GeneralBlog9 min read

In the Lab: GRAID SupremeRAID™ on Corespan's Photonic PCIe Fabric

A direct, evidence-first field note: GRAID SupremeRAID™ running GPU-accelerated RAID 6 parity over Corespan's photonic PCIe fabric — 43 GB/s read bandwidth, 117 µs p99.99 latency, zero PCIe errors.

Bill Koss - CEO and President of Corespan Systems

This is In the Lab — a direct, evidence-first look at what happens when Corespan's photonic PCIe disaggregation fabric meets a real enterprise storage stack. No marketing hand-waving, no curated charts, no benchmark numbers without the harness that produced them. For our debut entry we chose the integration we believed would be a demanding test of fabric transparency available on the market today: GRAID SupremeRAID™, running parity computation on a dedicated NVIDIA A10 GPU, against a four-drive RAID 6 array of Kingston self-encrypting NVMe SSDs sitting behind a HighPoint Gen5 controller — the entire data path carried over light.

If a photonic PCIe fabric can carry GRAID without altering the behavior of one byte above it, it can carry anything you intend to deploy on top of it. That is the thesis. Here is the evidence.

01 Why GRAID + Corespan Matters

Corespan's reason for existing is simple to state and difficult to engineer: pull NVMe storage and accelerators off the motherboard, put them on the other end of a photonic PCIe link, and let operators compose infrastructure the way hyperscalers compose virtual machines. Need more storage for that training cluster? Add a shelf. Need to repurpose the GPU pool for inference at 2 a.m.? Re-map it. No forklift upgrades, no stranded capacity, no oversubscribed copper backplanes. The hard part has never been the optics. The hard part is proving that every piece of software above the fabric behaves exactly as it would on a local PCIe slot.

GRAID is one of the most honest stress-tests of that promise we could find. SupremeRAID is out-of-path: it does not sit between the host and the drives like a traditional HBA. Instead, it offloads parity math to a dedicated GPU while the data plane stays on PCIe. That architecture only works if the underlying transport is clean, low-jitter, and behaves exactly like local silicon. For an end user, the GRAID + Corespan combination delivers what previously required a rack of dedicated storage appliances: enterprise-grade RAID 6 protection, GPU-accelerated parity offload, and freedom to place storage anywhere within the photonic domain — all while liberating host CPU cores to do the work you actually paid for them to do.

02 The Production-Shape Configuration

We deliberately built this run on hardware that looks like what a Tier-1 enterprise would actually deploy, not on a hand-tuned demo rig:

  • Host platform: Dual-socket AMD EPYC 9654 (192 cores / 384 threads), Ubuntu with the 5.15 kernel, iommu=pt active, pcie_aspm=off
  • Parity engine: NVIDIA A10 GPU (Gen4 x16) running graid_core, managed by GRAID SupremeRAID
  • Storage pool: Four Kingston SEDC3000ME7T6 7.0 TiB self-encrypting NVMe SSDs behind a HighPoint Gen5 controller
  • Array: RAID 6, 14 TiB usable, GRAID drive group reporting OPTIMAL, all four physical drives ONLINE at 0% wearout
  • Transport: Corespan native PCIe-over-optics fabric between host and drive shelf
  • Harness: Our reproducible validation suite (version 2026-06-15.2), launched end-to-end with a single ./run_all.sh invocation. Every test writes FIO JSON, a Python summarizer rolls the output into a Markdown report, and mpstat, pidstat, vmstat, and nvidia-smi dmon captures run alongside.
Run ID: 20260615T165258Z. Every number in the next section is reproducible from the raw FIO JSON, available on request.

03 The Headline Numbers

BY THE NUMBERS
43.3 GB/s
Peak read bandwidth — 1.35× our target, over photonic PCIe
64.2 µs
Mean 4K QD=1 read latency — fabric adding no measurable jitter
117.2 µs
p99.99 tail latency — inside our future SLA at baseline
1.42 M IOPS
On 4 of 32 supported drives — floor, not ceiling

Bandwidth saturation — 40.33 GiB/s (43.3 GB/s) read.

This is the result that defines the run. Our target was 32 GB/s, which would have been enough to saturate the Gen4 x16 bus serving the A10 using a traditional architecture. We came in at roughly 1.35× that number because SupremeRAID uses a patented out-of-path architecture. In plain terms: a four-drive RAID 6 array, with parity offloaded to a single A10 and the data plane carried entirely over Corespan's photonic PCIe fabric, sustained the kind of read throughput that traditional architectures need a dedicated all-NVMe appliance, multiple HBAs, and a forest of x86 cores to produce. The optical interconnect is not a tax on this workload. It is the reason the workload can scale.

Latency baseline — 64.2 µs mean, 75.3 µs p99, 117.2 µs p99.99.

At a queue depth of one, a single thread reading 4K blocks at random saw a p99.99 tail latency of 117.2 microseconds. Our capability matrix set a future target of "p99.99 below 150 microseconds" under concurrent read/write stress; we are already inside that envelope at the single-thread baseline. For tick databases, HFT order-routing, and any workload where the worst transaction is the only one that matters, this is the number that should make architects pay attention. The fabric is not adding jitter. The GRAID virtual driver is not queuing at the host bus. Light, in this configuration, behaves like copper — only longer.

IOPS saturation — 1.42 M IOPS, and why this is a population problem, not a ceiling.

The matrix target for this test was 6 million read IOPS at 4K QD=128 across 16 jobs. We produced 1,423,040. Before anyone misreads that as a gap in the architecture, here is the critical context: this run used four drives. Four. That is 12.5% of the drive capacity of the Corespan PRU 2500 chassis — a deliberately conservative population chosen to characterize a single GRAID drive group on a single HighPoint Gen5 controller, not to chase a peak number.

A single NVIDIA A10 running GRAID SupremeRAID supports up to 32 NVMe drives. We exercised four. Modern enterprise NVMe SSDs routinely deliver 800,000 to over 1,000,000 random read IOPS each, which means the A10 parity engine sitting in this lab has a comfortable theoretical headroom well beyond the 6 million IOPS target once the chassis is populated to its design point. Per-drive, our four-drive RAID 6 array is already producing roughly 355,000 read IOPS — a healthy and predictable scaling slope. Multiply that slope across a fully populated A10 domain and you do not approach the 6 million IOPS bar. You clear it with room to spare.

Two additional levers, once the drive population is up: switching GRAID into SPDK mode to bypass the kernel block layer (the dominant lever for small-block performance), and pinning the GPU and the controller to the same NUMA node. Neither was active in this run. Both are next on the bench. The 1.42 M IOPS number is the floor for a four-drive configuration on the kernel path — not the ceiling of the architecture.

CPU cost — quantifying the case for offload.

The kernel-path test (libaio, 4K random writes at QD=32 across 4 jobs) drove host sys=56.3% and usr=15.4%, with 6.45 million context switches over 60 seconds, for 1.28 GiB/s of throughput. That is the cost of running storage I/O through the standard kernel block layer at speed, and it is exactly the cost GRAID is engineered to eliminate. Traditional software RAID burns 3.5 to 5 percent of host CPU per GB/s of write bandwidth. SPDK-mode GRAID on Corespan is expected to come in below 0.2 percent — a roughly 20× reduction. In a 192-core EPYC host, the difference between those two numbers is dozens of cores you get back to spend on trading algorithms, Monte Carlo risk models, or inference workloads, instead of on parity math. That is the dollars-and-cents case for this architecture, and the kernel-path number we just captured is the floor we will measure SPDK against.

IOMMU coexistence — clean.

After the sustained mixed workload, dmesg returned zero DMAR translation errors and zero PCIe Advanced Error Reporting faults. The iommu=pt configuration is doing its job, the optical PCIe path is honoring address translation correctly, and there is no quiet error rain in the background. For regulated environments that mandate IOMMU isolation, this is the test that has to pass before any conversation about deployment can start. It did.

A10 telemetry — the engine is doing real work.

nvidia-smi dmon captured the A10 across the workload window: average draw 60.5 W out of a 150 W cap, peak 81 W, and SM utilization spiking to 100 percent during the parity-heavy phases. The parity engine is not idle. It is doing real work, and it has substantial thermal and compute headroom for the SPDK-path runs to come.

04 What This Means for End Users

If you are designing AI training clusters, HFT platforms, regulated financial workloads, or any architecture where storage, GPUs, and host CPUs need to be composed independently, this run is the answer to the question that matters most: can a photonic PCIe fabric carry enterprise NVMe storage without altering the behavior of every piece of software above it, while liberating CPU, budget, and floor space? On the evidence of this run, yes. The Corespan + GRAID stack delivered 43 GB/s of read bandwidth to a single host, held a 4K-random-read p99.99 tail under 120 microseconds, produced zero PCIe error events under sustained load, and confirmed an active parity engine doing real work on a single A10. The remaining IOPS gap is named, scoped, and on the next test plan.

Just as importantly, every one of those numbers was produced by a reproducible suite that runs end-to-end with a single command and writes machine-readable output. That matters because reproducibility is the difference between a demo and a deployable architecture. If you cannot re-run a vendor's benchmark, you do not actually have a benchmark; you have a slide.

05 Where We Are Going Next

  • SPDK comparison: Re-run the CPU-cost test with the SPDK ioengine and publish the kernel-vs-SPDK delta on CPU cost per GB/s and small-block IOPS. This is the cost-of-ownership story in one chart.
  • Online drive failure: Pull a physical drive via graidctl offline mid-run and quantify the throughput drop. Target: ≤ 15% on a sustained 4K random read workload.
  • Rebuild under load: Trigger graidctl rebuild on volume 0 while background writes continue. Target: rebuild rate ≥ 1.5 GB/s with foreground writes held at ≥ 80%.
  • SED key rotation: Cryptographic erase timed to the millisecond with start_epoch / end_epoch brackets. Target: under one second, block-level data unreadable.
  • Second-controller IOPS run: Re-stage with a fully populated drive complement under the A10's 32-drive support envelope, paired HighPoint Gen5 controllers, and NUMA-pinned affinity, and clear the 6 M IOPS target head-on.

06 Talk to Us

Corespan exists because the next decade of enterprise infrastructure does not look like the last one. Storage, compute, and acceleration will be composed dynamically over light, and the vendors who can prove their fabric is transparent to the software above it are the vendors who will be in production. If you want to see this stack run against your workload, want raw FIO JSON, mpstat / pidstat / vmstat captures, nvidia-smi dmon output, and the Markdown summary for run 20260615T165258Z, or simply want a conversation about what disaggregated photonic infrastructure could do for your business, we would like to hear from you.

Reach out to your Corespan account team, or write to us at the lab.