PRU 2500 + RTX 5090: The Smartest Inference Node to Buy

If you are evaluating a Corespan PRU 2500 populated with 8 or 10 NVIDIA RTX 5090s, the question on the table is almost always the same: can a PCIe-attached consumer-GPU node actually compete with HGX-class hardware on real inference workloads? Our latest benchmark says yes, emphatically, and the numbers are worth walking through, because they reframe what this class of machine is capable of.

PRU 2500 with 5xRTX-5090 GPUs on the right side and available slots on the left side.

The Benchmark

We ran Qwen2.5-32B-Instruct on a 4× RTX 5090 configuration using vLLM with pipeline parallelism (PP=4, TP=1), native Blackwell FP8, chunked prefill, and a 65,536 max-batched-token budget. The workload: 200 requests at 20-way concurrency, up to 1,024 output tokens each, driven by ApacheBench.

The headline result: ~5,345 tokens/sec aggregate, ~1,336 tok/s per GPU, 5.22 requests/sec, P50 latency of 127 ms, P95 of 7.0 s, and zero failures.

Three optimizations stacked cleanly: FP8 (1.3 to 1.5× over BF16 on Blackwell), pipeline parallelism replacing tensor parallelism (which eliminates the PCIe-bound all-reduce that handicaps multi-5090 nodes), and chunked prefill with a large batched-token budget (which keeps all four pipeline stages full). The latency improvement is the tell: most requests are now clearing the pipeline near-instantly because TTFT and decode are no longer serialized behind synchronous all-reduce.

How It Compares to an H100

This is the comparison customers actually care about. The closest published apples-to-apples benchmark is GPUStack’s H100 SXM performance lab running Qwen3-32B-FP8 with chunked prefill on vLLM. Their numbers:

1× H100 SXM, Qwen3-32B BF16: 2,352 tok/s

1× H100 SXM, Qwen3-32B FP8 (vLLM): ~4,005 tok/s

1× H100 SXM, Qwen3-32B FP8 + chunked prefill (TRT-LLM, optimized): 4,285 tok/s

Our 4× RTX 5090 result: ~5,345 tok/s, and we are doing it on vLLM, not TRT-LLM, which means there is still meaningful headroom on the table.

Figure 1. A 4× RTX 5090 PRU 2500 out-throughputs a fully optimized single H100 SXM on the same model class.

Read that again: a four-GPU consumer node on PCIe, running an open-source serving stack, out-throughputs a single fully-optimized H100 SXM on the same model class. Per-GPU, the H100 still wins by roughly 3.2×, which is expected, because HBM3, NVLink, and ~3.35 TB/s of memory bandwidth versus the 5090’s ~1.79 TB/s GDDR7 are real advantages. But the 5090’s case has never been single-GPU performance. It is aggregate tokens per second per dollar, and on that axis the math is brutal in our favor.

The Economic Argument: Cost Per Million Tokens

Performance only matters if it is affordable to deliver. The whole point of building inference infrastructure on PCIe-attached Blackwell consumer silicon is that the unit economics are dramatically better than HGX-class hardware. At sustained utilization, a 4× RTX 5090 PRU 2500 node delivers Qwen-32B-class FP8 inference at roughly $0.04 per million tokens, versus $0.30 to $0.60 per million tokens on H100 cloud pricing depending on the provider.

Figure 2. Roughly an order of magnitude lower cost per million tokens versus H100 cloud rates.

That is not a 20 percent improvement. It is 7.5× to 15× cheaper for the same model class at comparable throughput. Over the lifetime of a production deployment serving billions of tokens per month, this is the difference between an inference workload that pencils out economically and one that does not.

Where the Savings Start: GPU Acquisition Cost

The cloud-cost gap above is downstream of an even more fundamental gap: the raw hardware cost of the GPUs themselves. Using current list pricing, an RTX 5090 lands at approximately $4,000 per card. NVIDIA H100 PCIe sells in the $25,000 to $30,000 range per GPU, and H100 SXM (the variant used in the throughput benchmark above) runs $35,000 to $40,000 per GPU and is only available through OEMs as part of complete HGX systems.

That translates into roughly the following per-node GPU cost:

4× RTX 5090: $16,000

4× H100 PCIe: ~$110,000

4× H100 SXM: ~$150,000

8× H100 SXM (typical HGX node): ~$300,000

Figure 3. GPU hardware acquisition cost per node, list pricing. Excludes chassis, networking, and cooling.

A 4× RTX 5090 node beats a fully optimized single H100 SXM on aggregate throughput while costing roughly one ninth as much in GPU hardware as a 4× H100 SXM node, and roughly one nineteenth as much as the 8× H100 SXM HGX configuration customers typically benchmark against. Those ratios are before you factor in the chassis, NVLink switching, networking, and cooling infrastructure that HGX systems require, which can add another 30 to 50 percent on top of the GPU cost.

That is the underlying reason cost-per-million-tokens collapses so dramatically. The hardware is fundamentally an order of magnitude cheaper to acquire, and it is delivering equal or better throughput on the workloads that matter.

What Happens at 8 and 10 GPUs in the PRU 2500

Here is where the PRU 2500 platform earns its place in the conversation. Our 4-GPU result already beats a single H100 SXM. Scaling to 8 or 10 RTX 5090s opens two distinct deployment modes, and the right choice depends on your workload:

Mode 1: Pipeline-parallel one big model (PP=8 or PP=10). For 32B-class models at FP8, deeper pipelines give the scheduler more stages to fill in parallel. Expect throughput to scale meaningfully, though not perfectly linearly: pipeline-bubble overhead grows with depth, and you will want to tune max-num-batched-tokens and concurrency upward to compensate. Practically, an 8× 5090 PRU 2500 running PP=8 should put a serious dent in dual-H100 SXM territory on aggregate throughput, at a fraction of the BOM.

Mode 2: Replica parallelism for models that fit per-GPU. CloudRift’s published benchmark on 4× RTX 5090 running Qwen3-Coder-30B at AWQ-INT4 with one independent vLLM instance per GPU hit 12,744 tok/s, because there is zero inter-GPU traffic on the critical path. Extrapolate that to 8 or 10 GPUs in a single PRU 2500 and you are looking at 25,000 to 32,000 tok/s of aggregate throughput on a single chassis for models that fit in 32 GB VRAM at INT4. That is genuinely competitive with hardware costing 10× more.

In practice, most production deployments will run a mix: pipeline-parallel for the large frontier models you are serving externally, and replica-parallel for the smaller specialized models behind internal tools. The PRU 2500 supports both without architectural compromise.

A Chassis Built for Composability, Not Compromise

The reason these benchmark numbers translate into real production performance, and not just lab curiosities, is the chassis itself. The PRU 2500 was designed around the assumption that PCIe-attached accelerators would dominate the inference-serving market, and it provisions accordingly: clean topology, adequate per-GPU bandwidth, thermal headroom for sustained Blackwell-class power draw, and the I/O margin to keep a fully-loaded chassis fed without artificial bottlenecks.

The PRU 2500 with 5090 GPUs is fully liquid-cooled, which is what makes sustained, full-power operation on 5090-class silicon practical in the first place. Air-cooled multi-5090 nodes throttle under continuous load long before they reach their published throughput numbers. Liquid cooling removes that ceiling, keeps acoustics in check, and meaningfully extends component life. The result is a node that delivers its benchmark performance not just in a clean test run, but hour after hour, day after day, under production traffic.

Just as important is the PRU 2500 split-side architecture. The PRU 2500 supports up to 5× RTX 5090 GPUs with liquid cooling on one side of the chassis, while the other side is free to host whatever air-cooled resource your workload actually needs:

Another 5× RTX 5090 for a fully populated 10-GPU inference node

A bank of NVIDIA RTX 6000 PRO cards when you need 96 GB of VRAM per GPU for larger models, longer contexts, or fine-tuning workloads

AMD Instinct MI350 accelerators for customers standardizing on the ROCm stack or pursuing a dual-vendor strategy

Up to 600 TB of SSD storage, turning the same chassis into a high-density inference-plus-data node for RAG, vector search, model caching, or training-data staging

That composability is the real product. Most competing platforms force you to choose a configuration up front and live with it. The PRU 2500 lets you mix accelerator families, mix accelerators with storage, and re-balance the chassis as your workload evolves. Today you may want 10× 5090s for maximum inference throughput. Six months from now, you may want 5× 5090s on one side and 600 TB of NVMe on the other to support a retrieval-heavy agent workload. The same chassis handles both.

Production Stability

It is also worth noting what we observed on the stability side: zero failures across 200 requests, no out-of-memory events at 64K batched tokens. That matters for production. The configuration is well-tuned, the platform held up cleanly under sustained load, and the liquid-cooling loop kept every GPU well inside its thermal envelope for the duration of the run. Customers running multi-5090 nodes on commodity workstations or rack-mount servers built for other workloads hit thermal and PCIe-lane limits well before they hit GPU limits. The PRU 2500 does not.

The Bottom Line for Your Buying Decision

If your workload is 32B-class FP8 inference at production scale, an 8- or 10-GPU PRU 2500 will deliver throughput that meets or exceeds multi-H100 nodes at roughly one-tenth the cost per million tokens, on hardware that costs roughly one-ninth as much to acquire in the first place. If your workload skews toward smaller models you can run replica-parallel, the same chassis pushes well past 25,000 tok/s. And if your roadmap calls for mixing accelerator families, adding RTX 6000 PRO or MI350 cards, or co-locating hundreds of terabytes of storage with your GPUs, the PRU 2500 is the only platform in its class that lets you do all of that in a single liquid-cooled chassis.

The PCIe-plus-Blackwell-plus-vLLM combination, once dismissed as a hobbyist stack, is now competitive on serious inference SLAs, and the PRU 2500 is the chassis that makes it production-grade.

The 4× RTX 5090 number is not the ceiling. It is the floor of what this platform can do.

Why the PRU 2500 With RTX 5090s Is the Smartest Inference Node You Can Buy Right Now