PRU 2500 with 5090

High-density AI inference with 8 to 10 NVIDIA GeForce RTX 5090 GPUs, hybrid liquid cooling, composable PCIe Gen5 architecture, and Docker/Kubernetes-native deployment.

The smartest inference node you can buy right now

A PRU 2500 populated with RTX 5090s is no longer a speculative alternative to HGX-class hardware. On real Qwen-32B-class FP8 inference workloads, a 4x RTX 5090 configuration out-throughputs a fully optimized single H100 SXM while changing the economics by an order of magnitude. The PRU 2500 makes that performance practical with liquid cooling, clean PCIe topology, and a chassis that can evolve as workloads change.

Corespan PRU 2500 chassis interior with five RTX 5090 GPUs populated on the right side and the open accelerator bay visible on the left.

BY THE NUMBERS

~5,345 tok/s

Aggregate Qwen-32B-class FP8 throughput on 4x RTX 5090

$0.04/M

Estimated cost per million tokens at sustained utilization

7.5x-15x

Lower cost per million tokens versus typical H100 cloud rates

0 failures

Across 200 benchmark requests with a 64K batched-token budget

A 4x RTX 5090 PRU 2500 out-throughputs a fully optimized H100 SXM

In our benchmark, Qwen2.5-32B-Instruct ran on 4x RTX 5090 using vLLM, pipeline parallelism, native Blackwell FP8, chunked prefill, and a 65,536 max-batched-token budget. The result was roughly 5,345 aggregate tokens per second, ahead of published H100 SXM Qwen-32B-class FP8 results, with meaningful headroom still available in the serving stack.

Bar chart comparing aggregate throughput in tokens per second across four configurations: 1x H100 SXM BF16 at 2,352; 1x H100 SXM FP8 at 4,005; 1x H100 SXM FP8 with chunked prefill on TRT-LLM at 4,285; and 4x RTX 5090 PRU 2500 with FP8, pipeline parallelism, and chunked prefill on vLLM at 5,345.

Why the Benchmark Matters

The result reframes what a PCIe-attached consumer-GPU node can do for production inference. Per GPU, the H100 still wins because HBM3, NVLink, and memory bandwidth are real advantages. But the RTX 5090 story is aggregate throughput per dollar, and that is where the economics become difficult to ignore.

Three optimizations stacked cleanly: FP8 on Blackwell, pipeline parallelism instead of tensor parallelism, and chunked prefill with a large batched-token budget. Together, they keep the pipeline full and avoid the PCIe-bound all-reduce bottleneck that usually punishes multi-5090 nodes.

Roughly an order of magnitude lower cost per million tokens

At sustained utilization, a 4x RTX 5090 PRU 2500 node delivers Qwen-32B-class FP8 inference at roughly $0.04 per million tokens, compared with typical H100 cloud rates of $0.30 to $0.60 per million tokens. That is a 7.5x to 15x cost advantage for the same model class.

Horizontal bar chart of estimated cost per million tokens served: 4x RTX 5090 PRU 2500 at $0.04, H100 cloud low end at $0.30, and H100 cloud high end at $0.60. A callout notes the PRU 2500 is roughly 7.5x to 15x cheaper.

The savings start with GPU acquisition cost

The raw GPU hardware gap is just as stark: roughly $16,000 for 4x RTX 5090s, compared with approximately $110,000 for 4x H100 PCIe, $150,000 for 4x H100 SXM, and $300,000 for a typical 8x H100 SXM HGX node. That capital-cost delta is the foundation of the lower per-token economics.

Bar chart of per-node GPU acquisition cost at list pricing: 4x RTX 5090 at $16,000; 4x H100 PCIe at $110,000; 4x H100 SXM at $150,000; 8x H100 SXM at $300,000. A callout notes the 4x RTX 5090 node is roughly 9x cheaper than 4x H100 SXM.

Scale the same chassis for different inference modes

Scaling to 8 or 10 RTX 5090s opens two practical deployment modes. For 32B-class FP8 models, deeper pipeline parallelism gives the scheduler more stages to fill in parallel. For models that fit per GPU, replica parallelism eliminates inter-GPU traffic and can push aggregate throughput dramatically higher.

Most production deployments will use both modes: pipeline-parallel serving for larger externally facing models, and replica-parallel serving for smaller specialized models behind internal tools. The PRU 2500 supports both without forcing a new architecture.

A chassis built for composability, not compromise

The PRU 2500 supports up to 5x liquid-cooled RTX 5090s on one side while the other side can host additional RTX 5090s, NVIDIA RTX 6000 PRO cards, AMD Instinct MI350 accelerators, or up to 600 TB of NVMe storage. Today it can be a 10-GPU inference node. Six months from now, the same chassis can become a mixed accelerator and storage platform for retrieval-heavy AI workloads.

Production Stability

The benchmark run completed with zero failures across 200 requests and no out-of-memory events at a 64K batched-token budget. The configuration held up under sustained load, and the liquid-cooling loop kept the GPUs inside their thermal envelope.

That matters because many commodity multi-5090 systems hit thermal or PCIe-lane limits before they hit GPU limits. The PRU 2500 is designed to sustain the benchmark behavior under production traffic, not just in a clean lab run.

Key Features

High-Density GPU Pooling

Pack 8 to 10 RTX 5090 GPUs into a single PRU 2500 for dense, shared accelerator capacity built for inference-heavy environments.

Hybrid Liquid Cooling

Direct-to-chip liquid cooling on the GPUs sustains full Blackwell-class power draw without thermal throttling, hour after hour under production traffic.

Composable PCIe Gen5 Architecture

Dynamically attach and reassign GPUs to hosts as workloads change, improving utilization and reducing idle accelerator capacity.

Mixed Accelerators and Storage

Combine RTX 5090s with RTX 6000 PRO, AMD Instinct MI350, or up to 600 TB of NVMe in the same chassis. Rebalance as workloads evolve.

Docker and Kubernetes Native

Expose GPUs to standard container and Kubernetes environments with familiar NVIDIA tooling and existing operational models.

Use Cases

Neocloud GPU Services

Deliver GPU-as-a-Service with higher utilization and more flexible tenant allocation from a shared GPU pool.

Enterprise AI Inference

Support shared inference infrastructure across teams, applications, and changing workload demand.

Elastic Multi-Tenancy

Scale from smaller inference endpoints to larger multi-GPU jobs on the same physical platform.

Burstable GPU Capacity

Recompose GPUs between hosts and workloads in real time as jobs begin, finish, and shift.

FAQs

Are you an SI/VAR interested in selling the PRU 2500 with 5090 GPU platform?

Partner Information

Click here!