PRU 2500 with 5090
High-density AI inference with 8 to 12 NVIDIA GeForce RTX 5090 GPUs, hybrid liquid cooling, composable PCIe Gen5 architecture, and Docker/Kubernetes-native deployment.
The smartest inference node you can buy right now
A PRU 2500 populated with RTX 5090s is no longer a speculative alternative to HGX-class hardware. On real Qwen-32B-class FP8 inference workloads, a 4x RTX 5090 configuration out-throughputs a fully optimized single H100 SXM while changing the economics by an order of magnitude. The PRU 2500 makes that performance practical with liquid cooling, clean PCIe topology, and a chassis that can evolve as workloads change.
Contact Us
A 4x RTX 5090 PRU 2500 out-throughputs a fully optimized H100 SXM
In our benchmark, Qwen2.5-32B-Instruct ran on 4x RTX 5090 using vLLM, pipeline parallelism, native Blackwell FP8, chunked prefill, and a 65,536 max-batched-token budget. The result was roughly 5,345 aggregate tokens per second, ahead of published H100 SXM Qwen-32B-class FP8 results, with meaningful headroom still available in the serving stack.

Why the Benchmark Matters
The result reframes what a PCIe-attached consumer-GPU node can do for production inference. Per GPU, the H100 still wins because HBM3, NVLink, and memory bandwidth are real advantages. But the RTX 5090 story is aggregate throughput per dollar, and that is where the economics become difficult to ignore.
Three optimizations stacked cleanly: FP8 on Blackwell, pipeline parallelism instead of tensor parallelism, and chunked prefill with a large batched-token budget. Together, they keep the pipeline full and avoid the PCIe-bound all-reduce bottleneck that usually punishes multi-5090 nodes.
Roughly an order of magnitude lower cost per million tokens
At sustained utilization, a 4x RTX 5090 PRU 2500 node delivers Qwen-32B-class FP8 inference at roughly $0.04 per million tokens, compared with typical H100 cloud rates of $0.30 to $0.60 per million tokens. That is a 7.5x to 15x cost advantage for the same model class.

The savings start with GPU acquisition cost
The raw GPU hardware gap is just as stark: roughly $16,000 for 4x RTX 5090s, compared with approximately $110,000 for 4x H100 PCIe, $150,000 for 4x H100 SXM, and $300,000 for a typical 8x H100 SXM HGX node. That capital-cost delta is the foundation of the lower per-token economics.

Scale the same chassis for different inference modes
Scaling to 8 or 10 RTX 5090s opens two practical deployment modes. For 32B-class FP8 models, deeper pipeline parallelism gives the scheduler more stages to fill in parallel. For models that fit per GPU, replica parallelism eliminates inter-GPU traffic and can push aggregate throughput dramatically higher.
Most production deployments will use both modes: pipeline-parallel serving for larger externally facing models, and replica-parallel serving for smaller specialized models behind internal tools. The PRU 2500 supports both without forcing a new architecture.
A chassis built for composability, not compromise
The PRU 2500 supports up to 5x liquid-cooled RTX 5090s on one side while the other side can host additional RTX 5090s, NVIDIA RTX 6000 PRO cards, AMD Instinct MI350 accelerators, or up to 600 TB of NVMe storage. Today it can be a 10-GPU inference node. Six months from now, the same chassis can become a mixed accelerator and storage platform for retrieval-heavy AI workloads.
Contact Us
Production Stability
The benchmark run completed with zero failures across 200 requests and no out-of-memory events at a 64K batched-token budget. The configuration held up under sustained load, and the liquid-cooling loop kept the GPUs inside their thermal envelope.
That matters because many commodity multi-5090 systems hit thermal or PCIe-lane limits before they hit GPU limits. The PRU 2500 is designed to sustain the benchmark behavior under production traffic, not just in a clean lab run.
Key Features
High-Density GPU Pooling
Pack 8 to 12 RTX 5090 GPUs into a single PRU 2500 for dense, shared accelerator capacity built for inference-heavy environments.
Hybrid Liquid Cooling
Direct-to-chip liquid cooling on the GPUs sustains full Blackwell-class power draw without thermal throttling, hour after hour under production traffic.
Composable PCIe Gen5 Architecture
Dynamically attach and reassign GPUs to hosts as workloads change, improving utilization and reducing idle accelerator capacity.
Mixed Accelerators and Storage
Combine RTX 5090s with RTX 6000 PRO, AMD Instinct MI350, or up to 600 TB of NVMe in the same chassis. Rebalance as workloads evolve.
Docker and Kubernetes Native
Expose GPUs to standard container and Kubernetes environments with familiar NVIDIA tooling and existing operational models.
Use Cases
Neocloud GPU Services
Deliver GPU-as-a-Service with higher utilization and more flexible tenant allocation from a shared GPU pool.
Enterprise AI Inference
Support shared inference infrastructure across teams, applications, and changing workload demand.
Elastic Multi-Tenancy
Scale from smaller inference endpoints to larger multi-GPU jobs on the same physical platform.
Burstable GPU Capacity
Recompose GPUs between hosts and workloads in real time as jobs begin, finish, and shift.

Are you a system integration or AI specialist who wants to sell the PRU 2500-5090 solution that sets the bar for tokenization economics?
Better throughput, lower cost, production-grade from day one.