NVIDIA DGX Spark Cluster: A Modular Leap Forward in AI Infrastructure

NVIDIA DGX Spark Cluster Review: Distributed Inference on Dell, GIGABYTE, and HP by Dylan Dougherty and Divyansh Jain on AI ◇ Enterprise Two things tend to come up first whenever the conversation turns to the NVIDIA DGX Spark. The first is the headline spec: 128 GB of unified memory in a roughly $4,000 desktop box, a number that would have seemed implausible to put on an engineer’s desk even two years ago. The second is the 200 GB network on the back of the unit. The presence of a real datacenter-class fabric on a desktop appliance is what makes people lean forward, because it implies something more than a faster single-box workstation. It implies the ability to connect Sparks and replicate physically, the kind of multi-node setup that used to live exclusively in a rack. This review examines that capability. We benchmark distributed inference across all three OEM Spark implementations we have on hand, paired into two-node clusters connected over the 200 Gb fabric, and swept across model variants and three workload shapes. We also make a deliberate methodological choice about how the model is split between the two boxes that diverges from NVIDIA’s default recommendation, and we defend it with data. Before any of that, though, two pieces of context shape everything that follows: the network that makes clustering possible in the first place, and the reasons a person might or might not want to use it. The 200 GB Fabric We covered the networking implementation in detail in our original DGX Spark review, but the basics are worth restating because everything in this review depends on them. The back of every Spark carries two QSFP56 cages driven by an integrated NVIDIA ConnectX-7 SmartNIC. On paper, the two cages suggest 400 GB of aggregate connectivity, but PCIe is the real ceiling: the ConnectX-7 lives behind a pair of Gen5 x4 links, and the platform tops out at 200 GB of usable bandwidth, no matter how the cages are wired. A single populated QSFP56 cage already gives you the full 200 Gb the box supports, so the second port is there for topology flexibility rather than additional throughput. That flexibility shows up in three common configurations. The simplest is a single 200 GB port used as a direct Spark-to-Spark link, which is what NVIDIA’s validated two-node setup specifies and what we used for this review. The second is two 100 Gb ports setting up a ring-like topology between Sparks to cluster without a switch. The third is a split-role configuration where one cage goes to a peer Spark for clustering and the other goes to high-speed storage over NVMe-oF, which is useful when the working dataset will not fit on the Spark’s internal NVMe. NVIDIA sells the Spark in three configurations that map directly to how that network gets used. A single Spark for individual desktop work, a validated two-Spark cluster directly connected over the 200 Gb fabric for stretched models, and, as of GTC this year, a four-unit configuration that NVIDIA demonstrated publicly in response to user demand to push past the two-node limit. The dual-Spark configuration is the one NVIDIA actively markets, the one most readers will actually deploy, and the one we believe represents the sensible upper bound for production-style inference on this hardware. It is also the one this review benchmarks end-to-end. Why Cluster Sparks in the First Place The obvious reason to cluster Sparks is the same as for any cluster: a single 128 GB box cannot hold every model that matters. Stretching a 120B-parameter model across two boxes opens up a class of workloads that would otherwise not fit. That is the headline use case, and it is the one that gets demoed the most. The less obvious reason, and arguably the more important one for NVIDIA’s actual customer base on this platform, is learning. NVIDIA positions the Spark as an entry point. Their official documentation, sample notebooks, and partner playbooks treat the box as a teaching appliance. They include first-class guides for everything from spinning up a pre-built model behind a local chat interface to running a coding assistant against a hosted endpoint to fine-tuning small models to building end-to-end applications in PyTorch and JAX. The pitch is that someone who has never written a CUDA kernel in their life can get from zero to a working AI workflow at their desk in a weekend, and the same applies to engineers in a non-ML field who want a self-contained sandbox they fully control. A two-Spark cluster extends that teaching surface into multi-node territory: the same person can now also learn how tensor parallelism, pipeline parallelism, and collective communication libraries actually behave, with a network that is real enough to expose real bottlenecks. What is conspicuously absent from any of NVIDIA’s own positioning, though, is a claim that the Spark is for production inference serving. Jensen has talked about hardware-software co-design in nearly every keynote for the last several years, and the principle applies here. Every NVIDIA platform is optimized for a specific workload shape, and Spark is optimized for individual exploration and learning, not for serving traffic. Our previous Spark reviews have already shown that the platform is heavily memory-bandwidth-bound on most inference tasks, and the network only sharpens that constraint as soon as you cluster. A single 200 Gb link, while impressive for a desktop, is meaningfully slower than a PCIe Gen5 x16 connection within a single chassis, and collective communication patterns that work cleanly across an NVLink-bridged pair of datacenter GPUs do not transplant to a 200 Gb fabric without incurring real latency penalties. That is the real reason NVIDIA limited the officially supported configuration to two Sparks for so long, and why the four-unit demonstration at GTC was a response to user demand rather than an organic product expansion. Nothing prevents the software stack from running on four or eight nodes, and several users and outlets have published results from larger clusters. The performance numbers from those experiments are generally not flattering: the inter-node fabric becomes the dominant cost, and the collective performance degrades sharply, and to add to that, the per-user throughput at the tail end of those configurations can drop into the single-digit tokens per second range for any model large enough to justify the cluster in the first place. At that point, the setup is functionally a learning lab rather than a serving platform. None of that is meant as a dismissal. Clustering Sparks is a genuinely excellent way to develop an intuition for distributed inference and training that is otherwise locked behind hundreds of thousands of dollars of datacenter hardware, and the educational value of being able to actually see pipeline bubbles, all-reduce bottlenecks, and parallelism trade-offs on a system you own is significant. Our own follow-up plan was to take this further by training a small 1B-parameter or sub-1 B model from scratch across a dual-Spark cluster, with a setup chosen to mirror as closely as possible the conditions under which a real distributed pre-training run operates, so we could show exactly where this class of cluster does and does not make sense. That project is currently on the back burner while we work through other coverage you may have already seen and wait for the optics for our new 800 Gb lab core switch to arrive. We expect to revisit it once the lab build settles. What follows focuses on the use case for which the dual-Spark configuration is most defensible: distributed inference of models large enough to require both boxes, benchmarked across all three OEM implementations we have on hand. Before getting to the per-model numbers, the next section explains why we are reporting those numbers under a pipeline-parallel configuration rather than the tensor-parallel configuration that NVIDIA’s own documentation tends to default to. Perform...

NVIDIA DGX Spark Cluster: A Modular Leap Forward in AI Infrastructure

Key takeaways