GPU Networking Basics, Part 2

Front-End & Back-End, North-South & East-West Traffic, DeepSeek Example

Apr 10, 2025

∙ Paid

Hey everyone! I was off the grid last week while attending Intel Vision. It was a significant moment for me, witnessing Intel CEO Lip-Bu Tan’s first public appearance at my inaugural event as an analyst:

I had face-to-face meetings with executives, discussing the expected topics and some unexpected topics. What I learned at Intel Vision is under NDA, except for what was live-streamed. I can’t share details yet, but once the topics are public, I’ll have well-formed opinions ready for you!

This event had a compelling human element, considering Intel’s broader comeback story.

Vision is a customer-focused gathering where Intel executives engage directly with partners and customers, offering them attention and roadmap insight. Yet there was underlying tension: will these plans shift with a new CEO in charge? After all, Lip-Bu Tan (LBT) was only on the job for two weeks, which isn’t enough time to make sweeping changes. Moreover, he gave hints that change is coming; from the Vision Day 1 keynote:

LBT: At Intel, we will redefine some of our strategies and free up bandwidth. Some of the noncore businesses we will spin off — and focus on our core business.

An immediate murmur arose: What are the noncore businesses? Gaudi 3 and Jaguar Shores? Mobileye and Intel Automotive? The networking and edge business unit?

In conversations later that day, customers and analysts weren’t afraid to ask Intel executives the existential question: Will your business unit survive?

I watched these executives—who have undoubtedly experienced a lot of turmoil in the past few years—stand up with hat in hand and say, “As soon as I find out, I’ll let you know.”

This is the human element of the Intel comeback. I saw leaders responsible for setting strategy, inspiring their teams, and engaging with customers, all the while knowing LBT could replace them, spin them out, or shut them down.

It’s a corporate version of Schrödinger's cat, where their business unit exists in a superposition of dead and alive.

The executives could not overlook this tension, as customers require clarity, so they stepped up and addressed questions head-on.

When will we know if the business unit is dead or alive? For you quantum physicists, when will the wavefunction collapse? The Copenhagen interpretation was the answer: “When LBT makes his assessment [takes a measurement], then we’ll know.”

If I were a gambling man, I’d bet we’ll see changes announced during Intel’s next earnings call in two weeks.

Even with uncertainty in the air, the executives pitched their roadmap with conviction and engaged openly in Q&A. During our one-on-ones, no question was off-limits.

Sitting across from these executives brought a human dimension back into focus, a dimension often missing today. It doesn’t change my belief in the need for hard decisions, yet the human dimension adds depth to my analysis. For example, it highlighted the tension in leadership between empathy and execution.

Pat Gelsinger, deeply invested in Intel and its people, arguably didn’t make hard personnel decisions fast enough or deep enough. That’s not a character flaw; on the contrary, Pat’s love for the people of Intel surely inspired employees, partners, and customers during many difficult years. But that’s precisely why LBT is the right person now. LBT’s emotional distance can enable the decisions Pat struggled with. That doesn’t mean Pat’s vision was wrong; I still believe Pat’s 5N4Y was right for Foundry and Intel. But this is now LBT’s Intel.

And, while I have the microphone, I think Pat’s product strategy was severely lacking.

But, at Vision, I saw further down Intel’s Product roadmaps than I’ve seen before, and I liked what I saw. (Again, more to come later).

I also had the opportunity to be in the room when Intel Products CEO Michelle Johnston Holthaus gave a passionate interview with Jay Goldberg and Ben Bajarin:

No alternative text description for this image — Yeah, I was that guy sneaking a photo during the interview.

I highly recommend listening or watching. For example, just after the 13-minute mark, Michelle discusses x86, chiplets, and hyperscalers making their own Arm CPUs.

MJ: It wasn't that customers were against x86 as an architecture. They were against the fact that we wouldn't build them something that would allow them to differentiate... They didn't leave x86 — we left them!
We have a crown jewel; we just haven't figured out how to use it… so we can still deliver to the masses but customize for the few... Chiplets are an important part of that.

If you think about a custom monolithic die, that's a whole new design every time... but if you combine a customer's chiplets, my chiplets, my advanced packaging capabilities, and my manufacturing capabilities and put them together - it's a distinct change and shift...

Customers know what they want, and I want them to see Intel as an option... They didn't think this was an option, it was 'you buy the roadmap'...
Imagine if I say to them ... “I have IPs in chiplet form, how do you want to mix and match that?”... it allows your customers to start dreaming, and when your customers are dreaming then you are part of the conversation and get a seat at the table."

Interesting signal there!

I appreciate you sticking with me after last week’s pause. I hope the Vision commentary was worth the wait. There will be more travel in the coming weeks as I attend events focused on autonomous vehicles, fabs, and more. You can expect more on-the-ground coverage from me 🫡

Now, onto the main event.

GPU Networking Basics, Part 2

We’re continuing with our very gentle introduction to GPU networking. Catch up with Part 1 if you missed it:

GPU Networking Basics, Part 1

Austin Lyons

Mar 19

Read full story

Front-End vs Back-End

Last time, we discussed GPU-to-GPU communication during the pretraining of a large language model. We focused on the fast, high-bandwidth connections to nearby neighbors (e.g., via NVLink) and the slightly slower, lower-bandwidth connections to further nodes via Infiniband or Ethernet and a network switch.

This GPU-to-GPU communication network is called the back-end network.

The back-end network comprises inter- and intra-node GPU-to-GPU communication, e.g. NVLink and Infiniband,

These all-important GPU interconnects often draw all the attention but are only one part of the broader networking system.

Consider how data reaches the GPUs during training. LLMs must ingest trillions of tokens from storage (SSDs) for the neural net to train on. This communication is done through a separate, Ethernet-based front-end network.

Many other workloads traverse the front-end network, too, such as cluster management software and developers remotely accessing the cluster for debugging.

For brevity, I only illustrated a small slice of the workloads that traverse the front-end network. In reality, this would include job schedulers and orchestration, telemetry, engineers’ laptops, etc.

The front-end network is deliberately separated from the back-end to prevent interference and congestion. Routine tasks like loading data or logging are kept off the high-speed GPU network, ensuring that non-critical traffic doesn’t disrupt the network on which expensive training runs depend.

Since front-end devices can be outside the data center, firewalls and access segmentation policies are often needed to isolate the back end from front-end-originated traffic. That’s acceptable because front-end traffic is typically latency-tolerant.

North-South vs East-West

Communication between GPUs and devices on the front-end network is called North-South traffic.

Visualizations always help me remember terms like North South 😊

This North-South traffic occurs over Ethernet.

Why Ethernet? It’s cheap and ubiquitous. Front-end devices are already are built to run on standard Ethernet networks, and datacenter operators know and love Ethernet.

Can you guess what traffic within the back-end network is called?

Yep, East-West traffic.

East-West traffic is latency-optimized for GPU-to-GPU scale up and scale out communication. In the largest-scale training, this back-end network can even span multiple data centers! 🤯

As I said before, reality is much more complicated than my simplistic drawings 😅

But the simple understanding you’re gaining is the foundation for understanding the nuance and complications.

Checkpointing & Direct Storage

As an example of nuance, during LLM pretraining, model checkpointing is the practice of periodically saving a snapshot of the model’s parameters to persistent storage. These checkpoints allow training to resume from the last known good state in case of hardware failure and provide versioned artifacts.

If these large checkpointing writes of tens or hundreds of GB at a time are sent over the front-end Ethernet network, they risk colliding with other less essential traffic, causing congestion and unnecessary training stalls. To avoid this, AI training clusters can connect dedicated high-speed storage directly to the back-end network:

If the training system is directly dependent on a particular system, it can make sense to put that on the backend network.

In this setup, checkpointing is additional East-West traffic, as it stays on the back-end network.

Mixture of Experts Training & Network Impact

Let’s look at a real example to cement our understanding.

Training large language models requires heavy East-West communication as workloads are distributed across tens or hundreds of thousands of GPUs. These GPUs frequently exchange gradient updates to keep the model learning consistently and converging toward accurate outputs.

One example of this multi-parallelism approach is DeepSeek’s V3 Mixture of Experts (MoE) model, which we previously discussed in detail here:

Dispelling DeepSeek Myths, Studying V3

Austin Lyons

Mar 11

Read full story

DeepSeek distributes the training workload using a combination of parallelism strategies, including data parallelism, pipeline parallelism, and expert parallelism.

Data parallelism splits data across GPUs, each of which processes its shard (slice of data) independently before synchronizing updates to the shared model:

Source. Think of the “worker” as a subset of GPUs

Pipeline parallelism splits the model across GPUs, with each GPU handling a segment of layers and passing intermediate results forward:

Expert parallelism splits the model into multiple experts (subsections of the network) distributed across GPUs, activating only a few experts per token to reduce compute:

What can we immediately take away from this?

Each of these strategies decomposes the problem such that any GPU is only tackling a subset of the network and training data. Thus, frequent GPU-to-GPU communication is required to stay synchronized and ensure consistent model updates.

Also — reality is complicated! The interaction of data, pipeline, and expert parallelism results in overlapping communication that must be carefully managed to avoid stalls.

Each strategy imposes its own pattern of east-west traffic. Let’s walk through how each layer adds to the pressure.

Data Parallelism: Global Syncs

With data parallelism, each GPU processes a different mini-batch of data and then shares its progress with every other GPU after every training step. Thus, these GPUs must perform an all-reduce to average gradients and synchronize weights—a collective operation that involves every GPU exchanging several GBs of data.

Because this occurs at every step and blocks forward progress, it is incredibly latency-sensitive.

You can imagine the network pressure this creates on the entire system after each training step, as data is simultaneously sent across the back-end network:

Every node must communicate with every other node — that’s a lot of communication via switches.

This pressure spawns innovation. Nvidia’s InfiniBand with SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) supports in-network reduction to minimize traffic volume and latency. The network switches themselves do the computation!

See this excellent two-minute explainer from Nvidia:

Having the switches do computation to reduce the network traffic is a prime example of Nvidia’s systems thinking — innovating at the AI datacenter level:

The Power of Nvidia's Systems Thinking Approach

Austin Lyons

Jan 3

Read full story

In summary, data parallelism is clearly a network-intensive training approach requiring a robust, low-latency, high-throughput fabric to scale efficiently.

Pipeline Parallelism: Chained Dependencies

Pipeline parallelism splits a model across GPUs by layers, with each GPU handling a different stage of the forward and backward pass. Activations move forward one stage at a time, while gradients flow in the opposite direction. This forms a sequence of strict dependencies where each GPU must wait for inputs from the previous stage before computing and then pass results to the next.

Any delay from network congestion can stall the pipeline. To minimize this, pipeline stages must be placed on physically close nodes to reduce hop count and avoid congested network paths. Thus, pipeline parallelism relies on topology-aware scheduling to maintain stable throughput.

Expert Parallelism: Uneven Traffic

Expert parallelism introduces a different communication pattern; it routes individual tokens to a small subset of experts. These experts are subnetworks located on different GPUs, and only a few are activated per input. A single token might be dispatched to Expert 3 and Expert 12, which could reside on GPUs in separate nodes.

This setup leads to irregular and bursty communication. Some GPUs receive large volumes of tokens, while others remain mostly idle. The resulting traffic is non-uniform and shifts with each batch.

Because the communication is non-deterministic, it also complicates system planning and debugging.

Much work is done in software to load-balance the workload across experts. DeepSeek shared their strategy and code:

As described in the DeepSeek-V3 paper, we adopt a redundant experts strategy that duplicates heavy-loaded experts. Then, we heuristically pack the duplicated experts to GPUs to ensure load balancing across different GPUs. Moreover, thanks to the group-limited expert routing used in DeepSeek-V3, we also attempt to place the experts of the same group to the same node to reduce inter-node data traffic, whenever possible.

Putting It All Together

Each of these parallelism strategies is demanding on its own. The backend network must support three distinct types of pressure:

Global collective ops (data parallel)
Synchronous chained flows (pipeline)
Sparse, bursty, cross-GPU dispatch (experts)

These networking tasks happen simultaneously; activations flow through pipelines, gradient all-reduce begins, and expert GPUs request tokens. The backend must absorb the chaos without slowing down.

Appreciating DeepSeek

Grasping the network challenges in MoE training lets us appreciate DeepSeek’s deliberate strategy to avoid congestion through careful system design.

From the V3 technical report:

Through the co-design of algorithms, frameworks, and hardware, we overcome the communication bottleneck in cross-node MoE training, achieving near-full computation-communication overlap. This significantly enhances our training efficiency and reduces the training costs, enabling us to further scale up the model size without additional overhead.

How did they do this? Remember the computation and communication innovations we unpacked last time? Again, from DeepSeek:

In order to facilitate efficient training of DeepSeek-V3, we implement meticulous engineering optimizations. Firstly, we design the DualPipe algorithm for efficient pipeline parallelism. Compared with existing PP methods, DualPipe has fewer pipeline bubbles. More importantly, it overlaps the computation and communication phases across forward and backward processes, thereby addressing the challenge of heavy communication overhead introduced by cross-node expert parallelism. Secondly, we develop efficient cross-node all-to-all communication kernels to fully utilize IB and NVLink bandwidths and conserve Streaming Multiprocessors (SMs) dedicated to communication. Finally, we meticulously optimize the memory footprint during training, thereby enabling us to train DeepSeek-V3 without using costly Tensor Parallelism (TP).

AI labs are surely working to overcome network congestion too; they don’t have the constrained bandwidth of H800s like DeepSeek, but nonetheless deal with complex parallelism and network pressure. But we’ll give props to DeepSeek since they share their insights with us.

And hey, although we’ve kept our GPU networking series very simple, it’s cool that we can skim modern research papers and better appreciate their innovations!

Beyond the paywall, we combine our insights with industry data to revisit the Infiniband vs Ethernet contest. Hey, SHARP makes Infiniband look strong, right? Can Nvidia stay on top?

GPU Networking Basics, Part 2

Front-End & Back-End, North-South & East-West Traffic, DeepSeek Example

GPU Networking Basics, Part 2

GPU Networking Basics, Part 1

Front-End vs Back-End

North-South vs East-West

Checkpointing & Direct Storage

Mixture of Experts Training & Network Impact

Dispelling DeepSeek Myths, Studying V3

Data Parallelism: Global Syncs

The Power of Nvidia's Systems Thinking Approach

Pipeline Parallelism: Chained Dependencies

Expert Parallelism: Uneven Traffic

Putting It All Together

Appreciating DeepSeek

This post is for paid subscribers