GPUs: General-Purpose Parallel Computing
A Gentle Introduction to GPUs and Insights into Leading GPU Companies
Welcome back to our series explaining chips that help systems think. Our first article covered CPUs, the most general purpose semiconductor chip. I highly recommend beginning with the CPU article if you’re new.
Now we’ll dive into GPUs, a chip at the heart of current events including a chip ban on China, NVIDIA’s remarkable rise to a $1 trillion company, and a public spat between big GPU companies.
This article might feel more technical than our last article as we dive into parallelism to explain what GPUs are and why they are important. Hang tough! I’ll use everyday analogies to keep it simple.
For paid subscribers, we will cover the companies competing in the data center and client GPU markets, including NVIDIA, AMD, and Intel.
Let’s blast off!
Sequential vs Parallel
The best way to dive into understanding parallel processing is this fun 90 second Mythbusters video showing the difference between sequential processing (CPUs) and parallel processing (GPUs): Mythbusters video
📖 sequential: one operation at a time
📖 parallel: multiple operations processed simultaneously
Parallelism and GPUs
GPU stands for Graphics Processing Unit, which is a bit of a misnomer. The name is a relic of a time when the primary function of the chip was strictly graphics processing. GPUs have since evolved into a general purpose tool designed to efficiently solve particular problems that are inherently parallel.
📖 graphics processing: the manipulation and rendering of visual data by a computer
📖 inherently parallel: problems that can be divided into smaller, independent tasks capable of being performed at the same time
Remember our observation that it is more efficient to brush all teeth simultaneously rather than one at a time? This is an example of inherent parallelism in a problem. We noticed the traditional teeth brushing routine repeats a particular step over and over (“brush this single tooth”), and so we imagined building custom hardware to brush each tooth simultaneously.
Some problems are more parallelizable than others. If a particular problem requires minimal or no effort to divide the problem into several parallel tasks, it’s often called “embarrassingly parallel”.
📖 embarrassingly parallel: particular problems that can easily be broken down into a number of parallel tasks
Graphics rendering is an embarrassingly parallel task. Each pixel can often be calculated without needing information about other pixels, making it easy to render many pixels simultaneously.
Back to our teeth brushing example – we noticed each tooth could be brushed independently, but we were limited to operating on a single tooth at a time. We could improve this by using both hands (if we are coordinated enough) to cut the brushing time in half. Let’s call each hand with a toothbrush a “processing unit”. At best, humans only have two processing units, so our ability to parallelize teeth brushing is limited. To fully parallelize our solution and unlock maximum efficiency, we needed to build a robot with enough processing units to operate on every tooth simultaneously.
Graphics rendering is similar. Each pixel can be operated on simultaneously. We can decrease the time to render an image by increasing the number of processing units. Engineers did just this - they built specific chips with many processing units to speed up graphics rendering and called them Graphics Processing Units or GPUs. The processing units in these chips did very specific graphics processing tasks like “rasterization” and “pixel shading”.
💡GPUs were initially developed as specialized chips, primarily designed to handle the parallel processing of graphics-related tasks.
General Purpose GPUs
Our toothbrush robot was not general purpose; it was specialized for the single purpose of brushing teeth. What if we later decided we wanted it to floss? Creating a separate robot exclusively for flossing would not be cost-effective or practical. Instead, given that it's already equipped to address each tooth individually, we should modifying its manipulators to perform extra tasks like flossing.
Similarly, GPUs initially specialized in a single task (graphics processing) but have transformed into general processors for a range of parallel tasks. This evolution was enabled by their inherent design for parallel processing, allowing their processing units to be adapted for diverse applications beyond just graphics.
The GPU evolution from single purpose to general purpose didn’t happen overnight; the journey is fascinating and enlightening, and I’ll write a future post on it.
Thousands of Cores
Earlier I mentioned that GPUs have a parallel architecture with many processing units, and therefore can process many pixels simultaneously. In fact, “many” is a bit of an understatement - they have hundreds to thousands of cores per chip!
For example, NVIDIA’s H100 GPU has 14,592 parallel processing units it calls CUDA cores. Fourteen thousand! This amount of parallelism in a single GPU can massively accelerate certain workloads.
If you remember anything from this article, remember that GPUs are parallel processing workhorses:
💡Modern GPUs are general purpose parallel processing chips.
If the parallelism in a single GPU isn’t enough, you can distribute that workload across many GPUs.
For example, in 2020 Microsoft announced a supercomputer developed for OpenAI with than 10,000 GPUs and 285,000 CPU cores in a single system. 🤯 As OpenAI CEO Sam Altman explained, “We are seeing that larger-scale systems are an important component in training more powerful (AI) models”.
✨ An aside: if you’re new to GPUs, the H100 is at the heart of the generative AI explosion and a single GPU costs >$30,000! Check out this nice 2 minute video from the World Economic Forum with Chris Miller discussing the H100. Also, I highly recommend Chris Miller’s amazing book Chip War if you haven’t read it yet. I personally devoured it. ✨
Parallel Processing Units
Given that GPUs are not graphics-specific anymore, I’d love to rebrand them as PPUs - Parallel Processing Units. PPU emphasizes parallelism and doesn’t mention any particular applications (like graphics).
You’ll sometimes see GPUs referred to as GPGPUs, which stands for General Purpose Graphics Processing Units. While GPGPU is technically correct, it’s not concise or memorable.
Performance is more than core count
The number of parallel processing units is important, but equally important is a GPU’s memory bandwidth and memory capacity. Let’s define these and then look at an analogy to make bandwidth and capacity intuitive.
📖 memory bandwidth: the rate at which data can be read from or stored into memory
Memory bandwidth is typically measured using Gigabytes per second (GB/s)
📖 memory capacity: Memory capacity refers to the total amount of data that a chip’s memory system can store at any given time
Memory capacity is typically measured in Gigabytes (GB)
Sawmill Factory Analogy
To develop an intuition for memory bandwidth and capacity, let's consider a simple analogy - an old sawmill factory that turns logs into lumber.
Conveyor Belt as Memory Bandwidth
Imagine a scenario with a sawmill factory full of workers, each capable of processing logs into lumber. There’s a conveyor belt system that brings logs into the factory and transports individual logs to each worker. For maximum efficiency, this conveyor belt should be wide enough to carry the largest logs a worker can handle and fast enough to ensure a new log is ready as soon as a worker finishes processing the previous one. If the conveyor belt is too slow or too narrow, it could become a bottleneck, leaving workers idle or operating below their full capacity.
In this analogy, the conveyor belt represents the memory bandwidth of the GPU. What matters is how much data can be transport to the processing units and how quickly. That’s why bandwidth is a measure of an amount of data per time (GB per second). The goal is to bring enough data at the right rate to prevent cores from sitting idly.
Memory Capacity as the Size of the Wood Storage Area
What about memory capacity? In our analogy, envision a lumber storage area adjacent to the sawmill, designed to accommodate a large amount of wood. This area is big enough to guarantee a consistent supply of lumber for the sawmill's operations for the day. This storage area prevents downtime caused by waiting for log deliveries, thus maintaining a steady workflow in the sawmill.
The lumber storage area mirrors a GPU's memory capacity. It holds enough data to keep GPU cores productive. We need both a large pile of logs to be processed (storage capacity) AND a fast and wide enough conveyor belt (bandwidth) to bring those logs to workers for maximum productivity.
Balancing Bandwidth, Capacity, and Processing Power
If a sawmill manager wants to increase the amount of work the factory gets done per day, simply adding more workers or increasing their individual processing speed might not improve overall throughput. The limiting factors may be the conveyor belt speed or width and the amount of wood. Likewise, in the case of GPUs, adding more GPU cores may not improve throughput; the system may need increased memory bandwidth or capacity.
CPU vs GPU
I mentioned previously that modern CPUs have many cores (for example, 4 to 24 cores) in a single integrated circuit and could be thought of as simple parallel processing systems.
📖 integrated circuit: An integrated circuit (IC) is a set of electronic components such as transistors, resistors, and capacitors, built onto a single piece of semiconductor material, typically silicon, to form a complete electronic circuit.
✨ An aside: Prior to its 1959 invention by Robert Noyce (a fellow Iowan), circuits were made by connecting discrete, bulky components. By integrating the components into a single piece of silicon, manufacturers could reliably build small, power-efficient, and performant circuits. Integrated circuits on silicon enable Moore’s Law. ✨
At this point, CPUs and GPUs may be starting to sound very similar. CPUs are general purpose and have more than one core, and GPUs have evolved into a general purpose system with hundreds or thousands of cores. It’s fair to wonder, “are GPUs just scaled up CPUs with more cores?”
The answer: Not quite!
The key difference to know between CPUs and GPUs:
💡 CPUs are designed to minimize latency, whereas GPUs are tailored to maximize throughput.
Quick definitions,
📖 latency: the duration from the moment an instruction is received to the completion of its execution
📖 throughput: the volume of work done or data processed within a specific time frame
CPU cores are optimized to be fast and handle complex logic. For example, CPUs can handle conditional statements (like if-else statements) and even predict which path to take in these situations to avoid waiting for the condition to be computed.
In contrast, GPU cores are built for simpler operations that can be massively parallelized. While each core is less powerful individually, their sheer number makes GPUs incredibly efficient for certain tasks.
Ferrari or a Delivery Truck?
To illustrate the difference between low latency (CPUs) and throughput (GPUs), here’s an awesome analogy I heard on a podcast somewhere (can’t recall which one, but thank you if you’re reading this!)
Imagine you’re in the package delivery business. You have the option of using a Ferrari, which will get you to each destination incredibly quickly (low latency) but can only hold one or two packages at a time. Alternatively, you can deliver packages with a delivery truck, which travels more leisurely to each destination but can carry 50-100 packages at a time (higher throughput). If you’re in the rapid delivery business the Ferrari enables you to deliver packages quickly, whereas the truck is well-suited for a traditional delivery service.
In this analogy, the Ferrari is analogous to a CPU tailored for low latency, specializing in rapidly completing tasks. The delivery truck represents a GPU, optimized for high throughput and able to handle numerous computations simultaneously.
GPU cores - simple & small
CPU cores are typically larger and more complex, taking up more space on the chip due to additional components needed for handling complex tasks. In contrast, GPU cores are designed to be smaller and simpler, primarily focusing their space on performing calculations.
This is commonly conceptually illustrated as:
Kindergarteners Analogy
An elementary analogy (pun intended) to illustrate the power of many simple cores working on an inherently parallel problem:
Imagine a classroom with two teachers and 25 kindergarteners. For most tasks throughout the day, the teachers are much faster in their problem solving speed and ability compared to the students. Yet when it comes time to pick up a classroom full of scattered toys, nothing beats 25 pairs of little hands working simultaneously.
Many hands make light work.
CPU and GPU differences
The image below demonstrates the internal structure of CPUs and GPUs. CPUs focus on low latency, featuring extensive memory, more control unit logic, and faster operation (clock speed), yet they have fewer computational units. Conversely, GPUs emphasize processing large amounts of data simultaneously with multiple cores, but possess less local memory and simpler control logic.
As previously mentioned, GPU and CPU are architecturally different and therefore can’t be compared “apples to apples”. Resist the urge to think that a 1024 core GPU is 64x faster than a 16 core CPU.
💡 CPU and GPU cores are not directly comparable.
Only as good as the software they enable
By now we’ve belabored the point that GPUs are amazing parallel processing machines. But the hardware can only shine if developers can write and deploy software on it!
💡 Parallel processors' performance hinges on the software they run.
In the early days when GPUs were less general-purpose, GPU programming was very complex and required a deep understanding of GPU architecture and parallel algorithm design. Developers had to manually manage complicated things like memory hierarchies and parallel thread execution coordination. Over time, software abstraction layers were developed to hide these complexities from users and make GPU programming approachable. These frameworks have developed extensive ecosystems of libraries and tools, enhancing productivity by providing pre-built solutions for tasks in various domains from financial modeling to robotics.
For example, CUDA from NVIDIA:
CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers are able to dramatically speed up computing applications by harnessing the power of GPUs.
You’ll hear CUDA referred to as a “moat” - a significant and durable competitive edge - for NVIDIA. They launched CUDA way back in 2006, and being such an early leader in this space has led to a vast ecosystem with widespread adoption in both industry and academia. This creates a vendor lock-in effect, as software optimized for CUDA runs exclusively on NVIDIA's GPUs.
Combining GPUs with CPUs
Many modern systems use both CPUs and GPUs. Leveraging both processors in a solution enables performance optimization across a broad spectrum of applications and computational workloads.
For example, laptops typically have a GPU integrated into the same chip as the CPU. The GPU is often tasked with graphics rendering and video playback while the CPU handles general-purpose computing tasks. This offloading allows the laptop to deliver smoother graphical performance, particularly in gaming or high-definition video streaming.
Autonomous vehicles are an example of an embedded system with both CPUs and GPUs. The GPUs process visual data for object detection, while the CPUs manage the vehicle's control systems, path planning, and sensor communications.
📖 embedded system: An embedded system is a combination of computer hardware and software designed for a specific function within a larger device or system.
An illustrative reference - 50 examples of embedded systems, such as pacemakers, game consoles, microwave ovens, and more.
Application Time
Can you explain the concept of parallel processing and how it differs from sequential processing?
What is the primary purpose of a GPU, and how has its role evolved over time?
Can you describe the difference between latency and throughput and how these concepts apply to CPUs and GPUs?
How does the memory bandwidth and capacity of a GPU impact its performance?
Which is more general purpose - a CPU or GPU? Why?
How did you do? You can always reach out via email or social media if you have questions!
The rest of this article, accessible to paid subscribers, will discuss GPU market segments and GPU competitors. Market segments discussed include data center GPUs and client GPUs. Companies discussed include NVIDIA, AMD, Intel, Qualcomm, Apple, and ARM.
Free subscribers can skip ahead to ASICs: