🎧 Chipstrat + The Circuit - Episode 70 - NPUs

Why NPUs exist, consumer impact, the trouble with TOPS, predicting the future, and more

Jun 03, 2024

I recently joined The Circuit podcast for a fun discussion on Neural Processing Units (NPUs).

We explored the purpose and potential impact of NPUs on consumer technology, envisioning new applications powered by NPUs and multimodal LLMs. We also considered the limitations of TOPS and TOPS per Watt metrics, offered guidance on effectively evaluating NPUs, and pondered whether NVIDIA might develop its own NPU.

This is a great introduction to NPUs. The transcript is below, and you can listen to it here:

My next regular post is in the works and coming soon!

Transcript

Ben Bajarin
Hello everybody, welcome to another episode of The Circuit. I am Ben Bajarin.

Jay Goldberg
Hello world I am Jay Goldberg coming to you from an undisclosed location at a TSA holding cell – effectively.

Ben Bajarin
We think Jay has been obtained by some legal authority, and we're not sure who – based on his background.

We would like to welcome Austin Lyons back to the podcast. Austin, how are you?

Austin Lyons
Cowabunga, I'm good. I'm coming at you as Chipstrat on the interwebs.

Ben Bajarin
That was tremendous. I was going to work that in if somebody didn't at the end. So very good. That was a request from a listener to work in a word. In fact, do this every week because we will be happy to say something random in the middle of a conversation.

We'd also like to welcome Paul Karazuba. Did I say that right?

Paul Karazuba
You got it. Thanks for having me.

Ben Bajarin
Yes, awesome. And Paul, real quick, give a who are you, where you're from, what you do.

Paul Karazuba
Absolutely. My name is Paul Karazuba. I'm the vice president of marketing at Expedera. We are an NPU IP startup here in Silicon Valley. I've spent, it will be 26 years on Saturday in the semiconductor industry – and certainly not that I'm counting the days but it does certainly make you feel old when you say that.

Ben Bajarin
Tremendous tremendous.

Okay, so as you might have guessed, Paul's job is relevant to today's podcast because the topic is NPUs.

If you're new, NPU stands for Neural Processing Unit. This is not a new term, although people are starting to use it more regularly, especially people who did not use it before, which was actually a relevant conversation of a couple of years ago.

For example, Intel called this their VPU based on the fact that these things have vector cores in them. And many of us were like, stop doing that – everybody's going to call it an NPU. So you might as well do so too.

Apple almost never referred to it as this until the iPad launch. This was always Apple Neural Engine. Many of us who knew what that was knew it was a Neural Processing Unit, but they didn't want to talk about it.

And now, out into the world are Qualcomm, Intel, and AMD. Everybody's saying we've got NPUs, and it's basically the newest chip on the block—even though it's not the newest, it's the newest because there was a CPU, then there was a GPU, and now there's an NPU, which makes it relevant.

So that's the topic for today's conversation.

So I'm going to lob out to our committee the thesis for why this product (NPU) exists.

We recently at Creative Strategies did a research report on the NPU. A lot of that work just described what this is architecturally, the way that it uses a host of cores, everything from vector, matrix, and scalar cores, in a symphony of collaboration tied to memory that may or may not be on that block.

But the thing that makes the NPU unique, and this is sort of the point that we wanted to make, is that it can process AI at a much lower power than the CPU and the GPU core.

That doesn't mean that it's always going to be the best place to run an AI workload, just that if you have something that you want to run, and let's just say that's going to run for five hours in the background, the NPU is a great part of that because it can run at milliwatts of power. So that's the basis.

Now, the thesis I want to throw out, the counterargument to what we're saying, is just to run that AI on the CPU and just run it on the GPU. But the point that landed with me in this exercise was that CPUs and GPUs have other jobs than running AI and doing dense matrix multiplication math.

It's not that they can't do that, but they might also be playing a game, visually crunching, or encoding video in real-time. The CPU is running system tasks. It's got a whole slew of things.

So, do we agree with the thesis that having something dedicated to this process so that the other cores can do their jobs really well and maybe not be distracted and whatnot is a sound thesis to approach the role of the NPU?

Paul Karazuba
So do we agree entirely? While a CPU and GPU, as you have said, are fully capable of running an AI network, there's a difference between can and should. And if you're talking about a battery-powered device, for instance, the difference in battery life can be considerable when you talk about running an AI network in a dedicated purpose-built AI core rather than a more general-purpose CPU or GPU. So yes, absolutely, NPUs should exist and do exist for a reason, and they are becoming a part of chip designs everywhere in every market from what we're seeing.

Austin Lyons
I will keep going with Paul's answer.

So, if we step back and ask again, what is the purpose of an NPU? Why did we create one in the first place? Why not just use the CPU or GPU?

If you pull back the cover and look – Ben, you mentioned scalar math, which is like operating on single data, like 42 + 2. Vector math, which would be operating on arrays of data or lists of data. And then there's tensor math – matrix multiplication, that would be a two-dimensional tensor. And then there are higher dimensional tensors.

For neural networks, the math they do is vector and tensor math.

You can definitely do these types of operations on a CPU. CPUs can handle vector math with SIMD extensions, but there are limitations (lack of significant parallelism), and it's just not as fast as having the native hardware to do it. So, sure, you can do it on a CPU, but it's going to be slower.

You can do it on a GPU. GPUs can support vector and matrix operations (with high degrees of parallelism). But, as Ben said, GPUs are graphics-oriented. Especially on an edge device like a cell phone, they're being utilized to drive the display.

So, you can have an NPU geared for AI compute with lots of parallel computations, both tensor and vector multiply-accumulate units. It’s designed for low latency and low power. By offloading AI to the NPU, you can reduce GPU and CPU utilization.

Running neural nets on a GPU requires extra power.

So that's how I frame it – where is the best place to do it [the AI workload]? It would be for the circuitry that was designed to do it most efficiently.

Paul Karazuba
One other thing perhaps to add to that is when we think of networks that are running on devices, regardless of cloud or device or whatever it may be, we tend to think today in terms of the more traditional video or audio AI networks that one might run. If you think of the comparative size of an LLM that you might want to run, and if you're trying to optimize for something like that, where you have a hundred times the number of operations that you might have on a traditional network. That level of optimization or efficiency that an NPU is going to give you is going to be dramatic compared to what the same thing would take on a GPU or a CPU. So, looking forward, the advantages of an NPU, of having an NPU inside of your SoC, are just going to be more and more with every release of every new network.

Jay Goldberg
I find it kind of humorous that we even have to have this debate, or not debate, that we even have to have this conversation to validate an NPU. Because the whole history of semiconductors is just one long succession of multiple chips being merged into one chip, right? Like ALUs, arithmetic logic units, used to be a discrete chip, what, 30 years ago, 40 years ago? And now we don't even talk about them, right? But it was a workload, and we needed something to do it, and over time we merged it into CPUs and kind of forgot about it.

I think NPUs fall into that category. They have new functionality. We need something that can do AI better than the existing solution. But for some reason, this one is contentious.

Austin Lyons
Yeah, I want to add to what Jay's saying.

I was looking back at the history of domain-specific accelerators. If you go back to the Intel 8086, it was integer-based math. If you wanted to do floating point math, you had to do that in software, which was very inefficient.

The Intel 8087, released in 1980, was a math co-processor designed for native floating-point arithmetic. It was a very early example of a domain-specific accelerator, offloading floating-point math from the CPU.

And NPUs like the AI equivalent of that. So this concept is not new.

Ben Bajarin
Yeah, but at the same time, I guess where the lack of clarity came from was it takes up transistor budget. And so everybody was essentially saying, like, is that worth it? Especially if that block was to get bigger as you see it get bigger in a couple of different implementations. I would say Qualcomm's, for example, and then you'll see a few others come out at Computex where you're like, wow, that's way bigger. You put way more cores in that than I thought you were going to. So that's where I think it was just misunderstood. From where I heard people critique this was even in the early days of some of the big names that we're talking about were like, I just don't know how much I should give of it if the die, right? I have other things I need to do for core compute performance.

Paul Karazuba
I understand that argument, Ben, and I'll come at this from the business perspective. Yes, you know, silicon real estate is the most expensive real estate in the world. But at the same point, how are OEM selling their products today? They're selling them as AI-enabled. They're selling the AI functionality, and they're differentiating their products with AI. Chipmakers are differentiating their products with AI.

So if you look at a value versus cost perspective, I find it hard to justify why you wouldn't want to put an NPU in your system and why you wouldn't want to have a very high capacity NPU in your system to talk about how great your self-driving chip is going to be or to talk about how great the AI is on your smartphone or how good your data center is. There are just a million business advantages to doing it, especially when you talk about the cost of silicon. It's not really a debate for most folks.

Ben Bajarin
Agreed. Okay, so I'd like a little bit more kind of technical. And I say technical loosely, like, let's go as deep as we can, but make this digestible.

Well, let's do two things. I keep hearing TOPS per Watt. I don't love TOPS. Nobody loves TOPS. We put in our paper, and I just need to shout this out because I was I came up with this title, I called it “TOPS o’ the morning to you”. which is how I described this section, to say, look, this is why it's being used, but this is not the best metric. So let's start with that. And then I want to understand this wattage element of this math.

Paul Karazuba
So, I'll start with this: I am an NPU supplier. I will freely admit that I use TOPS per Watt in all of my marketing language. I will also freely admit, and have done so on my website, that it is a completely ineffective, meaningless measurement of an NPU's effectiveness. And we all laugh. The problem is that it's the most commonly understood.

TOPS is a measure of MACs * frequency * 2. [Editor’s note: TOPS is Trillions of Operations per second].

It's not a real measurement of actual performance. When you're looking at TOPS per Watt, I published a blog about this. I'll repost it on our social feed so folks who are hearing this can see it.

But when you're really looking at the effect of the system, you need to understand all the underlying test conditions of where TOPS for what came from what process node are you in? What frequency? Are you assuming integer or floating point? You know, what actual network are you running? All of that will highly skew the TOPS per Watt argument.

I mean, I've seen people post something wild, like 300 TOPS per Watt, which is absolutely unachievable in 99 and five nines percent of the cases, except for the one little corner case where I have something that is so optimized for this particular workload that it works really, really well. So yes, TOPS per watt is a meaningless number. But the problem is that it’s the number that most people understand or at least can somewhat relate to. So that's why it's used.

Austin Lyons
Yep. Let me take it a level deeper. Paul did a good job, and I'll explain really quick for new people – TOPS is Trillion Operations Per Second.

When Paul mentions MACs, that's multiply accumulate operations. At the core of neural network math — matrix math — tends to be multiply and accumulate operations. And so the two Paul mentioned in the equation, that comes from the fact there are two operations – you multiply and you accumulate.

Here’s how people can massage TOPS.

If it's MAC operations times frequency, the question is – what frequency? Is it a peak frequency? Okay, so that's your theoretical max TOPS, but what is the actual frequency when you're running a particular workload?

Even then, that also assumes that all of your compute units—all of your MACs—are fully utilized at all times. So we might ask, well, what about batch size? If I'm running inference with batch size one, I'm probably not going to exercise every MAC at all times.

So you could ask: what's the achievable TOPS at this particular frequency at a particular batch size?

And then, of course: precision. If it’s INT8 vs INT4.

Say your multiply-add units are INT8, so let’s say it takes up eight bits in width. And if you’re running INT4, then you could do two INT4 operations side by side in that particular unit. So you could basically double the TOPS number [double the number of operations].

That's why people just say it's a very massageable number because there are all these things you can do.

And even if you go further with TOPS per Watt, the question is, okay, first of all, how'd you get that TOPS? And then second, where'd your Watts come from? Is that your minimum power, or is that the actual power that you measured the TOPS at?

Paul Karazuba
And, and throw on top of that, sparsity compression and pruning. You know, if you're, if you're doing 30 % sparsity, you know, you're going to get a 30 % improvement artificially perhaps in your TOPS per watt number.

So anyone who is looking to evaluate the effectiveness of any NPU needs to look under the hood of where all of the assumptions of TOPS per watt came from.

Austin Lyons
Totally.

Jay Goldberg
Should we stop using it as a metric? And if so, what should we use instead?

Paul Karazuba
Austin, I'll let you answer that one first. I have my own answer, but I don't want to monopolize the conversation.

Austin Lyons
That's a great question. As a product person, the question is, “What matters for the end consumer?” TOPS may indirectly matter, but at the end of the day, if they're trying to run a small LLM – let's just say, like a ChatGPT [clone] locally or something. It's probably time-to-first token latency and tokens-per-second throughput.

Now, it gets hard to compare, but at the end of the day, if you gave me an AI PC, I'd want to fire it up and just start asking questions and see how quickly it responds. You could call this a “vibes check”. And I know that that's not quantifiable, but I just want to see if it feels snappy enough.

Paul Karazuba
Agreed, from an LLM perspective, time-to-first token and tokens-per-second are going to be the key performance metrics.

For more traditional networks. I encourage the folks that we work with not to rely on a single number but to instead ask suppliers like myself to produce performance estimates in all of the different networks they want to run. Don't rely on a single monolithic TOPS per Watt number, but tell me the 10 networks you want to run and tell me the conditions you're going to run them in. Then have me and my competitors give you those performance numbers so you see what the real world is actually going to look like, and you're not basing it on some artificial number that, for points made here, probably is not at all accurate what you're actually going to use.

Ben Bajarin
I think there are two points to this. No consumer is going to go run out and say, “Hey, I got a computer, and it runs X number of TOPS per watt.” What matters is that it's going to help your battery life. If you're going to run really dense applications at the edge, you're going to get another couple of hours out of your battery life.

Paul Karazuba
We all remember when computers were sold on 386, 486, Pentium, and then the frequency, and then it was the amount of RAM. Very few people buy computers for that anymore, if you can even find it. We're looking at how long the battery is going to last. We're looking at “How fast can I run stable diffusion?” and “How quickly is my local LLM going to respond to my inquiries?” That's, that's what's important to the consumer, not TOPS per watt. It doesn't even need to be defined.

Ben Bajarin
Yeah, exactly. It's useful for [manufacturers] just to say, “We've proved efficiency with this metric.”

I want to go back to something that you guys have talked about, the max or theoretical wattage of NPUs. This was actually interesting to me because we've been benchmarking a bunch of these NPUs – Apple M series, Qualcomm's, Intel's. You see a range of power consumption. One might be 10 Watt max. One might be 8 Watt max. And so I'm kind of curious how that works. How much wattage can I peak this at? Like is that a set number or could I be like, you know what I want a 30 watt max NPU? So I'm just curious, how variable is that number? Who sets that number? Are there theoretical limits?

Paul Karazuba
Well, it depends on what the application for your device is. If you are building something for a non-data center device, you do not want to do any sort of liquid cooling, which is going to put the max power consumption of your chip, let's say, 65 or 70 watts before you're going to need to do some sort of active cooling. But as far as the Watt consumption of the AI portion, it really just comes down to the chip maker and what they might want to handle or what their power budget might be in their system based on what their customers expect or what their users expect.

Ben Bajarin
So it can go higher, basically.

Paul Karazuba
It can. I mean, the silicon can handle it. It's how are you going to package it? Where are you going to put it in the system? And how is it going to be run? That's really the question of how hot it could really get. The H100, the darling of the training industry for obvious and justified reasons, is a 70 to 100-watt chip. Granted, it's in a data center, so they have the ability to cool it, but that's not stopping that from being super successful. That same chip stuck in a mobile phone will probably not be particularly successful.

Austin Lyons
I was just going to add – I would assume that the max power is a design constraint that they're designing around. So, you know, they're kind of choosing frequency and power for a given thermal characteristic that they care about.

Ben Bajarin
Okay. Well, I wanted to use that as a segue into data center.

But first, let's perhaps bust open a myth that all of these NPUs are proprietary and homegrown IP, which is kind of what everybody wants you to believe. You know, if, if you're everybody out there, they won't, they won't say like we've built this on a microarchitecture that's standard. They’ll say, “We did all of this. We invented our NPU.”

When you start looking at that, you're like, okay, well, where'd you get your tensor cores from? Where'd you get your matrix cores? Where'd you get your vector cores?

Maybe they acquired somebody back in the day – I think in AMD's case, some stuff came with Xilinx.

But let's dispel the myth that this is some brand new, evergreen creation of IP. These companies are using IP that came from somewhere—licensed or acquired.

Paul Karazuba
There are absolutely companies that have organically created their own NPU. They are very likely big companies. Designing an NPU is not a trivial task; it is incredibly difficult to design one that functions well.

But yes, an open secret is that not everything on chips has been designed in-house by the manufacturer that you see on the outside of that chip.

It would not be uncommon for parts or all of NPUs, for instance, to be licensed from external suppliers, wrapped into a larger SOC, and marketed as internally created.

NPUs, this is a weird word to use for any semiconductor, but NPUs are sexy. They're interesting. They're considered to be absolutely state-of-the-art. And to say that we didn't design that ourselves, at minimum, creates a little bit of a public relations “what are you doing”? And at maximum, says, “What exactly does this chip company do?”

There are absolutely NPUs that are licensed from other people that are relabeled, and that's fine. That's the way this industry works. And there is stuff that is created organically. I want to be very clear about that too. Not everyone is licensing them from someone else.

Austin Lyons
Two examples: When I was researching “where do NPUs come from and what is their history”, it seems that several companies started with DSPs, Digital Signal Processors – a domain-specific accelerator on a chip. So Qualcomm’s Hexagon, for example – basically said, “Hey, this DSP can do vector math, and it has high parallelism and was designed for low latency – let's just add a tensor support to it. And now it will be able to run AI workloads.” It seems that Qualcomm took this approach.

Intel took the same approach. They acquired a company, Movidius, and their SHAVE DSP. Intel took this DSP and added a bunch of matrix multiply and accumulate units. They probably massaged the memory layout and everything, and they call it an NPU. So these are just two examples of companies that started with something already.

I think Qualcomm says that they created theirs, and then Intel bought theirs, and then they both sort of morphed it, or iterated it, into an NPU.

Paul Karazuba
And I think that's a great way to do it.

Jay Goldberg
Has Intel said that explicitly? They said that “our NPU is from Movidius”? Or is that something you sort of pieced together?

Austin Lyons
If you look at Chips and Cheese, a really good website, they've done some work talking about it. I can't say off the top of my head if Intel explicitly said that or if it was pieced together [by Chips and Cheese].

Ben Bajarin
Well, it's because when they called it a VPU, they were pretty clear that it was Movidius IP. And now that's just become an NPU. That's just a renaming of it. So I think without saying it, again, the point I'm bringing up is nobody wants to really say where this came from.

I would imagine that Apple's was very similar. Their homegrown DSP evolved into bits of this. And now it’s the ANE [Apple Neural Engine].

There are other people I want to talk about next, but I don't know where they're getting it from. And that's totally fine. But my point was everybody wants you all to believe that this is like, “We came up with this thing.”

What intrigues me personally about this is everybody's NPU is different. It's like a snowflake, and from a design standpoint, this shows a lot of creativity, but also philosophy for the company. Apple's going to approach this because they'll have a philosophy, as does anybody doing these as an independent accelerator that's on a block. They're all going to be different, which makes it interesting to me because that will be very telling of architectural design decisions that we just don't normally see in a CPU or GPU – you just throw more cores at it, right? Or frequency scale.

This is a, totally different thing where we actually get to see designers design, and that's what I find interesting to analyze. So that's my plug of why I think it's interesting.

Jay Goldberg
It's been really enjoyable for me to watch this because there is—let me try to put this diplomatically—Qualcomm should get more credit because they took their DSP. I think they've actually said this publicly; I think there's a blog post about how they rethought their DSP for use as an NPU.

DSPs were once core to Qualcomm. That's a big part of building a modem is that DSP. It goes way back deep into Qualcomm's roots. And now they've sort of modernized. That’s a really good story to tell.

Intel acquired Movidius in 2016, which is kind of old stuff. And I'm just wondering how many of these companies are just repurposing old stuff, like Movidius for Intel, and how many people are designing it new from serious upgrades to older things like Qualcomm does.

And like, is that the right approach? I mean, do we, do we need to go back to first principles and rethink how we're doing NPU cores? I think that approach needs to be explored. What do you think, Paul?

Paul Karazuba
I have a biased answer because my company went back to day one and rethought what we believe an NPU should look like. There are a lot of other folks in the IP industry who, you know, I say somewhat sarcastically that they just basically took a warmed-over GPU, CPU, or DSP and created an NPU out of it. If there were one correct approach to doing this, there would be one architecture in the market and there's not. There are multiple architectures in the market. We're all doing our best to make sure that we handle as many different kinds of networks as we possibly can. I think the answer to your question, though, Jay, is really time will tell.

I feel strongly that if you're going to build an NPU, make it as good as possible at processing neural networks and don't take any of the baggage of perhaps past usages of that core technology that then became an NPU.

We built ours from the ground up solely to process neural networks. That is what we believe, and Expedara believes is the best approach. I believe we're right. Perhaps we'll be proven wrong, but I believe that's the right way to do it.

Ben Bajarin
Jay, we needed a bit more of a curmudgeon comment, by the way, though. I was just, yeah, I mean, your optimism is just, it's hurting me now at this point.

Jay Goldberg
Well, I don't want to name names, but for the last two years, you and I have been going to all these conferences, and we've talked to everybody, and we asked them fancy-looking NPU there, tell us about the design! And we're always met with this awkward silence. It's just the weirdest thing.

Ben Bajarin
Blank stares, blank stares.

Jay Goldberg
I mean, I almost wish they didn't know. I'm like, do they not know, or do they not want to tell us?

Ben Bajarin
It's, no, it was like the forbidden question. You're like, shoot.

Paul Karazuba
Well, let's be fair. Anyone standing in a trade show booth or at a technical conference has been media trained—they've been trained not to answer questions from folks like you specifically for that exact reason. So, if you've got an answer out of them, I would be quite impressed by your questioning skills.

Jay Goldberg
You're right. But, I mean, there are a lot of these analyst-specific events where you have a whole bunch of very nerdy people asking very technical questions. And so it has been glaring to me that it's not even a question of media training. It's just like silence. You know, “we'll get back to you”.

Paul Karazuba
It could be just a a lack of knowledge in the subject. If you don't want a secret to come out, don't let it out. Don't tell anyone.

Austin Lyons
I'm going to throw out there that if we go back to first principles, in theory, designing it from scratch, you can make the right trade-offs that you need for neural networks. Now, the question is, when these companies took existing DSPs, what design trade-offs were made, and were those a limiting factor for their NPU at all? Were those decisions okay, and they didn't prevent the NPU from reaching its full potential? And that's what we don't know.

Jay Goldberg
Right, that's a great question. I'm asking these people because I want to hear their thoughts on this.

Ben Bajarin
Sure. Sure. But I think it goes back to everyone wanting everyone to believe this is the secret sauce. And a lot of times, it is an approach that they're taking that is their unique design. So, I get that. I would just love more understanding of what they're doing So maybe in a few months, we should revisit this. After Computex, a lot of people are going to come out and give chip and block diagrams of what's doing what. And I think that'll be interesting to talk about who's doing what. It's starting to come out, at least at the SOC layer. There's another part of this discussion where I don't think anybody's going to tell us that I want to get to that's at least vaguely interesting.

Jay Goldberg
You described it as secret sauce. My contention is that it's not secret sauce. It's three raccoons in a trench coat.

Ben Bajarin
There's our curmudgeon comment. Thank you, Jay. I was desperately needing this.

Three raccoons in a trench coat will be our code word when we're sitting in a room, and nobody gives us a straight answer. We'll say, “It's those three raccoons right there, man.”

Okay, so I was having a conversation with a hyperscaler who makes a custom ASIC for AI acceleration. And I basically said, give me some details. And they said no. So I said, if I were to think that this thing had some vector cores and some tensor cores, would I be in the right direction? And they said yes.

So theoretically, this big square block that's maybe two and a half inches by two and a half inches is a giant discrete NPU. I think that's interesting.

None of them are going to give us information, you know. Amazon's not going to tell us what the microarchitecture for Trainium and Inferentia is going to be. Microsoft's not going to tell us for Maia. Google's been pretty clear its tensor core is great, but they're probably not going to go much beyond that.

So is that even the right way to think about it, that these are giant NPUs?

But I ask the wattage question because, again, this is in a data center. Maybe it can scale higher, but they want these to function as the block purpose-built for AI workloads.

So that's essentially the train of logic I got to where I'm at. So I'm throwing out if that's maybe the right way to think about Maiai, Trainium, Inferentia, and Google's TPUs.

Austin Lyons
My initial reaction is: that’s a fine way to think about it – as a discrete NPU.

People tend to call those AI accelerators when it's in a server and then they call it an NPU when it's on the edge.

But yes, you are correct to say it's just compute built to support the correct data types—matrix, vector, and scalar math for neural network workloads. So isn't that an NPU?

Ben Bajarin
Theoretically, yeah. Paul? Jay?

Jay Goldberg
Well, can we have this debate real quick though?

Is an NPU a discrete chip or is it a block in an SoC?

Ben Bajarin
Oooooh. See, we would need to define this in our definition. This is a great question. I have not heard it asked before.

Paul Karazuba
If I'm going to freestyle the answer to Jay's question, an NPU is a block in an SOC and an AI accelerator is a dedicated AI processor or a co-processor. A dedicated piece of silicon, let's say.

That's just my Funkmaster Flex freestyle.

Austin Lyons
Yeah, I guess the counterargument might be – well, is that chip, Maia, Inferentia - does it have a CPU on it too, like a little ARM CPU? And so if it's got a CPU and a bunch of MACs, is that really different than your Edge NPU SoC?

Ben Bajarin
I don't know. I mean, a GPU is a GPU, whether it's discrete or integrated. CPUs are generally just their own thing anyway because that's where this whole world started.

So I sort of just default to saying you could take it off or put it on. I could put a DSP on, I could put a DSP off. I could put memory on, I could put memory off. That's why I don't think anybody making a custom ASIC is going to call this an NPU.

My annoyance was that it was so ambiguous that no one had any idea what it was doing, that I was just like, we at least need to think about this somehow. And this custom-built thing, even if it's a co-processor that just handles AI workloads – it makes sense to think about it like we think about NPUs.

Jay Goldberg
My sense is increasingly, that NPU is the block inside the SoC.

Ben Bajarin
That's how people refer to it. Agreed. Agreed.

Yeah, nobody's going to call it that unless, in two weeks, Apple goes, “We're making hyper scaler chips, and by the way, we've turned ANE into its own engine, and it's an NPU.” They could screw everybody up, but that's a different conversation if and when we learn about Apple's data center efforts.

But I would tend to agree only because nobody's going to call their giant chip an NPU. But I did ask the wattage question in an attempt to perhaps make a distinction. If you told me that, theoretically, you can only throttle these NPUs to a certain amount of power, then I would say, cool. Then it probably is always going to be on. But if somebody could make an NPU and it throttles to 150 Watts and that's not theoretically impossible, then it scales. It scales to data center coprocessors.

Austin Lyons
I have a question for the group. To return to the topic of “Are everyone's NPUs different or similar?”

Do you think NPUs—let's talk edge or AI PCs—will be a point of differentiation, like the hardware itself?

Paul Karazuba
Let me make sure I understand your question, Austin.

Austin Lyons
So, I buy these laptops because they have the ANE or someone’s NPU in them, and they're always snappier. It's kind of like an Intel versus AMD—I'm buying it because of the NPU.

Paul Karazuba
I think consumers are, for the most part, more accustomed to buying what a device is going to do for them rather than a code name for something inside a device. Let's just say whether it's an XYZ processor or whatever it may be. Marketing is skewed toward usage. It's skewed toward “how this device is going to make my life better, easier, quicker, faster”.

I'm not sure that I would see there being, from the consumer point of view, people buying because it contains a particular chip, a particular function that's different, containing a particular use case that is unique to a specific manufacturer or OEM. Absolutely could see that as a buying decision. But as far as just simply containing a chip from someone, I don't know about that.

Austin Lyons
So, if I buy a Copilot Plus PC with the Qualcomm Snapdragon in it and it has more interesting or better AI functionality than a different OEM, would that be differentiation, even if the consumer doesn't know why?

Ben Bajarin
So, I'd say it this way. Let's just say, hypothetically, vendor B – because I don't want to name anybody – comes out with Copilot plus PC. And it gets horrible battery life because it’s NPU just wasn't desirable. Now the consumer is not going to be like, vendor B is really bad. There's going to be, like, those things don't get 25 hours of battery life. This one does. And that's a pain point to me. They're going to say, okay, I'm going to prefer the one that gets better battery life, right?

To Paul's point, that's the end result of the implementation that leads to the use case or the value proposition that you care about. So that's what they care about in the end. The secret sauce in the middle just gets to that. And if a vendor doesn't do a good job of implementing this, then that's definitely going to real impact real-world performance, which will then skew people toward their buying decisions. That's how I would look at it.

Jay Goldberg
I think it's even worse than that. Consumers will not care about any of these features. It is not differentiated to them because there are no good consumer use cases for AI right now. At least none that consumers are aware of. Maybe some of these Microsoft features will take off. I don't want to knock them. But unless there's some surprise hit there, consumers aren't going to care. So they're not going to care about the AI features in the phone.

Conversely, if you're running an AI PC and to your point, Ben, if it's really bad performance because of the chip, the chip maker is not going to get blamed. It's going to be 50-50. It's either going to be the OEM who made the laptop, or it's going to be Microsoft because consumers know both of those. And so, right. And if consumers find the Recall feature stutters, people are going to blame Microsoft. If the battery life is terrible, they're going to blame HP or Dell.

Ben Bajarin
Yeah, I would agree with that.

Jay Goldberg
But the flip side of that is if we can actually get a really good consumer use for AI PCs and one of the chip vendors has a real advantage there, then that's their opportunity to build the Intel inside brand or for Intel to reinforce the Intel inside brand. And I think that Qualcomm's big opportunity here is if somewhere down the road, some great feature shows up and they have a big advantage, then they go spend a billion dollars on a marketing campaign. It'd be huge for them. But the problem is by the time we get that feature, there's a decent chance Intel and AMD will catch up.

Ben Bajarin
Right, right. I would agree that there'll be more parity than not. However, this one block is probably the one that, for at least the next couple of years, we'll see the greatest year-over-year performance increase as they throw more transistors to it, as they throw more matrix or tensor cores at it. I don't know what that's going to yield yet, to be honest with you. We could be building something that no one uses, and transistors are left on the table. Maybe. I don't know.

I'm just saying that's the one we always scrutinize—“I want to see IPC gains, gen-on-gen!” It's just really hard.

However, AI compute gains – we're going to see that at the edge. And now it's just that developers need to absorb those gains in software. In the same way that throwing more GPU compute in the cloud and all of a sudden, the model gets bigger, and they chew up every single one of those TOPS and flops.

We need the same thing at the client edge that gives them more compute, in this case, the NPU compute lets developers go wild. Let's see what happens. We don't know, but that one block is going to see drastic performance year over year because everybody's investing in it. And it is the one thing you can get a whole lot more compute out at the edge. So we'll see where that goes. But yes, we are still all for these wonderful use cases, that really most normal consumers don't even care about at the moment.

Okay. Since we're almost at the end of our longest podcast ever, let's offer some parting thoughts. This was a rich discussion.

Last sort of thoughts, if you will, on the NPU as a whole. And let's just say specifically, can we make any kind of prediction at this point about where this goes? Other than what I just said, which is it's going to get more compute. Is there anything else we can say like watch for this as SOC vendors evolve this strategy?

Paul Karazuba
I'm going to make my prediction that you are going to see NPUs on 90-plus percent of chips, other than really small, discrete chips – within the next three to four years. You're going to see them absolutely everywhere of varying sizes of varying capabilities in almost every market. You're going to see them in refrigerators. I mean, you've seen them in smartphones, and you're going to see them in cars. They're going to be absolutely everywhere.

Ben Bajarin
Great prediction. All right.

Jay Goldberg
I'll go a step further on that. I think you're even gonna start seeing them in some pretty small chips too. Maybe not full-blown crazy ones, but like decent-sized ones in microcontrollers. I think that's coming sooner rather than later.

Austin Lyons
My prediction, and for the record, if I'm wrong, this is Austin from Chipstrat. We can check on this in a year.

I think multimodal LLMs plus NPUs will create a very interesting proliferation of awesome use cases for hands-free edge devices. Think – smart glasses, like what we hoped Google Glasses would be.

Because now it can understand visual input, it can understand your natural language way better than Siri ever could.

By the way, NPUs bring the cost to a developer of inference down to zero. They don't have to rent or buy Nvidia GPUs. They've just got it right there.

So maybe we'll see tons of little, tiny consumer-edge apps being built over the next couple of years.

And maybe a year from now, I'll wear a pair of those for our follow-up.

Ben Bajarin
I hope so, too, dude. I want my Meta glasses to do all that stuff already, and I'm annoyed they don't because, theoretically, they should. Maybe just not on the device but to the cloud.

All right, so my wild one will be, and this might rub the feathers of some people who I know are listening to this: NVIDIA will make an NPU and either put it on a board [AI accelerator ASIC] or integrate that onto perhaps a client SoC.

A company that has no NPU at the moment and is the AI King will make one in the not-too-distant future.

So there we go. Well, gents – Austin, Paul, thanks for joining.

Austin, everyone can find you on your Chipstrat Substack. Again, I said this before – the best Twitter handle – thee Austin Lyons – @theaustinlyons

And Paul, you mentioned your company Expedera.

Paul Karazuba
You can find my company at Expedera.com. I'm also on LinkedIn, and that's the best way to get in touch with us or see what we're doing.

Jay Goldberg
Thanks, everybody. Tell your friends.

Ben Bajarin
Awesome. Thanks for listening, everybody. I appreciate your comments, and until next time, cowabunga. I got it in twice.