Chipstrat Chat with Groq's Mark Heaps
API business and "ramp to hardware" model, differentiation, next-gen chip, and more
I recently had the opportunity to chat with Mark Heaps from Groq. We discussed Groq's API business and "ramp to hardware" model, differentiation from other AI chip startups, Groq's next-gen chip, and more. You can catch the conversation on YouTube, listen on Spotify, or read the lightly edited transcript below.
Chipstrat Chat with Groq’s Mark Heaps
Austin Lyons: Hello, listeners. We're here with Mark Heaps, Chief Tech Evangelist and VP of Brand and Creative at Groq. Listeners will know, but that's Groq with a Q — the ultra-low latency AI chip startup. Mark, thanks for taking the time to do this interview.
So, take me back to the beginning of Groq. Back in 2016, Jonathan launched Groq as a fabless design company. At the same time, that same year, Google publicly announced that they had a homebrewed AI accelerator, the TPU, that they were using for their own training and inference workloads. Of course, as listeners know, Jonathan was a founding member and creator of the TPU.
So yeah, I just wanted to go back to those early days. As Jonathan left Google and started Groq, what was his original vision?
Mark Heaps: Yeah, they had been working on the TPU for a while before Groq created the TSP, which later got rebranded as the LPU, Language Processing Unit.
You know, Jonathan had a couple of key focus points. First, he learned a lot from the team working on the TPU. And they realized that although it advanced the level of compute to what Google needed over existing incumbent processors, he also realized that it was extremely hard to program for; the software challenge was quite steep.
So they had this vision of, let's go solve a couple of challenges for the market. But if we're going to build a chip first to meet that need, let's actually do this opposite of the way of the chip industry and the semiconductor industry. Let's start with software first.
So they actually went through two or three compiler teams when they were forming Groq, but they started with the compiler. And then when they finally got that running the way that they wanted, then they started working on the architecture of the silicon. And because of that, that allowed them to create an actual deterministic chip.
So, and interestingly at that time, when we first got it back, it was the world's first single core PetaOp processor. So it was quite the achievement and that's now the foundation of the entire architecture, not only of that first chip, but it's the same compiler that we're building on for our next generation LPU.
AL: So the original vision was inference and it was, hey, let's get the compiler right first and then let the silicon be designed around that.
Who were the original customers that Jonathan had in mind for this chip?
MH: Well, he had pretty good logic when starting the business. At that time, they'd seen about a decade of everybody in the AI ML world focused on training models. And so you've got thousands to 10,000s of models being created at back then, pretty small size. But the logic was, hey, if these models keep getting better in training and we see those benchmarks, people will have to deploy these at some point, right? And they're going to deploy them to production.
So, you you train a model once, but once you run it in inference, you're really, really using it to provide an ROI. The logic was, let's focus on inference right out of the gate from day zero. And that'll be a pretty healthy market for us if we can design a chip that provides real value and performs better than the existing incumbents available to the market at large. They didn't know when that would happen. But that was the vision was to focus on inference from day one.
And I will say, even though we can train on our LPUs, they actually have one customer that's trained to model on it. That's never been our focus. And we've never actually developed tools or software for that. We've always developed everything for inference.
AL: That makes sense. Inference is where you reap the benefits of the trained model. So, as you follow it forward, everyone will be inferencing.
Now, as a reminder for everyone, like that was before GPT, before GPT-2, right?
Was it clear to Jonathan that these AI models would provide value anytime soon? Like that consumers would want to interact with them as they do today?
MH: Yeah, there's a running joke internally that Jonathan is a time traveler.
So I always tell him when he's thinking about things, “Hey, I need your feet on today and your eyes on tomorrow.” But he absolutely had faith from the day I met him that there would be value for society and humans at large when this AI technology becomes ubiquitous. So there was never a doubt—it was a matter of when.
We didn't know when there would be this nexus moment when the market would suddenly start deploying and realizing the value of AI technologies like ChatGPT and LLMs, but they just always believed in it.
We worked with several research groups and national labs, and we had great projects going on. But we were lucky that OpenAI and ChatGPT just exploded and had this massive viral moment for a year.
If people go back and watch our videos, they can see we were running GPT, YOLO, and all these other models long before then. But the awareness that happened from OpenAI was great for us.
But they had a private, closed model. If you take this delta between two points of the trend – the education that was coming from OpenAI's ChatGPT, and the moment the first generation of Llama got leaked – we got that up and running in a matter of like, two or three business days. In fact, the only reason it took us so long with our compiler is we had to strip out all of the GPU information from the first Llama 1.
But once we had it up and running, there was a really powerful moment. Nobody knew how fast ChatGPT was running for OpenAI. The estimates at the time were 20 to 40 tokens per second. And within a very short period of time, we had Llama 1, a relatively equivalent benchmark model, running at 100 tokens per second.
And suddenly, we realized that we could do this better than what we're seeing in this new market category for inference. Here's our real proof point.
It's just exploded since then.
AL: There's a lot to unpack there!
So before Llama, before even chatGPT got into the public zeitgeist and everyone recognized its benefits – there must have been a period where you have these chips, and you believe in inference, and you're probably trying to start to sell your hardware. But were people on the other end—you mentioned the national labs—were people on the other end not believers yet? Because customers aren't using something [like ChatGPT] daily. What was that period like?
MH: Yeah, that was really hard times. Jonathan's pretty transparent about this.
The company almost died a couple of times. We had to get investment. We had to be able to stay alive.
Salim Ismail, the CEO and founder of OpenEXO, and I think he advises a number of boards like OpenAI and others—-he's got this great book where he talks about how humans and the market in general are allergic to anything that challenges a dominant incumbent. And we found that. I'll tell you, there was some real angst.
I remember being in an interview with a bunch of tech analysts and one pretty established analyst said to me, “I think you guys are lying”. He didn't quite use those words. And I said, “how so?” And his exact words to me were, “if this could be done, Nvidia would have already done it.” And I said, “Well, clearly you haven't read their papers where they talk about how they lose real-time performance and low latency, every time they connect about eight GPUs.” And that wasn't just them, that's true of any GPU architecture.
We've radically innovated around that, but there are trade-offs. We're not going to do training, we're not going to be optimized for a really high batch size, and we're not going to be good for a small single researcher system. If you're trying to put a chip on a card in a desktop box and want to run your research model, the GPU outperforms us at that point.
But from day one, Jonathan's background was in these large distributed systems. So, we always stuck with that vision of how and what we were going to serve in the market.
But I will tell you, the early days were hard.
We had people who came in and led our sales org who really believed we were going to be a PCIe card business and tried to take us that route.
We had folks who came in and said, "We have to really lean into the research community because they're always on the cutting edge." But we found they didn't align with the enterprise. So, you know, there was a lot of growth and a lot of learning there.
But, we had exciting projects like drug discovery during COVID with Argonne National Labs and working with Oak Ridge on the Tokamak Nuclear Fusion Reactor project. We still got to see some pretty cool stuff during that time.
AL: As with any startup, you're trying to find your way and sort of find product market fit. Of course, there are such huge investments upfront for semiconductor startups that it's not so easy. It's not like a SaaS company, you know, just write some new code and pivot. So – stressful times, you probably ran low on cash there at times.
You got some extra funding, but then, fast-forward: ChatGPT raised everyone's awareness of how valuable AI and LLMs could be. Then there was a period between chatGPT getting in front of everyone, probably November ‘22, and Llama coming out – was that like January or February of the next year?
MH: Yeah, it was really, really early. We had a press release about it. And yeah, we just passed the one year anniversary. And so that was really, really exciting times. Because now you leaned into open source.
At that time, Meta hadn't said they would focus on open source. Now, clearly they're driving today as one of the leaders in providing models for the open source community, which has sort of balanced this idea of private model development. That's been really exciting for us because now we get models from them, the Mistrals, and a number of others. Even Google has Gemma, and Open AI even offers Whisper. That's been really, really exciting because now in our cloud solution, we provided access to the world.
It started with demos. It started, in those early days of Llama, by just standing it up at trade shows. And then we started getting instances of customers saying, Hey, can I get access to this remotely? And so we would provision something for them. Then we went to Jonathan and had a long conversation about what would benefit the company the most. Ultimately, the agreement was because we had done lots of demos, we should just put this on the web and make it available to everybody for free.
And that was a chat instance, right? That almost mimicked the early days of chatGPT. Knowing that we don't make the model, tools, or anything else, we're just posting it. But then when we acquired Definitive.io and their org — great group of guys. They're masters at building the infrastructure for cloud services. They started picking up where we had begun with building a cloud instance and making that available to the world. This is crazy, but in the last, since February 27th of this year, we're at half a million users that have signed up and are now getting API keys so they can build their own apps.
AL: I saw that stat on the Groq Speed Read email that came out yesterday. 497,000. Wow, that's half a million!
MH: Yeah. And it's funny too, because even going back to January, February, we had the free version of chat up, people could try the model out. And I remember sitting with Jonathan saying, wow, look, we're almost at 1000 people signed up. Like, I wonder when that's going to happen. And we were counting like tokens being generated.
As a strategy for a startup, because someone challenged us about this recently on Twitter or X, they said “how can we trust a company that gives away five billion tokens a day for free? What's the logic of that when they're still onboarding paid customers?”
And actually, it was a pretty logical business decision. We're trying to accelerate everything, and we're trying to be efficient. So a traditional company would have built a product org of 10 to 15 product managers, and they would have gone out and run tests and met customers and brought up developers and that would take two to three years and you'd spend millions of dollars. But because we own our entire stack, we're just paying ourselves for the tokens that we give away. But now we've got half a million users that we're learning from with our systems.
One of our very proud moments is reliability. We have a pretty big logo right now that we'll be announcing with us later this year, and their system has had a hundred percent uptime. That's due to the changes we've made based on what we've learned from the user base.
AL: That's a pretty awesome approach to go get a ton of users, create a lot of value, showcase what you can do, nd learn from customers quickly. And it sounds like it was done in a lean manner.
MH: A very lean manner. It's a fraction of what advisors estimated it would have cost us to do that level of product research. We have a product org today, technically by title, of two people. Groq is only about to hit 250 employees just now. So everything we've done, we've done at a fraction of scale compared to other startups.
AL: I'm going to go back a little bit to that little period after ChatGPT launch where you don't have your hands on Llama yet. What was that like?
I’m assuming it was exciting seeing everyone get ChatGPT. But then you couldn't get your hands on it yet. But you probably had an intuition that this could run this a lot faster. There's a better experience.
And frankly, the first hour of using ChatGPT was mind-blowing. But then I started integrating it into my workflows, and as soon as I started using it all the time, I was like, "This is too slow!" This is amazing, but it's so slow.
MH: We've heard that from a number of people. It was a weird time, you know, because it wasn't just Chat GPT. I mean, at that time we were getting previews of all the diffusion models. Midhourney. StabilityAI. There were all these types of models. You had people playing with Kinectomics models, RNNs, and GNNs. The diffusion model and the transformer-based model hadn't quite yet gripped. So we were experimenting with all of these things. And actually, we were doing so much experimentation, we built a tool that was scraping data so that we could research which models were being downloaded and built on the most from Hugging Face and Git and these other places.
Within a very short period, we onboarded over thousand models into our compiler, meaning people could just build right out of the gate. But as we matured the compiler, we started watching the market move more towards these LLMs. So when we got our first version of GPT running, and Peter Lillian was our engineer on that with me, we did a demo video on YouTube, and people were like, "This can't be real."
And so we started just letting people play with it that were the researchers. And they came asking us, why aren't you guys huge? Why aren't you successful? This doesn't make sense! You have this great thing. And you know, it really was that allergy to challenging the incumbent. And so, and I think, you know, obviously we weren't as mature. We didn't have as many tools. We didn't have many resources, but over the next year, OpenAI really validated what this technology was going to do. And all we had to do was step in and go, “Hey, if you like that, check this out”.
One of the first videos we did after Llama was a side-by-side screen recording of a single prompt on ChatGPT and then a single prompt in Llama. And you could see, visibly, that we were running five or six times faster than them.
AL: Totally. I remember seeing that on X — a visual demonstration of how fast it was.
MH: This will show you how little people understood at the time, because you know what the reaction to that video was once they saw it? They were like, “ Wow, that's crazy fast. But why would you build this? If you're making something that, that brings the results up faster than I can read it. Why does this matter?”
If you're a coder or a programmer, there's an obvious path here. But Jonathan and others being forward in this space said, “No, no, no, no, no – you have to understand. Latency is what it's about.” Because we're going to get more advanced applications – today we're talking about agentic, multimodal, you know, all these different areas. He knew that inference as the applications got more complicated was going to create a huge tax on the user experience. So he said, “Look, like we just have to keep creating more speed and lower latency.” So there's a margin for a fluid experience of the user in the end points of the application. I mean, just, he was so far ahead of everybody in that thinking.
AL: We're getting there now with o1, where you can see, the more time it spends thinking, the better the results are. Significantly better! But of course that's going to take more test time compute. That's an example of where low latency matters, because if it actually took it 20 steps under the hood, man, who cares that each step was so fast you couldn't read it? Especially because they're hiding the intermediate steps from us, all you care now about how long did it took to do those 20 steps.
MH: That's right. Actually a funny story about that when o1 came out, like officially they released it. I was having a meeting with our intern who was a a young gentleman, his name is Ben, who was yet to start at Stanford. He's actually in his first couple of weeks right now. We were having a conversation about o1. I said, explain to me like, the chain of thought and kind of the reasoning here and the prompt engineering. And in the middle of this conversation, he says, I think I can recreate that. And I said, really? And he goes, yeah, I'm gonna go play with something. And he came back two hours later and he had a working version of a Llama-based o1 sort of replicate that he calls G1, which is now available in our open source on GitHub.
But he actually shows you everything and the results in the backend, which obviously o1 doesn't do. And it's really fun to play with. It's really powerful. I've been giving it now multiple like reasoning and logic problems that are classic for LLMs. And it nails it every time, but it's still a really fast fluid experience.
AL: I'm glad you brought that up. I was going to ask you about this. o1 is another example of a gap between what openAI is doing privately and what Meta or others offer publicly. So Groq had to come into this space, and your intern said, “Hey, I can build that.” Is Meta going to build an o1 equivalent you guys can benefit from? Or do you think now you'll have to take on some of the model development?
MH: I think a little of A, a little of B. The Llama 3.2 announcement last week was really exciting because they added vision and multimodal capabilities to the models. And we should expect all of the open source model providers to follow suit. I imagine Mistral is going to make some adaptations to the MOE model. So, we will see that development and tooling are embedded more deeply into the models. But then our ability to advance that, like we've done, for example, with our own version of Whisper.
We've also made a few other modifications to Llama, including a smaller context length and some function calling inside of it, just to further enable the developers. So you're going to see a little bit of that A and B.
We're obviously following whatever the open source model creators add. But then we look at our customers and our users and say, what can we do to make this better for you and your needs? A good example of this is speech-to-text. We have a couple of versions of that as a service on Groq Cloud. But one of the things that we realized is a majority of our customers didn't need languages beyond English. So we did our own version of that model where it only understands English. It ended up speeding it up by like 2-3X. So now when you have a conversational AI agent, and you know the majority of your user base speaks English, you're getting another adaptation that's advantaged for you. So you can expect us to do things like that.
AL: That makes sense. Follow the open weights models, but tweak or sort of bridge the gap as needed for your developers.
Okay. Now to the business model. You've got hardware, you've got the API and Groq Cloud. I'm still fuzzy – is the goal to just grow the API business and that's the main stream of revenue, or is it to sell hardware?
I think you guys had an announcement fairly recently, with Aramco Digital. It wasn't clear, are you selling hardware to them? Clear me up!
MH: Yeah, as I said, we don't do anything very traditionally. One of our core internal values you'll see everywhere in the building is “defy gravity”. And that's exactly what that means. So the way the business model works is we have deployment optionality and we call it a ramp to hardware. So at the bottom of the ramp, the long tail, you've got the developer community and they want API access, right? They're not gonna buy multimillion dollar racks of hardware. The landscape for AI changes rapidly. Why would they commit to any one thing? It's sort of like leasing almost. And so they just wanna be able to provision and buy per token, switch out models as needed. And that gives them that rapid iteration. So why not serve that via our cloud?
Then we start getting some customers that say, hey, we want a cloud-like instance, but for compliance reasons and other things, we need to have a dedicated provisioned system for us, but you guys manage it, right? And we go, okay, cool, we can do that. And that's the next stage moving up the ramp. Then you get people that say, okay, well, we want more than that. We want to actually have something that's a colo where we've got some equipment in our lab. We've got some in a data center. You guys help us facilitate and manage that, and we will go up from there. And then you get to like the announcement that just happened with the Aramco Digital in the Kingdom of Saudi Arabia, where we're actually selling them and shipping them hundreds and hundreds and hundreds of racks. They're going to stand those up in their new data center in the region.
AL: Gotcha. Okay. That clarifies it. So the developers can get value out of it right away. They don't have to buy hardware. Hey, you're a startup. Don't mess around with that hardware. Just get API access. And then it goes from there.
MH: Exactly. The hyperscaler model you were discussing earlier, you know, one of the things that we can offer customers is, for example, like Aramco Digital, as they bring on this massive data center, which they've committed to build the world's largest AI inference focused data center. We've got a massive developer community that would love to have that instance and still use Groq in this part of the world.
And so now they're going to have access to our user base and say, Hey, you know, why not? Why not access your tokens over there? So it's a great partnership. You can buy hardware from us, but we have a massive pipeline of customers that are trying to get onboarded. And so there's a great opportunity all around.
AL: Nice, that makes sense. One thing that sticks out to me is that you're, sort of like a fully integrated company making the software, the compiler and the hardware.
And you have expanded to systems and data centers — obviously you have all of your own racks and you're powering a Groq Cloud. And you've now you've also expanded to help with like colo and help with all of this. And you mentioned that the acquisition of Definitive Intelligence, helped on that front.
Do you see this truly full stack, broad competency — from silicon through software all the way out to data centers — do you see that as a competitive advantage?
MH: Massively. Yeah. So I'll give you an example. People in this space, there's a lot of theories right now about loss leader methodologies in, you know, approach to market. There's a lot of belief by analysts that a lot of the folks that are providing inference are losing money in token services. And Andrew Ng actually did a really great article talking about, think they've seen a 87 to 90% reduction in token price over the last 12 months, right? It's massive reduction. So there's people that are truly like racing to the bottom to compete on price to win customers.
But what you have to remember with a lot of those folks is that they're brokering their compute from another cloud service, or they're waiting in line to buy GPUs so that they can stand up their stack, but they still have to buy those GPUs. The reason that our economic model is profitable and why our tokens are profitable is it's completely us. So we own the whole stack. We have our own data centers that we're standing up, you know, digging up footprint in, and we don't have any middlemen that we're having to pay for any of these services other than electricity. Electricity is really the big thing that everybody fights for.
So for us, we know what it costs to run our equipment. I often equate this to the idea of what Tesla did with car sales. Rather than going into a dealership model where you've now got to sell your units to all these independent dealerships around the country and then have them market up so that they can sell the vehicles, they said, no, we'll just sell direct. And so this is the same thing. When we didn't have a lot of folks jumping on Groq early, it just fueled our fire to say, okay, then we'll just do this direct model ourselves. And so today that's why we can compete in this space and not lose money.
AL: Yeah, yeah, that makes a lot of sense. You're fully vertically integrated. There's no middlemen margins. It's not GPU’s huge 80% margins on top of the hardware or anyone else that you're paying to host or anything.
MH: Yeah. So if you don't take that strategy, and realize where we're standing up these new deals with customers, right? We've got this massive data center being built in Saudi Arabia. They don't have a problem getting electricity. When you look at where we're currently in negotiations and we have a press release about this with Norway, they don't have a problem getting electricity. And so we're getting a global footprint quickly because these nations realize we have something very competitive and diverse, and very well adopted by a developer community.
The main thing is, where do you get electricity? We just published a paper that talks about how we're also more efficient with power usage at scale. This is also very attractive to these governments because they obviously have initiatives to try to reduce their electrical use and become more sustainable. So this is a hot topic for data center architects as well.
So we're becoming very attractive in new ways beyond just speed.
AL: Interesting. So this is fairly related – Groq was the first AI accelerator company to say, “hey, we could sell an API, and power our own API”. And now there are others who have followed suit. AI startups that are saying, “hey, we've got an API like that now too”.
If you just look at it without the conversation we just had, there feels like very little differentiation. Everyone is using the same open weights models. So it’s this token wars, token per second wars and maybe like a race to the bottom on price. But what I hear you saying is that maybe in the fullness of the business there are actually differentiation opportunities. For example, I don't know if they're all standing up their own data centers and running it full stack or anything like that.
How do you guys think about your differentiation there?
MH: Well, I'll tell you, in the early days, you know, we were talking about how we get people to adopt Groq. And in the early days, we tried throwing the entire kitchen sink and everything in it at potential customers. “Did you know that we're fast with inference? Also, did you know that our time to production was faster than anybody else because of the compiler? Did you know we had innovation and patent portfolios relating to chip to chip connections and power usage?” Like all kinds of stuff. And people just went, this is too much. So we focused our messaging and we really focused on speed.
And so now you've got some folks that are coming up competing on speed. Rivals that are following suit, as you said, while we defined this market category. But we don't see them having the other parts of the innovation. So when you say, how do you scale chip to chip? We have a proprietary connection that we developed that allows us to scale to thousands of chips and have complete linear performance through that. Like we don't lose anything. Not everybody can say that. So if you say, hey, we want to buy this hardware from you. Well, the difference is, do you want to be able to serve billions of users or do you want to serve hundreds of thousands of users?
And so once you start getting into the scale conversation, it changes dramatically. The other part that I think is really key, and this is why, you know, we recently raised another, I think it was $647 or $640 million in a Series D. You know, why would the investors do that if these other ASIC startups were coming into our space and catching up on benchmarks and things like that? Well, the other part that I think leans into efficiency and sustainability is: we did what we have done with a 10-year-old, four-generation back lithography on the silicon. It is a 14 nanometer chip that we had made in Malta, New York with Global Foundries. It's packaged up in Canada, and then we were building the systems in the US. Folks are catching up to us with them being on bleeding edge five and four nanometer silicon. They are at the cutting edge, and they're just now getting to us.
So we're about to have our next generation LPU come out next year. It'll be used in testing with us, probably Q1, Q2 next year. And we're doing that with Samsung. So now you're going to see a four generation leap in silicon development in a single instance product lifecycle, which means if we did no improvements at all to the design and simply just stamped it smaller, our engineers have told us we'd get a minimum of like a 3X improvement across the board.
Obviously we've learned a lot from having, you know, half a million developers on it. So we've made some changes still with the same compiler, but we're going to get even more performance out of it. So we're not really stressed about it. And neither were the investors because we've got so much room for growth.
AL: Fascinating. When people are comparing Groq with others, it's not just about tokens per seconds on a small instance. Your customers that you're selling to are asking — at scale, how does all of this, this whole system perform? And like you talked about, you guys have chip to chip communication and low latency all around. So that's really interesting to say, not only do we have the ability to stand up data centers, but our customers really care about this big broad performance, not just a number on a website.
And then, you were talking about the fact that you're four generations behind in lithography, 14 nanometers.
MH: I have one right here. If you've never, if you've never seen it, this one's actually metalized and stripped. So I have one right there.
AL: Nice. Look at that. That's awesome. Why'd they give one to you? What did you have to do to get one of those?
MH: It helps be the Chief Tech Evangelist and the VP of Brand. I have to organize all of the hero product shots.
AL: There you go. Hey, you're probably in front of the most potential customers.
So you guys are about to punch the gas as you skip ahead to like four nanometers with Samsung.
Last question on that topic. You know, I know that some people have asked questions about memory bandwidth and do your systems scale. Especially like these really, really, really large models. In 2020, OpenAI published a paper and it talked about scaling laws, both of compute, but also model size. So it seems that models will continue to get bigger.
With your new chip, are you guys tackling anything on the model size support front?
MH: This is why it was so imperative that we figured out a chip to chip connection versus doing sort of traditional Ethernet or other methods. We knew that although SRAM was really, really fast for model switching, we needed to solve the scalability situation. And that's what drove us to actually do Groq's proprietary C2C. So that was a key point. Now V1, I think we can connect, I always get these numbers wrong, but I think we can connect around a thousand to eleven hundred chips before we see any kind of fall-off. But the V2 will actually get us into the hundred thousand range. So we've got some new innovation coming out on that, and it will still be SRAM-based. And so, yeah, that that innovation will just carry forward.
But it's going to require people to not think about it the way that the traditional GPU modeling with HBM is approached. We've figured out our own solutions, because it's a single core. Remember the way that the data flows on our silicon are these east to west super lanes. It's all completely linear. So we have no schedulers or any of these sort of management systems in the software that you need to have when you have a multi-core GPU. Because of that, it's sort of like living in a city where there's no stoplights and no stop signs, right? The data just flows as if you've got a perfect grid in the city.
When you think about how memory usage has been managed by multi-core GPUs in the past, you actually have to have a lot of stoplights so that the data flow traffic can be managed effectively. We don't have any of that. So because of that, the data just passes through these memory units as the data is flowing over the chip. It doesn't have to be sent off chip and be scheduled.
AL: Nice. Good analogy.
I expect that you'll see people point out, “hey Groq uses 10,000 chips for their ultra fast inference. And this other company only needs, 10 chips or a hundred or whatever.” It sounds like you're saying that maybe that doesn't matter?
MH: It doesn't. This was always Jonathan's vision was that the semiconductor industry, as it was explained to me when I came into this field, was everybody looks at the chip. How do we optimize the chip?
Then you need more users, you just get more chips. And Jonathan—I think this was part of the lesson in architecting large distributed systems around the TPU–they realized that if you get a prompt or a search query that would come in for Google, you might take that one query, but actually spread it out over 10,000 chips. And that allows you to optimize in a number of ways. Same sort of idea here. Don't think of it as, this one prompt is going to this one chip. But if we can create all of this massive amount of capacity and the pipe moves really, really fast in linear fashion, then you can distribute that over all of those chips. So this is another one of the reasons that being deterministic is so important.
So at the very core of the software stack for Groq, you can actually, and this goes a little bit more of the work that we were doing with the research groups, not so much LLMs and API services. But when you take, let's say, 128 LPUs and you put them together, the software can see that as one LPU. Now that's a massive chip at that point. So if you say, well, you we gained more customers. Well, instead of us building a completely different system at a data center and now using routing to manage customers, you can say, well, let's just keep adding these little LPUs until we serve the volume that we need. And it just keeps distributing as we need to scale.
Now think about that from a business model, an LPU is not a massive cost. So if you have one chip in the system die, and we do have redundancies in the system, it's not very expensive to go replace that single LPU. If you're basing your entire architecture on a massive wafer, or you've got these sort of GPU-based systems, well, if that big wafer or something similar dies, that's a very expensive unit to actually replace. And because you need to schedule across the GPUs in the way that you do, when you lose one GPU, there's a lot of software to manage that contingency. So we don't have any of those problems. So we can just keep scaling up as demand is increasing.
But we know the demand right now, nobody in the market is serving all of the capacity that the market wants. Azure just announced they're investing way more open AI is buying chips left and right. People are famously taking Jensen to dinner to beg for more chips, right? And so we know that just build massive systems. And the demand is already there. The market has shown that. So yeah, we don't really worry about competing at the chip level. It's at the system level.
AL: Totally. So it's just about meeting demand, and it sounds like you guys have built a system that can scale without much degradation. Maybe they're cheap enough—I know you probably can't talk about cost, but there's no HBM.
So maybe it isn't also that expensive to scale?
MH: It's not. There was a really interesting rumor that was going around Twitter for a while, which I love all of the inaccurate information that gets thrown around out there. But people were telling us like, you know, the cost for LPU is $20,000 and they need a minimum of this, you know, hundreds of LPUs to be able to run a single model of Llama. And, there was all this kind of hyperbole out there about it.
And one, the price was wrong. What they found was a page from a partner we had established something with in the past, trying to sell single cards. And that was an experiment in the market. Ultimately, our cost per LPU was far from that. And so for us, you know, we can stand up loads of these and create these massive distributed systems as a solution.
Again, it's important that nobody knows where the AI world is quite going. So there is a hesitancy by a lot of customers to say, I don't want to buy our own hardware, right? But we do need to be in the market playing today. So why not use the API? And then from there, if later on you want to have your own instance, you move up the ramp. And then from there you say, nope, we want to have our own hardware on-prem. Great. And we can provide that for you. But in the meantime, why not be competing in the market right now?
AL: All right, Mark, I know I've kept you a long time. We covered a ton of ground. This was super interesting, super enlightening. My very last question for you. I know you play guitar. We can see them in the background. When are we going to get like a guitar tab generation app that runs on Groq? Is there one?
MH: I don't have one that runs on Groq. I know there is one out there in the market. Someone was showing me one the other day. We did do a really fun app, totally custom model that we developed, maybe two years ago, and we called it Groq Jams. There's a video on our YouTube channel. But actually it allowed you to put in a leading indicator and a lagging indicator. So you could choose Metallica and Dolly Parton. And literally in fractions of a second, it would write a one minute song with five tracks, guitar, drums, bass, piano, et cetera. And it would write it in again, fractions of a second, all MIDI tracks. And so at Supercomputing, we actually had people coming by the booth. They were playing my strat over here and they were jamming along to Groq Jams in real time. And I said, you could do descriptions, you could change some of the temperature settings on the field for each instrument. But because we're deterministic, it could write all of those different instrument models in sync with each other because they were deterministic. That was another way for us to show how determinism actually works on the chip.
AL: Love it. I'll go look for the YouTube video.
MH: It's really fun. We want to get that one back up, but not quite a priority in the market.
AL: Right. Well, thank you so much, Mark.
MH: Yeah. Thanks for having me. Appreciate it.