AMD's AI Woes & Trust Crisis

Concerns and Next Steps After the SemiAnalysis Report

Dec 30, 2024

SemiAnalysis’s recent MI300X exposé raises serious concerns about AMD’s AI accelerator business. I’ll give a quick recap for anyone who missed this story over Christmas break, and then we’ll discuss how AMD can restore confidence.

MI300X Problems

SemiAnalysis’ investigative efforts uncovered a big problem—AMD’s AI accelerator performance falls far short of expectations.

In theory, the MI300X should be at a huge advantage over Nvidia’s H100 and H200 in terms of specifications and Total Cost of Ownership (TCO). However, the reality is that the on paper specs as given below are not representative of performance that can be expected in a real-world environment.
For AMD, Real World Performance on public stable released software is nowhere close to its on paper marketed TFLOP/s. Nvidia’s real world performance also undershoots its marketing TFLOP/s, but not by nearly as much.

Why is there a significant gap between the MI300’s “paper” FLOPS and actual FLOPS?

Software.

AMD’s ROCm software has historically been problematic and far behind Nvidia’s CUDA; AMD knows this and publicly declared fixing ROCm was the company’s number-one priority in 2023.

However, SemiAnalysis’ investigation reveals AMD is taking the wrong approach to building a developer-friendly AI software stack. AMD is building one-off customer-specific solutions instead of making these fixes available to the public broadly.

AMD is making improvements for customers, but the public doesn’t benefit

We see this in SemiAnalysis’ benchmarking, which shows a gap between publicly available software and AMD’s internal builds:

*The public build (red X) has poor FLOPs, but custom builds (green check) look good.*

The silver lining is that the chart above demonstrates AMD can improve its software to unlock MI300X’s theoretical performance.

While AMD’s software fixes help key customers improve, the ecosystem doesn’t see the same benefits. That’s a problem—but not an unfixable one.

SemiAnalysis points out the solution is prioritizing public builds, which can require extra effort but would improve developers’ perception of AMD.

Give everyone the fixes ASAP by merging them into the public repo. Customers don’t need custom builds but simply also use the public codebase.

As SemiAnalysis points out, AMD’s current “customer one-off” strategy has a negative downstream effect: when AMD’s engineers focus on these customized customer builds rather than the main public branch, AMD forgets the pain that exists for everyone else.

While customers and AMD’s engineers experience a rapidly improving experience, everyone outside of these walls are left with a stagnant and buggy public repo

This leaves AMD in a bad state: even though ROCm is improving, developer sentiment isn’t.

Worse, this doesn’t seem to be a focal concern for AMD, which raises a leadership and culture questions.

Lack of Testing

Another AMD cultural flaw, as noted by SemiAnalysis, is inadequate investment in testing and “dogfooding”.

Currently Nvidia offers well over 1,000 GPUs for Continuous improvement and development of Pytorch externally and many more internally. AMD doesn’t. AMD needs to work with an AMD focused GPU Neocloud to have ~10,000 GPUs of each generation for internal development purposes and Pytorch. This will still be 1/8th that of Nvidia with their coming huge Blackwell clusters, but it’s a start. These can be dedicated to internal development and CICD for Pytorch.

Insufficient testing directly impacts product quality and customer confidence.

AMD should take a tip from Jensen, who, back in November, publicly explained the right approach (see ~11:08 in this video)

Jensen Huang: We don’t build Powerpoint slides and ship the chip.
Until we get the whole data center built up, how do you know the software works?
Until you get the whole data center built up, how do you know your fabric works and all the things that you expected the efficiencies to be?
How do you know it's going to really work at the scale?
That's the reason why it's not unusual to see somebody's actual performance be dramatically lower than their peak performance as shown in PowerPoint slides.
Computing is just not what it used to be. I say that the new unit of computing is the data center. So that's what you have to deliver. That's what we built.
Now, we build a whole thing like that. And then, for every single combination—air-cooled, x86, liquid-cooled, grace, ethernet, InfiniBand, MVLink, NoMVLink, you know what I'm saying—we build every single configuration.
We have five supercomputers in our company today. Next year, we're going to build easily five more.
So, if you're serious about software, you build your own computers [GPU clusters].
If you're serious about software, then you're going to build your whole computer. And we build it all at scale.

If AMD is serious about software, they need large MI300X clusters running their publicly available software stack through regression tests and performance benchmarks nightly.

I’m would assume AMD is already working toward this; if so, is there a communications plan to share with the public and improve perception? e.g. We’re actively standing up representatively sized MI300X clusters at AMD Neoclouds like TensorWave and HotAisle for internal AI model deployment and benchmarking… ZTSystems is designing an internally owned future test cluster…

Lisa’s Leadership

To AMD’s credit, they sat down with Dylan’s team for 90 minutes the very next day to dive into the constructive criticism:

To which Lisa replied:

Lisa pointed out that they are doing a ton of ROCm work… it’s just that we the public (broader ecosystem) aren’t seeing it.

Which, of course, is the mistake we’ve been discussing.

Loss of Trust

The most significant fallout from this exposé is the erosion of trust in AMD.

SemiAnalysis’ testing has undermined the credibility of AMD’s MI300X benchmarks, which were previously taken for granted as reliable indicators of real-world performance.

For example, at Computex 2024 AMD claims the MI350 series will “bring up to a 35x increase in AI inference performance compared to AMD Instinct MI300 Series”.

Now we have to re-read that with a grain of salt. Well… maybe important customers can get a custom build to unlock this theoretical 35X performance, but what about the rest of us?

To be fair, most of SemiAnalysis’ article was testing AMD’s training performance, when AMD is likely more focused on inference performance. But the problems are surely the same; if the out-of-box training experience is buggy and inhibits MI300X performance, are we to believe that inference is any different? And anyway, AMD must compete on training too. (Let the capital-strapped startups focus on inference.)

Can we trust AMD’s Instinct claims going forward?

Concerns

I’m going to use an analogy to illustrate my concerns with AMD’s current strategy.

Imagine someone fit told me they can run a sub-5-minute mile. I’d probably believe them, but have no proof. If they added me on Strava and I saw a recent 4:59 mile logged on their high school track, my confidence would grow. But if they showed me official race results with independent timing verifying the sub-5 performance, I’d have no doubt. And once they’ve logged an official 4:59 mile, I’ll believe them if they come back and tell me they are now in 4:50 shape.

Analogously, AMD told us they can run a sub-5 mile, but SemiAnalysis went to the track and timed them in 5:30. Not even close. SemiAnalysis gave AMD some training tips and watched AMD log a 5:15 a few months later. That’s better, and the rapid improvement suggests AMD has the physical capabilities to break 5 minutes.

But it also raises many questions.

How long will it take to get into sub-5 minute shape? The competition is already progressing toward 4:40 shape, so how can I trust you’ll not only get into the shape you already promised, but you’ll also get even faster so you can hang with competition?

Most concerning — does AMD have the right coach and training plan to even get to a sub 5 minute mile? After all, it took an external coach to come in and point out that they were well out of shape, and then work to help get them into shape.

Why were they so badly out of shape?

What was the coach focused on? What was their previous training plan?

Restoring Confidence

Belaboring my analogy – if AMD wants us to believe they are a 5 minute miler, they need to prove it.

Have the coach explain the training plan.

Give us regular updates. Run a time trial every two weeks and report back.

And sign up and run a race. Several races.

The time trials will help you know you’re fit, and the races will prove to everyone that you can perform against competition on a standard track with independent official verification.

OK, enough track talk. What actual steps must AMD take?

AMD’s AI software leadership must publicly communicate their game plan. Have your coach tell us the training plan so we can believe they know what they are doing.

AMD should begin internal benchmarking using the public AI stack and report back regularly. Post your workouts for all to see and hold you accountable.

When performance improves, submit results to MLPerf. This is like competing in a race — it’s a standard track with a performance validated by external officials. Prove it.

Don’t lose sight of what matters most: the customer. Benchmarks aren’t the prize; they’re part of the process of improving performance. Use them to build trust, but let customer impact remain the ultimate measure of success.

I believe AMD can overcome this, but it needs to act and communicate aggressively to ensure this issue doesn’t fester.

Get Serious About Software

Most importantly, AMD must get serious about AI software. Not lip service, but actually serious.

AMD is already designing the hardware, and that’s the most difficult part!

Now they simply need to get serious about AI software.