▲AMD's CDNA 4 Architecture Announcementchipsandcheese.com

160 points by rbanffy 20 hours ago | 31 comments

jauntywundrkind 19 hours ago [-]

Faster small matrix, for AI. Yup, that seems like good fit for what folks want.

Supercharging the Local Data Share (LDS) that's shared by threads is really cool to hear about. 64 -> 160KB size. Writes into LDS go from 32B max to 128B, increasing throughout. Transposes, to help get the data in the right shape for its next use.

Really really curious to see what the UDNA unified next gen architectures look like, if they really stick to merging Compute and Radeon CDNA and RDNA, as promised. If consumers end up getting multi-die compute solutions that would be neat & also intimidatingly hard (lots of energy spent keeping bits in sync across cores/coherency). After Navi 4X ended up having its flagship cancelled way back now, been wondering. I sort of expect that this won't scale as nicely as Epyc being a bunch of Ryzen dies. https://wccftech.com/amd-enthusiast-radeon-rx-8000-gpus-alle...

robjeiter 19 hours ago [-]

When looking at inference is AMD already on par with Nvidia?

moondistance 18 hours ago [-]

Yes, for many applications.

Meta, OpenAI, Crusoe, and xAI recently announced large purchases of MI300 chips for inference.

MI400, which will be available next year, also looks to be at least on par with Nvidia's roadmap.

moondistance 18 hours ago [-]

(this is also why AMD popped 10% at open yesterday - this is a new development and talks from their 2025 "Advancing AI" event were published late last week + over the weekend)

christkv 18 hours ago [-]

Is the software stack still lacking?

OneDeuxTriSeiGo 18 hours ago [-]

Yeah it's still a few years behind but it's getting better. They are hiring software and tooling engineers like crazy. I keep tabs on some of the job slots companies have in our area and every time I check AMD they always have tons of new slots for software, firmware, and tooling (and this has been the case for ~3 years now).

They've been playing catch up after "the bad old days" when they had to let a bunch of people go to avoid going under but it looks like they are catching back up to speed. Now it's just a matter of giving all those new engineers a few years to get their software world in order.

storus 17 hours ago [-]

They pay hardware rates to software engineers (principal engineer at the salary level of a decent fresh graduate) so I won't be too optimistic about them attracting software people that would propel them forward.

OneDeuxTriSeiGo 17 hours ago [-]

At least where I live (very much not west coast), their SW and HW rates are at or above what we normally see in this area.

latchkey 16 hours ago [-]

Stock is undervalued. If you get in now and it pops over the next few years, it'll likely make up for lower compensation.

MegaButts 12 hours ago [-]

You don't need to work at AMD to buy their stock.

latchkey 12 hours ago [-]

True, but if you don’t have a job, where’s the money for buying stock coming from?

alemanek 11 hours ago [-]

If you are what AMD needs to catch up then you can just go work for NVidia for 3x the pay. This market sucks but top tier engineers in the niche they need are not a dime a dozen.

latchkey 9 hours ago [-]

It isn't always about the money.

11 hours ago [-]

iszomer 11 hours ago [-]

We're forbidden to trading our own stock anyway, SEC regulation on insider trading and all.

almostgotcaught 12 hours ago [-]

You're "talking your book".

varelse 16 hours ago [-]

[dead]

zombiwoof 11 hours ago [-]

They pay terrible and still have legacy old guard managers. If you try to innovate on software you should look elsewhere or really make sure your manager knows what’s what

martinald 14 hours ago [-]

FWIW for the first time in 2+ years I managed to compile llama.cpp with ROCm out of the box and run a model with no problems* on Linux (actually under WSL2 as well), with no weirdness or errors.

Every time I have tried this previously it has failed with some cryptic errors.

So from this very small test it has got way better recently.

*Did have problems enabling the WMMA extensions though. So not perfect yet.

halJordan 12 hours ago [-]

If this has been an issue for two years, then it's not rocm or llama.cpp problem.

martinald 2 hours ago [-]

Oh I'm sure you are right its operator error, but I'd always have some issue installing rocm and getting the paths right or something. This is the first time I've managed to install rocm following the commands exactly and then compile llama.cpp without having to adjust anything.

BTW, this kind of dev experience does really matter. I'm sure it was possible to get working previously; but I didn't have the level of interest to make it work - even if it was somewhat trivial. Being able to compile out of the box makes a big difference. And AFIAK this new version is the first to properly support WSL2, which means I don't have to dual boot to even try and get it working. It's a big improvement.

moondistance 18 hours ago [-]

Yes, big time, but there continues to be lots of progress.

Most importantly, models are maturing, and this means less custom optimization is required.

martinald 14 hours ago [-]

Yes I'd agree with that. There is so much demand for inference which is maturing rapidly that even if a lot of the "R&D" is done on NVidia cards because of their (vastly, let's be fair) software stack, if AMD is competitive on the inference side (and perhaps more importantly have shorter lead times) then doing the inference on AMD is still an enormous market.

I suspect we will (or already are?) at a point where 95%+ of GPUs are used for inference, not training.

latchkey 16 hours ago [-]

https://eliovp.com/cranking-out-faster-tokens-for-fewer-doll...

incomingpain 3 hours ago [-]

I bought a radeon 9060. ROCM works. I'm getting ~40 tokens/sec out of Phi4: 14B

BEWARE: I was running fully patched ubuntu 24 LTS and I needed to upgrade to ubuntu 24.10 and then ubuntu 25 before the drivers worked. Painful.

bee_rider 19 hours ago [-]

Machine learning is, of course, a massive market and everybody’s focus.

But, does AMD just own the whole HPC stack at this point? (Or would they, if the software was there?).

At least the individual nodes. What’s their equivalent to Infiniband?

phonon 18 hours ago [-]

Ultra Ethernet

https://www.tomshardware.com/networking/amd-deploys-its-firs...

https://semianalysis.com/2025/06/11/the-new-ai-networks-ultr...

OneDeuxTriSeiGo 17 hours ago [-]

It's also worth noting Ultra Ethernet isn't just an AMD thing. The steering committee for the UEC is made up of basically every hardware manufacturer in the space except Nvidia. And of course Nvidia is a general contributor as well (presumably so they don't get left behind).

https://ultraethernet.org/

jauntywundrkind 17 hours ago [-]

Also UltraEthernet went 1.0 (6d ago), had a decent sized comments: https://news.ycombinator.com/item?id=44249190

latchkey 16 hours ago [-]

Within the node (gpu to gpu), it is infinity fabric.

Externally, it is 8x400G NICs, which is the limitation of PCIeV5 anyway.

We had a guy training SOTA models on 9 of our MI300x boxes just fine. Networking wasn't the slow bit.

wmf 17 hours ago [-]

Cray Slighshot is even faster than Infiniband.

Now that Nvidia is removing FP64 I assume AMD will have 100% of the HPC market until Fujitsu Monaka comes out.

curt15 4 hours ago [-]

Would traditional HPC applications using FP64 gain anything from CDNA4 compared to the MI300A?

19 hours ago [-]

icf80 7 hours ago [-]

no UDNA ? any news ?

Loading comments...

jauntywundrkind 19 hours ago [-]

Faster small matrix, for AI. Yup, that seems like good fit for what folks want.

robjeiter 19 hours ago [-]

When looking at inference is AMD already on par with Nvidia?

moondistance 18 hours ago [-]

Yes, for many applications.

Meta, OpenAI, Crusoe, and xAI recently announced large purchases of MI300 chips for inference.

MI400, which will be available next year, also looks to be at least on par with Nvidia's roadmap.

moondistance 18 hours ago [-]

(this is also why AMD popped 10% at open yesterday - this is a new development and talks from their 2025 "Advancing AI" event were published late last week + over the weekend)

christkv 18 hours ago [-]

Is the software stack still lacking?

OneDeuxTriSeiGo 18 hours ago [-]

storus 17 hours ago [-]

OneDeuxTriSeiGo 17 hours ago [-]

At least where I live (very much not west coast), their SW and HW rates are at or above what we normally see in this area.

latchkey 16 hours ago [-]

Stock is undervalued. If you get in now and it pops over the next few years, it'll likely make up for lower compensation.

MegaButts 12 hours ago [-]

You don't need to work at AMD to buy their stock.

latchkey 12 hours ago [-]

True, but if you don’t have a job, where’s the money for buying stock coming from?

alemanek 11 hours ago [-]

If you are what AMD needs to catch up then you can just go work for NVidia for 3x the pay. This market sucks but top tier engineers in the niche they need are not a dime a dozen.

latchkey 9 hours ago [-]

It isn't always about the money.

11 hours ago [-]

iszomer 11 hours ago [-]

We're forbidden to trading our own stock anyway, SEC regulation on insider trading and all.

almostgotcaught 12 hours ago [-]

You're "talking your book".

varelse 16 hours ago [-]

[dead]

zombiwoof 11 hours ago [-]

They pay terrible and still have legacy old guard managers. If you try to innovate on software you should look elsewhere or really make sure your manager knows what’s what

martinald 14 hours ago [-]

FWIW for the first time in 2+ years I managed to compile llama.cpp with ROCm out of the box and run a model with no problems* on Linux (actually under WSL2 as well), with no weirdness or errors.

Every time I have tried this previously it has failed with some cryptic errors.

So from this very small test it has got way better recently.

*Did have problems enabling the WMMA extensions though. So not perfect yet.

halJordan 12 hours ago [-]

If this has been an issue for two years, then it's not rocm or llama.cpp problem.

martinald 2 hours ago [-]

moondistance 18 hours ago [-]

Yes, big time, but there continues to be lots of progress.

Most importantly, models are maturing, and this means less custom optimization is required.

martinald 14 hours ago [-]

I suspect we will (or already are?) at a point where 95%+ of GPUs are used for inference, not training.

latchkey 16 hours ago [-]

https://eliovp.com/cranking-out-faster-tokens-for-fewer-doll...

incomingpain 3 hours ago [-]

I bought a radeon 9060. ROCM works. I'm getting ~40 tokens/sec out of Phi4: 14B

BEWARE: I was running fully patched ubuntu 24 LTS and I needed to upgrade to ubuntu 24.10 and then ubuntu 25 before the drivers worked. Painful.

bee_rider 19 hours ago [-]

Machine learning is, of course, a massive market and everybody’s focus.

But, does AMD just own the whole HPC stack at this point? (Or would they, if the software was there?).

At least the individual nodes. What’s their equivalent to Infiniband?

phonon 18 hours ago [-]

Ultra Ethernet

https://www.tomshardware.com/networking/amd-deploys-its-firs...

https://semianalysis.com/2025/06/11/the-new-ai-networks-ultr...

OneDeuxTriSeiGo 17 hours ago [-]

https://ultraethernet.org/

jauntywundrkind 17 hours ago [-]

Also UltraEthernet went 1.0 (6d ago), had a decent sized comments: https://news.ycombinator.com/item?id=44249190

latchkey 16 hours ago [-]

Within the node (gpu to gpu), it is infinity fabric.

Externally, it is 8x400G NICs, which is the limitation of PCIeV5 anyway.

We had a guy training SOTA models on 9 of our MI300x boxes just fine. Networking wasn't the slow bit.

wmf 17 hours ago [-]

Cray Slighshot is even faster than Infiniband.

Now that Nvidia is removing FP64 I assume AMD will have 100% of the HPC market until Fujitsu Monaka comes out.

curt15 4 hours ago [-]

Would traditional HPC applications using FP64 gain anything from CDNA4 compared to the MI300A?

19 hours ago [-]

icf80 7 hours ago [-]

no UDNA ? any news ?