Hacker Newsnew | past | comments | ask | show | jobs | submit | more TomVDB's commentslogin

I was wondering about that. According to Nvidia RTX 4090 product page, it still used PCIe 4.


Yup, it only uses PCI-E5 power connectors. The GPU is gen4.


Moving around data is indeed a major issue for any throughout oriented device. But for a gaming GPU, PCIe BW has never been an issue in any of the benchmarks that I’ve seen. (Those benchmarks artificially reduce the number of PCIe lanes.)

In fact, the 4000 series still has PCIe 4.

Moving data around for a GPU is about feeding the shader cores by the memory system. PCIe is way too slow to make that happen. That’s why a GPU has gigabytes of local RAM.


They're defending taking advantage of a miner's forced fire sale. Less waste, lower price for a good GPU. What's not to like?


His argument is that by buying a miner card, you're validating strategy of miners, which are considered "bad actors" by gamers and environmentalists. If you intentionally decline buying a miner card, you're helping to make miner activity less profitable, thus doing the "right thing".


You still want to give them money in exchange for dubious goods.


Why don't you go to a beach and scoop up some oil from the BP spill? It's good oil after all.

Let them clean up their environmental disaster.


Holy false equivalence, batman!


>Why don't you go to a beach and scoop up some oil from the BP spill? If it was trivial, why not? If you told him "don't buy any GPUs" then it would make sense. But either way he will be using some "oil" (GPU) in this case so I don't see why he should waste perfectly good "oil." I suppose you would rather leave the oil in the ocean?


I have never understood this argument. How much environmental damage is done by the mere creation of all television shows cumulatively, the creation of plastic tat given away at conventions, plastic wrapping at grocery stores?

But graphics cards are what we REALLY need to focus on and you better believe you are tuning in for the next big show. It's almost manufactured.


AMD's decision to have different architectures for gaming and datacenter is still a major mystery. It's clear from Nvidia's product line that there's no reason to do so. (And, yes, Hopper and Ada are different names, but there was nothing in today's announcement that makes me believe that Ada and Hopper are a bifurcation in core architecture.)


Moreover, CDNA is not a new architecture, but just a rebranding of GCN.

CDNA 1 had little changes over the previous GCN variant, except for the addition of matrix operations, which have double throughput compared to the vector operations, like NVIDIA did before (the so-called "tensor" cores of NVIDIA GPUs).

CDNA 2 had more important changes, with the double-precision operations becoming the main operations around which the compute units are structured, but the overall structure of the compute units has remained the same as in the first GCN GPUs from 2012.

The changes made in RDNA vs. GCN/CDNA would have been as useful in scientific computing applications as they are in the gaming GPUs and RDNA is also defined to potentially have fast double-precision operations, even if no such RDNA GPU has been designed yet.

I suppose that the reason why AMD has continued with GCN for the datacenter GPUs was their weakness in software development. Until today ROCm and the other AMD libraries and software tools for GPU computational applications have good support only for GCN/CDNA GPUs, while the support for RDNA GPUs was non-existent in the beginning and very feeble now.

So I assume that they have kept GCN rebranded as CDNA for datacenter applications because they were not ready to develop appropriate software tools for RDNA.


Some guy on Reddit claiming to be an AMD engineer was telling me a year or so ago that RDNA took up 30% more area per FLOP than GCN / CDNA.

That's basically the reason for the split. Video game shaders need the latency improvements from RDNA (particularly the cache, but also the pipeline level latency improvements, each clock an instruction completed rather than once every 4 clocks like GCN).

But supercomputers care more about bandwidth. The once every 4 clocks on GCN/CDNA is far denser and more power efficient.


GCN/CDNA is denser with more FLOPS.

RDNA has more cache and runs with far less latency. Like 1/4th the latency of CDNA/Vega. This makes it faster for video game shaders in practice.


Density, I can accept.

But what kind of latency are we talking about here?

CDNA has 16-wide SIMD units that retires 1 64-wide warp instruction every 4 clock cycles.

RDNA has a 32-wide SIMD unit that retires 1 32-wide warp every clock cycle. (It's uncanny how similar it to to Nvidia's Maxwell and Pascal architecture.)

Your 1/4 number makes me think that you're talking about a latency that has nothing to do with reads from memory, but with the rate at which instructions are retired? Or does it have to with the depth of the instruction pipeline? As long as there's sufficient occupancy, a latency difference of a few clock cycles shouldn't mean anything in the context of a thousand clock cycle latency for accessing DRAM?


> thousand clock cycle latency for accessing DRAM?

That's what's faster.

Vega64 accesses HBM in like 500 nanoseconds. (https://www.reddit.com/r/ROCm/comments/iy2rfw/752_clock_tick...)

RDNA2 accesses GDDR6 in like 200 nanoseconds. (https://www.techpowerup.com/281178/gpu-memory-latency-tested...)

EDIT: So it looks like my memory was bad. I could have sworn RDNA2 was faster (Maybe I was thinking of the faster L1/L2 caches of RDNA?) Either way, its clear that Vega/GCN has much, much worse memory latency. I've updated the numbers above and also edited this post a few times as I looked stuff up.


Thanks for that.

The weird part is that this latency difference has to be due to a terrible MC design by AMD, because there's not a huge difference in latency between any of the current DRAM technologies: the interface between HBM and GDDR (and regular DDR) is different, but the underlying method of accessing the data is similar enough for the access latency to be very similar as well.


Or... supercomputer users don't care about latency in GCN/CDNA applications.

500ns to access main memory, and lol 120 nanoseconds to access L1 cache is pretty awful. CPUs can access RAM in less latency than Vega/GCN can access L1 cache. Indeed, RDNA's main-memory access is approaching Vega/GCN's L2 latency.

----------

This has to be an explicit design decision on behalf of AMD's team to push GFLOPS higher and higher. But as I stated earlier: video game programmers want faster latency on their shaders. "More like NVidia", as you put it.

Seemingly, the supercomputer market is willing to put up with these bad latency scores.


But why would game programmers care about shader core latency??? I seriously don't understand.

We're not talking here about the latency that gamers care about, the one that's measured in milliseconds.

I've never seen any literature that complained about load/store access latency in the shader core. It's just so low level...


> But why would game programmers care about shader core latency??? I seriously don't understand.

Well, I don't know per se. What I can say is that the various improvements AMD made to RDNA did the following:

1. Barely increased TFLOPs -- Especially compared to CDNA, it is clear that RDNA has fewer FLOPs

2. Despite #1, improved gaming performance dramatically

--------

When we look at RDNA, we can see that many, many latency numbers improved (though throughput numbers, like TFLOPs, aren't that much better than Vega 7). Its clear that the RDNA team did some kind of analysis into the kinds of shaders that are used by video game programmers, and tailored RDNA to match them better.

> I've never seen any literature that complained about load/store access latency in the shader core. It's just so low level...

Those are just things I've noticed about the RDNA architecture. Maybe I'm latching onto the wrong things here, but... its clear that RDNA was aimed at the gaming workload.

Perhaps modern shaders are no longer just brute-force vertex/pixel style shaders, but are instead doing far more complex things. These more complicated shaders could be more latency bound rather than TFLOPs bound.


Nvidia have been making different architecture for gaming and datacenter for few generations now. Volta and Turing, Ampere and Ampere(called the same, different architectures on different node). And Hopper with Lovelace are different architectures. SMs are built differently, different cache amounts, different amount of shading units per SM, different rate between FP16/FP32, no RT cores in Hopper and I can go on and on. They are different architectures where some elements are the same.


No, the NVIDIA datacenter and gaming GPUs do not have different architectures.

They have some differences besides the different set of implemented features, e.g. ECC memory or FP64 speed, which are caused much less by their target market than by the offset in time between their designs, which gives the opportunity to add more improvements in whichever comes later.

The architectural differences between NVIDIA datacenter and gaming GPUs of the same generation are much less than between different NVIDIA GPU generations.

This can be obviously seen in the CUDA version numbers, which correspond to lists of implemented features.

For example, datacenter Volta is 7.0, automotive Volta is 7.2 and gaming Turing is 7.5, while different versions of Ampere are 8.0, 8.6 and 8.7.

The differences between any Ampere and any Volta/Turing are larger than between datacenter Volta and gaming Turing, or between datacenter Ampere and gaming Ampere.

The differences between two successive NVIDIA generations can be as large as between AMD CDNA and RDNA, while the differences between datacenter and gaming NVIDIA GPUs are less than between two successive generations of AMD RDNA or AMD CDNA.


I don't agree.

Turing is an evolution of Volta. In fact, in the CUDA slides of Turing, they mention explicitly that Turing shaders are binary compatible with Volta, and that's very clear from the whitepapers as well.

Ampere A100 and Ampere GeForce have the same core architecture as well.

The only differences are in HPC features (MIG, ECC), FP64, the beefiness of the tensor cores, and the lack of RTX cores on HPC units.

The jury is still out on Hopper vs Lovelace. Today's presentation definitely points to a similar difference as between A100 and Ampere GeForce.

It's more: the architectures are the same with some minor differences.

You can also see this with the SM feature levels:

Volta: SM 70, Turing SM 75

Ampere: SM 80 (A100) and SM 86 (GeForce)


Turing is an evolution of Volta, but they are different architectures.

A100 and GA102 DO NOT have same core architecture. 192KB of L1 cache in A100 SM, 128KB in GA102 SM. That already means that it is not the same SM. And there are other differences. For example Volta started featuring second datapath that could process one INT32 instruction in addition to floating point instructions. This datapath was upgraded in GA102 so now it can handle FP32 instructions as well(not FP16, only first datapath can process them). A100 doesn't have this improvement, that's why we see such drastic(basically 2x) difference in FP32 flops between A100 and GA102. It is not a "minor difference" and neither is a huge difference in L2 cache(40MB vs 6MB). It's a different architecture on a different node designed by a different team.


GP100 and GP GeForce has a different shared memory structure as well, so much so that GP100 was listed as having 30 SMs instead of 60 in some Nvidia presentations. But the base architecture (ISA, instruction delays, …) were the same.

It’s true tbat GA102 has double the FP32 units, but the way they works is very similar to the way SMs have 2x FP16 in that you need to go out of your way to benefit front them. Benchmark show this as well.

I like to think that Nvidia’s SM version nomenclature is a pretty good hint, but I guess it just boils down to personal opinion about what constitutes a base architecture.


AMD as well. The main difference being that Nvidia kills you big time with the damn licensing (often more expensive than the very pricy card itself) while AMD does not. Quite unfortunate we do not have more budget options for these types of cards as it would be pretty cool to have a bunch of VM's or containers with access to "discrete" graphics


Nvidia's datacenter product licensing costs are beyond onerous, but even worse to me is that their license server (both its on-premise and cloud version) is fiddly and sometimes just plain broken. Losing your license lease makes the card go into super low performance hibernation mode, which means that dealing with the licensing server is not just about maintaining compliance -- it's about keeping your service up.

It's a bit of a mystery to me how anyone can run a high availability service that relies on Nvidia datacenter GPUs. Even if you somehow get it all sorted out, if there was ANY other option I would take it.


I'd be more than happy to by a 3080 (or similar) for a bargain price, knowing that it has been run at lower voltages and power levels.

As for waste: I very much hope it will not become waste. Why would you be advocating for that?


How many giraffes is that?


With no virtual memory, no caches, and interface processors instead of direct access to external DRAM, this thing must be a programming nightmare?

Having tons of small CPUs with fast local SRAM is of course not a new idea. Back in 1998, I talked to a startup that believed it could replace standard cell ASIC design with tiny CPUs that had custom instruction sets. (I didn't believe it could: it's extremely area inefficient and way to power hungry for that kind of application. The startup went nowhere.) And the IBM Cell is indeed an obvious inspiration.

But AFAIK, the IBM Cell was hard to program. I've seen PS3 presentations where it was primarily used as a software defined GPU, because it was just too difficult to use as a general purpose processor.

Now NOT being a general purpose process is the whole point of Dojo, so maybe they can make it work. But from my limited experience with CUDA, virtual memory and direct access to DRAM is a major plus, even if the high performance compute routines make intensive use of shared memory. The fact that an interface processor is involved (how?) in managing your local SRAM must make synchronization much more complex than with CUDA, where everything is handled by the same SM that manages the calculations: your warp issues a load, it waits on a barrier, the calculations happens, sometimes in a side unit in which case you again wait on a barrier, you offload the data and wait on a barrier. And while one warp waits on a barrier, another warp can take over. It's pretty straightforward.

The Dojo model suggests that "wait on a barrier" becomes "wait on the interface processor".


If it only ever runs one program, and that program is an implementation of vanilla Transformers, that might be all it needs to be useful. Sufficiently large Transformers can do an incredible variety of tasks. If someone invents something better than vanilla Transformers, then they can write a second program for that.


Also investing in a branch predictor when the intended workload doesn't seem at all scalar is a confusing choice to me. Also the 362 F16 TFLOPs sounds super impressive, except the memory bandwidth is I think 800 GB/s (or is it 5 times that? Or effectively less than that if data has to be passed along multiple hops? I'm a bit confused), which means having to do 1000 ops (or 200? or more?) on each 16 bit value loaded in. Maybe you could do that, but it feels like you'd probably end up bandwidth bound most of the time.


My understanding is they load in weights occasionally into sram and then pump in training data on the sides of the die and have multiple cores operate on a wavefront of data. So the cores don't compete for host memory bandwidth because the same data flows (transformed) through multiple cores.


You are right that this won't work well with any language that assumes a "normal" processor. But a small language that is written for it could be fine.


From my understanding the CELL was meant to be the GPU for the PS3 but Sony instead found the same issues and could not program a reasonable performing SDK using it within the time limits (MS Xbox 360) and added in a Nvidia RSX GPU.

Another oddball architecture that went nowhere.


> could not program a reasonable performing SDK using it within the time limits

It feels like "within the time limits" has always been the problem of difficult to program for software-dependant architectures: time vs competitors.

E.g. in the time it takes to write an intelligent compiler (IA-64), your better-resourced (because they're getting revenue from the current market) competitor has surpassed your performance via brute evolution force.

There are use cases out there (early supercomputing, NVIDIA) where radical, development-heavy architectures have been successful, but they generally lack of competitor (the former) or iterate ruthlessly themselves (the latter).


"radical, development-heavy architectures" = niche use case

Connection machine = only had one customer afaik (NSA)

Transmeta - interesting technology but nobody in that market wanted to run anything besides Windows+x86.


Sounds to me like a programming dream. The usual way of things these days is 'don't waste your employer's time trying to optimize; everything that can profitably be done has already been done by other people; you just have to accept that particular part of your skill set is useless'. Dojo would let you actually use a lot more of your skills.


What programming when it's a model being run?


If you don’t want a computer on wheels, you don’t want a German car, you want a second hand Toyota Corolla.

We had an Audi Q5 first and a Tesla second. My wife and I fight over who gets to drive the Tesla.

With a modern German car, you still get some the fancy features, but they don’t work very well (lane control comes to mind), and the electronic UI is much worse than Tesla. CarPlay is the only way out.

And let’s not even talk about the difference in driving fun.


As a counterpoint, my company has outsourced major parts of its internal IT support to the Philippines, yet my experience is always as good as how it was before they did so.

If they can do that for internal support, it should be possible for external as well?


You might have missed the last line of my comment :)

> Of course this isn't always true. There are definitely some US call centers that are awful and some stellar offshore ones, but as a general rule of thumb this seems to be true.


Sure, but it’s much cheaper to have crap support that can’t help you. For many companies it’s a plus if they can cheaply waste your time so much you won’t call again next time.


Putting a check in the mail (or pressing some button on a web form to make it so) is one of the things that comes to mind.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: