More

averne_ · 2025-12-24T23:07:16 1766617636

Not really. https://codecs.multimedia.cx/2022/12/ffhistory-fabrice-bella...

>Fabrice won International Obfuscated C Code Contest three times and you need a certain mindset to create code like that—which creeps into your other work. So despite his implementation of FFmpeg was fast-working, it was not very nice to debug or refactor, especially if you’re not Fabrice

averne_ · 2025-12-02T09:15:01 1764666901

Not OP but I also often to listen to ambient while programming. A couple recommendations would be "Music for Nine Post Cards" and other works by Hiroshi Yoshimura, and "Music for 18 musicians" and others by Steve Reich.

In fact, the use of loops described in this article reminded me of what Reich called "phases", basically the same concept of emerging/shifting melodic patterns between different samples.

averne_ · 2025-10-07T12:01:35 1759838495

New physics in this context means previously unknown effects or mechanisms, or even a new theory/framework for an already understood phenomenon. Using "physics" in this way is common amongst academics.

IAmBroom · 2025-10-07T17:33:01 1759858381

Do you have two aliases on HN, or are you simply presuming to speak for the OP?

averne_ · 2025-09-30T19:53:17 1759261997

The main reason a wafer scale chip works there is because their cores are extremely tiny, and silicon area that gets fused off in the event of a defect is much lower than on NVIDIA chips, where a whole SM can get disabled. AFAIU this approach is not easily applicable to complex core designs.

averne_ · 2025-09-14T10:16:39 1757844999

The NVidia driver also has userland submission (in fact it does not support kernel-mode submission at all). I don't think it leads to a significant simplification or not of the userland code, basically a driver has to keep track of the same thing it would've submitted to an ioctl. If anything there are some subtleties that require careful consideration.

The major upside is removing the context switch on a submission. The idea is that an application only talks to the kernel for queue setup/teardown, everything else happens in userland.

sylware · 2025-09-15T08:18:35 1757924315

Yep. Future of GPU hardware programming? The one we will have to "standard"-ized à la RISC-V for CPUs?

The thing are the vulkan "fences", namely the GPU to CPU notifications. Probably hardware interrupts which will have to be forwarded by the kernel to the userland for an event ring buffer (probably a specific event file descriptor). There are alternatives though: we could think of userland polling/spinning on some cpu-mapped device memory content for notification or we could go one "expensive" step further which would "efficiently" remove the kernel for good here but would lock a CPU core (should be fine nowdays with our many cores CPUs): something along the line of a MONITOR machine instruction, basically a CPU core would halt until some memory content is written, with the possibility for another CPU core to un-halt it (namely spurious un-halting is expected).

Does nvidia handle their GPU to CPU notifications without the kernel too?

sylware · 2025-09-15T12:07:02 1757938022

eewww... my bad, we would need a timeout on the CPU core locking go back to the kernel.

Well, polling? erk... I guess a event file descriptor is in order, and that nvidia is doing the same.

averne_ · 2025-08-26T12:30:43 1756211443

It actually doesn't make much difference: https://chipsandcheese.com/i/138977378/decoder-differences-a...

chasil · 2025-08-26T14:22:19 1756218139

I had not realized that Apple did not implement any of the 32-bit ARM environment, but that cuts the legs out of this argument in the article:

"In Anandtech’s interview, Jim Keller noted that both x86 and ARM both added features over time as software demands evolved. Both got cleaned up a bit when they went 64-bit, but remain old instruction sets that have seen years of iteration."

I still say that x86 must run two FPUs all the time, and that has to cost some power (AMD must run three - it also has 3dNow).

Intel really couldn't resist adding instructions with each new chip (MMX, PAE for 32-bit, many more on this shorthand list that I don't know), which are now mostly baggage.

theevilsharpie · 2025-08-26T19:04:32 1756235072

> I still say that x86 must run two FPUs all the time, and that has to cost some power (AMD must run three - it also has 3dNow).

Legacy floating-point and SIMD instructions exposed by the ISA (and extensions to it) don't have any bearing on how the hardware works internally.

Additionally, AMD processors haven't supported 3DNow! in over a decade -- K10 was the last processor family to support it.

chasil · 2025-08-31T02:27:36 1756607256

80-bit x87 has no bearing on SSE implementation.

Right. Not.

daeken · 2025-08-26T13:13:07 1756213987

Oh wow, I need to dig way deeper into this but wonderful resource - thanks!

averne_ · 2025-08-22T19:29:33 1755890973

Do you have a link for that? I'm the guy working on the Vulkan ProRes decoder mentionned as "in review" in this changelog, as part of a GSoC project.

I'm curious wrt how a WebGPU implementation would differ from Vulkan. Here's mine if you're interested: https://github.com/averne/FFmpeg/tree/vk-proresdec

dtf · 2025-08-22T19:39:51 1755891591

I don't have a link to hand right now, but I'll try to put one up for you this weekend. I'm very interested in your implementation - thanks, will take a good look!

Initially this was just a vehicle for me to get stuck in and learn some WebGPU, so no doubt I'm missing lots of opportunities for optimisation - but it's been fun as much as frustrating. I leaned heavily on the SMPTE specification document and the FFMPEG proresdec.c implementation to understand and debug.

averne_ · 2025-08-22T19:46:03 1755891963

No problem, just be aware there's a bunch of optimizations I haven't had time to implement yet. In particular, I'd to remove the reset kernel, fuse the VLD/IDCT ones, and try different strategies and hw-dependent specializations for the IDCT routine (AAN algorithm, packed FP16, cooperative matrices).

averne_ · 2025-07-29T14:22:24 1753798944

Do you mind going in some detail as to why they suck? Not a dig, just genuinely curious.

Almondsetat · 2025-07-29T14:36:09 1753799769

95% GPU usage but only x2 faster than the reference SIMD encoder/decoder

actionfromafar · 2025-07-29T15:17:01 1753802221

What I wonder is, how do you get the video frames to be compressed from the video card into the encoder?

The only frame capture APIs I know, take the image from the GPU, to CPU RAM, then you can put it back into the GPU for encoding.

Are there APIs which can sidestep the "load to CPU RAM" part?

Or is it implied, that a game streaming codec has to be implemented with custom GPU drivers?

Almondsetat · 2025-07-29T15:21:19 1753802479

Some capture cards (Blackmagic comes to mind) have worked together with NVIDIA to expose DMA access. This way video frames are automatically transferred from the card to the GPU memory bypassing the RAM and CPU. I think all GPU manufacturers expose APIs to do this, but it's not that common in consumer products.

Const-me · 2025-07-29T15:35:44 1753803344

> Are there APIs which can sidestep the "load to CPU RAM" part?

On windows that API is Desktop Duplication. The API delivers D3D11 textures, usually in BGRA8_UNORM format. When HDR is enabled you would need slightly different API method which can deliver HDR frames in RGBA16_FLOAT pixel format.

mmozeiko · 2025-07-29T18:54:13 1753815253

There's also Windows.Graphics.Capture. It allows to get texture not only for whole desktop, but just individual windows.

LtdJorge · 2025-07-29T17:36:35 1753810595

On Linux you should look into GStreamer and dmabuf.

averne_ · 2025-07-28T22:04:12 1753740252

Hardware GPU encoders refer to dedicated ASIC engines, separate from the main shader cores. So they run in parallel and there is no performance penalty for using both simultaneously, besides increased power consumption.

Generally, you're right that these hardware blocks favor latency. One example of this is motion estimation (one of the most expensive operations during encoding). The NVENC engine on NVidia GPUs will only use fairly basic detection loops, but can optionally be fed motion hints from an external source. I know that NVidia has a CUDA-based motion estimator (called CEA) for this purpose. On recent GPUs there is also the optical flow engine (another separate block) which might be able to do higher quality detection.

miladyincontrol · 2025-07-29T04:55:25 1753764925

Im pretty sure they arent dedicated ASIC engines anymore. Thats why hacks like nvidia-patch are a thing where you can scale up NVENC usage up to the full GPU's compute rather than the arbitrary limitation nvidia adds. The penalty for using them within those limitations tends to be negligible however.

And on a similar note, NvFBC helps a ton with latency but its disabled on a driver level for consumer cards.

theshackleford · 2025-07-29T06:39:37 1753771177

> Im pretty sure they arent dedicated ASIC engines anymore.

They are. That patch doesnt do what you think it does.

averne_ · 2025-06-25T09:27:27 1750843647

Matrix instructions do of course have uses in graphics. One example of this is DLSS.

Agentlien · 2025-06-25T15:47:16 1750866436

This feels backwards to me when GPUs were created largely because graphics needed lots of parallel floating point operations, a big chunk of which are matrix multiplications.

When I think of matrix multiplication in graphics I primarily think of transforms between spaces: moving vertices from object space to camera space, transforming from camera space to screen space, ... This is a big part of the math done in regular rendering and needs to be done for every visible vertex in the scene - typically in the millions in modern games.

I suppose the difference here is that DLSS is a case where you primarily do large numbers of consecutive matrix multiplications with little other logic, since it's more ANN code than graphics code.