More

jrk · 2025-12-18T01:47:22 1766022442

Why are GPUs necessarily higher latency than TPUs? Both require roughly the same arithmetic intensity and use the same memory technology at roughly the same bandwidth.

eru · 2025-12-18T02:16:33 1766024193

And our LLMs still have latencies well into the human perceptible range. If there's any necessary, architectural difference in latency between TPU and GPU, I'm fairly sure it would be far below that.

danpalmer · 2025-12-18T03:57:23 1766030243

My understanding is that TPUs do not use memory in the same way. GPUs need to do significantly more store/fetch operations from HBM, where TPUs pipeline data through systolic arrays far more. From what I've heard this generally improves latency and also reduces the overhead of supporting large context windows.

jrk · 2025-12-06T16:09:09 1765037349

Not sure when you last tried, but Gemini, Claude, and ChatGPT have all supported pretty effective PDF input for quite a while.

jrk · 2025-11-26T10:01:01 1764151261

It is not only not that much more complex, it is often less complex.

Higher-level services like PaaS (Heroku and above) genuinely do abstract a number of details. But EC2 is just renting pseudo-bare computers—they save no complexity, and they add more by being diskless and requiring networked storage (EBS). The main thing they give you is the ability to spin up arbitrarily many more identical instances at a moment’s notice (usually, at least theoretically, though the amount of the time that you actually hit unavailability or shadow quotas is surprisingly high).

jrk · 2025-11-26T09:52:27 1764150747

You know that AWS will come back up. You definitely don’t know whether your own instances will come back or if you’ll need to redeploy it all.

BrandoElFollito · 2025-11-26T20:59:49 1764190789

Why do you assume that the small dedicated server has a higher probability to come back?

jrk · 2025-11-22T21:09:29 1763845769

Yes but you can also do the same thing with autoregressive models just by making them smaller. This tradeoff always exists, the question is whether the Pareto curve for diffusion models ever crosses or dominates the best autoregressive option at the same throughput (or quality).

jrk · 2025-11-18T23:23:53 1763508233

I think they weren’t asking “why can’t Gemini 3, the model, just do good transcription,” they were asking “why can’t Gemini, the API/app, recognize the task as something best solved not by a single generic model call, but by breaking it down into an initial subtask for a specialized ASR model followed by LLM cleanup, automatically, rather than me having to manually break down the task to achieve that result.”

darkwater · 2025-11-19T09:18:10 1763543890

Exactly that. There is a layer (or more than one) between the user submitting the YT video and the actual model "reading" it and writing the digest. If the required outcome is to write a digest of a 3 hours video, and to achieve the best result it needs to pass first into a specialized transcription model and then in a generic one that can summarize, well, why Google/Gemini doesn't do it out of the box? I mean, I'm probably oversimplifying but if you read the presentation post by Pichar itself, well, I would not expect less than this.

jrk · 2025-10-25T13:24:42 1761398682

I think the point was not that gem-grade synthetic diamonds are ugly, but that, as industry masters gem-grade production, presumably below-gem-grade production (“ugly synthetic diamonds”) would become cheap enough to deploy in more engineering settings where diamond’s other unique properties were the key concern.

jrk · 2025-10-21T00:04:54 1761005094

This is an established, though advanced, idea.

Sourcegraph Amp (https://sourcegraph.com/amp) has had this exact feature built in for quite a while: "ask the oracle" triggered an O1 Pro sub-agent (now, I believe, GPT-5 High), and searching can be delegated to cheaper, faster, longer-context sub-agents based on Gemini 2.5 Flash.

jrk · 2025-10-20T04:18:43 1760933923

It is a few generations behind: Blackwell is still on N4, which is an N5 variant. Meanwhile TSMC has been shipping N3 family processes in large volume products (Apple) for more than 2 years already, and is starting to ramp the next major node family (N2) for Apple et al. next year.

NVIDIA has often lagged on process, since they drive such large dies, but having the first major project demo wafer on N4 now is literally 2 generations behind Taiwan.

AlotOfReading · 2025-10-20T05:33:11 1760938391

It's a couple process generations behind, but Blackwell is literally nvidia's most current generation. They don't ship N3 until the next generation.

When was the last time current gen, competitive GPUs were fabbed outside Asia?

rsynnott · 2025-10-20T13:13:54 1760966034

Can't be _that_ long ago; AMD were still using GlobalFoundries (Germany and New York) for most stuff until 2018 or so IIRC.

AlotOfReading · 2025-10-20T13:41:13 1760967673

Forgot about AMD's brief GPU flirtation with glofo. ATI used TSMC. I think it was only Polaris that ever shipped anything from NY. That's admittedly a couple of legendary value cards though.

jrk · 2025-10-15T15:45:26 1760543126

If you go just a few posts back in Peter's own blog he has a video of himself doing exactly this:

https://steipete.me/posts/2025/live-coding-session-building-...

He has posted others over the past few months, but they don't seem to be on his blog currently.

As @simonw mentions in a peer comment, Armin Ronacher also has several great streams (and he's less caffeinated and frenetic than Peter :)