Why are GPUs necessarily higher latency than TPUs? Both require roughly the same arithmetic intensity and use the same memory technology at roughly the same bandwidth.
And our LLMs still have latencies well into the human perceptible range. If there's any necessary, architectural difference in latency between TPU and GPU, I'm fairly sure it would be far below that.
My understanding is that TPUs do not use memory in the same way. GPUs need to do significantly more store/fetch operations from HBM, where TPUs pipeline data through systolic arrays far more. From what I've heard this generally improves latency and also reduces the overhead of supporting large context windows.
It is not only not that much more complex, it is often less complex.
Higher-level services like PaaS (Heroku and above) genuinely do abstract a number of details. But EC2 is just renting pseudo-bare computers—they save no complexity, and they add more by being diskless and requiring networked storage (EBS). The main thing they give you is the ability to spin up arbitrarily many more identical instances at a moment’s notice (usually, at least theoretically, though the amount of the time that you actually hit unavailability or shadow quotas is surprisingly high).
Yes but you can also do the same thing with autoregressive models just by making them smaller. This tradeoff always exists, the question is whether the Pareto curve for diffusion models ever crosses or dominates the best autoregressive option at the same throughput (or quality).
I think they weren’t asking “why can’t Gemini 3, the model, just do good transcription,” they were asking “why can’t Gemini, the API/app, recognize the task as something best solved not by a single generic model call, but by breaking it down into an initial subtask for a specialized ASR model followed by LLM cleanup, automatically, rather than me having to manually break down the task to achieve that result.”
Exactly that. There is a layer (or more than one) between the user submitting the YT video and the actual model "reading" it and writing the digest. If the required outcome is to write a digest of a 3 hours video, and to achieve the best result it needs to pass first into a specialized transcription model and then in a generic one that can summarize, well, why Google/Gemini doesn't do it out of the box? I mean, I'm probably oversimplifying but if you read the presentation post by Pichar itself, well, I would not expect less than this.
I think the point was not that gem-grade synthetic diamonds are ugly, but that, as industry masters gem-grade production, presumably below-gem-grade production (“ugly synthetic diamonds”) would become cheap enough to deploy in more engineering settings where diamond’s other unique properties were the key concern.
Sourcegraph Amp (https://sourcegraph.com/amp) has had this exact feature built in for quite a while: "ask the oracle" triggered an O1 Pro sub-agent (now, I believe, GPT-5 High), and searching can be delegated to cheaper, faster, longer-context sub-agents based on Gemini 2.5 Flash.
It is a few generations behind: Blackwell is still on N4, which is an N5 variant. Meanwhile TSMC has been shipping N3 family processes in large volume products (Apple) for more than 2 years already, and is starting to ramp the next major node family (N2) for Apple et al. next year.
NVIDIA has often lagged on process, since they drive such large dies, but having the first major project demo wafer on N4 now is literally 2 generations behind Taiwan.
Forgot about AMD's brief GPU flirtation with glofo. ATI used TSMC. I think it was only Polaris that ever shipped anything from NY. That's admittedly a couple of legendary value cards though.