Hacker Newsnew | past | comments | ask | show | jobs | submit | thatguysaguy's commentslogin

TPUs predate LLMs by a long time. They were already being used for all the other internal ML work needed for search, youtube, etc.

A big part of software engineering is maintenance not just adding features. When you drop a 22,000 line PR without any discussion or previous work on the project, people will (probably correctly) assume that you aren't there for the long haul to take care of it.

On top of that, there's a huge asymmetry when people use AI to spit out huge PRs and expect thorough review from project maintainers. Of course they're not going to review your PR!


AI actually has the advantage here in my experience. Yes, you can do AI wrong and tell it to just change code, write no documentation, provide no notes on the changes, and not write any tests. But you would be dumb to do it that way.

As it stands now you can set AI to do actual software development with documentation, notes, reasoning for changes, tests, and so on. It isn’t exactly easy to do this, a novice to AI and software development definitely wouldn’t set it up this way, but it isn’t what the tech can really do. There is a lot to be done in using different AI to write tests and code (well, don’t let an AI who can see the code to write the tests, or you could just get a bunch of change detector crap), but in general it mostly turns out that all the things SWEs can do to improve their work works on AI also.


Note that this PR works, was tested, etc.

I was careful to have AI run through the examples in the PR, run lldb on the sample code and make sure the output matches.

Some of the changes didn't make it in before the PR was closed but I don't think anyone bothered to actually check the work. All the discussion focused on the inappropriateness of the huge PR itself (yes, I agree), on it being written by AI... and on the AI somehow "stealing" work code.


I'm actually not talking about whether the PR works or was tested. Let's just assume it was bug-free and worked as advertised. I would say that even in that situation, they should not accept the PR. The reason is that no one is the owner of that code. None of the maintainers will want to dedicate some of their volunteer time to owning your code/the AIs code, and the AI itself can't become the owner of the code in any meaningful way. (At least not without some very involved engineering work on building a harness, and since that's still a research-level project, it's clearly something which should be discussed at the project level, not just assumed).

> but I don't think anyone bothered to actually check the work

Including you


I’ve been finding that the documentation the AI writes isn’t so much for humans, but for the AI when it later goes to work on the code again…well, to say AI benefits from good PRs as much as people do. You could ask the AI to break up the PR next time if possible, it will probably do so much more easily than you could do it manually.

You can ask AI to write documentation for humans.

Also, I'll try to break up the PR sometime but I'm already running Claude using two $200/mo accounts, in addition to another $200/mo ChatGPT, and still running into time limits.

I want to finish my compilers first.


What forces you to publish this work as a PR, or as many PRs? You could have simply kept that for yourself, since you admitted in the PR discussion that you found it useful. Many people seem to think you haven't properly tested it, so that would also be a good way of testing it before publishing it, wouldn't it?

It's a volunteer run project... Saying that they have a duty to do anything other than what they want is quite strange.


Verification via LLM tends to break under quite small optimization pressure. For example I did RL to improve <insert aspect> against one of the sota models from one generation ago, and the (quite weak) learner model found out that it could emit a few nonsense words to get the max score.

That's without even being able to backprop through the annotator, and also with me actively trying to avoid reward hacking. If arxiv used an open model for review, it would be trivial for people to insert a few grammatical mistakes which cause them to receive max points.


FAIR is not older AI... They've been publishing a bunch on generative models.


FAIR is 3000 people, they do tons of different things


Back when BERT came out, everyone was trying to get it to generate text. These attempts generally didn't work, here's one for reference though: https://arxiv.org/abs/1902.04094

This doesn't have an explicit diffusion tie in, but Savinov et al. at DeepMind figured out that doing two steps at training time and randomizing the masking probability is enough to get it to work reasonably well.


Im just learning this from your text, after spending last week trying to get a BERT model to talk.

https://joecooper.me/blog/crosstalk/

I’ve still got a few ideas to try though so I’m not done having fun with it.


The trick is to always put the [MASK] at the end:

"The [MASK]" "The quick [MASK]" etc


I've saved this and I'll study this when I come back to it. Thanks!


Interesting as I was in the (very large) camp that never considered it for generation, and saw it as a pure encoder for things like semantic similarity with an easy jump to classification, etc


I would recommend going and reading what the BlueSky leadership actually wrote, rather than this post's summary of it.


Why would you think that deepseek is more efficient than gpt-5/Claude 4 though? There's been enough time to integrate the lessons from deepseek.


Because to make GPT-5 or Claude better than previous models, you need to do more reasoning which burns a lot more tokens. So, your per-token costs may drop, but you may also need a lot more tokens.


GPT-5 can be configured extensively. Is there any point at which any configuration of GPT-5 that offers ~DeepSeek level performance is more expensive than DeepSeek per token?


37 billion bytes per token?

Edit: Oh assuming this is an estimate based on the model weights moving fromm HBM to SRAM, that's not how transformers are applied to input tokens. You only have to do move the weights for every token during generation, not during "prefill". (And actually during generation you can use speculative decoding to do better than this roofline anyways).


> (And actually during generation you can use speculative decoding to do better than this roofline anyways).

And more importantly batches, so taking the example from the blog post, it would be 32 tokens per each forward pass in the decoding phase.


There's also an estimation of how much a KV cache grows with each subsequent token. That would be roughly ~MBs/token. I think that would be the bottleneck


Joel's blog in general is an extremely great read. I highly recommend subscribing.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: