A big part of software engineering is maintenance not just adding features. When you drop a 22,000 line PR without any discussion or previous work on the project, people will (probably correctly) assume that you aren't there for the long haul to take care of it.
On top of that, there's a huge asymmetry when people use AI to spit out huge PRs and expect thorough review from project maintainers. Of course they're not going to review your PR!
AI actually has the advantage here in my experience. Yes, you can do AI wrong and tell it to just change code, write no documentation, provide no notes on the changes, and not write any tests. But you would be dumb to do it that way.
As it stands now you can set AI to do actual software development with documentation, notes, reasoning for changes, tests, and so on. It isn’t exactly easy to do this, a novice to AI and software development definitely wouldn’t set it up this way, but it isn’t what the tech can really do. There is a lot to be done in using different AI to write tests and code (well, don’t let an AI who can see the code to write the tests, or you could just get a bunch of change detector crap), but in general it mostly turns out that all the things SWEs can do to improve their work works on AI also.
I was careful to have AI run through the examples in the PR, run lldb on the sample code and make sure the output matches.
Some of the changes didn't make it in before the PR was closed but I don't think anyone bothered to actually check the work. All the discussion focused on the inappropriateness of the huge PR itself (yes, I agree), on it being written by AI... and on the AI somehow "stealing" work code.
I'm actually not talking about whether the PR works or was tested. Let's just assume it was bug-free and worked as advertised. I would say that even in that situation, they should not accept the PR. The reason is that no one is the owner of that code. None of the maintainers will want to dedicate some of their volunteer time to owning your code/the AIs code, and the AI itself can't become the owner of the code in any meaningful way. (At least not without some very involved engineering work on building a harness, and since that's still a research-level project, it's clearly something which should be discussed at the project level, not just assumed).
I’ve been finding that the documentation the AI writes isn’t so much for humans, but for the AI when it later goes to work on the code again…well, to say AI benefits from good PRs as much as people do. You could ask the AI to break up the PR next time if possible, it will probably do so much more easily than you could do it manually.
Also, I'll try to break up the PR sometime but I'm already running Claude using two $200/mo accounts, in addition to another $200/mo ChatGPT, and still running into time limits.
What forces you to publish this work as a PR, or as many PRs? You could have simply kept that for yourself, since you admitted in the PR discussion that you found it useful. Many people seem to think you haven't properly tested it, so that would also be a good way of testing it before publishing it, wouldn't it?
Verification via LLM tends to break under quite small optimization pressure. For example I did RL to improve <insert aspect> against one of the sota models from one generation ago, and the (quite weak) learner model found out that it could emit a few nonsense words to get the max score.
That's without even being able to backprop through the annotator, and also with me actively trying to avoid reward hacking. If arxiv used an open model for review, it would be trivial for people to insert a few grammatical mistakes which cause them to receive max points.
Back when BERT came out, everyone was trying to get it to generate text. These attempts generally didn't work, here's one for reference though: https://arxiv.org/abs/1902.04094
This doesn't have an explicit diffusion tie in, but Savinov et al. at DeepMind figured out that doing two steps at training time and randomizing the masking probability is enough to get it to work reasonably well.
Interesting as I was in the (very large) camp that never considered it for generation, and saw it as a pure encoder for things like semantic similarity with an easy jump to classification, etc
Because to make GPT-5 or Claude better than previous models, you need to do more reasoning which burns a lot more tokens. So, your per-token costs may drop, but you may also need a lot more tokens.
GPT-5 can be configured extensively. Is there any point at which any configuration of GPT-5 that offers ~DeepSeek level performance is more expensive than DeepSeek per token?
Edit: Oh assuming this is an estimate based on the model weights moving fromm HBM to SRAM, that's not how transformers are applied to input tokens. You only have to do move the weights for every token during generation, not during "prefill". (And actually during generation you can use speculative decoding to do better than this roofline anyways).
There's also an estimation of how much a KV cache grows with each subsequent token. That would be roughly ~MBs/token. I think that would be the bottleneck
reply