Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The marketing copy and the current livestream appear tautological: "it's better because it's better."

Not much explanation yet why GPT-5 warrants a major version bump. As usual, the model (and potentially OpenAI as a whole) will depend on output vibe checks.



It has the last ~6 months worth of flavor of the month Javascript libraries in it's training set now, so it's "better at coding".

How is this sustainable.


Who said anything about sustainable? The only goal here is to hobble to the next VC round. And then the next, and the next, ...


It doesn't even have that, knowledge cutoff is in 2024.


Vast quantities of extremely dumb money


As someone who tries to push the limits of hard coding tasks (mainly refactoring old codebases) to LLMs with not much improvement since the last round of models, I'm finding that we are hitting the reduction of rate of improvement on the S-curve of quality. Obviously getting the same quality cheaper would be huge, but the quality of the output day to day isn't noticeable to me.


I find it struggles to even refactor codebases that aren't that large. If you have a somewhat complicated change that spans the full stack, and has some sort of wrinkle that makes it slightly more complicated than adding a data field, then even the most modern LLMs seem to trip on themselves. Even when I tell it to create a plan for implementation and write it to a markdown file and then step through those steps in a separate prompt.

Not that it makes it useless, just that we seem to not "be there" yet for the standard tasks software engineers do every day.


I haven’t used GPT5 yet, but even on a 1000 line code base I found Opus 4, o3, etc. to be very hit or miss. The trouble is I can’t seem to predict when these models will hit. So the misses cost time, reducing their overall utility.


I'm exclusively using sonnet via claude-code on their max plan (opting to specify sonnet so that opus isn't used). I just wasn't pleased with the opus output, but maybe I just need to use it differently. I haven't bothered with 4.1 yet. Another thing I noticed is opus would eat up my caps super quick, whereas using sonnet exclusively I never hit a cap.

I'd really just love incremental improvements over sonnet. Increasing the context window on sonnet would be a game changer for me. After auto-compact the quality may fall off a cliff and I need to spend some time bringing it back up to speed.

When I need a bit more punch for more reasoning / architecture type evaluations, I have it talk to gemini pro via zen mcp and OpenRouter. I've been considering setting up a subagent for architecture / system design decisions that would use the latest opus to see if it's better than gemini pro (so far I have no complaints though).


This, plus I really doubt we will ever "be there". Software engineering evolves over time and so far human engineers innovate in the field.


Agree, I think they'll need to move to performance now. If a model was comparable to Claude 4, but took like 500ms or less per edit. A quicker feedback loop would be a big improvement.


> Not much explanation yet why GPT-5 warrants a major version bump

Exactly. Too many videos - too little real data / benchmarks on the page. Will wait for vibe check from simonw and others


> Will wait for vibe check from simonw

https://openai.com/gpt-5/?video=1108156668

2:40 "I do like how the pelican's feet are on the pedals." "That's a rare detail that most of the other models I've tried this on have missed."

4:12 "The bicycle was flawless."

5:30 Re generating documentation: "It nailed it. It gave me the exact information I needed. It gave me full architectural overview. It was clearly very good at consuming a quarter million tokens of rust." "My trust issues are beginning to fall away"

Edit: ohh he has blog post now: https://news.ycombinator.com/item?id=44828264


I feel like we need to move on from using the same test on models since as time goes on the information about these specific test is out there in the training data and while i am not saying that it's happened in this case there is nothing stopping model developers from adding extra data for theses tests directly in the training data to make their models seem better than they are


This effectively kills this benchmark.


Honestly, I have mixed feelings about him appearing there. His blog posts are a nice way to be updated about what's going on, and he deserves the recognition, but he's now part of their marketing content. I hope that doesn't make him afraid of speaking his mind when talking about OpenAI's models. I still trust his opinions, though.


Yeah, even if he wasn't paid to appear there, this seems a bit too close.


The pelican is still a mess.


Damn Theo is really a handsome dude.


Yeah. We're entered the Smartphone stage: "You want the new one because it's the new one."


I think the biggest tell for me was having the leader of Cursor up vouching for the model, who has been a big proponent of Claude in Cursor for the last year. Doesn't seem like a light statement.


When they were about to release gpt4 I remember the hype was so high there were a lot of AGI debates. But then was quickly out-shadowed by more advanced models.

People knew that gpt5 wouldn’t be an AGI or even close to that. It’s just an updated version. GptN would become more or leas like an annual release.


There's a bunch of benchmarks on the intro page including AIME 2025 without tools, SWE-bench Verified, Aider Polyglot, MMMU, and HealthBench Hard (not familiar with this one): https://openai.com/index/introducing-gpt-5/

Pretty par for course evals at launch setup.


I didn't think GPT-4 warranted a major version bump. I do not believe that Open AI's benchmarks are legitimate and I don't think they have been for quite some time, if ever.


For fun, I asked it how much better it is than GPT-4. It started a rap battle against itself :P

https://chatgpt.com/share/6895d5da-8884-8003-bf9d-1e191b11d3...


its >o3 performance at gpt4 price. seems pretty obvious


o3 pricing: $8/Mtok out

GPT-5 pricing: $10/Mtok out

What am I missing?


It's more efficient with tools for one and the input cost is cheaper (which is where a lot of the cost is).

See comparison between GPT-5, 4.1, and o3 tool calling here: https://promptslice.com/share/b-2ap_rfjeJgIQsG.


That you can run Deepseek for 50 cents.


It seems like you might need less output tokens for the same quality of response though. One of their plots shows o3 needing ~14k tokens to get 69% on SWE-bench Verified, but GPT-5 needing only ~4k.


O3 has had some major price cuts since Gemini 2.5 Pro came out. At the time, o3 cost $10/Mtok in and $40/Mtok out. The big deal with Gemini 2.5 Pro was it had comparable quality to o3 at a fraction of the cost.

I'm not sure when they slashed the o3 pricing, but the GPT-5 pricing looks like they set it to be identical to Gemini 2.5 Pro.

If you scroll down on this page you can see what different models cost when 2.5 Pro was released: https://deepmind.google/models/gemini/pro/


pretty sure reduced cache input pricing is a pretty big deal for reasoning models, but im not positive


It just matches the 90% discount that Claude models have had for quite a while. I don't see anything groundbreaking...


We’re at the audiophile stage of LLMs where people are talking about the improved soundstage, tonality, reduced sibilance etc


Note GPT-5's subtle mouthfeel reminiscent of cranberries with a touch of bourbon.


Explains why I find AGI fundamentalists similar to tater heads. /s

(Not to undermine progress in the foundational model space, but there is a lack of appreciation for the democratization of domain specific models amongst HNers).


Every bourbon tastes the same unless it's Weller, King's County Peated, or Pappy (or Jim Beam for the wrong reasons lol)


Tbh, a mid-shelf Four Roses gets you 90% of the way to a upper shelf Weller.


I'm being hyperbolic but yeah four roses is probably the best deal next to Buffalo trace. All their stuff is fairly priced. If you want something like Weller though, you should get another wheated bourbon like Maker's Mark French oaked.


Buffalo trace is ridiculously overpriced nowadays. Good bourbon, but def not worth $35-40 for 750ml.

> you should get another wheated bourbon like Maker's Mark French oaked

I agree. I've found Maker Mark products to be a great bang for your buck quality wise and flavor wise as well.


If you can find Buffalo Trace for msrp which is $20-30, it's a good deal. I think the bourbon "market" kind of popped recently so finding things has been getting a little easier.


Yep! I agree! At MSRP BT is a great buy.

> I think the bourbon "market" kind of popped recently

It def did. The overproduction that was invested in during the peak of the COVID collector boom is coming into markets now. I think we'll see some well priced age stated products in the next 3-4 years based on by acquaintances in the space.

Ofc, the elephant in the room is consolidation - everyone wants to copy the LVMH model (and they say Europeans are ethical elves who never use underhanded mopolistic and market making behavior to corner markets /s).


I can already see LLMs Sommeliers: Yes, the mouthfeel and punch of GPT-5 it's comparable to the one of Grok 4, but it's tenderness lacks the crunch from Gemini 2.5 Pro.


Isn't it exactly what the typical LLM discourse is about? People are just throwing anecdotes and stay with their opinion. A is better than B because C, and that's basically it. And whoever tries to actually bench them gets called out because all benches are gamed. Go figure.


You need to burn-in your LLM by using for 100 hours before you see the true performance of it.


Well, reduced sibilance is an ordinary and desirable thing. A better "audiophile absurdity" example would be $77,000 cables, freezing CDs to improve sound quality, using hospital-grade outlets, cryogenically frozen outlets (lol), the list goes on and on


I feel sorry for audiophiles because they have to work so much harder to get the same enjoyment of music that I get via my laptop speakers


That's just the other extreme, which is not that much less silly. It's not unreasonable to spend 300$ on a good pair of headphones.


The "audiophile" attitude is such that that "work" is enjoyment. It's a game, a hobby. I'm not defending the extremes of it, but it's not like these people are miserable, they enjoy doing it even if it rapidly becomes completely insane nonsense entirely detached from reality.


I've never thought it that way, thanks for mentioning it.

I now wonder if I have any such hobbies. Probably not to the same extend as audiophiles, but some software-related stuff could come close.


Always have been. This LLM-centered AI boom has been my craziest and most frustrating social experiment, propped up by the rhetoric (with no evidence to back it up) that this time we finally have the keys to AGI (whatever the hell that means), and infused with enough AstroTurfing to drive the discourse into ideological stances devoid of any substance (you must either be a true believer or a naysayer). On the plus side, it appears that this hype train is taking a bump with GPT-5.


Come on, we aren't even close to the level of audiophile nonsense like worrying about what cable sounds better.


We're still at the stage of which LLM lies the least (but they all do). So yeah, no different than audiophiles really.


Informed audiophiles rely on Klippel output now


The empirical ones do! There's still a healthy sports car element to the scene though, at least in my experience.


You're right, it's hard to admit you can buy a $50 speaker and sub and EQ it to 95% maximum performance.


This is and isn't true.

The room is the limiting factor in most speaker setups. The worse the room, the sooner you hit diminishing returns for upgrading any other part of the system.

In a fantastic room a $50 speaker will be nowhere near 95% of the performance of a mastering monitor, no matter how much EQ you put on it. In the average living room with less than ideal speaker and listening position placement there will still be a difference, but it will be much less apparent due to the limitations of the listening environment.


Absolutely not true.

You might lose headroom or have to live with higher latency but if your complaint is about actual empirical data like frequency response or phase, that can be corrected digitally.


You can only EQ speakers and headphones as far as the transducer can still respond accurately to the signal you're sending it. No amount of EQ will give the Sennheiser HD-600's good sub-bass performance because the driver begins to distort the signal long before you've amplified it enough to match the Harman target at a normal listening level.

DSP is a very powerful tool that can make terrible speakers and headphones sound great, but it's not magic.


> You might lose headroom

Pretty much my first point… At the same time that same DSP can make a pretty mediocre speaker that can reproduce those frequencies do so in phase at the listening position so once again the point is moot, effectively add a cheap sub.

There is no time where you cannot get results from mediocre transducers given the right processing.

I’m not arguing you should, but in 2025 if a speaker sounds bad it is entirely because processing was skimped on.


This varies wildly with what frequency range you're talking about. Bass region, yes - room geometry makes a big difference. The rest of the range, DSP is your friend. Loudspeakers and Rooms by Floyd Toole is an awesome resource here.


Ah, the aforementioned snake oil.


It’s always been this way with LLMs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: