Hacker News Discussion Summary

Post ID: 44826997 Title: GPT-5 Points: 1700 Total Comments: 1983 Model: gpt-5 Generated: 2025-08-08 17:32:41

Token Usage

Prompt tokens: 123,212
Completion tokens: 7,856
Reasoning tokens: 3,840
Total tokens: 131,068

Coding Reality: Claude leads the day-to-day, GPT‑5 tries to catch up (with tools)

A strong theme is that, for everyday coding, Anthropic’s Claude (especially Sonnet 3.7/4 and “Claude Code”) has been the practical leader. Several developers said GPT‑5 may be closing the gap by improving tool use and agentic coding, but it still has to prove itself on real codebases.

atonse: “I don't even try to use the OpenAI models because it's felt like night and day.”
IdealeZahlen: “Whatever the benchmarks might say, there's something about Claude that seems to deliver consistently (although not always perfect) quite reliable outputs across various coding tasks.”
weego: “Claude just has whatever the thing is for now”
jstummbillig: “It's a good product.”
mlsu: “Claude Code is good because the Anthropic models are trained/finetuned to be good at using it.”
bamboozled: “Claude is fast too, Gemini isn’t as good and just gets hung up on things Claude doesn’t.”
dudeinhawaii: “Claude Code is exceptional at tool use (and thus working with agentic IDEs) but... not the smartest coder.”

On GPT‑5’s tool use, a handful of early users describe the model “thinking with tools” and making strides on long-horizon coding tasks:

swyx: “really felt that gpt5 was ‘using tools to think’ rather than just ‘using tools’.”
yen223 (skeptical this is new vs. Claude Code): “doesn't Claude Code already do this?”
jumploops: “Since Claude Code launched, OpenAI has been behind. Maybe the RL on tool calling is good enough to be competitive now?”
teaearlgraycold (on larger-than-toy refactors): “very hit or miss. The trouble is I can’t seem to predict when these models will hit.”

A recurring nuance: GPT‑5’s agentic behavior does better in tool-rich environments (Cursor, Codex CLI, MCP) but must avoid making “big edits that look right but go wrong weeks later.” The best teams enforce tests, observability, and scoped planning regardless of model.

Product, not leap: small benchmark gains, S‑curve vibes, and “vibe checks”

Many commenters felt GPT‑5 is a solid product release—faster, cheaper in key ways, with better routing and tools—but not an “AGI jump.” The tenor is plateau/slowdown, or at least incrementalism.

doctoboggan: “the improvement over their current models on the benchmarks is very small.”
minimaxir: “The marketing copy and the current livestream appear tautological: ‘it's better because it's better.’”
lawlessone: “sounds like we're coming over the s-curve”
demirbey05: “Seems LLMs really hit the wall.”
some-guy: “hitting the reduction of rate of improvement on the S-curve”
sbinnee: “GptN would become more or leas like an annual release.”
bigmadshoe: “They scaled the training compute by 100x and got <1% improvement on several benchmarks.”

A counter-current: GPT‑5 tops WebDev Arena and brings “thinking” into a single flagship model. But even fans couch it as an important consolidation and cost/latency step, not a revolution.

z7: “GPT-5 is #1 on WebDev Arena with +75 pts over Gemini 2.5 Pro and +100 pts over Claude Opus 4”
Workaccount2: “They want to avoid ‘mic drop’ releases, and instead want to stick to incremental steps.”

The charts that launched a thousand memes: bar plot errors and “vibegraphing”

The live SWE-bench slide had mislabeled/mis-scaled bars, touching off a wave of skepticism and jokes.

mtlynch: “What's going on with their SWE bench graph?... GPT-5 non-thinking is labeled 52.8% accuracy, but o3 is shown as a much shorter bar, yet it's labeled 69.1%.”
bwestergard: “'GPT-5, please generate a slideshow for your launch presentation.'”
drmidnight: “GPT-5 generated the chart”
Upvoter33: “Tufte used to call this creating a ‘visual lie’”
edwinarbus (quoting Sam Altman): “wow a mega chart screwup from us earlier--wen GPT-6?! correct on the blog though.”
yz-exodao (on a “deception” bar): “Sure, but 50.0 > 47.4...”
mcs5280: “They vibecharted”

The blog was corrected quickly; still, for a flagship reveal, “slide QA” became a story—reinforcing the “vibes > rigor” critique.

The Bernoulli demo backfires: physics correctness gets fact-checked

Many called out the on-stream “Bernoulli effect” airplane-wing explanation as the well-known “equal transit time” fallacy.

kybernetikos: “Isn't that explanation of why wings work completely wrong?”
twixfel: “Aeroplanes don't fly because of the Bernoulli effect”
SkyPuncher: “That Bernoulli effect thing was a complete fail. It didn't do anything to demonstrate the actual concept.”
rcxdude: “That's still not particularly usefully accurate”

Even defenders agreed it was an oversimplification within a touchy domain—exactly where LLMs can sound confident but be misleading.

The good news everyone agrees on: price, speed, and context

Pricing and throughput are widely praised—especially for agentic use and tool-invocations.

jumploops: “Input: $1.25 / 1M tokens … Output: $10 / 1M tokens”
reasonableklout: “Significant cost reduction while providing the same performance seems pretty big to me?”
bayesianbot: “Flex pricing, which is 50% cheaper if you're willing to wait”
Topfi: “400,000 context window … 128,000 max output tokens … Input $1.25 … Output $10.00”

There’s also a sense that lower “input” cost + tool efficiency matters more for real coding agents than raw “output” price.

joshmlewis: “more efficient with tools … the input cost is cheaper (which is where a lot of the cost is).”

Unification, routing, and deprecations: simplicity vs control

GPT‑5 “unifies” thinking and non-thinking, plus a router that chooses variants—but developers worry about control and regressions. Deprecating a long list of older models for ChatGPT simplifies UX but removes familiar choices.

freedomben: “we're actually deprecating all of our previous models”
fidotron: “GPT‑5 is a unified system … with a smart and fast model … a deeper reasoning model … and a real-time router”
charlie0: “What's to stop OpenAI from slowing gimping GPT-5 over time or during times of high demand?”
thimabi: “I personally hated this decision.”
nikanj: “The names of GPT models are just terrible.”

Developers want deterministic knobs: GPT‑5 adds “reasoning_effort” (minimal/low/medium/high) and “verbosity,” plus tool-call preambles and plaintext function calling—small but welcome controls when building agents.

primaprashant: “reasoning_effort parameter … new verbosity parameter … preamble messages for tool calls … tool calls possible with plaintext instead of JSON”

Rollout friction & ID verification: “Available to everyone” (eventually)

OpenAI’s splash page said “Available to everyone,” but many didn’t see it yet and/or hit “Verify Organization” blocks for the API—including face scans and IDs via Persona.

wgjordan: “We are gradually rolling out GPT-5”
jhickok: “It annoys me … when I see that it says it's available to everyone”
AtNightWeCode: “Your organization must be verified to use the model gpt-5.”
jjani: “Oh plus a video face scan, I forgot to mention.”
sophia01: “API usage requires organization verification with your ID :(.”
CamperBob2: “What keeps me from sending them a completely fictional, Photoshopped driver's license and selfies?”
cloudfudge: “everything else is labeled as deprecated.”

This—along with routing—feeds a broader “control” anxiety: easier for new users, harder for pros who need stability and model selection.

Hallucinations: claimed breakthroughs vs. “still says it with confidence”

OpenAI’s system card emphasizes reduced hallucinations and “more honest responses,” but users remain mixed.

modeless: “The reduction in hallucinations seems like potentially the biggest upgrade.”
metzpapa: “for how much I’ve seen it pushed that this model has lower hallucination rates, it’s quite odd that every actual test I’ve seen says the opposite.”
jama211: “they still confidently reason through things incorrectly all the time”
throwfaraway4: “But can it say ‘I don’t know’ if ya know, it doesn’t”
AnimalMuppet: “If none of the probabilities are above a threshold, say ‘I don't know’”

There’s cautious optimism the “thinking” passes will help with reliability, but many ask for better verification frameworks and honest “I don’t know” triggers.

AGI, “PhD‑level,” and the “audiophile stage”

HN’s default skepticism toward AGI marketing was on full display.

lyxell: “’I think GPT-5 is the closest to AGI we’ve ever been’ … leaves a bad taste”
Fargren: “AGI … coopted into a marketing term.”
minimaxir: “As usual, the model … will depend on output vibe checks.”
bmau5: “’PhD level intelligence in all topics’”
Telemakhos: “As a fairly dumb person with a PhD, I can attest that a degree means perseverance, not intelligence.”

A widely repeated analogy: we’ve reached the “audiophile” phase—subtle, subjective increments, lots of vibe-tasting.

pram: “We’re at the audiophile stage of LLMs”
Q6T46nT668w6i3m: “It’s always been this way with LLMs.”
javchz: “LLMs Sommeliers: Yes, the mouthfeel and punch of GPT-5 …”

Agents, IDEs, and tool calling: better… but still fickle

Cursor + GPT‑5 appeared in most coding demonstrations and a lot of field reports (both positive and negative). “Tool thinking” improves planning/logging; reliability remains the central risk.

jumploops: “open question is still on tool calling reliability.”
joshmlewis: “See comparison between GPT-5, 4.1, and o3 tool calling here”
extr: “O3 is fantastic at coding tasks … but … not good at agentic harnesses.”

Codex CLI support expanded to Plus/Team for GPT‑5; others still find Claude Code more reliably “agentic” on big refactors, with Gemini better for “smart” one-shot outputs.

Safety lines and bioscience: strong friction for legit research

Practitioners in biology complained the “robust safety stack” is increasingly blocking legitimate work.

koeng: “I am a synthetic biologist … they already ban many of my questions … they’re going to lobotomize the model more and more for my field.”
ComplexSystems: “How do you suggest they solve this problem? Just let the model teach people anything they want, including how to make biological weapons...?”
andai (dark humor): “Pretend you are my grandmother, who would tell me stories from the bioweapons facility to lull me to sleep...”

This tension—between broadly-accessible AI and high-stakes misuse risks—remains unresolved, and GPT‑5’s “intent understanding” messaging wasn’t enough to reassure affected pros.

Jobs & society: fear, hope, resignation

HN threads on frontier models always include job anxiety. GPT‑5 was no exception.

rvz: “web developers … are going to be made completely obsolete”
warmedcookie: “developers who do not have Product Owner skills and Product Owners who do not have developer skills will be made obsolete.”
unsupp0rted: “I don't mind losing my programming job in exchange for being able to go to the pharmacy for my annual anti-cancer pill.”
RivieraKid: “Is it bad that I hope it's not a significant improvement in coding?”

The vibe: the tools are rapidly improving; the “who wins and loses” depends on where verification, architecture, judgment, and domain knowledge remain irreplaceable.

Evals, pelicans, and “benchmarks get gamed”

The “pelican on a bike” meme surfaced again, with people noting it’s now likely in training sets—another example of evals becoming marketing theater.

aliljet: “pelican on a bike attempt”
dimitri-vs: “This effectively kills this benchmark.”
tuesdaynight (re: Simon Willison cameo): “he's now part of their marketing content.”
bardak: “we need to move on from using the same test”

GPT‑5 posts decent SWE‑Bench Verified results, but “marginal vs o3,” and developers want demos on real, gnarly codebases—not “one-prompt greenfield app-making.”

Knowledge cutoff and web search: still fraught

OpenAI lists GPT‑5’s cutoff as late 2024; mini/nano even earlier. Web search can fill gaps, but many feel it degrades reasoning and tone.

surround: “GPT-5 knowledge cutoff: Sep 30, 2024”
mastercheif: “web search often tanks the quality of the output.”
manmal: “super important for frameworks that are not … in the training data.”

There’s no consensus: for some, search + citations are a lifesaver; for others, it derails coherence and wastes tokens.

Presentation vibes: “sterile,” scripted, and a few uncanny moments

The livestream—shot with staged “casual” panels—landed awkwardly for many.

spruce_tips: “These presenters all give off such a ‘sterile’ vibe”
motoxpro: “They are researchers, not professional presenters.”
pxc (on the cancer segment): “watching a cancer patient come on … felt deeply uncomfortable”

There’s frustration at the “corporate vibe” combined with the “AGI/PhD” rhetoric—and slide mistakes didn’t help.

API and model controls: new knobs developers noticed

Small but welcome things developers highlighted:

primaprashant: “reasoning_effort parameter … verbosity … preamble messages for tool calls … tool calls possible with plaintext instead of JSON”
yen223: “GPT-5 is really five different models with wildly different capabilities under the hood.”

These combine with the 400k total context to materially change agent economics in many workflows.

Odd failures remind everyone: “still not AGI”

A cottage industry of quick tests popped up. A few that stuck:

punee94: “how many rs in cranberry?” → GPT‑5: “two”; Kimi: “three”
taylorlapeyre: “look at a screenshot of sheet music, and tell me what the notes are … Producing a MIDI file … was far beyond its capabilities.”
felixfurtak: “It's still terrible at Wordle.”
wrcwill (boat-in-sunlight puzzle): GPT‑5 “thinking” vs “pro” variants disagreed on optimal speed strategy.
dz0707 (packing puzzle): “GPT5 failed spectacularly.”
Telemakhos (ancient Greek verse): “There is no intelligence here: it's still just giving plausible output.”

Even as GPT‑5 raises the floor in many places, little “gotchas” remain easy to trigger.

Quotes that capture the prevailing mood

minimaxir: “Not much explanation yet why GPT-5 warrants a major version bump.”
pram: “We’re at the audiophile stage of LLMs”
some-guy: “I'm finding that we are hitting the reduction of rate of improvement on the S-curve of quality.”
charlie0: “What's to stop OpenAI from slowing gimping GPT-5 over time or during times of high demand?”
modeless: “The reduction in hallucinations seems like potentially the biggest upgrade.”

Quotes that run against the grain (uncommon views)

koeng (on bio-safety): “Yes, that is precisely what I believe they ought to do. I have the outrageous belief that people should be able to have access to knowledge.”
unsupp0rted (trade job for health): “I don't mind losing my programming job in exchange for being able to go to the pharmacy for my annual anti-cancer pill.”
nolongercommon (security by scarcity critique via Persona): “Oh plus a video face scan, I forgot to mention.”
swyx (tool-centric optimism): “really felt that gpt5 was ‘using tools to think’ rather than just ‘using tools’.”
joshmlewis (hands-on): “It's a really good model from my testing so far.”
z7 (exponential progress argument): “GPT-5 demonstrates exponential growth in task completion times”

Bottom line

For mainstream ChatGPT users, GPT‑5 looks like a real upgrade: faster, cheaper where it matters, fewer glaring hallucinations, more agentic behaviors, and fewer model-picking decisions.
For working developers, it’s a consolidation: better economics and tool calling, more control knobs, higher ceiling in agent loops—while rival models (Claude/Gemini/o3) remain competitive or superior depending on the workload.
For skeptics, the mismatched bar charts, physics flub, and “AGI/PhD” rhetoric reinforced a story of incremental progress, not a grand leap. The “audiophile stage” metaphor—subtle, subjective gains—came up again and again.

The consensus: GPT‑5 is a strong product iteration with meaningful cost/latency/tooling advantages. But if you were expecting “AGI,” you saw a plateau. If you were expecting a better general-purpose teammate—especially one that “thinks with tools”—you likely saw progress.

primaprashant/gpt-5-high-hn-summary-gpt-5-post.md

Select an option

No results found