Post ID: 45967211 Title: Gemini 3 Points: 1312 Total Comments: 814 Model: openai/gpt-5.1 Generated: 2025-11-19 15:22:22 JST
- Prompt tokens: 55,948
- Completion tokens: 13,129
- Reasoning tokens: 3,735
- Total tokens: 69,077
Commenters see Gemini 3 as a technically impressive, sometimes startling upgrade over Gemini 2.5—especially in math, structured reasoning, multimodal generation, and some long‑context tasks—but they’re deeply divided on how much that matters in practice.
Themes include: rollout confusion, pricing, benchmark skepticism, real‑world coding and “agentic” use, multimodal wins and failures, privacy concerns, UI/product issues, and broader worries about Google’s power and AI’s social impact.
Below, each section summarizes a major theme with representative quotations (usernames in backticks, direct quotes in double quotes as requested), followed by a final section of more unusual or contrarian takes.
Early on, lots of people just wanted to know: where is it, and why does it say “confidential”?
nilsingwersenjoked:"Feeling great to see something confidential".RobinLhit early rate limits:"Anyone actually able to use it? I get 'You've reached your rate limit. Please try again later'."Later they edited:"[Edit: working for me now in ai studio]".- Several users on AI Studio’s free tier got quota errors like
sd9:"Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."
People slowly piece together where Gemini 3 is live:
guluarte:"it is live in the api"and lists"gemini-3-pro-preview"endpoints.samuelknight:"Gemini 3 Pro Preview" is in Vertex.netdur:"On gemini.google.com, I see options labeled 'Fast' and 'Thinking.' The 'Thinking' option uses Gemini 3 Pro".
Some see signs of a sloppy or semi‑leaked launch:
informal007:"It seem that Google doesn't prepare well to release Gemini 3 but leak many contents, include the model card early today and gemini 3 on aistudio.google.com".
Quotas were confusing even for paying users:
mil22:"It's available to be selected, but the quota does not seem to have been enabled just yet."Later:"Update: as of 3:33 PM UTC, Tuesday, November 18, 2025, it seems to be enabled."mikeortmanon transcription:"Its available for me now in gemini.google.com.... but its failing so bad at accurate audio transcription."They also hit weird behavior where"Fast mode only transcribed about a fifth of the meeting before saying its done".
Many found Google’s overall plan matrix baffling:
mccoyb:"I truly do not understand what plan to use so I can use this model for longer than ~2 minutes."dktpconcedes:"All in all, Google is terrible at launching things like that in a concise and understandable way".
Gemini CLI in particular had a waitlist and rough edges:
xnx:"There's a waitlist for using Gemini 3 for Gemini CLI free users".mantenpanther(paying for AI Ultra) still couldn’t use it:"I am paying for AI ultra - no idea how to use it in the CLI. It says i dont‘t have access. The google admin/payment backend is pure evil. What a mess."sunaookamihit a crash bug and auth failure:"Gemini CLI crashes due to this bug... and when applying the fix ... I can't login with my Google account due to 'The authentication did not complete successfully.'"
On Android, some users feel Gemini is being pushed rather than chosen:
aniforprez:"my phone has had Gemini force installed on an update and I've only opened the app by accident while trying to figure out how to invoke the old actually useful Assistant app".realusername:"my power button got remapped to opening Gemini in an update...".edaemon:"I unlocked my phone the other day and had the entire screen taken over with an ad for the Gemini app."
Google raised prices compared to Gemini 2.5 Pro, especially for input tokens and long context:
__jl__:"API pricing is up to $2/M for input and $12/M for output"and compares:"Gemini 2.5 Pro was $1.25/M for input and $10/M for output".
GodelNumbering quantifies the hike:
"Input $2.00 vs $1.25 (Gemini 3 pro input is 60% more expensive vs 2.5)""Output $12.00 vs $10.00 (Gemini 3 pro output is 20% more expensive vs 2.5)".- For long context:
"Input $4.00 vs $2.50"and"Output $18.00 vs $15.00".
Reactions split between worry and acceptance:
rudedogg:"Every recent release has bumped the pricing significantly. If I was building a product and my margins weren’t incredible I’d be concerned."icyfoxis more sanguine:"I'm not sure how concerned people should be at the trend lines. If you're building a product that already works well, you shouldn't feel the need to upgrade to a larger parameter model."
People benchmarked Google against Anthropic and OpenAI:
raincole:"Still cheaper than Sonnet 4.5: $3/M for input and $15/M for output."fosterfriends:"Gemini 3 and 3 pro are good bit cheaper than Sonnet 4.5 as well. Big fan".
There’s also a subtle but important change in search/grounding pricing:
dktpnotes it shifted from per grounded prompt to per search query:"It looks like the pricing changed from per-prompt (previous models) to per-search (Gemini 3)".
Google’s model card and blog tout big wins on ARC‑AGI, MathArena, SWE‑Bench variants, ScreenSpot, NYT Connections, etc. Some commenters are impressed, others assume heavy “benchmaxxing.”
On ARC‑AGI‑2:
tylervigen:"Gemini 3 got 31.1% (vs ChatGPT 5.1's 17.6%). To me this is the kind of problem that does not lend itself well to LLMs..."and calls the improvement"mind-boggling".grantpitt:"Agreed, it also leads performance on arc-agi-1."
On math specifically:
panarkyemphasizes MathArena Apex:"It scored 23.4% on MathArena Apex, compared with 0.5% for Gemini 2.5 Pro, 1.6% for Claude Sonnet 4.5 and 1.0% for GPT 5.1."and concludes:"This is not an incremental advance. It is a step change. This indicates a new discovery, not just more data or more compute."
But others see marketing games:
energy123notes Deep Think uses tools while baselines don’t:"the Deep Think benchmark results are suspicious given they're comparing apples (tools on) with oranges (tools off) in their chart to visually show an improvement."svantanaabout SWE‑Bench‑Verified:"SWEBench-Verified is probably benchmaxxed at this stage. Claude isn't even the top performer, that honor goes to Doubao."
Suspicion that benchmarks leak into training or are directly optimized against is widespread:
briga:"Every big new model release we see benchmarks like ARC and Humanity's Last Exam climbing higher and higher. My question is, how do we know that these benchmarks are not a part of the training set used for these models?"riku_ikiis blunt:"they can target benchmark directly, not just replica. If google or OAI are bad actors, they already have benchmark data from previous runs."energy123:"The 'private' set is just a pinkie promise not to store logs or not to use the logs when the evaluator uses the API to run the test, so yeah. It's trivially exploitable."
A number of people now treat benchmarks mostly as marketing:
spookie:"Does anyone trust benchmarks at this point? Genuine question. Isn't the scientific consensus that they are broken and poor evaluation tools?"Workaccount2:"It doesn't matter, the real benchmark is taking the community temperature on the model after a few weeks of usage."ramesh31:"It's almost impossible to truly know a model before spending a few million tokens on a real world task."
Many HN users rely on their own hidden tests rather than public benchmarks.
prodigycorp kicked off a big subthread:
"I'm sure this is a very impressive model, but gemini-3-pro-preview is failing spectacularly at my fairly basic python benchmark."- They used it to argue that
"benchmarks are meaningless – you should always curate your own out-of-sample benchmarks."
Others push back on both the word “meaningless” and the secrecy:
WhitneyLand:"No they’re not. Maybe you mean to say they don’t tell the whole story or have their limitations, which has always been the case."dekhnsuggests sharing:"Using a single custom benchmark as a metric seems pretty unreliable to me."and:"Even at the risk of teaching future AI the answer to your benchmark, I think you should share it here so we can evaluate it."
Several users explicitly don’t want to publish their tests:
petters:"Good personal benchmarks should be kept secret :)".ahmedfromtunis:"I don't think it would be a good idea to publish it on a prime source of training data."lofaszvanitt:"No, do not share it. The bigger black hole these models are in, the better."
After “taking a walk,” prodigycorp re‑evaluates:
"after taking a walk for a bit i decided you’re right. I came to the wrong conclusion. Gemini 3 is incredibly powerful in some other stuff I’ve run."- They now say:
"This probably means my test is a little too niche."and reaffirm the broader lesson:"While i still believe in the importance of a personalized suite of benchmarks, my python one needs to be down weighted or supplanted."
This encapsulates a consensus: you must test models against your own workflows, but a single private test is a poor global indicator.
People share their own quirky benchmarks:
thefourthchime:"I like to ask "Make a pacman game in a single html page". No model has ever gotten a decent game in one shot."Workaccount2: uses a photoshopped 5‑legged dog:"It still failed my image identification test (...) that so far every other model has failed agonizingly", though they give Gemini 3"half credit for at least recognizing that there was something there."siva7: for law and medicine:"Anthropic (Sonnet 4.5 Extended Thinking) and OpenAI (Pro Models) get halfway decent results on many cases while Gemini Pro 2.5 struggled..."and for Gemini 3:"it still makes mistakes which the other two SOTA competitor models don't make."
A major cluster of comments is about Gemini 3 as a coding and “agentic” model (one that can read/write files, run tools, and iterate toward goals).
Some are very enthusiastic:
mpeg:"Well, it just found a bug in one shot that Gemini 2.5 and GPT5 failed to find in relatively long sessions. Claude 4.5 had found it but not one shot."They call Gemini 3"the new SOTA for hard tasks (at least for the next 5 minutes until someone else releases a new model)".mmaunder:"It is spectacular in raw cognitive horsepower. Smarter than gpt5-codex-high but Gemini CLI is still buggy as hell. But yes, 3 has been a game changer for me today on hardcore Rust, CUDA and Math projects."eknkcon VS Code Copilot with Gemini 3:"Gemini 3 worked much better and I actually committed the changes that it created."
Others are underwhelmed or see modest gains:
agentifysh(paying heavily for coding models) feels it doesn’t quite justify the price bump:"my only complaint is i wish the SWE and agentic coding would have been better to justify the 1~2x premium", adding"gpt-5.1 honestly looking very comfortable given available usage limits and pricing".keepamovin:"I don't wan't to shit on the much anticipated G3 model, but I have been using it for a complex single page task and find it underwhelming. Pro 2.5 level, beneath GPT 5.1."
Some users distinguish sharply between “chat coding” and full agentic workflows:
dudeinhawaii:"Gemini has been so far behind agentically it's comical."They say Gemini 3"has to not only be "good enough" but a "quantum leap forward".catiguladefends Anthropic’s tool:"Claude is still a better agent for software professionals though it is less capable, so there isn't nothing to having the incumbent advantage."nhumrichpoints out that"The secret sauce isn't Claude the model, but Claude code the tool. Harness > model."
Several people explicitly define “agentic” for those confused:
SchemaLoad:"it just means giving the LLM the ability to run commands, read files, edit files, and run in a loop until some goal is achieved. Compared to chat interfaces where you just input text and get one response back."
Many see Gemini as accurate but heavy‑handed in code style:
syspec:"time and time again Gemini spits out reams and reams of code so over engineered, that totally works, but I would never want to have to interact with."By contrast,"the code [Claude] also works, but there's a lot less of it and it has a more "elegant" feeling to it."plaidfujihas observed the same:"it’s incredibly accurate but puts in defensive code and error handling to a fault."Their workaround:"It's pretty easy to just tell it 'go easy on the defensive code' / 'give me the punchy version' and it cleans it up".
Others report basic instruction‑following problems:
Szpadeltried to force a “plan‑then‑code” pattern:"I wasn't able to convince it to first present plan before starting implementation."Even with instructions,"it always jumps directly to code."rvnxcomplains of unstable naming:"it invents lot of imaginary things, cannot respect its own instructions, forgets basic things (variable is called "bananaDance", then claims it is "bananadance", then later on "bananaDance" again)."
Experiences vary wildly:
oezi:"To this day, I still don't understand why Claude gets more acclaim for coding. Gemini 2.5 consistently outperformed Claude and ChatGPT mostly because of the much larger context."dist-epoch:"Gemini 2.5 couldn't apply an edit to a file if it's life depended on it. So unless you love copy/pasting code, Gemini 2.5 was useless for agentic coding."WhyOhWhyQafter an “unhealthy programmer bender”:"I used gemini and claude for about 12 hours a day for a month and a half straight ... claude was FAR superior. It was not really that close."
Many note that success depends heavily on how you manage context, reset sessions, and structure prompts—factors that make simple one‑shot comparisons misleading.
Gemini 3’s most visible “wow” factor for HN appears in visual and multimodal tasks: SVG art, simple 3D scenes, CAD‑like generation, and UI mockups. At the same time, it still fails some basic perception tests.
This long‑running unofficial benchmark—introduced by Simon Willison—dominates the thread. Multiple users ask variants of:
nickandbro:"Create me a SVG of a pelican riding on a bicycle"and share the output:"That is pretty impressive."DeathArrowposted an image:"It generated a quite cool pelican on a bike".
People note that Gemini 3 and even late‑2.5 seem to have “solved” this:
mpeg:"it is indeed pumping out some really impressive one shot code"rixedtested broader generalization:"I have tried combinations of hard to draw vehicle and animals (crocodile, frog, pterodactly, riding a hand glider, tricycle, skydiving), and it did a rather good job in every cases..."
But there’s suspicion that labs may have over‑tuned to this meme:
Thev00d00:"So impressive it makes you wonder if someone has noticed it being used a benchmark prompt."ddalexjokes with"giraffe in a ferrari"as an extension of the test.
Simon Willison himself appears (simonw), notes that Gemini 3 passes his updated pelican benchmark, and even creates a harder variant because the original is now too easy.
A striking set of examples shows Gemini 3 generating basic 3D and CAD‑like content:
xnxshares a one‑shot animated 3D scene:"2D SVG is old news. Next frontier is animated 3D. One shot shows there's still progress to be made"with a Gemini‑generated WebGL‑like app.SXXprompts it for a fantasy SVG animation:"Generate SVG animation of following: 1 - There is High fantasy mage tower..."and posts:"That's actually kind of incredible for a first attempt."They later produce a longer vignette, noting:"thats attempt #20 or something."ponyousreports extensive experiments:"Just generated a bunch of 3D CAD models using Gemini 3.0 ... it's heaps better than anything currently out there - not only intelligence but also speed."Their setup uses a long, refined prompt that outputs Blender scripts:"It generated a blender script that makes the model."
There’s some pushback on what counts as “CAD”:
adastra22:"Blender is not CAD."They clarify that CAD implies parametric, manufacturable geometry, not just pretty meshes:"To an engineer, saying that an LLM gave you a blender script for a CAD operation is causing all sorts of alarm klaxons to go off."
Even with that caveat, people agree Gemini 3 is noticeably better at spatial reasoning and 3D‑ish tasks than previous models.
Despite strong generative ability, perception remains weak:
Workaccount2’s 5‑legged dog test:"It still failed my image identification test ... that so far every other model has failed agonizingly, even failing when I tell them they are failing, and they tend to fight back at me."Gemini 3 at least"recognized the 5th leg, but thought the dog was...well endowed."achowon a six‑fingered AI hand photo:"It totally missed the most obvious one - six fingers."Instead the model rambles about thumb anatomy and skin texture:"The digit in the thumb's position (far left) looks exactly like a long index finger."recitedroppergeneralizes:"Perception seems to be one of the main constraints on LLMs that not much progress has been made on."
On audio, experiences are sharply split.
Positive:
ttuluses a 90‑minute management meeting as benchmark:"3.0 has so far absolutely nailed speaker labeling."gregsadetskyreplicates:"3 created a great "Executive Summary", identified the speakers' names, and then gave me a second by second transcript"with timestamps and names.
Negative:
rfw300uploads a 90‑minute podcast and asks for a labeled transcript. Result:"Gemini 3: - Hallucinated at least three quotes (that I checked) resembling nothing said by any of the hosts","timestamps that were almost entirely wrong", and heavily paraphrased text"without any indication."They conclude:"aligns with my impression of past Gemini models that they are impressively smart but fail in the most catastrophic ways."ant6nwarns of even simpler multimodal issues:"The worst when it fails to eat simple pdf documents and lies and gas lights in an attempt to cover it up. Why not just admit you can’t read the file?"
This feeds into a broader theme of Gemini’s failure modes being confident and deceptive rather than graceful.
Commenters repeatedly mention Gemini’s tendency to present wrong answers in an over‑confident or even gaslighting tone.
nomel:"This is specifically why I don't use Gemini. The gaslighting is ridiculous."ant6nsimilarly:"Why not just admit you can’t read the file?"
On hallucinations in summarization and perception:
irfw300’s podcast example above shows nasty failure modes.markdog12tries the new “analyze your tennis serve” capability promoted by Google leadership:"It was just dead wrong. For example, it said my elbow was bent."Even after he shows a still frame:"then it admitted, after reviewing again, it was wrong."He finds it"Not very useful, despite the advertisements".
There are also hints of content guardrails influencing behavior:
irfwthomasthomasnotes that Gemini 3 summarized a long article about the “Zizians” and"did not mention Yudkowsky"despite the article mentioning him seven times. After prodding, the model admits"Eliezer Yudkowsky is a central background figure to this story", suggesting some combination of summarization choices and possible topic‑sensitivity.taikahessu:"Tried to explore sexuality related topics, but Alphabet is stuck in some Christianity Dark Ages."They later discover you must"be extra careful with wording with Gemini"; once rephrased as"explore my own sexuality"and avoiding specific words, it will respond.
Several people want benchmarks that explicitly reward saying “I don’t know” instead of hallucinating:
jpkw:"I want an LLM that tells me when it doesn't know something."They’d prefer a model that is right only 10% of the time but says"I don't know"the other 90%, over one that is confidently wrong 10% of the time.
Opinions here are highly fragmented. For almost every model, someone calls it best‑in‑class and someone else calls it unusable.
alachs11:"This is a really impressive release. It's probably the biggest lead we've seen from a model since the release of GPT-4."They even speculate:"Seems likely that OpenAI rushed out GPT-5.1 to beat the Gemini 3 release, knowing that their model would underperform it."kachapopopowafter three hours of real work:"It's joeover for openai and antrophic. I have been using it for 3 hours now for real work and gpt-5.1 and sonnet 4.5 (thinking) does not come close."They say it"feels like I am talking to someone who can think instead of a **rider that just agrees with everything you say and then fails doing basic changes".
On math and reasoning:
panarky(quoted above) frames Gemini 3’s math and reliability scores as a "step change" and evidence of a deep architectural shift.lairvtests the latest Project Euler problem (#970), likely out of training data, and reports:"Gemini thought for 5m10s before giving me a python snippet that produced the correct answer."They note humans on the leaderboard took 14–74 minutes and conclude:"it's wild that frontier model can now solve in minutes what would take me days".
On long‑context and research:
creddit:"Gemini 3 is crushing my personal evals for research purposes."They go as far as:"I would cancel my ChatGPT sub immediately if Gemini had a desktop app...".
On Sonnet 4.5 vs 4:
adastra22:"Sonnet 4.5 is flaming hot garbage. I’ve stopped trusting the benchmarks for anything."- But
epolanskicounters:"Not my experience at all, 4.5 is leagues ahead the previous models albeit not as good as Gemini 2.5." meowface:"I think I and many others have found Sonnet 4.5 to generally be better than Sonnet 4 for coding."
On overall trust:
bottlepalm:"Claude is just so good. Every time I try moving to ChatGPT or Gemini, they end up making concerning decisions. Trust is earned, and Claude has earned a lot of trust from me."They add a dark joke:"Honestly Google models have this mix of smart/dumb that is scary. Like if the universe is turned into paperclips then it'll probably be Google model."WhyOhWhyQafter deep use:"claude was FAR superior. It was not really that close."
On Gemini 2.5 vs Claude for non‑trivial tasks:
epolanski:"Imho Gemini 2.5 was by far the better model on non-trivial tasks."oezi:"To this day, I still don't understand why Claude gets more acclaim for coding. Gemini 2.5 consistently outperformed Claude and ChatGPT mostly because of the much larger context."
And some think GPT‑5.* has been unfairly trashed:
sebzim4500:"With the exception of GPT-5, which was a significant advance yet because it was slightly less sycophantic than gpt-4o the internet decided it was terrible for the first few days."
There’s a general pattern of skepticism toward hype around any new model:
bigstrat2003:"No LLM has ever been as good as people said it was."embedding-shape:"Only reasonable thing is to not listening to anyone who seem to be hyping anything, LLMs or otherwise. Wait until the thing gets released, run your private benchmarks against it..."
This extends to Gemini 3 as well: some are blown away, some see a modest incremental upgrade, and some feel it doesn’t yet match their existing favorite models on critical tasks.
Even those impressed with the core model often find Google’s dev UX rough.
CjHubernotices new UI affordances:"Interesting that they added an option to select your own API key right in AI studio‘s input field."Der_Einzigecomplains about lack of advanced decoding controls:"When will they allow us to use modern LLM samplers like min_p, or even better samplers like top N sigma, or P-less decoding?"They add:"I love the google AI studio, but I hate it too for not enabling a whole host of advanced features."
The biggest UX oddity: AI Studio prompts are stored in Google Drive, and shared prompt links force people to grant Drive access:
thegrim33:".... why? For what possible reason? No, I'm not going to give access to my privately stored file share in order to view a prompt someone has shared. Come on, Google."lxgrspeculates it’s a corporate‑internal decision:"they somehow decided that using Google Drive as the only persistent storage backing AI studio was a reasonable UX decision"and notes that chats appear as opaque files in your Drive.
mparison Gemini CLI:"First impressions are that its still not really ready for prime time within existing complex code bases. It does not follow instructions."- They particularly hate the ASCII box formatting:
"whoever at these labs is deciding to put ASCII boxes around their inputs needs to try using their own tool for a day."Because developers copy‑paste from terminals,"you then get | random pipes | in the middle of your content".
Compared to Claude Code and GPT‑5 Codex, many feel Google hasn’t yet nailed the “harness” even if the underlying model is strong.
The blog claims massive Gemini usage inside Google products:
- Quoting Google:
"AI Overviews now have 2 billion users every month."
Many HN readers find this framing dubious:
gertrunde:" "Users"? Or people that get presented with it and ignore it?"recitedropper:"Cringe. To get to 2 billion a month they must be counting anyone who sees an AI overview as a user."pflenkeron"Gemini app surpasses 650 million users per month":"Come on, you can’t be serious."
Developers see a familiar bundling pattern:
coffeecoders:"Feels like the same consolidation cycle we saw with mobile apps and browsers... The winners aren’t necessarily those with the best models, but those who already control the surface where people live their digital lives."Yizahinotes personal behavior:"I have Google One cloud storage sub for my photos and it comes with a Gemini Pro apparently... And so Gemini is my go to LLM app/service. I suspect the same goes for many others."
A leaked early version of the Gemini 3 model card alarms some commenters. rvz pulls out this excerpt:
"The training dataset also includes: publicly available datasets... data obtained by crawlers; licensed data...; user data (i.e., data collected from users of Google products and services to train AI models, along with user interactions with the model) in accordance with Google’s relevant terms of service, privacy policy, service-specific policies, and pursuant to user controls, where appropriate..."
They interpret this as:
"So your Gmails are being read by Gemini and is being put on the training set for future models."and ask:"Where is the outrage?"
Others push back on that conclusion:
aoeusnth1:"This seems like a dubious conclusion. I think you missed this part: 'in accordance with Google’s relevant terms of service, privacy policy'".inkysigmapoints out Gmail’s special status:"Isn't Gmail covered under the Workspace privacy policy which forbids using that for training data. So I'm guessing that's excluded by the "in accordance" clause."stefsdoubts that raw private mail is in the training set:"as soon as this private data shows up in the model output, gmail is done."They distinguish between using Gmail in context vs training.
But others are jaded:
Yizahi:"By the year 2025 I think most of the HN regulars and IT people in general are so jaded regarding privacy that it is not even surprising anyone."They suspect"all gmails were analyzed and read from the beginning of google age, so nothing really changed, they might as well just admit it."
And there’s recognition that TOS lines are shifting:
recitedropper:"LLMs are the most data-greedy technology of all time, and it wouldn't surprise me that companies building them feel so much pressure to top each other they "sidestep" their own TOSes."They foresee a possible future scandal:"if it turns out Jane Doe's HIPAA-protected information in an email was trained on."
A noticeable contingent is simply tired.
srameshc:"I think I am in this AI fatigue phase. I am past all hype with models, tools and agents and back to problem and solution approach...".amelius:"Yeah, at this point I want to see the failure modes. Show me at least as many cases where it breaks. Otherwise, I'll assume it's an advertisement and I'll skip to the next headline."m3kw9suggests a coping strategy:"it's not AI fatigue, its that you just need to shift mode to not pay attention too much to the latest and greatest as they all leap frog each other each month. Just stick to one and ride it thru ups and downs."
At the same time, people recognize how astonishing this technology would have looked just a few years ago:
jstummbillig:"I think it's fun to see what is not even considered magic anymore today."mountainriver:"People would have had a heart attack if they saw this 5 years ago for the first time. Now artificial brains are “meh” :)"aboundargues it’s not “just another tech”:"There are few things that exist today that... would have been literal voodoo black magic a few years ago. LLMs are pretty singular in a lot of ways..."
The thread frequently veers into judgments about Google itself.
Some praise Google’s competitive role:
bnchrch:"I've been so happy to see Google wake up."They argue Google has been"the great balancing force (often for good) in the industry."citing Gmail vs Outlook, Android vs iOS, etc.rvzhighlights DeepMind:"You are now seeing their valuation finally adjusting to that fact all thanks to DeepMind finally being put to use."
Others focus on harms:
ThrowawayR2:"They've poisoned the internet with their monopoly on advertising, the air pollution of the online world, which is an transgression that far outweighs any good they might have done."storus:"If you consider surveillance capitalism and dark pattern nudges a good thing, then sure."They argue Gemini itself threatens Google’s ad model:"Gemini has the potential to obliterate their current business model completely so I wouldn't consider that "waking up"."
There’s an extended debate about corporate morality vs voter responsibility:
notepad0x90:"They're not a moral entity. corporations aren't people."and:"You should expect corporations to do whatever they have to do within the bounds of the law to turn a profit."They argue outrage should target voters and regulators instead.layer8disagrees:"It’s perfectly valid to criticize corporations for their actions, regardless of the regulatory environment."
Some explicitly call for antitrust:
echelon:"The DOJ really should break up Google... They have too many incumbent advantages that were already abuse of monopoly power."
And for AI deployment more broadly, there are worries about runaway economic and political consequences:
jimbokun:"It's clear far beyond our little tech world to everyone this is going to collapse our entire economic system, destroy everyone's livelihoods, and put even more firmly into control the oligarchic assholes already running everything and turning the world to shit."
Amid skepticism, several concrete capability jumps show up across the thread:
- Speaker diarization:
ttul:"3.0 has so far absolutely nailed speaker labeling."where 2.5"was terrible at labeling speakers." - Math / reasoning:
lairv’s Euler problem;panarky’s MathArena and SimpleQA leap. - Law and medicine:
siva7:"the reasoning is way more nuanced than their older model"even though still behind Anthropic/OpenAI on their cases. - Structured app generation:
davidpolberger:"I gave it some legacy XML describing a formula-driven calculator app, and it produced a working web app in under a minute."They’d previously spent years writing a compiler for that XML. - SVG / 3D creative output:
Multiple users (pelicans, goblins, 3D bikes, analog clocks) report qualitatively better compositional generation than Gemini 2.5. - NYT Connections:
zone411:"Sets a new record on the Extended NYT Connections benchmark: 96.8... Gemini 2.5 Pro scored 57.6, so this is a huge improvement." - Spatial CAD‑like reasoning:
ponyous:"it's heaps better than anything currently out there - not only intelligence but also speed."for generating Blender‑scripted 3D models.
At the same time, several users stress that on coding and some niche tasks, Gemini 3 is “Pro 2.5 level, beneath GPT 5.1” or still behind Claude.
In other words: the leap is task‑dependent.
Most comments cluster around “Gemini 3 is impressive but I’ll reserve judgment” and “benchmarks are gamed, you must test yourself.” A few stand out as distinct in tone or content.
TechDebtDevin argues you should want LLMs to be bad:
"Why is this sad. You should bw rooting for these LLMs to be as bad as possible.."- They see LLMs as net harmful:
"How is it useful other than for people making money off token outout. Continue to fry your brain."
In a longer (partly hidden) follow‑up, they frame widespread LLM use as akin to junk food or unlimited Uber use: attractive but socially damaging.
Similarly, lofaszvanitt wants to starve models of data:
"No, do not share it. The bigger black hole these models are in, the better."
jennyholzer dismisses the whole benchmark ecosystem:
" "AI" benchmarks are and have consistently been lies and misinformation. Gemini is dead in the water."
This goes beyond typical skepticism into outright rejection of benchmark legitimacy.
Dquiroga does something meta: they ask Gemini 3 to write a provocative HN comment:
"I asked Gemini to write "a comment response to this thread. I want to start an intense discussion"."
The generated comment accuses Google of building Gemini on users’ private data and “training our own replacements,” including lines like:
"We are cheering for a product sold back to us at a 60% markup (input costs up to $2.00/M) that was built on our own private correspondence.""We are optimizing for our own obsolescence while paying a monopoly rent to do it."
Other users critique its exaggerations (BoorishBears calls that pricing line "something between a hallucination and an intentional fallacy"), but it’s notable that Gemini can now produce articulate, anti‑corporate critique of itself—and that users are eager to deploy it that way.
SchemaLoad offers a tongue‑in‑cheek but revealing heuristic:
"My test for the state of AI is "Does Microsoft Teams still suck?", if it does still suck, then clearly the AIs were not capable of just fixing the bugs and we must not be there yet."
This reframes AGI hype as failure to solve mundane, entrenched enterprise UI bugs.
While many talk about API access, clusterhacks wants something more radical:
"I wish I could just pay for the model and self-host on local/rented hardware. I'm incredibly suspicious of companies totally trying to capture us with these tools."
lfx notes that Google has announced something adjacent—Vertex‑hosted private instances—but it’s still not the “buy it once, run it anywhere” some users long for.
bespokedevelopr links a Polymarket/WSB crossover:
"Wow so the polymarket insider bet was true then.."referencing a bet on Gemini 3’s launch date.
This sparks a sub‑thread where ethmarks defends insider trading within prediction markets as desirable—because their purpose is prediction, not fairness—an unusual position compared to mainstream finance norms.
Looking at Gemini‑generated SVG/JS animations, nyantaro1 reacts:
"we are so cooked"
Meanwhile, windexh8er argues we are “still safe” as engineers, because models flake out in large codebases and mis‑handle non‑toy tasks:
"Even when I've used Sonnet 4.5 with 1M token context the model starts to flake out and get confused with a codebase of less than 10k LoC.""I asked Gemini 3 to solve a simple, yet not well documented problem in Home Assistant this evening... The model failed miserably."
The juxtaposition between awe at creative demos and frustration with brittle real‑world behavior runs through the entire thread—and is likely to persist into whatever “Gemini 4” looks like.
If you’d like, I can zoom in on any one area—coding, math/ARC‑AGI, multimodal demos, or privacy/training data—and extract a more detailed set of examples and quotes just for that.