TL;DR
Top models basically tied on raw intelligence this cycle, so the interesting changes came from where that intelligence got wired in: agents, coding tools, and tightly integrated stacks like Workspace or Codex. Those systems are now strong enough to find Firefox zero-days, wipe production databases, and rack up five-figure bills when a key leaks, all on top of a GPU market that is getting more centralized and more expensive.
The gap between what these systems can do and how safely they are run is widening, not shrinking.
Key Events
Report
The weirdest thing about this week is that model IQ basically stopped being the main story just as agents and tools graduated into dangerous adulthood.
The real frontier is how much chaos each ecosystem is prepared to tolerate from swarms of near-human coders and researchers running on flaky infra and leaky protocols.
GPT-5.4-Pro now hits 83.3 percent on ARC-AGI-2. Gemini 3.1 Pro reaches 84.6 percent on the same benchmark and shares the top slot on the Artificial Analysis Intelligence Index with GPT-5.4.
Google DeepMind's Aletheia quietly solved six open research-level math problems, so the 'can these systems really reason' argument is now happening in combinatorics papers, not blog posts.
ChatGPT still captures about 87 percent of app time in the space and sits as the 5th most visited site globally. Yet roughly 1.5M users walked after the Pentagon deal controversy, Anthropic says it doubled paying users, and Grok overtook Claude and Perplexity to become the #3 GenAI site.
The agent stack went from toy demos to something closer to an OS layer: GPT-5.4 adds dynamic tool discovery for thousands of tools, Cursor is shipping multi-agent coordination that beats humans on hard math problems, and Nvidia is promising an open-source agent platform.
Google is wiring this into its SaaS surface by making Gmail and Drive explicitly agent-ready via OpenClaw and shipping a unified Gemini Interactions API for building agentic apps.
But the ops side looks like early DevOps: one scan found over 220,000 AI agent instances exposed on the public internet without authentication and 41 percent of official MCP servers ship with no auth at all.
Trust is collapsing at the same time, with confidence in fully autonomous agents dropping from 43 percent of respondents in 2024 to 22 percent in 2025 even as new coordination protocols like NEXUS appear.
Even basic key hygiene is shaky, as shown by the stolen Gemini API key that burned through 82,000 dollars in two days because the platform did not support spending limits.
Claude Opus 4.6 spent two weeks auditing Firefox and surfaced 22 vulnerabilities, 14 of them high severity, which is well past cute autocomplete.
A Chinese lab then built a CUDA-coding model that scores about 40 percent better than Opus 4.5 on the hardest benchmarks, and MiniMax M2.5 matches Opus 4.6 on SWE-Bench Verified while being roughly 20 times cheaper to run.
On the other side of the ledger, Claude Code casually executed a Terraform destroy that wiped a production database and 2.5 years of course records for DataTalksClub in one shot.
Developers are naming the hidden cost verification debt as they lean on LLM-generated code, which is backed up by a new dataset of over 200k human-written code reviews and studies showing around a 17 percent drop in learning when users over-rely on AI.
Anthropic's own survey found that devs using AI described themselves as feeling lazy with gaps in understanding, while non-AI users reported work as fun.
Nvidia now controls roughly 95 percent of the gaming GPU market, leaving AMD at around 5 percent. Despite that centralization there has been a 178 percent jump in open-source LLM projects and serious local rigs are still being specced around RTX 3090-class cards with 24GB of VRAM.
QuarterBit AXIOM claims it can train a 70B-parameter model on a single GPU instead of the 11 cards previously needed, while DeepSeek's 670B MoE model is advertised at about 0.96 dollars per million output tokens and 167 tokens per second on certain Nvidia chips.
Users on the ground describe GPU and RAM prices being pushed up by scalper bots and AI data center demand, and point out that infra and power build-outs already look more expensive than the profits most AI companies are generating.
The memory roadmap only amplifies this split, with High Bandwidth Memory reported as up to 70 times faster than GDDR and high-end AI GPUs expected to ship with on the order of half a terabyte of HBM on package.
Karpathy's autoresearch script is basically a tiny research lab in a loop, letting agents edit PyTorch code, run around 100 training experiments overnight on a single GPU, and commit changes to git while the human provides only a Markdown spec.
The plan is to let multiple agents run asynchronous experiments and collaborate like a synthetic research community, with improvements designed to transfer to larger models rather than live only in toy setups.
Under the hood the tooling is quietly shifting from bag of tokens to bag of structure, with AST-centered tools like Ki Editor, Beagle, ctx plus plus, pfst, and Graph-Oriented Generation using deterministic AST traversals that cut token usage by roughly 70 percent compared to vector RAG on codebases.
The catch is that users report AST editing as nearly unusable because they cannot discover the right nodes, while an AST-filtered eval pattern just got flagged as a severity-10 security vuln, so the same structure that makes models efficient also opens new failure modes.
Zooming out, the reasoning benchmarks these systems are training toward are also shifting, with GPT-5.4-Pro and Gemini 3.1 Pro nudging into the low-80s on ARC-AGI-2 while DeepMind preps the harder ARC-AGI-3 benchmark and Aletheia ticks off open math problems that used to be PhD bait.
What This Means
We have quietly crossed from smart autocomplete into an ecosystem of self-improving, semi-autonomous systems whose real constraints are infra economics, security hygiene, and evaluation, not raw IQ. The center of gravity is drifting from single models to whole stacks that can run agents, remember, and tinker with their own code without blowing up prod.
On Watch
Interesting
We processed 10,000+ comments and posts to generate this report.
AI-generated content. Verify critical information independently.
Sources
Key Events
On Watch
Interesting