TL;DR
ARC-AGI-3 basically says today’s models are nowhere near human-level agents, even as CEOs run around declaring AGI is already here. At the same time, compression tricks plus mid-range GPUs are making surprisingly strong models run locally, open competitors are chipping away at GPT/Claude’s moat, and the real choke points are shifting to routers and the security of the AI dev toolchain.
It feels less like a singular AGI breakthrough and more like the early-internet era where the messy plumbing quietly decides who actually wins.
Key Events
Report
On paper, we now “have AGI”; Nvidia’s Jensen Huang and the term’s originator both say so. But the only unsaturated AGI benchmark we have shows humans at 100% efficiency and frontier models under 1%, while the rest of the stack—compression tricks, local GPUs, routers, and compromised dev tools—mutates every week.
Nvidia CEO Jensen Huang and even the person who coined AGI both publicly say we’ve already hit it. At the same time, the new ARC‑AGI‑3 benchmark — 135 novel game-like environments scored by Relative Human Action Efficiency — has humans at 100% while frontier models sit below 1%.
ARC‑AGI‑3 explicitly measures how fast systems learn in interactive worlds rather than how much trivia they remember, and all well-performing models appear to have ARC-style data in their training sets anyway.
Seed IQ hit 95% of the second-best human’s efficiency on day one and another team bought a 36% score in a single day for about $1,000, underscoring how far generic chat models lag specialized agents.
Google’s TurboQuant compresses LLM key–value caches by ~6x and speeds inference by up to 8x with no measurable accuracy loss, enabling 100K‑token conversations on laptops like the M2 MacBook.
Delta‑KV for llama.cpp adds near‑lossless 4‑bit KV caches with 10,000x less quantization error, and a photonic KV‑selection chip claims 944x faster lookups and 18,000x lower energy than brute-force GPU scans.
On the user side this shows up as Qwen 3.5–9B running 20K‑token prompts on a MacBook Air and 27B‑scale models streaming over a million tokens per second on 96 B200 GPUs.
Threads debating TurboQuant are already pointing out that these wins mostly hit cache, not model weights, so 70B+ models still want big VRAM even as long-context chat suddenly feels cheap.
Around the 0xSero ecosystem, people are now talking about GPUs like long‑term hedges: 50M “free” local tokens versus 150M tokens for $30 on GLM or Kimi and 1B tokens for $110 on Claude.
Posts frame owning a decent GPU as moving from hobbyist flex to essential developer infrastructure, with “AI survivalism” meaning your workflow keeps running when pricing or ToS change and local models deliver 80–95% of cloud quality.
Hardware is meeting that sentiment halfway: Qwen3.5‑35B can be compressed by 20% to fit in 24GB VRAM with only ~1% performance loss, Kimi K2.5 packs 1T parameters with 32B active into 96GB of RAM, and Intel is shipping a 32GB‑VRAM GPU for $949.
At the same time, multiple threads warn that once you price in malware scares around LM Studio, disk and RAM upgrades, and operational hassle, cloud still wins for spiky, low‑duty workloads.
Grok 4.20 now leads non‑hallucination rankings with a 78% score, beating Gemini 3.1 and Claude Opus 4.6 on factual accuracy even as X users dunk on it as “worse than GPT‑4o.” GLM‑5.1 tops SWE‑bench‑Verified among open models at 77.8 and comes within a couple of points of Claude Opus 4.6 on coding benchmarks, while Xiaomi’s MiMo‑V2‑Flash sits first on SWE‑Bench at $0.10 per million input tokens.
Qwen 3.5‑27B hits ~1.1M tokens/second on 96 B200 GPUs and its 2.5 generation outperforms radiologists by 10% on certain image‑interpretation tasks without even seeing the images, while Mistral’s 3B‑parameter Voxtral TTS beats ElevenLabs Flash v2.5 in human preference tests with nine‑language support on ~3GB of RAM.
Frontier closed models like GPT‑5.4 still hold the crown on the hardest math and reasoning benchmarks, but the day‑to‑day coding and multimodal work is increasingly getting done by this swarm of cheaper, specialized contenders.
Apple is turning Siri into a front-end router that can send queries to ChatGPT, Gemini, Claude and others through an Extensions-style integration and a dedicated Siri app with “Ask” and “Write” modes.
MCP servers like Paper Lantern (2M+ research papers), LegalMCP (18 tools over US case law), and RemoteBridge (SSH into servers for autonomous deployment) similarly give agents structured access to external systems, with experiments showing a 3.2% gain in hyperparameter search when they can read CS papers.
OpenRouter is doing something similar for models, aggregating GPT, Claude, Grok, Qwen and Xiaomi endpoints while users report noticeable cost savings versus one‑vendor subscriptions, and IDE agents like Cursor or orchestrators like Codex juggle these models alongside plugins for Slack, Figma and Notion.
The messy part is that 98% of MCP tool descriptions don’t actually tell agents how to behave and over a third of MCP servers get an F on security tests, so new safety layers like Ark and zero‑trust proxies are already appearing around this routing fabric.
What This Means
We’re in a bifurcated AI moment where benchmarks and infra say “not AGI, not yet,” but local hardware, open models, and routing layers are compounding fast enough that the stack people actually use is changing under their feet. The real leverage is drifting away from single frontier models and toward whoever controls compression tricks, GPUs, and the orchestration fabric that decides which model does what.
On Watch
Interesting
We processed 10,000+ comments and posts to generate this report.
AI-generated content. Verify critical information independently.
Sources
Key Events
On Watch
Interesting