TL;DR
Benchmarks say GPT‑5.4 is back on top, but the most interesting moves are offstage: militaries are running custom Claude and Grok variants in classified stacks while Chinese and open models like Qwen and DeepSeek quietly catch up. At the same time, edge hardware and agent frameworks are making serious local and multi-agent systems practical years earlier than expected, but with security, ethics, and legal regimes that look badly underfit to the capabilities.
The real story isn’t just smarter models—it’s where they’re being dropped and how little visibility anyone has into that deployment layer.
Key Events
Report
Everyone is watching benchmark charts, but the sharpest moves this month happened where there are no leaderboards: inside classified military stacks and on-device silicon.
Publicly, GPT‑5.4 and Gemini 3.1 Flash‑Lite look like the story; underneath, custom Claude and Grok for the Pentagon and a flood of NPUs/MLX tooling are shifting where real capability lives.
OpenAI has agreed to deploy its models on the U.S. Department of War’s classified networks, while Anthropic is already running custom Claude models for the Pentagon that are 1–2 generations ahead of the consumer version and reportedly produced about 1,000 prioritized targets in operations like the Iran strike.
Musk’s Grok is slated for classified systems, and France’s Ministry of the Armed Forces has partnered with Mistral AI, so at least three frontier labs now maintain defense‑only branches.
In public, the status game is ARC‑AGI‑2—GPT‑5.4‑Pro at 83.3%, Gemini 3.1 Pro at 84.6%—and DeepMind’s Aletheia solving six open research‑level math problems, while Hassabis and LeCun bicker over what a “real” AGI test would look like.
Researchers simultaneously insist we’re far from true AGI and that alignment is unsolved, even as these not‑yet‑aligned systems are wired into live targeting pipelines behind classification walls.
GPT‑5.4 is a monster—1M‑token context and record scores on FrontierMath and CritPt—yet the usage data says “baseline,” not monopoly.
Claude just overtook ChatGPT as the #1 U.S. App Store app, even while Anthropic sells a Pentagon‑only Claude 1–2 generations ahead of what ordinary users touch.
Grok’s iPhone app has over 1M ratings at 4.9 stars and is pulling about 1.5× the traffic of both Claude and Perplexity, despite many power users dismissing it as a joke compared to Claude and Gemini.
On the open/Chinese side, Qwen 3.5‑35B‑A3B beats free‑tier ChatGPT and Gemini, GLM‑5/Kimi are near proprietary quality, DeepSeek V3 is called “frontier‑class” at a ~$5.6M training cost, and a Chinese CUDA‑coder model reportedly writes kernels 40% better than Claude Opus 4.5.
Against that backdrop, 1.5M users leaving ChatGPT and a 295% uninstall spike after the Pentagon deal look like demand redistributing across many “good enough” stacks rather than consolidating on one.
Apple’s MLX stack plus Qwen 3.5 is turning Macs into serious local‑AI rigs: Qwen3.5‑35B does around 110 tokens/second on an M4 Max, real‑time voice‑to‑voice interaction runs on Mac Studio, and a single iOS app now ships with 60‑plus models entirely on‑device.
Qwen3‑TTS runs locally on macOS and iOS with offline voice cloning and emotion presets, while Maic and LoopMaker provide MLX‑optimized LLM serving and all‑on‑device music generation.
NPUs are quietly joining the party: Strix Halo decodes about 19.5 tokens/second at 20W, Qwen 3 9B already hits >6 tokens/second on Android’s Hexagon NPU, Snapdragon Wear Elite brings 2B‑parameter models to watches, and Apple’s Neural Engine delivers 6.6 TFLOPS/W. At the same time, chips with models baked directly into hardware hit 17,000 tokens/second and QuarterBit trains 70B models on a single GPU, even as DDR5 scalpers and rising GPU prices keep the entry‑level PC market shrinking.
MCP and agent runtimes are crystallizing into a de facto agent stack: MCP servers slash context use by up to 98% for tools like Claude Code, CodeGraphContext’s graph‑based MCP reports 120× token reduction on large repos, and OpenClaw’s production stack runs 11 specialized agents with failover across 9 providers.
LangGraph agents already read docs and manage real support tickets, while LangChain’s OpenClaw hit 100k+ GitHub stars and LangSmith added Skills, a CLI, and coding‑agent benchmarks to debug these workflows.
But operational hygiene looks very 1999: more than 220,000 AI agent instances are exposed online with no auth, 41% of official MCP servers lack authentication, and 2,800 Google API keys are silently authenticating to Gemini.
One stolen Gemini key generated an $82,314 bill in 48 hours, Google doesn’t let you set hard spend caps, and researchers are simultaneously showing GPU Tensor Core side‑channel attacks and device‑memory inference leaks.
The result is production‑grade agent graphs running on top of a security posture that still treats LLMs like harmless SaaS widgets.
What This Means
The visible battle is benchmark scores and app‑store rankings, but the real shape of progress is a split between opaque, militarized frontier stacks and increasingly capable edge and agent infrastructures whose security, ethics, and governance lag far behind their raw capability. The distance between what these systems can do and what institutions can safely absorb is now widening faster than any ARC‑AGI curve.
On Watch
Interesting
We processed 10,000+ comments and posts to generate this report.
AI-generated content. Verify critical information independently.
Sources
Key Events
On Watch
Interesting