The headlines say “AGI is here” and “coding agents will replace engineers,” but the real action is cheap Chinese models crowding the top of benchmarks, agents quietly wiring into desktops, ticket systems and wallets, and devs not obviously getting faster.
At the same time, weirdly small world models, microscopic finetunes and new hardware like photonic chips are delivering outsized gains, hinting that the next big step may come from efficiency hacks and orchestration rather than just ever‑bigger GPTs.
Key Events
/Xiaomi’s MiMo‑V2‑Pro hit #3 globally on AI agent tasks, with its 1T‑parameter Pro model performing just behind Claude Opus 4.6 at much lower cost.
/Claude gained full computer‑control on macOS and held a 65.3% SWE‑Bench score with Opus 4.6.
/DeepSeek v4 was announced as an open‑source release for April, alongside the Forge Mesh distributed inference network for the R1 671B model.
/NVIDIA CEO Jensen Huang declared that AGI has already been achieved, shifting debate toward its economic and societal costs.
/OpenAI offered private‑equity firms a 17.5% guaranteed return plus early access to unreleased models while preparing a 2026 ChatGPT‑centric IPO.
Report
Everyone’s arguing about whether AGI is “here,” but the more interesting move is that influential voices are acting as if it’s solved and the only remaining variable is price.
At the same time, the models quietly eating the leaderboards are mostly Chinese, the most capable agents now drive your OS and your wallet, and coding productivity looks nothing like the vendor slideware.
agi is now a pricing debate, not a research question
NVIDIA’s Jensen Huang is publicly saying AGI is already achieved, explicitly casting the next phase as one of scaling deployment and infrastructure rather than chasing a missing capability.
Roman Yampolskiy is making the same move from the opposite direction, arguing discourse has shifted from whether AGI will arrive to the economic and societal costs of running it, including worries about recursive self‑improvement.
Meanwhile, researchers are racing to ship ARC‑AGI 3 as a harder benchmark for general intelligence, implicitly admitting that today’s “AGI” claims are untethered from any shared metric.
In parallel, whole‑brain emulation projects still lack peer‑reviewed support even as they’re invoked in AGI timelines, underlining how far the science lags the current marketing narrative.
china’s grey‑zone takeover of the leaderboard
Xiaomi’s MiMo‑V2‑Pro just ranked #3 globally on agent tasks, and its 1‑trillion‑parameter Pro variant plus the 309B‑parameter Flash model are landing near‑Opus SWE‑Bench performance at about $0.10 per million tokens.
MiniMax M2.7 is benchmarked as comparable to GPT‑5.4 and Opus 4.6 on coding, giving China‑origin models credible parity with Western frontier APIs on software tasks.
On the open‑weights side, GLM‑5 just topped a 21‑model debate benchmark, while Qwen Coder and Qwen 2.5 Coder 32B are becoming the default local coding choices on stacks like Ollama.
Usage data from OpenRouter shows Xiaomi models pulling significant token volumes as developers test them head‑to‑head against Anthropic and OpenAI, rather than treating US labs as the only serious options.
Underneath the performance story, Qwen’s political‑censorship behavior is drifting toward tighter alignment with CCP narratives even as refusal rates drop from 6.2% to 0%, which changes the risk profile of adopting these models wholesale.
agents are becoming an operating system with toy‑grade safety
Claude can now drive your macOS machine directly—mouse, keyboard, apps—and its Code stack adds subagents and a `/schedule` primitive for recurring cloud jobs, turning it into a general automation runtime rather than a glorified autocomplete.
ServiceNow’s Deep Agents already resolve around 90% of support tickets autonomously, while OpenClaw plugs into WeChat, n8n and CrowdStrike to move files, send email and react to live security alerts.
Replit’s Agent 4 runs parallel agents for development work, and Google has Gemini agents crawling even the dark web, so multi‑agent patterns are quietly turning into a default systems paradigm, not a lab curiosity.
Yet the Model Context Protocol statistics are brutal: 98% of MCP tool descriptions fail to tell agents how to use them, 36% of servers score an F on security due to issues like token leakage, and one experiment already had an AI agent making stablecoin payments via an MCP server.
People are bolting on band‑aid defenses like doc‑sherlock to scan documents for prompt injection and Tracerney’s SDK to pattern‑match prompt attacks, while still struggling with basic issues like invalid JSON and latency spikes in multi‑step agent traces.
the coding productivity mirage
In surveys, 93% of developers now report using AI tools, yet one controlled study found experienced devs were about 19% slower when they used them.
Other studies claim speedups, but the literature is inconsistent enough that even practitioners on Cursor and Copilot threads openly question whether existing metrics capture real productivity.
Developers repeatedly report that AI‑generated code often lacks coherent structure and logic, increases debugging time, and introduces security risks, with returns flattening beyond ~2,000 lines of assisted code.
That sits awkwardly next to benchmark leaders like Claude Opus 4.6 at 65.3% SWE‑Bench, Gemini 3.1 Pro near the top of SWE‑rebench, and MiniMax M2.7 matching GPT‑5.4 on coding, plus Qwen‑family coders topping local leaderboards.
Meanwhile, Salesforce’s CEO is publicly freezing new engineering hiring on the assumption that coding agents will fill the gap, GitHub Copilot is criticized for just 96.47% uptime and erratic suggestions, and employers increasingly treat LLM literacy as a baseline requirement rather than a differentiator.
post‑scaling cracks: tiny models, giant effects
Yann LeCun’s LeWorldModel learns a pixel‑level world model with only 15M parameters and can plan in under a second on a single GPU, running about 48× faster than older approaches.
Meta’s video model trained on 2 million hours of unlabeled footage still manages to infer basics like gravity and inertia, showing how much structure you can mine without labels.
TinyLoRA pushes an 8B model to 91% GSM8K by updating just 13 parameters, while the Mamba LLM squeezes 57M binary weights into a 7MB integer‑only model that runs even on hardware without a floating‑point unit.
A new photonic chip claims 944× faster scans with 18,000× less energy than GPUs, and a model‑free document parser chews through 500 pages in two seconds on CPU, hinting at very different compute and compression regimes than the current GPU monoculture.
At the pragmatic end, autoresearch loops are delivering 53× speedups in Shopify’s Liquid engine after ~120 experiments and training ~90M‑parameter models in about three hours on a GTX 980, showing that automated search plus modest hardware can already buy huge gains.
What This Means
The loud narrative says “AGI is here and coding agents are replacing engineers,” but the underlying data shows cheap China‑origin models racing up the leaderboards, mixed evidence on productivity, and autoresearch plus tiny finetunes bending the compute curve. The frontier is less about a single omnipotent model and more about who can safely orchestrate brittle agents, politically‑opinionated models, and increasingly exotic hardware into something that works outside a benchmark.
On Watch
/DSPy’s push to simplify its signature syntax while users complain it’s overcomplex and low‑ROI will be an early test of whether heavyweight orchestration frameworks can ever feel worth it outside niche teams.
/The fork of uv into telemetry‑free Fyn, against the backdrop of OpenAI acquiring Astral, is an early skirmish over whether core Python tooling will be steered by big AI vendors or community‑run alternatives.
/OpenAI’s Hugging Face pretraining competition for locally executable LLMs hints at frontier labs trying to shape the open‑source training ecosystem, not just rent it GPUs for inference.
Interesting
/A 3D breakout game was developed using GitHub Copilot, showcasing innovative uses of AI in gaming.
/Cursor's new coding model is built on Moonshot AI’s Kimi, indicating advancements in AI-driven coding solutions.
/The breakthrough paper on creating chatbots from pretrained LLMs has been cited nearly 24,000 times since 2023, indicating its significant impact.
/Anthropic is suing the Pentagon over a supply-chain risk designation affecting its Claude models.
/Xiaomi’s MiMo‑V2‑Pro hit #3 globally on AI agent tasks, with its 1T‑parameter Pro model performing just behind Claude Opus 4.6 at much lower cost.
/Claude gained full computer‑control on macOS and held a 65.3% SWE‑Bench score with Opus 4.6.
/DeepSeek v4 was announced as an open‑source release for April, alongside the Forge Mesh distributed inference network for the R1 671B model.
/NVIDIA CEO Jensen Huang declared that AGI has already been achieved, shifting debate toward its economic and societal costs.
/OpenAI offered private‑equity firms a 17.5% guaranteed return plus early access to unreleased models while preparing a 2026 ChatGPT‑centric IPO.
On Watch
/DSPy’s push to simplify its signature syntax while users complain it’s overcomplex and low‑ROI will be an early test of whether heavyweight orchestration frameworks can ever feel worth it outside niche teams.
/The fork of uv into telemetry‑free Fyn, against the backdrop of OpenAI acquiring Astral, is an early skirmish over whether core Python tooling will be steered by big AI vendors or community‑run alternatives.
/OpenAI’s Hugging Face pretraining competition for locally executable LLMs hints at frontier labs trying to shape the open‑source training ecosystem, not just rent it GPUs for inference.
Interesting
/A 3D breakout game was developed using GitHub Copilot, showcasing innovative uses of AI in gaming.
/Cursor's new coding model is built on Moonshot AI’s Kimi, indicating advancements in AI-driven coding solutions.
/The breakthrough paper on creating chatbots from pretrained LLMs has been cited nearly 24,000 times since 2023, indicating its significant impact.
/Anthropic is suing the Pentagon over a supply-chain risk designation affecting its Claude models.