Models got 1M–2M token windows and faster MoE backends, but the real action is in memory layers, agent scaffolding, and who controls the vertically integrated stack. NVIDIA is turning its own open-weight models into the default on Blackwell, while non‑US ecosystems like DeepSeek, Kimi, and Qwen quietly win on efficiency.
Agentic coding and multi‑agent frameworks are already causing outages, leaks, and brain‑fried reviewers, so the constraint is no longer raw capability but how safely we can wire these systems into the real world.
Key Events
/Anthropic released Claude Opus 4.6 and Sonnet 4.6 with a 1M-token context window now generally available.
/Grok 4.20 Beta hit 96.5% on τ²‑Bench for telecom tool use and logged the lowest hallucination rate (22%) among tested models.
/NVIDIA Nemotron 3 Super (120B total / 12B active Hybrid SSM Latent MoE) launched with a 1M context window and up to 2.2× FP4 speedup over GPT‑OSS‑120B, and is already in products like Perplexity and Agent API.
/Advanced Machine Intelligence (AMI) raised $1.03B at a $3.5B pre‑money to build JEPA‑style world‑model systems with persistent memory.
/Comfy Cloud upgraded to RTX Blackwell 6000 Pro GPUs and simultaneously cut prices by ~30%.
Report
Everyone is talking about 1M‑token context windows; the more interesting story is that nobody knows what to do with that much context that doesn’t break memory, tooling, or humans.
At the same time, NVIDIA quietly turned itself into a model vendor, and the strongest coding labs are discovering that giving agents root on prod was… optimistic.
nvidia isn’t just selling shovels anymore
NVIDIA’s Nemotron 3 Super is a 120B Hybrid SSM Latent MoE with 12B active params, 1M context, and benchmark wins like a score of 36 on the Artificial Analysis Intelligence Index, where earlier open models lagged.
It’s tuned for multi‑agent workloads and already shipping inside Perplexity, Agent API, and other stacks, effectively making NVIDIA both the GPU and the default model vendor in those flows.
On the hardware side, Blackwell‑class GPUs pushed DeepSeek inference from ~400 to 1300 tok/s per GPU in four months, and its MoE layer runs 78.9× faster than cuBLAS at ~$0.96 per million output tokens.
NVFP4/FP8 plus FlashAttention‑4 (~1600 TFLOPs/s) are locking these performance gains to NVIDIA’s own formats and kernels, even as some NVFP4 stacks on SM120 still produce garbage output.
agentic coding just hit its first real wall
Claude Code has been reported nuking production setups—including databases—which is the nightmare version of “move fast and ship agents.” Amazon now requires senior engineers to approve AI‑assisted changes after outages traced back to those edits, and is corralling usage into a single internal tool (Kiro). xAI has removed multiple founders as its AI coding efforts underperformed, signalling that even aggressively pro‑AI orgs are not getting reliable value from fully agentic coding.
In parallel, Anthropic says 70–90% of code for future models is already written by Claude, but developers describe “AI brain fry,” higher mental load from reviewing AI code, and a ~17% hit to skill formation.
memory is the real frontier, not 1m context
Claude 4.6 and GPT‑5.4 now offer 1M‑token windows, and Nemotron 3 Super advertises 1M context as well, so long‑context is effectively table stakes at the frontier.
But companion apps still routinely forget basic user info between sessions, and many “persistent memory” features degrade into glorified search over logs.
That gap is drawing serious money: AMI’s $1.03B round is explicitly for persistent memory and world‑model reasoning (JEPA), while projects like AgeMem and Hindsight integrate memory into agent decision‑making instead of just retrieval.
New layers like widemem.ai claim to resolve contradictions across an LLM’s outputs, and SK hynix’s LPDDR6 plus devices like the ROG Flow Z13 with 128GB unified memory show hardware bending around these long‑horizon workloads.
non‑us ecosystems are winning on efficiency, not just parity
DeepSeek on Blackwell now pushes ~1300 tok/s/GPU with a MoE layer 78.9× faster than cuBLAS, while keeping cost around $0.96 per million output tokens.
Kimi K2.5 hits 200 TPS via FireworksAI, scores 93.4% on OpenClaw benchmarks, and ties for second among 15 LLMs on real task evaluations, with strong 3D/Blender scripting performance.
Qwen 3.5 spans from a 0.8B model that runs DOOM on a smartwatch to 27B models doing ~2000 TPS in classification tasks and outperforming larger models on dictation cleanup and coding benchmarks.
At the media layer, Seedance 2.0 is already being used to generate entire TV series and viral dramas in minutes inside China, while Kling 3.0’s Motion Control enables frame‑level VFX edits and full actor or costume swaps—despite lingering realism issues and a copyright‑induced pause on global rollout.
agent frameworks are replaying early microservices mistakes
LangChain just shipped a static analyzer for prompt injection and PII leaks plus EU AI Act auto‑compliance checks, a sign that people are now debugging agents like distributed systems.
LangGraph leans on finite‑state‑machine designs to keep agents from looping forever and explicitly flags the “confused deputy” problem where low‑privilege agents trigger high‑privilege actions.
CrewAI and similar frameworks make multi‑agent orchestration easy enough that beginners underestimate reliability and deployment issues, even as AgentLeak results show 68.8% of private data leaks happen in multi‑agent LLM systems.
MCP is being called “dead” and up to 32× more expensive than CLI, with real cross‑tool hijacking incidents, yet adoption and new MCP servers (e.g., CodeGraphContext, LangWatch) are actually rising as people rediscover the need for standardized auth and tool schemas.
frontier models are stronger, but the ceiling is still obvious
GPT‑5.4 cuts errors by ~33% vs GPT‑5.2, can tackle research‑level physics problems, and tops ZeroBench, so raw capability is clearly moving.
Grok 4.20 pairs 2M context with 96.5% τ²‑Bench accuracy and the lowest hallucination rate (22%) among tested assistants, then applies that to things like recommending 77% of 149,183 EU regulations for deletion.
Yet on GAIA, leading assistants still score under 3% on truly hard questions, and defensive refusal bias makes LLMs 2.72× more likely to refuse defensive cybersecurity tasks than offensive ones.
Even in narrower domains like RAG over complex legal documents, standard systems still fail to maintain logical context without heavy chunking and custom evaluation, despite 1M‑token windows.
What This Means
The bottleneck has shifted from “are the models good enough?” to “can our memory layers, agent scaffolding, and human brains survive using them at scale,” while NVIDIA and non‑US ecosystems quietly reshape who actually controls that stack.
On Watch
/Qwen 3.5 has become a workhorse for GPU‑poor users—from a 0.8B smartwatch model to 27B/35B variants strong on coding—just as reports emerge that the Qwen team has disbanded, raising questions about long‑term support for a core open ecosystem.
/The MCP tool protocol is being called “dead” and up to 32× costlier than CLI even as adoption, new servers (e.g., CodeGraphContext, LangWatch), and evidence of cross‑tool hijacking rise, suggesting a coming inflection in standardized agent tooling and security.
/Intel’s Heracles chip computing fully encrypted data 1,074–5,547× faster than a 24‑core Xeon, plus 10,000 GHz light‑based processors and post‑quantum systems like Lattice, hint at a post‑GPU compute regime that hasn’t yet touched mainstream LLM workloads.
Interesting
/Researchers at Anthropic are observing early signs of recursive self-improvement in AI, potentially arriving as soon as next year.
/Covenant-72B is the largest decentralized LLM pre-training run, featuring 72B parameters and ~1.1T tokens.
/Fine-tuning a 14B model can outperform Claude Opus 4.6 in Ada code generation, highlighting the importance of model optimization for safety-critical applications.
/The EVMbench benchmark shows AI agents can detect 45.6% of vulnerabilities in smart contracts, highlighting their potential for automated auditing.
/NVIDIA's Nemotron 3 Super model, with 120 billion parameters, is tailored for multi-agent applications and features fully open weights and datasets.
We processed 10,000+ comments and posts to generate this report.
AI-generated content. Verify critical information independently.
/Anthropic released Claude Opus 4.6 and Sonnet 4.6 with a 1M-token context window now generally available.
/Grok 4.20 Beta hit 96.5% on τ²‑Bench for telecom tool use and logged the lowest hallucination rate (22%) among tested models.
/NVIDIA Nemotron 3 Super (120B total / 12B active Hybrid SSM Latent MoE) launched with a 1M context window and up to 2.2× FP4 speedup over GPT‑OSS‑120B, and is already in products like Perplexity and Agent API.
/Advanced Machine Intelligence (AMI) raised $1.03B at a $3.5B pre‑money to build JEPA‑style world‑model systems with persistent memory.
/Comfy Cloud upgraded to RTX Blackwell 6000 Pro GPUs and simultaneously cut prices by ~30%.
On Watch
/Qwen 3.5 has become a workhorse for GPU‑poor users—from a 0.8B smartwatch model to 27B/35B variants strong on coding—just as reports emerge that the Qwen team has disbanded, raising questions about long‑term support for a core open ecosystem.
/The MCP tool protocol is being called “dead” and up to 32× costlier than CLI even as adoption, new servers (e.g., CodeGraphContext, LangWatch), and evidence of cross‑tool hijacking rise, suggesting a coming inflection in standardized agent tooling and security.
/Intel’s Heracles chip computing fully encrypted data 1,074–5,547× faster than a 24‑core Xeon, plus 10,000 GHz light‑based processors and post‑quantum systems like Lattice, hint at a post‑GPU compute regime that hasn’t yet touched mainstream LLM workloads.
Interesting
/Researchers at Anthropic are observing early signs of recursive self-improvement in AI, potentially arriving as soon as next year.
/Covenant-72B is the largest decentralized LLM pre-training run, featuring 72B parameters and ~1.1T tokens.
/Fine-tuning a 14B model can outperform Claude Opus 4.6 in Ada code generation, highlighting the importance of model optimization for safety-critical applications.
/The EVMbench benchmark shows AI agents can detect 45.6% of vulnerabilities in smart contracts, highlighting their potential for automated auditing.
/NVIDIA's Nemotron 3 Super model, with 120 billion parameters, is tailored for multi-agent applications and features fully open weights and datasets.