The interesting action has moved out of the models and into the plumbing: harness design, memory, and local runtimes are now where agent systems actually succeed or explode. Open and mid-size models like GLM-5.1, Qwen 3.6, and Kimi K2.6 are good enough that token economics and stack architecture, not just raw IQ, are deciding what gets built.
At the same time, protocols like MCP and orchestration layers around agents are becoming prime security targets, which is starting to shape what serious teams are willing to ship.
Key Events
/Gemma 4 open models launched under Apache 2.0, running natively on iPhones with full offline inference.
/GLM-5.1 hit 58.4 on SWE-Bench Pro, becoming #1 open-weight and #3 global while shipping under an MIT license.
/Anthropic banned third-party harnesses like OpenClaw for Claude and launched Claude Managed Agents into public beta.
/OpenRouter raised $120M at a $1.3B valuation while Alibaba’s Qwen 3.6 Plus processed over 1 trillion tokens in a single day.
/Google’s TurboQuant cut KV-cache memory by at least 6x and delivered up to 8x speedups with no measured accuracy loss.
Report
The center of gravity has shifted from ‘pick the right model’ to ‘build the right machinery around any model.’ Harnesses, memory, and local stacks are where agents are winning or quietly face-planting.
harnesses as the real bottleneck
Most of Claude Code is execution scaffolding: 98.4% of its codebase is operational infrastructure and only 1.6% is AI decision logic, based on UCL’s analysis.
Anthropic has banned third-party harnesses like OpenClaw for Claude subscriptions and is adding extra fees for external tools, while simultaneously pushing Claude Managed Agents as a first-party harness layer.
Research groups are treating harness design as its own field, with Stanford’s improved harness outscoring Claude Code on TerminalBench 2 and projects like Meta-Harness and EvalMonkey emerging to auto-optimize and locally stress-test agents.
Security researchers are flagging the harness as a primary attack surface, noting that small changes in how tools, state, and permissions are wired can swing both performance and risk.
local-first large-context stacks on consumer hardware
Gemma 4 open models now run natively on iPhones with full offline inference, and have been brought up on devices as small as a Raspberry Pi 5 with 8GB RAM.
On the laptop side, MLX and oMLX plus quantization (TurboQuant, DFlash, oQ) are doubling generation speed for models like Qwen 3.5 27B on M5 Max and enabling large-context workloads on Apple Silicon.
KV-cache compression like TurboQuant’s 6x memory reduction and up to 8x speedup is letting long-context models run on commodity GPUs and even MacBooks, including a 397B-parameter model streaming from SSD on a 24–48GB RAM machine.
On RTX cards, Qwen 3.6 35B hits around 40 tokens/sec at ~200K context on a single 4090, while TurboQuant for GGML pushes 72K context for Llama-70B on dual 3090s, turning homelabs into serious agent/RAG backends.
sota is fragmented and cost-shaped, not monolithic
GLM-5.1 scores 58.4 on SWE-Bench Pro, ranking #1 in open weights and #3 globally while roughly matching Claude Opus 4.6’s coding and agentic performance at about one-third the cost under an MIT license.
Kimi K2.6, released as open source, scores 58.6 on SWE-Bench Pro and beats Claude Opus 4.6 and GPT-5.4 on that benchmark while being 76% cheaper than Opus 4.7 and pricing input at $0.95 per million tokens versus Opus’s $5.
Alibaba’s Qwen 3.6 Plus is the first model to process over 1 trillion tokens in a day and outperforms Claude Opus on TerminalBench with a 61.6 score, while medium models like Qwen 3.6-35B-A3B aim for ~80% of Opus 4.7’s performance at low operational cost.
At the same time, users report that Kimi underperforms competitors in day-to-day coding despite its benchmark wins, and that Opus 4.7 feels like a regression from 4.6, all while ‘thinking’ token usage and new $100/month ChatGPT tiers reshape how much frontier performance teams can actually afford to tap.
rag is mutating into hybrid, cache-augmented, agentic systems
Classic RAG setups are hitting latency and cost ceilings because they retrieve on every query and lean on external APIs even when the model already knows the answer.
New patterns layer Cache-Augmented Generation to cache static information, Skill-RAG to detect when a model is approaching a knowledge failure, and MASS-RAG to route work across multiple specialized agents.
LangGraph users are shipping semantic caching and a memory firewall that intercepts around 90.5% of memory poisoning attempts, while also exploring microRAG for tightly scoped domains.
Legal RAG stacks are dealing with corpora of up to 4 million documents, some over 30,000 tokens long, where standard vector similarity ranking is called dangerous and query expansion is empirically more important than chunk size.
mcp and supply chain: tool integration as attack surface
The Model Context Protocol (MCP) has effectively become the standard for wiring agents into tools and data, with over 97 million SDK downloads per month and more than 177,000 registered tools including Figma design, n8n workflow control, paper search, legal databases, and SSH/deployment bridges.
MCP Apps now let tools return interactive applications directly in chat, and specialized servers like Pentester-MCP bundle 235+ pentesting tools into agents for autonomous security work.
At the same time, MCP is being called out explicitly as a security risk because it connects agents straight into internal systems, prompting calls for a unified security framework around authentication, network isolation, and logging.
The broader supply chain picture is ugly: the LiteLLM PyPI compromise briefly exfiltrated SSH keys and cloud credentials from versions with ~97M monthly downloads, axios shipped a full remote access trojan via npm, Vercel suffered an OAuth-powered breach exposing environment variables, and Claude Code’s own source leaked through its npm registry, all while audits describe OpenClaw-style harnesses as security nightmares with privilege-escalation and sandbox-escape bugs.
What This Means
The action has moved into the ‘boring’ layers around models—harnesses, memory, local runtimes, and tool protocols—and that’s where both the biggest gains and the nastiest failures are now showing up.
On Watch
/DeepSeek V4 is slated for late April with a 1M-token context window and multimodal support, which could shake up long-context and open-weight agent stacks if it lands anywhere near its marketing.
/ARC-AGI-3 remains the only unsaturated agentic intelligence benchmark with frontier models under 1% of human efficiency, while tools like Seed IQ hit 95%, keeping the AGI-vs-benchmarks narrative volatile and ripe for re-interpretation.
/The MCP ecosystem is expanding fast—from Gemini API support to pentesting bundles like Pentester-MCP—without a mature shared security model, setting up a likely high-profile MCP-related incident.
Interesting
/- A memory layer for AI coding agents achieved an impressive 80% F1 score on the LoCoMo benchmark, significantly outperforming standard RAG systems.
/- Orla, an open-source framework, is designed to enhance LangGraph agents' speed and cost-effectiveness by decoupling inference time from application logic.
/- Enabling auto_commit=True in Kafka can lead to silent document deletions in RAG pipelines, raising concerns about data integrity.
/- WebGPU has shown up to 223x speed improvements over PyTorch on datacenter GPUs, indicating a significant advantage for browser-based ML frameworks.
/- Memory Sparse Attention allows processing of up to 100M tokens by managing KV cache efficiently in GPU VRAM.
We processed 10,000+ comments and posts to generate this report.
AI-generated content. Verify critical information independently.
/Gemma 4 open models launched under Apache 2.0, running natively on iPhones with full offline inference.
/GLM-5.1 hit 58.4 on SWE-Bench Pro, becoming #1 open-weight and #3 global while shipping under an MIT license.
/Anthropic banned third-party harnesses like OpenClaw for Claude and launched Claude Managed Agents into public beta.
/OpenRouter raised $120M at a $1.3B valuation while Alibaba’s Qwen 3.6 Plus processed over 1 trillion tokens in a single day.
/Google’s TurboQuant cut KV-cache memory by at least 6x and delivered up to 8x speedups with no measured accuracy loss.
On Watch
/DeepSeek V4 is slated for late April with a 1M-token context window and multimodal support, which could shake up long-context and open-weight agent stacks if it lands anywhere near its marketing.
/ARC-AGI-3 remains the only unsaturated agentic intelligence benchmark with frontier models under 1% of human efficiency, while tools like Seed IQ hit 95%, keeping the AGI-vs-benchmarks narrative volatile and ripe for re-interpretation.
/The MCP ecosystem is expanding fast—from Gemini API support to pentesting bundles like Pentester-MCP—without a mature shared security model, setting up a likely high-profile MCP-related incident.
Interesting
/- A memory layer for AI coding agents achieved an impressive 80% F1 score on the LoCoMo benchmark, significantly outperforming standard RAG systems.
/- Orla, an open-source framework, is designed to enhance LangGraph agents' speed and cost-effectiveness by decoupling inference time from application logic.
/- Enabling auto_commit=True in Kafka can lead to silent document deletions in RAG pipelines, raising concerns about data integrity.
/- WebGPU has shown up to 223x speed improvements over PyTorch on datacenter GPUs, indicating a significant advantage for browser-based ML frameworks.
/- Memory Sparse Attention allows processing of up to 100M tokens by managing KV cache efficiently in GPU VRAM.