General-purpose agents just had their first real sorting: Hermes is winning hard while OpenClaw is burning out on security and fragility. Under the surface, MCP tool layers, explicit memory systems, and aggressive local quantization are becoming the real stack decisions, with eval harnesses and security failures—not prompts or model hype—driving where serious builders focus.
In other words, the story has moved from "what model" to "what system" and whether that system can stay fast, observable, and safe once agents start taking actions.
Key Events
/Hermes Agent became the most used AI on OpenRouter, processing 271 billion tokens and overtaking Claude Code and OpenClaw.
/Open-source framework OpenClaw was found poisoned with over 575 malicious skills injected by just 13 accounts, driving a steep usage decline.
/CodeGraphContext, an MCP server that graphs codebases, surpassed 100,000 downloads as new MCP memory servers rolled out in production SaaS stacks.
/TurboQuant-powered BeeLlama.cpp pushed Qwen 3.6‑27B to around 80–87 tokens/s at 262K context on an RTX 4090, far above baseline llama.cpp speeds.
/An npm supply-chain attack injected credential-stealing malware into 84 TanStack packages, compromising CI tokens across projects.
Report
The agent ecosystem just had its first real selection event: Hermes Agent is exploding while OpenClaw is visibly collapsing under security and reliability problems.
Underneath that, a new stack is congealing—MCP servers, explicit memory layers, aggressive quantization—and the friction points tell where the real engineering stories are.
hermes vs openclaw: the first real agent fork
For engineers already shipping agentic workflows, the under-covered story is that Hermes Agent has effectively become the default general agent while OpenClaw is quietly aging out.
Hermes is now the most used AI on OpenRouter, processing 271 billion tokens and overtaking Claude Code and OpenClaw in real usage.
Its framework has picked up over 140,000 GitHub stars in less than three months and already runs against local GGUF and MLX models as well as cloud APIs.
OpenClaw, by contrast, is trending down after being poisoned with more than 575 malicious skills from just 13 accounts and is widely described as too fragile for business use unless isolated on its own machine.
mcp vs rest: the new tool fabric for agents
For people designing tool-using agents this quarter, the real shift is from classic REST-style APIs toward MCP servers as the integration layer.
MCP explicitly describes capabilities for autonomous LLM agents in a way OpenAPI never did, and early stacks already route everything from browsers to Zoom and Elementor through MCP servers with built-in OAuth and app auth.
CodeGraphContext, which turns entire codebases into navigable graphs for agents, has passed 100,000 downloads on PyPI, while a new MCP memory server on Cloudflare Workers handles semantic search and dedicated memory tools in production SaaS.
Android now ships native MCP support in the OS so apps can expose cross-app actions to agents, a strong signal that this protocol is being treated as infrastructure rather than a niche experiment.
The flip side is a visible "context tax": a five-server browser stack with Playwright and DevTools MCPs burns about 55,000 tokens before any work begins, and users report MCP overhead and latency dominating real tasks.
memory is escaping the vector db
For teams building non-trivial agents, the interesting work has moved from "just use a vector DB" to explicit, layered memory systems.
The Agent Memory Protocol (AMP) is trying to standardize how agents read and write memories, while projects like agentmemory and a Cloudflare-based MCP memory server give coding agents persistent, searchable state instead of per-session context blobs.
Claude Code is architected as six layers where the model is only one node inside the loop, and Hermes-style agents now retain memory across sessions, treating recall and storage as separate concerns from raw model context.
Meanwhile, long-context-only approaches are showing cracks: Kimi’s 262K-token window has been reported to bog down and lose coherence on extended tasks, and OpenCode users complain about slow prompt processing and inefficient context usage.
Even outside code, Obsidian users wiring Claude-like agents into their vaults are running into abandoned knowledge bases and serious plugin vulnerabilities, including a remote-access trojan abuse case and a critical Tasks bug.
eval harnesses and routing are eclipsing single-model fandom
For experienced engineers scaling systems, the pattern is that performance gains are coming from eval harnesses and routing logic rather than betting on one "best" model.
Forward-deployed AI engineers are explicitly being asked for harness engineering, prompt caching, and model routing skills, while practitioners complain that current eval tools obsess over prompts instead of full production workflows and execution efficiency.
LangGraph’s open 3-agent blind eval primitive, robotics benchmarks grounded in real-world tests, and Perplexity Enterprise’s 74,000 weekly tasks at PayPal for validation and research all point to evaluation loops becoming core infrastructure.
At the same time, usage data shows developers are routing by task: many prefer Claude or Kimi-style tools for long-form coding and conversation, GPT‑5.5 or Gemini for top-end reasoning, and Perplexity for research, with people actively switching models per job rather than settling on a single vendor.
Users also report spending more time debugging AI workflows, quotas, and token blowups than writing prompts, which shifts the interesting engineering work into LLMOps and harness design.
agents are moving from talk to action, and the security bill is coming due
For anyone letting agents touch real systems—codebases, CI, SaaS APIs—the most acute story right now is the security cliff. Google confirmed the first known case of hackers using AI to create a zero-day exploit that bypassed a two-factor authentication system, while a Chinese grey market sells stolen Claude API access at 90% off.
Supply-chain attacks are escalating in parallel: 84 malicious TanStack package versions on npm stole CI credentials, a mini Shai-Hulud worm abused GitHub Actions cache poisoning to compromise over 160 npm packages, and Vercel’s ecosystem saw a third-party breach leak API keys.
On the application side, scans found that 90% of 48 vibe-coded apps had at least one vulnerability and that 22% of Supabase projects leak user data anonymously, while thousands of AI-built assets on platforms like Replit are exposing sensitive information.
Agent frameworks themselves are part of the attack surface: OpenClaw has been poisoned with hundreds of malicious skills and is now recommended only on isolated systems, Obsidian plugins have already been abused as remote-access trojans, and Perplexity is responding by building a dedicated secure agent runtime sandbox.
local-first, quantized stacks are becoming production-grade
For builders trying to escape cloud GPU pricing, the numbers around local inference and quantization have quietly crossed a threshold from toy to serious.
Qwen 3.6 27B can hit around 135 tokens per second on an RTX 3090 with DFlash and TurboQuant, and over 80 tokens per second at long contexts on mid-range GPUs like 12GB cards and the RTX 4090.
The BeeLlama.cpp fork is 2–3× faster than baseline on an RTX 3090, while Multi‑Token Prediction in llama.cpp and TurboQuant routinely delivers 40% wall-clock speedups without changing model weights.
New NVFP4 formats push throughput up to 270 tokens per second on Blackwell GPUs and enable aggressive KV-cache compression, but users are already flagging noticeable quality loss versus FP8 or FP16 in some workloads.
Around this, a coherent local stack is consolidating: over 176,000 public GGUF models for llama.cpp, LM Studio and Ollama managing multi-GPU home rigs, SQLite used as an ultra-fast local store, FastAPI as the default lightweight AI backend, and RunPod renting A100‑class GPUs at roughly a dollar an hour.
What This Means
Across these threads, the leverage is shifting from picking a single "best" model to designing stacks: agents, MCP tool layers, memory systems, eval harnesses, and increasingly local, quantized runtimes. The gap between glossy demos and durable systems is now defined by security boundaries and workflow-level engineering, not by prompt copywriting.
On Watch
/Subquadratic’s SubQ model claims 1000× efficiency gains over current LLMs, with researchers publicly asking for independent proof before treating the numbers as real.
/Chrome is reportedly silently downloading a ~4GB Gemini Nano model for local summarization while Android rolls out Gemini Intelligence across devices, hinting at a near-future where on-device agents are the default UX.
/China’s first dedicated AI agent policy defines agents as autonomous systems and sets a "safety first, innovation second" principle, a stance that could shape how global platforms frame agent capabilities and constraints.
Interesting
/AI now generates 75% of Google’s new code and up to 30% of Microsoft’s new code, indicating a significant shift in coding practices.
/GBrain offers a unique approach to agent memory by using markdown files as a source of truth, contrasting with traditional vector-based methods.
/The 'memory curse' in LLM agents indicates that long histories can degrade their performance by making them overly focused on past events rather than future actions.
/A 7B language model trained with reinforcement learning can orchestrate larger models like GPT-5 and Claude Sonnet 4, outperforming them on various benchmarks.
/DeepSeek's v4 model, despite having only 210B parameters, performed similarly to models four times its size in the Claw-Eval benchmark.
We processed 10,000+ comments and posts to generate this report.
AI-generated content. Verify critical information independently.
/Hermes Agent became the most used AI on OpenRouter, processing 271 billion tokens and overtaking Claude Code and OpenClaw.
/Open-source framework OpenClaw was found poisoned with over 575 malicious skills injected by just 13 accounts, driving a steep usage decline.
/CodeGraphContext, an MCP server that graphs codebases, surpassed 100,000 downloads as new MCP memory servers rolled out in production SaaS stacks.
/TurboQuant-powered BeeLlama.cpp pushed Qwen 3.6‑27B to around 80–87 tokens/s at 262K context on an RTX 4090, far above baseline llama.cpp speeds.
/An npm supply-chain attack injected credential-stealing malware into 84 TanStack packages, compromising CI tokens across projects.
On Watch
/Subquadratic’s SubQ model claims 1000× efficiency gains over current LLMs, with researchers publicly asking for independent proof before treating the numbers as real.
/Chrome is reportedly silently downloading a ~4GB Gemini Nano model for local summarization while Android rolls out Gemini Intelligence across devices, hinting at a near-future where on-device agents are the default UX.
/China’s first dedicated AI agent policy defines agents as autonomous systems and sets a "safety first, innovation second" principle, a stance that could shape how global platforms frame agent capabilities and constraints.
Interesting
/AI now generates 75% of Google’s new code and up to 30% of Microsoft’s new code, indicating a significant shift in coding practices.
/GBrain offers a unique approach to agent memory by using markdown files as a source of truth, contrasting with traditional vector-based methods.
/The 'memory curse' in LLM agents indicates that long histories can degrade their performance by making them overly focused on past events rather than future actions.
/A 7B language model trained with reinforcement learning can orchestrate larger models like GPT-5 and Claude Sonnet 4, outperforming them on various benchmarks.
/DeepSeek's v4 model, despite having only 210B parameters, performed similarly to models four times its size in the Claw-Eval benchmark.