Agent stacks are getting real stress tests: multi-agent frameworks, LLM routers, parsers, and memory systems are where things are failing, being attacked, or quietly working—not the flashy model frontends. Local inference (Gemma 4, TurboQuant) and second‑gen RAG (layout‑aware parsers, hybrid retrieval) are now viable for serious builders, but security incidents like the LiteLLM and Trivy compromises, plus the Claude Code leak, show that the AI plumbing itself has become a prime attack surface.
The capability frontier is splitting too, with internal‑only models like Claude Mythos and harsh benchmarks like ARC‑AGI‑3 revealing a mix of superhuman code/security skills and still‑brittle interactive reasoning.
Key Events
/Claude Code's full 512,000‑line source leaked via an NPM map file, prompting Anthropic to issue over 8,000 copyright takedown requests.
/The LiteLLM PyPI package (versions 1.82.7 and 1.82.8) was backdoored to exfiltrate SSH keys, AWS credentials, and other secrets in a large‑scale supply‑chain attack.
/Google released Gemma 4 Apache‑licensed open models, including a 31B variant that runs locally with a 256K context window.
/Google’s TurboQuant compression cut LLM KV‑cache memory usage by at least 6x and boosted speed by up to 8x without reported accuracy loss.
/The ARC‑AGI‑3 benchmark launched with 135 interactive environments where humans score 100% but frontier models reach only about 0.37% RHAE.
Report
Most of the interesting movement isn’t in new frontier models; it’s in how agent stacks behave under real load and real attacks. The gaps between hype demos and production—security, memory, retrieval, orchestration—are where the writable stories sit right now.
multi-agent coding stacks vs single-llm + tools
LangChain’s Deep Agents (MIT‑licensed, model‑agnostic) packages planning, filesystem access, shell execution, and async sub‑agents into a Claude Code–style harness, while LangGraph underpins 8‑node Agentic RAGs and finance agents with crash recovery layers and cost governance.
CrewAI, OpenClaw, Claude Managed Agents, and Slack’s new “agentic operating system” positioning all treat orchestration graphs, state, and observability as first‑class infrastructure rather than demo glue.
At the same time, multiple teams report that simple automations or even uncoordinated agents outperform complex role‑based crews, and that many LangGraph/LangChain builds collapse back into 50‑line scripts once real failure modes show up.
Audience: experienced engineers and tool builders deciding how “agentic” their dev stack really needs to be this quarter.
ai supply-chain attacks hit the llm router layer
The LiteLLM PyPI package—97M downloads/month—was briefly shipped in compromised versions that stole SSH keys, AWS credentials, database passwords, and cloud tokens, often without even needing an import.
TeamPCP reached it by first compromising the Trivy CI scanner, then using that foothold to backdoor LiteLLM and other packages, ultimately touching an estimated 500,000 machines across 36 cloud environments.
Axios on npm (300M weekly downloads) was similarly hit, and the UK NIS report now attributes 29% of cybersecurity incidents to supply‑chain issues.
At the agent layer, OpenClaw was exploited via prompt injection to compromise 4,000 computers, while Claude Code’s 512k‑line leak revealed its internal telemetry and profanity‑logging behavior.
Audience: infra and security‑minded agent builders running LLM routers, gateways, and CI—this is a “right now” story, not a hypothetical.
local inference, turboquant, and the new memory ceiling
Gemma 4’s 31B dense and 26B MoE models run locally under Apache 2.0 on consumer RTX cards and even phones, with 6GB‑RAM variants (E2B/E4B) hitting ~40 tok/s on an iPhone 17 Pro and full 256K context on an RTX 5090 via TurboQuant KV compression.
Ollama, llama.cpp, MLX, AMD’s Lemonade server, and frameworks like GAIA are turning “local agent” from a novelty into a normal deployment target, while vLLM and TinyServe chase high‑throughput, multi‑user inference on 3090/4090s and B200 clusters.
TurboQuant claims ≥6x memory reduction and up to 8x speedups by aggressively quantizing KV cache—enabling 72K contexts on Llama‑70B and 100K‑token chats on laptops—yet critics note it mainly compresses context, not weights, and may trade subtle quality regressions for headline numbers.
Meanwhile, MLX ports of Gemma 4 show larger footprints, occasional crashes, and slightly worse output than GGUF on some Macs, keeping the local‑vs‑cloud decision very workload‑specific.
Audience: infra engineers and serious solo builders choosing between local‑first and API‑first agents over the next 1–2 quarters.
second-generation rag: parsers, ocr, and hybrid retrieval
Naive “chunk PDFs + embeddings” RAG breaks quickly on real documents—construction drawings, overlapping charts, messy financials—where OCR and layout understanding are brittle.
LLM‑native OCR and parsing stacks like LlamaParse (Agentic Plus mode with bounding boxes), LiteParse (500 pages in ~2 seconds, local, 50+ formats), and document models such as Qianfan‑OCR, MinerU2.5‑Pro, Nanonets OCR‑3, and GLM‑OCR now anchor many serious pipelines.
Retrieval is converging on hybrid: BM25 in Elasticsearch/Postgres, CJK‑aware BM25 for chat logs, ONNX‑based rerankers in systems like knowledge‑rag, and new frameworks like CDRAG that let LLMs guide cluster‑aware retrieval.
At the same time, teams deploying RAG in production report that ingestion quality, temporal relevance, and poisoning defenses matter more than which vector DB they picked in the first place.
Audience: engineers already running RAG for agents and copilots, now feeling the pain of bad PDFs and stale answers.
capability split: mythos, arc-agi-3, and agent reality
Anthropic’s internal Claude Mythos model is described as powerful enough to find thousands of zero‑day vulnerabilities across major OSes and browsers, with a 244‑page system card stating its output can’t be trusted without extra verification; one security researcher reports finding more bugs in Mythos in weeks than in his entire prior career.
At the same time, ARC‑AGI‑3—135 interactive environments testing skill acquisition rather than memorized knowledge—shows humans at 100% while top frontier models sit under 1%, with the best scores around 0.37% RHAE.
Vendors and pundits claim we are 70–80% of the way to AGI and that AGI may arrive within a few years, yet benchmark authors emphasize that current systems still lack robust hypothesis formation and iterative reasoning.
Community discourse is split between “AGI is here” rhetoric and skepticism that today’s LLM‑centric path is anything more than scaled autocomplete, with ARC‑AGI‑3 positioned explicitly to puncture benchmark gaming.
Audience: advanced agent and security tool builders who need to understand where models are superhuman (code/search/exploit discovery) versus where they still fall over (interactive, novel reasoning) over the next year.
What This Means
The center of gravity has moved from single‑model heroics to the messy details of stacks: orchestration, security, memory, and retrieval now dominate whether agents are powerful, fragile, or outright attack surfaces. The public models look increasingly like the safe, rate‑limited tip of an iceberg whose most capable systems and sharpest failure modes are only visible in infrastructure, leaks, and internal‑only deployments.
On Watch
/MCP is becoming the default tool layer (97M monthly SDK downloads) but experiments show tool calls costing up to 32× more tokens than equivalent CLIs and failing 28% of the time, with 98% of tool descriptions lacking guidance on when to use them.
/Gemma 4’s mobile‑focused E2B/E4B models can run multimodal assistants on 6GB‑RAM devices but are struggling with complex coding/reasoning tasks and even crashing on some Android integrations due to missing OpenCL libraries.
/Emerging agent/market protocols like ACP for identity/compliance, the APEX standard for financial agents, and concerns that machine identities may soon outnumber human ones hint at a coming 'protocol layer' for autonomous agents.
We processed 10,000+ comments and posts to generate this report.
AI-generated content. Verify critical information independently.
/Claude Code's full 512,000‑line source leaked via an NPM map file, prompting Anthropic to issue over 8,000 copyright takedown requests.
/The LiteLLM PyPI package (versions 1.82.7 and 1.82.8) was backdoored to exfiltrate SSH keys, AWS credentials, and other secrets in a large‑scale supply‑chain attack.
/Google released Gemma 4 Apache‑licensed open models, including a 31B variant that runs locally with a 256K context window.
/Google’s TurboQuant compression cut LLM KV‑cache memory usage by at least 6x and boosted speed by up to 8x without reported accuracy loss.
/The ARC‑AGI‑3 benchmark launched with 135 interactive environments where humans score 100% but frontier models reach only about 0.37% RHAE.
On Watch
/MCP is becoming the default tool layer (97M monthly SDK downloads) but experiments show tool calls costing up to 32× more tokens than equivalent CLIs and failing 28% of the time, with 98% of tool descriptions lacking guidance on when to use them.
/Gemma 4’s mobile‑focused E2B/E4B models can run multimodal assistants on 6GB‑RAM devices but are struggling with complex coding/reasoning tasks and even crashing on some Android integrations due to missing OpenCL libraries.
/Emerging agent/market protocols like ACP for identity/compliance, the APEX standard for financial agents, and concerns that machine identities may soon outnumber human ones hint at a coming 'protocol layer' for autonomous agents.