Local models plus decoding tricks just got fast and cheap enough to power serious agents, but they’re fragile and heavily dependent on hardware and stack choices. At the same time, AI is quietly writing most of the code at big companies while supply‑chain attacks, poisoned skills, and memory/RAG failures show how risky these stacks are becoming.
The real action for builders is shifting from model choice to system design: orchestration, security, and memory architecture.
Key Events
/Hermes Agent became the most-used AI globally, surpassing Claude Code and OpenClaw in token processing.
/DFlash speculative decoding delivered up to 8.5× faster LLM token generation and helped Gemma 4 26B reach ~600 tokens/sec on an RTX 5090.
/The 'Mini Shai-Hulud' npm attack compromised 84 TanStack packages and over 160 npm packages in total, stealing CI and cloud credentials.
/Hugging Face hit 1M open datasets while being poisoned with over 575 malicious skills and a fake 'OpenAI Privacy Filter' downloaded 244,000 times.
/Claude Mythos drove Firefox to fix more security bugs in April 2026 than in the prior 15 months combined, surfacing 271 vulnerabilities overall.
Report
Local-first AI stacks and long-lived agents quietly crossed a line this week: they’re now fast and cheap enough for serious use, but brittle enough that security and orchestration dominate builder pain.
For experienced engineers shipping agents, RAG, and coding workflows, the writable gaps are where performance hacks, memory design, and supply‑chain risk intersect.
local-first inference is fast enough to matter, fragile enough to hurt
Everyone is posting tokens/sec screenshots, but the interesting part is that Qwen 3.6 27B is now reported at up to 135 tokens/sec on a single RTX 3090 via BeeLlama.cpp, putting local agents into 'production-ish' territory.
The same model is also hitting around 80 tokens/sec at a 128K context window on a 12GB GPU, which changes what a solo builder can do with on‑device coding agents.
Gemma 4 26B is clocking about 600 tokens/sec on an RTX 5090 in optimized runtimes like vLLM.DeepSeek V4 Flash reaches 85.52 tokens/sec at a 524K context window on dual RTX PRO 6000 Max‑Q GPUs while costing roughly 90% less than GPT 5.4 Mini.
The catch is that all of this is extremely stack‑sensitive: builders report llama.cpp slowdowns on Windows with AMD GPUs, Vulkan generally beating ROCm but needing careful tuning, Qwen 3.6 instability in some harnesses, and Ollama both struggling with complex reasoning and exposing an unauthenticated 'Bleeding Llama' memory leak with potential remote code execution.
decoding tricks are now part of the product spec
Speculative decoding methods like DFlash are delivering up to 8.5× faster generation on some LLMs without measured accuracy loss in reported benchmarks.
On an RTX 5090, DFlash is part of the stack that lets Gemma 4 26B reach around 600 tokens/sec, and users report it outperforming classic MTP on parallel block diffusion drafting and stateful context management.
Multi‑Token Prediction gives Qwen 3.6 27B about a 2.5× speedup over baseline decoding, and the Qwen 3.6 35B A3B variant is generating around 80 tokens/sec at a 128K context on 12GB VRAM.
Gemma 4 MTP builds are drafting roughly 40% faster than standard LLaMA.cpp‑style decoding in community benchmarks. The other half of the story is where it breaks: DFlash often degrades beyond ~20K tokens and seems best under about 4K tokens on some models, MTP eats more VRAM, and gains are much weaker on creative chat than on coding or tool‑using agents, prompting calls for workload‑specific validation suites.
agents are turning into distributed systems
Hermes Agent is now the most‑used AI app globally on OpenRouter, surpassing Claude Code and OpenClaw in token volume, and is being wired into cloud services that can create Cloudflare accounts, buy domains, and handle payments via USDC on AWS.
At the same time, there’s a visible shift toward personal and self‑hosted agents, with trends away from subscriptions to locally hosted setups and new releases like n8n‑as‑code V2 embedding an agent directly into VS Code for workflow management.
Telegram is becoming a de facto agent surface, adding Guest AI Bots, bot‑to‑bot chats, chat automation, and Telegram‑Drive while bots on the platform already handle voice notes, images, chess analysis from screenshots, and CRM‑style workflows.
Under the hood, orchestration is shifting from opaque single agents to explicit graphs and flows, with LangGraph 1.2 adding delta channels and checkpointing for long‑running agents and n8n powering multi‑layer AI revenue‑intelligence and fraud‑detection systems across Redis, PostgreSQL, and LLM agents.
MCP is emerging as a standard protocol layer in these stacks, standardizing how agents discover and call tools while also acting as a security boundary and shared auth layer across multiple services.
memory and rag are becoming their own infra layer
Anthropic’s agents now use a sleep mechanism to replay experiences and reorganize memory traces, while the Hermes Memory Installer 2.0 builds long‑term agent memory on PostgreSQL to give assistants durable, queryable history.
OpenCode‑based coding agents add persistent memory so they can retain project context without re‑explanation, and separate projects use PostgreSQL to track per‑session budgets as agents query production databases.
On the retrieval side, EnterpriseRAG‑Bench is being introduced to stress‑test RAG systems on complex enterprise data rather than toy Q&A logs.
Blockify‑style corpus optimization reportedly shrinks document stores by about 40× and cuts tokens per query by roughly 3× compared to naive chunk‑and‑embed setups.
The darker story is that memory poisoning and prompt‑injection remain common control failures in RAG agents, while MCP‑based multi‑server setups introduce a 'context tax' where tool catalogs and server metadata eat context window and degrade model behavior.
ai coding just became majority author, and the backlash is starting
Airbnb reports that AI now writes about 60% of its new code, with even engineering managers using Claude Code to contribute. Google says AI now generates roughly 75% of its new code, while Microsoft puts its figure at up to 30%, so in many large shops the default author is already a model.
Hermes Agent has become the most‑used AI on OpenRouter, surpassing Claude Code and OpenClaw in token processing, and Codex is being run in fully autonomous modes where it completes paid bug‑fixing and security work without direct human steering.
The backlash line is forming: GitHub reversed a move to make Copilot a co‑author on every VS Code project after consent and job‑security fears, usage‑based Copilot billing is confusing developers, and audits of Lovable/Replit‑built apps have found thousands of deployments leaking credentials or exposing sensitive data.
In the trenches, Cursor‑style 'vibe coding' workflows are speeding up delivery but leaving teams to wrestle with hallucinated agents and code‑quality worries, while veteran programmers argue that AI still cannot replace the need for humans who understand security and architecture even as most job postings barely mention AI skills.
What This Means
The center of gravity for AI engineering has shifted from picking 'the best model' to composing brittle but powerful systems where inference tricks, memory, orchestration, and security posture are all first‑class design choices. The distance between benchmark charts and lived developer experience is widening, and that gap is where the most revealing stories are emerging.
On Watch
/Speculation that the Qwen 3.6 line may be the last open release, with some users fearing future Qwen models will go closed‑weight despite strong demand for Qwen 4, is putting openness and long‑term stability on the radar for router and local‑stack builders.
/GitLab is cutting staff to 'reinvest in growth for the agentic era' while some users migrate to self‑hosted GitLab/Forgejo to escape heavy AI integrations, hinting at an upcoming split between AI‑first and AI‑minimal dev platforms.
/Google’s Gemini Intelligence is being woven directly into Android and Chrome—alongside a $9.99/month Gemini Health Coach and reports of Chrome silently downloading a 4GB AI model—raising questions about OS‑level AI agents, privacy, and regulatory scrutiny.
Interesting
/Local open-weight AI on laptops has improved over twice as fast as Moore's Law, indicating rapid advancements in AI technology.
/The fake OpenAI Privacy Filter incident on Hugging Face underscores the risks associated with unverified software downloads, emphasizing the need for user vigilance.
/Many developers report that coding models optimized for greenfield projects struggle with real codebases that have accumulated technical debt, impacting their effectiveness.
/A developer reported that a single line change in a system prompt drastically reduced model quality from 84% to 52%.
/A semantic mistake memory layer called DriftGuard was built to help agents remember past mistakes.
We processed 10,000+ comments and posts to generate this report.
AI-generated content. Verify critical information independently.
/Hermes Agent became the most-used AI globally, surpassing Claude Code and OpenClaw in token processing.
/DFlash speculative decoding delivered up to 8.5× faster LLM token generation and helped Gemma 4 26B reach ~600 tokens/sec on an RTX 5090.
/The 'Mini Shai-Hulud' npm attack compromised 84 TanStack packages and over 160 npm packages in total, stealing CI and cloud credentials.
/Hugging Face hit 1M open datasets while being poisoned with over 575 malicious skills and a fake 'OpenAI Privacy Filter' downloaded 244,000 times.
/Claude Mythos drove Firefox to fix more security bugs in April 2026 than in the prior 15 months combined, surfacing 271 vulnerabilities overall.
On Watch
/Speculation that the Qwen 3.6 line may be the last open release, with some users fearing future Qwen models will go closed‑weight despite strong demand for Qwen 4, is putting openness and long‑term stability on the radar for router and local‑stack builders.
/GitLab is cutting staff to 'reinvest in growth for the agentic era' while some users migrate to self‑hosted GitLab/Forgejo to escape heavy AI integrations, hinting at an upcoming split between AI‑first and AI‑minimal dev platforms.
/Google’s Gemini Intelligence is being woven directly into Android and Chrome—alongside a $9.99/month Gemini Health Coach and reports of Chrome silently downloading a 4GB AI model—raising questions about OS‑level AI agents, privacy, and regulatory scrutiny.
Interesting
/Local open-weight AI on laptops has improved over twice as fast as Moore's Law, indicating rapid advancements in AI technology.
/The fake OpenAI Privacy Filter incident on Hugging Face underscores the risks associated with unverified software downloads, emphasizing the need for user vigilance.
/Many developers report that coding models optimized for greenfield projects struggle with real codebases that have accumulated technical debt, impacting their effectiveness.
/A developer reported that a single line change in a system prompt drastically reduced model quality from 84% to 52%.
/A semantic mistake memory layer called DriftGuard was built to help agents remember past mistakes.