Cloud AI got a lot more capable and a lot more dangerous to your wallet at the same time: TPU 8t/8i look great on paper, but real users are still getting surprise five-figure bills.
Local Qwen-class models and rapidly shifting AI coding tools are now viable parts of a production stack, but the attack surface around npm, MCP, Mythos-style tooling, and even GitHub CLI telemetry means your boring security and cost controls matter more than the latest model leaderboard.
Key Events
/TPU 8t/8i launched on Google Cloud, offering 2–4× speed over TPU v7 and pods up to 9600 TPUs.
/Google Cloud user hit an unexpected $18k bill despite a $7 budget cap, exposing fragile cost controls.
/GitHub CLI added default pseudoanonymous telemetry for all users, sparking privacy backlash.
/GitHub Copilot paused new Pro/Pro+/Student signups, is removing Opus models from Pro, and will move to token-based billing in June.
/npm package pgserve versions 1.1.11–1.1.13 shipped a credential-stealing postinstall script as a supply-chain attack.
Report
Infra and tooling are shifting under your feet: Google is pushing massive TPU 8t/8i clusters while real users still get surprise five-figure cloud bills.
At the same time, local Qwen-class models and flaky AI coding tools are turning architecture and workflow choices into moving targets.
cloud ai infra and cost volatility
Google launched TPU 8t/8i with 2–4× speed over TPU v7 for training and inference, and pods scaling to 9600 chips, clearly aimed at large LLM workloads already living on GCP.
GCP's AI APIs now push over 16 billion tokens per minute via direct calls, and nearly 75% of its customers are already using AI products in production.
Against that, one GCP user still ate an $18k bill on a project with a $7 budget, so the cost controls visible in the console clearly did not cap real spend.
On AWS, 1 TB on EFS is running around $307 per month while S3 stays much cheaper for the same capacity, which is pushing people to re-evaluate where they park state.
AWS App Runner has stopped accepting new customers entirely, showing that even PaaS-style services marketed as stable can quietly become dead ends.
Meanwhile, half of US AI data centers planned for 2026 are delayed or cancelled because transformers are scarce and prices have tripled over four years, so capacity and pricing for big GPU and TPU jobs will stay jumpy.
local vs cloud llms for real workloads
Qwen3.6-27B is an open model that beats the older 397B Qwen3.5-A17B on major coding benchmarks, with a 27B model now outperforming a 397B one on SWE-Bench.
People are running Qwen3.6-27B at home with llama.cpp and vLLM, reporting around 13 tokens per second on three GPUs and roughly 400 tps on a Windows box with dual RTX 3080s and 256 GB RAM at a 100k context.
One user sees 50 tps with a 200k context on an RTX 5090, and TurboQuant-style KV cache compression is letting FP8 variants fit into single consumer GPUs with 256k contexts.
There are direct reports that running local LLMs can pay back GPU hardware costs over time by avoiding cloud API charges on heavy workloads.
The catch is hardware and tuning: 70B-class models strain boxes like the GMKtec EVO-X2 128 GB, push people toward dual Mac Studio Ultras or high-end servers, and small changes to quantization or KV cache format can crater throughput.
All of this is happening while GPU supply is constrained and vendors like Anthropic are already feeling scarcity, so both local cards and cloud capacity are in a competitive market.
ai coding tools are churning fast
GitHub Copilot has paused new signups for Pro, Pro Plus, and Student plans, is dropping Opus models from Pro, and will switch users to token-based billing starting in June.
Across all plans, Copilot now supports bring-your-own-key, so it can front different backends instead of only Microsoft-hosted models. At Google, around 75% of new code is now AI-generated, up from about 50% last fall, with tools like Claude Code wired into their process even as users complain Opus 4.7 is uneven across benchmarks and often overly verbose.
OpenAI’s Codex endpoint is now officially supported at /backend-api/codex/responses, and some developers report preferring its backend logic outputs to Claude’s, despite Codex underperforming on UI-heavy work.
Cursor is reportedly in talks with SpaceX on anything from a $10B collaboration to a $60B acquisition based on its developer traces, underlining how much IDE telemetry is now raw training data.
On the edge, gateways like OpenClaw and OpenRouter are normalizing BYO-key setups that can flip between Kimi, Qwen, Codex, and others behind a single API, while users complain that API costs spike quickly on larger tasks.
security, supply chain, and tooling trust
npm package pgserve shipped versions 1.1.11 to 1.1.13 with a 41 KB postinstall script that quietly stole credentials using only standard Node APIs, so the package looked clean while exfiltrating secrets.
The Model Context Protocol picked up a high-severity bug that allowed arbitrary remote code execution, affecting integrations with more than 150 million downloads.
Anthropic’s Mythos model, designed to find and exploit vulnerabilities, was accessed by a private Discord group via a guessed URL after a third-party breach, meaning an internal red-team tool briefly became an uncontrolled offensive asset.
Mozilla used Mythos-class tooling to flag 271 potential vulnerabilities in Firefox, including zero-days, but engineers are already calling out the triage burden and uncertainty around how many of those findings are really unique issues.
In parallel, static AWS credentials continue to get stolen in AI circles, with compromised keys still a common root cause for cloud incidents. Even basic dev tooling is now part of the privacy surface: GitHub CLI enables default pseudoanonymous telemetry for all users and some people report that opt-out commands fail, which erodes trust in what used to feel like a thin git wrapper.
lightweight analytics and observability are getting good
An OTel trace analyzer now catches N+1 SQL and HTTP calls, slow queries, and pool saturation across languages including Java without per-runtime instrumentation, and it can run as a CI batch job, central collector, or sidecar emitting SARIF, JSON, or text.
DuckDB 1.5.2 is solidifying as the default embedded analytics engine, running on laptops, servers, and in browsers, with a dedicated Jupyter kernel that gives notebook users an analytical execution runtime.
Developers like its speed for inserts, updates, and deletes from Java and its ability to slurp CSV and Parquet for ad hoc analysis and app-embedded OLAP features.
Benchmarks show DuckDB can be up to 30 times faster than SQLite for some scenarios, but serious memory issues relative to ClickHouse are a recurring complaint once datasets stop fitting comfortably in RAM.
SQLite itself is being used as a local-first memory layer via sqlite-memory-MCP and remains a go-to for data scrubbing and intermediate ETL, so a lot of useful perf visibility is now doable from a single laptop.
What This Means
Raw capabilities for AI and analytics jumped again this cycle, but billing, security surface, and vendor churn are getting messier at the same time. The risk is shifting from "can the stack do it" to "can we operate and secure it without hidden costs or surprises.
On Watch
/Zed’s parallel-agent editor architecture is attracting power users for its speed and concurrency model, but complaints about slower TypeScript support than VS Code and unease with recent AI-heavy UI changes make it a candidate to watch as a potential core IDE for certain stacks.
/LangGraph users are experimenting with 5-agent validation setups and a 100-agent chaos-testing demo while still relying on print statements for debugging, signalling a fragile but rapidly evolving space in multi-agent orchestration that could harden into real production patterns once observability tools land.
/Async Flash v1.0 hitting about 81% sentence accuracy in real-time voice tests, together with async Rust libraries and proposed async React live hooks, points toward more streaming-by-default app designs once the ergonomics catch up.
Interesting
/The significant gap between OpenClaw's 247K stars and its 35K installs suggests a disconnect between developer interest and practical application.
/Setting up a self-hosted AI gateway on Google Cloud using Docker can be a cost-effective solution, with monthly expenses ranging from $12 to $25.
/The emergence of Kdts as an optimization-first TypeScript compiler reflects a trend towards performance-focused tools in the TypeScript ecosystem.
/The integration of DuckDB with Excel through xlwings Lite allows for seamless SQL queries directly within spreadsheets, enhancing data manipulation capabilities.
/A recent study found that retrieval in RAG systems is less challenging than document ingestion, which consumes most engineering time, suggesting areas for improvement in AI tools.
We processed 10,000+ comments and posts to generate this report.
AI-generated content. Verify critical information independently.
/TPU 8t/8i launched on Google Cloud, offering 2–4× speed over TPU v7 and pods up to 9600 TPUs.
/Google Cloud user hit an unexpected $18k bill despite a $7 budget cap, exposing fragile cost controls.
/GitHub CLI added default pseudoanonymous telemetry for all users, sparking privacy backlash.
/GitHub Copilot paused new Pro/Pro+/Student signups, is removing Opus models from Pro, and will move to token-based billing in June.
/npm package pgserve versions 1.1.11–1.1.13 shipped a credential-stealing postinstall script as a supply-chain attack.
On Watch
/Zed’s parallel-agent editor architecture is attracting power users for its speed and concurrency model, but complaints about slower TypeScript support than VS Code and unease with recent AI-heavy UI changes make it a candidate to watch as a potential core IDE for certain stacks.
/LangGraph users are experimenting with 5-agent validation setups and a 100-agent chaos-testing demo while still relying on print statements for debugging, signalling a fragile but rapidly evolving space in multi-agent orchestration that could harden into real production patterns once observability tools land.
/Async Flash v1.0 hitting about 81% sentence accuracy in real-time voice tests, together with async Rust libraries and proposed async React live hooks, points toward more streaming-by-default app designs once the ergonomics catch up.
Interesting
/The significant gap between OpenClaw's 247K stars and its 35K installs suggests a disconnect between developer interest and practical application.
/Setting up a self-hosted AI gateway on Google Cloud using Docker can be a cost-effective solution, with monthly expenses ranging from $12 to $25.
/The emergence of Kdts as an optimization-first TypeScript compiler reflects a trend towards performance-focused tools in the TypeScript ecosystem.
/The integration of DuckDB with Excel through xlwings Lite allows for seamless SQL queries directly within spreadsheets, enhancing data manipulation capabilities.
/A recent study found that retrieval in RAG systems is less challenging than document ingestion, which consumes most engineering time, suggesting areas for improvement in AI tools.