Last week's core narrative boils down to two words: "good enough." Claude Fable 5 pushed general-purpose model capabilities to a new high while halving its price. But more importantly, the industry's deliverables for Agent evaluation, safety, memory, and reasoning optimization are shifting from "paper concepts" to "runnable code and frameworks." Anthropic's prefill walkback, Kimi Work's 300 local parallel agents, MiniMax's sparse attention kernel — these events all point to a single signal: AI engineering in the first half of 2026 is moving from "can it run?" to "can it run reliably?"
AI hit multiple milestones today: MiniMax dropped M3, a 428B MoE model with 1M context and 14x speedup, while Kimi open-sourced K2.7-Code, boosting coding agent scores by 31%. On the cost frontier, researchers trained a 1B foundation model for just $1,500 using a novel HRM architecture — challenging
AI hit major milestones today. OpenAI's GPT-5.5 stunned the industry by beating Anthropic's Claude Fable 5 on the new Agents' Last Exam benchmark — a real-world, long-horizon workflow test where even the best models scored below 25%. Jeff Bezos stepped into the arena, unveiling Prometheus with a rec
AI hit multiple inflection points today. Google DeepMind's DiffusionGemma breaks the autoregressive lock-in, generating text 4x faster by diffusing 256-token blocks in parallel — a paradigm shift for latency-sensitive agents. NVIDIA, Apple, and Google teamed up to bring confidential computing to App
AI hit an inflection point today: Anthropic launched Claude Fable 5 and Mythos 5, which Andrej Karpathy calls a "major version jump" — Stripe used it to migrate 50 million lines of Ruby code in one day instead of two months. Meanwhile, OpenAI filed confidential IPO papers at $852B valuation, setting
AI hit a funding milestone: DeepSeek launched a $7.4B Series A at a $52-59B valuation, with Tencent and CATL joining — the Chinese model race just got real. OpenAI and Anthropic both filed confidential S-1s, kicking off IPO prep. On the agent front, Kimi Work dropped a desktop agent supporting 300 p
AI chip talent wars escalated as OpenAI's custom chip lead Clive Chan jumped to Anthropic, while Jensen Huang warned chip shortages will last years across the full supply chain. China's AI models overtook US competitors on OpenRouter for the first time, driven by Kimi K2.5, MiniMax M2.5, and DeepSee
AI safety and cost efficiency dominated the news today. OpenAI launched ChatGPT Lockdown Mode to block prompt injection data theft — a deterministic defense that's hard to bypass. MiniMax M3 matched Claude Opus on code audit tasks but at 1/18 the cost ($0.07 vs $1.30), while a study in Science showe
This week's narrative boils down to one word: delivery — model vendors shipped on three fronts they promised last quarter: inference efficiency, real-world Agent capability, and platform ecosystem. Microsoft CEO Satya Nadella, in two deep interviews after Build, reframed the company from "frontier model provider" to "frontier intelligence platform," and revealed a new balance with OpenAI. At the same time, NVIDIA, Google, and Microsoft delivered on inference: Nemotron 3 Ultra achieves 5x Agent inference acceleration with a 550B MoE architecture, Gemma 4 ships a 12B multimodal model for device-side, and Microsoft's MAI series drops 7 models at once, revealing a 30% cost-performance advantage for the MAIA 200 chip. On Agent evaluation, Andon Labs uses vending machines to expose the vast gap between benchmarks and reality, while OpenWebRL proves multi-turn RL works for visual web Agents. For formal theorem proving, Goedel-Architect and LEAP push open-source systems to new highs: 99.2% on MiniF2F and a perfect Putnam score. Finally, OpenAI's Lockdown Mode and Dreaming memory upgrade complete the safety and product experience puzzle — Lockdown Mode provides a deterministic defense against prompt injection, while Dreaming evolves ChatGPT's memory from manual saves to automated background synthesis.
This week's research in recommendation systems falls along three technical threads. Thread 1: Generative recommendation moves from functioning to stability — semantic IDs and reasoning become the industrial focus. Pinterest's UniPinRec unifies retrieval and ranking end-to-end (online engagement +1%, latency -11.1%), pushing generative recommendation beyond just retrieval. Kuaishou's OneReason (online deployment) reveals why reasoning mode fails in generative recommendation — missing both perception and cognition factors — and proposes a three-level CoT format plus specialized-unified training. Both point to the same conclusion: the core bottleneck in generative recommendation has shifted from model architecture to data format (semantic IDs) and system coordination. Thread 2: Cross-domain cold start moves from feature transfer to learning transfer — LLMs as cross-domain bridges begin large-scale deployment. Kuaishou's RGCD-Rep (serving 400M+ users) uses MLLM reasoning distillation to transfer short-video user interest to live streaming, with significant cold-start engagement gains. Meta's Quantizing Intent paper (online AUC +1.522% for cold start) quantifies organic feed behavior into semantic IDs for ad ranking, proving that behavioral richness determines cross-domain transfer quality. Both reveal that the key to cross-domain transfer isn't aligning features — it's building transferable semantic representations. Thread 3: LLM/Agent-enhanced recommendation moves toward industry differentiation — from general retrieval to deep adaptation in vertical scenarios. Li Auto's HPRO (132-day A/B, sales +9.5%) introduces preference optimization for lead scoring, solving sparse supervision and funnel hierarchy. Kuaishou's Taiji (CTR +12.4%, revenue +15.2%) proposes Pareto-optimal policy optimization, finding the optimal trade-off between semantics and IDs. Syft's DynTree (survival rate improved 1.5x) uses offline agent tree-building plus online lightweight subtree selection for
AI infrastructure and safety evaluation took center stage today. RedKnot from Xiaohongshu/Huawei Cloud shattered the monolithic KV cache abstraction, boosting LLM serving concurrency by 4.7-7.8x. Scale AI's PropensityBench introduced a new safety paradigm — testing what models *will* do under pressu
AI hit major milestones today: Axiom Math's system scored a perfect 120 on the Putnam exam, beating top human undergraduates and DeepSeek with formal verification. NVIDIA dropped Nemotron 3 Ultra, a 550B MoE with Mamba-Attention that delivers 5x inference speedup for agent workflows. OpenAI upgraded