为什么LayerNorm+AdamW成了深度网络的标准配置?从尺度不变性到梯度动力学

深度网络依赖LayerNorm(RMSNorm),这创造了局部的尺度不变性(Scale Invariance),它带了独特的梯度动力学(Gradient Dynamics)。在这个独特的动力学场域中,我们关于机器学习的直觉被颠覆了,Norm的物理含义从特征强度表示变成了学习进度的旋钮,Norm理论上稳步增加,SGD自带学习率衰减,但是刹车踩的太狠导致了学习的早停,而Weight Decay从正则化项进化为有效学习率的动态调节阀。AdamW如何成为标配:Adam做到了梯度的步长恒定,有效学习率的平缓刹车;Warmup来处理训练早期的权重过小(梯度爆炸)和二阶矩估计不准的问题;AdamW修正了L2正则的问题,引入Weight Decay,把“方向更新”和“进度控制”拆成两个干净的旋钮。

推荐算法只可锦上添花,不能雪中送炭

在和很多产品、运营团队合作的过程中,我常不得不扮演那个“泼冷水”的角色,特别是当大家对推荐算法寄予厚望的时候。 听到这样的战略规划:“我们明年目标是增长 80%,推荐系统是其中的关键。” 我的观点很直接:如果你的增长战略严重依赖推荐算法,一旦算法效果不及预期,目标就直接崩盘,那么这本质上是一个糟糕的战略**。对于规模增长,推荐算法不能雪中送炭,它只能在规模之上锦上添花。

从RL比SFT更不容易遗忘到反观推荐系统缺陷

最近陆续有了一些研究LLM中RL相比SFT更不容易造成灾难性遗忘的工作,清晰地支出是RL的On-Policy特性带来了参数的稳定,而SFT将模型参数推向与预训练分布差异很大的方向,导致了遗忘问题(如图,遗忘问题的衡量就是随着新任务的学习,旧任务的平均表现下降)。 这一清晰地结论,点亮了我对很多事情的理解,推荐系统原来孤立的问题也有可能连成一片,有了更深层次的支撑。 本文包括: • LLM领域,RL比SFT更不容易造成灾难性遗忘的工作解读 • 推荐系统是标准的off-policy 监督学习,(猜想)许多缺陷也应当由此而生

推荐系统线上能跑多大的模型

本文不是从系统优化角度谈复杂的模型的部署和优化问题,而是从行业成本角度,看线上推理多复杂的模型是可以满足成本及ROI要求的。 做一个假设: • 电商推荐行业,主要是更熟悉成本核算 • 部署标准的Transformer作为排序模型,参考OneTrans结构 • 参数规模对齐qwen2的系列模型,更直观看看能跑哪个尺寸

Talent Dilution Roofline:你的算法团队可能不需要再招人了?

Roofline model是高性能计算领域用来分析程序性能瓶颈的一个直观模型,因为画出来像一个屋顶形状而得名。如下图,横坐标是算法的计算强度Flop/Byte(算法的浮点计算数除以内存访问量),纵坐标是算力Flop/s,它描述的是如果算法计算强度提升算力线性提升(Memory-Bound),直到算数强度超过硬件的拐点,之后算力逼近硬件的上限(Compute-Bound)。它核心回答了:你的程序到底受什么限制——计算能力还是内存带宽?应该优化哪里?

OneTrans 推荐系统对齐序列处理与特征交叉

从精排切换成深度学习以来,工业界一直会把排序的模型结构研究切分成基本的两部分,序列处理和特征交叉,甚至有一些公司的排序组,下面都拆成两个Team分别处理行为序列和特征交叉。从最早的时候,比如序列用DIN来处理,序列就被压成了一个或多个向量表征,再参与与其他特征的交叉。我们可以理解成MLP(concat(DIN, Features)),发展到今天大多数的模型研究,还是分立地把MLP换成DCN,增加个LHUC,复杂化为Rank Mixer或Transformer,把DIN叠加MHA,直接换成Transformer,可以写成RankMixer(concat(Transformer, Features))。 从MLP(concat(DIN, Features))到RankMixer(concat(Transformer, Features)),本质没有变,就是序列处理和特征交叉是一个隐式的两阶段处理,序列被压缩到Vector Space才和特征发生交叉。而LLM的有趣之处,就是在Next Token Prediction利用到的交叉发生在词序列的Token Space之中,它能启发推荐排序模型的,就是每一个特征的交叉应该发生在用户序列的Token Space之中。

AI Tech Daily - 2026-06-21

Google DeepMind dropped a bombshell with a 57-page ASI roadmap, formally defining Superhuman AI as output exceeding tens of thousands of top experts working for a decade. Meta AI released SAGE-OPD, a selective distillation framework that boosts agent task success rates by 13.3% — a practical fix for

AI Weekly 2026-W25

The clearest narrative in 2026-W25: open-source model frontiers have shifted from catching up to running alongside closed-source models — and in some dimensions, surpassing them. Four models launched this week: GLM-5.2, DeepSeek-V4, Nemotron 3 Ultra, and Ling-2.6. Parameter counts range from 284B to 1.6T, all support 1M token context windows, and all are open-source. Community benchmarks and independent analysis report that these models now match GPT-5.5 and Opus 4.8 on knowledge work, coding, and scientific reasoning — and are cheaper. The second theme: Agent infrastructure is moving from scattered tools to platforms. Amazon Bedrock AgentCore Harness went GA — two API calls to deploy a production-grade Agent. Cursor launched Origin, a Git replacement designed for Agent workloads. Meanwhile, Agent evaluation methodology is shifting from aggregate leaderboards to predictive validity — an IBM paper directly challenges whether static leaderboards transfer to deployment scenarios. The third theme: micro-innovations in inference efficiency are accelerating. Pine AI proposes an editable/composable KV cache paradigm, reducing p90 TTFT by 53–398x. LMSYS used SGLang-JAX to optimize a 1T-parameter MoE model on TPUs, cutting prefill by 53%. Jeff Dean published the evolution of TPUs from v2 to Ironwood — 30x energy efficiency gains. The combination of hardware and algorithm innovations is making 1M token inference economically viable. Additionally, regulatory tensions escalated sharply this week — Anthropic restricted use of the Fable model, then the US Commerce Department imposed export license requirements on Fable and Mythos. Andrew Ng argues this will accelerate the AI sovereignty movement. Healthcare also saw multiple product-level advances, from rare disease diagnosis to full-body ultrasound CT.

RecSys Weekly 2026-W25

This week's recommendation systems research clusters around three themes: full lifecycle co-design for large-scale graph retrieval, Transformer-based sequence modeling deployed across platforms, and a shift from DNN to Transformer-native architectures for multi-task ranking. Meta, Airbnb, Alibaba, Shopee, and NetEase Cloud Music all published online deployment work with specific AB metrics. Thread 1 (End-to-end design of large-scale graph systems): Meta's RankGraph-2 (Meta) couples graph construction, representation learning, and online serving into a joint optimization. On a billion-node graph, it reduces compute cost by 83%, achieves 3.8x the recall of GAT+Deep Graph Infomax, and lifts online CTR by +0.96% and CVR by +2.75%. Along the same line, HighLevel's ScoreGate (HighLevel) uses a statistical fusion of two scores to adaptively control the number of retrieved chunks in RAG. In production, it cuts tokens by 34.8% while maintaining recall between 97.77% and 99.34%. Thread 2 (Generative recommendation moves from theory to production): Airbnb's JourneyFormer (Airbnb) deploys a Transformer-based sequence model in search ranking to handle long, sparse user behavior. Alibaba's OneBar (Alibaba) uses an end-to-end generative framework for video e-commerce query recommendation, achieving a 21.67% GMV lift. Both point to the same direction: generative recommendation needs engineering trade-offs under real constraints (cold start, latency, sparse labels) rather than chasing offline metrics alone. Thread 3 (Transformer-native paradigm for multi-task ranking): Shopee's OneRank (Shopee) eliminates the encoder-predictor separation, embedding task-private channels and gradient isolation inside the Transformer. Online CTR is up +1.2%, CVR +0.8%. NetEase Cloud Music's PIANO (NetEase Cloud Music) uses a learnable [CLS] token for list-level multi-objective re-ranking, lifting CTR by +0.62% and CVR by +4.45%. Both demonstrate that internalizing multi-objective reasoning into the Tr

AI Tech Daily - 2026-06-20

AI hit a major inflection point today. DeepSeek dropped DeepSeek-V4, a 1.6T MoE model that slashes long-context costs by 3.7x and beats GPT-5.4 — all open-source. Meanwhile, Subquadratic claims to have cracked the O(n²) attention bottleneck, and GLM-5.2 is now the first open model that independent d

AI Tech Daily - 2026-06-19

AI hit multiple inflection points today. Anthropic's Claude Opus 4.7 autonomously controlled a robot 20x faster than humans, while Qualcomm is reportedly acquiring Tenstorrent for $8-10B to challenge NVIDIA's inference dominance with RISC-V. Noam Shazeer — one of the "Attention is All You Need" auth

AI Tech Daily - 2026-06-18

AI hit multiple inflection points today. Noam Shazeer, co-author of the original Transformer paper, left Google for OpenAI — a decade-long pursuit finally realized. Vercel launched its eve agent framework with a full stack of components, while AWS and Hugging Face both unveiled critical agent infras