RecSys Weekly 2026-W20
2026-5-18
| 2026-5-18
字数 3408阅读时长 9 分钟
type
Post
status
Published
date
May 18, 2026 15:49
slug
rec-weekly-en-2026-W20
summary
This week's recommendation systems research breaks down along three technical fronts: generative recommendation architectures moving from tokenizer optimization to inference efficiency; LLM-enhanced recommendation evolving from isolated auxiliary modules to agents with memory and reasoning; and system-level quantization and thread orchestration emerging as the real bottleneck for production deployment. Theme 1 "Decoupling and Acceleration in Generative Recommendation": Alibaba deployed CQ-SID / EG-GRPO on TmallAPP, using category-aware semantic IDs and expert-guided reinforcement learning to achieve +1.15% GMV, with generative retrieval contributing 72.63% of purchases. Tencent and Tsinghua's AsymRec proposed an asymmetric continuous-discrete framework that replaces symmetric quantization with multi-expert projections, averaging 15.8% improvement. Meituan's DIG embeds the tokenizer into a discriminative ranking model for end-to-end training, improving both retrieval and ranking. Snap's SID-MLP distills the Transformer decoder into an MLP, achieving 8.74x speedup with no loss in accuracy. The common thread: generative recommendation is transitioning from "can run" to "runs stably and fast," with the core tactic being decoupling input/output representations and replacing overly dense structures. Theme 2 "LLM Recommendation Toward Reasoning and Memory": Microsoft Research's PGR introduced look-ahead guided retrieval, using Tree-of-Thought to expand query steps, achieving nearly 3x recall improvement on MemoryQuest. Meituan's RecRM-Bench provides 1 million structured entries covering four reward dimensions (instruction following, fact consistency, etc.) for agent-based recommendation systems. SDAR (Meituan) uses gated auxiliary objectives to stabilize On-Policy Self-Distillation (OPSD), outperforming GRPO by 7–10% on ALFWorld, Search-QA, and WebShop. The difference: PGR focuses on look-ahead reasoning before retrieval; SDAR focuses on training stability. But the shared
tags
Recommendation Systems
Weekly
Papers
category
Rec Tech Report
icon
📚
password
priority
1

Weekly Overview

This week's recommendation systems research breaks down along three technical fronts: generative recommendation architectures moving from tokenizer optimization to inference efficiency; LLM-enhanced recommendation evolving from isolated auxiliary modules to agents with memory and reasoning; and system-level quantization and thread orchestration emerging as the real bottleneck for production deployment.
Theme 1 "Decoupling and Acceleration in Generative Recommendation": Alibaba deployed CQ-SID / EG-GRPO on TmallAPP, using category-aware semantic IDs and expert-guided reinforcement learning to achieve +1.15% GMV, with generative retrieval contributing 72.63% of purchases. Tencent and Tsinghua's AsymRec proposed an asymmetric continuous-discrete framework that replaces symmetric quantization with multi-expert projections, averaging 15.8% improvement. Meituan's DIG embeds the tokenizer into a discriminative ranking model for end-to-end training, improving both retrieval and ranking. Snap's SID-MLP distills the Transformer decoder into an MLP, achieving 8.74x speedup with no loss in accuracy. The common thread: generative recommendation is transitioning from "can run" to "runs stably and fast," with the core tactic being decoupling input/output representations and replacing overly dense structures.
Theme 2 "LLM Recommendation Toward Reasoning and Memory": Microsoft Research's PGR introduced look-ahead guided retrieval, using Tree-of-Thought to expand query steps, achieving nearly 3x recall improvement on MemoryQuest. Meituan's RecRM-Bench provides 1 million structured entries covering four reward dimensions (instruction following, fact consistency, etc.) for agent-based recommendation systems. SDAR (Meituan) uses gated auxiliary objectives to stabilize On-Policy Self-Distillation (OPSD), outperforming GRPO by 7–10% on ALFWorld, Search-QA, and WebShop. The difference: PGR focuses on look-ahead reasoning before retrieval; SDAR focuses on training stability. But the shared challenge is that LLM memory and reasoning capabilities for recommendation are still far from mature.
Theme 3 "System Co-Design Becomes the Key to Production Deployment": Meta's LoKA delivers a Probe-Mods-Dispatch triple that yields +20% training throughput and +40% inference speedup under FP8 with no quality loss. Xiaohongshu's CCD-Level Thread Orchestration leverages CCD architecture cache features, achieving 3.7x throughput improvement and 30–90% P999 latency reduction on ANNS services. Baidu's Efficient Generative Targeting combines quantization, sparsification, and parallel verification to achieve 1.8x inference speedup, deployed in ad systems. These works show that when model architecture improvements deliver diminishing returns, hardware-aware system optimization becomes the main source of real-world gains.

Generative Recommendation Architecture and Inference Optimization

Generative recommendation contributions this week cluster around two directions: decoupling the tokenizer and output space, and designing efficient inference paths. The first targets the information bottleneck of standard TIGER-style symmetric quantization, the second targets the redundant computation in autoregressive decoding.
Alibaba's production deployment of CQ-SID / EG-GRPO (Alibaba) — this week's only generative recommendation industrial deployment reporting both online metrics and system contributions. Key points: CQ-SID introduces category-awareness and query-item contrastive learning on top of RQ-VAE, cutting beam search size by half while improving semantic hit rate by 26.76%. EG-GRPO replaces sparse rewards with expert-injected real samples to stabilize reinforcement learning. Production results: generative retrieval contributes over 50% of impressions, 58% of clicks, and 72% of purchases. Online GMV +1.15%, UCTCVR +0.40%. Compared to xGR's inference system optimization, CQ-SID focuses more on retrieval-stage quality itself.
AsymRec (Tencent / Tsinghua) — identifies a two-stage information bottleneck in symmetric semantic IDs. The input bottleneck comes from lossy quantization and popularity skew, the output bottleneck from insufficient precision in discrete targets. Solution: Multi-expert Semantic Projection (MSP) maps continuous embeddings into Transformer latent space via multiple expert projections, preserving fine-grained semantics; Multi-faceted Hierarchical Quantization (MHQ) constructs high-capacity discrete targets from multiple perspectives and levels. On Amazon Beauty, Toys, etc., it beats TIGER, RECOM, and other baselines by an average of 15.8%. This continues the dual-alignment approach of DAS, but explicitly decomposes it into input and output bottlenecks.
DIG (Meituan) — embeds the tokenizer into a discriminative ranking model for end-to-end training. Core insight: ranking performs argmax over the item space, retrieval performs argmax over the token space — essentially the same problem at different granularities. DIG encodes SID with item-intrinsic static features and uses u2i cross features to implicitly steer codebook boundaries toward recommendation decision boundaries. During inference, an MLP approximates token-level u2i. On Meituan industrial datasets, Recall@50 improves 12.3%, AUC 1.8%. This complements RelayGR's long-sequence relay reasoning — DIG handles representation alignment, RelayGR handles inference scaling.
SID-MLP (Snap) — reveals a counterintuitive fact: standard Transformer decoders are over-designed for hierarchical SID prediction. Because SID's hierarchical structure makes prediction difficulty drop sharply after the first token, repeated attention computation is highly redundant. SID-MLP captures global user context once and distills it into position-specific MLP heads, achieving 8.74x inference speedup with no accuracy loss. SID-MLP++ further replaces the encoder, offering a speed-accuracy trade-off. This echoes DualGR's dual-routing focus — one cuts from structural simplification, the other from routing control.
F-GRPO (Academic) — decomposes GRPO into candidate generation and ranking stages, addressing credit assignment in unified autoregressive generation. Design position-aware rewards and coverage rewards for the ranking stage, computing group relative advantages separately. Outperforms standard GRPO and solo fine-tuning on sequential recommendation and QA benchmarks.
TwiSTAR (Academic) — adaptive reasoning allocation: dynamically selects fast retrieval, lightweight ranking, or slow reasoning for each segment of user history. A planner is trained via supervised warm-up and reinforcement learning, using three tools. Improves accuracy while reducing latency on Amazon/Yelp datasets.
LASAR (Academic) — introduces latent reasoning into generative recommendation. Addresses issues with semantic IDs: lack of pretrained semantics, representation drift, fixed reasoning length. Proposes a two-stage SFT+GRPO framework. Uses step-level bidirectional KL divergence to align latent reasoning trajectories with chain-of-thought text, and a policy head to predict per-sample reasoning depth. Reduces latent steps by ~50% on average, with inference 20x faster than explicit CoT. This is inspired by SCoTER's chain-of-thought transfer, but moves reasoning from explicit text to latent space.
  • Takeaway: The next bottleneck for generative recommendation is inference efficiency. SID-MLP and LASAR compress decoding cost from different angles, but production deployment still needs to solve the coupling between tokenizer updates and the model. Next, watch CQ-SID's scalability across larger categories, and F-GRPO's online performance in end-to-end scenarios.
  • Takeaway: Decoupling input/output representations (AsymRec, DIG) could further increase semantic ID capacity, but needs to be coordinated with efficient KV cache management. Keep an eye on integrated tests of xGR with these representation methods.

LLM-Enhanced Recommendation and Agent Systems

This topic shows two forces: one focuses on LLM reasoning and memory for recommendation, the other builds evaluation benchmarks to drive systematic progress.
PGR (Microsoft Research) — "Look-ahead guided retrieval" for long-term memory retrieval. Core problem: standard RAG and GraphRAG depend on query embedding similarity, easily missing semantically distant but user-relevant facts. PGR first simulates possible user next steps using Tree-of-Thought or linear chains, uses these simulated steps as retrieval probes, then iterates the next simulation based on retrieval results. On the self-built MemoryQuest benchmark (1,625 queries, low similarity constraints), recall improves nearly 3x. In LLM-as-judge comparisons, PGR-generated answers are preferred on 89–98% of queries. This contrasts with GraphRAG-R1's process-constrained reinforcement learning — one optimizes the retrieval strategy, the other reward design, but both point to the scarcity of long-term memory retrieval.
SDAR (Meituan) — addresses instability of On-Policy Self-Distillation (OPSD) in multi-turn agents. OPSD uses a teacher branch with privileged context to provide dense token-level guidance, but in multi-turn scenarios, skill-conditioned guidance can produce negative rejections due to retrieval failure. SDAR treats OPSD as a gated auxiliary objective: a sigmoid gate dynamically adjusts distillation strength — strengthening on teacher-supported tokens, softly attenuating on negative rejections. On Qwen2.5/Qwen3 model families, SDAR improves over GRPO by 9.4%, 10.2%, and 7.0% on ALFWorld, WebShop, and Search-QA respectively. This draws on Implicit Turn-Wise Policy Optimization's turn-level reward idea, but SDAR focuses more on fine-grained distillation gate control.
RRCM (Academic) — proposes a ranking-driven retrieval-and-reasoning framework, optimizing memory reading strategy with GRPO. Motivation: fixed context construction strategies cannot decide per-instance whether collaborative evidence or metadata is needed; too much information clogs the context window. RRCM maintains two memory pathways (collaborative memory and metadata memory), represented in natural language, accessed via a unified retrieval interface. The policy is optimized using ranking rewards for flexible evidence acquisition. Outperforms LLaRA, RecLLM, etc. on Amazon, Yelp, MovieLens.
BLUE (Academic) — uses reinforcement learning to align LLM-generated text user profiles with embedding models. After profile generation, the embedding model provides reward signals to drive the profile closer to positive samples and away from negatives, with additional text-space supervision for next-item prediction. Outperforms strong baselines in zero-shot sequential recommendation and cross-domain transfer.
RecRM-Bench (Meituan) — the largest reward modeling benchmark for agent-based recommendation systems, containing 1 million structured entries covering four dimensions: instruction following, fact consistency, query-item relevance, and user behavior prediction. Provides a data foundation for training multi-dimensional reward models. Current baselines (e.g., Aligning LLMs for Controllable Recommendations) still focus on single-dimension rewards; RecRM-Bench could drive a shift from single to multi-dimension.
Standardized Evaluation (ReDial) (Academic) — re-evaluates 7 conversational recommendation methods, finding three issues: Recall@1 is sensitive to implementation details; nearly 50% of accuracy comes from repetition shortcuts (same item mentioned multiple times in a conversation); performance differences stem more from LLM backbone capacity than architectural innovation. Proposes a user utility metric, revealing that traditional recall overestimates system conversational effectiveness.
MEME (Academic) — defines 6 memory tasks across multi-entity and evolutionary dimensions, finding that existing memory systems fail on reasoning-dependent tasks (Cascade 3%, Absence 1%). Only Claude Opus 4.7 with a file agent partially mitigates this at 70x cost. Poses a serious challenge for long-term personalization in recommendation agents.
  • Takeaway: LLM's role in recommendation is shifting from "text enhancement" to "true reasoning and memory." PGR and RRCM explore look-ahead retrieval and policy-driven reading, but MEME shows reasoning-dependent tasks remain a key bottleneck. Next, watch how forward simulation can be combined with large-scale memory systems, and whether GRPO can validate these strategy gains online.
  • Takeaway: Both ReDial and RecRM-Bench suggest systemic bias in current evaluation frameworks — repetition shortcuts and single Recall metrics mask true model capabilities. Industry teams should adopt multi-dimensional evaluation (combining instruction following, factuality, efficiency) to avoid being misled by benchmark short-term scores.

Efficiency Optimization in Ranking and Retrieval

This is a classic topic, but this week's contributions focus on hard optimizations for industrial systems.
LoKA (Meta) — FP8 low-precision training for recommendation models. FP8 works for LLMs, but LRMs are numerically sensitive, use many small matrix multiplications, and are communication-intensive. Direct quantization causes quality loss and slower training. LoKA proposes system-model co-design: LoKA Probe statistically measures per-layer error distribution and computation speed to identify safe and unsafe sites; LoKA Mods designs reusable model adaptations like layer normalization and GeLU replacement; LoKA Dispatch selects the fastest FP8 kernel meeting accuracy requirements at runtime. On Meta's production LRM, training throughput improves 20%, inference speedup 40%, with no quality loss. This continues the topology-aware approach of Disaggregated Multi-Tower, but LoKA enters from numerical precision rather than communication topology.
CCD-Level Thread Orchestration (Xiaohongshu) — addresses the performance bottleneck of vector ANNS on multi-CCD CPUs. Production observation: multi-core scaling suffers from low cache utilization because requests exhibit high access locality but scheduling ignores CCD-to-CCD cache topology. Core contribution: unified HNSW and IVF interface with CCD-aware task allocation and task stealing. On Xiaohongshu retrieval/recommendation/ad production workloads, throughput improves 3.7x, P999 latency drops 90%, cache miss rate decreases 6–30%. This is a system-level complement to FAVOR's filter-agnostic vector search.
Efficient LLM-based Advertising (Baidu) — integrates adaptive group quantization (FP16→INT4), layer-adaptive sparsification, and prefix-tree parallel verification, achieving 1.8x speedup on Baidu's ad platform with acceptable quality loss. This is a practical engineering acceleration package, but lacks structural innovation.
Multimodal LLM Framework (ByteDance) — a generic three-part framework (content explanation, representation extraction, pipeline integration) using LLaMA2 to generate descriptive captions as tokenized categorical features. Offline AUC improves 0.35%, online metrics +0.02%. Results are modest but confirm LLM as a viable feature extractor, at the cost of additional latency.
ZipRerank (Academic) — a listwise multimodal reranker that compresses input length through query-image early interaction and eliminates autoregressive decoding with a single forward pass. Two-stage training: listwise pretraining (text rendered as images) + VLM distillation with soft ranking supervision. Matches SOTA on MMDocIR benchmark while reducing latency by ~10x. This draws on the distillation ideas of Efficient Long-Context Ranking, but extends to multimodal.
Granite Embedding Multilingual R2 (IBM Research) — a multilingual embedding model based on ModernBERT, supporting 200+ languages and a 32K context window, available in 311M and 97M parameter versions. The 97M version is obtained through pruning and vocabulary selection, achieving best retrieval performance among <100M parameter models. Released under Apache 2.0.
Simpson's Paradox in Behavioral Curves (Meta) — reveals systematic bias in aggregated behavioral curves. On Goodreads, individual users' optimal exposure count is ~11, but the aggregated curve shows ~34 — a 3x gap driven by survival bias. Amazon Electronics shows a 5.3x gap. Proposes Synthetic Null Calibration to reduce per-user classification false positive rate from 32% to controllable levels. This directly cautions against tuning recommendations based on aggregated data: do not infer individual behavior from aggregated curves.
  • Takeaway: Recommendation system efficiency improvement is shifting from model architecture to system co-design (LoKA, CCD threading). At the same time, aggregation bias (Simpson's Paradox) reminds us that tuning based on statistical curves may point in the wrong direction. Next, watch whether LoKA's Probe method can be standardized as a tool library, and whether CCD threading can extend to GPU architectures.
  • Takeaway: ZipRerank and Granite Embedding represent efficient paths for multimodal and language embeddings. ZipRerank's speed advantage makes it suitable for latency-sensitive scenarios, while Granite's multilingual support extends retrieval baselines for non-English markets. Industry teams should evaluate ZipRerank as a reranking accelerator for CVR scenarios.

Reinforcement Learning Exploration Strategies

Delightful Exploration (Google DeepMind) — proposes Delight-gated exploration (DE), using "expected improvement times surprise" as the exploration gate. Mathematically recovers the reserve price of Pandora's Box rule, with surprise setting the effective detection cost. On Bernoulli bandits, linear bandits, and tabular MDP, with the same hyperparameters and no tuning, regret grows far slower than Thompson Sampling and ε-greedy. DE comes from the same observation as Dynamic Prior Thompson Sampling on cold start — exploration actions need to be priced based on uncertainty — but DE simplifies pricing from Bayesian posterior to surprise times expected improvement, making it easier to implement.
ROAD (Ant Group) — optimizes data mixing in offline-to-online reinforcement learning. Treats data selection as bilevel optimization: the top level (outer loop) decides the mixing policy, the bottom level (inner loop) performs Q-learning. Uses a multi-armed bandit to approximate the bilevel gradient, replacing static mixing ratios. Averages over 10% improvement on D4RL, MuJoCo, etc.
HyperEyes (Xiaohongshu) — a parallel multimodal search agent that fuses visual grounding and retrieval into a single action while also training for inference efficiency. Two-stage training: parallel acceptable data synthesis + dual-granularity efficiency-aware reinforcement learning (TRACE trajectory-level reward + OPD token-level distillation). On 6 benchmarks, a 30B model achieves 9.9% higher accuracy than the strongest open-source agent, with 5.3x fewer tool call rounds. This contrasts with ReAct's serial tool calls; HyperEyes significantly improves efficiency through parallelism.
  • Takeaway: Reinforcement learning exploration strategies are moving from theory (DE) to recommendation cold start and data reuse (ROAD). DE's lightweight implementation makes it suitable for online deployment. Next, watch its actual performance in recommendation bandit scenarios. HyperEyes's parallelism approach could change the design paradigm for multi-turn recommendation agents.
  • Takeaway: ROAD's adaptive data mixing is directly attractive for offline-to-online transfer, but the multi-armed bandit agent's accuracy depends on gradient approximation quality. Test ROAD's dynamic mix ratio in user cold start scenarios in recommendation systems.

Directions to Watch

Generative Recommendation Inference Efficiency Race
Three papers this week (SID-MLP, LASAR, TwiSTAR) accelerate generative inference from different angles — MLP distillation, latent reasoning, adaptive routing. Their common premise: standard autoregressive decoding is computationally inefficient for recommendation tasks. Next, watch whether SID-MLP maintains 8.74x speedup on a million-item pool, and whether LASAR's latent step reduction holds for larger models. Alibaba has already demonstrated generative retrieval's effectiveness with CQ-SID; the natural next step is embedding inference optimizations into production services.
Memory-Driven Retrieval and Recommendation Agents
PGR, RRCM, and MEME all point to the strong dependency of recommendation agents on memory systems. Current approaches (RAG, GraphRAG) perform poorly on long-term personalization, while look-ahead simulation and policy-driven reading are two promising directions. Next, watch whether PGR's ToT retrieval can scale to millions of user profiles, and whether practical mitigations for MEME's revealed reasoning failure (e.g., structured memory or planners) emerge. The agentification of recommendation systems may break through via memory systems.
System-Model Co-Design Becomes Core for Production Deployment
LoKA, CCD threading, and Baidu's acceleration framework all emphasize "understanding hardware constraints" rather than just optimizing model architecture. The recommendation industry is undergoing a paradigm shift similar to LLMs in the post-training phase — hardware-aware optimization delivers observable end-to-end gains. Next, watch whether LoKA's Probe method gets integrated into mainstream frameworks like PyTorch/TorchAO, and whether CCD threading design generalizes to GPU multi-die architectures.

Paper Roundup

Generative Recommendation Architecture and Inference Optimization
CQ-SID / EG-GRPO (Alibaba) — Proposes category-aware semantic IDs and expert-guided GRPO, deployed on TmallAPP with +1.15% GMV and +0.40% UCTCVR; generative retrieval contributes 72.63% of purchases.
AsymRec (Tencent / Tsinghua) — Asymmetric continuous-discrete framework addressing information bottlenecks via multi-expert projection and multi-faceted hierarchical quantization; average 15.8% improvement.
DIG (Meituan) — End-to-end training embedding tokenizer into discriminative ranking model, unifying retrieval and ranking; +12.3% Recall@50, +1.8% AUC on industrial dataset.
F-GRPO (Academic) — Decomposes GRPO into candidate generation and ranking stages with position-aware and coverage rewards; outperforms standard GRPO.
SID-MLP (Snap) — MLP distillation replaces Transformer decoder; 8.74x speedup with equal accuracy.
TwiSTAR (Academic) — Adaptive reasoning allocation via RL planner that dynamically selects fast retrieval, lightweight ranking, or slow reasoning; improves accuracy and reduces latency on three datasets.
LASAR (Academic) — Latent adaptive semantic alignment reasoning with two-stage SFT+GRPO; inference 20x faster than explicit CoT, average 50% reduction in latent steps.
DiffRetriever (Academic) — Parallel generation of K representative tokens using diffusion language model; zero-shot NDCG@10 of 55.4 on BEIR-7, 68.2 after fine-tuning.
LLM-Enhanced Recommendation and Agent Systems
SDAR (Meituan) — Gated auxiliary objective stabilizes OPSD distillation; improves over GRPO by 7–10% on ALFWorld, Search-QA, WebShop.
PGR (Microsoft Research) — Look-ahead guided retrieval with ToT query expansion; recall improves nearly 3x, LLM preference 89–98%.
RecRM-Bench (Meituan) — Largest reward model benchmark for agent-based recommendation; 1M entries across four reward dimensions.
Standardized Re-evaluation (Academic) — Standardized evaluation of 7 conversational recommendation methods; finds 50% accuracy from repetition shortcuts, LLM backbone more important than architecture innovation.
MEME (Academic) — Multi-entity evolving memory evaluation; reasoning-dependent task accuracy only 1–3%, mainstream memory systems fail.
PDR (Academic) — Personalized deep research framework dynamically integrating user profiles into retrieval-reasoning loop; constructs PDR dataset and hybrid evaluation.
TRACE (Academic) — Tourism recommendation dialogue benchmark with 10,000 dialogues, review-span citation evidence, and rejection recovery; reveals Three-Competency Gap.
DCGL (Academic) — Dual-channel graph learning framework decoupling semantic and behavioral information with dynamic fusion; average 3–8% improvement on 4 datasets.
RRCM (Academic) — Ranking-driven retrieval reasoning optimizing memory reading strategy with GRPO; outperforms LLaRA, RecLLM, etc.
BLUE (Academic) — RL aligns text profiles with embeddings; significant improvement in zero-shot cross-domain transfer.
Efficiency Optimization in Ranking and Retrieval
LoKA (Meta) — +20% training throughput, +40% inference speedup under FP8 with no quality loss; deployed on Meta production LRM.
ZipRerank (Academic) — Efficient listwise multimodal reranker; ~10x inference latency reduction while matching SOTA.
Localization Boosting (Adobe) — Multi-objective LTR framework with VLM relevance signals and locale-aware boosting; recovers localized exposure across 5 locales.
Efficient LLM-based Advertising (Baidu) — Adaptive group quantization + sparsification + parallel verification; 1.8x inference speedup, deployed on Baidu ad platform.
Granite Embedding Multilingual R2 (IBM Research) — Multilingual embedding model; 200+ languages, 32K context, 311M/97M parameters; Apache 2.0 open source.
CCD-Level Thread Orchestration (Xiaohongshu) — CCD-aware thread orchestration; 3.7x throughput improvement, 90% P999 latency reduction; deployed for search/recommendation/ad services.
Multimodal LLM Framework (ByteDance) — LLaMA2 generates captions as tokenized features; +0.02% online metric, confirms LLM feature extraction viability.
Simpson's Paradox in Behavioral Curves (Meta) — Aggregated curves exhibit 3–5.3x bias; proposes synthetic null calibration to reduce false positive rate.
Reinforcement Learning Exploration Strategies
Delightful Exploration (Google DeepMind) — Delight-gated exploration using surprise times expected improvement; regret lower than Thompson Sampling and ε-greedy.
ROAD (Ant Group) — Bilevel optimization adaptive data mixing with multi-armed bandit gradient approximation; >10% average improvement on offline-to-online RL tasks.
HyperEyes (Xiaohongshu) — Parallel multimodal search agent; 30B model 9.9% more accurate, 5.3x fewer tool call rounds.
Other
TraXion (Academic) — Unified pretraining framework for multi-entity spatiotemporal event streams; single marker outperforms all task-specific baselines on 6 mobility datasets; zero-shot transfer to certification logs and mortality prediction.
ModelLens (Academic) — Learns model-dataset performance latent space from 1.62M leaderboard records; ranks models without running them; Top-K recommendation improves routing methods by up to 81%.
Graph Heuristic Audit (Academic) — Simple graph heuristics (last 1–2 interactions + few-hop item transfer graph + feature similarity) match or surpass modern generative recommendation baselines; +38–44% NDCG@10 relative improvement; reveals shortcut solvability in benchmarks.
  • Recommendation Systems
  • Weekly
  • Papers
  • AI Tech Daily - 2026-05-18AI Weekly 2026-W20
    Loading...