Qwen3-Next: Hybrid Attention + Ultra-Sparse MoE + MTP = SOTA Inference Speed

Recently, Alibaba’s Qwen team released the Qwen3-Next model, another major innovation after Qwen3. The model achieves multiple breakthroughs in architectural design, especially reaching industry-leading levels in the balance between inference efficiency and performance. This article briefly summarizes Qwen3-Next’s core innovations.

Three major breakthroughs of Qwen3-Next:

Hybrid attention architecture: 3 layers of linear attention + 1 layer of traditional attention, incorporating DeltaNet’s delta rule idea
Ultra-sparse MoE: only 11 of 512 experts activated; 80B parameters with only 3B activated
100+ tokens/s inference speed: reaches a state-of-the-art level via MTP

Core value: With 1/10 the compute cost and 10× the token processing speed, it achieves performance surpassing 32B dense models, benchmarking against Gemini 2.5 Flash.

1. Hybrid Attention Mechanism: Breaking the Efficiency Bottleneck

The Dilemma of Traditional Attention

Traditional Transformer models use a softmax attention mechanism that must scan all historical tokens to generate each token, resulting in O(L²) computational complexity. As sequence length increases, the compute cost rises sharply.

The Evolution of Linear Attention: From Theory to DeltaNet

Basic Linear Attention and Its Limitations

Linear attention achieves efficient computation by removing the softmax operation, essentially converting attention into an RNN form:

1 2	状态更新：S_t = S_{t-1} + v_t k_t^T 输出计算：o_t = S_t q_t

While this reduces complexity (O(L²d) → O(Ld²)), it has fundamental drawbacks:

Lossy compression: linear attention essentially compresses historical information and cannot preserve all details exactly
Missing needle-in-a-haystack capability: performs poorly when precise retrieval from long text is required
Cascading effects:
- Needle-in-a-haystack capability → in-context learning ability
- In-context learning → instruction-following ability
- Instruction-following → long chain-of-thought reasoning ability
- Instruction-following → tool-use ability

Therefore, for SOTA models that require deep reasoning and tool use, pure linear attention, while elegant in theory, struggles to reach SOTA performance.

DeltaNet: An Improvement via the Delta Rule

DeltaNet alleviates these issues in part by introducing the delta rule (from neural network learning theory):

1	S_t = S_{t-1} + β_t(v_t - S_{t-1}k_t)k_t^T

Where:

β_t: adaptive learning rate controlling update strength
(v_t - S_{t-1}k_t): prediction error term
“Erase-write” mechanism: a combination of first erasing old values and then writing new ones

This update rule can be seen as online gradient descent, minimizing an MSE loss at each step:

1	L_t(S) = 1/2 \|\|Sk_t - v_t\|\|²

Three Technical Breakthroughs of DeltaNet

Mathematical foundations
- Uses MSE loss rather than a nonlinear loss, enabling stronger error correction
- Dynamic learning rate β_t achieves adaptive memory updates
- Theoretically equivalent to a special form of Test-Time Training (TTT)
Parallelization innovations
- Blocked parallelism: split the sequence into blocks and compute in parallel within blocks
- Hardware-friendly: fully leverages GPU parallelism
- flash-linear-attention: specialized CUDA kernel implementation
Modern neural architecture tweaks
- L₂ regularization: normalize Q and K to improve numerical stability
- Output normalization: prevents gradient explosion/vanishing
- SiLU activation: provides smoother gradient flow than ReLU
- Short convolutions: capture local dependencies to complement global attention

Performance Validation: MQAR Benchmark

On the Multi-Query Associative Recall (MQAR) task, DeltaNet shows excellent performance:

Model	In-Context Recall	Noisy Recall	Selective Copy	Average
Transformer	94.1%	86.8%	99.6%	74.5%
Mamba	90.4%	90.1%	86.3%	69.3%
Linear Attention	80.8%	81.6%	88.6%	60.0%
DeltaNet	100%	100%	100%	71.8%

DeltaNet achieves perfect performance on associative memory tasks, demonstrating its advantage in precise retrieval.

Qwen3-Next’s Hybrid Strategy: Combining Theory and Practice

Qwen3-Next adopts a 3:1 hybrid attention architecture, a design grounded in deep theoretical insights:

Layer 1-3: 线性注意力（类DeltaNet机制）
    ├── 高效状态压缩
    ├── 自适应记忆更新
    └── O(Ld²) 复杂度
    
Layer 4: 传统 Softmax 注意力
    ├── 精确关联检索（大海捞针）
    ├── 长程依赖建模
    └── 关键信息保真

This choice of ratio reflects a delicate balance between efficiency and capability:

Preserving key capabilities: 25% softmax layers are sufficient to maintain the “needle-in-a-haystack” capability, and thus preserve in-context learning and derived higher-level abilities such as long chain-of-thought reasoning and tool use.
Maximizing computational efficiency: the 75% linear layers drastically cut compute cost
Empirical validation: across multiple benchmarks, the 3:1 ratio offers the best price–performance

Comparison with Other Hybrid Architectures

Model	Mix ratio	Context length	Inference speed	Features
MiniMax-01	7:1	1M→4M	~50 tokens/s	Lightning Attention + Transnormer
Qwen3-Next	3:1	128K	100+ tokens/s	DeltaNet ideas + Flash Attention
Google Infini-Attention	N/A	Infinite	N/A	Dual attention, 114× memory compression
DeepSeek NSA	Dynamic	64K	N/A	Hierarchical sparsity, hardware optimizations

Key differences:

MiniMax-01 adopts a more aggressive 7:1 ratio, sacrificing some precise retrieval ability in exchange for ultra-long context
Qwen3-Next’s more conservative 3:1 design ensures stronger in-context learning

Differences from Pure DeltaNet

While borrowing DeltaNet’s core ideas, Qwen3-Next makes engineering refinements:

Feature	DeltaNet	Qwen3-Next
Update rule	pure delta rule	simplified linear updates + softmax
Parallelization strategy	blocked parallelism	hierarchical hybrid parallelism
Memory mechanism	global state matrix	hierarchical progressive compression
Hardware optimizations	CUDA kernel	mixed precision + Flash Attention

2. Ultra-High-Sparsity MoE: Extreme Optimization of Activated Parameters

Qwen3-Next achieves unprecedented sparsity in its mixture-of-experts (MoE) architecture:

Model	Total experts	Activated experts	Activation ratio
Mixtral	8	2	1/4
DeepSeek R1	256	8	1/32
Qwen3	128	8	1/16
Qwen3-Next	512	11	1/46

80B-A3B architecture:

Total parameters: 80B
Activated parameters: only 3B
Performance: surpasses traditional 32B dense models

This means:

Inference cost reduced by 10×: only 3.7% of parameters need to be activated
Performance improves instead: via finer expert specialization
Training becomes harder: each expert must have sufficiently strong specificity

High sparsity imposes higher training requirements:

Must effectively separate knowledge from different domains into different experts
The routing mechanism needs to accurately identify and select the right experts
Avoid performance degradation due to overlapping expert functions

3. Multi-Token Prediction (MTP): The Key to Inference Acceleration

Traditional autoregressive models can generate only one token at a time, leading to:

Domestic SOTA open-source models (e.g., DeepSeek R1): typically 20–30 tokens/s (also partly due to too many activated parameters)
International SOTA closed-source models (GPT/Gemini): 100+ tokens/s

Qwen3-Next introduces MTP, generating in parallel—i.e., outputting multiple tokens at a time—then verifying afterward, achieving a significant speedup to 100+ tokens/s on par with SOTA closed-source models.

Measured Performance Data (compared with Qwen3-32B)

Thanks to the hybrid architecture design, Qwen3-Next shows remarkable performance improvements at all stages of inference:

Prefill Stage:

4K context: throughput increases by nearly 7x
32K+ context: throughput increases by over 10x

Decode Stage:

4K context: throughput increases by nearly 4x
32K+ context: still maintains a 10x+ advantage

These performance gains mainly come from:

Hybrid attention architecture reduces computational complexity
Ultra-sparse MoE greatly lowers the amount of activated parameters
MTP mechanism improves token generation efficiency

High-speed inference is critical in many scenarios:

Real-time dialog: voice assistants that must respond within 2 seconds
Chain-of-thought reasoning: can “think” more within the same time (200 characters/s vs 40 characters/s)
Agent multi-turn tool calls: tool-call latency is greatly reduced, improving user experience

4. Training Stability Optimizations

Qwen3-Next makes several key improvements for training stability:

Attention output gating: addresses two key issues
- Attention Sink (arxiv:2309.17453): the model tends to assign excessive attention weight to the first few tokens in a sequence (especially the first), even when those tokens are not semantically important
- Massive Activations (arxiv:2402.17762): a tiny number of activations are several orders of magnitude larger than others (even up to 10,000x), typically occurring at specific dimensions and specific tokens (such as the start token), acting like a fixed bias
- Through the output-gating mechanism, these abnormal activations are dynamically regulated to ensure training stability and inference efficiency
Zero-Centered RMSNorm:
- Background: In the QK-Norm used by Qwen3, the learnable parameters of LayerNorm (γ scale and β shift) can grow abnormally during training, which may lead to gradient instability and overfitting risks (arxiv:1911.07013)
- Solution: adopt Zero-Centered RMSNorm and apply weight decay to the norm weights, effectively preventing unbounded parameter growth and improving training stability
MoE router initialization: normalize initialization parameters to ensure each expert is selected without bias early in training, reducing noise from random initialization

5. Benchmarking Gemini 2.5 Flash

Qwen3-Next is designed to benchmark against Google’s Gemini 2.5 Flash:

Long-context handling
- Supports 100+ rounds of dialogue history
- Handles ultra-long documents without losing accuracy
Adaptive Thinking
- Fast chain-of-thought reasoning
Cost-effectiveness
- Fewer activated parameters, lower inference cost
- Significantly reduced training and deployment costs
Fast inference
- Fast prefill and decode speeds
- Reduced text response latency and tool-call latency

6. Industrial Significance of Qwen3-Next

Carrying forward the open-source ecosystem

As the performance of open-source models such as Llama3 and Llama4 becomes increasingly underwhelming, the Qwen team remains committed to open-sourcing:

Qwen2.5: became the foundation for many domain-specific models
Qwen3: released 200B+ large-scale models and reasoning models with tool-calling capabilities
Qwen3-Next: exploring new architectural directions

Industrial-grade validation

Although technologies such as linear attention and highly sparse MoE have been extensively studied in academia and are commonly adopted by SOTA closed-source models from Google and OpenAI, Qwen3-Next, with an open-source model, demonstrates that these techniques can be effectively integrated, can operate stably in production, and achieve excellent results in real business scenarios.

Conclusion

Qwen3-Next is not only a technological breakthrough; it also represents a new paradigm for the development of large models: achieving a win-win of efficiency and performance through architectural innovation. Its 10x inference performance improvement and generation speed comparable to international state-of-the-art have earned domestic large models an important position in global competition.

Hybrid attention has become an industry consensus: From Google’s Infini-Attention and MiniMax’s Lightning Attention to Qwen3-Next, everyone is exploring the optimal combination of linear and traditional attention. This is not accidental but an inevitable trend in technological development—purely linear attention cannot meet complex reasoning needs, while purely traditional attention faces efficiency bottlenecks.

For developers and enterprise users, Qwen3-Next offers a cost-effective choice: strong model capabilities, fast responses, and controllable costs.

References: