Qwen3-Next: Hybrid Attention + Ultra-Sparse MoE + MTP = SOTA Inference Speed
Recently, Alibaba’s Qwen team released the Qwen3-Next model, another major innovation after Qwen3. The model achieves multiple breakthroughs in architectural design, especially reaching industry-leading levels in the balance between inference efficiency and performance. This article briefly summarizes Qwen3-Next’s core innovations.
Three major breakthroughs of Qwen3-Next:
- Hybrid attention architecture: 3 layers of linear attention + 1 layer of traditional attention, incorporating DeltaNet’s delta rule idea
- Ultra-sparse MoE: only 11 of 512 experts activated; 80B parameters with only 3B activated
- 100+ tokens/s inference speed: reaches a state-of-the-art level via MTP
Core value: With 1/10 the compute cost and 10× the token processing speed, it achieves performance surpassing 32B dense models, benchmarking against Gemini 2.5 Flash.
1. Hybrid Attention Mechanism: Breaking the Efficiency Bottleneck
The Dilemma of Traditional Attention
Traditional Transformer models use a softmax attention mechanism that must scan all historical tokens to generate each token, resulting in O(L²) computational complexity. As sequence length increases, the compute cost rises sharply.
The Evolution of Linear Attention: From Theory to DeltaNet
Basic Linear Attention and Its Limitations
Linear attention achieves efficient computation by removing the softmax operation, essentially converting attention into an RNN form:
1 | 状态更新:S_t = S_{t-1} + v_t k_t^T |
While this reduces complexity (O(L²d) → O(Ld²)), it has fundamental drawbacks:
- Lossy compression: linear attention essentially compresses historical information and cannot preserve all details exactly
- Missing needle-in-a-haystack capability: performs poorly when precise retrieval from long text is required
- Cascading effects:
- Needle-in-a-haystack capability → in-context learning ability
- In-context learning → instruction-following ability
- Instruction-following → long chain-of-thought reasoning ability
- Instruction-following → tool-use ability
Therefore, for SOTA models that require deep reasoning and tool use, pure linear attention, while elegant in theory, struggles to reach SOTA performance.
DeltaNet: An Improvement via the Delta Rule
DeltaNet alleviates these issues in part by introducing the delta rule (from neural network learning theory):
1 | S_t = S_{t-1} + β_t(v_t - S_{t-1}k_t)k_t^T |
Where:
- β_t: adaptive learning rate controlling update strength
- (v_t - S_{t-1}k_t): prediction error term
- “Erase-write” mechanism: a combination of first erasing old values and then writing new ones
This update rule can be seen as online gradient descent, minimizing an MSE loss at each step:
1 | L_t(S) = 1/2 ||Sk_t - v_t||² |
Three Technical Breakthroughs of DeltaNet
Mathematical foundations
- Uses MSE loss rather than a nonlinear loss, enabling stronger error correction
- Dynamic learning rate β_t achieves adaptive memory updates
- Theoretically equivalent to a special form of Test-Time Training (TTT)
Parallelization innovations
- Blocked parallelism: split the sequence into blocks and compute in parallel within blocks
- Hardware-friendly: fully leverages GPU parallelism
- flash-linear-attention: specialized CUDA kernel implementation
Modern neural architecture tweaks
- L₂ regularization: normalize Q and K to improve numerical stability
- Output normalization: prevents gradient explosion/vanishing
- SiLU activation: provides smoother gradient flow than ReLU
- Short convolutions: capture local dependencies to complement global attention
Performance Validation: MQAR Benchmark
On the Multi-Query Associative Recall (MQAR) task, DeltaNet shows excellent performance:
Model | In-Context Recall | Noisy Recall | Selective Copy | Average |
---|---|---|---|---|
Transformer | 94.1% | 86.8% | 99.6% | 74.5% |
Mamba | 90.4% | 90.1% | 86.3% | 69.3% |
Linear Attention | 80.8% | 81.6% | 88.6% | 60.0% |
DeltaNet | 100% | 100% | 100% | 71.8% |
DeltaNet achieves perfect performance on associative memory tasks, demonstrating its advantage in precise retrieval.
Qwen3-Next’s Hybrid Strategy: Combining Theory and Practice
Qwen3-Next adopts a 3:1 hybrid attention architecture, a design grounded in deep theoretical insights:
1 | Layer 1-3: 线性注意力(类DeltaNet机制) |
This choice of ratio reflects a delicate balance between efficiency and capability:
- Preserving key capabilities: 25% softmax layers are sufficient to maintain the “needle-in-a-haystack” capability, and thus preserve in-context learning and derived higher-level abilities such as long chain-of-thought reasoning and tool use.
- Maximizing computational efficiency: the 75% linear layers drastically cut compute cost
- Empirical validation: across multiple benchmarks, the 3:1 ratio offers the best price–performance
Comparison with Other Hybrid Architectures
Model | Mix ratio | Context length | Inference speed | Features |
---|---|---|---|---|
MiniMax-01 | 7:1 | 1M→4M | ~50 tokens/s | Lightning Attention + Transnormer |
Qwen3-Next | 3:1 | 128K | 100+ tokens/s | DeltaNet ideas + Flash Attention |
Google Infini-Attention | N/A | Infinite | N/A | Dual attention, 114× memory compression |
DeepSeek NSA | Dynamic | 64K | N/A | Hierarchical sparsity, hardware optimizations |
Key differences:
- MiniMax-01 adopts a more aggressive 7:1 ratio, sacrificing some precise retrieval ability in exchange for ultra-long context
- Qwen3-Next’s more conservative 3:1 design ensures stronger in-context learning
Differences from Pure DeltaNet
While borrowing DeltaNet’s core ideas, Qwen3-Next makes engineering refinements:
Feature | DeltaNet | Qwen3-Next |
---|---|---|
Update rule | pure delta rule | simplified linear updates + softmax |
Parallelization strategy | blocked parallelism | hierarchical hybrid parallelism |
Memory mechanism | global state matrix | hierarchical progressive compression |
Hardware optimizations | CUDA kernel | mixed precision + Flash Attention |
2. Ultra-High-Sparsity MoE: Extreme Optimization of Activated Parameters
Qwen3-Next achieves unprecedented sparsity in its mixture-of-experts (MoE) architecture:
Model | Total experts | Activated experts | Activation ratio |
---|---|---|---|
Mixtral | 8 | 2 | 1/4 |
DeepSeek R1 | 256 | 8 | 1/32 |
Qwen3 | 128 | 8 | 1/16 |
Qwen3-Next | 512 | 11 | 1/46 |
80B-A3B architecture:
- Total parameters: 80B
- Activated parameters: only 3B
- Performance: surpasses traditional 32B dense models
This means:
- Inference cost reduced by 10×: only 3.7% of parameters need to be activated
- Performance improves instead: via finer expert specialization
- Training becomes harder: each expert must have sufficiently strong specificity
High sparsity imposes higher training requirements:
- Must effectively separate knowledge from different domains into different experts
- The routing mechanism needs to accurately identify and select the right experts
- Avoid performance degradation due to overlapping expert functions
3. Multi-Token Prediction (MTP): The Key to Inference Acceleration
Traditional autoregressive models can generate only one token at a time, leading to:
- Domestic SOTA open-source models (e.g., DeepSeek R1): typically 20–30 tokens/s (also partly due to too many activated parameters)
- International SOTA closed-source models (GPT/Gemini): 100+ tokens/s
Qwen3-Next introduces MTP, generating in parallel—i.e., outputting multiple tokens at a time—then verifying afterward, achieving a significant speedup to 100+ tokens/s on par with SOTA closed-source models.
Measured Performance Data (compared with Qwen3-32B)
Thanks to the hybrid architecture design, Qwen3-Next shows remarkable performance improvements at all stages of inference:
Prefill Stage:
- 4K context: throughput increases by nearly 7x
- 32K+ context: throughput increases by over 10x
Decode Stage:
- 4K context: throughput increases by nearly 4x
- 32K+ context: still maintains a 10x+ advantage
These performance gains mainly come from:
- Hybrid attention architecture reduces computational complexity
- Ultra-sparse MoE greatly lowers the amount of activated parameters
- MTP mechanism improves token generation efficiency
High-speed inference is critical in many scenarios:
- Real-time dialog: voice assistants that must respond within 2 seconds
- Chain-of-thought reasoning: can “think” more within the same time (200 characters/s vs 40 characters/s)
- Agent multi-turn tool calls: tool-call latency is greatly reduced, improving user experience
4. Training Stability Optimizations
Qwen3-Next makes several key improvements for training stability:
Attention output gating: addresses two key issues
- Attention Sink (arxiv:2309.17453): the model tends to assign excessive attention weight to the first few tokens in a sequence (especially the first), even when those tokens are not semantically important
- Massive Activations (arxiv:2402.17762): a tiny number of activations are several orders of magnitude larger than others (even up to 10,000x), typically occurring at specific dimensions and specific tokens (such as the start token), acting like a fixed bias
- Through the output-gating mechanism, these abnormal activations are dynamically regulated to ensure training stability and inference efficiency
Zero-Centered RMSNorm:
- Background: In the QK-Norm used by Qwen3, the learnable parameters of LayerNorm (γ scale and β shift) can grow abnormally during training, which may lead to gradient instability and overfitting risks (arxiv:1911.07013)
- Solution: adopt Zero-Centered RMSNorm and apply weight decay to the norm weights, effectively preventing unbounded parameter growth and improving training stability
MoE router initialization: normalize initialization parameters to ensure each expert is selected without bias early in training, reducing noise from random initialization
5. Benchmarking Gemini 2.5 Flash
Qwen3-Next is designed to benchmark against Google’s Gemini 2.5 Flash:
Long-context handling
- Supports 100+ rounds of dialogue history
- Handles ultra-long documents without losing accuracy
Adaptive Thinking
- Fast chain-of-thought reasoning
Cost-effectiveness
- Fewer activated parameters, lower inference cost
- Significantly reduced training and deployment costs
Fast inference
- Fast prefill and decode speeds
- Reduced text response latency and tool-call latency
6. Industrial Significance of Qwen3-Next
Carrying forward the open-source ecosystem
As the performance of open-source models such as Llama3 and Llama4 becomes increasingly underwhelming, the Qwen team remains committed to open-sourcing:
- Qwen2.5: became the foundation for many domain-specific models
- Qwen3: released 200B+ large-scale models and reasoning models with tool-calling capabilities
- Qwen3-Next: exploring new architectural directions
Industrial-grade validation
Although technologies such as linear attention and highly sparse MoE have been extensively studied in academia and are commonly adopted by SOTA closed-source models from Google and OpenAI, Qwen3-Next, with an open-source model, demonstrates that these techniques can be effectively integrated, can operate stably in production, and achieve excellent results in real business scenarios.
Conclusion
Qwen3-Next is not only a technological breakthrough; it also represents a new paradigm for the development of large models: achieving a win-win of efficiency and performance through architectural innovation. Its 10x inference performance improvement and generation speed comparable to international state-of-the-art have earned domestic large models an important position in global competition.
Hybrid attention has become an industry consensus: From Google’s Infini-Attention and MiniMax’s Lightning Attention to Qwen3-Next, everyone is exploring the optimal combination of linear and traditional attention. This is not accidental but an inevitable trend in technological development—purely linear attention cannot meet complex reasoning needs, while purely traditional attention faces efficiency bottlenecks.
For developers and enterprise users, Qwen3-Next offers a cost-effective choice: strong model capabilities, fast responses, and controllable costs.
References: