Chapter 7: Results and Efficiency
What the SlimQwen Model Achieved
Chapter Goals
By the end of this chapter, you will understand:
- The end-to-end compression results: from 80B parameters to 23B
- How SlimQwen performed across diverse benchmarks
- The runtime efficiency gains (speed, memory, deployment)
- What "competitive" means in context
By now we've walked through the individual techniques — pruning, expert compression, training recipes, and progressive schedules. This chapter brings them all together and reports the bottom line.
The Starting Point and the Target
| Property | Qwen3-Next-80A3B | SlimQwen-23A2B |
|---|---|---|
| Total Parameters | ~80 billion | ~23 billion |
| Active Parameters | 3.8 billion | 2.0 billion |
| Layers | 48 | 36 (depth reduced by 25%) |
| Hidden Size | 2048 | 1536 (width reduced by 25%) |
| Experts per MoE | 512 (10+1 active) | 256 (8+1 active) |
| Compression Ratio | — | ~4x total reduction |
The model was compressed to about one quarter of its original size — but remember, only a fraction of parameters were active at any time in both models. The active parameter count went from 3.8B to 2.0B, roughly halving the actual computation per token.
Benchmark Performance
The paper evaluated SlimQwen across a wide range of tasks. Here's how the compressed model stacked up:
| Benchmark | What It Tests | Teacher | SlimQwen | Retention |
|---|---|---|---|---|
| MMLU | General knowledge | 85.22 | 77.39 | ~91% |
| MMLU-Pro | Harder general knowledge | 62.86 | 53.22 | ~85% |
| MMLU-Redux | Cleaned-up knowledge | 84.45 | 78.01 | ~92% |
| BBH | Complex reasoning | 85.12 | 70.70 | ~83% |
| GSM-8K | Math word problems | 90.07 | 85.82 | ~95% |
| EvalPlus | Code generation | 74.12 | 69.08 | ~93% |
| C-Eval | Chinese knowledge | 90.33 | 82.97 | ~92% |
| CMMLU | Chinese multitask | 89.27 | 83.01 | ~93% |
Overall average retention: roughly 90-92% across knowledge-heavy tasks, with somewhat more degradation on reasoning-heavy benchmarks (BBH at ~83%).
Applied Definitions
Benchmark — A standardized test designed to measure a model's capabilities in a specific domain. MMLU tests knowledge across 57 subjects, GSM-8K tests mathematical reasoning, EvalPlus tests code generation accuracy, etc.
Retention Rate — The percentage of the teacher model's score that the compressed model maintains. A 90% retention on MMLU means SlimQwen scores 90% of what the original 80B model scored.
Inference Efficiency — How fast and resource-efficient the model is when actually being used (as opposed to during training). Measured by memory footprint, response latency, and tokens generated per second.
Runtime Efficiency Gains
Compression isn't just about maintaining quality on paper — it's about making the model actually deployable. Here's the practical impact, starting with the production-ready numbers (vLLM, an optimized inference engine):
| Metric | Teacher | SlimQwen | Improvement |
|---|---|---|---|
| Peak Memory | 156.56 GB | 43.30 GB | 72% reduction |
| Prefill Latency (vLLM) | 0.08s | 0.06s | 25% faster |
| Decoding Throughput (vLLM) | 142.58 tok/s | 210.87 tok/s | 48% faster |
| Prefill (HuggingFace) | 0.99s | 0.44s | 56% faster |
| Decoding Throughput (HF) | 4.05 tok/s | 6.55 tok/s | 62% faster |
For context: HuggingFace's backend is unoptimized and mainly used for quick local testing; vLLM is what you'd run in production. The gains are real in either case, but the vLLM numbers reflect what matters for actual deployment.
The most striking result: SlimQwen fits on a single 80GB GPU. The original teacher required two GPUs in parallel (tensor parallelism), which adds communication overhead and deployment complexity. SlimQwen runs solo.
This matters enormously in practice. A model that fits on one GPU means:
- Lower cloud computing costs
- Simpler deployment architecture
- Ability to serve requests on consumer-grade hardware
- No tensor/pipeline parallelism overhead
The Context of "Competitive"
The paper describes SlimQwen's performance as "competitive." This is deliberately measured language. It doesn't claim to match the teacher — it doesn't. A 4x compressed model scoring 91% on MMLU is impressive, but a 9% gap on the hardest benchmarks is real.
However, competitive means competitive for its size. Compared to other models of similar scale trained from scratch, SlimQwen is advantaged by the pruning initialization. It starts life further along the training curve than any model starting from random weights.
What's Lost and What's Kept
Looking at the retention rates across benchmarks reveals a pattern:
Strongest retention (>93%): Math (GSM-8K), Code (EvalPlus), Chinese benchmarks
Moderate retention (~85-92%): General knowledge (MMLU variants)
Weakest retention (~83%): Complex reasoning (BBH)
Complex reasoning appears to be the most sensitive to compression. This is consistent with a broader finding in the field: reasoning tasks often push models into regimes where architectural capacity matters most. Trimming layers and experts hits harder on tasks that demand deep chain-of-thought rather than pattern matching.
The Takeaway
SlimQwen demonstrates that a well-designed compression pipeline — combining pruning, partial-preservation merging, multi-token distillation, and progressive schedules — can produce a model roughly 4x smaller that retains 90-92% of the original's capability and runs about 2x faster on significantly less hardware.
It's not perfect compression. Some performance is lost. But the efficiency gains — both computational and economic — make the tradeoff compelling for many real-world applications.
Next: Chapter 8 — Takeaways and the Road Ahead