← Back to Red Deer Investments  ·  AI Library Home

Chapter 7: Results and Efficiency

What the SlimQwen Model Achieved


Chapter Goals

By the end of this chapter, you will understand:

  • The end-to-end compression results: from 80B parameters to 23B
  • How SlimQwen performed across diverse benchmarks
  • The runtime efficiency gains (speed, memory, deployment)
  • What "competitive" means in context

By now we've walked through the individual techniques — pruning, expert compression, training recipes, and progressive schedules. This chapter brings them all together and reports the bottom line.


The Starting Point and the Target

PropertyQwen3-Next-80A3BSlimQwen-23A2B
Total Parameters~80 billion~23 billion
Active Parameters3.8 billion2.0 billion
Layers4836 (depth reduced by 25%)
Hidden Size20481536 (width reduced by 25%)
Experts per MoE512 (10+1 active)256 (8+1 active)
Compression Ratio~4x total reduction

The model was compressed to about one quarter of its original size — but remember, only a fraction of parameters were active at any time in both models. The active parameter count went from 3.8B to 2.0B, roughly halving the actual computation per token.


Benchmark Performance

The paper evaluated SlimQwen across a wide range of tasks. Here's how the compressed model stacked up:

BenchmarkWhat It TestsTeacherSlimQwenRetention
MMLUGeneral knowledge85.2277.39~91%
MMLU-ProHarder general knowledge62.8653.22~85%
MMLU-ReduxCleaned-up knowledge84.4578.01~92%
BBHComplex reasoning85.1270.70~83%
GSM-8KMath word problems90.0785.82~95%
EvalPlusCode generation74.1269.08~93%
C-EvalChinese knowledge90.3382.97~92%
CMMLUChinese multitask89.2783.01~93%

Overall average retention: roughly 90-92% across knowledge-heavy tasks, with somewhat more degradation on reasoning-heavy benchmarks (BBH at ~83%).


Applied Definitions

Benchmark — A standardized test designed to measure a model's capabilities in a specific domain. MMLU tests knowledge across 57 subjects, GSM-8K tests mathematical reasoning, EvalPlus tests code generation accuracy, etc.

Retention Rate — The percentage of the teacher model's score that the compressed model maintains. A 90% retention on MMLU means SlimQwen scores 90% of what the original 80B model scored.

Inference Efficiency — How fast and resource-efficient the model is when actually being used (as opposed to during training). Measured by memory footprint, response latency, and tokens generated per second.


Runtime Efficiency Gains

Compression isn't just about maintaining quality on paper — it's about making the model actually deployable. Here's the practical impact, starting with the production-ready numbers (vLLM, an optimized inference engine):

MetricTeacherSlimQwenImprovement
Peak Memory156.56 GB43.30 GB72% reduction
Prefill Latency (vLLM)0.08s0.06s25% faster
Decoding Throughput (vLLM)142.58 tok/s210.87 tok/s48% faster
Prefill (HuggingFace)0.99s0.44s56% faster
Decoding Throughput (HF)4.05 tok/s6.55 tok/s62% faster

For context: HuggingFace's backend is unoptimized and mainly used for quick local testing; vLLM is what you'd run in production. The gains are real in either case, but the vLLM numbers reflect what matters for actual deployment.

The most striking result: SlimQwen fits on a single 80GB GPU. The original teacher required two GPUs in parallel (tensor parallelism), which adds communication overhead and deployment complexity. SlimQwen runs solo.

This matters enormously in practice. A model that fits on one GPU means:

  • Lower cloud computing costs
  • Simpler deployment architecture
  • Ability to serve requests on consumer-grade hardware
  • No tensor/pipeline parallelism overhead

The Context of "Competitive"

The paper describes SlimQwen's performance as "competitive." This is deliberately measured language. It doesn't claim to match the teacher — it doesn't. A 4x compressed model scoring 91% on MMLU is impressive, but a 9% gap on the hardest benchmarks is real.

However, competitive means competitive for its size. Compared to other models of similar scale trained from scratch, SlimQwen is advantaged by the pruning initialization. It starts life further along the training curve than any model starting from random weights.


What's Lost and What's Kept

Looking at the retention rates across benchmarks reveals a pattern:

Strongest retention (>93%): Math (GSM-8K), Code (EvalPlus), Chinese benchmarks

Moderate retention (~85-92%): General knowledge (MMLU variants)

Weakest retention (~83%): Complex reasoning (BBH)

Complex reasoning appears to be the most sensitive to compression. This is consistent with a broader finding in the field: reasoning tasks often push models into regimes where architectural capacity matters most. Trimming layers and experts hits harder on tasks that demand deep chain-of-thought rather than pattern matching.


The Takeaway

SlimQwen demonstrates that a well-designed compression pipeline — combining pruning, partial-preservation merging, multi-token distillation, and progressive schedules — can produce a model roughly 4x smaller that retains 90-92% of the original's capability and runs about 2x faster on significantly less hardware.

It's not perfect compression. Some performance is lost. But the efficiency gains — both computational and economic — make the tradeoff compelling for many real-world applications.


Next: Chapter 8 — Takeaways and the Road Ahead

← Previous Next →