Prompt Compression: The TL;DR Approach

The most direct fix: give the model a shorter prompt.

Prompt compression techniques take a long input and produce a shorter version — condensing the meaning without losing key information. There are two main approaches:

Learned compression: Train a small LLM specifically to rewrite prompts, preserving the information content while reducing token count. The compression model learns which parts of a prompt can be safely discarded — redundant instructions, verbose explanations, repeated information.

Semantic compression: Use the model itself to identify and merge semantically similar content. If the prompt contains three examples of the same concept, merge them into one. If an instruction is stated twice with different wording, keep the clearest version and drop the rest.

The results are surprisingly good. A 10,000-token prompt can often be compressed by 5--10x while retaining 95%+ of the task accuracy. For retrieval-augmented generation systems — where large documents are stuffed into context — compression can cut prefill time from minutes to seconds.

The risk: compression loses something important. A detail about the format of the expected output, or a specific constraint on the model's behavior, might be exactly what the task requires. For many applications, the quality-speed trade-off is worth it. For safety-critical systems, it's riskier.