Context Compression: Reduce Tokens Without Losing Quality

Compress prompts for lower cost and latency using LLMLingua, selective context, and summarization. Learn compression ratios, quality tradeoffs, and when not to compress.

June 10, 2026
context-compressiontoken-reductionllmlinguacost-optimizationprompt-engineering

The Compression Imperative

Context windows are growing (200K+ tokens), but every token costs money and adds latency. Most context is filler — removing non-essential tokens without losing meaning can cut costs 2-5x.

Before compression (2,500 tokens, $0.00625 at GPT-4o rates):
"You may find it helpful to know that the project under discussion
pertains to a web application framework that was originally developed
by a team of engineers at..."

After compression (600 tokens, $0.0015):
"Project: web framework by Meta, React-based, 2013, 200K+ GitHub stars"

Compression Techniques

LLMLingua

Use a small language model to compress prompts for a larger one. The small model identifies and removes non-essential tokens while preserving semantic content.

from llmlingua import PromptCompressor

compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-bert-base-multilingual",
    use_llmlingua2=True
)

long_prompt = """
The Model Context Protocol is an open standard developed by Anthropic
that enables AI applications to connect with various tools and data
sources in a standardized way. This protocol is designed to provide
a common interface for AI systems to interact with external resources...
"""

compressed = compressor.compress_prompt(
    long_prompt,
    rate=0.5,           # Compress to 50% of original tokens
    force_tokens=["!", "?", "."]  # Preserve these
)
print(f"Original: {len(long_prompt.split())} words")
print(f"Compressed: {len(compressed.split())} words")

Selective Context

For RAG pipelines: don't include all retrieved chunks in the prompt. Filter to only the most relevant ones.

def select_relevant_chunks(query, chunks, max_tokens=2000):
    """Select only chunks likely to help answer the query."""
    token_count = 0
    selected = []

    # Sort by relevance score (from retrieval)
    for chunk in sorted(chunks, key=lambda c: c.score, reverse=True):
        chunk_tokens = count_tokens(chunk.text)
        if token_count + chunk_tokens > max_tokens:
            break
        selected.append(chunk)
        token_count += chunk_tokens

    return selected

Summarization-Based Compression

Ask the model to summarize context first, then answer. Two calls instead of one, but the second call has a dramatically compressed prompt.

# Step 1: Summarize
"Summarize the following in 150 words, capturing only the
key facts relevant to understanding the situation:"

# Step 2: Answer from summary
"Based on this summary: [compressed context], answer: [question]"

Token Pruning Heuristics

Simple rules that don't require a model call:

Remove if:
- Stop words that don't change meaning ("the", "a", "an")
- Redundant phrases ("in order to" → "to")
- Filler text ("you may find it helpful to know that" → "")
- Repeated information across RAG chunks
- Whitespace and formatting tokens

Compression Quality Tradeoffs

MethodCompression RatioQuality RetentionLatency
Token pruning1.2-1.5x95-98%None
Selective context (RAG)2-10x90-95%None
LLMLingua2-3x93-97%+1-2s
Summarization3-10x85-93%+API call
Aggressive compression10-20x70-85%+API call

When NOT to Compress

Tasks where exact wording matters:

  • Legal document analysis (every clause could be critical)
  • Contract review (missing a negation word changes meaning)
  • Medical records (precision > cost)

Short prompts: Compression overhead > savings if the prompt is under 500 tokens.

Creative tasks: Compression removes stylistic elements. A compressed poem prompt loses rhythm and nuance.

Ambiguous queries: Compression can remove disambiguating context. If the original query is already terse, don't compress further.

Cost Comparison

Typical RAG pipeline with 10 retrieved documents (20,000 tokens total):

ApproachInput TokensCost (GPT-4o)Quality
Include all 10 chunks20,000$0.05100%
Selective top-3 chunks6,000$0.01595%
LLMLingua 50% + top-33,000$0.007593%
Summarize then answer500 + 500$0.002588%

The sweet spot for most pipelines: top-3 selective context with light LLMLingua compression.

Combining With Prompt Caching

Compression and caching are complementary:

  1. Compress your system prompt and static examples once.
  2. Cache the compressed version.
  3. Append only the dynamic query for each call.

This gives you both compression savings AND cache discounts — maximum cost optimization.