The Compression Imperative

Context windows are growing (200K+ tokens), but every token costs money and adds latency. Most context is filler — removing non-essential tokens without losing meaning can cut costs 2-5x.

Before compression (2,500 tokens, $0.00625 at GPT-4o rates):
"You may find it helpful to know that the project under discussion
pertains to a web application framework that was originally developed
by a team of engineers at..."

After compression (600 tokens, $0.0015):
"Project: web framework by Meta, React-based, 2013, 200K+ GitHub stars"

Compression Techniques

LLMLingua

Use a small language model to compress prompts for a larger one. The small model identifies and removes non-essential tokens while preserving semantic content.

from llmlingua import PromptCompressor

compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-bert-base-multilingual",
    use_llmlingua2=True
)

long_prompt = """
The Model Context Protocol is an open standard developed by Anthropic
that enables AI applications to connect with various tools and data
sources in a standardized way. This protocol is designed to provide
a common interface for AI systems to interact with external resources...
"""

compressed = compressor.compress_prompt(
    long_prompt,
    rate=0.5,           # Compress to 50% of original tokens
    force_tokens=["!", "?", "."]  # Preserve these
)
print(f"Original: {len(long_prompt.split())} words")
print(f"Compressed: {len(compressed.split())} words")

Selective Context

For RAG pipelines: don't include all retrieved chunks in the prompt. Filter to only the most relevant ones.

def select_relevant_chunks(query, chunks, max_tokens=2000):
    """Select only chunks likely to help answer the query."""
    token_count = 0
    selected = []

    # Sort by relevance score (from retrieval)
    for chunk in sorted(chunks, key=lambda c: c.score, reverse=True):
        chunk_tokens = count_tokens(chunk.text)
        if token_count + chunk_tokens > max_tokens:
            break
        selected.append(chunk)
        token_count += chunk_tokens

    return selected

Summarization-Based Compression

Ask the model to summarize context first, then answer. Two calls instead of one, but the second call has a dramatically compressed prompt.

# Step 1: Summarize
"Summarize the following in 150 words, capturing only the
key facts relevant to understanding the situation:"

# Step 2: Answer from summary
"Based on this summary: [compressed context], answer: [question]"

Token Pruning Heuristics

Simple rules that don't require a model call:

Remove if:
- Stop words that don't change meaning ("the", "a", "an")
- Redundant phrases ("in order to" → "to")
- Filler text ("you may find it helpful to know that" → "")
- Repeated information across RAG chunks
- Whitespace and formatting tokens

Compression Quality Tradeoffs

Method	Compression Ratio	Quality Retention	Latency
Token pruning	1.2-1.5x	95-98%	None
Selective context (RAG)	2-10x	90-95%	None
LLMLingua	2-3x	93-97%	+1-2s
Summarization	3-10x	85-93%	+API call
Aggressive compression	10-20x	70-85%	+API call

When NOT to Compress

Tasks where exact wording matters:

Legal document analysis (every clause could be critical)
Contract review (missing a negation word changes meaning)
Medical records (precision > cost)

Short prompts: Compression overhead > savings if the prompt is under 500 tokens.

Creative tasks: Compression removes stylistic elements. A compressed poem prompt loses rhythm and nuance.

Ambiguous queries: Compression can remove disambiguating context. If the original query is already terse, don't compress further.

Cost Comparison

Typical RAG pipeline with 10 retrieved documents (20,000 tokens total):

Approach	Input Tokens	Cost (GPT-4o)	Quality
Include all 10 chunks	20,000	$0.05	100%
Selective top-3 chunks	6,000	$0.015	95%
LLMLingua 50% + top-3	3,000	$0.0075	93%
Summarize then answer	500 + 500	$0.0025	88%

The sweet spot for most pipelines: top-3 selective context with light LLMLingua compression.

Combining With Prompt Caching

Compression and caching are complementary:

Compress your system prompt and static examples once.
Cache the compressed version.
Append only the dynamic query for each call.

This gives you both compression savings AND cache discounts — maximum cost optimization.

Context Compression: Reduce Tokens Without Losing Quality

The Compression Imperative

Compression Techniques

LLMLingua

Selective Context

Summarization-Based Compression

Token Pruning Heuristics

Compression Quality Tradeoffs

When NOT to Compress

Cost Comparison

Combining With Prompt Caching

Related Articles

Constitutional AI: Principle-Based Model Alignment

Gemini Streaming & Real-time: Live API & Latency Optimization

Image Generation with ChatGPT

On this page