Context Compression: Reduce Tokens Without Losing Quality
Compress prompts for lower cost and latency using LLMLingua, selective context, and summarization. Learn compression ratios, quality tradeoffs, and when not to compress.
The Compression Imperative
Context windows are growing (200K+ tokens), but every token costs money and adds latency. Most context is filler — removing non-essential tokens without losing meaning can cut costs 2-5x.
Before compression (2,500 tokens, $0.00625 at GPT-4o rates):
"You may find it helpful to know that the project under discussion
pertains to a web application framework that was originally developed
by a team of engineers at..."
After compression (600 tokens, $0.0015):
"Project: web framework by Meta, React-based, 2013, 200K+ GitHub stars"
Compression Techniques
LLMLingua
Use a small language model to compress prompts for a larger one. The small model identifies and removes non-essential tokens while preserving semantic content.
from llmlingua import PromptCompressor
compressor = PromptCompressor(
model_name="microsoft/llmlingua-2-bert-base-multilingual",
use_llmlingua2=True
)
long_prompt = """
The Model Context Protocol is an open standard developed by Anthropic
that enables AI applications to connect with various tools and data
sources in a standardized way. This protocol is designed to provide
a common interface for AI systems to interact with external resources...
"""
compressed = compressor.compress_prompt(
long_prompt,
rate=0.5, # Compress to 50% of original tokens
force_tokens=["!", "?", "."] # Preserve these
)
print(f"Original: {len(long_prompt.split())} words")
print(f"Compressed: {len(compressed.split())} words")
Selective Context
For RAG pipelines: don't include all retrieved chunks in the prompt. Filter to only the most relevant ones.
def select_relevant_chunks(query, chunks, max_tokens=2000):
"""Select only chunks likely to help answer the query."""
token_count = 0
selected = []
# Sort by relevance score (from retrieval)
for chunk in sorted(chunks, key=lambda c: c.score, reverse=True):
chunk_tokens = count_tokens(chunk.text)
if token_count + chunk_tokens > max_tokens:
break
selected.append(chunk)
token_count += chunk_tokens
return selected
Summarization-Based Compression
Ask the model to summarize context first, then answer. Two calls instead of one, but the second call has a dramatically compressed prompt.
# Step 1: Summarize
"Summarize the following in 150 words, capturing only the
key facts relevant to understanding the situation:"
# Step 2: Answer from summary
"Based on this summary: [compressed context], answer: [question]"
Token Pruning Heuristics
Simple rules that don't require a model call:
Remove if:
- Stop words that don't change meaning ("the", "a", "an")
- Redundant phrases ("in order to" → "to")
- Filler text ("you may find it helpful to know that" → "")
- Repeated information across RAG chunks
- Whitespace and formatting tokens
Compression Quality Tradeoffs
| Method | Compression Ratio | Quality Retention | Latency |
|---|---|---|---|
| Token pruning | 1.2-1.5x | 95-98% | None |
| Selective context (RAG) | 2-10x | 90-95% | None |
| LLMLingua | 2-3x | 93-97% | +1-2s |
| Summarization | 3-10x | 85-93% | +API call |
| Aggressive compression | 10-20x | 70-85% | +API call |
When NOT to Compress
Tasks where exact wording matters:
- Legal document analysis (every clause could be critical)
- Contract review (missing a negation word changes meaning)
- Medical records (precision > cost)
Short prompts: Compression overhead > savings if the prompt is under 500 tokens.
Creative tasks: Compression removes stylistic elements. A compressed poem prompt loses rhythm and nuance.
Ambiguous queries: Compression can remove disambiguating context. If the original query is already terse, don't compress further.
Cost Comparison
Typical RAG pipeline with 10 retrieved documents (20,000 tokens total):
| Approach | Input Tokens | Cost (GPT-4o) | Quality |
|---|---|---|---|
| Include all 10 chunks | 20,000 | $0.05 | 100% |
| Selective top-3 chunks | 6,000 | $0.015 | 95% |
| LLMLingua 50% + top-3 | 3,000 | $0.0075 | 93% |
| Summarize then answer | 500 + 500 | $0.0025 | 88% |
The sweet spot for most pipelines: top-3 selective context with light LLMLingua compression.
Combining With Prompt Caching
Compression and caching are complementary:
- Compress your system prompt and static examples once.
- Cache the compressed version.
- Append only the dynamic query for each call.
This gives you both compression savings AND cache discounts — maximum cost optimization.
Related Articles
Professional Content with Nano Banana: Business Photography
Create professional headshots, corporate photos, and business content with Nano Banana. Transform casual photos into polished professional images.
Midjourney Character Creation: Master AI Prompts for Unique Designs
Master Midjourney character creation with AI prompts. Learn advanced techniques for generating compelling characters across diverse styles and genres, from portraits to fantasy beings, with detailed guides and examples.
Abstract Expressionism SREF Codes for Midjourney
Bold, gestural abstract expressionism SREF codes for Midjourney featuring dynamic brushwork, raw emotion, and spontaneous mark-making inspired by Pollock and Kline.