The Context Window Challenge
Large language models have limited context windows — the amount of text they can process at once. Even as models expand from 4K to 100K+ tokens, effective context management remains critical for cost, latency, and response quality.
Understanding Context Windows
Token Limits by Model
- GPT-4o: 128K tokens
- Claude 3.5 Sonnet: 200K tokens
- Gemini 1.5 Pro: 2M tokens
- Llama 3: 8K-128K tokens (varies by variant)
Why Limits Matter
Even with large windows, using full capacity impacts cost (you pay per token), latency (more tokens = slower), attention dilution (models may miss important details in very long contexts), and budget allocation (input tokens compete with output tokens).
Basic Strategies
Truncation
The simplest approach — cut content that exceeds the limit:
def truncate_context(messages, max_tokens):
total = 0
result = []
for msg in reversed(messages): # Keep most recent
msg_tokens = count_tokens(msg)
if total + msg_tokens > max_tokens:
break
result.insert(0, msg)
total += msg_tokens
return result
Limitations: May lose important early context.
Sliding Window
Keep a fixed window of recent messages:
def sliding_window(messages, window_size=10):
system_msg = messages[0] if messages[0]['role'] == 'system' else None
recent = messages[-window_size:]
return ([system_msg] if system_msg else []) + recent
Better for conversations, but still loses context.
Intermediate Strategies
Summarization
Compress old context into summaries:
def summarize_and_compact(messages, max_tokens):
if count_tokens(messages) <= max_tokens:
return messages
# Keep system prompt and recent messages
system = messages[0]
recent = messages[-5:]
middle = messages[1:-5]
# Summarize middle section
summary = llm.summarize(middle)
return [
system,
{"role": "system", "content": f"Previous conversation summary: {summary}"},
*recent
]
Trade-off: Summaries lose detail but preserve themes.
Importance Scoring
Prioritize context by relevance:
def score_message_importance(message, current_query):
scores = {
'recency': 1.0, # Recent messages score higher
'relevance': cosine_similarity(embed(message), embed(current_query)),
'role_weight': 1.5 if message['role'] == 'system' else 1.0,
'user_explicit': 2.0 if 'remember' in message['content'].lower() else 1.0
}
return sum(scores.values())
Advanced Strategies
Retrieval-Augmented Generation (RAG)
Don't store everything in context — retrieve what's needed:
class RAGContextManager:
def __init__(self, vector_store):
self.vector_store = vector_store
def build_context(self, query, max_tokens=4000):
# Always include system prompt
context = [self.system_prompt]
remaining = max_tokens - count_tokens(self.system_prompt)
# Retrieve relevant documents
docs = self.vector_store.similarity_search(query, k=5)
for doc in docs:
if count_tokens(doc) < remaining:
context.append({"role": "system", "content": doc})
remaining -= count_tokens(doc)
# Add recent conversation
for msg in self.recent_messages[-3:]:
if count_tokens(msg) < remaining:
context.append(msg)
remaining -= count_tokens(msg)
return context
Hierarchical Context
Organize context in layers:
- Layer 1 (Always present): System prompt, user preferences, session state
- Layer 2 (Retrieved): Relevant documents, past interactions
- Layer 3 (Recent): Current conversation turns
Context Caching
Some providers (like Anthropic) support prompt caching — reuse common prefixes:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=[
{
"type": "text",
"text": long_system_prompt,
"cache_control": {"type": "ephemeral"} # Cache this
}
],
messages=[{"role": "user", "content": user_query}]
)
Cached prefixes process faster and cost less on subsequent requests.
Monitoring and Optimization
Track Token Usage
class ContextMetrics:
def __init__(self):
self.history = []
def log(self, input_tokens, output_tokens, context_strategy):
self.history.append({
'timestamp': datetime.now(),
'input': input_tokens,
'output': output_tokens,
'strategy': context_strategy,
'cost': self.calculate_cost(input_tokens, output_tokens)
})
def analyze(self):
by_strategy = defaultdict(list)
for entry in self.history:
by_strategy[entry['strategy']].append(entry['cost'])
return {k: sum(v)/len(v) for k, v in by_strategy.items()}
Best Practices
- Start conservatively: Use less context than available, expand if needed
- Prioritize system prompts: They set behavior and should rarely be truncated
- Measure quality impact: Track response quality against context utilization
- Consider latency: Smaller contexts respond faster
- Use structured data: JSON/XML is often more token-efficient than prose
Conclusion
Effective context management balances completeness against cost and latency. Start with simple strategies, measure their impact, and adopt more sophisticated approaches like RAG when the complexity is justified. The goal is giving the model exactly what it needs — no more, no less.