chopratejas/headroom: compress what your agent reads before it costs you tokens

A squeeze between your data and the model

Headroom is a context-compression layer for AI agents. It compresses everything your agent reads, tool outputs, logs, RAG chunks, files, and conversation history, before it reaches the LLM, claiming 60 to 95 percent fewer tokens with the same answers. The README’s example is a 10,144-token input compressed to 1,260 while still surfacing the same FATAL line. The premise is that most of what fills a context window is low-value noise around a few load-bearing tokens, and squeezing the noise is close to free in answer quality.

What makes it more than a one-trick compressor is the reversible design. Originals are never deleted; under what it calls CCR, the model can retrieve the full text on demand. So compression is not lossy in the way truncation is, it is deferred: the agent sees a compact view and pulls the detail back only when it actually needs it. Headroom also ships six compression algorithms rather than one fixed transform, so you can tune how aggressively to squeeze for a given workload. The reversible store doubles as a safety net: if compression ever drops something the model turns out to need, the original is one retrieval away rather than gone for good.

Four ways to drop it in

Headroom is deliberately format-flexible, which is most of why it is adoptable:

Library: call compress(messages) inline in Python or TypeScript.
Proxy: run headroom proxy --port 8787 and point any client at it, zero code changes, any language.
Agent wrap: headroom wrap claude|codex|cursor|aider|copilot wraps a coding agent in one command.
MCP server: headroom_compress, headroom_retrieve, and headroom_stats for any MCP client.

It also keeps a cross-agent memory store shared across Claude, Codex, and Gemini with automatic dedup, and headroom learn mines failed sessions and writes corrections into your CLAUDE.md or AGENTS.md.

Install

pip install "headroom-ai[all]"   # Python
npm install headroom-ai          # Node / TypeScript

Then pick a form factor: headroom proxy --port 8787 for a drop-in proxy, headroom wrap claude to wrap an agent, or headroom mcp install for the MCP-native path.

The tradeoff the README does not put up front

This is the insight worth taking away, and it comes straight from the issue tracker: compression can fight prompt caching. The most-discussed open issue is a big drop in cache hit rate. Providers cache on a stable prompt prefix, and if Headroom rewrites the context that prefix changes, so previously cached tokens miss and get re-billed at full, sometimes higher, rates. The net token cost is then a race between what compression saves and what lost cache hits cost. On workloads built around a large, stable, cached context, aggressive compression can be a wash or worse, while on workloads dominated by fresh, uncacheable tool output it is a clear win. Measure your own traffic rather than assuming the 60-to-95 figure lands net.

A second reported gotcha: the Codex proxy path has broken OpenAI authentication for some users. If you wrap Codex through the proxy and auth fails, that is a known thread, not a misconfiguration on your end alone. With 227 open issues as of 2026-06 and frequent releases (v0.24.0 in June 2026), it is fast-moving; pin a version for a stable pipeline.

Headroom versus LLMLingua

	headroom	LLMLingua
Stars	21,144	6,272
Form	library, proxy, agent wrap, MCP	research library
Reversible	yes, retrieve on demand	no
Focus	agent-integrated context compression	prompt compression

Counts are from GitHub as of June 2026. LLMLingua is the well-known prompt-compression research project, strong on the core technique but used as a library you wire in yourself. Headroom’s distinct bets are the multiple drop-in form factors, the reversible retrieve-on-demand model, and the direct agent integrations. If you want a researched compression algorithm to build on, LLMLingua; if you want compression wired into a running agent with minimal code, Headroom.

For the complementary token saving of pre-indexing code so agents grep less, see CodeGraph. For converting documents into compact, LLM-ready text upstream, see MarkItDown. For what else is climbing, see LLM tooling, the daily digest, and the weekly report.

FAQ

What does Headroom compress? Tool outputs, logs, RAG chunks, files, and conversation history, before they reach the LLM, reportedly cutting 60 to 95 percent of tokens.

Is the compression lossy? It is reversible. Originals are kept and the model can retrieve full text on demand, so detail is deferred rather than discarded.

Will it actually lower my bill? Not always. Compression can reduce prompt-cache hit rates, and lost cache hits cost tokens, so on cache-heavy workloads measure the net before assuming a win.

How do I add it? Install headroom-ai, then use it as a library, a proxy (headroom proxy), an agent wrap (headroom wrap), or an MCP server.

chopratejas/headroom: compress what your agent reads before it costs you tokens

Star growth

A squeeze between your data and the model

Four ways to drop it in

Install

The tradeoff the README does not put up front

Headroom versus LLMLingua

FAQ

Momentum

Repository data

chopratejas/headroom: compress what your agent reads before it costs you tokens

Star growth

A squeeze between your data and the model

Four ways to drop it in

Install

The tradeoff the README does not put up front

Headroom versus LLMLingua

Related

FAQ

Momentum

Repository data