kv-canary

A canary for silent KV-cache-compression failures.

GitHub repo · README · Runbook

Lossy KV-cache compression (quantization, token eviction) silently breaks functional outputs: generated code stops passing its tests, tool calls stop being valid JSON. VeriCache shows these methods' outputs "increasingly diverge from full-KV-cache outputs as more tokens are decoded, [leading] to catastrophic failures in code generation and tool calling"; The Pitfalls of KV Cache Compression shows aggregate metrics hide those per-instruction failures.

The question kv-canary tests: when KV compression breaks your code and tool calls, does the cheap token-level metric (perplexity) warn you, or stay silent? That is a hypothesis under test — no paper has shown perplexity is blind to KV-compression damage.

The finding: divergence

For each compression method, kv-canary plots two lines against KV memory retained: functional accuracy (does the code pass / is the tool call correct?) and perplexity-implied quality (the token-level metric). The result is the divergence — functional accuracy cliffing while the perplexity line hugs 1.0. As a single number:

SDS = (relative functional drop) / (relative perplexity rise)

A high Silent Degradation Score means functional accuracy craters while perplexity barely moves — the metric "lies." SDS ≈ 1 is graceful degradation. Perplexity improving while function breaks is the most deceptive case, and SDS reports it as maximally silent — by design.

What it measures

FamilyMethodsbudget
baselinefull (fp16)1.0
quantizationint8 / int4 / int2 KVbits / 16
token evictionStreamingLLM, SnapKVfraction of tokens kept

…against two objective, eviction-sensitive functional tasks — code execution (pass@1) and tool/JSON calling (valid + correct-function + arg-match) — contrasted with perplexity, all on a shared KV-memory-retained x-axis.

Quickstart (CPU, no GPU)

pip install -e ".[dev,ml]"
python -m kvcanary run configs/smoke.yaml --out results/raw/smoke.jsonl
python scripts/aggregate_and_report.py results/raw/smoke.jsonl
# -> report/divergence.png, report/RESULTS.md

How it works

Three small, independently-tested seams compose into a resumable experiment runner: KVCompressor (the methods, behind one budget knob), Backend (HuggingFace model, plus a deterministic fake for model-free tests), and Task (code-exec sandbox, tool-call validation, perplexity). The runner sweeps models × compressors × tasks, writes one JSON line per sample, and resumes by skipping finished cells — so a killed spot-GPU run restarts cheaply.

Status & honest caveats. The harness is wired end-to-end — compression is applied into HF's live KV cache and perplexity is measured under it, so the divergence is real (run the GPU matrix on a real instruct model before trusting numbers). Two method implementations are simplified and a faithful reviewer would flag them: StreamingLLM evicts from a post-RoPE cache without re-roping, and SnapKV is head-averaged without pooling — so treat these as "-style" eviction, not exact reproductions. Functional-degradation-under-KV-compression is also already measured by NVIDIA KVPress and others; the narrow niche here is executable-code + tool-schema conformance vs. perplexity. Full cited audit: RESEARCH.md.