A canary for silent KV-cache-compression failures.
GitHub repo · README · Runbook
Lossy KV-cache compression (quantization, token eviction) silently breaks functional outputs: generated code stops passing its tests, tool calls stop being valid JSON. VeriCache shows these methods' outputs "increasingly diverge from full-KV-cache outputs as more tokens are decoded, [leading] to catastrophic failures in code generation and tool calling"; The Pitfalls of KV Cache Compression shows aggregate metrics hide those per-instruction failures.
The question kv-canary tests: when KV compression breaks your code and tool calls, does the cheap token-level metric (perplexity) warn you, or stay silent? That is a hypothesis under test — no paper has shown perplexity is blind to KV-compression damage.
For each compression method, kv-canary plots two lines against KV memory retained: functional accuracy (does the code pass / is the tool call correct?) and perplexity-implied quality (the token-level metric). The result is the divergence — functional accuracy cliffing while the perplexity line hugs 1.0. As a single number:
SDS = (relative functional drop) / (relative perplexity rise)
A high Silent Degradation Score means functional accuracy craters while perplexity barely moves — the metric "lies." SDS ≈ 1 is graceful degradation. Perplexity improving while function breaks is the most deceptive case, and SDS reports it as maximally silent — by design.
| Family | Methods | budget |
|---|---|---|
| baseline | full (fp16) | 1.0 |
| quantization | int8 / int4 / int2 KV | bits / 16 |
| token eviction | StreamingLLM, SnapKV | fraction of tokens kept |
…against two objective, eviction-sensitive functional tasks — code execution (pass@1) and tool/JSON calling (valid + correct-function + arg-match) — contrasted with perplexity, all on a shared KV-memory-retained x-axis.
pip install -e ".[dev,ml]"
python -m kvcanary run configs/smoke.yaml --out results/raw/smoke.jsonl
python scripts/aggregate_and_report.py results/raw/smoke.jsonl
# -> report/divergence.png, report/RESULTS.md
Three small, independently-tested seams compose into a resumable experiment runner:
KVCompressor (the methods, behind one budget knob),
Backend (HuggingFace model, plus a deterministic fake for model-free tests), and
Task (code-exec sandbox, tool-call validation, perplexity). The runner sweeps
models × compressors × tasks, writes one JSON line per sample, and resumes by skipping
finished cells — so a killed spot-GPU run restarts cheaply.