SemBench

SemBench is a benchmark suite for semantic KV cache reuse. It separates four questions that are often conflated:

Did an exact token/block cache already cover the request?
Did a semantic planner find reusable donor spans?
Did the serving backend actually materialize or reuse donor KV?
Did the routed request preserve TTFT and answer quality while negative controls stayed cold?

The repository is standalone. Generated LongBench manifests and live result artifacts are intentionally excluded from git; review the source dataset license and your sharing policy before publishing generated data.

Capabilities

Deterministic fixture workloads for smoke tests.
LongBench-derived replay manifests with enterprise-style transforms: instruction variants, same evidence with new tasks, RAG reorder, multi-donor composition, fuzzy edits, leading-evidence new-task, and negative controls.
Offline exact-cache and semantic-candidate metrics.
Live SGLang and generic OpenAI-compatible gateway replay.
Backend log/audit parsers for vLLM, SGLang, and TensorRT-LLM.
Result gates for quality, route placement, backend-confirmed reuse, materialized-token counts, and negative-control safety.

Install

python -m pip install -e '.[dev,tokenizer]'

For LongBench ingestion:

python -m pip install -e '.[longbench,tokenizer,dev]'

For live HTTP replay:

python -m pip install -e '.[live,tokenizer,dev]'

Fixture Smoke Test

python -m sembench build \
  --profile fixture \
  --output manifests/fixture.jsonl

python -m sembench run-offline \
  --manifest manifests/fixture.jsonl \
  --output results/fixture-offline.json \
  --block-size 16

LongBench-Derived Replay

python -m sembench build \
  --profile longbench-v1 \
  --datasets qasper multifieldqa_en hotpotqa 2wikimqa musique gov_report qmsum multi_news lcc repobench-p \
  --max-items-per-dataset 10 \
  --transforms instruction_variant same_evidence_new_task rag_reorder multi_donor_composite fuzzy_edit leading_evidence_new_task negative_control \
  --max-segments 4 \
  --min-segment-chars 400 \
  --output manifests/longbench-v1-enterprise-replay.jsonl

python -m sembench run-offline \
  --manifest manifests/longbench-v1-enterprise-replay.jsonl \
  --output results/longbench-v1-offline.json \
  --block-size 16 \
  --tokenizer Qwen/Qwen2.5-7B-Instruct

The replay keeps LongBench as the source corpus while reshaping it into semantic-KV reuse cases: same document with new tasks, reordered retrieval chunks, multi-donor compositions, fuzzy formatting/edit changes, and unrelated negative controls.

Live Engine Replay

For SGLang:

python -m sembench run-live-sglang \
  --manifest manifests/longbench-v1-enterprise-replay.jsonl \
  --output results/longbench-v1-sglang.json \
  --base-url http://localhost:30000 \
  --model Qwen/Qwen2.5-7B-Instruct \
  --max-items 20

For any OpenAI-compatible gateway:

python -m sembench run-live-gateway \
  --manifest manifests/longbench-v1-enterprise-replay.jsonl \
  --output results/longbench-v1-gateway.json \
  --gateway-url http://localhost:8080 \
  --donor-url http://localhost:30000 \
  --model Qwen/Qwen2.5-7B-Instruct \
  --max-items 20

Gateway/router metrics answer whether traffic went through the expected serving path. Backend log/audit summaries answer whether the engine reported semantic KV materialization or reuse.

Backend Audit Summaries

python -m sembench summarize-engine-events \
  --engine vllm \
  --input results/vllm-backend.log \
  --output results/vllm-engine-events.json

python -m sembench summarize-engine-events \
  --engine trtllm \
  --input results/trtllm-audit.jsonl \
  --output results/trtllm-engine-events.json

python -m sembench summarize-engine-events \
  --engine sglang \
  --input results/sglang-backend.log \
  --output results/sglang-engine-events.json

For Kubernetes-hosted engines:

python -m sembench collect-k8s-engine-events \
  --engine sglang \
  --namespace inference \
  --pod sglang-0 \
  --since-time 2026-06-22T18:00:00Z \
  --output-log results/sglang.log \
  --output-summary results/sglang-engine-events.json

Acceptance Gates

python -m sembench assert-result-gates \
  --result results/longbench-v1-gateway.json \
  --engine-summary results/sglang-engine-events.json \
  --min-quality-pass-rate 0.80 \
  --min-backend-confirmed-block-rate 0.05 \
  --min-materialization-events 1 \
  --min-materialized-tokens 512 \
  --max-negative-control-confirmed-rate 0.0 \
  --require-materialized-reuse \
  --require-no-engine-errors

Main Metrics

See docs/METRICS.md for the exact metric contract.

exact_block_hit_rate: full recipient blocks found by exact token hash in the donor pool.
semantic_candidate_block_rate: full recipient blocks proposed for semantic donor reuse.
semantic_eligible_block_rate: candidate blocks that are aligned enough for backend materialization.
backend_confirmed_block_rate: live-backend-confirmed block reuse.
semantic_eligible_lift: semantic eligible rate minus exact rate.
backend_confirmed_lift: live confirmed rate minus exact rate.
semantic_placement_rate_by_request: fraction of replayed requests routed by semantic placement when route metadata is available.
negative_control_backend_confirmed_rate: confirmed reuse on unrelated donor/recipient pairs.

Semantic discovery is not counted as confirmed KV reuse unless a live backend or backend audit stream reports materialization or reuse.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
sembench		sembench
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SemBench

Capabilities

Install

Fixture Smoke Test

LongBench-Derived Replay

Live Engine Replay

Backend Audit Summaries

Acceptance Gates

Main Metrics

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SemBench

Capabilities

Install

Fixture Smoke Test

LongBench-Derived Replay

Live Engine Replay

Backend Audit Summaries

Acceptance Gates

Main Metrics

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages