GitHub - IsaacRe/vllm-kvcompress: KV cache compression for high-throughput LLM inference

KV cache compression for high-throughput LLM inference

This is a (messy) fork of vLLM v0.6.0 showcasing our new KV cache compression method that increases throughput for memory-constrained LLM deployments.

Current Limitations

We will be expanding the set of supported vLLM features as we upstream this work. The following features are not yet supported:

Tensor Parallelism
Chunked-prefill
Prefix caching
FlashInfer and other non-FlashAttention attention backends
CUDA graphs

Setup

It is recommended to run within the NVIDIA PyTorch image:

docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:24.04-py3

Install from source:

cd vllm-kvcompress/
pip install -e .

Alternatively, the prebuilt wheel can be used for x86 architectures:

pip install https://pub-ff08b7559526447fb14dd52ec4fac7c7.r2.dev/17da8eb/build/sm_89/vllm-0.6.0%2Bcu124-cp310-cp310-linux_x86_64.whl

Inference

The inference server can be launched with:

export model=meta-llama/Meta-Llama-3.1-8B-Instruct
vllm serve $model --enforce-eager --enable-kvc

Requests can then be sent with

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "prompt": "San Francisco is a",
    "max_tokens": 7,
    "temperature": 0,
    "max_cache_tokens": 128,
    "protected_window_size": 32,
    "compress_once": false
  }'

Compression can be configured per-request by setting the following additional sampling parameters:

max_cache_tokens - Configure the max number of KVs to retain in cache for this sequence, computed as num_layers * num_kv_heads * max_cache_tokens
protected_window_size - The window of final tokens for this sequence whose KVs will be protected during compression.
compress_once - If set, the sequence will only be compressed during the first compression iteration after its prefill.

Running Experiments

LongBench

cd vllm-kvcompress/experiments/

To run experiments with a limited observation window (KVC-w):

export dataset=narrativeqa model=llama3 w=8 cache_size=128
python run_longbench.py \
  --dataset $dataset \
  --model $model \
  --protected-window-size $w \
  --prefill-metric-collection-window-size $w \
  --max-cache-tokens $cache_size

To run experiments with full query-range aggregation (KVC-full):

python run_longbench.py \
  --dataset $dataset \
  --model $model \
  --protected-window-size 32 \
  --metric-collection-buffer-size 10 \
  --prefill-metric-collection-window-size 33000 \
  --prefill-metric-collection-block-size 1024 \
  --no-maxpool-metrics \
  --gpu-mem-util 0.6 \
  --max-cache-tokens $cache_size

Note: Aggregating over the full query-range requires significant memory and should be run on an H100 or comparable GPU to avoid OOMs. Lowering gpu-mem-util will save more GPU memory for the aggregation and lowering prefill-metric-collection-block-size will lower the required memory for the aggregation, at the expense of longer execution time.

Experiments can be run with continual compression (compressing during decoding as well as on prefill) by adding the --continual-compression flag. To reproduce results in the paper, --compression-rate can be used to limit cache size instead of --max-cache-tokens:

export cr=64
python run_longbench.py \
  --dataset $dataset \
  --model $model \
  --protected-window-size $w \
  --prefill-metric-collection-window-size $w \
  --continual-compression \
  --compression-rate $cr

Run scripts used for our experiments can be found in experiments/scripts.

Benchmark Throughput

cd vllm-kvcompress/

Run vLLM's benchmarking script with:

export model=meta-llama/Meta-Llama-3.1-8B-Instruct \
  max_model_len=19000 input_len=6000 cr=64
python3 benchmarks/benchmark_throughput.py \
  --model $model \
  --max-model-len $max_model_len \
  --enforce-eager \
  --num-prompts 256 \
  --input-len $input_len \
  --output-len 500 \
  --protected-window-size 32 \
  --compression-rate $cr \
  --enable-kvc

Run scripts used for our experiments can be found in benchmarks/scripts.

Citation

If you use this work in research/projects of your own, please cite our paper:

@misc{rehg2024kvcompresspagedkvcachecompression,
      title={KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head},
      author={Isaac Rehg},
      year={2024},
      eprint={2410.00161},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.00161},
}

Name	Name	Last commit message	Last commit date
Latest commit IsaacRe Oct 28, 2024 5e2a639 · Oct 28, 2024 History 2,588 Commits
.buildkite	.buildkite	[Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistra…	Sep 4, 2024
.github	.github	[CI] Change PR remainder to avoid at-mentions (#8134)	Sep 3, 2024
benchmarks	benchmarks	add experiment scripts, fix async engine with openai endpoint	Oct 2, 2024
cmake	cmake	Clean up remaining Punica C information (#7027)	Aug 4, 2024
csrc	csrc	fix overflow in execute_cache_moves kernel	Sep 16, 2024
docs	docs	[Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistra…	Sep 4, 2024
examples	examples	[Neuron] Adding support for adding/ overriding neuron configuration a…	Sep 4, 2024
experiments	experiments	add experiment scripts, fix async engine with openai endpoint	Oct 2, 2024
kernel-test	kernel-test	merge commits from irehg/eval	Sep 9, 2024
tests	tests	fixes	Sep 13, 2024
vllm	vllm	add unsupported backend error	Oct 28, 2024
.clang-format	.clang-format	[CI/Build] Enforce style for C++ and CUDA code with `clang-format` (#…	May 22, 2024
.dockerignore	.dockerignore	[CI/Build] Dockerfile.cpu improvements (#7298)	Aug 8, 2024
.gitignore	.gitignore	[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kern…	Aug 20, 2024
.readthedocs.yaml	.readthedocs.yaml	[Doc] Add missing mock import to docs `conf.py` (#6834)	Jul 27, 2024
.yapfignore	.yapfignore	[issue templates] add some issue templates (#3412)	Mar 14, 2024
CMakeLists.txt	CMakeLists.txt	merge commits from irehg/eval	Sep 9, 2024
CONTRIBUTING.md	CONTRIBUTING.md	[Misc] Define common requirements (#3841)	Apr 5, 2024
Dockerfile	Dockerfile	chore: Update check-wheel-size.py to read MAX_SIZE_MB from env (#8103)	Sep 4, 2024
Dockerfile.cpu	Dockerfile.cpu	[Misc] Update dockerfile for CPU to cover protobuf installation (#7182)	Aug 15, 2024
Dockerfile.neuron	Dockerfile.neuron	[CI/Build] bump Dockerfile.neuron image base, use public ECR (#6832)	Aug 12, 2024
Dockerfile.openvino	Dockerfile.openvino	[OpenVINO] migrate to latest dependencies versions (#7251)	Aug 7, 2024
Dockerfile.ppc64le	Dockerfile.ppc64le	Support CPU inference with VSX PowerPC ISA (#5652)	Jun 26, 2024
Dockerfile.rocm	Dockerfile.rocm	[Kernel][RFC] Refactor the punica kernel based on Triton (#5036)	Aug 1, 2024
Dockerfile.tpu	Dockerfile.tpu	[TPU] Upgrade PyTorch XLA nightly (#7967)	Aug 28, 2024
Dockerfile.xpu	Dockerfile.xpu	[CI/Build] Build on Ubuntu 20.04 instead of 22.04 (#6517)	Jul 19, 2024
KV-Compress.svg	KV-Compress.svg	update README	Oct 2, 2024
LICENSE	LICENSE	Add Apache-2.0 license (#102)	May 15, 2023
MANIFEST.in	MANIFEST.in	[Core] Pipeline parallel with Ray ADAG (#6837)	Aug 2, 2024
README.md	README.md	add current limitations to README	Oct 10, 2024
collect_env.py	collect_env.py	[misc] add nvidia related library in collect env (#7674)	Aug 20, 2024
format.sh	format.sh	[mypy] Enable mypy type checking for `vllm/core` (#7229)	Aug 27, 2024
pyproject.toml	pyproject.toml	[mypy] Enable mypy type checking for `vllm/core` (#7229)	Aug 27, 2024
requirements-adag.txt	requirements-adag.txt	[Core] Pipeline parallel with Ray ADAG (#6837)	Aug 2, 2024
requirements-build.txt	requirements-build.txt	[Misc] Add jinja2 as an explicit build requirement (#7695)	Aug 20, 2024
requirements-common.txt	requirements-common.txt	[Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistra…	Sep 4, 2024
requirements-cpu.txt	requirements-cpu.txt	[Hardware][Intel CPU] Update torch 2.4.0 for CPU backend (#6931)	Aug 2, 2024
requirements-cuda.txt	requirements-cuda.txt	[CI/Build] build on empty device for better dev experience (#4773)	Aug 11, 2024
requirements-dev.txt	requirements-dev.txt	Seperate dev requirements into lint and test (#5474)	Jun 13, 2024
requirements-lint.txt	requirements-lint.txt	[Core] Support serving encoder/decoder models (#7258)	Aug 9, 2024
requirements-neuron.txt	requirements-neuron.txt	[Misc] Define common requirements (#3841)	Apr 5, 2024
requirements-openvino.txt	requirements-openvino.txt	[OpenVINO] migrate to latest dependencies versions (#7251)	Aug 7, 2024
requirements-rocm.txt	requirements-rocm.txt	[CI/Build][ROCm] Enabling tensorizer tests for ROCm (#7237)	Aug 27, 2024
requirements-test.txt	requirements-test.txt	[Misc] Optional installation of audio related packages (#8063)	Sep 1, 2024
requirements-tpu.txt	requirements-tpu.txt	[TPU] Support single and multi-host TPUs on GKE (#7613)	Aug 30, 2024
requirements-xpu.txt	requirements-xpu.txt	[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814)	Jun 17, 2024
setup.py	setup.py	[CI/Build] make pip install vllm work in macos (for import only) (#8118)	Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KV cache compression for high-throughput LLM inference

Current Limitations

Setup

Inference

Running Experiments

LongBench

Benchmark Throughput

Citation

About

Releases

Sponsor this project

Packages

Contributors 524

Languages

License

IsaacRe/vllm-kvcompress

Folders and files

Latest commit

History

Repository files navigation

KV cache compression for high-throughput LLM inference

Current Limitations

Setup

Inference

Running Experiments

LongBench

Benchmark Throughput

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Contributors 524

Languages

Packages