ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

training-free, high-throughput long-context LLM inference

Hanshi Sun^1,2, Li-Wen Chang², Wenlei Bao², Size Zheng², Ningxin Zheng², Xin Liu²,
Harry Dong¹, Yuejie Chi¹, Beidi Chen¹

¹Carnegie Mellon University ²ByteDance

[Paper] | [Blog]

ShadowKV Framework

Environment Set Up

To reproduce the results in the paper, you need to set up the environment as follows with a single A100 GPU:

# create env
conda create -n ShadowKV python=3.10 -y
conda activate ShadowKV

# install packages
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

# nemo dependencies (for dataset building)
pip install wheel
pip install Cython
pip install youtokentome
pip install nemo_toolkit[all]==1.23

# flashinfer
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/

# cutlass
mkdir 3rdparty
git clone https://github.com/NVIDIA/cutlass.git 3rdparty/cutlass

# build kernels for ShadowKV
python setup.py build_ext --inplace

Supported Models

Currently, we support the following LLMs:

Llama-3-8B-1M: gradientai/Llama-3-8B-Instruct-Gradient-1048k
GLM-4-9B-1M: THUDM/glm-4-9b-chat-1m
Llama-3.1-8B: meta-llama/Meta-Llama-3.1-8B-Instruct
Yi-9B-200K: 01-ai/Yi-9B-200K
Phi-3-Mini-128K: microsoft/Phi-3-mini-128k-instruct (only NIAH test supported)
Qwen2-7B-128K: Qwen/Qwen2-7B-Instruct (only NIAH test supported)

Accuracy Evaluations

Here we provide an example to build the dataset and run evaluation for the RULER benchmark with Llama-3-8B-1M.

Build Datasets

To build RULER dataset, please run the following command:

# build RULER
python -c "import nltk; nltk.download('punkt')"
cd data/ruler
bash create_dataset.sh "gradientai/Llama-3-8B-Instruct-Gradient-1048k" "llama-3"

Run Evaluations

For the accuracy evaluation, please run the following command with 8xA100 GPUs:

# Full attention
OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method full --dataset_name "ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/fwe,ruler/qa_1,ruler/qa_2" --model_name "gradientai/Llama-3-8B-Instruct-Gradient-1048k"

# ShadowKV
OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method shadowkv --dataset_name "ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/fwe,ruler/qa_1,ruler/qa_2" --sparse_budget 2048 --rank 160 --chunk_size 8

Compatibility with MInference

ShadowKV is compatible with pre-filling acceleration techniques, such as MInference. To enable MInference, please add the --minference flag to the command. For example:

# Full attention with MInference
OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method full --dataset_name "ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/fwe,ruler/qa_1,ruler/qa_2" --minference

# ShadowKV with MInference
OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method shadowkv --dataset_name "ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/fwe,ruler/qa_1,ruler/qa_2" --sparse_budget 2048 --rank 160 --chunk_size 8 --minference

Efficiency Evaluations

For the efficiency evaluation, please run the following command with a single A100 GPU:

python test/e2e.py --model_name "meta-llama/Meta-Llama-3.1-8B-Instruct" --datalen "122k"

Citation

If you find ShadowKV useful or relevant to your project and research, please kindly cite our paper:

@article{sun2024shadowkv,
  title={ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference},
  author={Sun, Hanshi and Chang, Li-Wen and Bao, Wenlei and Zheng, Size and Zheng, Ningxin and Liu, Xin and Dong, Harry and Chi, Yuejie and Chen, Beidi},
  journal={arXiv preprint arXiv:2410.21465},
  year={2024}
}

Name	Name	Last commit message	Last commit date
Latest commit preminstrel Oct 30, 2024 a21f5a9 · Oct 30, 2024 History 13 Commits
.github/ISSUE_TEMPLATE	.github/ISSUE_TEMPLATE	add github templates	Oct 23, 2024
data	data	PR	Oct 29, 2024
kernels	kernels	add sources	Oct 23, 2024
models	models	PR	Oct 29, 2024
scripts	scripts	add sources	Oct 23, 2024
static	static	PR	Oct 29, 2024
test	test	PR	Oct 29, 2024
.gitignore	.gitignore	PR	Oct 29, 2024
CONTRIBUTING.md	CONTRIBUTING.md	add doc	Oct 23, 2024
LICENSE	LICENSE	add doc	Oct 23, 2024
README.md	README.md	arxiv	Oct 30, 2024
index.html	index.html	arxiv	Oct 30, 2024
requirements.txt	requirements.txt	PR	Oct 29, 2024
setup.py	setup.py	add sources	Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

Environment Set Up

Supported Models

Accuracy Evaluations

Build Datasets

Run Evaluations

Compatibility with MInference

Efficiency Evaluations

Citation

About

Releases

Packages

Contributors 2

Languages

License

bytedance/ShadowKV

Folders and files

Latest commit

History

Repository files navigation

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

Environment Set Up

Supported Models

Accuracy Evaluations

Build Datasets

Run Evaluations

Compatibility with MInference

Efficiency Evaluations

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages