Skip to content

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

License

Notifications You must be signed in to change notification settings

bytedance/ShadowKV

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Oct 30, 2024
a21f5a9 · Oct 30, 2024

History

13 Commits
Oct 23, 2024
Oct 29, 2024
Oct 23, 2024
Oct 29, 2024
Oct 23, 2024
Oct 29, 2024
Oct 29, 2024
Oct 29, 2024
Oct 23, 2024
Oct 23, 2024
Oct 30, 2024
Oct 30, 2024
Oct 29, 2024
Oct 23, 2024

Repository files navigation

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

training-free, high-throughput long-context LLM inference

1Carnegie Mellon University 2ByteDance
[Paper] | [Blog]

ShadowKV Framework

Environment Set Up

To reproduce the results in the paper, you need to set up the environment as follows with a single A100 GPU:

# create env
conda create -n ShadowKV python=3.10 -y
conda activate ShadowKV

# install packages
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

# nemo dependencies (for dataset building)
pip install wheel
pip install Cython
pip install youtokentome
pip install nemo_toolkit[all]==1.23

# flashinfer
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/

# cutlass
mkdir 3rdparty
git clone https://github.com/NVIDIA/cutlass.git 3rdparty/cutlass

# build kernels for ShadowKV
python setup.py build_ext --inplace

Supported Models

Currently, we support the following LLMs:

Accuracy Evaluations

Here we provide an example to build the dataset and run evaluation for the RULER benchmark with Llama-3-8B-1M.

Build Datasets

To build RULER dataset, please run the following command:

# build RULER
python -c "import nltk; nltk.download('punkt')"
cd data/ruler
bash create_dataset.sh "gradientai/Llama-3-8B-Instruct-Gradient-1048k" "llama-3"

Run Evaluations

For the accuracy evaluation, please run the following command with 8xA100 GPUs:

# Full attention
OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method full --dataset_name "ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/fwe,ruler/qa_1,ruler/qa_2" --model_name "gradientai/Llama-3-8B-Instruct-Gradient-1048k"

# ShadowKV
OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method shadowkv --dataset_name "ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/fwe,ruler/qa_1,ruler/qa_2" --sparse_budget 2048 --rank 160 --chunk_size 8

Compatibility with MInference

ShadowKV is compatible with pre-filling acceleration techniques, such as MInference. To enable MInference, please add the --minference flag to the command. For example:

# Full attention with MInference
OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method full --dataset_name "ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/fwe,ruler/qa_1,ruler/qa_2" --minference

# ShadowKV with MInference
OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method shadowkv --dataset_name "ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/fwe,ruler/qa_1,ruler/qa_2" --sparse_budget 2048 --rank 160 --chunk_size 8 --minference

Efficiency Evaluations

For the efficiency evaluation, please run the following command with a single A100 GPU:

python test/e2e.py --model_name "meta-llama/Meta-Llama-3.1-8B-Instruct" --datalen "122k"

Citation

If you find ShadowKV useful or relevant to your project and research, please kindly cite our paper:

@article{sun2024shadowkv,
  title={ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference},
  author={Sun, Hanshi and Chang, Li-Wen and Bao, Wenlei and Zheng, Size and Zheng, Ningxin and Liu, Xin and Dong, Harry and Chi, Yuejie and Chen, Beidi},
  journal={arXiv preprint arXiv:2410.21465},
  year={2024}
}