Skip to content

Commit

Permalink
[Doc] Update document for RWKV (#293)
Browse files Browse the repository at this point in the history
  • Loading branch information
Hzfengsy authored Jun 2, 2023
1 parent d3c6053 commit c856439
Show file tree
Hide file tree
Showing 6 changed files with 128 additions and 10 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ MLC LLM offers a repeatable, systematic, and customizable workflow that empowers

## How does MLC Enable Universal Native Deployment?

The cornerstone of our solution is machine learning compilation ([MLC](https://mlc.ai/)), which we leverage to efficiently deploy AI models. We build on the shoulders of open-source ecosystems, including tokenizers from Hugging Face and Google, as well as open-source LLMs like Llama, Vicuna, Dolly, MOSS and more. Our primary workflow is based on [Apache TVM Unity](https://github.com/apache/tvm/tree/unity), an exciting ongoing development in the Apache TVM Community.
The cornerstone of our solution is machine learning compilation ([MLC](https://mlc.ai/)), which we leverage to efficiently deploy AI models. We build on the shoulders of open-source ecosystems, including tokenizers from Hugging Face and Google, as well as open-source LLMs like Llama, Vicuna, Dolly, MOSS, RWKV and more. Our primary workflow is based on [Apache TVM Unity](https://github.com/apache/tvm/tree/unity), an exciting ongoing development in the Apache TVM Community.

- Dynamic shape: We bake a language model as a TVM IRModule with native dynamic shape support, avoiding the need for extra padding to the maximum length and reducing both computation amount and memory usage.
- Composable ML compilation optimizations: we perform many model deployment optimizations, such as better compilation code transformation, fusion, memory planning, library offloading and manual code optimization can be easily incorporated as TVM's IRModule transformations exposed as Python APIs.
Expand Down Expand Up @@ -112,4 +112,4 @@ walkthrough of our approaches.

This project is initiated by members from CMU catalyst, UW SAMPL, SJTU, OctoML and the MLC community. We would love to continue developing and supporting the open-source ML community.

This project is only possible thanks to the shoulders open-source ecosystems that we stand on. We want to thank the Apache TVM community and developers of the TVM Unity effort. The open-source ML community members made these models publicly available. PyTorch and Hugging Face communities that make these models accessible. We would like to thank the teams behind Vicuna, SentencePiece, LLaMA, Alpaca and MOSS. We also would like to thank the Vulkan, Swift, C++, Python Rust communities that enables this project.
This project is only possible thanks to the shoulders open-source ecosystems that we stand on. We want to thank the Apache TVM community and developers of the TVM Unity effort. The open-source ML community members made these models publicly available. PyTorch and Hugging Face communities that make these models accessible. We would like to thank the teams behind Vicuna, SentencePiece, LLaMA, Alpaca, MOSS and RWKV. We also would like to thank the Vulkan, Swift, C++, Python Rust communities that enables this project.
26 changes: 24 additions & 2 deletions docs/model-zoo.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,14 +41,32 @@ Below is a list of off-the-shelf prebuilt models compiled by MLC-LLM community.
- `RedPajama <https://www.together.xyz/blog/redpajama>`__
- * Weight storage data type: int4
* Running data type: float32
* Symmetric quantization
* Symmetric quantization
- `link <https://huggingface.co/mlc-ai/mlc-chat-RedPajama-INCITE-Chat-3B-v1-q4f32_0>`__
* - `RedPajama-INCITE-Chat-3B-v1-q4f16_0`
- `RedPajama <https://www.together.xyz/blog/redpajama>`__
- * Weight storage data type: int4
* Running data type: float16
* Symmetric quantization
* Symmetric quantization
- `link <https://huggingface.co/mlc-ai/mlc-chat-RedPajama-INCITE-Chat-3B-v1-q4f16_0>`__
* - `rwkv-raven-1b5-q8f16_0`
- `RWKV <https://github.com/BlinkDL/RWKV-LM>`__
- * Weight storage data type: uint8
* Running data type: float16
* Symmetric quantization
- `link <https://huggingface.co/mlc-ai/mlc-chat-rwkv-raven-1b5-q8f16_0>`__
* - `rwkv-raven-3b-q8f16_0`
- `RWKV <https://github.com/BlinkDL/RWKV-LM>`__
- * Weight storage data type: uint8
* Running data type: float16
* Symmetric quantization
- `link <https://huggingface.co/mlc-ai/mlc-chat-rwkv-raven-3b-q8f16_0>`__
* - `rwkv-raven-7b-q8f16_0`
- `RWKV <https://github.com/BlinkDL/RWKV-LM>`__
- * Weight storage data type: uint8
* Running data type: float16
* Symmetric quantization
- `link <https://huggingface.co/mlc-ai/mlc-chat-rwkv-raven-7b-q8f16_0>`__

You can check `MLC-LLM pull requests <https://github.com/mlc-ai/mlc-llm/pulls?q=is%3Aopen+is%3Apr+label%3Anew-models>`__ to track the ongoing efforts of new models. We encourage users to upload their compiled models to Hugging Face and share them with the community.

Expand Down Expand Up @@ -84,6 +102,10 @@ MLC-LLM supports the following model architectures:
- `GPT-J <https://github.com/kingoflolz/mesh-transformer-jax>`__
- `Relax Code <https://github.com/mlc-ai/mlc-llm/blob/main/mlc_llm/relax_model/gptj.py>`__
- * `MOSS <https://github.com/OpenLMLab/MOSS>`__
* - ``rwkv``
- `RWKV <https://github.com/BlinkDL/RWKV-LM>`__
- `Relax Code <https://github.com/mlc-ai/mlc-llm/blob/main/mlc_llm/relax_model/rwkv.py>`__
- * `RWKV-raven <https://github.com/BlinkDL/RWKV-LM>

For models within these model architectures, you can check the :doc:`/tutorials/compile-models` on how to compile models. Please create a new issue if you want to request a new model architecture. Our tutorial :doc:`/tutorials/bring-your-own-models` introduces how to bring a new model architecture to MLC-LLM.

Expand Down
73 changes: 73 additions & 0 deletions docs/tutorials/compile-models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,79 @@ In the command above, ``--model`` specifies the name of the model to build, ``--
python3 build.py --model RedPajama-INCITE-Chat-3B-v1 --target android --max-seq-len 768 --quantization q4f16_0
.. tab:: rwkv-raven-1b5/3b/7b

.. tabs::

.. tab:: Target: CUDA

.. code:: shell
# For 1.5B model
python3 build.py --hf-path=RWKV/rwkv-raven-1b5 --target cuda --quantization q8f16_0
# For 3B model
python3 build.py --hf-path=RWKV/rwkv-raven-3b --target cuda --quantization q8f16_0
# For 7B model
python3 build.py --hf-path=RWKV/rwkv-raven-7b --target cuda --quantization q8f16_0
.. tab:: Metal

On Apple Silicon powered Mac, build for Apple Silicon Mac:

.. code:: shell
# For 1.5B model
python3 build.py --hf-path=RWKV/rwkv-raven-1b5 --target metal --quantization q8f16_0
# For 3B model
python3 build.py --hf-path=RWKV/rwkv-raven-3b --target metal --quantization q8f16_0
# For 7B model
python3 build.py --hf-path=RWKV/rwkv-raven-7b --target metal --quantization q8f16_0
On Apple Silicon powered Mac, build for x86 Mac:

.. code:: shell
# For 1.5B model
python3 build.py --hf-path=RWKV/rwkv-raven-1b5 --target metal_x86_64 --quantization q8f16_0
# For 3B model
python3 build.py --hf-path=RWKV/rwkv-raven-3b --target metal_x86_64 --quantization q8f16_0
# For 7B model
python3 build.py --hf-path=RWKV/rwkv-raven-7b --target metal_x86_64 --quantization q8f16_0
.. tab:: Vulkan

On Linux, build for Linux:

.. code:: shell
# For 1.5B model
python3 build.py --hf-path=RWKV/rwkv-raven-1b5 --target vulkan --quantization q8f16_0
# For 3B model
python3 build.py --hf-path=RWKV/rwkv-raven-3b --target vulkan --quantization q8f16_0
# For 7B model
python3 build.py --hf-path=RWKV/rwkv-raven-7b --target vulkan --quantization q8f16_0
On Linux, build for Windows:

.. code:: shell
# For 1.5B model
python3 build.py --hf-path=RWKV/rwkv-raven-1b5 --target vulkan --quantization q8f16_0 --llvm-mingw path/to/llvm-mingw
# For 3B model
python3 build.py --hf-path=RWKV/rwkv-raven-3b --target vulkan --quantization q8f16_0 --llvm-mingw path/to/llvm-mingw
# For 7B model
python3 build.py --hf-path=RWKV/rwkv-raven-7b --target vulkan --quantization q8f16_0 --llvm-mingw path/to/llvm-mingw
.. tab:: iPhone/iPad

.. code:: shell
# For 1.5B model
python3 build.py --hf-path=RWKV/rwkv-raven-1b5 --target iphone --quantization q8f16_0
# For 3B model
python3 build.py --hf-path=RWKV/rwkv-raven-3b --target iphone --quantization q8f16_0
# For 7B model
python3 build.py --hf-path=RWKV/rwkv-raven-7b --target iphone --quantization q8f16_0
.. tab:: Other models

.. tabs::
Expand Down
21 changes: 16 additions & 5 deletions docs/tutorials/deploy-models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ This section introduces how to prepare and upload the model you built.
.. note::
Before proceeding, you should first have the model built manually.
At this moment, the iOS/Android/web app released by MLC LLM only support **specific model architectures with specific quantization modes**. Particularly,

- the :ref:`released iOS/iPadOS app <iPhone-download-app>` supports models structured by LLaMA-7B and quantized by ``q3f16_0``, and models structured by GPT-NeoX-3B and quantized by ``q4f16_0``.
- the :ref:`released Android app <Android-download-app>` supports models structured by LLaMA-7B and quantized by ``q4f16_0``.
- the `Web LLM demo page <https://mlc.ai/web-llm/>`_ supports models structured by LLaMA-7B and quantized by ``q4f32_0``, and models structured by GPT-NeoX-3B and quantized by both ``q4f16_0`` and ``q4f32_0``.
Expand Down Expand Up @@ -168,7 +168,7 @@ Opening that file, the ``model_lib`` field specifies the model library name we u
"model_lib": "vicuna-v1-7b-q3f16_0",
...
}
.. tab:: GPT-NeoX-3B

The model is expected to be quantized by ``q4f16_0``:
Expand All @@ -179,7 +179,18 @@ Opening that file, the ``model_lib`` field specifies the model library name we u
"model_lib": "RedPajama-INCITE-Chat-3B-v1-q4f16_0",
...
}
.. tab:: RWKV

The model is expected to be quantized by ``q8f16_0``:

.. code::
{
"model_lib": "rwkv-raven-1b5-q8f16_0",
...
}
.. tab:: Android

.. tabs::
Expand All @@ -194,7 +205,7 @@ Opening that file, the ``model_lib`` field specifies the model library name we u
"model_lib": "vicuna-v1-7b-q4f16_0",
...
}
.. tab:: Web

.. tabs::
Expand All @@ -209,7 +220,7 @@ Opening that file, the ``model_lib`` field specifies the model library name we u
"model_lib": "vicuna-v1-7b-q4f32_0",
...
}
.. tab:: GPT-NeoX-3B

If the model is quantized by ``q4f16_0``:
Expand Down
3 changes: 3 additions & 0 deletions ios/prepare_params.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@ mkdir -p dist
declare -a builtin_list=(
"RedPajama-INCITE-Chat-3B-v1-q4f16_0"
# "vicuna-v1-7b-q3f16_0"
# "rwkv-raven-1b5-q8f16_0"
# "rwkv-raven-3b-q8f16_0"
# "rwkv-raven-7b-q8f16_0"
)

for model in "${builtin_list[@]}"
Expand Down
11 changes: 10 additions & 1 deletion site/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ please install the latest [Vulkan driver](https://developer.nvidia.com/vulkan-dr
Vulkan driver, as the CUDA driver may not be good.

After installing all the dependencies, just follow the instructions below the install the CLI app:

```shell
# Create a new conda environment and activate the environment.
conda create -n mlc-chat
Expand Down Expand Up @@ -84,13 +85,20 @@ cd dist/prebuilt
git clone https://huggingface.co/mlc-ai/mlc-chat-RedPajama-INCITE-Chat-3B-v1-q4f16_0
cd ../..
mlc_chat_cli --local-id RedPajama-INCITE-Chat-3B-v1-q4f16_0

# Download prebuilt weights of RWKV-raven-1.5B/3B/7B
cd dist/prebuilt
git clone https://huggingface.co/mlc-ai/mlc-chat-rwkv-raven-1b5-q8f16_0
# or git clone https://huggingface.co/mlc-ai/mlc-chat-rwkv-raven-3b-q8f16_0
# or git clone https://huggingface.co/mlc-ai/mlc-chat-rwkv-raven-7b-q8f16_0
cd ../..
mlc_chat_cli --local-id rwkv-raven-1b5-q8f16_0 # Replace your local id if you use 3b or 7b model.
```

<p align="center">
<img src="gif/linux-demo.gif" width="80%">
</p>


### Web Browser

Please check out [WebLLM](https://mlc.ai/web-llm/), our companion project that deploys models natively to browsers. Everything here runs inside the browser with no server support and accelerated with WebGPU.
Expand All @@ -104,4 +112,5 @@ Please check out [WebLLM](https://mlc.ai/web-llm/), our companion project that d
walkthrough of our approaches.

## Disclaimer

The pre-packaged demos are for research purposes only, subject to the model License.

0 comments on commit c856439

Please sign in to comment.