Skip to content

Commit

Permalink
[Docs] Update project website (mlc-ai#2175)
Browse files Browse the repository at this point in the history
This PR mainly updates the project website, and also updates some
minor points for other docs.
  • Loading branch information
MasterJH5574 authored Apr 19, 2024
1 parent 855f9a2 commit f87745d
Show file tree
Hide file tree
Showing 4 changed files with 243 additions and 147 deletions.
98 changes: 5 additions & 93 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,22 +50,7 @@
</table>


**Scalable.** MLC LLM scales universally on NVIDIA and AMD GPUs, cloud and gaming GPUs. Below
showcases our single batch decoding performance with prefilling = 1 and decoding = 256.

Performance of 4-bit CodeLlama-34B and Llama2-70B on two NVIDIA RTX 4090 and two AMD Radeon 7900 XTX:
<p float="left">
<img src="site/img/multi-gpu/figure-1.svg" width="40%"/>
<img src="site/img/multi-gpu/figure-3.svg" width="30%"/>
</p>

Scaling of fp16 and 4-bit CodeLlama-34 and Llama2-70B on A100-80G-PCIe and A10G-24G-PCIe, up to 8 GPUs:
<p float="center">
<img src="site/img/multi-gpu/figure-2.svg" width="100%"/>
</p>


## Getting Started
## Quick Start

We introduce the quick start examples of chat CLI, Python API and REST server here to use MLC LLM.
We use 4-bit quantized 8B Llama-3 model for demonstration purpose.
Expand Down Expand Up @@ -140,30 +125,17 @@ print("\n")
engine.terminate()
```

**We design the Python API `mlc_llm.LLMEngine` to align with OpenAI API**,
which means you can use LLMEngine in the same way of using
**The Python API of `mlc_llm.LLMEngine` fully aligns with OpenAI API**.
You can use LLMEngine in the same way of using
[OpenAI's Python package](https://github.com/openai/openai-python?tab=readme-ov-file#usage)
for both synchronous and asynchronous generation.

In this code example, we use the synchronous chat completion interface and iterate over
all the stream responses.
If you want to run without streaming, you can run

```python
response = engine.chat.completions.create(
messages=[{"role": "user", "content": "What is the meaning of life?"}],
model=model,
stream=False,
)
print(response)
```

You can also try different arguments supported in [OpenAI chat completion API](https://platform.openai.com/docs/api-reference/chat/create).
If you would like to do concurrent asynchronous generation, you can use `mlc_llm.AsyncLLMEngine` instead.

### REST Server

We can launch a REST server to serve the 4-bit quantized Llama-3 model for OpenAI chat completion requests.
The server has fully OpenAI API completeness.

```bash
mlc_llm serve HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
Expand All @@ -186,66 +158,6 @@ curl -X POST \
http://127.0.0.1:8000/v1/chat/completions
```

The server will process this request and send back the response.
Similar to [Python API](#python-api), you can pass argument ``"stream": true``
to request for stream responses.

## Model Support

MLC LLM supports a wide range of model architectures and variants. We have the following prebuilts which you can
use off-the-shelf. Visit [Prebuilt Models](https://llm.mlc.ai/docs/prebuilt_models.html) to see the full list, and [Compile Models via MLC](https://llm.mlc.ai/docs/compilation/compile_models.html) to see how to use models not on this list.

<table style="width:100%">
<thead>
<tr>
<th style="width:40%">Architecture</th>
<th style="width:60%">Prebuilt Model Variants</th>
</tr>
</thead>
<tbody>
<tr>
<td>Llama</td>
<td>Llama-3, Code Llama, Vicuna, WizardLM, WizardMath, OpenOrca Platypus2, FlagAlpha Llama-2 Chinese, georgesung Llama-2 Uncensored</td>
</tr>
<tr>
<td>GPT-NeoX</td>
<td>RedPajama</td>
</tr>
<tr>
<td>GPT-J</td>
<td></td>
</tr>
<tr>
<td>RWKV</td>
<td>RWKV-raven</td>
</tr>
<tr>
<td>MiniGPT</td>
<td></td>
</tr>
<tr>
<td>GPTBigCode</td>
<td>WizardCoder</td>
</tr>
<tr>
<td>ChatGLM</td>
<td></td>
</tr>
<tr>
<td>StableLM</td>
<td></td>
</tr>
<tr>
<td>Mistral</td>
<td></td>
</tr>
<tr>
<td>Phi</td>
<td></td>
</tr>
</tbody>
</table>

## Universal Deployment APIs

MLC LLM provides multiple sets of APIs across platforms and environments. These include
Expand Down Expand Up @@ -273,7 +185,7 @@ The underlying techniques of MLC LLM include:

<details>
<summary>References (Click to expand)</summary>

```bibtex
@inproceedings{tensorir,
author = {Feng, Siyuan and Hou, Bohan and Jin, Hongyi and Lin, Wuwei and Shao, Junru and Lai, Ruihang and Ye, Zihao and Zheng, Lianmin and Yu, Cody Hao and Yu, Yong and Chen, Tianqi},
Expand Down
6 changes: 6 additions & 0 deletions docs/deploy/python_engine.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@ Run LLMEngine
-------------

:class:`mlc_llm.LLMEngine` provides the interface of OpenAI chat completion synchronously.
:class:`mlc_llm.LLMEngine` does not batch concurrent request due to the synchronous design,
and please use :ref:`AsyncLLMEngine <python-engine-async-llm-engine>` for request batching process.

**Stream Response.** In :ref:`quick-start` and :ref:`introduction-to-mlc-llm`,
we introduced the basic use of :class:`mlc_llm.LLMEngine`.
Expand Down Expand Up @@ -86,11 +88,14 @@ and `OpenAI chat completion API <https://platform.openai.com/docs/api-reference/
for the complete chat completion interface.


.. _python-engine-async-llm-engine:

Run AsyncLLMEngine
------------------

:class:`mlc_llm.AsyncLLMEngine` provides the interface of OpenAI chat completion with
asynchronous features.
**We recommend using** :class:`mlc_llm.AsyncLLMEngine` **to batch concurrent request for better throughput.**

**Stream Response.** The core use of :class:`mlc_llm.AsyncLLMEngine` for stream responses is as follows.

Expand Down Expand Up @@ -188,6 +193,7 @@ In short,
- mode ``"interactive"`` uses 1 as the request concurrency and low KV cache capacity, which is designed for **interactive use cases** such as chats and conversations.
- mode ``"server"`` uses as much request concurrency and KV cache capacity as possible. This mode aims to **fully utilize the GPU memory for large server scenarios** where concurrent requests may be many.

**For system benchmark, please select mode** ``"server"``.
Please refer to :ref:`python-engine-api-reference` for detailed documentation of the engine mode.


Expand Down
1 change: 0 additions & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,6 @@ Check out :ref:`introduction-to-mlc-llm` for the introduction and tutorial of a
compilation/convert_weights.rst
compilation/compile_models.rst
compilation/define_new_models.rst
compilation/configure_quantization.rst

.. toctree::
:maxdepth: 1
Expand Down
Loading

0 comments on commit f87745d

Please sign in to comment.