[Docs] Update project website (mlc-ai#2175)

This PR mainly updates the project website, and also updates some minor points for other docs.
mahdisml · Apr 19, 2024 · f87745d · f87745d
1 parent 855f9a2
commit f87745d
Show file tree

Hide file tree

Showing 4 changed files with 243 additions and 147 deletions.
diff --git a/README.md b/README.md
@@ -50,22 +50,7 @@
 </table>
 
 
-**Scalable.** MLC LLM scales universally on NVIDIA and AMD GPUs, cloud and gaming GPUs. Below
-showcases our single batch decoding performance with prefilling = 1 and decoding = 256.
-
-Performance of 4-bit CodeLlama-34B and Llama2-70B on two NVIDIA RTX 4090 and two AMD Radeon 7900 XTX:
-<p float="left">
-  <img src="site/img/multi-gpu/figure-1.svg" width="40%"/>
-  <img src="site/img/multi-gpu/figure-3.svg" width="30%"/>
-</p>
-
-Scaling of fp16 and 4-bit CodeLlama-34 and Llama2-70B on A100-80G-PCIe and A10G-24G-PCIe, up to 8 GPUs:
-<p float="center">
-  <img src="site/img/multi-gpu/figure-2.svg" width="100%"/>
-</p>
-
-
-## Getting Started
+## Quick Start
 
 We introduce the quick start examples of chat CLI, Python API and REST server here to use MLC LLM.
 We use 4-bit quantized 8B Llama-3 model for demonstration purpose.
@@ -140,30 +125,17 @@ print("\n")
 engine.terminate()
 ```
 
-**We design the Python API `mlc_llm.LLMEngine` to align with OpenAI API**,
-which means you can use LLMEngine in the same way of using
+**The Python API of `mlc_llm.LLMEngine` fully aligns with OpenAI API**.
+You can use LLMEngine in the same way of using
 [OpenAI's Python package](https://github.com/openai/openai-python?tab=readme-ov-file#usage)
 for both synchronous and asynchronous generation.
 
-In this code example, we use the synchronous chat completion interface and iterate over
-all the stream responses.
-If you want to run without streaming, you can run
-
-```python
-response = engine.chat.completions.create(
-    messages=[{"role": "user", "content": "What is the meaning of life?"}],
-    model=model,
-    stream=False,
-)
-print(response)
-```
-
-You can also try different arguments supported in [OpenAI chat completion API](https://platform.openai.com/docs/api-reference/chat/create).
 If you would like to do concurrent asynchronous generation, you can use `mlc_llm.AsyncLLMEngine` instead.
 
 ### REST Server
 
 We can launch a REST server to serve the 4-bit quantized Llama-3 model for OpenAI chat completion requests.
+The server has fully OpenAI API completeness.
 
 ```bash
 mlc_llm serve HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
@@ -186,66 +158,6 @@ curl -X POST \
   http://127.0.0.1:8000/v1/chat/completions
 ```
 
-The server will process this request and send back the response.
-Similar to [Python API](#python-api), you can pass argument ``"stream": true``
-to request for stream responses.
-
-## Model Support
-
-MLC LLM supports a wide range of model architectures and variants. We have the following prebuilts which you can
-use off-the-shelf. Visit [Prebuilt Models](https://llm.mlc.ai/docs/prebuilt_models.html) to see the full list, and [Compile Models via MLC](https://llm.mlc.ai/docs/compilation/compile_models.html) to see how to use models not on this list.
-
-<table style="width:100%">
-  <thead>
-    <tr>
-      <th style="width:40%">Architecture</th>
-      <th style="width:60%">Prebuilt Model Variants</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td>Llama</td>
-      <td>Llama-3, Code Llama, Vicuna, WizardLM, WizardMath, OpenOrca Platypus2, FlagAlpha Llama-2 Chinese, georgesung Llama-2 Uncensored</td>
-    </tr>
-    <tr>
-      <td>GPT-NeoX</td>
-      <td>RedPajama</td>
-    </tr>
-    <tr>
-      <td>GPT-J</td>
-      <td></td>
-    </tr>
-    <tr>
-      <td>RWKV</td>
-      <td>RWKV-raven</td>
-    </tr>
-    <tr>
-      <td>MiniGPT</td>
-      <td></td>
-    </tr>
-    <tr>
-      <td>GPTBigCode</td>
-      <td>WizardCoder</td>
-    </tr>
-    <tr>
-      <td>ChatGLM</td>
-      <td></td>
-    </tr>
-    <tr>
-      <td>StableLM</td>
-      <td></td>
-    </tr>
-    <tr>
-      <td>Mistral</td>
-      <td></td>
-    </tr>
-    <tr>
-      <td>Phi</td>
-      <td></td>
-    </tr>
-  </tbody>
-</table>
-
 ## Universal Deployment APIs
 
 MLC LLM provides multiple sets of APIs across platforms and environments. These include
@@ -273,7 +185,7 @@ The underlying techniques of MLC LLM include:
 
 <details>
   <summary>References (Click to expand)</summary>
-  
+
   ```bibtex
   @inproceedings{tensorir,
       author = {Feng, Siyuan and Hou, Bohan and Jin, Hongyi and Lin, Wuwei and Shao, Junru and Lai, Ruihang and Ye, Zihao and Zheng, Lianmin and Yu, Cody Hao and Yu, Yong and Chen, Tianqi},

diff --git a/docs/deploy/python_engine.rst b/docs/deploy/python_engine.rst
@@ -38,6 +38,8 @@ Run LLMEngine
 -------------
 
 :class:`mlc_llm.LLMEngine` provides the interface of OpenAI chat completion synchronously.
+:class:`mlc_llm.LLMEngine` does not batch concurrent request due to the synchronous design,
+and please use :ref:`AsyncLLMEngine <python-engine-async-llm-engine>` for request batching process.
 
 **Stream Response.** In :ref:`quick-start` and :ref:`introduction-to-mlc-llm`,
 we introduced the basic use of :class:`mlc_llm.LLMEngine`.
@@ -86,11 +88,14 @@ and `OpenAI chat completion API <https://platform.openai.com/docs/api-reference/
 for the complete chat completion interface.
 
 
+.. _python-engine-async-llm-engine:
+
 Run AsyncLLMEngine
 ------------------
 
 :class:`mlc_llm.AsyncLLMEngine` provides the interface of OpenAI chat completion with
 asynchronous features.
+**We recommend using** :class:`mlc_llm.AsyncLLMEngine` **to batch concurrent request for better throughput.**
 
 **Stream Response.** The core use of :class:`mlc_llm.AsyncLLMEngine` for stream responses is as follows.
 
@@ -188,6 +193,7 @@ In short,
 - mode ``"interactive"`` uses 1 as the request concurrency and low KV cache capacity, which is designed for **interactive use cases** such as chats and conversations.
 - mode ``"server"`` uses as much request concurrency and KV cache capacity as possible. This mode aims to **fully utilize the GPU memory for large server scenarios** where concurrent requests may be many.
 
+**For system benchmark, please select mode** ``"server"``.
 Please refer to :ref:`python-engine-api-reference` for detailed documentation of the engine mode.
 
 

diff --git a/docs/index.rst b/docs/index.rst
@@ -46,7 +46,6 @@ Check out :ref:`introduction-to-mlc-llm` for the introduction and tutorial of a
    compilation/convert_weights.rst
    compilation/compile_models.rst
    compilation/define_new_models.rst
-   compilation/configure_quantization.rst
 
 .. toctree::
    :maxdepth: 1