Provide TensorRT-LLM and NVIDIA Triton Inference Server with an OpenAI-compatible API. This allows you to integrate with langchain
Follow
the tensorrtllm_backend tutorial
to build your TensorRT engine, and launch a triton server. We provide an Baichuan
example below to follow.
You need to clone the repository with dependencies to build the project.
git clone --recursive https://github.com/npuichigo/openai_trtllm.git
Make sure you have Docker and Docker Compose installed.
docker compose up --build
cargo run --release
The parameters can be set with environment variables or command line arguments:
./target/release/openai_trtllm --help
Usage: openai_trtllm [OPTIONS]
Options:
-H, --host <HOST> Host to bind to [default: 0.0.0.0]
-p, --port <PORT> Port to bind to [default: 3000]
-t, --triton-endpoint <TRITON_ENDPOINT> [default: http://localhost:16001]
-o, --otlp-endpoint <OTLP_ENDPOINT> Endpoint of OpenTelemetry collector
-h, --help Print help
We provide a model template in models/Baichuan
to let you follow. Since we're unknown of your hardware, we don't
provide a pre-built TensorRT engine. You need to follow the steps below to build your own engine.
- Download the Baichuan model from HuggingFace.
# Make sure you have git-lfs installed (https://git-lfs.com) git lfs install git clone https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat models/download/Baichuan2-13B-Chat
- We provide a pre-built docker which is slightly newer
than v0.6.1.
You are free to test on other versions.
docker run --rm -it --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all -v /models:/models npuichigo/tritonserver-trtllm:711a28d bash
- Follow the tutorial here to build your
engine.
After the build, the engine will be saved to
# int8 for example [with inflight batching] python /app/tensorrt_llm/examples/baichuan/build.py \ --model_version v2_13b \ --model_dir /models/download/Baichuan2-13B-Chat \ --output_dir /models/baichuan/tensorrt_llm/1 \ --max_input_len 4096 \ --max_output_len 1024 \ --dtype float16 \ --use_gpt_attention_plugin float16 \ --use_gemm_plugin float16 \ --enable_context_fmha \ --use_weight_only \ --use_inflight_batching
/models/baichuan/tensorrt_llm/1/
to be used by Triton. - Make sure the
models/baichuan/preprocessing/config.pbtxt
andmodels/baichuan/postprocessing/config.pbtxt
refer to the correct tokenizer directory. For example:parameters { key: "tokenizer_dir" value: { string_value: "/models/download/Baichuan2-13B-Chat" } }
- Go ahead to launch the server, better with docker-compose.
We are tracing performance metrics using tracing, tracing-opentelemetry and opentelemetry-otlp crates.
Here is an example of tracing with Tempo on a k8s cluster:
Let's say you are running a Jaeger instance locally, you can run the following command to start it:
docker run --rm --name jaeger \
-p 6831:6831/udp \
-p 6832:6832/udp \
-p 5778:5778 \
-p 16686:16686 \
-p 4317:4317 \
-p 4318:4318 \
-p 14250:14250 \
-p 14268:14268 \
-p 14269:14269 \
-p 9411:9411 \
jaegertracing/all-in-one:1.51
To enable tracing, set the OTLP_ENDPOINT
environment variable or --otlp-endpoint
command line
argument to the endpoint of your OpenTelemetry collector.
OPENAI_TRTLLM_OTLP_ENDPOINT=http://localhost:4317 cargo run --release