-
Beijing Institute of Technology
- Beijing
Highlights
- Pro
Stars
FlashInfer: Kernel Library for LLM Serving
本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)
Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)
Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.
ModelScope: bring the notion of Model-as-a-Service to life.
Fast inference from large lauguage models via speculative decoding
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Development repository for the Triton language and compiler
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
Universal LLM Deployment Engine with ML Compilation
EVA Series: Visual Representation Fantasies from BAAI
The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (V…
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
Sample codes for my CUDA programming book
[CVPR 2023 Best Paper Award] Planning-oriented Autonomous Driving
This is a Chinese translation of the CUDA programming guide
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
Learn CUDA Programming, published by Packt
AIMET is a library that provides advanced quantization and compression techniques for trained neural network models.
how to optimize some algorithm in cuda.
A simple tool that can generate TensorRT plugin code quickly.
Simple samples for TensorRT programming
NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
A tool to modify ONNX models in a visualization fashion, based on Netron and Flask.
Collection of various algorithms in mathematics, machine learning, computer science and physics implemented in C++ for educational purposes.