Skip to content
View taihaozesong's full-sized avatar

Block or report taihaozesong

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

MagicPIG: LSH Sampling for Efficient LLM Generation

Python 172 8 Updated Dec 16, 2024

MoonPalace(月宫)是由 Moonshot AI 月之暗面提供的 API 调试工具。

Go 168 3 Updated Dec 30, 2024

Speculative Decoding Optimization

Python 4 Updated Dec 29, 2024

Tile primitives for speedy kernels

Cuda 1,896 90 Updated Jan 4, 2025

Deep learning for dummies. All the practical details and useful utilities that go into working with real models.

Python 746 38 Updated Dec 10, 2024

nanobind: tiny and efficient C++/Python bindings

C++ 2,467 205 Updated Jan 6, 2025

《UNIX环境高级编程》中文第三版笔记

1,361 448 Updated Jan 25, 2019

SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

Cuda 572 29 Updated Dec 9, 2024

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

C++ 2,313 131 Updated Jan 3, 2025

FlashInfer: Kernel Library for LLM Serving

Cuda 1,703 169 Updated Jan 4, 2025

Machine Learning Engineering Open Book

Python 12,169 742 Updated Jan 6, 2025

Finetune Llama 3.3, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 70% less memory

Python 20,030 1,421 Updated Jan 5, 2025

collection of benchmarks to measure basic GPU capabilities

Jupyter Notebook 274 41 Updated Jan 4, 2025

A low-latency & high-throughput serving engine for LLMs

Python 288 35 Updated Sep 12, 2024

Dynamic Memory Management for Serving LLMs without PagedAttention

C 267 19 Updated Dec 6, 2024

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Python 36,112 4,181 Updated Jan 4, 2025

📖A curated list of Awesome LLM/VLM Inference Papers with codes, such as FlashAttention, PagedAttention, Parallelism, etc. 🎉🎉

3,106 211 Updated Jan 3, 2025

FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/

C++ 1,233 520 Updated Jan 6, 2025

nanoGPT style version of Llama 3.1

Python 1,283 67 Updated Aug 8, 2024

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

Python 5,087 455 Updated Jan 3, 2025

Fast and memory-efficient exact attention

Python 14,925 1,410 Updated Jan 5, 2025

SGLang is a fast serving framework for large language models and vision language models.

Python 7,064 655 Updated Jan 5, 2025

Deep Reinforcement Learning: Zero to Hero!

Jupyter Notebook 2,031 74 Updated Aug 18, 2024

LLM training in simple, raw C/CUDA

Cuda 24,940 2,835 Updated Oct 2, 2024

LLM inference in C/C++

C++ 70,233 10,135 Updated Jan 4, 2025

Advanced Python Mastery (course by @dabeaz)

Python 10,774 1,800 Updated Aug 10, 2024

Material for gpu-mode lectures

Jupyter Notebook 3,387 343 Updated Dec 3, 2024

OneDiff: An out-of-the-box acceleration library for diffusion models.

Jupyter Notebook 1,754 110 Updated Dec 30, 2024

A GPU-driven system framework for scalable AI applications

C++ 111 16 Updated Oct 8, 2024

📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

Cuda 1,867 195 Updated Jan 3, 2025
Next