Highlights
- Pro
Starred repositories
Infinity ∞ : Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
Fine-tune the Whisper speech recognition model to support training without timestamp data, training with timestamp data, and training without speech data. Accelerate inference and support Web deplo…
Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
UniSpeech - Large Scale Self-Supervised Learning for Speech
An open-source framework for training large multimodal models.
Democratization of "PaLI: A Jointly-Scaled Multilingual Language-Image Model"
An Open Source text-to-speech system built by inverting Whisper.
Acoustic mosquito detection code with Bayesian Neural Networks
Collection of scripts and utilities for reorganizing corpora to use with the Montreal Forced Aligner
This repository contains the code to setup the experiments for the ComParE 2022 mosquito event detection sub-challenge.
A library built for easier audio self-supervised training, downstream tasks evaluation
This repo includes the official implementations of "Fine-tune the pretrained ATST model for sound event detection".
dataset for lightly supervised training using the librivox audio book recordings. https://librivox.org/.
LibriSpeech-Long is a benchmark dataset for long-form speech generation and processing. Released as part of "Long-Form Speech Generation with Spoken Language Models" (arXiv 2024).
(NeurIPS 2024) Learning to Visual Question Answering, Asking and Assessment
Implementation of E2-TTS, "Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS", in Pytorch
This repository contains code and metadata of How2 dataset
Prompting Large Language Models with Audio for General-Purpose Speech Summarization
《开源大模型食用指南》针对中国宝宝量身打造的基于Linux环境快速微调(全参数/Lora)、部署国内外开源大模型(LLM)/多模态大模型(MLLM)教程
batch processing of Llama-2 7B
学习vLLM,使用vLLM部署Qwen2-0.5B的模型,并使用docker部署。
A curated list of awesome Multimodal studies.
[arXiv 2024] Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
chinese speech pretrained models
1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
Inference and training library for high-quality TTS models.
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models