VATLM

VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

(Done) Nov. 2022: release the code and models
Nov. 2022: release preprint in arXiv

Pre-Trained and Fine-tuned Models

Model	Pre-training Dataset	Fine-tuning Dataset	Model
VatLM Base	LRS3 + paired audio+text+audio	-	Google drive
VatLM Base	LRS3 + paired audio+text+audio	LRS-30h audio-visual	Google drive
VatLM Base	LRS3 + paired audio+text+audio	LRS-30h visual	Google drive
VatLM Base	VoxCeleb2 + LRS3 + paired audio+text+audio	-	Google drive
VatLM Base	VoxCeleb2 + LRS3 + paired audio+text+audio	LRS-30h audio-visual	Google drive
VatLM Base	VoxCeleb2 + LRS3 + paired audio+text+audio	LRS-30h visual	Google drive
VatLM Base	VoxCeleb2 + LRS3 + paired audio+text+audio	LRS-433h audio-visual	Google drive
VatLM Base	VoxCeleb2 + LRS3 + paired audio+text+audio	LRS-433h visual	Google drive
VatLM Large	VoxCeleb2 + LRS3 + paired audio+text+audio	-	Google drive
VatLM Large	VoxCeleb2 + LRS3 + paired audio+text+audio	LRS-30h audio-visual	Google drive
VatLM Large	VoxCeleb2 + LRS3 + paired audio+text+audio	LRS-30h visual	Google drive
VatLM Large	VoxCeleb2 + LRS3 + paired audio+text+audio	LRS-433h audio-visual	Google drive
VatLM Large	VoxCeleb2 + LRS3 + paired audio+text+audio	LRS-433h visual	Google drive

Setup

To fine-tune or pre-train more models, please follow the instructions below.

git clone https://github.com/microsoft/SpeechT5.git
cd SpeechT5/VATLM
git submodule init && git submodule update

cd VATLM/fairseq  && pip install --editable
cd VATLM/vat_hubert && pip install -r requirements.txt

Data preparation

For audio or visual data, please follow the steps of AV-HuBERT's script to pre-process the data and get the corresponding train.tsv, train.km files.
For unimodal audio data, the visual modality is replaced with a zero vector, and the features are extracted according to this script and then kmeans clustering is performed to get the corresponding labels.
For unimodal text data, we use a small amount of pair text-audio data to obtain paired phone-unit data, and get the corresponding phoneme sequences by looking up the lexicon, and the unit data are obtained by extracting features and performing kmeans clustering. Then follow this script to train the phone2unit model.

Pre-train

VatLM Base model (LRS3 + paired audio+text+audio)

cd VATLM/vat_hubert/vathubert/scripts/pretrain
ngpu=32
updatefreq=1
save_path=/path/to/save_path

bash base_lsr3_pretrain_iter5.sh ${ngpu} ${updatefreq} ${save_path}

VatLM Base model (VoxCeleb2 + paired audio+text+audio)

cd VATLM/vat_hubert/vathubert/scripts/pretrain
ngpu=32
updatefreq=1
save_path=/path/to/save_path

bash base_vox_pretrain_iter5.sh ${ngpu} ${updatefreq} ${save_path}

VatLM Large model (VoxCeleb2 + paired audio+text+audio)

cd VATLM/vat_hubert/vathubert/scripts/pretrain
ngpu=32
updatefreq=2
save_path=/path/to/save_path

bash large_vox_pretrain_iter5.sh ${ngpu} ${updatefreq} ${save_path}

Fine-tune AVSR/VSR

For example, the AVSR model can be obtained by fine-tuning the VatLM model using 30 hours of labeled data.

cd VATLM/vat_hubert/vathubert/scripts/finetune_avsr
ngpu=8
updatefreq=1
save_path=/path/to/save_path

bash base_lrs3_finetune30_av.sh ${ngpu} ${updatefreq} ${save_path}

Decode

For example, decoding the fine-tuned AVSR model.

cd VATLM/vat_hubert/vathubert/
data="test"
bash decode_avhubert_lrs3.sh ${data}

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the FAIRSEQ and av_hubert

Microsoft Open Source Code of Conduct

Reference

If you find our work is useful in your research, please cite the following paper:

@article{zhu2022vatlm,
      title={VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning}, 
      author={Qiushi Zhu and Long Zhou and Ziqiang Zhang and Shujie Liu and Binxing Jiao and Jie Zhang and Lirong Dai and Daxin Jiang and Jinyu Li and Furu Wei},
      year={2022},
      eprint={2211.11275},
      archivePrefix={arXiv},
}

Contact Information

For help or issues using VatLM models, please submit a GitHub issue.

For other communications related to VatLM, please contact Long Zhou (lozhou@microsoft.com).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

VATLM

Pre-Trained and Fine-tuned Models

Setup

Data preparation

Pre-train

Fine-tune AVSR/VSR

Decode

License

Reference

Contact Information

Files

README.md

Latest commit

History

README.md

File metadata and controls

VATLM

Pre-Trained and Fine-tuned Models

Setup

Data preparation

Pre-train

Fine-tune AVSR/VSR

Decode

License

Reference

Contact Information