- (Done) Nov. 2022: release the code and models
- Nov. 2022: release preprint in arXiv
Model | Pre-training Dataset | Fine-tuning Dataset | Model |
---|---|---|---|
VatLM Base | LRS3 + paired audio+text+audio | - | Google drive |
VatLM Base | LRS3 + paired audio+text+audio | LRS-30h audio-visual | Google drive |
VatLM Base | LRS3 + paired audio+text+audio | LRS-30h visual | Google drive |
VatLM Base | VoxCeleb2 + LRS3 + paired audio+text+audio | - | Google drive |
VatLM Base | VoxCeleb2 + LRS3 + paired audio+text+audio | LRS-30h audio-visual | Google drive |
VatLM Base | VoxCeleb2 + LRS3 + paired audio+text+audio | LRS-30h visual | Google drive |
VatLM Base | VoxCeleb2 + LRS3 + paired audio+text+audio | LRS-433h audio-visual | Google drive |
VatLM Base | VoxCeleb2 + LRS3 + paired audio+text+audio | LRS-433h visual | Google drive |
VatLM Large | VoxCeleb2 + LRS3 + paired audio+text+audio | - | Google drive |
VatLM Large | VoxCeleb2 + LRS3 + paired audio+text+audio | LRS-30h audio-visual | Google drive |
VatLM Large | VoxCeleb2 + LRS3 + paired audio+text+audio | LRS-30h visual | Google drive |
VatLM Large | VoxCeleb2 + LRS3 + paired audio+text+audio | LRS-433h audio-visual | Google drive |
VatLM Large | VoxCeleb2 + LRS3 + paired audio+text+audio | LRS-433h visual | Google drive |
To fine-tune or pre-train more models, please follow the instructions below.
git clone https://github.com/microsoft/SpeechT5.git
cd SpeechT5/VATLM
git submodule init && git submodule update
cd VATLM/fairseq && pip install --editable
cd VATLM/vat_hubert && pip install -r requirements.txt
-
For audio or visual data, please follow the steps of AV-HuBERT's script to pre-process the data and get the corresponding
train.tsv
,train.km
files. -
For unimodal audio data, the visual modality is replaced with a zero vector, and the features are extracted according to this script and then kmeans clustering is performed to get the corresponding labels.
-
For unimodal text data, we use a small amount of pair text-audio data to obtain paired phone-unit data, and get the corresponding phoneme sequences by looking up the lexicon, and the unit data are obtained by extracting features and performing kmeans clustering. Then follow this script to train the phone2unit model.
-
VatLM Base model (LRS3 + paired audio+text+audio)
cd VATLM/vat_hubert/vathubert/scripts/pretrain ngpu=32 updatefreq=1 save_path=/path/to/save_path bash base_lsr3_pretrain_iter5.sh ${ngpu} ${updatefreq} ${save_path}
-
VatLM Base model (VoxCeleb2 + paired audio+text+audio)
cd VATLM/vat_hubert/vathubert/scripts/pretrain ngpu=32 updatefreq=1 save_path=/path/to/save_path bash base_vox_pretrain_iter5.sh ${ngpu} ${updatefreq} ${save_path}
-
VatLM Large model (VoxCeleb2 + paired audio+text+audio)
cd VATLM/vat_hubert/vathubert/scripts/pretrain ngpu=32 updatefreq=2 save_path=/path/to/save_path bash large_vox_pretrain_iter5.sh ${ngpu} ${updatefreq} ${save_path}
For example, the AVSR model can be obtained by fine-tuning the VatLM model using 30 hours of labeled data.
cd VATLM/vat_hubert/vathubert/scripts/finetune_avsr
ngpu=8
updatefreq=1
save_path=/path/to/save_path
bash base_lrs3_finetune30_av.sh ${ngpu} ${updatefreq} ${save_path}
For example, decoding the fine-tuned AVSR model.
cd VATLM/vat_hubert/vathubert/
data="test"
bash decode_avhubert_lrs3.sh ${data}
This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the FAIRSEQ and av_hubert
Microsoft Open Source Code of Conduct
If you find our work is useful in your research, please cite the following paper:
@article{zhu2022vatlm,
title={VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning},
author={Qiushi Zhu and Long Zhou and Ziqiang Zhang and Shujie Liu and Binxing Jiao and Jie Zhang and Lirong Dai and Daxin Jiang and Jinyu Li and Furu Wei},
year={2022},
eprint={2211.11275},
archivePrefix={arXiv},
}
For help or issues using VatLM models, please submit a GitHub issue.
For other communications related to VatLM, please contact Long Zhou ([email protected]
).