Skip to content

Latest commit

 

History

History
135 lines (93 loc) · 6.78 KB

README.md

File metadata and controls

135 lines (93 loc) · 6.78 KB

VATLM

VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

  • (Done) Nov. 2022: release the code and models
  • Nov. 2022: release preprint in arXiv

Pre-Trained and Fine-tuned Models

Model Pre-training Dataset Fine-tuning Dataset Model
VatLM Base LRS3 + paired audio+text+audio - Google drive
VatLM Base LRS3 + paired audio+text+audio LRS-30h audio-visual Google drive
VatLM Base LRS3 + paired audio+text+audio LRS-30h visual Google drive
VatLM Base VoxCeleb2 + LRS3 + paired audio+text+audio - Google drive
VatLM Base VoxCeleb2 + LRS3 + paired audio+text+audio LRS-30h audio-visual Google drive
VatLM Base VoxCeleb2 + LRS3 + paired audio+text+audio LRS-30h visual Google drive
VatLM Base VoxCeleb2 + LRS3 + paired audio+text+audio LRS-433h audio-visual Google drive
VatLM Base VoxCeleb2 + LRS3 + paired audio+text+audio LRS-433h visual Google drive
VatLM Large VoxCeleb2 + LRS3 + paired audio+text+audio - Google drive
VatLM Large VoxCeleb2 + LRS3 + paired audio+text+audio LRS-30h audio-visual Google drive
VatLM Large VoxCeleb2 + LRS3 + paired audio+text+audio LRS-30h visual Google drive
VatLM Large VoxCeleb2 + LRS3 + paired audio+text+audio LRS-433h audio-visual Google drive
VatLM Large VoxCeleb2 + LRS3 + paired audio+text+audio LRS-433h visual Google drive

Setup

To fine-tune or pre-train more models, please follow the instructions below.

git clone https://github.com/microsoft/SpeechT5.git
cd SpeechT5/VATLM
git submodule init && git submodule update

cd VATLM/fairseq  && pip install --editable
cd VATLM/vat_hubert && pip install -r requirements.txt

Data preparation

  1. For audio or visual data, please follow the steps of AV-HuBERT's script to pre-process the data and get the corresponding train.tsv, train.km files.

  2. For unimodal audio data, the visual modality is replaced with a zero vector, and the features are extracted according to this script and then kmeans clustering is performed to get the corresponding labels.

  3. For unimodal text data, we use a small amount of pair text-audio data to obtain paired phone-unit data, and get the corresponding phoneme sequences by looking up the lexicon, and the unit data are obtained by extracting features and performing kmeans clustering. Then follow this script to train the phone2unit model.

Pre-train

  • VatLM Base model (LRS3 + paired audio+text+audio)

    cd VATLM/vat_hubert/vathubert/scripts/pretrain
    ngpu=32
    updatefreq=1
    save_path=/path/to/save_path
    
    bash base_lsr3_pretrain_iter5.sh ${ngpu} ${updatefreq} ${save_path}
  • VatLM Base model (VoxCeleb2 + paired audio+text+audio)

    cd VATLM/vat_hubert/vathubert/scripts/pretrain
    ngpu=32
    updatefreq=1
    save_path=/path/to/save_path
    
    bash base_vox_pretrain_iter5.sh ${ngpu} ${updatefreq} ${save_path}
  • VatLM Large model (VoxCeleb2 + paired audio+text+audio)

    cd VATLM/vat_hubert/vathubert/scripts/pretrain
    ngpu=32
    updatefreq=2
    save_path=/path/to/save_path
    
    bash large_vox_pretrain_iter5.sh ${ngpu} ${updatefreq} ${save_path}

Fine-tune AVSR/VSR

For example, the AVSR model can be obtained by fine-tuning the VatLM model using 30 hours of labeled data.

cd VATLM/vat_hubert/vathubert/scripts/finetune_avsr
ngpu=8
updatefreq=1
save_path=/path/to/save_path

bash base_lrs3_finetune30_av.sh ${ngpu} ${updatefreq} ${save_path}

Decode

For example, decoding the fine-tuned AVSR model.

cd VATLM/vat_hubert/vathubert/
data="test"
bash decode_avhubert_lrs3.sh ${data}

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the FAIRSEQ and av_hubert

Microsoft Open Source Code of Conduct

Reference

If you find our work is useful in your research, please cite the following paper:

@article{zhu2022vatlm,
      title={VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning}, 
      author={Qiushi Zhu and Long Zhou and Ziqiang Zhang and Shujie Liu and Binxing Jiao and Jie Zhang and Lirong Dai and Daxin Jiang and Jinyu Li and Furu Wei},
      year={2022},
      eprint={2211.11275},
      archivePrefix={arXiv},
}

Contact Information

For help or issues using VatLM models, please submit a GitHub issue.

For other communications related to VatLM, please contact Long Zhou ([email protected]).