Skip to content

Latest commit

 

History

History

Speech2C

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Speech2C

Speech2C (INTERSPEECH 2022): Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data

Pre-Trained and Fine-tuned Models

Model Pre-training Dataset Fine-tuning Dataset Model
Speech2C 960 hrs LibriSpeech - Google Drive
Speech2C 960 hrs LibriSpeech 10 hrs LibriSpeech Google Drive
Speech2C 960 hrs LibriSpeech 100 hrs LibriSpeech Google Drive

Language Model and Vocabulary

Model Dataset Model Vocabulary
LM LibriSpeech LM Dataset Model Vocabulary

Setup

git submodule update --init Speech2C/fairseq
cd Speech2C/
pip install --editable fairseq/

Data Preparation

Please follow the steps of data preparation for HuBERT in here.

Pre-Training

DATA_DIR=
LABEL_DIR=
FAIRSEQ_PATH=

python ${FAIRSEQ_PATH}/fairseq_cli/hydra_train.py \
  --config-dir speech2c/config \
  --config-name speech2c_base_librispeech \
  task.data=${DATA_DIR} task.label_dir=${LABEL_DIR} task.labels='["km"]' \
  model.label_rate=50 common.user_dir=SpeechT5/Speech2C/speech2c \

Finetune

DATA_DIR=
LABEL_DIR=
FAIRSEQ_PATH=
W2V_PATH=
CONFIG_NAME=

python ${FAIRSEQ_PATH}/fairseq_cli/hydra_train.py \
  --config-dir speech2c/config \
  --config-name ${CONFIG_NAME} \
  task.data=${DATA_DIR} task.label_dir=${LABEL_DIR} \
  model.w2v_path=${W2V_PATH} common.user_dir=SpeechT5/Speech2C/speech2c \

Inference

Note that joint CTC and decoder inference is only supported when the batch size is 1.

FAIRSEQ_PATH=
DATA_DIR=
LABEL_DIR=
BEAM_SIZE=
CTC_WEIGHT=
TEST_SET=
CHECKPOINT_PATH=
W2V_PATH=


python ${FAIRSEQ_PATH}/fairseq_cli/generate.py ${DATA_DIR} \
    --label-dir ${LABEL_DIR} \
    --path ${CHECKPOINT_PATH} \
    --user-dir SpeechT5/Speech2C/speech2c \
    --model-overrides "{'w2v_path': '${W2V_PATH}'}" \
    --gen-subset ${TEST_SET} \
    --task speech2c_pretraining \
    --post-process letter \
    --add-decoder \
    --labels '["ltr"]' \
    --fine-tuning \
    --scoring wer \
    --max-len-a 0 \
    --max-len-b 620 \
    --pad-audio \
    --random-crop \
    --ctc-weight ${CTC_WEIGHT} \
    --max-tokens 8000000 \
    --beam ${BEAM_SIZE} \
    --single-target \

Results on Librispeech

Evaluation on the LibriSpeech 10hr subset

Model LM test-clean test-other
wav2vec2.0 Base - 11.1 17.6
HuBERT Base - 10.1 16.8
Speech2C - 7.8 13.1
wav2vec 2.0 Base 4-gram 4.3 9.5
wav2vec 2.0 Base Transf. 3.2 7.8
HuBERT Base 4-gram 4.3 9.4
Speech2C Transf. 3.1 7.0

Evaluation on the LibriSpeech 100hr subset

Model LM test-clean test-other
wav2vec2.0 Base - 6.1 13.3
wav2vec2.0 Large - 4.7 9.0
HuBERT Base - 6.3 13.2
SpeechT5 - 4.4 10.4
Baseline - 5.0 11.9
Speech2C - 4.3 9.0
wav2vec 2.0 Base 4-gram 3.4 8.0
wav2vec 2.0 Base Transf. 2.6 6.3
HuBERT Base 4-gram 3.4 8.1
SpeechT5 Transf. 2.4 5.8
Baseline Transf. 2.5 6.3
Speech2C Transf. 2.4 5.2

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the FAIRSEQ.

Microsoft Open Source Code of Conduct

Reference

If you find our work is useful in your research, please cite the following paper:

@article{Ao2022Speech2C,
  title   = {Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data},
  author  = {Junyi Ao and Ziqiang Zhang and Long Zhou and Shujie Liu and Haizhou Li and Tom Ko and Lirong Dai and Jinyu Li and Yao Qian and Furu Wei},
  eprint={2203.17113},
  archivePrefix={arXiv},
  primaryClass={cs.SD},
  year={2022}
}