Name		Name	Last commit message	Last commit date
parent directory ..
fairseq @ ca36b43		fairseq @ ca36b43
speech2c		speech2c
README.md		README.md

README.md

Speech2C

Speech2C (INTERSPEECH 2022): Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data

Pre-Trained and Fine-tuned Models

Model	Pre-training Dataset	Fine-tuning Dataset	Model
Speech2C	960 hrs LibriSpeech	-	Google Drive
Speech2C	960 hrs LibriSpeech	10 hrs LibriSpeech	Google Drive
Speech2C	960 hrs LibriSpeech	100 hrs LibriSpeech	Google Drive

Language Model and Vocabulary

Model	Dataset	Model	Vocabulary
LM	LibriSpeech LM Dataset	Model	Vocabulary

Setup

git submodule update --init Speech2C/fairseq
cd Speech2C/
pip install --editable fairseq/

Data Preparation

Please follow the steps of data preparation for HuBERT in here.

Pre-Training

DATA_DIR=
LABEL_DIR=
FAIRSEQ_PATH=

python ${FAIRSEQ_PATH}/fairseq_cli/hydra_train.py \
  --config-dir speech2c/config \
  --config-name speech2c_base_librispeech \
  task.data=${DATA_DIR} task.label_dir=${LABEL_DIR} task.labels='["km"]' \
  model.label_rate=50 common.user_dir=SpeechT5/Speech2C/speech2c \

Finetune

DATA_DIR=
LABEL_DIR=
FAIRSEQ_PATH=
W2V_PATH=
CONFIG_NAME=

python ${FAIRSEQ_PATH}/fairseq_cli/hydra_train.py \
  --config-dir speech2c/config \
  --config-name ${CONFIG_NAME} \
  task.data=${DATA_DIR} task.label_dir=${LABEL_DIR} \
  model.w2v_path=${W2V_PATH} common.user_dir=SpeechT5/Speech2C/speech2c \

Inference

Note that joint CTC and decoder inference is only supported when the batch size is 1.

FAIRSEQ_PATH=
DATA_DIR=
LABEL_DIR=
BEAM_SIZE=
CTC_WEIGHT=
TEST_SET=
CHECKPOINT_PATH=
W2V_PATH=


python ${FAIRSEQ_PATH}/fairseq_cli/generate.py ${DATA_DIR} \
    --label-dir ${LABEL_DIR} \
    --path ${CHECKPOINT_PATH} \
    --user-dir SpeechT5/Speech2C/speech2c \
    --model-overrides "{'w2v_path': '${W2V_PATH}'}" \
    --gen-subset ${TEST_SET} \
    --task speech2c_pretraining \
    --post-process letter \
    --add-decoder \
    --labels '["ltr"]' \
    --fine-tuning \
    --scoring wer \
    --max-len-a 0 \
    --max-len-b 620 \
    --pad-audio \
    --random-crop \
    --ctc-weight ${CTC_WEIGHT} \
    --max-tokens 8000000 \
    --beam ${BEAM_SIZE} \
    --single-target \

Results on Librispeech

Evaluation on the LibriSpeech 10hr subset

Model	LM	test-clean	test-other
wav2vec2.0 Base	-	11.1	17.6
HuBERT Base	-	10.1	16.8
Speech2C	-	7.8	13.1
wav2vec 2.0 Base	4-gram	4.3	9.5
wav2vec 2.0 Base	Transf.	3.2	7.8
HuBERT Base	4-gram	4.3	9.4
Speech2C	Transf.	3.1	7.0

Evaluation on the LibriSpeech 100hr subset

Model	LM	test-clean	test-other
wav2vec2.0 Base	-	6.1	13.3
wav2vec2.0 Large	-	4.7	9.0
HuBERT Base	-	6.3	13.2
SpeechT5	-	4.4	10.4
Baseline	-	5.0	11.9
Speech2C	-	4.3	9.0
wav2vec 2.0 Base	4-gram	3.4	8.0
wav2vec 2.0 Base	Transf.	2.6	6.3
HuBERT Base	4-gram	3.4	8.1
SpeechT5	Transf.	2.4	5.8
Baseline	Transf.	2.5	6.3
Speech2C	Transf.	2.4	5.2

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the FAIRSEQ.

Microsoft Open Source Code of Conduct

Reference

If you find our work is useful in your research, please cite the following paper:

@article{Ao2022Speech2C,
  title   = {Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data},
  author  = {Junyi Ao and Ziqiang Zhang and Long Zhou and Shujie Liu and Haizhou Li and Tom Ko and Lirong Dai and Jinyu Li and Yao Qian and Furu Wei},
  eprint={2203.17113},
  archivePrefix={arXiv},
  primaryClass={cs.SD},
  year={2022}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speech2C

Speech2C

README.md

Speech2C

Pre-Trained and Fine-tuned Models

Language Model and Vocabulary

Setup

Data Preparation

Pre-Training

Finetune

Inference

Results on Librispeech

Evaluation on the LibriSpeech 10hr subset

Evaluation on the LibriSpeech 100hr subset

License

Reference

Files

Speech2C

Directory actions

More options

Directory actions

More options

Latest commit

History

Speech2C

Folders and files

parent directory

README.md

Speech2C

Pre-Trained and Fine-tuned Models

Language Model and Vocabulary

Setup

Data Preparation

Pre-Training

Finetune

Inference

Results on Librispeech

Evaluation on the LibriSpeech 10hr subset

Evaluation on the LibriSpeech 100hr subset

License

Reference