Skip to content

The implementation of "X-TF-GridNet: A Time-Frequency Domain Target Speaker Extraction Network with Adaptive Speaker Embedding Fusion", which is accepted by Information Fusion.

Notifications You must be signed in to change notification settings

HaoFengyuan/X-TF-GridNet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

X-TF-GridNet: A Time-Frequency Domain Target Speaker Extraction Network with Adaptive Speaker Embedding Fusion

This project relates to the implementation of X-TF-GridNet, a Target Speaker Extraction Network (TSE) in the time-frequency (T-F) domain, which has been accepted by Information Fusion. Our proposed method boasts two key extensions: a U2-Net style network adeptly extracts robust fixed speaker embeddings, and an adaptive embedding fusion (AEA) mechanism ensures the effective utilization of target speaker information.

In this project, the primary basis is the original implementation of SpEx+ and the implementation of TF-GridNet. Notably, the project only encompasses the traing and inference phase. For specifics on data preparation, please refer to there.

Pretrain Models

We release the model trained on the WHAMR! dataset, there.

Running Experiments

# Train the X-TF-GridNet model.
bash train.sh
# Decode the X-TF-GridNet model.
bash decode.sh
# Output score metrics.
bash evalute.sh

Results

We choose the PESQ, SDR and SI-SDR results on the WSJ0-2mix dataset for further comparison with other time domain TSE method.

Method Domain Param. (M) MACs (G/s) PESQ $\uparrow$ SDR (dB) $\uparrow$ SI-SDR (dB) $\uparrow$
Mixture - - - - 2.02 0.2
SpEx T 10.79 3.55 - 16.3 15.8
SpEx+ T 11.14 3.76 3.43 17.2 16.9
X-DPRNN T 6.32 63.92 - - 17.4
SpEx++ T 34.08 11.88 3.53 18.4 18.0
SpExpc T 28.40 40.54 - 18.8 18.6
VEVEN T 2.63 85.11 3.66 19.2 19.0
X-SepFormer T 26.66 61.34 3.74 19.5 18.9
X-TF-GridNet T-F 7.79 68.32 3.70 20.4 19.7
X-TF-GridNet (Large) T-F 12.68 113.24 3.77 21.7 20.7

(* More details can be found in the paper.)

Citation

If you use our code in your research or wish to refer to the baseline results, please use the following BibTeX entry.

@article{hao2024if,
    title = {{X-TF-GridNet}: A time–frequency domain target speaker extraction network with adaptive speaker embedding fusion},
    journal = {Information Fusion},
    volume = {112},
    pages = {102550},
    year = {2024},
    issn = {1566-2535},
    doi = {https://doi.org/10.1016/j.inffus.2024.102550},
    url = {https://www.sciencedirect.com/science/article/pii/S1566253524003282},
    author = {Fengyuan Hao and Xiaodong Li and Chengshi Zheng},
}

About

The implementation of "X-TF-GridNet: A Time-Frequency Domain Target Speaker Extraction Network with Adaptive Speaker Embedding Fusion", which is accepted by Information Fusion.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published