Warning: The crux of the code, the Focal loss and Anti-Focal loss implementations are available in the fairseq/criterions directory and can be directly used with fairseq. However, the end-to-end code currently is more of a code-dump, rather than a code-release. Please wait for a (much better) cleaned-up version.
We use fairseq to train the models. Our code is tested on Ubuntu 18.04, with a Conda installation of Python 3.6.
git clone https://github.com/vyraun/long-tailed.git
pip install .
Other Repositories Used (thanks!):
- https://github.com/neulab/compare-mt
- https://github.com/mjpost/sacrebleu
Below are the steps to replicate each section of the paper.
The scripts with the prefix 'run' provides the code, from data preparation to evaluation. For example:
bash run_iwslt14_de_en.sh
Compute the Spearman's Rank Correlation between Norms and Frequencies:
python norm.py
cd analysis
bash evauate_splits.sh [model_dir]
bash evauate_model_on_splits.sh [model_dir]
The plot can be generated using compare-mt
bash evaluate.sh model_dir data_dir
python probs_new.py beam_search.pkl
python probs_all.py [beam_search_*.pkl]
The loss functions are implemented in the Criterions Directory.
bash run_iwslt14_de_fc.sh
bash run_iwslt14_de_afc.sh
cd analysis
bash normalization.sh
@inproceedings{raunak2020longtailed,
title = {On Long-Tailed Phenomena in Neural Machine Translation},
author = {Raunak, Vikas and Dalmia, Siddharth and Gupta, Vivek and Metze, Florian},
booktitle = {Findings of EMNLP},
year = 2020,
}