Fast Multi-Resolution Transformer Fine-tuning for Extreme Multi-label Text Classification, NeurIPS 2021
This folder contains code to train XR-Transformer models and reproduce experiments in "Fast Multi-Resolution Transformer Fine-tuning for Extreme Multi-label Text Classification".
- Clone the repository and enter
examples/xr-transformer-neurips21
directory. - First create a virtual environment and then install dependencies by running the following command:
pip install -r requirements.txt
If you're unfamiliar with Python virtual environments, check out the user guide.
The XMC datasets can be download at
# eurlex-4k, wiki10-31k, amazoncat-13k, amazon-670k, wiki-500k, amazon-3m
DATASET="wiki10-31k"
wget https://archive.org/download/pecos-dataset/xmc-base/${DATASET}.tar.gz
tar -zxvf ./${DATASET}.tar.gz
To train and evaluate XR-Transformer model, run
bash run.sh ${DATASET}
Recommended platform for training: AWS p3.16xlarge instance or equivalent.
We also release the fine-tuned XR-Transformer encoders on which users can generate instance embeddings. The XR-Transformer encoders can be download at
# eurlex-4k, wiki10-31k, amazoncat-13k, amazon-670k, wiki-500k, amazon-3m
DATASET="wiki10-31k"
wget https://archive.org/download/xr-transformer-encoders/${DATASET}.tar.gz
mkdir -p ./encoders
tar -zxvf ./${DATASET}.tar.gz -C ./encoders
The XR-Transformer embeddings of training and testing instances can be generated by:
# for eurlex-4k, wiki10-31k, amazoncat-13k, MODEL_NAME can be bert|roberta|xlnet
# for amazon-670k, wiki-500k, amazon-3m, MODEL_NAME can be bert1|bert2|bert3
MODEL_NAME="bert"
model_dir="./encoders/${DATASET}/${MODEL_NAME}"
python3 -m pecos.xmc.xtransformer.encode \
--text-path xmc-base/${DATASET}/X.trn.txt \
--model-folder ${model_dir} \
--batch-gen-workers 16 \
--save-emb-path ${model_dir}/X.emb.trn.npy \
--batch-size 128
python3 -m pecos.xmc.xtransformer.encode \
--text-path xmc-base/${DATASET}/X.tst.txt \
--model-folder ${model_dir} \
--batch-gen-workers 16 \
--save-emb-path ${model_dir}/X.emb.tst.npy \
--batch-size 128
Embeddings will be saved at ${model_dir}/X.emb.trn.npy
and ${model_dir}/X.emb.tst.npy
.
If you find this useful, please consider citing our paper.