This sub-folder contains AWS Multi-node Batch CDK code for fast generation of the infra for training distributed PECOS models.
The PECOS code used for training is what you have in local folder when synth and deploy CDK, which facilitates the testing of local changes.
- AWS account with access to VPC, Batch, EC2, S3, ECR.
- Setup AWS credentials following [guide].
- Configure versions:
export NPM_VERSION=v0.38.0 export NODE_VERSION=14.6.0 export CDK_VERSION=2.40.0
- Install
npm
andnode
if you haven't:- Download nvm script:
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/${NPM_VERSION}/install.sh | bash
- Append below to shell config (e.g.
.bashrc
or.zshrc
), and source it:# Append below export NVM_DIR="$HOME/.nvm" [ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh" # This loads nvm [ -s "$NVM_DIR/bash_completion" ] && \. "$NVM_DIR/bash_completion" # This loads nvm bash_completion # Source shell config source <YOUR_SHELL_CONFIG>
- Install latest CDK-supported
Node.js
andnpm
:# NOTE: latest node version is not always supported by CDK, double-check before installation nvm install ${NODE_VERSION}
- Check installation:
node -v npm -v
- Download nvm script:
- Install AWS CDK:
npm install -g aws-cdk@${CDK_VERSION} # Check installation cdk --version
Clone Code:
git clone https://github.com/amzn/pecos.git
cd pecos
All your command below will run in root folder of pecos. Export relative directory for CDK code:
export CDKDIR=aws_infra/multinode_batch_cdk
export CDKAPP="python3 ./${CDKDIR}/app.py"
Create Python virtual environment and install dependencies:
python3 -m venv $CDKDIR/.venv
source $CDKDIR/.venv/bin/activate
python3 -m pip install -r $CDKDIR/requirements.txt
Configure parameters:
# Make sure AWS account/region are the same as environment’s AWS credential
./$CDKDIR/config_generator.py
This will generate a param_config.json
file in the same directory as config_generator.py
, which will be used for the CDK synth.
Every time CDK synthesize/deploy needs to be redone for the parameters change.
Bootstrap:
# Replace contents in <> with your AWS account number and region
cdk -a $CDKAPP bootstrap aws://<YOUR_AWS_ID>/<YOUR_REGION>
CDK synthesize:
cdk -a $CDKAPP synth
# Display change
cdk -a $CDKAPP diff
CDK deploy:
- Every time, rerun deploy all stacks for reflecting new changes from local PECOS code.
- An image for distributed PECOS containers will be built and uploaded, so it may take a while.
cdk -a $CDKAPP deploy --all
You could open AWS CloudFormation web console [link] to check the deployed resources.
If you do not need the infra anymore, delete the S3 bucket s3://pecos-distributed-bucket-<YOUR_AWS_ID>-<YOUR_NAME>
generated by CDK first, then destroy:
cdk -a $CDKAPP destroy --all
Download eurlex-4k
data and upload to the S3 bucket created by CDK:
wget https://archive.org/download/pecos-dataset/xmc-base/eurlex-4k.tar.gz
tar -zxvf eurlex-4k.tar.gz
aws s3 cp --recursive ./xmc-base/eurlex-4k \
s3://pecos-distributed-bucket-<YOUR_AWS_ID>-<YOUR_NAME>/input-eurlex-4k/
Submit job by executing the following commands in the aws_infra/multinode_batch_cdk
directory:
./$CDKDIR/submit_job.py \
--job-name pecos-train-xlinear-eurlex-4k \
--input-folder input-eurlex-4k \
--output-folder output-eurlex-4k \
--num-nodes 2 \
--cpu 1 \
--memory 60000 \
--commands 'mpiexec -n $AWS_BATCH_JOB_NUM_NODES -f /job/hostfile python3 -m pecos.distributed.xmc.xlinear.train \
-x $PECOS_INPUT/tfidf-attnxml/X.trn.npz \
-y $PECOS_INPUT/Y.trn.npz \
-m $PECOS_OUTPUT/eurlex_model \
--nr-splits 2 -b 50 -k 100 -nst 16 -t 0.1
python3 -m pecos.xmc.xlinear.predict \
-x $PECOS_INPUT/tfidf-attnxml/X.tst.npz \
-y $PECOS_INPUT/Y.tst.npz \
-m $PECOS_OUTPUT/eurlex_model > $PECOS_OUTPUT/eurlex_score.txt'
in which $PECOS_INPUT
contains everything downloaded from the S3 input folder given by your parameters, and you should put all outputs into $PECOS_OUTPUT
for uploading to S3 bucket.
For full parameter list, please check submit_job.py
After submitting the job, navigate to AWS Batch web console [link] to check job running status.
After job done, check output model eurlex_model/
and score eurlex_score.txt
in the folder output-eurlex-4k
of S3 bucket generated by CDK.