PECOS Distributed Training Infra AWS Multi-node Batch CDK

This sub-folder contains AWS Multi-node Batch CDK code for fast generation of the infra for training distributed PECOS models.

The PECOS code used for training is what you have in local folder when synth and deploy CDK, which facilitates the testing of local changes.

Prerequisite

AWS account with access to VPC, Batch, EC2, S3, ECR.
Setup AWS credentials following [guide].

Configure versions:

export NPM_VERSION=v0.38.0
export NODE_VERSION=14.6.0
export CDK_VERSION=2.40.0

Install npm and node if you haven't:

Download nvm script:

curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/${NPM_VERSION}/install.sh | bash

Append below to shell config (e.g. .bashrc or .zshrc), and source it:

# Append below
export NVM_DIR="$HOME/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh"  # This loads nvm
[ -s "$NVM_DIR/bash_completion" ] && \. "$NVM_DIR/bash_completion"  # This loads nvm bash_completion

# Source shell config
source <YOUR_SHELL_CONFIG>

Install latest CDK-supported Node.js and npm:

# NOTE: latest node version is not always supported by CDK, double-check before installation
nvm install ${NODE_VERSION}

Check installation:
```
node -v
npm -v
```

Install AWS CDK:

npm install -g aws-cdk@${CDK_VERSION}

# Check installation
cdk --version

Usage

Workspace Preparation

Clone Code:

git clone https://github.com/amzn/pecos.git
cd pecos

All your command below will run in root folder of pecos. Export relative directory for CDK code:

export CDKDIR=aws_infra/multinode_batch_cdk
export CDKAPP="python3 ./${CDKDIR}/app.py"

Create Python virtual environment and install dependencies:

python3 -m venv $CDKDIR/.venv
source $CDKDIR/.venv/bin/activate
python3 -m pip install -r $CDKDIR/requirements.txt

CDK Constructs Generation

Configure parameters:

# Make sure AWS account/region are the same as environment’s AWS credential
./$CDKDIR/config_generator.py

This will generate a param_config.json file in the same directory as config_generator.py, which will be used for the CDK synth.

Every time CDK synthesize/deploy needs to be redone for the parameters change.

Bootstrap:

# Replace contents in <> with your AWS account number and region
cdk -a $CDKAPP bootstrap aws://<YOUR_AWS_ID>/<YOUR_REGION>

CDK synthesize:

cdk -a $CDKAPP synth
# Display change
cdk -a $CDKAPP diff

CDK deploy:

Every time, rerun deploy all stacks for reflecting new changes from local PECOS code.
An image for distributed PECOS containers will be built and uploaded, so it may take a while.

  cdk -a $CDKAPP deploy --all

You could open AWS CloudFormation web console [link] to check the deployed resources.

If you do not need the infra anymore, delete the S3 bucket s3://pecos-distributed-bucket-<YOUR_AWS_ID>-<YOUR_NAME> generated by CDK first, then destroy:

cdk -a $CDKAPP destroy --all

Example

Download eurlex-4k data and upload to the S3 bucket created by CDK:

wget https://archive.org/download/pecos-dataset/xmc-base/eurlex-4k.tar.gz
tar -zxvf eurlex-4k.tar.gz
aws s3 cp --recursive ./xmc-base/eurlex-4k \
s3://pecos-distributed-bucket-<YOUR_AWS_ID>-<YOUR_NAME>/input-eurlex-4k/

Submit job by executing the following commands in the aws_infra/multinode_batch_cdk directory:

./$CDKDIR/submit_job.py \
--job-name pecos-train-xlinear-eurlex-4k \
--input-folder input-eurlex-4k \
--output-folder output-eurlex-4k \
--num-nodes 2 \
--cpu 1 \
--memory 60000 \
--commands 'mpiexec -n $AWS_BATCH_JOB_NUM_NODES -f /job/hostfile python3 -m pecos.distributed.xmc.xlinear.train \
-x $PECOS_INPUT/tfidf-attnxml/X.trn.npz \
-y $PECOS_INPUT/Y.trn.npz \
-m $PECOS_OUTPUT/eurlex_model \
--nr-splits 2 -b 50 -k 100 -nst 16 -t 0.1
python3 -m pecos.xmc.xlinear.predict \
-x $PECOS_INPUT/tfidf-attnxml/X.tst.npz \
-y $PECOS_INPUT/Y.tst.npz \
-m $PECOS_OUTPUT/eurlex_model > $PECOS_OUTPUT/eurlex_score.txt'

in which $PECOS_INPUT contains everything downloaded from the S3 input folder given by your parameters, and you should put all outputs into $PECOS_OUTPUT for uploading to S3 bucket. For full parameter list, please check submit_job.py

After submitting the job, navigate to AWS Batch web console [link] to check job running status.

After job done, check output model eurlex_model/ and score eurlex_score.txt in the folder output-eurlex-4k of S3 bucket generated by CDK.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

PECOS Distributed Training Infra AWS Multi-node Batch CDK

Prerequisite

Usage

Workspace Preparation

CDK Constructs Generation

Example

Files

README.md

Latest commit

History

README.md

File metadata and controls

PECOS Distributed Training Infra AWS Multi-node Batch CDK

Prerequisite

Usage

Workspace Preparation

CDK Constructs Generation

Example