In this step, you create an environment configuration script to train a Natural Language Understanding model and upload the training data to an S3 bucket.
First, create an Amazon S3 bucket and upload the training data folder. This training folder will be accessed by the cluster worker nodes through FSx.
# generate a unique postfix
export BUCKET_POSTFIX=$(uuidgen --random | cut -d'-' -f1)
echo "Your bucket name will be mlbucket-${BUCKET_POSTFIX}"
aws s3 mb s3://mlbucket-${BUCKET_POSTFIX}
# downloading data:
export URL="https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip"
export FILE="wikitext-103-v1.zip"
wget $URL -O $FILE
unzip $FILE
# upload to your bucket
aws s3 cp wikitext-103 s3://mlbucket-${BUCKET_POSTFIX}/wikitext-103 --recursive
# delete local copies
rm -rf wikitext-103*
The next step is to create a post-installation script to be executed by ParallelCluster when provisioning the instances. This script first configures NVIDIA NCCL to work with the already available AWS EFA software. NCCL is the communication library used by PyTorch for GPU-to-GPU communication. For more information, refer to the Getting started with EFA and NCCL documentation.
The script also installs Miniconda3 and configures an environment with PyTorch and Fairseq in a shared filesystem. In the coming cluster configuration steps, you set up a GP2 Amazon EBS volume that will be attached to the head node and shared through NFS to be mounted by the compute nodes on /shared.
cat > post-install.sh << EOF
#!/bin/bash
export HOME=/home/ec2-user
# start configuration of NCCL and EFA only if CUDA and EFA present
CUDA_DIRECTORY=/usr/local/cuda
EFA_DIRECTORY=/opt/amazon/efa
OPENMPI_DIRECTORY=/opt/amazon/openmpi
if [ -d "\$CUDA_DIRECTORY" ] && [ -d "\$EFA_DIRECTORY" ]; then
# installing NCCL
NCCL_DIRECTORY=/home/ec2-user/nccl
if [ ! -d "\$NCCL_DIRECTORY" ]; then
echo "Installing NVIDIA nccl"
cd /home/ec2-user
git clone https://github.com/NVIDIA/nccl.git
cd /home/ec2-user/nccl
make -j src.build
fi
# installing aws-ofi-nccl
AWS_OFI_DIRECTORY=/home/ec2-user/aws-ofi-nccl
if [ ! -d "\$AWS_OFI_DIRECTORY" ]; then
echo "Installing aws-ofi-nccl"
cd /home/ec2-user
git clone https://github.com/aws/aws-ofi-nccl.git -b aws
fi
cd \$AWS_OFI_DIRECTORY
./autogen.sh
./configure --with-mpi=\$OPENMPI_DIRECTORY --with-libfabric=\$EFA_DIRECTORY --with-nccl=\$NCCL_DIRECTORY/build --with-cuda=\$CUDA_DIRECTORY
export PATH=\$OPENMPI_DIRECTORY/bin:\$PATH
make
sudo make install
fi
# configuring the conda environment
cd /shared
CONDA_DIRECTORY=/shared/.conda/bin
if [ ! -d "\$CONDA_DIRECTORY" ]; then
# control will enter here if $DIRECTORY doesn't exist.
echo "Conda installation not found. Installing..."
wget -O miniconda.sh \
"https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh" \
&& bash miniconda.sh -b -p /shared/.conda \
&& /shared/.conda/bin/conda init bash \
&& eval "\$(/shared/.conda/bin/conda shell.bash hook)" \
&& rm -rf miniconda.sh
conda install python=3.6 -y
fi
FAIRSEQ_DIRECTORY=/shared/fairseq
if [ ! -d "\$FAIRSEQ_DIRECTORY" ]; then
# control will enter here if $DIRECTORY doesn't exist.
echo "Fairseq repository not found. Installing..."
git clone https://github.com/pytorch/fairseq.git \$FAIRSEQ_DIRECTORY
pip install -e \$FAIRSEQ_DIRECTORY -U
pip install boto3 torch tqdm -y
fi
chown -R ec2-user:ec2-user /lustre
chown -R ec2-user:ec2-user /shared
sudo -u ec2-user /shared/.conda/bin/conda init bash
EOF
# upload to your bucket
aws s3 cp post-install.sh s3://mlbucket-${BUCKET_POSTFIX}/post-install.sh
# delete local copies
rm -rf post-install.sh
Before continuing, check the content of your bucket using the AWS CLI with the command aws s3 ls s3://mlbucket-${BUCKET_POSTFIX}
or the AWS console.
Next, define the configuration of the ML cluster by creating the AWS ParallelCluster configuration file.