In this step, you create a cluster configuration that supports your Distributed Machine Learning task.
If you are not familiar with AWS ParallelCluster, EFA and FSx, we recommend that you first complete the AWS Amazon FSx for Lustre lab and AWS EFA lab before proceeding. In particular, you need to be able to examine the FSx file system and examine the EFA enabled instance. The use of NICE DCV to interact with the cluster through a remote desktop is optional. Check out the Remote Visualization using NICE DCV lab for more information.
This section assumes that you are familiar with AWS ParallelCluster and the process of bootstrapping a cluster.
Let us reuse the SSH key-pair created earlier.
The cluster configuration that you generate for training large scale ML models includes constructs from EFA and FSx that you can explore in the previous sections of this workshop. The main additions to the cluster configuration script are:
CapacityType=SPOT
. AWS EC2 Spot instances are available for less than the cost of On-Demand Instances, but it is possible that they are interrupted. As the training workload provides model checkpointing - saving the model as training progresses - you will be able to restart training after a job failure. Consider running other compute capacity types in the case of limited spot instance availability or when running large scale training workloads that cannot be interrupted. Refer to this documentation to learn more about the impact of Spot instance interruptions in ParallelCluster.CustomActions:
OnNodeConfigured:
Script: s3://mlbucket-${BUCKET_POSTFIX}/post-install.sh
Iam:
S3Access:
- BucketName: mlbucket-${BUCKET_POSTFIX}
For more details about the configuration options, see the AWS ParallelCluster User Guide, the EFA parameters and the FSx parameters sections of the AWS ParallelCluster User Guide.
If you are using a different terminal than the previous section, make sure that the Amazon S3 bucket name is correct.
# create the cluster configuration
export AWS_REGION=$(curl --silent http://169.254.169.254/latest/meta-data/placement/region)
export IFACE=$(curl --silent http://169.254.169.254/latest/meta-data/network/interfaces/macs/)
export SUBNET_ID=$(curl --silent http://169.254.169.254/latest/meta-data/network/interfaces/macs/${IFACE}/subnet-id)
cat > ml-config.yaml << EOF
Region: ${AWS_REGION}
Image:
Os: alinux2
SharedStorage:
- MountDir: /shared
Name: default-ebs
StorageType: Ebs
- Name: fsxshared
StorageType: FsxLustre
MountDir: /lustre
FsxLustreSettings:
StorageCapacity: 1200
ImportPath: s3://mlbucket-${BUCKET_POSTFIX}
DeploymentType: SCRATCH_2
HeadNode:
InstanceType: c5n.2xlarge
Networking:
SubnetId: ${SUBNET_ID}
Ssh:
KeyName: ${AWS_KEYPAIR}
Dcv:
Enabled: true
Scheduling:
Scheduler: slurm
SlurmQueues:
- Name: compute
ComputeResources:
- Name: p3dn24xlarge
InstanceType: p3dn.24xlarge
MinCount: 0
MaxCount: 2
DisableSimultaneousMultithreading: true
Efa:
Enabled: true
CapacityType: SPOT
CustomActions:
OnNodeConfigured:
Script: s3://mlbucket-${BUCKET_POSTFIX}/post-install.sh
Iam:
S3Access:
- BucketName: mlbucket-${BUCKET_POSTFIX}
Networking:
SubnetIds:
- ${SUBNET_ID}
PlacementGroup:
Enabled: true
EOF
If you want to check the content of your configuration file, use the following command:
cat ml-config.yaml
Now, you are ready to create your Distributed ML cluster.
Create the cluster using the following command. This process would take about 15 minutes (depending on the resources/ settings).
pcluster create-cluster --cluster-name ml-cluster -c ml-config.yaml
The cluster creation continues even if the terminal session you are on gets terminated. To check on the status of the creation, use the command: pcluster describe-cluster --cluster-name ml-cluster
.
Once created, connect to your cluster.
pcluster ssh --cluster-name ml-cluster -i ${AWS_KEYPAIR}.pem
Next, preprocess the training data.