Automated Lustre Deployment on AWS
Deploying high-performance Lustre filesystems on AWS ParallelCluster traditionally requires extensive manual configuration and coordination across multiple components. This Ansible-based automation process provides an interactive deployment that handles everything from cluster sizing to post-installation configuration.
Important: This is a step-by-step Lustre deployment process that builds each Lustre component individually (MGS, MDS, OSS) and creates the filesystem from scratch. This approach does not use AWS built-in services like Amazon FSx for Lustre, but instead deploys a native Lustre filesystem directly on EC2 instances with full control over configuration, performance tuning, and customization.
GitHub Repository: https://github.com/veloduff/hpc/ansible-playbooks/pcluster-lustre
The complete automation scripts, Ansible playbooks, and supporting tools referenced in this post are available in the repository. This includes the Lustre deployment automation, cluster setup scripts, and storage management utilities.
What This Automation Does
The run-pcluster-lustre.sh
script provides a complete end-to-end deployment solution:
- Interactive configuration with intelligent defaults
- Pre-configured cluster sizes optimized for different workloads
- Automated Lustre setup with proper component distribution
- Post-installation scripts for immediate usability
- Comprehensive validation of prerequisites and credentials
Getting Started
Prerequisites
# Install required tools
pip install ansible aws-parallelcluster
# Configure AWS credentials
aws configure
Custom AMI
This process depends on the Lustre, DKMS, and ZFS modules being installed. It will handle loading of the modules, but they need to be installed. For customizing a ParallelCluster AMI, see my Building Custom ParallelCluster AMIs with Lustre Server Support blog post.
ParallelCluster VPC
This process depends on ParallelCluster being configured, and using ParallelCluster to setup the VPC is the recommended way of setting up the VPC. See my Creating an HPC Cluster with AWS ParallelCluster blog post for setting up a ParallelCluster VPC.
Basic Deployment
# Clone the repository and navigate to the playbook directory
cd ansible-playbooks/pcluster-lustre
# Run the interactive setup
$ ./run-pcluster-lustre.sh
ParallelCluster Lustre Cluster Ansible Setup
============================================
Verifying AWS credentials... verified
Cluster name [lustre-cluster-Jul23-20250642]:
AWS region []: us-west-2
Custom AMI []: ami-111122223333
Operating System []: rhel8
SSH key file path []: /path/to/my-key.pem
EC2 key pair name [my-key]:
Head node subnet ID []: subnet-12121212
Compute subnet ID []: subnet-23232323
Placement group name []: my-placement-group-01
File system size (small/medium/large/xlarge/local) [small]: large
Example file system, shown for a 1PB file system:
$ lfs df -h
...
testfs-OST0032_UUID 15.7T 8.0M 15.7T 1% /mnt/lustre[OST:50]
testfs-OST0033_UUID 15.7T 8.0M 15.7T 1% /mnt/lustre[OST:51]
testfs-OST0034_UUID 15.7T 8.0M 15.7T 1% /mnt/lustre[OST:52]
testfs-OST0035_UUID 15.7T 8.0M 15.7T 1% /mnt/lustre[OST:53]
testfs-OST0036_UUID 15.7T 8.0M 15.7T 1% /mnt/lustre[OST:54]
testfs-OST0037_UUID 15.7T 8.0M 15.7T 1% /mnt/lustre[OST:55]
testfs-OST0038_UUID 15.7T 8.0M 15.7T 1% /mnt/lustre[OST:56]
testfs-OST0039_UUID 15.7T 8.0M 15.7T 1% /mnt/lustre[OST:57]
testfs-OST003a_UUID 15.7T 8.0M 15.7T 1% /mnt/lustre[OST:58]
testfs-OST003b_UUID 15.7T 8.0M 15.7T 1% /mnt/lustre[OST:59]
testfs-OST003c_UUID 15.7T 8.0M 15.7T 1% /mnt/lustre[OST:60]
testfs-OST003d_UUID 15.7T 8.0M 15.7T 1% /mnt/lustre[OST:61]
testfs-OST003e_UUID 15.7T 8.0M 15.7T 1% /mnt/lustre[OST:62]
testfs-OST003f_UUID 15.7T 8.0M 15.7T 1% /mnt/lustre[OST:63]
filesystem_summary: 1007.1T 11.3G 1007.1T 1% /mnt/lustre
Pre-Configured File System Sizes
The total size of the file system will depend on the size of the cluster and the size of each OST.
Note: Each MDS will get one MDT, and the MGS has mirrored MGT volumes.
In the ansible-playbooks/pcluster-lustre/lustre_fs_settings.sh
file the size of the OST and number of OSTs per OSS can be changed. Here are the settings for the small file system. With the defaults, it is 96TB file system with 40K IOPS:
"small")
# Default performance: 20K IOPS, 4.8TB capacity
MDT_USE_LOCAL=false
OST_USE_LOCAL=false
# MGT will be mirrored volumes
MGT_SIZE=1 # Size (GB) for MGT volumes
MGT_VOLUME_TYPE="gp3" # Volume type for MDT (io1, io2, gp3)
MGT_THROUGHPUT=125 # MDT Throughput in MiB/s
MGT_IOPS=3000 # MDT IOPS
# Settings for MDTs when *NOT* using local disk (see MDT_USE_LOCAL)
MDTS_PER_MDS=1 # Number of MDTs to create per MDS server
MDT_VOLUME_TYPE="io2" # Volume type for MDT (io1, io2, gp3)
MDT_THROUGHPUT=1000 # MDT Throughput in MiB/s
MDT_SIZE=512 # Size (GB) for MDT volumes
MDT_IOPS=12000 # MDT IOPS
# Settings for OSTs when *NOT* using local disk (see OST_USE_LOCAL)
OSTS_PER_OSS=1 # Number of OSTs to create per OSS server
OST_VOLUME_TYPE="io1" # Volume type for OST (io1, io2, gp3)
OST_THROUGHPUT=250 # Throughput in MiB/s
OST_SIZE=1200 # Size (GB) for OST volumes
OST_IOPS=3000 # IOPS
;;
The Ansible run file ansible-playbooks/pcluster-lustre/run-pcluster-luster.sh
has the cluster size and
instance types, for example:
"small")
HEADNODE_INSTANCE_TYPE="m6idn.xlarge"
MGS_INSTANCE_TYPE="m6idn.large"
MGS_MIN_COUNT=1
MGS_MAX_COUNT=1
MDS_INSTANCE_TYPE="m6idn.xlarge"
MDS_MIN_COUNT=2
MDS_MAX_COUNT=8
OSS_INSTANCE_TYPE="m6idn.xlarge"
OSS_MIN_COUNT=4
OSS_MAX_COUNT=16
BATCH_INSTANCE_TYPE="m6idn.large"
BATCH_MIN_COUNT=4
BATCH_MAX_COUNT=32
Pre-Configured Cluster Sizes
The script includes five optimized configurations for different use cases:
Configuration | Use Case | Head Node | MGS | MDS | OSS | Compute |
---|---|---|---|---|---|---|
Small | Development, testing, small workloads | m6idn.xlarge | 1x m6idn.large | 2-8x m6idn.xlarge | 8-16x m6idn.xlarge | 4-32x m6idn.large |
Medium | Production workloads, moderate scale | m6idn.xlarge | 1x m6idn.xlarge | 4-8x m6idn.xlarge | 20-40x m6idn.xlarge | 8-128x m6idn.large |
Large | High-performance computing, large datasets | m6idn.2xlarge | 1x m6idn.xlarge | 8-16x m6idn.2xlarge | 40-128x m6idn.2xlarge | 16-256x m6idn.xlarge |
XLarge | Extreme scale, mission-critical workloads | m6idn.2xlarge | 1x m6idn.xlarge | 16x m6idn.2xlarge (fixed) | 40-128x m6idn.2xlarge | 16-256x m6idn.xlarge |
Local | Maximum performance with local NVMe storage | m6idn.2xlarge | 1x m6idn.xlarge | 16-32x m6idn.2xlarge | 40-64x m6idn.2xlarge | 16-256x m6idn.xlarge |
Automated Post-Installation Pipeline
The script orchestrates the post-installation process:
1. Cluster Setup
- Package installation via
cluster_setup.sh
- System configuration and optimization
- Dependency management for Lustre components
2. Lustre Host Configuration
- Host file management via
fix_lustre_hosts_files.sh
- Network configuration for Lustre communication
- Service discovery setup
3. Lustre Filesystem Creation
- Component creation via
setup_lustre.sh
- MGS/MDS/OSS deployment across designated nodes
- Filesystem mounting and validation
4. Supporting Scripts
- EBS volume management for storage provisioning
- Lustre component configuration with proper settings
- Performance tuning and optimization