Customizing an HPC cluster
Beyond Basic Cluster Deployment
AWS ParallelCluster provides an excellent foundation for deploying HPC infrastructure in the cloud, but out-of-the-box clusters often need additional customization to meet production requirements. While ParallelCluster handles the initial deployment of compute resources, network configuration, and scheduler setup, transforming these components into a production-ready environment requires additional automation.
In this post, I’ll introduce my cluster_setup.sh
script that automates post-deployment customization for AWS ParallelCluster environments. This script addresses common requirements for production HPC workloads that aren’t covered by the default ParallelCluster deployment.
GitHub Repository: https://github.com/veloduff/hpc/tree/main/Cluster_Setup/cluster_setup.sh
The complete cluster setup script and supporting tools are available in my HPC repository.
Common Production Requirements
Production HPC environments typically need several customizations beyond the basic cluster deployment:
- Consistent software environments across all nodes
- Efficient cluster-wide administration tools
- Performance monitoring and diagnostics
- Optimized storage configuration
- MPI library setup and validation
- Scheduler customization and testing
My cluster_setup.sh
script automates all these customizations in a single operation, ensuring consistency and reducing manual configuration errors.
Prerequisites
- An existing AWS ParallelCluster deployment (see my Creating an HPC Cluster with AWS ParallelCluster guide)
- Basic understanding of Linux and SSH
Automating Cluster Customization
To customize your ParallelCluster deployment:
- Prepare the required scripts:
cluster_setup.sh
: Main cluster customization scriptinstall_pkgs.sh
: Package installation dependency script
Copy these scripts to your head node:
scp cluster_setup.sh install_pkgs.sh ec2-user@<head-node-ip>:~/
- Execute the setup script:
chmod +x cluster_setup.sh install_pkgs.sh ./cluster_setup.sh
What the Customization Script Does
The cluster_setup.sh
script performs nine key operations to transform a basic ParallelCluster deployment into a production-ready HPC environment:
- Package Management: Installs essential HPC tools and utilities (pdsh, nvme-cli, monitoring tools)
- Cluster Communication: Sets up pdsh (Parallel Distributed Shell) for parallel command execution
- Host Management: Creates and distributes cluster host files for node-to-node communication
- SSH Configuration: Enables passwordless root access across all cluster nodes
- Environment Setup: Configures .bash_profile for optimal cluster operations
- Storage Management: Cleans up Instance Store devices for filesystem use (Lustre/GPFS)
- Monitoring Tools: Installs performance monitoring and debugging utilities (htop, nmon, iperf3)
- MPI Validation: Creates, compiles, and tests MPI applications for cluster verification
- Slurm Testing: Submits test jobs to validate scheduler functionality
Benefits of Automated Customization
This automated approach to cluster customization provides several key benefits:
- Consistency: Ensures all nodes have identical configurations
- Efficiency: Reduces setup time from hours to minutes
- Reproducibility: Creates predictable environments across multiple clusters
- Validation: Automatically verifies that the cluster is functioning correctly
Validating the Customized Cluster
After running the setup script, you can verify that your customizations were applied correctly:
- Verify parallel command execution with pdsh:
$ pdsh hostname batch01-st-batch-01: batch01-st-batch-01 batch01-st-batch-02: batch01-st-batch-02 batch01-st-batch-03: batch01-st-batch-03 ...
- Check host file consistency across the cluster:
$ pdsh cat /etc/hosts | dshbak -c
- Verify MPI functionality with the test results:
$ cat mpi-test.out Currently Loaded Modulefiles: 1) openmpi/4.1.7 Hello world from processor batch01-st-batch-31, rank 30 out of 32 processors Hello world from processor batch01-st-batch-27, rank 26 out of 32 processors ...
Leveraging the Customized Environment
With your customized cluster in place, you can now take advantage of several advanced capabilities:
1. Cluster-Wide Monitoring
Use pdsh to create comprehensive monitoring scripts:
#!/bin/bash
echo "===== CLUSTER STATUS REPORT ====="
echo "\n== SYSTEM LOAD =="
pdsh uptime | dshbak -c
echo "\n== MEMORY USAGE =="
pdsh free -h | dshbak -c
echo "\n== DISK USAGE =="
pdsh df -h | grep -v tmpfs | dshbak -c
echo "\n== RUNNING PROCESSES =="
pdsh "ps -eo pcpu,pmem,pid,user,args | sort -k 1 -r | head -5" | dshbak -c
2. Efficient File Distribution
Use pdcp for parallel file operations:
# Copy configuration files to all nodes
pdcp -w ^$HOME/cluster-ip-addr /path/to/config.file /destination/path/
# Distribute application binaries
pdcp -r -w ^$HOME/cluster-ip-addr /path/to/application/ /opt/apps/
3. Automated Health Checks
Implement periodic health checks across the cluster:
# Check for failed services
pdsh "systemctl list-units --state=failed" | dshbak -c
# Verify network connectivity
pdsh "ping -c 1 $(hostname)" | dshbak -c
Conclusion: From Basic to Production-Ready
While AWS ParallelCluster provides an excellent starting point for HPC in the cloud, the automated customizations provided by the cluster_setup.sh
script transform a basic deployment into a production-ready environment. This approach delivers several key benefits:
- Reduced Setup Time: Automates hours of manual configuration
- Improved Reliability: Ensures consistent configuration across all nodes
- Enhanced Functionality: Adds critical tools for cluster management and monitoring
- Validated Environment: Confirms that all components are working correctly
By automating these customizations, you can quickly deploy consistent, production-ready HPC environments that meet the demanding requirements of real-world scientific and engineering workloads.
This post is part of a series on building production-ready HPC environments on AWS. In my next post, I’ll explore how to leverage this customized environment for parallel filesystem deployment and optimization.