Setting up a PBS Scheduler in an AWS Cluster
- Joseph

- Nov 11
- 4 min read
Updated: Nov 13
This is the fifth part of an eight-part series on how to setup an HPC cluster on AWS.
This document explains how to set up an OpenPBS job scheduler in an AWS cluster.
The cluster has seven virtual machines (VMs)
One head / control node (node1)
One login node (node2)
Three compute nodes (node3, node4, node 5)
Two storage nodes (node6, node7)
All the VMs will have the OS Rocky Linux 9.6 (Blue Onyx)
OpenPBS (Portable Batch System) is an open-source workload management and job scheduling system designed for HPC environments. It efficiently allocates compute resources across clusters by queuing, scheduling, and monitoring batch jobs submitted by users. OpenPBS and PBS will be used interchangeably in this document.
There are 3 main types of nodes in the PBS.
Server Node: Central control node of the PBS cluster. This manages job queues, user requests, and job tracking (node1)
Compute Node: Runs the actual computational jobs submitted by users. It acts as the worker node and executes and monitors jobs assigned by the server (node3, node4, node5)
Client node: Used by users to log in, compile code, and submit jobs. These nodes do not to execute jobs (node2)
In our setup, storage servers does not execute jobs and so the storage servers do not require any aspect of PBS, including the PBS clients. In OpenPBS, the MOM (Machine Oriented Mini-server) is the daemon that runs on each compute node and is responsible for executing and managing jobs assigned by the PBS server. Acting as the cluster’s worker agent, it receives job scripts, launches them, monitors their progress, and reports job status back to the server.
Initial installation
The first thing to do when installing OpenPBS is to disable SELinux. Check out the second part of this series on how to do this. The second part also covers how to do the initial installations. The next thing to do is set up passwordless SSH access between all the nodes in the cluster. Check out the third part of this series on how to do this.
After disabling SELinux install the following packages on all nodes
sudo dnf install -y cjson-devel libedit-devel libical-devel ncurses-devel make cmake rpm-build libtool gcc gcc-c++ libX11-devel libXt-devel libXext libXext-devel libXmu-devel tcl-devel tk-devel postgresql-devel postgresql-server postgresql-contrib python3 python3-devel perl expat-devel openssl-devel hwloc-devel java-21-openjdk-devel swig swig-doc vim sendmail chkconfig autoconf automake gitBuild from Source
Once the packages are installed, build OpenPBS from source on all nodes (except storage as we are not installing PBS in storage nodes)
sudo git clone https://github.com/openpbs/openpbs.git && cd openpbs
sudo ./autogen.sh
sudo ./configure --prefix=/opt/pbs
sudo make -j$(nproc) && sudo make install
echo "export PATH=/opt/pbs/bin:/opt/pbs/sbin:\$PATH" | sudo tee /etc/profile.d/pbs.sh
source /etc/profile.d/pbs.shOnce installed run the following command on
sudo /opt/pbs/libexec/pbs_postinstallThen set the proper permissions on the files /opt/pbs/sbin/pbs_iff and /opt/pbs/sbin/pbs_rcp:
sudo chmod 4755 /opt/pbs/sbin/pbs_iff /opt/pbs/sbin/pbs_rcpIf it doesn't already exist, create the required directories and set permissions on the head node (node1):
sudo mkdir -p /var/spool/pbs/server_priv/security
sudo chown root:root /var/spool/pbs/server_priv/security
sudo chmod 700 /var/spool/pbs/server_priv/securityThen, in all nodes set the hostname for the PBS server
sudo sh -c 'echo "node1" > /var/spool/pbs/server_name'Configuration Files
Once installed, configure PBS by editing the configuration file /etc/pbs.conf. The configuration file has the following values
PBS_SERVER: Specifies the hostname of the PBS server. All PBS components (scheduler, communication daemon, and MOMs) will connect to this server for coordination.
PBS_START_SERVER: Tells the system to start the PBS server daemon (pbs_server) on this node. The PBS server manages job queues, user submissions, and system-wide scheduling.
PBS_START_SCHED: Starts the PBS scheduler (pbs_sched), which decides when and where jobs should run based on policies and available resources.
PBS_START_COMM: Starts the PBS communication daemon (pbs_comm), which handles network communication between all PBS components (server, scheduler, and compute nodes).
PBS_START_MOM: Indicates that the MOM (Machine Oriented Mini-server) process (pbs_mom) should not start on this node. The MOM daemon runs on compute nodes to execute and manage jobs. Setting this to 0 means this node is not a compute node.
PBS_EXEC: Specifies the installation directory where PBS binaries and scripts are located.
PBS_HOME: Defines the working directory for PBS, where it stores logs, job data, and configuration files.
PBS_CORE_LIMIT: Sets the core file size limit for PBS daemons to unlimited, allowing full core dumps for debugging if a daemon crashes.
PBS_SCP: Specifies the path to the scp command, used for securely copying files (like job scripts or output) between nodes.
PBS Server
On the PBS server (node1) set the following values
PBS_SERVER=node1
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=0
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scpClient Node
On the client node (node2) set the following values
PBS_SERVER=node1
PBS_START_SERVER=0
PBS_START_SCHED=0
PBS_START_COMM=0
PBS_START_MOM=0
PBS_HOME=/var/spool/pbs
PBS_EXEC=/opt/pbsCompute Nodes
On the compute nodes (node3, node4, node 5) set the following values
PBS_SERVER=node1
PBS_START_SERVER=0
PBS_START_SCHED=0
PBS_START_COMM=0
PBS_START_MOM=1
PBS_HOME=/var/spool/pbs
PBS_EXEC=/opt/pbsThen, on all nodes run the PBS services
sudo systemctl start pbs
sudo systemctl enable pbs
sudo systemctl status pbsConfiguring and Verifying PBS Nodes
Now that PBS is properly configured on all nodes, add the compute nodes from the PBS server (node1)
sudo /opt/pbs/bin/qmgr -c "create node node3"
sudo /opt/pbs/bin/qmgr -c "create node node4"
sudo /opt/pbs/bin/qmgr -c "create node node5"You can verify if the nodes were properly added by using the command:
sudo /opt/pbs/bin/qmgr -c "list node @active"Also verify PBS is reachable from the client node (node2):
sudo /opt/pbs/bin/qstat -BIn addition on the head node set the default server parameters:
sudo /opt/pbs/bin/qmgr -c "set server default_queue = workq"
sudo /opt/pbs/bin/qmgr -c "set server resources_default.select = 1"
sudo /opt/pbs/bin/qmgr -c "set server flatuid = True"The last one is particularly important as it tells PBS to treat all user IDs (UIDs) as equivalent across the cluster. This will make sure that strict UID matching between the server and compute nodes are not enforced. Without this, you may get an error when submitting the job from the client node (node2).
The next part outlines the steps to setup a LDAP system in an AWS cluster The main GitHub repo is available here.

Comments