top of page
Search

Setting up a PBS Scheduler in an AWS Cluster

  • Writer: Joseph
    Joseph
  • Nov 11
  • 4 min read

Updated: Nov 13

This is the fifth part of an eight-part series  on how to setup an HPC cluster on AWS.

This document explains how to set up an OpenPBS job scheduler in an AWS cluster.


The cluster has seven virtual machines (VMs)

  • One head / control node (node1)

  • One login node (node2)

  • Three compute nodes (node3, node4, node 5)

  • Two storage nodes (node6, node7)

  • All the VMs will have the OS Rocky Linux 9.6 (Blue Onyx)


OpenPBS (Portable Batch System) is an open-source workload management and job scheduling system designed for HPC environments. It efficiently allocates compute resources across clusters by queuing, scheduling, and monitoring batch jobs submitted by users. OpenPBS and PBS will be used interchangeably in this document.

There are 3 main types of nodes in the PBS.

  1. Server Node: Central control node of the PBS cluster. This manages job queues, user requests, and job tracking (node1)

  2. Compute Node: Runs the actual computational jobs submitted by users. It acts as the worker node and executes and monitors jobs assigned by the server (node3, node4, node5)

  3. Client node: Used by users to log in, compile code, and submit jobs. These nodes do not to execute jobs (node2)


In our setup, storage servers does not execute jobs and so the storage servers do not require any aspect of PBS, including the PBS clients. In OpenPBS, the MOM (Machine Oriented Mini-server) is the daemon that runs on each compute node and is responsible for executing and managing jobs assigned by the PBS server. Acting as the cluster’s worker agent, it receives job scripts, launches them, monitors their progress, and reports job status back to the server.


Initial installation

The first thing to do when installing OpenPBS is to disable SELinux. Check out the second part of this series on how to do this. The second part also covers how to do the initial installations. The next thing to do is set up passwordless SSH access between all the nodes in the cluster. Check out the third part of this series on how to do this.


After disabling SELinux install the following packages on all nodes

sudo dnf install -y  cjson-devel libedit-devel libical-devel ncurses-devel make cmake rpm-build libtool gcc gcc-c++ libX11-devel libXt-devel libXext libXext-devel libXmu-devel tcl-devel tk-devel postgresql-devel postgresql-server postgresql-contrib python3 python3-devel perl expat-devel openssl-devel hwloc-devel java-21-openjdk-devel swig swig-doc vim sendmail chkconfig autoconf automake git


Build from Source


Once the packages are installed, build OpenPBS from source on all nodes (except storage as we are not installing PBS in storage nodes)

sudo git clone https://github.com/openpbs/openpbs.git && cd openpbs
sudo ./autogen.sh
sudo ./configure --prefix=/opt/pbs
sudo make -j$(nproc) && sudo make install
echo "export PATH=/opt/pbs/bin:/opt/pbs/sbin:\$PATH" | sudo tee /etc/profile.d/pbs.sh
source /etc/profile.d/pbs.sh

Once installed run the following command on

sudo /opt/pbs/libexec/pbs_postinstall

Then set the proper permissions on the files /opt/pbs/sbin/pbs_iff and /opt/pbs/sbin/pbs_rcp:

sudo chmod 4755 /opt/pbs/sbin/pbs_iff /opt/pbs/sbin/pbs_rcp

If it doesn't already exist, create the required directories and set permissions on the head node (node1):

sudo mkdir -p /var/spool/pbs/server_priv/security
sudo chown root:root /var/spool/pbs/server_priv/security
sudo chmod 700 /var/spool/pbs/server_priv/security

Then, in all nodes set the hostname for the PBS server

sudo sh -c 'echo "node1" > /var/spool/pbs/server_name'

Configuration Files

Once installed, configure PBS by editing the configuration file /etc/pbs.conf. The configuration file has the following values

PBS_SERVER: Specifies the hostname of the PBS server. All PBS components (scheduler, communication daemon, and MOMs) will connect to this server for coordination.

  • PBS_START_SERVER: Tells the system to start the PBS server daemon (pbs_server) on this node. The PBS server manages job queues, user submissions, and system-wide scheduling.

  • PBS_START_SCHED: Starts the PBS scheduler (pbs_sched), which decides when and where jobs should run based on policies and available resources.

  • PBS_START_COMM: Starts the PBS communication daemon (pbs_comm), which handles network communication between all PBS components (server, scheduler, and compute nodes).

  • PBS_START_MOM: Indicates that the MOM (Machine Oriented Mini-server) process (pbs_mom) should not start on this node. The MOM daemon runs on compute nodes to execute and manage jobs. Setting this to 0 means this node is not a compute node.

  • PBS_EXEC: Specifies the installation directory where PBS binaries and scripts are located.

  • PBS_HOME: Defines the working directory for PBS, where it stores logs, job data, and configuration files.

  • PBS_CORE_LIMIT: Sets the core file size limit for PBS daemons to unlimited, allowing full core dumps for debugging if a daemon crashes.

  • PBS_SCP: Specifies the path to the scp command, used for securely copying files (like job scripts or output) between nodes.


PBS Server

On the PBS server (node1) set the following values


PBS_SERVER=node1
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=0
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp

Client Node

On the client node (node2) set the following values

PBS_SERVER=node1
PBS_START_SERVER=0
PBS_START_SCHED=0
PBS_START_COMM=0
PBS_START_MOM=0
PBS_HOME=/var/spool/pbs
PBS_EXEC=/opt/pbs

Compute Nodes

On the compute nodes (node3, node4, node 5) set the following values


PBS_SERVER=node1
PBS_START_SERVER=0
PBS_START_SCHED=0
PBS_START_COMM=0
PBS_START_MOM=1
PBS_HOME=/var/spool/pbs
PBS_EXEC=/opt/pbs

Then, on all nodes run the PBS services

sudo systemctl start pbs
sudo systemctl enable pbs
sudo systemctl status pbs

Configuring and Verifying PBS Nodes

Now that PBS is properly configured on all nodes, add the compute nodes from the PBS server (node1)

sudo /opt/pbs/bin/qmgr -c "create node node3"
sudo /opt/pbs/bin/qmgr -c "create node node4"
sudo /opt/pbs/bin/qmgr -c "create node node5"

You can verify if the nodes were properly added by using the command:

sudo /opt/pbs/bin/qmgr -c "list node @active"

Also verify PBS is reachable from the client node (node2):

sudo /opt/pbs/bin/qstat -B

In addition on the head node set the default server parameters:

sudo /opt/pbs/bin/qmgr -c "set server default_queue = workq"
sudo /opt/pbs/bin/qmgr -c "set server resources_default.select = 1"
sudo /opt/pbs/bin/qmgr -c "set server flatuid = True"

The last one is particularly important as it tells PBS to treat all user IDs (UIDs) as equivalent across the cluster. This will make sure that strict UID matching between the server and compute nodes are not enforced. Without this, you may get an error when submitting the job from the client node (node2).


The next part outlines the steps to setup a LDAP system in an AWS cluster The main GitHub repo is available here.


Reference







 
 
 

Recent Posts

See All
Setting up Prometheus and Grafana in an AWS Cluster

This is the seventh part of an eight-part series  on how to setup an HPC cluster on AWS. This document explains how to set up an OpenPBS job scheduler in an AWS cluster. The cluster has seven virtual

 
 
 
Setting up an LDAP System in an AWS Cluster

This is the sixth part of an eight- part series  on how to set up an HPC cluster on AWS. This document explains how to setup LDAP in the cluster The cluster has seven virtual machines (VMs) One head /

 
 
 

Comments


bottom of page