Setting up Prometheus and Grafana in an AWS Cluster
- Joseph

- Nov 13
- 3 min read
Updated: Nov 13
This is the seventh part of an eight-part series on how to setup an HPC cluster on AWS.
This document explains how to set up an OpenPBS job scheduler in an AWS cluster.
The cluster has seven virtual machines (VMs)
One head / control node (node1)
One login node (node2)
Three compute nodes (node3, node4, node 5)
Two storage nodes (node6, node7)
All the VMs will have the OS Rocky Linux 9.6 (Blue Onyx)
Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability. It collects and stores time-series metrics from systems, applications, and services. Grafana is a visualisation and analytics platform that integrates seamlessly with Prometheus and other data sources. It allows users to create interactive dashboards, set up alerts, and analyse performance trends in real time. Together, Prometheus and Grafana form a robust monitoring stack- Prometheus handles data collection and storage, while Grafana provides visual insights and alerting.
To avoid this discrepancy when installing packages, follow the steps outlined in the second part of this series. Also, make sure passwordless SSH access between all the nodes in the cluster is set up. Check out the third part of this series on how to do this.
Prometheus
Prometheus will be installed on the head node (node1). To install Prometheus, first, create a dedicated user for Prometheus:
sudo useradd --system --no-create-home --shell /usr/sbin/nologin prometheus
id prometheusThen download, extract, and install Prometheus on your system, then link the binaries to your system path so they can be run easily from anywhere in the terminal.
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v3.5.0/prometheus-3.5.0.linux-amd64.tar.gz
sudo tar -xvf /tmp/prometheus-3.5.0.linux-amd64.tar.gz -C /opt/
sudo ln -s /opt/prometheus-3.5.0.linux-amd64/prometheus /usr/local/bin/prometheus
sudo ln -s /opt/prometheus-3.5.0.linux-amd64/promtool /usr/local/bin/promtoolYou can verify the installation by checking the versions:
prometheus --version
promtool --versionThen create the necessary directories and set the appropriate permissions:
sudo mkdir -p /etc/prometheus
sudo mkdir -p /var/lib/prometheus
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheusNext, create the Prometheus configuration file at /etc/prometheus/prometheus.yml:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'node_exporters'
static_configs:
- targets: ['node1:9100', 'node2:9100', 'node3:9100', 'node4:9100', 'node5:9100', 'node6:9100', 'node7:9100']scrape_interval: Sets how often Prometheus collects (or scrapes) metrics from all configured targets.
job_name: Labels this scrape job (node_exporters ). This is helpful for organising and filtering metrics later.
static_configs: Lists fixed, manually defined targets.
targets: Specifies the hosts and ports to scrape. Each target here is a node running Node Exporter - a Prometheus agent that exposes system metrics like CPU, memory, and disk usage on port 9100.
Now set the ownership and permissions for the configuration file:
sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml
sudo chmod 644 /etc/prometheus/prometheus.ymlThen create a systemd service file, /etc/systemd/system/prometheus.service, for Prometheus with the following conent
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--web.listen-address=:9090
Restart=on-failure
[Install]
WantedBy=multi-user.targetDescription: A short description of the service (shown when you run systemctl status prometheus).
Wants: Ensures the network is up before Prometheus starts, but doesn’t strictly block startup if it fails.
After: Makes Prometheus start after the system’s network is ready.
Finally, reload the systemd daemon and start the Prometheus service:
sudo systemctl daemon-reload
sudo systemctl enable --now prometheus
sudo systemctl status prometheusInstalling Node Exporter
Node Exporter is a lightweight agent used with Prometheus to collect detailed hardware and OS-level metrics from Linux systems. It runs on each node and exposes system information such as CPU usage, memory consumption, disk I/O, filesystem statistics, and network performance via an HTTP endpoint (usually :9100/metrics). In a monitoring setup, Prometheus periodically scrapes these metrics from Node Exporter, helping administrators track system health, detect performance issues, and analyse resource utilisation across nodes in a cluster or HPC environment.
Node Exporter has to be installed on login (node2), compute (node3, node4, node5), and storage node (node6, node7). So the next steps have to be done on all these nodes.
First, create a dedicated user for Node Exporter:
sudo useradd --system --no-create-home --shell /usr/sbin/nologin node_exporter
id node_exporterThen download and install Node Exporter:
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.10.2/node_exporter-1.10.2.linux-amd64.tar.gz
sudo tar -xvf /tmp/node_exporter-1.10.2.linux-amd64.tar.gz -C /opt/
sudo ln -s /opt/node_exporter-1.10.2.linux-amd64/node_exporter /usr/local/bin/node_exporterVerify the installation by checking the version:
/usr/local/bin/node_exporter --versionThen create a systemd service file /etc/systemd/system/node_exporter.service for Node Exporter:
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.targetThis systemd service file configures Node Exporter to run automatically as a background service on system startup. It starts after the network is available, runs securely under the dedicated node_exporter user, and executes the Node Exporter binary to expose system metrics (usually on port 9100). The service is set to automatically restart if it fails, with a short 5-second delay between attempts, ensuring continuous availability for Prometheus monitoring.
Finally, reload the systemd daemon and start the Node Exporter service:
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
sudo systemctl status node_exporterInstalling Grafana
To install Grafana, first, import the Grafana GPG key:
cd /tmp
wget -q -O gpg.key https://rpm.grafana.com/gpg.key
sudo rpm --import gpg.keyThen create the Grafana repository file /etc/yum.repos.d/grafana.repo:
[grafana]
name=grafana
baseurl=https://rpm.grafana.com
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://rpm.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crtOnce the repository is set up, install Grafana:
sudo dnf install grafana -y
grafana-server -vThen start and enable the Grafana service:
sudo systemctl enable --now grafana-server
sudo systemctl status grafana-serverFinally, access the Grafana web interface by navigating to the following URL in your web
browse on the local system:
http://<ip-of-management-node>:3000The default username and password are both admin.
The next part outlines how I have automated every setup using Terraform and Ansible. The main GitHub repo is available here.

Comments