1. Introduction to HPC Systems

1.1. Overview

This tutorial provides a high-level introduction to High-Performance Computing (HPC) systems, with a focus on Cyclone's architecture and operational principles. Participants will explore the fundamental components of an HPC system, including compute nodes and file systems, while gaining insight into data management policies. Additionally, the tutorial introduces key concepts such as software management through modules, and job scheduling using SLURM, setting the stage for deeper exploration in subsequent tutorials.

1.2. Learning Objectives

By the end of this tutorial, participants will be able to:

  1. Describe the architecture of an HPC system, including Cyclone’s compute nodes and interconnects.
  2. Identify and understand the use cases of Cyclone’s file systems (home, scratch, shared directories).
  3. Identify when to use an HPC system or alternative solutions such as Cloud systems or High-end Workstations.
  4. Recognize the role of modules in managing software environments and how they simplify system use.
  5. Understand the purpose of job scheduling and the function of SLURM in resource management.

1.3. Overview of HPC Architectures

An HPC system is typically composed of multiple interconnected compute nodes which work in parallel to solve large-scale tasks. Each compute node consists of CPUs and GPUs (in some cases) designed for performing the computations. These nodes are interconnected through high-speed networks, which enable them to share data efficiently and work in parallel.

HPC System High-level Overview

Key components of an HPC system:

  1. Login Nodes: These are the entry points for users to connect to the system. Users prepare their jobs, access resources, and manage data from the login nodes. However, compute jobs are not executed here.
  2. Compute Nodes: These nodes execute user jobs. They often come with specialized hardware, like GPUs, to accelerate computational workloads.
  3. Interconnects: These are high-speed networks that link the compute nodes together, ensuring fast data transfer between them. The interconnects are critical for achieving low latency and high bandwidth communication essential for parallel computing.
  4. Storage Systems: These systems are used to manage the data. They include both local storage for temporary data and shared storage for datasets that are accessible across all nodes in the system, through the available filesystem.
The architecture ensures that large computational tasks are broken down into smaller sub-tasks, which are processed simultaneously across the nodes.

1.4. Overview of Cyclone

Cyclone is the National HPC system hosted at The Cyprus Institute. It is designed to support a range of disciplines, including physics, climate modeling, bioinformatics, AI and more. Cyclone consists of login and compute nodes, high-speed interconnects and a file system to manage data storage across different types of files and directories.

1.4.1 Architecture Configuration

  1. Compute Nodes: These are the main engines that run the jobs and computations. Cyclone has two types of compute nodes:
    • CPU Nodes: There are 17 nodes, each with 40 cores. These are used for general-purpose computing tasks like simulations, data processing, and more.
    • GPU Nodes: Cyclone has 16 nodes with 40 cores and each node is also equipped with 4 NVIDIA V100 GPUs. These specialized nodes are perfect for tasks that require faster computations, such as deep learning (AI), large-scale simulations, or tasks that benefit from the extra power of GPUs.
  2. Node Details:
    • Each node has 2 CPUs, each with 20 cores (so, 40 cores total). These CPUs are Intel Xeon Gold 6248, which are very powerful and efficient for handling parallel tasks.
    • Each node also has 192 GB of memory (RAM). This is where the data is temporarily stored and processed while jobs are running.
  3. Storage:
    • Cyclone has 135 TB of fast NVMe storage for the Scratch and Home directories. These directories are used to store files and data that you’re working on during a job. The Scratch directory is for temporary files, and the Home directory is for your personal files and scripts.
    • Additionally, there is 3.2 PB (Petabytes) of Shared Disk Storage. This is a large space used for collaboration, allowing multiple users or teams to access and share project data.
  4. Interconnect: The system has an HDR 100 Node-to-Node interconnect, which is a high-speed network (up to 100GB/s) that allows nodes to communicate with each other very quickly. This is especially important for tasks that involve a lot of data moving between nodes, such as simulations and large data analysis.
  5. Operating System: Cyclone uses Rocky Linux 8.6, an operating system that’s optimized for high-performance computing tasks.
This system is designed to support demanding tasks, like scientific simulations, machine learning, and data analysis, providing the necessary computing power, fast data transfer, and shared storage to store large datasets.

1.4.2. File System Overview

When working on Cyclone, your files will be stored in specific directories tailored to different purposes. Each directory has unique performance characteristics and retention policies. It is important to note that no backups are performed on any of these directories—it is solely the user's responsibility to maintain their own backups.
Directory Home Directory Scratch Directory Shared Data Directory
Path /nvme/h/your_username/ /nvme/scratch/your_username/ /onyx/data/project_name/
Description Personal space on Cyclone for storing configuration files, scripts, and small, critical files. High-performance, temporary storage for active computations and intermediate files. Shared project directory available to multiple users for collaborative datasets or resources.
Performance Moderate I/O performance. High I/O performance optimized for compute-intensive workloads. Moderate to low I/O performance.
Retention Persistent storage with no automatic cleanup, but limited in size. Monitor usage to avoid exceeding your quota. Temporary storage. Files may be deleted after a set period or when the system requires space. Persistent storage, but subject to project-specific quotas and policies.
Usage Tips - Store SSH keys, environment setup files, and small codebases.
- Avoid storing large datasets or temporary files here.
- The home path can also be displayed by using the $HOME variable.
- Use for large datasets or files generated during computations.
- Regularly move important results to your home directory or a local backup to prevent data loss.
- The user Scratch path can also be located at ~/scratch.
- Collaborate with team members by storing shared input data and results.
- Ensure file organization and naming conventions are clear for effective collaboration.

1.4.3. Important Notes and Best Practises

  • No Backups: None of these directories are backed up by Cyclone. You must regularly back up your important data to a secure location.
  • Data Responsibility: It is the user's sole responsibility to maintain copies of critical files. Loss of data in these directories due to system failure or cleanup policies is irreversible.
  • Store active job data in the Scratch directory.
  • Keep your source code and executables in the Home directory.
  • Store large shared data in Shared directories for collaboration.
  • Always back up important results from the Scratch directory to the Home or external storage to avoid data loss.

1.5. HPC Systems vs Cloud Systems vs High-End Workstations

Both HPC systems, cloud platforms, and high-end workstations offer powerful computational resources but are optimized for different tasks and use cases. Here's a comparison to help understand which system is more suitable for specific applications:
Feature HPC Systems Cloud Systems High-End Workstations
Purpose Optimized for large-scale, parallel, and intensive computations like simulations, AI, and complex calculations. Designed for scalability and flexibility, best for small-scale tasks, web applications, or quick scalability. Suitable for single-machine tasks like 3D rendering, video editing, and software development, where extreme parallelism is not required.
Hardware Specialized CPUs, GPUs, and high-speed interconnects optimized for performance. General-purpose hardware, flexible virtual machines, varying capabilities. Single high-performance CPU and GPU, moderate memory capacity (64GB to 128GB), optimized for individual tasks.
Resource Allocation Fixed, controlled environment with job schedulers like SLURM to allocate resources efficiently. On-demand, pay-as-you-go provisioning with flexible scaling. Limited scalability, bound by local hardware limitations.
When to Use Ideal for large-scale simulations (climate modeling, molecular dynamics), intensive data processing (genomics, weather prediction), or AI tasks requiring parallel computing (large scale training, hyperparameter optimisation). Best for web applications, small to medium workloads, or cost-effective solutions for short-term or ad-hoc tasks. Best for personal or small-team use, with high power needed for design, simulations, or smaller AI workloads.

1.6. Introduction to Modules

In a High-Performance Computing (HPC) system like Cyclone, users often need to work with specialized software and libraries that are required for their research or computational tasks. Modules help manage these software environments in a way that simplifies the process and ensures compatibility across different users and applications.

1.6.1. What are Modules?

A module is a tool that allows users to dynamically load, unload, and switch between different software environments without having to manually configure system paths or dependencies. Modules are used to make software easier to access on an HPC system, so you don’t need to worry about installation or environment conflicts.
For example, when you want to use a specific version of Python, you can load the Python module for that version, and the system will automatically configure the necessary settings for you.

1.6.2. Why Use Modules?

Using modules has several key benefits:
  • No Root Access: On HPC Systems, users do not have administrative priviledges in order to be able to install software in the same way as in traditional Linux systems (e.g., using sudo apt-get).
  • Simplifies Environment Management: You don’t need to worry about setting up complicated software environments. The system automatically adjusts paths and settings when you load a module.
  • Prevents Software Conflicts: Many HPC systems, including Cyclone, host multiple versions of software. Modules ensure that you can use the right version for your task without causing conflicts with other users.
  • Saves Time: Instead of manually installing or configuring software, you can simply load the required module and begin working right away.
  • Ensures Reproducibility: By using modules, you ensure that your environment is consistent across different sessions and that your work can be replicated by others.

1.6.3. How Modules Work

Modules work by modifying the environment variables (like PATH, LD_LIBRARY_PATH, and others) to point to the correct version of the software. These environment variables tell the system where to find executables, libraries, and other resources required for the software to run properly.
For example, if you load the Python 3.10 module, the system will automatically adjust the environment to use the Python 3.10 binary and related libraries without affecting other users who may be using a different version of Python.

1.6.4. Common Module Commands Cheat Sheet

Below is a table summarizing some of the most commonly used module commands to manage your software environment on Cyclone.
Command Description Example
List Available Modules View all available software modules. module avail
Load a Module Load a specific module to set up the software environment. module load Python/3.8.5
Unload a Module Unload a currently loaded module when it is no longer needed. module unload Python/3.8.5
Check Loaded Modules View a list of all currently loaded modules in your environment. module list
Switch Between Modules Switch from one version of a module to another (e.g., switch Python versions). module switch Python/3.8.5 Python/3.9.1

1.7. Introduction to Job Scheduling and SLURM

In systems like Cyclone, multiple users often share the same resources (e.g., CPUs, memory, GPUs). To ensure that everyone gets fair access to the system’s resources, HPC systems use job scheduling. SLURM (Simple Linux Utility for Resource Management) is the job scheduler used on Cyclone to manage and allocate resources for running computational tasks.

1.7.1. What is Job Scheduling?

Job scheduling is a process where the system manages the allocation of computational resources for running tasks (jobs). Instead of users running tasks directly on the system, SLURM queues up jobs, decides when and where they should run, and allocates resources such as CPUs, memory, and GPUs to those jobs.
This is especially important in a multi-user environment like Cyclone, where many tasks might need to run at the same time. SLURM helps manage and prioritize these tasks, ensuring that resources are used efficiently and fairly.

1.7.2. Why is Job Scheduling Important?

Job scheduling is important because:
  • Fair Resource Allocation: It ensures that no single user monopolizes the system and that resources are shared equitably among all users.
  • Efficient Use of Resources: SLURM makes sure that the system’s resources are used optimally, by deciding the best time and place to run each job.
  • Queue Management: SLURM organizes jobs into queues and allows jobs to wait in line until resources become available.

1.7.3. How Does SLURM Work?

SLURM divides resources into partitions (groups of compute nodes) based on their hardware and intended use. A user submits a job request, and SLURM assigns the job to the most appropriate node, based on the job’s resource requirements (e.g., number of cores, memory, GPU).
SLURM provides several commands to interact with the job scheduler and manage jobs. Below are the key SLURM commands that you will use to submit and manage jobs.

1.7.4. Key SLURM Commands Cheat Sheet

Category Command Description
Job Submission sbatch <script> Submit a job using a submission script.
Cancel Job scancel <job_id> Cancel a specific job using its job ID.
Hold Job scontrol hold <job_id> Place a job on hold, preventing it from starting.
Release Job scontrol release <job_id> Release a held job, allowing it to start when resources become available.
Queue Overview squeue View the status of all jobs in the queue.
My Queued Jobs squeue -u <your_username> View jobs specific to your user account.
Detailed Job Info scontrol show job <job_id> Show detailed information about a specific job.
Job History sacct -j <job_id> View historical job statistics and performance for a specific job.
Partition Info sinfo View available partitions, their nodes, and current states.
Node Details sinfo -N Display detailed node information for all partitions.
Resources Per Node scontrol show node <node_name> Display detailed resource availability for a specific node.

1.7.5. Job Status Symbols

SLURM uses the following symbols to indicate the current state of a job:

Symbol Status Description
PD Pending Job is waiting for resources or dependencies to become available.
R Running Job is currently executing.
CG Completing Job is completing, cleaning up resources.
CF Configuring Job is configuring, setting up resources.
CD Completed Job has finished successfully.
F Failed Job failed to execute successfully.
TO Timeout Job exceeded the allocated time limit.
CA Canceled Job was canceled by the user or administrator.
NF Node Failure Job failed due to a node failure.
ST Stopped Job has been stopped.

1.7.6. Submitting a Job with sbatch

To submit a job to SLURM, you will typically create a job script (a text file with the required commands) and submit it using the sbatch command. The job script defines the resources your job needs (such as the number of CPUs, memory, and time), along with the command to run your application.

Example Job Script my_job_script.sh

#!/bin/bash
#SBATCH --job-name=my_job
#SBATCH --output=output.txt
#SBATCH --error=error.txt
#SBATCH --time=01:00:00  # Set the maximum runtime (hh:mm:ss)
#SBATCH --ntasks=1        # Number of tasks (CPUs) to run
#SBATCH --mem=4G          # Memory required for the job
#SBATCH --partition=cpu   # Which partition to run the job on

# Your command(s) to run the job 
# e.g., a Python script or an executable
python my_script.py
In this script:
  • #SBATCH lines are SLURM directives that define resource requirements.
  • After defining the resources, the script contains the command to run the job, in this case, a Python script (python my_script.py). Note that instead of a single command, it could have been a series of commands.
Once the job script is ready, you can submit it with the following command:

sbatch my_job_script.sh

1.7.7. Monitoring Jobs

Monitoring your jobs on Cyclone is essential to ensure they are running smoothly. Use the squeue command to check the status of your active jobs, including information such as job ID, partition, and node allocation. If a job needs to be canceled, you can use the scancel command with the job ID. Once a job completes, the sacct command allows you to view detailed statistics, such as resource usage and job duration.

Checking Job Status

Use squeue to view active jobs:

squeue -u <your_username>

Output example:

JOBID   PARTITION   NAME      USER   ST   TIME    NODES   NODELIST(REASON)
12345   cpu         test_job  user1  R    00:05:12   1      cn01

Canceling Jobs

Cancel a running or queued job:

scancel <job_id>

Viewing Completed Jobs

Use sacct to see statistics of completed jobs:

sacct -j <job_id>

1.7.8. Resource Allocation

When using an HPC system like Cyclone, it’s crucial to request the right resources to run your jobs efficiently and avoid wasting system capacity. Below useful specifications and best practises are described.

Useful Specifications

Resource Description Job Specification
Nodes Number of Nodes --nodes
CPU tasks Number of CPU tasks per node --ntasks-per-node
CPU threads Number of CPU threads (cores) per task --cpus-per-task
Memory Amount of RAM per Job --mem
GPUs Request for GPUs if needed --gres
System Partition Either use the CPU or GPU part of the system --partition

Best Practices

  1. Request Only What You Need: Avoid over-requesting resources. Only ask for the CPU cores, memory, and time your job actually needs.
  2. Use the Appropriate Partition: Submit jobs to the correct partition (e.g., CPU for general tasks, GPU for tasks requiring GPU acceleration).
  3. Specify the Right Number of Cores: Match the number of CPU cores to the needs of your job (e.g., single-core for small tasks, multi-core for parallel tasks).
  4. Limit Job Runtime: Set a realistic time limit to prevent wasting resources. Avoid setting excessive or very short time limits.
  5. Use Job Arrays for Multiple Jobs: For repetitive tasks (e.g., simulations), use job arrays to submit many jobs efficiently.
  6. Avoid Overloading the System: Be mindful of the system load and avoid excessive resource requests, especially during peak usage times.
  7. Monitor Job Performance: Use commands like squeue and sacct to check job status and resource usage.
  8. Use Interactive Jobs for Debugging: For testing and debugging, run jobs interactively to better understand and optimize resource requirements. Don't run on the login nodes of the system.

1.8. Useful Resources

  • SLURM Best Practises Guide:This guide, created by EuroCC Spain, provides comprehensive information on using SLURM. It covers essential topics such as resource allocation, job initiation, and monitoring. The document highlights best practices for memory management, efficient parallelism, and handling large numbers of jobs. Tips on managing job arrays, wall time, and CPU usage are also included to ensure optimized performance. This guide is particularly valuable for users working with SLURM in high-performance computing environments, offering practical advice on system usage and resource allocation.
  • High-Performance Computing - Why & How: This document serves as an introduction to high-performance computing (HPC), explaining its relevance for research and computational tasks. It details the importance of HPC systems in fields like AI, data-intensive research, and simulations. The guide also emphasizes the need for proper training and access to systems like Cyclone at the Cyprus Institute. It provides an overview of various HPC resources, software, and how they enable more efficient and scalable computations. This guide is an excellent resource for those looking to understand HPC's capabilities and practical applications in the Cypriot ecosystem.

1.9. Recap and Next Steps

To conclude, this tutorial provided an overview of Cyclone's architecture, focusing on key components such as compute nodes, storage systems, and high-speed interconnects, as well as the role of file systems like Home Scratch and Shared directories. We also explored the importance of resource management using SLURM, including job scheduling, job submission, and monitoring, as well as the use of modules to manage software environments.
With this foundational knowledge, you are now ready to proceed to the next tutorials, where you will gain hands-on experience on accessing the system, with job submission, resource allocation, and utilizing modules, which are crucial for efficient any HPC system, including Cyclone.