How it works
You log into the “head” or “submit” node via ssh using you favorite terminal or terminal emulator. In your working directory on this node you have all the things you need to run your program: data, libraries, scripts, and software you’ll be using to run your job.
You may need to test run your job using an interactive session – see the interactive session section of this site. Once you are clear as to what your program needs for resources, you create a bash script (.sh file) that will list the instructions for your job, including the actual command to execute your job, and be read by the cluster scheduler commonly referred to as SLURM (Simple Linux Utility for Resource Management).
After your job is submitted to the scheduler, it will locate other computers (nodes) that have available processors and memory for your job’s needs. You do not need to log into these nodes – SLURM will send your job out to the nodes, and any results/output you are looking for will be written to the working directory from which you submitted the job. You will receive an email when your job actually starts, and when it completes, with our without failures.
All of these steps are described in the User’s Guide section of this site.
The cluster is currently made up of 13 CPU compute nodes, one GPU node with a total of 8 NVIDIA RTX 2080 Ti cards (for deep learning and machine learning tasks), two submit nodes with storage, and two high-speed direct attached storage nodes. Each compute node consists of two Intel Xeon E5-2680 v3 @ 2.50GHz CPU’s and 128 GB of RAM, and with hyper-threading there are total cluster resources of 624 cores and 1.7 terabytes of RAM available. For HPC work using MPI, high speed Infiniband interconnects are also available for true parallel processing procedures with speeds around 40Gbps.