Pronghorn is the University of Nevada, Reno’s High-Performance Computing (HPC) cluster. The GPU-accelerated system is designed, built and maintained by the Office of Information Technology’s HPC Team. Pronghorn and the HPC Team supports general research across the Nevada System of Higher Education (NSHE).
Pronghorn is composed of CPU, GPU, Visualization, and Storage subsystems interconnected by a 100Gb/s non-blocking Intel Omni-Path fabric. The CPU partition features 108 nodes, 3,456 CPU cores, and 24.8TiB of memory. The GPU partition features 44 NVIDIA Tesla P100 GPUs, 352 CPU cores, and 2.75TiB of memory. The Visualization partition is composed of three NVIDIA Tesla V100 GPUs, 48 CPU cores, and 1.1TiB of memory. The storage system uses the IBM SpectrumScale file system to provide 2PiB of high-performance storage. The computational and storage capabilities of Pronghorn will regularly expand to meet NSHE computing demands.
Pronghorn is collocated at the Switch Citadel Campus located 25 miles East of the University of Nevada, Reno. Switch is the definitive leader of sustainable data center design and operation. The Switch Citadel is rated Tier 5 Platinum, and will be the largest, most advanced data center campus on the planet.
Pronghorn is available to all University of Nevada, Reno faculty, staff, students, and sponsored affiliates. Priority access to the system is available for purchase.
First up, let’s talk about what a high-performance computer (HPC) is: really, it is a bunch of individual computers (“nodes”), just like the ones you are using, strung together with networking cables, with the ability to deploy “jobs” (some computational task you are trying to accomplish) across multiple nodes easily. As such, we can determine how many cores we have access to by counting the number of cores on each individual node, and summing them all up. Pronghorn has 3,456 CPU cores that (in theory) we have access to! In a perfect world (more on that later), you COULD divide the amount of time it takes to do a job by the number of cores you throw at it. With Pronghorn, you could theoretically do 10 YEARS of sequential calculations in less than one day! Put another way, Pronghorn’s capabilities are 864 times faster than my Windows machine.
Your desktop or laptop is all yours, generally, so you aren’t sharing the resources with anyone else. You’ve effectively pre-paid for ~ 5 years if computational time (warranty!) times the number of cores you have, so I’ve bought about 20 years of CPU-time on my Windows desktop and 40 years of CPU-time on my Mac laptop. Pronghorn, assuming a 5 year lifespan, has 17,280 years (!) of CPU-time, all of which was purchased in advance. While you are probably ok with your laptop/desktop just sitting there idle not doing much, a research computer like Pronghorn is designed to be used at near-capacity! Also, this is a SHARED MACHINE and as such much of the process getting your programs to run on it requires some understanding of how the system shares its resources amongst all the users! Enter SLURM.
SLURM is what is known as a workload manager. SLURM’s job is to take the vast number of different jobs sent to it by all users in the system, reserve “resources” (# of nodes per job, # of cores per node, memory per job), and then execute the jobs based on the user or association’s priority.
A “job” is basically the top level of what you are trying to accomplish – a workflow, set of commands/programs to run, etc. Typically we define a single job at a time and submit it to the SLURM system. Within the job are “steps” which can be running sequentially or in parallel depending on the particulars of your workflow. A step consists of one or more “tasks”. Each “task” runs on one or more “cpus” (cpu is the same as a logical core in SLURM parlance). Parallelization can occur at multiple levels: job, step, and task.
SLURM uses a “job script” written in any interpreted language that uses “#” as the comment character– typically we’ll use the “bash” language to create a job. This job script follows a very specific format that you will get familiar with. Your job script 1) tells SLURM what resources you need, and 2) once the resources are allocated, what programs to execute and how to allocate the resources to those programs.
As a general rule, Pronghorn is a BATCH system, which means you will focus on jobs that do not require user interaction, and will often be deferred (run at some time in the future). While you CAN run “interactive jobs” on Pronghorn, this should be minimized wherever possible. Interactive jobs typically idle resources quite a bit.
Pronghorn HPC uses Linux as the underlying operating system. Understanding how to use a unix environment and terminal to interact with files and folders is very important to bioinformatics. A lot of bioinformatic software is meant to be ran on the command-line. This training will enable you to feel confident running these command-line tools, moving, copying, and viewing files.
The Unix operating system has been around since 1969. Back then there was no such thing as a graphical user interface. You typed everything. It may seem archaic to use a keyboard to issue commands today, but it’s much easier to automate keyboard tasks than mouse tasks. There are several variants of Unix (including Linux), though the differences do not matter much for most basic functions.
Increasingly, the raw output of biological research exists as in silico data, usually in the form of large text files. Unix is particularly suited to working with such files and has several powerful (and flexible) commands that can process your data for you. The real strength of learning Unix is that most of these commands can be combined in an almost unlimited fashion. So if you can learn just five Unix commands, you will be able to do a lot more than just five things.
In order to connect to pronghorn, we will be using ssh
to connect to the remote server. Below is how you would connect using a
Linux or OSX computer
ssh yournetidhere@pronghorn.rc.unr.edu
You will then be prompted to type in your netid password. As you type, the cursor will not move/display text in order to keep your password secure.
If you are using a Windows computer, Windows does not have
ssh functionality built into the system. You will need to
setup a program in order to remotely connect to Pronghorn. Visit this
website and download the appropriate installation file for your
computer. https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html
Most likely: https://the.earth.li/~sgtatham/putty/latest/w64/putty-64bit-0.78-installer.msi
Once installed, you will configure the connection information similar to above.
Similar to Windows computers, UNIX has a directory structure to organize data.
At the very top of your system and this image, there is the root directory “/”. This is analogous to the “C:\” drive in Windows.
Below that, there are many “system” folders which the UNIX operating system uses to run (bin, dev, etc, lib, proc, sbin, etc). The “home” folder contains user directories similar to the “Users” folder in windows and mac systems.
Let’s login and open the terminal to start learning some commands and practice navigating the file/folders.
Linux follows rules for command syntax. Let’s take a look at the
cp command example below.
cp -R myFolder ~/Somewherelse/tocopy
The first part of the command, cp, is the linux
program/utiliy you are telling linux to run. These
programs need to be accessible via the $PATH environmental variable
(more on this later).
The second part of any command, which are optional, are the
command line flags. In this example below,
-R is the commandline flag. Flags allow users to augment
the default behavior of the command by someway. In this case,
recursively copy files/folders (more on this later).
The thid part of any command are the command line
arguments. In this case, we have two arguments,
myFolder and ~/Somewherelse/tocopy. This
instructs the cp command to copy the data in myFolder to a
location ~/Somewherelse/tocopy.
On a linux system, the default directory when logging into the system
will be your user’s home folder. This is signified by the tilda
~ character.
(base) hvasquezgross@jimkent:~$
To find out the full path of where you are located on the operating
system, use the pwd command.
(base) hvasquezgross@jimkent:~$ pwd
/data/gpfs/home/hvasquezgross
This looks very similar to the image above depicting the file/folder structure.
Let’s make a folder called “training” for this workshop using the
mkdir command.
(base) hvasquezgross@jimkent:~$ mkdir training
Use the ls command to verify if the training folder has
been made
(base) hvasquezgross@jimkent:~$ ls
b2gWorkspace Blast2GO Documents git igv Music Pictures R src Templates training
b2gFiles bin Desktop Downloads miniconda3 perl5 Public snap temp tmp Videos
Taken from Wikipedia: A Unix shell is a command-line interpreter or shell that provides a command line user interface for Unix-like operating systems. The shell is both an interactive command language and a scripting language, and is used by the operating system to control the execution of the system using shell scripts.
So the commands we were running above were sent to the Linux operating system by a shell. The default shell on pronghorn is BASH. For today, we will be issuing commands interactively. However, in the following meeting, we will create “bash scripts” in order to automate our analyses (and get them to run on Pronghorn as scheduled jobs).
Notice the ls command list the files. However, sometimes
we want more information than just what is in a folder. For example, we
may want to know file timestamps, file sizes, and user permissions.
To get more information, we will use command line flags which will
change the way ls operates. Try running the
ls -alh command. The “a” in the flags will list all files,
the “l” means to use the long format, and the “h” means to output the
file sizes in human readable format, rather than just bytes.
(base) hvasquezgross@jimkent:~$ ls -alh
total 508K
drwxr-xr-x 46 hvasquezgross hvasquezgross 4.0K May 13 13:48 .
drwxr-xr-x 6 root root 4.0K Sep 3 2021 ..
drwxrwxr-x 3 hvasquezgross hvasquezgross 4.0K Dec 16 11:03 .aspera
drwxr-xr-x 5 hvasquezgross hvasquezgross 4.0K Apr 25 13:34 b2gFiles
drwxrwxr-x 2 hvasquezgross hvasquezgross 4.0K Nov 1 2021 b2gWorkspace
-rw------- 1 hvasquezgross hvasquezgross 56K May 11 14:47 .bash_history
-rw-r--r-- 1 hvasquezgross hvasquezgross 220 Jul 30 2021 .bash_logout
-rw-r--r-- 1 hvasquezgross hvasquezgross 5.0K Mar 24 14:08 .bashrc
drwxrwxr-x 4 hvasquezgross hvasquezgross 4.0K May 12 15:34 bin
drwxr-xr-x 8 hvasquezgross hvasquezgross 4.0K Nov 1 2021 Blast2GO
Let’s say we want to see if we have any files in our “Downloads” folder. We can list files in the Downloads folder by typing Downloads after ls to specify we want to list files in that folder.
(base) hvasquezgross@jimkent:~$ ls Downloads
1098_2021.pdf
1151851s_t7-9_f9-10.zip
16654993.zip
Since the Downloads folder is in our home folder, we can use TAB
COMPLETION to auto-complete the name. Test this by typing
“ls Down” then press TAB and you will see the rest of the
name auto-complete.
Now try typing ls down and press TAB. Did anything
happen?
No, this is because in unix files/folders/commands are CASE SENSITIVE.
You can view a history of all the commands ran during your terminal
session. Type history to view.
(base) hvasquezgross@jimkent:~/training$ history
[edited out]
2083 cat second.txt >> third.txt
2084 rm third.txt
2085 cat data.txt > third.txt
2086 cat second.txt >> third.txt
2087 wc data.txt
2088 wc third.txt
2089 grep "Unix" data.txt
2090 grep "Unix" third.txt
2091 cat data.txt
2092 history
(base) hvasquezgross@jimkent:~/training$
You can rerun a particular command by using the bang (!) and then the
number ofthe command you want to rerun. Example: !2092
You can also rerun a particular command by using the bang (1) then
the first few letters of the command. Example: !grep
When you run the history command, you will notice a lot of text gets printed to the screen. This isn’t particularly informative if you want to see some of the first commands you ran in your history.
So, how do we view these first commands?
You can scroll up in your terminal session, but imagine if you had 20,000 commands you ran. That would be a lot of scrolling!
We can use command-line redirection as before, to make a text file with the contents from the history command. Let’s do that now and name the output history.txt.
(base) hvasquezgross@jimkent:~/training$ history > history.txt
(base) hvasquezgross@jimkent:~/training$ ls
data history.txt
We can use the head and tail commands to
look at the first lines or last lines of a file
(base) hvasquezgross@jimkent:~/training$ head history.txt
(base) hvasquezgross@jimkent:~/training$ tail history.txt
Since cat outputs the entire file contents to the
terminal, we don’t want to use cat to view this file.
Let’s use a new utility called less or more
which allows paging through files.
(base) hvasquezgross@jimkent:~/training$ less history.txt
This will open the file for viewing. You can scroll using arrow keys to navigate single line by line or PageUp/PageDown keys and CTRL+F or CTRL+B to scroll through pages at a time.
You can search within less by using the “/” then type
the search term and press ENTER.
In order to quit less, press the “q” key to exit and
return back to the terminal command prompt.
It’s a bit clunky having to create a “history.txt” file in order to see the first commands that were ran. We don’t necessarily want to save this file too.
Wouldn’t it be nice if we could temporarily view the history using less, so we can page through the contents?
We can using the | pipe character which is below
backspace and above enter.
When using the |, it tells unix to take the output from
the first command and use it for the second command.
Let’s do this with history and less.
(base) hvasquezgross@jimkent:~/training$ history | less
You will notice this works the same as when we used less before with
the “history.txt” file. However, now we did not need to create the file
first, then open it with less.
The following examples introduce some other Unix commands, and show
how they could be used to work on a fictional file called
file.txt. These are not all real world cases, but rather
show the diversity of Unix command-line tools, as well as stringing
multiple different commands together to obtain a desired result
View the penultimate 10 lines of a file (using head and tail commands):
tail -n 20 file.txt | head
Show lines of a file that begin with a start codon (ATG) (the ^ matches patterns at the start of a line):
grep "^ATG" file.txt
Cut out the 3rd column of a tab-delimited text file and sort it to only show unique lines (i.e. remove duplicates):
cut -f 3 file.txt | sort -u
Count how many lines in a file contain the words ‘cat’ or ‘bat’ (-c option of grep counts lines):
grep -c '[bc]at' file.txt
Turn lower-case text into upper-case (using tr command to ‘transliterate’):
cat file.txt | tr 'a-z' 'A-Z'
Change all occurences of ‘Chr1’ to ‘Chromosome 1’ and write changed output to a new file (using sed command):
cat file.txt | sed 's/Chr1/Chromosome 1/' > file2.txt
Now, you should know how to use the following commands:
ls, cd, cp, nano,
tree, rm, mkdir,
cat, wc, grep, echo,
less, more.
If you ever forget how to specifically use a command, use the
man command followed by the command you want to look up, to
pull up a help manual.
Try this now with the cp command.
(base) hvasquezgross@jimkent:~/training$ man cp
NAME
cp - copy files and directories
SYNOPSIS
cp [OPTION]... [-T] SOURCE DEST
cp [OPTION]... SOURCE... DIRECTORY
cp [OPTION]... -t DIRECTORY SOURCE...
DESCRIPTION
Copy SOURCE to DEST, or multiple SOURCE(s) to DIRECTORY.
Mandatory arguments to long options are mandatory for short options too.
-a, --archive
This will conclude our first day.
Below is a BASH commands cheat sheet.