Introduction

Outline

  • Day 1 ~ 2.5H
    • Intro to Pronghorn
    • SLURM
    • Introduction to Unix/Linux for Bioinformatics
    • Connecting to Pronghorn
    • File/Folder Organization
    • Anatomy of a linux command
    • First Linux/UNIX commands
    • What is a shell?
    • Command line options & TAB completion
    • Navigation: Relative vs Absolute pathing
    • Conda installation
    • Working with files continued
    • Using echo and viewing $HOME, $USER, and $PATH environment variable
    • Miscellaneous Unix power commands
    • Piping commands into other commands
    • RECAP
  • Day 2 ~ 3.5h
    • Working with Sequencing Data
    • Using screen command
    • Your first SLURM job
    • Conda: creating a new environment for reproducible analysis
    • SLURM interactive session for testing
    • Writing your first sequence analysis SBATCH script
    • Transferring files from Pronghorn to your local computer
    • Singularity containers
    • Pronghorn wrap up

Day 1

Pronghorn

Pronghorn is the University of Nevada, Reno’s High-Performance Computing (HPC) cluster. The GPU-accelerated system is designed, built and maintained by the Office of Information Technology’s HPC Team. Pronghorn and the HPC Team supports general research across the Nevada System of Higher Education (NSHE).

Pronghorn is composed of CPU, GPU, Visualization, and Storage subsystems interconnected by a 100Gb/s non-blocking Intel Omni-Path fabric. The CPU partition features 108 nodes, 3,456 CPU cores, and 24.8TiB of memory. The GPU partition features 44 NVIDIA Tesla P100 GPUs, 352 CPU cores, and 2.75TiB of memory. The Visualization partition is composed of three NVIDIA Tesla V100 GPUs, 48 CPU cores, and 1.1TiB of memory. The storage system uses the IBM SpectrumScale file system to provide 2PiB of high-performance storage. The computational and storage capabilities of Pronghorn will regularly expand to meet NSHE computing demands.

Pronghorn is collocated at the Switch Citadel Campus located 25 miles East of the University of Nevada, Reno. Switch is the definitive leader of sustainable data center design and operation. The Switch Citadel is rated Tier 5 Platinum, and will be the largest, most advanced data center campus on the planet.

Pronghorn is available to all University of Nevada, Reno faculty, staff, students, and sponsored affiliates. Priority access to the system is available for purchase.

First up, let’s talk about what a high-performance computer (HPC) is: really, it is a bunch of individual computers (“nodes”), just like the ones you are using, strung together with networking cables, with the ability to deploy “jobs” (some computational task you are trying to accomplish) across multiple nodes easily. As such, we can determine how many cores we have access to by counting the number of cores on each individual node, and summing them all up. Pronghorn has 3,456 CPU cores that (in theory) we have access to! In a perfect world (more on that later), you COULD divide the amount of time it takes to do a job by the number of cores you throw at it. With Pronghorn, you could theoretically do 10 YEARS of sequential calculations in less than one day! Put another way, Pronghorn’s capabilities are 864 times faster than my Windows machine.

Your desktop or laptop is all yours, generally, so you aren’t sharing the resources with anyone else. You’ve effectively pre-paid for ~ 5 years if computational time (warranty!) times the number of cores you have, so I’ve bought about 20 years of CPU-time on my Windows desktop and 40 years of CPU-time on my Mac laptop. Pronghorn, assuming a 5 year lifespan, has 17,280 years (!) of CPU-time, all of which was purchased in advance. While you are probably ok with your laptop/desktop just sitting there idle not doing much, a research computer like Pronghorn is designed to be used at near-capacity! Also, this is a SHARED MACHINE and as such much of the process getting your programs to run on it requires some understanding of how the system shares its resources amongst all the users! Enter SLURM.

SLURM

SLURM is what is known as a workload manager. SLURM’s job is to take the vast number of different jobs sent to it by all users in the system, reserve “resources” (# of nodes per job, # of cores per node, memory per job), and then execute the jobs based on the user or association’s priority.

A “job” is basically the top level of what you are trying to accomplish – a workflow, set of commands/programs to run, etc. Typically we define a single job at a time and submit it to the SLURM system. Within the job are “steps” which can be running sequentially or in parallel depending on the particulars of your workflow. A step consists of one or more “tasks”. Each “task” runs on one or more “cpus” (cpu is the same as a logical core in SLURM parlance). Parallelization can occur at multiple levels: job, step, and task.

SLURM uses a “job script” written in any interpreted language that uses “#” as the comment character– typically we’ll use the “bash” language to create a job. This job script follows a very specific format that you will get familiar with. Your job script 1) tells SLURM what resources you need, and 2) once the resources are allocated, what programs to execute and how to allocate the resources to those programs.

As a general rule, Pronghorn is a BATCH system, which means you will focus on jobs that do not require user interaction, and will often be deferred (run at some time in the future). While you CAN run “interactive jobs” on Pronghorn, this should be minimized wherever possible. Interactive jobs typically idle resources quite a bit.

Introduction to Unix/Linux for Bioinformatics

Pronghorn HPC uses Linux as the underlying operating system. Understanding how to use a unix environment and terminal to interact with files and folders is very important to bioinformatics. A lot of bioinformatic software is meant to be ran on the command-line. This training will enable you to feel confident running these command-line tools, moving, copying, and viewing files.

The Unix operating system has been around since 1969. Back then there was no such thing as a graphical user interface. You typed everything. It may seem archaic to use a keyboard to issue commands today, but it’s much easier to automate keyboard tasks than mouse tasks. There are several variants of Unix (including Linux), though the differences do not matter much for most basic functions.

Increasingly, the raw output of biological research exists as in silico data, usually in the form of large text files. Unix is particularly suited to working with such files and has several powerful (and flexible) commands that can process your data for you. The real strength of learning Unix is that most of these commands can be combined in an almost unlimited fashion. So if you can learn just five Unix commands, you will be able to do a lot more than just five things.

Connecting to Pronghorn

In order to connect to pronghorn, we will be using ssh to connect to the remote server. Below is how you would connect using a Linux or OSX computer


ssh yournetidhere@pronghorn.rc.unr.edu 

You will then be prompted to type in your netid password. As you type, the cursor will not move/display text in order to keep your password secure.

If you are using a Windows computer, Windows does not have ssh functionality built into the system. You will need to setup a program in order to remotely connect to Pronghorn. Visit this website and download the appropriate installation file for your computer. https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html Most likely: https://the.earth.li/~sgtatham/putty/latest/w64/putty-64bit-0.78-installer.msi

Once installed, you will configure the connection information similar to above.

File/Folder organization

Similar to Windows computers, UNIX has a directory structure to organize data.

At the very top of your system and this image, there is the root directory “/”. This is analogous to the “C:\” drive in Windows.

Below that, there are many “system” folders which the UNIX operating system uses to run (bin, dev, etc, lib, proc, sbin, etc). The “home” folder contains user directories similar to the “Users” folder in windows and mac systems.

Let’s login and open the terminal to start learning some commands and practice navigating the file/folders.

Anatomy of a Linux Command

Linux follows rules for command syntax. Let’s take a look at the cp command example below.


cp -R myFolder ~/Somewherelse/tocopy

The first part of the command, cp, is the linux program/utiliy you are telling linux to run. These programs need to be accessible via the $PATH environmental variable (more on this later).

The second part of any command, which are optional, are the command line flags. In this example below, -R is the commandline flag. Flags allow users to augment the default behavior of the command by someway. In this case, recursively copy files/folders (more on this later).

The thid part of any command are the command line arguments. In this case, we have two arguments, myFolder and ~/Somewherelse/tocopy. This instructs the cp command to copy the data in myFolder to a location ~/Somewherelse/tocopy.

First Linux/UNIX commands

On a linux system, the default directory when logging into the system will be your user’s home folder. This is signified by the tilda ~ character.

(base) hvasquezgross@jimkent:~$ 

To find out the full path of where you are located on the operating system, use the pwd command.

(base) hvasquezgross@jimkent:~$ pwd
/data/gpfs/home/hvasquezgross

This looks very similar to the image above depicting the file/folder structure.

Let’s make a folder called “training” for this workshop using the mkdir command.

(base) hvasquezgross@jimkent:~$ mkdir training

Use the ls command to verify if the training folder has been made

(base) hvasquezgross@jimkent:~$ ls
 b2gWorkspace  Blast2GO  Documents  git                                     igv         Music   Pictures  R     src   Templates  training
b2gFiles      bin           Desktop   Downloads    miniconda3   perl5      Public    snap  temp  tmp        Videos

What is a shell??

Taken from Wikipedia: A Unix shell is a command-line interpreter or shell that provides a command line user interface for Unix-like operating systems. The shell is both an interactive command language and a scripting language, and is used by the operating system to control the execution of the system using shell scripts.

So the commands we were running above were sent to the Linux operating system by a shell. The default shell on pronghorn is BASH. For today, we will be issuing commands interactively. However, in the following meeting, we will create “bash scripts” in order to automate our analyses (and get them to run on Pronghorn as scheduled jobs).

Command line options & TAB completion

Notice the ls command list the files. However, sometimes we want more information than just what is in a folder. For example, we may want to know file timestamps, file sizes, and user permissions.

To get more information, we will use command line flags which will change the way ls operates. Try running the ls -alh command. The “a” in the flags will list all files, the “l” means to use the long format, and the “h” means to output the file sizes in human readable format, rather than just bytes.

(base) hvasquezgross@jimkent:~$ ls -alh
total 508K
drwxr-xr-x 46 hvasquezgross hvasquezgross 4.0K May 13 13:48 .
drwxr-xr-x  6 root          root          4.0K Sep  3  2021 ..
drwxrwxr-x  3 hvasquezgross hvasquezgross 4.0K Dec 16 11:03 .aspera
drwxr-xr-x  5 hvasquezgross hvasquezgross 4.0K Apr 25 13:34 b2gFiles
drwxrwxr-x  2 hvasquezgross hvasquezgross 4.0K Nov  1  2021 b2gWorkspace
-rw-------  1 hvasquezgross hvasquezgross  56K May 11 14:47 .bash_history
-rw-r--r--  1 hvasquezgross hvasquezgross  220 Jul 30  2021 .bash_logout
-rw-r--r--  1 hvasquezgross hvasquezgross 5.0K Mar 24 14:08 .bashrc
drwxrwxr-x  4 hvasquezgross hvasquezgross 4.0K May 12 15:34 bin
drwxr-xr-x  8 hvasquezgross hvasquezgross 4.0K Nov  1  2021 Blast2GO

Let’s say we want to see if we have any files in our “Downloads” folder. We can list files in the Downloads folder by typing Downloads after ls to specify we want to list files in that folder.

(base) hvasquezgross@jimkent:~$ ls Downloads
 1098_2021.pdf
 1151851s_t7-9_f9-10.zip
 16654993.zip

Since the Downloads folder is in our home folder, we can use TAB COMPLETION to auto-complete the name. Test this by typing “ls Down” then press TAB and you will see the rest of the name auto-complete.

Now try typing ls down and press TAB. Did anything happen?

No, this is because in unix files/folders/commands are CASE SENSITIVE.

Piping commands into other commands

You can view a history of all the commands ran during your terminal session. Type history to view.

(base) hvasquezgross@jimkent:~/training$ history

[edited out]

 2083  cat second.txt >> third.txt 
 2084  rm third.txt 
 2085  cat data.txt > third.txt
 2086  cat second.txt >> third.txt 
 2087  wc data.txt 
 2088  wc third.txt 
 2089  grep "Unix" data.txt 
 2090  grep "Unix" third.txt 
 2091  cat data.txt 
 2092  history
(base) hvasquezgross@jimkent:~/training$ 

You can rerun a particular command by using the bang (!) and then the number ofthe command you want to rerun. Example: !2092

You can also rerun a particular command by using the bang (1) then the first few letters of the command. Example: !grep

When you run the history command, you will notice a lot of text gets printed to the screen. This isn’t particularly informative if you want to see some of the first commands you ran in your history.

So, how do we view these first commands?

You can scroll up in your terminal session, but imagine if you had 20,000 commands you ran. That would be a lot of scrolling!

We can use command-line redirection as before, to make a text file with the contents from the history command. Let’s do that now and name the output history.txt.

(base) hvasquezgross@jimkent:~/training$ history > history.txt
(base) hvasquezgross@jimkent:~/training$ ls
data  history.txt

We can use the head and tail commands to look at the first lines or last lines of a file

(base) hvasquezgross@jimkent:~/training$ head history.txt
(base) hvasquezgross@jimkent:~/training$ tail history.txt

Since cat outputs the entire file contents to the terminal, we don’t want to use cat to view this file.

Let’s use a new utility called less or more which allows paging through files.

(base) hvasquezgross@jimkent:~/training$ less history.txt 

This will open the file for viewing. You can scroll using arrow keys to navigate single line by line or PageUp/PageDown keys and CTRL+F or CTRL+B to scroll through pages at a time.

You can search within less by using the “/” then type the search term and press ENTER.

In order to quit less, press the “q” key to exit and return back to the terminal command prompt.

It’s a bit clunky having to create a “history.txt” file in order to see the first commands that were ran. We don’t necessarily want to save this file too.

Wouldn’t it be nice if we could temporarily view the history using less, so we can page through the contents?

We can using the | pipe character which is below backspace and above enter.

When using the |, it tells unix to take the output from the first command and use it for the second command.

Let’s do this with history and less.

(base) hvasquezgross@jimkent:~/training$ history | less

You will notice this works the same as when we used less before with the “history.txt” file. However, now we did not need to create the file first, then open it with less.

Miscellaneous Unix power commands

The following examples introduce some other Unix commands, and show how they could be used to work on a fictional file called file.txt. These are not all real world cases, but rather show the diversity of Unix command-line tools, as well as stringing multiple different commands together to obtain a desired result

View the penultimate 10 lines of a file (using head and tail commands):


tail -n 20 file.txt | head

Show lines of a file that begin with a start codon (ATG) (the ^ matches patterns at the start of a line):


grep "^ATG" file.txt

Cut out the 3rd column of a tab-delimited text file and sort it to only show unique lines (i.e. remove duplicates):


cut -f 3 file.txt | sort -u

Count how many lines in a file contain the words ‘cat’ or ‘bat’ (-c option of grep counts lines):


grep -c '[bc]at' file.txt

Turn lower-case text into upper-case (using tr command to ‘transliterate’):


cat file.txt | tr 'a-z' 'A-Z'

Change all occurences of ‘Chr1’ to ‘Chromosome 1’ and write changed output to a new file (using sed command):


cat file.txt | sed 's/Chr1/Chromosome 1/' > file2.txt

RECAP

Now, you should know how to use the following commands: ls, cd, cp, nano, tree, rm, mkdir, cat, wc, grep, echo, less, more.

If you ever forget how to specifically use a command, use the man command followed by the command you want to look up, to pull up a help manual.

Try this now with the cp command.

(base) hvasquezgross@jimkent:~/training$ man cp
NAME
       cp - copy files and directories

SYNOPSIS
       cp [OPTION]... [-T] SOURCE DEST
       cp [OPTION]... SOURCE... DIRECTORY
       cp [OPTION]... -t DIRECTORY SOURCE...

DESCRIPTION
       Copy SOURCE to DEST, or multiple SOURCE(s) to DIRECTORY.

       Mandatory arguments to long options are mandatory for short options too.

       -a, --archive

This will conclude our first day.

Below is a BASH commands cheat sheet.