Introduction

Outline

Day 1 ~ 2.5H
- Intro to Pronghorn
- SLURM
- Introduction to Unix/Linux for Bioinformatics
- Connecting to Pronghorn
- File/Folder Organization
- Anatomy of a linux command
- First Linux/UNIX commands
- What is a shell?
- Command line options & TAB completion
- Navigation: Relative vs Absolute pathing
- Conda installation
- Working with files continued
- Using echo and viewing $HOME, $USER, and $PATH environment variable
- Miscellaneous Unix power commands
- Piping commands into other commands
- RECAP
Day 2 ~ 3.5h
- Working with Sequencing Data
- Using screen command
- Your first SLURM job
- Conda: creating a new environment for reproducible analysis
- SLURM interactive session for testing
- Writing your first sequence analysis SBATCH script
- Transferring files from Pronghorn to your local computer
- Singularity containers
- Pronghorn wrap up

Day 1

Pronghorn

Pronghorn is the University of Nevada, Reno’s High-Performance Computing (HPC) cluster. The GPU-accelerated system is designed, built and maintained by the Office of Information Technology’s HPC Team. Pronghorn and the HPC Team supports general research across the Nevada System of Higher Education (NSHE).

Pronghorn is composed of CPU, GPU, Visualization, and Storage subsystems interconnected by a 100Gb/s non-blocking Intel Omni-Path fabric. The CPU partition features 108 nodes, 3,456 CPU cores, and 24.8TiB of memory. The GPU partition features 44 NVIDIA Tesla P100 GPUs, 352 CPU cores, and 2.75TiB of memory. The Visualization partition is composed of three NVIDIA Tesla V100 GPUs, 48 CPU cores, and 1.1TiB of memory. The storage system uses the IBM SpectrumScale file system to provide 2PiB of high-performance storage. The computational and storage capabilities of Pronghorn will regularly expand to meet NSHE computing demands.

Pronghorn is collocated at the Switch Citadel Campus located 25 miles East of the University of Nevada, Reno. Switch is the definitive leader of sustainable data center design and operation. The Switch Citadel is rated Tier 5 Platinum, and will be the largest, most advanced data center campus on the planet.

Pronghorn is available to all University of Nevada, Reno faculty, staff, students, and sponsored affiliates. Priority access to the system is available for purchase.

First up, let’s talk about what a high-performance computer (HPC) is: really, it is a bunch of individual computers (“nodes”), just like the ones you are using, strung together with networking cables, with the ability to deploy “jobs” (some computational task you are trying to accomplish) across multiple nodes easily. As such, we can determine how many cores we have access to by counting the number of cores on each individual node, and summing them all up. Pronghorn has 3,456 CPU cores that (in theory) we have access to! In a perfect world (more on that later), you COULD divide the amount of time it takes to do a job by the number of cores you throw at it. With Pronghorn, you could theoretically do 10 YEARS of sequential calculations in less than one day! Put another way, Pronghorn’s capabilities are 864 times faster than my Windows machine.

Your desktop or laptop is all yours, generally, so you aren’t sharing the resources with anyone else. You’ve effectively pre-paid for ~ 5 years if computational time (warranty!) times the number of cores you have, so I’ve bought about 20 years of CPU-time on my Windows desktop and 40 years of CPU-time on my Mac laptop. Pronghorn, assuming a 5 year lifespan, has 17,280 years (!) of CPU-time, all of which was purchased in advance. While you are probably ok with your laptop/desktop just sitting there idle not doing much, a research computer like Pronghorn is designed to be used at near-capacity! Also, this is a SHARED MACHINE and as such much of the process getting your programs to run on it requires some understanding of how the system shares its resources amongst all the users! Enter SLURM.

SLURM

SLURM is what is known as a workload manager. SLURM’s job is to take the vast number of different jobs sent to it by all users in the system, reserve “resources” (# of nodes per job, # of cores per node, memory per job), and then execute the jobs based on the user or association’s priority.

A “job” is basically the top level of what you are trying to accomplish – a workflow, set of commands/programs to run, etc. Typically we define a single job at a time and submit it to the SLURM system. Within the job are “steps” which can be running sequentially or in parallel depending on the particulars of your workflow. A step consists of one or more “tasks”. Each “task” runs on one or more “cpus” (cpu is the same as a logical core in SLURM parlance). Parallelization can occur at multiple levels: job, step, and task.

SLURM uses a “job script” written in any interpreted language that uses “#” as the comment character– typically we’ll use the “bash” language to create a job. This job script follows a very specific format that you will get familiar with. Your job script 1) tells SLURM what resources you need, and 2) once the resources are allocated, what programs to execute and how to allocate the resources to those programs.

As a general rule, Pronghorn is a BATCH system, which means you will focus on jobs that do not require user interaction, and will often be deferred (run at some time in the future). While you CAN run “interactive jobs” on Pronghorn, this should be minimized wherever possible. Interactive jobs typically idle resources quite a bit.

Introduction to Unix/Linux for Bioinformatics

Pronghorn HPC uses Linux as the underlying operating system. Understanding how to use a unix environment and terminal to interact with files and folders is very important to bioinformatics. A lot of bioinformatic software is meant to be ran on the command-line. This training will enable you to feel confident running these command-line tools, moving, copying, and viewing files.

The Unix operating system has been around since 1969. Back then there was no such thing as a graphical user interface. You typed everything. It may seem archaic to use a keyboard to issue commands today, but it’s much easier to automate keyboard tasks than mouse tasks. There are several variants of Unix (including Linux), though the differences do not matter much for most basic functions.

Increasingly, the raw output of biological research exists as in silico data, usually in the form of large text files. Unix is particularly suited to working with such files and has several powerful (and flexible) commands that can process your data for you. The real strength of learning Unix is that most of these commands can be combined in an almost unlimited fashion. So if you can learn just five Unix commands, you will be able to do a lot more than just five things.

Connecting to Pronghorn

In order to connect to pronghorn, we will be using ssh to connect to the remote server. Below is how you would connect using a Linux or OSX computer


ssh yournetidhere@pronghorn.rc.unr.edu

You will then be prompted to type in your netid password. As you type, the cursor will not move/display text in order to keep your password secure.

If you are using a Windows computer, Windows does not have ssh functionality built into the system. You will need to setup a program in order to remotely connect to Pronghorn. Visit this website and download the appropriate installation file for your computer. https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html Most likely: https://the.earth.li/~sgtatham/putty/latest/w64/putty-64bit-0.78-installer.msi

Once installed, you will configure the connection information similar to above.

File/Folder organization

Similar to Windows computers, UNIX has a directory structure to organize data.

At the very top of your system and this image, there is the root directory “/”. This is analogous to the “C:\” drive in Windows.

Below that, there are many “system” folders which the UNIX operating system uses to run (bin, dev, etc, lib, proc, sbin, etc). The “home” folder contains user directories similar to the “Users” folder in windows and mac systems.

Let’s login and open the terminal to start learning some commands and practice navigating the file/folders.

Anatomy of a Linux Command

Linux follows rules for command syntax. Let’s take a look at the cp command example below.


cp -R myFolder ~/Somewherelse/tocopy

The first part of the command, cp, is the linux program/utiliy you are telling linux to run. These programs need to be accessible via the $PATH environmental variable (more on this later).

The second part of any command, which are optional, are the command line flags. In this example below, -R is the commandline flag. Flags allow users to augment the default behavior of the command by someway. In this case, recursively copy files/folders (more on this later).

The thid part of any command are the command line arguments. In this case, we have two arguments, myFolder and ~/Somewherelse/tocopy. This instructs the cp command to copy the data in myFolder to a location ~/Somewherelse/tocopy.

First Linux/UNIX commands

On a linux system, the default directory when logging into the system will be your user’s home folder. This is signified by the tilda ~ character.

(base) hvasquezgross@jimkent:~$

To find out the full path of where you are located on the operating system, use the pwd command.

(base) hvasquezgross@jimkent:~$ pwd
/data/gpfs/home/hvasquezgross

This looks very similar to the image above depicting the file/folder structure.

Let’s make a folder called “training” for this workshop using the mkdir command.

(base) hvasquezgross@jimkent:~$ mkdir training

Use the ls command to verify if the training folder has been made

(base) hvasquezgross@jimkent:~$ ls
 b2gWorkspace  Blast2GO  Documents  git                                     igv         Music   Pictures  R     src   Templates  training
b2gFiles      bin           Desktop   Downloads    miniconda3   perl5      Public    snap  temp  tmp        Videos

What is a shell??

Taken from Wikipedia: A Unix shell is a command-line interpreter or shell that provides a command line user interface for Unix-like operating systems. The shell is both an interactive command language and a scripting language, and is used by the operating system to control the execution of the system using shell scripts.

So the commands we were running above were sent to the Linux operating system by a shell. The default shell on pronghorn is BASH. For today, we will be issuing commands interactively. However, in the following meeting, we will create “bash scripts” in order to automate our analyses (and get them to run on Pronghorn as scheduled jobs).

Command line options & TAB completion

Notice the ls command list the files. However, sometimes we want more information than just what is in a folder. For example, we may want to know file timestamps, file sizes, and user permissions.

To get more information, we will use command line flags which will change the way ls operates. Try running the ls -alh command. The “a” in the flags will list all files, the “l” means to use the long format, and the “h” means to output the file sizes in human readable format, rather than just bytes.

(base) hvasquezgross@jimkent:~$ ls -alh
total 508K
drwxr-xr-x 46 hvasquezgross hvasquezgross 4.0K May 13 13:48 .
drwxr-xr-x  6 root          root          4.0K Sep  3  2021 ..
drwxrwxr-x  3 hvasquezgross hvasquezgross 4.0K Dec 16 11:03 .aspera
drwxr-xr-x  5 hvasquezgross hvasquezgross 4.0K Apr 25 13:34 b2gFiles
drwxrwxr-x  2 hvasquezgross hvasquezgross 4.0K Nov  1  2021 b2gWorkspace
-rw-------  1 hvasquezgross hvasquezgross  56K May 11 14:47 .bash_history
-rw-r--r--  1 hvasquezgross hvasquezgross  220 Jul 30  2021 .bash_logout
-rw-r--r--  1 hvasquezgross hvasquezgross 5.0K Mar 24 14:08 .bashrc
drwxrwxr-x  4 hvasquezgross hvasquezgross 4.0K May 12 15:34 bin
drwxr-xr-x  8 hvasquezgross hvasquezgross 4.0K Nov  1  2021 Blast2GO

Let’s say we want to see if we have any files in our “Downloads” folder. We can list files in the Downloads folder by typing Downloads after ls to specify we want to list files in that folder.

(base) hvasquezgross@jimkent:~$ ls Downloads
 1098_2021.pdf
 1151851s_t7-9_f9-10.zip
 16654993.zip

Since the Downloads folder is in our home folder, we can use TAB COMPLETION to auto-complete the name. Test this by typing “ls Down” then press TAB and you will see the rest of the name auto-complete.

Now try typing ls down and press TAB. Did anything happen?

No, this is because in unix files/folders/commands are CASE SENSITIVE.

Navigation

Relative vs Absolute pathing

So far, we have only used examples of relative pathing, which means the file/folder path relative to the current directory we are in.

An example of absolute pathing would be the string that got returned when we used the “pwd” command at the beginning.

Notice, the path starts with the “/” character, which means the root (top) of the computer.

Let’s try to list files in the HOME folder using absolute paths and TAB completion. TAB will only complete up until the result has multiple options. For example, if you TAB complete the first letter of your username, it probably return anything because there are other users with the same name. You can press TAB twice quickly to have it list the different options.

(agat) [hvasquezgross@login-1 ~]$ ls /data/gpfs/home/h
hanaol/            hasanpour/         hhasani/           hkesting/          hmcswiggin/        hpoudel/           hshen/             hungnp/            hwimalasekera/     
haox/              hcui/              hholmes/           hlaw/              hmurgaxala/        hqin/              hsinpaic/          hvahdataboueshagh/ hyounis/           
harithah/          henryp/            hhosseinisafa/     hlin2/             honggangw/         hsalmanitehrani/   hsolorio/          hvanderson/        hzhou/             
harvey22/          hgriffin/          hhowe/             hmalik/            hongweiw/          hsapkota/          htaghaviganji/     hvasquezgross/

Let’s use relative pathing to list the files above our home folder. In Unix, the “..” characters signify up a directory where is “.” signifies the current directory. Since I know this is going to be a long list, I can use the -1 option for ls to create a single line list.

(base) hvasquezgross@jimkent:~$ ls -1 ..
ybarlas
ychuah
yidis
yifanzhang
yinghanc
yjathan
yliu23
yongshengz
yoyesakin
ytian
yufengh
yunchuanl
yuxinj
yuyangw
yuzhao
yyang2
zehd
zhaol
zhihuic
zhipingz
zhizhenz
zhou
ziruw
zisherwood
zkibria
zminaker
zquinton
zspeth

Changing directories and working with files

In order to change directories, use the cd command.

Let’s navigate into the “training” folder we made earlier.

(base) [hvasquezgross@login-1 ~]$ cd training/
(base) [hvasquezgross@login-1 training]$ ls

Notice that the path on the terminal updated from ~ to ~/training. This helps the user know where they are located on the computer.

Let’s make an example file called “test.txt” using the touch command and then verify it was created with the ls command. Touch will create a new empty file or update the timestamp for a given file.

(base) [hvasquezgross@login-1 training]$ touch test.txt
(base) [hvasquezgross@login-1 training]$ ls
test.txt

Let’s rename the file to “data.txt” using the mv command. The mv command takes two parameters: the source file and the destination file. So, to rename the file test.txt to data.txt, use the following command.

(base) [hvasquezgross@login-1 training]$ mv test.txt data.txt
(base) [hvasquezgross@login-1 training]$ ls
data.txt

Now, we want to organize this folder by putting the “data.txt” file into a new sub-folder called “data”. Try to figure out how to do that by using a previous command to create the directory then the mv command to move the “data.txt” folder into the data folder.

(base) [hvasquezgross@login-1 training]$ mkdir data
(base) [hvasquezgross@login-1 training]$ mv data.txt data
(base) [hvasquezgross@login-1 training]$ ls
data
(base) [hvasquezgross@login-1 training]$ ls data/
data.txt

We can use the tree command to visualize the directory tree.

(base) [hvasquezgross@login-1 training]$ tree
.
└── data
    └── data.txt

1 directory, 1 file

Did you get the output above??

No?

Well, that is because you do not have the tree command installed. Since this is a shared computational resource and you do not have administartive rights to install new programs, we have to use a different program called conda which is able to install programs in your home directory so you can run them.

Conda installation

Conda allows a user on a unix system to install packages with out sudo/admin privileges. Additionally, conda allows users to create “environments” where a specific suite of programs are installed. A user can have as many environments as they want (ie for different projects and analyses types). This has the benefit of “freezing” your program versions so that your analyses are reproducible.

Imagine running an analyses, getting your final results, then 2 years later, your PI wants you to re-run the analysis with new samples. However, the programs you used were updated many times in the past years (bug fixes, new features, etc). Perhaps the output format has changed and now you can’t compare the results of the new dataset with the old dataset. Using a conda environment will allow you to recapitulate the original analysis for the new dataset.

In order to install conda, we will need to download the program directly to our remote linux computer, Pronghorn.

Visit this website here: https://docs.conda.io/en/latest/miniconda.html#linux-installers

If you scroll towards the bottom, you can see a list of Linux installation options. We will use the one labeled “Miniconda3 Linux 64-bit”.

You can right-click and copy link location from your browser, but I also provided the link below.

https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

We will use the wget utility, short for web get, to download this file to Pronghorn.


(base) [hvasquezgross@login-1 training]$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
--2023-02-22 14:56:34--  https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.131.3, 104.16.130.3, 2606:4700::6810:8203, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.131.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 74403966 (71M) [application/x-sh]
Saving to: ‘Miniconda3-latest-Linux-x86_64.sh’

100%[=================================================================================================================================================================>] 74,403,966   245MB/s   in 0.3s   

2023-02-22 14:56:35 (245 MB/s) - ‘Miniconda3-latest-Linux-x86_64.sh’ saved [74403966/74403966]

Next, we will run this file with the program bash in order to install conda.


[hvasquezgross@login-1 training]$ bash Miniconda3-latest-Linux-x86_64.sh

There will be some prompts about the installation which we will proceed together. However, once done, you will need to exit your ssh session and re-establish your ssh session in order to get this new configuration for conda.

Please do this now after installing conda.

After relogging, test to see if conda is working by using the conda command. If you do not receive “command not found” error, you successfully installed conda.

Installing a program with conda

Now that we have conda installed, let’s install the tree command. This will install tree into your currently loaded environment, which the default is “base”. More about environments later


(base) [hvasquezgross@login-1 training]$ conda install tree

Retrieving notices: ...working... done
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /data/gpfs/assoc/inbre/hansvg/home/miniconda3/envs/agat

  added / updated specs:
    - tree


The following NEW packages will be INSTALLED:

  tree               conda-forge/linux-64::tree-2.0.0-h7f98852_0 

The following packages will be UPDATED:

  openssl                                 1.1.1s-h0b41bf4_1 --> 1.1.1t-h0b41bf4_0 


Proceed ([y]/n)?

Working with files continued

Let’s add some text to the data.txt file using the nano command. I will add the sentences “This is our first data file.” Then press Enter and add a second sentence “Unix is fun!”

NOTE: Nano does not accept user mouse clicks to navigate within the program. You will need to use arrow keys to move the text cursor.

(base) hvasquezgross@jimkent:~/training$ cd data/
(base) hvasquezgross@jimkent:~/training/data$ nano data.txt

This is our first data file
Unix is fun!

To save the file, use the keys CTRL + o. Nano will then verify the filename towards the bottom of the screen, press Enter. Then press CTRL + x to exit.

Make a copy of the file using the cp command and name the copy second.txt. Similar to move, this command takes two parameters: the source file and the destination file.

(base) hvasquezgross@jimkent:~/training/data$ cp data.txt second.txt
(base) hvasquezgross@jimkent:~/training/data$ ls
data.txt  second.txt

Edit the second.txt file and change “first data file” to “second data file”.

Use the cat command to verify the changes. The cat command outputs the file contents to the terminal.

(base) hvasquezgross@jimkent:~/training/data$ nano second.txt 
(base) hvasquezgross@jimkent:~/training/data$ cat data.txt 
This is our first data file.
Unix is fun!
(base) hvasquezgross@jimkent:~/training/data$ cat second.txt 
This is our second data file.
Unix is fun!

We are going to use “redirection” to create a third file with the contents of both of these text files. Any command’s output can be saved to a file using the “>” symbol followed by the filename you want to save the output as. This will overwrite the file, if already created. However, you can also use the “>>” symbol to append the contents to a file, which will not overwrite the file.

Let’s use cat to create a file with both file contents.

(base) hvasquezgross@jimkent:~/training/data$ cat data.txt > third.txt
(base) hvasquezgross@jimkent:~/training/data$ cat second.txt >> third.txt 
(base) hvasquezgross@jimkent:~/training/data$ cat third.txt 
This is our first data file.
Unix is fun!
This is our second data file.
Unix is fun!

We could make the same file in one command by using cat with multiple files and redirection. Let’s call this file fourth.txt.

(base) hvasquezgross@jimkent:~/training/data$ cat data.txt second.txt > fourth.txt
(base) hvasquezgross@jimkent:~/training/data$ cat fourth.txt 
This is our first data file.
Unix is fun!
This is our second data file.
Unix is fun!

Let’s get some basic text file information using the wc command, which stands for word count. wc will output the number of lines, words, and bytes of a file.

Test this on the data.txt file and third.txt file.

(base) hvasquezgross@jimkent:~/training/data$ wc data.txt 
 2  9 42 data.txt
(base) hvasquezgross@jimkent:~/training/data$ wc third.txt 
 4 18 85 third.txt

What if you’re only interested in counting the lines? You can use the -l option to only count the number of lines.

(base) hvasquezgross@jimkent:~/training/data$ wc -l data.txt 
2 data.txt
(base) hvasquezgross@jimkent:~/training/data$ wc -l third.txt 
4 third.txt

If we had a large file and wanted to search for a specific keyword, we can search through the file using the grep command. grep will return the text line for every instance the keyword is found.

Try searching for “Unix” in the “data.txt” and “third.txt” files.

(base) hvasquezgross@jimkent:~/training/data$ grep "Unix" data.txt 
Unix is fun!
(base) hvasquezgross@jimkent:~/training/data$ grep "Unix" third.txt 
Unix is fun!
Unix is fun!

Since third.txt and fourth.txt are essentially the same file, let’s remove the file fourth.txt with the rm command.

(base) hvasquezgross@jimkent:~/training/data$ rm fourth.txt 
(base) hvasquezgross@jimkent:~/training/data$ ls
data.txt  second.txt  third.txt

Let’s make a new folder called “extra” and move the “third.txt” file in it.

(base) hvasquezgross@jimkent:~/training/data$ mkdir extra
(base) hvasquezgross@jimkent:~/training/data$ mv third.txt extra/
(base) hvasquezgross@jimkent:~/training/data$ ls
data.txt  extra  second.txt

Try to remove the “extra” folder with the rm command like we used to remove the previous file.

(base) hvasquezgross@jimkent:~/training/data$ rm extra/
rm: cannot remove 'extra/': Is a directory

You will receive an error because “extra” is a directory. rm tries to protect users from accidentally removing full folders by doing this. However, we can use command line flags, similar to what we covered in ls, to force rm to remove the directory. These flags are -f for force and -r for recursive. Recursive means to include any sub-folders as well.

(base) hvasquezgross@jimkent:~/training/data$ rm -r -f extra/
(base) hvasquezgross@jimkent:~/training/data$ ls
data.txt  second.txt

Move the “data.txt” file up a directory to the “training” directory. Double-check your work with the tree command from the “training” directory.

(base) hvasquezgross@jimkent:~/training/data$ mv data.txt ../
(base) hvasquezgross@jimkent:~/training/data$ ls
second.txt
(base) hvasquezgross@jimkent:~/training/data$ cd ..
(base) hvasquezgross@jimkent:~/training$ tree
.
├── data
│   └── second.txt
└── data.txt

1 directory, 2 files

Using echo and viewing $HOME, $USER, and $PATH environment variable

One other use of the echo command is for displaying the contents of something known as environment variables. These contain user-specific or system-wide values that either reflect simple pieces of information (your username), or lists of useful locations on the file system. Some examples:

(base) hvasquezgross@jimkent:~/training$ echo $USER
hvasquezgross
(base) hvasquezgross@jimkent:~/training$ echo $HOME
/home/hvasquezgross
(base) hvasquezgross@jimkent:~/training$ echo $PATH
/home/hvasquezgross/miniconda3/bin:/home/hvasquezgross/miniconda3/condabin:/home/hvasquezgross/miniconda3/bin:/home/hvasquezgross/src/cellranger-6.1.1:/home/hvasquezgross/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

The last one shows the content of the $PATH environment variable, which displays a — colon separated — list of directories that are expected to contain programs that you can run. This includes all of the Unix commands that you have seen so far. These are files that live in directories which are run like programs (e.g. ls is just a special type of file in the /bin directory).

Knowing how to change your $PATH to include custom directories can be necessary sometimes (e.g. if you install some new bioinformatics software in a non-standard location).

Piping commands into other commands

You can view a history of all the commands ran during your terminal session. Type history to view.

(base) hvasquezgross@jimkent:~/training$ history

[edited out]

 2083  cat second.txt >> third.txt 
 2084  rm third.txt 
 2085  cat data.txt > third.txt
 2086  cat second.txt >> third.txt 
 2087  wc data.txt 
 2088  wc third.txt 
 2089  grep "Unix" data.txt 
 2090  grep "Unix" third.txt 
 2091  cat data.txt 
 2092  history
(base) hvasquezgross@jimkent:~/training$

You can rerun a particular command by using the bang (!) and then the number ofthe command you want to rerun. Example: !2092

You can also rerun a particular command by using the bang (1) then the first few letters of the command. Example: !grep

When you run the history command, you will notice a lot of text gets printed to the screen. This isn’t particularly informative if you want to see some of the first commands you ran in your history.

So, how do we view these first commands?

You can scroll up in your terminal session, but imagine if you had 20,000 commands you ran. That would be a lot of scrolling!

We can use command-line redirection as before, to make a text file with the contents from the history command. Let’s do that now and name the output history.txt.

(base) hvasquezgross@jimkent:~/training$ history > history.txt
(base) hvasquezgross@jimkent:~/training$ ls
data  history.txt

We can use the head and tail commands to look at the first lines or last lines of a file

(base) hvasquezgross@jimkent:~/training$ head history.txt
(base) hvasquezgross@jimkent:~/training$ tail history.txt

Since cat outputs the entire file contents to the terminal, we don’t want to use cat to view this file.

Let’s use a new utility called less or more which allows paging through files.

(base) hvasquezgross@jimkent:~/training$ less history.txt

This will open the file for viewing. You can scroll using arrow keys to navigate single line by line or PageUp/PageDown keys and CTRL+F or CTRL+B to scroll through pages at a time.

You can search within less by using the “/” then type the search term and press ENTER.

In order to quit less, press the “q” key to exit and return back to the terminal command prompt.

It’s a bit clunky having to create a “history.txt” file in order to see the first commands that were ran. We don’t necessarily want to save this file too.

Wouldn’t it be nice if we could temporarily view the history using less, so we can page through the contents?

We can using the | pipe character which is below backspace and above enter.

When using the |, it tells unix to take the output from the first command and use it for the second command.

Let’s do this with history and less.

(base) hvasquezgross@jimkent:~/training$ history | less

You will notice this works the same as when we used less before with the “history.txt” file. However, now we did not need to create the file first, then open it with less.

Miscellaneous Unix power commands

The following examples introduce some other Unix commands, and show how they could be used to work on a fictional file called file.txt. These are not all real world cases, but rather show the diversity of Unix command-line tools, as well as stringing multiple different commands together to obtain a desired result

View the penultimate 10 lines of a file (using head and tail commands):


tail -n 20 file.txt | head

Show lines of a file that begin with a start codon (ATG) (the ^ matches patterns at the start of a line):


grep "^ATG" file.txt

Cut out the 3rd column of a tab-delimited text file and sort it to only show unique lines (i.e. remove duplicates):


cut -f 3 file.txt | sort -u

Count how many lines in a file contain the words ‘cat’ or ‘bat’ (-c option of grep counts lines):


grep -c '[bc]at' file.txt

Turn lower-case text into upper-case (using tr command to ‘transliterate’):


cat file.txt | tr 'a-z' 'A-Z'

Change all occurences of ‘Chr1’ to ‘Chromosome 1’ and write changed output to a new file (using sed command):


cat file.txt | sed 's/Chr1/Chromosome 1/' > file2.txt

RECAP

Now, you should know how to use the following commands: ls, cd, cp, nano, tree, rm, mkdir, cat, wc, grep, echo, less, more.

If you ever forget how to specifically use a command, use the man command followed by the command you want to look up, to pull up a help manual.

Try this now with the cp command.

(base) hvasquezgross@jimkent:~/training$ man cp

NAME
       cp - copy files and directories

SYNOPSIS
       cp [OPTION]... [-T] SOURCE DEST
       cp [OPTION]... SOURCE... DIRECTORY
       cp [OPTION]... -t DIRECTORY SOURCE...

DESCRIPTION
       Copy SOURCE to DEST, or multiple SOURCE(s) to DIRECTORY.

       Mandatory arguments to long options are mandatory for short options too.

       -a, --archive

This will conclude our first day.

Below is a BASH commands cheat sheet.

Day 1: Using UNIX/BASH, Pronghorn and SLURM for Life Scientists