Breaking Down the Process

Here, we'll make sense of the bioinformatics process used by the authors, step by step, and we'll provide one example of a workflow that you can use to do the initial "bioinformatic analysis" that the authors are talking about yourself. To do this, we'll use the bash, which is a way to talk to the machine when you're using an operating system that's driven by some form of Unix. This includes Mac OSX, if you're using the Terminal on Mac, or the command line interface on some kind of Linux system. bash isn't so much of a programming language as it is a way to talk to the computer. It's almost an interpreter, or a translator of commands, but it also supports the types of patterns we expect in code, including loops, if-statements, and variable assignment.

Don't worry, we won't do anything too crazy in bash. On this page, you'll find everything you need to know to be able to use bash well enough to create the script that we did for analyzing the authors' data.

Starting a bash script: bash headers

Especially if you have any experience coding in other languages, the bash header can look pretty confusing. Below is what our header looks like for this tutorial and something that you might expect to see or use when you're building bash code more generally.

#SBATCH --partition=scavenger
#SBATCH --qos=unlim
#SBATCH --time=10000
#SBATCH -n 8

The first line,


signals that we're using bash to run this file. If we were running Perl or another language, we could also put that path here. /bin/bash is the path to the bash application , which most of the time we assume is installed at that default path (but if, for some reason, you only had bash installed somewhere else, you would change that.

The #! character combination has an awesome name, the Shebang! It's just the script's way of saying "Run me using XX application", in this case bash.

The rest of the lines in the header relate to the scheduler. In this example, we're using SLURM, a very popular scheduler, or job submission manager, used on high-performance computing systems.

The first line tells the scheduler what partition we want the run to happen on, which is basically just one of a series of sub-computers available on the big computing system. The second line tells the scheduler about our permissions, the third tells the scheduler the maximum amount of time we want this job to be able to run (it will stop early if the job finishes) and the last gives the number of threads we want to be able to use, for tasks that we can run in parallel, or multiple tasks at the same time.

Defining Variables

bash is a little particular about how certain variable types are defined.

  • Never leave spaces between the variable name, the equals sign, and what you're assigning the variable as
  • Variables never have to be typed (think Java, or don't worry about it if you have no idea what this is talking about)
  • Lists are defined as space-separated strings

Here's an example list:

myList="dog cat frog"

Loops in bash

Loops in bash have three major lines that they must have: for, do, and done (yay!)

for curr in $myList
    echo "$curr"

You'll notice we put a dollar sign before "myList". This is how we refer to a variable in bash. We also have to use quotations to call our variable in the loop, because this differentiates the variable name from the commands, which can be easily called from bash, unlike in programming languages like Python, where you probably have to call a command from a package, as in os.system (Python; again, don't worry about it if you don't know what this means! You'll learn!).