Here, we'll make sense of the bioinformatics process used by the authors, step by step, and we'll provide one example of a workflow that you can use to do the initial "bioinformatic analysis" that the authors are talking about yourself. To do this, we'll use the
bash, which is a way to talk to the machine when you're using an operating system that's driven by some form of Unix. This includes Mac OSX, if you're using the Terminal on Mac, or the command line interface on some kind of Linux system.
bash isn't so much of a programming language as it is a way to talk to the computer. It's almost an
interpreter, or a translator of commands, but it also supports the types of patterns we expect in code, including loops, if-statements, and variable assignment.
Don't worry, we won't do anything too crazy in
bash. On this page, you'll find everything you need to know to be able to use
bash well enough to create the script that we did for analyzing the authors' data.
Especially if you have any experience coding in other languages, the
bash header can look pretty confusing. Below is what our header looks like for this tutorial and something that you might expect to see or use when you're building
bash code more generally.
#!/bin/bash #SBATCH --partition=scavenger #SBATCH --qos=unlim #SBATCH --time=10000 #SBATCH -n 8
The first line,
signals that we're using
bash to run this file. If we were running
Perl or another language, we could also put that path here.
/bin/bash is the path to the bash application , which most of the time we assume is installed at that default path (but if, for some reason, you only had
bash installed somewhere else, you would change that.
#! character combination has an awesome name, the Shebang! It's just the script's way of saying "Run me using XX application", in this case
The rest of the lines in the header relate to the scheduler. In this example, we're using
SLURM, a very popular scheduler, or job submission manager, used on high-performance computing systems.
The first line tells the scheduler what partition we want the run to happen on, which is basically just one of a series of sub-computers available on the big computing system. The second line tells the scheduler about our permissions, the third tells the scheduler the maximum amount of time we want this job to be able to run (it will stop early if the job finishes) and the last gives the number of threads we want to be able to use, for tasks that we can run in parallel, or multiple tasks at the same time.
bash is a little particular about how certain variable types are defined.
- Never leave spaces between the variable name, the equals sign, and what you're assigning the variable as
- Variables never have to be typed (think Java, or don't worry about it if you have no idea what this is talking about)
- Lists are defined as space-separated strings
Here's an example list:
myList="dog cat frog"
for curr in $myList do echo "$curr" done
You'll notice we put a dollar sign before "myList". This is how we refer to a variable in
bash. We also have to use quotations to call our variable in the loop, because this differentiates the variable name from the commands, which can be easily called from
bash, unlike in programming languages like Python, where you probably have to call a command from a package, as in
os.system (Python; again, don't worry about it if you don't know what this means! You'll learn!).