How does one organize a bioinformatics project?

Perhaps this is the first time you've seen a bioinformatics project through from start to finish, and that's why you're interested in this tutorial. Fear not: while organization often is one of the more time-consuming and frustrating part of the project startup, it often ends up being a bioinformatician's saving grace. The more you invest in quality organization early in the project, the better that others - and your future self - will be able to understand and use your workflow.

For this project, we're going to follow the same organizational principles as will be recommended for the bioinformatics challenge. You can follow along with this organization in the form of a GitHub sample directory made just especially for the project, which can be found at this link.

When we first started out out repository, we made a bunch of folders with a structure like this:

├── data
├── envs
├── jupyter-notebooks
├── output
└── scripts

Where each of the arms just represents a folder inside our project directory , called "bvcn-sample" in this case.

The file is just a place for us to document our steps so that someone else (or, again, our future selves) could retrace them. After all, this is what reproducibility is all about!

The sub-folders are where the real magic happens. The first folder, data, is of course where we will store all of our input observations from the experiment. In some cases, these files might be a little (or a lot) too big for GitHub, which has a 100 MB limit for individual files. So raw sequencing data is probably a no-go, but processed data that you're using as input to your workflow is a-okay.

The envs folder applies if you're using environments for your project. You can use these environments to:

  • Create Jupyter notebooks, although this may be more trouble than it's worth
  • Activate when you are running scripts
  • Help you remember or remind users what kind of software they will need to run your code
    • And what version you used back in the day when the code was written (the longer you stick with bioinformatics, the more suprised you'll be at how quickly things change!)

The jupyter-notebooks folder is where we'll store our inline, coding notebooks. These contain a mixture of Markdown and code (which is typically Python, but can be any language that you can install a kernel for, which is the backend of the notebook that's doing all the hard work of compiling and interpreting your code for you. All of these terms are just jargon for error-checking then translating your code into a language that the computer can understand...believe it or not, Python and other languages that we bioinformaticians work in just look like plain English to the machine, that prefers to work in assembly and binary).

The output folder is made for just what it sounds like: all of the output files and figures we're bound to generate as we're analyzing the data. Finally, the scripts folder is for plain Python files that we can execute separately, for example on a high-performance cluster. Having these separate keeps us a little more organized, and tells us where we should go if we're looking for pretty output and interpretable results (in the Jupyter notebooks aisle, which might take a little longer to load), versus if we're looking for some of the heavy lifting of data manipulation or companions to our main analysis.