Stumbling blocks getting started

Before writing our first line of code, we have to get an not-unsubstantial initial stumbling block out of the way: software downloads and versioning.

For our bash script, here are the various software we need and the versions listed by the authors:

Software	Version
PRINSEQ	0.19.3
cmpfastq	Unspecified
TopHat	2.0.4
Cufflinks	2.0.2
cuffmerge	Unspecified
cuffdiff	Unspecified
SAMtools	0.1.18
IGVtools	2.1
Python	Unspecified

Now that isn't totally fair, because cuffmerge and cuffdiff will come with Cufflinks, and cmpfastq probably doesn't have multiple versions (though one day it might). A pretty important problem here, though, is that the authors do not mention what version of Python they are using, and if you try to install this list of software naïvely, you are most definitely going to get versioning errors with other associated software. Not to mention, TopHat is simply not available for any of the several most recent versions of Python. Some adjustments are necessary here.

Getting past this with a `conda` environment

There are many ways to set up an environment, or a set of packages pre-packaged for you, meant to avoid such versioning issues as this TopHat-Python conflict. Many of these are probably better than conda. But, here, conda is what we are going to use, because all of the packages in the table above are readily accessible through Anaconda Cloud.

To do this, we are going to create a yaml file (sometimes called a YML file; either extension will work). YAML calls itself a "data serialization language", and it has the cute, recursive meaning "YAML Ain't no Markup Language". That's really all you need to know. It's a way of organizing data.

So for our environment file, which we can then feed into conda to tell it to download all the packages we ask for. Oh, and remember that "envs" folder from the organization page? This is where we'll store the environments we create. Now, let's construct our yaml file.

name: bvcnenv
channels:
    - bioconda
    - conda-forge
dependencies:
    - python = 2.7
    - matplotlib = 1.5.3 # compatible with Python 2.7
    - prinseq = 0.20.4 # 0.19.3 not available from Anaconda
    - tophat = 2.0.13 # 2.0.4 not available from Anaconda
    - cufflinks = 2.2.1 # 2.0.2 not available from Anaconda
    - samtools = 0.1.18
    - igvtools = 2.3.16 # 2.1 not available from Anaconda
    - sra-tools = 2.10.3

The "name" line specifies what we want our environment to be called when we build it using this file. "channels" are spaces within Anaconda holding packages, so these are the two we wish to use to obtain the list of software under "dependencies". The version of each of the packages we want is specified after the equal sign.

You'll notice right away that there are some...inconsistencies. Several of the packages we wanted weren't even available as that version on Anaconda Cloud anymore. Now, we could probably go to the git repositories for some of them and get the version the authors originally used, but that's asking for even more versioning issues than we already have. So we'll try first with these slightly updated versions, using the closest we can to the actual versions the authors used. It will be interesting anyway to see how/whether that impacts the results.

The great disaster

Turns out that trying to use matplotlib and sra-tools (which is what we'll use to download the fastq files from the SRA website) at the same time as all of these packages is extremely problematic, and involves manually specifying the version of many dependencies. This is how far I got before I gave up using conda for this. There just aren't comprehensive enough versions on conda to make this work.

name: bvcnenv
channels:
    - bioconda
    - conda-forge
    - defaults
dependencies:
    #- pip
    - python = 2.7
    #- matplotlib <= 1.5.3 #1.5.3 # compatible with Python 2.7
    - numpy = 1.11.0 # required by matplotlib
    - libstdcxx-ng >= 7.3.0 # required by Python
    - ca-certificates
    - ncbi-ngs-sdk
    - pypy2.7 # required by matplotlib
    - libgcc-ng >= 7.3.0
    - zlib = 1.2.11
    - bzip2 = 1.0.6
    - bowtie2 <= 2.2.5 # required to be older by TopHat
    - sqlite = 3.25.3
    - tk = 8.6.10 # only version accepted by Python and matplotlib
    - perl <= 5.22.0 # unresolvable conflict between sra-tools and prinseq; removed sra-tools and did fastq-dump separately
    #- perl-xml-libxml
    - perl-threaded
    - pytz
    - readline = 7.0
    - libpng = 1.6.35 #1.6.21
    - libffi = 3.2.1
    - libedit >= 3.1.20170329
    - dbus = 1.13.2
    - sip = 4.18
    - libxml2 = 2.9.10
    - libiconv = 1.15
    - qt = 5.9.7
    - ncurses = 6.1
    - pyqt #= 4.11
    - bzip2 = 1.0.6
    - openssl = 1.1 # 1.0.2p changed several times
    - freetype = 2.9 # 2.8 #2.7
    - prinseq #= 0.20.4 # 0.19.3 not available from Anaconda; going with most recent 0.20
    - tophat = 2.0.13 # 2.0.4 not available from Anaconda
    - cufflinks = 2.2.1 # 2.0.2 not available from Anaconda
    - samtools = 0.1.18
    - igvtools = 2.3.16 # 2.1 not available from Anaconda
    #- sra-tools = 2.10.3

Don't get bogged down by all that. The point of including it is to (1) show the shortcomings of conda environments and their skill at figuring out what you mean when you specify multiple packages from multiple time periods and (2) to show that people who have a fair amount of experience working with software still struggle with these kinds of versioning issues, so don't worry if it ends up happening to you!

The solution: a simpler environment file

To solve the problem, we're going to be less ambitious. The new file can be found in the GitHub repository under envs/simplerenv.yaml, and looks like this:

name: bvcnenv
channels:
    - bioconda
    - conda-forge
    - defaults
dependencies:
    - python = 2.7
    - prinseq #= 0.20.4 # 0.19.3 not available from Anaconda; going with most recent 0.20
    - tophat = 2.0.13 # 2.0.4 not available from Anaconda
    - cufflinks = 2.2.1 # 2.0.2 not available from Anaconda
    - samtools = 0.1.18
    - igvtools = 2.3.16 # 2.1 not available from Anaconda

We're still dealing with the fact that we can't get the version we want for all of the packages because they aren't available on conda. Because I'm working on an HPC system, I can't use pip to install these, either, so I'd have to go through and grab all of the archived zip files. Instead, let's try it out this way, and cross that bridge when we come to it.

To create this conda environment and install all the software we need, we execute:

conda env create -f simplerenv.yaml

In the directory where we've placed the file, and then to use our shiny new environment, we execute:

conda activate bvcnenv

Because "bvcnenv" is the name we specified our environment to be called. See, that...wasn't too hard, was it...?

Side note: installing these packages one-by-one usually works, because conda is more forced to figure out all of the dependencies. It would have been easier for me to just install the packages I wanted manually, and then save the environment that I created, rather than trying to create the yaml file a priori. This is actually what I ended up doing with sra-tools, so you can choose to do that, rather than having to use a separate environment for downloading the fastq files.

The moral of the story? Please use conda env export > environment_name.yaml when you finish installing software for your project. It saves everyone involved a lot of pain, including your mentors and helpers! That will save the version and the name of all of the relevant software in your current environment to a yaml file. The same thing applies if you're using any other tool for software management.

Getting past this with a conda environment

The great disaster

The solution: a simpler environment file

Getting past this with a `conda` environment