Before writing our first line of code, we have to get an not-unsubstantial initial stumbling block out of the way: software downloads and versioning.
bash script, here are the various software we need and the versions listed by the authors:
Now that isn't totally fair, because
cuffdiff will come with
cmpfastq probably doesn't have multiple versions (though one day it might). A pretty important problem here, though, is that the authors do not mention what version of Python they are using, and if you try to install this list of software naïvely, you are most definitely going to get versioning errors with other associated software. Not to mention,
TopHat is simply not available for any of the several most recent versions of Python. Some adjustments are necessary here.
Getting past this with a
There are many ways to set up an environment, or a set of packages pre-packaged for you, meant to avoid such versioning issues as this
TopHat-Python conflict. Many of these are probably better than
conda. But, here,
conda is what we are going to use, because all of the packages in the table above are readily accessible through Anaconda Cloud.
To do this, we are going to create a
yaml file (sometimes called a
YML file; either extension will work).
YAML calls itself a "data serialization language", and it has the cute, recursive meaning "YAML Ain't no Markup Language". That's really all you need to know. It's a way of organizing data.
So for our environment file, which we can then feed into
conda to tell it to download all the packages we ask for. Oh, and remember that "envs" folder from the organization page? This is where we'll store the environments we create. Now, let's construct our
name: bvcnenv channels: - bioconda - conda-forge dependencies: - python = 2.7 - matplotlib = 1.5.3 # compatible with Python 2.7 - prinseq = 0.20.4 # 0.19.3 not available from Anaconda - tophat = 2.0.13 # 2.0.4 not available from Anaconda - cufflinks = 2.2.1 # 2.0.2 not available from Anaconda - samtools = 0.1.18 - igvtools = 2.3.16 # 2.1 not available from Anaconda - sra-tools = 2.10.3
The "name" line specifies what we want our environment to be called when we build it using this file. "channels" are spaces within Anaconda holding packages, so these are the two we wish to use to obtain the list of software under "dependencies". The version of each of the packages we want is specified after the equal sign.
You'll notice right away that there are some...inconsistencies. Several of the packages we wanted weren't even available as that version on Anaconda Cloud anymore. Now, we could probably go to the
git repositories for some of them and get the version the authors originally used, but that's asking for even more versioning issues than we already have. So we'll try first with these slightly updated versions, using the closest we can to the actual versions the authors used. It will be interesting anyway to see how/whether that impacts the results.
The great disaster
Turns out that trying to use
sra-tools (which is what we'll use to download the
fastq files from the SRA website) at the same time as all of these packages is extremely problematic, and involves manually specifying the version of many dependencies. This is how far I got before I gave up using
conda for this. There just aren't comprehensive enough versions on
conda to make this work.
name: bvcnenv channels: - bioconda - conda-forge - defaults dependencies: #- pip - python = 2.7 #- matplotlib <= 1.5.3 #1.5.3 # compatible with Python 2.7 - numpy = 1.11.0 # required by matplotlib - libstdcxx-ng >= 7.3.0 # required by Python - ca-certificates - ncbi-ngs-sdk - pypy2.7 # required by matplotlib - libgcc-ng >= 7.3.0 - zlib = 1.2.11 - bzip2 = 1.0.6 - bowtie2 <= 2.2.5 # required to be older by TopHat - sqlite = 3.25.3 - tk = 8.6.10 # only version accepted by Python and matplotlib - perl <= 5.22.0 # unresolvable conflict between sra-tools and prinseq; removed sra-tools and did fastq-dump separately #- perl-xml-libxml - perl-threaded - pytz - readline = 7.0 - libpng = 1.6.35 #1.6.21 - libffi = 3.2.1 - libedit >= 3.1.20170329 - dbus = 1.13.2 - sip = 4.18 - libxml2 = 2.9.10 - libiconv = 1.15 - qt = 5.9.7 - ncurses = 6.1 - pyqt #= 4.11 - bzip2 = 1.0.6 - openssl = 1.1 # 1.0.2p changed several times - freetype = 2.9 # 2.8 #2.7 - prinseq #= 0.20.4 # 0.19.3 not available from Anaconda; going with most recent 0.20 - tophat = 2.0.13 # 2.0.4 not available from Anaconda - cufflinks = 2.2.1 # 2.0.2 not available from Anaconda - samtools = 0.1.18 - igvtools = 2.3.16 # 2.1 not available from Anaconda #- sra-tools = 2.10.3
Don't get bogged down by all that. The point of including it is to (1) show the shortcomings of
conda environments and their skill at figuring out what you mean when you specify multiple packages from multiple time periods and (2) to show that people who have a fair amount of experience working with software still struggle with these kinds of versioning issues, so don't worry if it ends up happening to you!
The solution: a simpler environment file
To solve the problem, we're going to be less ambitious. The new file can be found in the
GitHub repository under
envs/simplerenv.yaml, and looks like this:
name: bvcnenv channels: - bioconda - conda-forge - defaults dependencies: - python = 2.7 - prinseq #= 0.20.4 # 0.19.3 not available from Anaconda; going with most recent 0.20 - tophat = 2.0.13 # 2.0.4 not available from Anaconda - cufflinks = 2.2.1 # 2.0.2 not available from Anaconda - samtools = 0.1.18 - igvtools = 2.3.16 # 2.1 not available from Anaconda
We're still dealing with the fact that we can't get the version we want for all of the packages because they aren't available on
conda. Because I'm working on an HPC system, I can't use
pip to install these, either, so I'd have to go through and grab all of the archived zip files. Instead, let's try it out this way, and cross that bridge when we come to it.
To create this
conda environment and install all the software we need, we execute:
conda env create -f simplerenv.yaml
In the directory where we've placed the file, and then to use our shiny new environment, we execute:
conda activate bvcnenv
Because "bvcnenv" is the name we specified our environment to be called. See, that...wasn't too hard, was it...?
Side note: installing these packages one-by-one usually works, because
conda is more forced to figure out all of the dependencies. It would have been easier for me to just install the packages I wanted manually, and then save the environment that I created, rather than trying to create the
yaml file a priori. This is actually what I ended up doing with
sra-tools, so you can choose to do that, rather than having to use a separate environment for downloading the
The moral of the story? Please use
conda env export > environment_name.yaml when you finish installing software for your project. It saves everyone involved a lot of pain, including your mentors and helpers! That will save the version and the name of all of the relevant software in your current environment to a
yaml file. The same thing applies if you're using any other tool for software management.