Search
Population Differentiation (Original R)

All of the code and software usage provided in this script was originally published by the authors (Benestan et al., 2016). Here, I am attempting to get the same results out using that analysis, explaining exactly what is meant by each part, and adding in some new visualizations to emphasize how you might extend this work for your own projects!

The authors include several, separate R scripts to analyze their data. It's up to use to construct the roadmap of how they put it all together.

codedir
├── scrip-rda-neutral.R
├── scrip-rda-selection-legendre.R
├── script_bayescan_laura.r
├── script-AEM.R
└── script-outflank.R

In addition, they provide several simplified text files that contain the environmental data they used for their analyses; in Environment.txt, for example, they list all of the minimum, maximum, and mean temperatures for each of the population localities that they examined, including total year values and just those for the summer.

After the authors used STACKS ("From the data set generated in that study, we developed a set of 13 688 filtered SNPs, which excluded SNPs that were not genotyped in at least 80% of the individuals and 70% of the locations, or did not show a minor allele frequency of at least 0.05 in all locations" (Benestan et al., 2016)) to generate their listing of 13,688 filtered SNPs, the next thing they did, per the paper, was:

used the software outflank (Whitlock & Lotterhos 2015), which calculates a likelihood based on a trimmed distribution of FST values to infer the distribution of FST for neutral markers

The authors have provided an R script for this, so I'll just walk through each step of their script, which is essentially just using the software.

### Dowload libraries
install.packages("devtools")
library(devtools)

###Download source file
source("http://bioconductor.org/biocLite.R")
biocLite("qvalue")
install_github("whitlock/OutFLANK")

### Install OutFLANK package
library(qvalue)
library(OutFLANK)

Already we've run into a little hiccup: recent versions of R require that you use BiocManager. This is easy to fix. In addition to that, I prefer to use pacman (Rinker & Kurkiewicz, 2018) for downloading packages, a tool which simultaneously downloads and loads packages. I'll do this from now on, regardless of what the authors did in the paper.

pacman::p_load(devtools,qvalue,BiocManager)
BiocManager::install(c("qvalue"))
install_github("whitlock/OutFLANK")
library(OutFLANK)
Bioconductor version 3.9 (BiocManager 1.30.10), R 3.6.2 (2019-12-12)

Installing package(s) 'qvalue'

The downloaded binary packages are in
	/var/folders/lg/f38d68k97wlg4hvxl8ff0q480000gn/T//RtmpsYG7sk/downloaded_packages
Old packages: 'ade4', 'backports', 'bit', 'blob', 'boot', 'broom', 'car',
  'caTools', 'class', 'crosstalk', 'dbplyr', 'deSolve', 'diffobj', 'doRNG',
  'ellipsis', 'FactoMineR', 'fitdistrplus', 'foghorn', 'forcats', 'foreach',
  'fs', 'future', 'future.apply', 'gargle', 'ggfortify', 'ggplot2', 'ggpubr',
  'ggrepel', 'git2r', 'glue', 'gplots', 'gtools', 'hexbin', 'igraph',
  'KernSmooth', 'lattice', 'leaps', 'leiden', 'lme4', 'locfit', 'lubridate',
  'maptools', 'MASS', 'matrixStats', 'metap', 'mnormt', 'modelr', 'multcomp',
  'mvtnorm', 'nlme', 'nloptr', 'nnet', 'openxlsx', 'pbkrtest', 'pdftools',
  'pillar', 'pkgbuild', 'pkgmaker', 'plotly', 'plotrix', 'plyr', 'proxy', 'ps',
  'quantreg', 'R.methodsS3', 'RcppAnnoy', 'RcppArmadillo', 'RcppParallel',
  'RcppProgress', 'RCurl', 'rematch2', 'reshape2', 'reticulate', 'rex',
  'RgoogleMaps', 'RJSONIO', 'rlang', 'rngtools', 'robustbase', 'ROCR',
  'rootSolve', 'rrcov', 'rsvd', 'scales', 'Seurat', 'shiny', 'sn', 'sp',
  'spatial', 'statmod', 'survival', 'tidyr', 'tidyselect', 'tinytest',
  'tinytex', 'usethis', 'uwot', 'vctrs', 'withr', 'xfun', 'XML', 'xml2', 'zoo'

Skipping install of 'OutFLANK' from a github remote, the SHA1 (e502e824) has not changed since last install.
  Use `force = TRUE` to force installation

pacman::p_load(vcfR)
# vcfFile <- read.vcfR("https://www.dropbox.com/s/yoy2i6a540zg8bf/13688snps-562individus.recode.vcf?dl=1")
# system(paste0("vcftools --vcf ", vcfFile, " --012 --out 13688snps-582ind.012.txt"))

# For some reason, the authors had the commands just listed out as they would be run in the console.
# To run these directory from R, we need to use `system`. I had to run this on a separate system, as it 
# requires that you have `vcftools` installed locally. 
system("vcftools --vcf ../../lobster-reproduction/data/lobster_download/13688snps-562individus.recode.vcf --012 --out ../../lobster-reproduction/data/13688snps-582ind.012.txt")
Warning message in system("vcftools --vcf ../../lobster-reproduction/data/lobster_download/13688snps-562individus.recode.vcf --012 --out ../../lobster-reproduction/data/13688snps-582ind.012.txt"):
“error in running command”

I think that a better way to do this is to use vcfR, which keeps everything within R, allows us to work straight out of Dropbox, and also doesn't involve odd matrix file formats and several intermediate files:

vcfFile <- read.vcfR("https://www.dropbox.com/s/yoy2i6a540zg8bf/13688snps-562individus.recode.vcf?dl=1")

## Create a genotype matrix using a vcfR utility.
genotypemat <- extract.gt(vcfFile, element = 'GT', as.numeric = FALSE)
Local file 13688snps-562individus.recode.vcf?dl=1 found.

Using this local copy instead of retrieving a remote copy.

Scanning file to determine attributes.
File attributes:
  meta lines: 8
  header_line: 9
  variant count: 13688
  column count: 571
Meta line 8 read in.
All meta lines processed.
gt matrix initialized.
Character matrix gt created.
  Character matrix gt rows: 13688
  Character matrix gt cols: 571
  skip: 0
  nrows: 13688
  row_num: 0
Processed variant: 13688
All variants processed
# Any N/A values should be converted to 9s to prevent OutFLANK from complaining
genotypemat[is.na(genotypemat)] = 9

# Unlike vcftools, vcfR returns genotypes in 0/1 format, which requires that we convert them to just 
# one numeric value.
mutatedgeno = data.frame(genotypemat)
for (c in 1:ncol(mutatedgeno)) {
    mutatedgeno[,c] = as.character(mutatedgeno[,c])
    for (r in 1:nrow(mutatedgeno)) {
        mutatedgeno[r,c] = sum(as.numeric(as.character(unlist(strsplit(as.character(mutatedgeno[r,c]), "/")))))
    }
}
head(mutatedgeno)
A data.frame: 6 × 562
ANT_02ANT_03ANT_04ANT_09ANT_11ANT_12ANT_14ANT_15ANT_16ANT_18TRI_36TRI_37TRI_38TRI_39TRI_41TRI_42TRI_43TRI_44TRI_45TRI_48
<chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr>
1-7510001010101001100112
1-16300001001001101100010
1-40901001000000000000000
1-110300001000101111100109
1-127800000001909090000000
1-382201011000009000000000

So now we have a matrix of genotypes without file I/O. Let's use the labels the authors provide to label the data and then feed the completed matrix into OutFLANK!

SNPmat <- mutatedgeno # instead of reading in output, we use what we produced above.
locusNames <- seq(from=1, to= 13688, by=1)
#popNames <- c(replicate(31,"ANT"), replicate(23,"BON"),replicate(34,"BOO"),replicate(23,"BRA"), replicate(36,"BRO"),replicate(34,"BUZ"),replicate(31,"CAN"),replicate(26,"CAP"),replicate(36,"CAR"), replicate(30,"GAS"),replicate(30,"LOB"),replicate(29,"MAL"),replicate(34,"MAR"),replicate(36,"OFF"),replicate(33,"RHO"),replicate(33,"SEA"),replicate(32,"SID"),replicate(18,"SJI"),replicate(33,"TRI"))
popNames <- strtrim(colnames(SNPmat), 3)
colnames(SNPmat) = popNames
rownames(SNPmat) = locusNames
SNPmat = t(SNPmat)
# Now we're working exclusively in OutFLANK (all code from 2015 paper!):
FstDataFrame <- MakeDiploidFSTMat(SNPmat,locusNames,popNames)

### Do the analysis with trim of 0.05
OFoutput <- OutFLANK(FstDataFrame, LeftTrimFraction=0.05,RightTrimFraction=0.05, Hmin=0.1, 19,qthreshold=0.05)
write.table(OFoutput, "Output_0.5_0.1_19pop.txt", quote=F)

### Do the analysis with trim of 0.1
OFoutput <- OutFLANK(FstDataFrame, LeftTrimFraction=0.1,RightTrimFraction=0.1, Hmin=0.1, 19,qthreshold=0.05)
write.table(OFoutput, "Output_0.5_0.1_19pop_without_trim.txt", quote=F)
Calculating FSTs, may take a few minutes...
[1] "10000 done of 13688"
OutFLANKResultsPlotter(OFoutput, withOutliers = TRUE, NoCorr = TRUE, Hmin = 0.1, binwidth = 0.005, Zoom =FALSE, RightZoomFraction = 0.05, titletext = NULL)
OutFLANKResultsPlotter(OFoutput, withOutliers = TRUE, NoCorr = TRUE, Hmin = 0.1, binwidth = 0.005, Zoom =TRUE, RightZoomFraction = 0.05, titletext = "Zoom = TRUE")

  1. Benestan, L., Quinn, B. K., Maaroufi, H., Laporte, M., Clark, F. K., Greenwood, S. J., … Bernatchez, L. (2016). Seascape genomics provides evidence for thermal adaptation and current-mediated population structure in American lobster (Homarus americanus). Molecular Ecology, 25(20), 5073–5092.
  2. Rinker, T. W., & Kurkiewicz, D. (2018). pacman: Package Management for R. Buffalo, New York. Retrieved from http://github.com/trinker/pacman