This is an example analysis of RNA-seq data using open source tools R and Bioconductor. It starts with raw reads downloaded from the Short Read Archive (SRA), does quality assessment and improvement, mapping, and analysis.
The data for this comes from Zhang et al., 2011, Genome-wide mapping of the HY5-mediated genenetworks in Arabidopsis that involve both transcriptional and post-transcriptional regulation (http://onlinelibrary.wiley.com/doi/10.1111/j.1365-313X.2010.04426.x/full). It is composed of two Arabidopsis thaliana ecotype Col-0 sets, one hy5 mutant and one wild type. The SRA accession is SRX029582.
GEO:GSM613465
: wild type samplesSRR070570
: replicate 1SRR070571
: replicate 2
GEO:GSM613466
: hy5 mutant samplesSRR070572
: replicate 1SRR070573
: replicate 2
These were sequenced on an Illumina Genome Analyzer after oligo(dT) selection and random hexamer priming.
Read data was acquired from the SRA via wget
, and checked with:
cd data/raw-reads/
md5sum -c *sra.md5
The TAIR10 Arabidopsis Genome was downloaded with:
wget -nd ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/*fas
-nd
tells wget
not to mirror the directory structure.
FASTQ data was extracted with NCBI's SRA Toolkit's fastq-dump
, used
with xargs
and find
to run this in parallel:
find . -name "*sra" | xargs -n1 -P4 /share/apps/sratoolkit/fastq-dump
cd data/athaliana-genome
gmap_build -d athaliana10 -C *.fas
The -C
option tries to extract chromosome information from the FASTA
header.
First, we assess the quality with qrqc
; see raw-read-qa.Rmd
.
sycthe
is a 3'-end quality trimmer. Illumina adapters are removed
from the raw reads:
find data/raw-reads/ -name "*.fastq" | xargs -n1 -I{} basename {} .fastq | xargs -n1 -P4 -I{} /share/apps/scythe/scythe -q sanger -a /share/apps/scythe/solexa_adapters.JNF.fa -o data/improved-reads/{}-trimmed.fastq data/raw-reads/{}.fastq
sickle
is a tool for trimming low-quality bases off of the 5'-end
and 3'-end of reads.
I use a script (map-gsnap.sh
) to process each alignment file. The
purpose of using the script is that I can use it with xargs
. Because
gsnap
returns results to standard out, we need to wrap the call in a
script to allow this to be parallelizable.
find data/improved-reads/ -name "*final.fastq" | xargs -n1 -P4 bash map-gsnap.sh
I am using -N1
with gsnap
, which allows for novel
splicing. gsnap
also allows for known splicing junctions to be used.
Another task for xargs
. As before, I use basename to keep just the
unique file name as a key, and run samtools
with the correct
directory and extension. This allows me to change the extension too.
find data/alignments/ -name "*sam" | xargs -n1 -I{} basename {} .sam | xargs -n1 -I{} -P4 samtools view -b -S -o data/alignments/{}.bam data/alignments/{}.sam