RNA-seq pipeline for the GTEx Consortium
This repository contains all components of the RNA-seq pipeline used by the GTEx Consortium, including alignment, expression quantification, and quality control.
The GTEx RNA-seq pipeline is provided as a Docker image, available at https://hub.docker.com/r/broadinstitute/gtex_rnaseq/
To download the image, run:
docker pull broadinstitute/gtex_rnaseq:V8
The following tools are included in the Docker image (V8 pipeline):
- SamToFastq: BAM to FASTQ conversion
- STAR: spliced alignment of RNA sequence reads (v2.5.3a)
- Picard MarkDuplicates: mark duplicate reads
- RSEM transcript expression quantification (v1.3.0)
- bamsync: utility for transferring QC flags from the input BAM and for re-generating read group IDs
- RNA-SeQC: QC metrics and gene-level expression quantification (v1.1.9)
Version in V7 pipeline:
- STAR v2.4.2a
- RSEM v1.2.22
- RNA-SeQC v1.1.8
Reference indexes for STAR and RSEM are needed to run the pipeline.
GTEx releases up to V7 were based on the GRCh37/hg19 genome reference (download). Releases from V8 onward are based on GRCh38/hg38. Please see TOPMed_RNAseq_pipeline.md for details and links for this reference.
Releases V6/V6p and V7 were based on the GENCODE v19 annotation; release V8 will be based on GENCODE v26.
For hg19-based analyses, the GENCODE annotation should be patched to use Ensembl chromosome names:
zcat gencode.v19.annotation.gtf.gz | \
sed 's/chrM/chrMT/;s/chr//' > gencode.v19.annotation.patched_contigs.gtf
The STAR index should be built to match the sequencing read length, specified by the sjdbOverhang parameter. GTEx samples were sequenced using a 2x76 bp paired-end sequencing protocol, and the matching sjdbOverhang is 75.
# build the STAR index:
mkdir $path_to_references/star_index_oh75
docker run --rm -v $path_to_references:/data -t broadinstitute/gtex_rnaseq:V8 \
/bin/bash -c "STAR \
--runMode genomeGenerate \
--genomeDir /data/star_index_oh75 \
--genomeFastaFiles /data/Homo_sapiens_assembly19.fasta \
--sjdbGTFfile /data/gencode.v19.annotation.patched_contigs.gtf \
--sjdbOverhang 75 \
--runThreadN 4"
# build the RSEM index:
docker run --rm -v $path_to_references:/data -t broadinstitute/gtex_rnaseq:V8 \
/bin/bash -c "rsem-prepare-reference \
/data/Homo_sapiens_assembly19.fasta \
/data/rsem_reference/rsem_reference \
--gtf /data/gencode.v19.annotation.patched_contigs.gtf \
--num-threads 4"
Individual components of the pipeline can be run using the commands below. It is assumed that the $path_to_data
directory contains the input data and reference indexes.
# BAM to FASTQ conversion
docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq \
/bin/bash -c "/src/run_SamToFastq.py /data/$input_bam -p ${sample_id} -o /data"
# STAR alignment
docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 \
/bin/bash -c "/src/run_STAR.py \
/data/star_index_oh75 \
/data/${sample_id}_1.fastq.gz \
/data/${sample_id}_2.fastq.gz \
${sample_id} \
--threads 4 \
--output_dir /tmp/star_out && mv /tmp/star_out /data/star_out"
# sync BAMs (optional; copy QC flags and read group IDs)
docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 \
/bin/bash -c "/src/run_bamsync.sh \
/data/$input_bam \
/data/star_out/${sample_id}.Aligned.sortedByCoord.out.bam \
/data/star_out/${sample_id}"
# mark duplicates (Picard)
docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 \
/bin/bash -c "/src/run_MarkDuplicates.py \
/data/star_out/${sample_id}.Aligned.sortedByCoord.out.patched.bam \
${sample_id}.Aligned.sortedByCoord.out.patched.md \
--output_dir /data"
# RNA-SeQC
docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 \
/bin/bash -c "/src/run_rnaseqc.py \
${sample_id}.Aligned.sortedByCoord.out.patched.md.bam \
${genes_gtf} \
${genome_fasta} \
${sample_id} \
--output_dir /data"
# RSEM transcript quantification
docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 \
/bin/bash -c "/src/run_RSEM.py \
/data/rsem_reference \
/data/star_out/${sample_id}.Aligned.toTranscriptome.out.bam \
/data/${sample_id} \
--threads 4"
Sample-level outputs in GCT format can be concatenated using combine_GCTs.py
:
docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 \
/bin/bash -c "python3 /src/combine_GCTs.py \
${rnaseqc_rpkm_gcts} ${sample_set_id}.rnaseqc_rpkm"