Name		Name	Last commit message	Last commit date
parent directory ..
bamsync		bamsync
src		src
Dockerfile		Dockerfile
README.md		README.md
bamsync.wdl		bamsync.wdl
markduplicates.wdl		markduplicates.wdl
rnaseq_pipeline_bam.wdl		rnaseq_pipeline_bam.wdl
rnaseq_pipeline_fastq.wdl		rnaseq_pipeline_fastq.wdl
rnaseqc_aggregate.wdl		rnaseqc_aggregate.wdl
rnaseqc_counts.wdl		rnaseqc_counts.wdl
rnaseqc_metrics.wdl		rnaseqc_metrics.wdl
rsem.wdl		rsem.wdl
rsem_aggregate.wdl		rsem_aggregate.wdl
rsem_reference.wdl		rsem_reference.wdl
samtofastq.wdl		samtofastq.wdl
samtools_view.wdl		samtools_view.wdl
star.wdl		star.wdl
star_index.wdl		star_index.wdl

README.md

RNA-seq pipeline for the GTEx Consortium

This repository contains all components of the RNA-seq pipeline used by the GTEx Consortium, including alignment, expression quantification, and quality control.

Docker image

The GTEx RNA-seq pipeline is provided as a Docker image, available at https://hub.docker.com/r/broadinstitute/gtex_rnaseq/

To download the image, run:

docker pull broadinstitute/gtex_rnaseq:V8

Image contents and pipeline components

The following tools are included in the Docker image (V8 pipeline):

SamToFastq: BAM to FASTQ conversion
STAR: spliced alignment of RNA sequence reads (v2.5.3a)
Picard MarkDuplicates: mark duplicate reads
RSEM transcript expression quantification (v1.3.0)
bamsync: utility for transferring QC flags from the input BAM and for re-generating read group IDs
RNA-SeQC: QC metrics and gene-level expression quantification (v1.1.9)

Version in V7 pipeline:

STAR v2.4.2a
RSEM v1.2.22
RNA-SeQC v1.1.8

Setup steps

Reference genome and annotation

Reference indexes for STAR and RSEM are needed to run the pipeline.

GTEx releases up to V7 were based on the GRCh37/hg19 genome reference (download). Releases from V8 onward are based on GRCh38/hg38. Please see TOPMed_RNAseq_pipeline.md for details and links for this reference.

Releases V6/V6p and V7 were based on the GENCODE v19 annotation; release V8 will be based on GENCODE v26.

For hg19-based analyses, the GENCODE annotation should be patched to use Ensembl chromosome names:

zcat gencode.v19.annotation.gtf.gz | \
    sed 's/chrM/chrMT/;s/chr//' > gencode.v19.annotation.patched_contigs.gtf

Building the indexes

The STAR index should be built to match the sequencing read length, specified by the sjdbOverhang parameter. GTEx samples were sequenced using a 2x76 bp paired-end sequencing protocol, and the matching sjdbOverhang is 75.

# build the STAR index:
mkdir $path_to_references/star_index_oh75
docker run --rm -v $path_to_references:/data -t broadinstitute/gtex_rnaseq:V8 \
    /bin/bash -c "STAR \
        --runMode genomeGenerate \
        --genomeDir /data/star_index_oh75 \
        --genomeFastaFiles /data/Homo_sapiens_assembly19.fasta \
        --sjdbGTFfile /data/gencode.v19.annotation.patched_contigs.gtf \
        --sjdbOverhang 75 \
        --runThreadN 4"

# build the RSEM index:
docker run --rm -v $path_to_references:/data -t broadinstitute/gtex_rnaseq:V8 \
    /bin/bash -c "rsem-prepare-reference \
        /data/Homo_sapiens_assembly19.fasta \
        /data/rsem_reference/rsem_reference \
        --gtf /data/gencode.v19.annotation.patched_contigs.gtf \
        --num-threads 4"

Running the pipeline

Individual components of the pipeline can be run using the commands below. It is assumed that the $path_to_data directory contains the input data and reference indexes.

# BAM to FASTQ conversion
docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq \
    /bin/bash -c "/src/run_SamToFastq.py /data/$input_bam -p ${sample_id} -o /data"

# STAR alignment
docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 \
    /bin/bash -c "/src/run_STAR.py \
        /data/star_index_oh75 \
        /data/${sample_id}_1.fastq.gz \
        /data/${sample_id}_2.fastq.gz \
        ${sample_id} \
        --threads 4 \
        --output_dir /tmp/star_out && mv /tmp/star_out /data/star_out"

# sync BAMs (optional; copy QC flags and read group IDs)
docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 \
    /bin/bash -c "/src/run_bamsync.sh \
        /data/$input_bam \
        /data/star_out/${sample_id}.Aligned.sortedByCoord.out.bam \
        /data/star_out/${sample_id}"

# mark duplicates (Picard)
docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 \
    /bin/bash -c "/src/run_MarkDuplicates.py \
        /data/star_out/${sample_id}.Aligned.sortedByCoord.out.patched.bam \
        ${sample_id}.Aligned.sortedByCoord.out.patched.md \
        --output_dir /data"

# RNA-SeQC
docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 \
    /bin/bash -c "/src/run_rnaseqc.py \
    ${sample_id}.Aligned.sortedByCoord.out.patched.md.bam \
    ${genes_gtf} \
    ${genome_fasta} \
    ${sample_id} \
    --output_dir /data"

# RSEM transcript quantification
docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 \
    /bin/bash -c "/src/run_RSEM.py \
        /data/rsem_reference \
        /data/star_out/${sample_id}.Aligned.toTranscriptome.out.bam \
        /data/${sample_id} \
        --threads 4"

Aggregating outputs

Sample-level outputs in GCT format can be concatenated using combine_GCTs.py:

docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 \
    /bin/bash -c "python3 /src/combine_GCTs.py \
        ${rnaseqc_rpkm_gcts} ${sample_set_id}.rnaseqc_rpkm"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rnaseq

rnaseq

README.md

RNA-seq pipeline for the GTEx Consortium

Docker image

Image contents and pipeline components

Setup steps

Reference genome and annotation

Building the indexes

Running the pipeline

Aggregating outputs

Files

rnaseq

Directory actions

More options

Directory actions

More options

Latest commit

History

rnaseq

Folders and files

parent directory

README.md

RNA-seq pipeline for the GTEx Consortium

Docker image

Image contents and pipeline components

Setup steps

Reference genome and annotation

Building the indexes

Running the pipeline

Aggregating outputs