eQTL discovery pipeline for the GTEx Consortium
This repository contains all components of the eQTL discovery pipeline used by the GTEx Consortium, including data normalization, QTL mapping, and annotation steps. This document describes the pipeline used for the V7 and V8 data releases; for settings specific to the V6p analyses presented in [GTEx Consortium, 2017], please see the last section.
The GTEx eQTL pipeline components are provided in a Docker image, available at https://hub.docker.com/r/broadinstitute/gtex_eqtl/
To download the image, run:
docker pull broadinstitute/gtex_eqtl:V8
The following tools are included in the Docker image:
- FastQTL: QTL mapping software (Ongen et al., Bioinformatics, 2016)
- R 3.2
- Python 3.5
The following input files are needed:
- VCF file with genotype information. Must be bgzip compressed and indexed with tabix.
- Expression tables in GCT format. Two tables are needed: read counts and normalized (FPKM or TPM).
- Gene annotation in GTF format.
Additional documentation and details about parameter choices are provided on the GTEx Portal.
This pipeline requires gene-level expression data. A collapsed reference GTF can be generated for this purpose using the collapse_annotation.py
script available in the gene model directory. In the code below, it is assumed that ${annotation_gtf}
was generated using this script.
The expression data are normalized as follows:
- Read counts are normalized between samples using TMM (Robinson & Oshlack, Genome Biology, 2010)
- Genes are selected based on the following expression thresholds:
- ≥0.1 TPM in ≥20% samples AND
- ≥6 reads (unnormalized) in ≥20% samples
- Each gene is inverse normal transformed across samples.
eqtl_prepare_expression.py ${tpm_gct} ${counts_gct} ${annotation_gtf} \
${sample_participant_lookup} ${vcf_chr_list} ${prefix} \
--tpm_threshold 0.1 \
--count_threshold 6 \
--sample_frac_threshold 0.2 \
--normalization_method tmm
The file ${vcf_chr_list}
lists the chromosomes in the VCF, and can be generated using
tabix --list-chroms ${vcf} > ${vcf_chr_list}
The file ${sample_participant_lookup}
must contain two columns, sample_id
and participant_id
, mapping IDs in the expression files to IDs in the VCF (these can be the same).
This step generates the following BED file and index:
${prefix}.expression.bed.gz
${prefix}.expression.bed.gz.tbi
Rscript run_PEER.R ${prefix}.expression.bed.gz ${prefix} ${num_peer}
The number of PEER factors was selected as function of sample size (N):
- 15 factors for N < 150
- 30 factors for 150 ≤ N < 250
- 45 factors for 250 ≤ N < 350
- 60 factors for N ≥ 350
For information on how these thresholds were determined, please see the Supplementary Information of [GTEx Consortium, 2017].
This step will generate 3 files:
${prefix}.PEER_residuals.txt
${prefix}.PEER_alpha.txt
${prefix}.PEER_covariates.txt
This step generates a combined covariates file, containing genotype PCs, PEER factors, and additional explicit covariates (e.g., genotyping platform).
combine_covariates.py ${prefix}.PEER_covariates.txt ${prefix} \
--genotype_pcs ${genotype_pcs} \
--add_covariates ${add_covariates}
The covariate files should have one covariate per row, with an identifier in the first column, and a header line with sample identifiers. This step will generate the file ${prefix}.combined_covariates.txt
A wrapper script for multithreaded execution is provided in the docker image (/opt/fastqtl/python/run_FastQTL_threaded.py
) and at https://github.com/francois-a/fastqtl
# nominal pass
run_FastQTL_threaded.py ${vcf} ${prefix}.expression.bed.gz ${prefix} \
--covariates ${prefix}.combined_covariates.txt \
--window 1e6 --chunks 100 --threads 16
# permutation pass
run_FastQTL_threaded.py ${vcf} ${prefix}.expression.bed.gz ${prefix} \
--covariates ${prefix}.combined_covariates.txt \
--window 1e6 --chunks 100 --threads 16 \
--permute 1000 10000
The following files will be generated:
${prefix}.allpairs.txt.gz
${prefix}.egenes.txt.gz
The steps described above can be run using docker. This assumes that the $path_to_data
directory contains all required input files.
# Docker command for step 1:
docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_eqtl:V8 /bin/bash \
-c "/src/eqtl_prepare_expression.py /data/${tpm_gct} /data/${counts_gct} \
/data/${annotation_gtf} /data/${sample_participant_lookup} /data/${vcf_chr_list} ${prefix} \
--tpm_threshold 0.1 --count_threshold 6 --sample_frac_threshold 0.2 --normalization_method tmm"
The expression data were normalized as follows:
- Genes were selected based on the following exression thresholds:
- >0.1 RPKM in ≥10 samples AND
- ≥6 reads (unnormalized) in ≥10 samples
- RPKMs were normalized between samples using quantile normalization
- Each gene was inverse normal transformed across samples.