THE BIOSTAR HANDBOOK COLLECTION
RNA-SEQ
BY EXAMPLE
HISAT STAR KALLISTO SALMON DESEQ
ALIGN ALIGN CLASSIFY CLASSIFY STATISTICS
Book updated on February 14, 2023
Contents
1 Welcome to the world of RNA-Seq 8
1.1 Mission Impossible RNA-Seq . . . . . . . . . . . . . . . . . . 9
1.2 Typesetting conventions . . . . . . . . . . . . . . . . . . . . . 9
1.3 How to download the book . . . . . . . . . . . . . . . . . . . . 10
1.4 How was the book developed? . . . . . . . . . . . . . . . . . . 10
2 Introduction to RNA-Seq 12
2.1 What is RNA-Seq when simplified to its essence? . . . . . . . 12
2.2 What makes RNA-Seq challenging? . . . . . . . . . . . . . . . 14
2.3 How does RNA sequencing work? . . . . . . . . . . . . . . . . 14
2.4 What are gene isoforms? . . . . . . . . . . . . . . . . . . . . . 15
2.5 What kind of splicing events exist? . . . . . . . . . . . . . . . 15
2.6 What is the final result of an RNA-Seq analysis? . . . . . . . 16
2.7 How many replicates do I need? . . . . . . . . . . . . . . . . . 17
2.8 Can one have too many replicates? . . . . . . . . . . . . . . . 17
2.9 What are the main methods for quantifying RNA-Seq data? . 18
2.10 How does quantifying against a genome work? . . . . . . . . . 19
2.11 How does classifying against a transcriptome work? . . . . . . 19
2.12 What are the tradeoffs? . . . . . . . . . . . . . . . . . . . . . 20
2.13 Which method should I use? . . . . . . . . . . . . . . . . . . . 20
2.14 Will there ever be an optimal RNA-Seq analysis method? . . . 21
2.15 When will I know that I understand RNA-Seq? . . . . . . . . 22
3 RNA-Seq terminology 23
3.1 What is a sample? . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 How should we represent RNA-Seq samples? . . . . . . . . . . 23
3.3 What is normalization? . . . . . . . . . . . . . . . . . . . . . . 24
2
CONTENTS 3
3.4 What is a library size normalization? . . . . . . . . . . . . . . 24
3.5 What is the effective length? . . . . . . . . . . . . . . . . . . . 24
3.6 What gets counted in a gene level analysis? . . . . . . . . . . 25
3.7 How do we quantify gene expression? . . . . . . . . . . . . . . 25
3.8 What is TMM (edgeR) normalization? . . . . . . . . . . . . . 26
3.9 What is DESeq normalization? . . . . . . . . . . . . . . . . . 26
3.10 Do I always need an advanced statistical analysis? . . . . . . . 26
3.11 What is a “spike-in” control? . . . . . . . . . . . . . . . . . . 27
3.12 How should I name samples? . . . . . . . . . . . . . . . . . . 27
3.13 Don’t mix up your samples! . . . . . . . . . . . . . . . . . . . 28
3.14 What is the CPM? . . . . . . . . . . . . . . . . . . . . . . . . 29
3.15 What is the RPKM? . . . . . . . . . . . . . . . . . . . . . . . 29
3.16 What the FPKM is that? . . . . . . . . . . . . . . . . . . . . 31
3.17 Why are RPKM and FPKM still used? . . . . . . . . . . . . . 31
3.18 What is TPM? . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 Statistics for RNA-Seq 34
4.1 Why do statistics play a role in RNA-Seq? . . . . . . . . . . . 35
4.2 When do I need to make use of statistics? . . . . . . . . . . . 35
4.3 What kind of questions can we answer with a statistical test? 36
4.4 What types of statistical tests are common? . . . . . . . . . . 36
4.5 Do I need to involve a statistician? . . . . . . . . . . . . . . . 37
4.6 What is R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.7 How do I install R? . . . . . . . . . . . . . . . . . . . . . . . . 38
4.8 Can I also install R from command line? . . . . . . . . . . . . 38
4.9 How should I run R? . . . . . . . . . . . . . . . . . . . . . . . 39
4.10 What is Bioconductor? . . . . . . . . . . . . . . . . . . . . . . 39
4.11 What does a p-value mean? . . . . . . . . . . . . . . . . . . . 40
4.12 So how do I deal with p-values? . . . . . . . . . . . . . . . . . 43
4.13 Do I need to compute and discuss p-values? . . . . . . . . . . 45
5 Computer setup 46
5.1 Install and use RStudio . . . . . . . . . . . . . . . . . . . . . 46
5.2 Running our scripts at the command line . . . . . . . . . . . . 47
5.3 Making a new environment . . . . . . . . . . . . . . . . . . . 47
5.4 How to prepare a new environment . . . . . . . . . . . . . . . 48
5.5 Switching environments in scripts . . . . . . . . . . . . . . . . 48
4 CONTENTS
6 Using RStudio for RNA-Seq 50
6.1 Running our scripts in RStudio . . . . . . . . . . . . . . . . . 50
6.2 Tips to become more productive . . . . . . . . . . . . . . . . . 51
I RNA-SEQ STEP-BY-STEP 52
7 1. Introducing the Golden Snidget 54
7.1 What do we know about the Golden Snidget? . . . . . . . . . 55
7.2 What is the objective of the entire section of the Golden Snidget? 56
8 2. Understand your reference 57
8.1 What is my very first step? . . . . . . . . . . . . . . . . . . . 57
8.2 What is the Golden Snidget’s genome like? . . . . . . . . . . . 58
8.3 How many sequences are in our reference files? . . . . . . . . . 59
8.4 What does the genome file look like? . . . . . . . . . . . . . . 59
8.5 What is the annotation file? . . . . . . . . . . . . . . . . . . . 60
8.6 What do the transcripts look like? . . . . . . . . . . . . . . . 60
8.7 Visualize your reference file . . . . . . . . . . . . . . . . . . . 61
8.8 Optional step: align your transcripts against the reference . . 62
8.9 What is the next step? . . . . . . . . . . . . . . . . . . . . . . 63
9 3. Understand the data 64
9.1 What information do I need to know? . . . . . . . . . . . . . . 64
9.2 What is the summary so far? . . . . . . . . . . . . . . . . . . 66
9.3 What is the experimental design file? . . . . . . . . . . . . . . 66
9.4 Does the order above matter? . . . . . . . . . . . . . . . . . . 67
10 4. Alignment based RNA-Seq 68
10.1 Index your reference genome . . . . . . . . . . . . . . . . . . . 68
10.2 Generate the alignments . . . . . . . . . . . . . . . . . . . . . 69
10.3 How to automate the alignments . . . . . . . . . . . . . . . . 69
10.4 Visualize the alignments . . . . . . . . . . . . . . . . . . . . . 71
10.5 Create coverage data . . . . . . . . . . . . . . . . . . . . . . . 71
10.6 Making a bigWig coverage file . . . . . . . . . . . . . . . . . . 72
10.7 What can you learn from the visualization? . . . . . . . . . . 73
10.8 What is the next step . . . . . . . . . . . . . . . . . . . . . . 73
11 5. Feature counting in RNA-Seq 74
CONTENTS 5
11.1 What is our quantification matrix? . . . . . . . . . . . . . . . 74
11.2 How to count features? . . . . . . . . . . . . . . . . . . . . . . 75
11.3 Standardizing the count matrix . . . . . . . . . . . . . . . . . 77
11.4 There is even more to counting . . . . . . . . . . . . . . . . . 78
11.5 Library stranded-ness . . . . . . . . . . . . . . . . . . . . . . . 79
11.6 What is the next step . . . . . . . . . . . . . . . . . . . . . . 79
12 6. Classification based RNA-Seq 80
12.1 What are the advantages of classification? . . . . . . . . . . . 80
12.2 What are the disadvantages of classification? . . . . . . . . . . 81
12.3 What is the weakest link in classification? . . . . . . . . . . . 81
12.4 How does the read redistribution work? . . . . . . . . . . . . . 81
12.5 What tools implement pseudo-alignment? . . . . . . . . . . . 82
12.6 How to use kallisto and salmon? . . . . . . . . . . . . . . . 83
12.7 The design matrix . . . . . . . . . . . . . . . . . . . . . . . . 83
12.8 Preparing the transcriptome . . . . . . . . . . . . . . . . . . . 84
12.9 Running the classification . . . . . . . . . . . . . . . . . . . . 84
12.10Combine your counts . . . . . . . . . . . . . . . . . . . . . . . 85
12.11Compare the outputs . . . . . . . . . . . . . . . . . . . . . . . 86
12.12Next step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
13 7. The differential expression 88
13.1 Computing differential expression . . . . . . . . . . . . . . . . 88
13.2 Which data will be analyzed? . . . . . . . . . . . . . . . . . . 89
13.3 How do I find differential expression? . . . . . . . . . . . . . . 90
13.4 Explore alternatives . . . . . . . . . . . . . . . . . . . . . . . 92
13.5 Visualizing the results . . . . . . . . . . . . . . . . . . . . . . 92
13.6 What do the results mean? . . . . . . . . . . . . . . . . . . . 94
13.7 What is the normalized matrix? . . . . . . . . . . . . . . . . . 96
13.8 How do I visualize the normalized matrix? . . . . . . . . . . . 98
13.9 Where do we go next? . . . . . . . . . . . . . . . . . . . . . . 98
14 8. What does the heatmap show 99
14.1 Why do we visualize the normalized matrix? . . . . . . . . . . 100
14.2 What does the heatmap show? . . . . . . . . . . . . . . . . . 100
14.3 Where do we go next? . . . . . . . . . . . . . . . . . . . . . . 103
15 9. Which method is the best? 104
6 CONTENTS
15.1 What kinds of errors do we expect to see? . . . . . . . . . . . 105
15.2 How many genes should be detected as differentially expressed?106
15.3 How many genes are detected as differentially expressed? . . . 106
15.4 False positives . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
15.5 False negatives . . . . . . . . . . . . . . . . . . . . . . . . . . 108
15.6 And the “winner” is edger . . . . . . . . . . . . . . . . . . . . 109
II RNA-SEQ IN PRACTICE 110
16 UHR vs HBR data 113
16.1 Which publication is reanalyzed? . . . . . . . . . . . . . . . . 113
16.2 What type of data is included? . . . . . . . . . . . . . . . . . 113
16.3 How do I get the data data? . . . . . . . . . . . . . . . . . . . 114
16.4 What is inside the data . . . . . . . . . . . . . . . . . . . . . 114
16.5 What is the reference? . . . . . . . . . . . . . . . . . . . . . . 115
16.6 What do I need to do now? . . . . . . . . . . . . . . . . . . . 115
III The Grouchy Grinch 116
17 1. Introducing the Grouchy Grinch 118
17.1 What do we know about the Grouchy Grinch? . . . . . . . . . 118
17.2 Experimental design . . . . . . . . . . . . . . . . . . . . . . . 119
18 2. Grinch: data disarray 120
18.1 Download and unpack the data . . . . . . . . . . . . . . . . . 120
18.2 Investigate the genome . . . . . . . . . . . . . . . . . . . . . . 121
18.3 Evaluate the annotations . . . . . . . . . . . . . . . . . . . . . 121
18.4 Visualizing annotations . . . . . . . . . . . . . . . . . . . . . . 122
19 3. Grinch: alignment gloom 124
19.1 Visualizing alignments . . . . . . . . . . . . . . . . . . . . . . 125
19.2 Tuning the aligner . . . . . . . . . . . . . . . . . . . . . . . . 127
20 4. Grinch: counting troubles 130
20.1 Let’s do the counting . . . . . . . . . . . . . . . . . . . . . . . 130
20.2 How does the counting work . . . . . . . . . . . . . . . . . . . 131
20.3 The search for the “right annotation” format begins . . . . . . 133
CONTENTS 7
20.4 GFT - the zombie format that shouldn’t exist . . . . . . . . . 134
20.5 Ok, Grinch! Let’s count. . . . . . . . . . . . . . . . . . . . . . 134
21 5. Grinch: stranding woes 136
21.1 Stranded? Yes, No, Reverse, RF, FR, SF, SR . . . . . . . . . 137
21.2 Plz send help! . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
21.3 How to perform stranded alignments? . . . . . . . . . . . . . . 138
21.4 How to perform a stranded counting? . . . . . . . . . . . . . . 139
22 6. Grinch: integrity torment 141
22.1 Coding regions are NOT transcripts! . . . . . . . . . . . . . . 142
22.2 Transcript integrity . . . . . . . . . . . . . . . . . . . . . . . . 144
23 7. Grinch: R anguish 146
23.1 The Curse of Knowledge . . . . . . . . . . . . . . . . . . . . . 146
23.2 Help is available . . . . . . . . . . . . . . . . . . . . . . . . . . 147
23.3 R installation problems . . . . . . . . . . . . . . . . . . . . . . 147
IV FURTHER EXPLORATIONS 149
24 The RNA-Seq puzzle 151
24.1 How would I even solve this puzzle? . . . . . . . . . . . . . . . 151
24.2 The Pledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
24.3 The Turn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
24.4 The Prestige . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
24.5 How to solve it (a hint) . . . . . . . . . . . . . . . . . . . . . 154
25 The Bear Paradox 155
25.1 Create realistic data . . . . . . . . . . . . . . . . . . . . . . . 156
V CODE 157
26 Code: Mission Impossible RNA-Seq 159
26.1 Bioinformatics environment . . . . . . . . . . . . . . . . . . . 159
26.2 Obtain the recipes . . . . . . . . . . . . . . . . . . . . . . . . 160
26.3 Command line usage . . . . . . . . . . . . . . . . . . . . . . . 160
26.4 Code listing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Chapter 1
Welcome to the world of
RNA-Seq
Last updated on February 14, 2023
This book is the volume from the Biostar Handbook Collection1 that intro-
duces readers to RNA-Seq data analysis. Prerequisites:
• Computer setup2 - software must be installed as explained there.
• Introduction to Unix3 - a basic understanding of the command line.
Techniques in the book make use of more advanced looping constructs ex-
plained in great detail in The Art of Bioinformatics Scripting4
The Biostar Handbook
Bioinformatics Scripting
RNA-Seq by Example
Corona Virus Genome Analysis
Biostar Workflows
1
[Link]
2
[Link]
3
[Link]
4
[Link]
8
1.1. MISSION IMPOSSIBLE RNA-SEQ 9
1.1 Mission Impossible RNA-Seq
Not all heroes wear capes! One day you may be called upon to save your
world (or your thesis) with a masterful data analysis. Fear not, you came to
the right place, we will teach you how to do it, perhaps in a minute:
The above analysis produces a differential expression study with three differ-
ent methods:
Read on to learn how you can do it as well.
1.2 Typesetting conventions
In the current book we will wrap long codes via the \ symbol. The \ charater
followed by a newline
echo AAAA\
BBBB
is equivalent to writing:
10 CHAPTER 1. WELCOME TO THE WORLD OF RNA-SEQ
echo AAAA BBBB
In practice code written like this below:
cat [Link] | parallel --round-robin --pipe --recstart '>'\
'blat -noHead [Link] stdin >(cat) > &2' > [Link]
is equivalent to writing all the above on a single line like so:
cat [Link] | parallel --round-robin --pipe --recstart '>' 'blat -noHead genome
1.3 How to download the book
The book is available to registered users. The latest versions can be down-
loaded from:
• RNA-Seq by Example, PDF5
• RNA-Seq by Example, eBook6
Our books are updated frequently. We recommend accessing each book via
the website as the web version will always contain the most recent and up-to-
date content. A few times a year we send out emails that describe the new
additions.
1.4 How was the book developed?
We have been teaching bioinformatics and programming courses to life sci-
entists for many years now. We are also the developers and maintainers of
Biostars: Bioinformatics Question and Answer7 website, the leading resource
for helping bioinformatics scientists with their data analysis questions.
5
[Link]
6
[Link]
7
[Link]
1.4. HOW WAS THE BOOK DEVELOPED? 11
We wrote this book based on these multi-year experiences in training students
and interacting with scientists that needed help to complete their analyses.
We are uniquely in tune with the challenges and complexities of applying
bioinformatics methods to real problems, and we’ve designed this book to
help readers overcome these challenges and go further than they have ever
imagined.
Manage your account
If you are logged in you may manage your account by clicking the link button
below.
Access Your Account
You may also change your email or log out via this page.
Chapter 2
Introduction to RNA-Seq
2.1 What is RNA-Seq when simplified to its
essence?
Conceptually an RNA-Seq analysis is quite a straightforward process. It goes
like this:
1. Produce sequencing data from a transcriptome in a “control” state
2. Match sequencing reads to the genome or the transcriptome.
3. Count how many reads align to a region (feature). Let’s say 100 reads
overlap with Gene A.
Now say the cell is subjected to a different condition; for example, a cold
shock is applied. We perform the same measurements as before:
1. Produce sequencing data from the transcriptome in the “perturbed”
state
2. Match sequencing reads to the genome or the transcriptome.
3. Count how many reads align to the same region (feature), gene A. Sup-
pose 200 reads overlap with Gene A.
12
2.1. WHAT IS RNA-SEQ WHEN SIMPLIFIED TO ITS ESSENCE? 13
We see that 200 is twice as big as 100. In an ideal world, we could state that
there is a two-fold increase in the coverage. Since the coverage is (ideally)
proportional with the abundance of the transcript (aka its expression level),
we may then report that the transcription level for Gene A doubled in the
second experiment.
We call the doubling the differential expression (aka DE). We could report
a foldChange=2 for gene A. To keep things symmetrical between 2-fold in-
crease (foldChange=2) and a 2-fold decrease (foldChange=0.5) scientists
typically compute the 2 based logarithm of the fold change and call that
log2FoldChange (aka log2FC). Then doubling would be log2FC=1 and halv-
ing would be log2FC=-1. Note how nice and symmetrical the increase and
decrease are with this formula.
That’s it. We would be done with the “informatics” section of our RNA-Seq
analysis in a perfect world.
Now the process of biological interpretation would start. We would need to
investigate gene A, what functions of it are known, is the response to shock
one of them? If yes, we both validated that finding and can look for how the
gene is regulated. If the response to cold shock was not listed as a function
for the gene A we might have discovered a new role. And so on. Note how this
step is not computational anymore; instead, it requires domain knowledge.
So that’s RNA-Seq in a nutshell. The task itself does not seem incredibly
daunting. We need to count some numbers. Of course, reality is not so
simple.
14 CHAPTER 2. INTRODUCTION TO RNA-SEQ
2.2 What makes RNA-Seq challenging?
The stochasticity and imprecision of the many processes coupled to the scale
of data conspire against determining fold change so readily.
We can’t just divide the two numbers; we can’t even trust the numbers to
mean what they are supposed to. There will be many bookkeeping, defensive
measures taken to protect against you-know-who: The Dark Lord of False
Discoveries. He grows stronger as the data grows; thus, the defense is chal-
lenging even though that task itself is relatively straightforward: compare
one number to another and estimate whether their difference is big enough
to be considered a valid change.
Take solace and strength from recognizing that the complexity lies in the data
and information management - the analysis process itself does not demand
particularly advanced mathematical or analytical problem-solving skills.
At the same time, you best understand right away that there is no silver
bullet that could make everything fall into place. There is no magic wand
or an approach that always guarantees success. Your job will always be to
respond to what the data tells you. For that, you will need to understand
the methods do in more depth.
2.3 How does RNA sequencing work?
The RNA-seq protocol turns the RNA produced by a cell into DNA (cDNA,
complementary DNA) via a process known as reverse transcription. The
resulting DNA is then sequenced, and from the observed abundances of DNA,
we attempt to infer the original amounts of RNA in the cell.
Conceptually the task looks simple. It seems that all we need to do is count
is how many DNA fragments are there. If we get higher numbers under
a particular condition, it means that the cell must be producing RNA at
higher levels. In reality, counting and comparing are more complicated and
confounded by several factors - the most important being that the mRNA
levels in the cell are rarely static! There is a continuous ebb and flow in
response to particular stimuli. We have to isolate the signal that corresponds
to the functions of interest from all the other unrelated changes present in
the mRNA concentrations.
2.4. WHAT ARE GENE ISOFORMS? 15
2.4 What are gene isoforms?
Gene isoforms are mRNAs that are produced from the same locus but are
different in their transcription start sites (TSSs), protein-coding DNA se-
quences (CDSs) and untranslated regions (UTRs), potentially altering gene
function.
Above we show three possible isoforms of a gene composed of three exons.
Think about how reads that are fully contained in exon 2 would not be
informative as to which transcript is expressed.
2.5 What kind of splicing events exist?
Five primary modes of alternative splicing are recognized:
1. Exon skipping or cassette exon.
2. Mutually exclusive exons.
3. Alternative donor (5’) site.
4. Alternative acceptor (3’) site.
5. Intron retention.
A few of these examples shown below:
16 CHAPTER 2. INTRODUCTION TO RNA-SEQ
When viewing RNA-Seq data in IGV you can recognize and identify these
by eye.
2.6 What is the final result of an RNA-Seq
analysis?
The result of an RNA-Seq analysis is a quantification matrix. For our toy
example, the file might look like this:
name control shock
Gene A 100 200
Gene B 80 60
Gene C 120 180
...
When processed by a statistical method, the quantification file would be
augmented with statistical measures that characterize the changes. For ex-
ample, if we were to perform a pairwise comparison between cold shock and
control (shock/control), the file above will gain more information, expressed
as additional columns:
name control shock fold_change pvalue
Gene A 100 200 2.0 0.000035
Gene B 80 60 0.75 0.234
Gene C 120 180 1.5 0.013
2.7. HOW MANY REPLICATES DO I NEED? 17
...
Interpreting this file at the 0.05 statistical significance level would then tell
us that Gene A and Gene C are differentially expressed between control and
shock, whereas Gene B is not.
2.7 How many replicates do I need?
While we can make some inferences from a single measurement, the inherent
variability of biological phenomena typically requires that we repeat our mea-
surements several times. The process of repeating the entire experimental
process, from sample extraction to sequencing, is called replication.
A typical recommendation is making minimally three measurements (repli-
cates); the word on the street is that five replicates might be optimal. With
replicates, our quantification matrix grows, it will look more like this:
name control1 control2 control3 shock1 shock2 shock3
Gene A 100 103 88 200 188 199
Gene B 300 276 310 120 121 120
Gene C 400 389 459 530 600 700
...
Above, we replicated each gene’s measurements three times in each state
control and shock.
As a general rule of thumb, when you expect the effects to be subtle and bio-
logical variation significant (e.g., experiments with live animals from different
populations), more replicates are better. Note that by generating five repli-
cates across two conditions, each producing two files, you are already looking
at 5 x 2 x 2 = 20 FASTQ files coming off the sequencing instrument. Your
ability to automate processes will become more critical than ever.
2.8 Can one have too many replicates?
Superficially it might seem that the more replicates, the better. From a
purely mathematical perspective, the statement is valid.
18 CHAPTER 2. INTRODUCTION TO RNA-SEQ
Yet there is a human factor to replication: with the increasing number of
samples, the chance of introducing a human error increases. Errors such as
mislabeling samples, misusing protocols can have immensely detrimental ef-
fects on our ability to interpret data. A single incorrect replicate may severely
impact the statistical power of the experiment. Therefore, the potential for
human error needs to be factored into the cost/benefit evaluation.
Empirically we’ve observed an inverse correlation between experiment size
and data quality. The larger the experiment, the worse the data is. In
general, we recommend focused studies that strike a balance between the
statistical power requirements and human factors.
Then there are ethical considerations. Scientists working with live animals
need to consider the welfare laboratory animals and should familiarize them-
selves with three Rs: Refinement, Reduction, and Replacement. For an
overview see:
• Recognition and Alleviation of Distress in Laboratory Animals1
where the authors state:
The simplest approach to avoiding, minimizing, and alleviating
distress in laboratory animal care and use is to follow the princi-
ples of the Three Rs—refinement, reduction, and replacement. It
is important, however, to strike a balance between the integrity
of research outcomes and the welfare of animals used.
2.9 What are the main methods for quanti-
fying RNA-Seq data?
The first task of an RNA-Seq data analysis is to select the “reference” that
your study will use to quantify gene expression changes. Two main ap-
proaches are in use:
1. Quantify against a genome.
2. Classifying against a transcriptome.
1
[Link]
2.10. HOW DOES QUANTIFYING AGAINST A GENOME WORK? 19
2.10 How does quantifying against a genome
work?
When quantifying against a genome, the reference data will be the genome
sequence and the genome annotations. You will need a sequence file that
contains chromosomal coordinates (FASTA) and an annotation (interval) file
(GFF) that describes coordinates on the genome.
The approach will intersect the alignment files with annotations to produce
“counts” that are then filtered to retain statistically significant results. The
quantification matrix represents how many times did alignments overlap (in-
tersect) with an interval labeled as part of a transcript.
• The green boxes represent the reference data you need to have access
to before starting the analysis.
• The blue boxes represent data that you start with or need to create
along the way.
• The red ovals represent methods/algorithms that transform the data.
2.11 How does classifying against a transcrip-
tome work?
When quantifying against a transcriptome, the reference data will be the
FASTA file of each transcript. The classification will assign each read to a
transcript, resolving ambiguities along the way. The quantification matrix
for each transcript will represent the number of times reads were classified
as that particular transcript.
20 CHAPTER 2. INTRODUCTION TO RNA-SEQ
• The green boxes represent the reference data you need to have access
to before starting the analysis.
• The blue boxes represent data that you start with or need to create
along the way.
• The red ovals represent methods/algorithms that transform the data.
2.12 What are the tradeoffs?
Genome base methods:
1. Allow us to visualize the data in the context of the entire genome.
2. Enable us to discover/validate new transcripts.
Transcriptome and classification based methods:
1. Are typically more accurate.
2. Require lower computational resources.
For example, we can run a salmon based RNA-Seq classification on a Mac-
Book Air laptop within an hour.
2.13 Which method should I use?
I have an awesome recommendation for you, friend:
You should perform BOTH types of analyses!
Each method will provide you with a different perspective of the same phe-
nomena. Only by seeing the data in the context of genome and transcriptome
will you fully appreciate the complexity of the task at hand.
2.14. WILL THERE EVER BE AN OPTIMAL RNA-SEQ ANALYSIS METHOD?21
The only reason NOT to do both is when the known information is not
available. For example, perhaps the transcriptome is not yet available or if
you have no genome.
2.14 Will there ever be an optimal RNA-Seq
analysis method?
As you will see, we present several alternative analysis protocols. You may
ask yourself, why isn’t there a “best” method?
There is an analogy to the mathematical paradox called the voting (Con-
dorcet) paradox2 . This paradox states that when tallying votes for more
than two options, we may end up with final rankings that are in conflict with
the individual wishes of a majority of voters!
Moreover, the Condorcet paradox also states there is no optimal voting strat-
egy that would avoid all possible conflicts. Another way to say this is that
every voting scheme can, in some (and potentially rare) situations produce
wildly unrealistic results that do not capture the voters’ actual intent.
When RNA-Seq methods assign measurements to genes, a sort of voting
scheme applies. Because often there are ambiguities on how to resolve a
measure, decisions have to be made to allocate the measurement to its most
likely origin (“candidate”) transcript. Unsurprisingly the Condorcet paradox
will apply, and no matter what method we choose, under some conditions,
methods may evaluate gene expressions incorrectly.
What is the take-home message here? We believe that the explanation is as
simple as it is counter-intuitive:
There is no best method for RNA-Seq. There are only methods
that are good enough. The meaning of good enough evolves in
time. Good enough means that the method can identify most of
those gene expression changes relevant to the study.
There is always a chance that you’ll have an unusual phenomenon to quantify
for which your initial choice of methods is not optimal. The most crucial skill
2
[Link]
22 CHAPTER 2. INTRODUCTION TO RNA-SEQ
then is to recognize this situation.
You will need not just to learn how to perform and RNA-Seq anal-
ysis but to understand when a process does not seem to produce
reasonable or correct results.
In our opinion, the usability and documentation of a method are among its
most essential ingredients. Treating an RNA-Seq analysis as a black box that
you just run with a click is a usually a recipe for disaster.
2.15 When will I know that I understand
RNA-Seq?
Here is a simple guideline:
You know RNA-Seq when you can readily analyze a dataset with three
different methods!
It is essential not to become a one-shot wonder, where there is only one kind
of analysis you understand. Life science throws curveballs. Having a clear
understanding of the processes will allow you to make informed decisions.
You will see that it is not that difficult to use entirely different methods to
analyze the same data.
The challenge is not running the tool but making an informed decision when
biologically interpreting the results.
Chapter 3
RNA-Seq terminology
In the section, we’ll try to clarify terms that you may encounter when reading
RNA-Seq related documentation.
3.1 What is a sample?
The commonly used word “sample” in sequencing lingo means a biological
sample (extraction) that was subjected to sequencing. When we talk about
samples, we assume that these are grouped into conditions via a so-called
experimental design. For example, we could say we have 10 samples, arranged
into two conditions with five replicates per condition: 2 x 5=10
The “RNA-Seq” sample should not be confused with “statistical sample” –
in statistics “sample” refers to selecting a subset of a population.
3.2 How should we represent RNA-Seq sam-
ples?
In our books we champion the use of a simple comma separated sample file
format where we list the so called roots of the names in a column called
sample and the labels for each sample in the condition column. For exam-
ple:
23
24 CHAPTER 3. RNA-SEQ TERMINOLOGY
sample,condition
name1,A
name2,A
name3,A
name4,B
name5,B
name6,B
Above we list six samples arranged into two conditions A and B. The names
correspond to the identifier of the sample from which we can build the full
path to each data. For example name1 could be the identifier for the file
data/[Link]. From this sample design file we can then build the
names of all subsequent input and output files
3.3 What is normalization?
When we assign values to the same labels in different samples, it becomes
essential that these values are comparable across the samples. The process
of ensuring that these values are expressed on the same scale is called nor-
malization. Several different normalization methods are in use, and each
operates with different types of assumptions.
3.4 What is a library size normalization?
The term “library size” is a frequently used (but improper term) that usually
means the sequencing coverage (depth) of a sample. For example, it might
be that for experiment A we have ended up with twice as much material
being placed into the instrument than for experiment B. Our quantification
methods need to be able to detect and account for the difference that occurs
merely due to the differences of the amount of initial material.
3.5 What is the effective length?
The sequencing coverage drops towards the ends of DNA molecules due to the
lower chances of producing fragments that include the end of the molecule.
3.6. WHAT GETS COUNTED IN A GENE LEVEL ANALYSIS? 25
For genomes, this loss of coverage is usually less of a problem since it only
affects the ends of very long chromosomes - and typically we have only a
few (dozens) of locations. When working with RNA transcripts that number
in the ten or hundred thousands, the edge effects will affect each transcript.
Besides, shorter transcripts will be more impacted (the edge effect is a larger
fraction of the transcript) than longer ones.
A correction is usually applied to account for this edge effect - a so-called
“effective length” will replace the true length of each feature. For example,
one common correction is to shorten each transcript end by half of the read
length. Only the values calculated over this shortened length will be used.
3.6 What gets counted in a gene level analy-
sis?
The gene level analysis treats every gene as a single transcript that contains
all exons of the gene. In some cases, this approximation is sufficient to
get meaningful results. In other cases, a more fine-grained analysis will be
necessary wherein abundances are quantified for each transcript level.
Gene level analyses will collapse the transcripts into a single “representative”
sequence - that may not be a valid transcript - it is a sum of all transcripts. It
is clear that a gene level analysis will not be able to identify those mechanisms
that rely on changes of the abundance of iso-forms of a single gene.
3.7 How do we quantify gene expression?
As you shall see, there is no shortage of various measures that scientist have
at some point used. These measures include CPM, FPKM, RPKM and TPM. We
define each at the end of the current chapter, nut, as you shall see later none
of these measures are universally suitable for gene expression quantification.
The statistical methods that we employ are designed to use the observed read
counts without any type of correction factor.
26 CHAPTER 3. RNA-SEQ TERMINOLOGY
3.8 What is TMM (edgeR) normalization?
Trimmed mean of M values (TMM) normalization estimates sequencing
depth after excluding genes for which the ratio of counts between a pair
of experiments is too extreme or for which the average expression is too
extreme. The edgeR software implements a TMM normalization.
3.9 What is DESeq normalization?
The DESeq normalization method (implemented in the DEseq R package)
estimates sequencing depth based on the count of the gene with the median
count ratio across all genes.
3.10 Do I always need an advanced statistical
analysis?
Surprisingly, the answer is no. The methods that need to be employed depend
on the goals of your experiment.
If, before starting the experiment, you knew which gene you wanted to know
more about, and you care only about this gene, then the law of large numbers
works in your favor.
This is to say that it is very unlikely that your gene of interest was singled
out by chance and was affected in a way that misleads you. This is to say
that if you use your RNA-Seq data to verify a statement then the simplest
of statistical tests and common sense suffice.
But if you did not know which transcripts might change and you wanted
to reliably determine that out of tens of thousands of alternatives and their
combinations then more sophisticated methods are necessary to ensure that
whatever change you observe was not caused by natural variation in the data.
3.11. WHAT IS A “SPIKE-IN” CONTROL? 27
3.11 What is a “spike-in” control?
The goal of the spike-in control is to determine how well we can measure and
reproduce data with known (expected) properties. A commercial product
such as the “ERCC ExFold RNA Spike-In Control Mix”1 can be added in
different mixtures. This spike-in consists of 92 transcripts that are present in
known concentrations across a wide abundance range (from very few copies
to many copies).
You may use spike controls to validate that a protocol operates as expected.
Of course challenges still remain, the spiked protocol
3.12 How should I name samples?
With RNA-seq analysis you may need to work with many dozens of samples.
One of the skills that you have to develop is to parse file names and connect
them to known sample information. File naming practices vary immensely
but having an odd naming scheme can be the source of the most devious
catastrophes! We suggest the following practices:
1. Each attribute of the data should be captured by a single section of
the name.
2. If there is a hierarchy to the information then start with the MOST
GENERIC piece of information and work your way back from there,
end the file with the MOST SPECIFIC bit of information.
For example, this is an appropriate naming scheme.
HBR_1_R1.fq
HBR_2_R1.fq
UHR_1_R2.fq
UHR_2_R2.fq
The first unit indicates samples: HBR and UHR, then comes the replicate
number 1, 2 and 3, then the paired files R1 and R2. The roots of the samples
are HBR_1 and UHR_1 etc. from which we can generate the full path to both
inputs and outputs.
1
[Link]
28 CHAPTER 3. RNA-SEQ TERMINOLOGY
A less optimal naming scheme would be one such encodes additional sample
specific information into the sample without grouping them properly:
HBR_1_Boston_R1_.fq
HBR_2_Boston_R2_.fq
UHR_1_Atlanta_R2_.fq
UHR_2_Atlanta_R2_.fq
Above both HBR and Boston represent sample specific information whereas
1 and 2 are replicate specific information at a different level of granularity.
The names are less well suited to automation, and you’ll have to work harder
to produce the outputs you want.
In a nutshell, it is much easier to automate and summarize processes when
the sample information is properly structured. You’d be surprised how few
data analysts understand this - only to end up with programs that are a lot
more complicated than need to be.
3.13 Don’t mix up your samples!
As you gain confidence with data analysis you are getting closer to making
subtle mistakes with profound consequences.
The most dangerous mistake you will ever make is one where you
plug in the wrong data into the pipeline! For example, you acci-
dentally swap control with condition, or compute the wrong ra-
tio hot/cold vs cold/hot etc. Whereas initially your errors will
manifest themselves in various obvious ways: i.e. the command
fails to run altogether - mislabeling data will silently produce
incorrect results.
Examples of the errors cause by mixed up samples abound in science, here
is one we saw at the time of writing the first version of this chapter:
• Study Linking Autism to ‘Male Brain’ Retracted, Replaced2
2
[Link]
newsalrt_190327_MSCPEDIT&uac=267659MN&impID=1919559&faf=1
3.14. WHAT IS THE CPM? 29
where the authors accidentally flipped the labels and managed to publish a
paper with exact opposite results than what the data indicated.
Of all errors that you can make, the error above (and its variants) so what is
divided by what? are the ones that cause the most lasting devastations. The
computational tasks are so complex, the error so simple yet paradoxically
that makes it much more difficult to identify!
Simplicity is key to success! Keep your data naming scheme short, evident
and simple!
• Good: UHR_1
• Bad: Universal_Human_Brain_Sample_Genotype_paa11_t48_R1_boston_redo.[Link]
Remember! It is not the sample name where the metadata on the sample
should primarly stored!
3.14 What is the CPM?
CPM is an acronym for “counts per million”. It is a measure of the abundance
scaled to a million.
CPM = 1,000,000 * N / C
where N is the number of reads that mapped to the feature and C is the total
number of mapped reads in the sample expressed in millions.
You could call it a milli-count.
3.15 What is the RPKM?
RPKM is an acronym for read counts per kilobase of transcript per million
mapped reads. Gee thanks!
If we wanted to compare the number of reads mapped to one given transcript
to another transcript of the same sample, we have to account for the fact that
longer transcripts will produce more DNA fragments merely because they are
longer.
If N were the total number of reads mapped to a transcript, and C was the
total number of reads mapped for the sample, we cannot just take N / C
30 CHAPTER 3. RNA-SEQ TERMINOLOGY
as our measure of gene expression. A single copy of a longer transcript will
produce more fragments (larger N) that a single copy of a shorter transcript.
Instead, we may choose to divide the fraction of the reads mapped to the
transcript by the effective length of the transcript:
gene expression = 1,000,000 * N / C * 1,000 / L
Basically, we first compute the fraction of the total reads that map to a
transcript then divide that by the length of the transcript. Then we scale
both the counts and the lengths.
This number would typically be tiny since just a small subset of reads of
the total will align to any gene, then we’re also dividing by the transcript
length that again may be large number in the thousands. Other sciences
solve the problem of reporting small numbers by using words such as milli,
micro, nano, pico, etc. to indicate the scale.
Those who came up with the concept of RPKM yearned to create a more
“user friendly” representation to avoid “confusing the stupid biologists” (they
did not actually mention the word stupid, but it is so clearly implied in there)
and decided that the best way to get there will be to express L in kilobases
(10ˆ3) and C in terms of millions of reads (10ˆ6) in essence adding a factor
of a billion to the formula above:
RPKM = 10^9 * N / L * 1 / C
Hence a weird unit was born, the RPKM, that, in our opinion only ends
up being a lot more confusing than it needs to be. Those with training in
sciences where formulas are not just pulled out of thin air, like say physics,
will note immediately that the RPKM as defined above has a “dimension” to
it.
Whereas the N and C are integer numbers, the 1/L is an inverse of a distance.
Which should give everyone a pause. Why is it appropriate to measure gene
expression by a quantity that is an inverse of a distance? Transcripts either
exist or not. The unit of the RPKM as defined above is a measure of the
speed of transcription (how many units are produced per length) and not
how many transcripts exist.
Oh, how life would have been a whole lot simpler if the original group3 of
scientists that invented this measure would have just called this quantity
3
[Link]
3.16. WHAT THE FPKM IS THAT? 31
a pachter, as a hat tip to Lior Pachter4 , one of our favorite academic-rebel
trolls. Then RPKM would have been just called a pico-pachter and may have
ended up being a better understood quantity.
As we learn more about the limitations of RPKM, the consensus appears to
be that RPKM is an inappropriate measure of gene expression, its use should
be curbed and eliminated. Yet, as you will see, some of the most commonly
used software continues to produce results in RPKM and is a zombie concept
that refuses to die.
3.16 What the FPKM is that?
FPKM is an acronym for fragment counts per kilobase of transcript per million
mapped reads. Gee thanks!
FPKM is an extension of the already flawed concept of RPKM to paired-end
reads. Whereas RPKM refers to reads, FPKM computes the same values over
read pair fragments. Conceptually is even worse as the word “fragment” only
adds another level of ambiguity to an already questionable concept. Which
fragments will be considered: The apparent fragment size (aka TLEN from
SAM?) The sum of read lengths for the pair? The alignment lengths? The
sum of aligned regions for each read? … Alas this is not defined, you are left
at the mercy of the tool implementer.
For further reading we recommend a useful blog post by Harold Pimentel,
it is the post that inspired the title of this question: What the FPKM? A
review of RNA-Seq expression units5 .
3.17 Why are RPKM and FPKM still used?
If the requirement for accuracy is sufficiently low RPKM, FPKM can produce
“useful” results. For example, they most certainly are better than a naive
estimates of read counts. On the other hand several known factors severely
4
[Link]
5
[Link]
seq-expression-units/
32 CHAPTER 3. RNA-SEQ TERMINOLOGY
affect RPKM and FPKM studies - and in our opinion, most results relying
on these values are scientifically less sound.
3.18 What is TPM?
TKM is an acronym for transcript counts per million mapped reads. Gee
thanks!
One serious limitation of RPKM is that ignores the possibility that new and
different transcripts may be present when experimental conditions change.
The RPKM dimension of 1/distance also indicates that instead of being a
quantity that indicates amounts, it is a quantity that characterizes the change
over distance.
Values can be only compared when that “distance” is the same. As it turns
out that “distance” that RPKM tacitly assumes to be the same is the total
transcript length. It assumes the reads are distributed over the same “range”
of DNA.
A more appropriate distance normalization should divide with a value that
accounts for the potential changes of the total transcript length T.
gene expression = 1,000,000 * N / L * 1 / T
A way to incorporate both the number of counts and the length into T is to
sum the rates:
T = sum Ni/Li
where i goes over all observed transcripts and Ni are the reads mapped to a
transcript of length Li.
Not to be outdone by RPKM in the department of protecting biologists
from confusion, a new measure was born, this time called the TPM where
we multiply the above by a million to save biologists* from the unbearable
mental overload of having to deal with small numbers:
TMP = 10^6 * N / L * 1 / sum(Ni/Li)
Life would have been a whole lot simpler if the original group6 of scientists
that invented TPM would have just called this quantity a salzberg,as a hat
6
[Link]
expression-estimation-with-read
3.18. WHAT IS TPM? 33
tip to Steven Salzberg7 one of the Greatest Bioinformaticians Of All Time
(GBOAT). The TPM would have been called a milli-salzberg and may have
turned out to be a better-understood quantity.
Since there is a distance dimension both in the numerator and denominator,
the TPM is dimensionless (unlike RPKM).
7
[Link]
Chapter 4
Statistics for RNA-Seq
Our favorite summary for statistics comes from the blog post The four aspects
of statistics1 where Frederick J. Ross writes:
Statistics resembles the apocryphal elephant being examined by
blind men. Each person uses, and often only knows, a particular
set of statistical tools, and, when passing on their knowledge,
does not have a picture to impart of the general structure of
statistics. Yet that structure consists of only four pieces: planning
experiments and trials; exploring patterns and structures in the
resulting data; making reproducible inferences from that data;
and designing the experience of interacting with the results of an
analysis. These parts are known in the field as
• design of experiments
• exploratory data analysis
• inference
• visualization
Sadly, this basic structure of statistics doesn’t seem to be writ-
ten down anywhere, particularly not in books accessible to the
beginner.
read more on each point on the blog2 .
1
[Link]
2
[Link]
34
4.1. WHY DO STATISTICS PLAY A ROLE IN RNA-SEQ? 35
4.1 Why do statistics play a role in RNA-
Seq?
Whereas alignments or counting overlaps are mathematically well-defined
concepts, the goals of a typical experiment are more complicated. The data
itself may be affected by several competing factors as well as random and
systematic errors. Statistics gives us tools that can help us extract more
information from our data and can help us assign a level of confidence or a
degree of uncertainty to each estimate that we make.
4.2 When do I need to make use of statistics?
You will need to think about statistics first when designing the experiment.
In this stage you have to enumerate the goals and parameters of the exper-
iment. Consulting with a statistican, if you that is option is available, is
obviously a good choice. The most important advice I would give is to not
be too ambitious (greedy?) with the experiment. In my experience projects
that try too cover too many ideas at once, multiple genotypes, multiple condi-
tions, various interactions, multiple time points etc. end up with less reliable
results than focused experiments.
It is akin to the joke: If you have one clock you know what the time is, if you
have ten clocks you never know which one is right.
The second stage for statistics comes into play when you collect and process
the data into a matrix. Then, in most cases, interpreting the information
in either a row, a column or a cell needs to be done in the context of those
other numbers in the table.
ID Condition 1 Condition 2 Condition3
SEPT3 1887.75036923533 81.1993358490033 2399.647399233
SEZ6L 1053.93741152703 530.9988446730548 211.73983343458
MICALL1 136.421402611593 197.470430842325 120.9483772358
A statistical test is a process by which you make quantitative or qualitative
decisions about the numbers.
36 CHAPTER 4. STATISTICS FOR RNA-SEQ
4.3 What kind of questions can we answer
with a statistical test?
Here is a selection:
• How accurate (close to reality) are these results?
• How precise (how many digits are meaningful) are the values?
• For which observation do values change between conditions?
• For which observation is there at least one changed condition?
• Are there genes for which there is a trend to the data?
• What kinds of patterns are there?
• Which observations vary the same way?
Statistical tests operate with principles such as margins of error, probabilities
of observing outcomes and other somewhat indirect measures. These mea-
sures can easily be misinterpreted and scientists routinel make mistakes when
summarizing and reformulating statements derived from statistical tests. See
the section on p-values later on this page.
4.4 What types of statistical tests are com-
mon?
The pairwise comparison is one of the most common and conceptually most
straightforward tests. For example, a pairwise comparison would compare
the expressions of a gene between two conditions. A gene that is found to
have changed its expression is called differentially expressed. The set of all
genes with modified expression forms what is called the differential expression
(DE).
Here, it is essential to understand the type of results that pairwise compar-
isons usually produce. Almost always we want to answer whether a gene’s
expression level has changed. But instead what we will typically obtain is
the probability that there was no difference between conditions (i.e., the
probability of the null hypothesis).
When this probability is low (small) we can reject the null hypothesis, and we
conclude that there is a change. The expression “reject the null hypothesis”
may seem like mincing words or trying to sound clever. But when further
4.5. DO I NEED TO INVOLVE A STATISTICIAN? 37
investigations are necessary it is essential to use these results in their proper
context. It is exceedingly common to formulate conclusions in a manner that
imply more than what the tests support.
4.5 Do I need to involve a statistician?
Ideally, of course, the answer is yes.
But we’re in the practical advice business here and, in reality, it is not always
all that easy to find a collaborator well versed in statics. Besides, just as with
bioinformatics it would be a mistake to oversimplify statistics into a purely
procedural skill: “any statistician can do it.” You would need to find a
collaborator who understands the characteristics of biological data as well as
the challenges of working with it.
We believe that understanding and performing simple statistical tests like
pairwise comparisons, making sound and reliable statistical decisions are well
within anyone’s reach and in this book, we will provide you with alternative
ways for doing it.
For more sophisticated data modeling, for example, time course analysis or
comparing several samples at once, you would need a statistical collaborator,
or you will need to spend some time and effort understanding the statistics
behind it.
Statistics is not as complicated as it looks – so don’t be discouraged. There
is a wealth of information available on more advanced statistical modeling.
4.6 What is R?
Unlike most other approaches, where we install and run command line tools,
statistical analysis in general and the differential expression detection, in
particular, are typically performed using packages that run within the R
programming environment: The R Project for Statistical Computing3
Learning how to program in R is not so simple. Those who claim otherwise
are probably the lucky ones whose minds happen to fit R.
3
[Link]
38 CHAPTER 4. STATISTICS FOR RNA-SEQ
You see, the R language was designed before computer scientists understood
how a programming language should work, what features it should have, and
what data types are useful. So don’t kick yourself if you can’t seem to quickly
learn R, it is without a doubt harder to learn than many other computational
languages. In addition most people that write code in are are not that well
versed in proper software engineering practice. As a result typical R code
is affected by far many more issues that code written in other domains of
science.
That being said, thankfully, less complicated and manageable to learn how
to run tools written by others. As with other programming languages, an R
script is simply a list of commands instructing R to perform certain actions.
4.7 How do I install R?
While there are R versions packaged with conda we recommend that you
install R with a downloadable installer, see the R Home Page4 .
Once installed this way the program will be universally available on your
computer from the command line as well.
4.8 Can I also install R from command line?
Yes, but then you may end up with two versions of R installed. Both need
to be set up the same way. To install R with conda do:
conda install r
If you work in R, a good option (especially when learning) is to run it via
RStudio5 , a graphical user interface to R.
4
[Link]
5
[Link]
4.9. HOW SHOULD I RUN R? 39
4.9 How should I run R?
We recommend running R via RStudio6 . It will provide you with an ex-
ploratory interface where you can edit and run your scripts.
Other alternatives such as Jupyter Notebooks7 or Google Colab8 are also
available.
4.10 What is Bioconductor?
Bioconductor [Link] is a project that collects R-
based code for genomic data analysis.
6
[Link]
7
[Link]
8
[Link]
40 CHAPTER 4. STATISTICS FOR RNA-SEQ
If you have R installed separately, please visit the URL for each tool for the
installation instructions. The installation instructions are simple but may
change in time:
• [Link]
html
• [Link]
html
• [Link]
The commands above will churn for a while and will install all the required
packages necessary for the subsequent commands. To run a differential ex-
pression study we need data and an R program that does the job.
If you installed R with conda you will need to install the following from
command line:
conda install -y bioconductor-deseq2 bioconductor-edger r-gplots
4.11 What does a p-value mean?
You probably don’t know what a p-value means especially when it comes to
interpreting biological data. Don’t sweat it. We have yet to meet someone
that does.
We have of course met overconfident people who think they understand p-
values and most of them are quite convinced that they got it right. In our
4.11. WHAT DOES A P-VALUE MEAN? 41
personal experience and observation, even statisticians giving a talk on p-
values, and even statistical textbooks routinely misuse the term.
Postulate: In our opinion nobody fully understands p-values and
their proper use in biology. There are only people that misunder-
stand and misuse the term less egregiously.
For every precise definition there is an expert opinion that can pull that apart
and prove that it does always mean what it is supposed to.
• Give p a chance: significance testing is misunderstood9
• Scientists rise up against statistical significance10 in Nature, 2019
• Statisticians issue warning over misuse of P values11 in Nature, 2016
• …and so on…
Note how there is an ever-increasing number of publications describing the
misuse of p-values; publications that are akin to old men ranting when trying
to get kids get off their lawn.
I know what you’re thinking, I am going to ask a super-duper-expert statisti-
cian. Well, here are excerpts from an email that one of our biologists collab-
orators received when they asked one of the leading statistical experts (with
multiple Nature level “rant” papers on the misuse of p-values) on advice and
recommendation on choosing the “proper” RNA-Seq method.
We include it here as we found it quite representative of what you might
expect as a reply from a statistician:
[…] I need to caution you that high throughput data analysis is
somewhat fragile. Almost any change in the analysis such as
different options for the mappings or the normalization or the
optimization algorithm in the statistical routines will change the
gene list - sometimes dramatically. This is one reason that biolo-
gists use supplemental methods - from wet lab experiments to high
9
[Link]
20207
10
[Link]
11
[Link]
1.19503
42 CHAPTER 4. STATISTICS FOR RNA-SEQ
throughput statistical methods such as clustering and and gene set
analysis - to assist in the interpretation of the results.
Another idea is to use somewhat different options in the differen-
tial expression analysis and then take the genes in the intersection
or union of the significant genes from both analyses. Using the
intersection is likely to be quite conservative, selecting the genes
that have the highest signal to noise ratios, while the union is likely
to be anti-conservative, having a higher FDR than the adjusted
P-values or q-values suggest.
Omitting some low-expressing genes from the analysis has a very
simple effect on the significant gene list - it will get bigger, but
no significant genes should be dropped. The reason for this is that
the unadjusted P-values for the remaining genes usually do not
change, and the adjusted P-values are a simple function of the
unadjusted values and the number of tests, which decreases when
the number of tests is smaller. I say “usually do not change”
because the estimated dispersion is a function of all the genes,
and will change a bit, not usually not much, when some genes are
omitted from the analysis […]
[…]
A short summary of the above: P-values! Sixty percent of the time
they work every time.
A noteworthy feature of the letter is its evasiveness and non-committal nature
of it. The “expert statistican” goes out of their ways in evading taking any
responsibility for making decisions - all the responsibility is punted back to
the biologist basically saying to them: do it many ways and see what works.
In addition we can’t help but point out that the recommendation that starts
with: Omitting some low-expressing genes from the analysis…, That entire
advice for us feels nothing more than p-hacking and data dredging12 , a misuse
of statistics, where, after knowing a partial answer (the genes that are lowly
expressed) we tweak the input data (omit genes) for the sole purpose of
getting more results presented as statistically significant!
12
[Link]
4.12. SO HOW DO I DEAL WITH P-VALUES? 43
Perhaps our explanation will feel like splitting hairs but it is not - it cuts to the
very essence of p-hacking. Filtering data before an analysis is an acceptable
practice. You may remove lowly expressed genes, or highly expressed genes,
genes that rhyme with fragilistic, or genes that from far away look like
flies. You may do whatever you wish to do as long as you have a scientifically
acceptable rationale for doing so and importantly you do it before looking at
the effects of said procedure. State and explain that rationale, then document
it - all fine.
What is however not acceptable is the precise reasoning that the letter above
recommends: filter data after you saw the results, for the sole purpose of
making more results pass a threshold. That, in our opinion, is the very
definition of p-hacking13 . Yet a famous statistician is the author of the advice
above. What gives?
Do you now see the origins of our postulate: Nobody fully understands p-
values and their proper use in biology? When under the pressure to deliver
results, and, when faced with the complexity and incredible messiness of
real biological data, statistics was, is and will be misused and misinterpreted
even by very guardians and the “experts” that later chide us for not using
the concepts in the proper context. Gee, thanks (sarcasm)!
4.12 So how do I deal with p-values?
Your primary responsibility is to avoid outrageous, preposterous and egre-
gious errors. Use statistics to avoid radical mistakes instead of relying it to
be the mechanism that leads you to truth and meaningful discoveries. The
bar seems awfully low, but dont’ let that trick you into a false sense of security
A common misuse of a p-value is formulating a stronger statement than what
it was created for. Perhaps the most common misuse of a p-value is expressed
in the following way:
13
[Link]
44 CHAPTER 4. STATISTICS FOR RNA-SEQ
• “our small p-values show that the our results are not due to random
chance”
• “our small p-values show that the value increased two-fold”
As it turns out that is not at all what p-values indicate.
Another very common misuse of a p-value is to consider a smaller p-value to
be a stronger indication that an effect exists. Sorting by p-value is not the
right mechanism to put the “better” results first. We sort by p-value to have
“some” ordering in our data and to apply a cutoff more easily.
To avoid misleading yourself and others here is what we recommend on p-
values:
1. Think of the p-value as the probability of obtaining an effect
of the size that you observe due to random chance. Note how
the “size” itself is not a factor here, the change could be small
or large. The pvalue does not capture the actual size of the
effect. Just that chance of observing that particular size (or
larger).
2. Again remember the p-value does not care about how “big”
the effect is.
3. Think of p-values as selection cutoffs. Use them to reduce a
list for further study. But not because 0.05 (or whatever the
cutoff) is good and 0.06 is not, it is because you have cut the
list somewhere. There is nothing inherently right about 0.05
- or any other cutoff. The cutoff is arbitrary - beggars cannot
be choosers - when you got noisy data with few results, you’ll
have to be more generous with the cutoff. Then with noisy
data expect the burden of proof to be higher, you will need
additional evidence to back up your findings.
4. Do NOT use p-values as an indicator for more reliable or
less reliable results nor as indicators of stronger or weaker
effects.
But then, according to my theorem the sentences above most certainly con-
tain inconsistencies and allow for invalid interpretations of p-values. My
apologies.
4.13. DO I NEED TO COMPUTE AND DISCUSS P-VALUES? 45
4.13 Do I need to compute and discuss p-
values?
Yes, I use them because there is no better alternative.
Theorem: The most cynical interpretation of p-values is that their serve as
the mechanism to filter out studies created by people that were so clueless
with respect to large scale analysis that they couldn’t even produce small
p-values. From that point of view p-values do correlate with the quality and
validity of the research.
Corollary: An expert scientist can (unwittingly) publish a Nature publica-
tion with tiny p-values, impressive and compelling visualizations even when
the underlying data is not different from random noise. For example:
• Genomic organization of human transcription initiation complexes, Na-
ture, 201314 .
14
[Link]
Chapter 5
Computer setup
The book provides a number of R scripts that facilitate the analysis of RNA-
seq data. For each project you should download a separate copy of all our
scripts, then edit these scripts to suit your needs.
For each project you should download a separate copy of all the scripts. Edit
and customize the scripts for your project.
# Obtain the biostar handbook rnaseq scripts.
curl -O [Link]
# Unpack the code.
tar -xzvf [Link]
A new directory called code will be created that contains the scripts.
5.1 Install and use RStudio
We recommend using R within RStudio and we dedicted a separate section
to the topic see RStudio for RNA-Seq
But you can also use R within your own terminal. For fully automated
pipelines you would need to set up the same enviroments in both RStudio
and on the computers where the analyses take place.
While you are working out the process RStudio is a superior choice. But
once the work is complete ensure that your scripts are all customized and
46
5.2. RUNNING OUR SCRIPTS AT THE COMMAND LINE 47
ready to repeat the full analysis with minimal external interaction from you.
5.2 Running our scripts at the command line
You may also run the scripts at the command line in which case you would
need to install the required packages first into your environment:
mamba install bioconductor-tximport bioconductor-biomart bioconductor-edger bioconducto
At the time of writing this section I was able to install all these additional
packages in the same bioinfo environment. But that may change in the
future.
Make a note of what libraries would be downgraded to a lower version and
DO NOT proceed if there are incompatibilities! Instead, make a new envi-
ronment as described in the next paragraph.
If the command above worked you are all set, and you can move to the next
section.
5.3 Making a new environment
Sometimes you need to create a new environment so that the statistical
packages don’t clash with other existing tools.
In that case to play it safe, it is best to create a new environment.
Creating a new environment makes your workflow a bit annoying, you will
have to switch to stats to do statistics, back to bioinfo to run another tool.
In this latter case the best strategy is to work in two different terminals.
In general, when you do your own analyses you should build a minimal envi-
ronment that contains only the tools you need. That way there is less chance
of creating incompatibilities, and you would not need to switch environments.
48 CHAPTER 5. COMPUTER SETUP
5.4 How to prepare a new environment
If you followed the book your environment has been packed with a large
number of packages. Due to the many interactions and dependencies between
packages occasionally it is advisable to create a separate enviroment for the
statistical tools alone.
Below we will create a new environment called stats and install the RNA-
Seq specific packages into that environment.
# Create a new environment for statistics.
conda create --name stats python=3.8 -y
# Activate the stats environment.
conda activate stats
# Install the statistical packages for rna-seq.
mamba install bioconductor-edger bioconductor-deseq2 r-gplots -y
Note: When running your statistics packages remember to activate the
stats environment.
5.5 Switching environments in scripts
Unfortunately there is a catch if you want to switch environments inside
a bash script. The following common sense approach will not work (even
though it should):
#
# This example DOES NOT WORK!
#
# This is inside the script
echo "Hello World"
conda activate stats
echo "Can I activate the stats environment like so?"
Alas the answer is no. The code above will produce the following error:
5.5. SWITCHING ENVIRONMENTS IN SCRIPTS 49
CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'
To initialize your shell, run
$ conda init <SHELL_NAME>
When we run a script in bash it will run in a new “child” sub-shell that does
not know about conda activation.
We need to initialize the proper conda commands so that conda commands
are visible in the script. An additional line of code needs to be included for
the conda environment to operate correctly.
Unfortunately the switching of environments does not play well with the set
-uex flag that we in general always like to have turned on. So those need
to be turned off then back on again. What a chore! … So your environment
switching script will need to look like this:
# This is inside the script
echo "Hello World"
# Must include the following initialization command in the script.
source ~/miniconda3/etc/profile.d/[Link]
# Turn off error checking and tracing.
set +uex
conda activate bioinfo
echo "This now runs in bioinfo"
conda activate stats
echo "This now runs in stats"
# Turn back on the error checking and tracing.
set -uex
Chapter 6
Using RStudio for RNA-Seq
We recommend the use of RStudio1 to run our scripts.
6.1 Running our scripts in RStudio
RNA-Seq analyses typically require a number of additional packages to be
installed. Install the requirements in an RStudio session:
# Install the Bioconductor installer
if (!requireNamespace("BiocManager", quietly = TRUE))
[Link]("BiocManager")
# Target packages
pkgs = c("DESeq2", "edgeR", "gplots", "tximport", "biomaRt")
# Install the packages
BiocManager::install(pkgs)
Note The working directory in your RStudio instance runs in needs to be
the directory that contains your data. You will need to set your working
directory to appropriate location.
1
[Link]
50
6.2. TIPS TO BECOME MORE PRODUCTIVE 51
6.2 Tips to become more productive
You may want to start with our scripts then edit them to reflect data naming
that matches your needs.
Always create independent copies of the scripts for each project separately.
Don’t attempt to make an “uber” script that works for multiple projects.
We found that each project is sufficiently different that the scripts would
eventually diverge.
Add all these scripts to a GitHub based repository and commit and push
them to the repository on a regular basis. This way you have a “free backup”
of your scripts and a track record of what you have done.
Part I
RNA-SEQ STEP-BY-STEP
52
Chapter 7
1. Introducing the Golden
Snidget
A “gold standard” data in biology is one with known properties. By repro-
ducing the known properties during analysis, we validate our methods and
demonstrate our skills in adequately applying the techniques.
In this book section, we will demonstrate several RNA-Seq data analysis
methods through the lens of “gold standard” data. The data we will analyze
was created as so-called spike-in control, where known abundances of mRNA
were added to mRNA extracted from a cell; then, scientists sequenced both
the artificial spike-in and the real data on an Illumina sequencer. The study
was originally published as
• Informatics for RNA-seq: A web resource for analysis on the cloud1
published in PLoS Computational Biology (2015) by Malachi Griffith
et al.
While working through the data, I found the analysis and the data validation
challenging to understand or follow.
In a moment of inspiration, I imagined recasting the data above to represent
a more exciting scenario. I have “created” a hypothetical organism that
corresponds to the control data. I hope this organism, the Golden Snidget,
a magical golden bird with fully rotational wings, will help you understand
how RNA-Seq data analysis works, from its strengths to its limitations.
1
[Link]
54
7.1. WHAT DO WE KNOW ABOUT THE GOLDEN SNIDGET? 55
Once you understand how the Golden Snidget regulation manifests
itself in the RNA-Seq data, and once you can also deduce that regu-
latory behavior from the data, you can consider yourself an advanced
RNA-Seq analyst.
The Golden Snidget is not a toy model! You will see that the same code
and approach used to analyze the Golden Snidget data can be used almost
unchanged in many other realistic data analysis scenarios.
7.1 What do we know about the Golden Snid-
get?
Partly mechanical and wholly fictitious, the Golden Snidget is the perfect
candidate for RNA-Seq data analysis.
What makes it so well-suited? Two peculiar features. First, the Golden
Snidget can only be in two moods:
1. BORED when it flies nice and straight.
2. EXCITED when it constantly changes direction.
Additionally, the Golden Snidget is so magically regulated that, in each
state, we know precisely how many transcripts are expressed. The organism
is so perfectly controlled that even genes could be named in a way that
describes their changes for both states. For example, a gene might be called:
ABC-1000-UP-4
The gene was named as such because in the BORED state, in every moment,
each cell has 1000 copies of the transcript corresponding to gene ABC. More-
over, the name also conveys that in the EXCITED state, this same gene is
up-regulated 4 fold compared to the BORED state. We call the state where
the mRNA concentrations do not change ( the synthesis and the degrada-
56 CHAPTER 7. 1. INTRODUCING THE GOLDEN SNIDGET
tion rates are identical) “steady state mRNA”. In the Golden Snitch all
transcripts are present in a steady state.
Another way to explain the naming scheme is that in the BORED state, there
will be 1000 copies of the ABC transcript, whereas in the EXCITED state, each
cell will contain 4000 copies of the ABC transcript.
Thus, by looking at a gene name, we already know how much of it is inside the
cells. We can create a quantification matrix for the real transcript abundance
inside the cell from the name of the genes. Our data would look like this:
name BORED EXCITED foldChange
ABC-1000-UP-4 1000 4000 4
ABD-40-DOWN-2 40 20 2
ABE-90-SAME-1 90 90 1
...
Across the two states, each gene can only be either UP, DOWN regulated or
staying the SAME. Note how the quantification matrix above represents what
“really” happens in the cell.
The whole point of the RNA-Seq analysis is to find the count matrix. As it
happens, for the Golden Snidget we already know what the count matrix
ought to be. It remains to be seen how well results from RNA-Seq data
can reproduce the expected numbers or ratios. The unique gene naming
convention will allow you to spot-check any intermediate result and evaluate
how well the analysis is going before you get to the final results.
7.2 What is the objective of the entire section
of the Golden Snidget?
Your job will be to demonstrate how well and under what circumstances you
can reproduce the known regulatory behavior when starting with RNA-Seq
data.
As you study the Golden Snidget and unlock its secrets, you will understand
the bigger picture of how to think about RNA-Seq in general.
Your first task is to understand the genome that you wish to quantify.
Chapter 8
2. Understand your reference
The first task of RNA-Seq data analysis is to select the “reference” that
your study will use to quantify the gene expression. As we described in the
previous chapter, there are two analysis paradigms:
1. Quantify against a genome
2. Classify against a transcriptome
As you recall, we also recommended that you do both.
Your first step will be to identify and obtain the reference files for your study.
Typically there are data distribution sites distributing reference files. For the
Golden Snidget, as far as we know, there is only one source of information;
the reference is available from:
• [Link]
[Link]
8.1 What is my very first step?
Understand your reference data! What properties does the “reference” have?
1. How big is the reference?
2. How many chromosomes does it have?
3. What are these chromosomes called? How long is each?
4. How many features are annotated?
5. What types of features are listed? …
57
58 CHAPTER 8. 2. UNDERSTAND YOUR REFERENCE
Understanding your genome is the first step in the confidence-building pro-
cess that moves you from being a mere passenger, tagging along, to becoming
the driver and director of the process.
8.2 What is the Golden Snidget’s genome
like?
First, do yourself a favor, create a subdirectory, and work in that.
We have seen countless well-intentioned individuals struggle to
make sense of their methods because their home directories were
overrun with hundreds of files of various origins. Make a new
separate directory for every new project!
We recommend creating one main work directory, might as well call that
work, then creating a new subdirectory for every subproject or task. For
example, work/foo, work/bar, work/golden, like so:
# Activate your environment
conda activate bioinfo
# Make two directories: work then inside that make golden
mkdir -p work/golden
# Switch to the directory
cd work/golden
We assume that all commands are run in the work/golden directory you
have just created and that the bioinfo environment is activated. Now let’s
get the reference data and investigate it.
# Download the reference genome.
wget -nc [Link]
# Unpack the reference genome.
tar xzvf [Link]
At this point, you will see that a refs folder has been created from the
8.3. HOW MANY SEQUENCES ARE IN OUR REFERENCE FILES? 59
downloaded data. List the contents of it:
$ ls -l refs/
the command prints:
-rw-r--r-- 1 ialbert staff 12K Feb 5 12:51 [Link]
-rw-r--r-- 1 ialbert staff 127K Feb 5 12:51 [Link]
-rw-r--r-- 1 ialbert staff 83K Jan 24 14:06 [Link]
We can see that we have a genome, features, and transcripts. For other
organisms, the naming of files might not be as obvious cut.
Above since we see that we have both genome and transcriptome informa-
tion, we would be able to perform both kinds of RNA-Seq studies: genome
and transcriptome-oriented ones. We may be restricted to only one type of
reference for other organisms.
8.3 How many sequences are in our reference
files?
Evaluate the FASTA files:
seqkit stats refs/*.fa
will print:
file format type num_seqs sum_len min_len avg_len max_len
refs/[Link] FASTA DNA 1 128,756 128,756 128,756 128,756
refs/[Link] FASTA DNA 92 82,756 273 899.5 2,022
The genome has a single chromosome of size 128,765 bp. There are 92
transcripts. The shortest transcript is 273bp. The longest is 2022bp. The
total transcript size is 92 x 899 = 82,708 basepairs.
8.4 What does the genome file look like?
cat refs/[Link] | head -5
prints:
60 CHAPTER 8. 2. UNDERSTAND YOUR REFERENCE
>Golden full genome, version 2020, Yo Ho Ho!
GCATTTTGAAAATTCTATGGAAGAGCTAGCATCTCTGACGAAAACAGCAGACGGAAAAGTACTGACCAGCGTCACACA
AACGGAACAGGGCTGACGCCGCTACATATATAGGAAAAGGGAAGGTAGAAGAGCTGAAGGCACTCGTGGAAGAGCTTG
GCTGATCTCCTCATCTTTAATGATGAACTGTCGCCAAGTCAGCTGAAGTCATTGGCAACAGCAATTGAAGTGAAGATG
TGACCGCACGCAATTGATATTAGATATTTTTGCAAAGCGGGCGAGAACGAGAGAAGGCAAACTTCAAATTGAGCTGGC
The genome is a FASTA file, the chromosomal sequence is named Golden.
8.5 What is the annotation file?
cat refs/[Link] | head -5
prints:
##gff-version 3
Golden ERCC exon 1 1060 0 + . gene_name=AAA-750000-UP-4; gene_id=AAA-750000
Golden ERCC exon 1560 2083 0 + . gene_name=ABA-187500-UP-4; gene_id=ABA-18
Golden ERCC exon 2583 3616 0 + . gene_name=ACA-46875-UP-4; gene_id=ACA-468
Golden ERCC exon 4116 5138 0 + . gene_name=ADA-23438-UP-4; gene_id=ADA-234
how many exons (this is an approximate count):
cat refs/[Link] | grep exon | wc -l
prints:
92
We note that the chromosomal names in the feature file have the same value:
Golden as in the genome. So the two names will match. Mismatching chro-
mosome names, for example, if it were called golden instead, can be very
annoying to deal with.
8.6 What do the transcripts look like?
cat refs/[Link] | head -5
prints:
>AAA-750000-UP-4
GCATTTTGAAAATTCTATGGAAGAGCTAGCATCTCTGACGAAAACAGCAGACGGAAAAGTACTGACCAGCGTCACACA
AACGGAACAGGGCTGACGCCGCTACATATATAGGAAAAGGGAAGGTAGAAGAGCTGAAGGCACTCGTGGAAGAGCTTG
8.7. VISUALIZE YOUR REFERENCE FILE 61
GCTGATCTCCTCATCTTTAATGATGAACTGTCGCCAAGTCAGCTGAAGTCATTGGCAACAGCAATTGAAGTGAAGATGAT
TGACCGCACGCAATTGATATTAGATATTTTTGCAAAGCGGGCGAGAACGAGAGAAGGCAAACTTCAAATTGAGCTGGCTC
We can see the naming schemes. Print out more gene names:
cat refs/[Link] | grep ">" | head
prints:
>AAA-750000-UP-4
>ABA-187500-UP-4
>ACA-46875-UP-4
>ADA-23438-UP-4
>AEA-11719-UP-4
>AFA-5859-UP-4
>AGA-2930-UP-4
>AHA-2930-UP-4
>AIA-1465-UP-4
>AJA-732-UP-4
From here, we can see that in this genome gene, AAA expresses with 750,000
copies in the BORED state, whereas gene AJA expresses with 732 copies. At
any time, there will be about 1000x more transcripts for the AAA than for
AJA. Make a mental note of that.
Note how the AAA and the AJA genes change the same amount fourfold! The
magnitude of the change (the ratio) across the states for AAA and AJA be the
same. Yet, the absolute amount of the concentration of each transcript will,
however, be massively different.
It makes sense that it will be more challenging to accurately quantify AJA
than AAA. There may be a limit under which we cannot detect changes; per-
haps we can’t discover anything at levels of AFA and below.
Of course, we would not know beforehand which transcript might be lowly
expressed for any other organism. But we need to approach the analysis
with the expectation that we will only be able to detect transcripts above a
certain expression abundance threshold.
8.7 Visualize your reference file
Load both the genome and annotations into IGV:
62 CHAPTER 8. 2. UNDERSTAND YOUR REFERENCE
8.8 Optional step: align your transcripts
against the reference
You may want to validate your transcripts, especially if you are unsure of
their origin. Will they match the genome as you assume they would?
# The index to the genome
IDX=refs/[Link]
# Build the index
hisat2-build $IDX $IDX
# Create the transcript alignment BAM file.
hisat2 -x $IDX -f -U refs/[Link] | samtools sort > refs/[Link]
# Index the BAM file
samtools index refs/[Link]
Visualizing the generated alignment file makes our hearts overflow with joy:
that the sequences do indeed correspond to the features in the GFF file.
8.9. WHAT IS THE NEXT STEP? 63
And we have another aha moment. All transcripts for the Golden Snidget
are located on the forward strand. Figures, it’s Golden Snidget after all.
8.9 What is the next step?
We have a good grasp of what our references contain. We’ll need to get the
same confidence level regarding the RNA-Seq data. Let’s Understand the
sequencing reads.
Chapter 9
3. Understand the data
The next step is to get a solid grasp of the properties and layout of your
sequencing reads. If the data is stored in SRA, you can use fastq-dump; for
other cases, your data will be stored in files distributed from various locations.
In our case, the data is located at:
• [Link]
[Link]
9.1 What information do I need to know?
You need to know how many reads were measured, what the experimental
layout is, and get a sense of how well the experiment worked. You may have
access to various sources of information that describe the experiment.
In general, we found that often documentation and even published supple-
mentary information often lack essential details and describe data incom-
pletely. As a rule, you can often only continue on by figuring out by yourself,
from the name of files or by what these files contain. Thus we will follow
that strategy:
# Download the data
wget -nc [Link]
# Unpack the data
64
9.1. WHAT INFORMATION DO I NEED TO KNOW? 65
tar zxvf [Link]
# List the content of the current directory.
ls -l
# List the contents of the reads directory
ls -l reads
on my system, it prints:
-rw-r--r-- 1 ialbert staff 26M Feb 5 11:56 BORED_1_R1.fq
-rw-r--r-- 1 ialbert staff 26M Feb 5 11:56 BORED_1_R2.fq
-rw-r--r-- 1 ialbert staff 32M Feb 5 11:56 BORED_2_R1.fq
-rw-r--r-- 1 ialbert staff 32M Feb 5 11:56 BORED_2_R2.fq
-rw-r--r-- 1 ialbert staff 29M Feb 5 11:56 BORED_3_R1.fq
-rw-r--r-- 1 ialbert staff 29M Feb 5 11:56 BORED_3_R2.fq
-rw-r--r-- 1 ialbert staff 55M Feb 5 11:56 EXCITED_1_R1.fq
-rw-r--r-- 1 ialbert staff 55M Feb 5 11:56 EXCITED_1_R2.fq
-rw-r--r-- 1 ialbert staff 37M Feb 5 11:56 EXCITED_2_R1.fq
-rw-r--r-- 1 ialbert staff 37M Feb 5 11:56 EXCITED_2_R2.fq
-rw-r--r-- 1 ialbert staff 46M Feb 5 11:56 EXCITED_3_R1.fq
-rw-r--r-- 1 ialbert staff 46M Feb 5 11:56 EXCITED_3_R2.fq
Run statistics on the reads:
seqkit stats reads/*.fq
The command prints:
file format type num_seqs sum_len min_len avg_len max_len
reads/BORED_1_R1.fq FASTQ DNA 112,193 11,219,300 100 100 100
reads/BORED_1_R2.fq FASTQ DNA 112,193 11,219,300 100 100 100
reads/BORED_2_R1.fq FASTQ DNA 137,581 13,758,100 100 100 100
reads/BORED_2_R2.fq FASTQ DNA 137,581 13,758,100 100 100 100
reads/BORED_3_R1.fq FASTQ DNA 123,093 12,309,300 100 100 100
reads/BORED_3_R2.fq FASTQ DNA 123,093 12,309,300 100 100 100
reads/EXCITED_1_R1.fq FASTQ DNA 237,018 23,701,800 100 100 100
reads/EXCITED_1_R2.fq FASTQ DNA 237,018 23,701,800 100 100 100
reads/EXCITED_2_R1.fq FASTQ DNA 158,009 15,800,900 100 100 100
reads/EXCITED_2_R2.fq FASTQ DNA 158,009 15,800,900 100 100 100
reads/EXCITED_3_R1.fq FASTQ DNA 196,673 19,667,300 100 100 100
66 CHAPTER 9. 3. UNDERSTAND THE DATA
reads/EXCITED_3_R2.fq FASTQ DNA 196,673 19,667,300 100 100 100
9.2 What is the summary so far?
1. The experiment captures data from two states: BORED and EXCITED.
2. Three replicates were measured for each state: 1, 2, 3.
3. The sequencing run is paired-end designated by R1 and R2. The read
length is 100bp.
4. The measurements are distributed somewhat unevenly over the sam-
ples.
5. Twice as much data was collected for sample EXCITED_1 than for
BORED_1.
9.3 What is the experimental design file?
Investigating the data, we identified the “root” words from the file name
conventions above and have built our [Link] matrix to be:
sample,condition
BORED_1, bored
BORED_2, bored
BORED_3, bored
EXCITED_1, excited
EXCITED_2, excited
EXCITED_3, excited
The design file above will govern all our downstream analyses from now.
From these names, we can generate all input and output file names in an
automated fashion.
You may add additional columns, the only requirement is the sample and
the condition columns are present.
To simplify the further commands, you can also generate just the ids for the
samples like so:
cat [Link] | csvcut -c sample | sed 1d > [Link]
where [Link] then contains only:
9.4. DOES THE ORDER ABOVE MATTER? 67
BORED_1
BORED_2
BORED_3
EXCITED_1
EXCITED_2
EXCITED_3
9.4 Does the order above matter?
When we make pairwise comparisons, as a convention, we compare the second
category to the first. Our differential expressions will be expressed as the
second condition compared to the first condition. While you can, of course,
change the order later, it is best if your initial listing already captures the
desired order.
In this project, we want to find the changes in the excited state relative to
the bored state. We want the fold change to be expressed as excited/bored;
thus, we list bored first and excited second.
Note: It is extremely important to always understand what gets compared
to what.
Chapter 10
4. Alignment based RNA-Seq
To perform alignment-based RNA-Seq, we need to use a so-called splice-aware
aligner designed to map the same read across several exon/exon junctions.
Even though the Golden Snidget does not have exons, we should get in the
habit of using the appropriate tools.
10.1 Index your reference genome
When aligning against a reference genome, we first need to prepare (index)
that genome so that our software can read it efficiently. This indexing needs
to be done only once; after indexing it once, the index can be reused when
aligning the same organism. We even recommend placing the index outside
of the project folder; for clarity, we will keep it local:
# The reference genome.
IDX=refs/[Link]
# Build the genome index.
hisat2-build $IDX $IDX
# Index the reference genome with samtools.
samtools faidx $IDX
68
10.2. GENERATE THE ALIGNMENTS 69
10.2 Generate the alignments
A single alignment with hisat2 would be generated with:
hisat2 -x refs/[Link] -1 reads/BORED_1_R1.fq -2 reads/BORED_1_R2.fq | head
we can see that it is a SAM file that gets produced:
HD VN:1.0 SO:unsorted
@SQ SN:Golden LN:128756
@PG ID:hisat2 PN:hisat2 VN:2.2.1 CL:"/home/ialbert/miniconda3/envs/bioi
HWI-ST718_14696[Link] 83 Golden 33031 60 100M = 32827 -
HWI-ST718_14696[Link] 163 Golden 32827 60 100M = 33031
HWI-ST718_14696[Link] 83 Golden 33653 60 100M = 33534 -
To store this output in alignment files for each sample for BORED states, we’ll
need to run:
hisat2 -x refs/[Link] -1 reads/BORED_1_R1.fq -2 reads/BORED_1_R2.fq | samtools sort >
hisat2 -x refs/[Link] -1 reads/BORED_2_R1.fq -2 reads/BORED_2_R2.fq | samtools sort >
hisat2 -x refs/[Link] -1 reads/BORED_3_R1.fq -2 reads/BORED_3_R2.fq | samtools sort >
and for EXCITED states, we need to run:
hisat2 -x refs/[Link] -1 reads/EXCITED_1_R1.fq -2 reads/EXCITED_1_R2.fq | samtools so
hisat2 -x refs/[Link] -1 reads/EXCITED_2_R1.fq -2 reads/EXCITED_2_R2.fq | samtools so
hisat2 -x refs/[Link] -1 reads/EXCITED_3_R1.fq -2 reads/EXCITED_3_R2.fq | samtools so
10.3 How to automate the alignments
We could write it out our commands by hand like above, but you can appreci-
ate how error-prone and non-reusable that process is. We’d rather automate
the process. Recall that our [Link] file contains:
sample,condition
BORED_1, bored
BORED_2, bored
BORED_3, bored
EXCITED_1, excited
EXCITED_2, excited
EXCITED_3, excited
70 CHAPTER 10. 4. ALIGNMENT BASED RNA-SEQ
from which we created an [Link] file that contains the first column with
no header:
BORED_1
BORED_2
BORED_3
EXCITED_1
EXCITED_2
EXCITED_3
now our commands become:
# The index name.
IDX=refs/[Link]
# Create the BAM folder.
mkdir -p bam
# Align the FASTQ files to the reference genome.
cat [Link] | parallel "hisat2 -x $IDX -1 reads/{}_R1.fq -2 reads/{}_R2.fq | samto
# Index each BAM file.
cat [Link] | parallel "samtools index bam/{}.bam"
See the Art of Bioinformatics Scripting1 for an introduction to using GNU
Parallel to automate processes. Moving on, the alignments have been created;
we can see all the resulting files with:
ls -l bam
it will print:
-rw-r--r-- 1 ialbert staff 12M Feb 5 13:06 BORED_1.bam
-rw-r--r-- 1 ialbert staff 352B Feb 5 13:06 BORED_1.[Link]
-rw-r--r-- 1 ialbert staff 75M Feb 5 13:06 BORED_1.sam
-rw-r--r-- 1 ialbert staff 16M Feb 5 13:06 BORED_2.bam
-rw-r--r-- 1 ialbert staff 352B Feb 5 13:06 BORED_2.[Link]
-rw-r--r-- 1 ialbert staff 92M Feb 5 13:06 BORED_2.sam
...
1
[Link]
10.4. VISUALIZE THE ALIGNMENTS 71
10.4 Visualize the alignments
Let’s visualize the file in IGV:
10.5 Create coverage data
Now, if you do load the data, you will notice a problem. Even if you only load
in a single BAM file, it takes quite a bit of time to have it appear in IGV2 .
Moreover, the same loading process needs to take place after each zooming
or panning operation.
It quickly becomes incredibly tedious to use, frankly an embarrassment for
the Broad Institute3 the maintainers of the software4 . Really? Is this how we
are curing cancer? Is this the best tool that hundreds of millions of dollars
in public and private funding are able to provide? A genome visualizer that
gets bogged down when presented with the Gold Snidget genome?
Of course, there is a workaround. There always is. Sometimes, I feel bioinfor-
matics itself is the journey of working around problems that shouldn’t exist
in the first place.
Since what we primarily care about is the “coverage” rather than individual
alignments, we can turn the BAM file into a so-called BigWig file - ah, yes, I
wish I was joking.
2
[Link]
3
[Link]
4
[Link]
72 CHAPTER 10. 4. ALIGNMENT BASED RNA-SEQ
10.6 Making a bigWig coverage file
The bigWig format is the brainchild of a celebrated bioinformatician Dr. Jim
Kent5 of the Human Genome fame. Among the many notable accomplish-
ments, Dr. Kent6 is also the creator of the UCSC genome browser, a tool of
high utility.
At the same time, he is also the “inventor” of data formats such as bed,
bigBed, and bigWig, formats that are little more than ad-hoc workarounds,
band-aid solutions to much larger, more fundamental, and still unaddressed
problems of biological data representation.
Silly or not, here we go, bigWig is what we must use to work around the
limitations of IGV. Oh well, let’s make a bigWig then… But first, install the
bedGraphToBigWig converter if you don’t already have it.
conda install -y ucsc-bedgraphtobigwig
You see, you can’t just make a bigWig; that would be too easy. First, you
turn your BAM files into bedGraph, then you take each bedGraph and turn it
into a bigWig. Sigh… Thankfully we have it all automated. The following
steps will be necessary:
# Turn each BAM file into bedGraph coverage. The files will have the .bg extension.
cat [Link] | parallel "bedtools genomecov -ibam bam/{}.bam -split -bg > bam/{}.b
# Convert each bedGraph coverage into bigWig coverage. The files will have the .bw
cat [Link] | parallel "bedGraphToBigWig bam/{}.bg ${IDX}.fai bam/{}.bw"
The resulting bedGraph and bigWig files will have *.bg and *.bw extensions
and are placed in the bam directory.
You can drag your bigWig files in the IGV7 panels, and the coverage informa-
tion will load up much-much faster. Below I have loaded all samples, re-sized
the tracks, and colored them by samples. I have also turned on logarithmic
and automatic scaling. The resulting browser track is quite informative:
5
[Link]
6
[Link]
7
[Link]
10.7. WHAT CAN YOU LEARN FROM THE VISUALIZATION? 73
Note that once you make a useful visualization with IGV, you can save the
IGV session into a file that can be reloaded later.
10.7 What can you learn from the visualiza-
tion?
Try to answer the following questions from the visualization:
1. What are the minimal gene expression levels that can be detected at
all?
2. What are the minimal gene expression levels that could be used to
detect a four-fold change?
3. Look at the gene names. Does the data seem to sort of support the
expected gene regulation?
4. What else can you read off the tracks?
10.8 What is the next step
Now that we have aligned files, we can proceed to counting the alignments
that overlap with certain features that we describe in Feature counting in
RNA-Seq
Chapter 11
5. Feature counting in
RNA-Seq
This chapter analyzes the BAM files produced by the RNA-Seq pipeline
described in 4. Alignment based RNA-Seq
11.1 What is our quantification matrix?
An RNA-Seq method aims to produce the quantification matrix that assigns
a value to a genomic feature. In an alignment-based RNA-Seq, the process
requires intersecting the BAM alignment file with intervals listed in an an-
notation file.
The intersection process, often called “feature counting”, produces a count
of how many alignments overlap with each file feature. For example, take
GENE A as pictured below. Assume that each dashed line is an alignment.
How many features overlap with GENE A?
------- ------- -------
------ --------- ---------
------- ---------
------- --------- ---------
|<--- GENE A --->|
As you can probably note, the answer depends on what the word “overlap”
74
11.2. HOW TO COUNT FEATURES? 75
means.
• Will any amount of overlap suffice? Then the 11 alignments overlap
with the feature.
• Should the alignment be entirely inside the feature? Then the count is
only 4.
• Should at least 50% of the alignment cover the feature? Now the count
is 8.
There is no universally correct way to count; different strategies might be
more appropriate depending on the situation. The default setting will con-
sider any amount of overlap to increase the counts.
11.2 How to count features?
For processing BAM files we recommend using the tool [featureCounts][featureCounts],
many scientists seem to swear by htseq-counts1 . A typical invocation of
featureCounts would be:
# Run the featureCounts program to summarize the number of reads that overlap with feature
featureCounts -p -a refs/[Link] -o [Link] bam/BORED_1.bam
# Show the first five lines of the [Link] file.
cat [Link] | head -5
It will produce the file counts with the following content:
# Program:featureCounts v1.6.4; Command:"featureCounts" "-p" "-a" "refs/[Link]" "-
Geneid Chr Start End Strand Length bam/BORED_1.bam
AAA-750000-UP-4 Golden 1 1060 + 1060 8103
ABA-187500-UP-4 Golden 1560 2083 + 524 979
ACA-46875-UP-4 Golden 2583 3616 + 1034 505
A nice feature of featureCounts is that we can list multiple BAM files:
featureCounts -p -a refs/[Link] -o [Link] bam/BORED_1.bam bam/BORED_2.bam bam/
We always need to avoid pattern matching as it sometimes leads to subtle and
complex to catch errors. Recall that our [Link] file contains the following:
1
[Link]
76 CHAPTER 11. 5. FEATURE COUNTING IN RNA-SEQ
BORED_1
BORED_2
BORED_3
EXCITED_1
EXCITED_2
EXCITED_3
and now, we will pipe the above into the xargs tool that adds all the above
parameters at the end of the command in a single line
cat [Link] | parallel -j 1 echo "bam/{}.bam" | \
xargs featureCounts -p -a refs/[Link] -o [Link]
Tip: in this case, the -j 1 flag is essential as we want to ensure that the
commands are executed serially as listed in the file. Whenever the order is
critical, you must use -j 1; otherwise, the order is not guaranteed.
One slightly confusing feature of the default behavior of xargs is that it does
visibly show where the incoming parameters will be inserted. As always, the
simplest way to troubleshoot automation code is to add an echo to the
beginning of the command to have it show what it will do:
cat [Link] | parallel -j 1 echo "bam/{}.bam" | \
xargs echo featureCounts -p -a refs/[Link] -o [Link]
It now prints the command instead of executing it:
featureCounts -p -a refs/[Link] -o [Link] \
bam/BORED_1.bam bam/BORED_2.bam bam/BORED_3.bam \
bam/EXCITED_1.bam bam/EXCITED_2.bam bam/EXCITED_3.bam
Running the command (without the echo inserted) will create the file text
file [Link] that, when opened in Excel, will contain:
11.3. STANDARDIZING THE COUNT MATRIX 77
11.3 Standardizing the count matrix
For our tools to work, we need to standardize the count matrix so that
different methods may be interchangeable. With that, we step into the
domain of data munging and transformations. If you have downloaded
the R scripts provided in the RNA-Seq Computer setup you can use the
parse_featurecounts.r to transform your text delimited count data into a
comma-separated count data with a more straightforward structure.
Note that many file names and paths are hard coded in the parse_featurecounts.r
script. We wrote the script to run for the current example. You need to
modify the file names in the script to run it in a different scenario.
You can use the script either as a command line script or in RStudio:
RScript code/parse_featurecounts.r
Running the script will print:
[1] "# Tool: parse_featurecounts.r"
[1] "# Design: [Link]"
78 CHAPTER 11. 5. FEATURE COUNTING IN RNA-SEQ
[1] "# Input: [Link]"
[1] "# Output: [Link]"
The script creates a simpler CSV file that you can use for further analysis.
The [Link] file may be considered the primary output of the alignment-
based quantification method. We may run different statistical methods on
this matrix to determine which rows show changes.
11.4 There is even more to counting
When we run featureCounts with default parameters, numerous decisions
are made on our behalf:
1. Which types of features are used to count overlaps (default=exons)
2. How are counted features grouped into the same (default=gene_name
attribute)
3. How much overlap is required to consider an alignment overlapping
with a feature (default=1)
4. Should individual read alignments or read-pairs be counted(default=no).
Note: setting both -p and --countReadPairs will count read pairs
11.5. LIBRARY STRANDED-NESS 79
5. If the data is strand-specific, the library type needs to be selected
Depending on the use case, you may need to override one or more of these
parameters. See all possible options by invoking the -h flag:
featureCounts -h
If you do override any parameters, make a note of that in your script, like
so:
# Count read pairs that overlap on the same strand.
featureCounts -p --countReadPairs -S 1 ...
See also the featureCounts documentation2 page for more details.
11.5 Library stranded-ness
When using strand-specific RNA-Seq, the library preparation type needs to
match the strandedness of counting:
-s <int or string> Perform strand-specific read counting. A single integer
value (applied to all input files) or a string of comma-
separated values (applied to each corresponding input
file) should be provided. Possible values include:
0 (unstranded), 1 (stranded) and 2 (reversely stranded).
Default value is 0 (ie. unstranded read counting carried
out for all input files).
11.6 What is the next step
You can now either:
1. Learn how to compute the same quantification using classification-
based methods.
2. Continue to the next step on how to perform differential expression.
2
[Link]
Chapter 12
6. Classification based
RNA-Seq
A classification-based RNA-Seq works via so-called “pseudo-alignments”.
Instead of using a genome as a reference, each read is classified as (assigned
to) a single transcript. Since some reads can be assigned to multiple tran-
scripts, an internal redistribution algorithm is also run that will attempt to
redistribute the reads across similar regions of transcripts to validate other
observed constraints.
12.1 What are the advantages of classifica-
tion?
Since the methods operate on the sequences for the transcript, they can be
“smarter” when establishing accurate abundances. In an alignment-based
RNA-Seq, the two steps, alignment and counting, are independent and sep-
arate processes; the algorithms are unaware of one another.
In a classification-based method, detecting the “alignment” and the count
are the same step.
80
12.2. WHAT ARE THE DISADVANTAGES OF CLASSIFICATION? 81
12.2 What are the disadvantages of classifi-
cation?
The primary limitation of classification-based methods is that we must
present the classifier with matching targets. These targets should cover all
cases. If we miss one valid target, the reads that would match those could
get redistributed over all other transcripts - potentially negatively impacting
the accuracy of the results.
12.3 What is the weakest link in classifica-
tion?
The difficulty in understanding why a specific transcript is reported as dif-
ferentially expressed is perhaps the greatest weakness of all classification
methods. They behave like black boxes. Say transcript T1, containing one
extra-base when compared to transcript T2 comes up differentially expressed.
How strong is the evidence for T1 being differentially expressed? The in-
ternal process and decision-making of the redistribution algorithm are not
made clear, nor do tools produce any evidence that will allow us to judge the
correctness of the process.
Most classification methods are “faith-based”; we blindly trust them (not a
good strategy, to say the least). In our opinion, all classifier-generated results
should be investigated with and supported by alignment-based validation.
12.4 How does the read redistribution work?
Imagine your transcriptome has two transcripts that look like so:
AAAAAAAAAAGGGGGGGGG
AAAAAAAAAATTTTTTTTT
Now imagine that:
• AAAAAAAAAAGGGGGGGGG is expressed with 100 copies but
• AAAAAAAAAATTTTTTTTT expressed with only 10.
82 CHAPTER 12. 6. CLASSIFICATION BASED RNA-SEQ
During sequencing, these transcripts are broken into small fragments. After
sequencing, we might observe:
• 110 reads that cover of the AAAAAAAAAA region (100+10)
• 100 reads have GGGGGGGGG in them,
• 10 reads cover TTTTTTTTT.
If we were to add up all reads that match each transcripts, we would end up
with:
• 210 reads match AAAAAAAAAAGGGGGGGGG
• 120 reads match AAAAAAAAAATTTTTTTTT
The AAAAAAA region is double counted; those reads match both transcripts.
From these observations and constraints, the classifier will attempt to redis-
tribute the 100 reads that match both transcripts in a way that is compatible
with differences observed later on, 100 and 10. Ideally, by the end of the
redistribution, we get the correct 1/10 ratio for the counts over the two
transcripts.
Of course, in the simplified example above, there is only a single “correct”
redistribution that could match both cases. When the transcripts are more
complicated, when there are more similar variants that overlap in various
ways, it becomes a lot more challenging to figure out what the rationale for
a particular classification is.
12.5 What tools implement pseudo-alignment?
The first widely used tool was published as the software known as kallisto
in Nicolas L Bray, Harold Pimentel, Páll Melsted and Lior Pachter, Near-
optimal probabilistic RNA-seq quantification, Nature Biotechnology 34, 525–
527 (2016)1 . Since then two widely used implementations exist:
• kallisto2
• salmon3
1
[Link]
2
[Link]
3
[Link]
12.6. HOW TO USE KALLISTO AND SALMON? 83
12.6 How to use kallisto and salmon?
We will use salmon in our examples as, over the years has been better main-
tained, whereas kallisto appears to be somewhat abandoned with its last
release in 2019.
# Create a new environment.
mamba create -y -n salmon
# Activate the new environment.
conda activate salmon
# Install the software.
mamba install salmon parallel
12.7 The design matrix
Recall that our [Link] file that we previously set up contains:
sample,condition
BORED_1, bored
BORED_2, bored
BORED_3, bored
EXCITED_1, excited
EXCITED_2, excited
EXCITED_3, excited
from which we also created an [Link] file that contains the first column
with no header:
BORED_1
BORED_2
BORED_3
EXCITED_1
EXCITED_2
EXCITED_3
84 CHAPTER 12. 6. CLASSIFICATION BASED RNA-SEQ
12.8 Preparing the transcriptome
Prepare the transcriptome for classification:
# The transcriptome in fasta format.
REF=refs/[Link]
# The name of the salmon index
IDX=idx/[Link]
# Build the index with salmon.
salmon index -t ${REF} -i ${IDX}
Remember that salmon operates on individual transcript sequences, where
each sequence in the FASTA file corresponds to a single transcript. Contrast
that to how hisat2 works where each sequence in the FASTA corresponds
to a single chromosome.
12.9 Running the classification
Running the salmon classification is thankfully quite straightforward task:
# Make a directory for salmon results.
mkdir -p salmon
# Run a salmon quantification.
salmon quant -i ${IDX} -l A --validateMappings -1 reads/BORED_1_R1.fq -2 reads/BOR
The above will create a directory called BORED_1, inside of which there will
be a file called [Link] that contains the counts.
cat salmon/BORED_1/[Link] | head | column -t
prints:
Name Length EffectiveLength TPM NumReads
AAA-750000-UP-4 1059 862.332 50512.074127 8101.000
ABA-187500-UP-4 523 324.332 16230.189480 979.000
ACA-46875-UP-4 1033 836.332 3240.282591 504.000
ADA-23438-UP-4 1022 822.332 1346.948810 206.000
AEA-11719-UP-4 1991 1794.332 782.111901 261.000
12.10. COMBINE YOUR COUNTS 85
AFA-5859-UP-4 1124 925.332 389.321429 67.000
AGA-2930-UP-4 521 324.332 629.976711 38.000
AHA-2930-UP-4 771 572.332 347.603910 37.000
AIA-1465-UP-4 1023 824.332 150.022618 23.000
Infuriatingly, and this will be a recurring pattern in bioinformatics, for all
the genius that goes into the software implementation, the software creators
seem to make some questionable choices when it comes to data naming. For
example, the files that salmon and kallisto produce are all named the same
way [Link] (or [Link] for kallisto)! We can only keep track of
them by keeping them in different folders!
Each run will create a new directory with several files in it. It is best if we
store all these subdirectories in a single parent directory; in this case, we call
it salmon, and each sample will be in a subdirectory in the salmon directory.
Let’s automate the process:
cat [Link] | parallel -j 4 "salmon quant -i ${IDX} -l A --validateMappings -1 reads/{}_R1
We now have a’ salmon’ directory containing subdirectories for each sample.
ls -1 salmon
prints:
BORED_1
BORED_2
BORED_3
EXCITED_1
EXCITED_2
EXCITED_3
12.10 Combine your counts
If you have downloaded the R scripts provided in the RNA-Seq Computer
setup, you can use the combine_transcripts.r to combine all the counts
data into a single comma-separated count matrix. You can use the script
either as a command line script or in RStudio:
Rscript code/combine_transcripts.r
that prints:
86 CHAPTER 12. 6. CLASSIFICATION BASED RNA-SEQ
[1] "# Tool: Combine transcripts."
[1] "# Sample: [Link]"
[1] "# Data dir: salmon"
reading in files with read_tsv
1 2 3 4 5 6
[1] "# Results: [Link]"
The resulting file is called [Link] and contains
12.11 Compare the outputs
You can compare the counts file created with featureCounts with the one
made via salmon and see if you can spot any differences. The expectation
is that the two methods should be very similar. The data is consistent,
mostly, but surprisingly we note a discrepancy for sample BORED_2 (perhaps
we messed up the analysis); we invite you to verify this yourself:
12.12. NEXT STEP 87
12.12 Next step
You can now pick your count matrix produced with either approach and
continue with The differential expression
Chapter 13
7. The differential expression
This chapter will focus on the practical aspects of using statistical meth-
ods. We also have a more extensive discussion on the role of statistics when
processing RNA-Seq data.
Make sure that you have completed the additional installation described in:
• RNA-Seq computer setup
13.1 Computing differential expression
For this book, we have written scripts that make use of different R-based
methods published as:
• Moderated estimation of fold change and dispersion for RNA-seq data
with DESeq2 (a.k.a. deseq2)1
• Empirical Analysis of Digital Gene Expression Data in R (a.k.a. edger)2
see also:
• A practical guide to methods controlling false discoveries in computa-
tional biology3
1
[Link]
2
[Link]
3
[Link]
88
13.2. WHICH DATA WILL BE ANALYZED? 89
We have devoted a substantial effort to standardizing both the usage and
the results produced by these different methods so that you can use both
methods in identical ways.
For each project you should download a separate copy of all the scripts. Edit
and customize the scripts for your project.
# Obtain the biostar handbook rnaseq scripts.
curl -O [Link]
# Unpack the code.
tar -xzvf [Link]
A new directory called code will be created that contains the scripts.
As described in RNA-Seq computer setup, you may also run these scripts
directly from RStudio.
13.2 Which data will be analyzed?
The previous chapters on Alignment based RNA-Seq or Classification based
RNA-Seq ended with creating a count matrix called [Link]. You can
also obtain this count data from our site from the link:
• [Link]
with the following command:
curl [Link] > [Link]
The file contains the counts for alignments that overlap with each feature:
90 CHAPTER 13. 7. THE DIFFERENTIAL EXPRESSION
We will process this file with a pairwise comparison to determine which fea-
tures show differential expression between the EXCITED and BORED states.
13.3 How do I find differential expression?
All statistical methods rely on various assumptions regarding the character-
istics of the data.
In general, most users of statistical methods are unaware of these assumptions
- and frankly, it is not their fault. It always takes a surprising amount of effort
to understand the applicability of each method. Be prepared to recognize
when a methodology is ill-suited for the data we have in hand!
We will be making pairwise comparisons between two samples, each with
multiple replicates. The statistical methods need to be informed of which
columns belong to the same sample/condition. This is called the experimen-
tal design. In our statistical scripts, we chose to specify the experimental
design with our [Link] file that we created before and contains:
sample,condition
BORED_1,bored
13.3. HOW DO I FIND DIFFERENTIAL EXPRESSION? 91
BORED_2,bored
BORED_3,bored
EXCITED_1,excited
EXCITED_2,excited
EXCITED_3,excited
To perform the differential analysis with DESeq2 we can run:
RScript code/deseq2.R
that will print:
converting counts to integer mode
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
[1] "# Tool: DESeq2"
[1] "# Design: [Link]"
[1] "# Input: [Link]"
[1] "# Output: [Link]"**
The command above will create the [Link] file that, when viewed in
Excel, looks like this:
92 CHAPTER 13. 7. THE DIFFERENTIAL EXPRESSION
On the web, you can right-click and open the image in a new tab to view it
at full resolution.
And that’s it! Your statistical analysis is done. The remaining steps are
not computational anymore but rely on interpreting the results.
13.4 Explore alternatives
Statistical analysis of RNA-Seq data has many variations.
A simple Google search for it will produce seemingly countless alternatives.
Explore these alternatives. Make an effort to understand the benefits and
tradeoffs of the different choices.
13.5 Visualizing the results
We will explain later the content of the file [Link] file. For now,
we note that you can also create a heatmap of the differentially expressed
features stored in the [Link] file with the following instruction:
RScript code/create_heatmap.r
The code prints::
[1] "# Tool: Create Heatmap "
[1] "# Input: [Link]"
[1] "# Output: [Link]"
and produces the visualization:
13.5. VISUALIZING THE RESULTS 93
You can also generate a heatmap using edgeR as the statistical method with:
RScript code/edger.r
RScript code/create_heatmap.r
that in turn prints:
[1] "# Tool: edgeR"
[1] "# Design: [Link]"
[1] "# Input: [Link]"
[1] "# Output: [Link]"
[1] "# Tool: Create Heatmap "
[1] "# Input: [Link]"
[1] "# Output: [Link]"
and generates the heatmap:
94 CHAPTER 13. 7. THE DIFFERENTIAL EXPRESSION
Note how the heatmap is “flipped” (red/green, top/bottom) compared to
the previous heatmap. One property of tree clustering is that branches can
“rotate” around the “joints” while maintaining the overall structure.
Edit the scripts to change the names of the files and other parameters.
13.6 What do the results mean?
When you use different statistical methods, you’ll soon find that the authors
of these packages are surprisingly cavalier with their notations and naming.
Often, words that should have well-defined meanings seem to be used in-
terchangeably. A column labeled padj in say deseq2 will mathematically
correspond to the column called FDR in edger and so on. Such inconsisten-
cies make it quite difficult (not to mention error-prone) to compare these
methods.
Fear not, though; we have rewritten each of the methods. With hard work
and tribulations (plus occasional swearing and gnashing of our teeth, it is
R after all that we are dealing with), we have produced information in a
standardized and reasonably complete tabular structure.
We have come up with what we believe is the minimally required data to
13.6. WHAT DO THE RESULTS MEAN? 95
make informed decisions about gene or transcript expressions. As it happens,
none of the statistical packages we know of automatically produce all the
information we deem essential!
The resulting files from our scripts will be column-oriented, comma-separated
(CSV) files that you can readily open in Excel. Each row in the file represents
information on a single feature across two conditions (A and B). As we have
shown before, the output looks like this:
We believe that to be able to make sense of RNA-Seq results reliably, the
following columns are necessary:
1. name - the feature identity. It must be unique within the column. It
may be a gene name, transcript name, or exon - whatever the feature
we chose to quantify.
2. baseMean - the average normalized expression level across samples. It
measures how much total signal is present across both conditions.
3. baseMeanA - the average normalized expression level across the first
condition. It measures how much total signal is there for condition A.
4. baseMeanB - the average normalized expression level across the first
condition. It measures how much total signal is there for condition B.
5. foldChange - the ratio of baseMeanB/baseMeanA. Very important to
always be aware that in the fold change means B/A (second condi-
tion/first condition)
6. log2FoldChange - the second logarithm of foldChange. Log 2 transfor-
96 CHAPTER 13. 7. THE DIFFERENTIAL EXPRESSION
mations are convenient as they transform the changes onto a uniform
scale. A four-fold increase after transformation is 2. A four-fold de-
crease (1/4) after log 2 transform is -2. This property makes it much
easier to compare the magnitude of up/down changes.
7. PValue - the uncorrected p-value of the likelihood of observing the
effect of the size foldChange (or larger) by chance alone. This p-value
is not corrected for multiple comparisons.
8. PAdj - the multiple comparisons corrected PValue (via the Hochberg
method). This probability of having at least one false positive when
accounting for all comparisons made. This value is usually overly con-
servative in genomics.
9. FDR - the False Discovery Rate - this column represents the fraction of
false discoveries for all the rows above the row where the value is listed.
For example, if in row number 300 the FDR is 0.05, it means that if
you were cut the table at this row and accept all genes at and above
it as differentially expressed then, 300 * 0.05 = 15 genes out of the
300 are likely to be false positives. The values in this column are also
called q-values.
10. falsePos - this column is derived directly from FDR and represents
the number of false positives in the rows above. It is computed as
RowIndex * FDR and is there to provide a direct interpretation of FDR.
11. The following columns represent the normalized matrix of the original
count data in this case, 3 and 3 conditions.
We typically use the FDR column for selecting rows. Depending on the prop-
erties of the data, other selection and filtering methods may be more appro-
priate.
13.7 What is the normalized matrix?
In the chapter Statistical analysis for RNA-Seq, we describe the normaliza-
tion concepts in more detail; in a nutshell, it is a transformation that makes
numbers comparable across samples. For example, comparing earnings of,
say $25 vs. $50 only makes sense if we know that both refer to the same
period.
The original (raw) counts that we load into the statistical method will first
get transformed into normalized counts - then the statistical process is run
13.7. WHAT IS THE NORMALIZED MATRIX? 97
on these normalized counts. We believe it is vital to see the normalized data
to understand what the statistical method “thinks” about our data. For
example, here are the original counts for our six samples:
cat [Link] | head -2 | cut -d, -f 2-9 | column -t -s ,
prints:
BORED_1 BORED_2 BORED_3 EXCITED_1 EXCITED_2 EXCITED_3
8103 13899 8825 69797 44913 57103
The normalized counts from our [Link] can be seen with:
cat [Link] | head -2 | cut -d, -f 13-20 | column -t -s ,
and are different:
BORED_1 BORED_2 BORED_3 EXCITED_1 EXCITED_2 EXCITED_3
11256.9 11005.4 10117.9 51077.6 50929.8 47242.6
Let’s write the numbers under one another (I will also drop the digit for
clarity):
Original: 8103 13899 8825 69797 44913 57103
Normalized: 11256 11005 10117 51077 50929 47242
Let’s recap: the normalization method of the deseq2 method looked at our
entire dataset and rescaled the numbers to make them comparable across
conditions.
For example, it turned 8103 into 11256, then in another column transformed
13899 into 11005, and so on. Note how The software adjusted some columns
up, and others were adjusted downward. Do also note how the replicates are
much more consistent after normalization.
Statistical packages may apply different normalization processes as well. No
wonder there are disagreements on which method works better.
We include the normalized matrix in the results because we believe you need
to understand what the statistical method “thinks” about your data. All
the methods developed in the book will provide you with the normalized
matrix. Always verify and look at the normalized matrix when interpreting
the results for one particular feature.
98 CHAPTER 13. 7. THE DIFFERENTIAL EXPRESSION
13.8 How do I visualize the normalized ma-
trix?
We believe the normalized matrix carries essential information and should be
visualized and clustered to understand the inter-replicate and inter-sample
variability. To that end, any result file created with our tools above can be
plotted as a heatmap with:
RScript code/create_heatmap.r
Interpreting what the heatmap shows have been published as a separate
chapter:
• What does the heatmap show?
13.9 Where do we go next?
Now that we run two algorithms and three methods, we are all dying to
know:
• What does the heatmap show?
• So which method is best?
Chapter 14
8. What does the heatmap
show
This heatmap looks neat, publication ready even, but what exactly is being
displayed and how do we interpret it?
Often we only have a cursory understanding on what a heatmap show, the
tacit assumption is that “of course” you are supposed to understand it right
away. In reality, almost always there are important tacit assumptions built
99
100 CHAPTER 14. 8. WHAT DOES THE HEATMAP SHOW
into each visualization.
There are several ways that one may scale and visualize an expression
matrix. What follows below is what our heatmaps show, specifically.
14.1 Why do we visualize the normalized ma-
trix?
We believe the normalized matrix carries essential information and should al-
ways be visualized and clustered to assess and understand the inter-replicate
and inter-sample variability. The heatmap is a high level overview of the
consistency of the statistical results, it helps you identify problems, peculiar-
ities but also can help you provide evidence that the processes all worked
well and that there is no reason for concern.
To that end, any result file created with our statistical tools above can be
plotted directly as a heatmap with:
RScript code/create_heatmap.r
The heatmap drawing script, will, by default, filter the matrix at 5% false
discovery rate. Edit the script to change the cutoff.
# FDR cutoff.
MIN_FDR = 0.05
Other selection methods for differentially expressed genes are just as valid.
14.2 What does the heatmap show?
The heatmap offers a way to visualize expression data that may be on rad-
ically different scale in each row. Recall how some genes express with tens
of thousands of copies, others with just a few. In addition, over the different
conditions the changes could also be substantial.
To deal with this problem the heatmap rescales values into so called z-scores
for each row. Basically, it transforms numbers by subtracting the average of
14.2. WHAT DOES THE HEATMAP SHOW? 101
the row from each number in the row, then divides the resulting value with
the standard deviation of the row. In R it would look like this:
# Original counts in a 3+3 design.
x = c( 1, 2, 3, 99, 88, 77 )
# Z-score rescaling.
z = ( x - mean(x) ) / sd(x)
# Print the z-scores.
print(z)
prints:
-0.92 -0.90 -0.88 1.13 0.90 0.67
The new vector will have a mean (average) of 0 and a standard deviation of
1. Note how numbers smaller than the mean are negative and numbers over
the mean are positive. In the heatmap negative values (decreases) are green,
positive values (increases) are red.
The magnitude of the number indicates how many standard deviations away
from the mean the value is. The larger the value to further away the value
is from the average of the mean.
The heatmap script we provide allows us to see, at a single glance, how
consisten the inter-replicate and intra-sample variations are. Ideally we want
to see the same color in the same group, and preferably the same shade.
Compare the two situations:
# Good replication
x = c(1, 2, 3, 5, 5, 4)
z = (x - mean(x)) / sd(x)
print(z)
Note how replicates within group are more similar than replicates between
groups.
-1.429 -0.816 -0.204 1.021 1.021 0.408
Here is another example:
# Bad replication
x = c(100, 200, 300, 500, 500, 100)
z = (x - mean(x)) / sd(x)
102 CHAPTER 14. 8. WHAT DOES THE HEATMAP SHOW
print(z)
the values are:
-0.9992 -0.4542 0.0908 1.1808 1.1808 -0.9992
When visualized as a heatmap the data would look like this:
Since the experimental design is 3x3 we want to see the identical signs and
magnitudes for first three columns, then the opposite color and same magni-
tudes for the next three columns. Note how for bad data the signs (colors)
are mixed.
An important skill is to learn to read this image:
• The number of columns will correspond to the samples.
• The columns are grouped by samples A and B
• The number of rows will correspond to the number of features (genes,
transcripts).
• For “good” data the sign (color) are consistent within groups.
• For “bad” data red and green intermingle within replicates.
• You can evalute the consistency within replicates.
• We cannot compare the absolute expression levels across rows.
• Even within a row we can’t tell if red/green correspond to small or
large values.
• The colors only show how far away from the mean a value is.
To generate the image above create the following data:
name, FDR, falsePos, A1, A2, A3, B1, B2, B3
14.3. WHERE DO WE GO NEXT? 103
good, 0, 0, 1, 2, 3, 5, 5, 4
bad, 0, 0, 100, 200, 300, 500, 500, 100
Copy and paste the above into a file, say [Link], edit the script to
change the input name, then generate the heatmap with:
RScript code/create_heatmap.r
14.3 Where do we go next?
Now that we run two algorithms and three methods we are all dying to know:
So which method is best?
Chapter 15
9. Which method is the best?
In the previous chapter, we’ve applied different statistical methods to our
deseq2 and edgeR. To repeat the study below, run the commands and store
the outputs in separate files. We assume that
• [Link] based on deseq2
• [Link] based on edgeR.
To produce that, you could run:
# Run DESseq
(RScript code/deseq2.r && mv [Link] [Link])
# Run edgeR
(RScript code/edger.r && mv [Link] [Link])
We can even generate heatmaps for the results that look like this:
104
15.1. WHAT KINDS OF ERRORS DO WE EXPECT TO SEE? 105
Which method is the best? In what context is it the best? Should we always
use that method?
Note As the packages/methods are updated, the number of
true/false positives may change.
15.1 What kinds of errors do we expect to
see?
Now you can see what makes the Golden Snidget’s naming convention so
convenient. It allows us to identify the errors by inspecting the gene names.
Recall that our genes are named as:
ADA-23438-UP-4
DLD-1465-DOWN-0.5
BEB-23438-SAME-1
In the above naming convention, the first label is the name, followed by a
number indicating the abundance, then the direction of change, and finally,
the magnitude of the change. What kinds of errors do we expect to see?
106 CHAPTER 15. 9. WHICH METHOD IS THE BEST?
• False positives: Genes with the word SAME reported as differentially
expressed.
• False negatives: Genes with the word UP or DOWN in the name NOT
reported as being differentially expressed.
• Accuracy: The difference between the observed fold change relative
to the expected fold change.
15.2 How many genes should be detected as
differentially expressed?
The genes with “UP” and “DOWN” in the name should be labeled as differ-
entially expressed.
cat [Link] | egrep "UP|DOWN" | wc -l
prints:
69
Thus, we know 69 genes change (have UP of DOWN in their name). But then,
some genes express such low levels that there might not be enough data to
detect the change. Thus, we do not expect to find all 69 genes.
15.3 How many genes are detected as differ-
entially expressed?
We will call a gene differentially expressed if its FDR value is less than 0.05.
While commonly used, the choice is not the only possible option. We could
also use the PAdj column or come up with a different definition altogether.
We are providing a script that we think is a good starting point. Edit and
expand on this script to reproduce all our findings.
RScript code/compare_results.r
The above code will print:
[1] "# Tool: compare_results.r"
[1] ""
15.4. FALSE POSITIVES 107
[1] "# File 1: 51 [Link]"
[1] "# File 2: 45 [Link]"
[1] ""
[1] "# Union: 52"
[1] "# Intersect: 44"
[1] "# File 1 only: 7"
[1] "# File 2 only: 1"
[1] "----"
[1] "Only 1:"
[1] "BGB-5859-SAME-1" "BFB-11719-SAME-1" "BIB-2930-SAME-1" "BLB-732-SAME-1"
[5] "BKB-1465-SAME-1" "BJB-1465-SAME-1" "DPD-366-DOWN-0.5"
[1] "----"
[1] "Only 2:"
[1] "BRB-92-SAME-1"
Now we can answer some of our questions:
• deseq2 detected 51 genes as differentially expressed.
• edgeR detected 45 genes as differentially expressed.
• 7 genes appeared only in the deseq2 results.
• 1 gene appeared only in the edgeR results.
15.4 False positives
The genes that only in the deseq2 output were:
BGB-5859-SAME-1 BFB-11719-SAME-1 BIB-2930-SAME-1 BLB-732-SAME-1
BKB-1465-SAME-1 BJB-1465-SAME-1 DPD-366-DOWN-0.5
Only one label indicates a good gene, DPD-366-DOWN-0.5; all others did not
change and should not have been detected as differentially expressed.
edgeR detected only a single false positive BRB-92-SAME-1
In this case, edgeR performed better than deseq2.
108 CHAPTER 15. 9. WHICH METHOD IS THE BEST?
15.5 False negatives
To detect false negatives, we need to add more code to the code/compare_results.r
R script; we are just sketching out a few ideas here:
# What are the genes that do change
expect = subset(data1, grepl("UP|DOWN", name))$name
# name1 is the genes over FDR threshold in file 1
setdiff(expect, name1)
that prints the genes that do not appear in the deseq2 output.
[1] "DRD-183-DOWN-0.5" "ARA-46-UP-4" "CTC-69-DOWN-0.67"
[4] "CVC-34-DOWN-0.67" "CLC-1099-DOWN-0.67" "DND-732-DOWN-0.5"
[7] "DSD-183-DOWN-0.5" "CRC-137-DOWN-0.67" "CMC-549-DOWN-0.67"
[10] "AWA-6-UP-4" "CSC-137-DOWN-0.67" "DTD-92-DOWN-0.5"
[13] "AUA-11-UP-4" "DUD-46-DOWN-0.5" "CGC-8789-DOWN-0.67"
[16] "ATA-23-UP-4" "ASA-46-UP-4" "AVA-11-UP-4"
[19] "AXA-3-UP-4" "CWC-17-DOWN-0.67" "CNC-549-DOWN-0.67"
[22] "AYA-1-UP-4" "CUC-34-DOWN-0.67" "CXC-9-DOWN-0.67"
[25] "CYC-2-DOWN-0.67" "DVD-46-DOWN-0.5" "DWD-23-DOWN-0.5"
[28] "DXD-11-DOWN-0.5" "DYD-3-DOWN-0.5"
Similarly, the genes that do change and do not appear in the edgeR output
are:
[1] "DRD-183-DOWN-0.5" "DPD-366-DOWN-0.5" "ARA-46-UP-4"
[4] "CTC-69-DOWN-0.67" "CVC-34-DOWN-0.67" "CLC-1099-DOWN-0.67"
[7] "DND-732-DOWN-0.5" "DSD-183-DOWN-0.5" "CRC-137-DOWN-0.67"
[10] "CMC-549-DOWN-0.67" "AWA-6-UP-4" "CSC-137-DOWN-0.67"
[13] "DTD-92-DOWN-0.5" "AUA-11-UP-4" "DUD-46-DOWN-0.5"
[16] "CGC-8789-DOWN-0.67" "ATA-23-UP-4" "ASA-46-UP-4"
[19] "AVA-11-UP-4" "AXA-3-UP-4" "CWC-17-DOWN-0.67"
[22] "CNC-549-DOWN-0.67" "AYA-1-UP-4" "CUC-34-DOWN-0.67"
[25] "CXC-9-DOWN-0.67" "CYC-2-DOWN-0.67" "DVD-46-DOWN-0.5"
[28] "DWD-23-DOWN-0.5" "DXD-11-DOWN-0.5" "DYD-3-DOWN-0.5"
We can see that both methods miss genes with low abundances.
We invite you to study the results further and draw your conclusions. It
appears that out of the 69 changed genes, 30 could not be reliably detected
15.6. AND THE “WINNER” IS EDGER 109
and that a tool may produce a sizeable number of false negatives.
Is the glass half full or half empty?
15.6 And the “winner” is edger
For this particular example, edger is the clear-cut winner. The edger results
contain fewer false positives and more true positives.
That being said, this does not mean the same behavior will hold across all
types of data. Some people swear by deseq2; others like edger. You can
make your mind up as well. Plus, there are many other alternatives as well.
Part II
RNA-SEQ IN PRACTICE
110
112
In this chapter we invite you to take the lessons learned for the Golden
Snitch and apply it to a more realistic data.
Chapter 16
UHR vs HBR data
16.1 Which publication is reanalyzed?
This section will use data from the publication:
• Informatics for RNA-seq: A web resource for analysis on the cloud1 .
11(8):e1004393. PLoS Computational Biology (2015) by Malachi Grif-
fith, Jason R. Walker, Nicholas C. Spies, Benjamin J. Ainscough, Obi
L. Griffith.
An alternative tutorial is available online at [Link]
rnaseq_tutorial/wiki
Note: We have greatly simplified the data naming and organiza-
tion and processing. We recommend the original resource as an
alternative tutorial and source of information.
16.2 What type of data is included?
The data consists of two commercially available RNA samples:
• Universal Human Reference (UHR) is total RNA isolated from a diverse
set of 10 cancer cell lines.
1
[Link]
113
114 CHAPTER 16. UHR VS HBR DATA
• Human Brain Reference (HBR) is total RNA isolated from the brains
of 23 Caucasians, male and female, of varying age but mostly 60-80
years old.
The data was produced in three replicates for each condition. So to summa-
rize this data consists of the following:
1. UHR Replicate 1
2. UHR Replicate 2
3. UHR Replicate 3
4. HBR Replicate 1
5. HBR Replicate 2
6. HBR Replicate 3
16.3 How do I get the data data?
A discussed above; this is a paired-end dataset with the experimental design
of 2x3=6 within a paired sequencing format, we will have 12 files in total.
Let’s get the data (145MB):
# The URL the data is located at.
URL=[Link]
# Download the data.
wget -nc $URL
# Unpack the data
tar zxvf [Link]
16.4 What is inside the data
Once the process completes you will have three directories:
• reads containing the sequencing reads.
• refs containing genome and annotation information.
16.5. WHAT IS THE REFERENCE? 115
16.5 What is the reference?
For this data, we will use a subset of the human genome as a reference. We
will only use chromosome 22 to have examples complete much faster.
# The sequence information for human chromosome 22.
[Link]
# The genome annotations for chromosome 22.
[Link]
The utility called gffread may be used to create a transcript file for this
data.
conda install gffread
# Extract the treanscripts.
gffread -w refs/[Link] -g refs/[Link] refs/[Link]
Even though original our references did not include a trancript file, we created
one that, so that we can now employ both classification and alignment based
methods. Read more about gffread here:
• [Link]
16.6 What do I need to do now?
Use the same procedures you have learned in the step-by-step instructions
to generate a list of genes or transcripts of interest.
Once you find your genes go to the gene enrichment chapter of the main book
and interpret these genes/transcripts.
Part III
The Grouchy Grinch
116
Chapter 17
1. Introducing the Grouchy
Grinch
Sometimes your data looks like the Golden Snitch, where everything falls
right into place. All you need is to follow the steps one by one. But then
there are other cases where you’ll run into unexpected challenges. Sometimes
you’ll end up stumbling from one deep hole to another.
In this section of the book, we will demonstrate several RNA-Seq data anal-
ysis methods through the lens of “grouchy” data where many things can go
wrong … really, really wrong.
This chapter’s data was collected from published scientific works to demon-
strate traps and pitfalls that a budding data analyst needs to navigate
around.
Rarely do all these problems coincide; what we want to demonstrate is that
we need to be vigilant at all times, visualize the results, and pick up on
characteristics that may indicate unusual phenomena.
17.1 What do we know about the Grouchy
Grinch?
Furry and fictitious, the Grouchy Grinch wants you to fail.
118
17.2. EXPERIMENTAL DESIGN 119
Beyond being behaviorally challenged, the Grouchy Grinch has evolved to
elude assumptions of how “regular” organisms work and expose some unsta-
ble foundations of bioinformatics in general.
17.2 Experimental design
As we all know, the Grouchy Grinch has many moods. Among the ones that
we have managed to identify two:
1. CRANKY the Grinch does cranky things with a smirk.
2. WICKED the Grinch does wicked things with a sneer.
RNA-Seq samples have been extracted from the Grinch for both of these
moods. Your job is to characterize the gene regulation that takes place.
Good luck, comrade! You are going to need it!
Your first task is to understand data head on to Grinch: data disarray.
Chapter 18
2. Grinch: data disarray
Wide-eyed and enthusiastic, equipped with the invincibility of the freshly
acquired powerful knowledge, we head out to slay the Grinch.
The data is available from
• [Link]
We expect it to contain information on two states:
1. CRANKY when the Grinch does cranky things with a smirk.
2. WICKED when the Grinch wicked things with a sneer.
18.1 Download and unpack the data
# Download the data
wget -nc [Link]
# Unpack the data
tar zxvf [Link]
The process show above prints:
x reads/[Link]
x reads/[Link]
x reads/[Link]
x reads/[Link]
120
18.2. INVESTIGATE THE GENOME 121
x reads/[Link]
x reads/[Link]
x refs/[Link]
x refs/annotations_1.gff
x refs/annotations_2.gff
x refs/annotations_3.gtf
x refs/annotations_4.gtf
Let’s print statistics on the data:
seqkit stats reads/*.fq
the command above generates:
file format type num_seqs sum_len min_len avg_len max_len
reads/[Link] FASTQ DNA 62,364 6,236,400 100 100 100
reads/[Link] FASTQ DNA 114,287 11,428,700 100 100 100
reads/[Link] FASTQ DNA 78,916 7,891,600 100 100 100
reads/[Link] FASTQ DNA 203,795 20,379,500 100 100 100
reads/[Link] FASTQ DNA 107,099 10,709,900 100 100 100
reads/[Link] FASTQ DNA 199,330 19,933,000 100 100 100
The data appears to present itself as a single end sequencing run with vari-
able coverage, almost 4 times as many reads collected for Wicked2 than in
Cranky1.
18.2 Investigate the genome
What do we know about the genome:
seqkit stats refs/[Link]
prints:
file format type num_seqs sum_len min_len avg_len max_len
refs/[Link] FASTA DNA 1 640,851 640,851 640,851 640,851
18.3 Evaluate the annotations
We note that there are multiple annotations, helpfully designated as:
122 CHAPTER 18. 2. GRINCH: DATA DISARRAY
1. refs/grinch-annotations_1.gff
2. refs/grinch-annotations_1.gff
3. refs/grinch-annotations_3.gtf
4. refs/grinch-annotations_4.gtf
Sharp-eyed readers will notice some extensions are GFF others are GTF. How
many lines in each:
wc -l refs/grinch-ann*
prints
1060 refs/grinch-annotations_1.gff
1098 refs/grinch-annotations_2.gff
784 refs/grinch-annotations_3.gtf
499 refs/grinch-annotations_4.gtf
Why so many files for the same organism?
Well, as it happens, the Grinch data is dispersed across many seemingly
equally credible data repositories. We found some information deposited at:
• Hogwarts School of Witchcraft and Wizardry
• Celestia’s School for Gifted Unicorns
• Canterlot School and Library of Magic.
Since we weren’t quite sure which one is the authoritative source (each
claimed to be so!), we visited each and obtained all files we could get our
hands on to make sure we cover all bases.
18.4 Visualizing annotations
Start IGV, load the file [Link] as the genome. Then drag each of the
annotation file into the main view to form a separate track. It might look
like so:
18.4. VISUALIZING ANNOTATIONS 123
A few observations:
• The first track, grinch-annotation_1.gff appears they contain the
most information.
• The information, while similar across the tracks, does not match ex-
actly.
• Hovering over each feature, we can see different types of annotation in
different files. mRNA, CDS, exon or transcript; neither file contains all
of these annotations.
A slight unease starts creeping in … will our results depend on the annotations
that we choose?
Hopefully not …
Time to move on to Grinch: alignment gloom
Chapter 19
3. Grinch: alignment gloom
Let’s set up an alignment-based data analysis, similar to the one done for
the Golden Snitch
Create the “roots” as explained in The Art of Bioinformatics Scripting1
parallel echo {1}{2} ::: Cranky Wicked ::: 1 2 3 > ids
now our file ids contains:
Cranky1
Cranky2
Cranky3
Wicked1
Wicked2
Wicked3
We want to run the hisat2 aligner. To do so, we first need to index the
genome:
# A shortcut to the genome.
IDX=refs/[Link]
# Build the index. It needs to be done only once per genome.
hisat2-build $IDX $IDX
Now run the aligner in parallel mode:
1
[Link]
124
19.1. VISUALIZING ALIGNMENTS 125
# Make a directory for the bam file.
mkdir -p bam
# Run the hisat aligner for each sample.
cat ids | parallel "hisat2 -x $IDX -U reads/{}.fq | samtools sort > bam/{}.bam"
# Create a bam index for each file.
cat ids | parallel samtools index bam/{}.bam
During the run we see encouraging messages such as:
62364 reads; of these:
62364 (100.00%) were unpaired; of these:
19 (0.03%) aligned 0 times
62311 (99.92%) aligned exactly 1 time
34 (0.05%) aligned >1 times
...
Our hearts are filled with hopes and dreams; the reads seem to match the
reference with high alignment rates.
19.1 Visualizing alignments
Let’s fire up IGV to visualize the alignments. As we browse, we notice the
following unexpected characteristics
126 CHAPTER 19. 3. GRINCH: ALIGNMENT GLOOM
Specifically we note split alignments on ranges of 10kb that look like so:
Lets’ check the average sequence composition of the so-called GC content2
with the Emboss tool called geecee:
cat refs/[Link] | geecee --filter
prints:
#Sequence GC content
grinch 0.21
Thus we note that only 21 percent of the genome sequence is composed of
GC; the rest is AT. Usually, we call such sequences as AT-rich (79% of it are
A/T) rather than GC poor.
We could check whether our reads have the same GC composition. We’ll
quickly concatenate all reads into a single sequence the compute the GC
content for the long sequence:
2
[Link]
19.2. TUNING THE ALIGNER 127
cat reads/[Link] | seqkit seq -s | geecee --filter
#Sequence GC content
0.36
From that data alone, we would already know that the sequencing does not
cover the entire genome but a selected part region of it (in this case, the
coding regions). It is not unexpected that the coding sequences have higher
GC content than the genome. Even so, when the GC content is skewed sub-
stantially, the sequence complexity is significantly reduced. “Misaligments”
are more likely to occur.
Again note the split alignments with long introns are not “incorrect” in the
classical sense of it making a mistake in the algorithm. The scoring matrix
leads the algorithm to optimally align the sequences to the “wrong” locations.
There is reason to believe that the long introns we found were te from per-
forming alignments against regions with low variability.
• See: [Link]
Zooming into that area and dragging the sequence track next to the odd
feature shows us that the long intron overlays a region composed of just ‘T’s.
We went a quite a bit of detail above to demonstrate the thought process and
rationale. Always visually evaluate your results, explain any inconsistency
that you see, and find an answer to what might be causing it.
19.2 Tuning the aligner
Hisat2 has a wide variety of options that allow us to customize the alignment
process. See all options on the hisat2 manual page or by running histat2
-h. Browsing the help, we see that the parameter:
128 CHAPTER 19. 3. GRINCH: ALIGNMENT GLOOM
--max-intronlen NUMBER
would limit the maximal size of the introns that it is willing to bridge over.
But what should be that number that we limit it to? It is not entirely obvious;
we need to estimate that in a way that does not introduce newer constraints.
If we set the limit too low, we will lose valid introns that happen to be longer
than the cutoff.
We do have access to the annotation file, but there is no universal tool to
quickly give you the answer to what the intron length distribution is. Why
isn’t there? Again it goes that lack of foundational data structures and
information that the files contain. If our GFF files had the feature called
intron, we could readily extract that and compute their lengths. But none
of our files lists the introns as features … so what do we do? … hit Google
and find some advice on Biostars, some of which may be directly applicable,
some of which may not. Either way, it is an unexpected detour that may
take two hours or two weeks to solve. It all depends on the severity of the
problem.
In this case, we’ll estimate visually that a typical intron length appears to
be around 500bp. So we’ll set a limit at 2500 (remembering this number and
documenting it early in our eventual notes):
# Make a directory for the bam file.
mkdir -p bam
# Run the hisat aligner for each sample.
cat ids | parallel "hisat2 --max-intronlen 2500 -x $IDX -U reads/{}.fq | samtools
# Create a bam index for each file.
cat ids | parallel samtools index bam/{}.bam
Now our view is the following:
19.2. TUNING THE ALIGNER 129
Did we fix the problem completely? No.
But we have mitigated it to some extent. We make a mental note that our
approach will not detect junctions over 2500 bp and chalk one up to the
Grinch.
All we need is to count, so let’s move on to Grinch: counting troubles
Chapter 20
4. Grinch: counting troubles
Having performed the alignments described in Grinch: alignment gloom, we
can now continue building our count matrix. The counting of reads is also
called “read summarization”; the count is a sum of reads over certain features.
20.1 Let’s do the counting
First lets test out that a single count will work:
featureCounts -a refs/grinch-annotations_1.gff -o [Link] bam/[Link]
The program fails immediately, with a verbose error message. The Grinch is
back!
ERROR: no features were loaded in format GTF.
The annotation format can be specified by the '-F' option, and the required feature
The porgram has to terminate.
As you can see, the porgram had to terminate. I know, I know! Me joking
about other people’s typos is like the pot calling the kettle black.
But why did the porgram terminate, though? The error message seems to
talk about some -F flag missing. Our file is GFF, and the error talks about
GTF. Quite confusing. Should we set the type?
featureCounts -F GFF -a refs/grinch-annotations_1.gff -o [Link] bam/Cranky1.b
…
130
20.2. HOW DOES THE COUNTING WORK 131
The porgram has to terminate.
As it turns out, the featuresCounts software makes assumptions of how the
annotations are formatted. Naively, we assumed that when we downloaded
grinch-annotations_1.gff from Celestia’s School for Gifted Unicorns, its
headmaster provided the oversight that ensured that the file is usable in a
standard workflow of counting features.
As it turns out, said headmaster did not do anything of the sort! The GFF
we got from that source will not work right away with featureCounts. I
expected more from gifted Unicorns, but perhaps we can still salvage the
situation.
20.2 How does the counting work
When presented with an annotation file, every counting program needs to be
provided with the following information :
1. what kinds of features to compute the overlaps with: exon, CDS, mRNA
2. how to group (sum up) the counts for individual features into a single
entity.
When we do not explicitly specify these files, the counting program will
attempt to process the file with default settings that assume that certain
fields are present.
Two major counting strategies are common:
1. gene-level analysis: where exons are grouped once per gene, the output
is gene-level counts.
2. transcript level analyses, where exons are grouped per transcript.
Whereas in the first strategy, each exon appears only once per gene, in the
second strategy, the same exons may be listed for different transcripts. Ob-
viously, there needs to be a way to avoid counting the full contribution of
exon more than once if it is listed for multiple transcripts. The process of
determining what fraction of each exon gets assigned to each transcript is
called “deconvolution”.
In general, recommend the use of featureCounts for gene-level analyses. We
recommend classification/pseudo-alignment-based methods with tools like
kallisto and salmon for transcript-level analysis.
132 CHAPTER 20. 4. GRINCH: COUNTING TROUBLES
Let’s view a few lines in our annotation file refs/grinch-annotations_1.gff.
We see the following:
grinch . gene 42367 46507 . - . ID=grinch.16;featflags=start_before:true,end
grinch . biological_region 42367 46507 . - . ID=grinch.17;featflags=type:mRN
grinch . mRNA 42367 43617 . - . Parent=grinch.17;featflags=start_before:true
grinch . mRNA 43775 46507 . - . Parent=grinch.17;featflags=end_after:true
grinch . biological_region 42367 46507 . - 0 ID=grinch.20;featflags=type:CDS
grinch . CDS 42367 43617 . - 0 Parent=grinch.20
grinch . CDS 43775 46507 . - 0 Parent=grinch.20
Valid field types appear to be gene, mRNA, CDS valid grouping fields appear
to be called Parent.
If no splicing were present, we could use the gene and mRNA fields, but if we
want to avoid counting reads that overlap with introns, we need to use the
CDS fields. Let’s try to specify the type of the field used for counting as CDS:
featureCounts -t CDS -a refs/grinch-annotations_1.gff -o [Link] bam/Cranky1.b
prints:
ERROR: failed to find the gene identifier attribute in the 9th column of the provid
The specified gene identifier attribute is 'gene_id'
An example of attributes included in your GTF annotation is 'Parent=grinch.6'
The program has to terminate.
Oh, the typo is gone! Ha, it is a different error message! Sometimes the typo
is helpful as it makes things more evident. Note how the error now talks
about missing gene identifier. It even gives us an example of what kinds of
valid annotations it has seen. That is very helpful. Let’s set the grouping
field as Parent:
featureCounts -t CDS -g Parent -a refs/grinch-annotations_1.gff -o [Link] bam
alas, it still fails:
ERROR: failed to find the gene identifier attribute in the 9th column of the provid
The specified gene identifier attribute is 'Parent'
An example of attributes included in your GTF annotation is 'ID=grinch.41;locus_ta
The program has to terminate.
Ah, it almost worked! Alas, whereas most CDS types do have a Parent field,
not all CDS fields do!
20.3. THE SEARCH FOR THE “RIGHT ANNOTATION” FORMAT BEGINS133
What is the verdict? The file uses non-standard attribute schemes; there are
some lines in the file that do not follow their own rules! Some CDS entries lack
Parent attribute. Thanks Celestia’s School for Gifted Unicorns (/sarcasm).
The conclusion is that the first file, even though we obtained it from a seem-
ingly official source, cannot be used for feature counting as it does not contain
the information needed to operate.
20.3 The search for the “right annotation”
format begins
Let’s give the second annotation a go then:
featureCounts -a refs/grinch-annotations_2.gff -o [Link] bam/[Link]
it prints:
ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF f
The specified gene identifier attribute is 'gene_id'
An example of attributes included in your GTF annotation is 'ID=exon_GRI_0107600-E1;Paren
The program has to terminate.
Again the typo is gone. From that, we know this happens in the case where
the software jumped to the second error right away. So the field selection is
valid, but the grouping selection is not.
Looking at the file, the second annotation file contains:
grinch . gene 249231 252010 . - . ID=GRI_0105800;description=conserved Plasmodium pro
grinch . mRNA 249231 252010 . - . ID=GRI_0105800.1;Parent=GRI_0105800;Ontology_term=
grinch . exon 249231 250366 . - . ID=exon_GRI_0105800-E2;Parent=GRI_0105800.1
grinch . exon 250648 252010 . - . ID=exon_GRI_0105800-E1;Parent=GRI_0105800.1
The default feature type used by featureCounts is exon, and we see that
listed in the third column. Let’s see if adding Parent works; we will list the
exon type just to be explicit:
featureCounts -t exon -g Parent -a refs/grinch-annotations_2.gff -o [Link] bam/Crank
Yay, it worked! Thank you Canterlot School and Library of Magic that’s
where I’m going to send my ponies!
134 CHAPTER 20. 4. GRINCH: COUNTING TROUBLES
20.4 GFT - the zombie format that shouldn’t
exist
The trials the tribulation we described above were so common, and ensuring
that a GFF contains properly annotated lines was so “difficult” to communi-
cate and enforce that the community “decided” that the only way to ensure
that a file contains what it needs is to create a NEW file format called GTF.
It is not really a new format, they just went to the data format graveyard
and re-animated an obsolete, antediluvian, and simplistic design called Gene
Transfer Format (GTF) and declared that to be the “standard” format used
for counting. The GTF zombie format had only one job and one job to do!
Its requirement was to have a gene_id and transcript_id specified on every
exon line. That’s it. This is why it exists.
The GFF format already contains a wealth of information; it could have
easily accommodated the above information as well. Some GFF files will
work fine, you just can’t tell from the extension.
Let’s look at the refs/grinch-annotations_2.gtf, we note that the exten-
sion is GTF and the file contains lines like:
grinch . exon 29510 34762 . + . transcript_id "GRI_0100100.1"; gene_id "GRI_0
grinch . exon 35888 37126 . + . transcript_id "GRI_0100100.1"; gene_id "GRI_0
Complaints aside, Look Ma! Now featureCounts “just works” maybe the
GTF is not so ludicrous after all (oh yes it is):
featureCounts -a refs/grinch-annotations_3.gtf -o [Link] bam/[Link]
Above no type information of grouping information was necessary,
featureCounts was built with GTF in mind though, because “it just
works” we are not quite sure what the type and what the grouping
attributes are (exon and gene_id if you must know).
20.5 Ok, Grinch! Let’s count.
Let’s list the Cranky and Wicked files in groups to generate our counts:
featureCounts -a refs/grinch-annotations_3.gtf -o [Link] bam/C*.bam bam/W*.b
20.5. OK, GRINCH! LET’S COUNT. 135
The output is:
Finally, we are making some progress!
Are we there yet? Not so fast, says the Grinch!
Let’s move on to Grinch: stranding woes
Chapter 21
5. Grinch: stranding woes
As we’ve seen in Grinch: counting troubles we can generate our counts with:
featureCounts -a refs/grinch-annotations_3.gtf -o [Link] bam/C*.bam bam/W*.b
When we color alignments by strand (right click and select that option if
not set) we notice that certain colored blocks appear. Some regions are red
(forward alignments) some regions are blue (reverse alignments) :
Moreover if we follow the little (barely visible) “fishbones” drawn over the
genes we can also note that the alignments appear to go the opposite direction
of the feature directionality:
But at the same time, we note that there might be some red alignments in
blue zones and vice versa.
Thanks, Grinch! It looks like we have another complexity at hand.
136
21.1. STRANDED? YES, NO, REVERSE, RF, FR, SF, SR 137
21.1 Stranded? Yes, No, Reverse, RF, FR,
SF, SR
Believe it or not, above, we lucked out a bit. At least the data was single-
end sequencing data. If it were paired-end data, each pair would have both
forward and reverse components where the patterns (blue/red) would be a lot
harder to discern. In paired stranded alignments, the first in pair would align
in the antisense, and the second in pair would align on the sense orientation
of the transcript.
For example, for DNA originating from a transcript on the reverse strand, the
first-in-pair would align on the forward strand and would be shifted leftward;
the second-in-pair would align on the reverse strand and be rightward relative
to the fragment. Simple, right? Not really.
The confusion, the labeling, and the misunderstandings around stranded data
are endemic, profound, and pervasive, so much so that there is even a soft-
ware tool called Guess My Library Type1 to assist users in exploring the
complexities of the matter. Really? I need software to figure out whether
my data is stranded and which type of stranded it is?
• [Link]
Then there are more lengthy exposes trying to shed some light on the matter:
• Biostar: Orientation of PE reads a review of –fr –ff and –rf meanings2
• Biostar: How To Separate Illumina Based Strand Specific Rna-Seq
Alignments By Strand3
21.2 Plz send help!
A helpful summary (partly reproduced below) was collected by igordot4 at:
• [Link]
[Link]
1
[Link]
2
[Link]
3
[Link]
4
[Link]
138 CHAPTER 21. 5. GRINCH: STRANDING WOES
The table below lists the parameters that would need to be set for the various
tools to operate in the correct mode. Note the lack of standardization and
immense potential for confusion:
reverse (rev comp of
forward (transcript) transcript)
TopHat/Cufflinks fr-secondstrand fr-firststrand
--library-type
STAR 1st read strand 2nd read strand
htseq-count yes reverse
-s/--stranded
featureCounts -s 1 2
RSEM 1 0
--forward-prob
Salmon/Sailfish SF/ISF SR/ISR
--libType
HISAT2 FR (F for single-end) RF (R for single-end)
--rna-strandness
Library Kit Illumina ScriptSeq Illumina TruSeq
Stranded Total RNA
21.3 How to perform stranded alignments?
First, do we need to perform a strand specific alignment? The table above
states hisat2 should be invoked with the
--RNA-strandness R
parameter for reverse stranded RNA-Seq data. As it happens for hisat2 tool,
in particular, the only difference that the flag makes is that when applied,
each alignment gains an additional optional tag of the form:
XS:A:-
to indicate the strand of the fragment. Thus each alignment becomes tagged
by the strand corresponding to the opposite strand that it aligns to. Even
though, in this case, rerunning the alignments is not critical, we will do so
for completeness’ sake. For other tools rerunning the alignment in stranded
21.4. HOW TO PERFORM A STRANDED COUNTING? 139
mode may have a larger effect. Here is our new alignment code:
# Make a directory for the bam file.
mkdir -p bam
# Run the hisat aligner for each sample.
cat ids | parallel "hisat2 --rna-strandness R --max-intronlen 2500 -x $IDX -U reads/{}.fq
# Create a bam index for each file.
cat ids | parallel samtools index bam/{}.bam
21.4 How to perform a stranded counting?
We read the manuals or visit the table above to note that featureCounts
takes the -s parameter with value 2:
featureCounts -s 2 -a refs/grinch-annotations_3.gtf -o [Link] bam/C*.bam bam/W
all across the board, the numbers have changed. Since we are here and have
all the code why not investigate both cases at the same time, we’ll study
both sense and antisense transcription. You should too, there is no reason
to leave that stone unturned.
This is where writing you own analysis code makes so much sense; it takes
a flick of the wrist to rerun the entire pipeline both in sense and antisense
mode and investigate phenomena that others might have completely missed.
# Count the alignments mapping the "wrong" way (antisense).
featureCounts -s 1 -a refs/grinch-annotations_3.gtf -o [Link] bam/C*.bam bam/W*
# Count the alignments mapping the "right" way (sense).
featureCounts -s 2 -a refs/grinch-annotations_3.gtf -o [Link] bam/C*.bam bam/W
Let’s see what kinds of maximal counts we get for our columns:
cat [Link] | grep GRI | datamash max 7-12
44226 83519 54239 39601 19858 36000
cat [Link] | grep GRI | datamash max 7-12
49 292 82 601 737 820
140 CHAPTER 21. 5. GRINCH: STRANDING WOES
We can see that there are some columns where there are at least 820 antisense
fragments aligning, a little grepping shows that to be GRI_0102800.
cat [Link] | grep "\t820"
GRI_0102800 grinch 126318 128141 - 1824 11 36 34 601 737 820
Let’s visualize the region for GRI_0102800 (you can type the name into the
location box) and load up both Cranky1 and the Wicked3 sample.:
What? Whaaaaaat? No! NOOOOOOOO!
Do you see what I see? If we ran differential expression on these counts, ruin
is our destination!
Time to move on to Grinch: integrity torment
Chapter 22
6. Grinch: integrity torment
So what is the problem with this data?
Lets’ follow through with what the data says.
cat [Link] | grep "\t820"
GRI_0102800 grinch 126318 128141 - 1824 11 36 34 601 737 820
we don’t need a super-duper statistical package to tell us that the numbers
11, 36, 34 and 601, 737, 820 will produce fold changes that will indicate
that gene GRI_0102800 is differentially expressed with a large effect size of
about 25x and a teeny-tiny p-value that would pass even the most stringent
statistical cutoff.
141
142 CHAPTER 22. 6. GRINCH: INTEGRITY TORMENT
Yet when we look at the data itself, something is not right. Gene
GRI_0102800 is covered only on the left side; the right side has no coverage.
This gene is not expressed! Yet our counts show it as such!
You see, featureCounts produces an integer number of 34 or 820 that re-
flects the number of reads that overlap with the feature. The statistical
method tells us how likely it is to observe the effect-size of size (820-34)
by random chance if the numbers were sampled from the same distribution.
None of the tools are concerned about whether those numbers were correct,
to begin with.
22.1 Coding regions are NOT transcripts!
Perhaps the most common assumption in RNA-Seq data interpretation is
tacitly assuming that transcripts contain coding regions (mostly).
Let’s recall the potential parts of a transcript:
1. 5’ UTR (untranslated region)
2. one or more exons
3. zero or more intros
4. 3’ UTR (untranslated region)
The above is not an exhaustive enumeration, but the ones that are relevant
to our situation. But what happens when the untranslated regions are unex-
22.1. CODING REGIONS ARE NOT TRANSCRIPTS! 143
pectedly long? Now looking at annotation files (GFF) we note that UTR’s
are often not annotated! Only the file:
1. refs/grinch-annotations_4.gtf
seems to contain transcript level annotations. When we run any other GFF
or GTF file, our processes are incomplete and potentially flawed. The RNA-
Seq experiment sequences transcripts. If we only have information on coding
regions, we end up ignoring some of our data.
Only grinch-annotations_4.gtf annotation file contains the untranslated
regions. All the other three files contain coding sequence-related information.
In general, coding sequences can be validated via protein sequences hence are
typically more reliable thus usually appear before transcript information is
added.
Let’s visualize the transcript level annotations. Note how the 3’ UTR
in grinch-annotations_4.gtf goes way past the stop codon from
grinch-annotations_3.gtf. What do we see now:
Ok, another unpleasant surprise. The transcript for gene GRI_01027000.1 is
labeled in the other file as MSTRG.7.1, but then, where is the transcript for
gene GRI_01028000.1? The file does not have it … sorry, pal. Why not? It is
just missing; chalk another one up to the Grinch and incomplete data. And
a lesson here, when you download data from sources, it is never complete -
these data are simply snapshots of the current understanding.
144 CHAPTER 22. 6. GRINCH: INTEGRITY TORMENT
22.2 Transcript integrity
The concept of transcript integrity characterizes the evenness of coverage
over a feature beyond just summarizing the number of reads that overlap
with it. Take the coverages below:
====== FEATURE A ====== ======= FEATURE B =======
--- --- --- --- --- --- --- --- ---
--- --- --- --- --- --- --- ---
--- --- ---
--- ---
Both features are covered with 11 reads. Thus our coverage matrix would
report 11 reads for both. Yet clearly, FEATURE A is covered evenly (every base
is covered with nearly the same coverage), whereas only half of FEATURE B
is covered, at the double the rate, the other half is empty.
The concept that captures the information of the coverage evenness is called
“transcript integrity” (TIN) and was first described in:
• Measure transcript integrity using RNA-seq data1 BMC Bioinformatics
volume 17, Article number: 58 (2016)
it is claimed that the Python Code to calculate TIN score ([Link]) is available
from the RSeQC package2 ; unfortunately, it is not so simple to install and
run. In this book, for simplicity and to avoid having to install yet another
tool, we’ll do what we call a Poor Man’s TIN.
We will attempt to compute the percent coverage of each gene relative to the
reads. It is not an optimal solution and will only work here because introns
are relatively short. For a general solution, you would need to use the tool
called [Link] mentioned above.
# Select the genes from the GTF file.
cat refs/grinch-annotations_2.gff | awk '$3=="gene" { print $0 }' > [Link]
# Compute the coverage of each gene relative to all BAM files. Note the flag to sele
bedtools coverage -S -a [Link] -b bam/*.bam > [Link]
1
[Link]
2
[Link]
22.2. TRANSCRIPT INTEGRITY 145
The resulting coverage file has the following structure (as described by the
bedtools coverage help3 :
After each interval in A, bedtools coverage will report:
The number of features in B that overlapped (by at least one base
pair) the A interval. The number of bases in A that had non-zero
coverage from features in B. The length of the entry in A. The
fraction of bases in A that had non-zero coverage from features
in B.
cat [Link] | cut -f 9,13 | tr ";" "\t" | cut -f 1,3 | head
prints:
ID=GRI_0107600 1.0000000
ID=GRI_0105800 1.0000000
ID=GRI_0100200 0.6924959
ID=GRI_0104500 0.3956835
ID=GRI_0111700 0.2824859
ID=GRI_0107900 0.8333333
ID=GRI_0105000 0.8969697
sort and store the results:
cat [Link] | cut -f 9,13 | tr ";" "\t" | cut -f 1,3 | sort -k2,2rn > [Link]
How is this file used?
You can now follow the differential expression computation as shown for the
Golden Snidget to obtain a list of the differentially expressed gene.
For each gene reported as differentially expressed, we will need to verify its
coverage (transcript integrity) and ensure that its integrity is high enough to
warrant its designation as differentially expressed.
3
[Link]
Chapter 23
7. Grinch: R anguish
The following excerpt from the post What I’ve Learned in 45 Years in the
Software Industry1 captures the fundamental problem that plagues most sci-
entific software. It may affect R (and Bioconductor) to a larger degree than,
say, software developed in Python, but that is mostly because just about all
software in R is developed by really smart people that happen to know a lot.
Smart people are often affected by what is called The Curse of Knowledge.
23.1 The Curse of Knowledge
When you know something it is almost impossible to imagine
what it is like not to know that thing. This is the curse of knowl-
edge, and it is the root of countless misunderstandings and ineffi-
ciencies. Smart people who are comfortable with complexity can
be especially prone to it!
If you don’t guard against the curse of knowledge it has the po-
tential to obfuscate all forms of communication, including code.
The more specialized your work, the greater the risk that you
will communicate in ways that are incomprehensible to the unini-
tiated. Fight the curse of knowledge. Work to understand your
1
[Link]
146
23.2. HELP IS AVAILABLE 147
audience. Try to imagine what it would be like to learn what you
are communicating for the first time.
The reason I am calling it out is not to criticize others, but to help you cope
with the challenges. The R code you read, the R vignettes and tutorials will
manifest a complexity that, initially, will be quite intimidating.
You will be frustrated, you will get mad.
I am not sure if it is of any consolation but I myself get confused by R docmen-
tation all the time. I often feel that people who wrote the documentation
speak a different language than I do. They most certainly think about the
world differently than I do.
But don’t despair, put additional effort into it. It will work out. Most
problems you run into are not new at all. The answer to your question is
often out there already, you just need to find it.
23.2 Help is available
At the same time, I want to clearly state that most scientists are also selfless
and dedicated to the greater good. Support sites such as the Biocondutor
support Q&A is a welcoming place where you can get in touch with individu-
als that wrote the very software that you are trying to use. Most are willing
and ready to help. If you get stuck, you can ask your question there:
• [Link]
23.3 R installation problems
You may run into mysterious problems such as:
dyld: Library not loaded: @rpath/./[Link]
Referenced from: /Users/badluckbrian/miniconda3/envs/stats/lib/R/lib/[Link]
148 CHAPTER 23. 7. GRINCH: R ANGUISH
Reason: image not found
Abort trap: 6
What I found that often this means that have two R’s installed one in the
bioinfo and one outside of it, the two R’s get into conflict. Basically, you
may have installed the R packages while the bioinfo environment was not
active.
Try uninstalling r from both your environment and outside of it as well.
conda activate bioinfo
conda uninstall r
conda deactivate
conda uninstall r
Another solution is to start the R that you do have then type into that the
deseq2 installation instructions as described here:
[Link]
Part IV
FURTHER EXPLORATIONS
149
Chapter 24
The RNA-Seq puzzle
I thought up this problem while preparing a lecture on RNA-Seq data anal-
ysis. In the spur of the moment, I assigned it as a homework problem. Most
students enjoyed solving it - and I received quite a bit of positive feedback.
I think that it reads almost like a puzzle in the Sunday paper.
Later I came to believe that this puzzle provides the means to assess how
well one understands RNA-Seq analysis.
When you can solve the puzzle, it means that you know how RNA-Seq anal-
ysis works behind the scenes - what assumptions it makes and how various
effects manifest themselves in the data.
24.1 How would I even solve this puzzle?
It is ok if you can’t solve this puzzle yet. Read the next chapters, then come
back here. We’ve listed it early on to show you where we are going.
The purpose of this “puzzle” is to get you to think about what the num-
bers mean, and how they are linked together – and what it’s like to obtain
consistent and real data.
But, we admit it is also an unusual approach because it reverses the thought
process.
In a typical RNA-Seq analysis, you are given data, and you are asked to
determine the gene expression from it. In this puzzle, you are told what the
151
152 CHAPTER 24. THE RNA-SEQ PUZZLE
reality is, what how genes express relative to one another; then you have to
make the data that supports those statements. Several constraints need to
be juggled. It is not unlike a sudoku puzzle, actually.
Note – you don’t need a computer to solve this problem: just paper and
pencil or write the numbers into a text file.
24.2 The Pledge
Imagine that you have an organism that only has three distinct transcripts
A, B, and, C.
• A, with a length of 10bp
• B, with a length of 100bp
• C, with a length of 1000bp
We want to study this organism under two conditions:
• Wild type: WT
• Heat shock: HEAT
We know from other sources that:
1. Within the WT condition, gene A expresses at levels twice as high as
gene B.
2. Only one transcript’s expression level (but you don’t know which)
changes between WT and HEAT conditions. The change is substantial
enough to be detectable.
3. All the other transcripts (not discussed in point 1. or 2.) will express
in the same abundance across WT and HEAT.
Imagine that we have perform an RNA-Seq experiment across the wild-type
WT and treatment HEAT - with just one replicate per experiment. Thus we
end up with two samples.
We have made one mistake, however. When we mixed we ended up placing
twice as much DNA for the WT condition than for the treatment HEAT.
We can still tell the samples apart since they were barcoded. We just added
and sequenced twice as much DNA material for WT than for HEAT. By the
end of the process, the instrument has produced twice as many reads for WT
than for HEAT. All reads align perfectly.
24.3. THE TURN 153
24.3 The Turn
Your task is the following:
Come up with the numbers for read counts that represent the situation ex-
plained above. You can make up the numbers - the goal is to make them
express what you know based on the description above.
Create a 3x2 count table that shows the read counts. Each ? will need to
have a number. Thus you have to come up with six numbers. That is the
puzzle. Fill in the matrix below:
ID WT HEAT
A ? ?
B ? ?
C ? ?
24.4 The Prestige
Show that your numbers work. When you can answer them all, you under-
stand how RNA-Seq works.
• How do the transcript lengths manifest in the numbers in your matrix?
• How can you tell from your data that you placed twice as much WT
material in the instrument?
• What is the CPM for each gene under each condition?
• What is the RPKM for each gene under each condition?
• What is the TPM for each gene under each condition?
• How can you tell that gene A expresses at twice the level of gene B
within the WT sample?
• Can you tell which gene’s expression level changed between WT and
HEAT?
Now think about this:
• How many reads would you need to sequence for the CPM to be a “nice”
number.
• How many reads would you need to sequence for the RPKM to be a “nice”
number.
154 CHAPTER 24. THE RNA-SEQ PUZZLE
• How many reads would you need to sequence for the TPM to be a “nice”
number.
• Does it make any sense to introduce measures like these above, which
have arbitrary scaling factors, make numbers look “nice”?
24.5 How to solve it (a hint)
Solve as you would do a sudoku puzzle, start filling in one number at a time,
and see if you can generate all the others accordingly.
Chapter 25
The Bear Paradox
Park rangers hike through forests and observe the fauna until they identify
100 animals:
1. In the first forest, rangers counted 90 bears, one rabbit, and nine squir-
rels.
2. In the second forest, rangers counted 30 bears, 20 rabbits, and 50 squir-
rels
Answer the following question:
• Are there more bears in the first forest than in the second?
Explore the following ideas:
• List what you think are valid interpretations of the counts above.
• List what you think might be a common yet invalid interpretation of
the counts above.
Draw a parallel to a sequencing instrument that works by randomly measur-
ing a predetermined number of DNA fragments from a sample.
• Can we tell how many DNA molecules were there in total?
• What if the sample contains DNA from a large number of organisms,
each with a different length of DNA?
• Would there be a difference on how we interpreted data if a sequencer
were able to sequence all DNA that was placed in the instrument?
What if park rangers counted patches of fur. Instead of seeing 30 bears they
found 30 patches of bear fur. How would your interpretation change?
155
156 CHAPTER 25. THE BEAR PARADOX
25.1 Create realistic data
Imagine that you have an instrument that can sequence 1000 reads. An
organism can be observed in two states:
1. Wild type (WT): transcript A expresses with 800 copies, transcript B
expresses with 200 copies
2. Cold Shock (SHOCK): transcript A expresses with 800 copies, transcript
B expresses with 200,000 copies
Note how transcript A is unaffected; it expresses at the same level in both
states.
Imagine that you ran an RNA seq experiment what would be a quantification
matrix that would capture the expected changes above.
WT SHOCK FOLD_CHANGE
A
B
Could you get another correct matrix that, even though (technically) accu-
rate, does cannot inform us of variation?
Part V
CODE
157
Chapter 26
Code: Mission Impossible
RNA-Seq
Your task, should you chose to accept it, is to run an RNA-Seq analysis in
less than 60 seconds.
26.1 Bioinformatics environment
The code we list below assumes that you already have set up you main
bioinfo environment as presented in the book.
If you wish to create a new environment then use the following commands:
# Create a new rnase specific environment.
mamba create -n rnaseq
# Activate the rnaseq environment.
conda activate rnaseq
# Install the rnaseq packages.
mamba install wget parallel samtools subread hisat2 salmon r-gplots \
bioconductor-tximport bioconductor-edger bioconductor-biomart bioconductor-dese
159
160 CHAPTER 26. CODE: MISSION IMPOSSIBLE RNA-SEQ
26.2 Obtain the recipes
For each project you should download a separate copy of all the scripts. Edit
and customize the scripts for your project.
# Obtain the biostar handbook rnaseq scripts.
curl -O [Link]
# Unpack the code.
tar -xzvf [Link]
A new directory called code will be created that contains the scripts.
26.3 Command line usage
The Makefile is located in code/[Link]
# Link the makefile locally under a simpler name
ln -sf code/[Link] Makefile
# First obtain the data
make data
# Index the data
make index
# Generate counts from alignments
make align
# Generate differential expression from alignment output
make results
# Generate counts from classification
make classify
# Generate differential expression from classification output
make results
26.4. CODE LISTING 161
Explore the source code for the recipe to see how the commands are con-
structed.
26.4 Code listing
#
# This Makefile perform the Mission Impossible RNA-seq analysis.
#
#
# More info in the Biostar Handbook volume RNA-Seq by Example
#
# The reference genome.
REF = refs/[Link]
# The file containing transcripts.
TRX = refs/[Link]
# The name of the HISAT2 index.
HISAT2_INDEX = idx/genome
# The name of the salmon index
SALMON_INDEX = idx/[Link]
# The design file.
DESIGN = [Link]
# These targets are not files.
.PHONY: data align results
# Tell the user to read the source of this Makefile to understand it.
usage:
@echo "#"
@echo "# Use the the source, Luke!"
@echo "#"
162 CHAPTER 26. CODE: MISSION IMPOSSIBLE RNA-SEQ
# Create the design file and [Link] file.
${DESIGN}:
# The design file
@echo "sample,condition" > [Link]
@echo "BORED_1,bored" >> [Link]
@echo "BORED_2,bored" >> [Link]
@echo "BORED_3,bored" >> [Link]
@echo "EXCITED_1,excited" >> [Link]
@echo "EXCITED_2,excited" >> [Link]
@echo "EXCITED_3,excited" >> [Link]
# Create the [Link] file (first column only, delete first line)
cat [Link] | cut -f1 -d , | sed 1d > [Link]
# Download the data for the analysis.
data:
# Download the reference genome.
wget -nc [Link]
# Unpack the reference genome.
tar xzvf [Link]
# Download the sequencing reads.
wget -nc [Link]
# Unpack the sequencing reads.
tar zxvf [Link]
# Build the HISAT2 and salmon index for the reference genomes
index: ${REF}
mkdir -p idx
hisat2-build ${REF} ${HISAT2_INDEX}
salmon index -t ${TRX} -i ${SALMON_INDEX}
# Runs a HISAT2 alignment.
26.4. CODE LISTING 163
align: ${DESIGN} ${HISAT2_INDEX_FILE}
mkdir -p bam
cat [Link] | parallel --progress --verbose "hisat2 -x ${HISAT2_INDEX} -1 reads/{
cat [Link] | parallel -j 1 echo "bam/{}.bam" | \
xargs featureCounts -p -a refs/[Link] -o [Link]
RScript code/parse_featurecounts.r
# Run a SALMON classification.
classify: ${DESIGN} ${SALMON_INDEX}
mkdir -p salmon
cat [Link] | parallel --progress --verbose "salmon quant -i ${SALMON_INDEX} -l A
RScript code/combine_transcripts.r
# Runs an analysis on the aligned data.
results:
mkdir -p res
RScript code/deseq2.r
RScript code/create_heatmap.r
# Required software
install:
@echo ""
@echo mamba install wget parallel samtools subread hisat2 salmon bioconductor-txi
@echo ""