forked from satijalab/seurat
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathfuture_vignette.Rmd
131 lines (106 loc) · 5.91 KB
/
future_vignette.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
title: "Parallelization in Seurat with future"
output:
html_document:
theme: united
df_print: kable
date: 'Compiled: `r format(Sys.Date(), "%B %d, %Y")`'
---
```{r setup, include=FALSE}
all_times <- list() # store the time for each chunk
knitr::knit_hooks$set(time_it = local({
now <- NULL
function(before, options) {
if (before) {
now <<- Sys.time()
} else {
res <- difftime(Sys.time(), now, units = "secs")
all_times[[options$label]] <<- res
}
}
}))
knitr::opts_chunk$set(
tidy = TRUE,
tidy.opts = list(width.cutoff = 95),
message = FALSE,
warning = FALSE,
time_it = TRUE
)
```
In Seurat, we have chosen to use the `future` framework for parallelization. In this vignette, we will demonstrate how you can take advantage of the `future` implementation of certain Seurat functions from a user's perspective. If you are interested in learning more about the `future` framework beyond what is described here, please see the package vignettes [here](https://cran.r-project.org/web/packages/future/index.html) for a comprehensive and detailed description.
# How to use parallelization in Seurat
To access the parallel version of functions in Seurat, you need to load the `future` package and set the `plan`. The `plan` will specify how the function is executed. The default behavior is to evaluate in a non-parallelized fashion (sequentially). To achieve parallel (asynchronous) behavior, we typically recommend the "multiprocess" strategy. By default, this uses all available cores but you can set the `workers` parameter to limit the number of concurrently active futures.
```{r future.setup}
library(future)
# check the current active plan
plan()
# change the current plan to access parallelization
plan("multiprocess", workers = 4)
plan()
```
# 'Futurized' functions in Seurat
The following functions have been written to take advantage of the future framework and will be parallelized if the current `plan` is set appropriately. Importantly, the way you call the function shouldn't change.
* `NormalizeData()`
* `ScaleData()`
* `JackStraw()`
* `FindMarkers()`
* `FindIntegrationAnchors()`
* `FindClusters()` - if clustering over multiple resolutions
For example, to run the parallel version of `FindMarkers()`, you simply need to set the plan and call the function as usual.
```{r demo}
library(Seurat)
pbmc <- readRDS("../data/pbmc3k_final.rds")
# Enable parallelization
plan('multiprocess', workers = 4)
markers <- FindMarkers(pbmc, ident.1 = "NK", verbose = FALSE)
```
# Comparison of sequential vs. parallel
Here we'll perform a brief comparison the runtimes for the same function calls with and without parallelization. Note that while we expect that using a parallelized strategy will decrease the runtimes of the functions listed above, the magnitude of that decrease will depend on many factors (e.g. the size of the dataset, the number of workers, specs of the system, the future strategy, etc). The following benchmarks were performed on a desktop computer running Ubuntu 16.04.5 LTS with an Intel(R) Core(TM) i7-6800K CPU @ 3.40GHz and 96 GB of RAM.
<details>
<summary>**Click to see bencharking code**</summary>
```{r compare}
timing.comparisons <- data.frame(fxn = character(), time = numeric(), strategy = character())
plan("sequential")
start <- Sys.time()
pbmc <- ScaleData(pbmc, vars.to.regress = "percent.mt", verbose = FALSE)
end <- Sys.time()
timing.comparisons <- rbind(timing.comparisons, data.frame(fxn = "ScaleData", time = as.numeric(end - start, units = "secs"), strategy = "sequential"))
start <- Sys.time()
markers <- FindMarkers(pbmc, ident.1 = "NK", verbose = FALSE)
end <- Sys.time()
timing.comparisons <- rbind(timing.comparisons, data.frame(fxn = "FindMarkers", time = as.numeric(end - start, units = "secs"), strategy = "sequential"))
plan("multiprocess", workers = 4)
start <- Sys.time()
pbmc <- ScaleData(pbmc, vars.to.regress = "percent.mt", verbose = FALSE)
end <- Sys.time()
timing.comparisons <- rbind(timing.comparisons, data.frame(fxn = "ScaleData", time = as.numeric(end - start, units = "secs"), strategy = "multiprocess"))
start <- Sys.time()
markers <- FindMarkers(pbmc, ident.1 = "NK", verbose = FALSE)
end <- Sys.time()
timing.comparisons <- rbind(timing.comparisons, data.frame(fxn = "FindMarkers", time = as.numeric(end - start, units = "secs"), strategy = "multiprocess"))
```
</details>
```{r viz.compare}
library(ggplot2)
library(cowplot)
ggplot(timing.comparisons, aes(fxn, time)) + geom_bar(aes(fill = strategy), stat = "identity", position = "dodge") + ylab("Time(s)") + xlab("Function") + theme_cowplot()
```
# Frequently asked questions
1. **Where did my progress bar go?**
<br>Unfortunately, the when running these functions in any of the parallel plan modes you will lose the progress bar. This is due to some technical limitations in the `future` framework and R generally. If you want to monitor function progress, you'll need to forgo parallelization and use `plan("sequential")`.
2. **What should I do if I keep seeing the following error?**
```
Error in getGlobalsAndPackages(expr, envir = envir, globals = TRUE) :
The total size of the X globals that need to be exported for the future expression ('FUN()') is X GiB.
This exceeds the maximum allowed size of 500.00 MiB (option 'future.globals.maxSize'). The X largest globals are ...
```
For certain functions, each worker needs access to certain global variables. If these are larger than the default limit, you will see this error. To get around this, you can set `options(future.globals.maxSize = X)`, where X is the maximum allowed size in bytes. So to set it to 1GB, you would run `options(future.globals.maxSize = 1000 * 1024^2)`. Note that this will increase your RAM usage so set this number mindfully.
```{r save.times, include = FALSE}
write.csv(x = t(as.data.frame(all_times)), file = "../output/timings/future_vignette_times.csv")
```
<details>
<summary>**Session Info**</summary>
```{r}
sessionInfo()
```
</details>