Data Wrangling with R: Using
the Tidyverse
Alexandra Emmons, PhD
Bioinformatics Training and Education Program (BTEP)
8 lessons directed toward data wrangling
• L1: Introduction to R, RStudio, • L5: The pipe, filtering, and
and the Tidyverse joining data tables
• L2: Getting Started, the basics • L6: Split, apply, combine
• L3: Loading and reshaping data • L7: Introduction to Bioconductor
• L4: Data visualization with -omics classes (containers)
ggplot2 • L8: Data Wrangling Review and
Practice
No Coding
Coding
• A language and statistical computing environment
• Open source, for and by scientists
• Widespread community
What is R? • Extended use through package installation
• R Packages are collections of R functions,
compiled code and sample data
Why should we use R?
• Great for statistical analysis, data visualization, and report generation
• Supports large scale data analysis
• Removes some of the human error associated with excel
• Ever growing community
• Many ways to get help
• Field specific packages and workflows
• Problems are “googlable”
Comprehensive R Archive
Network
Github
Where do
we find R Bioconductor
packages?
Check out METACRAN
An integrated development
environment (IDE) for R
Includes a console, code editor, and
What is R tools for plotting, history, debugging,
Studio? and workspace management.
Open-source and can be installed
locally or used through a browser
(RStudio Server, Posit Cloud)
DNAnexus
• A Cloud-based platform for NextGen Sequence analysis for which CCR
has a "site-license”
• We will be using this platform to provide a uniform, stable,
preinstalled interface for R training.
• Uses RStudio server
• Integrates course-notes
• R packages installed and ready to use
• The data ready to use and in one place; no need to download
Course registrants, please fill out this form with your DNAnexus
information.
Let’s take a tour of Rstudio IDE
Data Wrangling
Best Practices for data analysis
1. Keep raw data separate from analyzed data.
2. Keep spreadsheet data Tidy (or as tidy as possible)
3. Trust but Verify
--- From [Link]
factors-dataframes/[Link]
What is
tidy data?
**Having tidy data is useful
but not always necessary.
Do not worry about strict
adherence to the rules.
Your data should be in
whatever format that
makes your life easier for
analysis.**
Image from Lowndes and Horst 2020: Tidy Data for Efficiency, Reproducibility, and Collaboration
Guidelines to keep spreadsheets tidy
• Be consistent
• Choose meaningful names for things; no spaces
• Write dates as YYYY-MM-DD
• No empty cells
• Put just one thing in a cell
• Don’t use font color or highlighting as data
• Save the data as plain text files
--- [Link]
Tidyverse
• An opinionated collection of R
packages designed for data
science. All packages share an
underlying design philosophy,
grammar, and data structures. ---
[Link]
• Core packages:
• dplyr, ggplot2, forcats, tibble,
readr, stringr, tidyr, and purr
What is data wrangling?
• Data wrangling is a catch all
phrase for cleaning, Tidy
transforming, and summarizing
data Transform Summarize
• The primary packages we will Visualize
focus on for this purpose are &
tidyr and dplyr. Model
Getting Help
Stack Overflow and other forums
• Public Q&A platform
• Ex: [Link]
sample-names-unique-according-to-conditions
Vignettes
• Ex, browseVignettes(package="dplyr")
Coursera
• JHU Tidyverse Skills for Data Science in R Specialization
• Introduction to the Tidyverse
• Importing Data in the Tidyverse
• Wrangling Data in the Tidyverse
• Visualizing Data in the Tidyverse
Tutorials
• Modeling Data in the Tidyverse
Dataquest
• Intro to data analysis in R
• Data Visualization in R
• Data Cleaning in R
Bioconductor tutorials / workflows
Many others…
• glittr
For a Coursera or Dataquest license go to [Link]
Course Materials
Materials for each lesson will be found at
[Link]
Course materials will be updated prior to each lesson.
BTEP
Other R courses Email us
R Introductory Series ncibtep@[Link]
Data Visualization with R
BTEP Coding Club
• Once a month
• Tailored bioinformatics training to the NCI community.
• 1-hour demo / tutorial of a bioinformatics tool, software, skill, or
platform.
• Ranges in experience level from beginner to advanced.
• Email us at ncibtep@[Link] if there is a specific topic you would
like to see featured.
Check out past events here:
[Link]
Helpful things to know before
getting started
Terms to Know
• Function - code written to perform a specific task
• Example: Getwd()
• String – a sequence of one or more characters
• Enclosed by parentheses
• Data frame – object that stores tabular data; all variables are of the same
length
• Directory – location where files are stored
• Working directory – your current directory
• Package – the fundamental unit of shareable code, bundling together code,
data, documentation, and tests. This is how we extend the use of R.
• Library – a directory of installed packages
• Example: library(dplyr)
Directory Structures
• A file path shows us the location of a file. These are nested structures.
• .libPaths()
• Will show us the location of installed R packages
• For example:
• [1] "/Library/Frameworks/[Link]/Versions/4.1/Resources/library"
• Absolute file path
• The complete file path
• Relative file path
• A shortcut path from some other directory
In summary
Today we… Next time…
Learned about advantages of R and RStudio Get ready for some coding fun in RStudio
Navigated the RStudio environment (DNAnexus)
Learned about concepts related to data Learn R basics
wrangling
Reviewed resources available for getting help