Data Science Capstone - Week 2 Milestone - Exploratory Data Analysis On Text Files
Data Science Capstone - Week 2 Milestone - Exploratory Data Analysis On Text Files
1. Executive Summary
The goal of this project is to do an exploratory data analysis on text files as part of Week 2 activities from Data
Science Specialization SwiftKey Capstone. Data for the analysis can be downloaded from the link below:
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
(https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip)
2. Preparing Environment
2.1. Loading Libraries
Loading required packages:
set.seed(500)
library(ggplot2)
library(knitr)
library(RWeka)
library(SnowballC)
library(tm)
library(wordcloud)
Complementary information:
sessionInfo()
https://rstudio-pubs-static.s3.amazonaws.com/323145_6c395a8d69e6441d90c3abd94f67a5ce.html 1/7
1/18/2018 Data Science Capstone - Week 2 Milestone - Exploratory Data Analysis on Text Files
https://rstudio-pubs-static.s3.amazonaws.com/323145_6c395a8d69e6441d90c3abd94f67a5ce.html 2/7
1/18/2018 Data Science Capstone - Week 2 Milestone - Exploratory Data Analysis on Text Files
Now create a corpus (collection of text documents) from the sample texts:
https://rstudio-pubs-static.s3.amazonaws.com/323145_6c395a8d69e6441d90c3abd94f67a5ce.html 3/7
1/18/2018 Data Science Capstone - Week 2 Milestone - Exploratory Data Analysis on Text Files
# Find n-grams
ngram_US_2 <- f_tokenizer(Corpus_ST, 2)
ngram_US_4 <- f_tokenizer(Corpus_ST, 4)
head(ngram_US_2, 10)
## temp Freq
## 168895 i think 543
## 168103 i know 394
## 168164 i love 326
## 169016 i want 314
## 167449 i can 308
## 169058 i will 273
## 168083 i just 236
## 194669 last year 231
## 402541 year ago 186
## 168141 i like 175
head(ngram_US_4, 10)
https://rstudio-pubs-static.s3.amazonaws.com/323145_6c395a8d69e6441d90c3abd94f67a5ce.html 4/7
1/18/2018 Data Science Capstone - Week 2 Milestone - Exploratory Data Analysis on Text Files
## temp Freq
## 283048 me me me me 36
## 207375 i feel like i 16
## 479751 ugli ugli ugli ugli 14
## 207482 i felt like i 7
## 209752 i know i know 7
## 214658 i think i can 7
## 451947 the new york time 7
## 206733 i donâ<U+0080><U+0099>t know i 5
## 214729 i think im go 5
## 208866 i hope i can 4
barplot(Bigrams$Frequency, las = 2,
names.arg = Bigrams$Bigram,
col ="lightgreen", main ="Top 10 Bigrams",
ylab = "Frequency")
barplot(Quadgrams$Frequency, las = 2,
names.arg = Quadgrams[1:10,]$Quadgram,
col ="lightblue", main ="Top 10 Quadgrams",
ylab = "Frequency")
https://rstudio-pubs-static.s3.amazonaws.com/323145_6c395a8d69e6441d90c3abd94f67a5ce.html 6/7
1/18/2018 Data Science Capstone - Week 2 Milestone - Exploratory Data Analysis on Text Files
## the one will said get like just time can year
## 4988 2854 2848 2838 2282 2245 2225 2182 2093 2037
## make day new work know now good love say peopl
## 1775 1641 1547 1528 1418 1359 1352 1337 1311 1302
## want think also use but look first see thing back
## 1297 1277 1267 1244 1199 1190 1186 1186 1156 1150
## two and need come last take even way much this
## 1147 1142 1127 1126 1124 1086 1072 1057 957 956
## week state start realli well right still great play game
## 924 919 918 910 904 872 864 823 818 816
4. Future Actions
My goal for the eventual app and algorithm is to create a “Shiny version” of a word prediction/completios apps
available for cell phones.
https://rstudio-pubs-static.s3.amazonaws.com/323145_6c395a8d69e6441d90c3abd94f67a5ce.html 7/7