0% found this document useful (0 votes)
548 views

Data Science Capstone - Week 2 Milestone - Exploratory Data Analysis On Text Files

This document provides an exploratory data analysis of text files as part of a data science capstone project. It summarizes the size of three text files (blogs, news, twitter), creates a corpus from a sample of the data, cleans the text, analyzes n-grams (most common 2-word and 4-word sequences), and identifies the top 50 most commonly used words. Visualizations are created to show the top 10 bigrams and quadgrams. The goal is to gain insights from the text data through exploratory analysis techniques.

Uploaded by

Habib Mrad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
548 views

Data Science Capstone - Week 2 Milestone - Exploratory Data Analysis On Text Files

This document provides an exploratory data analysis of text files as part of a data science capstone project. It summarizes the size of three text files (blogs, news, twitter), creates a corpus from a sample of the data, cleans the text, analyzes n-grams (most common 2-word and 4-word sequences), and identifies the top 50 most commonly used words. Visualizations are created to show the top 10 bigrams and quadgrams. The goal is to gain insights from the text data through exploratory analysis techniques.

Uploaded by

Habib Mrad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

1/18/2018 Data Science Capstone - Week 2 Milestone - Exploratory Data Analysis on Text Files

Data Science Capstone - Week 2


Milestone - Exploratory Data Analysis on
Text Files
Leandro Freitas
10/26/2017

1. Executive Summary
The goal of this project is to do an exploratory data analysis on text files as part of Week 2 activities from Data
Science Specialization SwiftKey Capstone. Data for the analysis can be downloaded from the link below:

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
(https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip)

2. Preparing Environment
2.1. Loading Libraries
Loading required packages:

set.seed(500)
library(ggplot2)
library(knitr)
library(RWeka)
library(SnowballC)
library(tm)
library(wordcloud)

Complementary information:

sessionInfo()

https://rstudio-pubs-static.s3.amazonaws.com/323145_6c395a8d69e6441d90c3abd94f67a5ce.html 1/7
1/18/2018 Data Science Capstone - Week 2 Milestone - Exploratory Data Analysis on Text Files

## R version 3.4.1 (2017-06-30)


## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 15063)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252
## [3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C
## [5] LC_TIME=Portuguese_Brazil.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] wordcloud_2.5 RColorBrewer_1.1-2 tm_0.7-1
## [4] NLP_0.1-11 SnowballC_0.5.1 RWeka_0.4-34
## [7] knitr_1.17 ggplot2_2.2.1 RevoUtilsMath_10.0.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.12 magrittr_1.5 RWekajars_3.9.1-3
## [4] munsell_0.4.3 colorspace_1.3-2 rlang_0.1.2
## [7] stringr_1.2.0 plyr_1.8.4 tools_3.4.1
## [10] parallel_3.4.1 grid_3.4.1 gtable_0.2.0
## [13] htmltools_0.3.6 yaml_2.1.14 lazyeval_0.2.0
## [16] rprojroot_1.2 digest_0.6.12 tibble_1.3.4
## [19] rJava_0.9-8 slam_0.1-40 evaluate_0.10.1
## [22] rmarkdown_1.6 stringi_1.1.5 compiler_3.4.1
## [25] RevoUtils_10.0.5 scales_0.5.0 backports_1.1.0

2.2. Loading Datasets


# Read text files
Blogs <- readLines("./source/en_US.blogs.txt")
News <- readLines("./source/en_US.news.txt")
Twitter <- readLines("./source/en_US.twitter.txt")

2.2.1. Basic summaries of the three files

https://rstudio-pubs-static.s3.amazonaws.com/323145_6c395a8d69e6441d90c3abd94f67a5ce.html 2/7
1/18/2018 Data Science Capstone - Week 2 Milestone - Exploratory Data Analysis on Text Files

Blogs_Summary <- c(sum(nchar(Blogs)),


length(unlist(strsplit(Blogs, " "))),
format(object.size(Blogs), units = "Mb"))

News_Summary <- c(sum(nchar(News)),


length(unlist(strsplit(News, " "))),
format(object.size(News), units = "Mb"))

Twitter_Summary <- c(sum(nchar(Twitter)),


length(unlist(strsplit(Twitter, " "))),
format(object.size(Twitter), units = "Mb"))

var_names <- c("Characters", "Words", "Size")


summary_files <- data.frame(Blogs_Summary, News_Summary, Twitter_Summary, row.names = var_names)
names(summary_files) <- c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt")
kable(summary_files, align = "c")

en_US.blogs.txt en_US.news.txt en_US.twitter.txt

Characters 208361438 15683765 162384825

Words 37334131 2643969 30373543

Size 248.5 Mb 19.2 Mb 301.4 Mb

2.3. Preparing Data


2.3.1. Sampling and Corpus
Since the source files are large, a sample will be taken from each one to do the analysis:

Sample_Text <- rbind( sample(Blogs,10000),


sample(News, 10000),
sample(Twitter, 10000))

# Delete no longer needed large data


rm(Blogs, News, Twitter)

Now create a corpus (collection of text documents) from the sample texts:

Corpus_ST <- Corpus(VectorSource(Sample_Text))

2.3.2. Clean and prep data for analysis


Corpus_ST <- tm_map(Corpus_ST, removeWords, stopwords("english"))
Corpus_ST <- tm_map(Corpus_ST, removePunctuation)
Corpus_ST <- tm_map(Corpus_ST, removeNumbers)
Corpus_ST <- tm_map(Corpus_ST, stripWhitespace)
Corpus_ST <- tm_map(Corpus_ST, tolower)
Corpus_ST <- tm_map(Corpus_ST, stemDocument)

https://rstudio-pubs-static.s3.amazonaws.com/323145_6c395a8d69e6441d90c3abd94f67a5ce.html 3/7
1/18/2018 Data Science Capstone - Week 2 Milestone - Exploratory Data Analysis on Text Files

3. Exploratory Data Analysis


3.1. Finding n-grams
# Function for tokenizing the Corpus
f_tokenizer <- function (corpus, i) {
temp <- c()
ngram <-c()
temp <- NGramTokenizer(corpus, Weka_control(min=i,max=i))
ngram <- data.frame(table(temp))
return(ngram)
}

# Find n-grams
ngram_US_2 <- f_tokenizer(Corpus_ST, 2)
ngram_US_4 <- f_tokenizer(Corpus_ST, 4)

3.1.1. Most used sequences of 2 and 4 words


ngram_US_2 <- ngram_US_2[order(ngram_US_2$Freq, decreasing = TRUE),]
ngram_US_4 <- ngram_US_4[order(ngram_US_4$Freq, decreasing = TRUE),]

head(ngram_US_2, 10)

## temp Freq
## 168895 i think 543
## 168103 i know 394
## 168164 i love 326
## 169016 i want 314
## 167449 i can 308
## 169058 i will 273
## 168083 i just 236
## 194669 last year 231
## 402541 year ago 186
## 168141 i like 175

head(ngram_US_4, 10)

https://rstudio-pubs-static.s3.amazonaws.com/323145_6c395a8d69e6441d90c3abd94f67a5ce.html 4/7
1/18/2018 Data Science Capstone - Week 2 Milestone - Exploratory Data Analysis on Text Files

## temp Freq
## 283048 me me me me 36
## 207375 i feel like i 16
## 479751 ugli ugli ugli ugli 14
## 207482 i felt like i 7
## 209752 i know i know 7
## 214658 i think i can 7
## 451947 the new york time 7
## 206733 i donâ<U+0080><U+0099>t know i 5
## 214729 i think im go 5
## 208866 i hope i can 4

3.1.2. Plot most used sequences of 2 words


Bigrams <- ngram_US_2[order(ngram_US_2$Freq,decreasing = TRUE),]
colnames(Bigrams)<-c("Bigram","Frequency" )
Bigrams<- Bigrams[1:10,]

barplot(Bigrams$Frequency, las = 2,
names.arg = Bigrams$Bigram,
col ="lightgreen", main ="Top 10 Bigrams",
ylab = "Frequency")

3.2.3. Plot most used sequences of 4 words


https://rstudio-pubs-static.s3.amazonaws.com/323145_6c395a8d69e6441d90c3abd94f67a5ce.html 5/7
1/18/2018 Data Science Capstone - Week 2 Milestone - Exploratory Data Analysis on Text Files

Quadgrams <- ngram_US_4[order(ngram_US_4$Freq,decreasing = TRUE),]


colnames(Quadgrams)<-c("Quadgram","Frequency" )
Quadgrams<- Quadgrams[1:10,]

barplot(Quadgrams$Frequency, las = 2,
names.arg = Quadgrams[1:10,]$Quadgram,
col ="lightblue", main ="Top 10 Quadgrams",
ylab = "Frequency")

3.2. Most Common Words


3.2.1. Top 50 words used in the texts
Matrix_US <- DocumentTermMatrix(Corpus_ST)
Matrix_US <- removeSparseTerms(Matrix_US, 0.99)
frequency <- colSums(as.matrix(Matrix_US))
order_freq <- order(frequency, decreasing=TRUE)
frequency[head(order_freq,50)]

https://rstudio-pubs-static.s3.amazonaws.com/323145_6c395a8d69e6441d90c3abd94f67a5ce.html 6/7
1/18/2018 Data Science Capstone - Week 2 Milestone - Exploratory Data Analysis on Text Files

## the one will said get like just time can year
## 4988 2854 2848 2838 2282 2245 2225 2182 2093 2037
## make day new work know now good love say peopl
## 1775 1641 1547 1528 1418 1359 1352 1337 1311 1302
## want think also use but look first see thing back
## 1297 1277 1267 1244 1199 1190 1186 1186 1156 1150
## two and need come last take even way much this
## 1147 1142 1127 1126 1124 1086 1072 1057 957 956
## week state start realli well right still great play game
## 924 919 918 910 904 872 864 823 818 816

3.2.2. Word Cloud


colors = c("blue", "red", "orange", "green")
wordcloud(names(frequency), frequency, max.words=50, min.freq=2, colors=colors)

4. Future Actions
My goal for the eventual app and algorithm is to create a “Shiny version” of a word prediction/completios apps
available for cell phones.

https://rstudio-pubs-static.s3.amazonaws.com/323145_6c395a8d69e6441d90c3abd94f67a5ce.html 7/7

You might also like