DATA SCIENCE Using R Notes
DATA SCIENCE Using R Notes
OF
DATA SCIENCE
(R22A6702)
B.TECH III YEAR–I SEM (R22)
(2024-2025)
PREPARED BY
N.PRAMEELA
1
MALLA REDDY COLLEGE OF ENGINEERING AND
TECHNOLOGY
UNIT –I
Introduction to Data Science
Data Science Process: Roles in a data science project, Stages in a data science project, Applications of
data science.
Overview of R: Basic Features of R, R installation, basic data types: Numeric, Integer, Complex, Logical,
Character. Data Structures: vectors, lists, matrices, array, data frames, factors.
Control Structures: if, if-else, for loop, while loop, next, break. Functions:
named arguments, default parameters, return values.
UNIT –II
Loading, Exploring and Managing Data
Working with data from files: Reading and writing data, reading data files with read. table (), Reading in
larger datasets with read. table. Working with relational databases.
Data manipulation packages:dplyr, data.table, reshape2, tidyr, lubridate.
UNIT–III
Exploratory Data Analysis and Validation Approaches
Data validation: handling missing values, null values, duplicate values, outlier detection, data cleaning,
data loading and inspection, data transformation.
Cross validation:Validaton set approach, leave one out cross validation, k-fold cross validation, repeated
k -fold cross validation.
UNIT – IV
Modelling Methods
Supervised: Regression Analysis in R, linear regression, logistic regression,
naive bayes classifier, decision tree, random forest, knn classifier,
Unsupervised: kmeans clustering, association rule mining, apriori algorithm.
UNIT – V
Data Visualization in R
Introduction to ggplot2: Univariate graphs: categorical, quantitative, bivariate graphs categorical
vs. categorical, quantitative vs quantitative, categorical vs. quantitative, multivariate graphs :
grouping, faceting.
TEXT BOOKS:
1. Practical Data Science with R, Nina Zumel & John Mount , Manning Publications
NY, 2014.
2. Beginning Data Science in R-Data Analysis, Visualization, and Modelling for the Data
Scientist -Thomas Mailund –Apress -2017.
2
REFERENCE BOOKS:
1. The Comprehensive R Archive Network-https://cran.r-project.org.
2. R for Data Science by Hadley Wickham and Garrett Grolemund , 2017 , Published by O
Reilly Media, Inc.
3. R Programming for Data Science -Roger D. Peng, 2015 , Lean Publishing.
4. https://rkabacoff.github.io/datavis/IntroGGPLOT.html.
COURSE OUTCOMES:
The students will be able to:
Analyze the basics in R programming in terms of constructs, control statements and
Functions.
Implement Data Preprocessing using R Libraries.
Apply the R programming from a statistical perspective and Modelling Methods.
Build regression models for a given problem.
Illustrate R programming tools for Graphs.
3
INDEX
UNIT
TOPIC PAGENO
4
UNIT-I
PROJECT SPONSOR
The most important role in a data science project is the project sponsor. The sponsor is the person who
wants the data science result; generally they represent the business interests. The sponsor is responsi-
ble for deciding whether the project is a success or failure. The ideal sponsor meets the following con-
dition: if they’re satisfied with the project outcome, then the project is by definition a success.
KEEP THE SPONSOR INFORMED AND INVOLVED
It’s critical to keep the sponsor informed and involved. Show them plans, progress, and intermediate
successes or failures in terms they can understand.
CLIENT
While the sponsor is the role that represents the business interest, the client is the role that represents
the model’s end users’ interests. The client is more hands-on than the sponsor; they’re the interface
between the technical details of building a good model and the day-to-day work process into which the
model will be deployed. They aren’t necessarily mathematically or statistically sophisticated, but are
familiar with the relevant business processes and serve as the domain expert on the team.As with the
sponsor, you should keep the client informed and involved. Ideally you’d like to have regular meetings
with them to keep your efforts aligned with the needs of the end users.
DATA SCIENTIST
The next role in a data science project is the data scientist, who’s responsible for taking all necessary
steps to make the project succeed, including setting the project strategy and keeping the client in-
formed. They design the project steps, pick the data sources, and pick the tools to be used. Since they
5
pick the techniques that will be tried, they have to be well informed about statistics and machine learn-
ing. They’re also responsible for project planning and tracking, though they may do this with a project
management partner.
DATA ARCHITECT
The data architect is responsible for all of the data and its storage. Often this role is filled by someone
outside of the data science group, such as a database administrator or architect. Data architects often
manage data warehouses for many different projects, and they may only be available for quick consul-
tation.
OPERATIONS
The operations role is critical both in acquiring data and delivering the final results. The person filling
this role usually has operational responsibilities outside of the data science group. For example, if
you’re deploying a data science result that affects howproducts are sorted on an online shopping site,
then the person responsible for running the site will have a lot to say about how such a thing can be
deployed. This person will likely have constraints on response time, programming language, or data
size that you need to respect in deployment. The person in the operations role may already be support-
ing your sponsor or your client, so they’re often easy to find.
The Lifecycle of Data Science
The major steps in the life cycle of Data Science project are as follows:
1. Problem identification
This is the crucial step in any Data Science project. First thing is understanding in what way Data
Science is useful in the domain under consideration and identification of appropriate tasks which are
useful for the same. Domain experts and Data Scientists are the key persons in the problem
identification of problem. Domain expert has in depth knowledge of the application domain and
exactly what is the problem to be solved. Data Scientist understands the domain and help in
identification of problem and possible solutions to the problems.
2. Business Understanding
Understanding what customer exactly wants from the business perspective is nothing but Business
Understanding. Whether customer wish to do predictions or want to improve sales or minimise the loss
or optimise any particular process etc forms the business goals. During business understanding two
important steps are followed:
KPI (Key Performance Indicator)
For any data science project, key performance indicators define the performance or success of the
project. There is a need to be an agreement between the customer and data science project team
on Business related indicators and related data science project goals. Depending on the business need
the business indicators are devised and then accordingly the data science project team decides the goals
and indicators. To better understand this let us see an example. Suppose the business need is
to optimise the overall spendings of the company, then the data science goal will be to use the existing
resources to manage double the clients. Defining the Key performance Indicators is very crucial for
any data science projects as the cost of the solutions will be different for different goals.
SLA (Service Level Agreement)
Once the performance indicators are set then finalizing the service level agreement is important. As per
the business goals the service level agreement terms are decided. For example, for any airline
reservation system simultaneous processing of say 1000 users is required. Then the product must
satisfy this service requirement is the part of service level agreement. Once the performance indicators
are agreed and service level agreement is completed then the project proceeds to the next important
step.
6
3. Collecting Data
The basic data collection can be done using the surveys. Generally, the data collected through surveys
provide important insights. Much of the data is collected from the various processes followed in the
enterprise. At various steps the data is recorded in various software systems used in the enterprise
which is important to understand the process followed from the product development to deployment
and delivery. The historical data available through archives is also important to better understand the
business. Transactional data also plays a vital role as it is collected on a daily basis. Many statistical
methods are applied to the data to extract the important information related to business. In data science
project the major role is played by data and so proper data collection methods are important.
4. Pre-processing data
Large data is collected from archives, daily transactions and intermediate records. The data is available
in various formats and in various forms. Some data may be available in hard copy formats also. The
data is scattered at various places on various servers. All these data are extracted and converted into
single format and then processed. Typically, as data warehouse is constructed where the Extract,
Transform and Loading (ETL) process or operations are carried out. In the data science project this
ETL operation is vital and important. A data architect role is important in this stage who decides the
structure of data warehouse and perform the steps of ETL operations.
5. Analyzing data
Now that the data is available and ready in the format required then next important step is to
understand the data in depth. This understanding comes from analysis of data using various statistical
tools available. A data engineer plays a vital role in analysis of data. This step is also called as
Exploratory Data Analysis (EDA). Here the data is examined by formulating the various statistical
functions and dependent and independent variables or features are identified. Careful analysis of data
revels which data or features are important and what is the spread of data. Various plots are utilized to
visualize the data for better understanding. The tools like Tableau, PowerBI etc are famous for
performing Exploratory Data Analysis and Visualization. Knowledge of Data Science with Python and
R is important for performing EDA on any type of data.
6. Data Modelling
Data modelling is the important next step once the data is analysed and visualized. The important
components are retained in the dataset and thus data is further refined. Now the important is to decide
how to model the data? What tasks are suitable for modelling? The tasks, like classification or
regression, which is suitable is dependent upon what business value is required. In these tasks also
many ways of modelling are available. The Machine Learning engineer applies various algorithms to
the data and generates the output. While modelling the data many a times the models are first tested on
dummy data similar to actual data.
7. Model Evaluation/ Monitoring
As there are various ways to model the data so it is important to decide which one is effective. For that
model evaluation and monitoring phase is very crucial and important. The model is now tested with
actual data. The data may be very few and in that case the output is monitored for improvement. There
may be changes in data while model is being evaluated or tested and the output will drastically change
depending on changes in data. So, while evaluating the model following two phases are important:
Data Drift Analysis
Changes in input data is called as data drift. Data drift is common phenomenon in data science as
depending on the situation there will be changes in data. Analysis of this change is called Data Drift
Analysis. The accuracy of the model depends on how well it handles this data drift. The changes in
data are majorly because of change in statistical properties of data.
7
Model Drift Analysis
To discover the data drift machine learning techniques can be used. Also, more sophisticated methods
like Adaptive Windowing, Page Hinkley etc. are available for use. Modelling Drift Analysis is
important as we all know change is constant. Incremental learning also can be used effectively where
the model is exposed to new data incrementally.
8. Model Training
Once the task and the model are finalised and data drift analysis modelling is finalized then the
important step is to train the model. The training can be done is phases where the important parameters
can be further fine tuned to get the required accurate output. The model is exposed to the actual data in
production phase and output is monitored.
9. Model Deployment
Once the model is trained with the actual data and parameters are fine tuned then model is deployed.
Now the model is exposed to real time data flowing into the system and output is generated. The model
can be deployed as web service or as an embedded application in edge or mobile application. This is
very important step as now model is exposed to real world.
10. Driving insights and generating BI reports
After model deployment in real world, next step is to find out how model is behaving in real world
scenario. The model is used to get the insights which aid in strategic decisions related to business. The
business goals are bound to these insights. Various reports are generated to see how business is
driving. These reports help in finding out if key process indicators are achieved or not.
11. Taking a decision based on insight
For data science to make wonders, every step indicated above has to be done very carefully and
accurately. When the steps are followed properly then the reports generated in above step helps in
taking key decisions for the organization. The insights generated helps in taking strategic decisions like
for example the organization can predict that there will be need of raw material in advance. The
data science can be of great help in taking many important decisions related to business growth and
better revenue generation.
Setting Expectations
Developing expectations is the process of deliberately thinking about what you expect before you do
anything, such as inspect your data, perform a procedure, or enter a command. For experienced data
analysts, in some circumstances, developing expectations may be an automatic, almost subconscious
process, but it’s an important activity to cultivate and be deliberate about.For example, you may be
going out to dinner with friends at a cash-only establishment and need to stop by the ATM to withdraw
money before meeting up. To make a decision about the amount of money you’re going to withdraw,
you have to have developed some expectation of the cost of dinner. This may be an automatic
expectation because you dine at this establishment regularly so you know what the typical cost of a
meal is there, which would be an example of a priori knowledge. Another example of a priori
knowledge would be knowing what a typical meal costs at a restaurant in your city, or knowing what a
meal at the most expensive restaurants in your city costs. Using that information, you could perhaps
place an upper and lower bound on how much the meal will cost.You may have also sought out
external information to develop your expectations, which could include asking your friends who will
be joining you or who have eaten at the restaurant before and/or Googling the restaurant to find
general cost information online or a menu with prices. This same process, in which you use any a
8
priori information you have and/or external sources to determine what you expect when you inspect
your data or execute an analysis procedure, applies to each core activity of the data analysis process.
Features Of R
1) Open Source
An open-source language is a language on which we can work without any need for a license or a fee.
R is an open-source language. We can contribute to the development of R by optimizing our packages,
developing new ones, and resolving issues.
2) Platform Independent
R is a platform-independent language or cross-platform programming language which means its code
can run on all operating systems. R enables programmers to develop software for several competing
platforms by writing a program only once. R can run quite easily on Windows, Linux, and Mac.
3) Machine Learning Operations
R allows us to do various machine learning operations such as classification and regression. For this
purpose, R provides various packages and features for developing the artificial neural network. R is
used by the best data scientists in the world.
4) Exemplary support for data wrangling
R allows us to perform data wrangling. R provides packages such as dplyr, readr which are capable of
transforming messy data into a structured form.
5) Quality plotting and graphing
R simplifies quality plotting and graphing. R libraries such as ggplot2 and plotly advocates for visually
appealing and aesthetic graphs which set R apart from other programming languages.
6) The array of packages
R has a rich set of packages. R has over 10,000 packages in the CRAN repository which are constantly
growing. R provides packages for data science and machine learning operations.
7) Statistics
R is mainly known as the language of statistics. It is the main reason why R is predominant than other
programming languages for the development of statistical tools.
8) Continuously Growing
R is a constantly evolving programming language. Constantly evolving means when something
evolves, it changes or develops over time, like our taste in music and clothes, which evolve as we get
older. R is a state of the art which provides updates whenever any new feature is added.
Limitations of R
1) Data Handling
In R, objects are stored in physical memory. It is in contrast with other programming languages like
Python. R utilizes more memory as compared to Python. It requires the entire data in one single place
which is in the memory. It is not an ideal option when we deal with Big Data.
2) Basic Security
R lacks basic security. It is an essential part of most programming languages such as Python. Because
of this, there are many restrictions with R as it cannot be embedded in a web-application.
3) Complicated Language
R is a very complicated language, and it has a steep learning curve. The people who don’t have prior
knowledge or programming experience may find it difficult to learn R.
4) Weak Origin
The main disadvantage of R is, it does not have support for dynamic or 3D graphics. The reason
behind this is its origin. It shares its origin with a much older programming language “S.”
9
5) Lesser Speed
R programming language is much slower than other programming languages such as MATLAB and
Python. In comparison to other programming language, R packages are much slower.
In R, algorithms are spread across different packages. The programmers who have no prior knowledge
of packages may find it difficult to implement algorithms.
10
## [1] TRUE
class(x)
## [1] "logical"
is.logical(x)
## [1] TRUE
The Character Type
Finally, characters are what you get when you type in a string such as "hello, world".
x <- "hello, world"
class(x)
## [1] "character"
is.character(x)
## [1] TRUE
Unlike in some languages, character doesn’t mean a single character but any text. So it is not like in C
or Java where you have single character types, 'c', and multi-character strings, "string", they are both
just characters. You can, similar to the other types, explicitly convert a value into a character (string)
using as. character:
as.character(3.14)
## [1] "3.14"
Unlike in some languages, character doesn’t mean a single character but any text. So it is not like in C
or Java where you have single character types, 'c', and multi-character strings, "string", they are both
just characters. You can, similar to the other types, explicitly convert a value into a character (string)
using as. character:
as.character(3.14)
## [1] "3.14"
Data Structures
vectors
vectors, which are sequences of values all of the same type.
v <- c(1, 2, 3)
or through some other operator or function, e.g., the : operator or the rep function
1:3
## [1] 1 2 3
rep("foo", 3)
## [1] "foo""foo""foo"
We can test if something is this kind of vector using the is.atomic function:
v <- 1:3
is.atomic(v)
## [1] TRUE
v <- 1:3
is.vector(v)
## [1] TRUE
It is just that R only consider such a sequence a vector—in the sense that is.vector returns TRUE—if
the object doesn’t have any attributes (except for one, names, which it is allowed to have).
Attributes are meta-information associated with an object, and not something we will deal with much
here, but you just have to know that is.vector will be FALSE if something that is a perfectly good
vector gets
an attribute.
v <- 1:3
is.vector(v)
## [1] TRUE
attr(v, "foo") <- "bar"
v
11
## [1] 1 2 3
## attr(,"foo")
## [1] "bar"
is.vector(v)
## [1] FALSE
So if you want to test if something is the kind of vector I am talking about here, use is.atomic instead.
When you concatenate (atomic) vectors, you always get another vector back. So when you combine
several c() calls you don’t get any kind of tree structure if you do something like this:
c(1, 2, c(3, 4), c(5, 6, 7))
## [1] 1 2 3 4 5 6 7
The type might change, if you try to concatenate vectors of different types, R will try to translate the
type into the most general type of the vectors.
c(1, 2, 3, "foo")
## [1] "1""2""3""foo"
Matrix
If you want a matrix instead of a vector, what you really want is just a two-dimensional vector. You
can set the dimensions of a vector using the dim function—it sets one of those attributes we talked
about previously where you specify the number of rows and the number of columns you want the
matrix to have.
v <- 1:6
attributes(v)
## NULL
dim(v) <- c(2, 3)
attributes(v)
## $dim
## [1] 2 3
dim(v)
## [1] 2 3
v
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
When you do this, the values in the vector will go in the matrix column-wise, i.e., the values in the
vector will go down the first column first and then on to the next column and so forth. You can use the
convenience function matrix to create matrices and there you can specify if you want
the values to go by column or by row using the byrow parameter.
v <- 1:6
matrix(data = v, nrow = 2, ncol = 3, byrow = FALSE)
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
matrix(data = v, nrow = 2, ncol = 3, byrow = TRUE)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
the * operator will not do matrix multiplication. You use * if you want to make element-wise
multiplication; for matrix multiplication you need the operator %*% instead.
(A <- matrix(1:4, nrow = 2))
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
(B <- matrix(5:8, nrow = 2))
12
## [,1] [,2]
## [1,] 5 7
## [2,] 6 8
A*B
## [,1] [,2]
## [1,] 5 21
## [2,] 12 32
A %*% B
## [,1] [,2]
## [1,] 23 31
## [2,] 34 46
If you want to transpose a matrix, you use the t function and, if you want to invert it, you use the solve
function.
t(A)
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
solve(A)
## [,1] [,2]
## [1,] -2 1.5
## [2,] 1 -0.5
Lists
Lists, like vectors, are sequences, but unlike vectors, the elements of a list can be any kind of objects,
and they do not have to be the same type of objects. This means that you can construct more complex
data structures out of lists.
For example, we can make a list of two vectors:
list(1:3, 5:8)
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] 5 6 7 8
Notice how the vectors do not get concatenated like they would if we combined them with c(). The
result of this command is a list of two elements that happens to be both vectors.
They didn’t have to have the same type either, we could make a list like this, which also consist of two
vectors but vectors of different types:
list(1:3, c(TRUE, FALSE))
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] TRUE FALSE
You can flatten a list into a vector using the function unlist(). This will force the elements in the list to
be converted into the same type, of course, since that is required of vectors.
unlist(list(1:4, 5:7))
## [1] 1 2 3 4 5 6 7
13
Indexing
v <- 1:4
v[2]
## [1] 2
We can also know that you can get a subsequence out of the vector using a range of indices:
v[2:3]
## [1] 2 3
Here we are indexing with positive numbers, which makes sense since the elements in the vector have
positive indices, but it is also possible to use negative numbers to index in R. If you do that it is
interpreted as specifying the complement of the values you want. So if you want all elements except
the first element, you can use:
You can also use multiple negative indices to remove some values:
v[-(1:2)]
## [1] 3 4
Another way to index is to use a Boolean vector. This vector should be the same length as the vector
you index into, and it will pick out the elements where the Boolean vector is true.
v[v %% 2 == 0]
## [1] 2 4
If you want to assign to a vector you just assign to elements you index; as long as the vector to the
right of the assignment operator has the same length as the elements the indexing pulls out you will be
assigning to the vector.
v[v %% 2 == 0] <- 13
v
## [1] 1 13 3 13
If the vector has more than one dimension—remember that matrices and arrays are really just vectors
with more dimensions—then you subset them by subsetting each dimension. If you leave out a
dimension, you will get whole range of values in that dimension, which is a simple way to of getting
rows and columns of a matrix:
m <- matrix(1:6, nrow = 2, byrow = TRUE)
m
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
m[1,]
## [1] 1 2 3
m[,1]
## [1] 1 4
You can also index out a submatrix this way by providing ranges in one or more dimensions:
m[1:2,1:2]
## [,1] [,2]
## [1,] 1 2
## [2,] 4 5
If you want to get to the actual element in there, you need to use the [[]] operator instead.
L <- list(1,2,3)
L[[1]]
## [1] 1
14
Named Values
The elements in a vector or a list can have names. These are attributes that do not affect the values of
the elements but can be used to refer to them. You can set these names when you create the vector or
list:
v <- c(a = 1, b = 2, c = 3, d = 4)
v
## a b c d
## 1 2 3 4
L <- list(a = 1:5, b = c(TRUE, FALSE))
L
## $a
## [1] 1 2 3 4 5
##
## $b
## [1] TRUE FALSE
Or you can set the names using the names<- function. That weird name, by the way, means that you
are dealing with the names() function combined with assignment:
names(v) <- LETTERS[1:4]
v
## A B C D
## 1 2 3 4
You can use names to index vectors and lists (where the [] and [[]] returns either a list or the element
of the list, as before):
v["A"]
## A
## 1
L["a"]
## $a
## [1] 1 2 3 4 5
L[["a"]]
## [1] 1 2 3 4 5
factors
In the first step, we create a vector.
1. Next step is to convert the vector into a factor,
R provides factor() function to convert the vector into factor. There is the following syntax of factor()
function
1. factor_data<- factor(vector)
data <-
c("Shubham","Nishka","Arpita","Nishka","Shubham","Sumit","Nishka","Shubham","Sumit","
Arpita","Sumit")
2. print(data)
3. print(is.factor(data))
4. # Applying the factor function.
5. factor_data<- factor(data)
6. print(factor_data)
7. print(is.factor(factor_data))
output:[1] "Shubham""Nishka""Arpita""Nishka""Shubham""Sumit""Nishka"
[8] "Shubham""Sumit""Arpita""Sumit"
[1] FALSE
[1] Shubham Nishka Arpita Nishka Shubham Sumit Nishka Shubham Sumit
15
[10] Arpita Sumit
Levels: Arpita Nishka Shubham Sumit
[1] TRUE
Accessing components of factor
Like vectors, we can access the components of factors. The process of accessing components of factor
is much more similar to the vectors. We can access the element with the help of the indexing method or
using logical vectors. Let's see an example in which we understand the different-different ways of
accessing the components.
# Creating a vector as input.
data <-
c("Shubham","Nishka","Arpita","Nishka","Shubham","Sumit","Nishka","Shubham","Sumit","Arpita",
"Sumit")
factor_data<- factor(data)
print(factor_data)
print(factor_data[4])
print(factor_data[c(5,7)])
print(factor_data[-4])
print(factor_data[c(TRUE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,TRU
E)])
[1] Shubham Nishka Arpita Nishka Shubham Sumit Nishka Shubham Sumit
[10] Arpita Sumit
Levels: Arpita Nishka Shubham Sumit
[1] Nishka
Levels: Arpita Nishka Shubham Sumit
[1] Shubham Nishka
Levels: Arpita Nishka Shubham Sumit
[1] Shubham Nishka Arpita Shubham Sumit Nishka Shubham Sumit Arpita
[10] Sumit
Levels: Arpita Nishka Shubham Sumit
[1] Shubham Shubham Sumit Nishka Sumit
Levels: Arpita Nishka Shubham Sumit
Modification of factor
Like data frames, R allows us to modify the factor. We can modify the value of a factor by simply re-
assigning it. In R, we cannot choose values outside of its predefined levels means we cannot insert
value if it's level is not present on it. For this purpose, we have to create a level of that value, and then
we can add it to our factor.
data <- c("Shubham","Nishka","Arpita","Nishka","Shubham")
factor_data<- factor(data)
print(factor_data)
factor_data[4] <-"Arpita"
print(factor_data)
factor_data[4] <- "Gunjan"
print(factor_data)
levels(factor_data) <- c(levels(factor_data),"Gunjan")
factor_data[4] <- "Gunjan"
print(factor_data)
output:
[1] Shubham Nishka Arpita Nishka Shubham
Levels: Arpita Nishka Shubham
[1] Shubham Nishka Arpita Arpita Shubham
16
Levels: Arpita Nishka Shubham
Warning message:
In `[<-.factor`(`*tmp*`, 4, value = "Gunjan") :
invalid factor level, NA generated
[1] Shubham Nishka Arpita <NA> Shubham
Levels: Arpita Nishka Shubham
[1] Shubham Nishka Arpita Gunjan Shubham
Levels: Arpita Nishka Shubham Gunjan
Generating Factor Levels
R provides gl() function to generate factor levels. This function takes three arguments i.e., n, k, and
labels. Here, n and k are the integers which indicate how many levels we want and how many times
each level is required.
There is the following syntax of gl() function which is as follows
1. gl(n, k, labels)
1. n indicates the number of levels.
2. k indicates the number of replications.
3. labels is a vector of labels for the resulting factor levels.
Example
1. gen_factor<- gl(3,5,labels=c("BCA","MCA","B.Tech"))
2. gen_factor
Output
[1] BCA BCA BCA BCA BCA MCA MCA MCA MCA MCA
[11] B.Tech B.Tech B.Tech B.Tech B.Tech
Levels: BCA MCA B.Tech
height <- c(132,151,162,139,166,147,122)
weight <- c(48,49,66,53,67,52,40)
gender <- c("male","male","female","female","male","female","male")
input_data <- data.frame(height,weight,gender)
print(input_data)
print(is.factor(input_data$gender))
print(input_data$gender)
When we execute the above code, it produces the following result −
height weight gender
1 132 48 male
2 151 49 male
3 162 66 female
4 139 53 female
5 166 67 male
6 147 52 female
7 122 40 male
[1] TRUE
[1] male male female female male female male
Levels: female male
Changing the Order of Levels
The order of the levels in a factor can be changed by applying the factor function again with new order
of the levels.
data <- c("East","West","East","North","North","East","West",
"West","West","East","North")
factor_data <- factor(data)
print(factor_data)
17
new_order_data <- factor(factor_data,levels = c("East","West","North"))
print(new_order_data)
When we execute the above code, it produces the following result −
[1] East West East North North East West West West East North
Levels: East North West
[1] East West East North North East West West West East North
Levels: East West North
Subsetting R Objects
There are three operators that can be used to extract subsets of R objects.
The [ operator always returns an object of the same class as the original. It can be used to select
multiple elements of an object
The [[ operator is used to extract elements of a list or a data frame. It can only be used to
extract a single element and the class of the returned object will not necessarily be a list or data
frame.
The $ operator is used to extract elements of a list or data frame by literal name. Its semantics
are similar to that of [[.
Subsetting a Vector
Vectors are basic objects in R and they can be subsetted using the [ operator.
> x[1:4]
[1] "a""b""c""c"
The sequence does not have to be in order; you can specify any arbitrary integer vector.
18
> x[x >"a"]
[1] "b""c""c""d"
Subsetting a Matrix
Matrices can be subsetted in the usual way with (i,j) type indices. Here, we create simple 2×3
matrix with the matrix function.
> x[1, 2]
[1] 3
> x[2, 1]
[1] 2
Indices can also be missing. This behavior is used to access entire rows or columns of a matrix.
Subsetting Lists
Lists in R can be subsetted using all three of the operators mentioned above, and all three are used
for different purposes.
$bar
[1] 0.6
The [[ operator can be used to extract single elements from a list. Here we extract the first element
of the list.
> x[[1]]
[1] 1 2 3 4
The [[ operator can also use named indices so that you don’t have to remember the exact ordering of
every element of the list. You can also use the $ operator to extract elements by name.
> x[["bar"]]
[1] 0.6
> x$bar
[1] 0.6
Notice you don’t need the quotes when you use the $ operator.
19
One thing that differentiates the [[ operator from the $ is that the [[ operator can be used with
computed indices. The $ operator can only be used with literal names.
Partial Matching
Partial matching of names is allowed with [[ and $. This is often very useful during interactive work
if the object you’re working with has very long element names. You can just abbreviate those names
and R will figure out what element you’re referring to.
> x <- list(aardvark = 1:5)
> x$a
[1] 1 2 3 4 5
> x[["a"]]
NULL
> x[["a", exact = FALSE]]
[1] 1 2 3 4 5
Removing NA Values
A common task in data analysis is removing missing values (NAs).
> x <- c(1, 2, NA, 4, NA, 5)
> bad <- is.na(x)
> print(bad)
[1] FALSE FALSE TRUE FALSE TRUE FALSE
> x[!bad]
[1] 1 2 4 5
What if there are multiple R objects and you want to take the subset with no missing values in any
of those objects?
> x <- c(1, 2, NA, 4, NA, 5)
> y <- c("a", "b", NA, "d", NA, "f")
> good <- complete.cases(x, y)
> good
[1] TRUE TRUE FALSE TRUE FALSE TRUE
> x[good]
[1] 1 2 4 5
> y[good]
[1] "a""b""d""f"
20
Control Structures
if condition
This control structure checks the expression provided in parenthesis is true or not. If true, the
execution of the statements in braces {} continues.
Syntax:
if(expression)
{
statements
....
....
}
Example:
x <-100
if(x > 10){
print(paste(x, "is greater than 10"))
}
Output:
[1] "100 is greater than 10"
if-else condition
It is similar to if condition but when the test expression in if condition fails, then statements in else
condition are executed.
Syntax:
if(expression)
{
statements
....
....
}
else
{
statements
....
....
}
Example:
x <-5
Output:
[1] "5 is less than 10"
21
for loop
It is a type of loop or sequence of statements executed repeatedly until exit condition is reached.
Syntax:
for(value in vector)
{
statements
....
....
}
Example:
x <-letters[4:10]
for(i inx){
print(i)
}
Output:
[1] "d"
[1] "e"
[1] "f"
[1] "g"
[1] "h"
[1] "i"
[1] "j"
Nested loops
Nested loops are similar to simple loops. Nested means loops inside loop. Moreover, nested loops
are used to manipulate the matrix.
for(i in1:3)
{
for(j in1:5)
{
22
while loop
while loop is another kind of loop iterated until a condition is satisfied. The testing expression is
checked first before executing the body of loop.
Syntax:
while(expression)
{
statement
....
....
}
Example:
x =1
# Print 1 to 5
while(x <=5){
print(x)
x =x +1
}
Output:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
repeat loop and break statement
repeat is a loop which can be iterated many number of times but there is no exit condition to come
out from the loop. So, break statement is used to exit from the loop. break statement can be used in
any type of loop to exit from the loop.
Syntax:
repeat {
statements
....
....
if(expression) {
break
}
}
Example:
x =1
# Print 1 to 5
repeat{
print(x)
x =x +1
if(x > 5){
break
}
}
23
Output:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
next statement
next statement is used to skip the current iteration without executing the further statements and
continues the next iteration cycle without terminating the loop.
Example:
# Defining vector
x <-1:10
Output:
[1] 2
[1] 4
[1] 6
[1] 8
[1] 10
Functions
24
plus <- function(x, y) {
print(paste(x, "+", y, "is", x + y))
x+ y
}
div <- function(x, y) {
print(paste(x, "/", y, "is", x / y))
x/ y
}
plus(2, 2)
## [1] "2 + 2 is 4"
## [1] 4
div(6, 2)
## [1] "6 / 2 is 3"
## [1] 3
The assignment operator <- returns a value and that is passed along to the function as positional
arguments. So in the second function call above you are assigning 2 to y and 6 to x in the scope out-
side the function, but the values you pass to the function are positional so inside the function you
have given 2 to x and 6 to y.
25
Return Value from R Function
Method 1: R function with return value
In this scenario, we will use the return statement to return some value
Syntax:
Output:
[1] 30
Method 2: R function to return multiple values as a list
In this scenario, we will use the list() function in the return statement to return multiple values.
Syntax:
function_name <- function(parameters) {
statements
return(list(value1,value2,.,value n)
}
function_name(values)
where,
26
function_name is the name of the function
parameters are the values that are passed as arguments
return() function takes list of values as input
function_name(values) is used to pass values to the parameters
Example: R program to perform arithmetic operations and return those values
arithmetic = function(val1,val2)
{
add=val1+val2
sub=val1-val2
mul=val1*val2
div=val2/val1
return(list(add,sub,mul,div))
}
arithmetic(10,20)
Output:
[[1]]
[1] 30
[[2]]
[1] -10
[[3]]
[1] 200
[[4]]
[1] 2
Write R Programming: Create a 5 × 4 matrix, 3 × 3 matrix with labels and fill the matrix by
rows and 2 × 2 matrix with labels and fill the matrix by columns
Program:
m1 = matrix(1:20, nrow=5, ncol=4)
print("5 × 4 matrix:")
print(m1)
cells = c(1,3,5,7,8,9,11,12,14)
rnames = c("Row1", "Row2", "Row3")
cnames = c("Col1", "Col2", "Col3")
m2 = matrix(cells, nrow=3, ncol=3, byrow=TRUE, dimnames=list(rnames, cnames))
print("3 × 3 matrix with labels, filled by rows: ")
print(m2)
print("3 × 3 matrix with labels, filled by columns: ")
m3 = matrix(cells, nrow=3, ncol=3, byrow=FALSE, dimnames=list(rnames, cnames))
print(m3)
27
Write a R program to create a Dataframes which contain details of 5 employees and display
the details.
Write a R program to create the system's idea of the current date with and without time.
28
UNIT–II
Loading, Exploring and Managing Data
Working with data from files: Reading and Writing Data, Reading Data Files with read.table (),
Reading in Larger Datasets with read.table.Working with relational databases. Data manipulation
packages: dplyr, data.table, reshape2, tidyr, lubridate.
29
[1] "c."
[1] "c++"
[2] "java"
Reading the whole file
read_file(): This method is used for reading the whole file. To use this method we have to import
reader package.
Syntax: read_lines(file)
file: the file path
program:
library(readr)
myData = read_file("1.txt")
print(myData)
Output:
[1] “cc++java”
Reading a file in a table format
Another popular format to store a file is in a tabular format. R provides various methods that one
can read data from a tabular formatted data file.
read.table(): read.table() is a general function that can be used to read a file in table format. The
data will be imported as a data frame.
Syntax: read.table(file, header = FALSE, sep = “”, dec = “.”)
myData = read.table("basic.csv")
print(myData)
Output:
1 Name,Age,Qualification,Address
2 Amiya,18,MCA,BBS
3 Niru,23,Msc,BLS
4 Debi,23,BCA,SBP
5 Biku,56,ISC,JJP
read.csv(): read.csv() is used for reading “comma separated value” files (“.csv”). In this also the
data will be imported as a data frame.
Syntax: read.csv(file, header = TRUE, sep = “,”, dec = “.”, …)
myData = read.csv("basic.csv")
print(myData)
Output:
Name Age Qualification Address
1 Amiya 18 MCA BBS
2 Niru 23 Msc BLS
3 Debi 23 BCA SBP
4 Biku 56 ISC JJP
read.csv2(): read.csv() is used for variant used in countries that use a comma “,” as decimal point
and a semicolon “;” as field separators.
Syntax: read.csv2(file, header = TRUE, sep = “;”, dec = “,”, …)
myData = read.csv2("basic.csv")
print(myData)
Output:
Name.Age.Qualification.Address
1 Amiya,18,MCA,BBS
2 Niru,23,Msc,BLS
30
3 Debi,23,BCA,SBP
4 Biku,56,ISC,JJP
file.choose(): You can also use file.choose() with read.csv() just like before.
myData = read.csv(file.choose())
print(myData)
Output:
Name Age Qualification Address
1 Amiya 18 MCA BBS
2 Niru 23 Msc BLS
3 Debi 23 BCA SBP
4 Biku 56 ISC JJP
read_csv(): This method is also used for to read a comma (“,”) separated values by using the help
of readr package.
Syntax: read_csv(file, col_names = TRUE)
library(readr)
myData = read_csv("basic.csv", col_names = TRUE)
print(myData)
Output:
Parsed with column specification:
cols(
Name = col_character(),
Age = col_double(),
Qualification = col_character(),
Address = col_character()
)
# A tibble: 4 x 4
Name Age Qualification Address
1 Amiya 18 MCA BBS
2 Niru 23 Msc BLS
3 Debi 23 BCA SBP
4 Biku 56 ISC JJP
Reading a file from the internet
It’s possible to use the functions read.delim(), read.csv() and read.table() to import files from the
web.
myData = read.delim("http://www.sthda.com/upload/boxplot_format.txt")
print(head(myData))
Output:
Nom variable Group
1 IND1 10 A
2 IND2 7 A
3 IND3 20 A
4 IND4 14 A
5 IND5 14 A
6 IND6 12 A
31
Reading a CSV File
Following is a simple example of read.csv() function to read a CSV file available in your current
working directory −
data <- read.csv("input.csv")
print(data)
When we execute the above code, it produces the following result −
id, name, salary, start_date, dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 NA Gary 843.25 2015-03-27 Finance
6 6 Nina 578.00 2013-05-21 IT
7 7 Simon 632.80 2013-07-30 Operations
8 8 Guru 722.50 2014-06-17 Finance
Analyzing the CSV File
By default the read.csv() function gives the output as a data frame. This can be easily checked as
follows. Also we can check the number of columns and rows.
data <- read.csv("input.csv")
print(is.data.frame(data))
print(ncol(data))
print(nrow(data))
When we execute the above code, it produces the following result −
[1] TRUE
[1] 5
[1] 8
Once we read data in a data frame, we can apply all the functions applicable to data frames as
explained in subsequent section.
Get the maximum salary
# Create a data frame.
data <- read.csv("input.csv")
# Get the max salary from data frame.
sal <- max(data$salary)
print(sal)
When we execute the above code, it produces the following result −
[1] 843.25
Get the details of the person with max salary
We can fetch rows meeting specific filter criteria similar to a SQL where clause.
# Create a data frame.
data <- read.csv("input.csv")
# Get the max salary from data frame.
sal <- max(data$salary)
# Get the person detail having max salary.
retval <- subset(data, salary == max(salary))
print(retval)
When we execute the above code, it produces the following result −
id name salary start_date dept
5 NA Gary 843.25 2015-03-27 Finance
32
Get all the people working in IT department
# Create a data frame.
data <- read.csv("input.csv")
retval <- subset( data, dept == "IT")
print(retval)
When we execute the above code, it produces the following result −
id name salary start_date dept
1 1 Rick 623.3 2012-01-01 IT
3 3 Michelle 611.0 2014-11-15 IT
6 6 Nina 578.0 2013-05-21 IT
Get the persons in IT department whose salary is greater than 600
data <- read.csv("input.csv")
info <- subset(data, salary > 600 & dept == "IT")
print(info)
When we execute the above code, it produces the following result −
id name salary start_date dept
1 1 Rick 623.3 2012-01-01 IT
3 3 Michelle 611.0 2014-11-15 IT
Get the people who joined on or after 2014
data <- read.csv("input.csv")
retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))
print(retval)
When we execute the above code, it produces the following result −
id name salary start_date dept
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 NA Gary 843.25 2015-03-27 Finance
8 8 Guru 722.50 2014-06-17 Finance
Writing into a CSV File
R can create csv file form existing data frame. The write.csv() function is used to create the csv file.
This file gets created in the working directory.
data <- read.csv("input.csv")
retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))
write.csv(retval,"output.csv")
newdata <- read.csv("output.csv")
print(newdata)
When we execute the above code, it produces the following result −
X id name salary start_date dept
1 3 3 Michelle 611.00 2014-11-15 IT
2 4 4 Ryan 729.00 2014-05-11 HR
3 5 NA Gary 843.25 2015-03-27 Finance
4 8 8 Guru 722.50 2014-06-17 Finance
Install xlsx Package
You can use the following command in the R console to install the "xlsx" package. It may ask to
install some additional packages on which this package is dependent. Follow the same command
with required package name to install the additional packages.
install.packages("xlsx")
33
Verify and Load the "xlsx" Package
Use the following command to verify and load the "xlsx" package.
any(grepl("xlsx",installed.packages()))
library("xlsx")
When the script is run we get the following output.
[1] TRUE
Loading required package: rJava
Loading required package: methods
Loading required package: xlsxjars
Input as xlsx File
Open Microsoft excel. Copy and paste the following data in the work sheet named as sheet1.
id name salary start_date dept
1 Rick 623.3 1/1/2012 IT
2 Dan 515.2 9/23/2013 Operations
3 Michelle 611 11/15/2014 IT
4 Ryan 729 5/11/2014 HR
5 Gary 43.25 3/27/2015 Finance
6 Nina 578 5/21/2013 IT
7 Simon 632.8 7/30/2013 Operations
8 Guru 722.5 6/17/2014 Finance
Also copy and paste the following data to another worksheet and rename this worksheet to "city".
name city
Rick Seattle
Dan Tampa
Michelle Chicago
Ryan Seattle
Gary Houston
Nina Boston
Simon Mumbai
Guru Dallas
Save the Excel file as "input.xlsx". You should save it in the current working directory of the R
workspace.
Reading the Excel File
The input.xlsx is read by using the read.xlsx() function as shown below. The result is stored as a
data frame in the R environment.
data <- read.xlsx("input.xlsx", sheetIndex = 1)
print(data)
When we execute the above code, it produces the following result −
id, name, salary, start_date, dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 NA Gary 843.25 2015-03-27 Finance
6 6 Nina 578.00 2013-05-21 IT
7 7 Simon 632.80 2013-07-30 Operations
8 8 Guru 722.50 2014-06-17 Finance
34
XML is a file format which shares both the file format and the data on the World Wide Web,
intranets, and elsewhere using standard ASCII text. It stands for Extensible Markup Language
(XML). Similar to HTML it contains markup tags. But unlike HTML where the markup tag
describes structure of the page, in xml the markup tags describe the meaning of the data contained
into the file.
install.packages("XML")
Input Data
Create a XMl file by copying the below data into a text editor like notepad. Save the file with a .xml
extension and choosing the file type as all files(*.*).
<RECORDS>
<EMPLOYEE>
<ID>1</ID>
<NAME>Rick</NAME>
<SALARY>623.3</SALARY>
<STARTDATE>1/1/2012</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>2</ID>
<NAME>Dan</NAME>
<SALARY>515.2</SALARY>
<STARTDATE>9/23/2013</STARTDATE>
<DEPT>Operations</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>3</ID>
<NAME>Michelle</NAME>
<SALARY>611</SALARY>
<STARTDATE>11/15/2014</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>4</ID>
<NAME>Ryan</NAME>
<SALARY>729</SALARY>
<STARTDATE>5/11/2014</STARTDATE>
<DEPT>HR</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>5</ID>
<NAME>Gary</NAME>
<SALARY>843.25</SALARY>
<STARTDATE>3/27/2015</STARTDATE>
<DEPT>Finance</DEPT>
</EMPLOYEE>
<EMPLOYEE>
35
<ID>6</ID>
<NAME>Nina</NAME>
<SALARY>578</SALARY>
<STARTDATE>5/21/2013</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>7</ID>
<NAME>Simon</NAME>
<SALARY>632.8</SALARY>
<STARTDATE>7/30/2013</STARTDATE>
<DEPT>Operations</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>8</ID>
<NAME>Guru</NAME>
<SALARY>722.5</SALARY>
<STARTDATE>6/17/2014</STARTDATE>
<DEPT>Finance</DEPT>
</EMPLOYEE>
</RECORDS>
Reading XML File
The xml file is read by R using the function xmlParse(). It is stored as a list in R.
library("XML")
library("methods")
result <- xmlParse(file = "input.xml")
print(result)
When we execute the above code, it produces the following result −
1
Rick
623.3
1/1/2012
IT
2
Dan
515.2
9/23/2013
Operations
3
Michelle
611
11/15/2014
IT
4
Ryan
729
36
5/11/2014
HR
5
Gary
843.25
3/27/2015
Finance
6
Nina
578
5/21/2013
IT
7
Simon
632.8
7/30/2013
Operations
8
Guru
722.5
6/17/2014
Finance
Get Number of Nodes Present in XML File
# Load the packages required to read XML files.
library("XML")
library("methods")
# Give the input file name to the function.
result <- xmlParse(file = "input.xml")
# Exract the root node form the xml file.
rootnode <- xmlRoot(result)
# Find number of nodes in the root.
rootsize <- xmlSize(rootnode)
# Print the result.
print(rootsize)
When we execute the above code, it produces the following result −
output
[1] 8
Details of the First Node
Let's look at the first record of the parsed file. It will give us an idea of the various elements present
in the top level node.
# Load the packages required to read XML files.
library("XML")
library("methods")
# Give the input file name to the function.
result <- xmlParse(file = "input.xml")
# Exract the root node form the xml file.
rootnode <- xmlRoot(result)
# Print the result.
37
print(rootnode[1])
When we execute the above code, it produces the following result −
$EMPLOYEE
1
Rick
623.3
1/1/2012
IT
attr(,"class")
[1] "XMLInternalNodeList" "XMLNodeList"
Get Different Elements of a Node
# Load the packages required to read XML files.
library("XML")
library("methods")
# Give the input file name to the function.
result <- xmlParse(file = "input.xml")
# Exract the root node form the xml file.
rootnode <- xmlRoot(result)
# Get the first element of the first node.
print(rootnode[[1]][[1]])
# Get the fifth element of the first node.
print(rootnode[[1]][[5]])
# Get the second element of the third node.
print(rootnode[[3]][[2]])
When we execute the above code, it produces the following result −
1
IT
Michelle
JSON file stores data as text in human-readable format. Json stands for JavaScript Object Notation.
R can read JSON files using the rjson package.
Install rjson Package
In the R console, you can issue the following command to install the rjson package.
install.packages("rjson")
Input Data
Create a JSON file by copying the below data into a text editor like notepad. Save the file with
a .json extension and choosing the file type as all files(*.*).
{
"ID":["1","2","3","4","5","6","7","8" ],
"Name":["Rick","Dan","Michelle","Ryan","Gary","Nina","Simon","Guru" ],
"Salary":["623.3","515.2","611","729","843.25","578","632.8","722.5" ],
"StartDate":[ "1/1/2012","9/23/2013","11/15/2014","5/11/2014","3/27/2015","5/21/2013",
"7/30/2013","6/17/2014"],
"Dept":[ "IT","Operations","IT","HR","Finance","IT","Operations","Finance"]
}
Read the JSON File
The JSON file is read by R using the function from JSON(). It is stored as a list in R.
# Load the package required to read JSON files.
library("rjson")
# Give the input file name to the function.
result <- fromJSON(file = "input.json")
# Print the result.
38
print(result)
When we execute the above code, it produces the following result −
$ID
[1] "1" "2" "3" "4" "5" "6" "7" "8"
$Name
[1] "Rick" "Dan" "Michelle" "Ryan" "Gary" "Nina" "Simon" "Guru"
$Salary
[1] "623.3" "515.2" "611" "729" "843.25" "578" "632.8" "722.5"
$StartDate
[1] "1/1/2012" "9/23/2013" "11/15/2014" "5/11/2014" "3/27/2015" "5/21/2013"
"7/30/2013" "6/17/2014"
$Dept
[1] "IT" "Operations" "IT" "HR" "Finance" "IT"
"Operations" "Finance"
Convert JSON to a Data Frame
We can convert the extracted data above to a R data frame for further analysis using the
as.data.frame() function.
# Load the package required to read JSON files.
library("rjson")
# Give the input file name to the function.
result <- fromJSON(file = "input.json")
# Convert JSON file to a data frame.
json_data_frame <- as.data.frame(result)
print(json_data_frame)
When we execute the above code, it produces the following result −
id, name, salary, start_date, dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 NA Gary 843.25 2015-03-27 Finance
6 6 Nina 578.00 2013-05-21 IT
7 7 Simon 632.80 2013-07-30 Operations
8 8 Guru 722.50 2014-06-17 Finance
Reading in Larger Datasets with read.table
R is known to have difficulties handling large data files. Here we will explore some tips that make
working with such files in R less painfull.
If you can comfortably work with the entire file in memory, but reading the file is rather
slow, consider using the data.table package and read the file with its fread function.
If your file does not comfortably fit in memory:
Use sqldf if you have to stick to csv files.
Use a SQLite database and query it using either SQL queries or dplyr.
Convert your csv file to a sqlite database in order to query
Loading a large dataset: use fread() or functions from readr instead of read.xxx().
library("data.table")
library("readr")
Tto read an entire csv in memory, by default, R users use the read.table method or variations thereof
(such as read.csv). However, fread from the data.table package is a lot faster. Furthermore, the readr
package also provides more optimized reading functions (read_csv, read_delim,…). Let’s measure
the time to read in the data using these three different methods.
39
read.table.timing <- system.time(read.table(csv.name, header = TRUE, sep = ","))
readr.timing <- system.time(read_delim(csv.name, ",", col_names = TRUE))
data.table.timing <- system.time(allData <- fread(csv.name, showProgress = FALSE))
data <- data.frame(method = c('read.table', 'readr', 'fread'),
timing = c(read.table.timing[3], readr.timing[3], data.table.timing[3]))
## 1 read.table 183.732
## 2 readr 3.625
## 3 fread 12.564
Data files that don’t fit in memory
If you are not able to read in the data file, because it does not fit in memory (or because R becomes
too slow when you load the entire dataset), you will need to limit the amount of data that will
actually be stored in memory. There are a couple of options which we will investigate:
1. limit the number of lines you are trying to read for some exploratory analysis. Once you are
happy with the analysis you want to run on the entire dataset, move to another machine.
2. limit the number of columns you are reading to reduce the memory required to store the data.
3. limit both the number of rows and the number of columns using sqldf.
4. stream the data.
1. Limit the number of lines you read (fread)
Limiting the number of lines you read is easy. Just use the nrows and/or skip option (available to
both read.table and fread). skip can be used to skip a number of rows, but you can also pass a string
to this parameter causing fread to only start reading lines from the first line matching that string.
Let’s say we only want to start reading lines after we find a line matching the pattern 2015-06-12
15:14:39. We can do that like this:
sprintf("Number of lines in full data set: %s", nrow(allData))
## [1] "Number of lines in full data set: 3761058"
subSet <- fread(csv.name, skip = "2015-06-12 15:14:39", showProgress = FALSE)
sprintf("Number of lines in data set with skipped lines: %s", nrow(subSet))
## [1] "Number of lines in data set with skipped lines: 9998"
Skipping rows this way is obviously not giving you the entire dataset, so this strategy is only useful
for doing exploratory analysis on a subset of your data. Note that also read_delim provides a n_max
argument to limit the number of lines to read. If you want to explore the whole dataset, limiting the
number of columns you read can be a more useful strategy.
2. Limit the number of columns you read (fread)
If you only need 4 columns of the 21 columns present in the file, you can tell fread to only select
those 4. This can have a major impact on the memory footprint of your data. The option you need
for this is: select. With this, you can specify a number of columns to keep. The opposite - specifying
the columns you want to drop - can be accomplished with the drop option.
fourColumns = fread(csv.name, select = c("device_info_serial", "date_time", "latitude",
"longitude"), showProgress = FALSE)
sprintf("Size of total data in memory: %s MB", utils::object.size(allData)/1000000)
## [1] "Size of total data in memory: 1173.480728 MB"
sprintf("Size of only four columns in memory: %s MB", utils::object.size(fourColumns)/1000000)
## [1] "Size of only four columns in memory: 105.311936 MB"
The difference might not be as large as you would expect. R objects claim more memory than
needed to store the data alone, as they keep pointers, and other object attributes. But still, the
difference could save you.
3. Limiting both the number of rows and the number of columns using sqldf
The sqldf package allows you to run SQL-like queries on a file, resulting in only a selection of the
file being read. It allows you to limit both the number of lines and the number of rows at the same
40
time. In the background, this actually creates a sqlite database on the fly to execute the query.
Consider using the package when starting from a csv file, but the actual strategy boils down to
making a sqlite database file of your data.
4. Streaming data
Streaming a file means reading it line by line and only keeping the lines you need or do stuff with
the lines while you read through the file. It turns out that R is really not very efficient in streaming
files. The main reason is the memory allocation process that has difficulties with a constantly
growing object (which can be a dataframe containing only the selected lines).
Working with relational databases
In many production environments, the data you want lives in a relational or SQL database, not in
files. Public data is often in files (as they are easier to share), but your most important client data is
often in databases. Relational databases scale easily to the millions of records and supply important
production features such as parallelism, consistency, transactions, logging, and audits. When you’re
working with transaction data, you’re likely to find it already stored in a relational database, as
relational data- bases excel at online transaction processing ( OLTP ). Often you can export the
data into a structured file and use the methods of our previous sections to then transfer the data into
R. But this is generally not the right way to do things. Exporting from databases to files is often
unreliable and idiosyn- cratic due to variations in database tools and the typically poor job these
tools do when quoting and escaping characters that are confused with field separators. Data in a
database is often stored in what is called a normalized form, which requires relational
preparations called joins before the data is ready for analysis. Also, you often don’t want a dump of
the entire database, but instead wish to freely specify which columns and aggregations you need
during analysis.
Loading data with SQL Screwdriver
java -classpath SQLScrewdriver.jar:h2-1.3.170.jar \ com.winvector.db.LoadFiles \ file:dbDef.xml \
, \ hus \ file:csv_hus/ss11husa.csv file:csv_hus/ss11husb.csv java -classpath SQLScrewdriver.jar:h2-
1.3.170.jar \ com.winvector.db.LoadFiles \ file:dbDef.xml , pus \ file:csv_pus/ss11pusa.csv
file:csv_pus/ss11pusb.csv
Loading data from a database into R
To load data from a database, we use a database connector. Then we can directly issueSQL queries
from R. SQL is the most common database query language and allows usto specify arbitrary joins
and aggregations. SQL is called a declarative language (asopposed to a procedural language)
because in SQL we specify what relations we wouldlike our data sample to have, not how to
compute them. For our example, we load asample of the household data from the hus table and the
rows from the person table( pus ) that are associated with those households.
options( java.parameters = "-Xmx2g" )
drv <- JDBC("org.h2.Driver","h2-1.3.170.jar",identifier.quote="'")
options<-";LOG=0;CACHE_SIZE=65536;LOCK_MODE=0;UNDO_LOG=0"
conn <- dbConnect(drv,paste("jdbc:h2:H2DB",options,sep=''),"u","u")
dhus <- dbGetQuery(conn,"SELECT * FROM hus WHERE ORIGRANDGROUP<=1")
dpus <- dbGetQuery(conn,"SELECT pus.* FROM pus WHERE pus.SERIALNO IN \
(SELECT DISTINCT hus.SERIALNO FROM hus \
WHERE hus.ORIGRANDGROUP<=1)")
dbDisconnect(conn)
save(dhus,dpus,file='phsample.RData')
41
And we’re in business; the data has been unpacked from the Census-supplied .csv filesinto our
database and a useful sample has been loaded into R for analysis. We haveactually accomplished a
lot. Generating, as we have, a uniform sample of householdsand matching people would be tedious
using shell tools. It’s exactly what SQL data-bases are designed to do well.
42
> myirisdata
43
> filter(myirisdata, Species %in% c('setosa', 'virginica'))
44
#select series of columns
45
> mynewdata %>%
select(mpg, cyl)%>%
mutate(newvariable = mpg*cyl)
#or
> newvariable <- mynewdata %>% mutate(newvariable = mpg*cyl)
#summarise - this is used to find insights from data
> myirisdata%>%
group_by(Species)%>%
summarise(Average = mean(Sepal.Length, na.rm = TRUE))
46
2. data.table Package
This package allows you to perform faster manipulation in a data set. Leave your traditional ways
of sub setting rows and columns and use this package. With minimum coding, you can do much
more. Using data.table helps in reducing computing time as compared to data.frame. You’ll be
astonished by the simplicity of this package.
A data table has 3 parts namely DT[i,j,by]. You can understand this as, we can tell R to subset the
rows using ‘i’, to calculate ‘j’ which is grouped by ‘by’. Most of the times, ‘by’ relates to
categorical variable. In the code below, I’ve used 2 data sets (airquality and iris).
#load data
> data("airquality")
> mydata <- airquality
> head(airquality,6)
> data(iris)
> myiris <- iris
#load package
> library(data.table)
> mydata <- data.table(mydata)
> mydata
> mydata[2:4,]
47
#select columns with particular values
> myiris[Species == 'setosa']
#select columns with multiple values. This will give you columns with Setosa
#and virginica species
> myiris[Species %in% c('setosa', 'virginica')]
> mydata[,.(Temp,Month)]
[1]4887
#returns sum and standard deviation
> mydata[,.(sum(Ozone, na.rm = TRUE), sd(Ozone, na.rm = TRUE))]
48
#print and plot
> myiris[,{print(Sepal.Length)
> plot(Sepal.Width)
NULL}]
#grouping by a variable
> myiris[,.(sepalsum = sum(Sepal.Length)), by=Species]
#select a column for computation, hence need to set the key on column
> setkey(myiris, Species)
49
#load package
> install.packages('reshape2')
> library(reshape2)
#melt
> mt <- melt(thisdata, id=(c('ID','Names')))
> mt
cast : This function converts data from long format to wide format. It starts with melted data and
reshapes into long format. It’s just the reverse of melt function. It has two functions namely, dcast
and acast. dcast returns a data frame as output. acast returns a vector/matrix/array as the output.
Let’s understand it using the code below.
#cast
> mcast <- dcast(mt, DateofBirth + Subject ~ variable)
> mcast
Note: While doing research work, I found this image which aptly describes reshape package.
50
4. tidyr Package
This package can make your data look ‘tidy’. It has 4 major functions to accomplish this task.
Needless to say, if you find yourself stuck in data exploration phase, you can use them anytime
(along with dplyr). This duo makes a formidable team. They are easy to learn, code and implement.
These 4 functions are:
gather() – it ‘gathers’ multiple columns. Then, it converts them into key:value pairs. This
function will transform wide from of data to long form. You can use it as in alternative to
‘melt’ in reshape package.
spread() – It does reverse of gather. It takes a key:value pair and converts it into separate
columns.
separate() – It splits a column into multiple columns.
unite() – It does reverse of separate. It unites multiple columns into single
columnLet’s understand it closely using the code below:
#load package
> library(tidyr)
#create a dummy data set
> names <- c('A','B','C','D','E','A','B')
> weight <- c(55,49,76,71,65,44,34)
> age <- c(21,20,25,29,33,32,38)
> Class <- c('Maths','Science','Social','Physics','Biology','Economics','Accounts')
#create data frame
> tdata <- data.frame(names, age, weight, Class)
> tdata
51
#using gather function
> long_t <- tdata %>% gather(Key, Value, weight:Class)
> long_t
Separate function comes best in use when we are provided a date time variable in the data set. Since,
the column contains multiple information, hence it makes sense to split it and use those values
individually. Using the code below, I have separated a column into date, month and year.
#create a data set
> Humidity <- c(37.79, 42.34, 52.16, 44.57, 43.83, 44.59)
> Rain <- c(0.971360441, 1.10969716, 1.064475853, 0.953183435, 0.98878849, 0.939676146)
> Time <- c("27/01/2015 15:44","23/02/2015 23:24", "31/03/2015 19:15", "20/01/2015 20:52",
"23/02/2015 07:46", "31/01/2015 01:55")
52
#using spread function - reverse of gather
> wide_t <- long_t %>% spread(Key, Value)
> wide_t
5. Lubridate Package
Lubridate package reduces the pain of working of data time variable in R. This includes update
function, duration function and date extraction.
> install.packages('lubridate')
> library(lubridate)
#current date and time
> now()
[1] "2015-12-11 13:23:48 IST"
#assigning current date and time to variable n_time
> n_time <- now()
#using update function
> n_update <- update(n_time, year = 2013, month = 10)
> n_update
[1] "2013-10-11 13:24:28 IST"
#add days, months, year, seconds
> d_time <- now()
> d_time + ddays(1)
[1] "2015-12-12 13:24:54 IST"
> d_time + dweeks(2)
[1] "2015-12-12 13:24:54 IST"
> d_time + dyears(3)
[1] "2018-12-10 13:24:54 IST"
> d_time + dhours(2)
[1] "2015-12-11 15:24:54 IST"
> d_time + dminutes(50)
[1] "2015-12-11 14:14:54 IST"
> d_time + dseconds(60)
[1] "2015-12-11 13:25:54 IST"
#extract date,time
> n_time$hour <- hour(now())
> n_time$minute <- minute(now())
> n_time$second <- second(now())
> n_time$month <- month(now())
> n_time$year <- year(now())
#check the extracted dates in separate columns
53
> new_data <- data.frame(n_time$hour, n_time$minute, n_time$second, n_time$month,
n_time$year)
> new_data
output:
54
UNIT-III
DATA VALIDATION
Data validation is a critical process in data science, as it ensures the quality and reliability of your
data before using it for analysis, modeling, or decision-making. In a data science context, data
validation goes beyond simple checks like those used in data entry forms. It involves a deeper
understanding of the data, the context of the problem, and the intended use of the data.
Accurate Results: Invalid data can lead to incorrect conclusions and faulty models. Data
validation helps ensure the accuracy of your analysis and predictions.
Data Integrity: By identifying and correcting errors, inconsistencies, and outliers, you
maintain the integrity of your datasets, making them more reliable for future use.
Time and Resource Savings: Early detection and correction of data issues prevent wasted
time and resources spent on analysis based on faulty data.
Reproducibility: Validated datasets are easier to reproduce, which is crucial for scientific
rigor and collaborative projects.
55
Python Libraries:
o Pandas: Provides powerful data manipulation and cleaning functions.
o NumPy: Offers mathematical operations for numerical data validation.
o Scikit-learn: Includes tools for outlier detection and missing value imputation.
R Libraries:
o dplyr, tidyr: For data manipulation and cleaning.
o validate: A package specifically designed for data validation.
o ggplot2: For visualizing data distributions and potential issues.
SQL Queries:
o Can be used to perform validation checks directly in a database.
Domain Knowledge:
o Understanding the context and expected values of your data is crucial for effective
validation.
Handling missing values is a crucial part of data validation in data science. Missing values can arise
due to various reasons, such as data entry errors, sensor malfunctions, or incomplete surveys. Leav-
ing them unaddressed can lead to biased analysis and incorrect conclusions.
Here are several approaches to handling missing values during data validation:
1. Deletion:
Cons: Potential loss of valuable information if a significant portion of the data is missing. Can in-
troduce bias if missing values are not random.
2. Imputation:
Mean/Median/Mode Imputation: Replace missing values with the mean, median, or mode
(most frequent value) of the non-missing values in that variable.
Regression Imputation: Predict missing values based on other variables in the dataset.
Multiple Imputation: Create multiple imputed datasets, each with different plausible val-
ues for missing data, and combine results from the analyses of these datasets.
Pros: Preserves data and can be effective if the imputation method is appropriate.
Cons: May introduce bias or reduce the variability of the data if the imputation method is not well-
suited.
3. Advanced Techniques:
k-Nearest Neighbors (kNN) Imputation: Impute missing values based on the values of the
k most similar neighbors.
Maximum Likelihood Estimation (MLE): Estimate missing values based on a statistical
model that maximizes the likelihood of the observed data.
56
Expectation-Maximization (EM): An iterative algorithm that alternates between estimat-
ing missing values and model parameters.
Cons: More complex to implement and may require more computational resources.
Amount of missing data: If a small percentage of data is missing, deletion might be ac-
ceptable. For larger proportions, imputation is often preferred.
Pattern of missingness: If missing values are not random, simply deleting or imputing
them can introduce bias. You might need more sophisticated methods like multiple imputa-
tion.
Type of variable: The type of variable with missing values (categorical, numerical) influ-
ences the choice of imputation method.
Purpose of analysis: The specific goals of your analysis will also guide your decision. For
example, if the analysis is sensitive to outliers, you might avoid imputation methods that can
introduce artificial values.
Let's say you have a survey dataset where some respondents didn't provide their age.
Deletion: If only a few respondents have missing age values, you might consider deleting
those rows.
Mean/Median Imputation: If the missingness is random, you could replace missing age
values with the mean or median age of the respondents who provided their age.
Regression Imputation: If you have other variables (e.g., income, education level) that are
correlated with age, you could build a regression model to predict missing age values based
on those variables.
In data science, null values represent missing or unknown data points within a dataset. They are a
common occurrence in real-world data and can pose significant challenges during analysis and
modeling.
NaN (Not a Number): Commonly used in numerical datasets to represent missing or unde-
fined values.
NULL or None: Often used in databases and programming languages like SQL or Python
to denote missing values.
Blank or Empty Cells: In spreadsheets or flat files, null values may be represented as simp-
ly blank or empty cells.
57
Reduced Sample Size: Null values can reduce the effective sample size of your dataset,
leading to less reliable statistical analysis.
Biased Results: If the missingness of data is not random, ignoring null values can introduce
bias and lead to incorrect conclusions. For instance, if respondents with lower income are
less likely to report their salary, your analysis might overestimate the average income.
Algorithm Issues: Many machine learning algorithms cannot handle null values directly
and may require preprocessing steps like imputation or removal.
The appropriate strategy depends on the amount of missing data, the pattern of missingness, the
type of variable, and the purpose of your analysis. Here are some common approaches:
1. Deletion:
o Listwise Deletion: Remove entire rows with any null values. Suitable when the pro-
portion of missing data is small and the missingness is random.
o Pairwise Deletion: Exclude only the missing values from specific calculations. This
preserves more data but can lead to inconsistencies in results.
2. Imputation:
o Mean/Median/Mode Imputation: Replace missing values with the mean, median,
or mode of the non-missing values for that variable. Simple but may introduce bias if
missingness is not random.
o Regression Imputation: Predict missing values based on other variables using re-
gression models. Can be effective if there are strong relationships between variables.
o Multiple Imputation: Create multiple imputed datasets, each with different plausi-
ble values for missing data, and combine results from the analyses. This is a more
robust approach that accounts for the uncertainty in imputation.
o K-Nearest Neighbors (kNN) Imputation: Impute missing values based on the val-
ues of the k most similar neighbors.
3. Advanced Techniques:
o Maximum Likelihood Estimation (MLE): Estimate missing values based on a sta-
tistical model that maximizes the likelihood of the observed data.
o Expectation-Maximization (EM): An iterative algorithm that alternates between
estimating missing values and model parameters.
If you are building a model to predict house prices, and you have some null values in the "square
footage" variable, you have a few options:
Deletion: Remove rows with missing square footage. This may be acceptable if only a few
rows are affected.
Imputation:
o Mean/Median Imputation: Replace missing values with the average or median
square footage of similar houses (e.g., those with the same number of bedrooms and
bathrooms).
o Regression Imputation: Build a model to predict square footage based on other fea-
tures like the number of bedrooms, bathrooms, and location.
58
Duplicate Values
In data science, duplicate values refer to the occurrence of identical or nearly identical records with-
in a dataset. These duplicates can arise due to various reasons, such as data entry errors, multiple
data sources, or intentional data collection methods.
1. Exact Duplicates:
o Python: Use the duplicated() method in pandas to identify exact duplicate rows.
o SQL: Use the DISTINCT keyword or GROUP BY clause to find duplicates in a data-
base table.
2. Near Duplicates:
o Fuzzy Matching: Use techniques like fuzzy string matching or similarity measures
to identify near-duplicates, where values might differ slightly due to typos or varia-
tions.
The approach to handling duplicates depends on the context and the reasons for their occurrence.
1. Removal:
o Drop Duplicates: If duplicates are true errors, remove them using methods like
drop_duplicates() in pandas or DELETE statements in SQL.
o Keep First/Last: If duplicates represent valid but redundant information, you can
keep the first or last occurrence of a duplicate.
2. Aggregation:
o Group and Aggregate: If duplicates represent multiple measurements of the same
entity, you can group the data by unique identifiers and aggregate relevant columns
(e.g., sum, average).
3. Investigation:
o Analyze Causes: If duplicates are unexpected, investigate the data collection pro-
cess to understand the reasons for their occurrence.
o Correct Errors: If duplicates are due to errors, correct the data at the source.
Imagine you have a customer dataset where some customers appear multiple times due to data entry
errors. You could:
1. Identify Duplicates: Use pandas to identify rows with identical values in key fields like
customer ID or email.
2. Investigate: Look for patterns in the duplicates to understand why they occurred.
59
3. Remove Duplicates: Drop the duplicate rows, keeping only one unique record for each cus-
tomer.
4. Correct Errors: If the duplicates were caused by typos or inconsistencies in data entry, cor-
rect the original data source.
1. Outlier Detection
Outliers are data points that significantly deviate from the rest of your dataset. They can be legiti-
mate extreme values or errors that crept into your data during collection or processing. Detecting
and handling outliers is crucial because they can distort analysis results and undermine model per-
formance.
Types of Outliers:
Global Outliers: Data points that are significantly different from the overall distribution of
the data.
Contextual Outliers: Data points that are unusual within a specific context or subgroup of
the data.
Collective Outliers: A subset of data points that, when considered together, deviate signifi-
cantly from the rest of the data.
Statistical Methods:
o Z-score: Calculates how many standard deviations a data point is from the mean.
Points beyond a certain threshold (e.g., +/- 3 standard deviations) are considered out-
liers.
o Modified Z-score: Similar to Z-score but more robust to outliers in the data itself.
o Interquartile Range (IQR): Outliers are points that fall below Q1 - 1.5 * IQR or
above Q3 + 1.5 * IQR.
o Statistical tests: Grubbs' test, Dixon's Q test, etc., for formally testing if a data point
is an outlier.
Visual Methods:
o Boxplots: Visually identify outliers as points outside the whiskers.
o Scatterplots: Outliers can be visually spotted as isolated points.
Machine Learning:
o Isolation Forest: A tree-based algorithm that isolates outliers by randomly partition-
ing the data.
o Local Outlier Factor (LOF): Measures the local density deviation of a data point
compared to its neighbors.
2. Data Cleaning
60
Data cleaning is the process of identifying and rectifying errors, inconsistencies, and inaccuracies in
your data. It's a crucial preprocessing step in data science to ensure the reliability and quality of
your analysis.
1. Outlier Detection: Identify customers with unusual spending patterns or extreme demo-
graphic values.
2. Data Cleaning:
o Impute missing income values using the median income for similar demographics.
o Correct invalid zip codes or email addresses.
o Remove duplicate customer records.
o Standardize spending amounts for comparison across customers.
In data science, data loading and inspection are fundamental steps in the data analysis pipeline.
They involve importing data from various sources and examining its structure, content, and quality
to prepare it for further analysis and modeling.
Data Loading
Data loading is the process of bringing data from external sources into your data analysis environ-
ment. Common sources include:
61
APIs: Web APIs that provide structured data (e.g., Twitter API, Google Maps API).
Cloud Storage: Amazon S3, Google Cloud Storage, Azure Blob Storage.
Python:
o pandas: The primary library for data loading and manipulation in Python. It provides
functions like read_csv, read_excel, read_json, read_sql, and more.
o sqlalchemy: For connecting to and querying SQL databases.
o requests: For interacting with web APIs.
R:
o
readr, readxl: For reading files.
o
DBI: For connecting to databases.
o
httr: For working with APIs.
Other Tools:
o Database Clients: Tools like SQL Server Management Studio or MySQL Work-
bench can be used to directly load data from databases.
o Cloud Storage Clients: Cloud providers offer tools to access and download data
from their storage services.
Data Inspection
Data inspection involves examining the loaded data to understand its characteristics, identify poten-
tial issues, and prepare it for cleaning and analysis. Common inspection tasks include:
Displaying Data:
o Viewing the first few rows (e.g., using df.head() in pandas) to get a quick over-
view of the data.
o Checking the dimensions (number of rows and columns) of the dataset.
o Displaying column names to understand the variables present.
Checking Data Types:
o Verifying that each column is of the expected data type (numeric, categorical,
date/time).
o Identifying any columns that might need type conversion (e.g., converting string
dates to datetime objects).
Summarizing Data:
o Calculating descriptive statistics (mean, median, standard deviation, etc.) for numer-
ical variables.
o Getting frequency counts for categorical variables.
Detecting Missing Values:
o Identifying columns or rows with missing data.
o Deciding on a strategy to handle missing values (deletion, imputation, etc.).
Identifying Outliers:
o Checking for values that are far from the rest of the data distribution.
o Investigating the cause of outliers (errors, unusual events, natural variation).
Checking for Duplicates:
o Identifying and potentially removing duplicate rows.
Data Visualization:
o Creating histograms, scatterplots, boxplots, or other visualizations to explore the dis-
tribution and relationships between variables.
62
Python:
o pandas: Provides functions like describe(), info(), isnull(), value_counts(),
and many others for data inspection.
o matplotlib, seaborn: For creating visualizations.
R:
o summary(), str(), table(): For summary statistics and data type information.
o ggplot2: For creating visualizations.
Data Transformation
Data transformation is a crucial step in data science where you modify or convert raw data into a
format that is more suitable for analysis, modeling, or visualization. It's a key component of the data
preprocessing pipeline, helping to improve the quality and usability of your data.
Feature Engineering: Creating new features or modifying existing ones to improve the per-
formance of machine learning models.
Data Cleaning: Addressing issues like missing values, outliers, or inconsistent formats.
Normalization and Scaling: Transforming data to a common scale to ensure fair compari-
son and prevent certain features from dominating others in models.
Dimensionality Reduction: Reducing the number of features while retaining essential in-
formation. This can simplify models, improve performance, and reduce computational com-
plexity.
Data Integration: Combining data from multiple sources and converting them into a uni-
fied format for analysis.
1. Scaling:
o Standardization: Transforms data to have a mean of 0 and a standard deviation of
1. Useful for algorithms that are sensitive to the scale of features, like linear regres-
sion and k-means clustering.
o Normalization: Rescales data to a specific range (e.g., 0 to 1). Useful when the dis-
tribution of the data is not Gaussian.
2. Encoding:
o One-Hot Encoding: Converts categorical variables into binary (0/1) dummy varia-
bles. Useful for algorithms that require numerical input.
o Label Encoding: Assigns a unique numerical label to each category. Suitable for
ordinal variables where the order of categories matters.
o Ordinal Encoding: Converts ordinal categorical variables into numerical represen-
tations while preserving their order.
3. Log Transformation:
o Log Transformation: Compresses the range of skewed data, making it more nor-
mally distributed. Useful for features with a long tail, such as income or population.
4. Power Transformation:
o Box-Cox Transformation: Finds an optimal power transformation to make data
more normally distributed.
o Yeo-Johnson Transformation: Similar to Box-Cox but can handle negative values.
5. Aggregation:
o Grouping: Combining data points based on a categorical variable and calculating
summary statistics (mean, sum, count, etc.).
63
oRolling: Calculating statistics over a moving window of time or observations.
6. Discretization:
o Binning: Dividing continuous data into discrete intervals or bins. Can help capture
non-linear relationships.
7. Feature Creation:
o Creating Interaction Terms: Combining two or more features to capture their
combined effect on the target variable.
o Polynomial Features: Adding polynomial terms (e.g., x^2, x^3) to capture non-
linear relationships.
Scaling: Standardize the "square footage" and "number of bedrooms" features to have a
mean of 0 and a standard deviation of 1.
Encoding: One-hot encode categorical features like "neighborhood" and "house style."
Log Transformation: Apply a log transformation to the "price" feature, which is typically
right-skewed.
Feature Creation: Create a new feature "age of house" by subtracting the year built from
the current year.
Cross-validation
Cross-validation (CV) is a resampling technique widely used in data science to assess the perfor-
mance of machine learning models, especially in cases where you have limited data. It helps to
avoid overfitting and provides a more reliable estimate of how well your model will generalize to
new, unseen data.
Key Idea:
The core idea behind cross-validation is to divide your dataset into multiple subsets (folds). You
then train your model on a combination of these folds and evaluate its performance on the remain-
ing fold. This process is repeated multiple times, using a different fold as the test set each time. Fi-
nally, you average the performance metrics across all folds to get a more robust estimate of your
model's performance.
Types of Cross-Validation:
1. K-Fold Cross-Validation:
o The most common type.
o Divides the dataset into K equally sized folds.
o Trains the model on K-1 folds and tests it on the remaining fold.
o Repeats this process K times, using each fold as the test set once.
o Averages the performance metrics across all K folds.
2. Stratified K-Fold Cross-Validation:
o Similar to K-fold but ensures that the distribution of classes (in classification prob-
lems) is preserved in each fold. This is particularly important when dealing with im-
balanced datasets.
64
3. Leave-One-Out Cross-Validation (LOOCV):
o A special case of K-fold where K is equal to the number of data points.
o Each fold consists of a single data point.
o Computationally expensive for large datasets.
4. Leave-P-Out Cross-Validation:
o A generalization of LOOCV where you leave out P data points for testing.
o Computationally even more expensive than LOOCV.
5. Time Series Cross-Validation:
o Specifically designed for time series data.
o Splits the data into training and test sets based on time, preserving the temporal or-
der.
Benefits of Cross-Validation:
Reduced Overfitting: By evaluating the model on multiple test sets, cross-validation reduc-
es the risk of overfitting.
Better Generalization Estimate: Provides a more reliable estimate of how the model will
perform on unseen data.
Efficient Use of Data: Makes better use of limited data compared to a simple train-test split.
Hyperparameter Tuning: Can be used to select the best hyperparameters for your model
by comparing performance across different folds.
Model Selection: Choose the best model among different algorithms or configurations.
Model Evaluation: Assess the performance of your final model on unseen data.
Feature Selection: Determine which features are most important for your model.
65
UNIT–IV
Modelling Methods
Supervised: Regression Analysis in R, linear regression, logistic regression,naive bayes classifier,
decision tree, random forest, knn classifier,
Unsupervised: kmeans clustering, association rule mining, apriori algorithm.
Supervised and unsupervised learning are two fundamental paradigms in data science and machine
learning. They differ primarily in how they approach the learning process and the types of tasks
they are suited for.
Supervised Learning
Data: Uses labeled data, where each input data point has a corresponding output label or
value.
Goal: Learn a function that maps inputs to outputs, enabling predictions on new, unseen da-
ta.
Examples:
o Classification: Predicting if an email is spam or not (label is "spam" or "not spam").
o Regression: Predicting the price of a house based on its features (label is the house
price).
Unsupervised Learning
Data: Uses unlabeled data, where data points do not have associated labels.
Goal: Discover patterns, relationships, or structures in the data.
Examples:
o Clustering: Grouping similar customers based on their purchasing behavior.
o Dimensionality Reduction: Finding a lower-dimensional representation of high-
dimensional data.
o Association Rule Mining: Discovering relationships between items in a shopping
cart (e.g., people who buy bread also buy butter).
Regression Analysis
Regression analysis is a statistical technique used to model the relationship between a dependent
variable (the outcome you want to predict) and one or more independent variables (predictors). It's a
cornerstone of supervised machine learning, where you have labeled data to train your model.
Types of Regression in R
Linear Regression (lm): Models a linear relationship between the dependent and independ-
ent variables.
Generalized Linear Models (glm): Extends linear regression to accommodate non-normal
response distributions (e.g., logistic regression for binary outcomes, Poisson regression for
count data).
66
Polynomial Regression: Models non-linear relationships using polynomial functions of the
predictors.
Robust Regression: Deals with outliers by using robust estimation techniques.
Ridge Regression and Lasso Regression: Linear regression methods that add regulariza-
tion terms to prevent overfitting.
Elastic Net Regression: Combines the features of Ridge and Lasso regression.
1. Data Preparation:
o Load your data into an R data frame.
o Clean the data: Handle missing values, outliers, and inconsistencies.
o Explore the data: Visualize relationships between variables using scatterplots, histo-
grams, etc.
2. Model Building:
o Choose the appropriate regression type based on your data and research question.
o Fit the model using functions like lm() or glm().
o Specify the formula, which defines the relationship between the dependent and inde-
pendent variables.
3. Model Evaluation:
o Assess the model's goodness of fit using metrics like R-squared, adjusted R-squared,
AIC, or BIC.
o Check model assumptions (e.g., linearity, normality of residuals, homoscedasticity)
using diagnostic plots.
4. Model Interpretation:
o Examine the estimated coefficients to understand the direction and magnitude of the
relationship between predictors and the outcome.
o Use confidence intervals and p-values to assess the statistical significance of the co-
efficients.
5. Prediction (Optional):
o If your goal is prediction, use the fitted model to predict the outcome for new values
of the predictors.
Code snippet
# Load required library
library(datasets)
# Make predictions on new data (e.g., a car with wt = 2.5 and hp = 120)
new_data <- data.frame(wt = 2.5, hp = 120)
predicted_mpg <- predict(model, new_data)
Use code with caution.
content_copy
67
Additional Libraries in R
Naive Bayes
Naive Bayes is a probabilistic machine learning algorithm based on Bayes' Theorem. It's called "na-
ive" because it makes a strong assumption: it assumes that all features (attributes) in your data are
conditionally independent of each other given the class label. This means that the presence of one
feature doesn't affect the presence of another when you know the class. While this assumption may
not always hold true in real-world data, it often works surprisingly well in practice.
1. Training:
o The algorithm learns the probabilities of each feature value occurring within each
class.
o It also learns the prior probabilities of each class (how likely each class is to occur
overall).
2. Prediction:
o Given a new data point, Naive Bayes calculates the probability of it belonging to
each class using Bayes' Theorem:
Where:
o P(class | features) is the posterior probability (the probability of the class given the
features).
o P(features | class) is the likelihood (the probability of the features given the class).
o P(class) is the prior probability of the class.
o P(features) is the evidence (the probability of the features occurring at all).
It then predicts the class with the highest posterior probability.
Gaussian Naive Bayes: Assumes that continuous features follow a normal (Gaussian) dis-
tribution.
Multinomial Naive Bayes: Suitable for discrete data (e.g., text classification, where fea-
tures represent word counts).
Bernoulli Naive Bayes: Used for binary data (features are either present or absent).
Advantages:
Simple and Easy to Implement: The algorithm is easy to understand and code.
Fast: It's computationally efficient, making it suitable for large datasets.
Works well with high-dimensional data: It can handle datasets with many features.
Good for text classification: Commonly used for tasks like spam filtering and sentiment
analysis.
68
Disadvantages:
Naive Assumption of Independence: The assumption that features are independent may
not hold true in real-world data.
Zero Frequency Problem: If a feature value is not seen in the training data, it will have a
zero probability, which can affect the model's predictions.
Code snippet
library(e1071)
# Assuming you have your data in a data frame called 'data' with a text column
'text' and a category column 'label'
Spam filtering
Sentiment analysis
Document classification
Medical diagnosis
Fraud detection
Decision Trees
A decision tree is a versatile supervised machine learning algorithm used for both classification and
regression tasks. It's a flowchart-like model where each internal node represents a feature (or attrib-
ute), each branch represents a decision rule based on that feature, and each leaf node represents the
outcome or prediction.
1. Recursive Partitioning:
o The algorithm starts at the root node, which represents the entire dataset.
o It selects the feature that best splits the data based on a purity criterion (like Gini im-
purity or information gain).
o The data is then split into subsets based on the values of the selected feature.
o This process is repeated recursively on each subset until a stopping criterion is met
(e.g., all data points in a node belong to the same class, or a maximum tree depth is
reached).
2. Prediction:
o To predict the outcome for a new data point, start at the root node and follow the
branches based on the feature values of the data point until you reach a leaf node.
o The value in the leaf node represents the predicted outcome.
69
Example: Predicting Customer Churn
Imagine you have a dataset of customers with features like age, income, and contract length. You
want to build a decision tree to predict whether a customer will churn (stop using your service).
Interpretability: Decision trees are easy to visualize and understand, making them valuable
for explaining model predictions to stakeholders.
Handles Both Categorical and Numerical Data: You don't need to preprocess categorical
features.
Non-Linear Relationships: Can capture complex non-linear relationships between features
and the target variable.
Feature Importance: The algorithm implicitly performs feature selection by prioritizing the
most informative features at the top of the tree.
Overfitting: Prone to overfitting, especially with deep trees. Pruning techniques can help
mitigate this.
Instability: Small changes in the data can lead to significantly different tree structures.
Greedy Algorithm: The algorithm makes locally optimal decisions at each node, which
may not lead to the globally optimal tree.
Code snippet
library(rpart)
# Assuming you have a data frame 'data' with a target variable 'churn'
70
Random Forest
A Random Forest is a supervised machine learning algorithm that combines the output of multiple
decision trees to reach a single, more accurate prediction. It's a type of ensemble learning, where the
predictions of several individual models are combined to produce a final result. The "randomness"
in Random Forest comes from two main sources:
1. Bootstrapping: Each decision tree in the forest is trained on a random subset (bootstrap
sample) of the original training data. This introduces diversity among the trees, as each one
sees slightly different data.
2. Feature Randomness: At each node of the decision tree, the algorithm randomly selects a
subset of the available features for splitting. This further decorrelates the trees and helps to
prevent overfitting.
1. Building the Forest: The algorithm creates multiple decision trees (the number is a hy-
perparameter). Each tree is trained on a different bootstrap sample of the data, and at each
node, a random subset of features is considered for splitting.
2. Making Predictions (Classification):
o Each tree makes a prediction (e.g., the class label for a new data point).
o The final prediction is determined by majority vote (the class that receives the most
votes among the trees).
3. Making Predictions (Regression):
o Each tree makes a prediction (e.g., a numerical value).
o The final prediction is the average of the predictions from all the trees.
High Accuracy: Often achieves state-of-the-art performance in both classification and re-
gression tasks.
Handles Large Datasets and High-Dimensional Data: Can handle datasets with many
features and observations.
Reduces Overfitting: The randomness in the algorithm helps to reduce overfitting and im-
prove generalization to new data.
Robust to Noise and Outliers: Less sensitive to outliers and noisy data compared to indi-
vidual decision trees.
Provides Feature Importance: Gives an estimate of the importance of each feature in mak-
ing predictions.
71
Recommendation Systems: Recommending products or movies to users.
Anomaly Detection: Identifying unusual patterns or outliers in data.
Code snippet
library(randomForest)
# Assuming you have your data in a data frame called 'data' with a target varia-
ble 'label'
Knn Classifiers
What is KNN?
KNN is a supervised machine learning algorithm commonly used for both classification and regres-
sion tasks. It's a non-parametric algorithm, meaning it doesn't make any assumptions about the un-
derlying data distribution. Instead, it relies on the concept of similarity (or distance) between data
points.
1. Choose K: The first step is to choose the value of K, which represents the number of nearest
neighbors to consider.
2. Calculate Distances: For a new data point you want to classify, calculate the distance be-
tween this point and all other points in the training dataset. Common distance measures in-
clude Euclidean distance, Manhattan distance, and others.
3. Identify K-Nearest Neighbors: Select the K data points closest to the new data point based
on the calculated distances.
4. Majority Voting: In classification, the new data point is assigned the class label that is most
common among its K nearest neighbors.
Illustrative Diagram:
72
Opens in a new window www.researchgate.net
scatter plot of data points with different classes, highlighting the knearest neighbors of a new data
point.
Advantages of KNN:
Simple and Easy to Understand: The concept of KNN is intuitive and easy to grasp.
No Training Time: The model doesn't need to be explicitly trained; it simply stores the
training data.
Versatile: Can be used for both classification and regression.
Non-Parametric: Doesn't make assumptions about the data distribution.
Works Well with Small Datasets: Can be effective even with limited data.
Disadvantages of KNN:
Computationally Expensive: Calculating distances to all data points for each prediction
can be computationally expensive, especially for large datasets.
Sensitive to Irrelevant Features: If your dataset has many irrelevant features, KNN per-
formance can suffer. Feature selection or dimensionality reduction might be necessary.
Sensitive to the Scale of Features: Features with larger ranges can dominate the distance
calculations. Scaling or normalization of features is often needed.
Choosing the Right K: The value of K can significantly impact model performance. It's of-
ten determined through cross-validation.
Code snippet
library(class)
# Assuming you have your data in a data frame called 'data' with a target varia-
ble 'label'
# Train the KNN model (no explicit training step for KNN)
73
# Make predictions on the test set (using k = 5)
predictions <- knn(train_data[, -ncol(train_data)], test_data[, -
ncol(test_data)], train_data$label, k = 5)
# Evaluate accuracy
accuracy <- mean(predictions == test_data$label)
K-means Clustering
K-means clustering is an algorithm that aims to partition a dataset into a pre-defined number of
clusters (denoted by 'K') in such a way that:
The similarity between data points is usually measured using distance metrics like Euclidean dis-
tance or Manhattan distance.
Illustrative Diagram:
The optimal value of K is not always obvious and often requires experimentation. Common meth-
ods include:
74
Elbow Method: Plot the sum of squared distances (within-cluster sum of squares - WCSS)
against different values of K. The "elbow" point where the WCSS starts to decrease more
slowly is often a good choice for K.
Silhouette Analysis: Measures how well each data point fits within its assigned cluster.
Higher silhouette scores indicate better clustering quality.
Simple and Easy to Understand: The algorithm is relatively straightforward and easy to
implement.
Efficient: It scales well to large datasets.
Versatile: Can be applied to various types of data.
Association rule mining (ARM), also known as market basket analysis, is an unsupervised machine
learning method used to discover interesting relationships (affinities) between variables in large da-
tabases. It's particularly useful for analyzing transactional data, where you have records of items
that are often purchased or used together.
The key concept in ARM is to find association rules, which are if-then statements that express the
likelihood of certain items occurring together in a transaction. For example, a classic association
rule might be:
"If a customer buys bread, then they are also likely to buy butter (with 80% confidence)."
75
1. Support: The proportion of transactions that contain both the antecedent (the "if" part) and
the consequent (the "then" part) of the rule. It indicates how frequently the itemset appears
in the dataset.
2. Confidence: The proportion of transactions containing the antecedent that also contain the
consequent. It measures how often the rule is found to be true.
3. Lift: The ratio of the observed support of the rule to the expected support if the antecedent
and consequent were independent. A lift greater than 1 indicates that the items are associat-
ed more often than would be expected by chance.
Apriori Algorithm
One of the most well-known algorithms for association rule mining is the Apriori algorithm. It uses
a "bottom-up" approach, starting by identifying frequent individual items and then iteratively gen-
erating larger and larger itemsets until no more frequent itemsets can be found.
Imagine a grocery store wants to analyze its transaction data to identify which products are fre-
quently bought together. Using association rule mining, they might find rules like:
If a customer buys diapers, they are also likely to buy baby wipes (with 70% confidence).
If a customer buys beer, they are also likely to buy chips (with 60% confidence).
This information could be used to create targeted promotions or adjust store layouts to increase
sales.
Scalability: ARM can become computationally expensive for very large datasets.
Interpreting Results: It's important to consider the context and business knowledge when
interpreting association rules. Not all rules may be actionable or meaningful.
Rare Itemsets: It can be challenging to find rules for items that occur infrequently.
76
Linear and Logistic Regression :
Linear models are especially useful when you don’t want only to predict an outcome, but also to
know the relationship between the input variables and the outcome. This knowledge can prove
useful because this relationship can often be used as advice on how to get the outcome that you
want. We’ll first define linear regression and then use it to predict customer income. Later, we will
use logistic regression to predict the probability that a newborn baby will need extra medical
attention. We’ll also walk through the diagnostics that R produces when you fit a linear or logistic
model.Linear methods can work well in a surprisingly wide range of situations. However, there can
be issues when the inputs to the model are correlated or collinear. In the case of logistic regression,
there can also be issues (ironically) when a subset of the variables predicts a classification output
perfectly in a subset of the training data.
Linear regression is the bread and butter prediction method for statisticians and data scientists. If
you’re trying to predict a numerical quantity like profit, cost, or sales volume, you should always
try linear regression first. If it works well, you’re done; if it fails, the detailed diagnostics produced
can give you a good clue as to what methods you should try next.
Linear regression assumes that the outcome pounds_lost is linearly related to each of the inputs
daily_cals_down[i] and daily_exercise[i]. This means that the relationship between (for instance)
daily_cals_down[i] and pounds_lost looks like a (noisy) straight line, as shown in figure 7.2.1
77
The relationship between daily_exercise and pounds_lost would similarly be a straight line.
Suppose that the equation of the line shown in figure 7.2 is
This means that for every unit change in daily_cals_down (every calorie reduced), the value of
pounds_lost changes by b.cals, no matter what the starting value of daily_cals_down was. To make
it concrete, suppose pounds_lost = 3 + 2 * daily_ cals_down. Then increasing daily_cals_down by
one increases pounds_lost by 2, no matter what value of daily_cals_down you start with. This
would not be true for, say, pounds_lost = 3 + 2 * (daily_cals_down^2).
Linear regression further assumes that the total pounds lost is a linear combination of our variables
daily_cals_down[i] and daily_exercise[i], or the sum of the pounds lost due to reduced caloric
intake, and the pounds lost due to exercise. This gives us the following form for the linear
regression model of pounds_lost:
pounds_lost[i] = b0 + b.cals * daily_cals_down[i] +
b.exercise * daily_exercise[i]
The goal of linear regression is to find the values of b0, b.cals, and b.exercise so that the linear
combination of daily_cals_lost[i] and daily_exercise[i] (plus some offset b0) comes very close to
pounds_lost[i] for all persons i in the training data. Let’s put this in more general terms. Suppose
that y[i] is the numeric quantity you want to predict (called the dependent or response variable), and
x[i,] is a row of inputs that corresponds to output y[i] (the x[i,] are the independent or explanatory
variables). Linear regression attempts to find a function f(x) such that
78
y[i] ~ f(x[i,]) + e[i] = b[0] + b[1] * x[i,1] + ... + b[n] * x[i,n] + e[i]
By assuming that the noise is unsystematic, linear regression tries to fit what is called an “unbiased”
predictor. This is another way of saying that the predictor gets the right answer “on average” over
the entire training set, or that it underpredicts about as much as it overpredicts. In particular,
unbiased estimates tend to get totals correct.
Example Suppose you have fit a linear regression model to predict weight loss based on reduction
of caloric intake and exercise. Now consider the set of subjects in the training data, LowExercise,
who exercised between zero and one hour a day. Together, these subjects lost a total of 150 pounds
over the course of the study. How much did the model predict they would lose?
With a linear regression model, if you take the predicted weight loss for all the subjects in Low
Exercise and sum them up, that total will sum to 150 pounds, which means that the model predicts
the average weight loss of a person in the Low Exercise group correctly, even though some of the
individuals will have lost more than the model predicted, and some of them will have lost less. In a
business setting, getting sums like this correct is critical, particularly when summing up monetary
amounts. Under these assumptions (linear relationships and unsystematic noise), linear regression is
absolutely relentless in finding the best coefficients b[i]. If there’s some advantageous combination
or cancellation of features, it’ll find it. One thing that linear regression doesn’t do is reshape
variables to be linear. Oddly enough, linear regression often does an excellent job, even when the
actual relation is not in fact linear.
79
For this task, you will use the 2016 US Census PUMS dataset. For simplicity, we have
prepared a small sample of PUMS data to use for this example. The data preparation
steps include these:
Restricting the data to full-time employees between 20 and 50 years of age, with
an income between $1,000 and $250,000.
Dividing the data into a training set, dtrain, and a test set, dtest.
Each row of PUMS data represents a single anonymized person or household. Personal data
recorded includes occupation, level of education, personal income, and many other demographic
variables. For this example we have decided to predict log10(PINCP), or the logarithm of income.
Fitting logarithm-transformed data typically gives results with smaller relative error, emphasizing
smaller errors on smaller incomes. But this improved relative error comes at a cost of introducing a
bias: on average, predicted incomes are going to be below actual training incomes. An unbiased
alternative to predicting log(income) would be to use a type of generalized linear model called
Poisson regression. The Poisson regression is unbiased, but typically at the cost of larger relative
errors.1 For the analysis in this section, we’ll consider the input variables age (AGEP), sex (SEX),
class of worker (COW), and level of education (SCHL). The output variable is personal income
(PINCP). We’ll also set the reference level, or “default” sex to M (male); the reference level of class
of worker to Employee of a private for-profit; and the reference level of education level to no high
school diploma.
80
BUILDING A LINEAR REGRESSION MODEL
The first step in either prediction or finding relations (advice) is to build the linear regression model.
The function to build the linear regression model in R is lm(), supplied by the stats package. The
most important argument to lm() is a formula with ~ used in place of an equals sign. The formula
specifies what column of the data frame is the quantity to be predicted, and what columns are to be
used to make the predictions. Statisticians call the quantity to be predicted the dependent variable
and the variables/ columns used to make the prediction the independent variables. We find it is
easier to call the quantity to be predicted the y and the variables used to make the predictions the xs.
Our formula is this: log10(PINCP) ~ AGEP + SEX + COW + SCHL, which is read “Predict the log
base 10 of income as a function of age, sex, employment class, and education.”
R STORES TRAINING DATA IN THE MODEL R holds a copy of the training data in
its model to supply the residual information seen in summary(model). Holding a copy of the data
this way is not strictly necessary, and can needlessly run you out of memory. If you’re running low
on memory (or swapping), you can dispose of R objects like model using the rm() command. In this
case, you’d dispose of the model by running rm("model").
MAKING PREDICTIONS:
Once you’ve called lm() to build the model, your first goal is to predict income. This is easy to do in
R. To predict, you pass data into the predict() method. Figure demonstrates this using both the test
and training data frames dtest and dtrain.
81
The data frame columns dtest$predLogPINCP and dtrain$predLogPINCP now store the predictions
for the test and training sets, respectively. We have now both produced and applied a linear
regression model.
82
facts like the flight’s origin and destination, weather, and air carrier. For every flight i, you want to
predict flight_delayed[i] based on origin[i], destination[i], weather[i], and air_carrier[i].
We’d like to use linear regression to predict the probability that a flight i will be delayed, but
probabilities are strictly in the range 0:1, and linear regression doesn’t restrict its prediction to that
range.
One idea is to find a function of probability that is in the range -Infinity:Infinity, fit a linear model
to predict that quantity, and then solve for the appropriate probabilities from the model predictions.
So let’s look at a slightly different problem: instead of predicting the probability that a flight is
delayed, consider the odds that the flight is delayed, or the ratio of the probability that the flight is
delayed over the probability that it is not.
odds[flight_delayed] = P[flight_delayed == TRUE] / P[flight_delayed == FALSE]
The range of the odds function isn’t -Infinity:Infinity; it’s restricted to be a nonnegative
number. But we can take the log of the odds---the log-odds---to get a function of the probabilities
that is in the range -Infinity:Infinity.
log_odds[flight_delayed] = log(P[flight_delayed == TRUE] / P[flight_delayed =
= FALSE])
Let: p = P[flight_delayed == TRUE]; then
log_odds[flight_delayed] = log(p / (1 - p))
Note that if it’s more likely that a flight will be delayed than on time, the odds ratio will be greater
than one; if it’s less likely that a flight will be delayed than on time, the odds ratio will be less than
one. So the log-odds is positive if it’s more likely that the flight will be delayed, negative if it’s
more likely that the flight will be on time, and zero if the chances of delay are 50-50.
The log-odds of a probability p is also known as logit(p). The inverse of logit(p) is the sigmoid
function, shown in figure 7.13. The sigmoid function maps values in the range from -
Infinity:Infinity to the range 0:1—in this case, the sigmoid maps unbounded log-odds ratios to a
probability value that is between 0 and 1.
logit <- function(p) { log(p/(1-p)) }
s <- function(x) { 1/(1 + exp(-x))}
s(logit(0.7))
# [1] 0.7
logit(s(-2))
# -2
83
BUILDING A LOGISTIC REGRESSION MODEL
The function to build a logistic regression model in R is glm(), supplied by the stats package. In our
case, the dependent variable y is the logical (or Boolean) atRisk; all the other variables in table 7.1
are the independent variables x. The formula for building a model to predict atRisk using these
variables is rather long to type in by hand; you can generate the formula using the mk_formula()
function from the wrapr package, as shown next.
84
This is similar to the linear regression call to lm(), with one additional argument:
family = binomial(link = "logit"). The family function specifies the assumed distribution of the
dependent variable y. In our case, we’re modeling y as a binomial distribution, or as a coin whose
probability of heads depends on x. The link function “links” the output to a linear model—it’s as if
you pass y through the link function, and then model the resulting value as a linear function of the x
values. Different combinations of family functions and link functions lead to different kinds of
generalized linear models (for example, Poisson, or probit). In this book, we’ll only discuss logistic
models, so we’ll only need to use the binomial family with the logit link
MAKING PREDICTIONS
Making predictions with a logistic model is similar to making predictions with a linear model—use
the predict() function. The following code stores the predictions for the training and test sets as the
column pred in the respective data frames.
85
Note the additional parameter type = "response". This tells the predict() function to return the
predicted probabilities y. If you don’t specify type = "response", then by default predict() will return
the output of the link function, logit(y). One strength of logistic regression is that it preserves the
marginal probabilities of the training data. That means that if you sum the predicted probability
scores for the entire training set, that quantity will be equal to the number of positive outcomes
(atRisk == TRUE) in the training set. This is also true for subsets of the data determined by
variables included in the model. For example, in the subset of the training data that has
train$GESTREC == "<37 weeks" (the baby was premature), the sum of the predicted probabilities
equals the number of positive training examples.
1. data<-mtcars[,c("mpg","wt","disp","hp")]
2. print(head(input))
Example
#Creating input data.
input <- mtcars[,c("mpg","wt","disp","hp")]
# Creating the relationship model.
Model <- lm(mpg~wt+disp+hp, data = input)
# Showing the Model.
print(Model)
b0<- coef(Model)[1]
print(b0)
x_wt<- coef(Model)[2]
x_disp<- coef(Model)[3]
x_hp<- coef(Model)[4]
print(x_wt)
print(x_disp)
print(x_hp)
86
UNIT-V
Data visualization with R
Introduction to ggplot2: A worked example, Placing the data and mapping options, Graphs as
objects, Univariate Graphs: Categorical, Quantitative.
Introduction to ggplot2:
# load data
data(CPS85 , package ="mosaicData")
In building a ggplot2 graph, only the first two functions described below are required. The other
functions are optional and can appear in any order.
5.1.1 ggplot
The first function in building a graph is the ggplot function. It specifies the
87
5.1.2 geoms
Geoms are the geometric objects (points, lines, bars, etc.) that can be placed on a graph. They are
added using functions that start with geom_. In this example, we’ll add points using
the geom_point function, creating a scatterplot.
In ggplot2 graphs, functions are chained together using the + sign to build a final plot.
# add points
ggplot(data = CPS85,
mapping =aes(x =exper, y = wage)) +
geom_point()
The graph indicates that there is an outlier. One individual has a wage much higher than the rest.
We’ll delete this case before continuing.
# delete outlier
library(dplyr)
plotdata<-filter(CPS85, wage <40)
# redraw scatterplot
ggplot(data =plotdata,
mapping =aes(x =exper, y = wage)) +
geom_point()
88
transparency, respectively. Transparency ranges from 0 (completely transparent) to 1 (complete-
ly opaque). Adding a degree of transparency can help visualize overlapping points.
# make points blue, larger, and semi-transparent
ggplot(data =plotdata,
mapping =aes(x =exper, y = wage)) +
geom_point(color ="cornflowerblue",
alpha = .7,
size =3)
Next, let’s add a line of best fit. We can do this with the geom_smooth function. Options control
the type of line (linear, quadratic, nonparametric), the thickness of the line, the line’s color, and
the presence or absence of a confidence interval. Here we request a linear regression (method =
lm) line (where lm stands for linear model).
# add a line of best fit.
ggplot(data =plotdata,
mapping =aes(x =exper, y = wage)) +
geom_point(color ="cornflowerblue",
alpha = .7,
size =3) +
geom_smooth(method ="lm")
89
5.1.3 grouping
In addition to mapping variables to the x and y axes, variables can be mapped to the color, shape,
size, transparency, and other visual characteristics of geometric objects. This allows groups of
observations to be superimposed in a single graph.
The color = sex option is placed in the aes function, because we are mapping a variable to an
aesthetic. The geom_smooth option (se = FALSE) was added to suppresses the confidence inter-
vals.
It appears that men tend to make more money than women. Additionally, there may be a stronger
relationship between experience and wages for men than than for women.
5.1.4 scales
Scales control how variables are mapped to the visual characteristics of the plot. Scale functions
(which start with scale_) allow you to modify this mapping. In the next plot, we’ll change
the x and y axis scaling, and the colors employed.
# modify the x and y axes and specify the colors to be used
ggplot(data =plotdata,
mapping =aes(x =exper,
y = wage,
color = sex)) +
geom_point(alpha = .7,
size =3) +
geom_smooth(method ="lm",
se =FALSE,
size =1.5) +
90
scale_x_continuous(breaks =seq(0, 60, 10)) +
scale_y_continuous(breaks =seq(0, 30, 5),
label =scales::dollar) +
scale_color_manual(values =c("indianred3",
"cornflowerblue"))
5.1.5 facets
Facets reproduce a graph for each level a given variable (or combination of variables). Facets are
created using functions that start with facet_. Here, facets will be defined by the eight levels of
the sector variable.
# reproduce plot for each level of job sector
ggplot(data =plotdata,
mapping =aes(x =exper,
y = wage,
color = sex)) +
geom_point(alpha = .7) +
geom_smooth(method ="lm",
se =FALSE) +
scale_x_continuous(breaks =seq(0, 60, 10)) +
scale_y_continuous(breaks =seq(0, 30, 5),
label =scales::dollar) +
scale_color_manual(values =c("indianred3",
"cornflowerblue")) +
facet_wrap(~sector)
91
Figure: Add job sector, using faceting
It appears that the differences between mean and women depend on the job sector under consid-
eration.
5.1.6 labels
Graphs should be easy to interpret and informative labels are a key element in achieving this
goal. The labs function provides customized labels for the axes and legends. Additionally, a cus-
tom title, subtitle, and caption can be added.
# add informative labels
ggplot(data =plotdata,
mapping =aes(x =exper,
y = wage,
color = sex)) +
geom_point(alpha = .7) +
geom_smooth(method ="lm",
se =FALSE) +
scale_x_continuous(breaks =seq(0, 60, 10)) +
scale_y_continuous(breaks =seq(0, 30, 5),
label =scales::dollar) +
scale_color_manual(values =c("indianred3",
"cornflowerblue")) +
facet_wrap(~sector) +
labs(title ="Relationship between wages and experience",
subtitle ="Current Population Survey",
caption ="source: http://mosaic-web.org/",
x =" Years of Experience",
y ="Hourly Wage",
color ="Gender")
92
Figure: Add informative titles and labels
Now a viewer doesn’t need to guess what the labels expr and wage mean, or where the data
come from.
5.1.7 themes
Finally, we can fine tune the appearance of the graph using themes. Theme functions (which start
with theme_) control background colors, fonts, grid-lines, legend placement, and other non-data
related features of the graph. Let’s use a cleaner theme.
# use a minimalist theme
ggplot(data =plotdata,
mapping =aes(x =exper,
y = wage,
color = sex)) +
geom_point(alpha = .6) +
geom_smooth(method ="lm",
se =FALSE) +
scale_x_continuous(breaks =seq(0, 60, 10)) +
scale_y_continuous(breaks =seq(0, 30, 5),
label =scales::dollar) +
scale_color_manual(values =c("indianred3",
"cornflowerblue")) +
facet_wrap(~sector) +
labs(title ="Relationship between wages and experience",
subtitle ="Current Population Survey",
caption ="source: http://mosaic-web.org/",
x =" Years of Experience",
y ="Hourly Wage",
color ="Gender") +
theme_minimal()
93
Figure: Use a simpler theme
Now we have something. It appears that men earn more than women in management, manufac-
turing, sales, and the “other” category. They are most similar in clerical, professional, and ser-
vice positions. The data contain no women in the construction sector. For management positions,
wages appear to be related to experience for men, but not for women (this may be the most inter-
esting finding). This also appears to be true for sales.
Of course, these findings are tentative. They are based on a limited sample size and do not in-
volve statistical testing to assess whether differences may be due to chance variation.
Plots created with ggplot2 always start with the ggplot function. In the examples above,
the data and mapping options were placed in this function. In this case they apply to
each geom_ function that follows.
You can also place these options directly within a geom. In that case, they only apply only to
that specific geom.
Consider the following graph.
94
Figure: Color mapping in ggplot function
Since the mapping of sex to color appears in the ggplot function, it applies
to both geom_point and geom_smooth. The color of the point indicates the sex, and a separate
colored trend line is produced for men and women. Compare this to
# placingcolor mapping in the geom_point function
ggplot(plotdata,
aes(x =exper,
y = wage)) +
geom_point(aes(color = sex),
alpha = .7,
size =3) +
geom_smooth(method ="lm",
formula = y ~poly(x,2),
se =FALSE,
size =1.5)
Since the sex to color mapping only appears in the geom_point function, it is only used there. A
single trend line is created for all observations.
Most of the examples in this book place the data and mapping options in the ggplot function.
Additionally, the phrases data= and mapping= are omitted since the first option always refers to
data and the second option always refers to mapping.
95
plotdata<-CPS85[CPS85$wage <40,]
5.4.1 Categorical
The distribution of a single categorical variable is typically plotted with a bar chart, a pie chart,
or (less commonly) a tree map.
library(ggplot2)
data(Marriage, package ="mosaicData")
96
Figure: Simple barchart
5.4.1.1.1 Percents
Bars can represent percents rather than counts. For bar charts, the code aes(x=race) is actually a
shortcut for aes(x = race, y = ..count..), where ..count.. is a special variable representing the fre-
quency within each category. You can use this to calculate percentages, by specifying
the y variable explicitly.
# plot the distribution as percentages
ggplot(Marriage,
aes(x = race,
y = ..count.. /sum(..count..))) +
geom_bar() +
labs(x ="Race",
y ="Percent",
title ="Participants by race") +
scale_y_continuous(labels = scales::percent)
97
count(race)
The resulting dataset is give below.
race n
American Indian 1
Black 22
Hispanic 1
White 74
The graph bars are sorted in ascending order. Use reorder(race, -n) to sort in descending order.
98
title ="Participants by race")
Pie charts are controversial in statistics. If your goal is to compare the frequency of categories,
you are better off with bar charts (humans are better at judging the length of bars than the vol-
ume of pie slices). If your goal is compare each category with the the whole (e.g., what portion
of participants are Hispanic compared to all participants), and the number of categories is small,
then pie charts may work for you. It takes a bit more code to make an attractive pie chart in R.
99
lab.ypos =cumsum(prop) -0.5*prop)
ggplot(plotdata,
aes(x ="",
y = prop,
fill = race)) +
geom_bar(width =1,
stat ="identity",
color ="black") +
coord_polar("y",
start =0,
direction =-1) +
theme_void()
library(treemapify)
ggplot(plotdata,
aes(fill =officialTitle,
area = n)) +
geom_treemap() +
labs(title ="Marriages by officiate")
100
Figure: Basic treemap
5.4.2 Quantitative
The distribution of a single quantitative variable is typically plotted with a histogram, kernel
density plot, or dot plot.
5.4.2.1 Histogram
Using the Marriage dataset, let’s plot the ages of the wedding participants.
library(ggplot2)
101
subtitle ="number of bins = 20",
x ="Age")
102
ggplot(Marriage, aes(x = age)) +
geom_density(fill ="deepskyblue",
bw =1) +
labs(title ="Participants by age",
subtitle ="bandwidth = 1")
103
5.5.1 Categorical vs. Categorical
When plotting the relationship between two categorical variables, stacked, grouped, or segment-
ed bar charts are typically used. A less common approach is the mosaic chart.
library(ggplot2)
Stacked is the default, so the last line could have also been written as geom_bar().
104
Figure: Side-by-side bar chart
You can use additional options to improve color and labeling. In the graph below
factor modifies the order of the categories for the class variable and both the order and
the labels for the drive variable
scale_y_continuous modifies the y-axis tick mark labels
labs provides a title and changed the labels for the x and y axes and the legend
scale_fill_brewer changes the fill color scheme
theme_minimal removes the grey background and changed the grid color
library(ggplot2)
105
# bar plot, with each bar representing 100%,
# reordered bars, and better labels and colors
library(scales)
ggplot(mpg,
aes(x =factor(class,
levels =c("2seater", "subcompact",
"compact", "midsize",
"minivan", "suv", "pickup")),
fill =factor(drv,
levels =c("f", "r", "4"),
labels =c("front-wheel",
"rear-wheel",
"4-wheel")))) +
geom_bar(position ="fill") +
scale_y_continuous(breaks =seq(0, 1, .2),
label = percent) +
scale_fill_brewer(palette ="Set2") +
labs(y ="Percent",
fill ="Drive Train",
x ="Class",
title ="Automobile Drive by Class") +
theme_minimal()
5.5.2.1 Scatterplot
The simplest display of two quantitative variables is a scatterplot, with each variable represented
on an axis. For example, using the Salaries dataset, we can plot experience (yrs.since.phd)
vs. academic salary (salary) for college professors.
library(ggplot2)
data(Salaries, package="carData")
106
# simple scatterplot
ggplot(Salaries,
aes(x =yrs.since.phd,
y = salary)) +
geom_point()
When one of the two variables represents time, a line plot can be an effective method of displa y-
ing relationship. For example, the code below displays the relationship between time (year) and
life expectancy (lifeExp) in the United States between 1952 and 2007. The data comes from
the gapminder dataset.
data(gapminder, package="gapminder")
# Select US cases
library(dplyr)
plotdata<-filter(gapminder,
country == "United States")
107
# simple line plot
ggplot(plotdata,
aes(x = year,
y =lifeExp)) +
geom_line()
In previous sections, bar charts were used to display the number of cases by category for a single
variable or for two variables. You can also use bar charts to display other summary statistics
(e.g., means or medians) on a quantitative variable for each level of a categorical variable.
For example, the following graph displays the mean salary for a sample of university professors
by their academic rank.
data(Salaries, package="carData")
108
Figure: Bar chart displaying means
Side-
by-side box plots are very useful for comparing groups (i.e., the levels of a categorical variable)
on a numerical variable.
109
# plot the distribution of salaries by rank using boxplots
ggplot(Salaries,
aes(x = rank,
y = salary)) +
geom_boxplot() +
labs(title ="Salary distribution by rank")
ggplot(mpg,
110
aes(x =cty,
y = class,
fill = class)) +
geom_density_ridges() +
theme_ridges() +
labs("Highway mileage by auto class") +
theme(legend.position ="none")
rank n mean sd se ci
111
group =1)) +
geom_point(size =3) +
geom_line() +
geom_errorbar(aes(ymin = mean -se,
ymax = mean +se),
width = .1)
112
color = rank)) +
geom_boxplot(size=1,
outlier.shape =1,
outlier.color ="black",
outlier.size =3) +
geom_jitter(alpha =0.5,
width=.2) +
scale_y_continuous(label = dollar) +
labs(title ="Academic Salary by Rank",
subtitle ="9-month salary for 2008-2009",
x ="",
y ="") +
theme_minimal() +
theme(legend.position ="none") +
coord_flip()
113
Figure: Beeswarm plot
data(gapminder, package="gapminder")
5.6.1 Grouping
In grouping, the values of the first two variables are mapped to the x and y axes. Then additional
variables are mapped to other visual characteristics such as color, shape, size, line type, and
transparency. Grouping allows you to plot the data for multiple groups in a single graph.
Using the Salaries dataset, let’s display the relationship between yrs.since.phd and salary.
library(ggplot2)
data(Salaries, package="carData")
114
# plot experience vs. salary
ggplot(Salaries,
aes(x =yrs.since.phd,
y = salary)) +
geom_point() +
labs(title ="Academic salary by years since degree")
5.6.2 Faceting
In faceting, a graph consists of several separate plots or small multiples, one for each level of a
third variable, or combination of variables. It is easiest to understand this with an example.
115
Figure: Salary distribution by rank
The facet_wrap function creates a separate graph for each level of rank. The ncol option controls
the number of columns.
*****
116