Data Science
Data Science
After deploying the model we can get actionable knowledge or the knowledge over which actions can be taken to reach
the solution/ goal.
Tools used to create data analysis softwares
Version Control Systems
Version control, also known as source control, is the practice of tracking and managing changes to software code.
Version control systems are software tools that help software teams manage changes to source code over time. As
development environments have accelerated, version control systems help software teams work faster and smarter. They
are especially useful for DevOps teams since they help them to reduce development time and increase successful
deployments. DevOps combines development (Dev) and operations (Ops) to increase the efficiency, speed, and security
of software development and delivery compared to traditional processes.
Version control software keeps track of every modification to the code in a special kind of database. If a mistake is
made, developers can turn back the clock and compare earlier versions of the code to help fix the mistake while
minimizing disruption to all team members.
Purpose of Version Control:
Multiple people can work simultaneously on a single project. Everyone works on and edits their own copy of the files
and it is up to them when they wish to share the changes made by them with the rest of the team.
It also enables one person to use multiple computers to work on a project, so it is valuable even if you are working by
yourself.
It integrates the work that is done simultaneously by different members of the team. In some rare cases, when
conflicting edits are made by two people to the same line of a file, then human assistance is requested by the
version control system in deciding what should be done.
Version control provides access to the historical versions of a project. This is insurance against computer crashes or
data loss. If any mistake is made, you can easily roll back to a previous version. It is also possible to undo specific
edits that too without losing the work done in the meanwhile. It can be easily known when, why, and by whom any
part of a file was edited.
Benefits of the version control system:
Enhances the project development speed by providing efficient collaboration,
Leverages the productivity, expedites product delivery, and skills of the employees through better communication
and assistance,
Reduce possibilities of errors and conflicts meanwhile project development through traceability to every small
change,
Employees or contributors of the project can contribute from anywhere irrespective of the different geographical
locations through this VCS,
For each different contributor to the project, a different working copy is maintained and not merged to the main
file unless the working copy is validated. The most popular example is Git, Helix core, Microsoft TFS,
Helps in recovery in case of any disaster or contingent situation,
Informs us about Who, What, When, Why changes have been made.
Use of Version Control System:
A repository: It can be thought of as a database of changes. It contains all the edits and historical versions
(snapshots) of the project.
Copy of Work (sometimes called as checkout): It is the personal copy of all the files in a project. You can edit to
this copy, without affecting the work of others and you can finally commit your changes to a repository when you
are done making your changes.
Working in a group: Consider yourself working in a company where you are asked to work on some live project.
You can’t change the main code as it is in production, and any change may cause inconvenience to the user, also
you are working in a team so you need to collaborate with your team to and adapt their changes. Version control
helps you with the, merging different requests to main repository without making any undesirable changes. You
may test the functionalities without putting it live, and you don’t need to download and set up each time, just pull
the changes and do the changes, test it and merge it back.
Types of Version Control Systems:
Local Version Control Systems(LVCS)
Centralized Version Control Systems(CVCS)
Distributed Version Control Systems(DVCS)
Local Version Control Systems: It is one of the simplest forms and has a database that kept all the changes to files under
revision control. RCS is one of the most common VCS tools. It keeps patch sets (differences between files) in a special
format on disk. By adding up all the patches it can then re-create what any file looked like at any point in time.
Centralized Version Control Systems: Centralized version control systems contain just one repository globally and every
user need to commit for reflecting one’s changes in the repository. It is possible for others to see your changes by
updating.
Two things are required to make your changes visible to others which are:
You commit
They update
The benefit of CVCS (Centralized Version Control Systems) makes collaboration amongst developers along with
providing an insight to a certain extent on what everyone else is doing on the project. It allows administrators to fine-
grained control over who can do what.
It has some downsides as well which led to the development of DVS. The most obvious is the single point of failure
that the centralized repository represents if it goes down during that period collaboration and saving versioned changes is
not possible. What if the hard disk of the central database becomes corrupted, and proper backups haven’t been kept?
You lose absolutely everything.
Distributed Version Control Systems: Distributed version control systems contain multiple repositories. Each user has
their own repository and working copy. Just committing your changes will not give others access to your changes. This is
because commit will reflect those changes in your local repository and you need to push them in order to make them
visible on the central repository. Similarly, When you update, you do not get others’ changes unless you have first pulled
those changes into your repository.
To make your changes visible to others, 4 things are required:
You commit
You push
They pull
They update
The most popular distributed version control systems are Git, and Mercurial. They help us overcome the problem of single
point of failure.
Git:
Git is a distributed version control system for tracking changes in source code during software development. It is designed
for coordinating work among programmers, but it can be used to track changes in any set of files. Its goals include speed,
data integrity, and support for distributed, non-linear workflows.
It is generally used for source code management in software development.
Git is used to tracking changes in the source code
The distributed version control tool is used for source code management
It allows multiple developers to work together
It supports non-linear development through its thousands of parallel branches
Features of Git
Tracks history
Free and open source
Supports non-linear development
Creates backups
Scalable
Supports collaboration
Branching is easier
Distributed development
Git Workflow:
The above diagram shows there is a master branch. There are two separate branches called “small feature” and “large
feature.” Once you are finished working with the two separate branches, you can merge them and create a master
branch.
Some common Git Commands:
Create Repositories- git init
Make Changes- add, commit, status
Parallel Development- branch, merge, rebase
Sync Repositories- push, pull, add origin
Check the version of Git- git --version
GitHub:
GitHub is a web-based Git repository hosting service, which offers all of the distributed revision control and source
code management (SCM) functionality of Git as well as adding its own features. GitHub also facilitates with many of its
features, such as access control and collaboration. It provides a Web-based graphical interface. GitHub is an American
company. It hosts source code of your project in the form of different programming languages and keeps track of the
various changes made by programmers.
Git vs GitHub
S.No. Git GitHub
5. Git is focused on version control and code sharing. GitHub is focused on centralized source code hosting.
8. Git has no user management feature. GitHub has a built-in user management feature.
10. Git has minimal external tool configuration. GitHub has an active marketplace for tool integration.
Git competes with CVS, Azure DevOps Server, GitHub competes with GitLab, Bit Bucket, AWS Code
12.
Subversion, Mercurial, etc. Commit, etc.
Markdown
Markdown is a lightweight markup language that describes how text should look on a page. HTML is another example
of a markup language. Markdown is a style of writing documents that makes it easy to define what the content should
look like. It describes headers, text styles, links, lists and so much more.
Markdown is used in documentation, articles, and notes and can even be used to build a webpage. If you use GitHub,
you'll be familiar with the “readme.md” files that show up in the root of a repository. That “md” stands for markdown. It's
a very readable syntax and it can be converted into HTML, XHTML as well as in other formats. Its main purpose is
readability and ease of use.
Markdown Editors: You can write markdown in any text editor, Markpad, HarooPad, MarkdownPad2 and Typora etc.
Importance of Markdown:
Simple Learning Curve: Markdown is very simple to learn. The official syntax can be found on the
website: daringfireball. However, you need to know that typing *word* will make it bold, typing _word_
will change that word to italics, and adding a - sign before the word will create lists. Also, it is much easier
to read raw Markdown than raw HTML.
Easy HTML Conversion: Markdown has built-in software to convert a plain text to HTML. Hence it can also
be considered as a text-to-HTML conversion software in addition to being a markup language.
Create Static Sites Easily: Markdown empowers you to make free, simple, and static sites utilizing open-
source tools like Mkdocs, Jekyll, or Read the Docs.
Easy Sharing and Syncing: You can easily sync and share files created in the Markdown editor to Dropbox,
Google Drive, and WordPress.
Diversification: Since Markdown is just plain text, it can be converted into a bunch of formats like PDF,
epub, Docx, HTML, etc. To acquire in-depth skills in web development, visit KnowledgeHut’s online Web
Development Course and learn development from scratch.
Working with Markdown:
Working of Markdown can be explained in the following four steps:
1. Create Markdown file: The first step is to create a Markdown file with “.md” or “.markdown” extension.
2. Open File in Markdown Application: You need a Markdown application capable of processing the
Markdown file. There are lots of applications available such as Typora, Ghostwriter, etc.
3. Working of Markdown Application: Markdown applications use Markdown processors to take the
Markdown-formatted text and output it to HTML format.
4. View the file in a web browser: The final rendered output of your document can now be viewed in a web
browser. Following is the visual representation of the process:
3. Bold
If we want some strong text (or bold text), we can use double asterisks or we can also use double underscores which will
make it bold as well. Given below is an example program to make bold text in Markdown:
**This is first way to make text in bold**
__This is second way to make text in bold__
Output:
This is first way to make text in bold
This is second way to make text in bold
4. Strikethrough
For strikethrough, we can use the double tilde sign. Given below is an example program for strikethrough text in
Markdown:
~~Strikethrough text example~~
Output:
Strikethrough text example
5. Horizontal Line
A horizontal line acts like a separator, and for that, we use triple hyphens or triple underscores. You can use this to
separate your content.
---
This is text after horizontal line
Output:
This is text before Horizontal Line
This is text after horizontal line
This is text before Horizontal Line
This is text after horizontal line
R
R is a programming language and software environment for statistical computing and graphics. Developed in 1993 by
Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, R has become a major tool in data analysis,
statistics, and graphical models. It's an open-source project, part of the GNU Project, which means it's freely available and
regularly updated by its community.
Features of R:
Data Analysis: Capable of handling various types of data and provides extensive packages for data analysis.
Statistical Modeling: Offers numerous techniques for statistical tests, linear and non-linear modeling, time-series
analysis, classification, clustering, etc.
Graphics: Features comprehensive graphical capabilities for data visualization.
Extensibility: Allows integration with other languages (C, C++, Java, etc.) and can be extended through packages.
CRAN in R Language:
CRAN (Comprehensive R Archive Network) is the primary repository for R packages, and it hosts thousands of packages
that users can download and install to extend the functionality of the R Programming Language. These packages are
created by R users and developers from around the world and cover a wide range of topics and applications.
It functions as a robust repository, hosting a diverse collection of R packages and related software, making it an essential
cornerstone for statisticians, data scientists, and researchers worldwide. In this comprehensive exploration, we will delve
deep into the significance of CRAN and its pivotal role in nurturing the growth of the R programming language.
Purpose of CRAN:
CRAN is a network of servers storing R packages.
The packages on CRAN enhance data analysis capabilities.
CRAN serves as the primary platform for sharing packages with the R community.
Importance of CRAN:
1. Central Hub: CRAN acts as the central hub for R packages, a place where users can easily access, download, and
install packages without the need for extensive searches across various websites or sources. This seamless access
streamlines the process of enhancing R’s capabilities, enabling users to find and install packages effortlessly.
2. Quality Assurance: One of CRAN’s standout features is its steadfast dedication to quality assurance. Package
maintainers undergo rigorous review processes when submitting their packages to CRAN. This meticulous
examination ensures that packages meet the highest standards, including thorough documentation, best practices,
and adherence to CRAN’s guidelines. As a result, users can have full confidence in the quality and dependability of
packages on CRAN.
3. Version Management: CRAN maintains a comprehensive history of package versions, allowing users to access and
install specific versions of packages. This feature is crucial for ensuring the reproducibility of data analysis and
research, ensuring that code performs as intended, even as packages evolve over time.
4. Diverse Selection of Packages: CRAN hosts a vast array of packages covering a wide range of domains. From
statistical modeling and machine learning to data visualization and manipulation, CRAN’s repository caters to the
needs of beginners and experienced users alike. Whatever your data analysis requirements, you’re likely to discover
a package that streamlines and enhances your workflow on CRAN.
5. Community Collaboration: Beyond being a distribution platform for packages, CRAN fosters a vibrant community of R
developers, maintainers, and users. Developers can collaborate on packages, share their expertise, and contribute to
the ongoing enrichment of R’s ecosystem. Users can seek help, report issues, and engage in discussions, fostering a
sense of camaraderie and support that bolsters the entire community.
R Studio
RStudio is an Integrated Development Environment (IDE) for R. It's a separate application designed to make using R easier
and more productive. The RStudio IDE is developed in 2011 by Posit, PBC, a public-benefit corporation founded by J. J.
Allaire, creator of the programming language ColdFusion.
Features of R Studio:
User-Friendly Interface: Offers a clean, user-friendly interface to R, making it accessible to a wider range of users.
Integrated Tools: Combines a console, syntax-highlighting editor, and direct code execution. It also includes tools
for plotting, history, debugging, and workspace management.
Project Management: Simplifies the organization of R projects, files, and associated data.
Version Control Integration: Provides built-in support for Git and SVN.
Technical Aspects:
RStudio enhances the functionality of R, but does not replace it; you need to install R to run RStudio.
It's focused on streamlining the workflow in R, particularly for data analysis, visualization, and application
development.
2. Project Management: R Studio facilitates efficient project organization with its project-based workflow, allowing
users to manage multiple scripts, data files, and plots within a cohesive workspace, which is essential for organizing
projects after the R language installation.
3. Data Visualization: It supports powerful data visualization capabilities through integrated plotting tools and
compatibility with popular visualization libraries like ggplot2. It enables users to create insightful graphs and charts
effortlessly, enhancing data representation after the R language installation.
4. Package Management: R Studio simplifies package management with tools like the Package Manager and
integrated CRAN (Comprehensive R Archive Network) repository access, making it easy to install, update, and
manage R packages crucial for various analytical tasks, supporting package management post R language
installation.
5. Markdown and R Markdown Support: It seamlessly integrates with Markdown and R Markdown, enabling users to
create dynamic reports, presentations, and documents that combine code, visualizations, and narrative text in a
single file, facilitating report generation after R language installation.
6. Collaboration and Sharing: R Studio facilitates collaboration by supporting version control systems like Git and
enabling seamless sharing of projects and analyses through R Studio Server and R Studio Cloud, promoting
collaboration after the R language installation.
R-Studio layout
Exploratory data analysis can enable analysts to represent different sales trends graphically and visualize data
related to best-selling product categories, buyer demographics and preferences, customer spending patterns,
and units sold over a certain period.
Without EDA, this would not have been possible.
Data specialists perform exploratory data analysis using popular scripting languages for statistics, such as
Python and R. For effective EDA, data professionals also use a variety of BI (Business Intelligence) tools,
including Qlik Sense, IBM Cognos, and Tableau.
Python and R programming languages enable analysts to analyze data better and manipulate it using libraries
and packages such as Plotly, Seaborn, or Matplotlib.
BI tools, incorporating interactive dashboards, robust security, and advanced visualization features, provide
data processors with a comprehensive view of data that helps them develop Machine Learning (ML) models.
The exploratory data analysis steps that analysts have in mind when performing EDA include:
Asking the right questions related to the purpose of data analysis
Obtaining in-depth knowledge about problem domains
Setting clear objectives that are aligned with the desired outcomes.
There are four exploratory data analysis techniques that data experts use, which include:
i) Univariate Non-Graphical: This is the simplest type of EDA, where data has a single variable. Since there is
only one variable, data professionals do not have to deal with relationships.
ii) Univariate Graphical: Non-graphical techniques do not present the complete picture of data. Therefore, for
comprehensive EDA, data specialists implement graphical methods, such as stem-and-leaf plots, box plots, and
histograms.
iii) Multivariate Non-Graphical: Multivariate data consists of several variables. Non-graphic multivariate EDA
methods illustrate relationships between 2 or more data variables using statistics or cross-tabulation.
iv) Multivariate Graphical: This EDA technique makes use of graphics to show relationships between 2 or more
datasets. The widely-used multivariate graphics include bar chart, bar plot, heat map, bubble chart, run chart,
multivariate chart, and scatter plot.
Elliminating or sharpening potential hypothesis about the world that can be addressed by the data:
Exploratory Data Analysis is used for sharpening potential hypothesis about the world that can be addressed
by the data.
A statistical hypothesis is an assumption made by the researcher about the data of the population collected for
any experiment. It is not mandatory for this assumption to be true every time. Hypothesis testing, in a way, is
a formal process of validating the hypothesis made by the researcher.
In order to validate a hypothesis, it will consider the entire population into account. However, this is not
possible practically. Thus, to validate a hypothesis, it will use random samples from a population. On the basis
of the result from testing over the sample data, it either selects or rejects the hypothesis.
Statistical Hypothesis Testing can be categorized into two types as below:
Null Hypothesis – Hypothesis testing is carried out in order to test the validity of a claim or assumption
that is made about the larger population. This claim that involves attributes to the trial is known as the
Null Hypothesis. The null hypothesis testing is denoted by H0.
Alternative Hypothesis – An alternative hypothesis would be considered valid if the null hypothesis is
fallacious. The evidence that is present in the trial is basically the data and the statistical computations that
accompany it. The alternative hypothesis testing is denoted by H1or Ha.
Statisticians use hypothesis testing to formally check whether the hypothesis is accepted or rejected.
Hypothesis testing is conducted in the following manner:
1. State the Hypotheses – Stating the null and alternative hypotheses.
2. Formulate an Analysis Plan – The formulation of an analysis plan is a crucial step in this stage.
3. Analyze Sample Data – Calculation and interpretation of the test statistic, as described in the analysis plan.
4. Interpret Results – Application of the decision rule described in the analysis plan.
Hypothesis testing ultimately uses a p-value to weigh the strength of the evidence or in other words what
the data are about the population. The p-value ranges between 0 and 1. It can be interpreted in the
following way:
A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject it.
A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject it.
A p-value very close to the cutoff (0.05) is considered to be marginal and could go either way.
Decision Errors in R
The two types of error that can occur from the hypothesis testing:
Type I Error – Type I error occurs when the researcher rejects a null hypothesis when it is true. The term
significance level is used to express the probability of Type I error while testing the hypothesis. The
significance level is represented by the symbol α (alpha).
Type II Error – Accepting a false null hypothesis H0 is referred to as the Type II error. The term power of
the test is used to express the probability of Type II error while testing hypothesis. The power of the test is
represented by the symbol β (beta).
Icon-Based Techniques
This is used for visualization of data values as features of icons. Examples - Chernoff faces , Stick
figures, Star plots, Color icons. Some of it’s important characteristics are
It can handle small to medium datasets with a few thousand data records, as icons tend to use a
screen space of several pixels.
It can be applied to datasets of high dimensionality, but interpretation is not straightforward and
requires training.
In this technique variables are treated differently, as some visual features of the icons may attract
more attention than others.
The way data variables are mapped to icon features greatly determines the expressiveness of the
resulting visualization and what can be perceived .
Defining a suitable mapping may be difficult and poses a bottleneck, particularly for higher
dimensional data .
Data record overlapping can occur if some variables are mapped to the display positions .
Pixel-Based Techniques
Each variable is represented as a sub window in the display which is filled with colored pixels. A data
record with k variables is represented as k colored pixels, each in one sub window associated with a
variable. The color of a pixel demonstrates its corresponding value. The color mapping of the pixels,
arrangement of pixels in the sub windows and shape of the sub windows depend on the data
characteristics and visualization tasks. Some of it’s important characteristics are
It can handle large and very large datasets on high-resolution displays
Can reasonably handle medium- and high- dimensional datasets
As each data record is uniquely mapped to a pixel, data record overlapping and visual cluttering do not
occur
Limited in revealing quantitative relationships between variables because color is not effective in
visualizing quantitative values.
Hierarchical Techniques
It subdivides the k-Dimensional data space and present subspaces in a hierarchical fashion Examples-
Dimensional stacking, Mosaic Plot, Worlds-within-worlds, Treemap, Cone Trees etc. Some of it’s important
characteristics are
It can handle small- to medium- sized datasets
More suitable for handling datasets of low- to medium- dimensionality
Variables are treated differently, with different mappings producing different views of data
Interpretation of resulting plots requires training
Hybrid Techniques
It integrates multiple visualization techniques, either in one or multiple windows, to enhance the
expressiveness of visualization. Linking and brushing are powerful tools to integrate visualization windows.
Multivariate Graphs:
Multivariate graphs display the relationships among three or more variables. There are two common methods
for accommodating multiple variables: grouping and faceting.
Grouping
In grouping, the values of the first two variables are mapped to the x and y axes. Then additional variables are
mapped to other visual characteristics such as color, shape, size, line type, and transparency. Grouping allows
you to plot the data for multiple groups in a single graph.
Using the Salaries dataset, let’s display the relationship between yrs.since.phd and salary.
library(ggplot2)
data(Salaries, package="carData")
Faceting
Grouping allows you to plot multiple variables in a single graph, using visual characteristics such as
color, shape, and size.
In faceting, a graph consists of several separate plots or small multiples, one for each level of a third
variable, or combination of variables. It is easiest to understand this with an example.
# plot salary histograms by rank
ggplot(Salaries, aes(x = salary)) +
geom_histogram(fill = "cornflowerblue",
color = "white") +
facet_wrap(~rank, ncol = 1) +
labs(title = "Salary histograms by rank")
Evolution of R
R was initially written by Ross Ihaka and Robert Gentleman at the Department of Statistics of the
University of Auckland in Auckland, New Zealand. R made its first appearance in 1993.A large group
of individuals has contributed to R by sending code and bug reports.Since mid-1997 there has been a
core group (the "R Core Team") who can modify the R source code archive.
Features of R
The following are the important features of R −
R is a well-developed, simple and effective programming language which includes conditionals,
loops, user defined recursive functions and input and output facilities.
R has an effective data handling and storage facility,
R provides a suite of operators for calculations on arrays, lists, vectors and matrices.
R provides a large, coherent and integrated collection of tools for data analysis.
R provides graphical facilities for data analysis and display either directly at the computer or
printing at the papers.
As a conclusion, R is world’s most widely used statistics programming language. It's the # 1 choice of
data scientists and supported by a vibrant and talented community of contributors.
Basic Syntax
Depending on the needs, we can program either at R command prompt or we can use an R script file
to write program..
Using R Command Prompt:
Once we have R environment setup, then it’s easy to start R command prompt by just typing the
following command at command prompt −
$ R
This will launch R interpreter and we will get a prompt > where you can start typing our program as
follows −
> s <- "Hello, World!"
> print (s)
[1] "Hello, World!"
Here first statement defines a string variable s, where we assign a string "Hello, World!" and then next
statement print() is being used to print the value stored in variable s.
Using R Script File:
Usually, we will do programming by writing programs in script files and then we execute those scripts
at command prompt with the help of R interpreter called Rscript. Example-
# My first program in R Programming
myString <- "Hello, World!"
print ( myString)
Save the above code in a file test.R and run the above program by writing Rscript test.R in terminal.
Output [1] "Hello, World!"
Comments:
Comments are like helping text in R program and they are ignored by the interpreter while executing
actual program. Single comment is written using # in the beginning of the statement as follows −
# My first program in R Programming
R does not support multi-line comments.
Data Types and Objects in R
Data type is an indicator of type of data to be stored in a variable. In contrast to other programming
languages like C and java in R, the data types are not declared. The variables are assigned with R-
Objects and the data type of the R-object becomes the data type of the variable. There are many
types of R-objects. The frequently used ones are −
Vectors
Lists
Matrices
Arrays
Factors
Data Frames
The simplest of these objects is the vector object and there are six data types of these atomic
vectors, also termed as six classes of vectors. The other R-Objects are built upon the atomic vectors.
Examples:
Logical TRUE, FALSE
Numeric 12.3, 5, 999
Integer 2L, 34L, 0L
Complex 3 + 2i
Character 'a' , '"good", "TRUE", '23.4'
Raw "Hello" is stored as 48 65 6c 6c 6f
R-Code v <- TRUE
print(v)
print(class(v))
Output [1]”TRUE”
[1] "logical"
Vectors
When you want to create vector with more than one element, you should use c() function which
means to combine the elements into a vector.
apple <- c('red','green',"yellow")
print(apple)
print(class(apple))
Output:-
[1] "red" "green" "yellow"
[1] "character"
Lists
A list is an R-object which can contain many different types of elements inside it like vectors,
functions and even another list inside it.
list1 <- list(c(2,5,3),21.3,sin)
print(list1)
Output:-
[[1]]
[1] 2 5 3
[[2]]
[1] 21.3
[[3]]
Matrices
A matrix is a two-dimensional rectangular data set. It can be created using a vector input to the matrix
function.
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
When we execute the above code, it produces the following result −
[,1] [,2] [,3]
[1,] "a" "a" "b"
[2,] "c" "b" "a"
Arrays
While matrices are confined to two dimensions, arrays can be of any number of dimensions. The
array function takes a dim attribute which creates the required number of dimension. In the below
example we create an array with two elements which are 3x3 matrices each.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
When we execute the above code, it produces the following result −
, , 1
, , 2
Factors
Factors are the r-objects which are created using a vector. It stores the vector along with the distinct
values of the elements in the vector as labels. The labels are always character irrespective of whether
it is numeric or character or Boolean etc. in the input vector. They are useful in statistical modeling.
Factors are created using the factor() function. The nlevels functions gives the count of levels.
apple_colors <- c('green','green','yellow','red','red','red','green')
factor_apple <- factor(apple_colors)
print(factor_apple)
print(nlevels(factor_apple))
Output:-
[1] green green yellow red red red green
Levels: green red yellow
[1] 3
Data Frames
Data frames are tabular data objects. Unlike a matrix in data frame each column can contain different
modes of data. The first column can be numeric while the second column can be character and third
column can be logical. It is a list of vectors of equal length.
Data Frames are created using the data.frame() function.
BMI<-data.frame(gender=c("Male","Male","Female"),height=c(152,171.5,165),
weight = c(81,93, 78),Age = c(42,38,26))
print(BMI)
Output−
gender height weight Age
1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 165.0 78 26
R-Variables
A variable provides us with named storage that our programs can manipulate. A variable in R can
store an atomic vector, group of atomic vectors or a combination of many R-objects. A valid variable
name consists of letters, numbers and the dot or underline characters. The variable name starts with
a letter or the dot not followed by a number. Example of valid variable names are var_name, .var2,
ab_cd, ab.cd etc
Variable Assignment
The variables can be assigned values using leftward, rightward and equal to operator. The values of
the variables can be printed using print() or cat() function. The cat() function combines multiple items
into a continuous print output. For assignment we use equal(=) or inward operator(<-) or outward
operator(->). Example
v <- c("learn","R") OR v = c("learn","R") OR c("learn","R")-> v
print(v)
When we execute the above code, it produces the same result −
learn R
Data Type of a Variable
In R, a variable itself is not declared of any data type, rather it gets the data type of the R - object
assigned to it. So R is called a dynamically typed language, which means that we can change a
variable’s data type of the same variable again and again when using it in a program.
v <- "Hello"
cat("The class of v is ",class(v))
Output:-
The class of v is character
Finding Variables
To know all the variables currently available in the workspace we use the ls() function. Also the ls()
function can use patterns to match the variable names.
print(ls())
The ls() function can use patterns to match the variable names.
# List the variables starting with the pattern "var".
print(ls(pattern = "var"))
The variables starting with dot(.) are hidden, they can be listed using "all.names = TRUE" argument
to ls() function.
print(ls(all.name = TRUE))
Deleting Variables
Variables can be deleted by using the rm() function. Below we delete the variable var.3. On printing
the value of the variable error is thrown.
rm(var.3)
print(var.3)
When we execute the above code, it produces the following result −
[1] "var.3"
Error in print(var.3) : object 'var.3' not found
All the variables can be deleted by using the rm() and ls() function together.
rm(list = ls())
print(ls())
When we execute the above code, it produces the following result −
character(0)
R - Operators
An operator is a symbol that tells the compiler to perform specific mathematical or logical
manipulations. R language is rich in built-in operators and provides following types of operators.
Types of Operators
We have the following types of operators in R programming −
Arithmetic Operators
Relational Operators
Logical Operators
Assignment Operators
Miscellaneous Operators
Arithmetic Operators
Following table shows the arithmetic operators supported by R language. The operators act on each
element of the vector.
+ Adds two vectors
– Subtracts second vector from the first
* Multiplies both vectors
/ Divide the first vector with the second(floating point division)
%% Give the remainder of the first vector with the second
%/% The result of division of first Vector with second (integer quotient)
^ The first vector raised to the exponent of second vector
Example:
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v/t)
When we execute the above code, it produces the following result −
[1] 0.250000 1.833333 1.500000
Relational Operators
Following table shows the relational operators supported by R language. Each element of the first
vector is compared with the corresponding element of the second vector. The result of comparison is
a Boolean value.
> Checks if each element of the first vector is greater than the corresponding element of the second
vector.
< Checks if each element of the first vector is less than the corresponding element of the second vector.
== Checks if each element of the first vector is equal to the corresponding element of the second vector.
<= Checks if each element of the first vector is less than or equal to the corresponding element of the
second vector.
>= Checks if each element of the first vector is greater than or equal to the corresponding element of the
second vector.
!= Checks if each element of the first vector is unequal to the corresponding element of the second vector.
Example:-
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
print(v>=t)
it produces the following result −
[1] FALSE TRUE FALSE TRUE
Logical Operators
Following table shows the logical operators supported by R language. It is applicable only to vectors
of type logical, numeric or complex. All numbers greater than 1 are considered as logical value
TRUE.
Each element of the first vector is compared with the corresponding element of the second vector.
The result of comparison is a Boolean value.
& It is called Element-wise Logical AND operator. It combines each element of the first vector with the
corresponding element of the second vector and gives a output TRUE if both the elements are TRUE.
| It is called Element-wise Logical OR operator. It combines each element of the first vector with the
corresponding element of the second vector and gives a output TRUE if one the elements is TRUE.
! It is called Logical NOT operator. Takes each element of the vector and gives the opposite logical
value.
Example:-
v <- c(3,0,TRUE,2+2i)
t <- c(4,0,FALSE,2+3i)
print(v|t)
it produces the following result −
[1] TRUE FALSE TRUE TRUE
The logical operator && and || considers only the first element of the vectors and give a vector of
single element as output.
v <- c(3,0,TRUE,2+2i)
t <- c(1,3,TRUE,2+3i)
print(v&&t)
it produces the following result −
[1] TRUE
Assignment Operators
These operators are used to assign values to vectors.
<− or = or <<− Left Assignment operators
-> or ->> Right Assignment operators
Example:-
v1 <- c(3,1,TRUE,2+3i)
print(v1)
it produces the following result −
[1] 3+0i 1+0i 1+0i 2+3i
Miscellaneous Operators
These operators are used to for specific purpose and not general mathematical or logical
computation.
: Colon operator. It creates the series of numbers in sequence for a vector.
%in% This operator is used to identify if an element belongs to a vector.
%*% This operator is used to multiply a matrix with its transpose.
Example
v1 <- 8
v2 <- 12
t <- 1:10
print(v1 %in% t)
print(v2 %in% t)
it produces the following result −
[1] TRUE
[1] FALSE
Control Structure
Decision making structures
Decision making structures require the programmer to specify one or more conditions to be evaluated
or tested by the program, along with a statement or statements to be executed if the condition is
determined to be true, and optionally, other statements to be executed if the condition is determined
to be false.
R provides the following types of decision making statements.
a) If Statement
An if statement consists of a Boolean expression followed by one or more statements.
Syntax
The basic syntax for creating an if statement in R is −
if(boolean_expression) {
// statement(s) will execute if the boolean expression is true.
}
If the Boolean expression evaluates to be true, then the block of code inside the if statement will be
executed. If Boolean expression evaluates to be false, then the first set of code after the end of the if
statement (after the closing curly brace) will be executed.
Example
x <- 30L
if(is.integer(x)) {
print("X is an Integer")
}
When the above code is executed, it produces the following result −
[1] "X is an Integer"
b) If...Else Statement
An if statement can be followed by an optional else statement which executes when the boolean
expression is false.
Syntax
The basic syntax for creating an if...else statement in R is −
if(boolean_expression) {
// statement(s) will execute if the boolean expression is true.
} else {
// statement(s) will execute if the boolean expression is false.
}
If the Boolean expression evaluates to be true, then the if block of code will be executed,
otherwise else block of code will be executed.
Example
x <- c("what","is","truth")
if("Truth" %in% x) {
print("Truth is found")
} else {
print("Truth is not found")
}
When the above code is compiled and executed, it produces the following result −
[1] "Truth is not found"
Here "Truth" and "truth" are two different strings.
(c)The if...else if...else Statement
An if statement can be followed by an optional else if...else statement, which is very useful to test
various conditions using single if...else if statement.
When using if, else if, else statements there are few points to keep in mind.
An if can have zero or one else and it must come after any else if's.
An if can have zero to many else if's and they must come before the else.
Once an else if succeeds, none of the remaining else if's or else's will be tested.
Syntax
The basic syntax for creating an if...else if...else statement in R is −
if(boolean_expression 1) {
// Executes when the boolean expression 1 is true.
} else if( boolean_expression 2) {
// Executes when the boolean expression 2 is true.
} else if( boolean_expression 3) {
// Executes when the boolean expression 3 is true.
} else {
// executes when none of the above condition is true.
}
Example
x <- c("what","is","truth")
if("Truth" %in% x) {
print("Truth is found the first time")
} else if ("truth" %in% x) {
print("truth is found the second time")
} else {
print("No truth found")
}
When the above code is compiled and executed, it produces the following result −
[1] "truth is found the second time"
Switch Statement
A switch statement allows a variable to be tested for equality against a list of values. Each value is
called a case, and the variable being switched on is checked for each case.
Syntax
The basic syntax for creating a switch statement in R is −
switch(expression, case1, case2, case3 ... )
The following rules apply to a switch statement −
If the value of expression is not a character string it is coerced to integer.
You can have any number of case statements within a switch. Each case is followed by the value
to be compared to and a colon.
If the value of the integer is between 1 and nargs()−1 (The max number of arguments)then the
corresponding element of case condition is evaluated and the result returned.
If expression evaluates to a character string then that string is matched (exactly) to the names of
the elements.
If there is more than one match, the first matching element is returned.
No Default argument is available.
Example
x <- switch(3,"first","second","third","fourth")
print(x)
When the above code is compiled and executed, it produces the following result −
[1] "third"
Loops
There may be a situation when you need to execute a block of code several number of times. In
general, statements are executed sequentially. The first statement in a function is executed first,
followed by the second, and so on.
A loop statement allows us to execute a statement or group of statements multiple times and the
following is the general form of a loop statement in most of the programming languages –
a) repeat loop: Executes a sequence of statements multiple times and abbreviates the code that
manages the loop variable.
b) while loop: Repeats a statement or group of statements while a given condition is true. It tests the
condition before executing the loop body.
c) for loop: Like a while statement, except that it tests the condition at the end of the loop body.
a) Repeat Loop
The Repeat loop executes the same code again and again until a stop condition is met.
Syntax
The basic syntax for creating a repeat loop in R is −
repeat {
commands
if(condition) {
break
}
}
Example
v <- c("Hello","loop")
cnt <- 2
repeat {
print(v)
cnt <- cnt+1
if(cnt > 5) {
break
}
}
When the above code is executed, it produces the following result −
[1] "Hello" "loop"
[1] "Hello" "loop"
[1] "Hello" "loop"
[1] "Hello" "loop"
b) While Loop
The While loop executes the same code again and again until a stop condition is met.
Syntax
The basic syntax for creating a while loop in R is −
while (test_expression) {
statement
}
Here key point of the while loop is that the loop might not ever run. When the condition is tested and
the result is false, the loop body will be skipped and the first statement after the while loop will be
executed.
Example
v <- c("Hello","while loop")
cnt <- 2
while (cnt < 7) {
print(v)
cnt = cnt + 1
}
When the above code is executed, it produces the following result −
[1] "Hello" "while loop"
[1] "Hello" "while loop"
[1] "Hello" "while loop"
[1] "Hello" "while loop"
[1] "Hello" "while loop"
c) For Loop
A For loop is a repetition control structure that allows you to efficiently write a loop that needs to
execute a specific number of times.
Syntax
The basic syntax for creating a for loop statement in R is −
for (value in vector) {
statements
}
R’s for loops are particularly flexible in that they are not limited to integers, or even numbers in the
input. We can pass character vectors, logical vectors, lists or expressions.
Example
v <- LETTERS[1:4]
for ( i in v) {
print(i)
}
When the above code is nd executed, it produces the following result −
[1] "A"
[1] "B"
[1] "C"
[1] "D"
Break Statement
The break statement in R programming language has the following two usages −
When the break statement is encountered inside a loop, the loop is immediately
terminated and program control resumes at the next statement following the loop.
It can be used to terminate a case in the switch statement (covered in the next chapter).
Syntax
The basic syntax for creating a break statement in R is −
break
Example
v <- c("Hello","loop")
cnt <- 2
repeat {
print(v)
cnt <- cnt + 1
if(cnt > 5) {
break
}
}
When the above code is compiled and executed, it produces the following result −
[1] "Hello" "loop"
[1] "Hello" "loop"
[1] "Hello" "loop"
[1] "Hello" "loop"
Next Statement
The next statement in R programming language is useful when we want to skip the current iteration
of a loop without terminating it. On encountering next, the R parser skips further evaluation and starts
next iteration of the loop.
Syntax
The basic syntax for creating a next statement in R is −
next
Example
v <- LETTERS[1:6]
for ( i in v) {
if (i == "D") {
next
}
print(i)
}
When the above code is compiled and executed, it produces the following result −
[1] "A"
[1] "B"
[1] "C"
[1] "E"
[1] "F"
Functions
A function is a set of statements organized together to perform a specific task. R has a large number
of in-built functions and the user can create their own functions.
In R, a function is an object so the R interpreter is able to pass control to the function, along with
arguments that may be necessary for the function to accomplish the actions.
The function in turn performs its task and returns control to the interpreter as well as any result which
may be stored in other objects.
Function Definition:-
An R function is created by using the keyword function. The basic syntax of an R function definition
is as follows −
function_name <- function(arg_1, arg_2, ...) {
Function body
}
Function Components:
The different parts of a function are −
Function Name − This is the actual name of the function. It is stored in R environment as an
object with this name.
Arguments − An argument is a placeholder. When a function is invoked, you pass a value to the
argument. Arguments are optional; that is, a function may contain no arguments. Also arguments
can have default values.
Function Body − The function body contains a collection of statements that defines what the
function does.
Return Value − The return value of a function is the last expression in the function body to be
evaluated.
R has many in-built functions which can be directly called in the program without defining them first.
We can also create and use our own functions referred as user defined functions.
Built-in Function
Simple examples of in-built functions are seq(), mean(), max(), sum(x) and paste(...) etc. They are
directly called by user written programs. Examples:-
print(seq(32,44))
print(mean(25:82))
print(sum(41:68))
When we execute the above code, it produces the following result −
[1] 32 33 34 35 36 37 38 39 40 41 42 43 44
[1] 53.5
[1] 1526
User-defined Function
We can create user-defined functions in R. They are specific to what a user wants and once created
they can be used like the built-in functions. Below is an example of how a function is created and
used.
# Create a function to print squares of numbers in sequence.
display <- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
}
}
# Call the function display supplying 6 as an argument.
display(6)
When we execute the above code, it produces the following result −
[1] 1
[1] 4
[1] 9
[1] 16
[1] 25
[1] 36
Debugging in R Programming
Debugging is a process of cleaning a program code from bugs to run it successfully. While writing codes, some
mistakes or problems automatically appears after the compilation of code and are harder to diagnose. So,
fixing it takes a lot of time and after multiple levels of calls.
Various debugging functions are:
browser()
Editor breakpoint
traceback()
recover()
browser() Function
browser() function is inserted into functions to open R interactive debugger. It will stop the execution of
function() and you can examine the function with the environment of itself. In debug mode, we can modify
objects, look at the objects in the current environment, and also continue executing.
Editor Breakpoints
Editor Breakpoints can be added in RStudio by clicking to the left of the line in RStudio or
pressing Shift+F9 with the cursor on your line. A breakpoint is same as browser() but it doesn’t involve
changing codes. Breakpoints are denoted by a red circle on the left side, indicating that debug mode will be
entered at this line after the source is run.
traceback() Function
The traceback() function is used to give all the information on how your function arrived at an error. It will
display all the functions called before the error arrived called the “call stack”.
recover() Function
recover() statement is used as an error handler and not like the direct statement. In recover(), R prints the
whole call stack and lets you select which function browser you would like to enter. Then debugging session
starts at the selected location.
Scoping Rules of R
Scoping Rules of a language are responsible for determining how value will
be associated with the free variable in a function in the R
language. Scoping rules in R is of two types, such as Lexical scoping and
Dynamic scoping.
Lexical Scoping
In Lexical Scoping the scope of the variable is determined by the textual
structure of a program. Most programming languages we use today are
lexically scoped. Even, a human can determine the scope of a variable just
by properly reading the code. Below is a code of lexical Scoping in R.
a <- 10
f <- function(x) {
a <- 2
cat(a ^ 2 + g(x))
}
g <- function(x) {
x * a
}
f(3)
Output:
34
It provides less
flexibility. It provides more flexibility.
Profiling R code
The R Profiler
Using system.time() allows you to test certain functions or code blocks to
see if they are taking excessive amounts of time. However, this approach
assumes that you already know where the problem is and can
call system.time() on it that piece of code.
The Rprof() function starts the profiler in R. In conjunction
with Rprof(), we will use the summaryRprof() function which summarizes the
output from Rprof() (otherwise it’s not really readable).
Rprof() keeps track of the function call stack at regularly sampled
intervals and tabulates how much time is spent inside each function. By
default, the profiler samples the function call stack every 0.02 seconds.
The profiler is started by calling the Rprof() function.
> Rprof() ## Turn on the profiler
By default it will write its output to a file called Rprof.out.
Once you call the Rprof() function, everything that you do from then on
will be measured by the profiler.
The profiler can be turned off by passing NULL to Rprof().
> Rprof(NULL) ## Turn off the profiler
Using summaryRprof()
The summaryRprof() function tabulates the R profiler output and calculates
how much time is spent in which function. There are two methods for
normalizing the data.
“by.total” divides the time spend in each function by the total run time
“by.self” does the same as “by.total” but first subtracts out time spent
in functions above the current function in the call stack.
Here is what summaryRprof() reports in the “by.total” output.
Loop Functions
Looping on the Command Line
Writing for and while loops is useful when programming but not particularly
easy when working interactively on the command line. Multi-line expressions
with curly braces are just not that easy to sort through when working on the
command line. R has some functions which implement looping in a compact form
to make your life easier.
apply(): Apply a function over the margins of an array
lapply(): Loop over a list and evaluate a function on each element
sapply(): Same as lapply but try to simplify the result
tapply(): Apply a function over subsets of a vector
mapply(): Multivariate version of lapply
apply() function
The apply() function lets us apply a function to the rows or columns of a
matrix or data frame. This function takes matrix or data frame as an argument
along with function and whether it has to be applied by row or column and
returns the result in the form of a vector or array or list of values
obtained.
Syntax: apply( x, margin, function )
Parameters:
x: determines the input array including matrix.
margin: If the margin is 1 function is applied across row, if the
margin is 2 it is applied across the column.
function: determines the function that is to be applied on input
data.
Example:
lapply() function
The lapply() function helps us in applying functions on list objects and
returns a list object of the same length. The lapply() function in the R
Language takes a list, vector, or data frame as input and gives output in the
form of a list object. Since the lapply() function applies a certain operation
to all the elements of the list it doesn’t need a MARGIN.
Syntax: lapply( x, fun )
Parameters:
x: determines the input vector or an object.
fun: determines the function that is to be applied to input data.
Example:
sapply() function
The sapply() function helps us in applying functions on a list, vector, or
data frame and returns an array or matrix object of the same length. The
sapply() function in the R Language takes a list, vector, or data frame as
input and gives output in the form of an array or matrix object. Since the
sapply() function applies a certain operation to all the elements of the
object it doesn’t need a MARGIN. It is the same as lapply() with the only
difference being the type of return object.
Syntax: sapply( x, fun )
Parameters:
x: determines the input vector or an object.
fun: determines the function that is to be applied to input data.
Example:
Here, is a basic example showcasing the use of the sapply() function to a
vector.
# create sample data
sample_data<- data.frame( x=c(1,2,3,4,5,6),
y=c(3,2,4,2,34,5))
print( "original data:")
sample_data
# apply sapply() function
print("data after sapply():")
sapply(sample_data, max)
Output:
tapply() function
The tapply() helps us to compute statistical measures (mean, median, min, max,
etc..) or a self-written function operation for each factor variable in a
vector. It helps us to create a subset of a vector and then apply some
functions to each of the subsets. For example, in an organization, if we have
data of salary of employees and we want to find the mean salary for male and
female, then we can use tapply() function with male and female as factor
variable gender.
Syntax: tapply( x, index, fun )
Parameters:
x: determines the input vector or an object.
index: determines the factor vector that helps us distinguish the
data.
fun: determines the function that is to be applied to input data.
Example:
mapply() function
The mapply() function stands for ‘multivariate’ apply. Its purpose is to be
able to vectorize arguments to a function that is not usually accepting
vectors as arguments.
In short, mapply() applies a Function to Multiple List or multiple Vector
Arguments.
Example-
The following code shows how to use mapply() to create a matrix by
repeating the values c(1, 2, 3) each 5 times:
#create matrix
mapply(rep, 1:3, times=5)
Simulation
Simulation is important for both statistics and for a variety of other
areas where there is a need to introduce randomness. sometimes we want to
simulate a system and random number generators can be used to model random
inputs.
R comes with a set of pseuodo-random number generators that allow you to
simulate from well-known probability distributions like the Normal,
Poisson, and binomial. Some example functions for probability
distributions in R
rnorm: generate random Normal variates with a given mean and
standard deviation
dnorm: evaluate the Normal probability density (with a given
mean/SD) at a point (or vector of points)
pnorm: evaluate the cumulative distribution function for a
Normal distribution
rpois: generate random Poisson variates with a given rate
Here we simulate standard Normal random numbers with mean 0 and standard
deviation 1.
> ## Simulate standard Normal random numbers
> x <- rnorm(10)
> x
[1] 0.01874617 -0.18425254 -1.37133055 -0.59916772 0.29454513
0.38979430
[7] -1.20807618 -0.36367602 -1.62667268 -0.25647839
We can modify the default parameters to simulate numbers with mean 20 and
standard deviation 2.
> x <- rnorm(10, 20, 2)
> x
[1] 22.20356 21.51156 19.52353 21.97489 21.48278 20.17869 18.09011
19.60970
[9] 21.85104 20.96596
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.09 19.75 21.22 20.74 21.77 22.20
If you wanted to know what was the probability of a random Normal variable
of being less than, say, 2, you could use the pnorm() function to do that
calculation.
> pnorm(2)
[1] 0.9772499
The set.seed() function sets a seed for R’s random number generator, so that
we get the same values consistently each time we run the code. Otherwise, we
will get different results.
To get integers, we use the round function, as follows:
runif(25,min=0,max=10)
set.seed(24)
sample(seq(1,10),8)
set.seed(1)
sample(letters,18)
equal probabilities
If we set prob=rep(0.1,5), then numbers 1 to 5 will be equally sampled, as
shown in the histogram above.
#Scenario 2 Unequal Probability
We generate a random sample of size 10 from sequence[1,5] with unequal
probabilities
unequal_prob_dist = sample(5,10000,prob =c(0.1,0.25,0.4,0.25,0.1), replace=T)
hist(unequal_prob_dist)
unequal probabilities
We set the following selection probability rules for the sequence:
1 & 5: 0.12 & 4: 0.25 3: 0.4
As it turns out, number 3 is the most selected and 1 & 5 are the least
selected.
Example: There are two 6-sided dices. If you roll them together, what is the
probability of rolling a 7?
set.seed(1)
die = 1:6
die1 = sample(die,10000,replace = TRUE,prob=NULL)
die2= sample(die,10000,replace=TRUE,prob = NULL
outcomes = die1+die2
mean(outcomes == 7)
O/p- [1] 0.1614