R Programming UNIT-1
R Programming UNIT-1
CollegeforWomen(A),Nellore
COURSE 1:INTRODUCTION TO
DATASCIENCE AND R PROGRAMMING
UNIT-1
UNIT I: Syllabus
Defining Data Science and Bigdata,Benefits and
Uses,facets of Data,DataScience Process.History and
Overview of R, Getting Started with R, R Nuts and
Bolts
and processes
data
Can be challenging to integrate with existing systems
and processes
Similarities between Big Data and Data Science:
Both fields deal with large amounts of data and require
specialized skills and expertise
Both aim to extract insights and knowledge from data
to inform decision-making
Both have a wide range of applications in various
industries
Both can lead to significant cost savings and
1.Structured
2.Semi structured
3. Unstructured
• Natural language
• Machine-generated
• Graph-based
• Audio, video, and images
• Streaming
1.Structured Data:
1.It concerns all data which can be stored in database
SQL in table with rows and columns.
2.They have relational key and can be easily mapped
into pre-designed fields.
3.Today, those data are the most processed in
development and the simplest way to manage
information.
4. But structured data represent only 5 to 10% of all
informatics data.
2.Semi Structured Data:
1. Semi-structured data is information that doesn’t
reside in a relational database but that does have some
organizational properties that make it easier to analyze.
2. With some process you can store them in relation
database (it could be very hard for some kind of semi
structured data), but the semi structure exists to ease
space, clarity or compute…
3.But as Structured data, semi structured data
represents a few parts of data (5 to 10%).
Examples of semi-structured: JSON, CSV , XML
documents are semi structured documents.
3.Unstructured data:
1.Unstructured data represent around 80% of data.
2. It often include text and multimedia content.
3.Examples include e-mail messages, word processing
documents, videos, photos, audio files, presentations,
webpages and many other kinds of business
documents.
4.Unstructured data is everywhere.
5.In fact, most individuals and organizations conduct
their lives around unstructured data.
6.Just as with structured data, unstructured data is
either machine generated or human generated.
Here are some examples of machine-generated
unstructured data:
Satellite images: This includes weather data or the data
that the government captures in its satellite surveillance
imagery. Just think about Google Earth, and you get the
picture.
Photographs and video: This include security,
surveillance, and traffic video.
Radar or sonar data: This includes vehicular,
meteorological, and Seismic oceanography.
The following list shows a few examples of human-
generated unstructured data:
Social media data: This data is generated from the
social media platforms such as YouTube, Facebook,
Twitter, LinkedIn, and Flickr.
Mobile data: This includes data such as text messages
and location information.
website content: This comes from any site delivering
unstructured content, like YouTube, Flickr, or
Instagram.
i)Natural Language:
Natural language is a special type of unstructured data;
it’s challenging to process because it requires
knowledge of specific data science techniques and
linguistics.
The natural language processing community has had
success in entity recognition, topic recognition,
summarization, and sentiment analysis, but models
trained in one domain don’t generalize well to other
domains.
ii)Graph based or Network Data:
In graph theory, a graph is a mathematical structure to
model pair-wise relationships between objects.
Graph or network data is, in short, data that focuses on
the relationship or adjacency of objects.
The graph structures use nodes, edges, and properties to
represent and store graphical data. Graph-based data is
a natural way to represent social networks.
iii)Audio, Image & Video:
Audio, image, and video are data types that pose
specific challenges to a data scientist.
MLBAM (Major League Baseball Advanced Media)
announced in 2014 that they’ll increase video capture
to approximately 7 TB per game for the purpose of live,
in-game analytics. High-speed cameras at stadiums will
capture ball and athlete movements to calculate in real
time, for example, the path taken by a defender relative
to two baselines.
iv)Streaming Data:
Streaming data is data that is generated continuously by
thousands of data sources, which typically send in the
data records simultaneously, and in small sizes (order
of Kilobytes).
Examples are the-Log files generated by customers
using your mobile or web applications, online game
activity, “What’s trending” on Twitter, live sporting or
music events, and the stock market.
Facets of Data
• Very large amount of data will generate in big data
and data science. These data is various types and main
categories of data are as follows:
a) Structured
b) Natural language
c) Graph-based
d) Streaming
e) Unstructured
f) Machine-generated
g) Audio, video and images
Structured Data
• Structured data is arranged in rows and column
format. It helps for application to retrieve and process
data easily. Database management system is used for
storing structured data.
• The term structured data refers to data that is
identifiable because it is organized in a structure. The
most common form of structured data or records is a
database where specific information is stored based on a
methodology of columns and rows.
• Structured data is also searchable by data type within
content. Structured data is understood by computers and
is also efficiently organized for human readers.
• An Excel table is an example of structured data.
Unstructured Data
• Unstructured data is data that does not follow a
specified format. Row and columns are not used for
unstructured data. Therefore it is difficult to retrieve
required information. Unstructured data has no
identifiable structure.
• The unstructured data can be in the form of Text:
(Documents, email messages, customer feedbacks),
audio, video, images. Email is an example of
unstructured data.
• Even today in most of the organizations more than 80
% of the data are in unstructured form. This carries lots
of information. But extracting information from these
various sources is a very big challenge.
• Characteristics of unstructured data:
1. There is no structural restriction or binding for the
data.
2. Data can be of any type.
3. Unstructured data does not follow any structural
rules.
4. There are no predefined formats, restriction or
sequence for unstructured data.
5. Since there is no structural binding for unstructured
data, it is unpredictable in nature.
Natural Language
• Natural language is a special type of unstructured data.
• Natural language processing enables machines to
recognize characters, words and sentences, then apply
meaning and understanding to that information. This
helps machines to understand language as humans do.
• Natural language processing is the driving force
behind machine intelligence in many modern real-world
applications. The natural language processing
community has had success in entity recognition, topic
recognition, summarization, text completion and
sentiment analysis.
•For natural language processing to help machines
understand human language, it must go through speech
recognition, natural language understanding and
machine translation. It is an iterative process comprised
of several layers of text analysis.
Machine - Generated Data
• Machine-generated data is an information that is
created without human interaction as a result of a
computer process or application activity. This means
that data entered manually by an end-user is not
recognized to be machine-generated.
• Machine data contains a definitive record of all
activity and behavior of our customers, users,
transactions, applications, servers, networks, factory
machinery and so on.
• It's configuration data, data from APIs and message
queues, change events, the output of diagnostic
commands and call detail records, sensor data from
remote equipment and more.
• Examples of machine data are web server logs, call
detail records, network event logs and telemetry.
• Both Machine-to-Machine (M2M) and Human-to-
Machine (H2M) interactions generate machine data.
Machine data is generated continuously by every
processor-based system, as well as many consumer-
oriented systems.
• It can be either structured or unstructured. In recent
years, the increase of machine data has surged. The
expansion of mobile devices, virtual servers and
desktops, as well as cloud- based services and RFID
technologies, is making IT infrastructures more
complex.
Graph-based or Network Data
•Graphs are data structures to describe relationships and
interactions between entities in complex systems. In
general, a graph contains a collection of entities called
nodes and another collection of interactions between a
pair of nodes called edges.
• Nodes represent entities, which can be of any object
type that is relevant to our problem domain. By
connecting nodes with edges, we will end up with a
graph (network) of nodes.
• A graph database stores nodes and relationships
instead of tables or documents. Data is stored just like
we might sketch ideas on a whiteboard. Our data is
stored without restricting it to a predefined model,
allowing a very flexible way of thinking about and
using it.
• Graph databases are used to store graph-based data
and are queried with specialized query languages such
as SPARQL.
• Graph databases are capable of sophisticated fraud
prevention. With graph databases, we can use
relationships to process financial and purchase
transactions in near-real time. With fast graph queries,
we are able to detect that, for example, a potential
purchaser is using the same email address and credit
card as included in a known fraud case.
• Graph databases can also help user easily detect
relationship patterns such as multiple people associated
with a personal email address or multiple people
sharing the same IP address but residing in different
physical addresses.
• Graph databases are a good choice for
recommendation applications. With graph databases,
we can store in a graph relationships between
information categories such as customer interests,
friends and purchase history. We can use a highly
available graph database to make product
recommendations to a user based on which products are
purchased by others who follow the same sport and
have similar purchase history.
• Graph theory is probably the main method in social
network analysis in the early history of the social
network concept. The approach is applied to social
network analysis in order to determine important
features of the network such as the nodes and links (for
example influencers and the followers).
• Influencers on social network have been identified as
users that have impact on the activities or opinion of
other users by way of followership or influence on
decision made by other users on the network as shown
in Fig. 1.2.1.
• Graph theory has proved to be very effective on large-
scale datasets such as social network data. This is
because it is capable of by-passing the building of an
actual visual representation of the data to run directly
on data matrices.
Audio, Image and Video
• Audio, image and video are data types that pose
specific challenges to a data scientist. Tasks that are
trivial for humans, such as recognizing objects in
pictures, turn out to be challenging for computers.
•The terms audio and video commonly refers to the
time-based media storage format for sound/music and
moving pictures information. Audio and video digital
recording, also referred as audio and video codecs, can
be uncompressed, lossless compressed or lossy
compressed depending on the desired quality and use
cases.
• It is important to remark that multimedia data is one of
the most important sources of information and
knowledge; the integration, transformation and indexing
of multimedia data bring significant challenges in data
management and analysis. Many challenges have to be
addressed including big data, multidisciplinary nature
of Data Science and heterogeneity.
• Data Science is playing an important role to address
these challenges in multimedia data. Multimedia data
usually contains various forms of media, such as text,
image, video, geographic coordinates and even pulse
waveforms, which come from multiple sources. Data
Science can be a key instrument covering big data,
machine learning and data mining solutions to store,
handle and analyze such heterogeneous data.
Streaming Data
Streaming data is data that is generated continuously by
thousands of data sources, which typically send in the
data records simultaneously and in small sizes (order of
Kilobytes).
• Streaming data includes a wide variety of data such as
log files generated by customers using your mobile or
web applications, ecommerce purchases, in-game
player activity, information from social networks,
financial trading floors or geospatial services and
telemetry from connected devices or instrumentation in
data centers.
Difference between Structured and Unstructured
Data
wrangling.
With R, data scientists can apply machine learning
Run R Programs
You can run R programs in two different ways:
Installing R in your local machine
Using an online environment
[[2]]
[1] "a"
[[3]]
[1] TRUE
[[4]]
[1] 1+4i
We can also create an empty list of a prespecified
length with the vector() function
> x <- vector("list", length = 5)
>x
[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
[[5]]
NULL
4.11 Factors
Factors are used to represent categorical data and can
be unordered or ordered. One can think of a factor as
an integer vector where each integer has a label.
Factors are important in statistical modeling and are
treated specially by modelling functions
like lm() and glm().
Using factors with labels is better than using integers
because factors are self-describing. Having a variable
that has values “Male” and “Female” is better than a
variable that has values 1 and 2.
Factor objects can be created with the factor() function.
> x <- factor(c("yes", "yes", "no", "yes", "no"))
>x
[1] yes yes no yes no
Levels: no yes
> table(x)
x
no yes
2 3
> ## See the underlying representation of factor
> unclass(x)
[1] 2 2 1 2 1
attr(,"levels")
[1] "no" "yes"
Often factors will be automatically created for you
when you read a dataset in using a function
like read.table(). Those functions often default to
creating factors when they encounter data that look like
characters or strings.
The order of the levels of a factor can be set using
the levels argument to factor(). This can be important
in linear modelling because the first level is used as the
baseline level.
> x <- factor(c("yes", "yes", "no", "yes", "no"))
> x ## Levels are put in alphabetical order
[1] yes yes no yes no
Levels: no yes
> x <- factor(c("yes", "yes", "no", "yes", "no"),
+ levels = c("yes", "no"))
>x
[1] yes yes no yes no
Levels: yes no
4.12 Missing Values
Missing values are denoted by NA or NaN for q
undefined mathematical operations.
is.na() is used to test objects if they are NA
is.nan() is used to test for NaN
NA values have a class also, so there are integer NA,
character NA, etc.
A NaN value is also NA but the converse is not true
> ## Create a vector with NAs in it
> x <- c(1, 2, NA, 10, 3)
> ## Return a logical vector indicating which elements
are NA
> is.na(x)
[1] FALSE FALSE TRUE FALSE FALSE
> ## Return a logical vector indicating which elements
are NaN
> is.nan(x)
[1] FALSE FALSE FALSE FALSE FALSE
> ## Now create a vector with both NA and NaN
values
> x <- c(1, 2, NaN, NA, 4)
> is.na(x)
[1] FALSE FALSE TRUE TRUE FALSE
> is.nan(x)
[1] FALSE FALSE TRUE FALSE FALSE
4.13 Data Frames
Data frames are used to store tabular data in R. They
are an important type of object in R and are used in a
variety of statistical modeling applications. Hadley
Wickham’s package dplyr has an optimized set of
functions designed to work efficiently with data
frames.
Data frames are represented as a special type of list
where every element of the list has to have the same
length. Each element of the list can be thought of as a
column and the length of each element of the list is the
number of rows.
Unlike matrices, data frames can store different classes
of objects in each column. Matrices must have every
element be the same class (e.g. all integers or all
numeric).
In addition to column names, indicating the names of
the variables or predictors, data frames have a special
attribute called row.names which indicate information
about each row of the data frame.
Data frames are usually created by reading in a dataset
using the read.table() or read.csv(). However, data
frames can also be created explicitly with
the data.frame() function or they can be coerced from
other types of objects like lists.
Data frames can be converted to a matrix by
calling data.matrix(). While it might seem that
the as.matrix() function should be used to coerce a data
frame to a matrix, almost always, what you want is the
result of data.matrix().
> x <- data.frame(foo = 1:4, bar = c(T, T, F, F))
>x
foo bar
1 1 TRUE
2 2 TRUE
3 3 FALSE
4 4 FALSE
> nrow(x)
[1] 4
> ncol(x)
[1] 2
4.14 Names
R objects can have names, which is very useful for
writing readable code and self-describing objects. Here
is an example of assigning names to an integer vector.
> x <- 1:3
> names(x)
NULL
> names(x) <- c("New York", "Seattle", "Los
Angeles")
>x
New York Seattle Los Angeles
1 2 3
> names(x)
[1] "New York" "Seattle" "Los Angeles"
Lists can also have names, which is often very useful.
> x <- list("Los Angeles" = 1, Boston = 2, London = 3)
>x
$`Los Angeles`
[1] 1
$Boston
[1] 2
$London
[1] 3
> names(x)
[1] "Los Angeles" "Boston" "London"
Matrices can have both column and row names.
> m <- matrix(1:4, nrow = 2, ncol = 2)
> dimnames(m) <- list(c("a", "b"), c("c", "d"))
>m
cd
a13
b24
Column names and row names can be set separately
using the colnames() and rownames() functions.
> colnames(m) <- c("h", "f")
> rownames(m) <- c("x", "z")
>m
hf
x13
z24
Note that for data frames, there is a separate function
for setting the row names, the row.names() function.
Also, data frames do not have column names, they just
have names (like lists). So to set the column names of a
data frame just use the names() function. Yes, I know
its confusing. Here’s a quick summary:
Set column Set row
Object
names names
data
names() row.names()
frame
4.15 Summary
There are a variety of different builtin-data types in R.
In this chapter we have reviewed the following
atomic classes: numeric, logical, character, integer,
complex
vectors, lists
factors
missing values
data frames and matrices
All R objects can have attributes that help to describe
what is in the object. Perhaps the most useful attribute
is names, such as column and row names in a data
frame, or simply names in a vector or list. Attributes
like dimensions are also important as they can modify
the behavior of objects, like turning a vector into a
matrix.