0% found this document useful (0 votes)
56 views

R Programming UNIT-1

The document provides an introduction to data science and big data. It defines big data as large volumes of data that require specialized tools to process effectively. Some key applications of big data include financial services, communications, and retail. Data science is described as the intersection of data and computing to build predictive models from large datasets. Major applications include internet search, digital advertising, and recommender systems. The document also compares and contrasts big data and data science.

Uploaded by

padma
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

R Programming UNIT-1

The document provides an introduction to data science and big data. It defines big data as large volumes of data that require specialized tools to process effectively. Some key applications of big data include financial services, communications, and retail. Data science is described as the intersection of data and computing to build predictive models from large datasets. Major applications include internet search, digital advertising, and recommender systems. The document also compares and contrasts big data and data science.

Uploaded by

padma
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 48

D.K.Govt.

CollegeforWomen(A),Nellore
COURSE 1:INTRODUCTION TO
DATASCIENCE AND R PROGRAMMING

UNIT-1

UNIT I: Syllabus
Defining Data Science and Bigdata,Benefits and
Uses,facets of Data,DataScience Process.History and
Overview of R, Getting Started with R, R Nuts and
Bolts

Introduction about Big Data:


It is huge, large, or voluminous data, information, or
the relevant statistics acquired by large organizations
and ventures. Many software and data storages is
created and prepared as it is difficult to compute the big
data manually. It is used to discover patterns and trends
and make decisions related to human behavior and
interaction technology.

Big data refers to significant volumes of data that


cannot be processed effectively with the traditional
applications that are currently used. The processing of
big data begins with raw data that isn’t aggregated and
is most often impossible to store in the memory of a
single computer.
Big data is a buzzword used to describe immense
volumes of data, both unstructured and structured, that
can inundate a business on a day-to-day basis. Big data
is used to analyze insights, which can lead to better
decisions and strategic business moves.
In summary, Gartner provides the following definition
of big data: “Big data is high-volume, and high-velocity
or high-variety information assets that demand cost-
effective, innovative forms of information processing
that enable enhanced insight, decision making, and
process automation.”

Applications of Big Data


 Big Data for Financial Services
Credit card companies, retail banks, private wealth
management
advisories, insurance firms, venture funds, and
institutional investment banks all use big data for their
financial services. The common problem among them
all is the massive amounts of multi-structured data
living in multiple disparate systems, which big data
can solve. As such, big data is used in several ways,
including:
1. Customer analytics
2. Compliance analytics
3. Fraud analytics
4. Operational analytics
 Big Data in Communications
Gaining new subscribers, retaining customers, and
expanding within
current subscriber bases are top priorities for
telecommunication service providers. The solutions to
these challenges lie in the ability to combine and
analyze the masses of customer-generated data and
machine-generated data that is being created every day.
 Big Data for Retail

Whether it’s a brick-and-mortar company an online


retailer, the answer to staying in the game and being
competitive is
understanding the customer better. This requires the
ability to analyze all disparate data sources that
companies deal with every day, including the weblogs,
customer transaction data, social media, store-branded
credit card data, and loyalty program data.
Advantages of Big Data:
 Able to handle and process large and complex data

sets that cannot be easily managed with traditional


database systems
 Provides a platform for advanced analytics and

machine learning applications


 Enables organizations to gain insights and make data-

driven decisions based on large amounts of data


 Offers potential for significant cost savings through

efficient data management and analysis


Disadvantages of Big Data:
 Requires specialized skills and expertise in data

engineering, data management, and big data tools and


technologies
 Can be expensive to implement and maintain due to

the need for specialized infrastructure and software


 May face privacy and security concerns when

handling sensitive data


 Can be challenging to integrate with existing systems

and processes

Introduction about Data Science:


Data Science is a field or domain which includes and
involves working with a huge amount of data and using
it for building predictive, prescriptive, and prescriptive
analytical models. It’s about digging, capturing,
(building the model) analyzing(validating the model),
and utilizing the data(deploying the best model). It is an
intersection of Data and computing. It is a blend of the
field of Computer Science, Business, and Statistics
together.

Data science is a field that deals with unstructured,


structured data, and semi-structured data. It involves
practices like data cleansing, data preparation, data
analysis, and much more.
Data science is the combination of: statistics,
mathematics, programming, and problem-solving;,
capturing data in ingenious ways; the ability to look at
things differently; and the activity of cleansing,
preparing, and aligning data. This umbrella term
includes various techniques that are used when
extracting insights and information from data.

Applications of Data Science


 Internet Search
Search engines make use of data science algorithms to
deliver the best results for search queries in seconds.
 Digital Advertisements
The entire digital marketing spectrum uses data science
algorithms,
from display banners to digital billboards. This is the
main reason
that digital ads have higher click-through rates than
traditional advertisements.
 Recommender Systems
The recommender systems not only make it easy to
find relevant products from billions of available
products, but they also add a lot to the user experience.
Many companies use this system to promote their
products and suggestions in accordance with the user’s
demands and relevance of information. The
recommendations are based on the user’s previous
search results.

Advantages of Data Science:


 Provides a framework for extracting insights and

knowledge from data through statistical analysis,


machine learning, and
 data visualization techniques

 Offers a wide range of applications in various fields

such as finance, healthcare, and marketing


 Helps organizations make informed decisions by

extracting meaningful insights from data


 Offers potential for significant cost savings through

efficient data management and analysis


Disadvantages of Data Science:
 Requires specialized skills and expertise in statistical

analysis, machine learning, and data visualization


 Can be time-consuming and resource-intensive due to

the need for data cleaning and preprocessing


 May face ethical concerns when dealing with sensitive

data
 Can be challenging to integrate with existing systems

and processes
Similarities between Big Data and Data Science:
 Both fields deal with large amounts of data and require
specialized skills and expertise
 Both aim to extract insights and knowledge from data

to inform decision-making
 Both have a wide range of applications in various

industries
 Both can lead to significant cost savings and

operational efficiencies when applied correctly


Below is a table of differences between Big Data and
Data Science:
Data Science Big Data
Big Data is a technique to
Data Science is an area. collect, maintain and
process huge information.
It is about the collection, It is about extracting vital
processing, analyzing, and and valuable information
utilizing of data in various from a huge amount of
operations. It is more data.
Data Science Big Data
conceptual.
It is a field of study just
It is a technique for
like Computer Science,
tracking and discovering
Applied Statistics, or
trends in complex data sets.
Applied Mathematics.
The goal is to make data
more vital and usable i.e.
The goal is to build data-
by extracting only
dominant products for a
important information from
venture.
the huge data within
existing traditional aspects.
Tools mainly used in Data Tools mostly used in Big
Science include SAS, R, Data include Hadoop,
Python, etc Spark, Flink, etc.
It is a superset of Big Data
as data science consists of It is a sub-set of Data
Data scrapping, cleaning, Science as mining activities
visualization, statistics, which is in a pipeline of
and many more Data science.
techniques.
It is mainly used for
It is mainly used for
business purposes and
scientific purposes.
customer satisfaction.
It broadly focuses on the It is more involved with the
science of the data. processes of handling
Data Science Big Data
voluminous data.
conclusion:
while big data and data science are related fields that
share many similarities, they differ in their areas of
focus, data size, tools and technologies used, skills
required, and application. Choosing between big data
and data science ultimately depends on an individual’s
interests, skills, and career goals.
Facets of data:

1.Structured
2.Semi structured
3. Unstructured
• Natural language
• Machine-generated
• Graph-based
• Audio, video, and images
• Streaming
1.Structured Data:
1.It concerns all data which can be stored in database
SQL in table with rows and columns.
2.They have relational key and can be easily mapped
into pre-designed fields.
3.Today, those data are the most processed in
development and the simplest way to manage
information.
4. But structured data represent only 5 to 10% of all
informatics data.
2.Semi Structured Data:
1. Semi-structured data is information that doesn’t
reside in a relational database but that does have some
organizational properties that make it easier to analyze.
2. With some process you can store them in relation
database (it could be very hard for some kind of semi
structured data), but the semi structure exists to ease
space, clarity or compute…
3.But as Structured data, semi structured data
represents a few parts of data (5 to 10%).
Examples of semi-structured: JSON, CSV , XML
documents are semi structured documents.
3.Unstructured data:
1.Unstructured data represent around 80% of data.
2. It often include text and multimedia content.
3.Examples include e-mail messages, word processing
documents, videos, photos, audio files, presentations,
webpages and many other kinds of business
documents.
4.Unstructured data is everywhere.
5.In fact, most individuals and organizations conduct
their lives around unstructured data.
6.Just as with structured data, unstructured data is
either machine generated or human generated.
 Here are some examples of machine-generated

unstructured data:
Satellite images: This includes weather data or the data
that the government captures in its satellite surveillance
imagery. Just think about Google Earth, and you get the
picture.
Photographs and video: This include security,
surveillance, and traffic video.
Radar or sonar data: This includes vehicular,
meteorological, and Seismic oceanography.
 The following list shows a few examples of human-
generated unstructured data:
Social media data: This data is generated from the
social media platforms such as YouTube, Facebook,
Twitter, LinkedIn, and Flickr.
Mobile data: This includes data such as text messages
and location information.
website content: This comes from any site delivering
unstructured content, like YouTube, Flickr, or
Instagram.
i)Natural Language:
Natural language is a special type of unstructured data;
it’s challenging to process because it requires
knowledge of specific data science techniques and
linguistics.
The natural language processing community has had
success in entity recognition, topic recognition,
summarization, and sentiment analysis, but models
trained in one domain don’t generalize well to other
domains.
ii)Graph based or Network Data:
In graph theory, a graph is a mathematical structure to
model pair-wise relationships between objects.
Graph or network data is, in short, data that focuses on
the relationship or adjacency of objects.
The graph structures use nodes, edges, and properties to
represent and store graphical data. Graph-based data is
a natural way to represent social networks.
iii)Audio, Image & Video:
Audio, image, and video are data types that pose
specific challenges to a data scientist.
MLBAM (Major League Baseball Advanced Media)
announced in 2014 that they’ll increase video capture
to approximately 7 TB per game for the purpose of live,
in-game analytics. High-speed cameras at stadiums will
capture ball and athlete movements to calculate in real
time, for example, the path taken by a defender relative
to two baselines.
iv)Streaming Data:
Streaming data is data that is generated continuously by
thousands of data sources, which typically send in the
data records simultaneously, and in small sizes (order
of Kilobytes).
Examples are the-Log files generated by customers
using your mobile or web applications, online game
activity, “What’s trending” on Twitter, live sporting or
music events, and the stock market.
Facets of Data
• Very large amount of data will generate in big data
and data science. These data is various types and main
categories of data are as follows:
a) Structured
b) Natural language
c) Graph-based
d) Streaming
e) Unstructured
f) Machine-generated
g) Audio, video and images
Structured Data
• Structured data is arranged in rows and column
format. It helps for application to retrieve and process
data easily. Database management system is used for
storing structured data.
• The term structured data refers to data that is
identifiable because it is organized in a structure. The
most common form of structured data or records is a
database where specific information is stored based on a
methodology of columns and rows.
• Structured data is also searchable by data type within
content. Structured data is understood by computers and
is also efficiently organized for human readers.
• An Excel table is an example of structured data.
Unstructured Data
• Unstructured data is data that does not follow a
specified format. Row and columns are not used for
unstructured data. Therefore it is difficult to retrieve
required information. Unstructured data has no
identifiable structure.
• The unstructured data can be in the form of Text:
(Documents, email messages, customer feedbacks),
audio, video, images. Email is an example of
unstructured data.
• Even today in most of the organizations more than 80
% of the data are in unstructured form. This carries lots
of information. But extracting information from these
various sources is a very big challenge.
• Characteristics of unstructured data:
1. There is no structural restriction or binding for the
data.
2. Data can be of any type.
3. Unstructured data does not follow any structural
rules.
4. There are no predefined formats, restriction or
sequence for unstructured data.
5. Since there is no structural binding for unstructured
data, it is unpredictable in nature.
Natural Language
• Natural language is a special type of unstructured data.
• Natural language processing enables machines to
recognize characters, words and sentences, then apply
meaning and understanding to that information. This
helps machines to understand language as humans do.
• Natural language processing is the driving force
behind machine intelligence in many modern real-world
applications. The natural language processing
community has had success in entity recognition, topic
recognition, summarization, text completion and
sentiment analysis.
•For natural language processing to help machines
understand human language, it must go through speech
recognition, natural language understanding and
machine translation. It is an iterative process comprised
of several layers of text analysis.
Machine - Generated Data
• Machine-generated data is an information that is
created without human interaction as a result of a
computer process or application activity. This means
that data entered manually by an end-user is not
recognized to be machine-generated.
• Machine data contains a definitive record of all
activity and behavior of our customers, users,
transactions, applications, servers, networks, factory
machinery and so on.
• It's configuration data, data from APIs and message
queues, change events, the output of diagnostic
commands and call detail records, sensor data from
remote equipment and more.
• Examples of machine data are web server logs, call
detail records, network event logs and telemetry.
• Both Machine-to-Machine (M2M) and Human-to-
Machine (H2M) interactions generate machine data.
Machine data is generated continuously by every
processor-based system, as well as many consumer-
oriented systems.
• It can be either structured or unstructured. In recent
years, the increase of machine data has surged. The
expansion of mobile devices, virtual servers and
desktops, as well as cloud- based services and RFID
technologies, is making IT infrastructures more
complex.
Graph-based or Network Data
•Graphs are data structures to describe relationships and
interactions between entities in complex systems. In
general, a graph contains a collection of entities called
nodes and another collection of interactions between a
pair of nodes called edges.
• Nodes represent entities, which can be of any object
type that is relevant to our problem domain. By
connecting nodes with edges, we will end up with a
graph (network) of nodes.
• A graph database stores nodes and relationships
instead of tables or documents. Data is stored just like
we might sketch ideas on a whiteboard. Our data is
stored without restricting it to a predefined model,
allowing a very flexible way of thinking about and
using it.
• Graph databases are used to store graph-based data
and are queried with specialized query languages such
as SPARQL.
• Graph databases are capable of sophisticated fraud
prevention. With graph databases, we can use
relationships to process financial and purchase
transactions in near-real time. With fast graph queries,
we are able to detect that, for example, a potential
purchaser is using the same email address and credit
card as included in a known fraud case.
• Graph databases can also help user easily detect
relationship patterns such as multiple people associated
with a personal email address or multiple people
sharing the same IP address but residing in different
physical addresses.
• Graph databases are a good choice for
recommendation applications. With graph databases,
we can store in a graph relationships between
information categories such as customer interests,
friends and purchase history. We can use a highly
available graph database to make product
recommendations to a user based on which products are
purchased by others who follow the same sport and
have similar purchase history.
• Graph theory is probably the main method in social
network analysis in the early history of the social
network concept. The approach is applied to social
network analysis in order to determine important
features of the network such as the nodes and links (for
example influencers and the followers).
• Influencers on social network have been identified as
users that have impact on the activities or opinion of
other users by way of followership or influence on
decision made by other users on the network as shown
in Fig. 1.2.1.
• Graph theory has proved to be very effective on large-
scale datasets such as social network data. This is
because it is capable of by-passing the building of an
actual visual representation of the data to run directly
on data matrices.
Audio, Image and Video
• Audio, image and video are data types that pose
specific challenges to a data scientist. Tasks that are
trivial for humans, such as recognizing objects in
pictures, turn out to be challenging for computers.
•The terms audio and video commonly refers to the
time-based media storage format for sound/music and
moving pictures information. Audio and video digital
recording, also referred as audio and video codecs, can
be uncompressed, lossless compressed or lossy
compressed depending on the desired quality and use
cases.
• It is important to remark that multimedia data is one of
the most important sources of information and
knowledge; the integration, transformation and indexing
of multimedia data bring significant challenges in data
management and analysis. Many challenges have to be
addressed including big data, multidisciplinary nature
of Data Science and heterogeneity.
• Data Science is playing an important role to address
these challenges in multimedia data. Multimedia data
usually contains various forms of media, such as text,
image, video, geographic coordinates and even pulse
waveforms, which come from multiple sources. Data
Science can be a key instrument covering big data,
machine learning and data mining solutions to store,
handle and analyze such heterogeneous data.
Streaming Data
Streaming data is data that is generated continuously by
thousands of data sources, which typically send in the
data records simultaneously and in small sizes (order of
Kilobytes).
• Streaming data includes a wide variety of data such as
log files generated by customers using your mobile or
web applications, ecommerce purchases, in-game
player activity, information from social networks,
financial trading floors or geospatial services and
telemetry from connected devices or instrumentation in
data centers.
Difference between Structured and Unstructured
Data

Data Science Process


Data science process consists of six stages :
1. Discovery or Setting the research goal
2. Retrieving data
3. Data preparation
4. Data exploration
5. Data modeling
6. Presentation and automation
• Fig. 1.3.1 shows data science design process.

• Step 1: Discovery or Defining research goal


This step involves acquiring data from all the identified
internal and external sources, which helps to answer the
business question.
• Step 2: Retrieving data
It collection of data which required for project. This is
the process of gaining a business understanding of the
data user have and deciphering what each piece of data
means. This could entail determining exactly what data
is required and the best methods for obtaining it. This
also entails determining what each of the data points
means in terms of the company. If we have given a data
set from a client, for example, we shall need to know
what each column and row represents.
• Step 3: Data preparation
Data can have many inconsistencies like missing
values, blank columns, an incorrect data format, which
needs to be cleaned. We need to process, explore and
condition data before modeling. The cleandata, gives
the better predictions.
• Step 4: Data exploration
Data exploration is related to deeper understanding of
data. Try to understand how variables interact with each
other, the distribution of the data and whether there are
outliers. To achieve this use descriptive statistics, visual
techniques and simple modeling. This steps is also
called as Exploratory Data Analysis.
• Step 5: Data modeling
In this step, the actual model building process starts.
Here, Data scientist distributes datasets for training and
testing. Techniques like association, classification and
clustering are applied to the training data set. The
model, once prepared, is tested against the "testing"
dataset.
• Step 6: Presentation and automation
Deliver the final baselined model with reports, code and
technical documents in this stage. Model is deployed
into a real-time production environment after thorough
testing. In this stage, the key findings are communicated
to all stakeholders. This helps to decide if the project
results are a success or a failure based on the inputs
from the model.
R Programming for Data Science

R Programming Language is an open-source


programming language that is widely used as a
statistical software and data analysis tool. R is an
important tool for Data Science. It is highly popular
and is the first choice of many statisticians and data
scientists.
Data Science in R Programming Language
Data Science has emerged as the most popular field of
the 21st century. This is because there is a pressing
need to analyze and construct insights from the data.
Industries transform raw data into furnished data
products. In order to do so, it requires several important
tools to churn the raw data. R is one of the
programming languages that provide an intensive
environment for you to research, process, transform,
and visualize information.
Features of R – Data Science
Some of the important features of R for data science
applications are:
 R provides extensive support for statistical modeling.

 R is a suitable tool for various data science

applications because it provides aesthetic visualization


tools.
 R is heavily utilized in data science applications for

ETL (Extract, Transform, Load). It provides an


interface for many databases like SQL and even
spreadsheets.
 R also provides various important packages for data

wrangling.
 With R, data scientists can apply machine learning

algorithms to gain insights about future events.


 One of the important features of R is to interface with

NoSQL databases and analyze unstructured data.


Most common R Libraries in Data Science
 Dplyr: For performing data wrangling and data

analysis, we use the dplyr package. We use this


package for facilitating various functions for the Data
frame in R. Dplyr is actually built around these 5
functions. You can work with local data frames as
well as with remote database tables. You might need
to:
Select certain columns of data.
Filter your data to select specific rows.
Arrange the rows of your data in order.
Mutate your data frame to contain new columns.
Summarize chunks of your data in some way.
 Ggplot2: R is most famous for its visualization

library ggplot2. It provides an aesthetic set of graphics


that are also interactive. The ggplot2 library
implements a “grammar of graphics” (Wilkinson,
2005). This approach gives us a coherent way to
produce visualizations by expressing relationships
between the attributes of data and their graphical
representation.
 Esquisse: This package has brought the most
important feature of Tableau to R. Just drag and drop,
and get your visualization done in minutes. This is
actually an enhancement to ggplot2. It allows us to
draw bar graphs, curves, scatter plots, and histograms,
then export the graph or retrieve the code generating
the graph.
 Tidyr: Tidyr is a package that we use for tidying or
cleaning the data. We consider this data to be tidy
when each variable represents a column and each row
represents an observation.
 Shiny: This is a very well-known package in R. When
you want to share your stuff with people around you
and make it easier for them to know and explore it
visually, you can use Shiny. It’s a Data Scientist’s best
friend.
 Caret: Caret stands for classification and regression
training. Using this function, you can model complex
regression and classification problems.
 E1071: The E1071 package has wide use for
implementing clustering, Fourier Transform, Naive
Bayes, SVM, and other types of miscellaneous
functions.
 Mlr: This package is absolutely incredible in
performing machine learning tasks. It almost has all
the important and useful algorithms for performing
machine learning tasks. It can also be termed as the
extensible framework for classification, regression,
clustering, multi-classification, and survival analysis.
R – History & Overview

R is a programming language and software environment


for statistical analysis, graphics representation and
reporting. R was created by Ross Ihaka and Robert
Gentleman at the University of Auckland, New
Zealand, and is currently developed by the R
Development Core Team.
The core of R is an interpreted computer language
which allows branching and looping as well as modular
programming using functions. R allows integration with
the procedures written in the C, C++, .Net, Python or
FORTRAN languages for efficiency.
R is freely available under the GNU General Public
License, and pre-compiled binary versions are provided
for various operating systems like Linux, Windows and
Mac.
R is free software distributed under a GNU-style copy
left, and an official part of the GNU project called GNU
S.
Evolution of R
R was initially written by Ross Ihaka and Robert
Gentleman at the Department of Statistics of the
University of Auckland in Auckland, New Zealand. R
made its first appearance in 1993.
 A large group of individuals has contributed to R
by sending code and bug reports.
 Since mid-1997 there has been a core group (the "R
Core Team") who can modify the R source code
archive.
Features of R
As stated earlier, R is a programming language and
software environment for statistical analysis, graphics
representation and reporting. The following are the
important features of R −
 R is a well-developed, simple and effective
programming language which includes
conditionals, loops, user defined recursive
functions and input and output facilities.
 R has an effective data handling and storage
facility,
 R provides a suite of operators for calculations on
arrays, lists, vectors and matrices.
 R provides a large, coherent and integrated
collection of tools for data analysis.
 R provides graphical facilities for data analysis and
display either directly at the computer or printing at
the papers.

Getting Started With R


R is an interpreted programming language. It also
allows you to carry out modular programming with the
help of functions. It is widely used to analyze statistical
information as well as graphical representation.
R allows you to integrate with programming procedures
written in C, C++, Python, .Net, etc. Today, R is widely
used in the field of data science by data analysts,
researchers, statisticians, etc. It is used to retrieve data
from datasets, clean it, analyze and visualize it, and
present it in the most suitable way.

Run R Programs
You can run R programs in two different ways:
 Installing R in your local machine
 Using an online environment

Install R in Your Local Machine


Before installing R on your computer, you first need to
determine the operating system that you are using. R
has binaries for all the major operating systems
including Windows, MacOS, and Linux.
In this tutorial, you will learn how to install R in
a Windows machine. You can follow the steps below:
1. Visit https://cloud.r-project.org/ and download the right
binary for your operating system. If you want to
download the latest version of R in Windows, you can
visit https://cloud.r-project.org/bin/windows/base/R-
4.1.1-win.exe.
1. Once you have finished the download, open the
executable file to start the installation. This will start an
installation wizard for you.
2. Here, select the path where you want to install the R
tool. We recommend you to use the default path
provided.
3. Next, select all the components that you want to install
and click the next button.
4. The next wizard will ask you to either select a default
setup or customize the setup. We recommend you to
select the default setup and click on next.
5. Wait for a while until the installation concludes. Once
done, click on finish.

How to Install RStudio?


If you want to work with R in your local machine,
installing R is not enough. R does not come with a GUI-
based platform. Most users install a separate IDE which
allows them to interact with R. It gives them additional
functionality such as help, preview, etc.
The most popular IDE for the R programming language
is RStudio. You can follow these steps to install
RStudio on your Windows machine.
1. Visit https://www.rstudio.com/products/rstudio/
download/#download to download the free version of
RStudio for any platform you want.
2. Once the download is completed, you need to open the
executable file to start the installation process.
3. An installation wizard will appear on the screen. Click
on the next button.
4. On the next prompt, it will ask you to select the start
menu folder for shortcut creation. Click on the install
button. Once the installation is completed, click on
Finish.

Running Your First R Program


Now that you have installed R and RStudio
successfully, let's try to create your first R program. We
will try to create a simple Hello World program.
A Hello World program is a simple program that simply
prints a "Hello World!" message on the screen. It's
generally used to introduce a new language to learners.
Consider the program below.
message <-"Hello World!"
print(message)
Output
[1] "Hello World!"
Here, we have created a simple variable called message.
We have initialized this variable with a simple message
string called "Hello World!". On execution, this
program prints the message stored inside the variable.
Every output in R is preceded by a number (say n) in
square brackets. This number means that the displayed
value is the nth element printed.

R Nuts and Bolts

4.1 Entering Input


At the R prompt we type expressions. The <- symbol is
the assignment operator.
> x <- 1
> print(x)
[1] 1
>x
[1] 1
> msg <- "hello"
The grammar of the language determines whether an
expression is complete or not.
x <- ## Incomplete expression
The # character indicates a comment. Anything to the
right of the # (including the # itself) is ignored. This is
the only comment character in R. Unlike some other
languages, R does not support multi-line comments or
comment blocks.
4.2 Evaluation
When a complete expression is entered at the prompt, it
is evaluated and the result of the evaluated expression
is returned. The result may be auto-printed.
> x <- 5 ## nothing printed
>x ## auto-printing occurs
[1] 5
> print(x) ## explicit printing
[1] 5
The [1] shown in the output indicates that x is a vector
and 5 is its first element.
Typically with interactive work, we do not explicitly
print objects with the print function; it is much easier to
just auto-print them by typing the name of the object
and hitting return/enter. However, when writing
scripts, functions, or longer programs, there is
sometimes a need to explicitly print objects because
auto-printing does not work in those settings.
When an R vector is printed you will notice that an
index for the vector is printed in square brackets [] on
the side. For example, see this integer sequence of
length 20.
> x <- 11:30
>x
[1] 11 12 13 14 15 16 17 18 19 20 21 22
[13] 23 24 25 26 27 28 29 30
The numbers in the square brackets are not part of the
vector itself, they are merely part of the printed output.
With R, it’s important that one understand that there is
a difference between the actual R object and the
manner in which that R object is printed to the console.
Often, the printed output may have additional bells and
whistles to make the output more friendly to the users.
However, these bells and whistles are not inherently
part of the object.
Note that the : operator is used to create integer
sequences.
4.3 R Objects
R has five basic or “atomic” classes of objects:
 character
 numeric (real numbers)
 integer
 complex
 logical (True/False)
The most basic type of R object is a vector. Empty
vectors can be created with the vector() function. There
is really only one rule about vectors in R, which is
that A vector can only contain objects of the same
class.
But of course, like any good rule, there is an exception,
which is a list, which we will get to a bit later. A list is
represented as a vector but can contain objects of
different classes. Indeed, that’s usually why we use
them.
There is also a class for “raw” objects, but they are not
commonly used directly in data analysis and I won’t
cover them here.
4.4 Numbers
Numbers in R are generally treated as numeric objects
(i.e. double precision real numbers). This means that
even if you see a number like “1” or “2” in R, which
you might think of as integers, they are likely
represented behind the scenes as numeric objects (so
something like “1.00” or “2.00”). This isn’t important
most of the time…except when it is.
If you explicitly want an integer, you need to specify
the L suffix. So entering 1 in R gives you a numeric
object; entering 1L explicitly gives you an integer
object.
There is also a special number Inf which represents
infinity. This allows us to represent entities like 1 / 0.
This way, Inf can be used in ordinary calculations;
e.g. 1 / Inf is 0.
The value NaN represents an undefined value (“not a
number”); e.g. 0 / 0; NaN can also be thought of as a
missing value (more on that later)
4.5 Attributes
R objects can have attributes, which are like metadata
for the object. These metadata can be very useful in
that they help to describe the object. For example,
column names on a data frame help to tell us what data
are contained in each of the columns. Some examples
of R object attributes are
 names, dimnames
 dimensions (e.g. matrices, arrays)
 class (e.g. integer, numeric)
 length
 other user-defined attributes/metadata
Attributes of an object (if any) can be accessed using
the attributes() function. Not all R objects contain
attributes, in which case the attributes() function
returns NULL.
4.6 Creating Vectors
The c() function can be used to create vectors of
objects by concatenating things together.
> x <- c(0.5, 0.6) ## numeric
> x <- c(TRUE, FALSE) ## logical
> x <- c(T, F) ## logical
> x <- c("a", "b", "c") ## character
> x <- 9:29 ## integer
> x <- c(1+0i, 2+4i) ## complex
Note that in the above example, T and F are short-hand
ways to specify TRUE and FALSE. However, in
general one should try to use the
explicit TRUE and FALSE values when indicating
logical values. The T and F values are primarily there
for when you’re feeling lazy.
You can also use the vector() function to initialize
vectors.
> x <- vector("numeric", length = 10)
>x
[1] 0 0 0 0 0 0 0 0 0 0
4.7 Mixing Objects happen on purpose. So what
happens with the following
There are occasions when different classes of R objects
get mixed together. Sometimes this happens by
accident but it can also code?
> y <- c(1.7, "a") ## character
> y <- c(TRUE, 2) ## numeric
> y <- c("a", TRUE) ## character
In each case above, we are mixing objects of two
different classes in a vector. But remember that the
only rule about vectors says this is not allowed. When
different objects are mixed in a vector, coercion occurs
so that every element in the vector is of the same class.
In the example above, we see the effect of implicit
coercion. What R tries to do is find a way to represent
all of the objects in the vector in a reasonable fashion.
Sometimes this does exactly what you want and…
sometimes not. For example, combining a numeric
object with a character object will create a character
vector, because numbers can usually be easily
represented as strings.
4.8 Explicit Coercion
Objects can be explicitly coerced from one class to
another using the as.* functions, if available.
> x <- 0:6
> class(x)
[1] "integer"
> as.numeric(x)
[1] 0 1 2 3 4 5 6
> as.logical(x)
[1] FALSE TRUE TRUE TRUE TRUE TRUE
TRUE
> as.character(x)
[1] "0" "1" "2" "3" "4" "5" "6"
Sometimes, R can’t figure out how to coerce an object
and this can result in NAs being produced.
> x <- c("a", "b", "c")
> as.numeric(x)
Warning: NAs introduced by coercion
[1] NA NA NA
> as.logical(x)
[1] NA NA NA
> as.complex(x)
Warning: NAs introduced by coercion
[1] NA NA NA
When nonsensical coercion takes place, you will
usually get a warning from R.
4.9 Matrices
Matrices are vectors with a dimension attribute. The
dimension attribute is itself an integer vector of length
2 (number of rows, number of columns)
> m <- matrix(nrow = 2, ncol = 3)
>m
[,1] [,2] [,3]
[1,] NA NA NA
[2,] NA NA NA
> dim(m)
[1] 2 3
> attributes(m)
$dim
[1] 2 3
Matrices are constructed column-wise, so entries can be
thought of starting in the “upper left” corner and
running down the columns.
> m <- matrix(1:6, nrow = 2, ncol = 3)
>m
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
Matrices can also be created directly from vectors by
adding a dimension attribute.
> m <- 1:10
>m
[1] 1 2 3 4 5 6 7 8 9 10
> dim(m) <- c(2, 5)
>m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
Matrices can be created by column-binding or row-
binding with the cbind() and rbind() functions.
> x <- 1:3
> y <- 10:12
> cbind(x, y)
x y
[1,] 1 10
[2,] 2 11
[3,] 3 12
> rbind(x, y)
[,1] [,2] [,3]
x 1 2 3
y 10 11 12
4.10 Lists
Lists are a special type of vector that can contain
elements of different classes. Lists are a very important
data type in R and you should get to know them well.
Lists, in combination with the various “apply”
functions discussed later, make for a powerful
combination.
Lists can be explicitly created using the list() function,
which takes an arbitrary number of arguments.
> x <- list(1, "a", TRUE, 1 + 4i)
>x
[[1]]
[1] 1

[[2]]
[1] "a"
[[3]]
[1] TRUE

[[4]]
[1] 1+4i
We can also create an empty list of a prespecified
length with the vector() function
> x <- vector("list", length = 5)
>x
[[1]]
NULL

[[2]]
NULL

[[3]]
NULL

[[4]]
NULL

[[5]]
NULL
4.11 Factors
Factors are used to represent categorical data and can
be unordered or ordered. One can think of a factor as
an integer vector where each integer has a label.
Factors are important in statistical modeling and are
treated specially by modelling functions
like lm() and glm().
Using factors with labels is better than using integers
because factors are self-describing. Having a variable
that has values “Male” and “Female” is better than a
variable that has values 1 and 2.
Factor objects can be created with the factor() function.
> x <- factor(c("yes", "yes", "no", "yes", "no"))
>x
[1] yes yes no yes no
Levels: no yes
> table(x)
x
no yes
2 3
> ## See the underlying representation of factor
> unclass(x)
[1] 2 2 1 2 1
attr(,"levels")
[1] "no" "yes"
Often factors will be automatically created for you
when you read a dataset in using a function
like read.table(). Those functions often default to
creating factors when they encounter data that look like
characters or strings.
The order of the levels of a factor can be set using
the levels argument to factor(). This can be important
in linear modelling because the first level is used as the
baseline level.
> x <- factor(c("yes", "yes", "no", "yes", "no"))
> x ## Levels are put in alphabetical order
[1] yes yes no yes no
Levels: no yes
> x <- factor(c("yes", "yes", "no", "yes", "no"),
+ levels = c("yes", "no"))
>x
[1] yes yes no yes no
Levels: yes no
4.12 Missing Values
Missing values are denoted by NA or NaN for q
undefined mathematical operations.
 is.na() is used to test objects if they are NA
 is.nan() is used to test for NaN
 NA values have a class also, so there are integer NA,
character NA, etc.
 A NaN value is also NA but the converse is not true
> ## Create a vector with NAs in it
> x <- c(1, 2, NA, 10, 3)
> ## Return a logical vector indicating which elements
are NA
> is.na(x)
[1] FALSE FALSE TRUE FALSE FALSE
> ## Return a logical vector indicating which elements
are NaN
> is.nan(x)
[1] FALSE FALSE FALSE FALSE FALSE
> ## Now create a vector with both NA and NaN
values
> x <- c(1, 2, NaN, NA, 4)
> is.na(x)
[1] FALSE FALSE TRUE TRUE FALSE
> is.nan(x)
[1] FALSE FALSE TRUE FALSE FALSE
4.13 Data Frames
Data frames are used to store tabular data in R. They
are an important type of object in R and are used in a
variety of statistical modeling applications. Hadley
Wickham’s package dplyr has an optimized set of
functions designed to work efficiently with data
frames.
Data frames are represented as a special type of list
where every element of the list has to have the same
length. Each element of the list can be thought of as a
column and the length of each element of the list is the
number of rows.
Unlike matrices, data frames can store different classes
of objects in each column. Matrices must have every
element be the same class (e.g. all integers or all
numeric).
In addition to column names, indicating the names of
the variables or predictors, data frames have a special
attribute called row.names which indicate information
about each row of the data frame.
Data frames are usually created by reading in a dataset
using the read.table() or read.csv(). However, data
frames can also be created explicitly with
the data.frame() function or they can be coerced from
other types of objects like lists.
Data frames can be converted to a matrix by
calling data.matrix(). While it might seem that
the as.matrix() function should be used to coerce a data
frame to a matrix, almost always, what you want is the
result of data.matrix().
> x <- data.frame(foo = 1:4, bar = c(T, T, F, F))
>x
foo bar
1 1 TRUE
2 2 TRUE
3 3 FALSE
4 4 FALSE
> nrow(x)
[1] 4
> ncol(x)
[1] 2
4.14 Names
R objects can have names, which is very useful for
writing readable code and self-describing objects. Here
is an example of assigning names to an integer vector.
> x <- 1:3
> names(x)
NULL
> names(x) <- c("New York", "Seattle", "Los
Angeles")
>x
New York Seattle Los Angeles
1 2 3
> names(x)
[1] "New York" "Seattle" "Los Angeles"
Lists can also have names, which is often very useful.
> x <- list("Los Angeles" = 1, Boston = 2, London = 3)
>x
$`Los Angeles`
[1] 1

$Boston
[1] 2

$London
[1] 3
> names(x)
[1] "Los Angeles" "Boston" "London"
Matrices can have both column and row names.
> m <- matrix(1:4, nrow = 2, ncol = 2)
> dimnames(m) <- list(c("a", "b"), c("c", "d"))
>m
cd
a13
b24
Column names and row names can be set separately
using the colnames() and rownames() functions.
> colnames(m) <- c("h", "f")
> rownames(m) <- c("x", "z")
>m
hf
x13
z24
Note that for data frames, there is a separate function
for setting the row names, the row.names() function.
Also, data frames do not have column names, they just
have names (like lists). So to set the column names of a
data frame just use the names() function. Yes, I know
its confusing. Here’s a quick summary:
Set column Set row
Object
names names

data
names() row.names()
frame

matrix colnames() rownames()

4.15 Summary
There are a variety of different builtin-data types in R.
In this chapter we have reviewed the following
 atomic classes: numeric, logical, character, integer,
complex
 vectors, lists
 factors
 missing values
 data frames and matrices
All R objects can have attributes that help to describe
what is in the object. Perhaps the most useful attribute
is names, such as column and row names in a data
frame, or simply names in a vector or list. Attributes
like dimensions are also important as they can modify
the behavior of objects, like turning a vector into a
matrix.

You might also like