Introduction To R Installation: Data Types Value Examples
Introduction To R Installation: Data Types Value Examples
Installation
Installation of the basic R package is fairly simple. You need to complete the following steps:
1. Visit the website of one of the mirrors of the R project (e.g. http://cran.gis-lab.info/)
2. Follow the link corresponding to your operating system (Download R for Linux/Mac/Windows)
3. Download the installer (using the instructions on the website, the download starts when you click
on “install R for the first time”)
4. Launch the installer (for example, in Windows that is R-3.2.0-win.exe)
Along with the basic R package we suggest installing the R Studio, software that completes the core R with
a convenient integrated development environment (IDE) that often makes life much easier. The website
of this project where you can find the installation instructions is http://www.rstudio.com/.
In all the examples below we will work in R console. It appears automatically when you launch R Studio in
the upper left part of the screen and invites the user to start typing R commands with the following
message:
Data types
Like any other programming language, R supports different data types to work with different kinds of
values: integer, numeric, logic etc. The basic data types in R are described in the table below.
The R language manages the types dynamically: this means that, when seeing an expression like one of
those in the right column of the table, the R interpreter automatically determines its type. For example,
if you want assign a logical value to some variable x, you don’t need to define the type of x explicitly, you
have only to actually assign the value to it with no declarations (the assignment operator in R is an arrow
“<-“):
Having an already existing variable x, you can perform a range of operations (functions) to determine or
change its type:
It is often the case that the data about certain parameter values for certain observations is missing. R
provides a mechanism for handling that case. Missing (unknown) observation values have a special NA
type in R (“Not available”). The corresponding operations are:
The na.omit() function is very useful when you want to calculate some statistics based on the data that
possibly has some missing values (NA). If that is the case, then the mean() function (that computes the
mean) also returns NA, which is not the desired behavior in most cases. Here na.omit() solves the
problem:
Compare with MATLAB: is has a similar mechanism for solving this problem, namely the functions like
nanmean().
The vector data structure is in fact a list of values of the same type: in the examples above, the first two
vectors hold the values [1, 2, 3], and in the last example the vector of three zeros gets created. A list
differs from a vector in that it can hold values of different types at the same time, including other lists.
Factor is a sort of vector used to encode categorical (nominal) data (is the example above it is used to
encode the observations of a variable that takes two values: “Male”/”Female”). A matrix in R can be
created by specifying the vector of the values in its cells and its dimensions (nrow/ncol parameters).
Finally, data.frame objects hold data tables (observations of several attributes). Data frames are thus the
basic data structure for many real-world data analysis tasks. An example of data.frame creation can be
seen below. This data frame contains data about six people (with age/heigh/sex attributes). After the code
you can see the resulting data table:
Functions in R
In programming, a function is a named section of a program that performs a specific task. Functions are
often also called methods or procedures. A function usually takes some parameter(s) as its input and
produces some value as its output. This return value can be a number, a logical value, or some complex
object like a plot.
In R, a user can both use a set of predefined functions incorporated into the language (like the
is.number() function that we’ve already seen before) and define his/her own functions that perform
some specific computations he or she needs. Here is the syntax of function definition in R:
Note that the functions in R are just like objects: they are assigned to variables (myfunction in the
example above) and can be passed to other functions as parameters. This is what makes R different from
popular non-functional languages like C or Java.
Let us define a simple function that computes the square of a number, and test it:
> square <- function(x) { return (x * x) }
> square(3)
[1] 9
It is also possible that a function calls itself in its body (main part). This is what is called recursion. A classic
example of a function that can be defined recursively is the factorial, defined for a non-negative integer
𝑛 as the product of all positive integers less than or equal to 𝑛. Note also the usage of the if ... else
expression in its definition. It allows us to specify what the function should do when the expression is true
and when it’s not:
> factorial <- function(x) {
if (x == 1) {
return (1)
} else {
return (x * factorial (x - 1))
}
}
> factorial(1)
[1] 1
> factorial(5)
[1] 120
Finally, let us re-implement the same factorial function without recursion. We can do that with a for-
loop, which is yet another basic construct in most programming languages. A for-loop makes it possible
to iterate through a collection of values and perform some steps at each iteration. The implementation
of the factorial function follows its definition: we iterate through all positive integers less than or equal to
x (the range of these values is denoted as 1:x) and add it to the overall product accumulated in the
variable called result:
> summary(data)
age height sex
Min. :18.00 Min. :170 f:2
1st Qu.:19.25 1st Qu.:171 m:4
Median :20.50 Median :173
Mean :20.50 Mean :174
3rd Qu.:21.75 3rd Qu.:176
Max. :23.00 Max. :180
NA's :1
Data slicing in a data frame can be done both by rows and by columns. For both, there is the syntax of
form data[<row filter>, <column filter>]. For example, to retrieve the specified rows from
the data table, run:
> data[c(1,3),] # only the first and the third rows (indices vector)
age height sex
1 18 170 m
3 20 NA m
To get certain attributes only, you can use the data[<row filter>, <column filter>] syntax
again:
> data$height
[1] 170 171 NA 176 173 180
This “$” syntax also works for named elements extraction out of an R list, as well as for R objects.
More complicated queries to a data frame can combine filters by rows and by columns. In the example
below, we retrieve the age of all men older than 20:
Other useful functions include all() and any(). These functions answer the question whether the
specified condition is true for all or any record in the data, correspondingly:
Finally, you can add new attributes to a data frame by passing the vector of values of this new attribute
for all records present in the data frame:
Data visualization
In the simplest case, you can try to visualize the dependencies between attribute pairs using plots. For
this purpose, R has the plot() function that builds a dot diagram (scatter plot) on the plane. There is
also the lines() function for connecting the dots on the plot with lines. Let’s construct plot to estimate
the dependency between height and age for people in the dataset we have been using in the examples
above:
The plot() function can also build the scatterplot matrix for all attribute pairs if you pass the whole
data.frame object as its argument. Such a matrix can be useful as one of the first steps in data analysis:
it gives a clear idea of which attribute pairs expose a certain dependency and which seem to be
uncorrelated:
> plot(data)
Finally, to build histograms in R, one can use the barplot() function. This function should be provided
with not just the source data, but with the information about frequency of appearance of different values
of the attribute being investigated. These frequency values can be computed with the table() function.
In the example below we also pass as the second argument to barplot() the names of the bars on the
resulting histogram:
> barplot(table(data$sex), names.arg=c("Female", "Male"))
Importing data
The most simple way to load some existing data set to R is to read it from a file. The R language has a
range of functions for reading the data from different formats: read.csv() for CSV tables,
read.xlsx() (from package xlsx) for Excel tables, fromJSON() (package RJSONIO) for reading the
JSON data. All these functions produce a data.frame object. In the example below, we first download a
data set called «Iris flower» from the web to a temporary CSV file and then read this file in R:
Reading matrices in R can be done with the combination of read.csv and as.matrix() functions. Let’s
assume you have the following contents in the matrix.txt file residing in your current working directory
(you can figure out what your current working directory is by calling getwd() in R and change it using the
setwd(<path>) function):
0,.11,.22,.4
.11,0,.5,.3
.22,.5,0,.7
Saving the matrix back to a file is simple as well with the write.table() function. If you don’t set the
row.names and col.names parameters to FALSE, row/column names will be written to the output file
along with the raw data: