R-Tutorial - Introduction
R-Tutorial - Introduction
[CRAN is a network of ftp and web servers around the world that store identical, up-to-date, versions of code
and documentation for R. Please use the CRAN mirror nearest to you to minimize network load.]
Choose "base".
Choose "Download".
Starting R
R Console
Choose "Properties"
In "Start In" box type "C:\R"
Working with Scalars
print(47.5)
print(35 + 56)
print (rabi)
print("Rabi is working")
print(FALSE)
print(3>2)
print(47.1==47.2)
print(47.1=47.2)
bb<-56.6
cc<-"My Nepal"
dd<-TRUE
ee<-3 + 2i
> print(bb)
>print(cc)
> print(dd)
> print(ee)
Note: Values in variables can be displayed simply by typing its name, such as
> aa
> bb
> cc
> dd
> ee
> class(aa)
To display combinations of textual prompts and variable values- Use 'cat()' function
...............not now...............
> ls(pat = "^m") # displays all variable names which start with m.
.........Some left..............
R Objects
R is Object Oriented Programming (OOP) language. There are many built-in objects of R. Some common
R-objects used to handle data are
Vectors
List
Matrices
Factors
Data Frames
Arrays
Vectors
A vector is combination of two or more variables all of same data type. It is the simplest type of R object.
All variables so far created previously are R objects containing one element (member).
Creating Vectors
Using 'c( )' function
> ab <- c(23,35,56)
> ab
> ac
> ad<-c(TRUE,TRUE,2<3,0==0)
> ad
> ae<-c(3+4i,7+2i)
> ae
> assign("a", 7)
> assing("b", c(1,2,3,4))
> baa<-5.6:12.6
> bc<-seq(3,54,2)
> bd<-seq(1,10,0.5)
> be <-seq(50,0,-5)
It create a series of sequences of integers each ending by the numbers given as parameters. E.g.
> sequence(4 : 7)
> sequence(6 : 3)
> rb[2]
> rb[c(2, 5)]
> rb[2:5]
> rc = rb[3]
> rc
> rd = rb[c(3,4,7)]
> rd
ma = c(34,65,76,21,23)
maa = ma + 2
mab = ma 2
mac = ma * 2
mad = ma/2
mada = 1/ma
mae = ma ^2
maf = ma^(1/2)
mag = ma^(1/3)
mb = c(12,54,23, 45,32)
mc = ma + mb
md = ma mb
me = ma * mb
mf = ma/mb
mg = c(1,2,3,2,1)
mh = mb ^ mg
Note: If length of two vectors are not same, then mathematical operations on two vectors is not
possible. However, if length of one vector is scalar multiple of another, then in this case, the
values of shorter vector is recycled while carrying mathematical operations. E.g.
> m1 = c(2, 6, 7)
> m2 = c(3, 6, 8, 7, 4, 5)
> m3 = m1 + m2
mb = c(12,54,23, 45,32)
mi = mean(mb)
mi
mj = var(mb)
mj
mk = sum(mb)
mk
ml = prod(mb)
ml
mm=sqrt(mb)
mm
mn = length(mb)
mn
mo = min(mb)
mp = max(mb)
mq = sort(mb)
Working on logical and relational operators with vectors
> a = c(1:5)
>b=a>3
>b
> a==3
>a!=3
> ! FALSE
> ! TRUE
Introduction
A vector contains elements of same type. A list is similar to vector, but it may contain elements of
different type.
Creating List
Exm.
> a1
> aa
> bb
Factors
Introduction
A factor is a R data type that stores categorical variables. Such type of data types are abundantly used in
statistical modeling.
A data variable is said to be of categorical, if the contents to be included in it are not all different, but
can be any one of two or more types.
For example, variables related to gender may be of only two types- male and female.
Variables related to blood group may be any one of four types- A, B, AB and O.
Or,
Or,
gen_fact = factor(gen)
For example- the blood group of 11 patients admitted at a hospital on a day are recorded and are
changed into factor below
> bg = factor(c("A","B","A","AB","A","O","O","A","AB","B","B"))
> bg
R stores different levels of factors as a vector of integers. R assigns integer values to different elements
in a factor in the order of the alphabetical listing.
To display the numeric integers corresponding to different elements of a factor the structure function
'str()' is used. E.g.
> str(bg)
These integers are used by R for storing textual description of elements in a factor.
By default, the values provided to different elements of a factor are set according as alphabetical
ordering. However, we can provide our own integer values to the different elements of factor by using
'levels' parameter inside 'factor()' function.
Exm.
> str(bg)
The number of levels in a factor can be accessed with 'nlevels' function. E.g.
print(nlevels(gen_fact))
factor(x, levels = sort(unique(x), na.last = TRUE), labels = levels, exclude = NA, ordered =
is.ordered(x))
Here, levels specifies the possible levels of the factor (by default the unique values of the vector x),
labels defines the names of the levels, exclude defines the values of x to exclude from the levels, and
ordered is a logical argument specifying whether the levels of the factor are ordered. Recall that x is of
mode numeric or character.
In some categorical variables different levels associated may have specific ordering. For example,
economic status of citizens can be categorized as low, medium, high. Here different levels have some
sort of ordering. Such type of categorical variables are said to be 'Ordinal'.
a) Nominal Variable
b) Ordinal Variable
c) Scale Variable
d) Ratio Variable
Ordinal Variables
To create ordinal variable, while creating factor, 'ordered' attribute is set to 'TRUE'. E.g.
If one views the structure of this factor by using 'str()' function, then according to alphabetical order
"Large" is provided value 1, "Medium" is provide value 2 and "Small" is provided value 3.
To provide values 1, 2 and 3 for "Small", "Medium" and "Large", one can use 'levles' attribute of factor
function, as
The categorical variables which are defined for certain ranges are called interval variables. For example:
(a) age-group (0 10, 10 20, 20- 30, etc.) (b) income groups ( $ 100 500, $ 600 1000, etc.)
Description on interval variables and ratio variable are left over now.
> ts_fact[2]
> ts_fact[c(1,3,4)]
A matrix is a two dimensional rectangular data set in which data values are arranged into rows and
columns.
The function which is used to create a matrix is 'matrix()'. The syntax of this function is as follows-
The option byrow indicates whether the values given by data must fill successively the columns (the
default) or the rows (if TRUE). The option dimnames allows to give names to the rows and columns.
Creating Matrices
Creating matrix of integers
> m2 = matrix(c(12, 43, 43, 23,34, 26), nrow = 2, ncol= 3)
Marks of two students "Rajan" and "Hari" in three subjects "Math", "Science" and "Computer" are
stored in a matrix and headings are provided below-
> m2[2, 3]
> n1 = m2[2, 3]
> m2[2, ]
> m2[ , 2]
> diag(m3)
Matrix Manipulation
> a = matrix(1:10, nrow=2, ncol=5)
>a
>b
>c=ab
>5*a
>1/a
Real matrix multiplication
> sqrt(a)
> sum(a)
> mean(a)
> sum(a[1, ])
> sum(a[2, ])
> mean(a[2, ])
> sum(diag(c))
> mean(diag(c))
> p1 = diag(c)
> mean(p1)
> eigen(c)$values
Introduction
R is a statistical programming language and in Statistics we work with datasets. Such data sets typically
comprises of observations. All observations consist of some variables which may be of different types.
In datasets, different instances of observations are stored in different rows. Each of these observations
has specific attributes, e.g. name, age, gender, score, etc. Since there will be a lot of observations a
particular attribute is placed in same column of dataset.
So, a dataset is similar to matrix, since it is a two dimensional array consisting of rows and columns.
However, a matrix can contain all data of same type, but a dataset needs each observation containing
data of one or more different data type.
In fact, a list represents a single observation (row) of dataset and a dataset can also be created by using
list of lists.
However, R provides a special way to create a dataset and it is by using object 'dataframe'.
In a data frame all columns contains elements of same data type and they represent different attributes
of observations. Data representing common attribute of different observations are placed in a particular
column of data frame. In the same way, rows contain list of elements belonging to a particular instance
or particular observation.
Let us create a data frame containing three columns (or vectors) of names- name, age, and gender, each
containing five observations.
> df
An alternative method is
Or,
In the same way, different rows of observations can also be named. (Later)
E.g.
> str(df)
> df[3, 2]
> df[ 3 , ]
> df [ , 1]
To display data in entire observations in first and third columns, i.e., name and male columns
> df[ , c(1,3)]
Alternatively, data frame can also be visualized as vector of lists, where each list corresponds to a
particular observation or row.
> df $ Age
> df $ Male
> df [["Age"]]
> ht = c(132, 143, 214, 245, 243) //creating vector for heights
> df $ height = ht
Or,
> df[["height"]] = ht
To add a new row to the data frame, we create another data frame containing rows to be added and use
'rbind()' function as follows-
Or,
> df[rnk, ]
h w sex
1 60 100 Male
2 80 100 Male
3 60 300 Male
4 80 300 Male
5 60 100 Female
6 80 100 Female
7 60 300 Female
8 80 300 Female
Practically, a dataset is not created directly, but it is imported from some data source, such as Excel,
SQL, Access, SPSS, etc.
Practically, data required for statistical analysis are not entered in R directly, as we have practiced in
data frame, but they are usually imported from different sources, such as text editor, spreadsheets,
databases, etc.
Data files created by using comma are commonly called comma separated files (or .CSV files) and those
using "Tab" are called tab delimited text files (or .TXT files)
Other functions that can be used to import dataset into R are scan(), read.fwf, etc.
// OR
read.table(file, header = FALSE, sep = "", quote = "\"'", dec = ".", row.names, col.names, as.is =
FALSE, na.strings = "NA", colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, fill =
!blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, comment.char = "#")
'scan()' function to open data sets
The function scan is more flexible than read.table. A difference is that it is possible to specify the mode
of the variables, for example:
reads in the file 'data.dat' three variables, the first is of mode character and the next two are of mode
numeric.
Another important distinction is that scan() can be used to create different objects, vectors, matrices,
data frames, lists, . . .
In the above example, mydata is a list of three vectors. By default, that is if what is omitted, scan()
creates a numeric vector. If the data read do not correspond to the mode(s) expected (either by default,
or specified by what), an error message is returned.
scan(file = "", what = double(0), nmax = -1, n = -1, sep = "", quote = if (sep=="\n") "" else "'\"",
dec = ".", skip = 0, nlines = 0, na.strings = "NA", flush = FALSE, fill = FALSE, strip.white =
FALSE, quiet = FALSE, blank.lines.skip = TRUE, multi.line = TRUE, comment.char = "",
allowEscapes = TRUE)
The function read.fwf can be used to read in a file some data in fixed width format:
read.fwf(file, widths, header = FALSE, sep = "\t", as.is = FALSE, skip = 0, row.names,
col.names, n = -1, buffersize = 2000, ...)
> mean(bb) will not work, since data frame is not stored into memory.
So it is required to use
> attach(bb) // It imports data from files into working memory of computer.
// Once attach() function is used to import a data file, it will not be necessary to use '$' operator to refer
to any variable in it. E.g.
> mean(Math)
> dim(bb) // displays the number of rows and columns in the data frame
> bb[2:3, ]
> bb[ , 4]
> bb [ , c(4,5)]
To display summary of data Frame
> summary(bb)
For numeric data it displays mean, median, mode, quartiles, etc. If there are categorical data, i.e.,
factors, then it displays counts of different categories.
To split a data frame into two or more data frames according to some categorical data, e.g. Gender
> maledata
> femdata
> dim(maledata)
> summary(maledata)
> femdata[1:3,]
> mean(bb$Age[bb$Gender=="female"])
Array
While matrices of are confined to two dimensions, arrays can be any number of dimensions. In fact,
vectors, lists and factors are one dimensional array. Similarly, matrices are two dimensional arrays.
Creating arrays
Marks of 3 students in 4 subjects recorded for two terminal examinations can be presented in the form
of a 3-dimensional array as 2 number of 3 x 3 matrices as follows:
> ar3 = array(matrix(c(sub11, sub21, sub31, sub12, sub22, sub32),nrow=4, ncol=3), dim=c(4,3,2))
Manipulating Arrays
To provide names to the row headings, column heading, and matrix headings in above array.
a)
b)
c)
ar4[2, 3, 1]
ar4[2, , 1]
mean(ar4[2, , 1])
ar4[ , 3, 2]
mean(ar4[ , 3, 2])
sum(ar4[ , , 1])
To display grand average marks of all students in all subjects in both terms
write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ", eol = "\n", na = "NA", dec = ".",
row.names = TRUE, col.names = TRUE, qmethod = c("escape", "double"))