Module 2 Iris Data Set
Module 2 Iris Data Set
Allan Lao
2023-09-26
##ctrl-alt-i for code blocks
Iris Dataset in R
The iris dataset is a built-in dataset in R that contains measurements on 4 different attributes (in centimeters) for 50 flowers from 3 different
species.
data(iris)
Structure
The structure of the dataset
str(iris)
str() shows the structure indicating the number of observations (records) and variables as well as its data type. There are 150 rows of records in
the iris dataset with 5 columns. Note the Species variable has a data type of Factor
The dimension
dim(iris)
## [1] 150 5
names(iris)
head(iris,4)
4 rows
tail(iris)
6 rows
summary(iris)
For each of the numeric variables we can see the following information:
For the only categorical variable in the dataset (Species) we see a frequency count of each value:
plot(iris)
the entire dataset provides a glimpse of the relation between its variables. The chart below Sepal.Length represents the Sepal.Width in the y-axis
and Sepal.Length in the x-axis
plot(iris$Sepal.Length) #Quantitative
plot(iris$Sepal.Width, iris$Sepal.Length,
col=factor(iris$Species),
main='Sepal Length vs Width',
xlab='Sepal Width',
ylab='Sepal Length',
pch=19)
<>
plot(iris$Species)
Next, will use histogram to determine how data is spread across a range of values. Just being curious on the distribution of Sepal Length.
hist(iris$Sepal.Length,
col='steelblue',
main='Histogram',
xlab='Length',
ylab='Frequency')
Box Plot shows 5 statistically significant numbers- the minimum, the 25th percentile, the median, the 75th percentile and the maximum. It is thus
useful for visualizing the spread of the data is and deriving inferences accordingly
Using a boxplot() we can determine the distribution of sepal length across species.
boxplot(Sepal.Length~Species,
data=iris,
main='Sepal Length by Species',
xlab='Species',
ylab='Sepal Length',
col='steelblue',
border='black')