forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPA1_template.Rmd
133 lines (109 loc) · 4.71 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
title: "Reproducible Research: Peer Assessment 1"
output:
html_document:
keep_md: true
---
## Loading and preprocessing the data
```{r loading_preprocessing,echo=TRUE}
#Unzipping the compressed file and loading the data
unzip("activity.zip")
df2 <- read.csv("./activity.csv",na.strings = "NA")
```
## What is mean total number of steps taken per day?
```{r meansteps,echo=TRUE}
library(ggplot2) #We use ggplot2 as the plotting library here
stepsvar <- tapply(df2$steps, df2$date, FUN=sum,na.rm=TRUE)
#tapply will sum the steps for every day and assign it to stepsvar
qplot(stepsvar, binwidth=800, xlab="Days",ylab="Frequency of Total Steps Per Day",colour=I("purple"))
mean_steps <- as.integer(mean(stepsvar))
median_steps <- as.integer(median(stepsvar))
```
The mean and median of total steps per day are `r mean_steps` and `r median_steps` respectively.
## What is the average daily activity pattern?
```{r avdailypat,echo=TRUE}
library(ggplot2)
#aggregate function ----> splits data into subsets
#the by values are --->coerced to factors before use
average_data <- aggregate(x=list(steps=df2$steps), by=list(interval=df2$interval),
FUN=mean,na.rm=TRUE)
ggplot(data=average_data, mapping=aes(x=interval, y=steps)) +
geom_line(colour="blue") +
xlab("Interval of 5 Minutes") +
ylab("Average No of Steps Taken in Interval")
```
### 5 minute interval containing the maximum number of steps averaged across all days
```{r findmax,echo=TRUE}
average_data[which.max(average_data$steps),]$interval
```
## Imputing missing values
There are a number of days/intervals where the number of steps taken is missing,
they are coded as NA in the original dataset. These missing values may introduce
bias into some calculations or summaries of the data.
### Finding the number of missing values
```{r number_of_missing,echo=TRUE}
#the generic function is.na indicates which elements are missing
sum(is.na(df2))
```
We can fill all the missing values with the mean value of that 5 minute interval.
```{r filling_new,echo=TRUE}
na_fill <- function(steps, interval) {
mod_df <- NA
if (!is.na(steps))
mod_df <- c(steps)
else
mod_df <- (average_data[average_data$interval==interval, "steps"])
return(mod_df)
}
# The function above replaces the NA values with the mean value of that
# corresponding interval.
new_df <- df2
# We use mapply to apply the function to multiple arguments
new_df$steps <- mapply(na_fill, new_df$steps, new_df$interval)
```
Using the new dataset, we again plot the histogram and find the mean and median
of the total number of steps taken per day.
```{r meansteps2,echo=TRUE}
#tapply will sum the steps for every day and assign it to stepsvar2
stepsvar2 <- tapply(new_df$steps, new_df$date, FUN=sum)
qplot(stepsvar2, binwidth=800, xlab="Days",ylab="Frequency of total steps per
day",colour=I("green"))
mean_steps2 <- as.integer(mean(stepsvar2))
median_steps2 <- as.integer(median(stepsvar2))
```
The mean and median of total steps per day in the modified histogram are
`r mean_steps2` and `r median_steps2` respectively.
The observation here is that these values differ from those in the first part
of the assignment. The impact of imputing missing data is that both the mean
and median values are higher relatively. This is because instead of using 0
values for the missing data we are using the mean value.
## Are there differences in activity patterns between weekdays and weekends?
For observing this difference, we need to find the day corresponding to
each date measurement.
```{r filling_missing,echo=TRUE}
# the below function takes a date and outputs either weekend or weekday
# with the help of the weekdays() function in R.
day_helper <- function(date) {
day <- weekdays(date)
foo <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
bar <- c("Saturday","Sunday")
if (day %in% foo)
return("weekday")
else if (day %in% bar)
return("weekend")
else
stop("invalid date")
}
new_df$date <- as.Date(new_df$date)
# sapply applies the helper function over the date column of our dataframe
new_df$day <- sapply(new_df$date, FUN=day_helper)
```
We make a time series plot of the 5 minute interval and the average number of
steps taken, averaged across all weekdays and week-end days.
```{r panel_plot,echo=TRUE}
# aggregate function splits the steps day into subset based on type of day
average_data2 <- aggregate(steps ~ interval + day, data=new_df, mean)
# facet grid forms a matrix of panels defined by row nad column faceting variables.
ggplot(average_data2, aes(interval, steps)) + geom_line(colour=I("blue")) + facet_grid(day ~ .) +
xlab("Interval of 5 minutes")+ ylab("Average Number of Steps Taken in Interval")
```