Logistic Regression
Logistic Regression
Logistic Regression
Peter Caya
April 25, 2017
library(dplyr)
##
## Attaching package: 'dplyr'
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 1/29
5/7/2017 Logistic Regression
# This R environment comes with all of CRAN preinstalled, as well as many other helpful
packages
# The environment is defined by the kaggle/rstats docker image: https://github.com/kagg
le/docker-rstats
# For example, here's several helpful packages to load in
# [3] Observe the amount of defaults that occur based on the previously mentioned va
riables.
# [4] Identify pattens in Pay, Bill and Pay_AMT variables for defaulted and undefaul
ted datasets.
# [5] Get percentiles for different variables.
# [1] ---------------------------------------------------------------------
# 25 variables with 30,000 observations. No duplicates were observed. There were al
so no missing values so this appears to be a rather
# clean dataset.
print(dim(raw_data))
## [1] 30000 25
print(sum(duplicated(raw_data)))
## [1] 0
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 2/29
5/7/2017 Logistic Regression
## [1] 0
## ID LIMIT_BAL
## "numeric" "numeric"
## SEX EDUCATION
## "numeric" "numeric"
## MARRIAGE AGE
## "numeric" "numeric"
## PAY_0 PAY_2
## "numeric" "numeric"
## PAY_3 PAY_4
## "numeric" "numeric"
## PAY_5 PAY_6
## "numeric" "numeric"
## BILL_AMT1 BILL_AMT2
## "numeric" "numeric"
## BILL_AMT3 BILL_AMT4
## "numeric" "numeric"
## BILL_AMT5 BILL_AMT6
## "numeric" "numeric"
## PAY_AMT1 PAY_AMT2
## "numeric" "numeric"
## PAY_AMT3 PAY_AMT4
## "numeric" "numeric"
## PAY_AMT5 PAY_AMT6
## "numeric" "numeric"
## default.payment.next.month
## "numeric"
Note something: The PAY_# variables contain observations of -2 which are not mentioned in the data dictionary.
If you look at the data you can further see that we have fewer instances of 1 than we would expect - in the case of
PAY_5 and PAY_6 there are no observations! I am guessing that these were mislabeled somehow and correct the
mistake below along with some other transformations on the data.
for(i in pay_names){
raw_data[,i] <- ifelse(raw_data[,i]==-2,1,raw_data[,i])
}
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 3/29
5/7/2017 Logistic Regression
for(i in bill_names){
raw_data[,i] <- ifelse(raw_data[,i]<0, -raw_data[,i] ,raw_data[,i])
}
exp_data <- exp_data %>% mutate( AGE_Range = factor( ifelse( AGE >= 20 & AGE <= 30 ,
"20-30",
ifelse( AGE >= 30 & AGE
<= 40 , "30-40",
ifelse( AGE >= 4
0 & AGE <= 50,"40-50",
ifelse(
AGE >= 50 & AGE <= 60, "50-60", "70-80"))))))
NOTE TO SELF: Create one variable from the payment data showing the number of months that were missed as
of payment 4!
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 4/29
5/7/2017 Logistic Regression
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 5/29
5/7/2017 Logistic Regression
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 6/29
5/7/2017 Logistic Regression
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 7/29
5/7/2017 Logistic Regression
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 8/29
5/7/2017 Logistic Regression
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 9/29
5/7/2017 Logistic Regression
# Summarize age:
summary(exp_data$AGE)
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 10/29
5/7/2017 Logistic Regression
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 11/29
5/7/2017 Logistic Regression
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 12/29
5/7/2017 Logistic Regression
# [3] ---------------------------------------------------------------------
# We see that bankruptcies make up ~22% of the data.
sum(exp_data$default.payment.next.month==1)/nrow(exp_data)
## [1] 0.2233652
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 13/29
5/7/2017 Logistic Regression
ggplot(exp_data)+geom_bar(aes(Default))+ coord_flip()
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 14/29
5/7/2017 Logistic Regression
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 15/29
5/7/2017 Logistic Regression
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 16/29
5/7/2017 Logistic Regression
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 17/29
5/7/2017 Logistic Regression
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 18/29
5/7/2017 Logistic Regression
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 19/29
5/7/2017 Logistic Regression
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 20/29
5/7/2017 Logistic Regression
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 21/29
5/7/2017 Logistic Regression
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 22/29
5/7/2017 Logistic Regression
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 23/29
5/7/2017 Logistic Regression
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 24/29
5/7/2017 Logistic Regression
# [4] ---------------------------------------------------------------------
k <- 1
pay_table <- list()
for(i in pay_names){
cur_table <- exp_data %>% group_by_(i,"Default") %>%
summarise(Observations = n() ) %>%
group_by_(i) %>%
mutate(pct = Observations/sum(Observations)) %>%
filter(Default == "Default")
names(cur_table)[1] <- c("Category")
pay_table[[k]] <- cur_table[c(1,4)]
k = k+1
}
for( i in 1:length(pay_table)){
if(i==1){ disp_table <- pay_table[[i]]
names(disp_table)[ncol(disp_table)] <- pay_names[i] }else{
kable(disp_table)
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 25/29
5/7/2017 Logistic Regression
b. Exploratory Analysis
In the exploratory analysis I will attempt to answer a few questions I have regarding this dataset.
PAY_2, PAY_3, PAY_4, PAY_5, and PAY_6 represent the same metric for the months of August, July, June, May,
and April respectively. How does the number of defaults change as we move from the earliest month to the latest?
To start with this analysis I will apply logistic regression using Rs caret package. In this
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 26/29
5/7/2017 Logistic Regression
# predict(object = hold, )
#
#
log_model <- train(Default ~. , data = training_data,
method="glm", family = binomial(link = "logit"),
trControl = fitControl)
#
#
# vars <- training_data[1:11]
#
#
# y = training_data$Default
#
# nb_model <- train(vars,y,'nb',trControl=trainControl(method='cv',number=10))
#
# log_model <- train( ,
# method="nb",trControl = fitControl)
#
#
#
#
library(mlbench)
#
#
train_preds <- predict(log_model,newdata = training_data)
test_preds <- predict(log_model,newdata = testing_data)
#
confusionMatrix(train_preds,training_data$Default)
confusionMatrix(test_preds,testing_data$Default)
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 28/29
5/7/2017 Logistic Regression
le:///media/petercaya/B192-0759/Kaggle/CC_Default/Logistic_Regression_for_CC_Fraud.html 29/29