0% found this document useful (0 votes)
27 views8 pages

Project Report On Credit Risk Analysis Using Random Forest

This document outlines a project focused on credit risk analysis using a Random Forest classifier to predict loan defaults based on a dataset containing various applicant characteristics. It details the dataset, preprocessing steps, exploratory data analysis, model training, and performance evaluation, highlighting the Random Forest model's accuracy of 91.77% compared to baseline models. Future directions include enhancing the model through advanced feature engineering and incorporating external data sources for improved predictive accuracy.

Uploaded by

muzammil watto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views8 pages

Project Report On Credit Risk Analysis Using Random Forest

This document outlines a project focused on credit risk analysis using a Random Forest classifier to predict loan defaults based on a dataset containing various applicant characteristics. It details the dataset, preprocessing steps, exploratory data analysis, model training, and performance evaluation, highlighting the Random Forest model's accuracy of 91.77% compared to baseline models. Future directions include enhancing the model through advanced feature engineering and incorporating external data sources for improved predictive accuracy.

Uploaded by

muzammil watto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Credit Risk Analysis Using Random Forest

1. Introduction and Dataset Description

Introduction

In today's financial landscape, assessing credit risk is crucial for lenders and financial
institutions. This project offers a simplified view of the factors that contribute to credit risk,
making it an excellent opportunity for data scientists to apply their skills in machine learning and
predictive modeling. The project is focused on analyzing credit risk using machine learning
techniques. The goal is to predict loan defaults by developing a model, specifically a Random
Forest classifier, using a given dataset.

Dataset Description

The dataset provides essential information about loan applicants and their characteristics. Key
features include:

 ID: Unique identifier for each loan applicant.


 Age: Age of the loan applicant.
 Income: Income of the loan applicant.
 Home: Home ownership status (Own, Mortgage, Rent).
 Emp_Length: Employment length in years.
 Intent: Purpose of the loan (e.g., education, home improvement).
 Amount: Loan amount applied for.
 Rate: Interest rate on the loan.
 Status: Loan approval status (Fully Paid, Charged Off, Current).
 Percent_Income: Loan amount as a percentage of income.
 Default: Whether the applicant has defaulted on a loan previously (Yes, No).
 Cred_Length: Length of the applicant's credit history.

The dataset is valuable for assessing credit risk, a crucial task for lenders and financial
institutions. Proper handling and analysis of this data can lead to better credit risk assessment
methods.

2. Importing Libraries

Essential libraries are imported for data manipulation, visualization, and machine learning. These
include:

 Pandas and NumPy for data manipulation.


 Seaborn and Matplotlib for data visualization.
 Plotly for interactive visualizations.
 Scikit-learn for machine learning models and evaluation.
 Yellowbrick for visualizing model performance.
3. Data Loading and data pre-processing

Data Loading

The dataset is loaded into a DataFrame to facilitate data manipulation and analysis. This step
involves reading the data from a CSV file.

Data pre-processing

Data preprocessing is a critical step in preparing the dataset for analysis and modeling. This
process begins with handling missing values, which are common in real-world data. In this
project, missing values in the ‘Emp_Length’ and ‘Rate’ columns were addressed using the mean
imputation strategy. By applying the SimpleImputer from Scikit-learn, we replaced missing
values in these columns with the mean value, ensuring that no important information was lost
and that the dataset remained complete. To further ensure the integrity and quality of the dataset,
we removed all duplicate rows. This step is crucial as duplicates can skew analysis and model
training. The Id column, which serves only as a unique identifier and does not contribute to the
predictive modeling, was dropped from the dataset to streamline the data. Finally, we inspected
the number of unique values in each column to understand the dataset's structure better and to
identify any potential issues or insights regarding the diversity of the data. This thorough
preprocessing ensures that the dataset is clean, consistent, and ready for subsequent exploratory
data analysis and model building.

4. Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is an essential step in understanding the underlying patterns
and relationships in the dataset. It helps in visualizing the distribution of data, identifying
outliers, and discovering trends that can inform feature engineering and model building.

Target Feature: Status

The status column, which indicates the loan approval status (Fully Paid, Charged Off, Current),
is the target variable for our predictive modeling. To understand its distribution, we use a count
plot:
This visualization helps us see the frequency of each loan status category, providing insight into
class imbalances that may need to be addressed in the modeling phase.

Home Ownership Status

The Home column indicates the home ownership status of the loan applicants. We visualize its
distribution using a count plot:

This plot helps us understand the distribution of home ownership types (Own, Mortgage, Rent)
among the applicants.

Loan Intent

The Intent column specifies the purpose of the loan, such as education or home improvement. Its
distribution is visualized as follows:

This visualization reveals the most common reasons for loan applications, which can be crucial
for understanding different segments of loan applicants.
Default History

The Default column indicates whether the applicant has defaulted on a loan previously. Its
distribution is visualized using a count plot:

Understanding the proportion of applicants with a default history helps in assessing the risk
profile of the dataset.

Converting Categorical Columns to Numeric

To prepare the dataset for machine learning algorithms, we convert categorical columns to
numeric using LabelEncoder:

This step ensures that all features are in a numerical format, suitable for feeding into machine
learning models.
Histogram of Dataset

Visualizing the distribution of numerical features using histograms:

Histograms provide a visual summary of the data distribution for each feature, highlighting
skewness, kurtosis, and the presence of outliers.

Heatmap of Correlation Matrix

The heatmap visualizes the correlation matrix of the features, indicating how features are related
to each other and to the target variable:
Correlations with Status

Analyzing the correlation of each feature with the status:

This bar plot highlights the features most strongly correlated with the target variable, providing
insights into which features might be most predictive of loan status.
6. Model Building

Data Splitting

We start by splitting the dataset into training and testing sets, ensuring an unbiased assessment of
the model's performance. The training set is used to train the model, and the testing set is used to
evaluate its effectiveness on unseen data.

Training the Random Forest Model

We focus on the Random Forest algorithm, which combines multiple decision trees to improve
prediction accuracy and robustness. The Random Forest model is trained on the training data to
learn the patterns and relationships within the dataset.

Making Predictions

Once the model is trained, we use it to predict the loan statuses in the testing set. This step
involves applying the trained model to new data to assess its predictive capabilities.

Evaluating Model Performance

We evaluate the model's performance using several metrics:

 Accuracy Score: The Random Forest model achieved an accuracy score of 91.77%, indicating its
high predictive capability.
 Confusion Matrix: Provides a detailed breakdown of the true positives, false positives, true
negatives, and false negatives.
 Classification Report: Offers detailed metrics like precision, recall, and F1-score for each class.

Feature Importance’s

Random Forest provides a measure of feature importance, indicating which features are most
influential in making predictions.
This bar plot visualizes the importance of each feature, helping to identify the most critical
factors influencing loan approval status.

7. Comparing Model Performance with Baseline Models

To ensure that the Random Forest model is the best choice, we compare its performance with
other baseline models like Logistic Regression and K-Nearest Neighbors.

The random Forest model outperformed the baseline models, achieving an accuracy of 91.77%
compared to 80.58% for Logistic Regression and 82.82% for K-Nearest Neighbors. This
indicates that the Random Forest model is more effective in predicting credit risk, providing
more accurate and reliable results.

Future Direction

To enhance the credit risk analysis model, future efforts should focus on advanced feature
engineering, handling imbalanced data through resampling or cost-sensitive learning, and
optimizing model performance with hyperparameter tuning and ensemble methods.
Incorporating external data sources, such as credit bureau scores and macroeconomic indicators,
can further improve predictive accuracy. Additionally, using tools like SHAP values for model
interpretation, ensuring regulatory compliance, and addressing ethical considerations will help
build a robust, transparent, and fair predictive model. Deploying and monitoring the model in a
real-time environment will ensure its ongoing effectiveness and reliability.

You might also like