THE OXFORD COLLEGE OF SCIENCE
Department of Computer Applications
INTERNSHIP REPORT
ON
“LOAN PREDICTION”
Submitted in partial fulfilment of the requirements for the award of the Degree of
Bachelor of Computer Applications
Submitted by
S B Anusha (U03MS21S0394)
Machine Learning Internship at
PRINSTON SMART ENGINEERS
Under the guidance of
Mr. AKASH V
Department of Computer Science and Applications
Bangalore University
CERTIFICATE
Certified that the Project work entitled “LOAN PREDICTION”
carried out by S B ANUSHA (U03MS21S0394) is a bona-fide
student of The Oxford College of science, 4th sector, HSR Layout,
Bengaluru in partial fulfilment of the requirement of VI semester
(Machine Learning Project) Bachelor of computer application, during
the year 2023 – 2024. It is certified that all corrections/suggestions
indicated for Internal Assessment have been incorporated in the report
deposited in the departmental library. The Internship Project report
has been approved as it satisfies the academic requirements with
respect of the Mini Project work prescribed for the said degree.
ACKNOWLEDGEMENT
While presenting this Machine Learning Project on “Loan
Prediction”, I feel it is my duty to acknowledge the help rendered to
us by various people.
I endure our humble and sincere gratitude to Dr. Susil Kumar Sahoo,
Head of Department of Computer Science and Applications, for
his great encouragement and valuable support.
I sincerely acknowledge the guidance and support of our internship
guide, Mr. Akash V, Prinston Smart Engineers who provided
valuable advice throughout the course of the project.
ABSTRACT
Loan prediction is a critical task in the banking and financial
sector, aiming to predict the likelihood of a loan applicant defaulting
on their loan. Accurate predictions can significantly enhance decision-
making processes, reduce financial risk, and increase profitability for
financial institutions. This research focuses on developing a predictive
model for loan approval using machine learning techniques.
The dataset, sourced from a leading financial institution, includes
features such as applicant income, loan amount, credit history,
employment status, and other demographic variables. Various data
preprocessing techniques are applied to handle missing values, encode
categorical variables, and normalize numerical features. The study
explores several machine learning algorithms, including Logistic
Regression, Decision Trees, Random Forests, Support Vector
Machines, and Gradient Boosting.
By leveraging advanced machine learning techniques, this
research provides a reliable and efficient approach to loan prediction,
offering valuable insights for risk management and strategic planning
in the banking industry.
DECLARATION
I, S B ANUSHA, hereby declare that this project work entitled
LOAN PREDICTION is submitted in fulfilment for the award of the
degree of BACHELOR OF COMPUTER APPLICATIONS of
Bangalore University. I further declare that I have not submitted this
project report either in part or in full to any other university for the
award of any degree.
S B ANUSHA
(U03MS21S0394)
CONTENTS
1. Introduction
1.1 Problem Statement
1.2 Objective
1.3 Future Scope
2. Requirements Specification
2.1 Hardware Requirements
2.2 Software Requirements
3. System Definition
3.1 Project Description
3.2 Importance of Loan Prediction
3.3 Working Description
3.4 Libraries Used
3.5 Dataset
3.6 Advantages
3.7 Disadvantages
4. Implementation
5. Snapshots
6. Conclusion
CHAPTER 1
INTRODUCTION
Arthur Samuel, a pioneer in the field of artificial intelligence and
computer gaming, coined the term “Machine Learning”. He defined
machine learning as – “A Field of study that gives computers the
capability to learn without being explicitly programmed.” The process
starts with feeding good quality data and then training our
machines(computers) by building machine learning models using the
data and different algorithms. The choice of algorithms depends on
what type of data we have and what kind of task we are trying to
automate.
How ML works?
Gathering relevant data for the problem you are trying to solve.
This can come from various sources like databases, sensors,
online repositories, etc. The quality and quantity of data
significantly impact te performance of the machine learning
model.
Data Preprocessing- Raw data often needs to be cleaned and
formatted before it can be used to train a model. It includes steps
like filling in or removing the missing data points, scaling
features to common range or distribution, converting categorical
data into numerical formats and dividing the data into training
and testing sets.
Selecting an appropriate algorithm based on the nature of
dataset and building models on the training set.
Evaluating the trained model using the testing set to assess its
performance. Common metrics include accuracy, precision,
recall, F1-score, mean squared error, etc.
- Linear Algebra
- Statistics and Probability
- Calculus
- Graph Theory
- Programming Skills (Languages like Python, R, MATLAB,
C++, or Octave)
How we split data in Machine Learning?
Train-Test Split: The training set is used to train the model and
includes majority of the data. The testing set is used to evaluate
the model’s performance and includes the remaining data.
Validation Data: This part of data is used to do frequent
evaluation of the model, fit ni the training dataset along with
improving involved hyper parameters. This data plays its role
when the model is actually training.
Testing Data: Once the model is completely trained, testing
data provides an unbiased evaluation. The model will predict
some values after feeding some inputs of testing data. After
prediction, we evaluate the model by comparing it with actual
output present in testing data.
1.1 Problem Statement
Loans are the major requirement of the modern world. By this,
banks get a major part of the total profit. It is beneficial for
students to manage their education and living expenses, and for
people to buy any kind of luxury like houses, cars, etc. But
when it comes to deciding whether the applicant’s profile is
relevant to be granted with loan or not, banks have to look after
many aspects. This project mainly focuses on identifying the
customer segments, those who are eligible for loan aspects so
that they can specifically target these customers.
1.2 Objectives
The objectives of developing a loan prediction model using
machine learning can be outlined as follows:
Develop a system that can automatically assess and predict
the eligibility of loan applicants, reducing the time and
resources spent on manual evaluations.
Improve the accuracy of loan approval decisions, ensuring
that only applicants who meet certain criteria and have a
huge likelihood of repayment are approved
Minimize the risk of default by identifying high-risk
applicants through predictive modeling.
Ensure consistent and unbiased loan approval decisions by
removing subjective human judgements.
Provide faster loan approval or rejection feedback to
applicants, enhancing customer satisfaction.
Gain insights into the key factors that influence loan
approval and default rates.
Streamline operations and reduce the cost associated with
manual processing of loan applications.
Ensure that the loan process complies with regulatory
requirements and guidelines.
Develop a scalable solution that can handle increasing
volumes of loan applications without compromising
performance.
Maintain transparency and explainability in the decision-
making process to meet compliance standards.
1.3 Future Scope
1. Social Media and Online Behavior: Using data from social
media, online transactions and digital footprints to assess
credit worthiness.
2. IoT and Smart Devices: Leveraging data from smart
devices and IoT to gain insights into an applicant’s
financial behavior and stability.
3. Dynamic Scoring Models: Developing real-time credit
scoring systems that update with new data, providing more
accurate and current assessments.
4. Deep Learning: Utilizing deep neural networks to capture
complex patterns in large datasets for improved prediction
accuracy.
5. Transfer Learning: Applying knowledge gained from one
financial domain to another to enhance model
performance.
6. Fraud Detection: Integrating predictive models with fraud
detection systems to identify and mitigate fraudulent
applications in real-time.
7. Microloans and Nano Loans: Developing models that can
assess the creditworthiness for Microloans and Nano
loans, providing financial services to underserved
populations.
8. Ecosystem Partnerships: Collaborating with fintech
companies, banks, and other financial institutions to create
an interconnected ecosystem that leverages loan prediction
models.
9. Green Loans: Developing models that promote and
support green financing initiatives and sustainable
investments.
CHAPTER 2
REQUIREMENTS SPECIFICATION
2.1 SOFTWARE REQUIREMENTS
Operating System – Windows 10/11
Languages used in Python
Jupyter Notebook
Libraries like Numpy, Pandas, Matplotlib, Sci-kit Learn,
Seaborn
Version Control
Security
2.2 HARDWARE REQUUIREMENTS
Processor
Processor Speed – 1 GHz
Memory – 2 GB RAM
SSD with 1 TB capacity
Mouse or any other pointing device
Keyboard
Display device – Color Monitor
CHAPTER 3
SYSTEM DEFINITION
3.1 Project Description
The “Loan Prediction” project aims to develop a machine
learning model that predicts the eligibility of applicants for loan
approval based on their personal and financial data. The goal is to
streamline the loan approval process, minimize default risks, and
enhance customer satisfaction through efficient, accurate, and
automated decision-making.
1. Data Collection: The project begins by obtaining the Loan
Prediction dataset which gathers historical loan application data
from a financial institution or open-source database.
2. Data Preprocessing: This process is essential to ensure data
quality and usability. It includes steps like handling missing
values, encoding categorical variables, and scaling numerical
features to prepare the dataset for analysis.
3. Feature Engineering: It creates new features or modify the
existing ones to improve model performance and selects
relevant features based on statistical analysis and domain
knowledge.
4. Model Building: Here the data is split into training, testing and
validations sets. Various machine learning models like Logistic
Regression, Decision Trees, Random Forest, etc. are trained to
predict the loan approval status of the applicants.
5. Model Evaluation: The performance of the models is evaluated
using metrics such as accuracy, precision, recall, etc. The
performance is then compared across different models to choose
the suitable one.
6. Model Interpretation: The feature importance is analyzed to
understand the key factors influencing loan approval. This will
help to ensure that the model is interpretable and explainable to
stakeholders.
7. Deployment: A user interface or API is developed and
integrated into the loan processing system. The deployment
environment should be secure, scalable and efficient.
8. Monitoring and Maintenance: The model performance is
continuously monitored and updated with new data to maintain
accuracy and relevance.
3.2 Importance of Loan Prediction
1. Risk Mitigation
Assessing Creditworthiness: Accurate loan prediction
models help banks and financial institutions evaluate the
creditworthiness of applicants. By identifying potential
defaulters, institutions can make informed decisions about
whether to approve or reject loan applications.
2. Optimizing Lending Practices
Personalized Loan Products: Loan prediction models
enable lenders to tailor loan products to individual
applicants based on their risk profile. This personalization
can include adjusting interest rates, loan amounts, and
repayment terms.
3. Regulatory Compliance
Fair Lending Practices: Loan prediction models help
ensure that lending decisions are based on objective data,
reducing the risk of discrimination and bias. This is
crucial for complying with regulations that mandate fair
lending practices.
4. Operational Efficiency
Automation of Loan Processing: Predictive models can
automate the evaluation of loan applications, significantly
reducing the time and effort required for manual
assessments. This leads to faster loan processing times and
improved customer satisfaction.
5. Enhanced Customer Experience
Quick Decision Making: Automated loan prediction
models enable quicker decision-making, providing
applicants with rapid feedback on their loan applications.
This improves the overall customer experience and
satisfaction.
6. Continuous Improvement
Learning from Data: Machine learning models improve
over time as they are exposed to more data. This
continuous learning process helps institutions refine their
predictive capabilities and adapt to changing market
conditions and borrower behaviours.
7. Strategic Decision Making
Informed Decision-Making: Data-driven insights from
loan prediction models support strategic decision-making
at higher levels of management. This includes portfolio
management, risk assessment, and long-term planning.
3.3 Working Description
In this data-driven project, we will predict whether a loan
applicant will be approved or not based on their personal and financial
information.
1. Data Collection
Gather historical loan application data, which includes
features like applicant demographics, income details, loan
amount, credit history and loan status.
2. Data Preprocessing
Handling Missing Values
Encoding Categorical Variables
Feature Scaling
Exploratory Data Analysis (EDA)
3. Feature Engineering
Creating New Features: Combine existing features to
create new ones that may better represent the underlying
patterns (e.g., Total_Income = ApplicantIncome +
CoapplicantIncome).
Feature Selection: Use statistical tests, correlation analysis
and domain knowledge to select the most relevant features
for the model.
4. Data Splitting
Split the dataset into training and testing sets.
5. Model Selection and Training
Train multiple machine learning models and select the
best performing one.
6. Model Evaluation
Evaluate the model using the validation set to tune the
model and avoid overfitting. The different metrics include
accuracy, precision, recall etc.
7. Model Interpretation
Understand which features are most important in the
model’s decision-making process. Tools like SHAP
(SHapley Additive exPlanation) can help with
interpretability.
8. Testing and Validation
Test the final model on the test set to evaluate its
performance on unseen data.
9. Deployment
Develop an API or user-interface for the model to be used
in real-time loan application processing.
10. Monitoring and Maintenance
Continuously monitor the model’s performance and
update it with new data to maintain its accuracy and
relevance.
3.4 Libraries used
For the Loan Prediction project, several Python libraries are
used for various tasks such as data manipulation, visualization and
analysis.
1. NumPy (‘import numpy as np’): NumPy (Numerical Python)
is a fundamental library for scientific computing in Python. It
provides support for arrays, matrices, and many mathematical
functions to operate on these data structures efficiently.
2. Pandas (‘import pandas as pd’): Pandas is a powerful and
flexible open-source data analysis and data manipulation library
for Python. It provides data structures and functions needed to
work on structured data seamlessly and efficiently.
3. Matplotlib (‘import [Link] as plt’): Matplotlib is a
comprehensive library for creating static, animated, and
interactive visualizations in Python. It is widely used in data
science, data analysis, and scientific research to create a variety
of plots and charts.
4. Seaborn (‘import seaborn as sns’): Seaborn is a powerful and
user-friendly data visualization library for Python, built on top
of Matplotlib. It provides a high-level interface for drawing
attractive and informative statistical graphics, making it a
popular choice among data scientists and analysts.
5. Sci-Kit Learn (‘from sklearn…’): Scikit-learn (often
abbreviated as sklearn) is one of the most popular and powerful
machine learning libraries in Python. It provides simple and
efficient tools for data mining and data analysis, built on
NumPy, SciPy, and Matplotlib.
3.5 Dataset
The data of the individuals who applied for the loan, is used for
the analysis. This contains various details of the applicants like
marital status, education, their income, their employment details, etc.
Key features include:
1. Loan_ID: Unique identifier for each loan application.
2. Gender: Gender of the applicant (Male/Female).
3. Married: Marital status of the applicant (Yes/No).
4. Dependents: Number of dependents (0, 1, 2, 3+).
5. Education: Education level (Graduate/Not graduate).
6. Self_Employed: Self-employment status (Yes/No).
7. ApplicantIncome: Monthly income of the applicant.
8. CoapplicantIncome: Monthly income of the co-applicant.
9. LoanAmount: Loan amount in thousands.
10. Loan_Amount_Term: Term of the loan in months.
11. Credit_History: Credit history (1: Good, 0: Bad)
12. Property_Area: Property location (Urban/Semiurban/Rural)
13. Loan_Status: Target variable indicating the loan approval status
(Y: Approved, N: Not Approved).
3.6 Advantages
Accuracy: Advanced predictive models can assess the credit
worthiness of applicants more accurately than the traditional
methods, reducing the likelihood of defaults.
Operational Efficiency: Automating the loan approval process
reduces the need for manual intervention, cutting down on labor
costs and minimizing human errors.
Fraud Detection: Predictive models can identify suspicious
patterns that may indicate fraudulent applications, saving the
institution from potential financial losses.
Risk Assessment: Machine learning models can evaluate a vast
array of data points to predict the risk associated with each loan
applicant more comprehensively.
3.7 Disadvantages
While loan prediction models can significantly improve the
efficiency and accuracy of the loan approval process, they are not
without their drawbacks. Here are some potential disadvantages of
using loan prediction models:
Incomplete or Inaccurate Data: The model's accuracy is highly
dependent on the quality of the data. Incomplete or inaccurate
data can lead to incorrect predictions.
Over-fitting & Under-fitting: If the model is too complex, it
may over fit the training data, capturing noise instead of the
underlying pattern. This reduces its performance on new,
unseen data. Conversely, a model that is too simple may under-
fit, failing to capture the underlying trends in the data.
Complexity: Some advanced models, such as ensemble
methods or neural networks, can be difficult to interpret and
explain to stakeholders. This lack of transparency can be
problematic in regulated industries like finance.
Privacy: The use of personal data in loan prediction models
raises privacy concerns. Ensuring compliance with data
protection regulations (e.g., GDPR) is essential.
CHAPTER 4
IMPLEMENTATION (CODE)
Executed in Jupyter Notebook Environment
#import statements
import pandas as pd
import numpy as np
import [Link] as plt
import seaborn as sns
#loading the dataset
df = pd.read_csv("loanpred_dataset.csv")
[Link](10)
#analyzing the data
print([Link])
print([Link]())
print([Link]())
#data wrangling
print([Link]())
print([Link]().sum())
#checking if there are any null values
lpp = [Link](deep=True)
#Copying the dataset into another variable
lpp["Gender"].fillna("Others", inplace=True)
lpp["Married"].fillna("Unmarried", inplace=True)
lpp["Self_Employed"].fillna("No", inplace=True)
#train and test data
x = lpp[["LoanAmount"]]
y = lpp.Loan_Status
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,
random_state=2)
from sklearn.linear_model import LinearRegression
slr = LinearRegression()
[Link](x_train,y_train)
#KNN Algorithm
from [Link] import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3, metric= 'euclidean')
[Link](x_train, y_train)
xpred = [Link](x_test)
print(xpred)
from [Link] import classification_report, accuracy_score
print(accuracy_score(y_test, xpred))
#Decision Tree Classifier
from [Link] import DecisionTreeClassifier
clf = DecisionTreeClassifier()
[Link](x, y)
#printing confusion matrix and plot tree
from [Link] import confusion_matrix
confusion_matrix(y,ypred)
from [Link] import plot_tree
plot_tree(clf,feature_names=["LoanAmount"], class_names=["Y","N"])
#Heat map
[Link](figsize=(10, 6))
corr = [Link]()
[Link](corr, annot=True, cmap='coolwarm')
[Link]('Correlation Heatmap')
[Link]()
CHAPTER 5
SNAPSHOTS
5.1 Libraries Imported and Dataset Loaded
Figure 5.1 Libraries Imported and Dataset Loaded
5.2 Filling the null values
Figure 5.2 Filling the null values
5.3 Correlation Heat map
Figure 5.3 Correlation Heatmap
5.4 Training the Data using Linear Regression
Figure 5.4 Training the data using Linear Regression
5.5 KNN Classification
Figure 5.5 KNN Classification
5.7 Decision Tree Classifier
Figure 5.7 Decision Tree Classifier
CHAPTER 6
CONCLUSION
Loan Prediction is a critical process for financial institutions,
enabling them to assess the risk associated with loan applications and
make data-driven decisions. By these techniques, lenders can
significantly improve their risk management practices, enhance the
accuracy of their lending decisions and reduce the incidents of loan
defaults.
The implementation of loan prediction models offers significant
benefits, including risk mitigation, improved decision making,
operational efficiency, and regulatory compliance. It has also raised
awareness about the ethical considerations and responsibilities
associated with automated decision-making, especially in domains
with significant real-world consequences like finance. By leveraging
advanced data analytics and machine learning techniques, financial
institutions can develop robust models that accurately predict loan
defaults, thereby supporting sustainable and profitable lending
practices. Continuous innovation and improvement in these models
are necessary to address ongoing challenges and meet the dynamic
needs of the financial sector.