0% found this document useful (0 votes)

29 views12 pages

Model Learning Steps

The document outlines a step-by-step guide to building a machine learning model using Python, specifically focusing on a regression model with a dataset related to patient charges. Key steps include loading and understanding the dataset, data preprocessing, visualization, model building, evaluation, and prediction. The guide emphasizes the importance of data preparation and model evaluation metrics such as the coefficient of determination and mean squared error.

Uploaded by

Aryan Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views12 pages

Model Learning Steps

Uploaded by

Aryan Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

7 Steps to Build a Machine

Learning Model with Python

Machine Learning is a subset of Artificial Intelligence that provides a
machine the ability to learn automatically from experience without
being explicitly programmed. Machine learning is a great field.

1. Loading the dataset

2. Understanding the dataset

3. Data preprocessing

4. Data visualization

5. Building a regression model

6. Model evaluation

7. Model prediction

Before getting started, please don’t forget to subscribe to my youtube

channel where I create content about AI, data science, machine
learning, and deep learning.

Let’s dive in!

1|Page
1. Loading The Data

Dataset I’m going to use includes charges from patients. I highly

recommend that you download this dataset and write your codes
with me. You can access my codes from here. First, let’s import the
dataset. I’m going to load the dataset using Pandas. Let me first
import Pandas.

import pandas as pd

Pandas is an excellent library for data loading and data

preprocessing. Now, let me load the dataset with the read_csv
method.
data = pd.read_csv("[Link]")

Now, let’s take a look at the first rows of the dataset. To do this I’m
going to use the head method.
[Link]()

As you can see, there are 7 columns such as age, sex, body mass
index, number of children, smoking, region, and charges.

2|Page
2. Understanding The Dataset

Understanding the data is very important before building a machine

learning model. For example, let’s see the number of rows and
columns of the dataset. I’m going to use the shape attribute to do
this.
[Link]

As you can see, the dataset has 1338 rows and 7 columns. Now, I’m
going to use the info method to get more information about the
dataset.
[Link]()

There is no missing data in the dataset. You can also use the isnull
method to see the missing data.
[Link]()

3|Page
Let me use the sum method to see the sum of the missing data.
[Link]().sum()

As you can see, there is no missing data in the dataset. Knowing the
column types is very important for building a machine learning
model. Now, let’s take a look at the column types. I’m going to use
the dtypes attribute for this.
[Link]

4|Page
3. Data Preprocessing
Let’s convert object types to category types.
data['sex'] = data['sex'].astype('category')
data['region'] = data['region'].astype('category')
data['smoker'] = data['smoker'].astype('category')

Let’s see the data types again.

[Link]

Now, let’s go ahead and take a look at the statistics of numeric

variables with the describe method. If we use the transpose of the
dataset, you can see the statistics better.

5|Page
[Link]()

Now, let’s look at the mean charges for smokers and non-smokers.
To do this, let’s first group with the groupby method. I’m going to
use the round method to see only two numbers after the comma.
smoke_data = [Link]("smoker").mean().round(2)

Let’s see smoke data.

smoke_data

As you can see, smokers pay more than non-smokers.

4. Data Visualization
You can understand the dataset better with data visualization. Now,
let’s look at the relationships of numeric variables using the seaborn.
First, let me import seaborn.
import seaborn as sns

6|Page
Seaborn is a library that it builds on the matplotlib, especially used
for statistical plots. Now, let’s choose the plot style.
sns.set_style("whitegrid")

I’m going to use the pairplot method to see the relationships of

numeric variables.
[Link](
data[["age", "bmi", "charges", "smoker"]],
hue = "smoker",
height = 3,
palette = "Set1")

7|Page
For example, when the age variable increases, both smokers and
non-smokers pay more. Now, let’s look at the correlation between
the variables.
[Link]([Link](), annot= True)

Notice that there is a relationship between charges and the other

variables.
One-Hot Encoding
Now, I’m going to do a one-hot encoding of the categorical variables
in the dataset. This is very easy to do with Pandas. You can
automatically convert categorical data into one-hot encoding using
the get_dummies method in Pandas. Let’s convert categorical data
to one-hot encoding.
data = pd.get_dummies(data)

Thus, only categorical data were converted to one-hot encoding.

Now let’s look at the columns of the dataset.
[Link]

8|Page
As you can see, new columns have been created for each
subcategory. Using Pandas was very easy. Thanks, Pandas! Thus, the
dataset is ready to build the model. Let’s go ahead and build a
regression model.

5. Building a Regression Model

When building a model, you should start with the simplest model. If
you don’t get good accuracy, you can try more complex models. I’ll
build a linear regression model because the output variable charges
is numeric type.

Before building a machine learning model, we need to determine the

input and output variables. The input variables are features. In
statistics, these are called independent variables. The output
variable is the target variable. In statistics, this variable is called the
dependent variable. Let’s assign the target variable charges to
variable y.
y = data["charges"]

If we drop the target variable, the remainders are the features.

X = [Link]("charges", axis = 1)

Before the model is built, the dataset is split into training and
testing. The model is built with the training data, and the model is
evaluated with the test data. You can use the train_test_split
method in scikit-learn to split the dataset into training and testing.
With this method, you can easily split the dataset. First, let’s import
this method.
from sklearn.model_selection import train_test_split

9|Page
Let’s split the dataset into 80 percent training and 20 percent
testing.
X_train,X_test,y_train,y_test=train_test_split(
X,y,
train_size = 0.80,
random_state = 1)

Now, let’s build the model. Let me import the linear regression class
from scikit-learn.
from sklearn.linear_model import LinearRegression

Let me create an instance of the LinearRegression class.

lr = LinearRegression()

I’m going to build the model using the training data.

[Link](X_train,y_train)

6. Model Evaluation

Let’s take a look at the performance of the model. To do this, I’m

going to use the coefficient of determination. The closer this value is
to 1, the better the model. First, let’s take a look at the score of the
model on the test data.
[Link](X_test, y_test).round(3)

#Output: 0.762

The coefficient of determination on the test data is greater than 0.7.

Our model is not bad. Of course, it would be better if it was closer to
1. Now, let’s see the score of the model on the training data.
[Link](X_train, y_train).round(3)
#Output: 0.748

10 | P a g e
As you can see, the performance of the model on the training data is
close to the performance of the test data. If the performance of the
model on the training data was high, it would mean that there is an
overfitting problem. You may ask how to solve the overfitting
problem? To overcome the overfitting problem, you can use
regularization. Ridge or lasso models can be used for this.

Now let’s take a look at another metric, mean squared error, to

evaluate the model. For this, let’s first predict the test data with the
predict method.
y_pred = [Link](X_test)

Now, let’s import the mean_squared_error metric.

from [Link] import mean_squared_error

I’m going to use this metric now. First, let me import the math
module because I’m going to calculate the square root of this metric.
import math

Let’s take a look at the square root of the mean squares error.
[Link](mean_squared_error(y_test, y_pred))
#Output: 5956.45

This value means that the model predicts with a standard deviation
of 5956.45.

11 | P a g e
7. Model Prediction

Now, I’m going to predict the first row as an example. First, let’s
select the first row of the training data.
data_new = X_train[:1]

Let me predict the data with our model.

[Link](data_new)
#Output: 10508. 42

Let’s take a look at the real value.

y_train[:1]
#Output: 10355.64

As you can see, our model predicted close to the real value.

12 | P a g e

Python Simple Linear Regression Guide
No ratings yet
Python Simple Linear Regression Guide
14 pages
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
No ratings yet
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
29 pages
Deep Learning
No ratings yet
Deep Learning
25 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
No ratings yet
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
53 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
ML Adv
No ratings yet
ML Adv
51 pages
Data Mining with Python Lab Guide
No ratings yet
Data Mining with Python Lab Guide
39 pages
Data Mining Lab Manual CSE VII Sem
No ratings yet
Data Mining Lab Manual CSE VII Sem
63 pages
MLP Regressor with Sklearn on Wine Data
No ratings yet
MLP Regressor with Sklearn on Wine Data
10 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
16 pages
Machine Learning Strategies
No ratings yet
Machine Learning Strategies
59 pages
Beginner's Guide to Machine Learning
No ratings yet
Beginner's Guide to Machine Learning
8 pages
Foundations of Machine Learning and Data Science - Concepts, Techniques, and Applications
No ratings yet
Foundations of Machine Learning and Data Science - Concepts, Techniques, and Applications
9 pages
Machine Learning Life Cycle
No ratings yet
Machine Learning Life Cycle
11 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
Machine Learning Path
No ratings yet
Machine Learning Path
21 pages
4 Data Preprocessing
No ratings yet
4 Data Preprocessing
27 pages
Beginner's Guide To Implementing A Simple Machine Learning Project - DeV Community
No ratings yet
Beginner's Guide To Implementing A Simple Machine Learning Project - DeV Community
9 pages
Project 2
No ratings yet
Project 2
5 pages
Module 4 - Supervised Learning - First ML Model
No ratings yet
Module 4 - Supervised Learning - First ML Model
23 pages
ML SIG - Day 1
No ratings yet
ML SIG - Day 1
55 pages
OceanofPDF - Com Hands-On Machine Learning From Scratch - Venelin Valkov
No ratings yet
OceanofPDF - Com Hands-On Machine Learning From Scratch - Venelin Valkov
119 pages
2 Machine Learning
No ratings yet
2 Machine Learning
21 pages
Assignment 9
No ratings yet
Assignment 9
8 pages
AI and ML Lab Ex3 To 12
No ratings yet
AI and ML Lab Ex3 To 12
27 pages
Machine Learning Evaluation Guide
100% (1)
Machine Learning Evaluation Guide
504 pages
How To Prepare Your Dataset For Machine Learning in Python
No ratings yet
How To Prepare Your Dataset For Machine Learning in Python
14 pages
Python for Data Science: ML Basics
No ratings yet
Python for Data Science: ML Basics
45 pages
ML in Python Part-2
No ratings yet
ML in Python Part-2
21 pages
ML Lab Scikit Pandas
No ratings yet
ML Lab Scikit Pandas
6 pages
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
No ratings yet
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
38 pages
FYMCA IDSLab A6 Submission
No ratings yet
FYMCA IDSLab A6 Submission
9 pages
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
No ratings yet
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
10 pages
Week 7 Laboratory Activity
No ratings yet
Week 7 Laboratory Activity
12 pages
Moocs Ritesh
No ratings yet
Moocs Ritesh
22 pages
Lab Manual 04
No ratings yet
Lab Manual 04
12 pages
Machine Learning With PySpark and MLlib - Solving A Binary Classification Problem - by Susan Li - Towards Data Science
No ratings yet
Machine Learning With PySpark and MLlib - Solving A Binary Classification Problem - by Susan Li - Towards Data Science
10 pages
Top 90+ Data Science Interview Questions and Answers (2024)
No ratings yet
Top 90+ Data Science Interview Questions and Answers (2024)
38 pages
Module 3 Data Science Machine Learning
No ratings yet
Module 3 Data Science Machine Learning
53 pages
Learning Algorithms & Models
No ratings yet
Learning Algorithms & Models
9 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
30 pages
Implementing Artificial Neural Network in Python From Scratch
No ratings yet
Implementing Artificial Neural Network in Python From Scratch
16 pages
Practical Machine Learning Code Examples
No ratings yet
Practical Machine Learning Code Examples
33 pages
Class 14 - Basic Coding in Python - 5
No ratings yet
Class 14 - Basic Coding in Python - 5
24 pages
Assignment 4 R Program1
No ratings yet
Assignment 4 R Program1
11 pages
Unit-III Advanced Machine Learning
No ratings yet
Unit-III Advanced Machine Learning
8 pages
PythonForML2023 Laboratory07 08 Regression Classification Update2
No ratings yet
PythonForML2023 Laboratory07 08 Regression Classification Update2
6 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Scikit-learn Machine Learning Tutorial
No ratings yet
Scikit-learn Machine Learning Tutorial
17 pages
ML Lab 01999676272
No ratings yet
ML Lab 01999676272
12 pages
Machine Learning 2
No ratings yet
Machine Learning 2
45 pages
ML Unit I Data Preprocessing &unit IV Cost Function and Unit V Pruning Topic
No ratings yet
ML Unit I Data Preprocessing &unit IV Cost Function and Unit V Pruning Topic
11 pages
ML (Prac1)
No ratings yet
ML (Prac1)
12 pages
C2W3 Lab 01 Model Evaluation and Selection
No ratings yet
C2W3 Lab 01 Model Evaluation and Selection
21 pages
C2W3 Lab 01 Model Evaluation and Selection
No ratings yet
C2W3 Lab 01 Model Evaluation and Selection
21 pages
Gradient Descent in Machine Learning
No ratings yet
Gradient Descent in Machine Learning
55 pages
Frequency Distribution Analysis
No ratings yet
Frequency Distribution Analysis
17 pages
The Insignificance of Null Hypothesis Significance Testing
No ratings yet
The Insignificance of Null Hypothesis Significance Testing
30 pages
Common Mistakes
No ratings yet
Common Mistakes
410 pages
Statistics Exam Practice
No ratings yet
Statistics Exam Practice
65 pages
Pretest-Postest Analysis
No ratings yet
Pretest-Postest Analysis
4 pages
Categorical Dependent Variable Regression Models Using STATA, SAS, and SPSS
No ratings yet
Categorical Dependent Variable Regression Models Using STATA, SAS, and SPSS
32 pages
Statistical Treatment of Data
75% (4)
Statistical Treatment of Data
10 pages
Lecture 6 Estimation
No ratings yet
Lecture 6 Estimation
8 pages
CAL1P39C16
No ratings yet
CAL1P39C16
9 pages
Ecological Methodology (Charles J. Krebs)
No ratings yet
Ecological Methodology (Charles J. Krebs)
680 pages
Inquiry-Based Learning Test
No ratings yet
Inquiry-Based Learning Test
3 pages
6) BIOSTATISTICs
No ratings yet
6) BIOSTATISTICs
99 pages
Understanding Simple Regression Models
No ratings yet
Understanding Simple Regression Models
25 pages
T-Distribution Table Extended DF 1-100
100% (4)
T-Distribution Table Extended DF 1-100
2 pages
Predicting Churn
100% (10)
Predicting Churn
14 pages
Data Analytics With Python - Unit 12 - Week 10
No ratings yet
Data Analytics With Python - Unit 12 - Week 10
4 pages
ML Practice - Set
No ratings yet
ML Practice - Set
2 pages
FIN435 Individual Assignment
No ratings yet
FIN435 Individual Assignment
13 pages
Statistical Process Control Notes
No ratings yet
Statistical Process Control Notes
94 pages
Applied Statistics Course Overview
No ratings yet
Applied Statistics Course Overview
5 pages
Statistics Using Excel PDF
No ratings yet
Statistics Using Excel PDF
63 pages
Types of Multiple Regression Explained
No ratings yet
Types of Multiple Regression Explained
35 pages
Sampling Distributions & Estimation
No ratings yet
Sampling Distributions & Estimation
26 pages
CH 3 Statistical Estimation
100% (1)
CH 3 Statistical Estimation
13 pages
ASTM Data and Control
100% (2)
ASTM Data and Control
141 pages
Statistics Cheat Sheet
100% (1)
Statistics Cheat Sheet
4 pages
Environmental Statistics V. Barnett ISBN: 0-471-48971-9 (HB)
100% (1)
Environmental Statistics V. Barnett ISBN: 0-471-48971-9 (HB)
11 pages
Safety Stock Planning Under Causal Demand Forecasting
No ratings yet
Safety Stock Planning Under Causal Demand Forecasting
9 pages
Guideline For The Validation of Methods For The Analysis of Pesticide Residues in Plant Products by Mass Spectrometry-Mass PDF
No ratings yet
Guideline For The Validation of Methods For The Analysis of Pesticide Residues in Plant Products by Mass Spectrometry-Mass PDF
14 pages
Essentials of Behavioral Research Methods and Data Analysis Third Edition Rosenthal Instant Download
No ratings yet
Essentials of Behavioral Research Methods and Data Analysis Third Edition Rosenthal Instant Download
85 pages

Model Learning Steps

Uploaded by

Model Learning Steps

Uploaded by

7 Steps to Build a Machine

Learning Model with Python

1. Loading the dataset

2. Understanding the dataset

5. Building a regression model

Before getting started, please don’t forget to subscribe to my youtube

Let’s dive in!

Dataset I’m going to use includes charges from patients. I highly

Pandas is an excellent library for data loading and data

Understanding the data is very important before building a machine

Let’s see the data types again.

Now, let’s go ahead and take a look at the statistics of numeric

Let’s see smoke data.

As you can see, smokers pay more than non-smokers.

I’m going to use the pairplot method to see the relationships of

Notice that there is a relationship between charges and the other

Thus, only categorical data were converted to one-hot encoding.

5. Building a Regression Model

Before building a machine learning model, we need to determine the

If we drop the target variable, the remainders are the features.

Let me create an instance of the LinearRegression class.

I’m going to build the model using the training data.

Let’s take a look at the performance of the model. To do this, I’m

The coefficient of determination on the test data is greater than 0.7.

Now let’s take a look at another metric, mean squared error, to

Now, let’s import the mean_squared_error metric.

Let me predict the data with our model.

Let’s take a look at the real value.

You might also like