0% found this document useful (0 votes)
29 views12 pages

Model Learning Steps

The document outlines a step-by-step guide to building a machine learning model using Python, specifically focusing on a regression model with a dataset related to patient charges. Key steps include loading and understanding the dataset, data preprocessing, visualization, model building, evaluation, and prediction. The guide emphasizes the importance of data preparation and model evaluation metrics such as the coefficient of determination and mean squared error.

Uploaded by

Aryan Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views12 pages

Model Learning Steps

The document outlines a step-by-step guide to building a machine learning model using Python, specifically focusing on a regression model with a dataset related to patient charges. Key steps include loading and understanding the dataset, data preprocessing, visualization, model building, evaluation, and prediction. The guide emphasizes the importance of data preparation and model evaluation metrics such as the coefficient of determination and mean squared error.

Uploaded by

Aryan Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

7 Steps to Build a Machine

Learning Model with Python


Machine Learning is a subset of Artificial Intelligence that provides a
machine the ability to learn automatically from experience without
being explicitly programmed. Machine learning is a great field.

1. Loading the dataset

2. Understanding the dataset

3. Data preprocessing

4. Data visualization

5. Building a regression model

6. Model evaluation

7. Model prediction

Before getting started, please don’t forget to subscribe to my youtube


channel where I create content about AI, data science, machine
learning, and deep learning.

Let’s dive in!

1|Page
1. Loading The Data

Dataset I’m going to use includes charges from patients. I highly


recommend that you download this dataset and write your codes
with me. You can access my codes from here. First, let’s import the
dataset. I’m going to load the dataset using Pandas. Let me first
import Pandas.

import pandas as pd

Pandas is an excellent library for data loading and data


preprocessing. Now, let me load the dataset with the read_csv
method.
data = pd.read_csv("[Link]")

Now, let’s take a look at the first rows of the dataset. To do this I’m
going to use the head method.
[Link]()

As you can see, there are 7 columns such as age, sex, body mass
index, number of children, smoking, region, and charges.

2|Page
2. Understanding The Dataset

Understanding the data is very important before building a machine


learning model. For example, let’s see the number of rows and
columns of the dataset. I’m going to use the shape attribute to do
this.
[Link]

As you can see, the dataset has 1338 rows and 7 columns. Now, I’m
going to use the info method to get more information about the
dataset.
[Link]()

There is no missing data in the dataset. You can also use the isnull
method to see the missing data.
[Link]()

3|Page
Let me use the sum method to see the sum of the missing data.
[Link]().sum()

As you can see, there is no missing data in the dataset. Knowing the
column types is very important for building a machine learning
model. Now, let’s take a look at the column types. I’m going to use
the dtypes attribute for this.
[Link]

4|Page
3. Data Preprocessing
Let’s convert object types to category types.
data['sex'] = data['sex'].astype('category')
data['region'] = data['region'].astype('category')
data['smoker'] = data['smoker'].astype('category')

Let’s see the data types again.


[Link]

Now, let’s go ahead and take a look at the statistics of numeric


variables with the describe method. If we use the transpose of the
dataset, you can see the statistics better.

5|Page
[Link]()

Now, let’s look at the mean charges for smokers and non-smokers.
To do this, let’s first group with the groupby method. I’m going to
use the round method to see only two numbers after the comma.
smoke_data = [Link]("smoker").mean().round(2)

Let’s see smoke data.


smoke_data

As you can see, smokers pay more than non-smokers.

4. Data Visualization
You can understand the dataset better with data visualization. Now,
let’s look at the relationships of numeric variables using the seaborn.
First, let me import seaborn.
import seaborn as sns

6|Page
Seaborn is a library that it builds on the matplotlib, especially used
for statistical plots. Now, let’s choose the plot style.
sns.set_style("whitegrid")

I’m going to use the pairplot method to see the relationships of


numeric variables.
[Link](
data[["age", "bmi", "charges", "smoker"]],
hue = "smoker",
height = 3,
palette = "Set1")

7|Page
For example, when the age variable increases, both smokers and
non-smokers pay more. Now, let’s look at the correlation between
the variables.
[Link]([Link](), annot= True)

Notice that there is a relationship between charges and the other


variables.
One-Hot Encoding
Now, I’m going to do a one-hot encoding of the categorical variables
in the dataset. This is very easy to do with Pandas. You can
automatically convert categorical data into one-hot encoding using
the get_dummies method in Pandas. Let’s convert categorical data
to one-hot encoding.
data = pd.get_dummies(data)

Thus, only categorical data were converted to one-hot encoding.


Now let’s look at the columns of the dataset.
[Link]

8|Page
As you can see, new columns have been created for each
subcategory. Using Pandas was very easy. Thanks, Pandas! Thus, the
dataset is ready to build the model. Let’s go ahead and build a
regression model.

5. Building a Regression Model

When building a model, you should start with the simplest model. If
you don’t get good accuracy, you can try more complex models. I’ll
build a linear regression model because the output variable charges
is numeric type.

Before building a machine learning model, we need to determine the


input and output variables. The input variables are features. In
statistics, these are called independent variables. The output
variable is the target variable. In statistics, this variable is called the
dependent variable. Let’s assign the target variable charges to
variable y.
y = data["charges"]

If we drop the target variable, the remainders are the features.


X = [Link]("charges", axis = 1)

Before the model is built, the dataset is split into training and
testing. The model is built with the training data, and the model is
evaluated with the test data. You can use the train_test_split
method in scikit-learn to split the dataset into training and testing.
With this method, you can easily split the dataset. First, let’s import
this method.
from sklearn.model_selection import train_test_split

9|Page
Let’s split the dataset into 80 percent training and 20 percent
testing.
X_train,X_test,y_train,y_test=train_test_split(
X,y,
train_size = 0.80,
random_state = 1)

Now, let’s build the model. Let me import the linear regression class
from scikit-learn.
from sklearn.linear_model import LinearRegression

Let me create an instance of the LinearRegression class.


lr = LinearRegression()

I’m going to build the model using the training data.


[Link](X_train,y_train)

6. Model Evaluation

Let’s take a look at the performance of the model. To do this, I’m


going to use the coefficient of determination. The closer this value is
to 1, the better the model. First, let’s take a look at the score of the
model on the test data.
[Link](X_test, y_test).round(3)

#Output: 0.762

The coefficient of determination on the test data is greater than 0.7.


Our model is not bad. Of course, it would be better if it was closer to
1. Now, let’s see the score of the model on the training data.
[Link](X_train, y_train).round(3)
#Output: 0.748

10 | P a g e
As you can see, the performance of the model on the training data is
close to the performance of the test data. If the performance of the
model on the training data was high, it would mean that there is an
overfitting problem. You may ask how to solve the overfitting
problem? To overcome the overfitting problem, you can use
regularization. Ridge or lasso models can be used for this.

Now let’s take a look at another metric, mean squared error, to


evaluate the model. For this, let’s first predict the test data with the
predict method.
y_pred = [Link](X_test)

Now, let’s import the mean_squared_error metric.


from [Link] import mean_squared_error

I’m going to use this metric now. First, let me import the math
module because I’m going to calculate the square root of this metric.
import math

Let’s take a look at the square root of the mean squares error.
[Link](mean_squared_error(y_test, y_pred))
#Output: 5956.45

This value means that the model predicts with a standard deviation
of 5956.45.

11 | P a g e
7. Model Prediction

Now, I’m going to predict the first row as an example. First, let’s
select the first row of the training data.
data_new = X_train[:1]

Let me predict the data with our model.


[Link](data_new)
#Output: 10508. 42

Let’s take a look at the real value.


y_train[:1]
#Output: 10355.64

As you can see, our model predicted close to the real value.

12 | P a g e

You might also like