0% found this document useful (0 votes)
40 views14 pages

How To Prepare Your Dataset For Machine Learning in Python

The document describes the steps to prepare a dataset for machine learning in Python. These steps include: 1) Getting the dataset and importing necessary libraries. 2) Handling any missing data by using mean imputation. 3) Encoding any categorical data using label encoding and one-hot encoding. 4) Splitting the dataset into training and test sets using train_test_split. 5) Performing feature scaling on the training and test sets.

Uploaded by

Bulbula Kumeda
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
0% found this document useful (0 votes)
40 views14 pages

How To Prepare Your Dataset For Machine Learning in Python

The document describes the steps to prepare a dataset for machine learning in Python. These steps include: 1) Getting the dataset and importing necessary libraries. 2) Handling any missing data by using mean imputation. 3) Encoding any categorical data using label encoding and one-hot encoding. 4) Splitting the dataset into training and test sets using train_test_split. 5) Performing feature scaling on the training and test sets.

Uploaded by

Bulbula Kumeda
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 14

How To Prepare Your Dataset For Machine

Learning In Python
By Krunal Last updated Jul 25, 2018

 2 2,304

Share

By clicking the subscribe button you will never miss the new articles!

Subscribe
How To Prepare Dataset For Machine Learning in Python. Machine Learning is
all about train your model based on current data to predict the future values. So we
need the proper amounts to train our model. So in real life, we do not always have
the correct data to work with. If the data is not processed correctly, then we need to
prepare it and then start training our model. So in this post, we will see step by step
to transform our initial data into Training and Test data. For this example, we use
python libraries like scikit learn, numpy, and pandas.

Content Overview [hide]
 1 Prepare Dataset For Machine Learning in Python
 2 #Steps To Prepare The Data.
 3 #1: Get The Dataset.
 4 #2: Handle Missing Data.
 5 #3: Encode Categorical data.
 6 #4: Split the dataset into Training Set and Test Set.
 7 #5:  Feature Scaling

Prepare Dataset For Machine Learning in


Python
We use the Python programming language to create a perfect dataset. For
preparing a dataset, we need to perform the following steps.

#Steps To Prepare The Data.


1. Get the dataset and import the libraries.
2. Handle missing data.
3. Encode categorical data.
4. Splitting the dataset into the Training set and Test set.
5. Feature Scaling, if all the columns are not scaled correctly.

So, we will be all the steps on the dataset one by one and prepare the final dataset
on which we can apply regression and different algorithms.

#1: Get The Dataset.


Okay, now we are going to use Indian Liver Patient’s data. So we first prepare the
complete dataset for this kind of data. I am putting the link here to download the
data. Remember, this is not a real dataset, this is just the demo dataset. It looks like
the actual dataset. You can get the Real Dataset on this link.

Download File: patientData

Now, we need to create a project directory. So let us build using the following
command.

mkdir predata

Now go into the directory.

cd predata

We need to move the CSV file inside this folder.

Now, open the Anaconda Navigator software. If you are new to Anaconda, then


please check out this How To Get Started With Machine Learning In
Python. After opening Navigator, you can see a screen like below.
 

Now, launch the Spyder application and navigate to your project folder. You can
see, we have already moved the patientData.csv file so that you can see that file
over there.
Okay, now we need to create one Python file called datapre.py and start importing
the mathematical libraries.

Write the following code inside datapre.py file. So, your file looks like this.
Remember, we are usingPython 3

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Jul 25 18:52:15 2018

@author: your name


"""
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Now, select code of three import statements and hit the command + enter and
you can see at the right side down, the code is running successfully.

That means, we have successfully imported the libraries. If you found any error
then possibly the numpy,  pandas, or matplotlib library is missing. So you need to
install that, and that is it.

#2: Handle Missing Data.


In real-time, missing the data happens quite a lot. If you are finding the real-time
data set like for the patients, then there is always missing the data. To train the
model correctly, we need to fill the data somehow. Otherwise, the model will
mispredict the values. Luckily libraries are already available to do that; we need to
use the proper function to do that. Now, in our dataset, there is missing data, so we
need to fill the data with either mean values or to use some other algorithms. In
this example, we are using MEAN to supply the values. So let us do that.
But first, let us divide the dataset into our X and Y axis.

Okay, now write the following code after the importing the libraries.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Jul 25 18:52:15 2018

@author: krunal
"""
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

dataset = pd.read_csv('patientData.csv')

Now, select the following line and hit the command + enter.


RELATED POSTS
Tensorflow Basics Tutorial | Getting Started With Tensorflow
Feb 4, 2019

How To Build Simple Model In Tensorflow


Jan 30, 2019
 PREV NEXT  1 of 2

dataset = pd.read_csv('patientData.csv')

Okay, so we have included our initial dataset, and you can see here.
 
Here, you can see that if the value is empty, then nan is displaying. So we need to
change it with theMEAN values. So let us do that.

Write the following code.

X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, 3].values

So, here in the X, we have selected the first four columns and leave the fifth column.
It will be our Y.

Remember, indexes are starting from 0. So -1 means last column. So we are


selecting the all the columns except the last column.

For Y, we have explicitly selected the fourth column, and the index is 3.

Okay, now we need to handle the missing data. We will use a library Scikit learn.

Write the following code.

...

from sklearn.preprocessing import Imputer


imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

So, here we have to use Imputer module to use the strategy ‘mean’ and fill the
missing values with the mean values. Run the above lines and type the X in the
Console. You can see something like below. Here, column 1 and 2 have missing
values, but we have written 1:3 because the upper bound is excluded that is why
we have taken 1 and 3, and it is working fine. Finally, transform the whole column
values which have NaN values, and now we have got the filled values.
 

Here, you can see that the mean values of that particular column fill the missing
values.

So, we have handled the missing data. Now, head over to the next step.
#3: Encode Categorical data.
In our dataset, there are two categorical columns.

1. Gender
2. Liver Disease
So, we need to encode this two columns of data.

# Encode Categorical Data

from sklearn.preprocessing import LabelEncoder, OneHotEncoder


labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)

Here, we have encoded the values of the first column. Now, here, we have only two
cases for the first column, and that is Female and Male. Now, after transform, the
values are 1 for Female, and 0for Male.

Run the above line and see the changes in categorical data. So, here for Female, it
is 1 and Male is0. It has created one more column and
replaces Male and Female according to 1 and 0. That is why it becomes from 3
columns to 4 columns.
 

#4: Split the dataset into Training Set and


Test Set.
Now, generally, we split the data with the ratio of 70% for the Training Data and 30%
to test data. For our example, we split into the 80% for training data and 20% for
the test data.
Write the following code inside the Spyder.

# Split the data between the Training Data and Test Data

from sklearn.model_selection import train_test_split


X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2
,random_state = 0)

Run the code, and you can get the four more variables. So, we have the total
of seven variables.

So, here, we have split the both Axis X and Y into X_train and X_test

Y-axis becomes Y_train and Y_test.

So, you have 80% data on the X_train and Y_train and 20% data on the X_test and
Y_test.

#5:  Feature Scaling
In a general scenario, machine learning is based on Euclidean Distance. Here for the
column Albuminand Age column has an entirely different range of values. So we need
to convert those values and make it under the range of values. That is why this is called
feature scaling. We need to scale the values for Agecolumn. So let us scale the X_train
and X_test.
# Feature Scaling

from sklearn.preprocessing import StandardScaler


sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

Here, we do not need for Y because it is already in scaled. Now run the above code and
hit the following command.

Here, we can see that all the values are appropriately scaled and also you can check
the X_test variable as well.

So, we have successfully cleared and prepared the data.

Here, is the final code of our datapre.py.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Jul 25 18:52:15 2018

@author: krunal
"""

# Importing Libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing Dataset

dataset = pd.read_csv('patientData.csv')
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, 3].values

# Handing Missing Dataset

from sklearn.preprocessing import Imputer


imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

# Encode Categorical Data

from sklearn.preprocessing import LabelEncoder, OneHotEncoder


labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)

# Split the data between the Training Data and Test Data

from sklearn.model_selection import train_test_split


X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2
,random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

So, we have successfully Prepare Dataset For Machine Learning in Python.

Machine Learning has very complex computation. It totally depends on how you get
the data and in which condition. Based on the condition of the data, you will start to
preprocess the data and split the data into Train and Test model.

Finally, Prepare Dataset For Machine Learning in Python is over. Thanks for


taking.

Basically you have three data sets: training, validation and testing.

You train the classifier using 'training set', tune the parameters using 'validation set' and
then test the performance of your classifier on unseen 'test set'. An important point to
note is that during training the classifier only the training and/or validation set is
available. The test set must not be used during training the classifier. The test set will
only be available during testing the classifier.

There is no 'one' way of choosing the size of training/testing set and people apply
heuristics such as 10% testing and 90% training. However, doing so can bias the
classification results and the results may not be generalizable. A well accepted method is
N-Fold cross validation, in which you randomize the dataset and create N (almost) equal
size partitions. Then choose Nth partition for testing and N-1 partitions for training the
classifier. Within the training set you can further employ another K-fold cross validation
to create a validation set and find the best parameters. And repeat this process N times
to get an average of the metric. Since we want to get rid of classifier 'bias' we repeat this
above process M times (by randomizing data and splitting into N fold) and take average
of the metric.  Cross-validation is almost unbiased, but it can also be misused if training
and validation set comes from different populations and knowledge from training set is
used in the test set.

You might also like