Data Preprocesing JavaPoint
Data Preprocesing JavaPoint
Here we will use a demo dataset for data preprocessing, and for practice,
it can be downloaded from here,
"https://www.superdatascience.com/pages/machine-learning. For real-
world problems, we can download datasets online from various sources
such
as https://www.kaggle.com/uciml/datasets, https://archive.ics.uci.edu/ml/
index.php etc.
We can also create our dataset by gathering data using various API with
Python and put that data into a .csv file.
2) Importing Libraries
In order to perform data preprocessing using Python, we need to import
some predefined Python libraries. These libraries are used to perform
some specific jobs. There are three specific libraries that we will use for
data preprocessing, which are:
1. import numpy as nm
Here we have used nm, which is a short name for Numpy, and it will be
used in the whole program.
Pandas: The last library is the Pandas library, which is one of the most
famous Python libraries and used for importing and managing the
datasets. It is an open-source data manipulation and analysis library. It
will be imported as below:
Here, we have used pd as a short name for this library. Consider the
below image:
Note: We can set any directory as a working directory, but it must contain the required
dataset.
Here, in the below image, we can see the Python file along with required
dataset. Now, the current folder is set as a working directory.
read_csv() function:
Now to import the dataset, we will use read_csv() function of pandas library,
which is used to read a csv file and performs various operations on it. Using this
function, we can read a csv file locally as well as through an URL.
1. data_set= pd.read_csv('Dataset.csv')
Here, data_set is a name of the variable to store our dataset, and inside
the function, we have passed the name of our dataset. Once we execute
the above line of code, it will successfully import the dataset in our code.
We can also check the imported dataset by clicking on the
section variable explorer, and then double click on data_set. Consider
the below image:
As in the above image, indexing is started from 0, which is the default
indexing in Python. We can also change the format of our dataset by
clicking on the format option.
1. x= data_set.iloc[:,:-1].values
In the above code, the first colon(:) is used to take all the rows, and the
second colon(:) is for all the columns. Here we have used :-1, because we
don't want to take the last column as it contains the dependent variable.
So by doing this, we will get the matrix of features.
As we can see in the above output, there are only three variables.
1. y= data_set.iloc[:,3].values
Here we have taken all the rows with the last column only. It will give the
array of dependent variables.
Output:
array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)
Note: If you are using Python language for machine learning, then extraction is
mandatory, but for R language it is not required.
There are mainly two ways to handle missing data, which are:
By deleting the particular row: The first way is used to commonly deal
with null values. In this way, we just delete the specific row or column
which consists of null values. But this way is not so efficient and removing
data may lead to loss of information which will not give the accurate
output.
By calculating the mean: In this way, we will calculate the mean of that
column or row which contains any missing value and will put it on the
place of missing value. This strategy is useful for the features which have
numeric data such as age, salary, year, etc. Here, we will use this
approach.
1. #handling missing data (Replacing missing data with the mean valu
e)
2. from sklearn.preprocessing import Imputer
3. imputer= Imputer(missing_values ='NaN', strategy='mean', axis =
0)
4. #Fitting imputer object to the independent variables x.
5. imputerimputer= imputer.fit(x[:, 1:3])
6. #Replacing missing data with the calculated mean value
7. x[:, 1:3]= imputer.transform(x[:, 1:3])
Output:
As we can see in the above output, the missing values have been replaced
with the means of rest column values.
1. #Catgorical data
2. #for Country Variable
3. from sklearn.preprocessing import LabelEncoder
4. label_encoder_x= LabelEncoder()
5. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
Output:
Out[15]:
array([[2, 38.0, 68000.0],
[0, 43.0, 45000.0],
[1, 30.0, 54000.0],
[0, 48.0, 65000.0],
[1, 40.0, 65222.22222222222],
[2, 35.0, 58000.0],
[1, 41.111111111111114, 53000.0],
[0, 49.0, 79000.0],
[2, 50.0, 88000.0],
[0, 37.0, 77000.0]], dtype=object)
Explanation:
But in our case, there are three country variables, and as we can see in
the above output, these variables are encoded into 0, 1, and 2. By these
values, the machine learning model may assume that there is some
correlation between these variables which will produce the wrong output.
So to remove this issue, we will use dummy encoding.
Dummy Variables:
Output:
As we can see in the above output, all the variables are encoded into
numbers 0 and 1 and divided into three columns.
1. labelencoder_y= LabelEncoder()
2. y= labelencoder_y.fit_transform(y)
For the second categorical variable, we will only use labelencoder object
of LableEncoder class. Here we are not using OneHotEncoder class
because the purchased variable has only two categories yes or no, and
which are automatically encoded into 0 and 1.
Output:
If we train our model very well and its training accuracy is also very high,
but we provide a new dataset to it, then it will decrease the performance.
So we always try to make a machine learning model which performs well
with the training set and also with the test dataset. Here, we can define
these datasets as:
Training Set: A subset of dataset to train the machine learning model,
and we already know the output.
Test set: A subset of dataset to test the machine learning model, and by
using the test set, model predicts the output.
For splitting the dataset, we will use the below lines of code:
Explanation:
o In the above code, the first line is used for splitting arrays of the
dataset into random train and test subsets.
o In the second line, we have used four variables for our output that
are
o x_train: features for the training data
o x_test: features for testing data
o y_train: Dependent variables for training data
o y_test: Independent variable for testing data
o In train_test_split() function, we have passed four parameters in
which first two are for arrays of data, and test_size is for specifying
the size of the test set. The test_size maybe .5, .3, or .2, which tells
the dividing ratio of training and testing sets.
o The last parameter random_state is used to set a seed for a
random generator so that you always get the same result, and the
most used value for this is 42.
Output:
By executing the above code, we will get 4 different variables, which can
be seen under the variable explorer section.
As we can see in the above image, the x and y variables are divided into 4
different variables with corresponding values.
7) Feature Scaling
Feature scaling is the final step of data preprocessing in machine learning.
It is a technique to standardize the independent variables of the dataset in
a specific range. In feature scaling, we put our variables in the same
range and in the same scale so that no any variable dominate the other
variable.
Standardization
Normalization
Here, we will use the standardization method for our dataset.
1. st_x= StandardScaler()
2. x_train= st_x.fit_transform(x_train)
1. x_test= st_x.transform(x_test)
Output:
By executing the above lines of code, we will get the scaled values for
x_train and x_test as:
x_train:
x_test:
As we can see in the above output, all the variables are scaled between
values -1 to 1.
Note: Here, we have not scaled the dependent variable because there are only two
values 0 and 1. But if these variables will have more range of values, then we will also
need to scale those variables.
Now, in the end, we can combine all the steps together to make our
complete code more understandable.
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('Dataset.csv')
8.
9. #Extracting Independent Variable
10. x= data_set.iloc[:, :-1].values
11.
12. #Extracting Dependent variable
13. y= data_set.iloc[:, 3].values
14.
15. #handling missing data(Replacing missing data with the mean
value)
16. from sklearn.preprocessing import Imputer
17. imputer= Imputer(missing_values ='NaN', strategy='mean', a
xis = 0)
18.
19. #Fitting imputer object to the independent varibles x.
20. imputerimputer= imputer.fit(x[:, 1:3])
21.
22. #Replacing missing data with the calculated mean value
23. x[:, 1:3]= imputer.transform(x[:, 1:3])
24.
25. #for Country Variable
26. from sklearn.preprocessing import LabelEncoder, OneHotEnco
der
27. label_encoder_x= LabelEncoder()
28. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
29.
30. #Encoding for dummy variables
31. onehot_encoder= OneHotEncoder(categorical_features= [0])
32. x= onehot_encoder.fit_transform(x).toarray()
33.
34. #encoding for purchased variable
35. labelencoder_y= LabelEncoder()
36. y= labelencoder_y.fit_transform(y)
37.
38. # Splitting the dataset into training and test set.
39. from sklearn.model_selection import train_test_split
40. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size=
0.2, random_state=0)
41.
42. #Feature Scaling of datasets
43. from sklearn.preprocessing import StandardScaler
44. st_x= StandardScaler()
45. x_train= st_x.fit_transform(x_train)
46. x_test= st_x.transform(x_test)
In the above code, we have included all the data preprocessing steps
together. But there are some steps or lines of code which are not
necessary for all machine learning models. So we can exclude them from
our code to make it reusable for all models.