Data Acquisition Python

Unit_1_Importing_Datasets
October 30, 2023
1 Module 1 - Importing Datasets

1.1 Introduction:
Data is the foundation of any data analysis activity. Before we can perform any analysis, we need
to have data to work with. This module will introduce the methods of importing data into a data.
1.1.1 Table of Contents

Data Acquisition: * Data sources * Data collection methods * Data quality assessment criteria
Basic Insights from the Dataset: * Data exploration * Descriptive analysis
1.1.2 Requirements
1. Estimated time of completion: 30 minutes
2. Have a basic knowledge on Google colaboratory.
3. Have a basic knowledge on Python language programming.
1.2 Data Acquisition

The first step in the data analysis process is to collect data. Data can be collected from a variety
of sources, including:
• Internal data sources: These are data that are stored in the enterprise’s systems, such as
sales data, customer data, product data, etc.
• External data sources: These are data that are collected from outside the enterprise, such
as market data, weather data, traﬀic data, etc.
There are a variety of data collection methods, including:
• Manual data collection: This is the method of collecting data by entering data into the
system manually.
• Automated data collection: This is the method of collecting data by using automated
tools and applications.
Data quality assessment criteria include:
• Accuracy: The data must be accurate and error-free.
• Completeness: The data must be complete and without omissions.
• Relevance: The data must be relevant to the purpose of use.
1
1.3 Basic Insights from the Dataset
After collecting the data, we need to perform data exploration steps to better understand the data.
Data exploration steps include:
• Descriptive statistics: These are simple statistical calculations to describe the data, such
as mean, standard deviation, etc.
• Graphical analysis: This is the use of charts and graphs to visualize the data.
• Cluster analysis: This is the division of data into groups based on certain attributes.
Descriptive statistics help us understand the basic attributes of the data, such as mean, minimum
value, maximum value, etc. Graphical analysis helps us visualize the data and easily identify trends
and correlations. Cluster analysis helps us understand the data according to different groups.
1.4 Laboratory:
1.4.1 Data Acquisition
There are various formats for a data set: .csv, .json, .xlsx etc. The data set can be stored in
different places, on your local machine or sometimes online.
Understanding the Dataset
In this laboratory, you will use a dataset called 1985 Ward’s Automotive Yearbook that collected
by Jeffrey Schlimmer (Schlimmer, 1987).
This dataset is in Comma Separated Value (CSV) format. There are three types of entities:
1. The specification of an auto in terms of various characteristics;
2. Its assigned insurance risk rating;
3. Its normalized losses in use as compared to other cars.
Step 1: Importing the necessary libraries

1. The Pandas Library is a useful tool that enables us to read various datasets into a data frame.
Jupyter notebook platforms have a built-in Pandas Library so that all we need to do is import
Pandas without installing.
2. NumPy is a powerful library that enables users to perform complex mathematical operations
on large amounts of data with ease. It is an essential tool for data scientists and researchers
who work with numerical data in Python. “It was created in 2005 by Travis Oliphant and is
widely used in scientific computing in Python.”(Numpy, n.d)
[1]: import pandas as pd

import numpy as np
Step 2: Reading the Dataset To read the Dataset in CSV format, you need to use read_csv()
function in Pandas. You need to set the file path along with a quotation mark in the single brackets.
Pandas will read the file into a DataFrame from that address. The file path can be either an url
or your local file address.
Because the data does not include headers, we can add an argument headers = None inside the
read_csv() method so that Pandas will not automatically set the first row as a header.
2
I will assign the dataset to df variable, but you can also assign the dataset to any variable you
create.
[2]: path = '/content/automobile_data.csv'
df = pd.read_csv(path, header=None)
After reading the dataset, you can use the df.head(n) method to check the top n rows of the
DataFrame, where n is an integer. Contrary to df.head(n), df.tail(n) will show you the bottom
n rows of the data frame.
[3]: df.head()
[3]: 0 1 2 3 4 \
0 symboling normalized-losses make fuel-type aspiration
1 3 ? alfa-romero gas std
4 2 164 audi gas std
5 6 7 8 9 … \
0 num-of-doors body-style drive-wheels engine-location wheel-base …
1 two convertible rwd front 88.6 …
3 two hatchback rwd front 94.5 …
4 four sedan fwd front 99.8 …
16 17 18 19 20 21 \
0 engine-size fuel-system bore stroke compression-ratio horsepower
1 130 mpfi 3.47 2.68 9 111
2 130 mpfi 3.47 2.68 9 111
3 152 mpfi 2.68 3.47 9 154
4 109 mpfi 3.19 3.4 10 102
22 23 24 25
0 peak-rpm city-mpg highway-mpg price
1 5000 21 27 13495
2 5000 21 27 16500
3 5000 19 26 16500
4 5500 24 30 13950
[5 rows x 26 columns]
Question 1: Check the bottom 10 rows of DataFrame df

[4]: # Write your code below and press Shift+Enter to execute
As you see, Pandas sets the header with an integer starting from 0 automatically. Therefore, you
need to add the headers manually for each column into the DataFrame. To complete this task, you
need to read the Dataset’s information, available at here
3
Firstly, I would create a list called headers that includes all column names in order.
[5]: # create headers list
headers = ["symboling","normalized-losses","make","fuel-type","aspiration",␣
↪"num-of-doors","body-style",
"drive-wheels","engine-location","wheel-base",␣
↪"length","width","height","curb-weight","engine-type",
"num-of-cylinders",␣
↪"engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
"peak-rpm","city-mpg","highway-mpg","price"]
print("headers\n", headers)
headers
['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-
doors', 'body-style', 'drive-wheels', 'engine-location', 'wheel-base', 'length',
'width', 'height', 'curb-weight', 'engine-type', 'num-of-cylinders', 'engine-
size', 'fuel-system', 'bore', 'stroke', 'compression-ratio', 'horsepower',
'peak-rpm', 'city-mpg', 'highway-mpg', 'price']
Then, I assigned df.columns = headers to replace the headers with the list that I just created.
[6]: df.columns = headers
df.columns
[6]: Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',

'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
'highway-mpg', 'price'],
dtype='object')
Now, I am checking the first 10 entries of the updated data frame and note that the headers are
updated.
[7]: df.head(15)
[7]: symboling normalized-losses make fuel-type aspiration \

6 2 ? audi gas std
8 1 ? audi gas std
9 1 158 audi gas turbo
4
10 0 ? audi gas turbo
11 2 192 bmw gas std
num-of-doors body-style drive-wheels engine-location wheel-base … \

5 four sedan 4wd front 99.4 …
6 two sedan fwd front 99.8 …
8 four wagon fwd front 105.8 …
10 two hatchback 4wd front 99.5 …
11 two sedan rwd front 101.2 …
12 four sedan rwd front 101.2 …
engine-size fuel-system bore stroke compression-ratio horsepower \

1 130 mpfi 3.47 2.68 9 111
2 130 mpfi 3.47 2.68 9 111
3 152 mpfi 2.68 3.47 9 154
4 109 mpfi 3.19 3.4 10 102
5 136 mpfi 3.19 3.4 8 115
6 136 mpfi 3.19 3.4 8.5 110
7 136 mpfi 3.19 3.4 8.5 110
8 136 mpfi 3.19 3.4 8.5 110
9 131 mpfi 3.13 3.4 8.3 140
10 131 mpfi 3.13 3.4 7 160
11 108 mpfi 3.5 2.8 8.8 101
12 108 mpfi 3.5 2.8 8.8 101
13 164 mpfi 3.31 3.19 9 121
14 164 mpfi 3.31 3.19 9 121
peak-rpm city-mpg highway-mpg price

1 5000 21 27 13495
2 5000 21 27 16500
3 5000 19 26 16500
4 5500 24 30 13950
5 5500 18 22 17450
5
6 5500 19 25 15250
7 5500 19 25 17710
8 5500 19 25 18920
9 5500 17 20 23875
10 5500 16 22 ?
11 5800 23 29 16430
12 5800 23 29 16925
13 4250 21 28 20970
14 4250 21 28 21105
Step 3: Preprocessing the Dataset Now, we need to replace the ? symbol with NaN so the
dropna() can remove the missing values:
[8]: df = df.replace('?',np.NaN)
I can drop missing values along the column price as follows:

[9]: df = df.dropna(subset=["price"], axis=0)
df.head(20)
[9]: symboling normalized-losses make fuel-type aspiration \

1 3 NaN alfa-romero gas std
6 2 NaN audi gas std
8 1 NaN audi gas std
9 1 158 audi gas turbo
15 1 NaN bmw gas std
19 2 121 chevrolet gas std
20 1 98 chevrolet gas std
num-of-doors body-style drive-wheels engine-location wheel-base … \

6
5 four sedan 4wd front 99.4 …
6 two sedan fwd front 99.8 …
8 four wagon fwd front 105.8 …
18 four sedan rwd front 110 …
19 two hatchback fwd front 88.4 …
20 two hatchback fwd front 94.5 …
engine-size fuel-system bore stroke compression-ratio horsepower \

1 130 mpfi 3.47 2.68 9 111
2 130 mpfi 3.47 2.68 9 111
3 152 mpfi 2.68 3.47 9 154
4 109 mpfi 3.19 3.4 10 102
5 136 mpfi 3.19 3.4 8 115
6 136 mpfi 3.19 3.4 8.5 110
7 136 mpfi 3.19 3.4 8.5 110
8 136 mpfi 3.19 3.4 8.5 110
9 131 mpfi 3.13 3.4 8.3 140
11 108 mpfi 3.5 2.8 8.8 101
12 108 mpfi 3.5 2.8 8.8 101
13 164 mpfi 3.31 3.19 9 121
14 164 mpfi 3.31 3.19 9 121
15 164 mpfi 3.31 3.19 9 121
16 209 mpfi 3.62 3.39 8 182
17 209 mpfi 3.62 3.39 8 182
18 209 mpfi 3.62 3.39 8 182
19 61 2bbl 2.91 3.03 9.5 48
20 90 2bbl 3.03 3.11 9.6 70
peak-rpm city-mpg highway-mpg price

1 5000 21 27 13495
2 5000 21 27 16500
3 5000 19 26 16500
4 5500 24 30 13950
7
5 5500 18 22 17450
6 5500 19 25 15250
7 5500 19 25 17710
8 5500 19 25 18920
9 5500 17 20 23875
11 5800 23 29 16430
12 5800 23 29 16925
13 4250 21 28 20970
14 4250 21 28 21105
15 4250 20 25 24565
16 5400 16 22 30760
17 5400 16 22 41315
18 5400 15 20 36880
19 5100 47 53 5151
20 5400 38 43 6295
Here, axis=0 means that the contents along the entire row will be dropped wherever the entity
price is found to be NaN.
1.4.2 Saving the Dataset

To save the Dataset into your local to use it later. Pandas library enables you to save the data
set to CSV. By using the df.to_csv() method, you can add the file path and name along with
quotation marks in the brackets.
For example, if you save the data frame df as automobile.csv to your local machine, you may use
the syntax below, where index = False means the row names will not be written.
You can also read and save other file formats. You can use similar functions like pd.read_csv()
and df.to_csv()for other data formats. The functions are listed in the following table:
Data Formate Read Save

csv pd.read_csv() df.to_csv()
json pd.read_json() df.to_json()
excel pd.read_excel() df.to_excel()
hdf pd.read_hdf() df.to_hdf()
sql pd.read_sql() df.to_sql()
… … …
[10]: df.to_csv('/content/automobile.csv')
Question 2: Find the name of the columns of the dataframe.

8
1.4.3 Basic Insights from the Dataset
After reading and saving the dataset into a new Pandas DataFrame, it is time for you to explore
the data set.
Checking the data types of the Dataset Data has a variety of types, so we need to check
the types of data to understand more of them. The main types stored in Pandas data frames are
object, float, int, bool, and datetime64. To better learn about each attribute, you should
always know the data type of each column.
[12]: df.dtypes
[12]: symboling object

normalized-losses object
make object
fuel-type object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base object
length object
width object
height object
curb-weight object
engine-type object
num-of-cylinders object
engine-size object
fuel-system object
bore object
stroke object
compression-ratio object
horsepower object
peak-rpm object
city-mpg object
highway-mpg object
price object
dtype: object
As shown above, you can clearly see that the data type of symboling and curb-weight are int64;
normalized-losses is an object, and wheel-base is float64, etc.
If you would like to get a statistical summary of each column such as count, column mean value,
column standard deviation, etc., use the df.describe() method. Note that this method will
provide various summary statistics, excluding NaN (Not a Number) values.
[13]: df.describe()
9
[13]: symboling normalized-losses make fuel-type aspiration num-of-doors \
count 202 165 202 202 202 200
unique 7 52 23 3 3 3
top 0 161 toyota gas std four
freq 65 11 32 181 165 113
body-style drive-wheels engine-location wheel-base … engine-size \

count 202 202 202 202 … 202
unique 6 4 3 53 … 44
top sedan fwd front 93.7 … 122
freq 94 118 198 20 … 15
fuel-system bore stroke compression-ratio horsepower peak-rpm \

count 202 198 198 202 200 200
unique 9 39 37 33 59 23
top mpfi 3.62 3.4 9 68 5500
freq 92 23 19 46 19 36
city-mpg highway-mpg price

count 202 202 202
unique 30 31 187
top 31 25 8921
freq 28 19 2
For example, the attribute symboling has 205 counts, the mean value of this column is 0.83, the
standard deviation is 1.25, the minimum value is -2, the 25th percentile is 0, the 50th percentile
is 1, the 75th percentile is 2, and the maximum value is 3. You can add an argument include =
"all" inside the bracket to check all the columns including those that are of type object.
[14]: df.describe(include = "all")
[14]: symboling normalized-losses make fuel-type aspiration num-of-doors \

count 202 165 202 202 202 200
unique 7 52 23 3 3 3
top 0 161 toyota gas std four
freq 65 11 32 181 165 113
body-style drive-wheels engine-location wheel-base … engine-size \

count 202 202 202 202 … 202
unique 6 4 3 53 … 44
top sedan fwd front 93.7 … 122
freq 94 118 198 20 … 15
fuel-system bore stroke compression-ratio horsepower peak-rpm \

count 202 198 198 202 200 200
10
unique 9 39 37 33 59 23
top mpfi 3.62 3.4 9 68 5500
freq 92 23 19 46 19 36
city-mpg highway-mpg price

count 202 202 202
unique 30 31 187
top 31 25 8921
freq 28 19 2
Now, it provides the statistical summary of all the columns, including object-typed attributes. You
can now see how many unique values there are, which one is the top value, and the frequency
of the top value in the object-typed columns. Some values in the table above show NaN. Those
numbers are not available regarding a particular column type.
Question 3:
You can select the columns of a DataFrame by indicating the name of each column. For example,
you can select the three columns as follows: * df[[' column 1 ','column 2', 'column 3']],
where:
• column is the name of the column, you can apply the method ‘df.
• describe() to get the statistics of those columns as follows:
• df[[' column 1 ','column 2', 'column 3']].describe()
Apply the method to df.describe() to the columns length and compression-ratio.
You can also use df.info() method to check your dataset. This method prints information about
a data frame including the index dtype and columns, non-null values, and memory usage.
[16]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 202 entries, 0 to 205
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 symboling 202 non-null object
1 normalized-losses 165 non-null object
2 make 202 non-null object
3 fuel-type 202 non-null object
4 aspiration 202 non-null object
5 num-of-doors 200 non-null object
6 body-style 202 non-null object
7 drive-wheels 202 non-null object
11
8 engine-location 202 non-null object
9 wheel-base 202 non-null object
10 length 202 non-null object
11 width 202 non-null object
12 height 202 non-null object
13 curb-weight 202 non-null object
14 engine-type 202 non-null object
15 num-of-cylinders 202 non-null object
16 engine-size 202 non-null object
17 fuel-system 202 non-null object
18 bore 198 non-null object
19 stroke 198 non-null object
20 compression-ratio 202 non-null object
21 horsepower 200 non-null object
22 peak-rpm 200 non-null object
23 city-mpg 202 non-null object
24 highway-mpg 202 non-null object
25 price 202 non-null object
dtypes: object(26)
memory usage: 42.6+ KB
1.5 Citations:
1. Schlimmer, J. (1987). Automobile. UCI Machine Learning Repository. Retrieved on 2023,
Oct 29, from https://doi.org/10.24432/C5B01C.
2. What is NumPy? (n.d). Numpy. Retrieved on 2023, Oct 29, form
https://numpy.org/doc/stable/user/whatisnumpy.html
12

Data Acquisition Python

Uploaded by

Data Acquisition Python

Uploaded by

Unit_1_Importing_Datasets

October 30, 2023

1 Module 1 - Importing Datasets

1.1.1 Table of Contents

1.2 Data Acquisition

Step 1: Importing the necessary libraries

[1]: import pandas as pd

Question 1: Check the bottom 10 rows of DataFrame df

[6]: Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',

[7]: symboling normalized-losses make fuel-type aspiration \

num-of-doors body-style drive-wheels engine-location wheel-base … \

engine-size fuel-system bore stroke compression-ratio horsepower \

peak-rpm city-mpg highway-mpg price

[15 rows x 26 columns]

I can drop missing values along the column price as follows:

[9]: symboling normalized-losses make fuel-type aspiration \

num-of-doors body-style drive-wheels engine-location wheel-base … \

engine-size fuel-system bore stroke compression-ratio horsepower \

peak-rpm city-mpg highway-mpg price

[20 rows x 26 columns]

1.4.2 Saving the Dataset

Data Formate Read Save

Question 2: Find the name of the columns of the dataframe.

[12]: symboling object

body-style drive-wheels engine-location wheel-base … engine-size \

fuel-system bore stroke compression-ratio horsepower peak-rpm \

city-mpg highway-mpg price

[14]: symboling normalized-losses make fuel-type aspiration num-of-doors \

body-style drive-wheels engine-location wheel-base … engine-size \

fuel-system bore stroke compression-ratio horsepower peak-rpm \

city-mpg highway-mpg price

[15]: # Write your code below and press Shift+Enter to execute

You might also like