Data Acquisition Python
Data Acquisition Python
1.1.2 Requirements
1. Estimated time of completion: 30 minutes
2. Have a basic knowledge on Google colaboratory.
3. Have a basic knowledge on Python language programming.
1
1.3 Basic Insights from the Dataset
After collecting the data, we need to perform data exploration steps to better understand the data.
Data exploration steps include:
• Descriptive statistics: These are simple statistical calculations to describe the data, such
as mean, standard deviation, etc.
• Graphical analysis: This is the use of charts and graphs to visualize the data.
• Cluster analysis: This is the division of data into groups based on certain attributes.
Descriptive statistics help us understand the basic attributes of the data, such as mean, minimum
value, maximum value, etc. Graphical analysis helps us visualize the data and easily identify trends
and correlations. Cluster analysis helps us understand the data according to different groups.
1.4 Laboratory:
1.4.1 Data Acquisition
There are various formats for a data set: .csv, .json, .xlsx etc. The data set can be stored in
different places, on your local machine or sometimes online.
Understanding the Dataset
In this laboratory, you will use a dataset called 1985 Ward’s Automotive Yearbook that collected
by Jeffrey Schlimmer (Schlimmer, 1987).
This dataset is in Comma Separated Value (CSV) format. There are three types of entities:
1. The specification of an auto in terms of various characteristics;
2. Its assigned insurance risk rating;
3. Its normalized losses in use as compared to other cars.
Step 2: Reading the Dataset To read the Dataset in CSV format, you need to use read_csv()
function in Pandas. You need to set the file path along with a quotation mark in the single brackets.
Pandas will read the file into a DataFrame from that address. The file path can be either an url
or your local file address.
Because the data does not include headers, we can add an argument headers = None inside the
read_csv() method so that Pandas will not automatically set the first row as a header.
2
I will assign the dataset to df variable, but you can also assign the dataset to any variable you
create.
[2]: path = '/content/automobile_data.csv'
df = pd.read_csv(path, header=None)
After reading the dataset, you can use the df.head(n) method to check the top n rows of the
DataFrame, where n is an integer. Contrary to df.head(n), df.tail(n) will show you the bottom
n rows of the data frame.
[3]: df.head()
[3]: 0 1 2 3 4 \
0 symboling normalized-losses make fuel-type aspiration
1 3 ? alfa-romero gas std
2 3 ? alfa-romero gas std
3 1 ? alfa-romero gas std
4 2 164 audi gas std
5 6 7 8 9 … \
0 num-of-doors body-style drive-wheels engine-location wheel-base …
1 two convertible rwd front 88.6 …
2 two convertible rwd front 88.6 …
3 two hatchback rwd front 94.5 …
4 four sedan fwd front 99.8 …
16 17 18 19 20 21 \
0 engine-size fuel-system bore stroke compression-ratio horsepower
1 130 mpfi 3.47 2.68 9 111
2 130 mpfi 3.47 2.68 9 111
3 152 mpfi 2.68 3.47 9 154
4 109 mpfi 3.19 3.4 10 102
22 23 24 25
0 peak-rpm city-mpg highway-mpg price
1 5000 21 27 13495
2 5000 21 27 16500
3 5000 19 26 16500
4 5500 24 30 13950
[5 rows x 26 columns]
As you see, Pandas sets the header with an integer starting from 0 automatically. Therefore, you
need to add the headers manually for each column into the DataFrame. To complete this task, you
need to read the Dataset’s information, available at here
3
Firstly, I would create a list called headers that includes all column names in order.
[5]: # create headers list
headers = ["symboling","normalized-losses","make","fuel-type","aspiration",␣
↪"num-of-doors","body-style",
"drive-wheels","engine-location","wheel-base",␣
↪"length","width","height","curb-weight","engine-type",
"num-of-cylinders",␣
↪"engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
"peak-rpm","city-mpg","highway-mpg","price"]
print("headers\n", headers)
headers
['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-
doors', 'body-style', 'drive-wheels', 'engine-location', 'wheel-base', 'length',
'width', 'height', 'curb-weight', 'engine-type', 'num-of-cylinders', 'engine-
size', 'fuel-system', 'bore', 'stroke', 'compression-ratio', 'horsepower',
'peak-rpm', 'city-mpg', 'highway-mpg', 'price']
Then, I assigned df.columns = headers to replace the headers with the list that I just created.
[6]: df.columns = headers
df.columns
Now, I am checking the first 10 entries of the updated data frame and note that the headers are
updated.
[7]: df.head(15)
4
10 0 ? audi gas turbo
11 2 192 bmw gas std
12 0 192 bmw gas std
13 0 188 bmw gas std
14 0 188 bmw gas std
5
6 5500 19 25 15250
7 5500 19 25 17710
8 5500 19 25 18920
9 5500 17 20 23875
10 5500 16 22 ?
11 5800 23 29 16430
12 5800 23 29 16925
13 4250 21 28 20970
14 4250 21 28 21105
Step 3: Preprocessing the Dataset Now, we need to replace the ? symbol with NaN so the
dropna() can remove the missing values:
[8]: df = df.replace('?',np.NaN)
6
2 two convertible rwd front 88.6 …
3 two hatchback rwd front 94.5 …
4 four sedan fwd front 99.8 …
5 four sedan 4wd front 99.4 …
6 two sedan fwd front 99.8 …
7 four sedan fwd front 105.8 …
8 four wagon fwd front 105.8 …
9 four sedan fwd front 105.8 …
11 two sedan rwd front 101.2 …
12 four sedan rwd front 101.2 …
13 two sedan rwd front 101.2 …
14 four sedan rwd front 101.2 …
15 four sedan rwd front 103.5 …
16 four sedan rwd front 103.5 …
17 two sedan rwd front 103.5 …
18 four sedan rwd front 110 …
19 two hatchback fwd front 88.4 …
20 two hatchback fwd front 94.5 …
7
5 5500 18 22 17450
6 5500 19 25 15250
7 5500 19 25 17710
8 5500 19 25 18920
9 5500 17 20 23875
11 5800 23 29 16430
12 5800 23 29 16925
13 4250 21 28 20970
14 4250 21 28 21105
15 4250 20 25 24565
16 5400 16 22 30760
17 5400 16 22 41315
18 5400 15 20 36880
19 5100 47 53 5151
20 5400 38 43 6295
Here, axis=0 means that the contents along the entire row will be dropped wherever the entity
price is found to be NaN.
[10]: df.to_csv('/content/automobile.csv')
8
1.4.3 Basic Insights from the Dataset
After reading and saving the dataset into a new Pandas DataFrame, it is time for you to explore
the data set.
Checking the data types of the Dataset Data has a variety of types, so we need to check
the types of data to understand more of them. The main types stored in Pandas data frames are
object, float, int, bool, and datetime64. To better learn about each attribute, you should
always know the data type of each column.
[12]: df.dtypes
As shown above, you can clearly see that the data type of symboling and curb-weight are int64;
normalized-losses is an object, and wheel-base is float64, etc.
If you would like to get a statistical summary of each column such as count, column mean value,
column standard deviation, etc., use the df.describe() method. Note that this method will
provide various summary statistics, excluding NaN (Not a Number) values.
[13]: df.describe()
9
[13]: symboling normalized-losses make fuel-type aspiration num-of-doors \
count 202 165 202 202 202 200
unique 7 52 23 3 3 3
top 0 161 toyota gas std four
freq 65 11 32 181 165 113
[4 rows x 26 columns]
For example, the attribute symboling has 205 counts, the mean value of this column is 0.83, the
standard deviation is 1.25, the minimum value is -2, the 25th percentile is 0, the 50th percentile
is 1, the 75th percentile is 2, and the maximum value is 3. You can add an argument include =
"all" inside the bracket to check all the columns including those that are of type object.
[14]: df.describe(include = "all")
10
unique 9 39 37 33 59 23
top mpfi 3.62 3.4 9 68 5500
freq 92 23 19 46 19 36
[4 rows x 26 columns]
Now, it provides the statistical summary of all the columns, including object-typed attributes. You
can now see how many unique values there are, which one is the top value, and the frequency
of the top value in the object-typed columns. Some values in the table above show NaN. Those
numbers are not available regarding a particular column type.
Question 3:
You can select the columns of a DataFrame by indicating the name of each column. For example,
you can select the three columns as follows: * df[[' column 1 ','column 2', 'column 3']],
where:
• column is the name of the column, you can apply the method ‘df.
• describe() to get the statistics of those columns as follows:
• df[[' column 1 ','column 2', 'column 3']].describe()
Apply the method to df.describe() to the columns length and compression-ratio.
You can also use df.info() method to check your dataset. This method prints information about
a data frame including the index dtype and columns, non-null values, and memory usage.
[16]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 202 entries, 0 to 205
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 symboling 202 non-null object
1 normalized-losses 165 non-null object
2 make 202 non-null object
3 fuel-type 202 non-null object
4 aspiration 202 non-null object
5 num-of-doors 200 non-null object
6 body-style 202 non-null object
7 drive-wheels 202 non-null object
11
8 engine-location 202 non-null object
9 wheel-base 202 non-null object
10 length 202 non-null object
11 width 202 non-null object
12 height 202 non-null object
13 curb-weight 202 non-null object
14 engine-type 202 non-null object
15 num-of-cylinders 202 non-null object
16 engine-size 202 non-null object
17 fuel-system 202 non-null object
18 bore 198 non-null object
19 stroke 198 non-null object
20 compression-ratio 202 non-null object
21 horsepower 200 non-null object
22 peak-rpm 200 non-null object
23 city-mpg 202 non-null object
24 highway-mpg 202 non-null object
25 price 202 non-null object
dtypes: object(26)
memory usage: 42.6+ KB
1.5 Citations:
1. Schlimmer, J. (1987). Automobile. UCI Machine Learning Repository. Retrieved on 2023,
Oct 29, from https://doi.org/10.24432/C5B01C.
2. What is NumPy? (n.d). Numpy. Retrieved on 2023, Oct 29, form
https://numpy.org/doc/stable/user/whatisnumpy.html
12