Unit IV Notes
Unit IV Notes
Syllabus
4.1Data Wrangling
• Data Wrangling is the process of transforming data from its original "raw" form into a more
digestible format and organizing sets from various sources into a singular coherent whole for
further processing.
• The primary purpose of data wrangling can be described as getting data in coherent shape. In
other words, it is making raw data usable. It provides substance for further proceedings.
• Data wrangling is the process of cleaning, structuring and enriching raw data into a desired
format for better decision making in less time.
• There are typically six iterative steps that make up the data wrangling process:
1. Discovering: Before you can dive deeply, you must better understand what is in your data,
which will inform how you want to analyze it. How you wrangle customer data, for example,
may be informed by where they are located, what they bought, or what promotions they received.
2. Structuring: This means organizing the data, which is necessary because raw data comes in
many different shapes and sizes. A single column may turn into several rows for easier analysis.
One column may become two. Movement of data is made for easier computation and analysis.
3. Cleaning: What happens when errors and outliers skew your data? You clean the data. What
happens when state data is entered as AP or Andhra Pradesh or Arunachal Pradesh? You clean
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
the data. Null values are changed and standard formatting implemented, ultimately increasing
data quality.
4. Enriching Here you take stock in your data and strategize about how other additional data
might augment it. Questions asked during this data wrangling step might be : what new types of
data can I derive from what I already have or what other information would better inform my
decision making about this current data?
5. Validating Validation rules are repetitive programming sequences that verify data
consistency, quality, and security. Examples of validation include ensuring uniform distribution
of attributes that should be distributed normally (e.g. birth dates) or confirming accuracy of fields
through a check across data.
6. Publishing: Analysts prepare the wrangled data for use downstream, whether by a particular
user or software and document any particular steps taken or logic used to wrangle said data. Data
wrangling gurus understand that implementation of insights relies upon the ease with which it
can be accessed and utilized by others.
• Python is a high-level scripting language which can be used for a wide variety of text
processing, system administration and internet-related tasks.
• Python was developed in the early 1990's by Guido van Rossum, then at CWI in Amsterdam,
and currently at CNRI in Virginia.
• Python relies on modules, that is, self-contained programs which define a variety of functions
and data types.
• A module is a file containing Python definitions and statements. The file name is the module
name with the suffix .py appended.
• Within a module, the module's name (as a string) is available as the value of the global
variable_name_.
• If a module is executed directly however, the value of the global variable_name will be
"_main_".
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
• Modules can contain executable statements aside from definitions. These are executed only the
first time the module name is encountered in an import statement as well as if the file is executed
as a script.
• Integrated Development Environment (IDE) is the basic interpreter and editor environment that
you can use along with Python. This typically includes an editor for creating and modifying
programs, a translator for executing programs, and a program debugger. A debugger provides a
means of taking control of the execution of a program to aid in finding program errors.
• Python is most commonly translated by use of an interpreter. It provides the very useful ability
to execute in interactive mode. The window that provides this interaction is referred to as the
Python shell.
• Python support two basic modes: Normal mode and interactive mode
• Normal mode: The normal mode is the mode where the scripted and finished.py files are run in
the Python interpreter. This mode is also called as script mode.
• Interactive mode is a command line shell which gives immediate feedback for each statement,
while running previously fed statements in active memory.
• Start the Python interactive interpreter by typing python with no arguments at the command
line.
• To access the Python shell, open the terminal of your operating system and then type "python".
Press the enter key and the Python shell will appear.
C:\Windows\system32>python
4.3Numpy
• NumPy, short for Numerical Python, is the core library for scientific computing in Python. It
has been designed specifically for performing basic and advanced array operations. It primarily
supports multi-dimensional arrays and vectors for complex arithmetic operations.
•A library is a collection of files (called modules) that contains functions for use by other
programs. A Python library is a reusable chunk of code that you may want to include in your
programs.
• Many popular Python libraries are NumPy, SciPy, Pandas and Scikit-Learn. Python
visualization libraries are matplotlib and Seaborn.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
• NumPy has risen to become one of the most popular Python science libraries and just secured a
round of grant funding.
• NumPy's multidimensional array can perform very large calculations much more easily and
efficiently than using the Python standard data types.
• To get started, NumPy has many resources on their website, including documentation and
tutorials.
• NumPy (Numerical Python) is a perfect tool for scientific computing and performing basic and
advanced array operations.
• The library offers many handy features performing operations on a n-arrays and matrices in
Python. It helps to process arrays that store values of the same data type and makes performing
math operations on arrays easier. In fact, the vectorization of mathematical operations on the
NumPy array type increases performance and accelerates the execution time.
• Numpy is the core library for scientific computing in Python. It provides a high performance
multidimensional array object and tools for working with these arrays.
• NumPy is the fundamental package needed for scientific computing with Python. It contains:
• NumPy is an extension package to Python for array programming. It provides "closer to the
hardware" optimization, which in Python means C implementation.
Python 3.5.0 (v3.5.0:374f501f4567, Sep 13 2015, 02:27:37) [MSC v.1900 64 bit (AMD64)] on
win32 Type "help", "copyright", "credits" or "license" for more information.
>>>
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
• The >>> indicates that the Python shell is ready to execute and send your commands to the
Python interpreter. The result is immediately displayed on the Python shell as soon as the Python
interpreter interprets the command.
• For example, to print the text "Hello World", we can type the following:
Hello World
>>>
• In script mode, a file must be created and saved before executing the code to get results. In
interactive mode, the result is returned immediately after pressing the enter key.
• In script mode, you are provided with a direct way of editing your code. This is not possible in
interactive mode.
3. It is portable.
6. Python can run equally on different platforms such as Windows, Linux, UNIX, and
Macintosh, etc
7. It provides a vast range of libraries for the various fields such as machine learning, web
developer, and also for the scripting.
Advantages of Python
• Ease of programming
Disadvantages of python
• Numpy array is a powerful N-dimensional array object which is in the form of rows and
columns. We can initialize NumPy arrays from nested Python lists and access it elements.
NumPy array is a collection of elements that have the same data type.
1. Attributes of arrays: It define the size, shape, memory consumption, and data types of arrays.
2. Indexing of arrays: Getting and setting the value of individual array elements. 3. Slicing of
arrays: Getting and setting smaller subarrays within a larger array.
5. Joining and splitting of arrays: Combining multiple arrays into one, and splitting one array
into many.
a) Attributes of array
• In Python, arrays from the NumPy library, called N-dimensional arrays or the ndarray, are used
as the primary data structure for representing data.
• The main data structure in NumPy is the ndarray, which is a shorthand name for N-
dimensional array. When working with NumPy, data in an ndarray is simply referred to as an
array. It is a fixed-sized array in memory that contains data of the same type, such as integers or
floating point values.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
•The data type supported by an array can be accessed via the "dtype" attribute on the array. The
dimensions of an array can be accessed via the "shape" attribute that returns a tuple describing
the length of each dimension.
• Array attributes are essential to find out the shape, dimension, item size etc.
• ndarray.shape: By using this method in numpy, we can know the array dimensions. It can also
be used to resize the array. Each array has attributes ndim (the number of dimensions), shape (the
size of each dimension), and size (the total size of the array).
• ndarray.size: The total number of elements of the array. This is equal to the product of the
elements of the array's shape.
• ndarray.dtype: An object describing the data type of the elements in the array. Recall that
NumPy's ND-arrays are homogeneous: they can only posses numbers of a uniform data type.
b) Indexing of arrays
• Array indexing always refers to the use of square brackets ("[ ]') to index the elements of the
array. In order to access a single element of an array we can refer to its index.
>>>P
>>>P[3]
28
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
• The NumPy arrays also accept negative indexes. These indexes have the same incremental
sequence from 0 to -1, -2, and so on,
>>>P[-1]
30
>>>P[-6]
25
>>>a[[1, 3, 4]]
• Moving on to the two-dimensional case, namely, the matrices, they are represented as
rectangular arrays consisting of rows and columns, defined by two axes, where axis 0 is
represented by the rows and axis 1 is represented by the columns. Thus, indexing in this case is
represented by a pair of values : the first value is the index of the row and the second is the index
of the column.
>>> A
c) Slicing of arrays
• Slicing is the operation which allows to extract portions of an array to generate new ones.
Whereas using the Python lists the arrays obtained by slicing are copies, in NumPy, arrays are
views onto the same underlying buffer.
• Slicing of array in Python means to access sub-parts of an array. These sub-parts can be stored
in other variables and further modified.
• Depending on the portion of the array, to extract or view the array, make use of the slice
syntax; that is, we will use a sequence of numbers separated by colons (':') within the square
brackets.
• The start parameter represents the starting index, stop is the ending index, and step is the
number of items that are "stepped" over. If any of these are unspecified, they default to the
values start=0, stop-size of dimension, step-1.
importnumpy as np
arr = np.array([1,2,3,4])
print(arr[1:3:2])
print(arr[:3])
print(arr[::2])
Output:
[2]
[1 2 3]
[13]
Multidimensional sub-arrays:
• Multidimensional slices work in the same way, with multiple slices separated by commas. For
example:
In[24]: x2
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
[ 7, 6, 8, 8],
[ 1, 6, 7, 7]])
[ 7, 6, 8]])
[7, 8],
[ 1, 7]])
• Let us create an array using the package Numpy and access its columns.
# Creating an array
importnumpy as np
a= np.array([[1,2,3],[4,5,6],[7,8,9]])
• Now let us access the elements column-wise. In order to access the elements in a column-wise
manner colon(:) symbol is used let us see that with an example.
importnumpy as np
a = np.array([[1,2,3],[4,5,6],[7,8,9]])
print(a[:,1])
Output:
[258]
d) Reshaping of array
•The numpy.reshape() function is used to reshape a numpy array without changing the data in the
array.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
• Syntax:
Where order: {'C', 'F', 'A'}, optional Read the elements of a using this index order, and place the
elements into the reshaped array using this index order.
num_array = np.array([1,2,3,4,5,6,7,8])
num_array
Output:
array([1, 2, 3, 4, 5, 6, 7, 8])
np.reshape(num_array,(4,2))
array([[1,2],
[3,4],
[5,6],
[7,8]])
• The shape of the input array has been changed to a (4,2). This is a 2-D array and contains the
same data present in the original input 1-D array.
• np.concatenate() constructor is used to concatenate or join two or more arrays into one. The
only required argument is list or tuple of arrays.
importnumpy as np
arr1 = np.arange(1,4)
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
arr2 = np.arange(4,7)
print("Arrays to concatenate:")
print(arr1); print(arr2)
print("After concatenation:")
print(np.concatenate([arr1,arr2]))
Arrays to concatenate:
[1 2 3]
[4 5 6]
After concatenation:
[1 2 3 4 5 6]
4.5 Aggregations
• In aggregation function is one which takes multiple individual values and returns a summary.
In the majority of the cases, this summary is a single value. The most common aggregation
functions are a simple average or summation of values.
>>> arr1
>>> arr2 = np.array([[0, 10, 20], [30, 40, 50], [60, 70, 80]])
>>> arr2
>>> arr3 = np.array([[14, 6, 9, -12, 19, 72], [-9, 8, 22, 0, 99, -11]])
>>> array3
arr1.sum()
arr2.sum()
arr3.sum()
• This Python numpy sum function allows to use an optional argument called an axis. This
Python numpy Aggregate Function helps to calculate the sum of a given axis. For example, axis
= 0 returns the sum of each column in anNumpy array.
arr2.sum(axis = 0)
arr3.sum(axis = 0)
arr2.sum(axis = 1)
arr3.sum(axis = 1)
>>> arr1.sum()
150
>>> arr2.sum()
360
>>> arr3.sum()
217
>>> arr2.sum(axis = 0)
>>> arr3.sum(axis=0)
>>> arr2.sum(axis=1)
array([108, 109])
• Python has built-in min and max functions used to find the minimum value and maximum
value of any given array.
• Python min() and max() are built-in functions in python which returns the smallest number and
the largest number of the list respectively, as the output. Python min() can also be used to find
the smaller one in the comparison of two variables or lists. However, Python max() on the other
hand is used to find the bigger one in the comparison of two variables or lists.
• Computation on NumPy arrays can be very fast, or it can be very slow. Using vectorized
operations, fast computations is possible and it is implemented by using NumPy's universial
functions (ufuncs).
• Functions that work on both scalars and arrays are known as ufuncs. For arrays, ufuncs apply
the function in an element-wise fashion. Use of ufuncs is an esssential aspect of vectorization
and typically much more computtionally efficient than using an explicit loop over each element.
NumPy's Ufuncs :
• Unary ufuncs operate on a single input and binary ufuncs, which operate on two inputs.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
# Taking input
# Addition
# Subtraction
# Multiplication
#Division
#Modulus
#Exponentiation
exp =float(num1)**float(num2)
#Floor Division
Absolute value :
• NumPy understands Python's built-in arithmetic operators, it also understands Python's built-in
absolute value function. The abs() function returns the absolute magnitude or value of input
passed to it as an argument. It returns the actual value of input without taking the sign into
consideration.
• The abs() function accepts only a single arguement that has to be a number and it returns the
absolute magnitude of the number. If the input is of type integer or float, the abs() function
returns the absolute magnitude/value. If the input is a complex number, the abs() function returns
only the magnitude portion of the number.
Syntax: abs(number)
Where the number can be of integer type, floating point type or a complex number.
• Example:
num -25.79
• Output:
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
Trigonometric functions:
• The numpy package provides trigonometric functions which can be used to calculate
trigonometric ratios for a given angle in radians.
Example:
importnumpy as np
Arr = Arr*np.pi/180
print(np.sin(Arr))
print(np.cos(Arr))
print(np.tan(Arr))
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
• Masking means to extract, modify, count or otherwise manipulate values in an array based on
some criterion.
• Boolean masking, also called boolean indexing, is a feature in Python NumPy that allows for
the filtering of values in numpy arrays. There are two main ways to carry out boolean masking:
• The result of these comparison operators is always an array with a Boolean data type. All six of
the standard comparison operations are available. For example, we might wish to count all values
greater than a certain value, or perhaps remove all outliers that are above some threshold. In
NumPy, Boolean masking is often the most efficient way to accomplish these types of tasks.
x = np.array([1,2,3,4,5])
print(x==3) #equal
Boolean array:
• A boolean array is a numpy array with boolean (True/False) values. Such array can be obtained
by applying a logical operator to another numpy array:
importnumpyasnp
print(a)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
print(large_values)
[ TrueTrueTrue True]]
print(even_values)
print('array a:\n{}\n'.format(a))
print('array b:\n{}'.format(b))
array a:
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
array b:
• With NumPy array fancy indexing, an array can be indexed with another NumPy array, a
Python list, or a sequence of integers, whose values select elements in the indexed array.
• Example: We first create a NumPy array with 11 floating-point numbers and then index the
array with another NumPy array and Python list, to extract element numbers 0, 2 and 4 from the
original array :
importnumpy as np
A = np.linspace(0, 1, 11)
print(A)
print(A[np.array([0, 2, 4])])
print(A[[0, 2, 4]])
Output:
[0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 ].
• A structured Numpy array is an array of structures. As numpy arrays are homogeneous i.e. they
can contain data of same type only. So, instead of creating a numpy array of int or float, we can
create numpy array of homogeneous structures too.
importnumpy as np
• Now to create a structure numpy array we can pass a list of tuples containing the structure
elements i.e.
[('Ram', 22.2, 3), ('Rutu', 39.4, 5), ('Rupu', 55.5, 6), ('Iresh', 99.9, 7)]
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
• But as elements of a Numpy array are homogeneous, so how will be the size and type of
structure will be decided? For that we need to pass the type of above structure type i.e. schema in
dtype parameter.
structuredArr= np.array([('Ram', 22.2, 3), ('Rutu', 39.4, 5), ('Rupu', 55.5, 6),
('Iresh', 99.9, 7)], dtype=dtype)
• It will create a structured numpy array and its contents will be,
[('Ram', 22.2, 3), ('Rutu', 39.4, 5), ('Rupu', 55.5, 6), ('Iresh', 99.9, 7)]
• Let's check the data type of the above created numpy array is,
print(structured Arr.dtype)
Output:
1. Dictionary method :
• Below is a listing of all data types available in NumPy and the characters that represent them.
1) I - integer
2) b - boolean
3) u - unsigned integer
4) f - float
5) c - complex float
6) m - timedelta
7) M - datetime
8) O - object
9) S - string
• Pandas is a high-level data manipulation tool developed by Wes McKinney. It is built on the
Numpy package and its key data structure is called the DataFrame.
• DataFrames allow you to store and manipulate tabular data in rows of observations and
columns of variables.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
• Pandas is built on top of the NumPy package, meaning a lot of the structure of NumPy is used
or replicated in Pandas. Data in pandas is often used to feed statistical analysis in SciPy, plotting
functions from Matplotlib and machine learning algorithms in Scikit-learn.
• Pandas is the library for data manipulation and analysis. Usually, it is the starting point for your
data science tasks. It allows you to read/write data from/to multiple sources. Process the missing
data, align your data, reshape it, merge and join it with other data, search data, group it, slice it.
• Duplicate data creates the problem for data science project. If database is large, then processing
duplicate data means wastage of time.
• Finding duplicates is important because it will save time, space false result. how to easily and
efficiently you can remove the duplicate data using drop_duplicates() function in pandas.
import pandas as pd
df
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
Drop duplicates
df.drop_duplicates()
• Drop duplicates in the first name column, but take the last observation in the duplicated set
• Overview of dataset is given by data map. Data map is used for finding potential problems in
data, such as redundant variables, possible errors, missing values and variable transformations.
• Try creating a Python script that converts a Python dictionary into a Pandas DataFrame, then
print the DataFrame to screen.
import pandas as pd
dataframe = pd.DataFrame(scottish_hills)
print(dataframe)
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
• Categorical variable is one that has a specific value from a limited selection of values. The
number of values is usually fixed.
• Categorical features can only take on a limited, and usually fixed, number of possible values.
For example, if a dataset is about information related to users, then you will typically find
features like country, gender, age group, etc. Alternatively, if the data you are working with is
related to products, you will find features like product type, manufacturer, seller and so on.
• Method for creating a categorical variable and then using it to check whether some data falls
within the specified limits.
import pandas as pd
categories=cycle_colors, ordered=False))
find_entries = pd.isnull(cycle_data)
print cycle_colors
print cycle_data
• Here cycle_color is a categorical variable. It contains the values Blue, Red, and Green as color.
• Data frame variable names are typically used many times when wrangling data. Good names
for these variables make it easier to write and read wrangling programs.
• Categorical data has a categories and a ordered property, which list their possible values and
whether the ordering matters or not.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
In [41]: s
Out[43]:
0a
1b
2C
3a
dtype: category
In [45]: s
Out[45]:
0 Group a
1 Group b
2 Group c
3 Group a
dtype: category
In [46]: s.cat.rename_categories([1,2,3])
Out[46]:
01
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
12
23
31
dtype: category
• Dates are often provided in different formats and must be converted into single format Date
Time objects before analysis.
2. strftime() function= It define how user want the datetime value to appear after
conversion.
import pandas as pd
date = ['21.07.2020']
print(pd.to_datetime(date))
import pandas as pd
print(pd.to_datetime(date))
• We can convert a string to datetime using strptime() function. This function is available in
datetime and time modules to parse a string to datetime and time objects respectively.
datetime.strptime(date_string, format)
import datetime
today = datetime.datetime.today()
s = today.strftime(format)
print 'strftime:', s
d = datetime.datetime.strptime(s, format)
$ python datetime_datetime_strptime.py
• Time Zones: Within datetime, time zones are represented by subclasses of tzinfo. Since tzinfo
is an abstract base class, you need to define a subclass and provide appropriate implementations
for a few methods to make it useful.
Missing Data
• Data can have missing values for a number of reasons such as observations that were not
recorded and data corruption. Handling missing data is important as many machine learning
algorithms do not support data with missing values.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
• You can load the dataset as a Pandas DataFrame and print summary statistics on each attribute.
print(dataset.describe())
• In Python, specifically Pandas, NumPy and Scikit-Learn, we mark missing values as NaN.
Values with a NaN value are ignored from operations like sum, count, etc.
• Use the isnull() method to detect the missing values. Pandas Dataframe provides a function
isnull(), it returns a new dataframe of same size as calling dataframe, it contains only True and
False only. With True at the place NaN in original dataframe and False at other places.
Encoding missingness:
• The fillna() function is used to fill NA/NaN values using the specified method.
• Syntax :
Where
6. downcast: It takes a dict that specifies what to downcast like Float64 to int64.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
• A MultiIndex or Hierarchical index comes in when our DataFrame has more than two
dimensions. As we already know, a Series is a one-dimensional labelled NumPy array and a
DataFrame is usually a two-dimensional table whose columns are Series. In some instances, in
order to carry out some sophisticated data analysis and manipulation, our data is presented in
higher dimensions.
• A MultiIndex adds at least one more dimension to the data. A Hierarchical Index as the name
suggests is ordering more than one item in terms of their ranking.
• To createDataFrame with player ratings of a few players from the Fifa 19 dataset.
'Messi', 'Neymar'],
'Overall': ['91','88', '89', '89', '91', '90', '91', '90', '92', '94', '93', '92'],
'Rank': ['1st', '3rd', '2nd', '3rd','1st', '2nd', '2nd', '3rd', '1st', '1st', '2nd', '3rd']}
In [4]: fifa19
Out[4]:
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
• From above Dataframe, we notice that the index is the default Pandas index; the columns
'Position' and 'Rank' both have values or objects that are repeated. This could sometimes pose a
problem for us when we want to analyse the data. What we would like to do is to use meaningful
indexes that uniquely identify each row and makes it easier to get a sense of the data we are
working with. This is where MultiIndex or Hierarchical Indexing comes in.
• We do this by using the set_index() method. For Hierarchical indexing, we use set_index()
method for passing a list to represent how we want the rows to be identified uniquely.
In [6]: fifa19
Out[6];
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
• We can see from the code above that we have set our new indexes to 'Position' and 'Rank' but
there is a replication of these columns. This is because we passed drop-False which keeps the
columns where they are. The default method, however, is drop-True so without indicating
drop=False the two columns will be set as the indexes and the columns deleted automatically.
Position Rank
GK 1st De Gea91
GK 3rd Coutios88
GK 2nd Allison 89
DF 1st Ramos 91
DF 2nd Godin 90
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
MF 2nd Hazard 91
MF 3rd Kante90
MF 1st De Bruyne 92
CF 1st Ronaldo 94
CF 2nd Messi93
CF 3rd Neymar92
• We use set_index() with an ordered list of column labels to make the new indexes. To verify
that we have indeed set our DataFrame to a hierarchical index, we call the .index attribute.
In [9]: fifa19.index
• Whether it is to concatenate several datasets from different csv files or to merge sets of
aggregated data from different google analytics accounts, combining data from various sources is
critical to drawing the right conclusions and extracting optimal value from data analytics.
• When using pandas, data scientists often have to concatenate multiple pandas DataFrame;
either vertically (adding lines) or horizontally (adding columns).
DataFrame.append
• This method allows to add another dataframe to an existing one. While columns with matching
names are concatenated together, columns with different labels are filled with NA.
>>>df1
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
ints bools
0 0 True
11 False
2 2 True
>>> df2
ints floats
0 3 1.5
1 4 2.5
2 5 3.5
>>> df1.append(df2).
0 0 True NaN
1 1 False NaN
2 2 True NaN
0 3 NaN 1.5
1 4 NaN 2.5
2 5 NaN 3.5
• In addition to this, DataFrame.append provides other flexibilities such as resetting the resulting
index, sorting the resulting data or raising an error when the resulting index includes duplicate
records.
Pandas.concat
• We can concat dataframes both vertically (axis=0) and horizontally (axis=1) by using the
Pandas.concat function. Unlike DataFrame.append, Pandas.concat is not a method but a function
that takes a list of objects as input. On the other hand, columns with different labels are filled
with NA values as for DataFrame.append.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
>>> df3
bools floats
0 False 4.5
1 True 5.5
2 False 6.5
• The date column can be parsed using the extremely handy dateutil library.
import pandas as pd
importdateutil
data = pd.DataFrame.from_csv('phone_data.csv')
• Once the data has been loaded into Python, Pandas makes the calculation of different statistics
very simple. For example, mean, max, min, standard deviations and more for columns are easily
calculable:
data['item'].count()
Out[38]: 830
data['duration'].max()
Out[39]: 10528.0
data['duration'][data['item'] == 'call'].sum()
Out[40]: 92321.0
data['month'].value_counts()
Out[41]:
2014-11 230
2015-01 205
2014-12 157
2015-02 137
2015-03 101
dtype: int64
data['network'].nunique()
Out[42]: 9
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
groupby() function :
• groupby essentially splits the data into different groups depending on a variable of user choice.
• The groupby() function returns a GroupBy object, but essentially describes how the rows of the
original data set has been split. The GroupBy object groups variable is a dictionary whose keys
are the computed unique groups and corresponding values being the axis labels belonging to
each group.
• Functions like max(), min(), mean(), first(), last() can be quickly applied to the GroupBy object
to obtain summary statistics for each group.
• The GroupBy object supports column indexing in the same way as the DataFrame and returns a
modified GroupBy object.
• A pivot table is a similar operation that is commonly seen in spreadsheets and other programs
that operate on tabular data. The pivot table takes simple column-wise data as input, and groups
the entries into a two-dimensional table that provides a multidimensional summarization of the
data.
• A pivot table is a table of statistics that helps summarize the data of a larger table by "pivoting"
that data. Pandas gives access to creating pivot tables in Python using the .pivot_table() function.
import pandas as pd
pd.pivot_table(
data=,
values=None,
index=None,
columns=None,
aggfunc='mean',
fill_value =None,
margins=False,
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
dropna=True,
margins_name='All',
observed=False,
sort=True
1. Index: Which column should be used to identify and order the rows vertically.
2. Columns: Which column should be used to create the new columns in reshaped DataFrame.
Each unique value in the column stated here will create a column in new DataFrame.
3. Values: Which column(s) should be used to fill the values in the cells of DataFrame.
• Import modules:
import pandas as pd
Create dataframe :
'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd', '1st', '1st', '2nd','2nd'],
'TestScore': [4,24,31,2,3,4,24,31,2,3,2,3]}
df
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering
pd.pivot_table(df,index=['regiment','company'], aggfunc='mean')
Ans. Data wrangling is the process of transforming data from its original "raw" form into a more
digestible format and organizing sets from various sources into a singular coherent whole for
further processing.
Ans. Python is a high-level scripting language which can be used for a wide variety of text
processing, system administration and internet-related tasks. Python is a true object-oriented
language and is available on a wide variety of platforms.
Ans. NumPy, short for Numerical Python, is the core library for scientific computing in Python.
It has been designed specifically for performing basic and advanced array operations. It primarily
supports multi-dimensional arrays and vectors for complex arithmetic operations.
Ans. An aggregation function is one which takes multiple individual values and returns a
summary. In the majority of the cases, this summary is a single value. The most common
aggregation functions are a simple average or summation of values.
Ans. A structured Numpy array is an array of structures. As numpy arrays are homogeneous i.e.
they can contain data of same type only. So, instead of creating a numpy array of int or float, we
can create numpy array of homogeneous structures too.
Ans. Pandas is a high-level data manipulation tool developed by Wes McKinney. It is built on
the Numpy package and its key data structure is called the DataFrame. DataFrames allow you to
store and manipulate tabular data in rows of observations and columns of variables. Pandas is
built on top of the NumPy package, meaning a lot of the structure of NumPy is used or replicated
in Pandas.
Ans. Categorical variable is one that has a specific value from a limited selection of values. The
number of values is usually fixed. Categorical features can only take on a limited and usually
fixed, number of possible values. For example, if a dataset is about information related to users,
then user will typically find features like country, gender, age group, etc. Alternatively, if the
data we are working with is related to products, you will find features like product type,
manufacturer, seller and so on.
Ans. : A pivot table is a similar operation that is commonly seen in spreadsheets and other
programs that operate on tabular data. The pivot table takes simple column-wise data as input
and groups the entries into a two-dimensional table that provides a multidimensional
summarization of the data.
.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering