0% found this document useful (0 votes)
60 views44 pages

Unit IV Notes

The document outlines the syllabus for a unit on Python libraries for data wrangling, covering essential topics such as NumPy arrays, data manipulation with Pandas, and the data wrangling process. It details the steps involved in data wrangling, including discovering, structuring, cleaning, enriching, validating, and publishing data. Additionally, it introduces Python as a high-level programming language and discusses its features, advantages, and the basics of using NumPy for scientific computing.

Uploaded by

leela.flwr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views44 pages

Unit IV Notes

The document outlines the syllabus for a unit on Python libraries for data wrangling, covering essential topics such as NumPy arrays, data manipulation with Pandas, and the data wrangling process. It details the steps involved in data wrangling, including discovering, structuring, cleaning, enriching, validating, and publishing data. Additionally, it introduces Python as a high-level programming language and discusses its features, advantages, and the basics of using NumPy for scientific computing.

Uploaded by

leela.flwr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

P.T.

Lee Chengalvaraya Naicker


College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

UNIT IV: Python Libraries for Data Wrangling

Syllabus

Basics of Numpy arrays - aggregations - computations on arrays - comparisons, masks, boolean


logic - fancy indexing - structured arrays - Data manipulation with Pandas - data indexing and
selection operating on data - missing data - Hierarchical indexing - combining datasets -
aggregation and grouping - pivot tables.

4.1Data Wrangling

• Data Wrangling is the process of transforming data from its original "raw" form into a more
digestible format and organizing sets from various sources into a singular coherent whole for
further processing.

• Data wrangling is also called as data munging.

• The primary purpose of data wrangling can be described as getting data in coherent shape. In
other words, it is making raw data usable. It provides substance for further proceedings.

• Data wrangling covers the following processes:

1. Getting data from the various source into one place

2. Piecing the data together according to the determined setting

3. Cleaning the data from the noise or erroneous, missing elements.

• Data wrangling is the process of cleaning, structuring and enriching raw data into a desired
format for better decision making in less time.

• There are typically six iterative steps that make up the data wrangling process:

1. Discovering: Before you can dive deeply, you must better understand what is in your data,
which will inform how you want to analyze it. How you wrangle customer data, for example,
may be informed by where they are located, what they bought, or what promotions they received.

2. Structuring: This means organizing the data, which is necessary because raw data comes in
many different shapes and sizes. A single column may turn into several rows for easier analysis.
One column may become two. Movement of data is made for easier computation and analysis.

3. Cleaning: What happens when errors and outliers skew your data? You clean the data. What
happens when state data is entered as AP or Andhra Pradesh or Arunachal Pradesh? You clean
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

the data. Null values are changed and standard formatting implemented, ultimately increasing
data quality.

4. Enriching Here you take stock in your data and strategize about how other additional data
might augment it. Questions asked during this data wrangling step might be : what new types of
data can I derive from what I already have or what other information would better inform my
decision making about this current data?

5. Validating Validation rules are repetitive programming sequences that verify data
consistency, quality, and security. Examples of validation include ensuring uniform distribution
of attributes that should be distributed normally (e.g. birth dates) or confirming accuracy of fields
through a check across data.

6. Publishing: Analysts prepare the wrangled data for use downstream, whether by a particular
user or software and document any particular steps taken or logic used to wrangle said data. Data
wrangling gurus understand that implementation of insights relies upon the ease with which it
can be accessed and utilized by others.

4.2 Introduction to Python

• Python is a high-level scripting language which can be used for a wide variety of text
processing, system administration and internet-related tasks.

• Python is a true object-oriented language, and is available on a wide variety of platforms.

• Python was developed in the early 1990's by Guido van Rossum, then at CWI in Amsterdam,
and currently at CNRI in Virginia.

• Python 3.0 was released in Year 2008.

• Python statements do not need to end with a special character.

• Python relies on modules, that is, self-contained programs which define a variety of functions
and data types.

• A module is a file containing Python definitions and statements. The file name is the module
name with the suffix .py appended.

• Within a module, the module's name (as a string) is available as the value of the global
variable_name_.

• If a module is executed directly however, the value of the global variable_name will be
"_main_".
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• Modules can contain executable statements aside from definitions. These are executed only the
first time the module name is encountered in an import statement as well as if the file is executed
as a script.

• Integrated Development Environment (IDE) is the basic interpreter and editor environment that
you can use along with Python. This typically includes an editor for creating and modifying
programs, a translator for executing programs, and a program debugger. A debugger provides a
means of taking control of the execution of a program to aid in finding program errors.

• Python is most commonly translated by use of an interpreter. It provides the very useful ability
to execute in interactive mode. The window that provides this interaction is referred to as the
Python shell.

• Python support two basic modes: Normal mode and interactive mode

• Normal mode: The normal mode is the mode where the scripted and finished.py files are run in
the Python interpreter. This mode is also called as script mode.

• Interactive mode is a command line shell which gives immediate feedback for each statement,
while running previously fed statements in active memory.

• Start the Python interactive interpreter by typing python with no arguments at the command
line.

• To access the Python shell, open the terminal of your operating system and then type "python".
Press the enter key and the Python shell will appear.

C:\Windows\system32>python

4.3Numpy

• NumPy, short for Numerical Python, is the core library for scientific computing in Python. It
has been designed specifically for performing basic and advanced array operations. It primarily
supports multi-dimensional arrays and vectors for complex arithmetic operations.

•A library is a collection of files (called modules) that contains functions for use by other
programs. A Python library is a reusable chunk of code that you may want to include in your
programs.

• Many popular Python libraries are NumPy, SciPy, Pandas and Scikit-Learn. Python
visualization libraries are matplotlib and Seaborn.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• NumPy has risen to become one of the most popular Python science libraries and just secured a
round of grant funding.

• NumPy's multidimensional array can perform very large calculations much more easily and
efficiently than using the Python standard data types.

• To get started, NumPy has many resources on their website, including documentation and
tutorials.

• NumPy (Numerical Python) is a perfect tool for scientific computing and performing basic and
advanced array operations.

• The library offers many handy features performing operations on a n-arrays and matrices in
Python. It helps to process arrays that store values of the same data type and makes performing
math operations on arrays easier. In fact, the vectorization of mathematical operations on the
NumPy array type increases performance and accelerates the execution time.

• Numpy is the core library for scientific computing in Python. It provides a high performance
multidimensional array object and tools for working with these arrays.

• NumPy is the fundamental package needed for scientific computing with Python. It contains:

a) A powerful N-dimensional array object

b) Basic linear algebra functions

c) Basic Fourier transforms

d) Sophisticated random number capabilities

e) Tools for integrating Fortran code

f) Tools for integrating C/C++ code.

• NumPy is an extension package to Python for array programming. It provides "closer to the
hardware" optimization, which in Python means C implementation.

Python 3.5.0 (v3.5.0:374f501f4567, Sep 13 2015, 02:27:37) [MSC v.1900 64 bit (AMD64)] on
win32 Type "help", "copyright", "credits" or "license" for more information.

>>>
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• The >>> indicates that the Python shell is ready to execute and send your commands to the
Python interpreter. The result is immediately displayed on the Python shell as soon as the Python
interpreter interprets the command.

• For example, to print the text "Hello World", we can type the following:

>>> print("Hello World")

Hello World

>>>

• In script mode, a file must be created and saved before executing the code to get results. In
interactive mode, the result is returned immediately after pressing the enter key.

• In script mode, you are provided with a direct way of editing your code. This is not possible in
interactive mode.

Features of Python Programming

1. Python is a high-level, interpreted, interactive and object-oriented scripting language.

2. It is simple and easy to learn.

3. It is portable.

4. Python is free and open source programming language.

5. Python can perform complex tasks using a few lines of code.

6. Python can run equally on different platforms such as Windows, Linux, UNIX, and
Macintosh, etc

7. It provides a vast range of libraries for the various fields such as machine learning, web
developer, and also for the scripting.

Advantages and Disadvantages of Python

Advantages of Python

• Ease of programming

• Minimizes the time to develop and maintain code


P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

•Modular and object-oriented

•Large community of users

• A large standard and user-contributed library

Disadvantages of python

• Interpreted and therefore slower than compiled languages

• Decentralized with packages

4.4 Basics of Numpy Arrays

• Numpy array is a powerful N-dimensional array object which is in the form of rows and
columns. We can initialize NumPy arrays from nested Python lists and access it elements.
NumPy array is a collection of elements that have the same data type.

• A one-dimensional NumPy array can be thought of as a vector, a two-dimensional array as a


matrix (i.e., a set of vectors), and a three-dimensional array as a tensor (i.e., a set of matrices).

• To define an array manually, we can use the np.array() function.

• Basic array manipulations are as follows :

1. Attributes of arrays: It define the size, shape, memory consumption, and data types of arrays.

2. Indexing of arrays: Getting and setting the value of individual array elements. 3. Slicing of
arrays: Getting and setting smaller subarrays within a larger array.

4. Reshaping of arrays: Changing the shape of a given array.

5. Joining and splitting of arrays: Combining multiple arrays into one, and splitting one array
into many.

a) Attributes of array

• In Python, arrays from the NumPy library, called N-dimensional arrays or the ndarray, are used
as the primary data structure for representing data.

• The main data structure in NumPy is the ndarray, which is a shorthand name for N-
dimensional array. When working with NumPy, data in an ndarray is simply referred to as an
array. It is a fixed-sized array in memory that contains data of the same type, such as integers or
floating point values.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

•The data type supported by an array can be accessed via the "dtype" attribute on the array. The
dimensions of an array can be accessed via the "shape" attribute that returns a tuple describing
the length of each dimension.

• Array attributes are essential to find out the shape, dimension, item size etc.

• ndarray.shape: By using this method in numpy, we can know the array dimensions. It can also
be used to resize the array. Each array has attributes ndim (the number of dimensions), shape (the
size of each dimension), and size (the total size of the array).

• ndarray.size: The total number of elements of the array. This is equal to the product of the
elements of the array's shape.

• ndarray.dtype: An object describing the data type of the elements in the array. Recall that
NumPy's ND-arrays are homogeneous: they can only posses numbers of a uniform data type.

b) Indexing of arrays

• Array indexing always refers to the use of square brackets ("[ ]') to index the elements of the
array. In order to access a single element of an array we can refer to its index.

• Fig. 4.4.1 shows the indexing of an ndarray mono-dimensional.

>>> a = np.arange(25, 31)

>>>P

array([25, 26, 27, 28, 29, 30])

>>>P[3]

28
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• The NumPy arrays also accept negative indexes. These indexes have the same incremental
sequence from 0 to -1, -2, and so on,

>>>P[-1]

30

>>>P[-6]

25

• In a multidimensional array, we can access items using a comma-separated tuple of indices. To


select multiple items at once, we can pass array of indexes within the square brackets.

>>>a[[1, 3, 4]]

array([26, 28, 29])

• Moving on to the two-dimensional case, namely, the matrices, they are represented as
rectangular arrays consisting of rows and columns, defined by two axes, where axis 0 is
represented by the rows and axis 1 is represented by the columns. Thus, indexing in this case is
represented by a pair of values : the first value is the index of the row and the second is the index
of the column.

• Fig. 4.4.2 shows the indexing of a bi-dimensional array.

>>> A = np.arange(10, 19).reshape((3, 3))

>>> A

array([[10, 11, 12],

[13, 14, 15],

[16, 17, 18]])


P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

c) Slicing of arrays

• Slicing is the operation which allows to extract portions of an array to generate new ones.
Whereas using the Python lists the arrays obtained by slicing are copies, in NumPy, arrays are
views onto the same underlying buffer.

• Slicing of array in Python means to access sub-parts of an array. These sub-parts can be stored
in other variables and further modified.

• Depending on the portion of the array, to extract or view the array, make use of the slice
syntax; that is, we will use a sequence of numbers separated by colons (':') within the square
brackets.

• Synatx : arr[ start stop step],

Arr[slice(start, stop, step)]

• The start parameter represents the starting index, stop is the ending index, and step is the
number of items that are "stepped" over. If any of these are unspecified, they default to the
values start=0, stop-size of dimension, step-1.

importnumpy as np

arr = np.array([1,2,3,4])

print(arr[1:3:2])

print(arr[:3])

print(arr[::2])

Output:

[2]

[1 2 3]

[13]

Multidimensional sub-arrays:

• Multidimensional slices work in the same way, with multiple slices separated by commas. For
example:

In[24]: x2
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

Out[24]: array([[12, 5, 2, 4],

[ 7, 6, 8, 8],

[ 1, 6, 7, 7]])

In[25]: x2[:2, :3] # two rows, three columns

Out[25]: array([[12, 5, 2],

[ 7, 6, 8]])

In[26]: x2[:3, ::2] # all rows, every other column

Out[26]: array([[12, 2],

[7, 8],

[ 1, 7]])

• Let us create an array using the package Numpy and access its columns.

# Creating an array

importnumpy as np

a= np.array([[1,2,3],[4,5,6],[7,8,9]])

• Now let us access the elements column-wise. In order to access the elements in a column-wise
manner colon(:) symbol is used let us see that with an example.

importnumpy as np

a = np.array([[1,2,3],[4,5,6],[7,8,9]])

print(a[:,1])

Output:

[258]

d) Reshaping of array

•The numpy.reshape() function is used to reshape a numpy array without changing the data in the
array.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• Syntax:

numpy.reshape(a, newshape, order='C')

Where order: {'C', 'F', 'A'}, optional Read the elements of a using this index order, and place the
elements into the reshaped array using this index order.

Step 1: Create a numpy array of shape (8,)

num_array = np.array([1,2,3,4,5,6,7,8])

num_array

Output:

array([1, 2, 3, 4, 5, 6, 7, 8])

Step 2: Use np.reshape() function with new shape as (4,2)

np.reshape(num_array,(4,2))

array([[1,2],

[3,4],

[5,6],

[7,8]])

• The shape of the input array has been changed to a (4,2). This is a 2-D array and contains the
same data present in the original input 1-D array.

e) Array concatenation and splitting

• np.concatenate() constructor is used to concatenate or join two or more arrays into one. The
only required argument is list or tuple of arrays.

#first, import numpy

importnumpy as np

# making two arrays to concatenate

arr1 = np.arange(1,4)
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

arr2 = np.arange(4,7)

print("Arrays to concatenate:")

print(arr1); print(arr2)

print("After concatenation:")

print(np.concatenate([arr1,arr2]))

Arrays to concatenate:

[1 2 3]

[4 5 6]

After concatenation:

[1 2 3 4 5 6]

4.5 Aggregations

• In aggregation function is one which takes multiple individual values and returns a summary.
In the majority of the cases, this summary is a single value. The most common aggregation
functions are a simple average or summation of values.

• Let us consider following example:

>>> import numpy as np

>>> arr1 = np.array([10, 20, 30, 40, 50])

>>> arr1

array([10, 20, 30, 40, 50])

>>> arr2 = np.array([[0, 10, 20], [30, 40, 50], [60, 70, 80]])

>>> arr2

array([[0, 10, 20]

[30, 40, 50]

[60, 70, 80]])


P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

>>> arr3 = np.array([[14, 6, 9, -12, 19, 72], [-9, 8, 22, 0, 99, -11]])

>>> array3

array([[14, 6, 9, -12, 19, 72])

[-9, 8, 22, 0, 99, -11]])

• Python numpy sum function calculates the sum of values in an array.

arr1.sum()

arr2.sum()

arr3.sum()

• This Python numpy sum function allows to use an optional argument called an axis. This
Python numpy Aggregate Function helps to calculate the sum of a given axis. For example, axis
= 0 returns the sum of each column in anNumpy array.

arr2.sum(axis = 0)

arr3.sum(axis = 0)

• axis = 1 returns the sum of each row in an array.

arr2.sum(axis = 1)

arr3.sum(axis = 1)

>>> arr1.sum()

150

>>> arr2.sum()

360

>>> arr3.sum()

217

>>> arr2.sum(axis = 0)

array([90, 120, 150])


P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

>>> arr3.sum(axis=0)

array([5, 14, 31, -12, 118, 61])

>>> arr2.sum(axis=1)

array([30, 120, 210])

>>> arr3.sum(axis =1)

array([108, 109])

• Python has built-in min and max functions used to find the minimum value and maximum
value of any given array.

• Python min() and max() are built-in functions in python which returns the smallest number and
the largest number of the list respectively, as the output. Python min() can also be used to find
the smaller one in the comparison of two variables or lists. However, Python max() on the other
hand is used to find the bigger one in the comparison of two variables or lists.

4.6 Computations on Arrays

• Computation on NumPy arrays can be very fast, or it can be very slow. Using vectorized
operations, fast computations is possible and it is implemented by using NumPy's universial
functions (ufuncs).

• A universal function (ufuncs) is a function that operates on ndarrays in an element-by- element


fashion, supporting array broadcasting, type casting, and several other standard features. The
ufunc is a "vectorized" wrapper for a function that takes a fixed number of specific inputs and
produces a fixed number of specific outputs.

• Functions that work on both scalars and arrays are known as ufuncs. For arrays, ufuncs apply
the function in an element-wise fashion. Use of ufuncs is an esssential aspect of vectorization
and typically much more computtionally efficient than using an explicit loop over each element.

NumPy's Ufuncs :

• Ufuncs are of two types: unary ufuncs and binary ufuncs.

• Unary ufuncs operate on a single input and binary ufuncs, which operate on two inputs.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• Arithmetic operators implemented in NumPy is as follows:

• Example of Arithmetic Operators: Python Code

# Taking input

num1 = input('Enter first number:')

num2 = input('Enter second number:')

# Addition

sum = float(num1) + float(num2)

# Subtraction

min =float(num1) - float(num2)

# Multiplication

mul = float(num1)* float(num2)

#Division

div = float(num1) / float(num2)

#Modulus

mod = float(num1) % float(num2)


P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

#Exponentiation

exp =float(num1)**float(num2)

#Floor Division

floordiv = float(num1) // float(num2)

print("The sum of {0} and {1} is {2}'.format(num1, num2, sum))

print("The subtraction of {0} and {1} is {2}'.format(num1, num2, min))

print("The multiplication of {0} and {1} is {2}'.format(num1, num2, mul))

print("The division of {0} and {1} is {2}'.format(num1, num2, div))

print("The modulus of {0} and {1} is {2}'.format(num1, num2, mod))

print("The exponentiation of {0} and {1} is {2}'.format(num1, num2, exp))

print("The floor division between {0} and {1} is {2}'.format(num1, num2,floordiv))

Absolute value :

• NumPy understands Python's built-in arithmetic operators, it also understands Python's built-in
absolute value function. The abs() function returns the absolute magnitude or value of input
passed to it as an argument. It returns the actual value of input without taking the sign into
consideration.

• The abs() function accepts only a single arguement that has to be a number and it returns the
absolute magnitude of the number. If the input is of type integer or float, the abs() function
returns the absolute magnitude/value. If the input is a complex number, the abs() function returns
only the magnitude portion of the number.

Syntax: abs(number)

Where the number can be of integer type, floating point type or a complex number.

• Example:

num -25.79

print("Absolute value:", abs(num))

• Output:
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

Absolute value : 25.79

Trigonometric functions:

• The numpy package provides trigonometric functions which can be used to calculate
trigonometric ratios for a given angle in radians.

Example:

importnumpy as np

Arr = np.array([0, 30, 60, 90])

#converting the angles in radians

Arr = Arr*np.pi/180

print("\nThe sin value of angles:")

print(np.sin(Arr))

print("\nThe cos value of angles:")

print(np.cos(Arr))

print("\nThe tan value of angles:")

print(np.tan(Arr))
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

4. 7 Comparisons, Masks and Boolean Logic

• Masking means to extract, modify, count or otherwise manipulate values in an array based on
some criterion.

• Boolean masking, also called boolean indexing, is a feature in Python NumPy that allows for
the filtering of values in numpy arrays. There are two main ways to carry out boolean masking:

a) Method one: Returning the result array.

b) Method two: Returning a boolean array.

Comparison operators as ufuncs

• The result of these comparison operators is always an array with a Boolean data type. All six of
the standard comparison operations are available. For example, we might wish to count all values
greater than a certain value, or perhaps remove all outliers that are above some threshold. In
NumPy, Boolean masking is often the most efficient way to accomplish these types of tasks.

x = np.array([1,2,3,4,5])

print(x<3) # less than

print(x>3) # greater than

print(x<=3) # less than or equal

print(x>=3) #greater than or equal

print(x!=3) #not equal

print(x==3) #equal

• Comparison operators and their equivalent :


P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

Boolean array:

• A boolean array is a numpy array with boolean (True/False) values. Such array can be obtained
by applying a logical operator to another numpy array:

importnumpyasnp

a = np.reshape(np.arange(16), (4,4)) # create a 4x4 array of integers

print(a)

[[ 0 1 2 3]

[ 4 5 6 7]

[ 8 9 10 11]

[12 13 14 15]]

large values (a>10) # test which elements of a are greated than 10

print(large_values)

[[False FalseFalse False]

[False FalseFalse False]

[False Falsefalse True]

[ TrueTrueTrue True]]

even_values = (a%2==0) # test which elements of a are even


P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

print(even_values)

[[True False True False]

[True False True False]

[True False True False]

[True False True False]]

Logical operations on boolean arrays

• Boolean arrays can be combined using logical operators :

b = ~(a%3 == 0) # test which elements of a are not divisible by 3

print('array a:\n{}\n'.format(a))

print('array b:\n{}'.format(b))

array a:

[[ 0 1 2 3]

[ 4 5 6 7]

[ 8 9 10 11]

[12 13 14 15]]

array b:

[[False TrueTrue False]

[ TrueTrue False True]

[True False True True]


P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

[False TrueTrue False]]

4.8 Fancy Indexing

• With NumPy array fancy indexing, an array can be indexed with another NumPy array, a
Python list, or a sequence of integers, whose values select elements in the indexed array.

• Example: We first create a NumPy array with 11 floating-point numbers and then index the
array with another NumPy array and Python list, to extract element numbers 0, 2 and 4 from the
original array :

importnumpy as np

A = np.linspace(0, 1, 11)

print(A)

print(A[np.array([0, 2, 4])])

# The same thing can be accomplished by indexing with a Python list

print(A[[0, 2, 4]])

Output:

[0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 ].

[0. 0.2 0.4]

[0. 0.2 0.4]

4.9 Structured Arrays

• A structured Numpy array is an array of structures. As numpy arrays are homogeneous i.e. they
can contain data of same type only. So, instead of creating a numpy array of int or float, we can
create numpy array of homogeneous structures too.

• First of all import numpy module i.e.

importnumpy as np

• Now to create a structure numpy array we can pass a list of tuples containing the structure
elements i.e.

[('Ram', 22.2, 3), ('Rutu', 39.4, 5), ('Rupu', 55.5, 6), ('Iresh', 99.9, 7)]
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• But as elements of a Numpy array are homogeneous, so how will be the size and type of
structure will be decided? For that we need to pass the type of above structure type i.e. schema in
dtype parameter.

• Let's create a dtype for above structure i.e.

# Creating the type of a structure

dtype = [('Name', (np.str_, 10)), ('Marks', np.float64), ('GradeLevel', np.int32)]

• Let's create a numpy array based on this dtype i.e.

# Creating a StrucuredNumpy array

structuredArr= np.array([('Ram', 22.2, 3), ('Rutu', 39.4, 5), ('Rupu', 55.5, 6),
('Iresh', 99.9, 7)], dtype=dtype)

• It will create a structured numpy array and its contents will be,

[('Ram', 22.2, 3), ('Rutu', 39.4, 5), ('Rupu', 55.5, 6), ('Iresh', 99.9, 7)]

• Let's check the data type of the above created numpy array is,

print(structured Arr.dtype)

Output:

[('Name', '<U10'), ('Marks', '<f8'), ('GradeLevel', '<i4')]

Creating structured arrays:

• Structured array data types can be specified in a number of ways.

1. Dictionary method :

np.dtype({'names': ('name', 'age', 'weight'),

'formats': ('U10', '14', 'f8')})

Output: dtype([('name', '<U10'), ('age', '<i4'), ('weight', '<f8')])

2. Numerical types can be specified with Python types or NumPydtypes instead :

np.dtype({'names': ('name', 'age', 'weight'),


P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

'formats':((np.str_, 10), int, np.float32)})

Output: dtype([('name', '<U10'), ('age', '<i8'), ('weight', '<f4')])

3. A compound type can also be specified as a list of tuples :

np.dtype([('name', 'S10'), ('age', 'i4'), ('weight', 'f8')])

Output: dtype([('name', 'S10'), ('age', '<i4'), ('weight', '<f8')])

NumPy data types:

• Below is a listing of all data types available in NumPy and the characters that represent them.

1) I - integer

2) b - boolean

3) u - unsigned integer

4) f - float

5) c - complex float

6) m - timedelta

7) M - datetime

8) O - object

9) S - string

10) U - unicode string

11) V - fixed for other types of memory

4.10 Data Manipulation with Pandas

• Pandas is a high-level data manipulation tool developed by Wes McKinney. It is built on the
Numpy package and its key data structure is called the DataFrame.

• DataFrames allow you to store and manipulate tabular data in rows of observations and
columns of variables.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• Pandas is built on top of the NumPy package, meaning a lot of the structure of NumPy is used
or replicated in Pandas. Data in pandas is often used to feed statistical analysis in SciPy, plotting
functions from Matplotlib and machine learning algorithms in Scikit-learn.

• Pandas is the library for data manipulation and analysis. Usually, it is the starting point for your
data science tasks. It allows you to read/write data from/to multiple sources. Process the missing
data, align your data, reshape it, merge and join it with other data, search data, group it, slice it.

Create DataFrame with Duplicate Data

• Duplicate data creates the problem for data science project. If database is large, then processing
duplicate data means wastage of time.

• Finding duplicates is important because it will save time, space false result. how to easily and
efficiently you can remove the duplicate data using drop_duplicates() function in pandas.

• Create Dataframe with Duplicate data

import pandas as pd

raw_data={'first_name': ['rupali', 'rupali', 'rakshita','sangeeta', 'mahesh', 'vilas'],

'last_name': ['dhotre', 'dhotre', 'dhotre','Auti', 'jadhav', 'bagad'],

'RNo': [12, 12, 1111111, 36, 24, 73],

'TestScore1': [4, 4, 4, 31, 2, 3],

'TestScore2': [25, 25, 25, 57, 62, 70]}

df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore',


'postTestScore'])

df
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

Drop duplicates

df.drop_duplicates()

• Drop duplicates in the first name column, but take the last observation in the duplicated set

df.drop_duplicates (['first_name'], keep='last')

Creating a Data Map and Data Plan

• Overview of dataset is given by data map. Data map is used for finding potential problems in
data, such as redundant variables, possible errors, missing values and variable transformations.

• Try creating a Python script that converts a Python dictionary into a Pandas DataFrame, then
print the DataFrame to screen.

import pandas as pd

scottish_hills={'Ben Nevis': (1345, 56.79685, -5.003508),

'Ben Macdui': (1309, 57.070453, -3.668262),

'Braeriach': (1296, 57.078628, -3.728024),

'Cairn Toul': (1291, 57.054611, -3.71042),

'Sgòr an Lochain Uaine': (1258, 57.057999, -3.725416)}

dataframe = pd.DataFrame(scottish_hills)

print(dataframe)
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

Manipulating and Creating Categorical Variables

• Categorical variable is one that has a specific value from a limited selection of values. The
number of values is usually fixed.

• Categorical features can only take on a limited, and usually fixed, number of possible values.
For example, if a dataset is about information related to users, then you will typically find
features like country, gender, age group, etc. Alternatively, if the data you are working with is
related to products, you will find features like product type, manufacturer, seller and so on.

• Method for creating a categorical variable and then using it to check whether some data falls
within the specified limits.

import pandas as pd

cycle_colors=pd.Series(['Blue', 'Red', 'Green'], dtype='category')

cycle_data = pd.Series( pd.Categorical(['Yellow', 'Green', 'Red', 'Blue', 'Purple'],

categories=cycle_colors, ordered=False))

find_entries = pd.isnull(cycle_data)

print cycle_colors

print

print cycle_data

print

print find_entries [find_entries==True]

• Here cycle_color is a categorical variable. It contains the values Blue, Red, and Green as color.

Renaming Levels and Combining Levels

• Data frame variable names are typically used many times when wrangling data. Good names
for these variables make it easier to write and read wrangling programs.

• Categorical data has a categories and a ordered property, which list their possible values and
whether the ordering matters or not.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• Renaming categories is done by assigning new values to the Series.cat.categories property or


by using the Categorical.rename_categories() method :

In [41]: s = pd.Series(["a","b","c","a"], dtype="category")

In [41]: s

Out[43]:

0a

1b

2C

3a

dtype: category

Categories (3, object): [a, b, c]

In [44]: s.cat.categories=["Group %s" % g for g in s.cat.categories]

In [45]: s

Out[45]:

0 Group a

1 Group b

2 Group c

3 Group a

dtype: category

Categories (3, object): [Group a, Group b, Group c]

In [46]: s.cat.rename_categories([1,2,3])

Out[46]:

01
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

12

23

31

dtype: category

Categories (3, int64): [1, 2, 3]

Dealing with Dates and Times Values

• Dates are often provided in different formats and must be converted into single format Date
Time objects before analysis.

• Python provides two methods of formatting date and time.

1. str() = It turns a datetime value into a string without any formatting.

2. strftime() function= It define how user want the datetime value to appear after

conversion.

1. Using pandas.to_datetime() with a date

import pandas as pd

#input in mm.dd.yyyy format

date = ['21.07.2020']

# output in yyyy-mm-dd format

print(pd.to_datetime(date))

2. Using pandas.to_datetime() with a date and time

import pandas as pd

# date (mm.dd.yyyy) and time (H:MM:SS)

date [21.07.2020 11:31:01 AM']

# output in yyyy-mm-dd HH:MM:SS


P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

print(pd.to_datetime(date))

• We can convert a string to datetime using strptime() function. This function is available in
datetime and time modules to parse a string to datetime and time objects respectively.

• Python strptime() is a class method in datetime class. Its syntax is :

datetime.strptime(date_string, format)

• Both the arguments are mandatory and should be string

import datetime

format="%a %b %d %H:%M:%S %Y"

today = datetime.datetime.today()

print 'ISO:', today

s = today.strftime(format)

print 'strftime:', s

d = datetime.datetime.strptime(s, format)

print 'strptime:', d.strftime(format)

$ python datetime_datetime_strptime.py

ISO : 2013-02-21 06:35:45.707450

strftime: Thu Feb 21 06:35:45 2013

strptime: Thu Feb 21 06:35:45 2013

• Time Zones: Within datetime, time zones are represented by subclasses of tzinfo. Since tzinfo
is an abstract base class, you need to define a subclass and provide appropriate implementations
for a few methods to make it useful.

Missing Data

• Data can have missing values for a number of reasons such as observations that were not
recorded and data corruption. Handling missing data is important as many machine learning
algorithms do not support data with missing values.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• You can load the dataset as a Pandas DataFrame and print summary statistics on each attribute.

# load and summarize the dataset

from pandas import read_csv

# load the dataset

dataset = read_csv('csv file name', header=None)

# summarize the dataset

print(dataset.describe())

• In Python, specifically Pandas, NumPy and Scikit-Learn, we mark missing values as NaN.
Values with a NaN value are ignored from operations like sum, count, etc.

• Use the isnull() method to detect the missing values. Pandas Dataframe provides a function
isnull(), it returns a new dataframe of same size as calling dataframe, it contains only True and
False only. With True at the place NaN in original dataframe and False at other places.

Encoding missingness:

• The fillna() function is used to fill NA/NaN values using the specified method.

• Syntax :

DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None,


downcast=None, **kwargs)

Where

1. value: It is a value that is used to fill the null values.

2. method: A method that is used to fill the null values.

3. axis: It takes int or string value for rows/columns.

4. inplace: If it is True, it fills values at an empty place.

5. limit: It is an integer value that specifies the maximum number of consecutive


forward/backward NaN value fills.

6. downcast: It takes a dict that specifies what to downcast like Float64 to int64.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

4.11 Hierarchical Indexing

• Hierarchical indexing is a method of creating structured group relationships in data.

• A MultiIndex or Hierarchical index comes in when our DataFrame has more than two
dimensions. As we already know, a Series is a one-dimensional labelled NumPy array and a
DataFrame is usually a two-dimensional table whose columns are Series. In some instances, in
order to carry out some sophisticated data analysis and manipulation, our data is presented in
higher dimensions.

• A MultiIndex adds at least one more dimension to the data. A Hierarchical Index as the name
suggests is ordering more than one item in terms of their ranking.

• To createDataFrame with player ratings of a few players from the Fifa 19 dataset.

In [1]: import pandas as pd

In [2]: data = {'Position': ['GK', 'GK', 'GK', 'DF', 'DF', 'DF",

'MF', 'MF", 'MF', 'CF', 'CF', 'CF'],

'Name': ['De Gea', 'Coutois', 'Allison', 'VanDijk',

'Ramos', 'Godin', 'Hazard', 'Kante', 'De Bruyne', 'Ronaldo'

'Messi', 'Neymar'],

'Overall': ['91','88', '89', '89', '91', '90', '91', '90', '92', '94', '93', '92'],

'Rank': ['1st', '3rd', '2nd', '3rd','1st', '2nd', '2nd', '3rd', '1st', '1st', '2nd', '3rd']}

In [3]: fifa19 = pd.DataFrame(data, columns=['Position', 'Name', 'Overall', 'Rank'])

In [4]: fifa19

Out[4]:
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• From above Dataframe, we notice that the index is the default Pandas index; the columns
'Position' and 'Rank' both have values or objects that are repeated. This could sometimes pose a
problem for us when we want to analyse the data. What we would like to do is to use meaningful
indexes that uniquely identify each row and makes it easier to get a sense of the data we are
working with. This is where MultiIndex or Hierarchical Indexing comes in.

• We do this by using the set_index() method. For Hierarchical indexing, we use set_index()
method for passing a list to represent how we want the rows to be identified uniquely.

In [5]: fif19.set_index(['Position', 'Rank'], drop = False)

In [6]: fifa19

Out[6];
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• We can see from the code above that we have set our new indexes to 'Position' and 'Rank' but
there is a replication of these columns. This is because we passed drop-False which keeps the
columns where they are. The default method, however, is drop-True so without indicating
drop=False the two columns will be set as the indexes and the columns deleted automatically.

In [7]: fifa19.set_index(['Position', 'Rank'])

Out[7]: Name Overall

Position Rank

GK 1st De Gea91

GK 3rd Coutios88

GK 2nd Allison 89

DF 3rd Van Dijk 89

DF 1st Ramos 91

DF 2nd Godin 90
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

MF 2nd Hazard 91

MF 3rd Kante90

MF 1st De Bruyne 92

CF 1st Ronaldo 94

CF 2nd Messi93

CF 3rd Neymar92

• We use set_index() with an ordered list of column labels to make the new indexes. To verify
that we have indeed set our DataFrame to a hierarchical index, we call the .index attribute.

In [8]: fifa19-fifa 19.set_index(['Position', 'Rank'])

In [9]: fifa19.index

Out[9]: MultiIndex(levels = [['CF', 'DF', 'GK', 'MF'],

['1st', '2nd', '3rd']],

codes = [[2, 2, 2, 1, 1, 1, 3, 3, 3, 0, 0, 0],

[0, 2, 1, 2,0,1, 1, 2, 0, 0, 1, 2]],

names= ['Position', 'Rank'])

4.12 Combining Datasets

• Whether it is to concatenate several datasets from different csv files or to merge sets of
aggregated data from different google analytics accounts, combining data from various sources is
critical to drawing the right conclusions and extracting optimal value from data analytics.

• When using pandas, data scientists often have to concatenate multiple pandas DataFrame;
either vertically (adding lines) or horizontally (adding columns).

DataFrame.append

• This method allows to add another dataframe to an existing one. While columns with matching
names are concatenated together, columns with different labels are filled with NA.

>>>df1
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

ints bools

0 0 True

11 False

2 2 True

>>> df2

ints floats

0 3 1.5

1 4 2.5

2 5 3.5

>>> df1.append(df2).

ints bools floats

0 0 True NaN

1 1 False NaN

2 2 True NaN

0 3 NaN 1.5

1 4 NaN 2.5

2 5 NaN 3.5

• In addition to this, DataFrame.append provides other flexibilities such as resetting the resulting
index, sorting the resulting data or raising an error when the resulting index includes duplicate
records.

Pandas.concat

• We can concat dataframes both vertically (axis=0) and horizontally (axis=1) by using the
Pandas.concat function. Unlike DataFrame.append, Pandas.concat is not a method but a function
that takes a list of objects as input. On the other hand, columns with different labels are filled
with NA values as for DataFrame.append.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

>>> df3

bools floats

0 False 4.5

1 True 5.5

2 False 6.5

>>>pd.concat([df1, df2, df3])

ints bools floats

0 0.0 True NaN

1 1.0 False NaN

2 2.0 True NaN

0 3.0 NaN 1.5

1 4.0 NaN 2.5

2 5.0 NaN 3.5

0 NaN False 4.5

1 NaN True 5.5

2 NaN False 6.5

4.13 Aggregation and Grouping

• Pandas aggregation methods are as follows:

a) count() Total number of items

b) first(), last(): First and last item

c) mean(), median(): Mean and median

d) min(), max(): Minimum and maximum

e) std(), var(): Standard deviation and variance


P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

f) mad(): Mean absolute deviation

g) prod(): Product of all items

h) sum(): Sum of all items.

• Sample CSV file is as follows:

• The date column can be parsed using the extremely handy dateutil library.

import pandas as pd

importdateutil

# Load data from csv file

data = pd.DataFrame.from_csv('phone_data.csv')

# Convert date from string to date times

data['date'] = data['date'].apply(dateutil.parser.parse, dayfirst=True)


P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• Once the data has been loaded into Python, Pandas makes the calculation of different statistics
very simple. For example, mean, max, min, standard deviations and more for columns are easily
calculable:

# How many rows the dataset

data['item'].count()

Out[38]: 830

# What was the longest phone call / data entry?

data['duration'].max()

Out[39]: 10528.0

# How many seconds of phone calls are recorded in total?

data['duration'][data['item'] == 'call'].sum()

Out[40]: 92321.0

# How many entries are there for each month?

data['month'].value_counts()

Out[41]:

2014-11 230

2015-01 205

2014-12 157

2015-02 137

2015-03 101

dtype: int64

# Number of non-null unique network entries

data['network'].nunique()

Out[42]: 9
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

groupby() function :

• groupby essentially splits the data into different groups depending on a variable of user choice.

• The groupby() function returns a GroupBy object, but essentially describes how the rows of the
original data set has been split. The GroupBy object groups variable is a dictionary whose keys
are the computed unique groups and corresponding values being the axis labels belonging to
each group.

• Functions like max(), min(), mean(), first(), last() can be quickly applied to the GroupBy object
to obtain summary statistics for each group.

• The GroupBy object supports column indexing in the same way as the DataFrame and returns a
modified GroupBy object.

4.14 Pivot Tables

• A pivot table is a similar operation that is commonly seen in spreadsheets and other programs
that operate on tabular data. The pivot table takes simple column-wise data as input, and groups
the entries into a two-dimensional table that provides a multidimensional summarization of the
data.

• A pivot table is a table of statistics that helps summarize the data of a larger table by "pivoting"
that data. Pandas gives access to creating pivot tables in Python using the .pivot_table() function.

• The syntax of the .pivot_table() function:

import pandas as pd

pd.pivot_table(

data=,

values=None,

index=None,

columns=None,

aggfunc='mean',

fill_value =None,

margins=False,
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

dropna=True,

margins_name='All',

observed=False,

sort=True

• To use the pivot method in Pandas, we need to specify three parameters:

1. Index: Which column should be used to identify and order the rows vertically.

2. Columns: Which column should be used to create the new columns in reshaped DataFrame.
Each unique value in the column stated here will create a column in new DataFrame.

3. Values: Which column(s) should be used to fill the values in the cells of DataFrame.

• Import modules:

import pandas as pd

Create dataframe :

raw_data={'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks','Dragoons','

Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],

'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd', '1st', '1st', '2nd','2nd'],

'TestScore': [4,24,31,2,3,4,24,31,2,3,2,3]}

df pd.DataFrame(raw_data, columns=['regiment','company', 'TestScore'])

df
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

• Create a pivot table of group means, by company and regiment

pd.pivot_table(df,index=['regiment','company'], aggfunc='mean')

• Create a pivot table of group score counts, by company and regiment

df.pivot_table(index=['regiment', 'company'], aggfunc='count')


P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

Two Marks Questions with Answers

Q.1 Define data wrangling ?

Ans. Data wrangling is the process of transforming data from its original "raw" form into a more
digestible format and organizing sets from various sources into a singular coherent whole for
further processing.

Q.2 What is Python?

Ans. Python is a high-level scripting language which can be used for a wide variety of text
processing, system administration and internet-related tasks. Python is a true object-oriented
language and is available on a wide variety of platforms.

Q.3 What is NumPy ?

Ans. NumPy, short for Numerical Python, is the core library for scientific computing in Python.
It has been designed specifically for performing basic and advanced array operations. It primarily
supports multi-dimensional arrays and vectors for complex arithmetic operations.

Q.4 What is an aggregation function ?

Ans. An aggregation function is one which takes multiple individual values and returns a
summary. In the majority of the cases, this summary is a single value. The most common
aggregation functions are a simple average or summation of values.

Q.5 What is Structured Arrays?


P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

Ans. A structured Numpy array is an array of structures. As numpy arrays are homogeneous i.e.
they can contain data of same type only. So, instead of creating a numpy array of int or float, we
can create numpy array of homogeneous structures too.

Q.6 Describe Pandas.

Ans. Pandas is a high-level data manipulation tool developed by Wes McKinney. It is built on
the Numpy package and its key data structure is called the DataFrame. DataFrames allow you to
store and manipulate tabular data in rows of observations and columns of variables. Pandas is
built on top of the NumPy package, meaning a lot of the structure of NumPy is used or replicated
in Pandas.

Q.7 How to Manipulating and Creating Categorical Variables?

Ans. Categorical variable is one that has a specific value from a limited selection of values. The
number of values is usually fixed. Categorical features can only take on a limited and usually
fixed, number of possible values. For example, if a dataset is about information related to users,
then user will typically find features like country, gender, age group, etc. Alternatively, if the
data we are working with is related to products, you will find features like product type,
manufacturer, seller and so on.

Q.8 Explain Hierarchical Indexing.

Ans; Hierarchical indexing is a method of creating structured group relationships in data. A


MultiIndex or Hierarchical index comes in when our DataFrame has more than two dimensions.
As we already know, a Series is a one-dimensional labelled NumPy array and a DataFrame is
usually a two-dimensional table whose columns are Series. In some instances, in order to carry
out some sophisticated data analysis and manipulation, our data is presented in higher
dimensions.

Q.9 What is Pivot Tables?

Ans. : A pivot table is a similar operation that is commonly seen in spreadsheets and other
programs that operate on tabular data. The pivot table takes simple column-wise data as input
and groups the entries into a two-dimensional table that provides a multidimensional
summarization of the data.

.
P.T.Lee Chengalvaraya Naicker
College Of Engineering & Technology
P.T.Lee Chengalvaraya Naicker Nagar, Oovery, Kancheepuram – 631 502.
Department of Computer Science And Engineering

You might also like