Python For Data Science Nympy and Pandas
Python For Data Science Nympy and Pandas
NumPy is a Python library used for mathematical and scientific computing and can deal with
multidimensional data. NumPy allows us to create ndarrays, which are n-dimensional array objects
optimized for computational efficiency, and provides a variety of functions for accessing, manipulating,
performing different operations, and exporting ndarrays.
Pandas is a Python library that provides fast, flexible, and expressive data structures designed to make
working with structured (tabular, multidimensional, potentially heterogeneous) data both easy and
intuitive. It is built on top of NumPy and provides a variety of functions for accessing,
manipulating, analyzing, and exporting structured data.
2. What are the differences between a Python list and a NumPy array?
Can have elements of different datatypes All elements are of the same datatype
Require more memory for storage as they Require less memory for storage as they do
have to store the datatype of each element not store the datatype of each element
separately separately
3. I used the drop() function to drop a column from the dataframe but the changes are not
reflected in the original data. How can I resolve this?
The drop() function has a parameter 'inplace' which is set to False by default. This ensures that the
function does not make changes to the original dataframe and returns a copy of the dataframe with
specified changes. The inplace parameter can be set to True to makes changes in the original data.
Aggregation refers to performing an operation on multiple rows corresponding to a single column. Some
of the aggregation examples are as follows:
sum:It is used to return the sum of the values for the requested axis.
min:It is used to return a minimum of the values for the requested axis.
max:It is used to return maximum values for the requested axis.
How to resolve?
The FileNotFound Error is a very common error caused by a mismatch in the directory that the code
searches for and the directory of the actual file location.
1. The data to be loaded and the code notebook on which the data needs to be loaded are stored
in the same folder.
2. Ensure that the name of the dataset is correct (check for lowercase and uppercase, check for
spaces, etc.)
loc is a label-based indexing method to access rows and columns of pandas objects. When using the loc
method on a dataframe, we specify which rows and columns to access by using the following format:
iloc is an integer-based indexing method to access rows and columns of pandas objects. When using the
loc method on a dataframe, we specify which rows and which columns we want by using the following
format:
'shape' is an attribute of a pandas object, which means that every time you create a dataframe, pandas
would have to precompute the shape of the data and store it in the 'shape' attribute.
'head' is a function provided by pandas and is computed only when we call it. Whenever called, it will
return the first 5 rows of the data by default.
Ans: Axis parameter is used to define whether column or rows are to be dropped.
vec=np.linspace(10,20,3) print(vec)
Linspace function of numpy is used to get evenly spaced samples in the given interval.
vec=np.linspace(10,20,3)
In the above code, 10 denotes starting point, 20 denotes ending point and 3 denotes the number of
samples required. Please make a note that both start and end point are included while calculating the
evenly spaced samples.
Below formula can be used to calculate the evenly spaced distance: ((End – Start)/ (N-1))
N- Number of points
10. Please explain what is meant by Uniform Random Variable and Standard Normal Random Variable.
Ans: Uniform Random Variable (np.random.rand()) - this will generate uniform random numbers in the
range of 0-1. If you will generate a large number of these random numbers, the mean of all the numbers
will get close to 0.5. Try running the below code: d=np.random.rand(100,100) d.mean() Standard Normal
Random Variables(np.random.randn() - this will generate random variables that will have a mean of 0 and
a standard deviation of 1. You will study Standard Normal Distribution in-depth in the upcoming course.
Try observing the mean and std of a randn function: e=np.random.randn(100,100)
e.mean()
e.std()
Here the issue is that we have defined the vec as a list. Logically when we pass the np.where function on
a list that has multiple values, the python is unable to decide which value should be checked for the
given statement. In the above case you are comparing one element at at a time to get the desired
output. Check the below code:
An alternative easy way is to convert the given list into an array and then write the where code. If an
array is given, where a statement will compare and check for the values one by one. This is an
application where the arrays are handy. Try using the below code where I have converted the given list
into an array and see check the output
print(vec)
Below are the main differences between del and drop function:
drop operates on both columns and rows; del operates on column only.
drop can operate on multiple items at a time; del operates only on one at a time.
drop can operate in-place or return a copy; del is an in-place operation only.
-np.array ([1,2,3,1],[4,5,6,4],[7,8,9,7])
14. When to use iloc command and when to use loc command?
Both loc() and iloc() functions are used for data slicing of a pandas dataframe. The difference in loc() and
iloc() function is loc() function. is label based data selecting method which means that we have to pass
the name of the row or column which we want to select whereas iloc() is a indexed based selecting
method which means that we have to pass integer index in the method to select specific row/column.