Training Report On Data Sciencep
Training Report On Data Sciencep
A PROJECT REPORT
By: Soham B. Mistry
Enrollment no:216370307055
Diploma ENGINEERING
in
computer engineering
[Aug, 2023]
216370307055 1
Data Science
Certificate :
216370307055 2
Data Science
DECLARATION
I hereby certify that the work which is being presented in the report entitled
“Data Science” in fulfilment of the requirement for completion of one-month
industrial training in Department of Electronics and Communication
Engineering of “University Institute of Engineering and Technology,
Kurukshetra University” is an authentic record of my own work carried out
during industrial training.
Soham
216370305
216370307055 3
Data Science
ACKNOWLEDGEMENT
The work in this report is an outcome of continuous work over a period and
drew intellectual support from Internshala and other sources. I would like to
articulate our profound gratitude and indebtedness to Internshala helped us
in completion of the training. I am thankful to Internshala Training Associates
for teaching and assisting me in making the training successful.
Soham
216370305
216370307055 4
Data Science
Introduction to Organization:
Company name:
STYPIX
Address:
A-903, Siddhivinayak Business Tower Nr, Vasna Telephone Exchange, Makarba, Ahmedabad, Gujarat 380051
About Us:
STYPIX delivers the best in class custom software solutions, elite software development teams and innovative
cloud software to enterprise businesses across numerous industries. We believe new technologies are the lifeline
of every business in the modern age and aim to connect businesses across all industries to innovative software,
technological development, solutions and services, in a manner that’s faster, easier and better than ever before.
Our mission is to create leading, innovative software that creates economic and social value on a global scale,
collaborating closely with our clients to help them achieve both their short and longer term goals.
Website URL:
https://www.stypix.co.in/
216370307055 5
Data Science
About Training:
The Data Science Training by Internshala is a 6-week online training program
in which Internshala aim to provide you with a comprehensive introduction
to data science. In this training program, you will learn the basics of python,
statistics, predictive modeling, and machine learning. This training program
has video tutorials and is packed with assignments, assessments tests,
quizzes, and practice exercises for you to get a hands-on learning experience.
At the end of this training program, you will have a solid understanding of
data science and will be able to build an end-to-end predictive model. For
doubt clearing, you can post your queries on the forum and get answers
within 24 hours.
216370307055 6
Data Science
Table of Content
Introduction to Organization
About Training
216370307055 7
Data Science
216370307055 8
Data Science
Predictive modeling:
Predictive modeling is a form of artificial intelligence that uses data mining and
probability to forecast or estimate more granular, specific outcomes.
For example, predictive modeling could help identify customers who are likely to
purchase our new One AI software over the next 90 days.
Machine Learning:
Machine learning is a branch of artificial intelligence (ai) where computers learn to
act and adapt to new data without being programmed to do so. The computer is
able to act independently of human interaction.
Forecasting:
Forecasting is a process of predicting or estimating future events based on past
and present data and most commonly by analysis of trends. "Guessing" doesn't cut
it. A forecast, unlike a prediction, must have logic to it. It must be defendable. This
logic is what differentiates it from the magic 8 ball's lucky guess. After all, even a
broken watch is right two times a day.
216370307055 9
Data Science
216370307055 10
Data Science
Netflix knew that significant numbers of people who liked Fincher also liked Wright.
All this information combined to suggest that buying the series would be a good
investment for the company.
216370307055 11
Data Science
216370307055 12
Data Science
216370307055 13
Data Science
b. Relational Operators:
Relational operators compares the values. It either
returns True or False according to the condition.
> Greater than: True if left operand is greater than the right x>y
< Less than: True if left operand is less than the right x<y
x ==
Greater than or equal to: True if left operand is greater than x >=
Less than or equal to: True if left operand is less than or equal x <=
216370307055 14
Data Science
c. Logical operators:
Logical operators perform Logical AND, Logical OR and Logical
NOT operations.
OPERATOR DESCRIPTION SYNTAX
and Logical AND: True if both the operands are true x and y
d. Bitwise operators:
Bitwise operators acts on bits and performs bit by bit operation.
OPERATOR DESCRIPTION SYNTAX
| Bitwise OR x|y
~ Bitwise NOT ~x
216370307055 15
Data Science
e. Assignment operators:
Assignment operators are used to assign values to the variables.
216370307055 16
Data Science
a <<=
216370307055 17
Data Science
() Parentheses left-to-right
** Exponent right-to-left
* / % Multiplication/division/modulus left-to-right
216370307055 18
Data Science
+ - Addition/subtraction left-to-right
216370307055 19
Data Science
c. Multiple Assignment:
• You can assign values to multiple Python variables in one statement.
• You can assign the same value to multiple Python variables.
d. Deleting Variables:
• You can also delete Python variables using the keyword ‘del’.
Data Types:
A. Python Numbers:
There are four numeric Python data types.
a. int
int stands for integer. This Python Data Type holds signed integers. We can use the
type() function to find which class it belongs to.
b. float
This Python Data Type holds floating-point real values. An int can only store the
number 3, but float can store 3.25 if you want.
c. long
This Python Data type holds a long integer of unlimited length. But this construct
does not exist in Python 3.x.
d. complex
This Python Data type holds a complex number. A complex number looks like this:
a+bj Here, a and b are the real parts of the number, and j is imaginary.
B. Strings:
A string is a sequence of characters. Python does not have a char data type, unlike
C++ or Java. You can delimit a string using single quotes or double-quotes.
a. Spanning a String Across Lines:
To span a string across multiple lines, you can use triple quotes.
b. Displaying Part of a String:
You can display a character from a string using its index in the string. Remember,
indexing starts with 0.
c. String Formatters:
String formatters allow us to print characters and values at once. You can use the
% operator.
d. String Concatenation:
You can concatenate(join) strings using + operator. However, you cannot
concatenate values of different types.
C. Python Lists:
A list is a collection of values. Remember, it may contain different types of values.
216370307055 20
Data Science
To define a list, you must put values separated with commas in square brackets.
You don’t need to declare a type for a list either.
a. Slicing a List
You can slice a list the way you’d slice a string- with the slicing operator. Indexing
for a list begins with 0, like for a string. A Python doesn’t have arrays.
b. Length of a List
Python supports an inbuilt function to calculate the length of a list.
c. Reassigning Elements of a List
A list is mutable. This means that you can reassign elements later on.
d. Iterating on the List
To iterate over the list we can use the for loop. By iterating, we can access each
element one by one which is very helpful when we need to perform some
operations on each element of list.
e. Multidimensional Lists
A list may have more than one dimension. Have a detailed look on this in DataFlair’s
tutorial on Python Lists.
D. Python Tuples:
A tuple is like a list. You declare it using parentheses instead.
a. Accessing and Slicing a Tuple
You access a tuple the same way as you’d access a list. The same goes for slicing
it.
b. A tuple is Immutable
Python tuple is immutable. Once declared, you can’t change its size or elements.
E. Dictionaries:
A dictionary holds key-value pairs. Declare it in curly braces, with pairs separated
by commas. Separate keys and values by a colon(:).The type() function works with
dictionaries too.
a. Accessing a Value
To access a value, you mention the key in square brackets.
b. Reassigning Elements
You can reassign a value to a key.
c. List of Keys
Use the keys() function to get a list of keys in the dictionary.
F. Bool:
A Boolean value can be True or False.
216370307055 21
Data Science
G. Sets:
A set can have a list of values. Define it using curly braces. It returns only one
instance of any value present more than once. However, a set is unordered, so it
doesn’t support indexing. Also, it is mutable. You can change its elements or add
more. Use the add() and remove() methods to do so.
H. Type Conversion:
Since Python is dynamically-typed, you may want to convert a value into another
type. Python supports a list of functions for the same.
a. int()
b. float()
c. bool()
d. set()
e. list()
f. tuple()
g. str()
2.4. Conditional Statements
a. If statements
If statement is one of the most commonly used conditional statement in most of
the programming languages. It decides whether certain statements need to be
executed or not. If statement checks for a given condition, if the condition is true,
then the set of code present inside the if block will be executed.
The If condition evaluates a Boolean expression and executes the block of code
only when the Boolean expression becomes TRUE.
Syntax:
If (Boolean expression): Block of code #Set of statements to execute if the
condition is true
b. If-else statements
The statement itself tells that if a given condition is true then execute the
statements present inside if block and if the condition is false then execute the else
block.
Else block will execute only when the condition becomes false, this is the block
where you will perform some actions when the condition is not true.
If-else statement evaluates the Boolean expression and executes the block of code
present inside the if block if the condition becomes TRUE and executes a block of
code present in the else block if the condition becomes FALSE.
216370307055 22
Data Science
Syntax:
if(Boolean expression):
Block of code #Set of statements to execute if condition is true
else:
Block of code #Set of statements to execute if condition is false
c. elif statements
In python, we have one more conditional statement called elif statements. Elif
statement is used to check multiple conditions only if the given if condition false.
It’s similar to an if-else statement and the only difference is that in else we will not
check the condition but in elif we will do check the condition.
Elif statements are similar to if-else statements but elif statements evaluate
multiple conditions.
Syntax:
if (condition):
#Set of statement to execute if condition is true
elif (condition):
#Set of statements to be executed when if condition is false and elif
condition is true
else:
#Set of statement to be executed when both if and elif conditions are false
216370307055 23
Data Science
if(condition):
#Statements to execute if condition is true
if(condition):
#Statements to execute if condition is true
else:
#Statements to execute if condition is false
else:
e. elif Ladder
We have seen about the elif statements but what is this elif ladder. As the name
itself suggests a program which contains ladder of elif statements or elif statements
which are structured in the form of a ladder.
This statement is used to test multiple expressions.
Syntax:
if (condition):
#Set of statement to execute if condition is true
elif (condition):
#Set of statements to be executed when if condition is false and elif
condition is true
elif (condition):
#Set of statements to be executed when both if and first elif condition is
false and second elif condition is true
elif (condition):
#Set of statements to be executed when if, first elif and second elif
conditions are false and third elif statement is true
else:
#Set of statement to be executed when all if and elif conditions are false
2.5. Looping Constructs
Loops:
a. while loop:
Repeats a statement or group of statements while a given condition is TRUE. It
tests the condition before executing the loop body.
216370307055 24
Data Science
Syntax:
while expression:
statement(s)
b. for loop:
Executes a sequence of statements multiple times and abbreviates the code that
manages the loop variable.
Syntax:
for iterating_var in sequence:
statements(s)
c. nested loops:
You can use one or more loop inside any another while, for or do..while loop.
Syntax of nested for loop:
for iterating_var in sequence:
for iterating_var in sequence:
statements(s)
statements(s)
Syntax of nested while loop:
while expression:
while expression:
statement(s)
statement(s)
Loop Control Statements:
a. break statement:
Terminates the loop statement and transfers execution to the statement
immediately following the loop.
b. continue statement:
Causes the loop to skip the remainder of its body and immediately retest its
condition prior to reiterating.
c. pass statement:
The pass statement in Python is used when a statement is required syntactically
but you do not want any command or code to execute.
2.6. Functions
A. Built-in Functions or pre-defined functions:
These are the functions which are already defined by Python. For example: id (),
type(), print (), etc.
216370307055 25
Data Science
B. User-Defined Functions:
These are functions that are defined by the users for simplicity and to avoid
repetition of code. It is done by using def function.
2.7. Data Structure
Python has implicit support for Data Structures which enable you to store and
access data. These structures are called List, Dictionary, Tuple and Set.
2.8. Lists
Lists in Python are the most versatile data structure. They are used to store
heterogeneous data items, from integers to strings or even another list! They are
also mutable, which means that their elements can be changed even after the list
is created.
Creating Lists
Lists are created by enclosing elements within [square] brackets and each item is
separated by a comma.
Creating lists in Python
Since each element in a list has its own distinct position, having duplicate values in
a list is not a problem.
Accessing List elements
To access elements of a list, we use Indexing. Each element in a list has an index
related to it depending on its position in the list. The first element of the list has
the index 0, the next element has index 1, and so on. The last element of the list
has an index of one less than the length of the list.
Indexing in Python lists
While positive indexes return elements from the start of the list, negative indexes
return values from the end of the list. This saves us from the trivial calculation which
we would have to otherwise perform if we wanted to return the nth element from
the end of the list. So instead of trying to return List_name[len(List_name)-1]
element, we can simply write List_name[-1].
Using negative indexes, we can return the nth element from the end of the list
easily. If we wanted to return the first element from the end, or the last index, the
associated index is -1. Similarly, the index for the second last element will be -2,
and so on. Remember, the 0th index will still refer to the very first element in the
list.
Appending values in Lists
We can add new elements to an existing list using the append() or insert() methods:
216370307055 26
Data Science
216370307055 27
Data Science
2.9. Dictionaries
Dictionary is another Python data structure to store heterogeneous objects that
are immutable but unordered.
Generating Dictionary
Dictionaries are generated by writing keys and values within a { curly } bracket
separated by a semi-colon. And each key-value pair is separated by a comma:
Using the key of the item, we can easily extract the associated value of the item:
Dictionaries are very useful to access items quickly because, unlike lists and tuples,
a dictionary does not have to iterate over all the items finding a value. Dictionary
uses the item key to quickly find the item value. This concept is called hashing.
216370307055 28
Data Science
We can even access these values simultaneously using the items() method which
returns the respective key and value pair for each element of the dictionary.
216370307055 29
Data Science
• USing csv.reader(): At first, the CSV file is opened using the open() method
in ‘r’ mode(specifies read mode while opening a file) which returns the file
object then it is read by using the reader() method of CSV module that
returns the reader object that iterates throughout the lines in the specified
CSV document.
Note: The ‘with‘ keyword is used along with the open() method as it simplifies
exception handling and automatically closes the CSV file.
import csv
import pandas
216370307055 30
Data Science
DataFrame Methods:
FUNCTION DESCRIPTION
operator add)
216370307055 31
Data Science
value_counts() Method counts the number of times each unique value occurs within
the Series
isnull() Method creates a Boolean Series for extracting rows with null values
notnull() Method creates a Boolean Series for extracting rows with non-null
values
predefined range
isin() Method extracts rows from a DataFrame where a column value exists
in a predefined collection
216370307055 32
Data Science
dtypes() Method returns a Series with the data type of each column. The
values() Method returns a Numpy representation of the DataFrame i.e. only the
removed
sort_values()- Set1, Set2 Method sorts a data frame in Ascending or Descending order of
passed Column
sort_index() Method sorts the values in a DataFrame based on their index positions
out of two or more data frames and hence later index can be changed
ix[] Method retrieves DataFrame rows based on either index label or index
position. This method combines the best features of the .loc[] and
.iloc[] methods
216370307055 33
Data Science
DataFrame
nsmallest() Method pulls out the rows with the smallest values in a column
nlargest() Method pulls out the rows with the largest values in a column
DataFrame
dimensions.
Returns 1 if Series,
otherwise returns 2 if
DataFrame
216370307055 34
Data Science
dropna() Method allows the user to analyze and drop Rows/Columns with Null
fillna() Method manages and let the user replace NaN values with some value
of their own
from a DataFrame
duplicated() Method creates a Boolean Series and uses it to extract rows that have
duplicate values
216370307055 35
Data Science
216370307055 36
Data Science
(ii) Median :
It is measure of central value of a sample set. In these, data set is ordered from
lowest to highest value and then finds exact middle.
For example,
216370307055 37
Data Science
(iii) Mode :
It is value most frequently arrived in sample set. The value repeated most of time
in central set is actually mode.
For example,
216370307055 38
Data Science
Boxplot : It is based on the percentiles of the data as shown in the figure below.
The top and bottom of the boxplot are 75th and 25th percentile of the data. The
extended lines are known as whiskers that includes the range of rest of the data.
# BoxPlot Population In Millions
fig, ax1 = plt.subplots()
fig.set_size_inches(9, 15)
Frequency Table : It is a tool to distribute the data into equally spaced ranges,
segments and tells us how many values fall in each segment.
Histogram: It is a way of visualizing data distribution through frequency table
with bins on the x-axis and data count on the y-axis.
Code – Histogram
216370307055 39
Data Science
Output :
216370307055 40
Data Science
216370307055 41
Data Science
1. Empirical Approach
2. Classical Approach
3. Axiomatic Approach
In this article, we are going to study about Axiomatic Approach.In this approach,
we represent the probability in terms of sample space(S) and other terms.
Basic Terminologies:
• Random Event :- If the repetition of an experiment occurs several
times under similar conditions, if it does not produce the same
outcome everytime but the outcome in a trial is one of the several
possible outcomes, then such an experiment is called random event
or a probabilistic event.
• Elementary Event – The elementary event refers to the outcome of
each random event performed. Whenever the random event is
performed, each associated outcome is known as elementary event.
• Sample Space – Sample Space refers to the set of all possible
outcomes of a random event.Example, when a coin is tossed, the
possible outcomes are head and tail.
• Event – An event refers to the subset of the sample space associated
with a random event.
• Occurrence of an Event – An event associated with a random event
is said to occur if any one of the elementary event belonging to it is
an outcome.
• Sure Event – An event associated with a random event is said to be
sure event if it always occurs whenever the random event is
performed.
• Impossible Event – An event associated with a random event is said
to be impossible event if it never occurs whenever the random event
is performed.
• Compound Event – An event associated with a random event is said
to be compound event if it is the disjoint union of two or more
elementary events.
• Mutually Exclusive Events – Two or more events associated with a
random event are said to be mutually exclusive events if any one of
the event occurrs, it prevents the occurrence of all other events.This
means that no two or more events can occur simultaneously at the
same time.
216370307055 42
Data Science
1. 0 ≤ pi ≤ 1.
2. ∑pi = 1 where sum is taken over all possible values of x.
Continuous Random Variable:
A random variable X is said to be continuous if it takes on infinite number of
values. The probability function associated with it is said to be PDF = Probability
density function.
PDF: If X is continuous random variable.
P (x < X < x + dx) = f(x)*dx.
1. 0 ≤ f(x) ≤ 1; for all x
2. ∫ f(x) dx = 1 over all values of x
Then P (X) is said to be PDF of the distribution.
216370307055 43
Data Science
It can be observed from the above graph that the distribution is symmetric about
its center, which is also the mean (0 in this case). This makes the probability of
events at equal deviations from the mean, equally probable. The density is highly
centered around the mean, which translates to lower probabilities for values
away from the mean.
Probability Density Function –
The probability density function of the general normal distribution is given as-
In the above formula, all the symbols have their usual meanings, is the Standard
Deviation and is the Mean.
It is easy to get overwhelmed by the above formula while trying to understand
everything in one glance, but we can try to break it down into smaller pieces so
as to get an intuition as to what is going on. The
z-score is a measure of how many standard deviations away a data point is
216370307055 44
Data Science
The exponent of in the above formula is the square of the z-score times . This
is actually in accordance to the observations that we made above. Values away
from the mean have a lower probability compared to the values near the mean.
Values away from the mean will have a higher z-score and consequentlya lower
probability since the exponent is negative. The opposite is true for valuescloser
to the mean.
This gives way for the 68-95-99.7 rule, which states that the percentage of
values that lie within a band around the mean in a normal distribution with a
width of two, four and six standard deviations, comprise 68%, 95% and 99.7% of
all the values. The figure given below shows this rule-
The effects of and on the distribution are shown below. Here is used to
reposition the center of the distribution and consequently move the graph left
or right, and is used to flatten or inflate the curve-
216370307055 45
Data Science
216370307055 46
Data Science
216370307055 47
Data Science
216370307055 48
Data Science
Here,
x’ and y’ = mean of given sample set
n = total no of sample
xi and yi = individual sample of set
Example –
216370307055 49
Data Science
216370307055 50
Data Science
It involves the historical or past data from an authorized source over which
predictive analysis is to be performed.
3. Data Cleaning:
Data Cleaning is the process in which we refine our data sets. In the process of
data cleaning, we remove un-necessary and erroneous data. It involves removing
the redundant data and duplicate data from our data sets.
4. Data Analysis:
It involves the exploration of data. We explore the data and analyze it thoroughly
in order to identify some patterns or new outcomes from the data set. In this
stage, we discover useful information and conclude by identifying some patterns
or trends.
5. Build Predictive Model:
In this stage of predictive analysis, we use various algorithms to build predictive
models based on the patterns observed. It requires knowledge of python, R,
Statistics and MATLAB and so on. We also test our hypothesis using standard
statistic models.
6. Validation:
It is a very important step in predictive analysis. In this step, we check the
efficiency of our model by performing various tests. Here we provide sample
input sets to check the validity of our model. The model needs to be evaluated
for its accuracy in this stage.
7. Deployment:
In deployment we make our model work in a real environment and it helps in
everyday discussion making and make it available to use.
8. Model Monitoring:
Regularly monitor your models to check performance and ensure that we have
proper results. It is seeing how model predictions are performing against actual
data sets.
4.4. Hypothesis Generation
A hypothesis is a function that best describes the target in supervised machine
learning. The hypothesis that an algorithm would come up depends upon the data
and also depends upon the restrictions and bias that we have imposed on the data.
To better understand the Hypothesis Space and Hypothesis consider the following
coordinate that shows the distribution of some data:
216370307055 51
Data Science
216370307055 52
Data Science
Access modes govern the type of operations possible in the opened file. It refers to
how the file will be used once it’s opened. These modes also define the location of
the File Handle in the file. File handle is like a cursor, which defines from where the
data has to be read or written in the file. Different access modes for reading a file
are –
1. Read Only (‘r’) : Open text file for reading. The handle is positioned at the
beginning of the file. If the file does not exists, raises I/O error. This is also the
default mode in which file is opened.
2. Read and Write (‘r+’) : Open the file for reading and writing. The handle is
positioned at the beginning of the file. Raises I/O error if the file does not exists.
216370307055 53
Data Science
3. Append and Read (‘a+’) : Open the file for reading and writing. The file is created
if it does not exist. The handle is positioned at the end of the file. The data being
written will be inserted at the end, after the existing data.
4.8. Variable Identification
First, identify Predictor (Input) and Target (output) variables. Next, identify the
data type and category of the variables.
Example:- Suppose, we want to predict, whether the students will play cricket or
not (refer below data set). Here you need to identify predictor variables, target
variable, data type of variables and category of variables.
216370307055 54
Data Science
Note: Univariate analysis is also used to highlight missing and outlier values. In
the upcoming part of this series, we will look at methods to handle missing and
outlier values.
216370307055 55
Data Science
Scatter plot shows the relationship between two variable but does not indicates the
strength of relationship amongst them. To find the strength of the relationship, we
use Correlation. Correlation varies between -1 and +1.
• 0: No correlation
216370307055 56
Data Science
Probability less than 0.05: It indicates that the relationship between the variables is
significant at 95% confidence. The chi-square test statistic for a test of
independence of two categorical variables is found by:
From previous two-way table, the expected count for product category 1 to be of
small size is 0.22. It is derived by taking the row total for Size (9) times the column
216370307055 57
Data Science
total for Product category (2) then dividing by the sample size (81). This is
procedure is conducted for each cell. Statistical Measures used to analyze the
power of relationship are:
Different data science language and tools have specific methods to perform chi-
square test. In SAS, we can use Chisq as an option with Proc freq to perform this
test.
• Z-Test/ T-Test:- Either test assess whether mean of two groups are statistically
• ANOVA:- It assesses whether the average of more than two groups is statistically
different.
216370307055 58
Data Science
Notice the missing values in the image shown above: In the left scenario, we have
not treated missing values. The inference from this data set is that the chances of
playing cricket by males is higher than females. On the other hand, if you look at
the second table, which shows data after treatment of missing values (based on
gender), we can see that females have higher chances of playing cricket compared
to males.
1. Data Extraction: It is possible that there are problems with extraction process. In
such cases, we should double-check for correct data with data guardians. Some
hashing procedures can also be used to make sure data extraction is correct. Errors
at data extraction stage are typically easy to find and can be corrected easily as
well.
2. Data collection: These errors occur at time of data collection and are harder to
correct. They can be categorized in four types:
216370307055 59
Data Science
1. Deletion: It is of two types: List Wise Deletion and Pair Wise Deletion.
o In list wise deletion, we delete observations where any of the variable is missing.
Simplicity is one of the major advantage of this method, but this method reduces
the power of model because it reduces the sample size.
o In pair wise deletion, we perform analysis with all cases in which the variables of
interest are present. Advantage of this method is, it keeps as many cases available
for analysis. One of the disadvantage of this method, it uses different sample size
for different variables.
216370307055 60
Data Science
o Deletion methods are used when the nature of missing data is “Missing
completely at random” else non random missing values can bias the model
output.
2. Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing
values with estimated ones. The objective is to employ known relationships that
can be identified in the valid values of the data set to assist in estimating the
missing values. Mean / Mode / Median imputation is one of the most frequently
used methods. It consists of replacing the missing data for a given attribute by the
mean or median (quantitative attribute) or mode (qualitative attribute) of all known
values of that variable. It can be of two types:-
o Generalized Imputation: In this case, we calculate the mean or median for all non
missing values of that variable then replace missing value with mean or median.
Like in above table, variable “Manpower” is missing so we take average of all non
missing values of “Manpower” (28.33) and then replace missing value with it.
o Similar case Imputation: In this case, we calculate average for gender “Male”
(29.75) and “Female” (25) individually of non missing values then replace the
missing value based on gender. For “Male“, we will replace missing values of
manpower with 29.75 and for “Female” with 25.
3. Prediction Model: Prediction model is one of the sophisticated method for
handling missing data. Here, we create a predictive model to estimate values that
will substitute the missing data. In this case, we divide our data set into two sets:
One set with no missing values for the variable and another one with missing
values. First data set become training data set of the model while second data set
with missing values is test data set and variable with missing values is treated as
target variable. Next, we create a model to predict target variable based on other
attributes of the training data set and populate missing values of test data set.We
can use regression, ANOVA, Logistic regression and various modeling technique
to perform this. There are 2 drawbacks for this approach:
o The model estimated values are usually more well-behaved than the true
values
o If there are no relationships with attributes in the data set and the attribute with
missing values, then the model will not be precise for estimating missing values.
4. KNN Imputation: In this method of imputation, the missing values of
an attribute are imputed using the given number of attributes that are most similar
to the attribute whose values are missing. The similarity of two attributes is
determined using a distance function. It is also known to have certain advantage
& disadvantages.
o Advantages:
▪ k-nearest neighbour can predict both qualitative & quantitative attributes
▪ Creation of predictive model for each attribute with missing data is not required
216370307055 61
Data Science
• Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR
• Use capping methods. Any value which out of range of 5th and 95th percentile can
be considered as outlier
• Data points, three or more standard deviation away from mean are considered
outlier
• Outlier detection is merely a special case of the examination of data for influential
data points and it also depends on the business understanding
• Bivariate and multivariate outliers are typically measured using either an index of
influence or leverage, or distance. Popular indices such as Mahalanobis’ distance
and Cook’s D are frequently used to detect outliers.
• In SAS, we can use PROC Univariate, PROC SGPLOT. To identify outliers and
influential observation, we also look at statistical measure like STUDENT, COOKD,
RSTUDENT and others.
Most of the ways to deal with outliers are similar to the methods of missing values
like deleting observations, transforming them, binning them, treat them as a
separate group, imputing values and other statistical methods. Here, we will
discuss the common techniques used to deal with outliers:
216370307055 62
Data Science
Imputing: Like imputation of missing values, we can also impute outliers. We can
use mean, median, mode imputation methods. Before imputing values, we should
analyse if it is natural outlier or artificial. If it is artificial, we can go with imputing
values. We can also use statistical model to predict values of outlier observation
and after that we can impute it with predicted values.
Treat separately: If there are significant number of outliers, we should treat them
separately in the statistical model. One of the approach is to treat both groups as
two different groups and build individual model for both groups and then combine
the output.
216370307055 63
Data Science
• Symmetric
distribution is preferred over skewed distribution as it is easier to interpret and
generate inferences. Some modeling techniques requires normal distribution of
variables. So, whenever we have a skewed distribution, we can use transformations
which reduce skewness. For right skewed distribution, we take square / cube root or
logarithm of variable and for left skewed, we take square / cube or exponential of
variables.
216370307055 64
Data Science
216370307055 65
Data Science
Any change in the coefficient leads to a change in both the direction and the
steepness of the logistic function. It means positive slopes result in an S-shaped
curve and negative slopes result in a Z-shaped curve.
4.18. Decision Trees
Decision Tree : Decision tree is the most powerful and popular tool for classification
and prediction. A Decision tree is a flowchart like tree structure, where each internal
node denotes a test on an attribute, each branch represents an outcome of the test,
and each leaf node (terminal node) holds a class label.
Decision Tree Representation :
Decision trees classify instances by sorting them down the tree from the root to
some leaf node, which provides the classification of the instance. An instance is
classified by starting at the root node of the tree,testing the attribute specified by
this node,then moving down the tree branch corresponding to the value of the
attribute as shown in the above figure.This process is then repeated for the subtree
rooted at the new node.
Strengths and Weakness of Decision Tree approach
The strengths of decision tree methods are:
• Decision trees are able to generate understandable rules.
• Decision trees perform classification without requiring much computation.
• Decision trees are able to handle both continuous and categorical variables.
• Decision trees provide a clear indication of which fields are most important for
prediction or classification.
The weaknesses of decision tree methods :
• Decision trees are less appropriate for estimation tasks where the goal is to predict
the value of a continuous attribute.
• Decision trees are prone to errors in classification problems with many class and
relatively small number of training examples.
• Decision tree can be computationally expensive to train. The process of growing a
decision tree is computationally expensive. At each node, each candidate splitting
field must be sorted before its best split can be found. In some algorithms,
combinations of fields are used and a search must be made for optimal combining
weights. Pruning algorithms can also be expensive since many candidate sub-trees
must be formed and compared.
216370307055 66
Data Science
4.19 Project:
Project Abtract:
Accurate prediction of flight prices plays a crucial role in the travel industry,
enablingtravelers to plan their journeys effectively and make informed decisions
regarding their air travel expenses. This project utilizes data science techniques to
predict flight prices based on a comprehensive dataset comprising various attributes
and historical flight data. By leveraging the power of data analysis and predictive
modeling, this study aims to provide valuable insights into the factors influencing flight
prices.
Through data exploration, feature engineering, and rigorous model training, this project
develops a robust framework for estimating flight prices. While the specific algorithms
and techniques employed in this project are not discussed in this abstract, the focus
remains on the predictive power of the model and the potential benefits it offers to
travelers and airlines alike.
Accurate flight price predictions empower travelers to plan their trips more effectively,
enabling them to budget and make informed choices regarding their flight bookings.
Moreover, airlines can leverage these predictions to optimize revenue management,
pricing strategies, and flight availability. By understanding the underlying patterns and
trends in flight prices, airlines can better anticipate demand fluctuations and adjust
fares accordingly.
The project's findings shed light on the key factors influencing flight prices, such as
departure location, destination, time of travel, seasonality, airline preferences, and
flight duration. This understanding aids travelers in finding the best deals and
optimizing their travel plans, while airlines can enhance their pricing strategies and
maximize profitability.
216370307055 67
Data Science
Columns Details
Airline: This column represents the name or code of the airline associated with the flight.
Date_of_Journey: This column denotes the date of the flight journey.
Source: This column indicates the source or departure city of the flight.
Destination: This column represents the destination city of the flight.
Route: This column provides the route or path of the flight, including any stopovers or layovers.
Dep_Time: This column denotes the departure time of the flight.
Arrival_Time: This column represents the arrival time of the flight.
Duration: This column indicates the duration of the flight journey.
Total_Stops: This column represents the total number of stops or layovers during the flight journey.
Additional_Info: This column may contain additional information or special remarks about the flight.
Price: This column denotes the price or fare of the flight ticket.
216370307055 68
Data Science
Code:
1.
2.
3.
216370307055 69
Data Science
4.
5.
216370307055 70
Data Science
6.
7.
216370307055 71
Data Science
8.
9.
216370307055 72
Data Science
10.
11.
216370307055 73
Data Science
12. visualization
13.
216370307055 74
Data Science
14.
216370307055 75
Data Science
15.
216370307055 76
Data Science
16.
17.
216370307055 77
Data Science
18.
19.
216370307055 78
Data Science
20.
21.
22.
216370307055 79
Data Science
23.
216370307055 80