Data Science
Data Science
Data Sciences
It is a concept to unify statistics, data analysis, machine learning and their related methods in order to understand
and analyse actual phenomena with data. It employs techniques and theories drawn from many fields within the
context of Mathematics, Statistics, Computer Science, and Information Science.
Over the years, banking companies learned to divide and conquer data via customer profiling, past expenditures, and
other essential variables to analyse the probabilities of risk and default. Moreover, it also helped them to push their
banking products based on customers’ purchasing power.
Data Science applications also enable an advanced level of treatment personalization through research in genetics
and genomics. The goal is to understand the impact of the DNA on our health and find individual biological
connections between genetics, diseases, and drug response.
All search engines (including Google) make use of data science algorithms to deliver the best result for our searched
query in a fraction of a second. Considering the fact that Google processes more than 20 petabytes of data every day,
had there been no data science, Google wouldn’t have been the ‘Google’ we know today.
If you thought Search would have been the biggest of all data science applications, here is a challenger – the entire
digital marketing spectrum. Starting from the display banilrs on various websites to the digital billboards at the
airports – almost all of them are decided by using data science algorithms.
Internet giants like Amazon, Twitter, Google Play, Netflix, LinkedIn, IMDB and many more use this system to improve
the user experience. The recommendations are made based on previous search results for a user.
Data Science is a combination of Python and Mathematical concepts like Statistics, Data Analysis, probability, etc.
Concepts of Data Science can be used in developing applications around AI as it gives a strong base for data analysis
in Python.
The Scenario
Every day, restaurants prepare food in large quantities keeping in mind the probable number of customers walking
into their outlet. But if the expectations are not met, a good amount of food gets wasted which eventually becomes a
loss for the restaurant as they either have to dump it or give it to hungry people for free. And if this daily loss is taken
into account for a year, it becomes quite a big amount.
Problem Scoping
The Problem statement template leads us towards the goal of our project which can now be stated as: “To be able to
predict the quantity of food dishes to be prepared for everyday consumption in restaurant buffets.”
Data Acquisition
For this problem, a dataset covering all the elements mentioned above is made for each dish prepared by the
restaurant over a period of 30 days.
Data Exploration
After creating the database, we now need to look at the data collected and understand what is required out of it. In
this case, since the goal of our project is to be able to predict the quantity of food to be prepared for the next day, we
need to have the following data:
Modelling
Once the dataset is ready, we train our model on it. In this case, a regression model is chosen in which the dataset is
fed as a dataframe and is trained accordingly. Regression is a Supervised Learning model which takes in continuous
values of data over a period of time.
Evaluation
Once the model has been trained on the training dataset of 20 days, it is now time to see if the model is working
properly or not. Once the model is able to achieve optimum efficiency, it is ready to be deployed in the restaurant for
real-time usage.
Data Collection
Data collection is nothing new that has come up in our lives. It has been in our society for ages. Even when people
did not have a fair knowledge of calculations, records were still maintained in some way or the other to keep an
account of relevant things.
Data collection is an exercise that does not require even a tiny bit of technological knowledge. But when it comes to
analysing the data, it becomes a tedious process for humans as it is all about numbers and alpha-numerical data.
That is where Data Science comes into the picture.
Data Science not only gives us a clearer idea of the dataset but also adds value to it by providing deeper and clearer
analyses around it. And as AI gets incorporated in the process, predictions and suggestions by the machine become
possible on the same.
Sources of Data
Online Data Collection: Open-sourced Government Portals, Reliable Websites (Kaggle), World Organisations’ open-
sourced statistical Observations websites
While accessing data from any of the data sources, the following points should be kept in mind:
i. Data that is available for public usage only should be taken up.
ii. Personal datasets should only be used with the consent of the owner.
iv. Data should only be taken from reliable sources as the data collected from random sources can be wrong or
unusable.
v. Reliable sources of data ensure the authenticity of data which helps in the proper training of the AI model.
Types of Data
For Data Science, usually, the data is collected in the form of tables. These tabular datasets can be stored in different
formats.
CSV
CSV stands for comma-separated values. It is a simple file format used to store tabular data. Each line of this file is a
data record and the reach record consists of one or more fields that are separated by commas. Since the values of
records are separated by a comma, hence they are known as CSV files.
Spreadsheet
A Spreadsheet is a piece of paper or a computer program that is used for accounting and recording data using rows
and columns into which information can be entered. Microsoft Excel is a program that helps in creating spreadsheets.
SQL
SQL is a programming language also known as Structured Query Language. It is a domain-specific language used in
programming and is designed for managing data held in different kinds of DBMS (Database Management Systems) It
is particularly useful in handling structured data.
Data Access
After collecting the data, to be able to use it for programming purposes, we should know how to access the same in
Python code. To make our lives easier, there exist various Python packages which help us in accessing structured data
(in tabular form) inside the code.
NumPy
NumPy, which stands for Numerical Python, is the fundamental package for Mathematical and logical operations on
arrays in Python. It is a commonly used package when it comes to working around numbers. NumPy gives a wide
range of arithmetic operations around numbers giving us an easier approach to working with them. NumPy also
works with arrays, which are nothing but a homogenous collection of Data.
An array is nothing but a set of multiple values which are of the same data type. They can be numbers, characters,
booleans, etc. but only one datatype can be accessed through an array. In NumPy, the arrays used are known as ND-
arrays (N-Dimensional Arrays) as NumPy comes with a feature of creating n-dimensional arrays in Python.
NumPy Arrays
ii. Can contain only one type of data, hence not flexible with datatypes.
iii. Cannot be directly initialized. Can be operated with Numpy package only.
iv. Direct numerical operations can be done. For example, dividing the whole array by 3 divides every element
by 3 .
vii. Functions like concatenation, appending, reshaping, etc are not trivially possible with arrays.
Lists
ii. Can contain multiple types of data, hence flexible with datatypes.
iv. Direct numerical operations are not possible. For example, dividing the whole list by 3 cannot divide every
element by 3 .
vii. Functions like concatenation, appending, reshaping, etc are trivially possible with lists.
Pandas
Pandas is a software library written for the Python programming language for data manipulation and analysis. In
particular, it offers data structures and operations for manipulating numerical tables and time series. The name is
derived from the term "panel data", an econometrics term for data sets that include observations over multiple time
periods for the same individuals.
Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
Any other form of observational / statistical data sets. The data actually need not be labelled at all to be
placed into a Pandas data structure
The two primary data structures of Pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast
majority of typical use cases in finance, statistics, social science, and many areas of engineering.
Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can
simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
Matplotlib
Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib is a multiplatform data
visualization library built on NumPy arrays.
Matplotlib comes with a wide variety of plots. Plots help to understand trends, and patterns, and to make
correlations. They’re typically instruments for reasoning quantitative information.
Some types of graphs that we can make with this package are listed below:
Not just plotting, but you can also modify your plots the way you wish. You can stylise them and make them more
descriptive and communicable.
Analysing the data collected can be difficult as it is all about tables and numbers. While machines work efficiently on
numbers, humans need a visual aid to understand and comprehend the information passed. Hence, data visualisation
is used to interpret the data collected and identify patterns and trends out of it.
Erroneous Data: There are two ways in which the data can be erroneous:
Incorrect values: The values in the dataset (at random places) are incorrect.
Invalid or Null values: In some places, the values get corrupted and hence they become invalid.
Outliers: Data that do not fall in the range of a certain element are referred to as outliers.
In Python, Matplotlib package helps in visualising the data and making some sense out of it. As we have already
discussed before, with the help of this package, we can plot various kinds of graphs.
Scatter plots
Scatter plots are used to plot discontinuous data; that is, the data which does not have any continuity in flow is
termed as discontinuous. There exist gaps in data that introduce discontinuity. A 2D scatter plot can display
information maximum upto 4 parameters.
Bar Chart
It is one of the most commonly used graphical methods. From students to scientists, everyone uses bar charts in
some way or the other. It is very easy-to-draw yet informative graphical representation. Various versions of bar chart
exist like a single bar chart, double bar chart, etc
Histograms
Histograms are the accurate representation of continuous data. When it comes to plotting the variation in just one
entity of a period of time, histograms come into the picture. It represents the frequency of the variable at different
points of time with the help of the bins.
Box Plots
When the data is split according to its percentile throughout the range, box plots come in Haman. Box plots also
known as box and whiskers plot conveniently display the distribution of data throughout the range with the help of 4
quartiles.
K-Nearest Neighbour
The k-nearest neighbours (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm
that can be used to solve both classification and regression problems. The KNN algorithm assumes that similar things
exist in close proximity. In other words, similar things are near to each other as the saying goes “Birds of a feather
flock together”.
The KNN prediction model relies on the surrounding points or neighbours to determine its class or group
Utilises the properties of the majority of the nearest points to decide how to classify unknown points
Based on the concept that similar data points should be close to each other
KNN tries to predict an unknown value on the basis of the known values. The model simply calculates the distance
between all the known points with the unknown point (by distance we mean to say the difference between two
values) and takes up K number of points whose distance is minimum. And according to it, predictions are made.
As we decrease the value of K to 1, our predictions become less stable. Just think for a minute, imagine K=1 and we
have X surrounded by several greens and one blue, but the blue is the single nearest neighbour. Reasonably, we
would think X is most likely green, but because K=1, KNN incorrectly predicts that it is blue.
Inversely, as we increase the value of K, our predictions become more stable due to majority voting / averaging, and
thus, more likely to make more accurate predictions (up to a certain point). Eventually, we begin to witness an
increasing number of errors. It is at this point we know we have pushed the value of K too far.
In cases where we are taking a majority vote (e.g. picking the mode in a classification problem) among labels, we
usually make K an odd number to have a tiebreaker.