0% found this document useful (0 votes)
2 views

Data Science

Data Science integrates statistics, data analysis, and machine learning to analyze real-world phenomena. Its applications span various fields, including fraud detection in banking, personalized medicine in genetics, and targeted advertising in digital marketing. The document also outlines the process of data collection, exploration, modeling, and visualization using Python libraries like NumPy, Pandas, and Matplotlib.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Data Science

Data Science integrates statistics, data analysis, and machine learning to analyze real-world phenomena. Its applications span various fields, including fraud detection in banking, personalized medicine in genetics, and targeted advertising in digital marketing. The document also outlines the process of data collection, exploration, modeling, and visualization using Python libraries like NumPy, Pandas, and Matplotlib.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Data Science

Data Sciences

It is a concept to unify statistics, data analysis, machine learning and their related methods in order to understand
and analyse actual phenomena with data. It employs techniques and theories drawn from many fields within the
context of Mathematics, Statistics, Computer Science, and Information Science.

Applications of Data Sciences - Fraud and Risk Detection

Over the years, banking companies learned to divide and conquer data via customer profiling, past expenditures, and
other essential variables to analyse the probabilities of risk and default. Moreover, it also helped them to push their
banking products based on customers’ purchasing power.

Applications of Data Sciences - Genetics & Genomics

Data Science applications also enable an advanced level of treatment personalization through research in genetics
and genomics. The goal is to understand the impact of the DNA on our health and find individual biological
connections between genetics, diseases, and drug response.

Applications of Data Sciences - Internet Search

All search engines (including Google) make use of data science algorithms to deliver the best result for our searched
query in a fraction of a second. Considering the fact that Google processes more than 20 petabytes of data every day,
had there been no data science, Google wouldn’t have been the ‘Google’ we know today.

Applications of Data Sciences - Targeted Advertising

If you thought Search would have been the biggest of all data science applications, here is a challenger – the entire
digital marketing spectrum. Starting from the display banilrs on various websites to the digital billboards at the
airports – almost all of them are decided by using data science algorithms.

Applications of Data Sciences - Website Recommendations

Internet giants like Amazon, Twitter, Google Play, Netflix, LinkedIn, IMDB and many more use this system to improve
the user experience. The recommendations are made based on previous search results for a user.

Applications of Data Sciences - Airline Route Planning

Now, while using Data Science, airline companies can:

 Predict flight delay • Decide which class of airplanes to buy


 Whether to directly land at the destination or take a halt in between

 Effectively drive customer loyalty programs

Data Science is a combination of Python and Mathematical concepts like Statistics, Data Analysis, probability, etc.
Concepts of Data Science can be used in developing applications around AI as it gives a strong base for data analysis
in Python.

The Scenario

Every day, restaurants prepare food in large quantities keeping in mind the probable number of customers walking
into their outlet. But if the expectations are not met, a good amount of food gets wasted which eventually becomes a
loss for the restaurant as they either have to dump it or give it to hungry people for free. And if this daily loss is taken
into account for a year, it becomes quite a big amount.

Problem Scoping

The Problem statement template leads us towards the goal of our project which can now be stated as: “To be able to
predict the quantity of food dishes to be prepared for everyday consumption in restaurant buffets.”

Data Acquisition

For this problem, a dataset covering all the elements mentioned above is made for each dish prepared by the
restaurant over a period of 30 days.

Data Exploration

After creating the database, we now need to look at the data collected and understand what is required out of it. In
this case, since the goal of our project is to be able to predict the quantity of food to be prepared for the next day, we
need to have the following data:
Modelling

Once the dataset is ready, we train our model on it. In this case, a regression model is chosen in which the dataset is
fed as a dataframe and is trained accordingly. Regression is a Supervised Learning model which takes in continuous
values of data over a period of time.

Evaluation

Once the model has been trained on the training dataset of 20 days, it is now time to see if the model is working
properly or not. Once the model is able to achieve optimum efficiency, it is ready to be deployed in the restaurant for
real-time usage.

Data Collection

Data collection is nothing new that has come up in our lives. It has been in our society for ages. Even when people
did not have a fair knowledge of calculations, records were still maintained in some way or the other to keep an
account of relevant things.

Data collection is an exercise that does not require even a tiny bit of technological knowledge. But when it comes to
analysing the data, it becomes a tedious process for humans as it is all about numbers and alpha-numerical data.
That is where Data Science comes into the picture.

Data Science not only gives us a clearer idea of the dataset but also adds value to it by providing deeper and clearer
analyses around it. And as AI gets incorporated in the process, predictions and suggestions by the machine become
possible on the same.

Sources of Data

Offline Data Collection: Sensors, Surveys, Interviews, Observations.

Online Data Collection: Open-sourced Government Portals, Reliable Websites (Kaggle), World Organisations’ open-
sourced statistical Observations websites

While accessing data from any of the data sources, the following points should be kept in mind:

i. Data that is available for public usage only should be taken up.

ii. Personal datasets should only be used with the consent of the owner.

iii. One should never breach someone’s privacy to collect data.

iv. Data should only be taken from reliable sources as the data collected from random sources can be wrong or
unusable.

v. Reliable sources of data ensure the authenticity of data which helps in the proper training of the AI model.

Types of Data

For Data Science, usually, the data is collected in the form of tables. These tabular datasets can be stored in different
formats.

CSV

CSV stands for comma-separated values. It is a simple file format used to store tabular data. Each line of this file is a
data record and the reach record consists of one or more fields that are separated by commas. Since the values of
records are separated by a comma, hence they are known as CSV files.

Spreadsheet

A Spreadsheet is a piece of paper or a computer program that is used for accounting and recording data using rows
and columns into which information can be entered. Microsoft Excel is a program that helps in creating spreadsheets.

SQL
SQL is a programming language also known as Structured Query Language. It is a domain-specific language used in
programming and is designed for managing data held in different kinds of DBMS (Database Management Systems) It
is particularly useful in handling structured data.

Data Access

After collecting the data, to be able to use it for programming purposes, we should know how to access the same in
Python code. To make our lives easier, there exist various Python packages which help us in accessing structured data
(in tabular form) inside the code.

NumPy

NumPy, which stands for Numerical Python, is the fundamental package for Mathematical and logical operations on
arrays in Python. It is a commonly used package when it comes to working around numbers. NumPy gives a wide
range of arithmetic operations around numbers giving us an easier approach to working with them. NumPy also
works with arrays, which are nothing but a homogenous collection of Data.

An array is nothing but a set of multiple values which are of the same data type. They can be numbers, characters,
booleans, etc. but only one datatype can be accessed through an array. In NumPy, the arrays used are known as ND-
arrays (N-Dimensional Arrays) as NumPy comes with a feature of creating n-dimensional arrays in Python.

NumPy Arrays

i. Homogenous collection of Data.

ii. Can contain only one type of data, hence not flexible with datatypes.

iii. Cannot be directly initialized. Can be operated with Numpy package only.

iv. Direct numerical operations can be done. For example, dividing the whole array by 3 divides every element
by 3 .

v. Widely used for arithmetic operations.

vi. Arrays take less memory space.

vii. Functions like concatenation, appending, reshaping, etc are not trivially possible with arrays.

viii. Example: To create a numpy array ' A ':


import numpy
A= numpy.array ([1,2,3,4,5,6,7,8,9,0])

Lists

i. Heterogenous collection of Data.

ii. Can contain multiple types of data, hence flexible with datatypes.

iii. Can be directly initialized as it is a part of Python syntax.

iv. Direct numerical operations are not possible. For example, dividing the whole list by 3 cannot divide every
element by 3 .

v. Widely used for data management.

vi. Lists acquire more memory space.

vii. Functions like concatenation, appending, reshaping, etc are trivially possible with lists.

viii. Example: To create a list:


A = [1,2,3,4,5,6,7,8,9,0]

Pandas
Pandas is a software library written for the Python programming language for data manipulation and analysis. In
particular, it offers data structures and operations for manipulating numerical tables and time series. The name is
derived from the term "panel data", an econometrics term for data sets that include observations over multiple time
periods for the same individuals.

Pandas is well suited for many different kinds of data:

 Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet

 Ordered and unordered (not necessarily fixed-frequency) time series data.

 Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels

 Any other form of observational / statistical data sets. The data actually need not be labelled at all to be
placed into a Pandas data structure

The two primary data structures of Pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast
majority of typical use cases in finance, statistics, social science, and many areas of engineering.

pandas does well:

 Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data

 Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects

 Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can
simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations

 Intelligent label-based slicing, fancy indexing, and subsetting of large data sets

 Intuitive merging and joining data sets

 Flexible reshaping and pivoting of data sets

Matplotlib

Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib is a multiplatform data
visualization library built on NumPy arrays.

Matplotlib comes with a wide variety of plots. Plots help to understand trends, and patterns, and to make
correlations. They’re typically instruments for reasoning quantitative information.

Some types of graphs that we can make with this package are listed below:

Not just plotting, but you can also modify your plots the way you wish. You can stylise them and make them more
descriptive and communicable.

Basic Statistics with Python


Data Visualisation

Analysing the data collected can be difficult as it is all about tables and numbers. While machines work efficiently on
numbers, humans need a visual aid to understand and comprehend the information passed. Hence, data visualisation
is used to interpret the data collected and identify patterns and trends out of it.

Issues we can face with data:

 Erroneous Data: There are two ways in which the data can be erroneous:
Incorrect values: The values in the dataset (at random places) are incorrect.
Invalid or Null values: In some places, the values get corrupted and hence they become invalid.

 Missing Data: In some datasets, some cells remain empty.

 Outliers: Data that do not fall in the range of a certain element are referred to as outliers.

In Python, Matplotlib package helps in visualising the data and making some sense out of it. As we have already
discussed before, with the help of this package, we can plot various kinds of graphs.

Scatter plots

Scatter plots are used to plot discontinuous data; that is, the data which does not have any continuity in flow is
termed as discontinuous. There exist gaps in data that introduce discontinuity. A 2D scatter plot can display
information maximum upto 4 parameters.
Bar Chart

It is one of the most commonly used graphical methods. From students to scientists, everyone uses bar charts in
some way or the other. It is very easy-to-draw yet informative graphical representation. Various versions of bar chart
exist like a single bar chart, double bar chart, etc

Histograms

Histograms are the accurate representation of continuous data. When it comes to plotting the variation in just one
entity of a period of time, histograms come into the picture. It represents the frequency of the variable at different
points of time with the help of the bins.

Box Plots

When the data is split according to its percentile throughout the range, box plots come in Haman. Box plots also
known as box and whiskers plot conveniently display the distribution of data throughout the range with the help of 4
quartiles.
K-Nearest Neighbour

The k-nearest neighbours (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm
that can be used to solve both classification and regression problems. The KNN algorithm assumes that similar things
exist in close proximity. In other words, similar things are near to each other as the saying goes “Birds of a feather
flock together”.

Some features of KNN are:

 The KNN prediction model relies on the surrounding points or neighbours to determine its class or group

 Utilises the properties of the majority of the nearest points to decide how to classify unknown points

 Based on the concept that similar data points should be close to each other

KNN tries to predict an unknown value on the basis of the known values. The model simply calculates the distance
between all the known points with the unknown point (by distance we mean to say the difference between two
values) and takes up K number of points whose distance is minimum. And according to it, predictions are made.

As we decrease the value of K to 1, our predictions become less stable. Just think for a minute, imagine K=1 and we
have X surrounded by several greens and one blue, but the blue is the single nearest neighbour. Reasonably, we
would think X is most likely green, but because K=1, KNN incorrectly predicts that it is blue.

Inversely, as we increase the value of K, our predictions become more stable due to majority voting / averaging, and
thus, more likely to make more accurate predictions (up to a certain point). Eventually, we begin to witness an
increasing number of errors. It is at this point we know we have pushed the value of K too far.

In cases where we are taking a majority vote (e.g. picking the mode in a classification problem) among labels, we
usually make K an odd number to have a tiebreaker.

You might also like