0% found this document useful (0 votes)

2 views

Data Science

Data Science integrates statistics, data analysis, and machine learning to analyze real-world phenomena. Its applications span various fields, including fraud detection in banking, personalized medicine in genetics, and targeted advertising in digital marketing. The document also outlines the process of data collection, exploration, modeling, and visualization using Python libraries like NumPy, Pandas, and Matplotlib.

Uploaded by

rudranshchikhale685

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Data Science

Uploaded by

rudranshchikhale685

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Data Science

Data Sciences

It is a concept to unify statistics, data analysis, machine learning and their related methods in order to understand
and analyse actual phenomena with data. It employs techniques and theories drawn from many fields within the
context of Mathematics, Statistics, Computer Science, and Information Science.

Applications of Data Sciences - Fraud and Risk Detection

Over the years, banking companies learned to divide and conquer data via customer profiling, past expenditures, and
other essential variables to analyse the probabilities of risk and default. Moreover, it also helped them to push their
banking products based on customers’ purchasing power.

Applications of Data Sciences - Genetics & Genomics

Data Science applications also enable an advanced level of treatment personalization through research in genetics
and genomics. The goal is to understand the impact of the DNA on our health and find individual biological
connections between genetics, diseases, and drug response.

Applications of Data Sciences - Internet Search

All search engines (including Google) make use of data science algorithms to deliver the best result for our searched
query in a fraction of a second. Considering the fact that Google processes more than 20 petabytes of data every day,
had there been no data science, Google wouldn’t have been the ‘Google’ we know today.

Applications of Data Sciences - Targeted Advertising

If you thought Search would have been the biggest of all data science applications, here is a challenger – the entire
digital marketing spectrum. Starting from the display banilrs on various websites to the digital billboards at the
airports – almost all of them are decided by using data science algorithms.

Applications of Data Sciences - Website Recommendations

Internet giants like Amazon, Twitter, Google Play, Netflix, LinkedIn, IMDB and many more use this system to improve
the user experience. The recommendations are made based on previous search results for a user.

Applications of Data Sciences - Airline Route Planning

Now, while using Data Science, airline companies can:

 Predict flight delay • Decide which class of airplanes to buy

 Whether to directly land at the destination or take a halt in between

 Effectively drive customer loyalty programs

Data Science is a combination of Python and Mathematical concepts like Statistics, Data Analysis, probability, etc.
Concepts of Data Science can be used in developing applications around AI as it gives a strong base for data analysis
in Python.

The Scenario

Every day, restaurants prepare food in large quantities keeping in mind the probable number of customers walking
into their outlet. But if the expectations are not met, a good amount of food gets wasted which eventually becomes a
loss for the restaurant as they either have to dump it or give it to hungry people for free. And if this daily loss is taken
into account for a year, it becomes quite a big amount.

Problem Scoping

The Problem statement template leads us towards the goal of our project which can now be stated as: “To be able to
predict the quantity of food dishes to be prepared for everyday consumption in restaurant buffets.”

Data Acquisition

For this problem, a dataset covering all the elements mentioned above is made for each dish prepared by the
restaurant over a period of 30 days.

Data Exploration

After creating the database, we now need to look at the data collected and understand what is required out of it. In
this case, since the goal of our project is to be able to predict the quantity of food to be prepared for the next day, we
need to have the following data:
Modelling

Once the dataset is ready, we train our model on it. In this case, a regression model is chosen in which the dataset is
fed as a dataframe and is trained accordingly. Regression is a Supervised Learning model which takes in continuous
values of data over a period of time.

Evaluation

Once the model has been trained on the training dataset of 20 days, it is now time to see if the model is working
properly or not. Once the model is able to achieve optimum efficiency, it is ready to be deployed in the restaurant for
real-time usage.

Data Collection

Data collection is nothing new that has come up in our lives. It has been in our society for ages. Even when people
did not have a fair knowledge of calculations, records were still maintained in some way or the other to keep an
account of relevant things.

Data collection is an exercise that does not require even a tiny bit of technological knowledge. But when it comes to
analysing the data, it becomes a tedious process for humans as it is all about numbers and alpha-numerical data.
That is where Data Science comes into the picture.

Data Science not only gives us a clearer idea of the dataset but also adds value to it by providing deeper and clearer
analyses around it. And as AI gets incorporated in the process, predictions and suggestions by the machine become
possible on the same.

Sources of Data

Offline Data Collection: Sensors, Surveys, Interviews, Observations.

Online Data Collection: Open-sourced Government Portals, Reliable Websites (Kaggle), World Organisations’ open-
sourced statistical Observations websites

While accessing data from any of the data sources, the following points should be kept in mind:

i. Data that is available for public usage only should be taken up.

ii. Personal datasets should only be used with the consent of the owner.

iii. One should never breach someone’s privacy to collect data.

iv. Data should only be taken from reliable sources as the data collected from random sources can be wrong or
unusable.

v. Reliable sources of data ensure the authenticity of data which helps in the proper training of the AI model.

Types of Data

For Data Science, usually, the data is collected in the form of tables. These tabular datasets can be stored in different
formats.

CSV

CSV stands for comma-separated values. It is a simple file format used to store tabular data. Each line of this file is a
data record and the reach record consists of one or more fields that are separated by commas. Since the values of
records are separated by a comma, hence they are known as CSV files.

Spreadsheet

A Spreadsheet is a piece of paper or a computer program that is used for accounting and recording data using rows
and columns into which information can be entered. Microsoft Excel is a program that helps in creating spreadsheets.

SQL
SQL is a programming language also known as Structured Query Language. It is a domain-specific language used in
programming and is designed for managing data held in different kinds of DBMS (Database Management Systems) It
is particularly useful in handling structured data.

Data Access

After collecting the data, to be able to use it for programming purposes, we should know how to access the same in
Python code. To make our lives easier, there exist various Python packages which help us in accessing structured data
(in tabular form) inside the code.

NumPy

NumPy, which stands for Numerical Python, is the fundamental package for Mathematical and logical operations on
arrays in Python. It is a commonly used package when it comes to working around numbers. NumPy gives a wide
range of arithmetic operations around numbers giving us an easier approach to working with them. NumPy also
works with arrays, which are nothing but a homogenous collection of Data.

An array is nothing but a set of multiple values which are of the same data type. They can be numbers, characters,
booleans, etc. but only one datatype can be accessed through an array. In NumPy, the arrays used are known as ND-
arrays (N-Dimensional Arrays) as NumPy comes with a feature of creating n-dimensional arrays in Python.

NumPy Arrays

i. Homogenous collection of Data.

ii. Can contain only one type of data, hence not flexible with datatypes.

iii. Cannot be directly initialized. Can be operated with Numpy package only.

iv. Direct numerical operations can be done. For example, dividing the whole array by 3 divides every element
by 3 .

v. Widely used for arithmetic operations.

vi. Arrays take less memory space.

vii. Functions like concatenation, appending, reshaping, etc are not trivially possible with arrays.

viii. Example: To create a numpy array ' A ':

import numpy
A= numpy.array ([1,2,3,4,5,6,7,8,9,0])

Lists

i. Heterogenous collection of Data.

ii. Can contain multiple types of data, hence flexible with datatypes.

iii. Can be directly initialized as it is a part of Python syntax.

iv. Direct numerical operations are not possible. For example, dividing the whole list by 3 cannot divide every
element by 3 .

v. Widely used for data management.

vi. Lists acquire more memory space.

vii. Functions like concatenation, appending, reshaping, etc are trivially possible with lists.

viii. Example: To create a list:

A = [1,2,3,4,5,6,7,8,9,0]

Pandas
Pandas is a software library written for the Python programming language for data manipulation and analysis. In
particular, it offers data structures and operations for manipulating numerical tables and time series. The name is
derived from the term "panel data", an econometrics term for data sets that include observations over multiple time
periods for the same individuals.

Pandas is well suited for many different kinds of data:

 Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet

 Ordered and unordered (not necessarily fixed-frequency) time series data.

 Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels

 Any other form of observational / statistical data sets. The data actually need not be labelled at all to be
placed into a Pandas data structure

The two primary data structures of Pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast
majority of typical use cases in finance, statistics, social science, and many areas of engineering.

pandas does well:

 Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data

 Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects

 Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can
simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations

 Intelligent label-based slicing, fancy indexing, and subsetting of large data sets

 Intuitive merging and joining data sets

 Flexible reshaping and pivoting of data sets

Matplotlib

Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib is a multiplatform data
visualization library built on NumPy arrays.

Matplotlib comes with a wide variety of plots. Plots help to understand trends, and patterns, and to make
correlations. They’re typically instruments for reasoning quantitative information.

Some types of graphs that we can make with this package are listed below:

Not just plotting, but you can also modify your plots the way you wish. You can stylise them and make them more
descriptive and communicable.

Basic Statistics with Python

Data Visualisation

Analysing the data collected can be difficult as it is all about tables and numbers. While machines work efficiently on
numbers, humans need a visual aid to understand and comprehend the information passed. Hence, data visualisation
is used to interpret the data collected and identify patterns and trends out of it.

Issues we can face with data:

 Erroneous Data: There are two ways in which the data can be erroneous:
Incorrect values: The values in the dataset (at random places) are incorrect.
Invalid or Null values: In some places, the values get corrupted and hence they become invalid.

 Missing Data: In some datasets, some cells remain empty.

 Outliers: Data that do not fall in the range of a certain element are referred to as outliers.

In Python, Matplotlib package helps in visualising the data and making some sense out of it. As we have already
discussed before, with the help of this package, we can plot various kinds of graphs.

Scatter plots

Scatter plots are used to plot discontinuous data; that is, the data which does not have any continuity in flow is
termed as discontinuous. There exist gaps in data that introduce discontinuity. A 2D scatter plot can display
information maximum upto 4 parameters.
Bar Chart

It is one of the most commonly used graphical methods. From students to scientists, everyone uses bar charts in
some way or the other. It is very easy-to-draw yet informative graphical representation. Various versions of bar chart
exist like a single bar chart, double bar chart, etc

Histograms

Histograms are the accurate representation of continuous data. When it comes to plotting the variation in just one
entity of a period of time, histograms come into the picture. It represents the frequency of the variable at different
points of time with the help of the bins.

Box Plots

When the data is split according to its percentile throughout the range, box plots come in Haman. Box plots also
known as box and whiskers plot conveniently display the distribution of data throughout the range with the help of 4
quartiles.
K-Nearest Neighbour

The k-nearest neighbours (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm
that can be used to solve both classification and regression problems. The KNN algorithm assumes that similar things
exist in close proximity. In other words, similar things are near to each other as the saying goes “Birds of a feather
flock together”.

Some features of KNN are:

 The KNN prediction model relies on the surrounding points or neighbours to determine its class or group

 Utilises the properties of the majority of the nearest points to decide how to classify unknown points

 Based on the concept that similar data points should be close to each other

KNN tries to predict an unknown value on the basis of the known values. The model simply calculates the distance
between all the known points with the unknown point (by distance we mean to say the difference between two
values) and takes up K number of points whose distance is minimum. And according to it, predictions are made.

As we decrease the value of K to 1, our predictions become less stable. Just think for a minute, imagine K=1 and we
have X surrounded by several greens and one blue, but the blue is the single nearest neighbour. Reasonably, we
would think X is most likely green, but because K=1, KNN incorrectly predicts that it is blue.

Inversely, as we increase the value of K, our predictions become more stable due to majority voting / averaging, and
thus, more likely to make more accurate predictions (up to a certain point). Eventually, we begin to witness an
increasing number of errors. It is at this point we know we have pushed the value of K too far.

In cases where we are taking a majority vote (e.g. picking the mode in a classification problem) among labels, we
usually make K an odd number to have a tiebreaker.

Fbs Week 1-10
75% (8)
Fbs Week 1-10
183 pages
Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
4th Eye Activation.-Corect Maual PDF
100% (4)
4th Eye Activation.-Corect Maual PDF
13 pages
How To Add A ZTE ONT On Huawei OLT
No ratings yet
How To Add A ZTE ONT On Huawei OLT
10 pages
Affidavit BRgy, Captain X
100% (3)
Affidavit BRgy, Captain X
3 pages
Metro Line 11 (Wadala To CSMT)
No ratings yet
Metro Line 11 (Wadala To CSMT)
486 pages
5_6237938787641463884
No ratings yet
5_6237938787641463884
9 pages
Chapter-14 Data Science
No ratings yet
Chapter-14 Data Science
12 pages
Data Science Class X Notes
No ratings yet
Data Science Class X Notes
3 pages
8 A 8 D 4
No ratings yet
8 A 8 D 4
2 pages
DATA SCIENCE
No ratings yet
DATA SCIENCE
31 pages
X AI SS CH4 LM
No ratings yet
X AI SS CH4 LM
57 pages
defrgdsadsw
No ratings yet
defrgdsadsw
3 pages
Data Science I: Charles C.N. Wang
No ratings yet
Data Science I: Charles C.N. Wang
68 pages
Unit 1 FUNDAMENTALS OF DATA SCIENCE-1
No ratings yet
Unit 1 FUNDAMENTALS OF DATA SCIENCE-1
27 pages
AI ML June 4 2022
No ratings yet
AI ML June 4 2022
40 pages
Data Science Notes
No ratings yet
Data Science Notes
4 pages
Data science
No ratings yet
Data science
10 pages
ST2195 Complete
No ratings yet
ST2195 Complete
430 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
30 pages
Data Science A Beginner S Guide 1668243666
100% (1)
Data Science A Beginner S Guide 1668243666
26 pages
1
No ratings yet
1
32 pages
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
No ratings yet
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
4 pages
TTDS Lectures
No ratings yet
TTDS Lectures
13 pages
Unit I
No ratings yet
Unit I
52 pages
DSOST1
No ratings yet
DSOST1
91 pages
data science
No ratings yet
data science
42 pages
Industrial Training Report
No ratings yet
Industrial Training Report
24 pages
Module 1
No ratings yet
Module 1
192 pages
Data Science Training
No ratings yet
Data Science Training
8 pages
Data Science and Big Data by IBM CE Allsoft Summer Training Final Report
100% (1)
Data Science and Big Data by IBM CE Allsoft Summer Training Final Report
41 pages
WEEK 4-5-Exploring Data Science Methods, Models, And Application
No ratings yet
WEEK 4-5-Exploring Data Science Methods, Models, And Application
18 pages
L1 - Introduction To Data Science
No ratings yet
L1 - Introduction To Data Science
33 pages
Data Science - Data
No ratings yet
Data Science - Data
10 pages
Data v2
No ratings yet
Data v2
25 pages
DSBA Curriculum Guide
No ratings yet
DSBA Curriculum Guide
18 pages
Module 1 Introduction To DataScience and Analytics
No ratings yet
Module 1 Introduction To DataScience and Analytics
10 pages
PYDS 3150713 Unit-2
No ratings yet
PYDS 3150713 Unit-2
38 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
24 pages
Unit 1
No ratings yet
Unit 1
28 pages
Data Science
No ratings yet
Data Science
244 pages
Lecture+Notes (Upgrad)
No ratings yet
Lecture+Notes (Upgrad)
5 pages
Data Science
No ratings yet
Data Science
109 pages
Data Science Material
No ratings yet
Data Science Material
48 pages
Data Analytics PDF
0% (1)
Data Analytics PDF
6 pages
Ai, Ds & ML
No ratings yet
Ai, Ds & ML
52 pages
FDS CH1
No ratings yet
FDS CH1
4 pages
Unit 1 Full Notes
No ratings yet
Unit 1 Full Notes
52 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
6 pages
UNIT 4 Data Science
No ratings yet
UNIT 4 Data Science
7 pages
Chapter 04 Advanced Use of Python Libraries for AI and Data Science
No ratings yet
Chapter 04 Advanced Use of Python Libraries for AI and Data Science
179 pages
Introduction-It Skills
No ratings yet
Introduction-It Skills
20 pages
Unit 2 Data Science
No ratings yet
Unit 2 Data Science
53 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
25 pages
Unit 1-FDS
No ratings yet
Unit 1-FDS
18 pages
Machine Learning Unit-1.1
No ratings yet
Machine Learning Unit-1.1
43 pages
Lesson1 Introduction To The Data Science Process and The Value of Learning Data Science
No ratings yet
Lesson1 Introduction To The Data Science Process and The Value of Learning Data Science
6 pages
Ch7-Overview of Data Science-part 1
No ratings yet
Ch7-Overview of Data Science-part 1
37 pages
09 Handout 1
No ratings yet
09 Handout 1
4 pages
Lesson 3 Data Science
No ratings yet
Lesson 3 Data Science
12 pages
Advancement of Data Science
No ratings yet
Advancement of Data Science
7 pages
Data Science Techniques AND PREDICTIONS
No ratings yet
Data Science Techniques AND PREDICTIONS
5 pages
Chapter 1 (6)
No ratings yet
Chapter 1 (6)
62 pages
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
From Everand
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
Calvert Long
No ratings yet
PYTHON DATA ANALYTICS: Mastering Python for Effective Data Analysis and Visualization (2024 Beginner Guide)
From Everand
PYTHON DATA ANALYTICS: Mastering Python for Effective Data Analysis and Visualization (2024 Beginner Guide)
FLOYD BAX
No ratings yet
How To Do Opening & Closing of The Client-SCC4
No ratings yet
How To Do Opening & Closing of The Client-SCC4
5 pages
Quikr Srs
No ratings yet
Quikr Srs
23 pages
diffusion of innovation
No ratings yet
diffusion of innovation
4 pages
G.R. No 183505 Cir Vs SM Prime Holdings
No ratings yet
G.R. No 183505 Cir Vs SM Prime Holdings
11 pages
IT Helpdesk Support Analyst in Houston TX Resume Marcus Flores
No ratings yet
IT Helpdesk Support Analyst in Houston TX Resume Marcus Flores
3 pages
Sandlewood Health Benefits
No ratings yet
Sandlewood Health Benefits
7 pages
MTES3063 Review 1
No ratings yet
MTES3063 Review 1
5 pages
Prepostseo Plagiarism Checker
No ratings yet
Prepostseo Plagiarism Checker
2 pages
United States v. Michelle Mallard, 4th Cir. (2015)
No ratings yet
United States v. Michelle Mallard, 4th Cir. (2015)
6 pages
Migration to IP Based Networks-30.06.2014
No ratings yet
Migration to IP Based Networks-30.06.2014
45 pages
Complex Numbers: Choose The Most Appropriate Option (A, B, C or D)
No ratings yet
Complex Numbers: Choose The Most Appropriate Option (A, B, C or D)
13 pages
A2 Booklet pracise
No ratings yet
A2 Booklet pracise
38 pages
Aircraft Accident Investigation
No ratings yet
Aircraft Accident Investigation
39 pages
PCOS Patient Handout
No ratings yet
PCOS Patient Handout
16 pages
13-Emergency Fog Light Checklist-2024
No ratings yet
13-Emergency Fog Light Checklist-2024
4 pages
GTN Company Profile
No ratings yet
GTN Company Profile
3 pages
Chapter 2
No ratings yet
Chapter 2
7 pages
Bachelor of Science in Computer Engineering (Effective SY: 2014 - 2015)
No ratings yet
Bachelor of Science in Computer Engineering (Effective SY: 2014 - 2015)
6 pages
Diana Hernandez Cruz - North-South Dispute Over Slavery Led To Civil War - Student Packet
No ratings yet
Diana Hernandez Cruz - North-South Dispute Over Slavery Led To Civil War - Student Packet
4 pages
All Chapter Download Test Bank For Global Business Today, 6th Edition, Charles W. L. Hill, G. Tomas M. Hult, Thomas McKaig Frank Cotae
100% (5)
All Chapter Download Test Bank For Global Business Today, 6th Edition, Charles W. L. Hill, G. Tomas M. Hult, Thomas McKaig Frank Cotae
54 pages
Personal Ethical Development: Lecture Notes
No ratings yet
Personal Ethical Development: Lecture Notes
39 pages
Karnataka State Budget July 2023
No ratings yet
Karnataka State Budget July 2023
19 pages
Unit 5 HRM in Retailing: 1. Societal Objectives
No ratings yet
Unit 5 HRM in Retailing: 1. Societal Objectives
20 pages
Institute of Postgraduate Studies Universiti Sains Malaysia Verification Form For Draft Thesis Submission
No ratings yet
Institute of Postgraduate Studies Universiti Sains Malaysia Verification Form For Draft Thesis Submission
5 pages
Final Examination Digital Marketing
No ratings yet
Final Examination Digital Marketing
7 pages