UNIT I Democracy

The document outlines the data science process, including understanding data types, data preparation, and model building. It highlights the benefits and applications of data science across various sectors such as business, government, and education. Key steps include defining research goals, retrieving and cleansing data, exploratory data analysis, and presenting findings through effective communication methods.

Uploaded by

S.T.SHERIBA CSE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODP, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views75 pages

UNIT I Democracy

Uploaded by

S.T.SHERIBA CSE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODP, PDF, TXT or read online on Scribd

Ø CO1: Understand the data science process

Ø CO2: Understand different types of data description for data science process
Ø COURSE
CO3: Gain knowledge OUTCOMES:
on relationships between data
Ø CO4: Use the Python Libraries for Data Wrangling
Ø CO5: Apply visualization Libraries in Python to interpret and explore data
Types of Data Science
Types of Data Science
Benefits and uses of data
• science and big data
• Data science and big data are used almost everywhere in both
commercial and noncommercial settings.
• Many companies use data science to offer customers a better user
experience, as well as to cross-sell, up-sell, and personalize their
offerings.
• A good example of this is Google AdSense, which collects data from
internet users so relevant commercial messages can be matched
to the person browsing the internet.
Benefits and uses of data
science and big data

Upselling is the practice of encouraging customers to purchase a

comparable higher-end product than the one in question, while cross-
selling invites customers to buy related or complementary items.
Benefits and uses of data science and big
data
• Human resource professionals use people analytics and text mining
to screen candidates, monitor the mood of employees, and study
informal networks among coworkers.
• 50% of trades worldwide are performed automatically by machines
based on algorithms developed by quants.

“Quant” is financial slang for “Quantitative Analyst.” Quantitative

Analysts are a highly-skilled profession in the finance sector as they're
responsible for the design, development, and implementation of
mathematical models or algorithms that solve or correct complex
financial problems.
Benefits and uses of data science and big
data
• Governmental organizations are also aware of data’s value. Many
governmental organizations not only rely on internal data
scientists to discover valuable information, but also share their
data with the public.
• You can use this data to gain insights or build data-driven
applications.
• Data.gov is one example; it’s the home of the US Government’s
open data.
• A data scientist in a governmental organization gets to work on
diverse projects such as detecting fraud and other criminal
activity or optimizing project funding.
Benefits and uses of data science and big
data
Example
• A well-known example was provided by Edward Snowden, who
leaked internal documents of the American National Security
Agency and the British Government Communications
Headquarters that show clearly how they used data science and
big data to monitor millions of individuals.
• Those organizations collected 5 billion data records from
widespread applications such as Google Maps, Angry Birds,
email, and text messages, among many other data sources. Then
they applied data science techniques to distill information.
•
Benefits and uses of data science and big
data
• Nongovernmental organizations (NGOs) :
• They use it to raise money and defend their causes.
• The World Wildlife Fund (WWF), for instance, employs data scientists to
increase the effectiveness of their fundraising efforts.
• Many data scientists devote part of their time to helping NGOs, because NGOs
often lack the resources to collect data and employ data scientists.
• DataKind is one such data scientist group that devotes its time to the benefit of
mankind.
Benefits and uses of data science
and big data

• Universities use data science in their research but also to enhance the study
experience of their students.
• The rise of massive open online courses (MOOC) produces a lot of data,
which allows universities to study how this type of learning can
complement traditional classes.
• MOOCs are an invaluable asset if you want to become a data scientist and big
data professional, so definitely look at a few of the better-known ones:
• Coursera, Udacity, and edX.
Facets of data
The main categories of data are :
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Audio, video, and images
■ Streaming
Facets of data
Structured data
• Structured data is data that depends on a data model and resides in a fixed field
within a record.
• It’s easy to store structured data in tables within databases or Excel files.
• SQL, or Structured Query Language, is the preferred way to manage and query
data that resides in databases.
Unstructured data
• Unstructured data is data that isn’t easy to fit into a data model because the
content is context-specific or varying. Example: Your regular email.
Facets of data
Structured data
Facets of data
UnStructured data
Facets of data
Natural language
• Natural language is a special type of unstructured data; it’s challenging to
process because it requires knowledge of specific data science techniques
and linguistics.
• The natural language processing community has had success in entity
recognition, topic recognition, summarization, text completion, and
sentiment analysis, but models trained in one domain don’t generalize well
to other domains.
Machine-generated data
• Machine-generated data is information that’s automatically created by a
computer, process, application, or other machine without human
intervention.
• Examples of machine data are web server logs, call detail records, network
event logs, and telemetry.
Facets
Graph-based or network data
of data
• The graph structures use nodes, edges, and properties to represent and store
graphical data.
• Graph-based data is a natural way to represent social networks, and its structure
allows you to calculate specific metrics such as the influence of a person and
the shortest path between two people.
Examples
• Graph-based data can be found on many social media websites. For instance, on
LinkedIn you can see who you know at which company.
• Your follower list on Twitter is another example of graph-based data.
• Imagine the connecting edges here to show “friends” on Facebook. Imagine
another graph with the same people which connects business colleagues via
LinkedIn.
• Imagine a third graph based on movie interests on Netflix.
Facets of data
Facets of data
Audio, image, and video
• A company called DeepMind succeeded at creating an algorithm that’s capable of
learning how to play video games.
Streaming data
• Examples:
• “What’s trending” on Twitter, live sporting or music events, and the stock market.
DATA SCIENCE PROCESS: OVERVIEW
DEFINING RESEARCH GOALS
• understanding the what, the why, and the how of your project.
• good understanding of the context, well-defined deliverables, and
a plan of action with a timetable.
• project charter.
•
•
• Spend time understanding the goals and context of your research.
• Create a project charter.
§ Create a project charter
• A clear research goal
• The project mission and context
• How you’re going to perform your analysis
• What resources you expect to use
• Proof that it’s an achievable project, or proof of concepts
• Deliverables and a measure of success
• A timeline
•
RETRIEVING DATA
• Start with data stored within the company.
• Data can be stored in many forms
• text files
• Tables
• stored in official data repositories such as databases, data
marts, data warehouses, and data lakes .
•
• Don’t be afraid to shop around
• Nielsen and GFK
•
•
RETRIEVING DATA
• Do data quality checks now to prevent problems later
• check to see if the data is equal to the data in the source document
• right data types.
• statistical properties such as distributions, correlations, and outliers
DATA PREPARATION
Cleansing data
• focuses on removing errors in the data.
• The first type is the interpretation error, such as when you take the
value in your data for granted, like saying that a person’s age is
greater than 300 years.
• The second type of error points to inconsistencies between data
sources or against your company’s standardized values.
•
Cleansing data
Cleansing data
DATA ENTRY ERRORS
• require human intervention.
• machine or hardware failure.
• Examples of errors originating from machines are transmission errors
or bugs in the extract, transform, and load phase (ETL).
Eg:
• variable that can take only two values: “Good” and “Bad”
• the values “Godo” and “Bade”
if x = = “Godo”:
x = “Good”
if x = = “Bade”:
x = “Bad”
REDUNDANT WHITESPACE
• keys in one table contained a whitespace at the end of a string. This
caused a mismatch of keys such as “FR ” – “FR”.
•
• In Python the strip() function is used to remove leading and trailing
spaces
IMPOSSIBLE VALUES AND SANITY
CHECKS
• Sanity checks can be directly expressed with rules:
• check = 0 <= age <= 120
•
OUTLIERS
DEALING WITH MISSING VALUES
DEVIATIONS FROM A CODE BOOK
• A code book is a description of your data, a form of metadata. It
contains things such as the number of variables per observation, the
number of observations, and what each encoding within a variable
means.
• For instance “0” equals “negative”, “5” stands for “very positive”.
•
DIFFERENT UNITS OF
MEASUREMENT
• gather data from different data providers.
• Data sets can contain prices per gallon and
• others can contain prices per liter.
DIFFERENT LEVELS OF
AGGREGATION
• An example of this would be a data set containing data per week
versus one containing data per work week.
Correct errors as early as possible
• Data errors may point to a business process
• Data errors may point to defective equipment, such as broken
transmission lines and defective sensors.
Combining data from different data
sources
• Data varies in size, type, and structure, ranging from Databases and
excel files to text documents.
THE DIFFERENT WAYS OF
COMBINING DATA
• Joining: enriching an observation from one table with information
from another table.
• Appending or stacking: adding the observations of one table to those
of another table.
•
JOINING TABLES
APPENDING TABLES
USING VIEWS TO SIMULATE DATA
JOINS AND APPENDS
• To avoid duplication of data, virtually combine data with views.
• A view behaves as if you’re working on a table, but this table is
nothing but a virtual layer that combines the tables.
ENRICHING AGGREGATED
MEASURES
Transforming data
• Relationships between an input variable and an output variable.
• a relationship of the form y = aebx.
The data science
process
The data science
DEALING WITH MISSING VALUES
process
REDUCING THE NUMBER OF
VARIABLES
• Having too many variables in the model makes the model difficult to
handle, and certain techniques don’t perform well when overload
them with too many input variables.
TURNING VARIABLES INTO DUMMIES
Step 4: Exploratory data analysis
• exploring the data.
• visualization techniques
Pareto diagram
The data science
brushing and linking.

process
• With brushing and linking combine and link different graphs and tables (or views)
so changes in one graph are automatically transferred to the other graphs.
•
The data science
brushing and linking.

process
• With brushing and linking combine and link different graphs and tables (or views)
so changes in one graph are automatically transferred to the other graphs.
•
Histogram and Boxplot
• In a histogram, a variable is cut into discrete categories and the
number of occurrences in each category is summed up.
• The boxplot, doesn’t show how many observations are present but
does offer an impression of the distribution within categories. It can
show the maximum, minimum, median, and other characterizing
measures at the same time.
•
Step 5: Build the models
• Building a model is an iterative process.
• Most models consist of the following main steps:
1. Selection of a modeling technique and variables to enter in the model
2. Execution of the model
3. Diagnosis and model comparison.
•
Model and variable selection
• select the variables to include in the model
• choose the right model for the defined problem.
• How difficult is the maintenance of the model
• Does the model need to be easy to explain?
•
The data science
Model and variable selection

process
• Must the model be moved to a production environment and, if so, would it be
easy to implement?
• How difficult is the maintenance on the model: how long will it remain relevant if
left untouched?
• Does the model need to be easy to explain?
Model execution
• Once you’ve chosen a model you’ll need to implement it in code.
Model diagnostics and model comparison
Model execution
• Once a model is chosen, it is needed to implement in code.
• Most programming languages, such as Python, already have libraries
such as StatsModels or Scikit-learn.
• It’s easy to use linear regression
Linear regression model information
output
Model fit
• For this the R-squared or adjusted R-squared is used.
• This measure is an indication of the amount of variation in the data
that gets captured by the model.
• Predictor variables have a coefficient
• If the p-value is lower than 0.05, the variable is considered significant
for most people.
Classification model - k-nearest
neighbors
• k-nearest neighbors looks at labeled points nearby an unlabeled point
and, based on this, makes a prediction of what the label should be.
Executing k-nearest neighbor
classification on semi-random data
Classification model - k-nearest
neighbors
• compare it to the real thing using a confusion matrix.
metrics.confusion_matrix(target,prediction)
•
Model diagnostics and model
comparison
• choose the best one can be chosen based on multiple criteria.
• the model should work on unseen data.
• Only a fraction of the data is used to estimate the model and the
other part, the holdout sample, is kept out of the equation.
• error measures are calculated to evaluate it.
• error measure used in the example is the mean square error.
•
Mean square error
• Mean square error is a simple measure: check for every prediction
how far it was from the truth, square this error, and add up the error
of every prediction.
performance of two models to
predict the order size from the price.

• The first model is size = 3 * price second model is size = 10.

model diagnostics
• Many models make strong assumptions, such as the independence of
the inputs, and have to verify that these assumptions are indeed
met. This is called model diagnostics.
Step 6: Presenting findings and
building applications on top of them
Once the model is validated, findings need to be communicated
effectively. This step involves:
• Data Visualization Dashboards (Tableau, Power BI, Matplotlib,
Seaborn)
• Generating Reports (PDFs, PowerPoint presentations)
• Building APIs for deployment (Flask, FastAPI)
• Integrating with business applications (CRM, financial software, etc.)
• Model monitoring and updating based on new data
•

Mod 3
No ratings yet
Mod 3
96 pages
Unit-1 IDS
No ratings yet
Unit-1 IDS
26 pages
21css303t Datascience Unit 1 Notes
No ratings yet
21css303t Datascience Unit 1 Notes
246 pages
Foundations of Data Science Course
No ratings yet
Foundations of Data Science Course
25 pages
Unit 1 To 5
No ratings yet
Unit 1 To 5
202 pages
Unit 1
No ratings yet
Unit 1
11 pages
Fdsa PPT - Unit 1
No ratings yet
Fdsa PPT - Unit 1
19 pages
FODS Full Notes
No ratings yet
FODS Full Notes
217 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
15 pages
Ds Unit 1
No ratings yet
Ds Unit 1
18 pages
Data Science
No ratings yet
Data Science
244 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
36 pages
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
Unit I
No ratings yet
Unit I
262 pages
Cs3352 Fds Notes Mk1
No ratings yet
Cs3352 Fds Notes Mk1
30 pages
Unit 1
No ratings yet
Unit 1
19 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
161 pages
Unit 1 PPT 1
100% (1)
Unit 1 PPT 1
27 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Fundamentals of Data Science Course
75% (4)
Fundamentals of Data Science Course
62 pages
1 Unit 1 Introduction To Data Science
No ratings yet
1 Unit 1 Introduction To Data Science
48 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
FODS Unit-1
No ratings yet
FODS Unit-1
33 pages
Data Science Unit 1 Notes
No ratings yet
Data Science Unit 1 Notes
65 pages
22UCS303 DS-Unit I-N
No ratings yet
22UCS303 DS-Unit I-N
42 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
36 pages
CS 3353 FDS Unit 1 Notes JPR
No ratings yet
CS 3353 FDS Unit 1 Notes JPR
39 pages
Data Science Overview and Process Guide
No ratings yet
Data Science Overview and Process Guide
139 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
Data Science Foundations Guide
100% (2)
Data Science Foundations Guide
143 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Unit 1-3
No ratings yet
Unit 1-3
39 pages
Data Science Unit 1 Notes
No ratings yet
Data Science Unit 1 Notes
22 pages
Lecture 1 and 2 Powerpoints
No ratings yet
Lecture 1 and 2 Powerpoints
32 pages
Data Science Overview for Honours Students
No ratings yet
Data Science Overview for Honours Students
28 pages
Module-1: Introduction To Data Science
No ratings yet
Module-1: Introduction To Data Science
98 pages
Unit - I Data Science
No ratings yet
Unit - I Data Science
95 pages
DSF Notes
No ratings yet
DSF Notes
97 pages
The Excitement of Data Science
No ratings yet
The Excitement of Data Science
137 pages
Data v2
No ratings yet
Data v2
25 pages
Ocs353dsf Unit Wise Notes
100% (4)
Ocs353dsf Unit Wise Notes
121 pages
DS R Unit-1
No ratings yet
DS R Unit-1
41 pages
Five Unit Notes
No ratings yet
Five Unit Notes
138 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
185 pages
Dsbda Unit1
No ratings yet
Dsbda Unit1
232 pages
Ids Sem Ans U-I
No ratings yet
Ids Sem Ans U-I
17 pages
Data Science
No ratings yet
Data Science
40 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
BCA Lecture I
No ratings yet
BCA Lecture I
20 pages
Big Data and Data Science
No ratings yet
Big Data and Data Science
6 pages
Data Science Unit-I
No ratings yet
Data Science Unit-I
13 pages
Data Science and Big Data Analytics Unit 1 Notes
No ratings yet
Data Science and Big Data Analytics Unit 1 Notes
13 pages
DS Unit 1 - NUMPY
No ratings yet
DS Unit 1 - NUMPY
29 pages
Data
No ratings yet
Data
43 pages
ETCh 2
No ratings yet
ETCh 2
36 pages
FDS Notes
No ratings yet
FDS Notes
137 pages
Unit-2-Foundations of UI Design
No ratings yet
Unit-2-Foundations of UI Design
95 pages
Unit I DS
No ratings yet
Unit I DS
33 pages
The Role of Express, Mongo..
No ratings yet
The Role of Express, Mongo..
12 pages
Implementing Express in NodeJS
No ratings yet
Implementing Express in NodeJS
7 pages
Administering Databases
No ratings yet
Administering Databases
5 pages
UNTI 1 Full Stack Web Application Development
No ratings yet
UNTI 1 Full Stack Web Application Development
68 pages
UNIT 1-Notes
No ratings yet
UNIT 1-Notes
23 pages
InVentia OPC
No ratings yet
InVentia OPC
20 pages
Rdbms Question Bank
No ratings yet
Rdbms Question Bank
6 pages
Hadoop Commands
No ratings yet
Hadoop Commands
3 pages
DNS Hosting Guide Hidden Master With DNSSEC
No ratings yet
DNS Hosting Guide Hidden Master With DNSSEC
4 pages
Database Structure & SQL Overview
No ratings yet
Database Structure & SQL Overview
4 pages
Introduction To Databases in Python: Filtering and Targeting Data
No ratings yet
Introduction To Databases in Python: Filtering and Targeting Data
32 pages
DataSource Enhancement
No ratings yet
DataSource Enhancement
16 pages
TS2BodyShop 2017.12.12 20.10.04
No ratings yet
TS2BodyShop 2017.12.12 20.10.04
4 pages
DBMS-MCQs 2
No ratings yet
DBMS-MCQs 2
5 pages
GNU/Linux Most Wanted: Summary of Most Useful Commands
No ratings yet
GNU/Linux Most Wanted: Summary of Most Useful Commands
1 page
Data Visualization with Tableau
No ratings yet
Data Visualization with Tableau
13 pages
6550 Ion Nvme SSD Tech Prod Spec
No ratings yet
6550 Ion Nvme SSD Tech Prod Spec
11 pages
SQL - Program - List
No ratings yet
SQL - Program - List
2 pages
XML Bursting ESS Job Scheduling
100% (1)
XML Bursting ESS Job Scheduling
14 pages
Student Record Management System Overview
No ratings yet
Student Record Management System Overview
5 pages
Understanding Database Management Systems
No ratings yet
Understanding Database Management Systems
31 pages
ER Model: Attributes and Relationships
No ratings yet
ER Model: Attributes and Relationships
55 pages
Smart Restart
No ratings yet
Smart Restart
2 pages
CSE302 Lab05
No ratings yet
CSE302 Lab05
3 pages
Queues in CICS - With Paging Logic
90% (10)
Queues in CICS - With Paging Logic
25 pages
SQL Database Management Guide
No ratings yet
SQL Database Management Guide
66 pages
Farhan
No ratings yet
Farhan
6 pages
Transaction Processing and Concurrency Techniques
No ratings yet
Transaction Processing and Concurrency Techniques
53 pages
Query
No ratings yet
Query
104 pages
Solutions Partner Technical Onboarding Guide
100% (1)
Solutions Partner Technical Onboarding Guide
27 pages
Oracle Data Guard Flashback Guide
No ratings yet
Oracle Data Guard Flashback Guide
10 pages
Data Migration Cockpit in SAP S4 HANA
No ratings yet
Data Migration Cockpit in SAP S4 HANA
12 pages
Techniques of Data Cube Computation in Data Science
No ratings yet
Techniques of Data Cube Computation in Data Science
11 pages
Girish Dasari: Skype ID: Girishmkr
No ratings yet
Girish Dasari: Skype ID: Girishmkr
3 pages
ITS OD 202 Data Analytics 0225
No ratings yet
ITS OD 202 Data Analytics 0225
2 pages

UNIT I Democracy

Uploaded by

UNIT I Democracy

Uploaded by

Ø CO1: Understand the data science process

Upselling is the practice of encouraging customers to purchase a

“Quant” is financial slang for “Quantitative Analyst.” Quantitative

• The first model is size = 3 * price second model is size = 10.

You might also like