0% found this document useful (0 votes)
14 views75 pages

UNIT I Democracy

The document outlines the data science process, including understanding data types, data preparation, and model building. It highlights the benefits and applications of data science across various sectors such as business, government, and education. Key steps include defining research goals, retrieving and cleansing data, exploratory data analysis, and presenting findings through effective communication methods.

Uploaded by

S.T.SHERIBA CSE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODP, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views75 pages

UNIT I Democracy

The document outlines the data science process, including understanding data types, data preparation, and model building. It highlights the benefits and applications of data science across various sectors such as business, government, and education. Key steps include defining research goals, retrieving and cleansing data, exploratory data analysis, and presenting findings through effective communication methods.

Uploaded by

S.T.SHERIBA CSE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODP, PDF, TXT or read online on Scribd

Ø CO1: Understand the data science process

Ø CO2: Understand different types of data description for data science process
Ø COURSE
CO3: Gain knowledge OUTCOMES:
on relationships between data
Ø CO4: Use the Python Libraries for Data Wrangling
Ø CO5: Apply visualization Libraries in Python to interpret and explore data
Types of Data Science
Types of Data Science
Benefits and uses of data
• science and big data
• Data science and big data are used almost everywhere in both
commercial and noncommercial settings.
• Many companies use data science to offer customers a better user
experience, as well as to cross-sell, up-sell, and personalize their
offerings.
• A good example of this is Google AdSense, which collects data from
internet users so relevant commercial messages can be matched
to the person browsing the internet.
Benefits and uses of data
science and big data

Upselling is the practice of encouraging customers to purchase a


comparable higher-end product than the one in question, while cross-
selling invites customers to buy related or complementary items.
Benefits and uses of data science and big
data
• Human resource professionals use people analytics and text mining
to screen candidates, monitor the mood of employees, and study
informal networks among coworkers.
• 50% of trades worldwide are performed automatically by machines
based on algorithms developed by quants.

“Quant” is financial slang for “Quantitative Analyst.” Quantitative


Analysts are a highly-skilled profession in the finance sector as they're
responsible for the design, development, and implementation of
mathematical models or algorithms that solve or correct complex
financial problems.
Benefits and uses of data science and big
data
• Governmental organizations are also aware of data’s value. Many
governmental organizations not only rely on internal data
scientists to discover valuable information, but also share their
data with the public.
• You can use this data to gain insights or build data-driven
applications.
• Data.gov is one example; it’s the home of the US Government’s
open data.
• A data scientist in a governmental organization gets to work on
diverse projects such as detecting fraud and other criminal
activity or optimizing project funding.
Benefits and uses of data science and big
data
Example
• A well-known example was provided by Edward Snowden, who
leaked internal documents of the American National Security
Agency and the British Government Communications
Headquarters that show clearly how they used data science and
big data to monitor millions of individuals.
• Those organizations collected 5 billion data records from
widespread applications such as Google Maps, Angry Birds,
email, and text messages, among many other data sources. Then
they applied data science techniques to distill information.

Benefits and uses of data science and big
data
• Nongovernmental organizations (NGOs) :
• They use it to raise money and defend their causes.
• The World Wildlife Fund (WWF), for instance, employs data scientists to
increase the effectiveness of their fundraising efforts.
• Many data scientists devote part of their time to helping NGOs, because NGOs
often lack the resources to collect data and employ data scientists.
• DataKind is one such data scientist group that devotes its time to the benefit of
mankind.
Benefits and uses of data science
and big data

• Universities use data science in their research but also to enhance the study
experience of their students.
• The rise of massive open online courses (MOOC) produces a lot of data,
which allows universities to study how this type of learning can
complement traditional classes.
• MOOCs are an invaluable asset if you want to become a data scientist and big
data professional, so definitely look at a few of the better-known ones:
• Coursera, Udacity, and edX.
Facets of data
The main categories of data are :
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Audio, video, and images
■ Streaming
Facets of data
Structured data
• Structured data is data that depends on a data model and resides in a fixed field
within a record.
• It’s easy to store structured data in tables within databases or Excel files.
• SQL, or Structured Query Language, is the preferred way to manage and query
data that resides in databases.
Unstructured data
• Unstructured data is data that isn’t easy to fit into a data model because the
content is context-specific or varying. Example: Your regular email.
Facets of data
Structured data
Facets of data
UnStructured data
Facets of data
Natural language
• Natural language is a special type of unstructured data; it’s challenging to
process because it requires knowledge of specific data science techniques
and linguistics.
• The natural language processing community has had success in entity
recognition, topic recognition, summarization, text completion, and
sentiment analysis, but models trained in one domain don’t generalize well
to other domains.
Machine-generated data
• Machine-generated data is information that’s automatically created by a
computer, process, application, or other machine without human
intervention.
• Examples of machine data are web server logs, call detail records, network
event logs, and telemetry.
Facets
Graph-based or network data
of data
• The graph structures use nodes, edges, and properties to represent and store
graphical data.
• Graph-based data is a natural way to represent social networks, and its structure
allows you to calculate specific metrics such as the influence of a person and
the shortest path between two people.
Examples
• Graph-based data can be found on many social media websites. For instance, on
LinkedIn you can see who you know at which company.
• Your follower list on Twitter is another example of graph-based data.
• Imagine the connecting edges here to show “friends” on Facebook. Imagine
another graph with the same people which connects business colleagues via
LinkedIn.
• Imagine a third graph based on movie interests on Netflix.
Facets of data
Facets of data
Audio, image, and video
• A company called DeepMind succeeded at creating an algorithm that’s capable of
learning how to play video games.
Streaming data
• Examples:
• “What’s trending” on Twitter, live sporting or music events, and the stock market.
DATA SCIENCE PROCESS: OVERVIEW
DEFINING RESEARCH GOALS
• understanding the what, the why, and the how of your project.
• good understanding of the context, well-defined deliverables, and
a plan of action with a timetable.
• project charter.


• Spend time understanding the goals and context of your research.
• Create a project charter.
§ Create a project charter
• A clear research goal
• The project mission and context
• How you’re going to perform your analysis
• What resources you expect to use
• Proof that it’s an achievable project, or proof of concepts
• Deliverables and a measure of success
• A timeline

RETRIEVING DATA
• Start with data stored within the company.
• Data can be stored in many forms
• text files
• Tables
• stored in official data repositories such as databases, data
marts, data warehouses, and data lakes .

• Don’t be afraid to shop around
• Nielsen and GFK


RETRIEVING DATA
• Do data quality checks now to prevent problems later
• check to see if the data is equal to the data in the source document
• right data types.
• statistical properties such as distributions, correlations, and outliers
DATA PREPARATION
Cleansing data
• focuses on removing errors in the data.
• The first type is the interpretation error, such as when you take the
value in your data for granted, like saying that a person’s age is
greater than 300 years.
• The second type of error points to inconsistencies between data
sources or against your company’s standardized values.

Cleansing data
Cleansing data
DATA ENTRY ERRORS
• require human intervention.
• machine or hardware failure.
• Examples of errors originating from machines are transmission errors
or bugs in the extract, transform, and load phase (ETL).
Eg:
• variable that can take only two values: “Good” and “Bad”
• the values “Godo” and “Bade”
if x = = “Godo”:
x = “Good”
if x = = “Bade”:
x = “Bad”
REDUNDANT WHITESPACE
• keys in one table contained a whitespace at the end of a string. This
caused a mismatch of keys such as “FR ” – “FR”.

• In Python the strip() function is used to remove leading and trailing
spaces
IMPOSSIBLE VALUES AND SANITY
CHECKS
• Sanity checks can be directly expressed with rules:
• check = 0 <= age <= 120

OUTLIERS
DEALING WITH MISSING VALUES
DEVIATIONS FROM A CODE BOOK
• A code book is a description of your data, a form of metadata. It
contains things such as the number of variables per observation, the
number of observations, and what each encoding within a variable
means.
• For instance “0” equals “negative”, “5” stands for “very positive”.

DIFFERENT UNITS OF
MEASUREMENT
• gather data from different data providers.
• Data sets can contain prices per gallon and
• others can contain prices per liter.
DIFFERENT LEVELS OF
AGGREGATION
• An example of this would be a data set containing data per week
versus one containing data per work week.
Correct errors as early as possible
• Data errors may point to a business process
• Data errors may point to defective equipment, such as broken
transmission lines and defective sensors.
Combining data from different data
sources
• Data varies in size, type, and structure, ranging from Databases and
excel files to text documents.
THE DIFFERENT WAYS OF
COMBINING DATA
• Joining: enriching an observation from one table with information
from another table.
• Appending or stacking: adding the observations of one table to those
of another table.

JOINING TABLES
APPENDING TABLES
USING VIEWS TO SIMULATE DATA
JOINS AND APPENDS
• To avoid duplication of data, virtually combine data with views.
• A view behaves as if you’re working on a table, but this table is
nothing but a virtual layer that combines the tables.
ENRICHING AGGREGATED
MEASURES
Transforming data
• Relationships between an input variable and an output variable.
• a relationship of the form y = aebx.
The data science
process
The data science
DEALING WITH MISSING VALUES
process
REDUCING THE NUMBER OF
VARIABLES
• Having too many variables in the model makes the model difficult to
handle, and certain techniques don’t perform well when overload
them with too many input variables.
TURNING VARIABLES INTO DUMMIES
Step 4: Exploratory data analysis
• exploring the data.
• visualization techniques
Pareto diagram
The data science
brushing and linking.

process
• With brushing and linking combine and link different graphs and tables (or views)
so changes in one graph are automatically transferred to the other graphs.

The data science
brushing and linking.

process
• With brushing and linking combine and link different graphs and tables (or views)
so changes in one graph are automatically transferred to the other graphs.

Histogram and Boxplot
• In a histogram, a variable is cut into discrete categories and the
number of occurrences in each category is summed up.
• The boxplot, doesn’t show how many observations are present but
does offer an impression of the distribution within categories. It can
show the maximum, minimum, median, and other characterizing
measures at the same time.

Step 5: Build the models
• Building a model is an iterative process.
• Most models consist of the following main steps:
1. Selection of a modeling technique and variables to enter in the model
2. Execution of the model
3. Diagnosis and model comparison.

Model and variable selection
• select the variables to include in the model
• choose the right model for the defined problem.
• How difficult is the maintenance of the model
• Does the model need to be easy to explain?

The data science
Model and variable selection

process
• Must the model be moved to a production environment and, if so, would it be
easy to implement?
• How difficult is the maintenance on the model: how long will it remain relevant if
left untouched?
• Does the model need to be easy to explain?
Model execution
• Once you’ve chosen a model you’ll need to implement it in code.
Model diagnostics and model comparison
Model execution
• Once a model is chosen, it is needed to implement in code.
• Most programming languages, such as Python, already have libraries
such as StatsModels or Scikit-learn.
• It’s easy to use linear regression
Linear regression model information
output
Model fit
• For this the R-squared or adjusted R-squared is used.
• This measure is an indication of the amount of variation in the data
that gets captured by the model.
• Predictor variables have a coefficient
• If the p-value is lower than 0.05, the variable is considered significant
for most people.
Classification model - k-nearest
neighbors
• k-nearest neighbors looks at labeled points nearby an unlabeled point
and, based on this, makes a prediction of what the label should be.
Executing k-nearest neighbor
classification on semi-random data
Classification model - k-nearest
neighbors
• compare it to the real thing using a confusion matrix.
metrics.confusion_matrix(target,prediction)

Model diagnostics and model
comparison
• choose the best one can be chosen based on multiple criteria.
• the model should work on unseen data.
• Only a fraction of the data is used to estimate the model and the
other part, the holdout sample, is kept out of the equation.
• error measures are calculated to evaluate it.
• error measure used in the example is the mean square error.

Mean square error
• Mean square error is a simple measure: check for every prediction
how far it was from the truth, square this error, and add up the error
of every prediction.
performance of two models to
predict the order size from the price.

• The first model is size = 3 * price second model is size = 10.


model diagnostics
• Many models make strong assumptions, such as the independence of
the inputs, and have to verify that these assumptions are indeed
met. This is called model diagnostics.
Step 6: Presenting findings and
building applications on top of them
Once the model is validated, findings need to be communicated
effectively. This step involves:
• Data Visualization Dashboards (Tableau, Power BI, Matplotlib,
Seaborn)
• Generating Reports (PDFs, PowerPoint presentations)
• Building APIs for deployment (Flask, FastAPI)
• Integrating with business applications (CRM, financial software, etc.)
• Model monitoring and updating based on new data

You might also like