0% found this document useful (0 votes)
50 views46 pages

Principles of Data Science

The document outlines the fundamentals of data science, including its multidisciplinary nature and the various methods of data analysis such as descriptive, diagnostic, predictive, and prescriptive analytics. It details the differences between structured and unstructured data, as well as quantitative and qualitative data, while emphasizing the importance of data collection, preparation, exploration, and model evaluation in the data science lifecycle. Additionally, it highlights the significance of clearly defining problems and understanding data types to derive meaningful insights for business applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views46 pages

Principles of Data Science

The document outlines the fundamentals of data science, including its multidisciplinary nature and the various methods of data analysis such as descriptive, diagnostic, predictive, and prescriptive analytics. It details the differences between structured and unstructured data, as well as quantitative and qualitative data, while emphasizing the importance of data collection, preparation, exploration, and model evaluation in the data science lifecycle. Additionally, it highlights the significance of clearly defining problems and understanding data types to derive meaningful insights for business applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Unit-1

• CO1: Understand the different levels of Data and Steps in Data Science.

• CO2: Apply the basics of probability models for data exploration.

• CO3: Analyze the basics of statistics models for data exploration.

• CO4: Analyze the different data visualization techniques.

• CO5: Analyze the suitable model for real time applications


Introduction to Data Science
• Data science is the study of data to extract meaningful insights for
business.
• It is a multidisciplinary approach that combines principles and
practices from the fields of mathematics, statistics, artificial
intelligence, and computer engineering to analyze large amounts of
data.
• This analysis helps data scientists to ask and answer questions like
what happened, why it happened, what will happen, and what can be
done with the results.
• Data science is used to study data in four main ways:
• Descriptive analysis examines data to gain insights into what
happened or what is happening in the data environment.
• It is characterized by data visualizations such as pie charts, bar charts,
line graphs, tables, or generated narratives.
• For example, a flight booking service may record data like the number
of tickets booked each day. Descriptive analysis will reveal booking
spikes, booking slumps, and high-performing months for this service.
• Diagnostic analysis is a deep-dive or detailed data examination to
understand why something happened.
• It is characterized by techniques such as drill-down, data discovery,
data mining, and correlations.
• Multiple data operations and transformations may be performed on a
given data set to discover unique patterns in each of these techniques.
• For example, the flight service might drill down on a particularly high-
performing month to better understand the booking spike.
• This may lead to the discovery that many customers visit a particular
city to attend a monthly sporting event.
• Prescriptive analytics takes predictive data to the next level.
• It not only predicts what is likely to happen but also suggests an
optimum response to that outcome.
• It can analyze the potential implications of different choices and
recommend the best course of action.
• It uses graph analysis, simulation, complex event processing, neural
networks, and recommendation engines from machine learning.
• Predictive Analytics
• By analyzing historical data in tandem with industry trends, you can
make informed predictions about what the future could hold for your
company.
• Data is different types of information usually formatted in a particular manner.
• All software is divided into two major categories: programs and data.
• Programs are collections of instructions used to manipulate data.
• Information is defined as classified or organized data that has some meaningful
value for the user.
• Information is also the processed data used to make decisions and take action.
• Processed data must meet the following criteria for it to be of any significant use in
decision-making:
• Accuracy: The information must be accurate.
• Completeness: The information must be complete.
• Timeliness: The information must be available when it’s needed.
Structured vs Unstructured Data
• Structured data is data that has a standardized format for efficient access by
software and humans alike.
• It is typically tabular with rows and columns that clearly define data
attributes.
• It is data that has been organized into a database that can be easily queried
and analyzed
• Computers can effectively process structured data for insights due to its
quantitative nature.
• For example, a structured customer data table containing columns—name,
address, and phone number—can provide insights like the total number of
customers and the locality with the maximum number of customers.
• Structured data is typically used by developers in the development stage of
an application, as it's the most straightforward way to manage information.
• When non-developers take over, they often need help with how best to use
this data type.
• To manipulate it, there is a special language called SQL, which stands for
Structured Query Language and was developed back in the 1970s by IBM.
• They have relational keys and can easily be mapped into pre-designed fields.
• Today, those data are most processed in the development and simplest way
to manage information
Features of structured data
• Organized Format:
• Structured data is arranged in a consistent, predictable format, often
in tables with rows and columns.
• Predefined Schema:
• A schema defines the structure of the data, including data types,
relationships, and constraints, before data entry.
• Ease of Querying:
• Structured data can be easily accessed and retrieved using query
languages like SQL, allowing for specific data extraction and analysis.
Features of structured data
• Consistency:
• Data types are consistent across all entries for a given field, ensuring
data quality and facilitating analysis.
• Relational:
• Structured data can be linked across different tables through
relationships, enabling complex queries and analysis.
• Use cases for structured data include:
• AI model training
• Customer relationship management (CRM)
• Business intelligence (BI)
• Inventory management
• Search engine optimization (SEO) rich snippets
• Unstructured data does not have a predefined format.
• Unstructured datasets are typically large (think terabytes or petabytes
of data) and comprise 90% of all enterprise-generated data.
• This high volume is due to the emergence of big data—the massive,
complex datasets from the internet and other connected
technologies.
• Unstructured data can contain both textual and nontextual data and
both qualitative (social media comments) and quantitative (figures
embedded in text) data.
Features of unstructured data
• 1. Lack of Defined Structure:
• Unstructured data doesn't adhere to a specific schema or data model, meaning
there's no predefined structure like rows and columns in a database.
• It's not organized in a way that's easily readable by computers without specific
processing.
• 2. Variety of Formats:
• It can include a vast array of data types, such as:
• Text: Emails, documents, social media posts, etc.
• Multimedia: Images, audio files, videos
• Web content: Web pages, blogs
• Other: Sensor data, machine logs, etc.
• 3. High Volume:
• Unstructured data often represents a large portion of the data
generated today, and its volume is constantly increasing.
• The growth of digital devices and the internet has accelerated the
generation of unstructured data.
• 4. Difficult to Analyze:
• Due to the lack of structure, analyzing unstructured data can be
challenging and often requires specialized tools and techniques.
• This can involve natural language processing (NLP), image recognition,
or other advanced methods.
• 5. Human-Generated:
• A significant portion of unstructured data is created by humans, such
as text documents, emails, and social media posts.
• 6. Context-Dependent:
• Understanding unstructured data often requires considering the
context in which it was created.
• Examples of unstructured data from textual data sources include:
• Emails
• Text documents
• Social media posts
• Call transcripts
• Message text files, such as those from Microsoft Teams or Slack
• Examples of nontextual unstructured data include:
• Image files (JPEG, GIF and PNG)
• Multimedia files
• Video files
• Mobile activity
• Sensor data from Internet of Things (IoT) devices
• Use cases include:
• Retrieval augmented generation (RAG)
• Generative AI (gen AI)
• Customer behavior and sentiment analysis
• Predictive data analytics
• Chatbot text analysis
Quantitative and Qualitative data
• Quantitative data is numerical data that can be measured and analyzed
statistically. It is often used to describe quantities, amounts, or frequencies.
• Examples of quantitative data include:-
• Age
• Height
• Weight
• Temperature
• Sales figures
• Test scores
• Characteristics of Quantitative Data
• Quantitative data has several key characteristics, including:
• 1. Numerical: Quantitative data is numerical in nature and can be expressed
in terms of numbers or quantities.
• 2. Measurable: Quantitative data can be measured using standardized units of
measurement, such as meters, kilograms, or dollars.
• 3. Statistical analysis: Quantitative data can be analyzed using statistical
methods, such as descriptive statistics, inferential statistics, and regression
analysis.
• 4. Objectivity: Quantitative data is often considered objective, as it is based
on numerical measurements that are less prone to personal biases.
• Types of Quantitative Data
• There are several types of quantitative data, including:
• 1. Discrete data: Discrete data is countable data that can take on
specific values, such as the number of students in a class or the
number of defective products.
• 2. Continuous data: Continuous data is measurable data that can take
on any value within a range, such as height, weight, or temperature.
• 3. Interval data: Interval data is continuous data that has equal
intervals between measurements, such as temperature in Celsius or
Fahrenheit.
• Methods of Collecting Quantitative Data
• There are several methods of collecting quantitative data, including:
• 1. Surveys: Surveys are a common method of collecting quantitative data,
where respondents are asked to provide numerical answers to questions.
• 2. Experiments: Experiments are a controlled method of collecting
quantitative data, where variables are manipulated and measured.
• 3. Observations: Observations are a method of collecting quantitative
data, where behavior or events are measured and recorded.
• 4. Existing data sources: Existing data sources, such as administrative
records or databases, can also be used to collect quantitative data.
• Statistical Analysis of Quantitative Data
• Quantitative data can be analyzed using various statistical methods,
including:
• 1. Descriptive statistics: Descriptive statistics, such as mean, median,
and mode, are used to summarize and describe quantitative data.
• 2. Inferential statistics: Inferential statistics, such as t-tests and
ANOVA, are used to make inferences about populations based on
sample data.
• 3. Regression analysis: Regression analysis is used to model the
relationship between variables and predict outcomes.
Qualitative Data
• Qualitative data is a type of data that is non-numerical in nature and
focuses on describing and understanding phenomena through words,
images, and observations.
• It is a crucial component of research in various fields, including social
sciences, healthcare, education, and business.
• Characteristics of Qualitative Data
• Qualitative data has several key characteristics, including:
• 1. Non-numerical: Qualitative data is non-numerical in nature and is
often collected through text, images, or observations.
• 2. Descriptive: Qualitative data is descriptive and focuses on providing
detailed insights into phenomena.
• 3. Subjective: Qualitative data is often subjective, as it is based on
interpretation and perspective.
• 4. Contextual: Qualitative data is often contextual, as it is collected in a
specific setting or context.
• Types of Qualitative Data
• There are several types of qualitative data, including:
• 1. Text data: Text data includes written or spoken words, such as
interview transcripts, survey responses, or social media posts.
• 2. Image data: Image data includes photographs, videos, or other visual
materials that can provide insights into phenomena.
• 3. Observational data: Observational data includes data collected
through observations, such as field notes or ethnographic research.
• 4. Case study data: Case study data includes in-depth examinations of
specific cases or phenomena.
• Methods of Collecting Qualitative Data
• There are several methods of collecting qualitative data, including:
• 1. Interviews: Interviews are a common method of collecting qualitative
data, where participants are asked open-ended questions.
• 2. Focus groups: Focus groups are a method of collecting qualitative
data, where a group of participants discuss a specific topic.
• 3. Observations: Observations are a method of collecting qualitative
data, where behavior or events are observed and recorded.
• 4. Document analysis: Document analysis is a method of collecting
qualitative data, where existing documents are analyzed for insights.
• Analyzing Qualitative Data
• Qualitative data can be analyzed using various methods, including:
• 1. Thematic analysis: Thematic analysis involves identifying and coding
themes in qualitative data.
• 2. Content analysis: Content analysis involves analyzing the content of
qualitative data, such as text or images.
• 3. Narrative analysis: Narrative analysis involves analyzing the narrative
structure of qualitative data.
• 4. Coding: Coding involves assigning codes or labels to qualitative data
to facilitate analysis.
• In data science, the four levels of data measurement (also known as
scales of measurement) are:
• nominal,
• ordinal,
• interval, and
• ratio.
• These levels dictate the type of mathematical operations that can be
meaningfully applied to the data.
The five steps of Data Science
• The data science lifecycle is a systematic approach to extracting value from
data. It provides a framework for data scientists to follow from problem
definition to model evaluation.
• The data science lifecycle encompasses five main stages, each with its own
set of tasks and goals.
• These stages are:
• Defining the problem/ask an interesting question
• Data collection and preparation/obtain data
• Data exploration and analysis/explore data
• Model building and evaluation/model the data
• Deployment and maintenance/communicate data and explore the data
• Defining the problem
• The first step in the data science lifecycle is to define the problem that
needs to be solved.
• This involves clearly articulating the business objective and understanding
the key requirements and constraints.
• Effective problem definition sets the stage for the entire data science
project, as it helps to align the goals of the analysis with the needs of the
organisation.
• A well-defined problem provides a clear direction for the data science
project and helps data scientists focus their efforts on finding relevant and
actionable insights.
• Defining the problem
• Stakeholder interviews: Engaging with key stakeholders to understand their
requirements, expectations, and pain points.
• Problem framing: Breaking down the overarching problem into smaller, more
manageable sub-problems.
• Defining success criteria: Establishing clear and measurable criteria for
evaluating the success of the data science project.
• Setting priorities: Identifying the most critical aspects of the problem that
need to be addressed first.
• Documenting requirements: Documenting the problem statement, goals,
and constraints to ensure that all team members are aligned.
• Data collection and preparation
• Data collection is a critical phase in the data science lifecycle, as the
quality and completeness of the data directly impact the accuracy and
reliability of the analyses.
• Data scientists can collect data from various sources, including
internal databases, external APIs, web scraping, and surveys.
• During the data collection process, it is essential to ensure the privacy
and security of the data, especially when dealing with sensitive or
personally identifiable information.
• Data collection and preparation
• Before diving into the analysis, data scientists need to prepare the data by
cleaning, transforming, and restructuring it. This involves tasks such as:
• Data cleaning: Removing outliers, handling missing values, and resolving
inconsistencies.
• Data integration: Combining data from different sources and resolving any
discrepancies or conflicts.
• Feature engineering: Creating new features that capture relevant information
and improve the performance of machine learning models.
• Data reduction: Reducing the dimensionality of the data to focus on the most
informative variables.
• Data exploration and analysis
• Data exploration is a crucial step in the data science lifecycle, as it allows
data scientists to understand the characteristics and quirks of the data.
• Through data exploration, they can uncover hidden insights, identify
outliers or anomalies, and validate assumptions.
• Data exploration also helps data scientists identify potential data quality
issues or biases that may influence the analysis.
• By visualising the data and conducting exploratory analyses, they can
gain a holistic understanding of the dataset and make informed
decisions about subsequent analyses.
• Data exploration and analysis
• Data scientists employ various methods and techniques to analyse data
effectively. These methods include:
• Descriptive statistics: Calculating summary statistics, such as mean, median,
and standard deviation, to summarise the data.
• Statistical modelling: Applying statistical models, such as regression or time
series analysis, to uncover relationships and make predictions.
• Data visualisation: Creating charts, graphs, and interactive visualisations to
present the data in a meaningful and engaging way.
• Machine learning: Using machine learning algorithms to identify patterns,
classify data, or make predictions.
• Model building and evaluation
• In the model-building and evaluation stage, data scientists develop and
refine predictive models based on the insights gained from the previous
stages.
• Building a data model: what you need to know
• Building a data model entails selecting a suitable algorithm or technique
that aligns with the problem and the characteristics of the data.
• Data scientists can choose from a wide range of models, including linear
regression, decision trees, neural networks, and support vector machines.
• Model building and evaluation
• Evaluating your data model’s performance
• To evaluate the performance of a data model, data scientists employ
various evaluation metrics, such as accuracy, precision, recall, and F1 score.
• These metrics quantify the model’s predictive accuracy and allow for the
comparison of different models or approaches.
• Data scientists should also perform a thorough analysis of the model’s
strengths and weaknesses.
• This includes assessing potential biases or errors, determining the model’s
interpretability, and identifying areas for improvement.
• Deployment and maintenance
• Deploying a data model requires careful planning to minimise disruptions and
ensure its practical utility. Common deployment strategies include:
• Batch Processing: Implementing the model periodically to analyse large
volumes of data in batches, suitable for scenarios with less urgency.
• Real-time Processing: Enabling the model to process data in real-time,
providing instantaneous insights and predictions, ideal for applications
requiring quick responses.
• Cloud Deployment: Leveraging cloud platforms for deployment, offering
scalability, flexibility, and accessibility, facilitating easier updates and
maintenance.
• Interpret and Communicate results
• In data science, interpreting and communicating results is the crucial
final step that transforms raw data analysis into actionable insights
and strategic recommendations.
• It involves not just understanding the technical findings but also
conveying them clearly and compellingly to diverse audiences,
including technical and non-technical stakeholders.
• Example
• Imagine a data scientist analyzing customer behavior on an e-commerce website.
• Instead of just presenting a list of statistical measures, they would interpret the results
to identify patterns like which products are most frequently purchased together
(market basket analysis) and then communicate this insight to the marketing team.
• They might say, "We found that customers who purchase product A are 60% more
likely to also purchase product B within the same week.
• This suggests a strong product relationship and presents an opportunity to create
targeted bundles or promotions to increase sales."
• This clear explanation, coupled with a visual representation (like an association rule
graph), allows the marketing team to understand the finding, its potential impact, and
how to use it to improve their strategies.

You might also like