0% found this document useful (0 votes)

20 views29 pages

Unit I

Uploaded by

haribabu.j

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views29 pages

Unit I

Uploaded by

haribabu.j

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

UNIT-1:

Introduction To Data Science, Benefits and Uses, Facets of Data, Data Science Process in Brief, Big Data
Ecosystem and Data Science.
Data Science Process: Overview, Defining Goals and Creating Project Charter, Retrieving Data, Cleansing,
Integrating and Transforming Data, Exploratory Analysis, Model Building, Presenting Findings and Building
Applications on Top of Them.

Introduction to Data Science:

What is data science?

Data science is the study of data to extract meaningful insights for business. It is a multidisciplinary approach
that combines principles and practices from the fields of mathematics, statistics, artificial intelligence, and
computer engineering to analyze large amounts of data. This analysis helps data scientists to ask and answer
questions like what happened, why it happened, what will happen, and what can be done with the results.
Why is data science important?
Data science is important because it combines tools, methods, and technology to generate meaning from data.
Modern organizations are inundated with data; there is a proliferation of devices that can automatically collect
and store information. Online systems and payment portals capture more data in the fields of e-commerce,
medicine, finance, and every other aspect of human life. We have text, audio, video, and image data available
in vast quantities.
History of data science:
While the term data science is not new, the meanings and connotations have changed over time. The word first
appeared in the ’60s as an alternative name for statistics. In the late ’90s, computer science professionals
formalized the term. A proposed definition for data science saw it as a separate field with three aspects: data
design, collection, and analysis. It still took another decade for the term to be used outside of academia.
Future of data science:
Artificial intelligence and machine learning innovations have made data processing faster and more efficient.
Industry demand has created an ecosystem of courses, degrees, and job positions within the field of data
science. Because of the cross-functional skillset and expertise required, data science shows strong projected
growth over the coming decades.
Benefits And Uses:
The Benefits of data science for business:
Data science is revolutionizing the way companies operate. Many businesses, regardless of size, need a robust
data science strategy to drive growth and maintain a competitive edge. Some key benefits include:

Discover unknown transformative patterns:

Data science allows businesses to uncover new patterns and relationships that have the potential to transform
the organization. It can reveal low-cost changes to resource management for maximum impact on profit
margins for example, an e-commerce company uses data science to discover that too many customer queries
are being generated after business hours. Investigations reveal that customers are more likely to purchase if
they receive a prompt response instead of an answer the next business day. By implementing 24/7 customer
service, the business grows its revenue by 30%.

1
QISCET [AIML]
Innovate new products and solutions:

Data science can reveal gaps and problems that would otherwise go unnoticed. Greater insight about purchase
decisions, customer feedback, and business processes can drive innovation in internal operations and external
solutions for example, an online payment solution uses data science to collate and analyze customer comments
about the company on social media. Analysis reveals that customers forget passwords during peak purchase
periods and are unhappy with the current password retrieval system. The company can innovate a better
solution and see a significant increase in customer satisfaction.

Real-time optimization:

It’s very challenging for businesses, especially large-scale enterprises, to respond to changing conditions in
real-time. This can cause significant losses or disruptions in business activity. Data science can help companies
predict change and react optimally to different circumstances for example, a truck-based shipping company
uses data science to reduce downtime when trucks break down. They identify the routes and shift patterns that
lead to faster breakdowns and tweak truck schedules. They also set up an inventory of common spare parts that
need frequent replacement so trucks can be repaired faster.

USES:
Data science is used to study data in four main ways:

1. Descriptive analysis
2. Diagnostic analysis
3. Predictive analysis
4. Prescriptive analysis
1. Descriptive analysis:

Descriptive analysis examines data to gain insights into what happened or what is happening in the data
environment. It is characterized by data visualizations such as pie charts, bar charts, line graphs, tables, or
generated narratives. For example, a flight booking service may record data like the number of tickets booked
each day. Descriptive analysis will reveal booking spikes, booking slumps, and high-performing months for
this service.

2. Diagnostic analysis:

Diagnostic analysis is a deep-dive or detailed data examination to understand why something happened. It is
characterized by techniques such as drill-down, data discovery, data mining, and correlations. Multiple data
operations and transformations may be performed on a given data set to discover unique patterns in each of
these techniques for example, the flight service might drill down on a particularly high-performing month to
better understand the booking spike. This may lead to the discovery that many customers visit a particular city
to attend a monthly sporting event.

3. Predictive analysis:

Predictive analysis uses historical data to make accurate forecasts about data patterns that may occur in the
future. It is characterized by techniques such as machine learning, forecasting, pattern matching, and predictive
2
QISCET [AIML]
modeling. In each of these techniques, computers are trained to reverse engineer causality connections in the
data.For example, the flight service team might use data science to predict flight booking patterns for the
coming year at the start of each year. The computer program or algorithm may look at past data and predict
booking spikes for certain destinations in May. Having anticipated their customer’s future travel requirements,
the company could start targeted advertising for those cities from February.

4. Prescriptive analysis:

Prescriptive analytics takes predictive data to the next level. It not only predicts what is likely to happen but
also suggests an optimum response to that outcome. It can analyze the potential implications of different
choices and recommend the best course of action. It uses graph analysis, simulation, complex event processing,
neural networks, and recommendation engines from machine learning.

Back to the flight booking example, prescriptive analysis could look at historical marketing campaigns to
maximize the advantage of the upcoming booking spike. A data scientist could project booking outcomes for
different levels of marketing spend on various marketing channels. These data forecasts would give the flight
booking company greater confidence in their marketing decisions.
FACETS OF DATA

In Data Science and Big Data you’ll come across many different types of data, and each of them tends to
require different tools and techniques. The main categories of data are these:

 Structured
 Unstructured
 Natural Language
 Machine-generated
 Graph-based
 Audio, video and images
 Streaming

Let’s explore all these interesting data types..

Structured Data:

3
QISCET [AIML]
Fig : An Excel Sheet is an example of Structured Data

Structured data is the data that depends on a data model and resides in a fixed field within a record. It’s often
easy to store structured data in tables within data bases or Excel files. SQL, Structured Query Language, is
the preferred way to manage and query data that resides in data bases. You may also come across structured
data that might give you a hard time storing it in a traditional relational database.

Hierarchical data such as a family tree is one such example.The world isn’t made up of structured data,
though; it’s imposed upon it by humans and machines.

Unstructured Data:

Unstructured data is data that isn’t easy to fit into a data model because the content is context-specific or
varying. One example of unstructured data is your regular email. Although email contains structured elements
such as the sender, title, and body text, it’s a challenge to find the number of people who have written an
email complaint about a specific employee because so many ways exist to refer to a person, for example. The
thousands of different languages and dialects out there further complicate this.
A human-written email, is also a perfect example of natural language data.

Natural Language:

4
QISCET [AIML]
Natural language is a special type of unstructured data ;it’s challenging to process because it requires
knowledge of specific data science techniques and linguistics.

The natural language processing community has had success in entity recognition, topic recognition,
summarization, text completion, and sentiment analysis, but models trained in one domain don’t generalize
well to other domains. Even state-of-the-art techniques aren’t able to decipher the meaning of every piece of
text. This shouldn’t be a surprise though: humans struggle with natural language as well. It’s ambiguous by
nature. The concept of meaning itself is questionable here. Have two people listen to the same conversation.
Will they get the same meaning? The meaning of the same words can vary when coming from someone upset
or joyous.

Machine-generated Data

Machine-generated data is informative that’s automatically created by a computer, process, application or

other machine without human intervention. Machine-generated data is becoming a major data resource and
will continue to do so.

5
QISCET [AIML]
The analysis of Machine data relies on highly scalable tools, due to high volume and speed.

Examples are, web server logs, call detail records, network event logs and telemetry.

Example for Machine data

This is not the best approach for highly interconnected or “networked” data, where the relationship between
entities have a valuable role to play.

Graph-based or Network Data:

“Graph data” can be a confusing term because any data can be shown in a graph. “Graph” in this case points
to mathematical graph theory. In graph theory, a graph is a mathematical structure to model pair-wise
relationships between objects. Graph or network data is, in short, data that focuses on the relationship or
adjacency of objects.

The graph structures use nodes, edges, and properties to represent and store graphical data.

social network is an example of Graph-based data

Graph-based data is a natural way to represent social networks, and its structure allows you to calculate
specific metrics such as the influence of a person and the shortest path between two people.

Graph databases are used to store graph-based data and are queried with specialized query languages such
as SPARQL.
6
QISCET [AIML]
Graph data poses its challenges, but for a computer interpreting additive and image data, it can be ever more
difficult.

Audio, Images and Videos

Audio, image, and video are data types that pose specific challenges to a data scientist. Tasks that are trivial
for humans, such as recognizing objects in pictures, turn out to be challenging for computers.

Multimedia data in the form of audio, video, images and sensor signals have become an integral part of
everyday life. Moreover, they have revolutionized product testing and evidence collection by providing
multiple sources of data for quantitative and systematic assessment.

We have various libraries, development languages and IDEs commonly used in the field, such as :

 MATLAB, OpenCV, ImageJ, Python, R, Java, C, C++, C#

7
QISCET [AIML]
Streaming Data:
While streaming data can take almost any of the previous forms, it has an extra property. The data flows into
the system when an event happens instead of being loaded into a data store in a batch. Although it isn’t really
a different type of data, we treat it here as much because you need to adapt your process to deal with this type
of information.

Examples are the “What’s trending” on Twitter, live sporting or music events and the stock market.

Data science process In Brief:

A business problem typically initiates the data science process. A data scientist will work with business
stakeholders to understand what business needs. Once the problem has been defined, the data scientist may
solve it using the OSEMN data science process:

O – Obtain data

Data can be pre-existing, newly acquired, or a data repository downloadable from the internet. Data
scientists can extract data from internal or external databases, company CRM software, web server logs,
social media or purchase it from trusted third-party sources.

S – Scrub data

Data scrubbing, or data cleaning, is the process of standardizing the data according to a predetermined
format. It includes handling missing data, fixing data errors, and removing any data outliers. Some examples
of data scrubbing are: ·

 Changing all date values to a common standard format. ·

 Fixing spelling mistakes or additional spaces. ·
 Fixing mathematical inaccuracies or removing commas from large numbers.
E – Explore data

Data exploration is preliminary data analysis that is used for planning further data modeling strategies. Data
scientists gain an initial understanding of the data using descriptive statistics and data visualization tools.
Then they explore the data to identify interesting patterns that can be studied or actioned.

M – Model data

Software and machine learning algorithms are used to gain deeper insights, predict outcomes, and prescribe
the best course of action. Machine learning techniques like association, classification, and clustering are
applied to the training data set. The model might be tested against predetermined test data to assess result
accuracy. The data model can be fine-tuned many times to improve result outcomes.

N – Interpret results

Data scientists work together with analysts and businesses to convert data insights into action. They make
diagrams, graphs, and charts to represent trends and predictions. Data summarization helps stakeholders
understand and implement results effectively.

8
QISCET [AIML]
Big Data Ecosystem And Data Science:

The Big Data Ecosystem refers to the collection of tools, technologies, frameworks, and platforms designed
to collect, store, process, analyze, and visualize large and complex datasets, commonly known as Big Data.
It provides the infrastructure and processes needed to manage the 5 V's of Big Data: Volume, Velocity,
Variety, Veracity, and Value.

Key Components of the Big Data Ecosystem

1. Data Sources

 The starting point of the ecosystem, where data is generated or collected.

 Examples:
o Social media (e.g., Twitter, Instagram).
o IoT devices and sensors (e.g., smart meters, cameras).

9
QISCET [AIML]
o Logs and transactions (e.g., web server logs, online purchases).

2. Data Ingestion

 The process of importing raw data into a centralized system for processing and storage.
 Tools:
o Batch Ingestion: Apache Sqoop, Flume.
o Real-Time Ingestion: Apache Kafka, RabbitMQ, Flink.

3. Data Storage

 Solutions designed to store massive volumes of data efficiently, often in distributed systems.
 Categories:
o Structured Storage: Relational databases (MySQL, PostgreSQL).
o Unstructured Storage: Hadoop Distributed File System (HDFS), Amazon S3.
o NoSQL Databases: MongoDB, Cassandra, HBase.

4. Data Processing

 The transformation of raw data into meaningful formats and insights.

 Types:
o Batch Processing: Processing large volumes of data at once.

 Example Tools: Apache Hadoop, Apache Spark.

o Stream Processing: Processing data in real time as it is generated.

 Example Tools: Apache Kafka, Apache Flink, Apache Storm.

5. Data Analysis

 Extracting insights from data using statistical and computational techniques.

 Tools:
o SQL-based Tools: Apache Hive, Presto.
o Machine Learning Frameworks: TensorFlow, PyTorch, MLlib (in Spark).

6. Data Visualization

 Representing data and analytics in charts, graphs, and dashboards for decision-making.
 Tools:
o Tableau
o Power BI
o D3.js
o Kibana

7. Data Security and Governance

 Ensuring data privacy, security, and compliance with regulations.

 Key Practices:
o Access Control: Role-based access (e.g., Kerberos).
o Data Encryption: Secure data in transit and at rest.
o Compliance: GDPR, HIPAA, CCPA.

10
QISCET [AIML]
8. Orchestration and Workflow Management

 Automating and managing workflows in the data pipeline.

 Tools:
o Apache Airflow
o Oozie
o Kubernetes (for containerized applications)

Characteristics of the Big Data Ecosystem

1. Scalability:

o Systems are designed to scale horizontally, handling increasing data volumes.

2. Distributed Architecture:

o Data is stored and processed across multiple nodes.

3. Fault Tolerance:

o Redundant systems ensure high availability even in case of failures.

4. Real-Time Processing:

o Enables instant analytics for time-sensitive applications.

5. Integration:

o Combines structured, semi-structured, and unstructured data into unified analytics.

Applications of the Big Data Ecosystem

1. Business Intelligence: Customer segmentation, sales forecasting.

2. Healthcare: Predictive analytics for patient care.
3. Finance: Fraud detection, credit scoring.
4. Retail: Recommendation systems, inventory management.
5. IoT: Smart city data processing, predictive maintenance.

Examples of Ecosystem in Action

 Netflix:

o Ingests data from millions of users watching shows.

o Uses Apache Kafka for streaming, Hadoop/S3 for storage, and machine learning for
personalized recommendations.

 Amazon:

o Processes massive e-commerce transactions using data pipelines to optimize inventory,

predict demand, and recommend products.

How Big Data Ecosystem is Related to Data Science

11
QISCET [AIML]
The Big Data Ecosystem and Data Science are interdependent domains. While the Big Data Ecosystem
focuses on the tools and technologies required to handle large-scale data, Data Science uses this data to
extract insights, build models, and make predictions. Here’s a detailed explanation of their relationship:

1. Big Data Ecosystem as the Foundation for Data Science

The Big Data Ecosystem provides the infrastructure needed for Data Science to function effectively:

 Data Collection: Big data tools capture data from multiple sources, such as IoT devices, social
media, transactional systems, and logs.
o Example: Apache Kafka streams real-time data for analysis.
 Data Storage: Tools like Hadoop HDFS and Amazon S3 store vast amounts of raw and processed
data that data scientists rely on.
 Data Processing: Frameworks like Apache Spark and Flink enable large-scale data cleaning,
transformation, and feature engineering.

2. Data Science Utilizes Big Data

Data Science relies on the Big Data Ecosystem to perform tasks like:

 Analyzing Trends and Patterns: Using tools like Python, R, or SQL to analyze data stored in big
data platforms.
o Example: Data scientists extract trends from customer behavior stored in Google BigQuery.
 Machine Learning: Data Science applies machine learning to data processed by big data tools to
build predictive models.

o Example: Training a recommendation system using Spark MLlib on e-commerce data.

3. Handling the 3 Vs of Big Data for Data Science

The Big Data Ecosystem addresses the challenges of Volume, Velocity, and Variety, which are crucial for
Data Science projects:

 Volume: Data Science needs tools like Hadoop to manage and query terabytes or petabytes of data.
 Velocity: Stream processing tools (e.g., Apache Flink) ensure real-time data is available for
predictive models.
 Variety: Data Science benefits from the ecosystem's ability to handle structured (e.g., databases),
semi-structured (e.g., JSON), and unstructured (e.g., videos, text) data.

4. Overlapping Tools in Big Data and Data Science

Some tools are integral to both domains:

 Big Data Tools: Hadoop, Spark, Kafka, Cassandra.

 Data Science Tools: Python (Pandas, Scikit-learn), R, TensorFlow.
 Integration: Data scientists often use big data platforms like Spark or Hive for data preparation
before building models.

5. Examples of Their Collaboration

 Fraud Detection:

12
QISCET [AIML]
o Big Data Ecosystem: Real-time financial transactions are processed with Kafka and stored in
HDFS.
o Data Science: Machine learning models analyze these transactions for anomalies.

 Customer Personalization:
o Big Data Ecosystem: Logs from millions of users are processed with Apache Spark.
o Data Science: Insights from these logs are used to recommend personalized products.

6. Complementary Roles

 Big Data Ecosystem:

o Focuses on data management, scalability, and processing large-scale datasets.
o Ensures data is accessible, clean, and ready for analysis.
 Data Science:

o Focuses on insight extraction, modeling, and decision-making.

o Leverages the ecosystem's processed data to deliver actionable outcomes.

Data Science Process

Data science process consists of six stages :

1. Discovery or Setting the research goal

2. Retrieving data

3. Data preparation

4. Data exploration

5. Data modeling

6. Presentation and automation

13
QISCET [AIML]
• Step 1: Discovery or Defining research goal

This step involves acquiring data from all the identified internal and external sources, which helps to answer
the business question.

• Step 2: Retrieving data

It collection of data which required for project. This is the process of gaining a business understanding of the
data user have and deciphering what each piece of data means. This could entail determining exactly what
data is required and the best methods for obtaining it. This also entails determining what each of the data
points means in terms of the company. If we have given a data set from a client, for example, we shall need
to know what each column and row represents.

• Step 3: Data preparation

Data can have many inconsistencies like missing values, blank columns, an incorrect data format, which
needs to be cleaned. We need to process, explore and condition data before modeling. The cleandata, gives
the better predictions.

• Step 4: Data exploration

Data exploration is related to deeper understanding of data. Try to understand how variables interact with
each other, the distribution of the data and whether there are outliers. To achieve this use descriptive
statistics, visual techniques and simple modeling. This steps is also called as Exploratory Data Analysis.

• Step 5: Data modeling

In this step, the actual model building process starts. Here, Data scientist distributes datasets for training and
testing. Techniques like association, classification and clustering are applied to the training data set. The
model, once prepared, is tested against the "testing" dataset.

• Step 6: Presentation and automation

Deliver the final baselined model with reports, code and technical documents in this stage. Model is
deployed into a real-time production environment after thorough testing. In this stage, the key findings are
communicated to all stakeholders. This helps to decide if the project results are a success or a failure based
on the inputs from the model.

14
QISCET [AIML]
Defining Goals and Creating a Project Charter

Defining goals and creating a project charter are crucial steps in ensuring the success of any project. These
processes help align the team's understanding, establish clear objectives, and provide a structured roadmap
for execution. Here's a guide to both:

1. Defining Goals

Goals are the specific, measurable outcomes that a project aims to achieve. Well-defined goals provide
direction, focus, and benchmarks for success.

Steps to Define Goals:

Understand the Purpose:

1. Clarify why the project is being undertaken.

2. Identify the stakeholders and their expectations.

Use SMART Criteria:

1. Specific: Goals should be clear and well-defined.

2. Measurable: Include quantifiable criteria to track progress.
3. Attainable: Ensure goals are realistic given resources and constraints.
4. Relevant: Align goals with the organization’s objectives.
5. Time-bound: Specify deadlines for achieving goals.

Break Down Goals into Objectives:

1. Create smaller, actionable tasks that collectively achieve the broader goal.

15
QISCET [AIML]
Engage Stakeholders:

1. Collaborate with all relevant parties to ensure alignment and agreement.

2. Creating a Project Charter

A project charter is a document that outlines the project's purpose, objectives, scope, stakeholders, and
other key details. It serves as a formal agreement between stakeholders and the project team.

Key Components of a Project Charter:

Project Title:

1. A concise and descriptive name for the project.

Purpose and Objectives:

1. Why the project is being undertaken.

2. High-level goals and expected outcomes.

Scope:

1. Define what is included in the project and what is excluded.

2. Helps prevent scope creep.

Stakeholders:

1. Identify individuals, teams, or organizations involved in or affected by the project.

Deliverables:

1. List of tangible outcomes or products the project will produce.

Timeline:

1. High-level schedule or milestones.

Resources:

1. Budget, personnel, tools, and other resources required.

Risks and Assumptions:

1. Potential challenges or risks that could affect the project.

2. Key assumptions that the project is based on.

Authorization:

1. Include signatures of key stakeholders to formally approve the project charter.

Benefits of a Project Charter

 Alignment: Ensures all stakeholders understand and agree on the project’s purpose and goals.
16
QISCET [AIML]
 Direction: Serves as a roadmap for the project team.
 Accountability: Defines roles, responsibilities, and deliverables.
 Communication: Acts as a reference for stakeholders throughout the project lifecycle.

Example of a Project Charter

Project Title: Development of a Machine Learning Model for Sales Forecasting

Purpose and Objectives:

 Develop a predictive model to improve sales forecasting accuracy by 20%.

 Enable better inventory management and reduce stockouts.

Scope:

 In Scope: Data collection, model training, deployment, and user training.

 Out of Scope: Integration with external systems.

Stakeholders:

 Project Sponsor: Head of Sales

 Project Manager: Data Science Lead
 Project Team: Data Scientists, Data Engineers, Business Analysts

Deliverables:

 Cleaned dataset, predictive model, documentation, and training materials.

Timeline:

 Start: February 1, 2025

 End: May 31, 2025

Resources:

 Budget: $50,000
 Tools: Python, AWS, Tableau

Risks and Assumptions:

 Risk: Delay in data availability.

 Assumption: Historical sales data will be complete and accurate.

Authorization:

 Signature: [Project Sponsor’s Name]

Retrieving Data

• Retrieving required data is second phase of data science project. Sometimes Data scientists need to go into
the field and design a data collection process. Many companies will have already collected and stored the
data and what they don't have can often be bought from third parties.
17
QISCET [AIML]
• Most of the high quality data is freely available for public and commercial use. Data can be stored in
various format. It is in text file format and tables in database. Data may be internal or external.

1. Start working on internal data, i.e. data stored within the company

• First step of data scientists is to verify the internal data. Assess the relevance and quality of the data that's
readily in company. Most companies have a program for maintaining key data, so much of the cleaning
work may already be done. This data can be stored in official data repositories such as databases, data marts,
data warehouses and data lakes maintained by a team of IT professionals.

• Data repository is also known as a data library or data archive. This is a general term to refer to a data set
isolated to be mined for data reporting and analysis. The data repository is a large database infrastructure,
several databases that collect, manage and store data sets for data analysis, sharing and reporting.

• Data repository can be used to describe several ways to collect and store data:

a) Data warehouse is a large data repository that aggregates data usually from multiple sources or segments
of a business, without the data being necessarily related.

b) Data lake is a large data repository that stores unstructured data that is classified and tagged with
metadata.

c) Data marts are subsets of the data repository. These data marts are more targeted to what the data user
needs and easier to use.

d) Metadata repositories store data about data and databases. The metadata explains where the data source,
how it was captured and what it represents.

e) Data cubes are lists of data with three or more dimensions stored as a table.

Advantages of data repositories:

i. Data is preserved and archived.

ii. Data isolation allows for easier and faster data reporting.

iii. Database administrators have easier time tracking problems.

iv. There is value to storing and analyzing data.

Disadvantages of data repositories :

i. Growing data sets could slow down systems.

ii. A system crash could affect all the data.

iii. Unauthorized users can access all sensitive data more easily than if it was distributed across several
locations.

2. Do not be afraid to shop around

18
QISCET [AIML]
• If required data is not available within the company, take the help of other company, which provides such
types of database. For example, Nielsen and GFK are provides data for retail industry. Data scientists also
take help of Twitter, LinkedIn and Facebook.

• Government's organizations share their data for free with the world. This data can be of excellent quality; it
depends on the institution that creates and manages it. The information they share covers a broad range of
topics such as the number of accidents or amount of drug abuse in a certain region and its demographics.

3. Perform data quality checks to avoid later problem

• Allocate or spend some time for data correction and data cleaning. Collecting suitable, error free data is
success of the data science project.

• Most of the errors encounter during the data gathering phase are easy to spot, but being too careless will
make data scientists spend many hours solving data issues that could have been prevented during data
import.

• Data scientists must investigate the data during the import, data preparation and exploratory phases. The
difference is in the goal and the depth of the investigation.

• In data retrieval process, verify whether the data is right data type and data is same as in the source
document.

• With data preparation process, more elaborate checks performed. Check any shortcut method is used. For
example, check time and data format.

• During the exploratory phase, Data scientists focus shifts to what he/she can learn from the data. Now Data
scientists assume the data to be clean and look at the statistical properties such as distributions, correlations
and outliers.

Data Cleaning

• Data is cleansed through processes such as filling in missing values, smoothing the noisy data or resolving
the inconsistencies in the data.

• Data cleaning tasks are as follows:

1. Data acquisition and metadata

2. Fill in missing values

3. Unified date format

4. Converting nominal to numeric

5. Identify outliers and smooth out noisy data

6. Correct inconsistent data

• Data cleaning is a first step in data pre-processing techniques which is used to find the missing value,
smooth noise data, recognize outliers and correct inconsistent.
19
QISCET [AIML]
• Missing value: These dirty data will affects on miming procedure and led to unreliable and poor output.
Therefore it is important for some data cleaning routines. For example, suppose that the average salary of
staff is Rs. 65000/-. Use this value to replace the missing value for salary.

• Data entry errors: Data collection and data entry are error-prone processes. They often require human
intervention and because humans are only human, they make typos or lose their concentration for a second
and introduce an error into the chain. But data collected by machines or computers isn't free from errors
either. Errors can arise from human sloppiness, whereas others are due to machine or hardware failure.
Examples of errors originating from machines are transmission errors or bugs in the extract, transform and
load phase (ETL).

• Whitespace error: Whitespaces tend to be hard to detect but cause errors like other redundant characters
would. To remove the spaces present at start and end of the string, we can use strip() function on the string
in Python.

• Fixing capital letter mismatches: Capital letter mismatches are common problem. Most programming
languages make a distinction between "Chennai" and "chennai".

• Python provides string conversion like to convert a string to lowercase, uppercase using lower(), upper().

• The lower() Function in python converts the input string to lowercase. The upper() Function in python
converts the input string to uppercase.

20
QISCET [AIML]
# Sample dataset (simulating a small dataset with issues)
data = [
{"name": " Alice ", "salary": None},
{"name": "BOB", "salary": 70000},
{"name": "chennai", "salary": 65000},
{"name": " dAvid ", "salary": None},
]
# Default value for missing salaries
default_salary = 65000
# Data cleaning routine
def clean_data(data):
cleaned_data = []
for record in data:
# Fix missing salary by replacing it with the default value
record["salary"] = record["salary"] if record["salary"] is not None else
default_salary
# Strip leading and trailing whitespace from the name
record["name"] = record["name"].strip()
# Convert the name to title case (fixes capital letter mismatches)
record["name"] = record["name"].title()
# Append the cleaned record to the cleaned data list
cleaned_data.append(record)
return cleaned_data
# Apply the cleaning routine
cleaned_data = clean_data(data)
# Display the cleaned dataset
print("Cleaned Data:")
for record in cleaned_data:
print(record)

Combining Data from Different Data Sources

1. Joining table

• Joining tables allows user to combine the information of one observation found in one table with the
information that we find in another table. The focus is on enriching a single observation.

21
QISCET [AIML]
• A primary key is a value that cannot be duplicated with in a table. This means that one value can only be
seen once within the primary key column. That same key can exist as a foreign key in another table which
creates the relationship. A foreign key can have duplicate instances within a table.

• Fig. 1.6.2 shows Joining two tables on the CountryID and CountryName keys.

2. Appending tables

• Appending table is called stacking table. It effectively adding observations from one table to another table.
Fig. 1.6.3 shows Appending table. (See Fig. 1.6.3 on next page)

• Table 1 contains x3 value as 3 and Table 2 contains x3 value as 33.The result of appending these tables is a
larger one with the observations from Table 1 as well as Table 2. The equivalent operation in set theory
would be the union and this is also the command in SQL, the common language of relational databases.
Other set operators are also used in data science, such as set difference and intersection.

3. Using views to simulate data joins and appends

22
QISCET [AIML]
• Duplication of data is avoided by using view and append. The append table requires more space for
storage. If table size is in terabytes of data, then it becomes problematic to duplicate the data. For this
reason, the concept of a view was invented.

• Fig. 1.6.4 shows how the sales data from the different months is combined virtually into a yearly sales
table instead of duplicating the data.

Transforming Data

• In data transformation, the data are transformed or consolidated into forms appropriate for mining.
Relationships between an input variable and an output variable aren't always linear.

• Reducing the number of variables: Having too many variables in the model makes the model difficult to
handle and certain techniques don't perform well when user overload them with too many input variables.

• All the techniques based on a Euclidean distance perform well only up to 10 variables. Data scientists use
special methods to reduce the number of variables but retain the maximum amount of data.

Euclidean distance :

• Euclidean distance is used to measure the similarity between observations. It is calculated as the square
root of the sum of differences between each point.

Euclidean distance = √(X1-X2)2 + (Y1-Y2)2

Turning variable into dummies :

• Variables can be turned into dummy variables. Dummy variables can only take two values: true (1) or
false√ (0). They're used to indicate the absence of a categorical effect that may explain the observation.

23
QISCET [AIML]
What is Exploratory Data Analysis?
Last Updated : 13 Jan, 2025
Exploratory Data Analysis (EDA) is an important first step in data science projects. It involves looking at
and visualizing data to understand its main features, find patterns, and discover how different parts of the
data are connected.
EDA helps to spot any unusual data or outliers and is usually done before starting more detailed statistical
analysis or building models.
Why Exploratory Data Analysis is Important?
Exploratory Data Analysis (EDA) is important for several reasons, especially in the context of data
science and statistical modeling. Here are some of the key reasons why EDA is a critical step in the data
analysis process:
 Helps to understand the dataset, showing how many features there are, the type of data in each feature,
and how the data is spread out, which helps in choosing the right methods for analysis.
 EDA helps to identify hidden patterns and relationships between different data points, which help us in
and model building.
 Allows to spot errors or unusual data points (outliers) that could affect your results.
 Insights that you obtain from EDA help you decide which features are most important for building
models and how to prepare them to improve performance.
 By understanding the data, EDA helps us in choosing the best modeling techniques and adjusting them
for better results.
Types of Exploratory Data Analysis
There are various sorts of EDA strategies based on nature of the records. Depending on the number of
columns we are analyzing we can divide EDA into three types: Univariate, bivariate and multivariate .

1. Univariate Analysis

Univariate analysis focuses on studying one variable to understand its characteristics. It helps describe the
data and find patterns within a single feature. Common methods include histograms to show data
distribution, box plots to detect outliers and understand data spread, and bar charts for categorical
24
QISCET [AIML]
data. Summary statistics like mean, median, mode, variance, and standard deviation help describe the
central tendency and spread of the data

2. Bivariate Analysis

Bivariate analysis focuses on exploring the relationship between two variables to find connections,
correlations, and dependencies. It’s an important part of exploratory data analysis that helps understand
how two variables interact. Some key techniques used in bivariate analysis include scatter plots, which
visualize the relationship between two continuous variables; correlation coefficient, which measures how
strongly two variables are related, commonly using Pearson’s correlation for linear relationships;
and cross-tabulation, or contingency tables, which show the frequency distribution of two categorical
variables and help understand their relationship.
Line graphs are useful for comparing two variables over time, especially in time series data, to identify
trends or patterns. Covariance measures how two variables change together, though it’s often
supplemented by the correlation coefficient for a clearer, more standardized view of the relationship.

3. Multivariate Analysis

Multivariate analysis examines the relationships between two or more variables in the dataset. It aims to
understand how variables interact with one another, which is crucial for most statistical modeling
techniques. It include Techniques like pair plots, which show the relationships between multiple variables
at once, helping to see how they interact. Another technique is Principal Component Analysis (PCA ),
which reduces the complexity of large datasets by simplifying them, while keeping the most important
information.
In addition to univariate and multivariate analysis, there are specialized EDA techniques tailored for
specific types of data or analysis needs:
 Spatial Analysis:For geographical data, using maps and spatial plotting to understand the geographical
distribution of variables.
 Text Analysis: Involves techniques like word clouds, frequency distributions, and sentiment analysis to
explore text data.
 Time Series Analysis: This type of analysis is mainly applied to statistics sets that have a temporal
component. Time collection evaluation entails inspecting and modeling styles, traits, and seasonality
inside the statistics through the years. Techniques like line plots, autocorrelation analysis, transferring
averages, and ARIMA (AutoRegressive Integrated Moving Average) fashions are generally utilized in
time series analysis.

Model building

Model building is a crucial phase in the data science process, where data is transformed into actionable
insights and predictions. In this blog, I will provide you with a step-by-step guide to model building,
equipping you with the essential techniques to develop accurate and reliable predictive models.

Step 1: Define the Problem and Set Objectives:Clearly define the problem you aim to solve and establish
measurable objectives. Understand the scope, constraints, and desired outcomes of your model. This step
ensures that your model aligns with the problem at hand and provides meaningful insights.

25
QISCET [AIML]
Step 2: Gather and Prepare the Data:Collect the relevant data required for model building. Clean and
preprocess the data, handling missing values, outliers, and inconsistencies. Perform feature engineering and
selection to extract meaningful predictors and ensure data quality.

Step 3: Split the Data:Split your data into training and testing sets. The training set is used to train the
model, while the testing set serves as an unseen dataset for evaluating the model’s performance. Consider
techniques like cross-validation for robust model assessment.

Step 4: Choose the Right Algorithm:Select the appropriate machine learning algorithm based on your
problem type (e.g., classification, regression) and data characteristics. Consider popular algorithms like linear
regression, decision trees, random forests, support vector machines, or deep learning models.

Step 5: Train the Model:Fit the selected algorithm to the training data. Adjust the model’s parameters and
hyperparameters to optimize its performance. Use techniques like grid search or Bayesian optimization to
find the best parameter settings.

Step 6: Evaluate the Model:Assess the model’s performance using appropriate evaluation metrics, such as
accuracy, precision, recall, or mean squared error. Compare the model’s predictions with the actual values in
the testing dataset. Consider additional techniques like ROC curves or confusion matrices for classification
problems.

Step 7: Fine-tune and Optimize:Iteratively refine your model to enhance its performance. Experiment with
different parameter settings, feature selections, or ensemble techniques to improve accuracy and
generalization. Regularize the model to avoid overfitting and ensure it performs well on unseen data.

Step 8: Interpret the Results:Understand and interpret the model’s output to gain insights into the
underlying patterns and relationships in the data. Analyze feature importances, coefficients, or decision
boundaries to explain the model’s behavior. Communicate the results effectively to stakeholders.

Step 9: Deploy and Monitor:Deploy your model in a production environment to make predictions or
support decision-making processes. Continuously monitor the model’s performance and assess its impact on
business outcomes. Update the model periodically as new data becomes available.

1. Presenting Findings in Data Science

The goal is to effectively communicate the results of your analysis to stakeholders, which often involves
storytelling supported by visuals and clear explanations.

Steps:
26
QISCET [AIML]
Understand Your Audience:

1. Tailor your presentation to the audience's technical expertise (e.g., executives, developers, or
analysts).
2. Focus on actionable insights for decision-makers and methodology for technical teams.

Create a Narrative:

1. Define the problem or question you addressed.

2. Explain your approach briefly.
3. Highlight key findings and their implications.
4. Conclude with recommendations.

Use Visualizations:

1. Choose the right chart types (e.g., bar charts, scatter plots, heatmaps).
2. Tools: Python libraries (Matplotlib, Seaborn), Tableau, or Power BI.
3.

import matplotlib.pyplot as pltimport seaborn as sns

sns.barplot(x="Category", y="Value", data=df)

plt.title("Key Findings")
plt.show()

Provide Supporting Evidence:

1. Summarize metrics like accuracy, precision, recall, RMSE, etc.

2. Use confidence intervals or statistical significance tests to back up findings.

Leverage Dashboards:

1. Create interactive dashboards for real-time exploration.

2. Tools: Streamlit, Dash, or Tableau.

Write Reports:

1. Include detailed methodologies, results, and appendices for raw data or extended analysis.
2. Tools: Jupyter Notebooks, LaTeX, or Word.

2. Building Applications Based on Data Science Findings

Applications transform insights into tools for automation, decision-making, or user interaction.

Steps:

Identify the Use Case:

27
QISCET [AIML]
1. Determine how the insights will be used (e.g., predictions, recommendations, process
optimization).

Develop the Backend:

1. Use APIs to integrate the model with the application.

2. Tools: Flask, FastAPI, or Django for Python-based ML integration.
3.

Design the Frontend:

1. Build a user-friendly interface for interaction.

2. Tools: React, Angular, or Streamlit.

Data Visualization and Interaction:

1. Display findings dynamically using charts and graphs.

2. Libraries: Plotly, D3.js, or Highcharts.

Testing:

1. Validate the application for different scenarios and edge cases.

2. Perform user testing for feedback.

Deploy the Application:

1. Deploy on a server or cloud platform like AWS, Azure, Google Cloud, or Heroku.

Example Workflow:

Use Case: Predict Customer Churn

Findings Presentation:

1. Visualize features contributing to churn (e.g., account age, support tickets).

2. Provide churn probability per customer segment.
3. Recommendation: Enhance retention campaigns for high-risk groups.

Application Development:

1. Backend API serves churn predictions for given customer data.

2. Frontend allows input of customer details and displays churn risk and recommendations.
3. Interactive dashboard shows churn trends over time.

Best Practices

 Simplify Complex Findings: Use analogies or relatable examples.

 Iterate with Feedback: Present findings early and refine based on stakeholder input.
 Focus on User Experience: Ensure applications are intuitive and solve real problems.
28
QISCET [AIML]
Refrences:

https://aws.amazon.com/what-is/data-science/

https://cse.poriyaan.in/topic/facets-of-data-50576/

https://www.shaip.com/blog/named-entity-recognition-and-its-types/

https://freecontent.manning.com/the-big-data-ecosystem-and-data-science/

https://cse.poriyaan.in/topic/retrieving-data-50579/

https://www.geeksforgeeks.org/what-is-exploratory-data-analysis/

https://www.geeksforgeeks.org/model-building-for-data-analytics/

https://medium.com/@jscvcds/mastering-model-building-a-step-by-step-guide-for-data-scientists-c3d07c287a96

https://www.anuupdates.org/2024/02/data-presentation-and-automation.html

29
QISCET [AIML]

What Is Data Science?: Module - 1
No ratings yet
What Is Data Science?: Module - 1
29 pages
OceanofPDF - Com DATA SCIENCE Simple and Effective Tips An - Benjamin Smith
100% (1)
OceanofPDF - Com DATA SCIENCE Simple and Effective Tips An - Benjamin Smith
122 pages
Data Science - FYBCA-Sem-II
No ratings yet
Data Science - FYBCA-Sem-II
13 pages
Unit 1 PPT 1
100% (1)
Unit 1 PPT 1
27 pages
Data Science
No ratings yet
Data Science
64 pages
Unit 1-3
No ratings yet
Unit 1-3
39 pages
Introduction to Data Science Basics
No ratings yet
Introduction to Data Science Basics
33 pages
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
Chapter 2
No ratings yet
Chapter 2
10 pages
Lecture 1 and 2 Powerpoints
No ratings yet
Lecture 1 and 2 Powerpoints
32 pages
Introduction to Data Science Basics
No ratings yet
Introduction to Data Science Basics
348 pages
Unit 1
No ratings yet
Unit 1
19 pages
Introduction to Data Science Basics
No ratings yet
Introduction to Data Science Basics
107 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
William Wizner - Python For Data Science - Data Analysis and Deep Learning With Python Coding and Programming
100% (1)
William Wizner - Python For Data Science - Data Analysis and Deep Learning With Python Coding and Programming
73 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
36 pages
DS231 Week 2
No ratings yet
DS231 Week 2
33 pages
DS231 Module 2
No ratings yet
DS231 Module 2
33 pages
CH1 1
No ratings yet
CH1 1
41 pages
3961502-Class10 Ai Part B Unit3 Unit3 Data Science
No ratings yet
3961502-Class10 Ai Part B Unit3 Unit3 Data Science
15 pages
Data Science
No ratings yet
Data Science
46 pages
Unit-1 - Introduction To Data Science
No ratings yet
Unit-1 - Introduction To Data Science
17 pages
Unit 1 Dsa
No ratings yet
Unit 1 Dsa
26 pages
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
No ratings yet
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
4 pages
Module 1
No ratings yet
Module 1
35 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
15 pages
Fundamentals of Data Science Course
75% (4)
Fundamentals of Data Science Course
62 pages
Ids Sem Ans U-I
No ratings yet
Ids Sem Ans U-I
17 pages
Data Science Course Overview and Skills
100% (2)
Data Science Course Overview and Skills
18 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
161 pages
09 Handout 1
No ratings yet
09 Handout 1
4 pages
Module-1: Introduction To Data Science
No ratings yet
Module-1: Introduction To Data Science
98 pages
A Sneak Peek Into Data Science
No ratings yet
A Sneak Peek Into Data Science
2 pages
Data Science Unit 1 Notes
No ratings yet
Data Science Unit 1 Notes
75 pages
UNIT I Democracy
No ratings yet
UNIT I Democracy
75 pages
Unit 1 FUNDAMENTALS OF DATA SCIENCE-1
No ratings yet
Unit 1 FUNDAMENTALS OF DATA SCIENCE-1
27 pages
Data Science
No ratings yet
Data Science
6 pages
Unit 1
No ratings yet
Unit 1
19 pages
Datas Unit1
No ratings yet
Datas Unit1
20 pages
21css303t Datascience Unit 1 Notes
No ratings yet
21css303t Datascience Unit 1 Notes
246 pages
Unit 1
No ratings yet
Unit 1
28 pages
Data Science Techniques AND PREDICTIONS
No ratings yet
Data Science Techniques AND PREDICTIONS
5 pages
ML Chapter 01
No ratings yet
ML Chapter 01
19 pages
Big Data and Data Science
No ratings yet
Big Data and Data Science
6 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
36 pages
FODS Unit-1
No ratings yet
FODS Unit-1
33 pages
Data Science Overview and Process Guide
No ratings yet
Data Science Overview and Process Guide
139 pages
Data v2
No ratings yet
Data v2
25 pages
UNIT - II Artificial Intelligence Second Part
No ratings yet
UNIT - II Artificial Intelligence Second Part
9 pages
Data Science Foundations Guide
100% (2)
Data Science Foundations Guide
143 pages
Data Science & Machine Learning Insights
No ratings yet
Data Science & Machine Learning Insights
29 pages
WEEK 4-5-Exploring Data Science Methods, Models, and Application
No ratings yet
WEEK 4-5-Exploring Data Science Methods, Models, and Application
18 pages
22mca341 - Data Science
No ratings yet
22mca341 - Data Science
109 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Data Science Fundamentals Lecture Notes
No ratings yet
Data Science Fundamentals Lecture Notes
101 pages
Unit I
No ratings yet
Unit I
262 pages
Music Festival Fans' Engagement Insights
No ratings yet
Music Festival Fans' Engagement Insights
2 pages
Book6 PDF
No ratings yet
Book6 PDF
29 pages
Software Process Models Explained
50% (2)
Software Process Models Explained
5 pages
Evolution of ERP Systems
No ratings yet
Evolution of ERP Systems
34 pages
SQL Assignment
No ratings yet
SQL Assignment
105 pages
Format For Synopsis: Title Page
No ratings yet
Format For Synopsis: Title Page
6 pages
Windows System Administrator Sample Resume PDF
0% (1)
Windows System Administrator Sample Resume PDF
3 pages
1 of 6 - REST API - A To Z - REST API 101
No ratings yet
1 of 6 - REST API - A To Z - REST API 101
13 pages
Laboratory Exercise 7
No ratings yet
Laboratory Exercise 7
14 pages
Bwin Online Gaming - SQL Server 2008 Case Study
No ratings yet
Bwin Online Gaming - SQL Server 2008 Case Study
6 pages
Aws Glossary General PDF
No ratings yet
Aws Glossary General PDF
265 pages
Assembler Pass Structure Explained
No ratings yet
Assembler Pass Structure Explained
4 pages
Week II
No ratings yet
Week II
10 pages
LO1 Describe Different Software Development Lifecycles
No ratings yet
LO1 Describe Different Software Development Lifecycles
41 pages
DMS Dsyco)
No ratings yet
DMS Dsyco)
2 pages
Java Full Stack Developer Profile
No ratings yet
Java Full Stack Developer Profile
9 pages
Building Web Application Using API
No ratings yet
Building Web Application Using API
15 pages
Task 1 Scenario Assistance
No ratings yet
Task 1 Scenario Assistance
23 pages
DSS and Olap
No ratings yet
DSS and Olap
7 pages
How To Install SAP Solution Manager 7
No ratings yet
How To Install SAP Solution Manager 7
16 pages
Principles of Programming Languages UNIT I
No ratings yet
Principles of Programming Languages UNIT I
91 pages
Assignment 1 Manual
No ratings yet
Assignment 1 Manual
6 pages
Nextcloud Server Administration Manual
No ratings yet
Nextcloud Server Administration Manual
287 pages
SLES PSP Installation & Online Log Collection.
No ratings yet
SLES PSP Installation & Online Log Collection.
75 pages
02 - L2 - Databases
No ratings yet
02 - L2 - Databases
31 pages
The SQL DELETE Statement
No ratings yet
The SQL DELETE Statement
6 pages
Bibiana Jabbar: Work Experience
No ratings yet
Bibiana Jabbar: Work Experience
5 pages
Overview of SQL Statement Types
No ratings yet
Overview of SQL Statement Types
3 pages
Oracle Vs SQL Server
No ratings yet
Oracle Vs SQL Server
12 pages