Unit 1 Notes

Introduction to Data Analytics
Data analytics is a field focused on examining raw data to uncover patterns, draw conclusions, and
support decision-making. It's a multidisciplinary area that combines statistical analysis, programming,
and domain expertise to make sense of data. Here’s a broad overview of the key components and
steps involved in data analytics:
1. Data Collection
 Sources: Data can come from various sources such as databases, web servers, surveys, social
media, or sensors.
 Types: Structured data (e.g., databases, spreadsheets) and unstructured data (e.g., text,
images).
2. Data Cleaning and Preparation
 Cleaning: This involves handling missing values, correcting errors, and removing duplicates.
 Transformation: Converting data into a usable format, which might include normalization,
aggregation, or encoding.
3. Exploratory Data Analysis (EDA)
 Descriptive Statistics: Measures such as mean, median, mode, variance, and standard
deviation.
 Visualization: Using charts (e.g., histograms, scatter plots) to understand data distributions
and relationships.
4. Data Modeling
 Statistical Models: Techniques like regression analysis to understand relationships between

variables.
 Machine Learning: Algorithms such as classification, clustering, and prediction to make data-
driven decisions or forecasts.
5. Interpretation and Analysis
 Insight Generation: Drawing conclusions based on data models and visualizations.
 Contextual Understanding: Relating findings to the business or research context to ensure

relevance and accuracy.
6. Reporting and Visualization
 Dashboards: Interactive tools that provide real-time data insights.
 Reports: Structured presentations of findings, including visualizations and summaries, often

used to communicate with stakeholders.
7. Decision Making
 Actionable Insights: Using the analysis to guide strategic decisions or operational changes.
 Continuous Improvement: Iteratively refining models and approaches based on feedback
and new data.
Tools and Technologies
 Programming Languages: Python, R
 Software: Excel, Tableau, Power BI
 Libraries: Pandas, NumPy, Scikit-learn (for Python); ggplot2, dplyr (for R)
Skills Required
 Statistical Knowledge: Understanding of statistical methods and techniques.
 Programming Skills: Proficiency in data manipulation and analysis languages.
 Domain Knowledge: Insight into the specific industry or field of study to make data relevant
and actionable.
 Communication Skills: Ability to explain technical findings to non-technical stakeholders

effectively.
Data analytics is increasingly critical in various fields including business, healthcare, finance, and
more, as organizations strive to leverage their data for competitive advantage and improved
outcomes.
Sources and nature of data

In data analytics, understanding the sources and nature of data is crucial because it affects how the
data is collected, processed, and interpreted. Here’s a comprehensive look at the various sources and
nature of data:
Sources of Data
1. Internal Sources
o Databases: Relational databases (SQL-based) and NoSQL databases where

structured data is stored.
o Spreadsheets: Excel, Google Sheets, often used for data entry and simple analysis.
o Enterprise Systems: ERP (Enterprise Resource Planning), CRM (Customer

Relationship Management) systems that manage business processes and customer
data.
o Logs and Records: System logs, transaction records, and historical data generated by
internal processes.
2. External Sources
o Public Data Sets: Government databases (e.g., census data), public health data, and
other freely available datasets.
o Social Media: Data from platforms like Twitter, Facebook, and Instagram, which can
be used for sentiment analysis, trends, and user behavior studies.
o Web Scraping: Data collected from websites using automated tools or scripts.
o Third-Party Data Providers: Commercial vendors that sell data such as market
research, consumer behavior, or economic indicators.
3. Real-Time Data
o Sensors and IoT: Data from connected devices and sensors used in manufacturing,
healthcare, and smart cities.
o Streaming Data: Live data from social media feeds, financial markets, or online
transactions.
4. Surveys and Questionnaires
o Structured Surveys: Pre-defined questions that yield quantitative data.
o Open-Ended Responses: Qualitative data from user feedback and open questions.
5. Experimental Data
o A/B Testing: Data generated from controlled experiments comparing different

variables or treatments.
o Clinical Trials: Data from controlled studies in medical and scientific research.
Nature of Data
1. Structured Data
o Definition: Highly organized and easily searchable data, typically in tabular format
(rows and columns).
o Examples: Databases, spreadsheets, and CSV files.
o Characteristics: Well-defined data types and fields, easy to analyze with traditional
data tools and techniques.
2. Unstructured Data
o Definition: Data that does not have a predefined format or structure.
o Examples: Text documents, emails, social media posts, images, and videos.
o Characteristics: Requires advanced techniques such as natural language processing

(NLP) and image recognition to extract useful information.
3. Semi-Structured Data
o Definition: Data that does not fit into a traditional row-column database but still has
some organizational properties.
o Examples: JSON, XML, and HTML.

o Characteristics: Contains tags or markers to separate data elements but does not
conform to a rigid structure.
4. Quantitative Data
o Definition: Numerical data that can be measured and quantified.
o Examples: Sales figures, temperature readings, and website traffic counts.
o Characteristics: Used for statistical analysis and mathematical modeling.
5. Qualitative Data
o Definition: Non-numerical data that describes qualities or characteristics.
o Examples: Customer feedback, interview transcripts, and product reviews.
o Characteristics: Analyzed through methods like thematic analysis or coding to

identify patterns and insights.
6. Time-Series Data
o Definition: Data collected at successive points in time.
o Examples: Stock prices, weather data, and website traffic over time.
o Characteristics: Used for trend analysis and forecasting.
7. Spatial Data
o Definition: Data related to geographic locations and spatial relationships.
o Examples: GPS data, maps, and location-based services.
o Characteristics: Requires geographic information systems (GIS) for analysis.
Understanding these sources and the nature of data helps in selecting appropriate analytical
techniques, tools, and methodologies for effective data analysis.
1. Structured Data
Definition: Structured data is highly organized and easily searchable in relational databases. It is
typically stored in tables with rows and columns, where each column has a specific data type and the
data is consistently formatted.
Characteristics:
 Organized: Data is neatly arranged in predefined fields or columns.
 Easily Searchable: Simple querying and data retrieval using SQL (Structured Query Language).
 Fixed Schema: Requires a predefined schema to define the structure of the data.
Examples:
 Databases: SQL databases like MySQL, PostgreSQL, and Oracle.
 Spreadsheets: Excel files with well-defined rows and columns.

 Tables: Customer records, inventory lists.
2. Semi-Structured Data
Definition: Semi-structured data does not fit neatly into a table but still has some organizational
properties that make it easier to analyze than unstructured data. It often includes tags or markers to
separate data elements and enforce hierarchies.
Characteristics:
 Flexible Structure: Data may not conform to a rigid schema but has some organizational
patterns.
 Hierarchical Organization: Often uses tags or metadata to define relationships between

elements.
 Partial Schema: May have some structure (e.g., XML or JSON format) but not as strictly
enforced as in structured data.
Examples:
 Documents: XML files, JSON files.
 Emails: Header information (e.g., sender, date) and body content.
 Web Data: HTML files, RSS feeds.
3. Unstructured Data
Definition: Unstructured data lacks a predefined format or structure, making it more challenging to
process and analyze. It often consists of free-form text or other data types that do not fit neatly into
traditional databases.
Characteristics:
 No Fixed Schema: Does not have a specific structure or schema.
 Varied Formats: Can include various data formats and types.
 Complex Analysis: Requires advanced techniques such as natural language processing (NLP)
to analyze.
Examples:
 Text: Social media posts, articles, books.
 Multimedia: Videos, images, audio files.
 Free-form Data: Customer feedback, survey responses.
Chart for Data Classification
Here’s a simple chart summarizing the classification of data:
Data Type Structure Schema Examples
Structured Highly organized Fixed schema SQL databases, Excel spreadsheets

Data Type Structure Schema Examples
Semi-Structured Organized but flexible Partial schema XML files, JSON files, emails
Unstructured No predefined structure None Text documents, images, videos
Characteristics of data
1. Volume
Definition: The amount of data being processed and analyzed.
Details:
 Big Data: Refers to extremely large data sets that can be analyzed computationally to reveal
patterns, trends, and associations.
 Data Storage: Increasing volume necessitates scalable storage solutions such as cloud storage
or distributed databases.
 Processing Challenges: Large volumes of data require efficient processing power and
optimized algorithms.
Examples:
 Social media platforms generating terabytes of data daily.
 Transaction logs from e-commerce websites.
2. Velocity
Definition: The speed at which data is generated, processed, and analyzed.
Details:
 Real-time Data: Data that is processed and analyzed as soon as it is generated. Important for
time-sensitive applications such as stock trading or fraud detection.
 Batch Processing: Data collected over a period and processed at intervals, such as daily or
weekly reports.
 Streaming Data: Continuous flow of data, requiring real-time processing capabilities.
Examples:
 Real-time analytics for financial transactions.
 Live streaming data from IoT devices.
3. Variety
Definition: The different types of data and sources of data.
Details:
 Structured Data: Organized in a tabular format, such as databases and spreadsheets.
 Semi-Structured Data: Includes data with some organizational properties but not rigidly
structured, like XML or JSON files.
 Unstructured Data: Includes free-form text, images, videos, and other data types without a
specific format.
Examples:
 Structured: Customer databases, sales records.
 Semi-Structured: JSON data from web APIs, XML files.
 Unstructured: Customer reviews, social media posts, multimedia content.
4. Veracity
Definition: The trustworthiness and accuracy of the data.
Details:
 Data Quality: Includes accuracy, completeness, consistency, and reliability of data.
 Data Integrity: Ensuring data is accurate and unaltered during collection and processing.
 Handling Uncertainty: Managing data with inherent uncertainty or errors, and implementing
validation and cleansing processes.
Examples:
 Data verification processes in data entry.
 Handling missing or erroneous values in datasets.
5. Value
Definition: The usefulness and relevance of the data for decision-making.
Details:
 Insights Generation: Data must be relevant to the questions or problems being analyzed.
 Actionable Information: Data should help in making informed decisions and drive business
strategies.
 ROI: Evaluating the return on investment from data analytics initiatives.
Examples:
 Analyzing customer behavior to improve marketing strategies.
 Using sales data to forecast demand and optimize inventory.
6. Variability
Definition: The inconsistencies in the data and the way it changes over time.
Details:
 Data Consistency: Managing variability to ensure consistent data formats and definitions
across different sources.
 Dynamic Data: Data that changes frequently, requiring adaptive analytics solutions to handle
updates and modifications.
 Context Sensitivity: Adapting to changes in the context or environment affecting data.
Examples:
 Changes in user behavior patterns over time.
 Variability in financial market data.
7. Complexity
Definition: The intricacy and interrelatedness of the data.
Details:
 Data Integration: Combining data from multiple sources, often requiring sophisticated
integration and transformation techniques.
 Data Relationships: Understanding and managing complex relationships between different

data elements.
 Data Models: Utilizing complex data models to represent and analyze data relationships and
hierarchies.
Examples:
 Integrating customer data from CRM, ERP, and social media platforms.
 Analyzing interconnected data points in a supply chain.
Summary
In data analytics, understanding these characteristics allows organizations to effectively manage and
leverage data for decision-making. Here’s a quick overview in a tabular format:
Characteristic Definition Details Examples
Big data, storage, processing Social media data, e-

Volume Amount of data
challenges commerce transactions
Speed of data generation Real-time, batch, streaming Stock trading, IoT device
Velocity
and processing data data
Structured, semi-structured, Databases, JSON files, social

Variety Types and sources of data
unstructured data media content
Trustworthiness and Data quality, integrity, Data verification, error

Veracity
accuracy of data handling uncertainty handling
Usefulness and relevance Insights generation, Marketing analytics, sales

Value
of data actionable information, ROI forecasting
Characteristic Definition Details Examples
Inconsistencies and Data consistency, dynamic Changing user behavior,

Variability
changes in data data, context sensitivity financial data fluctuations
Intricacy and Data integration, Supply chain data,

Complexity
interrelatedness of data relationships, data models integrated customer profiles
Understanding these characteristics helps in designing effective data analytics strategies and
choosing the right tools and methods for data processing and analysis.

Unit 1 Notes

Uploaded by

Unit 1 Notes

Uploaded by

Introduction to Data Analytics

2. Data Cleaning and Preparation

3. Exploratory Data Analysis (EDA)

 Statistical Models: Techniques like regression analysis to understand relationships between

5. Interpretation and Analysis

 Insight Generation: Drawing conclusions based on data models and visualizations.

 Contextual Understanding: Relating findings to the business or research context to ensure

6. Reporting and Visualization

 Dashboards: Interactive tools that provide real-time data insights.

 Reports: Structured presentations of findings, including visualizations and summaries, often

Tools and Technologies

 Programming Languages: Python, R

 Software: Excel, Tableau, Power BI

 Libraries: Pandas, NumPy, Scikit-learn (for Python); ggplot2, dplyr (for R)

 Statistical Knowledge: Understanding of statistical methods and techniques.

 Programming Skills: Proficiency in data manipulation and analysis languages.

 Communication Skills: Ability to explain technical findings to non-technical stakeholders

Sources and nature of data

o Databases: Relational databases (SQL-based) and NoSQL databases where

o Enterprise Systems: ERP (Enterprise Resource Planning), CRM (Customer

4. Surveys and Questionnaires

o Structured Surveys: Pre-defined questions that yield quantitative data.

o A/B Testing: Data generated from controlled experiments comparing different

o Examples: Databases, spreadsheets, and CSV files.

o Definition: Data that does not have a predefined format or structure.

o Characteristics: Requires advanced techniques such as natural language processing

o Examples: JSON, XML, and HTML.

o Definition: Numerical data that can be measured and quantified.

o Examples: Sales figures, temperature readings, and website traffic counts.

o Characteristics: Used for statistical analysis and mathematical modeling.

o Definition: Non-numerical data that describes qualities or characteristics.

o Examples: Customer feedback, interview transcripts, and product reviews.

o Characteristics: Analyzed through methods like thematic analysis or coding to

o Definition: Data collected at successive points in time.

o Characteristics: Used for trend analysis and forecasting.

o Definition: Data related to geographic locations and spatial relationships.

o Examples: GPS data, maps, and location-based services.

o Characteristics: Requires geographic information systems (GIS) for analysis.

 Organized: Data is neatly arranged in predefined fields or columns.

 Databases: SQL databases like MySQL, PostgreSQL, and Oracle.

 Spreadsheets: Excel files with well-defined rows and columns.

 Hierarchical Organization: Often uses tags or metadata to define relationships between

 Documents: XML files, JSON files.

 Emails: Header information (e.g., sender, date) and body content.

 Web Data: HTML files, RSS feeds.

 No Fixed Schema: Does not have a specific structure or schema.

 Varied Formats: Can include various data formats and types.

 Text: Social media posts, articles, books.

 Multimedia: Videos, images, audio files.

 Free-form Data: Customer feedback, survey responses.

Chart for Data Classification

Here’s a simple chart summarizing the classification of data:

Data Type Structure Schema Examples

Structured Highly organized Fixed schema SQL databases, Excel spreadsheets

Unstructured No predefined structure None Text documents, images, videos

Definition: The amount of data being processed and analyzed.

 Social media platforms generating terabytes of data daily.

 Transaction logs from e-commerce websites.

Definition: The speed at which data is generated, processed, and analyzed.

 Streaming Data: Continuous flow of data, requiring real-time processing capabilities.

 Real-time analytics for financial transactions.

 Live streaming data from IoT devices.

Definition: The different types of data and sources of data.

 Structured: Customer databases, sales records.

 Semi-Structured: JSON data from web APIs, XML files.

 Unstructured: Customer reviews, social media posts, multimedia content.

Definition: The trustworthiness and accuracy of the data.

 Data Quality: Includes accuracy, completeness, consistency, and reliability of data.

 Data verification processes in data entry.

 Handling missing or erroneous values in datasets.

Definition: The usefulness and relevance of the data for decision-making.

 ROI: Evaluating the return on investment from data analytics initiatives.

 Analyzing customer behavior to improve marketing strategies.

 Using sales data to forecast demand and optimize inventory.

 Context Sensitivity: Adapting to changes in the context or environment affecting data.

 Changes in user behavior patterns over time.