Unit 1 Notes
Unit 1 Notes
Data analytics is a field focused on examining raw data to uncover patterns, draw conclusions, and
support decision-making. It's a multidisciplinary area that combines statistical analysis, programming,
and domain expertise to make sense of data. Here’s a broad overview of the key components and
steps involved in data analytics:
1. Data Collection
Sources: Data can come from various sources such as databases, web servers, surveys, social
media, or sensors.
Types: Structured data (e.g., databases, spreadsheets) and unstructured data (e.g., text,
images).
Cleaning: This involves handling missing values, correcting errors, and removing duplicates.
Transformation: Converting data into a usable format, which might include normalization,
aggregation, or encoding.
Descriptive Statistics: Measures such as mean, median, mode, variance, and standard
deviation.
Visualization: Using charts (e.g., histograms, scatter plots) to understand data distributions
and relationships.
4. Data Modeling
Machine Learning: Algorithms such as classification, clustering, and prediction to make data-
driven decisions or forecasts.
7. Decision Making
Actionable Insights: Using the analysis to guide strategic decisions or operational changes.
Continuous Improvement: Iteratively refining models and approaches based on feedback
and new data.
Skills Required
Domain Knowledge: Insight into the specific industry or field of study to make data relevant
and actionable.
Data analytics is increasingly critical in various fields including business, healthcare, finance, and
more, as organizations strive to leverage their data for competitive advantage and improved
outcomes.
Sources of Data
1. Internal Sources
o Spreadsheets: Excel, Google Sheets, often used for data entry and simple analysis.
o Logs and Records: System logs, transaction records, and historical data generated by
internal processes.
2. External Sources
o Public Data Sets: Government databases (e.g., census data), public health data, and
other freely available datasets.
o Social Media: Data from platforms like Twitter, Facebook, and Instagram, which can
be used for sentiment analysis, trends, and user behavior studies.
o Web Scraping: Data collected from websites using automated tools or scripts.
o Third-Party Data Providers: Commercial vendors that sell data such as market
research, consumer behavior, or economic indicators.
3. Real-Time Data
o Sensors and IoT: Data from connected devices and sensors used in manufacturing,
healthcare, and smart cities.
o Streaming Data: Live data from social media feeds, financial markets, or online
transactions.
o Open-Ended Responses: Qualitative data from user feedback and open questions.
5. Experimental Data
o Clinical Trials: Data from controlled studies in medical and scientific research.
Nature of Data
1. Structured Data
o Definition: Highly organized and easily searchable data, typically in tabular format
(rows and columns).
o Characteristics: Well-defined data types and fields, easy to analyze with traditional
data tools and techniques.
2. Unstructured Data
o Examples: Text documents, emails, social media posts, images, and videos.
3. Semi-Structured Data
o Definition: Data that does not fit into a traditional row-column database but still has
some organizational properties.
4. Quantitative Data
5. Qualitative Data
6. Time-Series Data
o Examples: Stock prices, weather data, and website traffic over time.
7. Spatial Data
Understanding these sources and the nature of data helps in selecting appropriate analytical
techniques, tools, and methodologies for effective data analysis.
1. Structured Data
Definition: Structured data is highly organized and easily searchable in relational databases. It is
typically stored in tables with rows and columns, where each column has a specific data type and the
data is consistently formatted.
Characteristics:
Easily Searchable: Simple querying and data retrieval using SQL (Structured Query Language).
Fixed Schema: Requires a predefined schema to define the structure of the data.
Examples:
2. Semi-Structured Data
Definition: Semi-structured data does not fit neatly into a table but still has some organizational
properties that make it easier to analyze than unstructured data. It often includes tags or markers to
separate data elements and enforce hierarchies.
Characteristics:
Flexible Structure: Data may not conform to a rigid schema but has some organizational
patterns.
Partial Schema: May have some structure (e.g., XML or JSON format) but not as strictly
enforced as in structured data.
Examples:
3. Unstructured Data
Definition: Unstructured data lacks a predefined format or structure, making it more challenging to
process and analyze. It often consists of free-form text or other data types that do not fit neatly into
traditional databases.
Characteristics:
Complex Analysis: Requires advanced techniques such as natural language processing (NLP)
to analyze.
Examples:
Semi-Structured Organized but flexible Partial schema XML files, JSON files, emails
Characteristics of data
1. Volume
Details:
Big Data: Refers to extremely large data sets that can be analyzed computationally to reveal
patterns, trends, and associations.
Data Storage: Increasing volume necessitates scalable storage solutions such as cloud storage
or distributed databases.
Processing Challenges: Large volumes of data require efficient processing power and
optimized algorithms.
Examples:
2. Velocity
Details:
Real-time Data: Data that is processed and analyzed as soon as it is generated. Important for
time-sensitive applications such as stock trading or fraud detection.
Batch Processing: Data collected over a period and processed at intervals, such as daily or
weekly reports.
Examples:
3. Variety
Details:
Structured Data: Organized in a tabular format, such as databases and spreadsheets.
Semi-Structured Data: Includes data with some organizational properties but not rigidly
structured, like XML or JSON files.
Unstructured Data: Includes free-form text, images, videos, and other data types without a
specific format.
Examples:
4. Veracity
Details:
Data Integrity: Ensuring data is accurate and unaltered during collection and processing.
Handling Uncertainty: Managing data with inherent uncertainty or errors, and implementing
validation and cleansing processes.
Examples:
5. Value
Details:
Insights Generation: Data must be relevant to the questions or problems being analyzed.
Actionable Information: Data should help in making informed decisions and drive business
strategies.
Examples:
6. Variability
Definition: The inconsistencies in the data and the way it changes over time.
Details:
Data Consistency: Managing variability to ensure consistent data formats and definitions
across different sources.
Dynamic Data: Data that changes frequently, requiring adaptive analytics solutions to handle
updates and modifications.
Examples:
7. Complexity
Details:
Data Integration: Combining data from multiple sources, often requiring sophisticated
integration and transformation techniques.
Data Models: Utilizing complex data models to represent and analyze data relationships and
hierarchies.
Examples:
Integrating customer data from CRM, ERP, and social media platforms.
Summary
In data analytics, understanding these characteristics allows organizations to effectively manage and
leverage data for decision-making. Here’s a quick overview in a tabular format:
Speed of data generation Real-time, batch, streaming Stock trading, IoT device
Velocity
and processing data data
Understanding these characteristics helps in designing effective data analytics strategies and
choosing the right tools and methods for data processing and analysis.