0% found this document useful (0 votes)
15 views68 pages

IoT UNIT-II

IoT Data Analytics involves collecting and analyzing data from interconnected devices to derive actionable insights, with core types including Descriptive, Diagnostic, Predictive, and Prescriptive Analytics. Extended types such as Operational, Behavioral, and Anomaly Detection further enhance the analysis capabilities. Additionally, tools like Amazon Kinesis and AWS Lambda facilitate real-time data processing and serverless computing for various use cases in IoT environments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views68 pages

IoT UNIT-II

IoT Data Analytics involves collecting and analyzing data from interconnected devices to derive actionable insights, with core types including Descriptive, Diagnostic, Predictive, and Prescriptive Analytics. Extended types such as Operational, Behavioral, and Anomaly Detection further enhance the analysis capabilities. Additionally, tools like Amazon Kinesis and AWS Lambda facilitate real-time data processing and serverless computing for various use cases in IoT environments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

IoT Data Analytics and Its Types

1. What is IoT Data Analytics?


IoT Data Analytics is the process of collecting, processing, storing, and analyzing data generated by
interconnected devices and sensors. This analysis helps derive actionable insights and supports decision-
making. A typical architecture includes:

• Data acquisition (from devices and metadata),


• Storage and processing (batch or streaming),
• Analytics (various types),
• Visualization or downstream actions (dashboards, APIs)

Core Types of IoT Analytics:

a) Descriptive Analytics
Purpose: Answers “What happened?” by summarizing historical data.
Techniques: Time-series analysis, trend charts, aggregated reports.
Example: Analyzing temperature patterns over a year in smart buildings to optimize HVAC scheduling.

b) Diagnostic Analytics
Purpose: Explains “Why did it happen?” by identifying root causes.
Methods: Correlation, event tracing, anomaly detection.
Example: Diagnosing machine failure in manufacturing using sensor data like vibration or temperature.

K. BALAKRISHNA, Asst. Prof. 1


c) Predictive Analytics
Purpose: Forecasts “What will happen?” using past and present data.
Tools: Machine learning models like regression, classification.
Example: Predicting equipment failure in heavy machinery to enable pre-scheduled maintenance.

d) Prescriptive Analytics
Purpose: Answers “What should we do?” by recommending actions.
Techniques: Optimization algorithms, decision engines.
Example: Real-time flight rerouting based on current atmospheric data.

Extended Types of Analytics:


1. Operational Analytics (Real-Time or Continuous Analytics)

• Real-time data analysis that supports immediate decision-making.

• It focuses on what is happening right now rather than just analysing the past.

Key Features:

• Fast data processing

• Continuous updates

• Event-driven actions

Examples:

• Dynamic pricing in e-commerce or airlines: Prices change based on current demand.

• Ride-hailing apps (like Uber): Analysing real-time location, traffic, and demand to match
drivers with riders instantly.

2. Behavioural Analytics

• Focuses on analysing user behaviours and patterns to understand how people interact with a
product, service, or system.

• Often used for personalization and customer experience optimization.

Key Features:

• Tracks actions over time (clicks, movements, purchases)

• Builds profiles for segmentation and targeting

Examples:

• Retail RFID tracking: Sensors track how long a customer spends near a product; the system
can suggest promotions or layout improvements.

• Streaming platforms: Recommending shows based on viewing habits.

K. BALAKRISHNA, Asst. Prof. 2


3. Anomaly Detection

• A form of analytics that identifies unusual patterns or outliers that deviate from the norm.

• Helps in early detection of problems or security breaches.

Key Features:

• Pattern recognition

• Alert systems

• Often uses machine learning models

Examples:

• Pipeline monitoring: Detects pressure or flow irregularities indicating a potential leak.

• Financial services: Detects suspicious transactions to prevent fraud.

Use Case Examples by Analytics Type:


Analytics Type Real‑World Use Case
Descriptive Smart thermostats mapping temperature trends
Diagnostic Understanding machine downtime via sensor data
Predictive Forecasting fleet vehicle maintenance
Prescriptive Airplane route adjustments during turbulence
Operational Dynamic fare pricing in ride-hailing apps
Behavioral In‑store shopper tracking for product recommendations
Anomaly Detection Leak detection in oil pipelines

2. Designing Data Processing for Analytics:


a. Amazon Kinesis:

Amazon Kinesis is a fully managed, real-time data streaming service offered by AWS that lets you collect,
process, and analyze large streams of data as they are generated. It's ideal for use cases that require real-time
insights, such as analytics, machine learning, IoT monitoring, and live dashboards.

How It Works:

Kinesis processes real-time data streams in four stages:

1. Data ingestion Producers (apps, sensors, logs, IoT, video/audio streams) send data in formats
like JSON or binary into Kinesis.

2. Sharding and scaling


Data is split into shards—individual capacity units.

▪ One shard supports up to 1,000 PutRecord calls/sec or ~2 MiB/sec

▪ Scaling is horizontal: add more shards to handle volume.

K. BALAKRISHNA, Asst. Prof. 3


3. Processing and buffering
Data flows through buffers for aggregation, filtering, or transformation before storage or
delivery.

4. Making data accessible


After preparation, data is consumed via:

▪ Data Streams API (custom consumers)

▪ Kinesis Firehose (delivery to AWS destinations)

▪ Kinesis Data Analytics (live SQL or Flink processing)

Core Services in Amazon Kinesis:

1. Kinesis Data Streams (KDS)

• Ingest high-throughput streaming data.

• You build custom applications (producers & consumers).

• Example: monitor stock trading data for anomalies.

2. Kinesis Video Streams

• Ingest, store, and process video/audio streams.

• Supports ML applications like object detection.

• Example: monitor live security footage for motion detection.

3. Kinesis Data Firehose

• Delivers data automatically to storage destinations like S3, Redshift, OpenSearch.

• No manual provisioning; fully serverless.

• Example: collect and store website clickstream data in S3.


K. BALAKRISHNA, Asst. Prof. 4
4. Kinesis Data Analytics

• Analyse streaming data using SQL or Apache Flink.

• Real-time ETL and pattern detection.

• Example: detect fraud in payment streams using Flink.

Key Features & Examples:

Cost-Efficiency

• Pay-as-you-go pricing per shard, data volume, and processing—no upfront cost.

Global Availability & Durability

• Data is replicated across AZs. Streams retention defaults to 24 hrs (extendable to 7 days).

Real-Time Processing

• Enables immediate actions on incoming data—perfect for monitoring, alerting, dashboards.

Examples:

1. IoT Sensor Monitoring

▪ Sensors publish telemetry (temperature, pressure, vibration) to Kinesis Data Streams.

▪ Kinesis Data Analytics runs anomaly-detection SQL jobs.

▪ Firehose loads results into S3 or Redshift for long-term storage and BI.

2. Clickstream Aggregation

▪ Web/app events are ingested into Data Streams.

▪ Firehose batches and delivers the data to OpenSearch, enabling real-time dashboarding.

3. Live Video Surveillance

▪ Cameras stream into Kinesis Video Streams.

▪ ML models (e.g., via SageMaker) analyse video for activity recognition or object
detection in real-time.

4. Fraud Detection in Finance

▪ Transaction logs flow into Kinesis Data Streams.

▪ Analytics or custom consumers detect suspicious patterns.

▪ Automatic alerts are triggered via Lambda or SNS.


K. BALAKRISHNA, Asst. Prof. 5
b. Amazon Lambda:

AWS Lambda is a powerful serverless computing service that automatically runs code in response to
events, without requiring you to manage the underlying infrastructure. It supports event-driven applications
triggered by events such as HTTP requests, DynamoDB table updates, or state transitions. You simply upload
your code (as a .zip file or container image), and Lambda handles everything from provisioning to scaling and
maintenance. It automatically scales applications based on traffic, handling server management, auto-scaling,
security patching, and monitoring. AWS Lambda is ideal for developers who want to focus on writing code
without worrying about infrastructure management.

AWS lambda are server-less compute functions are fully managed by the AWS where developers can run
their code without worrying about servers. AWS lambda functions will allow you to run the code without
provisioning or managing servers.

Once you upload the source code file into AWS lambda in the form of ZIP file then AWS lambda will
automatically run the code without you provision the servers and also it will automatically be scaling your
functions up or down based on demand. AWS lambda are mostly used for the event-driven application for the
data processing Amazon S3 buckets, or responding to HTTP requests.

Example:

1. Process data from Amazon S3 buckets.

2. Respond to HTTP requests.

3. Build serverless applications.

Use Cases of AWS Lambda Functions:

1. Real-Time File Processing

Trigger: S3 bucket events


Use Case: Automatically process uploaded files (e.g., resize images, transcode videos, extract metadata).
Example: Upload an image to S3 → Lambda resizes it → stores thumbnails in another S3 bucket.

2. Serverless Web Backends (REST APIs)

Trigger: Amazon API Gateway


Use Case: Handle API requests without managing servers.
Example: A Lambda function processes user registration forms, stores data in DynamoDB, and returns a
JSON response.

K. BALAKRISHNA, Asst. Prof. 6


3. Real-Time Stream Processing

Trigger: Kinesis or DynamoDB Streams


Use Case: Analyse or react to real-time data (e.g., logs, metrics, clickstreams).
Example: Process IoT sensor data streaming into Kinesis to detect anomalies or outliers.

4. Scheduled Tasks

Trigger: Amazon EventBridge (CloudWatch Events)


Use Case: Run tasks on a schedule—hourly, daily, or monthly.
Example: A Lambda function cleans up old database records every night at midnight.

5. Real-Time Notifications & Alerts

Trigger: SNS, S3, CloudWatch


Use Case: Send emails, SMS, or Slack messages on certain events.
Example: Trigger Lambda when CPU usage is too high → send Slack alert to DevOps team.

6. Authentication and Authorization Logic

Trigger: Amazon Cognito or API Gateway Lambda Authorizers


Use Case: Custom user authentication or role-based access control.
Example: Verify JWT (JSON Web Tokens) in a Lambda function before granting access to protected API
routes.

7. Data Transformation and ETL

Trigger: S3, Kinesis, or DynamoDB


Use Case: Transform raw data before storing it in a data lake or data warehouse.
Example: Lambda filters, normalizes, and enriches event logs before saving them to Amazon Redshift.

8. Infrastructure Automation

Trigger: CloudFormation Custom Resources or EventBridge


Use Case: Automatically configure or clean up AWS infrastructure.
Example: When a new EC2 instance is launched, a Lambda function adds it to a monitoring system.

9. Real-Time Video/Audio Analysis

Trigger: Amazon Kinesis Video Streams or S3


Use Case: Analyse live or recorded streams using ML models.
Example: Lambda triggers a Rekognition call to detect faces in uploaded surveillance video.

10. Cost Optimization

Trigger: CloudWatch or Billing Alarms


Use Case: Monitor usage and automatically shut down unused resources.
Example: When monthly spend crosses a threshold, Lambda sends alerts or stops dev EC2 instances.

K. BALAKRISHNA, Asst. Prof. 7


Features of AWS Lambda Functions:

Feature Description Example Use Case


Event-Driven Automatically runs in response to AWS events Run code when a file is uploaded
Execution like S3 uploads, API calls, etc. to S3.
Pay-As-You-Go Charged only for compute time (per ms), no Cost-efficient for occasional or
Pricing idle charges. burst workloads.
Fully Managed No need to manage servers, scaling, or Ideal for developers without
Infrastructure maintenance. DevOps overhead.
Multi-Language Supports Node.js, Python, Java, Go, .NET, Ruby, Use Python for ETL, Node.js for
Support and custom runtimes. APIs, Java for business logic.
Instantly scales based on incoming traffic or Handle 1 or 1 million API
Automatic Scaling
event volume. requests automatically.
Stateless & Short- Functions are stateless and run for up to 15 Quick data transformation or job
Lived minutes per invocation. execution.
Deep AWS Seamless integration with AWS services like Automatically store processed
Integration S3, DynamoDB, API Gateway, etc. data into DynamoDB.
Security via IAM & IAM roles control access; supports private Limit DB access to only your
VPC VPC networking. Lambda in a private subnet.
Environment Set DB credentials or feature
Pass runtime config without changing code.
Variables flags via env variables.
Share and reuse common code or Centralize utility libraries across
Lambda Layers
dependencies across functions. multiple functions.
Scheduled Run functions on a schedule using Nightly data clean-up or report
Execution EventBridge/CloudWatch rules. generation.
Versioning & Manage versions and use aliases for Rollback to a stable version
Aliases dev/stage/prod environments. instantly.
Monitoring & Built-in logging and metrics via Amazon View execution logs, error rates,
Logging CloudWatch. and performance metrics.
Function URLs / Directly expose functions via HTTPS without Create simple webhooks or test
HTTP Access using API Gateway. endpoints.

c. Amazon Anthena:

AWS Athena is a serverless, interactive query service provided by Amazon Web Services. It lets you use
standard SQL to analyse structured and semi-structured data stored in Amazon S3 — without the need for
setting up servers, databases, or ETL pipelines.

Key Advantages:

• Query data directly in S3 using SQL

• No servers to manage – fully serverless

• No ETL required – query raw data directly

• Scales automatically from gigabytes to petabytes

• Pay-per-query model – you pay only for the data scanned

K. BALAKRISHNA, Asst. Prof. 8


Use Cases:

• Ad-hoc data analysis

• Log analytics

• Quick prototyping of dashboards

• Exploring large datasets without moving them into databases

Features of AWS Athena:

1. Serverless Architecture

• No infrastructure management: No need to provision or maintain servers.

• Athena auto-scales based on the size and complexity of your query.

• Suitable for unpredictable workloads, exploratory analysis, and cost-sensitive projects.

• Cost model: You pay only for the amount of data scanned by your queries.

2. Integration with AWS Glue & Other Services

• AWS Glue Data Catalog: Stores metadata like table definitions, schemas, and locations.

▪ Helps Athena discover and manage datasets.

▪ Supports schema evolution and version control.

K. BALAKRISHNA, Asst. Prof. 9


• Automatic schema detection: Glue can crawl data and detect schema for Athena to use.

• Tight integration with other AWS tools:

▪ Amazon QuickSight – for data visualization.

▪ AWS CloudTrail – for auditing and security tracking.

▪ Amazon S3 – as the primary data storage layer.

3. Support for Standard SQL

• Athena uses Presto, a distributed SQL engine.

• Supports:

▪ Joins, window functions, aggregates

▪ Complex types like arrays and maps

• Great for teams already familiar with SQL—no need to learn a new query language.

4. Support for Various Data Formats

Athena can process many data formats:

• CSV, JSON – Simple row-based formats.

• Avro, Parquet, ORC – Columnar formats (more efficient).

▪ Columnar formats help reduce the amount of data scanned, which:

▪ Improves performance

▪ Lowers costs

5. Scalability, Partitioning, and Performance

• Scalability: Athena handles datasets of any size, automatically running queries in parallel.

• Data Partitioning:

▪ Data can be partitioned in S3 (e.g., by date, region, etc.)

▪ Athena only scans relevant partitions, which:

▪ Speeds up queries

▪ Reduces scanned data volume, saving cost

• Performance tips:

▪ Use Parquet/ORC

▪ Filter with WHERE clauses

▪ Limit fields to reduce scanned data


K. BALAKRISHNA, Asst. Prof. 10
6. Security and Compliance Features

• IAM Integration:

▪ Define fine-grained permissions for users, roles, and groups.

▪ Control access to specific S3 buckets or tables.

• Encryption: Supports encryption at rest (using KMS, SSE-S3), in transit (SSL)


• Compliance: Athena complies with major security and privacy standards (HIPAA, GDPR, etc.)

Microsoft SQL
Feature Amazon Athena Amazon Redshift AWS Glue
Server
Relational
Serverless, Serverless data
Data warehousing Database
Description interactive query integration & ETL
platform Management
service service
System (RDBMS)
Fast querying of Creating, editing, BI, analytics,
Ad-hoc querying of
large, structured retrieving tables; transaction
Use Cases S3 data; Big data
datasets using batch ELT and processing; all
analytics
clusters streaming processing SQL operations
Structured, semi-
Data Types Structured, semi- Structured, semi-
structured, Structured
Supported structured structured
unstructured
Supports
CSV, TSV, JSON,
conversion to
File Formats Parquet, ORC, CSV, JSON, ORC,
columnar formats; XML, Non-XML
Supported Avro, Apache logs, Parquet, MS Excel
Redshift-native
custom text files
table types
Framework / Presto with ANSI PostgreSQL (8.0.2 Python/Scala-based
Transact-SQL
SQL Engine SQL compatible) ETL scripting engine
Queries S3 directly; Requires ETL & Crawls & catalogs S3
Not natively
S3 Integration no ETL; creates cluster setup before data; helps prepare
integrated with S3
external tables querying data for Athena
Instant (on pre-
Instant (within Takes 15–60
Startup Time Varies by job type configured
seconds) minutes to start up
instances)
Free (Express) to
Charged for both
$5 per TB of data $0.44 per DPU-hour $15,123
Pricing Model compute and
scanned (billed per second) (Enterprise
storage
edition)
Full ETL support Typically used
ETL ETL required
No ETL required with job scheduling with ETL
Requirement before querying
and transformation pipelines
Traditional DB
On-demand SQL High-performance Data discovery,
applications and
Ideal For queries over S3 analytics on large, preparation, and
enterprise
with minimal setup structured datasets transformation
workloads.

K. BALAKRISHNA, Asst. Prof. 11


d. The AWS IoT platform:

A fully-managed, serverless platform from AWS that simplifies connecting IoT devices to the cloud. It
handles provisioning, authentication, secure communication, messaging, state management, and routing—
letting you focus on device logic instead of cloud infrastructure.

Core Components & How They Work:

1. FreeRTOS

• A real-time OS optimized for microcontrollers.

• Comes with AWS IoT libraries pre-integrated for secure device–cloud connectivity.

Example: A fitness-tracking bracelet using FreeRTOS can stream heart rate and steps directly to AWS with
library/kernel support.

2. Device Registry

• A centralized metadata store for tracking “Things” (physical devices): models, serials, firmware
versions, etc.

• Useful for organizing large fleets and facilitating OTA updates.

3. Device SDK

• Open-source SDKs (C++, Java, Python, JS, Android, iOS) for secure, bi-directional device-cloud
interactions.

• Handles certificate-based authentication, message formatting, and communication over MQTT,


HTTP, or WebSockets.

K. BALAKRISHNA, Asst. Prof. 12


4. Authentication & Authorization

• Authentication: Uses X.509 certs/TLS to verify device identity and AWS endpoint legitimacy.

• Authorization: IoT Core policies (IAM-like JSON documents) define what devices can do—
publish, subscribe, invoke AWS services.

5. Device Gateway (Message Broker)

• A fully-managed MQTT/HTTP/WebSocket broker that handles publish/subscribe messaging at


scale (billions of devices).

• Decouples publishers and subscribers (e.g., a bracelet publishes to a topic like StartupX/smart-
bracelets/ModelX/bracelet7).

6. Rules Engine

• Processes inbound messages and triggers actions—e.g., store readings in DynamoDB, forward
to Kinesis, invoke Lambda, or persist to S3.

• Also supports third-party integrations (e.g., Kafka, Snowflake).

7. Device Shadow

• A JSON device state “shadow” in the cloud, storing desired and reported states.

• Enables offline state sync: updates persist to shadow, and devices retrieve missing changes
when they reconnect.

8. Jobs

• Supports remote orchestration: tasks like firmware updates, config pushes, and certificate
rotations across THING groups.

9. Thing Groups

• Groups of similar devices (static or dynamic) to collectively manage policies, firmware updates,
and jobs—ideal for large fleets.

10. Tunnels (Secure Tunnelling)

• Enables secure remote access to devices behind firewalls, without altering network
configurations—ideal for troubleshooting.

Extended IoT Services:

• AWS IoT Greengrass: Extends compute to the edge; run Lambda, ML inference on local
devices when offline.

• AWS IoT Device Defender: Continuously audits security policies, monitors device behavior,
and alerts on deviations.

• AWS IoT Device Management: Fleet-wide device registration, health monitoring, OTA
updates.

K. BALAKRISHNA, Asst. Prof. 13


• AWS IoT Analytics: Cleans, enriches, and analyzes noisy device data; handles storage, filtering,
and long-term insights.

Device Shadow process:

This diagram illustrates how AWS IoT Core interacts with a connected wearable device (e.g., a fitness
tracker), its Device Shadow, and an application or user interface. The workflow demonstrates bi-
directional communication using AWS IoT services and the Device Shadow feature.

1. Device publishes current state: The IoT device (e.g., smartwatch) sends its current status:

"heart_rate": "95", "blood_pressure": "120/80", "steps_walked": "1000"

This data is published to AWS IoT Core using the Device SDK over MQTT or HTTP.

2. Persist to Data Store: AWS IoT receives this message and stores the data in a cloud data store (e.g.,
DynamoDB, S3) using Rules Engine or lambda triggers.

3. App requests device status: A mobile or web app queries AWS IoT to get the latest state of the device.

4. App requests status changes: The app may send a desired state (e.g., "enable sleep mode", or "increase
step goal").

5. Device Shadow syncs updated state: AWS IoT Device Shadow receives this "desired state" and syncs it
with the current reported state. When the device connects again, it checks the shadow and retrieves the
update.

K. BALAKRISHNA, Asst. Prof. 14


6. Device publishes current state (again): After applying changes (e.g., user set new step goal), the device
publishes its updated state back to AWS IoT.

7. Device Shadow confirms state changes: AWS IoT Shadow service now confirms the new state and
updates the user/app interface with the most current status.

e. Microsoft Azure IoT Hub:

Azure IoT Hub, a cloud service provided by Microsoft, is a fully managed service that enables organizations to
manage, monitor, and control IoT devices. In addition, Azure IoT Hub enables reliable, secure bidirectional
communications between IoT devices and its cloud-based services. It allows developers to receive messages
from, and send messages to, IoT devices, acting as a central message hub for communication. It can also help
organizations make use of data obtained from IoT devices, transforming IoT data into actionable insights.

Feature Description Example/Use Case


Enables real-time two-way A sensor detects a fault → Cloud
Bidirectional
communication between devices and the sends corrective command to the
Communication
cloud. device.
Devices send data (e.g., temperature, Manufacturing device sends
Device-to-Cloud
performance metrics) to the cloud for temperature data → Cloud detects
Telemetry
analysis and storage. overheating trend.
Cloud adjusts A/C settings in
Cloud-to-Device Cloud can send control commands to
specific building zones to save
Commands devices to change behavior or settings.
energy.
A digital twin of a device in the cloud Monitor and update device settings
Device Twins
storing state, config, and metadata. remotely via the twin.
The cloud can invoke actions directly on Admin sends command to reboot a
Direct Methods
devices, such as reboots or resets. faulty sensor remotely.

K. BALAKRISHNA, Asst. Prof. 15


The above diagram presents a high-level architecture for an IoT solution using an IoT Hub, showing how
devices of various types connect to the cloud, and how data is processed in the backend. It is split into two
major sections:

1. Device Connectivity:

This section shows how different types of IoT devices connect to the cloud-based IoT Hub.

Types of Devices:

1. IP-capable devices

✓ These are advanced devices (like smart appliances or industrial machines) that can
connect directly to the cloud.

✓ They use protocols such as:

▪ AMQP (Advanced Message Queuing Protocol)

▪ MQTT (Message Queuing Telemetry Transport)

▪ HTTPS

✓ They use the IoT device library to securely send data directly to the IoT Hub.

2. Existing IoT devices

✓ These are devices already deployed, possibly using custom protocols.

✓ They may not directly support the required protocols for cloud connection.

✓ A protocol gateway (custom or IoT protocol gateway) is used to translate messages


(e.g., from MQTT to AMQP) before they reach the cloud.

3. Low-power devices

✓ Examples: BLE (Bluetooth Low Energy) sensors, ZigBee devices.

✓ These cannot connect to the cloud directly due to limitations like power or protocol
support.

✓ They send data locally to an IoT Field Gateway, which aggregates and forwards it to
the cloud using AMQP.

Optional Components:

• IoT protocol gateway


Used to bridge protocol gaps for legacy/custom devices.

• IoT field gateway


Acts as a bridge or edge device for low-power or non-IP devices.

2. Data Processing and Analytics (Cloud):

Once the data reaches the IoT Hub, it is managed by the IoT Solution Backend, which includes:

K. BALAKRISHNA, Asst. Prof. 16


Key Backend Components:

1. Event-based device-to-cloud ingestion

✓ Handles streaming data from connected devices (like telemetry, sensor data).

✓ Can be processed in real time using services like AWS Lambda, Azure Functions, or
Apache Kafka.

2. Reliable cloud-to-device messaging

✓ Sends commands, configurations, or firmware updates from the cloud back to the
device.

✓ Ensures messages are delivered even when devices reconnect later.

3. Per-device authentication and secure connectivity

✓ Every device is authenticated individually (via certificates or tokens).

✓ Ensures end-to-end encryption and secure channel.

✓ Helps in isolating devices and revoking access if needed.

3. IoT Data Storage Approaches:


There are several technical approaches to store IoT data:

a. Edge Storage
Edge storage involves storing data on local devices or near the data source, rather than transmitting
it to a centralized data centre. This mitigates latency issues by processing data close to where it is
generated, reducing bandwidth usage on networks. Examples of use cases include manufacturing
plants and autonomous vehicles.
Example: Autonomous Vehicles
• Self-driving cars use local edge processors to instantly analyze data from cameras, LiDAR, and
radar.
• Decisions like braking or steering are made on the spot, without needing cloud access.
• Later, trip logs or training data can be uploaded to the cloud when parked.

b. Cloud Storage
Cloud storage is scalable and flexible, leveraging the cloud’s resources to store data remotely. This
allows IoT deployments to expand storage capacity as needed without investing in physical
infrastructure. However, relying solely on cloud storage can introduce latency issues due to data
having to travel from the IoT devices to the cloud. Data caching and choosing cloud data centers
located nearer to the data sources can help mitigate these latency problems.
Example: Smart Agriculture
• IoT sensors in large farms collect temperature, humidity, and soil moisture levels.
• Data is pushed to the cloud for central monitoring and long-term analysis.
• AI models in the cloud predict optimal irrigation schedules or detect crop disease trends.

K. BALAKRISHNA, Asst. Prof. 17


c. Hybrid Storage
Hybrid storage combines the advantages of edge and cloud storage, allowing data to be stored and
processed both locally and in the cloud. This enables a balance between reducing latency and
leveraging the scalable storage and advanced analytics capabilities of the cloud. It is useful for local
decision-making, but where long-term data analysis can be offloaded to the cloud.

Example: Smart Cities (Traffic Management)


• Cameras and traffic lights use edge computing to detect congestion or accidents in real time.
• Local systems handle instant reactions like changing light signals.
• Cloud systems collect the data for daily traffic patterns and future planning.

Technologies Enabling IoT Storage:

Here are some of the technologies that support IoT storage:

• Database technologies: Databases support structured and semi-structured data storage for IoT
systems:

▪ Time-series databases, such as InfluxDB and TimescaleDB, are optimized for storing sequential data
generated by IoT devices. They offer efficient data compression and specialized query capabilities to
handle large volumes of timestamped data.

▪ NoSQL databases like Cassandra and MongoDB provide flexibility, scalability, and high performance,
managing the varied data from IoT devices. They support a schema-less data model, allowing them
to handle different data types.

• File systems: Suitable for IoT environments requiring high throughput and low-latency data access, file
systems like ZFS or Btrfs provide features like data integrity checking and snapshot capabilities. This is
useful for IoT applications that may need to restore historical data states.

• Block storage systems: These ensure high-performance data access for IoT applications, especially for
real-time processing and analysis. They are useful for data, requiring immediate storage and retrieval.
Examples include iSCSI and Fiber Channel.

• Object storage solutions: These can handle unstructured data, such as video and images, from devices
like surveillance cameras or drones. Solutions like Amazon S3 in the cloud and Cloudian for on-premises
storage offer scalability and durability. Users can store, retrieve, and manage data non-sequentially.

• Data warehouses: These provide a structured format for querying and analyzing data, suitable for
structured data in IoT scenarios where response times are important. They allow for complex queries
and reporting on IoT data that has been processed and normalized.

• Data lakes: These offer a more flexible environment suitable for storing raw, unstructured data from
IoT devices. Technologies like Hadoop or Azure Data Lake can handle large amounts of heterogeneous
IoT data, enabling later refining and analysis.

K. BALAKRISHNA, Asst. Prof. 18


IoT Storage Challenges and Solutions:

Here are some of the main challenges associated with storing IoT data and how to address them.

1. Data Volume and Scalability

Problem: IoT devices generate huge amounts of data that traditional storage systems can't handle easily.

Solution: Use cloud storage or distributed file systems that can grow as needed (called horizontal scaling) to
store and manage increasing data smoothly.

2. Real-Time Processing and Latency

Problem: Many IoT applications (like smart cars or health monitors) need data to be processed immediately.
Any delay (latency) can cause problems.

Solution:

• Use edge computing – process data near the source (like on the device) to reduce delay.
• Use fast storage tools like in-memory databases and caching to speed up data access.

3. Security and Privacy

Problem: IoT devices often collect private and sensitive data, making them targets for cyberattacks.

Solution:

• Encrypt data during storage and transmission.


• Use strong access controls, regular security checks, and real-time threat detection.
• Apply data anonymization and follow data protection rules (like GDPR) to ensure privacy.

4. Interoperability and Standards

Problem: Devices from different brands often don’t work well together due to lack of common standards.

Solution:

• Use common communication protocols like MQTT or CoAP.


• Build APIs to let different systems talk to each other, reducing data silos and improving efficiency.

What is Big Data?

Big Data refers to very large, fast, and complex sets of data that cannot be easily managed or processed
using traditional tools like spreadsheets or basic databases.

This data comes from multiple sources, such as:

• Social media (e.g., tweets, posts, likes)

• Sensors and IoT devices (e.g., smartwatches, temperature monitors)

• Online transactions

K. BALAKRISHNA, Asst. Prof. 19


• Mobile apps and GPS systems

• Audio, video, images, and documents

Big data includes structured, semi-structured, and unstructured formats:

• Structured: Tables, databases (e.g., Excel, SQL)

• Semi-structured: XML, JSON

• Unstructured: Videos, emails, social media posts

The 5 V’s of Big Data: Five key characteristics:

1. Volume

▪ Refers to the amount of data.

▪ Companies now collect data in terabytes, petabytes, or even exabytes.

▪ Example: Facebook generates over 4 petabytes of data per day.

2. Velocity

▪ The speed at which data is created and needs to be processed.

▪ Example: Credit card transactions or live social media updates.

3. Variety

▪ The different types of data formats: Text, images, audio, video, log files, etc.

4. Veracity

▪ Refers to the trustworthiness or quality of the data.

▪ Data can be messy or incomplete, which affects analysis.

5. Value

▪ The usefulness of the data after it is processed.

▪ Big data is only valuable if it produces insights that lead to better decisions.

Why is Big Data Important?

Big Data is useful because it helps organizations:

• Improve decision-making by analyzing real-time data.

• Understand customer behavior and personalize services.

• Predict trends and patterns, such as in sales or weather.

K. BALAKRISHNA, Asst. Prof. 20


• Detect fraud in financial transactions.

• Enhance operational efficiency in industries like healthcare, logistics, or manufacturing.

How is Big Data Processed and Managed?

Traditional databases (like MySQL or Excel) cannot handle the size and speed of big data. So companies use:

1. Big Data Technologies

• Hadoop – Stores and processes large datasets using a distributed system.

• Apache Spark – Fast in-memory data processing engine.

• NoSQL Databases – Like MongoDB or Cassandra for flexible data storage.

2. Cloud Platforms

• AWS, Microsoft Azure, Google Cloud provide scalable tools to store and analyze big data.

3. Analytics Tools

• Machine Learning & AI are used to find patterns and predictions from big data.

• Data Visualization tools like Tableau or Power BI present insights in an understandable


format.

Real-World Applications of Big Data:

Industry Use Case

Healthcare Predict disease outbreaks, personalize treatments

Retail Customer behavior analysis, inventory management

Finance Fraud detection, credit scoring

Transportation Route optimization, traffic forecasting

Media Personalized recommendations (e.g., Netflix, YouTube)

Hadoop:

Apache Hadoop is an open source, Java-based, software framework and parallel data processing engine. It
enables big data analytics processing tasks to be broken down into smaller tasks that can be performed in
parallel by using an algorithm (like the MapReduce algorithm), and distributing them across a Hadoop
cluster. A Hadoop cluster is a collection of computers, known as nodes, that are networked together to
perform these kinds of parallel computations on big data sets. Unlike traditional storage systems, Hadoop
excels in handling diverse data types—structured, semi-structured, and unstructured—distributed across
multiple nodes for fault tolerance and parallel processing.

K. BALAKRISHNA, Asst. Prof. 21


Picture an e-commerce company like Amazon. Their operations hinge on analyzing user behavior, clicks, and
transactions. A Hadoop cluster enables them to process terabytes of data daily, helping refine
recommendations, optimize logistics, and predict consumer demand—all in near real-time.

1. Hadoop Cluster Architecture:

Hadoop is a master-slave model made up of three components. Within a Hadoop cluster, one machine in the
cluster is designated as the NameNode and another machine as the JobTracker, these are the masters. The
rest of the machines in the cluster are DataNodes and TaskTrackers, these are the slaves. The masters
coordinate the roles of many slaves. The table below provides more information on each component.

• Master Node — The master node in a Hadoop cluster is responsible for storing data in the Hadoop
Distributed Filesystem (HDFS). It also executes the computation of the stored data using MapReduce, which
is the data processing framework. Within the master node, there are three additional nodes: NameNode,
Secondary NameNode, and JobTracker. NameNode handles the data storage function with HDFS and
Secondary NameNode keeps a backup of the NameNode data. JobTracker monitors the parallel processing of
data using MapReduce.

• Slave/Worker Node — The slave/worker node in a Hadoop cluster is responsible for storing data and
performing computations. The slave/worker node is comprised of a TaskTracker and a DataNode. The
DataNode service communicates with the Master node in the cluster.

• Client Nodes — The client node is responsible for loading all the data into the Hadoop cluster. It submits
MapReduce jobs and outlines how the data needs to be processed, then retrieves the output once the job is
complete.

Types of Hadoop Clusters:

Hadoop clusters can be configured as either single-node or multi-node systems, each catering to distinct
needs based on workload complexity and scale. Selecting the appropriate configuration ensures operational
efficiency and scalability while minimizing risks.

K. BALAKRISHNA, Asst. Prof. 22


1. Single-node clusters

In a single-node cluster, all processes, including the NameNode, DataNode, ResourceManager, and
NodeManager, operate on a single machine. This setup is best suited for testing or development
environments, where simplicity and minimal resource allocation are priorities. For instance, a startup
experimenting with recommendation algorithms might utilize a single-node cluster to validate ideas before
transitioning to a production-ready system.

Setting up a single-node cluster involves:

1. Installing Hadoop binaries.

2. Configuring HDFS files, such as core-site.xml and hdfs-site.xml.

3. Starting all services using the start-all.sh command.

Since this configuration is limited in fault tolerance—any failure impacts all processes—it’s ideal for proof-
of-concept testing. Basic manual checks are sufficient to ensure functionality during development.

2. Multi-node clusters

Multi-node clusters distribute processes across multiple machines, enabling large-scale data processing. This
configuration is designed for production environments where massive datasets and complex workflows are
involved. For example, streaming platforms like Netflix rely on multi-node clusters to process global
viewership data, enabling content recommendations and insights.

The setup process for a multi-node cluster includes:

1. Installing Java 8.

2. Configuring SSH for seamless communication between nodes.

3. Assigning master and worker roles to specific machines.

4. Configuring replication settings in hdfs-site.xml.

Advanced tools such as Apache Ambari or Cloudera Manager play a critical role in multi-node setups,
providing real-time monitoring of node health and performance. These clusters also excel in fault tolerance,
as HDFS replicates data across nodes, ensuring operational continuity even if a node fails.

2. Hadoop Distributed File System (HDFS):

The Hadoop Distributed File System (HDFS) is a key component of the Apache Hadoop ecosystem, designed
to store and manage large volumes of data across multiple machines in a distributed manner. It provides
high-throughput access to data, making it suitable for applications that deal with large datasets, such as big
data analytics, machine learning, and data warehousing.

The Hadoop Distributed File System (HDFS) is a scalable and fault-tolerant storage solution designed for large
datasets. It consists of NameNode (manages metadata), DataNodes (store data blocks), and a client interface.
Key advantages include scalability, fault tolerance, high throughput, cost-effectiveness, and data locality,
making it ideal for big data applications.

K. BALAKRISHNA, Asst. Prof. 23


HDFS Architecture:

HDFS is designed to be highly scalable, reliable, and efficient, enabling the storage and processing of massive
datasets. Its architecture consists of several key components:

1. NameNode

2. DataNode

3. Secondary NameNode

4. HDFS Client

5. Block Structure

NameNode

The NameNode is the master server that manages the filesystem namespace and controls access to files by
clients. It performs operations such as opening, closing, and renaming files and directories. Additionally, the
NameNode maps file blocks to DataNodes, maintaining the metadata and the overall structure of the file
system. This metadata is stored in memory for fast access and persisted on disk for reliability.

Key Responsibilities:

• Maintaining the filesystem tree and metadata.

• Managing the mapping of file blocks to DataNodes.

• Ensuring data integrity and coordinating replication of data blocks.

K. BALAKRISHNA, Asst. Prof. 24


DataNode

DataNodes are the worker nodes in HDFS, responsible for storing and retrieving actual data blocks as
instructed by the NameNode. Each DataNode manages the storage attached to it and periodically reports
the list of blocks it stores to the NameNode.

Key Responsibilities:

• Storing data blocks and serving read/write requests from clients.

• Performing block creation, deletion, and replication upon instruction from the NameNode.

• Periodically sending block reports and heartbeats to the NameNode to confirm its status.

Secondary NameNode

The Secondary NameNode acts as a helper to the primary NameNode, primarily responsible for merging the
EditLogs with the current filesystem image (FsImage) to reduce the potential load on the NameNode. It
creates checkpoints of the namespace to ensure that the filesystem metadata is up-to-date and can be
recovered in case of a NameNode failure.

Key Responsibilities:

• Merging EditLogs with FsImage to create a new checkpoint.

• Helping to manage the NameNode's namespace metadata.

HDFS Client

The HDFS client is the interface through which users and applications interact with the HDFS. It allows for
file creation, deletion, reading, and writing operations. The client communicates with the NameNode to
determine which DataNodes hold the blocks of a file and interacts directly with the DataNodes for actual data
read/write operations.

Key Responsibilities:

• Facilitating interaction between the user/application and HDFS.

• Communicating with the NameNode for metadata and with DataNodes for data access.

Block Structure

HDFS stores files by dividing them into large blocks, typically 128MB or 256MB in size. Each block is stored
independently across multiple DataNodes, allowing for parallel processing and fault tolerance. The
NameNode keeps track of the block locations and their replicas.

Key Features:

• Large block size reduces the overhead of managing a large number of blocks.

• Blocks are replicated across multiple DataNodes to ensure data availability and fault tolerance.

Example: Let’s assume that 100TB file is inserted, then masternode(namenode) will first divide the file into
blocks of 10TB (default size is 128 MB in Hadoop 2.x and above). Then these blocks are stored across

K. BALAKRISHNA, Asst. Prof. 25


different datanodes(slavenode). Datanodes(slavenode) replicate the blocks among themselves and the
information of what blocks they contain is sent to the master. Default replication factor is 3 means for each
block 3 replicas are created (including itself). In hdfs.site.xml we can increase or decrease the replication
factor i.e we can edit its configuration here.

Data storage in HDFS

3. Parquet:

Apache Parquet is an open-source columnar storage format that addresses big data processing challenges.
Unlike traditional row-based storage, it organizes data into columns. This structure allows you to read only
the necessary columns, making data queries faster and reducing resource consumption.

Features of Apache Parquet:

Columnar storage

Unlike row-based formats like CSV, Parquet organizes data in columns. This means when we run a query, it
only pulls the specific columns we need instead of loading everything. This improves performance and
reduces I/O usage.

Row vs column-based structure.

K. BALAKRISHNA, Asst. Prof. 26


Parquet files are split into row groups, which hold a batch of rows. Each row group is broken into column
chunks, each containing data for one column. These chunks are further divided into smaller pieces called
pages, which are compressed to save space.

In addition, Parquet files store extra information in the footer, called metadata, which locates and reads only
the data we need.

Here’s what the structure looks like:

Row groups

• A row group contains multiple rows but stores data column-wise for efficient reading.

• Example: A dataset with 1 million rows might be split into 10 groups of 100,000 rows each.

Column chunks

• Within each row group, data is separated by columns.

• This design allows columnar pruning, where we can read only the relevant columns instead of
scanning the entire file.

Parquet file internal structure.

Pages

• Each column chunk is further split into pages to optimize memory usage.

• Pages are typically compressed, reducing storage costs.

K. BALAKRISHNA, Asst. Prof. 27


Footer (metadata)

• The footer at the end of a Parquet file stores index information:

• Schema: Defines data types and column names.

• Row group offsets: Helps locate specific data quickly.

• Statistics: Min/max values to enable predicate pushdown (filtering at the storage


level).

Compression and encoding

As mentioned, Parquet compresses data column by column using compression methods like Snappy and
Gzip. It also uses two encoding techniques:

• Run-length encoding to store repeated values compactly.

• Dictionary encoding to replace duplicates with dictionary references.

This reduces file sizes and speeds up data reading, which is especially helpful when you work with big data.

Schema evolution

Schema evolution means modifying the structure of datasets, such as adding or altering columns. It may
sound simple, but depending on how your data is stored, modifying the schema can be slow and resource-
intensive.

Let’s understand this by comparing CSV and Parquet schema evolution.

Suppose you have a CSV file with columns like student_id, student_name, and student_age. If you want to add
a new scores column, you’d have to do the following:

1. Read the entire file into memory.

2. Update the header to include a new column, scores.

3. Add a score for each student. This means appending values for all rows (even if they are
missing, you may need placeholders like empty strings or NULL).

4. Save everything as a new CSV file.

CSV is a simple text-based format with no built-in schema support. This means any change to the structure
requires rewriting the entire file, and older systems reading the modified file might break if they expect a
different structure!

With Parquet, you can add, remove, or update fields without breaking your existing files. As we saw before,
Parquet stores schema information inside the file footer (metadata), allowing for evolving schemas without
modifying existing files.

Here’s how it works:

• When you add a new column, existing Parquet files remain unchanged.

• New files will include the additional column, while old files still follow the previous schema.
K. BALAKRISHNA, Asst. Prof. 28
• Removing a column doesn’t require reprocessing previous data; queries will ignore the missing
column.

• If a column doesn’t exist in an older file, Parquet engines (like Apache Spark, Hive, or BigQuery)
return NULL instead of breaking the query.

• Older Parquet files can be read even after schema modifications.

• Newer Parquet files with additional columns can still be read by systems expecting an older
schema.

Adding a column to the Parquet file without breaking it.

Language and platform support

Parquet supports different programming languages, such as Java, Python, C++, and Rust. This means
developers can easily use it regardless of their platform. It is also natively integrated with big data
frameworks like Apache Spark, Hive, Presto, Flink, and Trino, ensuring efficient data processing at scale.

So whether you're using Python (through PySpark) or another language, Parquet can manage the data in a
way that makes it easy to query and analyze across different platforms.

4. Avro:

Data Serialization: Data Serialization is the process wherein data structures or object states are converted
into a format that can be stored, transported, and subsequently reconstructed. Often used in data storage,
remote procedure calls (RPC), and data communication, serialization facilitates complex data processing and
analytics by making data more accessible and portable.

Functionality and Features:

Data Serialization operates by converting intricate data structures into a byte stream, enabling effective data
transfer across networks. Its features include:

• Data Persistence: Serialization helps in saving the state of an object to a storage medium and later
retrieving it.

• Data Exchange: It allows transmitting data over a network in a form that the network can understand.

• Remote Procedure Calls (RPCs): They can be made as though they are local calls via serialization.
K. BALAKRISHNA, Asst. Prof. 29
Architecture:

The architecture of data serialization is based on two main components: the serializer and deserializer. The
serializer converts object data into a byte stream, while the deserializer reconverts the byte stream to
replicate the original object data structure.

Benefits and Use Cases:

Data Serialization accrues numerous benefits to businesses, notably:

• Facilitates Distributed Computing: Serialization simplifies the processing of objects in a distributed


environment by enabling object transport over the network.

• Enhances Data Interchange: Data exchange between different languages or platforms is made possible
through serialization.

• Enables Data Persistence: Serialized data can be stored and recovered efficiently, making it beneficial for
applications like caching, session state persistence, etc.

What is Avro?

Apache Avro is a data serialization system developed by the Apache Software Foundation that is used for big
data and high-speed data processing. It provides rich data structures and a compact, fast, binary data format
that can be processed quickly and in a distributed manner. Avro has wide use in the Hadoop ecosystem and is
often used in data-intensive applications, such as data analytics.

Functionality and Features:

• Schema definition: Avro data is always associated with a schema written in JSON format.

• Language-agnostic: Avro libraries are available in several languages including Java, C, C++, C#, Python,
and Ruby.

• Dynamic typing: Avro does not require code generation, which enhances its flexibility and ease of use.

• Compact and fast: Avro offers efficient serialization and deserialization.

Architecture:

The core of Avro's architecture is its schema, which is used to read and write data. Avro schemas are defined
in JSON, and the resulting serialized data is compact and efficient. Processing systems can use these schemas
to understand the data and perform operations on it.

An Avro schema is defined in the JSON format and is necessary for both serialization and deserialization,
enabling compatibility and evolution over time. It can be a:

• JSON string, which contains the type name, like int.

• JSON array, which represents a union of multiple data types.

• JSON object, which defines a new data type using the format {"type": "typeName", ...attributes...}

K. BALAKRISHNA, Asst. Prof. 30


Example of an Avro record schema:

5. Hive:

Hive is a query interface on top of Hadoop’s native MapReduce. Remember Hive is a data warehouse it is not
an RDBMS! Hive allows us to write SQL-style queries in a native language known as Hive Query Language
(HQL).

Why do we need Hive?

In Hadoop we have our data stored in HDFS, and as MapReduce is used for processing that data. But there is a
problem, the MapReduce framework needs a person who knows JAVA. So if you want to work with this HDFS
data you should be a JAVA expert!

Facebook noticed this problem and in 2010, they invented Hive which lets you process this HDFS
data(structured one) even if you don’t know JAVA.

K. BALAKRISHNA, Asst. Prof. 31


HIVE execution engine converts the script written using HQL into Map-Reduce programs. These MapReduce
programs are further converted to .JAR files and they work in the same way when executed as the
MapReduce should have worked.

HIVE Reads data from HDFS. HIVE works with structured data in Hadoop.

In Hive, Tables and Data are stored separately. Data is stored only in HDFS. Tables are just projected over
that data of HDFS. Remember data is not stored in these tables of HIVE like we have in RDBMS!! Hive only
stores the table’s schema information (table metadata) in its metastore which is borrowed from an RDBMS
(derby is the default one). But in Production, Derby is replaced by Oracle, MSSQL as Derby is a single-user
database.

The table metadata is stored separately from the data.

K. BALAKRISHNA, Asst. Prof. 32


Working with Hive:

1. The driver calls the user interface’s execute function to perform a query.

2. The driver answers the query, creates a session handle for the query, and passes it to the compiler for
generating the execution plan.

3. The compiler responses to the metadata request are sent to the metaStore.

4. The compiler computes the metadata. The metadata that the compiler uses for type-checking and
semantic analysis on the expressions in the query tree is what is written in the preceding bullet. The
compiler generates the execution plan (Directed acyclic Graph) for Map Reduce jobs, which includes map
operator trees (operators used by mappers and reducers) as well as reduce operator trees (operators used
by reducers).

5. The compiler then transmits the generated execution plan to the driver.

6. After the compiler provides the execution plan to the driver, the driver passes the implemented plan to
the execution engine for execution.

7. The execution engine then passes these stages of DAG to suitable components. The deserializer for each
table or intermediate output uses the associated table or intermediate output deserializer to read the rows
from HDFS files. These are then passed through the operator tree. The HDFS temporary file is then serialised
using the serializer before being written to the HDFS file system. These HDFS files are then used to provide
data to the subsequent MapReduce stages of the plan. The final temporary file is moved to the table’s
location.

K. BALAKRISHNA, Asst. Prof. 33


8. The driver stores the contents of the temporary files in HDFS as part of a fetch call from the driver to the
Hive interface. The Hive interface sends the results to the driver.

6. Hadoop MapReduce:

Components of Hadoop: Hadoop has three components:

1. HDFS: Hadoop Distributed File System is a dedicated file system to store big data with a cluster of
commodity hardware or cheaper hardware with streaming access pattern. It enables data to be stored at
multiple nodes in the cluster which ensures data security and fault tolerance.

2. Map Reduce : Data once stored in the HDFS also needs to be processed upon. Now suppose a query is
sent to process a data set in the HDFS. Now, Hadoop identifies where this data is stored, this is called
Mapping. Now the query is broken into multiple parts and the results of all these multiple parts are combined
and the overall result is sent back to the user. This is called reduce process. Thus while HDFS is used to store
the data, Map Reduce is used to process the data.

3. YARN : YARN stands for Yet Another Resource Negotiator. It is a dedicated operating system for Hadoop
which manages the resources of the cluster and also functions as a framework for job scheduling in Hadoop.
The various types of scheduling are First Come First Serve, Fair Share Scheduler and Capacity Scheduler etc.
The First Come First Serve scheduling is set by default in YARN.

How Hadoop Map and Reduce Work Together:

As the name suggests, MapReduce works by processing input data in two stages – Map and Reduce. To
demonstrate this, we will use a simple example with counting the number of occurrences of words in each
document.

The final output we are looking for is: How many times the words Apache, Hadoop, Class, and Track appear in
total in all documents.

For illustration purposes, the example environment consists of three nodes. The input contains six
documents distributed across the cluster. We will keep it simple here, but in real circumstances, there is no
limit. You can have thousands of servers and billions of documents.

K. BALAKRISHNA, Asst. Prof. 34


1. First, in the map stage, the input data (the six documents) is split and distributed across the cluster (the
three servers). In this case, each map task works on a split containing two documents. During mapping, there
is no communication between the nodes. They perform independently.

2. Then, map tasks create a <key, value> pair for every word. These pairs show how many times a word
occurs. A word is a key, and a value is its count. For example, one document contains three of four words we
are looking for: Apache 7 times, Class 8 times, and Track 6 times. The key-value pairs in one map task output
look like this:

• <apache, 7>

• <class, 8>

• <track, 6>

This process is done in parallel tasks on all nodes for all documents and gives a unique output.

3. After input splitting and mapping completes, the outputs of every map task are shuffled. This is the first
step of the Reduce stage. Since we are looking for the frequency of occurrence for four words, there are four
parallel Reduce tasks. The reduce tasks can run on the same nodes as the map tasks, or they can run on any
other node.

The shuffle step ensures the keys Apache, Hadoop, Class, and Track are sorted for the reduce step. This
process groups the values by keys in the form of <key, value-list> pairs.

4. In the reduce step of the Reduce stage, each of the four tasks process a <key, value-list> to provide a final
key-value pair. The reduce tasks also happen at the same time and work independently.

In our example from the diagram, the reduce tasks get the following individual results:

• <apache, 22>

• <hadoop, 20>

• <class, 18>

• <track, 22>

5. Finally, the data in the Reduce stage is grouped into one output. MapReduce now shows us how many
times the words Apache, Hadoop, Class, and track appeared in all documents. The aggregate data is, by
default, stored in the HDFS.

7. Yet Another Resource Negotiator (YARN):

YARN's architecture comprises three main components:

1. ResourceManager (RM): Acts as the master daemon, managing and allocating cluster resources. It
comprises two main parts:

• Scheduler: Allocates resources based on application requirements and policies.

• ApplicationManager: Manages job submissions and coordinates with NodeManagers.

K. BALAKRISHNA, Asst. Prof. 35


2. NodeManager: It is an application that is executed in the data nodes to coordinate the running of
containers.

3. ApplicationMaster (AM): This is the component that is tasked with negotiating with the
ResourceManager and either the NodeManager(s) to launch and monitor the tasks. It has a data computation
framework provided by the ResourceManager and a per-node slave NodeManager. The ApplicationMaster is
included in the application framework package.

Apache HBase:

Apache HBase is an open-source, NoSQL, distributed big data store. It enables random, strictly consistent,
real-time access to petabytes of data. HBase is very effective for handling large, sparse datasets.

HBase integrates seamlessly with Apache Hadoop and the Hadoop ecosystem and runs on top of the Hadoop
Distributed File System (HDFS) or Amazon S3 using Amazon Elastic MapReduce file system (EMRFS). HBase
serves as a direct input and output to the Apache MapReduce framework for Hadoop, and works with Apache
Phoenix to enable SQL-like queries over HBase tables.

How does HBase work?

• HBase stores data in tables, similar to traditional relational databases. Each table consists of rows and
columns.

• Tables are divided into column families, which group related columns together.

• Each column family can contain multiple columns, and each column can store a value.

• Rows are uniquely identified by a row key.

K. BALAKRISHNA, Asst. Prof. 36


Architecture:

HMaster

The HMaster is a central component in an HBase cluster and is responsible for managing metadata and
coordinating cluster operations. It keeps track of regions, assigns regions to Region Servers, and handles
region splits and merges.

Region Servers

• Region Servers are responsible for serving data in HBase. They host a set of regions.

• A region is a subset of a table and represents a range of row keys.

• Each Region Server can serve multiple regions and is responsible for reading, writing, and managing data
within those regions.

ZooKeeper

HBase relies on Apache ZooKeeper for coordination and distributed synchronization. ZooKeeper helps in:

1. Cluster Coordination:

▪ ZooKeeper helps coordinate the different nodes (RegionServers and Master) in an HBase cluster.
▪ It keeps track of live RegionServers, their health, and their availability.
▪ When a RegionServer joins or leaves the cluster (intentionally or due to failure), ZooKeeper updates
the cluster state accordingly.

2. Leader Election:

▪ In an HBase cluster, only one HMaster should be active at a time.


▪ If multiple HMasters are started for high availability, ZooKeeper helps elect the active master among
them using leader election mechanisms.
▪ If the active HMaster fails, ZooKeeper promotes one of the standby HMasters to become the new
active master.

K. BALAKRISHNA, Asst. Prof. 37


3. RegionServer Failover:

▪ Each HBase RegionServer registers itself with ZooKeeper.


▪ If a RegionServer crashes or becomes unresponsive, ZooKeeper detects this (through missing
heartbeats).
▪ The HMaster, upon being notified by ZooKeeper, reassigns the regions served by the failed
RegionServer to other healthy RegionServers to maintain availability.

4. Metadata Management: ZooKeeper helps HBase track:

▪ Location of the -ROOT- and .META. tables (used to locate user data).
▪ Active Master’s address.
▪ Configuration details and server status.

What are the benefits of HBase?

Scalable: HBase is designed to handle scaling across thousands of servers and managing access to petabytes
of data. With the elasticity of Amazon EC2, and the scalability of Amazon S3, HBase is able to handle online
access to massive data sets.

Fast: HBase provides low latency random read and write access to petabytes of data by distributing requests
from applications across a cluster of hosts. Each host has access to data in HDFS and S3, and serves read and
write requests in milliseconds.

Fault-Tolerant: HBase splits data stored in tables across multiple hosts in the cluster and is built to
withstand individual host failures. Because data is stored on HDFS or S3, healthy hosts will automatically be
chosen to host the data once served by the failed host, and data is brought online automatically.

DynamoDB

Amazon DynamoDB is a fully managed NoSQL service that works on key-value pair and other data structure
documents provided by Amazon. It requires only a primary key and doesn’t require a schema to create a
table. It can store any amount of data and serve any amount of traffic to any extent. With DynamoDB, you can
expect great performance even when it scales up. It’s a really simple and small API that follows the key-value
method to store, access, and perform advanced data retrieval.

K. BALAKRISHNA, Asst. Prof. 38


DynamoDB is a web service, and interactions with it are stateless. Applications are not required to maintain
persistent network connections. Instead, interaction with DynamoDB occurs using HTTP(S) requests and
responses.

Features of DynamoDB:

DynamoDB is designed in such a way that the user can get high-performance, run scalable applications that
might not be possible with the normal database system. These additional features of DynamoDB are often
seen under the subsequent categories:

• On-demand capacity mode: The applications using the on-demand service, DynamoDB automatically
scales up/down to accommodate the traffic.

• Built-in support for ACID transactions: DynamoDB provides native/ server-side support for
transactions.

• On-demand backup: This feature allows you to make an entire backup of your work on any given point
in time.

• Point-in-time recovery: This feature helps you with the protection of your data just in case of accidental
read/ write operations.

• Encryption at rest: It keeps the info encrypted even when the table is not in use. This enhances security
with the assistance of encryption keys.

Amazon S3

Amazon Simple Storage Service (S3) is a massively scalable storage service based on object
storage technology. It provides a very high level of durability, with high availability and high performance.
Data can be accessed from anywhere via the Internet, through the Amazon Console and the powerful S3 API.

S3 storage provides the following key features:

• Buckets—data is stored in buckets. Each bucket can store an unlimited amount of unstructured data.

• Elastic scalability—S3 has no storage limit. Individual objects can be up to 5TB in size.

• Flexible data structure—each object is identified using a unique key, and you can use metadata to
flexibly organize data.

• Downloading data—easily share data with anyone inside or outside your organization and enable them
to download data over the Internet.

• Permissions—assign permissions at the bucket or object level to ensure only authorized users can access
data.

• APIs – the S3 API, provided both as REST and SOAP interfaces, has become an industry standard and is
integrated with a large number of existing tools.

K. BALAKRISHNA, Asst. Prof. 39


Amazon S3 Core Concepts — Buckets and Objects:

For Organizing, storing and retrieving data in Amazon S3 focuses on two key components:

• Buckets and

• Objects that work together to create the storage system.

Buckets:

A bucket is a container for objects stored in Amazon S3. Objects are saved in the buckets. In order to store
your data in Amazon S3, you first create a bucket and specify a bucket name and AWS Region. Then, you
upload your data to that bucket as objects in Amazon S3.

It’s also important to know that Amazon S3 buckets are globally unique. No other AWS account in the same
region can have the same bucket names as yours unless you first delete your own buckets.

Objects:

Objects are the fundamental entities stored in Amazon S3. Amazon S3 is an object storage service that stores
data as objects within buckets. Objects are data files, including documents, photos, videos and any metadata
that describes the file. Each object has a key (or key name), which is the unique identifier for the object
within the bucket.

Objects consist of object data and metadata. The metadata is a set of name-value pairs that describe the
object. These pairs include some default metadata, such as the date last modified, and standard HTTP
metadata, such as Content-Type. We can also specify custom metadata at the time that the object is stored.

How Amazon S3 Work:

When we create a bucket, we should give a bucket name and choose the AWS Region where the bucket will
reside. After we create a bucket, we cannot change the name of the bucket or its Region.

It’s best practice to select a region that’s geographically closest to you. Objects that reside in a bucket within a
specific region remain in that region unless you transfer the files somewhere else.

K. BALAKRISHNA, Asst. Prof. 40


Amazon S3 has many use cases:

• Data Storage: S3 is ideal when you want to store application images and videos. All AWS services like
Amazon Prime and Amazon.com, as well as Netflix and Airbnb, use Amazon S3 for this purpose.

• Backup and Disaster Recovery: Amazon S3 is suitable for storing and archiving critical data or backup
data with its automatically replicated cross-region, providing maximum availability and durability.

• Analytics: We can run big data analytics, artificial intelligence (AI), machine learning (ML) on Amazon S3.

• Data Archiving: We can move data archives to the Amazon S3 Glacier storage classes to lower costs,
eliminate operational complexities, and gain new insights.

• Static Website Hosting: S3 stores various static objects. So that we can host SPA frontend layers with using
S3.

• Cloud-native applications: We can use in microservices architectures for storing blob data and access
from different services.

Apache Spark for IoT data processing

The Apache Spark framework is an open-source, distributed analytics engine designed to support big data
workloads. With Spark, users can harness the full power of distributed computing to extract insights from big
data quickly and effectively.

Spark handles parallel distributed processing by allowing users to deploy a computing cluster on local or
cloud infrastructure and schedule or distribute big data analytics jobs across the nodes. Spark has a built-in
standalone cluster manager, but can also connect to other cluster managers like Mesos, YARN, or Kubernetes.
Users can configure the Spark cluster to read data from various sources, perform complex transformations
on high-scale data, and optimize resource utilization.

K. BALAKRISHNA, Asst. Prof. 41


Apache Spark framework

The Spark framework consists of five components:

1. Spark Core is the underlying distributed execution engine that powers Apache Spark. It
handles memory management, connects to data storage systems, and can schedule, distribute,
and manage jobs.

2. Spark SQL is a distributed query engine for interactive relational querying.

3. Spark Streaming is a streaming analytics engine that leverages Spark Core’s fast scheduling to
ingest and analyse newly ingested data in real-time.

4. Spark MLlib is a library of machine learning algorithms that users can train using their own
data.

5. Spark GraphX is an API for building graphs and running graph-parallel computation on large
datasets.

Spark Core is exposed through an API that supports several of the most popular programming languages,
including Scala, Java, SQL, R, and Python.

How Apache Spark Works:

At its core, Apache Spark uses a distributed computing model. Here’s how it processes data:

1. Data Distribution: Spark splits data into smaller chunks and distributes it across a cluster of nodes.

2. Task Execution: Each node processes its assigned chunk in parallel, increasing speed and efficiency.

3. In-Memory Computation: Unlike traditional systems that rely on disk-based processing, Spark keeps
data in memory, significantly reducing latency.

4. Resilient Distributed Datasets (RDDs): RDDs are immutable collections of objects that can be processed
in parallel, ensuring reliability and fault tolerance.

K. BALAKRISHNA, Asst. Prof. 42


Core Features of Apache Spark:

1. Faster Data Processing

One of the primary reasons Apache Spark is the backbone of big data analytics is its speed. Thanks to its in-
memory processing architecture, Spark is much faster than traditional big data platforms like Hadoop. This
speed is critical for applications requiring quick analysis, such as real-time decision-making in industries like
finance, retail, and healthcare.

2. Scalability for Massive Datasets

In the age of big data, the ability to scale efficiently is essential. Apache Spark scales horizontally, meaning
that as your data grows, you can add more machines to your cluster to handle the increased load. This
scalability makes Spark an ideal choice for organizations dealing with petabytes of data.

3. Versatility Across Data Processing Tasks

Apache Spark demonstrates significant versatility across various data processing tasks due to its unified
framework and specialized modules. This allows it to handle diverse workloads within a single platform,
eliminating the need for separate tools for different analytical needs.

4. Integration with Other Big Data Tools

Apache Spark easily integrates with other popular big data tools, such as Hadoop, Hive, and Cassandra,
allowing organizations to build a robust and flexible big data ecosystem. This integration enables businesses
to leverage their existing infrastructure while gaining the benefits of Spark’s advanced analytics capabilities.

5. Simplified Development and Maintenance

With its high-level APIs, Apache Spark makes it easier for developers to build big data applications. Unlike
older technologies like Hadoop, which require complex configurations and custom code, Spark provides a
more straightforward approach to building and maintaining data pipelines, making it an attractive option for
organizations looking to reduce development time and complexity.

6. Advanced Analytics and Machine Learning

Apache Spark’s ability to handle machine learning workloads through its MLlib library makes it a crucial tool
for organizations looking to incorporate advanced analytics into their operations. Spark also supports graph
processing through GraphX, allowing businesses to analyze relationships and patterns in data, such as
social networks or recommendation systems.

7. Cost-Effective Solution

Despite its advanced capabilities, Apache Spark is an open-source platform, which means there are no
licensing fees associated with its use. This makes Spark a cost-effective option for organizations looking to
perform big data analytics without the financial burden of proprietary software.

Thinking about a single machine versus a cluster of machines:

Apache Spark is a powerful open-source distributed computing system designed for fast and general-purpose
big data processing. It can be deployed in two main ways: on a single machine (local mode) or across a
cluster of machines. Each approach has its own advantages and disadvantages.

K. BALAKRISHNA, Asst. Prof. 43


1. Single machine (local mode):

In local mode, the entire Spark application runs in a single Java Virtual Machine (JVM) on a single machine.

Advantages:

• Simplicity: Easy to set up and ideal for learning, prototyping, and debugging applications with smaller
datasets.

• Faster for Small Data: For smaller datasets, local mode might even be faster due to reduced network
overhead compared to a cluster setup.

• Resource Efficiency: Can be cost-effective for single-machine workloads, such as training machine
learning models on moderate datasets or handling lightweight data tasks.

Limitations:

• Scalability: Limited by the resources (CPU, RAM, storage) of a single machine.

• Not for Big Data: Not suitable for large datasets that cannot fit in a single machine's memory or require
significant processing power beyond a single machine's capabilities.

• Fault Tolerance: No fault tolerance; a single machine failure will halt the entire application.

2. Cluster of machines (distributed mode):

In distributed mode, Spark applications are distributed across multiple machines in a cluster, allowing for
parallel execution and handling of large datasets.

Advantages:

• Scalability: Can easily scale horizontally by adding more machines to handle growing data volumes and
processing demands.

• Faster for Big Data: Processes large datasets much faster by distributing tasks and computations across
multiple nodes.

• Fault Tolerance: Designed for fault tolerance, allowing applications to recover gracefully from node
failures without losing data.

• Resource Isolation: The driver program can run on a separate machine, preventing resource conflicts with
worker nodes.

• No Dependency on Local Machine: After submitting an application in cluster mode, it runs independently,
allowing you to disconnect from your local machine.

Limitations:

• Complexity: Setting up and managing a Spark cluster can be more complex than running Spark in local
mode.

• Cost: Running a cluster can be more expensive due to the need for multiple machines and potentially more
advanced hardware.

K. BALAKRISHNA, Asst. Prof. 44


Use case considerations:

• Use single-node Spark for:

• Development, testing, and debugging with small datasets.


• Workloads that do not heavily utilize Spark's distributed nature or for situations where data
access is the primary need.
• Small deep learning jobs that might benefit from single-machine GPUs rather than distributed
processing.

• Use Spark clusters for:

• Production deployments and large-scale applications requiring high availability and scalability.
• Processing large datasets that exceed the resources of a single machine.
• Applications requiring parallel processing and the ability to recover from node failures.

In a single-node architecture, we take a large task and divide it into subtasks, which are then completed
sequentially.

In a distributed computing architecture, the fundamental principle is to take a large task and break it into
smaller subtasks that all tasks can execute concurrently on a single node within the cluster. Capable of
utilizing this distributed nature of computing is what allows Spark to scale and perform with the increasing
demands from enterprise-level data in today's world.

K. BALAKRISHNA, Asst. Prof. 45


To stream or not to stream: Lambda architecture:

Lambda (λ) Architecture is a data processing architecture designed to handle massive quantities of data by
taking advantage of both batch and stream processing methods. This hybrid approach aims to balance the
trade-offs between latency, throughput, and fault tolerance, making it particularly suitable for real time
analytics on large datasets. The architecture processes data in a way that maximizes the strengths of both
batch processing—known for its comprehensive and accurate computations—and real time stream
processing, which provides low-latency updates.

Three Layers of the Lambda Architecture:

1. Batch Layer (Batch Processing)

Batch layer, also known as the batch processing layer, is responsible for storing the complete dataset and
pre-computing batch views. It operates on the principle of immutability, meaning that once data is ingested,
it is never updated or deleted; only new data is appended. This layer processes large volumes of historical
data at scheduled intervals, which can range from minutes to days, depending on the application's
requirements.

Key characteristics of the batch layer are:

• It handles extensive historical data.

• It allows for comprehensive data analysis and complex computations.

• Data is stored immutably, ensuring a reliable historical record.

• It processes data in scheduled batches.

Technologies commonly used in the batch layer are:

• Apache Hadoop

• Apache Spark

• Databricks

• Snowflake

• Amazon Redshift

• Google BigQuery

2. Speed Layer (Real time / Speed Processing)

Speed layer, also known as the real time or streaming layer, is designed to handle data that needs to be
processed with minimal latency. Unlike the batch layer, it doesn't wait for complete data and provides
immediate views based on the most recent data.

In the speed layer, incoming data is processed in real time to generate low-latency views. This layer aims to
bridge the gap between the arrival of new data and its availability in the batch layer’s views. While the speed
layer's results may be less accurate or comprehensive, they offer timely insights.

K. BALAKRISHNA, Asst. Prof. 46


The speed layer is crucial for applications requiring real time analytics, such as fraud detection,
recommendation systems, or monitoring systems where immediate data insights are essential.

Key characteristics of the speed layer are:

• It processes data in real time or near real time.

• It provides low-latency updates to users and applications.

• It handles continuously generated data, such as logs or sensor data.

Technologies commonly used in the speed layer are:

• Apache Kafka

• Amazon Kinesis

• Apache Spark Streaming

• Apache Flink

• Apache Storm

3. Serving Layer (Data Access and Queries)

Serving Layer indexes and stores the precomputed batch views from the Batch Layer and the near real time
views from the Speed Layer. It provides a unified view for querying and analysis, allowing users to access
both historical and real time data efficiently.

Key characteristics of the serving layer are:

• It merges results from the Batch and Speed Layers.

• It provides low-latency access to processed data.

• It supports ad-hoc queries and data exploration.

Technologies commonly used in the serving layer are: Apache Druid, Apache HBase, Elasticsearch.

K. BALAKRISHNA, Asst. Prof. 47


4. EDA for IoT Data: Exploring and visualizing data
What is data visualization?

Data visualization is the graphical representation of information and data. By using visual elements like
charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends,
outliers, and patterns in data. Additionally, it provides an excellent way for employees or business owners to
present data to non-technical audiences without confusion.

In the world of Big Data, data visualization tools and technologies are essential to analyse massive amounts
of information and make data-driven decisions.

Exploratory Data Analysis (EDA) is an important step in data science and data analytics as it visualizes data
to understand its main features, find patterns and discover how different parts of the data are connected.

Why Exploratory Data Analysis Important?

1. It helps to understand the dataset by showing how many features it has, what type of data each
feature contains and how the data is distributed.

2. It helps to identify hidden patterns and relationships between different data points which help us in
and model building.

3. Allows to identify errors or unusual data points (outliers) that could affect our results.

4. The insights gained from EDA help us to identify most important features for building models and
guide us on how to prepare them for better performance.

Types of Exploratory Data Analysis:

There are various types of EDA based on nature of records. Depending on the number of columns we are
analysing we can divide EDA into three types:

K. BALAKRISHNA, Asst. Prof. 48


1. Univariate Analysis

Univariate analysis focuses on studying one variable to understand its characteristics. It helps to describe
data and find patterns within a single feature. Various common methods like histograms are used to show
data distribution, box plots to detect outliers and understand data spread and bar charts for categorical data.
Summary statistics like mean, median, mode, variance and standard deviation helps in describing the
central tendency and spread of the data

2. Bivariate Analysis

Bivariate Analysis focuses on identifying relationship between two variables to find connections, correlations
and dependencies. It helps to understand how two variables interact with each other. Some key techniques
include:

• Scatter plots which visualize the relationship between two continuous variables.

• Correlation coefficient measures how strongly two variables are related which commonly
use Pearson's correlation for linear relationships.

• Cross-tabulation or contingency tables shows the frequency distribution of two categorical


variables and help to understand their relationship.

• Line graphs are useful for comparing two variables over time in time series data to identify trends
or patterns.

• Covariance measures how two variables change together but it is paired with the correlation
coefficient for a clearer and more standardized understanding of the relationship.

3. Multivariate Analysis

Multivariate Analysis identify relationships between two or more variables in the dataset and aims to
understand how variables interact with one another which is important for statistical modelling techniques.
It includes techniques like:

• Pair plots which shows the relationships between multiple variables at once and helps in
understanding how they interact.

• Another technique is Principal Component Analysis (PCA) which reduces the complexity of large
datasets by simplifying them while keeping the most important information.

• Spatial Analysis is used for geographical data by using maps and spatial plotting to understand the
geographical distribution of variables.

• Time Series Analysis is used for datasets that involve time-based data and it involves
understanding and modeling patterns and trends over time. Common techniques include line plots,
autocorrelation analysis, moving averages and ARIMA models.

Steps for Performing Exploratory Data Analysis:

EDA is the process of analysing datasets to summarize their main features, uncover patterns, spot anomalies,
and prepare the data for modelling. It involves both statistical and visual techniques. Tools like Python
(Pandas, Seaborn) and R (ggplot2, dplyr) are commonly used.

K. BALAKRISHNA, Asst. Prof. 49


1. Understand the Problem & Data

Before analyzing, understand the goal of the project and what each variable represents. Know the data
types (e.g., numerical, categorical) and any known quality issues. This ensures the analysis is relevant and
accurate.

2. Import & Inspect the Data

Load the data into your tool (Python/R). Check the shape (rows & columns), data types, missing values,
and any errors or inconsistencies. This gives you a clear view of the dataset’s structure.

3. Handle Missing Data

Missing values can mislead your results. You can either remove such data or fill it (impute) using methods
like mean, median, or predictive models. Choose the method based on how and why the data is missing.

4. Explore Data Characteristics

Calculate summary statistics (mean, median, std dev, skewness) to understand the distribution and
spread of data. This helps detect irregularities and determine suitable modelling techniques.

5. Transform the Data

Make data ready for modelling:

• Normalize/scale numeric values

• Encode categorical variables

• Apply log or square root transformations

• Create new features or aggregate data for better insights

6. Visualize Relationships

Use charts to explore patterns:

• Bar/pie charts for categorical data

• Histograms/box plots for numeric data

• Scatter plots/correlation heatmaps to find relationships between variables

7. Handle Outliers: Outliers are unusual data points that can distort analysis. Detect them using IQR, Z-
scores, or domain knowledge, and choose whether to remove or adjust them based on context.

8. Communicate Findings: Summarize your findings clearly using charts, tables, and key insights. Mention
limitations and suggest next steps. Effective communication ensures stakeholders understand and act on
the analysis.

K. BALAKRISHNA, Asst. Prof. 50


The Tableau overview:

Tableau is the fastest and most powerful visualization tool. It provides the features like cleaning, organizing,
and visualizing data, it easier to create interactive visual analytics in the form of dashboards. These
dashboards make it easier for non-technical analysts and end-users to convert data into understandable
ones.

Tableau Features:

• Dashboard – A holistic and customizable visualization of an organization's data

• Collaboration – Share data and visualizations in real-time for live collaboration.

• Live and in-memory data – Use Tableau's live connection to extract data from the source or in-
memory.

• Advanced Visualization – Naturally, Tableau creates bar charts and pie charts. Still, its advanced
visualizations also include boxplots, bullet charts, Gantt charts, histograms, motion charts, and treemaps,
and that's just the tip of the iceberg.

• Maps – Tableau's map feature lets users see where trends are happening.

• Highly Robust Security – Tableau follows all industry best practices.

• Mobile View – Create dashboards and reports from your phone or tablet.

• Ask Data – Tableau understands dozens of natural languages. Users don't have to be data scientists to
find answers within data.

• Trend Lines and Predictive Analysis – Drag and drop technology creates trend lines for forecasting and
predictions.

• Cross-Database Join – Uncover insight through multiple datasets.

• Nested Sorting – Sort data from headers, or field labels.

• Drag-and-Drop Integration – Tableau's drag-and-drop feature creates fast user-driven customization


and formatting.

• Data Connectors – Tableau supports dozens of data connectors.

• Text Editor – Format your text in a way that makes sense to you.

• Revision History – Revision history lets decision-makers and viewers see how the data has changed
over time.

• Licensing Views – All license holders will have viewing access to the dashboard and reports

• ETL Refresh – Automatically or manually refresh as new data is added

• Web Data Connector – Connect to the cloud and nearly every other online data source

• Split Function – Split data to create new fields in all supporting data sources

K. BALAKRISHNA, Asst. Prof. 51


Values in Tableau:

There are two types of values in the tableau:

• Dimensions: Values that are discrete (which cannot change with respect to time) in nature called
Dimension in tableau. Example: city name, product name, country name.

• Measures: Values that are continuous (which can change with respect to time) in nature called
Measure in tableau. Example: profit, sales, discount, population.

Techniques to understand data quality:

1. Look at your data - au naturel (data profiling): This involves analysing the raw data to understand its
structure, content, relationships, and statistical properties.

• Techniques:

o Data type analysis: Identifying the data types of each column (e.g., numeric, text, date)
and checking for inconsistencies.

o Value frequency distributions: Examining the frequency of values within columns to


identify common values, outliers, and potential data entry errors.

o Statistical analysis: Calculating descriptive statistics (mean, median, mode, standard


deviation) to understand the distribution and potential anomalies.

o Pattern detection: Identifying recurring patterns or formats, and deviations from those
patterns.

2. Data completeness: Assessing whether all required data is present and accounted for.

• Techniques:

o Missing value analysis: Identifying nulls, blank cells, or placeholder values that indicate
missing information.

o Attribute-level completeness: Evaluating the proportion of non-null values for each


attribute or field within a dataset.

o Record-level completeness: Assessing whether entire records or entries in a dataset are


complete, i.e., all required fields are populated.

o Business rule checks: Verifying if mandatory fields are populated according to defined
business rules.

3. Data validity: Ensuring the data conforms to defined standards, formats, and business rules.

• Techniques:

o Format validation: Checking if data adheres to specified formats (e.g., date formats,
email address formats, phone number formats).

K. BALAKRISHNA, Asst. Prof. 52


o Range constraints: Ensuring values fall within acceptable ranges (e.g., age values
between 0 and 120, product prices above 0).

o Data type checks: Verifying that the data type of a column matches the expected data
type (e.g., numeric values in a numerical column).

o Referential integrity checks: Ensuring relationships between different datasets are valid
(e.g., foreign keys referencing existing primary keys).

o Business rule validation: Checking if data complies with specific business rules (e.g., a
customer's order quantity cannot exceed the available stock).

o Schema checks: Verifying data against predefined data structures and criteria.

4. Assessing information lag (data timeliness/recency): Evaluating how current and up-to-date the data is.

• Techniques:

o Timeliness analysis: Examining data creation and modification timestamps to


determine how old the data is and whether it meets the requirements for a particular
use case.

o Comparison with real-world events: Checking if the data reflects the most recent events
or changes in the real world.

o Identifying outdated information: Pinpointing data that is no longer relevant due to its
age.

5. Representativeness: Determining if the data accurately reflects the underlying population or phenomenon
it's intended to represent.

• Techniques:

o Sampling techniques: If dealing with a subset of data, assessing if the sampling method
used ensures a representative sample.

o Bias detection: Identifying if any biases are present in the data that might lead to
skewed or inaccurate insights.

o Domain expertise and contextual knowledge: Relying on domain knowledge to judge


whether the data seems plausible and reflects reality.

Other important data quality dimensions and techniques:

• Accuracy: How well the data reflects reality and is free from errors.

o Techniques: Data profiling, validation, comparing data with reference sources.

• Consistency: Whether data stored in one place matches relevant data stored elsewhere,
avoiding contradictions.

o Techniques: Cross-dataset comparisons, consistency checks.

K. BALAKRISHNA, Asst. Prof. 53


• Uniqueness (Integrity): Ensuring there are no duplicate records within a dataset or between
related datasets.

o Techniques: Duplicate detection, integrity checks.

• Usefulness/Relevance: Whether the data is applicable and valuable for solving problems and
making decisions.

o Techniques: Defining data quality metrics aligned with business objectives.

Basic Time Series Analysis:

Time series analysis is a powerful technique used to understand trends, patterns, and seasonal variations in
data collected over time. It plays a critical role in fields such as finance, weather
forecasting, healthcare, energy, and retail, where predicting future values accurately is key to decision-
making. With the exponential growth in data availability, mastering time series analysis has become essential
for data scientists and analysts alike.

What is a Time Series?

A time series is a sequence of data points collected or recorded at specific and usually equally spaced
intervals over time. Unlike random or unordered data, time series data is inherently chronological, making
time a critical dimension for analysis. Each observation in a time series is dependent on previous values,
which differentiates it from other types of data structures.

Real-world examples of time series data include:

• Stock prices recorded every minute or day

• Temperature readings logged hourly or daily

• Electricity consumption measured every second

• Retail sales tracked weekly or monthly

• Website traffic monitored by hour or day

K. BALAKRISHNA, Asst. Prof. 54


Time series data is widely used for forecasting, trend analysis, and anomaly detection. Its ability to capture
and model temporal patterns helps businesses and researchers make informed, data-driven decisions.

It is important to distinguish time series data from cross-sectional data, which captures observations at a
single point in time across multiple subjects (e.g., sales from different stores on the same day). While cross-
sectional analysis examines relationships among variables at a fixed time, time series analysis focuses on
understanding how a single variable evolves over time, taking into account temporal dependencies and
patterns like seasonality, trends, and cycles.

Components of Time Series Data:

Time series data is composed of several key components, each representing different underlying patterns.
Understanding these components is essential for accurate modelling and forecasting.

1. Trend

The trend represents the long-term direction in the data—whether it’s increasing, decreasing, or remaining
stable over time. Trends often emerge due to factors like economic growth, technological advancement, or
demographic shifts.

Example: The consistent rise in global average temperature over decades reflects a positive trend.

2. Seasonality

Seasonality refers to regular, repeating patterns observed over a fixed period, such as daily, weekly, monthly,
or yearly intervals. These variations are caused by external influences like weather, holidays, or business
cycles.

Example: Ice cream sales tend to spike during summer months every year, showing clear seasonal behaviour.

3. Cyclic Patterns

Cyclic variations are long-term fluctuations that do not follow a fixed frequency, unlike seasonality. These
cycles often correspond to economic or business cycles and can span multiple years.

Example: A country’s GDP might follow multi-year cycles of expansion and recession due to macroeconomic
factors.

4. Irregular (Random) Components

The irregular or residual component includes unpredictable and random variations that cannot be attributed
to trend, seasonality, or cyclic behaviour. These are typically caused by unexpected events like natural
disasters, pandemics, or sudden market shocks.

Example: A sudden drop in retail sales due to a nationwide strike would be considered an irregular
component.

Time Series Visualization Techniques:

Effective visualization is crucial in time series analysis. It helps identify patterns, trends, and anomalies that
may not be immediately obvious in raw data. Different types of plots highlight different aspects of time-
dependent behavior.

K. BALAKRISHNA, Asst. Prof. 55


Common Plot Types:

• Line Charts: The most widely used method for visualizing time series data. Line plots provide a
clear view of how values change over time and are ideal for identifying trends and seasonality.
Use Case: Tracking monthly revenue over a year.

• Heatmaps: Heatmaps represent data in a matrix format where values are colored based on
intensity. In time series, they are especially useful for visualizing seasonal and daily patterns
over longer periods.
Use Case: Analyzing hourly website traffic over weeks.

• Seasonal Subseries Plots: These plots group time series data by season (month, quarter, etc.) to
highlight recurring seasonal patterns.
Use Case: Understanding month-wise sales fluctuations over several years.

Applying time series analysis:

Applying time series analysis typically involves these steps:

1. Data Collection and Cleaning: Gather relevant data and address issues like missing values.

2. Data Exploration and Visualization: Examine the data for patterns using visualizations.

3. Stationarity Assessment: Determine if the data's statistical properties are constant over time.

4. Decomposition: Break down the time series into its components.

5. Model Building: Select an appropriate model.

6. Model Evaluation: Assess the model's accuracy.

7. Forecasting/Prediction: Use the model to predict future values.

Tools and software:

Several tools and libraries are available, including Python Libraries (pandas, NumPy, statsmodels, Prophet,
scikit-learn), R, MATLAB, specialized Time Series Databases (InfluxDB, TimescaleDB, Prometheus), and
visualization tools (Matplotlib, ggplot2®, Tableau, Grafana).

Exploring data categories and missing values in Tableau:

1. Identifying Categories:

• Drag Dimensions to Rows/Columns: To identify categories, drag any dimension field (e.g.,
"Category", "Product Name", "Region", or "Measurement Flag fields" like Qgag and Qpcp
mentioned in the screenshot) from the Data pane to the Rows or Columns shelf. This will create
a list of all unique values within that dimension.

• Observe Distinct Values: The list generated by dragging the dimension to the Rows or Columns
shelf represents the categories within that field.

K. BALAKRISHNA, Asst. Prof. 56


2. Identifying Missing Values (Nulls):

• Option 1: Using the Data Pane:

o Examine the Data pane for each field you are interested in.

o Click on the data type icon next to a field to open its context menu.

o Check for any indicators or information related to missing values (Null) within the
context menu or related descriptions.

• Option 2: Using the Worksheet (Visual Approach):

o Create a view by dragging a dimension to the Rows or Columns shelf and a measure to
the other shelf.

o Examine the generated view for gaps or breaks in the data that could indicate missing
values.

o Show Empty Rows/Columns: Go to Analysis > Table Layout > Show Empty
Rows/Columns to explicitly display rows or columns that have missing values.

o Show Missing Values: For date fields or numerical bins, right-click (control-click on
Mac) the date or bin headers and select Show Missing Values to display any missing
values within the range.

3. Comparing dimensions for interaction (Example: Qgag and Qpcp Measurement Flags):

• Create a Crosstab: Drag one of the "Measurement Flag" dimensions (e.g., Qgag) to the Rows
shelf.

• Drag the other "Measurement Flag" dimension (e.g., Qpcp) to the Columns shelf.

• Observe Co-occurrence of Nulls: The resulting view will display a crosstab showing the
relationship between the two dimensions. Look for rows and columns where both dimensions
display "Null" values, confirming their frequent co-occurrence.

K. BALAKRISHNA, Asst. Prof. 57


Bring in geography:

Geography is a very powerful aid in understanding IoT data when devices are located in a variety of places.
Patterns that are not obvious from statistical analysis of the data can become very obvious when shown on a
map. Geography significantly influences how IoT data, particularly from weather stations, should be
interpreted, especially when analysing precipitation across a state.

1. Clustered weather stations and representative rainfall

• It is common to find weather stations clustered in certain areas, particularly in urban centres or
regions with higher population density, while other areas, such as mountainous or remote
regions, have sparse coverage.

• For instance, in the southern part of the Indian region around Bangalore, there's a relatively
good density of rain gauges, while other areas of India have poor density.

• When weather station locations are not evenly distributed, simply averaging the rainfall
measurements from these stations may not accurately represent the actual average rainfall for
the entire state.

• This is because the arithmetic mean method assumes that each station represents an equal
corresponding area, an assumption that is violated when the stations are unevenly distributed.

2. Importance of spatial distribution and density

• Research indicates that both the density and the spatial distribution of rain gauge networks
play a crucial role in accurately calculating the areal average rainfall (AAR).

• Studies have shown that estimation error in calculating AAR is relatively small in areas with
well-distributed rain gauges but significantly increases when the stations are clustered.

• Even with a higher density of stations, if the spatial distribution is poor, the accuracy of the AAR
can be affected.

3. Techniques to address uneven distribution

• To mitigate the impact of uneven distribution, various techniques are employed, including:

o Isohyetal analysis: Drawing lines of equal rainfall (isohyets) on a map to estimate areal
precipitation.

o Thiessen polygon method: Assigning weights to stations based on the area each station
represents within a polygon network.

o Distance weighting/Gridded techniques: Assigning weights based on the distance of a


grid point from observed station values.

4. Complementing station data

• In regions with poor rain gauge density, researchers explore using gauge-calibrated satellite
observations to supplement rainfall information and improve estimation accuracy.

K. BALAKRISHNA, Asst. Prof. 58


Installing R and RStudio and Using R for statistical analysis:

https://www.datacamp.com/tutorial/r-studio-tutorial

Solving Industry-Specific Analysis Problems:

1. Manufacturing:

Challenges:

• Supply chain disruptions: Unforeseen events can interrupt the flow of raw materials and
finished goods, impacting production and delivery schedules.

• Quality control issues: Manufacturing processes can experience defects and inconsistencies,
leading to product recalls and customer dissatisfaction.

• Cost optimization: Identifying and eliminating inefficiencies in production processes is crucial


for maintaining profitability.

• Predictive maintenance: Unexpected equipment breakdowns can lead to costly downtime and
production delays.

Data Analysis Solutions:

• Supply Chain Analytics: Analyse supplier performance, logistics data, and inventory levels to
predict and mitigate potential disruptions, according to Oracle.

• Quality Control Analytics: Analyse production data, sensor data, and defect rates to identify the
root causes of quality issues and implement corrective actions.

• Operational Efficiency Analytics: Analyse production line data, machine utilization, and waste
data to identify bottlenecks and areas for process improvement.

• Predictive Maintenance: Analyse equipment sensor data and historical maintenance records to
predict potential failures and schedule proactive maintenance, minimizing downtime.

2. Healthcare:

Challenges:

• Patient readmissions: High readmission rates indicate potential gaps in patient care and
increased healthcare costs.

• Operational efficiency: Optimizing resource allocation, staff scheduling, and inventory


management is crucial for delivering quality care while controlling costs.

• Disease prevention and management: Identifying at-risk patients and proactively managing
chronic diseases can improve patient outcomes and reduce the burden on the healthcare
system.

• Fraud detection: Fraudulent claims and billing errors can impact the financial stability of
healthcare organizations.
K. BALAKRISHNA, Asst. Prof. 59
Data Analysis Solutions:

• Predictive Analytics for Readmissions: Analyse patient data to identify individuals at high risk
of readmission and implement targeted follow-up care and interventions.

• Operational Analytics: Analyse patient admission rates, staffing schedules, and resource
utilization to optimize operations, improve efficiency, and reduce costs.

• Population Health Management: Analyse demographic data, patient histories, and social
determinants of health to identify populations at risk and design targeted preventive care
programs.

• Fraud Detection Analytics: Analyse claims data, billing patterns, and patient health histories to
identify anomalies and flag suspicious activities that may indicate fraud.

3. Retail:

Challenges:

• Inaccurate demand forecasting: Overstocking or understocking products can lead to losses and
missed sales opportunities.

• Rapidly changing trends: Retailers must adapt quickly to evolving consumer preferences and
market trends to remain competitive.

• Customer data protection: Safeguarding sensitive customer information is crucial for building
trust and maintaining compliance.

• Optimizing marketing campaigns: Effectively targeting marketing efforts to the right customers
at the right time is essential for maximizing ROI (Return on Investment).

Data Analysis Solutions:

• Demand Forecasting: Utilize historical sales data, market trends, and external factors to predict
future demand for products, optimizing inventory management and pricing strategies.

• Trend Analysis and Customer Segmentation: Analyse sales data, social media trends, and
customer feedback to identify emerging trends and segment customers based on their
preferences and purchasing habits.

• Data Security and Compliance: Implement robust security measures, access controls, and data
classification systems to protect customer data and adhere to privacy regulations.

• Customer Behaviour Analysis: Analyse customer browsing data, purchase history, and
demographics to create accurate customer profiles and personalize marketing campaigns,
improving conversion rates.

K. BALAKRISHNA, Asst. Prof. 60


IoT Data set augmentation:

What is Data Augmentation?

Data augmentation is a technique used to artificially increase the size of a training dataset by applying
various transformations to the existing data. This technique is commonly used in machine learning and deep
learning tasks, especially in computer vision, to improve the generalization and robustness of the trained
models.

Augmented vs. synthetic data:

Data augmentation and synthetic data generation are distinct yet complementary techniques in machine
learning:

• Augmented data: This involves creating modified versions of existing data to increase dataset
diversity. For example, in image processing, applying transformations like rotations, flips, or
colour adjustments to existing images can help models generalize better.

• Synthetic data: This refers to artificially generated data, which allows researchers and
developers to test and improve algorithms without risking the privacy or security of real-world
data.

Why Use Data Augmentation?

1. Increased Dataset Size: By creating new samples from the existing data, data augmentation
effectively increases the size of the dataset, which can lead to better model performance.

2. Regularization: Data augmentation introduces additional variations in the data, which can help
prevent overfitting by providing the model with a more diverse set of examples.

3. Improved Generalization: By exposing the model to a wider range of variations in the data, data
augmentation helps the model generalize better to unseen examples.

Common Data Augmentation Techniques:

1. Rotation: Rotate the image by a certain angle (e.g., 90 degrees, 180 degrees).

2. Translation: Shift the image horizontally or vertically by a certain distance.

3. Scaling: Enlarge or shrink the image by a certain factor.

4. Flipping: Flip the image horizontally or vertically.

5. Shearing: Skew the image along the x or y-axis.

6. Zooming: Zoom in or out of the image.

7. Brightness Adjustment: Increase or decrease the brightness of the image.

8. Contrast Adjustment: Increase or decrease the contrast of the image.

9. Noise Addition: Add random noise to the image.

K. BALAKRISHNA, Asst. Prof. 61


Data Augmentation Applications:

Healthcare:

Acquiring and labelling medical imaging datasets is time-consuming and expensive. You also need a subject
matter expert to validate the dataset before performing data analysis. Using geometric and other
transformations can help you train robust and accurate machine-learning models.

For example, in the case of Pneumonia Classification, you can use random cropping, zooming, stretching, and
colour space transformation to improve the model performance. However, you need to be careful about
certain augmentations as they can result in opposite results. For example, random rotation and reflection
along the x-axis are not recommended for the X-ray imaging dataset.

Self-Driving Cars:

There is limited data available on self-driving cars, and companies are using simulated environments to
generate synthetic data using reinforcement learning. It can help you train and test machine learning
applications where data security is an issue.

K. BALAKRISHNA, Asst. Prof. 62


Decorating Your Data – Adding Internal Datasets in IoT:

IoT devices like sensors, cameras, and smart machines collect a lot of real-time data.
This data is useful, but by itself, it does not give the full picture of what’s happening in a business.
To make this data more valuable, companies can combine it with internal datasets they already have.
This process is called “decorating your data”, because it makes the data richer, more meaningful, and easier
to use for decision-making.

1. Customer Information:

This includes all the details a company knows about its customers, such as:

• Names and contact information

• Purchase history and product preferences

• Location and demographic details

• Service requests and complaint records

Why add it to IoT data?

• To create personalized services that match each customer’s habits.

• To connect IoT device usage patterns with specific customers.

• To send the right alerts, offers, or maintenance reminders to the right people.

Example in IoT:
A smart home energy meter collects electricity usage data. When combined with the customer’s profile, it can
send tips on how they can save money based on their lifestyle.

2. Production Data:

Data from manufacturing and production processes, such as:

• Machine performance logs

K. BALAKRISHNA, Asst. Prof. 63


• Production schedules

• Quality control records

• Process parameters (temperature, pressure, speed, etc.)

Why add it to IoT data?

• To link machine sensor readings with production outcomes.

• To detect early signs of problems in machines or processes.

• To improve efficiency and reduce waste in manufacturing.

Example in IoT:

If vibration sensors on a machine show unusual patterns, production data can help identify which product
batches might have been affected, so they can be checked or stopped before shipping.

3. Field Service Data:

Information from maintenance and repair activities, such as:

• Service visit logs

• Repair history

• Technician reports

• Service schedules

Why add it to IoT data?

• To fix devices before they break down (predictive maintenance).

• To ensure technicians carry the right spare parts before visiting.

• To reduce downtime and improve customer satisfaction.

Example in IoT:
A connected elevator sends an error alert to the company. The system checks past service history and sees
the same problem happened twice before. It automatically sends a technician with the right replacement
part, avoiding multiple visits.

4. Financial Data:

All financial records related to the company’s operations, such as:

• Costs and expenses

• Sales and revenue

• Asset depreciation (loss of value over time)

• Budgets and forecasts

K. BALAKRISHNA, Asst. Prof. 64


Why add it to IoT data?

• To see how much money IoT systems are saving or costing.

• To plan maintenance based on budget limits.

• To make cost-effective decisions.

Example in IoT:
A connected truck fleet tracks fuel use in real-time. When combined with financial data, managers can see
which trucks are using too much fuel and decide whether to repair or replace them.

Benefits of Adding Internal Datasets:

• Better predictions – More data means better forecasting of problems and needs.

• Faster decisions – Managers get a complete picture of the situation.

• Happier customers – Services can be tailored to individual needs.

• Clear ROI – Easy to see how IoT investments are performing financially.

Challenges:

• Data privacy & security – Customer and company data must be protected.

• Integration issues – Old systems may be hard to connect with IoT platforms.

• Data quality problems – Internal and IoT data must be cleaned and matched.

• Legal compliance – Companies must follow data protection laws like GDPR.

Adding External Datasets in IoT:

While IoT devices collect valuable real-time sensor data, this information can become even more powerful
when combined with external datasets. External datasets come from public sources, government
databases, APIs, or research institutions. They add extra context to IoT data, improving analysis,
predictions, and decision-making.

1. External Datasets – Geography:

Geographical datasets help IoT systems understand location-based factors such as terrain, weather, and
transportation networks.

a. Elevation

Elevation data is important for applications like climate modelling, flood prediction, logistics planning,
and agriculture.

• SRTM Elevation (Shuttle Radar Topography Mission) – Satellite-based elevation data used
in mapping and terrain analysis. The Shuttle Radar Topography Mission (SRTM) is a joint
project between NASA and the National Geospatial-Intelligence Agency (NGA) that acquired
near-global elevation data using radar interferometry. Launched in February 2000 aboard the
space shuttle Endeavour, SRTM captured topographic data for approximately 80% of the

K. BALAKRISHNA, Asst. Prof. 65


Earth's landmass, creating a comprehensive digital elevation model (DEM). This data is widely
used for various applications, including mapping, geographic information systems (GIS), and
environmental modeling.

• National Elevation Dataset (NED) – US-specific dataset with detailed height information for
environmental and engineering purposes. The National Elevation Dataset (NED) is a primary
elevation data product created and distributed by the U.S. Geological Survey (USGS). It provides
a seamless, raster-based digital elevation model (DEM) of the United States, including the
conterminous U.S., Alaska, Hawaii, and territorial islands. The NED is a key resource for various
applications, including hydrologic modeling, flood analysis, and geographic information system
(GIS) applications.

b. Weather

Weather data can help IoT systems adjust operations for safety, efficiency, and comfort. The National
Oceanic and Atmospheric Administration (NOAA) is a key source of weather-related data, both for public and

K. BALAKRISHNA, Asst. Prof. 66


commercial use, and it offers a variety of free resources. NOAA's Open Data Dissemination (NODD) program
makes a vast amount of environmental data available on cloud platforms like Amazon Web Services,
Microsoft Azure, and Google Cloud. This includes data from satellites, sensors, models, and forecasts, which
are crucial for understanding the Earth system and its impacts.

Example: Smart irrigation systems can use weather data to avoid watering before rain.

c. Geographical Features

These datasets describe land use, roads, rivers, cities, and other physical features.

• Planet.osm – OpenStreetMap data with roads, buildings, and landmarks.

OpenStreetMap closeup view of London with GPS traces layer turned on.

• Google Maps API – Provides location, routing, and traffic data for integration with IoT apps.

What our real estate search app built on the Google Nearby Search API looks like

K. BALAKRISHNA, Asst. Prof. 67


• USGS National Transportation Datasets – US Geological Survey data on roads, railways, and
transportation infrastructure.

2. External Datasets – Demographic: Demographic datasets describe population characteristics such as


age, income, education, and population density. These are useful for public planning, marketing, and
service optimization.

• The U.S. Census Bureau – Official US population data with detailed demographics and
economic characteristics.

• CIA World Factbook – Global demographic and economic data for almost every country,
including population size, literacy rates, and GDP.

3. External Datasets – Economic: Economic datasets provide financial and market data that can help in
business forecasting and policy-making.

• Organization for Economic Cooperation and Development (OECD) – International statistics


on economy, trade, education, and development.

• Federal Reserve Economic Data (FRED) – US-based database with interest rates, inflation,
employment, and other economic indicators.

Benefits of Adding External Datasets

• Better Predictions – Combining sensor data with weather, location, or economic data improves
forecasting.

• Enhanced Decision-Making – Gives a broader view for smarter strategies.

• Global Context – Allows IoT systems to work effectively across different regions.

• Customized Services – Tailors operations to specific locations or demographics.

Example Use Cases:

• Smart Farming – IoT soil sensors + weather forecast data = optimized watering and planting.

• Smart Transportation – GPS tracking + traffic datasets (Google Maps API) = faster route
planning.

• Urban Planning – Air quality sensors + demographic data = targeted environmental


improvements.

K. BALAKRISHNA, Asst. Prof. 68

You might also like