IoT UNIT-II
IoT UNIT-II
a) Descriptive Analytics
Purpose: Answers “What happened?” by summarizing historical data.
Techniques: Time-series analysis, trend charts, aggregated reports.
Example: Analyzing temperature patterns over a year in smart buildings to optimize HVAC scheduling.
b) Diagnostic Analytics
Purpose: Explains “Why did it happen?” by identifying root causes.
Methods: Correlation, event tracing, anomaly detection.
Example: Diagnosing machine failure in manufacturing using sensor data like vibration or temperature.
d) Prescriptive Analytics
Purpose: Answers “What should we do?” by recommending actions.
Techniques: Optimization algorithms, decision engines.
Example: Real-time flight rerouting based on current atmospheric data.
• It focuses on what is happening right now rather than just analysing the past.
Key Features:
• Continuous updates
• Event-driven actions
Examples:
• Ride-hailing apps (like Uber): Analysing real-time location, traffic, and demand to match
drivers with riders instantly.
2. Behavioural Analytics
• Focuses on analysing user behaviours and patterns to understand how people interact with a
product, service, or system.
Key Features:
Examples:
• Retail RFID tracking: Sensors track how long a customer spends near a product; the system
can suggest promotions or layout improvements.
• A form of analytics that identifies unusual patterns or outliers that deviate from the norm.
Key Features:
• Pattern recognition
• Alert systems
Examples:
Amazon Kinesis is a fully managed, real-time data streaming service offered by AWS that lets you collect,
process, and analyze large streams of data as they are generated. It's ideal for use cases that require real-time
insights, such as analytics, machine learning, IoT monitoring, and live dashboards.
How It Works:
1. Data ingestion Producers (apps, sensors, logs, IoT, video/audio streams) send data in formats
like JSON or binary into Kinesis.
Cost-Efficiency
• Pay-as-you-go pricing per shard, data volume, and processing—no upfront cost.
• Data is replicated across AZs. Streams retention defaults to 24 hrs (extendable to 7 days).
Real-Time Processing
Examples:
▪ Firehose loads results into S3 or Redshift for long-term storage and BI.
2. Clickstream Aggregation
▪ Firehose batches and delivers the data to OpenSearch, enabling real-time dashboarding.
▪ ML models (e.g., via SageMaker) analyse video for activity recognition or object
detection in real-time.
AWS Lambda is a powerful serverless computing service that automatically runs code in response to
events, without requiring you to manage the underlying infrastructure. It supports event-driven applications
triggered by events such as HTTP requests, DynamoDB table updates, or state transitions. You simply upload
your code (as a .zip file or container image), and Lambda handles everything from provisioning to scaling and
maintenance. It automatically scales applications based on traffic, handling server management, auto-scaling,
security patching, and monitoring. AWS Lambda is ideal for developers who want to focus on writing code
without worrying about infrastructure management.
AWS lambda are server-less compute functions are fully managed by the AWS where developers can run
their code without worrying about servers. AWS lambda functions will allow you to run the code without
provisioning or managing servers.
Once you upload the source code file into AWS lambda in the form of ZIP file then AWS lambda will
automatically run the code without you provision the servers and also it will automatically be scaling your
functions up or down based on demand. AWS lambda are mostly used for the event-driven application for the
data processing Amazon S3 buckets, or responding to HTTP requests.
Example:
4. Scheduled Tasks
8. Infrastructure Automation
c. Amazon Anthena:
AWS Athena is a serverless, interactive query service provided by Amazon Web Services. It lets you use
standard SQL to analyse structured and semi-structured data stored in Amazon S3 — without the need for
setting up servers, databases, or ETL pipelines.
Key Advantages:
• Log analytics
1. Serverless Architecture
• Cost model: You pay only for the amount of data scanned by your queries.
• AWS Glue Data Catalog: Stores metadata like table definitions, schemas, and locations.
• Supports:
• Great for teams already familiar with SQL—no need to learn a new query language.
▪ Improves performance
▪ Lowers costs
• Scalability: Athena handles datasets of any size, automatically running queries in parallel.
• Data Partitioning:
▪ Speeds up queries
• Performance tips:
▪ Use Parquet/ORC
• IAM Integration:
Microsoft SQL
Feature Amazon Athena Amazon Redshift AWS Glue
Server
Relational
Serverless, Serverless data
Data warehousing Database
Description interactive query integration & ETL
platform Management
service service
System (RDBMS)
Fast querying of Creating, editing, BI, analytics,
Ad-hoc querying of
large, structured retrieving tables; transaction
Use Cases S3 data; Big data
datasets using batch ELT and processing; all
analytics
clusters streaming processing SQL operations
Structured, semi-
Data Types Structured, semi- Structured, semi-
structured, Structured
Supported structured structured
unstructured
Supports
CSV, TSV, JSON,
conversion to
File Formats Parquet, ORC, CSV, JSON, ORC,
columnar formats; XML, Non-XML
Supported Avro, Apache logs, Parquet, MS Excel
Redshift-native
custom text files
table types
Framework / Presto with ANSI PostgreSQL (8.0.2 Python/Scala-based
Transact-SQL
SQL Engine SQL compatible) ETL scripting engine
Queries S3 directly; Requires ETL & Crawls & catalogs S3
Not natively
S3 Integration no ETL; creates cluster setup before data; helps prepare
integrated with S3
external tables querying data for Athena
Instant (on pre-
Instant (within Takes 15–60
Startup Time Varies by job type configured
seconds) minutes to start up
instances)
Free (Express) to
Charged for both
$5 per TB of data $0.44 per DPU-hour $15,123
Pricing Model compute and
scanned (billed per second) (Enterprise
storage
edition)
Full ETL support Typically used
ETL ETL required
No ETL required with job scheduling with ETL
Requirement before querying
and transformation pipelines
Traditional DB
On-demand SQL High-performance Data discovery,
applications and
Ideal For queries over S3 analytics on large, preparation, and
enterprise
with minimal setup structured datasets transformation
workloads.
A fully-managed, serverless platform from AWS that simplifies connecting IoT devices to the cloud. It
handles provisioning, authentication, secure communication, messaging, state management, and routing—
letting you focus on device logic instead of cloud infrastructure.
1. FreeRTOS
• Comes with AWS IoT libraries pre-integrated for secure device–cloud connectivity.
Example: A fitness-tracking bracelet using FreeRTOS can stream heart rate and steps directly to AWS with
library/kernel support.
2. Device Registry
• A centralized metadata store for tracking “Things” (physical devices): models, serials, firmware
versions, etc.
3. Device SDK
• Open-source SDKs (C++, Java, Python, JS, Android, iOS) for secure, bi-directional device-cloud
interactions.
• Authentication: Uses X.509 certs/TLS to verify device identity and AWS endpoint legitimacy.
• Authorization: IoT Core policies (IAM-like JSON documents) define what devices can do—
publish, subscribe, invoke AWS services.
• Decouples publishers and subscribers (e.g., a bracelet publishes to a topic like StartupX/smart-
bracelets/ModelX/bracelet7).
6. Rules Engine
• Processes inbound messages and triggers actions—e.g., store readings in DynamoDB, forward
to Kinesis, invoke Lambda, or persist to S3.
7. Device Shadow
• A JSON device state “shadow” in the cloud, storing desired and reported states.
• Enables offline state sync: updates persist to shadow, and devices retrieve missing changes
when they reconnect.
8. Jobs
• Supports remote orchestration: tasks like firmware updates, config pushes, and certificate
rotations across THING groups.
9. Thing Groups
• Groups of similar devices (static or dynamic) to collectively manage policies, firmware updates,
and jobs—ideal for large fleets.
• Enables secure remote access to devices behind firewalls, without altering network
configurations—ideal for troubleshooting.
• AWS IoT Greengrass: Extends compute to the edge; run Lambda, ML inference on local
devices when offline.
• AWS IoT Device Defender: Continuously audits security policies, monitors device behavior,
and alerts on deviations.
• AWS IoT Device Management: Fleet-wide device registration, health monitoring, OTA
updates.
This diagram illustrates how AWS IoT Core interacts with a connected wearable device (e.g., a fitness
tracker), its Device Shadow, and an application or user interface. The workflow demonstrates bi-
directional communication using AWS IoT services and the Device Shadow feature.
1. Device publishes current state: The IoT device (e.g., smartwatch) sends its current status:
This data is published to AWS IoT Core using the Device SDK over MQTT or HTTP.
2. Persist to Data Store: AWS IoT receives this message and stores the data in a cloud data store (e.g.,
DynamoDB, S3) using Rules Engine or lambda triggers.
3. App requests device status: A mobile or web app queries AWS IoT to get the latest state of the device.
4. App requests status changes: The app may send a desired state (e.g., "enable sleep mode", or "increase
step goal").
5. Device Shadow syncs updated state: AWS IoT Device Shadow receives this "desired state" and syncs it
with the current reported state. When the device connects again, it checks the shadow and retrieves the
update.
7. Device Shadow confirms state changes: AWS IoT Shadow service now confirms the new state and
updates the user/app interface with the most current status.
Azure IoT Hub, a cloud service provided by Microsoft, is a fully managed service that enables organizations to
manage, monitor, and control IoT devices. In addition, Azure IoT Hub enables reliable, secure bidirectional
communications between IoT devices and its cloud-based services. It allows developers to receive messages
from, and send messages to, IoT devices, acting as a central message hub for communication. It can also help
organizations make use of data obtained from IoT devices, transforming IoT data into actionable insights.
1. Device Connectivity:
This section shows how different types of IoT devices connect to the cloud-based IoT Hub.
Types of Devices:
1. IP-capable devices
✓ These are advanced devices (like smart appliances or industrial machines) that can
connect directly to the cloud.
▪ HTTPS
✓ They use the IoT device library to securely send data directly to the IoT Hub.
✓ They may not directly support the required protocols for cloud connection.
3. Low-power devices
✓ These cannot connect to the cloud directly due to limitations like power or protocol
support.
✓ They send data locally to an IoT Field Gateway, which aggregates and forwards it to
the cloud using AMQP.
Optional Components:
Once the data reaches the IoT Hub, it is managed by the IoT Solution Backend, which includes:
✓ Handles streaming data from connected devices (like telemetry, sensor data).
✓ Can be processed in real time using services like AWS Lambda, Azure Functions, or
Apache Kafka.
✓ Sends commands, configurations, or firmware updates from the cloud back to the
device.
a. Edge Storage
Edge storage involves storing data on local devices or near the data source, rather than transmitting
it to a centralized data centre. This mitigates latency issues by processing data close to where it is
generated, reducing bandwidth usage on networks. Examples of use cases include manufacturing
plants and autonomous vehicles.
Example: Autonomous Vehicles
• Self-driving cars use local edge processors to instantly analyze data from cameras, LiDAR, and
radar.
• Decisions like braking or steering are made on the spot, without needing cloud access.
• Later, trip logs or training data can be uploaded to the cloud when parked.
b. Cloud Storage
Cloud storage is scalable and flexible, leveraging the cloud’s resources to store data remotely. This
allows IoT deployments to expand storage capacity as needed without investing in physical
infrastructure. However, relying solely on cloud storage can introduce latency issues due to data
having to travel from the IoT devices to the cloud. Data caching and choosing cloud data centers
located nearer to the data sources can help mitigate these latency problems.
Example: Smart Agriculture
• IoT sensors in large farms collect temperature, humidity, and soil moisture levels.
• Data is pushed to the cloud for central monitoring and long-term analysis.
• AI models in the cloud predict optimal irrigation schedules or detect crop disease trends.
• Database technologies: Databases support structured and semi-structured data storage for IoT
systems:
▪ Time-series databases, such as InfluxDB and TimescaleDB, are optimized for storing sequential data
generated by IoT devices. They offer efficient data compression and specialized query capabilities to
handle large volumes of timestamped data.
▪ NoSQL databases like Cassandra and MongoDB provide flexibility, scalability, and high performance,
managing the varied data from IoT devices. They support a schema-less data model, allowing them
to handle different data types.
• File systems: Suitable for IoT environments requiring high throughput and low-latency data access, file
systems like ZFS or Btrfs provide features like data integrity checking and snapshot capabilities. This is
useful for IoT applications that may need to restore historical data states.
• Block storage systems: These ensure high-performance data access for IoT applications, especially for
real-time processing and analysis. They are useful for data, requiring immediate storage and retrieval.
Examples include iSCSI and Fiber Channel.
• Object storage solutions: These can handle unstructured data, such as video and images, from devices
like surveillance cameras or drones. Solutions like Amazon S3 in the cloud and Cloudian for on-premises
storage offer scalability and durability. Users can store, retrieve, and manage data non-sequentially.
• Data warehouses: These provide a structured format for querying and analyzing data, suitable for
structured data in IoT scenarios where response times are important. They allow for complex queries
and reporting on IoT data that has been processed and normalized.
• Data lakes: These offer a more flexible environment suitable for storing raw, unstructured data from
IoT devices. Technologies like Hadoop or Azure Data Lake can handle large amounts of heterogeneous
IoT data, enabling later refining and analysis.
Here are some of the main challenges associated with storing IoT data and how to address them.
Problem: IoT devices generate huge amounts of data that traditional storage systems can't handle easily.
Solution: Use cloud storage or distributed file systems that can grow as needed (called horizontal scaling) to
store and manage increasing data smoothly.
Problem: Many IoT applications (like smart cars or health monitors) need data to be processed immediately.
Any delay (latency) can cause problems.
Solution:
• Use edge computing – process data near the source (like on the device) to reduce delay.
• Use fast storage tools like in-memory databases and caching to speed up data access.
Problem: IoT devices often collect private and sensitive data, making them targets for cyberattacks.
Solution:
Problem: Devices from different brands often don’t work well together due to lack of common standards.
Solution:
Big Data refers to very large, fast, and complex sets of data that cannot be easily managed or processed
using traditional tools like spreadsheets or basic databases.
• Online transactions
1. Volume
2. Velocity
3. Variety
▪ The different types of data formats: Text, images, audio, video, log files, etc.
4. Veracity
5. Value
▪ Big data is only valuable if it produces insights that lead to better decisions.
Traditional databases (like MySQL or Excel) cannot handle the size and speed of big data. So companies use:
2. Cloud Platforms
• AWS, Microsoft Azure, Google Cloud provide scalable tools to store and analyze big data.
3. Analytics Tools
• Machine Learning & AI are used to find patterns and predictions from big data.
Hadoop:
Apache Hadoop is an open source, Java-based, software framework and parallel data processing engine. It
enables big data analytics processing tasks to be broken down into smaller tasks that can be performed in
parallel by using an algorithm (like the MapReduce algorithm), and distributing them across a Hadoop
cluster. A Hadoop cluster is a collection of computers, known as nodes, that are networked together to
perform these kinds of parallel computations on big data sets. Unlike traditional storage systems, Hadoop
excels in handling diverse data types—structured, semi-structured, and unstructured—distributed across
multiple nodes for fault tolerance and parallel processing.
Hadoop is a master-slave model made up of three components. Within a Hadoop cluster, one machine in the
cluster is designated as the NameNode and another machine as the JobTracker, these are the masters. The
rest of the machines in the cluster are DataNodes and TaskTrackers, these are the slaves. The masters
coordinate the roles of many slaves. The table below provides more information on each component.
• Master Node — The master node in a Hadoop cluster is responsible for storing data in the Hadoop
Distributed Filesystem (HDFS). It also executes the computation of the stored data using MapReduce, which
is the data processing framework. Within the master node, there are three additional nodes: NameNode,
Secondary NameNode, and JobTracker. NameNode handles the data storage function with HDFS and
Secondary NameNode keeps a backup of the NameNode data. JobTracker monitors the parallel processing of
data using MapReduce.
• Slave/Worker Node — The slave/worker node in a Hadoop cluster is responsible for storing data and
performing computations. The slave/worker node is comprised of a TaskTracker and a DataNode. The
DataNode service communicates with the Master node in the cluster.
• Client Nodes — The client node is responsible for loading all the data into the Hadoop cluster. It submits
MapReduce jobs and outlines how the data needs to be processed, then retrieves the output once the job is
complete.
Hadoop clusters can be configured as either single-node or multi-node systems, each catering to distinct
needs based on workload complexity and scale. Selecting the appropriate configuration ensures operational
efficiency and scalability while minimizing risks.
In a single-node cluster, all processes, including the NameNode, DataNode, ResourceManager, and
NodeManager, operate on a single machine. This setup is best suited for testing or development
environments, where simplicity and minimal resource allocation are priorities. For instance, a startup
experimenting with recommendation algorithms might utilize a single-node cluster to validate ideas before
transitioning to a production-ready system.
Since this configuration is limited in fault tolerance—any failure impacts all processes—it’s ideal for proof-
of-concept testing. Basic manual checks are sufficient to ensure functionality during development.
2. Multi-node clusters
Multi-node clusters distribute processes across multiple machines, enabling large-scale data processing. This
configuration is designed for production environments where massive datasets and complex workflows are
involved. For example, streaming platforms like Netflix rely on multi-node clusters to process global
viewership data, enabling content recommendations and insights.
1. Installing Java 8.
Advanced tools such as Apache Ambari or Cloudera Manager play a critical role in multi-node setups,
providing real-time monitoring of node health and performance. These clusters also excel in fault tolerance,
as HDFS replicates data across nodes, ensuring operational continuity even if a node fails.
The Hadoop Distributed File System (HDFS) is a key component of the Apache Hadoop ecosystem, designed
to store and manage large volumes of data across multiple machines in a distributed manner. It provides
high-throughput access to data, making it suitable for applications that deal with large datasets, such as big
data analytics, machine learning, and data warehousing.
The Hadoop Distributed File System (HDFS) is a scalable and fault-tolerant storage solution designed for large
datasets. It consists of NameNode (manages metadata), DataNodes (store data blocks), and a client interface.
Key advantages include scalability, fault tolerance, high throughput, cost-effectiveness, and data locality,
making it ideal for big data applications.
HDFS is designed to be highly scalable, reliable, and efficient, enabling the storage and processing of massive
datasets. Its architecture consists of several key components:
1. NameNode
2. DataNode
3. Secondary NameNode
4. HDFS Client
5. Block Structure
NameNode
The NameNode is the master server that manages the filesystem namespace and controls access to files by
clients. It performs operations such as opening, closing, and renaming files and directories. Additionally, the
NameNode maps file blocks to DataNodes, maintaining the metadata and the overall structure of the file
system. This metadata is stored in memory for fast access and persisted on disk for reliability.
Key Responsibilities:
DataNodes are the worker nodes in HDFS, responsible for storing and retrieving actual data blocks as
instructed by the NameNode. Each DataNode manages the storage attached to it and periodically reports
the list of blocks it stores to the NameNode.
Key Responsibilities:
• Performing block creation, deletion, and replication upon instruction from the NameNode.
• Periodically sending block reports and heartbeats to the NameNode to confirm its status.
Secondary NameNode
The Secondary NameNode acts as a helper to the primary NameNode, primarily responsible for merging the
EditLogs with the current filesystem image (FsImage) to reduce the potential load on the NameNode. It
creates checkpoints of the namespace to ensure that the filesystem metadata is up-to-date and can be
recovered in case of a NameNode failure.
Key Responsibilities:
HDFS Client
The HDFS client is the interface through which users and applications interact with the HDFS. It allows for
file creation, deletion, reading, and writing operations. The client communicates with the NameNode to
determine which DataNodes hold the blocks of a file and interacts directly with the DataNodes for actual data
read/write operations.
Key Responsibilities:
• Communicating with the NameNode for metadata and with DataNodes for data access.
Block Structure
HDFS stores files by dividing them into large blocks, typically 128MB or 256MB in size. Each block is stored
independently across multiple DataNodes, allowing for parallel processing and fault tolerance. The
NameNode keeps track of the block locations and their replicas.
Key Features:
• Large block size reduces the overhead of managing a large number of blocks.
• Blocks are replicated across multiple DataNodes to ensure data availability and fault tolerance.
Example: Let’s assume that 100TB file is inserted, then masternode(namenode) will first divide the file into
blocks of 10TB (default size is 128 MB in Hadoop 2.x and above). Then these blocks are stored across
3. Parquet:
Apache Parquet is an open-source columnar storage format that addresses big data processing challenges.
Unlike traditional row-based storage, it organizes data into columns. This structure allows you to read only
the necessary columns, making data queries faster and reducing resource consumption.
Columnar storage
Unlike row-based formats like CSV, Parquet organizes data in columns. This means when we run a query, it
only pulls the specific columns we need instead of loading everything. This improves performance and
reduces I/O usage.
In addition, Parquet files store extra information in the footer, called metadata, which locates and reads only
the data we need.
Row groups
• A row group contains multiple rows but stores data column-wise for efficient reading.
• Example: A dataset with 1 million rows might be split into 10 groups of 100,000 rows each.
Column chunks
• This design allows columnar pruning, where we can read only the relevant columns instead of
scanning the entire file.
Pages
• Each column chunk is further split into pages to optimize memory usage.
As mentioned, Parquet compresses data column by column using compression methods like Snappy and
Gzip. It also uses two encoding techniques:
This reduces file sizes and speeds up data reading, which is especially helpful when you work with big data.
Schema evolution
Schema evolution means modifying the structure of datasets, such as adding or altering columns. It may
sound simple, but depending on how your data is stored, modifying the schema can be slow and resource-
intensive.
Suppose you have a CSV file with columns like student_id, student_name, and student_age. If you want to add
a new scores column, you’d have to do the following:
3. Add a score for each student. This means appending values for all rows (even if they are
missing, you may need placeholders like empty strings or NULL).
CSV is a simple text-based format with no built-in schema support. This means any change to the structure
requires rewriting the entire file, and older systems reading the modified file might break if they expect a
different structure!
With Parquet, you can add, remove, or update fields without breaking your existing files. As we saw before,
Parquet stores schema information inside the file footer (metadata), allowing for evolving schemas without
modifying existing files.
• When you add a new column, existing Parquet files remain unchanged.
• New files will include the additional column, while old files still follow the previous schema.
K. BALAKRISHNA, Asst. Prof. 28
• Removing a column doesn’t require reprocessing previous data; queries will ignore the missing
column.
• If a column doesn’t exist in an older file, Parquet engines (like Apache Spark, Hive, or BigQuery)
return NULL instead of breaking the query.
• Newer Parquet files with additional columns can still be read by systems expecting an older
schema.
Parquet supports different programming languages, such as Java, Python, C++, and Rust. This means
developers can easily use it regardless of their platform. It is also natively integrated with big data
frameworks like Apache Spark, Hive, Presto, Flink, and Trino, ensuring efficient data processing at scale.
So whether you're using Python (through PySpark) or another language, Parquet can manage the data in a
way that makes it easy to query and analyze across different platforms.
4. Avro:
Data Serialization: Data Serialization is the process wherein data structures or object states are converted
into a format that can be stored, transported, and subsequently reconstructed. Often used in data storage,
remote procedure calls (RPC), and data communication, serialization facilitates complex data processing and
analytics by making data more accessible and portable.
Data Serialization operates by converting intricate data structures into a byte stream, enabling effective data
transfer across networks. Its features include:
• Data Persistence: Serialization helps in saving the state of an object to a storage medium and later
retrieving it.
• Data Exchange: It allows transmitting data over a network in a form that the network can understand.
• Remote Procedure Calls (RPCs): They can be made as though they are local calls via serialization.
K. BALAKRISHNA, Asst. Prof. 29
Architecture:
The architecture of data serialization is based on two main components: the serializer and deserializer. The
serializer converts object data into a byte stream, while the deserializer reconverts the byte stream to
replicate the original object data structure.
• Enhances Data Interchange: Data exchange between different languages or platforms is made possible
through serialization.
• Enables Data Persistence: Serialized data can be stored and recovered efficiently, making it beneficial for
applications like caching, session state persistence, etc.
What is Avro?
Apache Avro is a data serialization system developed by the Apache Software Foundation that is used for big
data and high-speed data processing. It provides rich data structures and a compact, fast, binary data format
that can be processed quickly and in a distributed manner. Avro has wide use in the Hadoop ecosystem and is
often used in data-intensive applications, such as data analytics.
• Schema definition: Avro data is always associated with a schema written in JSON format.
• Language-agnostic: Avro libraries are available in several languages including Java, C, C++, C#, Python,
and Ruby.
• Dynamic typing: Avro does not require code generation, which enhances its flexibility and ease of use.
Architecture:
The core of Avro's architecture is its schema, which is used to read and write data. Avro schemas are defined
in JSON, and the resulting serialized data is compact and efficient. Processing systems can use these schemas
to understand the data and perform operations on it.
An Avro schema is defined in the JSON format and is necessary for both serialization and deserialization,
enabling compatibility and evolution over time. It can be a:
• JSON object, which defines a new data type using the format {"type": "typeName", ...attributes...}
5. Hive:
Hive is a query interface on top of Hadoop’s native MapReduce. Remember Hive is a data warehouse it is not
an RDBMS! Hive allows us to write SQL-style queries in a native language known as Hive Query Language
(HQL).
In Hadoop we have our data stored in HDFS, and as MapReduce is used for processing that data. But there is a
problem, the MapReduce framework needs a person who knows JAVA. So if you want to work with this HDFS
data you should be a JAVA expert!
Facebook noticed this problem and in 2010, they invented Hive which lets you process this HDFS
data(structured one) even if you don’t know JAVA.
HIVE Reads data from HDFS. HIVE works with structured data in Hadoop.
In Hive, Tables and Data are stored separately. Data is stored only in HDFS. Tables are just projected over
that data of HDFS. Remember data is not stored in these tables of HIVE like we have in RDBMS!! Hive only
stores the table’s schema information (table metadata) in its metastore which is borrowed from an RDBMS
(derby is the default one). But in Production, Derby is replaced by Oracle, MSSQL as Derby is a single-user
database.
1. The driver calls the user interface’s execute function to perform a query.
2. The driver answers the query, creates a session handle for the query, and passes it to the compiler for
generating the execution plan.
3. The compiler responses to the metadata request are sent to the metaStore.
4. The compiler computes the metadata. The metadata that the compiler uses for type-checking and
semantic analysis on the expressions in the query tree is what is written in the preceding bullet. The
compiler generates the execution plan (Directed acyclic Graph) for Map Reduce jobs, which includes map
operator trees (operators used by mappers and reducers) as well as reduce operator trees (operators used
by reducers).
5. The compiler then transmits the generated execution plan to the driver.
6. After the compiler provides the execution plan to the driver, the driver passes the implemented plan to
the execution engine for execution.
7. The execution engine then passes these stages of DAG to suitable components. The deserializer for each
table or intermediate output uses the associated table or intermediate output deserializer to read the rows
from HDFS files. These are then passed through the operator tree. The HDFS temporary file is then serialised
using the serializer before being written to the HDFS file system. These HDFS files are then used to provide
data to the subsequent MapReduce stages of the plan. The final temporary file is moved to the table’s
location.
6. Hadoop MapReduce:
1. HDFS: Hadoop Distributed File System is a dedicated file system to store big data with a cluster of
commodity hardware or cheaper hardware with streaming access pattern. It enables data to be stored at
multiple nodes in the cluster which ensures data security and fault tolerance.
2. Map Reduce : Data once stored in the HDFS also needs to be processed upon. Now suppose a query is
sent to process a data set in the HDFS. Now, Hadoop identifies where this data is stored, this is called
Mapping. Now the query is broken into multiple parts and the results of all these multiple parts are combined
and the overall result is sent back to the user. This is called reduce process. Thus while HDFS is used to store
the data, Map Reduce is used to process the data.
3. YARN : YARN stands for Yet Another Resource Negotiator. It is a dedicated operating system for Hadoop
which manages the resources of the cluster and also functions as a framework for job scheduling in Hadoop.
The various types of scheduling are First Come First Serve, Fair Share Scheduler and Capacity Scheduler etc.
The First Come First Serve scheduling is set by default in YARN.
As the name suggests, MapReduce works by processing input data in two stages – Map and Reduce. To
demonstrate this, we will use a simple example with counting the number of occurrences of words in each
document.
The final output we are looking for is: How many times the words Apache, Hadoop, Class, and Track appear in
total in all documents.
For illustration purposes, the example environment consists of three nodes. The input contains six
documents distributed across the cluster. We will keep it simple here, but in real circumstances, there is no
limit. You can have thousands of servers and billions of documents.
2. Then, map tasks create a <key, value> pair for every word. These pairs show how many times a word
occurs. A word is a key, and a value is its count. For example, one document contains three of four words we
are looking for: Apache 7 times, Class 8 times, and Track 6 times. The key-value pairs in one map task output
look like this:
• <apache, 7>
• <class, 8>
• <track, 6>
This process is done in parallel tasks on all nodes for all documents and gives a unique output.
3. After input splitting and mapping completes, the outputs of every map task are shuffled. This is the first
step of the Reduce stage. Since we are looking for the frequency of occurrence for four words, there are four
parallel Reduce tasks. The reduce tasks can run on the same nodes as the map tasks, or they can run on any
other node.
The shuffle step ensures the keys Apache, Hadoop, Class, and Track are sorted for the reduce step. This
process groups the values by keys in the form of <key, value-list> pairs.
4. In the reduce step of the Reduce stage, each of the four tasks process a <key, value-list> to provide a final
key-value pair. The reduce tasks also happen at the same time and work independently.
In our example from the diagram, the reduce tasks get the following individual results:
• <apache, 22>
• <hadoop, 20>
• <class, 18>
• <track, 22>
5. Finally, the data in the Reduce stage is grouped into one output. MapReduce now shows us how many
times the words Apache, Hadoop, Class, and track appeared in all documents. The aggregate data is, by
default, stored in the HDFS.
1. ResourceManager (RM): Acts as the master daemon, managing and allocating cluster resources. It
comprises two main parts:
3. ApplicationMaster (AM): This is the component that is tasked with negotiating with the
ResourceManager and either the NodeManager(s) to launch and monitor the tasks. It has a data computation
framework provided by the ResourceManager and a per-node slave NodeManager. The ApplicationMaster is
included in the application framework package.
Apache HBase:
Apache HBase is an open-source, NoSQL, distributed big data store. It enables random, strictly consistent,
real-time access to petabytes of data. HBase is very effective for handling large, sparse datasets.
HBase integrates seamlessly with Apache Hadoop and the Hadoop ecosystem and runs on top of the Hadoop
Distributed File System (HDFS) or Amazon S3 using Amazon Elastic MapReduce file system (EMRFS). HBase
serves as a direct input and output to the Apache MapReduce framework for Hadoop, and works with Apache
Phoenix to enable SQL-like queries over HBase tables.
• HBase stores data in tables, similar to traditional relational databases. Each table consists of rows and
columns.
• Tables are divided into column families, which group related columns together.
• Each column family can contain multiple columns, and each column can store a value.
HMaster
The HMaster is a central component in an HBase cluster and is responsible for managing metadata and
coordinating cluster operations. It keeps track of regions, assigns regions to Region Servers, and handles
region splits and merges.
Region Servers
• Region Servers are responsible for serving data in HBase. They host a set of regions.
• Each Region Server can serve multiple regions and is responsible for reading, writing, and managing data
within those regions.
ZooKeeper
HBase relies on Apache ZooKeeper for coordination and distributed synchronization. ZooKeeper helps in:
1. Cluster Coordination:
▪ ZooKeeper helps coordinate the different nodes (RegionServers and Master) in an HBase cluster.
▪ It keeps track of live RegionServers, their health, and their availability.
▪ When a RegionServer joins or leaves the cluster (intentionally or due to failure), ZooKeeper updates
the cluster state accordingly.
2. Leader Election:
▪ Location of the -ROOT- and .META. tables (used to locate user data).
▪ Active Master’s address.
▪ Configuration details and server status.
Scalable: HBase is designed to handle scaling across thousands of servers and managing access to petabytes
of data. With the elasticity of Amazon EC2, and the scalability of Amazon S3, HBase is able to handle online
access to massive data sets.
Fast: HBase provides low latency random read and write access to petabytes of data by distributing requests
from applications across a cluster of hosts. Each host has access to data in HDFS and S3, and serves read and
write requests in milliseconds.
Fault-Tolerant: HBase splits data stored in tables across multiple hosts in the cluster and is built to
withstand individual host failures. Because data is stored on HDFS or S3, healthy hosts will automatically be
chosen to host the data once served by the failed host, and data is brought online automatically.
DynamoDB
Amazon DynamoDB is a fully managed NoSQL service that works on key-value pair and other data structure
documents provided by Amazon. It requires only a primary key and doesn’t require a schema to create a
table. It can store any amount of data and serve any amount of traffic to any extent. With DynamoDB, you can
expect great performance even when it scales up. It’s a really simple and small API that follows the key-value
method to store, access, and perform advanced data retrieval.
Features of DynamoDB:
DynamoDB is designed in such a way that the user can get high-performance, run scalable applications that
might not be possible with the normal database system. These additional features of DynamoDB are often
seen under the subsequent categories:
• On-demand capacity mode: The applications using the on-demand service, DynamoDB automatically
scales up/down to accommodate the traffic.
• Built-in support for ACID transactions: DynamoDB provides native/ server-side support for
transactions.
• On-demand backup: This feature allows you to make an entire backup of your work on any given point
in time.
• Point-in-time recovery: This feature helps you with the protection of your data just in case of accidental
read/ write operations.
• Encryption at rest: It keeps the info encrypted even when the table is not in use. This enhances security
with the assistance of encryption keys.
Amazon S3
Amazon Simple Storage Service (S3) is a massively scalable storage service based on object
storage technology. It provides a very high level of durability, with high availability and high performance.
Data can be accessed from anywhere via the Internet, through the Amazon Console and the powerful S3 API.
• Buckets—data is stored in buckets. Each bucket can store an unlimited amount of unstructured data.
• Elastic scalability—S3 has no storage limit. Individual objects can be up to 5TB in size.
• Flexible data structure—each object is identified using a unique key, and you can use metadata to
flexibly organize data.
• Downloading data—easily share data with anyone inside or outside your organization and enable them
to download data over the Internet.
• Permissions—assign permissions at the bucket or object level to ensure only authorized users can access
data.
• APIs – the S3 API, provided both as REST and SOAP interfaces, has become an industry standard and is
integrated with a large number of existing tools.
For Organizing, storing and retrieving data in Amazon S3 focuses on two key components:
• Buckets and
Buckets:
A bucket is a container for objects stored in Amazon S3. Objects are saved in the buckets. In order to store
your data in Amazon S3, you first create a bucket and specify a bucket name and AWS Region. Then, you
upload your data to that bucket as objects in Amazon S3.
It’s also important to know that Amazon S3 buckets are globally unique. No other AWS account in the same
region can have the same bucket names as yours unless you first delete your own buckets.
Objects:
Objects are the fundamental entities stored in Amazon S3. Amazon S3 is an object storage service that stores
data as objects within buckets. Objects are data files, including documents, photos, videos and any metadata
that describes the file. Each object has a key (or key name), which is the unique identifier for the object
within the bucket.
Objects consist of object data and metadata. The metadata is a set of name-value pairs that describe the
object. These pairs include some default metadata, such as the date last modified, and standard HTTP
metadata, such as Content-Type. We can also specify custom metadata at the time that the object is stored.
When we create a bucket, we should give a bucket name and choose the AWS Region where the bucket will
reside. After we create a bucket, we cannot change the name of the bucket or its Region.
It’s best practice to select a region that’s geographically closest to you. Objects that reside in a bucket within a
specific region remain in that region unless you transfer the files somewhere else.
• Data Storage: S3 is ideal when you want to store application images and videos. All AWS services like
Amazon Prime and Amazon.com, as well as Netflix and Airbnb, use Amazon S3 for this purpose.
• Backup and Disaster Recovery: Amazon S3 is suitable for storing and archiving critical data or backup
data with its automatically replicated cross-region, providing maximum availability and durability.
• Analytics: We can run big data analytics, artificial intelligence (AI), machine learning (ML) on Amazon S3.
• Data Archiving: We can move data archives to the Amazon S3 Glacier storage classes to lower costs,
eliminate operational complexities, and gain new insights.
• Static Website Hosting: S3 stores various static objects. So that we can host SPA frontend layers with using
S3.
• Cloud-native applications: We can use in microservices architectures for storing blob data and access
from different services.
The Apache Spark framework is an open-source, distributed analytics engine designed to support big data
workloads. With Spark, users can harness the full power of distributed computing to extract insights from big
data quickly and effectively.
Spark handles parallel distributed processing by allowing users to deploy a computing cluster on local or
cloud infrastructure and schedule or distribute big data analytics jobs across the nodes. Spark has a built-in
standalone cluster manager, but can also connect to other cluster managers like Mesos, YARN, or Kubernetes.
Users can configure the Spark cluster to read data from various sources, perform complex transformations
on high-scale data, and optimize resource utilization.
1. Spark Core is the underlying distributed execution engine that powers Apache Spark. It
handles memory management, connects to data storage systems, and can schedule, distribute,
and manage jobs.
3. Spark Streaming is a streaming analytics engine that leverages Spark Core’s fast scheduling to
ingest and analyse newly ingested data in real-time.
4. Spark MLlib is a library of machine learning algorithms that users can train using their own
data.
5. Spark GraphX is an API for building graphs and running graph-parallel computation on large
datasets.
Spark Core is exposed through an API that supports several of the most popular programming languages,
including Scala, Java, SQL, R, and Python.
At its core, Apache Spark uses a distributed computing model. Here’s how it processes data:
1. Data Distribution: Spark splits data into smaller chunks and distributes it across a cluster of nodes.
2. Task Execution: Each node processes its assigned chunk in parallel, increasing speed and efficiency.
3. In-Memory Computation: Unlike traditional systems that rely on disk-based processing, Spark keeps
data in memory, significantly reducing latency.
4. Resilient Distributed Datasets (RDDs): RDDs are immutable collections of objects that can be processed
in parallel, ensuring reliability and fault tolerance.
One of the primary reasons Apache Spark is the backbone of big data analytics is its speed. Thanks to its in-
memory processing architecture, Spark is much faster than traditional big data platforms like Hadoop. This
speed is critical for applications requiring quick analysis, such as real-time decision-making in industries like
finance, retail, and healthcare.
In the age of big data, the ability to scale efficiently is essential. Apache Spark scales horizontally, meaning
that as your data grows, you can add more machines to your cluster to handle the increased load. This
scalability makes Spark an ideal choice for organizations dealing with petabytes of data.
Apache Spark demonstrates significant versatility across various data processing tasks due to its unified
framework and specialized modules. This allows it to handle diverse workloads within a single platform,
eliminating the need for separate tools for different analytical needs.
Apache Spark easily integrates with other popular big data tools, such as Hadoop, Hive, and Cassandra,
allowing organizations to build a robust and flexible big data ecosystem. This integration enables businesses
to leverage their existing infrastructure while gaining the benefits of Spark’s advanced analytics capabilities.
With its high-level APIs, Apache Spark makes it easier for developers to build big data applications. Unlike
older technologies like Hadoop, which require complex configurations and custom code, Spark provides a
more straightforward approach to building and maintaining data pipelines, making it an attractive option for
organizations looking to reduce development time and complexity.
Apache Spark’s ability to handle machine learning workloads through its MLlib library makes it a crucial tool
for organizations looking to incorporate advanced analytics into their operations. Spark also supports graph
processing through GraphX, allowing businesses to analyze relationships and patterns in data, such as
social networks or recommendation systems.
7. Cost-Effective Solution
Despite its advanced capabilities, Apache Spark is an open-source platform, which means there are no
licensing fees associated with its use. This makes Spark a cost-effective option for organizations looking to
perform big data analytics without the financial burden of proprietary software.
Apache Spark is a powerful open-source distributed computing system designed for fast and general-purpose
big data processing. It can be deployed in two main ways: on a single machine (local mode) or across a
cluster of machines. Each approach has its own advantages and disadvantages.
In local mode, the entire Spark application runs in a single Java Virtual Machine (JVM) on a single machine.
Advantages:
• Simplicity: Easy to set up and ideal for learning, prototyping, and debugging applications with smaller
datasets.
• Faster for Small Data: For smaller datasets, local mode might even be faster due to reduced network
overhead compared to a cluster setup.
• Resource Efficiency: Can be cost-effective for single-machine workloads, such as training machine
learning models on moderate datasets or handling lightweight data tasks.
Limitations:
• Not for Big Data: Not suitable for large datasets that cannot fit in a single machine's memory or require
significant processing power beyond a single machine's capabilities.
• Fault Tolerance: No fault tolerance; a single machine failure will halt the entire application.
In distributed mode, Spark applications are distributed across multiple machines in a cluster, allowing for
parallel execution and handling of large datasets.
Advantages:
• Scalability: Can easily scale horizontally by adding more machines to handle growing data volumes and
processing demands.
• Faster for Big Data: Processes large datasets much faster by distributing tasks and computations across
multiple nodes.
• Fault Tolerance: Designed for fault tolerance, allowing applications to recover gracefully from node
failures without losing data.
• Resource Isolation: The driver program can run on a separate machine, preventing resource conflicts with
worker nodes.
• No Dependency on Local Machine: After submitting an application in cluster mode, it runs independently,
allowing you to disconnect from your local machine.
Limitations:
• Complexity: Setting up and managing a Spark cluster can be more complex than running Spark in local
mode.
• Cost: Running a cluster can be more expensive due to the need for multiple machines and potentially more
advanced hardware.
• Production deployments and large-scale applications requiring high availability and scalability.
• Processing large datasets that exceed the resources of a single machine.
• Applications requiring parallel processing and the ability to recover from node failures.
In a single-node architecture, we take a large task and divide it into subtasks, which are then completed
sequentially.
In a distributed computing architecture, the fundamental principle is to take a large task and break it into
smaller subtasks that all tasks can execute concurrently on a single node within the cluster. Capable of
utilizing this distributed nature of computing is what allows Spark to scale and perform with the increasing
demands from enterprise-level data in today's world.
Lambda (λ) Architecture is a data processing architecture designed to handle massive quantities of data by
taking advantage of both batch and stream processing methods. This hybrid approach aims to balance the
trade-offs between latency, throughput, and fault tolerance, making it particularly suitable for real time
analytics on large datasets. The architecture processes data in a way that maximizes the strengths of both
batch processing—known for its comprehensive and accurate computations—and real time stream
processing, which provides low-latency updates.
Batch layer, also known as the batch processing layer, is responsible for storing the complete dataset and
pre-computing batch views. It operates on the principle of immutability, meaning that once data is ingested,
it is never updated or deleted; only new data is appended. This layer processes large volumes of historical
data at scheduled intervals, which can range from minutes to days, depending on the application's
requirements.
• Apache Hadoop
• Apache Spark
• Databricks
• Snowflake
• Amazon Redshift
• Google BigQuery
Speed layer, also known as the real time or streaming layer, is designed to handle data that needs to be
processed with minimal latency. Unlike the batch layer, it doesn't wait for complete data and provides
immediate views based on the most recent data.
In the speed layer, incoming data is processed in real time to generate low-latency views. This layer aims to
bridge the gap between the arrival of new data and its availability in the batch layer’s views. While the speed
layer's results may be less accurate or comprehensive, they offer timely insights.
• Apache Kafka
• Amazon Kinesis
• Apache Flink
• Apache Storm
Serving Layer indexes and stores the precomputed batch views from the Batch Layer and the near real time
views from the Speed Layer. It provides a unified view for querying and analysis, allowing users to access
both historical and real time data efficiently.
Technologies commonly used in the serving layer are: Apache Druid, Apache HBase, Elasticsearch.
Data visualization is the graphical representation of information and data. By using visual elements like
charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends,
outliers, and patterns in data. Additionally, it provides an excellent way for employees or business owners to
present data to non-technical audiences without confusion.
In the world of Big Data, data visualization tools and technologies are essential to analyse massive amounts
of information and make data-driven decisions.
Exploratory Data Analysis (EDA) is an important step in data science and data analytics as it visualizes data
to understand its main features, find patterns and discover how different parts of the data are connected.
1. It helps to understand the dataset by showing how many features it has, what type of data each
feature contains and how the data is distributed.
2. It helps to identify hidden patterns and relationships between different data points which help us in
and model building.
3. Allows to identify errors or unusual data points (outliers) that could affect our results.
4. The insights gained from EDA help us to identify most important features for building models and
guide us on how to prepare them for better performance.
There are various types of EDA based on nature of records. Depending on the number of columns we are
analysing we can divide EDA into three types:
Univariate analysis focuses on studying one variable to understand its characteristics. It helps to describe
data and find patterns within a single feature. Various common methods like histograms are used to show
data distribution, box plots to detect outliers and understand data spread and bar charts for categorical data.
Summary statistics like mean, median, mode, variance and standard deviation helps in describing the
central tendency and spread of the data
2. Bivariate Analysis
Bivariate Analysis focuses on identifying relationship between two variables to find connections, correlations
and dependencies. It helps to understand how two variables interact with each other. Some key techniques
include:
• Scatter plots which visualize the relationship between two continuous variables.
• Correlation coefficient measures how strongly two variables are related which commonly
use Pearson's correlation for linear relationships.
• Line graphs are useful for comparing two variables over time in time series data to identify trends
or patterns.
• Covariance measures how two variables change together but it is paired with the correlation
coefficient for a clearer and more standardized understanding of the relationship.
3. Multivariate Analysis
Multivariate Analysis identify relationships between two or more variables in the dataset and aims to
understand how variables interact with one another which is important for statistical modelling techniques.
It includes techniques like:
• Pair plots which shows the relationships between multiple variables at once and helps in
understanding how they interact.
• Another technique is Principal Component Analysis (PCA) which reduces the complexity of large
datasets by simplifying them while keeping the most important information.
• Spatial Analysis is used for geographical data by using maps and spatial plotting to understand the
geographical distribution of variables.
• Time Series Analysis is used for datasets that involve time-based data and it involves
understanding and modeling patterns and trends over time. Common techniques include line plots,
autocorrelation analysis, moving averages and ARIMA models.
EDA is the process of analysing datasets to summarize their main features, uncover patterns, spot anomalies,
and prepare the data for modelling. It involves both statistical and visual techniques. Tools like Python
(Pandas, Seaborn) and R (ggplot2, dplyr) are commonly used.
Before analyzing, understand the goal of the project and what each variable represents. Know the data
types (e.g., numerical, categorical) and any known quality issues. This ensures the analysis is relevant and
accurate.
Load the data into your tool (Python/R). Check the shape (rows & columns), data types, missing values,
and any errors or inconsistencies. This gives you a clear view of the dataset’s structure.
Missing values can mislead your results. You can either remove such data or fill it (impute) using methods
like mean, median, or predictive models. Choose the method based on how and why the data is missing.
Calculate summary statistics (mean, median, std dev, skewness) to understand the distribution and
spread of data. This helps detect irregularities and determine suitable modelling techniques.
6. Visualize Relationships
7. Handle Outliers: Outliers are unusual data points that can distort analysis. Detect them using IQR, Z-
scores, or domain knowledge, and choose whether to remove or adjust them based on context.
8. Communicate Findings: Summarize your findings clearly using charts, tables, and key insights. Mention
limitations and suggest next steps. Effective communication ensures stakeholders understand and act on
the analysis.
Tableau is the fastest and most powerful visualization tool. It provides the features like cleaning, organizing,
and visualizing data, it easier to create interactive visual analytics in the form of dashboards. These
dashboards make it easier for non-technical analysts and end-users to convert data into understandable
ones.
Tableau Features:
• Live and in-memory data – Use Tableau's live connection to extract data from the source or in-
memory.
• Advanced Visualization – Naturally, Tableau creates bar charts and pie charts. Still, its advanced
visualizations also include boxplots, bullet charts, Gantt charts, histograms, motion charts, and treemaps,
and that's just the tip of the iceberg.
• Maps – Tableau's map feature lets users see where trends are happening.
• Mobile View – Create dashboards and reports from your phone or tablet.
• Ask Data – Tableau understands dozens of natural languages. Users don't have to be data scientists to
find answers within data.
• Trend Lines and Predictive Analysis – Drag and drop technology creates trend lines for forecasting and
predictions.
• Text Editor – Format your text in a way that makes sense to you.
• Revision History – Revision history lets decision-makers and viewers see how the data has changed
over time.
• Licensing Views – All license holders will have viewing access to the dashboard and reports
• Web Data Connector – Connect to the cloud and nearly every other online data source
• Split Function – Split data to create new fields in all supporting data sources
• Dimensions: Values that are discrete (which cannot change with respect to time) in nature called
Dimension in tableau. Example: city name, product name, country name.
• Measures: Values that are continuous (which can change with respect to time) in nature called
Measure in tableau. Example: profit, sales, discount, population.
1. Look at your data - au naturel (data profiling): This involves analysing the raw data to understand its
structure, content, relationships, and statistical properties.
• Techniques:
o Data type analysis: Identifying the data types of each column (e.g., numeric, text, date)
and checking for inconsistencies.
o Pattern detection: Identifying recurring patterns or formats, and deviations from those
patterns.
2. Data completeness: Assessing whether all required data is present and accounted for.
• Techniques:
o Missing value analysis: Identifying nulls, blank cells, or placeholder values that indicate
missing information.
o Business rule checks: Verifying if mandatory fields are populated according to defined
business rules.
3. Data validity: Ensuring the data conforms to defined standards, formats, and business rules.
• Techniques:
o Format validation: Checking if data adheres to specified formats (e.g., date formats,
email address formats, phone number formats).
o Data type checks: Verifying that the data type of a column matches the expected data
type (e.g., numeric values in a numerical column).
o Referential integrity checks: Ensuring relationships between different datasets are valid
(e.g., foreign keys referencing existing primary keys).
o Business rule validation: Checking if data complies with specific business rules (e.g., a
customer's order quantity cannot exceed the available stock).
o Schema checks: Verifying data against predefined data structures and criteria.
4. Assessing information lag (data timeliness/recency): Evaluating how current and up-to-date the data is.
• Techniques:
o Comparison with real-world events: Checking if the data reflects the most recent events
or changes in the real world.
o Identifying outdated information: Pinpointing data that is no longer relevant due to its
age.
5. Representativeness: Determining if the data accurately reflects the underlying population or phenomenon
it's intended to represent.
• Techniques:
o Sampling techniques: If dealing with a subset of data, assessing if the sampling method
used ensures a representative sample.
o Bias detection: Identifying if any biases are present in the data that might lead to
skewed or inaccurate insights.
• Accuracy: How well the data reflects reality and is free from errors.
• Consistency: Whether data stored in one place matches relevant data stored elsewhere,
avoiding contradictions.
• Usefulness/Relevance: Whether the data is applicable and valuable for solving problems and
making decisions.
Time series analysis is a powerful technique used to understand trends, patterns, and seasonal variations in
data collected over time. It plays a critical role in fields such as finance, weather
forecasting, healthcare, energy, and retail, where predicting future values accurately is key to decision-
making. With the exponential growth in data availability, mastering time series analysis has become essential
for data scientists and analysts alike.
A time series is a sequence of data points collected or recorded at specific and usually equally spaced
intervals over time. Unlike random or unordered data, time series data is inherently chronological, making
time a critical dimension for analysis. Each observation in a time series is dependent on previous values,
which differentiates it from other types of data structures.
It is important to distinguish time series data from cross-sectional data, which captures observations at a
single point in time across multiple subjects (e.g., sales from different stores on the same day). While cross-
sectional analysis examines relationships among variables at a fixed time, time series analysis focuses on
understanding how a single variable evolves over time, taking into account temporal dependencies and
patterns like seasonality, trends, and cycles.
Time series data is composed of several key components, each representing different underlying patterns.
Understanding these components is essential for accurate modelling and forecasting.
1. Trend
The trend represents the long-term direction in the data—whether it’s increasing, decreasing, or remaining
stable over time. Trends often emerge due to factors like economic growth, technological advancement, or
demographic shifts.
Example: The consistent rise in global average temperature over decades reflects a positive trend.
2. Seasonality
Seasonality refers to regular, repeating patterns observed over a fixed period, such as daily, weekly, monthly,
or yearly intervals. These variations are caused by external influences like weather, holidays, or business
cycles.
Example: Ice cream sales tend to spike during summer months every year, showing clear seasonal behaviour.
3. Cyclic Patterns
Cyclic variations are long-term fluctuations that do not follow a fixed frequency, unlike seasonality. These
cycles often correspond to economic or business cycles and can span multiple years.
Example: A country’s GDP might follow multi-year cycles of expansion and recession due to macroeconomic
factors.
The irregular or residual component includes unpredictable and random variations that cannot be attributed
to trend, seasonality, or cyclic behaviour. These are typically caused by unexpected events like natural
disasters, pandemics, or sudden market shocks.
Example: A sudden drop in retail sales due to a nationwide strike would be considered an irregular
component.
Effective visualization is crucial in time series analysis. It helps identify patterns, trends, and anomalies that
may not be immediately obvious in raw data. Different types of plots highlight different aspects of time-
dependent behavior.
• Line Charts: The most widely used method for visualizing time series data. Line plots provide a
clear view of how values change over time and are ideal for identifying trends and seasonality.
Use Case: Tracking monthly revenue over a year.
• Heatmaps: Heatmaps represent data in a matrix format where values are colored based on
intensity. In time series, they are especially useful for visualizing seasonal and daily patterns
over longer periods.
Use Case: Analyzing hourly website traffic over weeks.
• Seasonal Subseries Plots: These plots group time series data by season (month, quarter, etc.) to
highlight recurring seasonal patterns.
Use Case: Understanding month-wise sales fluctuations over several years.
1. Data Collection and Cleaning: Gather relevant data and address issues like missing values.
2. Data Exploration and Visualization: Examine the data for patterns using visualizations.
3. Stationarity Assessment: Determine if the data's statistical properties are constant over time.
Several tools and libraries are available, including Python Libraries (pandas, NumPy, statsmodels, Prophet,
scikit-learn), R, MATLAB, specialized Time Series Databases (InfluxDB, TimescaleDB, Prometheus), and
visualization tools (Matplotlib, ggplot2®, Tableau, Grafana).
1. Identifying Categories:
• Drag Dimensions to Rows/Columns: To identify categories, drag any dimension field (e.g.,
"Category", "Product Name", "Region", or "Measurement Flag fields" like Qgag and Qpcp
mentioned in the screenshot) from the Data pane to the Rows or Columns shelf. This will create
a list of all unique values within that dimension.
• Observe Distinct Values: The list generated by dragging the dimension to the Rows or Columns
shelf represents the categories within that field.
o Examine the Data pane for each field you are interested in.
o Click on the data type icon next to a field to open its context menu.
o Check for any indicators or information related to missing values (Null) within the
context menu or related descriptions.
o Create a view by dragging a dimension to the Rows or Columns shelf and a measure to
the other shelf.
o Examine the generated view for gaps or breaks in the data that could indicate missing
values.
o Show Empty Rows/Columns: Go to Analysis > Table Layout > Show Empty
Rows/Columns to explicitly display rows or columns that have missing values.
o Show Missing Values: For date fields or numerical bins, right-click (control-click on
Mac) the date or bin headers and select Show Missing Values to display any missing
values within the range.
3. Comparing dimensions for interaction (Example: Qgag and Qpcp Measurement Flags):
• Create a Crosstab: Drag one of the "Measurement Flag" dimensions (e.g., Qgag) to the Rows
shelf.
• Drag the other "Measurement Flag" dimension (e.g., Qpcp) to the Columns shelf.
• Observe Co-occurrence of Nulls: The resulting view will display a crosstab showing the
relationship between the two dimensions. Look for rows and columns where both dimensions
display "Null" values, confirming their frequent co-occurrence.
Geography is a very powerful aid in understanding IoT data when devices are located in a variety of places.
Patterns that are not obvious from statistical analysis of the data can become very obvious when shown on a
map. Geography significantly influences how IoT data, particularly from weather stations, should be
interpreted, especially when analysing precipitation across a state.
• It is common to find weather stations clustered in certain areas, particularly in urban centres or
regions with higher population density, while other areas, such as mountainous or remote
regions, have sparse coverage.
• For instance, in the southern part of the Indian region around Bangalore, there's a relatively
good density of rain gauges, while other areas of India have poor density.
• When weather station locations are not evenly distributed, simply averaging the rainfall
measurements from these stations may not accurately represent the actual average rainfall for
the entire state.
• This is because the arithmetic mean method assumes that each station represents an equal
corresponding area, an assumption that is violated when the stations are unevenly distributed.
• Research indicates that both the density and the spatial distribution of rain gauge networks
play a crucial role in accurately calculating the areal average rainfall (AAR).
• Studies have shown that estimation error in calculating AAR is relatively small in areas with
well-distributed rain gauges but significantly increases when the stations are clustered.
• Even with a higher density of stations, if the spatial distribution is poor, the accuracy of the AAR
can be affected.
• To mitigate the impact of uneven distribution, various techniques are employed, including:
o Isohyetal analysis: Drawing lines of equal rainfall (isohyets) on a map to estimate areal
precipitation.
o Thiessen polygon method: Assigning weights to stations based on the area each station
represents within a polygon network.
• In regions with poor rain gauge density, researchers explore using gauge-calibrated satellite
observations to supplement rainfall information and improve estimation accuracy.
https://www.datacamp.com/tutorial/r-studio-tutorial
1. Manufacturing:
Challenges:
• Supply chain disruptions: Unforeseen events can interrupt the flow of raw materials and
finished goods, impacting production and delivery schedules.
• Quality control issues: Manufacturing processes can experience defects and inconsistencies,
leading to product recalls and customer dissatisfaction.
• Predictive maintenance: Unexpected equipment breakdowns can lead to costly downtime and
production delays.
• Supply Chain Analytics: Analyse supplier performance, logistics data, and inventory levels to
predict and mitigate potential disruptions, according to Oracle.
• Quality Control Analytics: Analyse production data, sensor data, and defect rates to identify the
root causes of quality issues and implement corrective actions.
• Operational Efficiency Analytics: Analyse production line data, machine utilization, and waste
data to identify bottlenecks and areas for process improvement.
• Predictive Maintenance: Analyse equipment sensor data and historical maintenance records to
predict potential failures and schedule proactive maintenance, minimizing downtime.
2. Healthcare:
Challenges:
• Patient readmissions: High readmission rates indicate potential gaps in patient care and
increased healthcare costs.
• Disease prevention and management: Identifying at-risk patients and proactively managing
chronic diseases can improve patient outcomes and reduce the burden on the healthcare
system.
• Fraud detection: Fraudulent claims and billing errors can impact the financial stability of
healthcare organizations.
K. BALAKRISHNA, Asst. Prof. 59
Data Analysis Solutions:
• Predictive Analytics for Readmissions: Analyse patient data to identify individuals at high risk
of readmission and implement targeted follow-up care and interventions.
• Operational Analytics: Analyse patient admission rates, staffing schedules, and resource
utilization to optimize operations, improve efficiency, and reduce costs.
• Population Health Management: Analyse demographic data, patient histories, and social
determinants of health to identify populations at risk and design targeted preventive care
programs.
• Fraud Detection Analytics: Analyse claims data, billing patterns, and patient health histories to
identify anomalies and flag suspicious activities that may indicate fraud.
3. Retail:
Challenges:
• Inaccurate demand forecasting: Overstocking or understocking products can lead to losses and
missed sales opportunities.
• Rapidly changing trends: Retailers must adapt quickly to evolving consumer preferences and
market trends to remain competitive.
• Customer data protection: Safeguarding sensitive customer information is crucial for building
trust and maintaining compliance.
• Optimizing marketing campaigns: Effectively targeting marketing efforts to the right customers
at the right time is essential for maximizing ROI (Return on Investment).
• Demand Forecasting: Utilize historical sales data, market trends, and external factors to predict
future demand for products, optimizing inventory management and pricing strategies.
• Trend Analysis and Customer Segmentation: Analyse sales data, social media trends, and
customer feedback to identify emerging trends and segment customers based on their
preferences and purchasing habits.
• Data Security and Compliance: Implement robust security measures, access controls, and data
classification systems to protect customer data and adhere to privacy regulations.
• Customer Behaviour Analysis: Analyse customer browsing data, purchase history, and
demographics to create accurate customer profiles and personalize marketing campaigns,
improving conversion rates.
Data augmentation is a technique used to artificially increase the size of a training dataset by applying
various transformations to the existing data. This technique is commonly used in machine learning and deep
learning tasks, especially in computer vision, to improve the generalization and robustness of the trained
models.
Data augmentation and synthetic data generation are distinct yet complementary techniques in machine
learning:
• Augmented data: This involves creating modified versions of existing data to increase dataset
diversity. For example, in image processing, applying transformations like rotations, flips, or
colour adjustments to existing images can help models generalize better.
• Synthetic data: This refers to artificially generated data, which allows researchers and
developers to test and improve algorithms without risking the privacy or security of real-world
data.
1. Increased Dataset Size: By creating new samples from the existing data, data augmentation
effectively increases the size of the dataset, which can lead to better model performance.
2. Regularization: Data augmentation introduces additional variations in the data, which can help
prevent overfitting by providing the model with a more diverse set of examples.
3. Improved Generalization: By exposing the model to a wider range of variations in the data, data
augmentation helps the model generalize better to unseen examples.
1. Rotation: Rotate the image by a certain angle (e.g., 90 degrees, 180 degrees).
Healthcare:
Acquiring and labelling medical imaging datasets is time-consuming and expensive. You also need a subject
matter expert to validate the dataset before performing data analysis. Using geometric and other
transformations can help you train robust and accurate machine-learning models.
For example, in the case of Pneumonia Classification, you can use random cropping, zooming, stretching, and
colour space transformation to improve the model performance. However, you need to be careful about
certain augmentations as they can result in opposite results. For example, random rotation and reflection
along the x-axis are not recommended for the X-ray imaging dataset.
Self-Driving Cars:
There is limited data available on self-driving cars, and companies are using simulated environments to
generate synthetic data using reinforcement learning. It can help you train and test machine learning
applications where data security is an issue.
IoT devices like sensors, cameras, and smart machines collect a lot of real-time data.
This data is useful, but by itself, it does not give the full picture of what’s happening in a business.
To make this data more valuable, companies can combine it with internal datasets they already have.
This process is called “decorating your data”, because it makes the data richer, more meaningful, and easier
to use for decision-making.
1. Customer Information:
This includes all the details a company knows about its customers, such as:
• To send the right alerts, offers, or maintenance reminders to the right people.
Example in IoT:
A smart home energy meter collects electricity usage data. When combined with the customer’s profile, it can
send tips on how they can save money based on their lifestyle.
2. Production Data:
Example in IoT:
If vibration sensors on a machine show unusual patterns, production data can help identify which product
batches might have been affected, so they can be checked or stopped before shipping.
• Repair history
• Technician reports
• Service schedules
Example in IoT:
A connected elevator sends an error alert to the company. The system checks past service history and sees
the same problem happened twice before. It automatically sends a technician with the right replacement
part, avoiding multiple visits.
4. Financial Data:
Example in IoT:
A connected truck fleet tracks fuel use in real-time. When combined with financial data, managers can see
which trucks are using too much fuel and decide whether to repair or replace them.
• Better predictions – More data means better forecasting of problems and needs.
• Clear ROI – Easy to see how IoT investments are performing financially.
Challenges:
• Data privacy & security – Customer and company data must be protected.
• Integration issues – Old systems may be hard to connect with IoT platforms.
• Data quality problems – Internal and IoT data must be cleaned and matched.
• Legal compliance – Companies must follow data protection laws like GDPR.
While IoT devices collect valuable real-time sensor data, this information can become even more powerful
when combined with external datasets. External datasets come from public sources, government
databases, APIs, or research institutions. They add extra context to IoT data, improving analysis,
predictions, and decision-making.
Geographical datasets help IoT systems understand location-based factors such as terrain, weather, and
transportation networks.
a. Elevation
Elevation data is important for applications like climate modelling, flood prediction, logistics planning,
and agriculture.
• SRTM Elevation (Shuttle Radar Topography Mission) – Satellite-based elevation data used
in mapping and terrain analysis. The Shuttle Radar Topography Mission (SRTM) is a joint
project between NASA and the National Geospatial-Intelligence Agency (NGA) that acquired
near-global elevation data using radar interferometry. Launched in February 2000 aboard the
space shuttle Endeavour, SRTM captured topographic data for approximately 80% of the
• National Elevation Dataset (NED) – US-specific dataset with detailed height information for
environmental and engineering purposes. The National Elevation Dataset (NED) is a primary
elevation data product created and distributed by the U.S. Geological Survey (USGS). It provides
a seamless, raster-based digital elevation model (DEM) of the United States, including the
conterminous U.S., Alaska, Hawaii, and territorial islands. The NED is a key resource for various
applications, including hydrologic modeling, flood analysis, and geographic information system
(GIS) applications.
b. Weather
Weather data can help IoT systems adjust operations for safety, efficiency, and comfort. The National
Oceanic and Atmospheric Administration (NOAA) is a key source of weather-related data, both for public and
Example: Smart irrigation systems can use weather data to avoid watering before rain.
c. Geographical Features
These datasets describe land use, roads, rivers, cities, and other physical features.
OpenStreetMap closeup view of London with GPS traces layer turned on.
• Google Maps API – Provides location, routing, and traffic data for integration with IoT apps.
What our real estate search app built on the Google Nearby Search API looks like
• The U.S. Census Bureau – Official US population data with detailed demographics and
economic characteristics.
• CIA World Factbook – Global demographic and economic data for almost every country,
including population size, literacy rates, and GDP.
3. External Datasets – Economic: Economic datasets provide financial and market data that can help in
business forecasting and policy-making.
• Federal Reserve Economic Data (FRED) – US-based database with interest rates, inflation,
employment, and other economic indicators.
• Better Predictions – Combining sensor data with weather, location, or economic data improves
forecasting.
• Global Context – Allows IoT systems to work effectively across different regions.
• Smart Farming – IoT soil sensors + weather forecast data = optimized watering and planting.
• Smart Transportation – GPS tracking + traffic datasets (Google Maps API) = faster route
planning.