Module-2-MINING DATA STREAMS
Module-2-MINING DATA STREAMS
Contents
Introduction To Streams Concepts – Stream Data Model and
Architecture –
Stream Computing –
Sampling Data in a Stream –
Filtering Streams –
Counting Distinct Elements in a Stream –
Estimating Moments –
Counting Oneness in a Window –
Decaying Window –
Real time Analytics Platform (RTAP) Applications –
Case Studies - Real Time Sentiment Analysis, Stock Market
Predictions.
Stream processing
It is a method of continuously ingesting, analysing and
acting on data as it’s generated.
Unlike traditional batch processing, where data is
collected over a period of time and then processed in
chunks, stream processing operates on data as it is
collected, offering insights and actions within milliseconds
to seconds of data arrival.
Stream-Processing Architecture
1. Data producers
To even get started with stream processing you need
some data to process, which means you need
something to create data in the first place.
Some common data producers are things like IoT
sensors , software applications producing metrics or
user activity data, and financial data.
The volume and frequency of data being produced
will affect the architecture needed for other
components in the stream-processing system.
It can also be called message broker or Stream
processor.
2. Data ingestion
The data ingestion component of a stream-
processing system is critical for ensuring reliability
and scalability.
The data ingestion layer is what captures data from
various sources and acts as a buffer for data before
it is sent to be processed.
A proper data ingestion layer will ensure data is
collected in a fault-tolerant manner so no data is
lost before processing, and also that short-term
increases in data volume is prevented from
overwhelming the stream-processing system.
This would be done by an ETL tool
A few examples of open-source ETL tools for streaming
data are Apache Storm, Spark Streaming, and WSO2
Stream Processor.
In this process often performing additional joins,
transformations, or aggregations on the data.
The result may be, an action, a visualization, an alert,
or in some cases a new data stream.
it supports 2 types of query ad hoc query and standing
query.
This layer provides the tools which can be used for
querying and analysing the stored data stream.
Standing Query- Queries asked to the stream at all
times(continuous)
Example -Alert whenever current value exceeds
50A.
Ad hoc Query- Queries asked one time to the
stream.
Example – What is the average value current values
captured so far.
Data storage
In many cases, once real-time analysis is done
using stream processing, there will still be a
need for storing the data for historical analysis
or archival purposes.
Common solutions are things like data
warehouses , time series databases and object
storage.
Summary
Stream Processor.
Popular stream processing tools include Apache Kafka and
Amazon Kinesis. Also called the message broker, the stream
processor collects data from a producer, which undergoes
processing.
The processed data is then streamed for continuous
consumption by other components.
Real-time ETL (Extract, Transform, Load) Tools.
These tools help move real-time data between locations. ETL
tools include Amazon Kinesis Data Streams, Azure Event
Sub, and Apache Storm.
Data from streams usually arrive in unstructured formats
and need some form of processing like aggregation and
transformation before querying by SQL tools.
The result may be an API call, an action, a visualization, or
an alert.
Data Analysis tool.
After transformation by ETL tools, the processed data
undergoes analysis to provide actionable insights for
organizations.
Most data analysis tools also help with visualization to
better understand insights.
Storage.
Cloud-storage options are the most popular options for
storing event streams.
Data lakes, warehouses, and object storage provide their
pros and cons.
The storage unit needs to be able to store large amounts of
data, support record ordering, and enable consistency to
allow fast read-write of data.
Some options include Amazon S3 and Amazon Redshift.
Stream Computing
Stream computing is a way to analyse and process Big Data in
real time to gain current insights to take appropriate decisions or
to predict new trends in the immediate future
• Implements in a distributed clustered environment
• High rate of receiving data in stream
Model For Data Stream Processing
As stated, an infinite amount of data arrives continuously in
a data stream.
Assume D is the data stream which is the sequence of
transactions and can be defined as:
D = (T1, T2,…., Ti, Ti+1, …., Tj)
where,
T1: is the 1st transaction,
T2: is the 2nd transaction,
Ti: is the i th transaction and
Tj: jth transaction.
There are three different models for data stream processing,
namely,
a. Landmark,
b. Sliding Windows
c. Damped
(a)Landmark model:
This model finds the frequently used items in entire data
stream from a specific time (known as landmark) till
present.
In other words, the model finds out frequent items
starting from Ti to current time Tt from the window W[i,
t], where i represents the landmark time.
However, if i=1, then the model finds out the frequent
items over entire data stream.
In this type of model, all time-points are treated equally
after the starting time.
The examples of landmark model include stock monitor
system, which observes and reports on global stock
market.
c) Damped model:
This model is also called as Time-Fading model as it assigns
more weight to the recent transactions in data stream and
this weight keeps on decreasing with age.
In other words, the older transactions have less weight as
compared to the newer transactions in the data stream.
This model is mostly used in those applications where new
data has more impact on mining results in comparison to
the old data and the impact of old data decreases with time
Sampling
Filtering Streams
Accept only those tuples in the stream that meet a criterion.
Others are dropped
If the selection criterion is a property/attribute of a tuple
that can be calculated, then the selection is easy
If the criterion is lookup of membership then the selction
hard