100% found this document useful (3 votes)
257 views17 pages

Module-2-MINING DATA STREAMS

Stream computing is used to analyze and process big data in real time. It uses distributed systems to handle high data input rates. There are three main models for processing data streams: 1) Landmark model finds frequently used items from a starting point (landmark) to present across the entire data stream. 2) Sliding windows model stores recent data within a window that slides over time, discarding old data. 3) Damped model assigns reducing weights to older data so recent items have greater influence on analysis.

Uploaded by

Aswathy V S
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
100% found this document useful (3 votes)
257 views17 pages

Module-2-MINING DATA STREAMS

Stream computing is used to analyze and process big data in real time. It uses distributed systems to handle high data input rates. There are three main models for processing data streams: 1) Landmark model finds frequently used items from a starting point (landmark) to present across the entire data stream. 2) Sliding windows model stores recent data within a window that slides over time, discarding old data. 3) Damped model assigns reducing weights to older data so recent items have greater influence on analysis.

Uploaded by

Aswathy V S
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 17

Module II

MINING DATA STREAMS

Contents
Introduction To Streams Concepts – Stream Data Model and
Architecture –
Stream Computing –
Sampling Data in a Stream –
Filtering Streams –
Counting Distinct Elements in a Stream –
Estimating Moments –
Counting Oneness in a Window –
Decaying Window –
Real time Analytics Platform (RTAP) Applications –
Case Studies - Real Time Sentiment Analysis, Stock Market
Predictions.

Introduction to stream concepts

 A data stream is an existing, continuous, ordered chain of


items.
 The term "streaming" is used to describe continuous, never-
ending data streams with no beginning or end, that provide
a constant feed of data that can be utilized/acted upon
without needing to be downloaded first. Similarly, data
streams are generated by all types of sources, in various
formats and volumes.
 It is enormous volumes of data; items arrive at a high rate.
 Examples- Log files, social network, financial trading etc.

Characteristics of Data Streams:


1. Large volumes of continuous data, possibly infinite.
2. Steady changing and requires a fast, real-time response.
3. Data stream captures nicely our data processing needs
of today.
4. Random access is expensive and a single scan
algorithm(ie sequential)
5. Store only the summary of the data seen so far.

Examples of Stream Sources


1. IoT Sensors.
2. Security Monitoring- (Automated applications)
3. Click Stream Analysis from apps and websites (how much
time users are using)
4. Financial Market
5. Network monitoring and traffic engineering.
Applications of Data Streams:
1. Fraud perception.
2. Real-time goods dealing.
3. Consumer enterprise.
4. Observing and describing on inside IT systems.

S.No. DBMS DSMS


DBMS refers to Data Base DSMS refers to Data Stream
01. Management System. Management System.
02. Data Base Management Data Stream Management
System deals with System deals with stream
persistent data. data.
In DBMS random data In DSMS sequential data
03. access takes place. access takes place.

The data update rates in The data update rates in


04. DBMS is relatively low. DSMS is relatively high.
In DBMS the queries are But in DSMS the queries are
05. one-time queries. continuous.
In DBMS the query gives In DSMS the query gives the
06. the exact answer. exact/approximate answer.
DBMS provides no real DSMS provides real time
07. time service. service.
DBMS uses unbounded
disk store means DSMS uses bounded main
unlimited secondary memory means limited main
08. storage. memory.

What is DSMS Architecture?


 DSMS stands for data stream management system.
 It is nothing but a software application just
like DBMS (database management system) but it involves
processing and management of continuously flowing data
stream rather than static data like excel or pdf or other
files.
For example, Google queries. The sources of data streams include
Internet traffic, online transactions, satellite data, sensors data,
live event data, real-time surveillance systems, etc.

Stream processing
 It is a method of continuously ingesting, analysing and
acting on data as it’s generated.
 Unlike traditional batch processing, where data is
collected over a period of time and then processed in
chunks, stream processing operates on data as it is
collected, offering insights and actions within milliseconds
to seconds of data arrival.
Stream-Processing Architecture

1. Data producers
 To even get started with stream processing you need
some data to process, which means you need
something to create data in the first place.
 Some common data producers are things like IoT
sensors , software applications producing metrics or
user activity data, and financial data.
 The volume and frequency of data being produced
will affect the architecture needed for other
components in the stream-processing system.
 It can also be called message broker or Stream
processor.
2. Data ingestion
 The data ingestion component of a stream-
processing system is critical for ensuring reliability
and scalability.
 The data ingestion layer is what captures data from
various sources and acts as a buffer for data before
it is sent to be processed.
 A proper data ingestion layer will ensure data is
collected in a fault-tolerant manner so no data is
lost before processing, and also that short-term
increases in data volume is prevented from
overwhelming the stream-processing system.
 This would be done by an ETL tool
 A few examples of open-source ETL tools for streaming
data are Apache Storm, Spark Streaming, and WSO2
Stream Processor.
 In this process often performing additional joins,
transformations, or aggregations on the data.
 The result may be, an action, a visualization, an alert,
or in some cases a new data stream.
 it supports 2 types of query ad hoc query and standing
query.
 This layer provides the tools which can be used for
querying and analysing the stored data stream.
 Standing Query- Queries asked to the stream at all
times(continuous)
 Example -Alert whenever current value exceeds
50A.
 Ad hoc Query- Queries asked one time to the
stream.
 Example – What is the average value current values
captured so far.

Data processing /Data Analytics


 This is the heart of a stream-processing system,
where things like real-time analytics, data
transformation, filtering or enrichment are
performed.
 Depending on the volume of data and what needs
to be done to the data, this component could be
anything from a simple Python script to a
distributed computing framework.
 Some commonly used tools are -Amazon Redshift,
Elasticsearch, Cassandra

Data storage
 In many cases, once real-time analysis is done
using stream processing, there will still be a
need for storing the data for historical analysis
or archival purposes.
 Common solutions are things like data
warehouses , time series databases and object
storage.
Summary
Stream Processor.
 Popular stream processing tools include Apache Kafka and
Amazon Kinesis. Also called the message broker, the stream
processor collects data from a producer, which undergoes
processing.
 The processed data is then streamed for continuous
consumption by other components.
Real-time ETL (Extract, Transform, Load) Tools.
 These tools help move real-time data between locations. ETL
tools include Amazon Kinesis Data Streams, Azure Event
Sub, and Apache Storm.
 Data from streams usually arrive in unstructured formats
and need some form of processing like aggregation and
transformation before querying by SQL tools.
 The result may be an API call, an action, a visualization, or
an alert.
Data Analysis tool.
 After transformation by ETL tools, the processed data
undergoes analysis to provide actionable insights for
organizations.
 Most data analysis tools also help with visualization to
better understand insights.
Storage.
 Cloud-storage options are the most popular options for
storing event streams.
 Data lakes, warehouses, and object storage provide their
pros and cons.
 The storage unit needs to be able to store large amounts of
data, support record ordering, and enable consistency to
allow fast read-write of data.
 Some options include Amazon S3 and Amazon Redshift.

Stream Computing
Stream computing is a way to analyse and process Big Data in
real time to gain current insights to take appropriate decisions or
to predict new trends in the immediate future
• Implements in a distributed clustered environment
• High rate of receiving data in stream
Model For Data Stream Processing
 As stated, an infinite amount of data arrives continuously in
a data stream.
 Assume D is the data stream which is the sequence of
transactions and can be defined as:
D = (T1, T2,…., Ti, Ti+1, …., Tj)
where,
T1: is the 1st transaction,
T2: is the 2nd transaction,
Ti: is the i th transaction and
Tj: jth transaction.
There are three different models for data stream processing,
namely,
a. Landmark,
b. Sliding Windows
c. Damped
(a)Landmark model:
 This model finds the frequently used items in entire data
stream from a specific time (known as landmark) till
present.
 In other words, the model finds out frequent items
starting from Ti to current time Tt from the window W[i,
t], where i represents the landmark time.
 However, if i=1, then the model finds out the frequent
items over entire data stream.
 In this type of model, all time-points are treated equally
after the starting time.
 The examples of landmark model include stock monitor
system, which observes and reports on global stock
market.

(b)Sliding Windows model:

 This model stores recent data in sliding window from a


certain range and discard old data items.
 The size of the sliding window may vary according to
the type of application used.
 The window will update its size according to the current
time.
 The part of the data stream that is in the range of the
sliding window are retrieved at a particular time point.

c) Damped model:
 This model is also called as Time-Fading model as it assigns
more weight to the recent transactions in data stream and
this weight keeps on decreasing with age.
 In other words, the older transactions have less weight as
compared to the newer transactions in the data stream.
 This model is mostly used in those applications where new
data has more impact on mining results in comparison to
the old data and the impact of old data decreases with time

Sampling Data in a Stream

Sampling

 The process of collecting a representative collection of


elements from entire stream
 Smaller in size
 Reduces computational cost
 A good sample data retains all the significant characteristics
and behaviour of the stream.
 Sample helps to estimate/predict many crucial aggregates
on the stream.

Sampling Techniques in Big Data Stream

1. Fixed Proportion Sampling


2. Fixed size Sampling
3. Biased Reservoir Sampling
4. Concise Sampling

1. Fixed Proportion Sampling


 It samples the data with fixed proportion
 Proportion simply means percentage
 It can be used when we are aware of the length of
data (ie the approximate count of entire data stream).
 It mostly ensures representative sample
 It uses when data is very large volume(billions) -need
high computational power and resources.
 It may lead to under representation or over
representation.
2. Fixed Size Sampling
 In this sampling it samples fixed number of data
points.
 Does not guarantee representative sample.
 When you have very huge data- it is useful for
reducing volume.

3. Biased Reservoir Sampling


 It is used in streams to select a subset of the data in
a way that is not uniformly random.
 This sampling can lead to a biased sample that may
not be a representative of full dataset
 The selection of elements is based on a
predetermined probability distribution that may be
weighted towards certain elements or groups of
elements.
 The probability distribution used for biased reservoir
sampling may be based on various factors such as
frequency of occurrences of certain type of data or
the importance of certain data points.
 It is used when there are constrains on the resources
available for sampling such as limited memory or
computational power.
 It is important to carefully consider the potential
biases introduced by this sampling technique and
adjust the analysis accordingly.

4. Concise Sampling Technique


 Similar to fixed size sampling
 But the goal is to maintain a small reservoir of a
fixed size, while still achieving representative
sampling of the data stream.
 Number of samples that can be a stroed in memeory at
a given time is limited, which can be a challenge when
dealing with the large data streams
 Size of the sample may need to be adjusted based on
the amount of memory available to store the data
 Instead of selecting samples randomly the sampling
algorithm may prioritize choosing samples with unique
or representative values of a particular attribute in the
data stream

Problems on Data Streams


Type of queries one wants as an answer on a data stream
1. Sampling data from a stream

Example- Search engine- it is practically not


feasible to store the entire data in single storage
environment, so look in for what data should
select for analysis- a random sample should be
constructed

2. Queries over sliding window


Sliding window will highlight the most recent
element in your data stream

When you apply a query-it will be difficult.

Number of items of type x in the last k elements


of the stream.

What are the most recent k elements that may be


essential for answering a query when we make
use of sliding window concept is the next
challenge

3. Filtering a stream data


Select elements with property x from the stream.
Accept only those tuples in the stream that meet
a criterion.
Others are dropped

4. Counting distinct element


No: of distinct elements in the last k elements of a
stream
Eg: Search engine – if query engine experts have
to find out how many unique elements has hit
their engine
Then the problem arise – counting distinct
element
5. Estimating Moments
When you have distribution of data, in data
stream environment it is completely not known
In this scenario estimating moments like avg, std
dev for the purpose of statistical is a challenge
6. Finding Frequent elements
In Search engine, few queries hit your search
engine, which query has largest hit
 Suppose this challenges are addressed Applications

1. Mining query streams


2. Mining clicks streams
3. Mining social network news feeds
4. Sensor Networks
5. Telephone call records
6. IP Packets monitored at a switch

Filtering Streams
 Accept only those tuples in the stream that meet a criterion.
 Others are dropped
 If the selection criterion is a property/attribute of a tuple
that can be calculated, then the selection is easy
 If the criterion is lookup of membership then the selction
hard

You might also like