Module II

Module II: Big Data Analytics 1
Module II
Mining Data Streams
Introduction to Stream Concepts
Traditional DBMS store data that are finite and persistent. Data is available whenever
we want it. In Big data Analytics or in Data mining data is assumed to come in streams. If not
processed immediately it is lost. Data streams are continuous flow of data. Sensor data, network
traffic, call centre records, satellite images, and data from electric power grids etc. are some of
the popular examples for data streams.
Data Stream Mining
 It is an activity of collecting insights from continuous high speed data records

which comes from to the system ion a software.
 Process of extracting knowledge structures from continuous rapid data records.
Data streams possess several unique properties:
● Infinite
● Massive
● Fast changing
● New classes may evolve, that makes it difficult to include in the existing classes(Concept
evolution)
● The relation between input data and output data may change (Concept drift)
Apart from these unique characteristics there are some potential challenges in data
stream mining
● It is not manually possible to label all the data points in the stream.
● It is not feasible to store or archive these data streams in a conventional database.
● Concept drift
● Concept evolution
● Speed and huge volume makes it difficult to mine the data. Only single scan algorithms
will be feasible.
● Difficult to query with SQL-based tools due to lack of schema and structure
The Stream Data Model and Architecture
A stream data architecture is a framework of software components built to incorporate

and process large volume of streaming data from various sources.
A streaming data model processes the data immediately as it is generated, and continue
it to store. This architecture also includes various additional tools for real time processing, data
manipulation and analysis.
Benefits of stream processing:-
● Able to deal with infinite or never-ending streams of data

○ In stream processing, while it is challenging to combine and capture data
from multiple streams, it lets you derive immediate insights from large
volumes of streaming data.
● Real-time processing
● Detecting Patterns in Time Series Data
○ Detecting patterns over time, for example looking for trends in website
traffic data, requires data to be continuously processed and analysed.
● Easy data scalability
○ Modern stream processing infrastructure is hyper-scalable, able to deal
with Gigabytes of data per second with a single stream processor. This
allows you to easily deal with growing data volumes without
infrastructure changes.
In analogy to a database-management system, we can view a stream processor as a kind

of data-management system, the high-level organization of which is suggested in above. Any
number of streams can enter the system. Each stream can provide elements at its own schedule;
they need not have the same data rates or data types, and the time between elements of one
stream need not be uniform. The fact that the rate of arrival of stream elements is not under the
control of the system distinguishes stream processing from the processing of data that goes on
within a database-management system. The latter system controls the rate at which data is read
from the disk, and therefore never has to worry about data getting lost as it attempts to execute
queries.
Streams may be archived in a large archival store, but we assume it is not possible to
answer queries from the archival store. It could be examined only under special circumstances
using time-consuming retrieval processes. There is also a working store, into which summaries
or parts of streams may be placed, and which can be used for answering queries. The working
store might be disk, or it might be main memory, depending on how fast we need to process
queries. But either way, it is of sufficiently limited capacity that it cannot store all the data from
all the streams.
There are four building blocks of stream architecture are:
1. Stream Processor/ Message broker
Takes data coming from various sources, translates it into a standard format, and
streams it on an ongoing basis.
Two popular stream processing tools are Apache Kafka and Amazon Kinesis Data
streams
2. Batch and Real-time ETL tools
ETL stands for Extract, Transform and Load. It is the process of moving a huge
volume of unstructured data from one source to another. ETL is basically a data
integration process ETL tools aggregate data streams from one or more message brokers.
ETL tool or platform receives queries from users, fetches events from message
queues and applies the query, to generate a result. The result may be an API call, an
action, a visualization, an alert, or in some cases a new data stream.
A few examples of open-source ETL tools for streaming data are Apache Storm,
Spark Streaming and WSO2 Stream Processor.
3. Query Engine
After streaming data is prepared for consumption by the stream processor, it must
be analysed to provide value. Some of the commonly used data analytics tools are Amazon
Athena, Amazon Redshift, Elastic search, Cassandra.
4. Data Storage
Streams may be archived in a large archival store, but it is not possible to answer
queries from the archival store.
A working store of limited capacity is used into which summaries or parts of

streams may be placed, and which can be used for answering queries. The working store
might be disk, or it might be main memory, depending on how fast we need to process
queries.
The advent of low cost storage technologies paved a way for organizations to store
streaming event data.
There are several ways in which events can be stored; in a database or a data
warehouse, in the message broker, in a data lake.
A data lake is the most flexible and inexpensive option for storing event data. But
the latency (time required to transfer data from the storage) is high for real time analysis.
Streaming data architectures enable developers to develop applications that use

both bound and unbound data in new ways. For example, Alibaba’s search infrastructure
team uses a streaming data architecture powered by Apache Flink to update product details
and inventory information in real-time. Netflix also uses Flink to support its
recommendation engines and ING, the global bank based in The Netherlands, uses the
architecture to prevent identity theft and provide better fraud protection. Other platforms
that can accommodate both stream and batch processing include Apache Spark, Apache
Storm, Google Cloud Dataflow and AWS Kinesis.
Examples of Stream Sources:
1. Sensor Data
Imagine a temperature sensor bobbing about in the ocean, sending back to a
base station a reading of the surface temperature each hour. The data produced by this
sensor is a stream of real numbers.
2. Image Data
Satellites often send down to earth streams consisting of many terabytes of images
per day. Surveillance cameras produce images with lower resolution than satellites, but
there can be many of them, each producing a stream of images at intervals like one second.
London is said to have six million such cameras, each producing a stream.
3. Internet and Web Traffic
A switching node in the middle of the Internet receives streams of IP packets from
many inputs and routes them to its outputs. Normally, the job of the switch is to transmit
data and not to retain it or query it. But there is a tendency to put more capability into the
switch, e.g., the ability to detect denial-of-service attacks or the ability to reroute packets
based on information about congestion in the network.
A switching node in the middle of the Internet receives streams of IP packets from many
inputs and routes them to its outputs. Normally, the job of the switch is to transmit data and not
to retain it or query it. But there is a tendency to put more capability into the switch, e.g., the
ability to detect denial-of-service attacks or the ability to reroute packets based on information
about congestion in the network.
Stream Queries:
There are two types of stream queries: standing queries and ad-hoc queries
1. Standing queries
These queries are, in a sense, permanently executing, and produce outputs at

appropriate times.
E.g., consider a temperature sensor bobbing about in the ocean, sending back to a base
station a reading of the surface temperature each hour. The data produced by this sensor is a
stream of real numbers. In this case we can ask a query, what is the maximum temperature ever
recorded by the sensor. For answering this query we need not store the entire stream. When a
new stream element arrives, we compare it with the stored maximum, and set the maximum to
whichever is larger. Similarly, if we want the average temperature over all time, we have only
to record two values: the number of readings ever sent in the stream and the sum of those
readings.
2. Ad-hoc queries
Asked only when a particular information is needed. Not permanently executing.
E.g., asked once about the current state of streams. Such queries are difficult to answer
as we are not archiving the entire stream. For answering such ad-hoc queries we have
to store the parts or summaries of the stream.
Sliding Window:
● Sliding window approach can be used to answer ad-hoc queries.

● Each sliding window stores the most recent n elements of the stream for some n.
● Or it can be all the elements that have arrived within the last t time units; may be day.
● The length of the sliding window is specified by its range.
● Stride specifies the portion of the window that is omitted when the window moves
forward.
Two types of sliding windows
 Time-based
○ Range and stride are specified by time intervals.
○ For example a sliding window with range= 10 mins and stride= 2 mins produces
a window that cover the data in the last 10 mins. A new window is created after
2 mins
 Count-based
○ Range and stride are specified in terms of number of intervals.

The obvious approach would be to generate a random number, say an integer from 0
to 9, in response to each search query. Store the tuple if and only if the random number
is 0. Each user has, on average, 1/10th of their queries stored.
Stream Computing
 A high performance computer system that analyses multiple data streams from many
source live.
 The word stream in stream computing is used to mean pulling in streams of data,
processing the data and streaming it back out as a single flow.
 Stream computing uses software algorithms that analyses the data in real time as it
streams in to increase speed and accuracy when dealing with data handling and
analysis.
 It continuously integrates and analyses data in motion to deliver real time analytics.
Stream computing enables organization to detect insights (risks and opportunities) in
high velocity data.
 Stream computing enables organization:-
 Enables and act up rapidly changing data in real time.
 Enhance existing models with new insights.
 Capture, analyse and action insight before opportunities are lost forever.
Sampling Data in a Stream
As mentioned earlier a data stream is a massive, infinite dataset. Hence it is not possible
to store the entire stream. While mining a data stream a typical question that can arise is how
we can answer certain critical queries without storing the entire stream. In some cases we can
get an answer from certain samples in the stream, without examining the entire stream. Here,
we have to keep in mind two things; one is the sample should be unbiased. The second one is
the typical sample should be able to answer the queries. Choosing the right samples is critical.
Carelessly choosing samples can destroy the results of the query. While sampling we must take
care of some pitfalls.
An Example
Consider a search engine like Google that receives a stream of queries. Google wants
to study the behaviour of users. A typical question that can be asked is “What fraction of queries
asked past the month are unique?”. Only 1/10th of the stream element is stored.
The obvious approach would be to generate a random number, say an integer from 0 to
9, in response to each search query. Store the tuple if and only if the random number is 0. If
we do so, each user has, on average, 1/10th of their queries stored. Statistical fluctuations will
introduce some noise into the data, but if users issue many queries, the law of large numbers
will assure us that most users will have a fraction quite close to 1/10th of their queries stored.
This scheme gives us the wrong answer to the query asking for the average number of
duplicate queries for a user. Suppose a user has issued s search queries one time in the past
month, d search queries twice, and no search queries more than twice. If we have a 1/10th
sample of queries, we shall see in the sample for that user an expected s/10 of the search queries
issued once. Of the d search queries issued twice, only d/100 will appear twice in the sample;
that fraction is d times the probability that both occurrences of the query will be in the 1/10th
sample. Of the queries that appear twice in the full stream, 18d/100 will appear exactly once.
[Sample will contain s/10 of the singleton queries and 2d/10 of the duplicate queries at least
once .But only d/100 pairs of duplicates d/100 = 1/10 * 1/10 * d]
To see why, note that 18/100 is the probability that one of the two occurrences will be in the
1/10th of the stream that is selected, while the other is in the 9/10th that is not selected.
[Of d “duplicates” 18d/100 appear once 18d/100 = ((1/10*9/10)+(9/10*1/10))*d]
The correct answer to the query about the fraction of repeated searches is d/(s+d). However,
the answer we shall obtain from the sample is d/(10s+19d).
d/100 appear twice, while s/10+18d/100 appear once. Thus, the fraction appearing twice in
the sample is d/100 divided by d/100 + s/10 + 18d/100. This ratio is d/(10s + 19d).
Solution
A solution for the above scenario is to sample the users.
● Pick 1/10th of users and take all their searches in the sample.
● Use a hash function that hashes the user name or user id uniformly into 10 buckets.
● Each time a search query arrives in the stream, we look up the user to see whether or
not they are in the sample. If so, we add this search query to the sample, and if not,
then not.
● By using a hash function, one can avoid keeping the list of users.
● Hash each user name to one of ten buckets, 0 through 9. If the user hashes to bucket 0,
then accept this search query for the sample, and if not, then not.
General Solution
● Stream of tuples with keys.

● Key is some subset of each tuple’s components e.g., tuple is (user, search, time). Here
Key is the user.
● Choice of key depends on application.
● To get a sample of size a/b: Hash each tuple’s key uniformly into b buckets Pick the
tuple if its hash value is at most a.
Filtering Streams
Another common process on streams is selection, or filtering. We want to accept those

tuples in the stream that meet a criterion. Accepted tuples are passed to another process as a
stream, while other tuples are dropped. If the selection criterion is a property of the tuple that
can be calculated (e.g., the first component is less than 10), then the selection is easy to do.
The problem becomes harder when the criterion involves lookup for membership in a set. It is
especially hard, when that set is too large to store in main memory. The technique known as
“Bloom filtering”, a way to eliminate most of the tuples that do not meet the criterion.
Filtering or selection involves selecting the streams that satisfies a particular criterion.
There are different methods for selecting streams. The process is hard when it is required to
search for membership in a set.
● Each element of data stream is a tuple

● Given a list of keys S.
● Determine which tuples of stream are in S
Applications of filtering:
● Email spam filtering

○ We know 1 billion “good” email addresses.
○ If an email comes from one of these, it is NOT spam
● Publish-subscribe systems
○ You are collecting lots of messages (news articles).
○ People express interest in certain sets of keywords.
○ Determine whether each message matches user’s interest.
Example
Suppose we want to create an account in Gmail. Gmail maintains a list of usernames

of those persons who already have an account. When we give our preferred username, we may
get a message that “username already exists”. Gmail checks availability of username by
searching millions of username registered with it. There are several methods to do the search.
• Linear Search : obviously this is a bad idea because there may be billions of accounts.
• Binary search : the usernames must be stored in sorted order. Even then it may not be
possible to search in billions.
Solution is Bloom Filter Technique
Bloom Filter Technique
● A Bloom filter is a space-efficient probabilistic data structure that is used to test

whether an element is a member of a set.
● Bloom Filter method uses hashing.
● A hash function takes input and outputs a unique identifier of fixed length which is
used for identification of input.
Standard Bloom Filter
● A Bloom filter is an array of m bits representing a set S ={x1,x2,..xn} of n elements.

○ Array set to 0 initially.
● k independent hash functions h1,h2,..hk with range {1 ,2,…, m}
○ Assume that each hash function maps each item in the universe to a random
number uniformly over the range.
● For each element x in S, the bit hi(x) in the array is set to 1, for 1≤ i ≤ k.
○ A bit in the array may be set to 1 multiple times for different elements.
Working of Bloom Filter
● A empty bloom filter is a bit array of m bits, all set to zero, like this –
● We need k number of hash functions to calculate the hashes for a given input.
● When we want to add an item in the filter, the bits at k indices h1(x), h2(x), … hk(x) are
set, where indices are calculated using hash functions.
● Example – Suppose we want to enter “good” in the filter, we are using 3 hash functions
and a bit array of length 10, all set to 0 initially. First we’ll calculate the hashes as
following :
○ First we’ll calculate the hashes as following :
h1(“good”) % 10 = 1
h2(“good”) % 10 = 4
h3(“good”) % 10 = 7
[These outputs are random for explanation only.]
Now we will set the bits at indices 1, 4 and 7 to 1 for the element,”good”.
Again we want to enter

“bad”, similarly we’ll calculate hashes
h1(“bad”) % 10 = 3
h2(“bad”) % 10 = 5
h3(“bad”) % 10 = 4
Set the bits at indices 3, 5 and 4 to 1 for the element “bad”.
Now, to check if an element is present in the list or not we do the reverse process. Calculate
respective hashes using h1, h2 and h4 and check if all these indices are set to 1 in the bit array.
If all the bits are set, then that element is probably present. If any of the bits at these
indices are 0, then username is definitely not present.
False Positive in Bloom Filters
For a non-member, it may be found to be a member of S (all of its k bits are

nonzero) with false positive probability.
The result “probably present”, is uncertainty. Let’s understand this with an example.
Suppose we want to check whether “cat” is present or not. We’ll calculate hashes using h1, h2
and h3.
h1 (“cat”) % 10 = 1
h2 (“cat”) % 10 = 3
h3 (“cat”) % 10 = 7
If we check the bit array, bits at these indices are set to 1 but we know that “cat” was never
added to the filter. Bit at index 1 and 7 was set when we added “good” and bit 3 was set when
we added “bad”. So, because bits at calculated indices are already set by some other item,
bloom filters erroneously claim that “cat” is present and generating a false positive result.
● By controlling the size of the bloom filter we can control the probability of getting
false positives.
● Use more number of hash functions and more bit arrays.
Probability of False positivity: Let m be the size of bit array, k be the number of hash
functions and n be the number of expected elements to be inserted in the filter, then the
probability of false positive p can be calculated as:
1 𝑘𝑛 𝑘
𝑃 = (1 − [1 − ] )
𝑚
Generalization
A Bloom filter consists of:
1. An array of n bits, initially all 0’s.
2. A collection of hash functions h1, h2. . . h k . Each hash function maps “key” values to n
buckets, corresponding to the n bits of the bit-array.
3. A set S of m key values.
The purpose of the Bloom filter is to allow through all stream elements whose keys are in S,
while rejecting most of the stream elements whose keys are not in S.
● To initialize the bit array, begin with all bits 0.

● Take each key value in S and hash it using each of the k hash functions.
● Set to 1 each bit that is hi (K) for some hash function hi and some key value K in S.
● To test a key K that arrives in the stream, check that all of h1 (K), h2 (K) . . . hk (K) are
1’s in the bit-array.
● If all are 1’s, then let the stream element through. If one or more of these bits are 0,
then K could not be in S, so reject the stream element.
The hash function used in bloom filters should be independent and uniformly distributed. They
should be as fast as possible.
Applications of Bloom filters
● Quora implemented a shared bloom filter in the feed backend to filter out stories that
people have seen before.
● The Google Chrome web browser used to use a Bloom filter to identify malicious
URLs.
● Google BigTable, Apache HBase and Apache Cassandra, and Postgresql use Bloom
filters to reduce the disk lookups for non-existent rows or columns .
Counting Distinct Elements in a Stream
This process is to count distinct elements in a data stream with repeated elements. The
elements might represent IP addresses of packets passing through a router, unique visitor to a
web site, elements in a large database, motifs in a DNA sequence, or elements of sensor/RFID
networks.
Definition
A stream of elements {x1,x2,...xs} with repetitions, and an integer m. Let n be the number of
distinct elements, namely, n=|{x1,x2,...xs }|and let these elements be {e1,e2,...en}.
Objective: Find an estimate of n using only m storage units, where m<n
Data stream consists of a universe of elements chosen from a set of size N. Maintain a
count of the number of distinct elements seen so far
Solution:
● Maintain the set of elements seen so far.

● That is, keep a hash table of all the distinct elements seen so far.
Counting distinct elements is very important in many practical applications. For example: How
many different words are found among the Web pages being crawled at a site? Unusually low
or high numbers could indicate artificial pages (spam?).How many different Web pages does
each customer request in a week? How many distinct products have we sold in the last week?
Example
Suppose Google wants to gather the statistics of unique users it has seen in each month.
Google does not require a unique login to issue a search query. The only way to recognize
users is to identify the IP addresses from which the queries are issued. In this case the 4 billion
IP addresses serve as the universal set.
Solution
● Keep the list of all elements seen so far in a hash table or a search tree in the main
memory.
● When a new query arrives check whether the IP address from which the query issued
is in the list or not.
● If it is not there add the new IP address. Otherwise discard.
The above solution works well as long as the number of distinct elements is not too large.
The problem arises when the number of elements is too great and all the streams need to be
processed at once. The data may not be fit in the main memory.
The Flajolet-Martin Algorithm is an efficient technique to estimate the number of

distinct objects using much less memory.
Flajolet-Martin Algorithm
This algorithm approximates the number of distinct elements in a stream or a database in one
pass. Suppose the stream consists of n elements with m of them are unique. Then time
complexity of the algorithm is O (n).The algorithm requires O(log(m)) memory.
Algorithm
For a given input stream and a hash function.
Step 1: Apply the hash function h(x) to each element in the stream.
Step 2: For each hash function obtained, write the binary equivalent for the same.
Step 3: Count the number of trailing zeros (zeros in the end) of each bit of the hash function.
Step 4: Write the value of maximum number of trailing zeros. Let the number be r.
Step 5: Calculate the number of distinct elements as R=2r
Example
Consider a data stream of integers, 1,3,2,1,2,3,4,3,1,2,3,1 elements if the hash function is:
h (x) = 6x + 1 mod 5.
(Treat the result of hash functions as a 5 bit binary integer).
Solution:
Since the data stream is small we can readily count the number of distinct
elements. There are 4 distinct elements.
 Step 1:
Using hash function h(x) =6x+1 mode 5, calculate it
h (1)= (6*1 + 1 ) mod 5 =2;
Similarly we get, h(3)=4, h(2)=3, h(1)=2, h(2)=3, h(3)=4, h(4)=0,

h(3)=4, h(1)=2, h(2)=3, h(3)=3, h(1)=2.
 Step 2:
Convert the result into binary.
h(1)=010, h(3)=100, h(2)=011, h(1)=010, h(2)=011, h(3)=100,

h(4)=000, h(3)=100, h(1)=010, h(2)=011, h(3)=011, h(1)=010.
 Step 3:
Trailing zeros.
h(1)=1, h(3)=2, h(2)=0, h(1)=1, h(2)=0, h(3)=2, h(4)=0, h(3)=2, h(1)=1, h(2)=0,
h(3)=0, h(1)=1.
 Step 4:
Distinct element.
From the binary equivalent trailing zero values, write the value of maximum number
of trailing zeros. The value of r=2.
The distinct value of R=2r=4
Hence, there are 4 distinct elements: 1,2,3,4.
Estimating Moments
Estimating moments involves computation of distribution of frequencies of

different elements in the stream.
Definition of Moments
Consider a data stream of elements from a universal set. Let mi be the number of
occurrences of the ith element for any i. Then the kth-moment of the stream is the sum over all
i of (mi )k .
The 0th moment is the count of distinct elements in the stream.
The 1st moment is the sum of mi’’s. That is the length of the stream.
The 2nd moment is the sum of squares of mi’’s. It is also called a “surprise number”.
Determines how the distribution is uneven.
Example
Suppose we have a stream of length 100, in which eleven different elements appear.
The most even distribution of these eleven elements would have one appearing 10 times and
the other ten appearing 9 times each. In this case, the surprise number is 102 + 10 × 92 = 910.
At the other extreme, one of the eleven elements could appear 90 times and the other ten appear
1 time each. Then, the surprise number would be 902 + 10 × 12 = 8110.
As explained in the above example moments of any order can be computed as long as the
stream fits in the main memory. In the case where the stream does not fit in the memory, we
can compute the kth moment by keeping a limited number of values and computing an estimate
from these values. We can use the following algorithm for computing the second moment.
The Alon-Matias-Szegedy (AMS) Algorithm for Second Moments:
Even if there is not enough storage space, the second moment can still be estimated using
AMS Algorithm.
Algorithm
Consider a stream of length n. Without taking all the elements, compute some
sample variables.
For each variable X,
● Store X.element as a particular element in the set.

● Choose a position in the stream between 1 and n.
● Assign X.element to be the element found in that particular position.
● Assign X.value=1 for that element.
● Scan the stream and add 1 to X.value each time another occurrence of the variable
encountered.
● Derive the estimated second moment of the variable X using n(2X.value-1).
● Calculate the average of all the estimates.
Example
Suppose the stream is a, b, c, b, d, a, c, d, a, b, d, c, a, a, b.
● The length of the stream is n = 15. Since a appears 5 times, b appears 4 times, and c and
d appear 3 times each, the second moment for the stream is 52 + 42 + 32 + 32 = 59.
Suppose we keep three variables, X1, X2, and X3.
● Assume that at “random” we pick the 3rd, 8th, and 13th positions to define these three
variables. When we reach position 3, we find element c, so we set X1.element = c and
X1.value = 1. Position 4 holds b, so we do not change X1. Likewise, nothing happens
at positions 5 or 6. At position 7, we see c again, so we set X1.value = 2.
● At position 8 we find d, and so set X2.element = d and X2.value = 1. Positions 9 and
10 hold a and b, so they do not affect X1 or X2.
● Position 11 holds d so we set X2.value = 2, and position 12 holds c so we set X1.value
= 3.
● At position 13, we find element a, and so set X3.element = a and X3.value = 1. Then,
at position 14 we see another a and so set X3.value = 2.
● Position 15, with element b does not affect any of the variables, so we are done, with
final values,
X1.value = 3
X2.value = 2
X3.value = 2.
We can derive an estimate of the second moment from any variable X. This estimate
is n(2X.value − 1).
From the previous example, for X1, we derive the estimate n(2X1.value − 1)
= 15 × (2 × 3 − 1) = 75.
The other two variables, X2 and X3, each have value 2 at the end, so their
estimates are 15 × (2 × 2 − 1) = 45.
Recall that the true value of the second moment for this stream is 59. On the
other hand, the average of the three estimates is 55, a fairly close approximation.
In general, the kth moments for any k≥2
𝒏(𝒗𝒌 − (𝒗 − 𝟏)𝒌 ), v is the X.value for some variable X in the stream.

Counting Ones in a Window
The Datar-Gionis-Indyk-Motwani (DGIM) Algorithm:
● Designed to find the number of 1s in a data stream.
● This algorithm uses O (log2 N) bits to represent a window of N bit.
● It allows to estimate the number of 1’s in the window with an error of no more than
50%.
Components:
 Timestamp
 Buckets
 Each bit that arrives has a timestamp for the position at which it arrives.
 If the first bit has a timestamp 1, the second bit has a timestamp 2 and so on…The
positions are recognized with the window size N (which are usually taken as multiple
of 2).
 The windows are divided into buckets consisting of 1’s and 0’s.
Rules:
1. The right side of the bucket always start with 1 (If it starts with 0, it is to be
neglected). E.g., 1001011:- bucket size 4. Having four 1’s starting with 1 on its right
end.
2. Every bucket should have at least one 1, else no bucket can be formed.
3. All buckets should be in power of 2.
4. The buckets cannot decrease in size as we move to the left (move in increasing order
towards left).
Example
Consider the bit stream, 1 0 1 011000101110110010110……
N=24 (Window Size)
101011 000 10111 0 11 00 101 1 0…
22 =4 22 =4 21=2 21 =2 20=1
No of ones = sum of bucket size=4+4+2+2+1=13.

In the given data stream, let us assume the new bit arrives from the right. If new
bit=0, there is no change in buckets. If it is 1, modify the adjacent bucket size.
It will continue if current timestamp-leftmost bit timestamp of window less than N.
Decaying Window
This approach is used for finding the most popular element in the stream. This can be
considered as an extension of DGIM Algorithm. The aim is to weight the recent elements more
heavily.
● Recording the popularity of items sold at Amazon.

● The rate at which different Twitter-users tweet.
Let a stream currently consist of the elements a1, a2, . . . , at , where a1 is the first element to
arrive and at is the current element. Let c be a small constant, such as 10−6 or 10−9.
● Define the exponentially decaying window for this stream to be the sum
𝒕−𝟏
∑ 𝒂𝒕−𝒊 (𝟏 − 𝒄)𝒊
𝒊=𝟎
when a new element a t+1 arrives at the stream input, all we need to do is:
1. Multiply the current sum by 1 − c.
2. Add a t+1.
Decaying Window Algorithm:
Identify the most popular elements (trending, in other words) in an incoming data
stream.
● Tracks the most recurring elements in an incoming data stream.

● Discounts any random spikes or spam requests that might have boosted an element’s
frequency.
● Assign a score or weight to every element of the incoming data stream.
● Calculate the aggregate sum for each distinct element by adding all the weights assigned
to that element.
● The element with the highest total score is listed as trending or the most popular.
1. Assign each element with a weight/score.
2. Calculate aggregate sum for each distinct element by adding all the weights assigned to that
element.
● Assign more weight to newer elements.

● For a new element, you first reduce the weight of all the existing elements by a constant
factor k and then assign the new element with a specific weight.
● The aggregate sum of the decaying exponential weights can be calculated using the
following formula:
𝑡−1
∑ 𝑎𝑡−𝑖 (1 − 𝑐)𝑖
𝑖=0
3. Multiply the current sum/score by the value (1−c).
4. Add the weight corresponding to the new element.
Weight decays exponentially over time
For example, consider a sequence of twitter tags below:
fifa, ipl, fifa, ipl, ipl, ipl, fifa
Also, let's say each element in sequence has weight of 1.
Let's c be 0.1
The aggregate sum of each tag in the end of above stream will be calculated as below: fifa
fifa - 1 * (1-0.1) = 0.9
ipl - 0.9 * (1-0.1) + 0 = 0.81 (adding 0 because current tag is different than fifa)
fifa - 0.81 * (1-0.1) + 1 = 1.729 (adding 1 because current tag is fifa only)
ipl - 1.729 * (1-0.1) + 0 = 1.5561
ipl - 1.5561 * (1-0.1) + 0 = 1.4005

ipl - 1.4005 * (1-0.1) + 0 = 1.2605
fifa - 1.2605 * (1-0.1) + 1 = 2.135
ipl
fifa - 0 * (1-0.1) = 0
ipl - 0 * (1-0.1) + 1 = 1
fifa - 1 * (1-0.1) + 0 = 0.9 (adding 0 because current tag is different than ipl) ipl
- 0.9 * (1-0.01) + 1 = 1.81
ipl - 1.81 * (1-0.01) + 1 = 2.7919
ipl -2.7919 * (1-0.01) + 1 = 3.764
fifa - 3.764 * (1-0.01) + 0 = 3.7264
In the end of the sequence, we can see the score of fifa is 2.135 but ipl is 3.7264 So,
ipl is more trending than fifa
Even though both of them occurred almost the same number of times in input there score is
still different.
Advantages of Decaying Window Algorithm:
1. Sudden spikes or spam data is taken care of.
2. New element is given more weight by this mechanism, to achieve right trending output.

Module II

Uploaded by

Module II

Uploaded by

Module II: Big Data Analytics 1

Mining Data Streams

Introduction to Stream Concepts

Data Stream Mining

 It is an activity of collecting insights from continuous high speed data records

Data streams possess several unique properties:

● It is not feasible to store or archive these data streams in a conventional database.

The Stream Data Model and Architecture

A stream data architecture is a framework of software components built to incorporate

Benefits of stream processing:-

● Able to deal with infinite or never-ending streams of data

In analogy to a database-management system, we can view a stream processor as a kind

There are four building blocks of stream architecture are:

1. Stream Processor/ Message broker

2. Batch and Real-time ETL tools

A working store of limited capacity is used into which summaries or parts of

Streaming data architectures enable developers to develop applications that use

Examples of Stream Sources:

3. Internet and Web Traffic

These queries are, in a sense, permanently executing, and produce outputs at

Asked only when a particular information is needed. Not permanently executing.

● Sliding window approach can be used to answer ad-hoc queries.

○ Range and stride are specified in terms of number of intervals.

Sampling Data in a Stream

[Of d “duplicates” 18d/100 appear once 18d/100 = ((1/10*9/10)+(9/10*1/10))*d]

A solution for the above scenario is to sample the users.

● Stream of tuples with keys.

Another common process on streams is selection, or filtering. We want to accept those

● Each element of data stream is a tuple

● Determine which tuples of stream are in S

● Email spam filtering

Suppose we want to create an account in Gmail. Gmail maintains a list of usernames

Solution is Bloom Filter Technique

Bloom Filter Technique

● A Bloom filter is a space-efficient probabilistic data structure that is used to test

Standard Bloom Filter

● A Bloom filter is an array of m bits representing a set S ={x1,x2,..xn} of n elements.

Working of Bloom Filter

[These outputs are random for explanation only.]

Again we want to enter

Set the bits at indices 3, 5 and 4 to 1 for the element “bad”.

False Positive in Bloom Filters

For a non-member, it may be found to be a member of S (all of its k bits are

A Bloom filter consists of:

1. An array of n bits, initially all 0’s.

3. A set S of m key values.

● To initialize the bit array, begin with all bits 0.

Applications of Bloom filters

filters to reduce the disk lookups for non-existent rows or columns .

Counting Distinct Elements in a Stream

Objective: Find an estimate of n using only m storage units, where m<n

● Maintain the set of elements seen so far.

The Flajolet-Martin Algorithm is an efficient technique to estimate the number of

For a given input stream and a hash function.

Step 5: Calculate the number of distinct elements as R=2r

(Treat the result of hash functions as a 5 bit binary integer).

elements. There are 4 distinct elements.

h (1)= (6*1 + 1 ) mod 5 =2;

Similarly we get, h(3)=4, h(2)=3, h(1)=2, h(2)=3, h(3)=4, h(4)=0,

Convert the result into binary.

h(1)=010, h(3)=100, h(2)=011, h(1)=010, h(2)=011, h(3)=100,

Estimating moments involves computation of distribution of frequencies of

The 0th moment is the count of distinct elements in the stream.

The Alon-Matias-Szegedy (AMS) Algorithm for Second Moments:

For each variable X,

● Store X.element as a particular element in the set.

Suppose the stream is a, b, c, b, d, a, c, d, a, b, d, c, a, a, b.

In general, the kth moments for any k≥2

𝒏(𝒗𝒌 − (𝒗 − 𝟏)𝒌 ), v is the X.value for some variable X in the stream.

Counting Ones in a Window

The Datar-Gionis-Indyk-Motwani (DGIM) Algorithm:

● Designed to find the number of 1s in a data stream.

● This algorithm uses O (log2 N) bits to represent a window of N bit.

Consider the bit stream, 1 0 1 011000101110110010110……

N=24 (Window Size)

101011 000 10111 0 11 00 101 1 0…

[Of d “duplicates” 18d/100 appear once 18d/100 = ((1/109/10)+(9/101/10))*d]