Module II
Module II
Module II
Traditional DBMS store data that are finite and persistent. Data is available whenever
we want it. In Big data Analytics or in Data mining data is assumed to come in streams. If not
processed immediately it is lost. Data streams are continuous flow of data. Sensor data, network
traffic, call centre records, satellite images, and data from electric power grids etc. are some of
the popular examples for data streams.
● Infinite
● Massive
● Fast changing
● New classes may evolve, that makes it difficult to include in the existing classes(Concept
evolution)
● The relation between input data and output data may change (Concept drift)
Apart from these unique characteristics there are some potential challenges in data
stream mining
● It is not manually possible to label all the data points in the stream.
● Concept drift
● Concept evolution
Module II: Big Data Analytics 2
● Speed and huge volume makes it difficult to mine the data. Only single scan algorithms
will be feasible.
● Difficult to query with SQL-based tools due to lack of schema and structure
A streaming data model processes the data immediately as it is generated, and continue
it to store. This architecture also includes various additional tools for real time processing, data
manipulation and analysis.
Streams may be archived in a large archival store, but we assume it is not possible to
answer queries from the archival store. It could be examined only under special circumstances
using time-consuming retrieval processes. There is also a working store, into which summaries
or parts of streams may be placed, and which can be used for answering queries. The working
store might be disk, or it might be main memory, depending on how fast we need to process
queries. But either way, it is of sufficiently limited capacity that it cannot store all the data from
all the streams.
Takes data coming from various sources, translates it into a standard format, and
streams it on an ongoing basis.
Two popular stream processing tools are Apache Kafka and Amazon Kinesis Data
Module II: Big Data Analytics 4
streams
ETL stands for Extract, Transform and Load. It is the process of moving a huge
volume of unstructured data from one source to another. ETL is basically a data
integration process ETL tools aggregate data streams from one or more message brokers.
ETL tool or platform receives queries from users, fetches events from message
queues and applies the query, to generate a result. The result may be an API call, an
action, a visualization, an alert, or in some cases a new data stream.
A few examples of open-source ETL tools for streaming data are Apache Storm,
Spark Streaming and WSO2 Stream Processor.
3. Query Engine
After streaming data is prepared for consumption by the stream processor, it must
be analysed to provide value. Some of the commonly used data analytics tools are Amazon
Athena, Amazon Redshift, Elastic search, Cassandra.
4. Data Storage
Streams may be archived in a large archival store, but it is not possible to answer
queries from the archival store.
The advent of low cost storage technologies paved a way for organizations to store
streaming event data.
There are several ways in which events can be stored; in a database or a data
warehouse, in the message broker, in a data lake.
A data lake is the most flexible and inexpensive option for storing event data. But
the latency (time required to transfer data from the storage) is high for real time analysis.
both bound and unbound data in new ways. For example, Alibaba’s search infrastructure
team uses a streaming data architecture powered by Apache Flink to update product details
and inventory information in real-time. Netflix also uses Flink to support its
recommendation engines and ING, the global bank based in The Netherlands, uses the
architecture to prevent identity theft and provide better fraud protection. Other platforms
that can accommodate both stream and batch processing include Apache Spark, Apache
Storm, Google Cloud Dataflow and AWS Kinesis.
1. Sensor Data
Imagine a temperature sensor bobbing about in the ocean, sending back to a
base station a reading of the surface temperature each hour. The data produced by this
sensor is a stream of real numbers.
2. Image Data
Satellites often send down to earth streams consisting of many terabytes of images
per day. Surveillance cameras produce images with lower resolution than satellites, but
there can be many of them, each producing a stream of images at intervals like one second.
London is said to have six million such cameras, each producing a stream.
A switching node in the middle of the Internet receives streams of IP packets from
many inputs and routes them to its outputs. Normally, the job of the switch is to transmit
data and not to retain it or query it. But there is a tendency to put more capability into the
switch, e.g., the ability to detect denial-of-service attacks or the ability to reroute packets
based on information about congestion in the network.
A switching node in the middle of the Internet receives streams of IP packets from many
inputs and routes them to its outputs. Normally, the job of the switch is to transmit data and not
to retain it or query it. But there is a tendency to put more capability into the switch, e.g., the
ability to detect denial-of-service attacks or the ability to reroute packets based on information
about congestion in the network.
Stream Queries:
There are two types of stream queries: standing queries and ad-hoc queries
Module II: Big Data Analytics 6
1. Standing queries
E.g., consider a temperature sensor bobbing about in the ocean, sending back to a base
station a reading of the surface temperature each hour. The data produced by this sensor is a
stream of real numbers. In this case we can ask a query, what is the maximum temperature ever
recorded by the sensor. For answering this query we need not store the entire stream. When a
new stream element arrives, we compare it with the stored maximum, and set the maximum to
whichever is larger. Similarly, if we want the average temperature over all time, we have only
to record two values: the number of readings ever sent in the stream and the sum of those
readings.
2. Ad-hoc queries
E.g., asked once about the current state of streams. Such queries are difficult to answer
as we are not archiving the entire stream. For answering such ad-hoc queries we have
to store the parts or summaries of the stream.
Sliding Window:
Time-based
○ Range and stride are specified by time intervals.
○ For example a sliding window with range= 10 mins and stride= 2 mins produces
a window that cover the data in the last 10 mins. A new window is created after
2 mins
Count-based
Module II: Big Data Analytics 7
Stream Computing
A high performance computer system that analyses multiple data streams from many
source live.
The word stream in stream computing is used to mean pulling in streams of data,
processing the data and streaming it back out as a single flow.
Stream computing uses software algorithms that analyses the data in real time as it
streams in to increase speed and accuracy when dealing with data handling and
analysis.
It continuously integrates and analyses data in motion to deliver real time analytics.
Stream computing enables organization to detect insights (risks and opportunities) in
high velocity data.
Stream computing enables organization:-
Enables and act up rapidly changing data in real time.
Enhance existing models with new insights.
Capture, analyse and action insight before opportunities are lost forever.
As mentioned earlier a data stream is a massive, infinite dataset. Hence it is not possible
to store the entire stream. While mining a data stream a typical question that can arise is how
we can answer certain critical queries without storing the entire stream. In some cases we can
get an answer from certain samples in the stream, without examining the entire stream. Here,
Module II: Big Data Analytics 8
we have to keep in mind two things; one is the sample should be unbiased. The second one is
the typical sample should be able to answer the queries. Choosing the right samples is critical.
Carelessly choosing samples can destroy the results of the query. While sampling we must take
care of some pitfalls.
An Example
Consider a search engine like Google that receives a stream of queries. Google wants
to study the behaviour of users. A typical question that can be asked is “What fraction of queries
asked past the month are unique?”. Only 1/10th of the stream element is stored.
The obvious approach would be to generate a random number, say an integer from 0 to
9, in response to each search query. Store the tuple if and only if the random number is 0. If
we do so, each user has, on average, 1/10th of their queries stored. Statistical fluctuations will
introduce some noise into the data, but if users issue many queries, the law of large numbers
will assure us that most users will have a fraction quite close to 1/10th of their queries stored.
This scheme gives us the wrong answer to the query asking for the average number of
duplicate queries for a user. Suppose a user has issued s search queries one time in the past
month, d search queries twice, and no search queries more than twice. If we have a 1/10th
sample of queries, we shall see in the sample for that user an expected s/10 of the search queries
issued once. Of the d search queries issued twice, only d/100 will appear twice in the sample;
that fraction is d times the probability that both occurrences of the query will be in the 1/10th
sample. Of the queries that appear twice in the full stream, 18d/100 will appear exactly once.
[Sample will contain s/10 of the singleton queries and 2d/10 of the duplicate queries at least
once .But only d/100 pairs of duplicates d/100 = 1/10 * 1/10 * d]
To see why, note that 18/100 is the probability that one of the two occurrences will be in the
1/10th of the stream that is selected, while the other is in the 9/10th that is not selected.
The correct answer to the query about the fraction of repeated searches is d/(s+d). However,
the answer we shall obtain from the sample is d/(10s+19d).
d/100 appear twice, while s/10+18d/100 appear once. Thus, the fraction appearing twice in
the sample is d/100 divided by d/100 + s/10 + 18d/100. This ratio is d/(10s + 19d).
Module II: Big Data Analytics 9
Solution
● Pick 1/10th of users and take all their searches in the sample.
● Use a hash function that hashes the user name or user id uniformly into 10 buckets.
● Each time a search query arrives in the stream, we look up the user to see whether or
not they are in the sample. If so, we add this search query to the sample, and if not,
then not.
● By using a hash function, one can avoid keeping the list of users.
● Hash each user name to one of ten buckets, 0 through 9. If the user hashes to bucket 0,
then accept this search query for the sample, and if not, then not.
General Solution
Filtering or selection involves selecting the streams that satisfies a particular criterion.
There are different methods for selecting streams. The process is hard when it is required to
search for membership in a set.
Applications of filtering:
● Publish-subscribe systems
○ You are collecting lots of messages (news articles).
○ People express interest in certain sets of keywords.
○ Determine whether each message matches user’s interest.
Example
• Linear Search : obviously this is a bad idea because there may be billions of accounts.
• Binary search : the usernames must be stored in sorted order. Even then it may not be
possible to search in billions.
○ Assume that each hash function maps each item in the universe to a random
number uniformly over the range.
● For each element x in S, the bit hi(x) in the array is set to 1, for 1≤ i ≤ k.
○ A bit in the array may be set to 1 multiple times for different elements.
● A empty bloom filter is a bit array of m bits, all set to zero, like this –
● We need k number of hash functions to calculate the hashes for a given input.
● When we want to add an item in the filter, the bits at k indices h1(x), h2(x), … hk(x) are
set, where indices are calculated using hash functions.
● Example – Suppose we want to enter “good” in the filter, we are using 3 hash functions
and a bit array of length 10, all set to 0 initially. First we’ll calculate the hashes as
following :
○ First we’ll calculate the hashes as following :
h1(“good”) % 10 = 1
h2(“good”) % 10 = 4
h3(“good”) % 10 = 7
Now we will set the bits at indices 1, 4 and 7 to 1 for the element,”good”.
h1(“bad”) % 10 = 3
Module II: Big Data Analytics 12
h2(“bad”) % 10 = 5
h3(“bad”) % 10 = 4
Now, to check if an element is present in the list or not we do the reverse process. Calculate
respective hashes using h1, h2 and h4 and check if all these indices are set to 1 in the bit array.
If all the bits are set, then that element is probably present. If any of the bits at these
indices are 0, then username is definitely not present.
The result “probably present”, is uncertainty. Let’s understand this with an example.
Suppose we want to check whether “cat” is present or not. We’ll calculate hashes using h1, h2
and h3.
h1 (“cat”) % 10 = 1
h2 (“cat”) % 10 = 3
h3 (“cat”) % 10 = 7
If we check the bit array, bits at these indices are set to 1 but we know that “cat” was never
added to the filter. Bit at index 1 and 7 was set when we added “good” and bit 3 was set when
we added “bad”. So, because bits at calculated indices are already set by some other item,
bloom filters erroneously claim that “cat” is present and generating a false positive result.
● By controlling the size of the bloom filter we can control the probability of getting
false positives.
● Use more number of hash functions and more bit arrays.
Module II: Big Data Analytics 13
Probability of False positivity: Let m be the size of bit array, k be the number of hash
functions and n be the number of expected elements to be inserted in the filter, then the
probability of false positive p can be calculated as:
1 𝑘𝑛 𝑘
𝑃 = (1 − [1 − ] )
𝑚
Generalization
2. A collection of hash functions h1, h2. . . h k . Each hash function maps “key” values to n
buckets, corresponding to the n bits of the bit-array.
The purpose of the Bloom filter is to allow through all stream elements whose keys are in S,
while rejecting most of the stream elements whose keys are not in S.
The hash function used in bloom filters should be independent and uniformly distributed. They
should be as fast as possible.
● Quora implemented a shared bloom filter in the feed backend to filter out stories that
people have seen before.
● The Google Chrome web browser used to use a Bloom filter to identify malicious
URLs.
● Google BigTable, Apache HBase and Apache Cassandra, and Postgresql use Bloom
Module II: Big Data Analytics 14
This process is to count distinct elements in a data stream with repeated elements. The
elements might represent IP addresses of packets passing through a router, unique visitor to a
web site, elements in a large database, motifs in a DNA sequence, or elements of sensor/RFID
networks.
Definition
A stream of elements {x1,x2,...xs} with repetitions, and an integer m. Let n be the number of
distinct elements, namely, n=|{x1,x2,...xs }|and let these elements be {e1,e2,...en}.
Data stream consists of a universe of elements chosen from a set of size N. Maintain a
count of the number of distinct elements seen so far
Solution:
Example
Suppose Google wants to gather the statistics of unique users it has seen in each month.
Google does not require a unique login to issue a search query. The only way to recognize
users is to identify the IP addresses from which the queries are issued. In this case the 4 billion
IP addresses serve as the universal set.
Solution
● Keep the list of all elements seen so far in a hash table or a search tree in the main
Module II: Big Data Analytics 15
memory.
● When a new query arrives check whether the IP address from which the query issued
is in the list or not.
● If it is not there add the new IP address. Otherwise discard.
The above solution works well as long as the number of distinct elements is not too large.
The problem arises when the number of elements is too great and all the streams need to be
processed at once. The data may not be fit in the main memory.
Flajolet-Martin Algorithm
This algorithm approximates the number of distinct elements in a stream or a database in one
pass. Suppose the stream consists of n elements with m of them are unique. Then time
complexity of the algorithm is O (n).The algorithm requires O(log(m)) memory.
Algorithm
Step 1: Apply the hash function h(x) to each element in the stream.
Step 2: For each hash function obtained, write the binary equivalent for the same.
Step 3: Count the number of trailing zeros (zeros in the end) of each bit of the hash function.
Step 4: Write the value of maximum number of trailing zeros. Let the number be r.
Example
Consider a data stream of integers, 1,3,2,1,2,3,4,3,1,2,3,1 elements if the hash function is:
h (x) = 6x + 1 mod 5.
Solution:
Since the data stream is small we can readily count the number of distinct
Module II: Big Data Analytics 16
Step 1:
Using hash function h(x) =6x+1 mode 5, calculate it
Step 2:
Step 3:
Trailing zeros.
h(1)=1, h(3)=2, h(2)=0, h(1)=1, h(2)=0, h(3)=2, h(4)=0, h(3)=2, h(1)=1, h(2)=0,
h(3)=0, h(1)=1.
Step 4:
Distinct element.
From the binary equivalent trailing zero values, write the value of maximum number
of trailing zeros. The value of r=2.
The distinct value of R=2r=4
Hence, there are 4 distinct elements: 1,2,3,4.
Estimating Moments
Definition of Moments
Consider a data stream of elements from a universal set. Let mi be the number of
occurrences of the ith element for any i. Then the kth-moment of the stream is the sum over all
i of (mi )k .
Module II: Big Data Analytics 17
The 1st moment is the sum of mi’’s. That is the length of the stream.
The 2nd moment is the sum of squares of mi’’s. It is also called a “surprise number”.
Determines how the distribution is uneven.
Example
Suppose we have a stream of length 100, in which eleven different elements appear.
The most even distribution of these eleven elements would have one appearing 10 times and
the other ten appearing 9 times each. In this case, the surprise number is 102 + 10 × 92 = 910.
At the other extreme, one of the eleven elements could appear 90 times and the other ten appear
1 time each. Then, the surprise number would be 902 + 10 × 12 = 8110.
As explained in the above example moments of any order can be computed as long as the
stream fits in the main memory. In the case where the stream does not fit in the memory, we
can compute the kth moment by keeping a limited number of values and computing an estimate
from these values. We can use the following algorithm for computing the second moment.
Even if there is not enough storage space, the second moment can still be estimated using
AMS Algorithm.
Algorithm
Consider a stream of length n. Without taking all the elements, compute some
sample variables.
Example
● The length of the stream is n = 15. Since a appears 5 times, b appears 4 times, and c and
d appear 3 times each, the second moment for the stream is 52 + 42 + 32 + 32 = 59.
Suppose we keep three variables, X1, X2, and X3.
● Assume that at “random” we pick the 3rd, 8th, and 13th positions to define these three
variables. When we reach position 3, we find element c, so we set X1.element = c and
X1.value = 1. Position 4 holds b, so we do not change X1. Likewise, nothing happens
at positions 5 or 6. At position 7, we see c again, so we set X1.value = 2.
● At position 8 we find d, and so set X2.element = d and X2.value = 1. Positions 9 and
10 hold a and b, so they do not affect X1 or X2.
● Position 11 holds d so we set X2.value = 2, and position 12 holds c so we set X1.value
= 3.
● At position 13, we find element a, and so set X3.element = a and X3.value = 1. Then,
at position 14 we see another a and so set X3.value = 2.
● Position 15, with element b does not affect any of the variables, so we are done, with
final values,
X1.value = 3
X2.value = 2
X3.value = 2.
We can derive an estimate of the second moment from any variable X. This estimate
is n(2X.value − 1).
From the previous example, for X1, we derive the estimate n(2X1.value − 1)
= 15 × (2 × 3 − 1) = 75.
The other two variables, X2 and X3, each have value 2 at the end, so their
estimates are 15 × (2 × 2 − 1) = 45.
Recall that the true value of the second moment for this stream is 59. On the
other hand, the average of the three estimates is 55, a fairly close approximation.
● It allows to estimate the number of 1’s in the window with an error of no more than
50%.
Components:
Timestamp
Buckets
Each bit that arrives has a timestamp for the position at which it arrives.
If the first bit has a timestamp 1, the second bit has a timestamp 2 and so on…The
positions are recognized with the window size N (which are usually taken as multiple
of 2).
The windows are divided into buckets consisting of 1’s and 0’s.
Rules:
1. The right side of the bucket always start with 1 (If it starts with 0, it is to be
neglected). E.g., 1001011:- bucket size 4. Having four 1’s starting with 1 on its right
end.
2. Every bucket should have at least one 1, else no bucket can be formed.
3. All buckets should be in power of 2.
4. The buckets cannot decrease in size as we move to the left (move in increasing order
towards left).
Example
22 =4 22 =4 21=2 21 =2 20=1
In the given data stream, let us assume the new bit arrives from the right. If new
bit=0, there is no change in buckets. If it is 1, modify the adjacent bucket size.
Decaying Window
This approach is used for finding the most popular element in the stream. This can be
considered as an extension of DGIM Algorithm. The aim is to weight the recent elements more
heavily.
Let a stream currently consist of the elements a1, a2, . . . , at , where a1 is the first element to
arrive and at is the current element. Let c be a small constant, such as 10−6 or 10−9.
● Define the exponentially decaying window for this stream to be the sum
𝒕−𝟏
∑ 𝒂𝒕−𝒊 (𝟏 − 𝒄)𝒊
𝒊=𝟎
when a new element a t+1 arrives at the stream input, all we need to do is:
1. Multiply the current sum by 1 − c.
2. Add a t+1.
Decaying Window Algorithm:
Identify the most popular elements (trending, in other words) in an incoming data
stream.
2. Calculate aggregate sum for each distinct element by adding all the weights assigned to that
element.
𝑡−1
∑ 𝑎𝑡−𝑖 (1 − 𝑐)𝑖
𝑖=0
Let's c be 0.1
The aggregate sum of each tag in the end of above stream will be calculated as below: fifa
ipl - 0.9 * (1-0.1) + 0 = 0.81 (adding 0 because current tag is different than fifa)
fifa - 0.81 * (1-0.1) + 1 = 1.729 (adding 1 because current tag is fifa only)
ipl
fifa - 0 * (1-0.1) = 0
ipl - 0 * (1-0.1) + 1 = 1
fifa - 1 * (1-0.1) + 0 = 0.9 (adding 0 because current tag is different than ipl) ipl
- 0.9 * (1-0.01) + 1 = 1.81
In the end of the sequence, we can see the score of fifa is 2.135 but ipl is 3.7264 So,
ipl is more trending than fifa
Even though both of them occurred almost the same number of times in input there score is
still different.
2. New element is given more weight by this mechanism, to achieve right trending output.