COUNTING THE NUMBER OF 1’s IN THE DATA
STREAM
DGIM algorithm (Datar-Gionis-Indyk-Motwani Algorithm)
Designed to find the number 1’s in a data set. This
algorithm uses O(log²N) bits to represent a window of N
bit, allows to estimate the number of 1’s in the window
with and error of no more than 50%.
So this algorithm gives a 50% precise answer.
In DGIM algorithm, each bit that arrives has a timestamp,
for the position at which it arrives. if the first bit has a
timestamp 1, the second bit has a timestamp 2 and so on..
the positions are recognized with the window size N (the
window sizes are usually taken as a multiple of 2).The
windows are divided into buckets consisting of 1’s and 0's.
RULES FOR FORMING THE BUCKETS:
1. The right side of the bucket should always start
with 1. (if it starts with a 0,it is to be neglected)
E.g. · 1001011 → a bucket of size 4 ,having four 1’s
and starting with 1 on it’s right end.
2. Every bucket should have at least one 1, else no
bucket can be formed.
3. All buckets should be in powers of 2.
4. The buckets cannot decrease in size as we move to
the left. (move in increasing order towards left)
Let us take an example to understand the algorithm.
Estimating the number of 1’s and counting the buckets in
the given data stream.
This picture shows how we can form the buckets based on
the number of ones by following the rules.
In the given data stream let us assume the new bit arrives
from the right. When the new bit = 0
After the new bit ( 0 ) arrives with a time stamp 101, there
is no change in the buckets.
But what if the new bit that arrives is 1, then we need to
make changes..
· Create a new bucket with the current timestamp and size
1.
· If there was only one bucket of size 1, then nothing more
needs to be done. However, if there are now three buckets
of size 1( buckets with timestamp 100,102, 103 in the
second step in the picture) We fix the problem by
combining the leftmost(earliest) two buckets of size 1.
(purple box)
To combine any two adjacent buckets of the same size,
replace them by one bucket of twice the size. The
timestamp of the new bucket is the timestamp of the
rightmost of the two buckets.
Now, sometimes combining two buckets of size 1 may
create a third bucket of size 2. If so, we combine the
leftmost two buckets of size 2 into a bucket of size 4. This
process may ripple through the bucket sizes.
How long can you continue doing this…
You can continue if current timestamp- leftmost bucket
timestamp of window < N (=24 here) E.g. 103–87=16 < 24
so I continue, if it greater or equal to then I stop.
Finally the answer to the query.
How many 1’s are there in the last 20 bits?
Counting the sizes of the buckets in the last 20 bits, we
say, there are 11 ones.
Advantages
Stores only O(log2 N) bits - O(log N)counts of log2N bits each
Easy update as more bits enter - Error in count no greater than the number of 1’s in the unknown
area.
Drawbacks
• As long as the 1s are fairly evenly distributed, the error due to the unknown region is small – no
more than 50%
• But it could be that all the 1s are in the unknown area (indicated by “?” in the below figure) at the
end. In that case, the error is unbounded.