Big Data - No SQL Databases and Related Concepts
Big Data - No SQL Databases and Related Concepts
• When the writes reach the server, the server will serialize them—
decide to apply one, then the
other. Let’s assume it uses alphabetical order and picks Martin’s
update first, then Pramod’s. Without any concurrency control,
Martin’s update would be applied and immediately overwritten by
Pramod’s. In this case Martin’s is a lost update
• A pessimistic approach works by preventing conflicts from
occurring; an optimistic approach lets conflicts occur, but
detects them and takes action to sort them out.
• The next step again follows from version control: You have to
merge the two updates somehow. Maybe you show both values
to the user and ask them to sort it out
• People first prefer pessimistic concurrency because they are
determined to avoid conflicts but it (Pessimistic approaches) often
severely degrade the responsiveness of a system to the degree that it
becomes unfit for its purpose.
• Availability :It means that if you can talk to a node in the cluster, it can
read and write data.
Atomic
– All operations in a transaction succeed or every operation is rolled back.
• Consistent
– On the completion of a transaction, the database is structurally sound.
• Isolated
– Transactions do not contend with one another. Contentious access to data is
moderated by the database so that transactions appear to run sequentially.
• Durable
– The results of applying a transaction are permanent, even in the presence of
failures.
• ACID properties mean that once a transaction is complete, its data is consistent (tech
lingo: write consistency) and stable on disk, which may involve multiple distinct
memory locations.
BASE
The BASE Consistency Model
• For many domains and use cases, ACID transactions are far more pessimistic
(i.e., they’re more worried about data safety) than the domain actually
requires.
• In the NoSQL world, ACID transactions are less fashionable as some databases
have loosened the requirements for immediate consistency, data freshness
and accuracy in order to gain other benefits, like scale and resilience.
• Basic Availability: The database appears to work most of the time.
• Soft-state: Stores don’t have to be write-consistent, nor do different replicas
have to be mutually consistent all the time.
• Eventual consistency: Stores exhibit consistency at some later point (e.g., lazily
at read time).
• BASE properties are much looser than ACID guarantees, but there isn’t a direct
one-for-one mapping between the two consistency models.
The BASE consistency model is primarily used by aggregate stores, including
column family, Key value and document Stores.
Version Stamps
• Many critics of NoSQL databases focus on the lack of support for
transactions.
• One reason why many NoSQL supporters worry less about a lack of
transactions is that aggregate-oriented
NoSQL databases do support atomic updates within an aggregate—and
aggregates are designed so that their data forms a natural unit of up
date.
• Version Stamps
Version Stamps
• A good way managing Transaction and Consistency
problem is to ensure that records in the database
contain some form of version stamp
• 4) A fourth approach is to use the timestamp of the last update. Like counters,
they are reasonably short and can be directly compared for recentness, yet have
the advantage of not needing a single Ma ster. Multiple machines can generate
timestamps—but to work properly, their clocks have to be kept in sync. One
node with a bad clock can cause all sorts of data corruptions. There’s also a
danger that if the timestamp is too granular you can get duplicates—it’s no good
using timestamps of a millisecond precision if you get many updates per
millisecond.
Version Stamp
• You can blend the advantages of these different
version stamp schemes by using more than one
of the m to create a composite stamp. For
example, CouchDB uses a combination of
counter and content hash.
• Version Stamp helps to avoid update conflicts, it
is also useful for providing session consistency
Version Stamp :vector stamp
• A vector stamp is a set of counters, one for each
node.
• A vector stamp for three nodes (blue, green, black)
would look something like:
• [blue: 43,green: 54, black: 12].
• Each time a node has an internal update, it updates
its own counter, so an up date in the green node
would change the vector to [blue: 43, green: 55, black:
12].
Conclude: Version stamps
• Version stamps help you detect concurrency conflicts. When you
read data, then update it, you can check the version stamp to
ensure nobody updated the data between your read and write.
• The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific
nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.
• Client applications submit jobs to the Job tracker.
• The JobTracker talks to the NameNode to determine the location of the data
• The JobTracker locates TaskTracker nodes with available slots at or near the data
• The JobTracker submits the work to the chosen TaskTracker nodes.
• The TaskTracker nodes are monitored. If they do not submit heartbeat signals often
enough, they are deemed to have failed and the work is scheduled on a different
TaskTracker.
• A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do
then: it may resubmit the job elsewhere, it may mark that specific record as something to
avoid, and it may even blacklist the TaskTracker as unreliable.
• When the work is completed, the JobTracker updates its status.
• Client applications can poll the JobTracker for information.
• The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all
running jobs are halted.
Map-Reduce
• The rise of aggregate-oriented databases is in large part due to the growth of
clusters. Running on a cluster means you have to make your tradeoffs in data storage
differently than when running on a single machine.
• Clusters don’t just change the rules for data storage—they also change the rules for
computation. If you store lots of data on a cluster, processing that data efficiently
means you have to think differently about how you organize your processing.
• With a centralized database, there are generally two ways you can run the processing
logic against it: either on the database server itself or on a client machine.
• Running it on a client machine gives you more flexibility in choosing a programming
environment, which usually makes for programs that are easier to create or extend.
This comes at the cost of having to shift lots of data from the database server.
• If you need to hit a lot of data, then it makes sense to do the processing on the
server, paying the price in programming convenience and increasing the load on the
database server.
Need for Map Reduce
• When you have a cluster, there is good news
immediately—you have lots of machines to
spread the computation over.
• However, you also still need to try to reduce
the amount of data that needs to be
transferred across the network by doing as
much processing as you can on the same node
as the data it needs.
Map-Reduce
• The map-reduce pattern (a form of Scatter-Gather )is a way to organize
processing in such a way as to take advantage of multiple machines on a
cluster while keeping as much processing and the data it needs
together on the same machine.
• Example: Order & customer
• Orders can be aggregated together but to get
• “Product Revenue Report”: you’ll have to visit every machine in the
cluster and examine many records on each machine to get the Product
revenue Report
• So go for MAP REDUCE
Map reduce with single Reduce task
MAP
• The first stage in a map-reduce job is the map.
• A map is a function whose input is a single
aggregate and whose output is a bunch of key
value pairs.
• In this case, the input would be an order. The
output would be key-value pairs corresponding
to each items. Each one would have the product
ID as the key and an embedded map with the
quantity and price as the values
MAP
Reduce
Map Reduce
Map
• Each application of the map function is
independent of all the others.
• This allows them to be safely parallelizable, so
that a map-reduce framework can create
efficient map tasks on each node and freely
allocate each order to a map task. This yields a
great deal of parallelism and locality of data
access.
Map Reduce
• A map operation only operates on a single record;
the reduce function takes multiple map outputs
with the same key and combines their values. So, a
map function might yield 1000 line items from
orders; then reduce function would reduce down to
one, with the totals for the quantity and revenue.
While the map function is limited to working only
on data from a single aggregate, the reduce
function can use all values emitted for a single key
Partitioning and Combining
• In the simplest form, we think of a map-
reduce job as having a single reduce function.
The outputs from all the map tasks running on
the various nodes are concatenated together
and sent into the reduce.
• While this will work, there are things we can
do to increase the parallelism and to reduce
the data transfer
Partitioning and Combining
• The first thing we can do is increase parallelism by partitioning the output of the
mappers.
• Each reduce function operates on the results of a single key.
• This is a limitation—it means you can’t do anything in the reduce that operates
across keys—but it’s also a benefit in that it allows you to run multiple reducers
in parallel.
• To take advantage of this, the results of the mapper are divided up based the key
on each processing node.
• Typically, multiple keys are grouped together into partitions.
• The framework then takes the data from all the nodes for one partition,
combines it into a single group for that partition, and sends it off to a reducer.
• Multiple reducers can then operate on the partitions in parallel, with the final
results merged together. (This step is also called “shuffling,” and the partitions
are sometimes referred to as “buckets” or “regions.”)
Partitioning and Combining
• The next problem we can deal with is the amount of
data being moved from node to node between the map
and reduce stages.
• Much of this data is repetitive, consisting of multiple
key-value pairs for the same key.
• A combiner function cuts this data down by combining
all the data for the same key into a single value .
• A combiner function is, in essence, a reducer function
—indeed, in many cases the same function can be used
for combining as the final reduction.
Map- Reduce dataflow with Multiple Reduce
task
Composing Map-Reduce Calculations