Big Data Analysis

Chapter 3.
NOSQL
3.1 Introduction to NoSQL
With technological advancements, such as easy availability of the internet at affordable prices,
increased usage of IoT devices, use of social networking websites, etc., the rate of generation of
digital data increased at an alarming pace. Hence, traditional systems were not scalable to
accommodate such massive volumes of data. Moreover, digital data was not structured
anymore, and organisations were not interested in discarding this data because nowadays, non-
structured data comprises almost 80% of the total generated data, which is a significant
proportion. So, to store and process humongous non-structured data, organisations started to
migrate from traditional SQL supporting systems to other data processing tools that are scalable
and fault tolerant.
When organisations were trying to find out a way to store and process big data, Hadoop came
to their rescue. Hadoop uses HDFS as its storage layer and is used to store data in a distributed
manner across a cluster of commodity machines. HDFS is a robust file system that can store
structured, unstructured, and semi-structured data and is horizontally scalable.
Though Hadoop overcame some of the challenges faced by SQL, it posed new challenges of its
own. The MapReduce framework of Hadoop (used for processing the data stored in HDFS) is
well suited for batch processing, where the whole data is accessed sequentially. But the
drawback of using MapReduce for processing big data is that it is not suited for all use cases,
such as performing random lookups on data. Hence, apart from Hadoop, there was a need for
some solution to address these use cases. These limitations of Hadoop led to the inception of
NoSQL datastores.
Some of the reasons for the increasing popularity of NoSQL databases are as follows:
• NoSQL datastores are efficient in storing and handling big data. Based on the targeted use
cases, every NoSQL database has its data model for storing data.
• NoSQL data stores provide scalability, i.e., in case of space crunch, extra space can be easily
created by just adding additional nodes to the cluster.
• NoSQL data stores are flexible and do not restrict themselves to a fixed schema. Hence,
they can adapt to changes in the schema of data dynamically.
NoSQL is a set of concepts that allows the rapid and efficient processing of
data sets with a focus on performance, reliability, and agility.
• NoSQL systems store and retrieve data from many formats.

• NoSQL systems allow you to extract your data using simple interfaces without joins.
• NoSQL systems allow you to drag-and-drop your data into a folder and then query it
without creating an entity-relational model.
• NoSQL systems allow you to store your database on multiple processors and maintain
high-speed performance.
Prepared by Prof. Mohini

• Most (but not all) NoSQL systems leverage low-cost commodity processors that have
separate RAM and disk.
• When you add more processors, you get a consistent increase in performance.
3.1.1 NoSQL Business Drivers

Many organizations supporting single-CPU relational systems have come to a crossroads
where the needs of their organizations are changing. Businesses have found value in rapidly
capturing and analysing large amounts of variable data and making immediate changes in
their businesses based on the information they receive.
Fig. 1 NOSQL Business Drivers
The figure shows how the business drivers’ volume, velocity, variability, and agility apply
pressure to the single CPU system, resulting in the cracks. Volume and velocity refer to the
ability to handle large datasets that arrive quickly. Variability refers to how diverse data types
don’t fit into structured tables, and agility refers to how quickly an organization responds to
business change.
• Volume: Without a doubt, the key factor pushing organizations to look at alternatives
to their current RDBMSs is a need to query big data using clusters of commodity
processors. The focus is to increase from the speed on a single chip to using more
processors working together. The need to scale out (also known as horizontal scaling),
rather than scale up (faster processors), moved organizations from serial to parallel
processing where data problems are split into separate paths and sent to separate
processors to divide and conquer the work.

• Velocity: RDBMSs frequently index many columns of every new row, a process which
decreases system performance. When single-processor RDBMSs are used as a back end
to a web store front, the random bursts in web traffic slow down response for everyone
and tuning these systems can be costly when both high read and write throughput is
desired.
• Variability: Adding new columns to an RDBMS requires the system be shut down and
ALTER TABLE commands to be run. When a database is large, this process can impact
system availability, costing time and money.
• Agility: The most complex part of building applications using RDBMSs is the process of
putting data into and getting data out of the database. The aim is to generate the
correct combination of INSERT, UPDATE, DELETE, and SELECT SQL statements to move
object data to and from the RDBMS persistence layer.
3.1.2 CAP Theorem

The CAP theorem applies to the distributed systems that store data. The CAP theorem states
that it is impossible for a distributed data store to simultaneously provide more than two
out of the three guarantees, namely consistency, availability, and partition tolerance.
• Consistency guarantees that every node in the distributed system returns the same,
most successful, and recent write.
• Availability is when every request receives a response without the guarantee that it
contains the most recent write.
• Partition tolerance is when the system continues to function and upholds its
guarantees despite network partitions.
Fig. 2 CAP Theorem
Let’s look at the proof of the CAP theorem by contradiction.

To prove the CAP theorem by contradiction, we assume a distributed data store that
guarantees consistency, availability, and partition tolerance.
Let the data store consist of two nodes G1 and G2, maintaining an object O with an initial
value of v0. We will construct an execution of a sequence of queries in which there exists a
request that returns an inconsistent response.
Fig. 3 Example Distributed System
Now suppose that the network between the two nodes gets partitioned, that is, nodes G1
and G2 can no longer communicate with each other.
Fig. 4 Network is Partitioned Between Node-G1 and Node-G2
A client C1 wants to overwrite the value of Object-O with v1. By assumption, the system is
always available. Let Node-G1 accept the request and update the value.

Fig. 5 Write Request to Node-G1
Since the network is partitioned, the updated value of object-O cannot be reflected in Node-
G1. Now, a read request comes for Object-O. Again, the system must respond to this request
due to the availability assumption. Assume that Node-G2 responds to this request by
returning its older value v0; this contradicts the consistency assumption.
Fig. 6 In-Consistent Read From Node-G2
Hence, in the case of network partitioning, one must choose between availability and
consistency.
3.2 NoSQL Data Architecture Patterns

Data architecture pattern is a consistent way of representing data in a structure.
3.2.1 Key-value Stores

A key-value store is a simple database that when presented with a simple string (the key)
returns an arbitrary large BLOB of data (the value). Key-value stores have no query
language; they provide a way to add and remove key-value pairs into/from a database.
Fig. 7 Key-Value Store
Key-Value stores are simplest type of NoSQL databases to design and implement.
Clients can use get, put and delete APIs to read and write data and since they
always use primary key access, they very easy to scale and provide high
performance. Key-Value stores are ideal for applications with simple data models
and require high velocity read & writes. Key-Value stores are not suitable when the
application requires frequent updates or complex queries involving specific data
values, multiple unique keys and relationships between them.
Example of Key-Value Store Database: Berkley DB, Dynamo DB, Memcache, etc.
3.2.2 Column Family (Bigtable)Stores

Column family systems are important NoSQL data architecture patterns because
they can scale to manage large volumes of data. Column family stores use row and
column identifiers as general purposes keys for data lookup. They’re sometimes
referred to as data stores rather than databases, since they lack features you may
expect to find in traditional databases. The column family approach of using a row
ID and column name as a lookup key is a flexible way to store data, gives you
benefits of higher scalability and availability, and saves you time and hassles when
adding new data to your system.
Example of Column-Family Store Database: Apache HBase, Apache Cassandra,

Hypertable, etc

Fig. 8 Column-Family Store
3.2.3 Document Stores

Document store can be thought of as a tree-like structure. Document Stores are
similar to Key-Value stores, as there is a key and a corresponding value, but the
value provides structure to the stored data in XML, JSON or BSON formats. This
value is referred to as a document. Each document is effectively an object containing
attribute metadata along with a typed value such as string, date, binary or an array.
This provides a way to index and query data based on the attributes in the
document.
Due to its flexible schema and complex querying capabilities, Document Stores are
popular and suitable for variety of use cases such as:
• Content management systems

• E-commerce websites
• Middleware applications that use JSON
Document stores are not suitable if the application requires complex transactions
with multiple operations.
Examples of Document Store Databases: MongoDB and Couchbase.

Fig. 9 Document Store
3.2.4 Graph Stores

Graph stores are important in applications that need to analyse relationships
between objects or visit all nodes in a graph in a particular manner. Graph
databases are useful for any business problem that has complex relationships
between objects such as social networking. A graph store has three data fields:
nodes, relationships, and properties. Graph nodes are usually representations of
real-world objects like nouns such as people, organizations, telephone numbers,
web pages, computers on a network, or even biological cells in a living organism.
The relationships can be thought of as connections between these objects and are
typically represented as arcs (lines that connect) between circles in diagrams.

Fig. 10 Graph Store
Example of Graph Store Database: Allegro Graph, Infinite Graph, etc.

Big Data Analysis

Uploaded by

Big Data Analysis

Uploaded by

Chapter 3.

• NoSQL systems store and retrieve data from many formats.

Prepared by Prof. Mohini

3.1.1 NoSQL Business Drivers

Fig. 1 NOSQL Business Drivers

Prepared by Prof. Mohini

3.1.2 CAP Theorem

Fig. 2 CAP Theorem

Let’s look at the proof of the CAP theorem by contradiction.

Prepared by Prof. Mohini

Fig. 3 Example Distributed System

Fig. 4 Network is Partitioned Between Node-G1 and Node-G2

Prepared by Prof. Mohini

Fig. 6 In-Consistent Read From Node-G2

3.2 NoSQL Data Architecture Patterns

3.2.1 Key-value Stores

Prepared by Prof. Mohini

Fig. 7 Key-Value Store

3.2.2 Column Family (Bigtable)Stores

Example of Column-Family Store Database: Apache HBase, Apache Cassandra,

Prepared by Prof. Mohini

3.2.3 Document Stores

• Content management systems

Examples of Document Store Databases: MongoDB and Couchbase.

Prepared by Prof. Mohini

3.2.4 Graph Stores

Prepared by Prof. Mohini

Example of Graph Store Database: Allegro Graph, Infinite Graph, etc.

Prepared by Prof. Mohini

You might also like