Big Data Analysis
Big Data Analysis
NOSQL
3.1 Introduction to NoSQL
With technological advancements, such as easy availability of the internet at affordable prices,
increased usage of IoT devices, use of social networking websites, etc., the rate of generation of
digital data increased at an alarming pace. Hence, traditional systems were not scalable to
accommodate such massive volumes of data. Moreover, digital data was not structured
anymore, and organisations were not interested in discarding this data because nowadays, non-
structured data comprises almost 80% of the total generated data, which is a significant
proportion. So, to store and process humongous non-structured data, organisations started to
migrate from traditional SQL supporting systems to other data processing tools that are scalable
and fault tolerant.
When organisations were trying to find out a way to store and process big data, Hadoop came
to their rescue. Hadoop uses HDFS as its storage layer and is used to store data in a distributed
manner across a cluster of commodity machines. HDFS is a robust file system that can store
structured, unstructured, and semi-structured data and is horizontally scalable.
Though Hadoop overcame some of the challenges faced by SQL, it posed new challenges of its
own. The MapReduce framework of Hadoop (used for processing the data stored in HDFS) is
well suited for batch processing, where the whole data is accessed sequentially. But the
drawback of using MapReduce for processing big data is that it is not suited for all use cases,
such as performing random lookups on data. Hence, apart from Hadoop, there was a need for
some solution to address these use cases. These limitations of Hadoop led to the inception of
NoSQL datastores.
Some of the reasons for the increasing popularity of NoSQL databases are as follows:
• NoSQL datastores are efficient in storing and handling big data. Based on the targeted use
cases, every NoSQL database has its data model for storing data.
• NoSQL data stores provide scalability, i.e., in case of space crunch, extra space can be easily
created by just adding additional nodes to the cluster.
• NoSQL data stores are flexible and do not restrict themselves to a fixed schema. Hence,
they can adapt to changes in the schema of data dynamically.
NoSQL is a set of concepts that allows the rapid and efficient processing of
data sets with a focus on performance, reliability, and agility.
The figure shows how the business drivers’ volume, velocity, variability, and agility apply
pressure to the single CPU system, resulting in the cracks. Volume and velocity refer to the
ability to handle large datasets that arrive quickly. Variability refers to how diverse data types
don’t fit into structured tables, and agility refers to how quickly an organization responds to
business change.
• Volume: Without a doubt, the key factor pushing organizations to look at alternatives
to their current RDBMSs is a need to query big data using clusters of commodity
processors. The focus is to increase from the speed on a single chip to using more
processors working together. The need to scale out (also known as horizontal scaling),
rather than scale up (faster processors), moved organizations from serial to parallel
processing where data problems are split into separate paths and sent to separate
processors to divide and conquer the work.
• Variability: Adding new columns to an RDBMS requires the system be shut down and
ALTER TABLE commands to be run. When a database is large, this process can impact
system availability, costing time and money.
• Agility: The most complex part of building applications using RDBMSs is the process of
putting data into and getting data out of the database. The aim is to generate the
correct combination of INSERT, UPDATE, DELETE, and SELECT SQL statements to move
object data to and from the RDBMS persistence layer.
• Consistency guarantees that every node in the distributed system returns the same,
most successful, and recent write.
• Availability is when every request receives a response without the guarantee that it
contains the most recent write.
• Partition tolerance is when the system continues to function and upholds its
guarantees despite network partitions.
Let the data store consist of two nodes G1 and G2, maintaining an object O with an initial
value of v0. We will construct an execution of a sequence of queries in which there exists a
request that returns an inconsistent response.
Now suppose that the network between the two nodes gets partitioned, that is, nodes G1
and G2 can no longer communicate with each other.
A client C1 wants to overwrite the value of Object-O with v1. By assumption, the system is
always available. Let Node-G1 accept the request and update the value.
Since the network is partitioned, the updated value of object-O cannot be reflected in Node-
G1. Now, a read request comes for Object-O. Again, the system must respond to this request
due to the availability assumption. Assume that Node-G2 responds to this request by
returning its older value v0; this contradicts the consistency assumption.
Hence, in the case of network partitioning, one must choose between availability and
consistency.
Key-Value stores are simplest type of NoSQL databases to design and implement.
Clients can use get, put and delete APIs to read and write data and since they
always use primary key access, they very easy to scale and provide high
performance. Key-Value stores are ideal for applications with simple data models
and require high velocity read & writes. Key-Value stores are not suitable when the
application requires frequent updates or complex queries involving specific data
values, multiple unique keys and relationships between them.
Example of Key-Value Store Database: Berkley DB, Dynamo DB, Memcache, etc.
Due to its flexible schema and complex querying capabilities, Document Stores are
popular and suitable for variety of use cases such as:
Document stores are not suitable if the application requires complex transactions
with multiple operations.