0% found this document useful (0 votes)
118 views12 pages

No SQL - Types, CAP Theorem

Uploaded by

Guhan Bala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
118 views12 pages

No SQL - Types, CAP Theorem

Uploaded by

Guhan Bala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

NoSQL Databases: Introduction – CAP Theorem

NoSQL is a type of database management system (DBMS) that is designed to handle


and store large volumes of unstructured and semi-structured data. Unlike traditional relational
databases that use tables with pre-defined schemas to store data, NoSQL databases use flexible
data models that can adapt to changes in data structures and are capable of scaling horizontally
to handle growing amounts of data.
The term NoSQL originally referred to “non-SQL” or “non-relational” databases, but
the term has since evolved to mean “not only SQL,” as NoSQL databases have expanded to
include a wide range of different database architectures and data models.
NoSQL databases are generally classified into four main categories:
Document databases: These databases store data as semi-structured documents, such as JSON
or XML, and can be queried using document-oriented query languages.
Key-value stores: These databases store data as key-value pairs, and are optimized for simple
and fast read/write operations.
Column-family stores: These databases store data as column families, which are sets of
columns that are treated as a single entity. They are optimized for fast and efficient querying
of large amounts of data.
Graph databases: These databases store data as nodes and edges, and are designed to handle
complex relationships between data.
NoSQL databases are often used in applications where there is a high volume of data
that needs to be processed and analyzed in real-time, such as social media analytics, e-
commerce, and gaming. They can also be used for other applications, such as content
management systems, document management, and customer relationship management.
However, NoSQL databases may not be suitable for all applications, as they may not
provide the same level of data consistency and transactional guarantees as traditional relational
databases. It is important to carefully evaluate the specific needs of an application when
choosing a database management system.
NoSQL originally referring to non SQL or non relational is a database that provides a
mechanism for storage and retrieval of data. This data is modeled in means other than the
tabular relations used in relational databases. Such databases came into existence in the late
1960s, but did not obtain the NoSQL moniker until a surge of popularity in the early twenty-
first century. NoSQL databases are used in real-time web applications and big data and their
use are increasing over time.
NoSQL systems are also sometimes called Not only SQL to emphasize the fact that
they may support SQL-like query languages. A NoSQL database includes simplicity of design,
simpler horizontal scaling to clusters of machines and finer control over availability. The data
structures used by NoSQL databases are different from those used by default in relational
databases which makes some operations faster in NoSQL. The suitability of a given NoSQL
database depends on the problem it should solve.
NoSQL databases, also known as “not only SQL” databases, are a new type of database
management system that have gained popularity in recent years. Unlike traditional relational
databases, NoSQL databases are designed to handle large amounts of unstructured or semi-
structured data, and they can accommodate dynamic changes to the data model. This makes
NoSQL databases a good fit for modern web applications, real-time analytics, and big data
processing.
Data structures used by NoSQL databases are sometimes also viewed as more flexible
than relational database tables. Many NoSQL stores compromise consistency in favor of
availability, speed and partition tolerance. Barriers to the greater adoption of NoSQL stores
include the use of low-level query languages, lack of standardized interfaces, and huge previous
investments in existing relational databases.
Most NoSQL stores lack true ACID(Atomicity, Consistency, Isolation, Durability)
transactions but a few databases, such as MarkLogic, Aerospike, FairCom c-treeACE, Google
Spanner (though technically a NewSQL database), Symas LMDB, and OrientDB have made
them central to their designs.
Most NoSQL databases offer a concept of eventual consistency in which database
changes are propagated to all nodes so queries for data might not return updated data
immediately or might result in reading data that is not accurate which is a problem known as
stale reads. Also some NoSQL systems may exhibit lost writes and other forms of data loss.
Some NoSQL systems provide concepts such as write-ahead logging to avoid data loss.
One simple example of a NoSQL database is a document database. In a document database,
data is stored in documents rather than tables. Each document can contain a different set of
fields, making it easy to accommodate changing data requirements
For example, “Take, for instance, a database that holds data regarding employees.”. In
a relational database, this information might be stored in tables, with one table for employee
information and another table for department information. In a document database, each
employee would be stored as a separate document, with all of their information contained
within the document.
NoSQL databases are a relatively new type of database management system that have
gained popularity in recent years due to their scalability and flexibility. They are designed to
handle large amounts of unstructured or semi-structured data and can handle dynamic changes
to the data model. This makes NoSQL databases a good fit for modern web applications, real-
time analytics, and big data processing.
Key Features of NoSQL :
Dynamic schema: NoSQL databases do not have a fixed schema and can accommodate
changing data structures without the need for migrations or schema alterations.
Horizontal scalability: NoSQL databases are designed to scale out by adding more nodes to
a database cluster, making them well-suited for handling large amounts of data and high levels
of traffic.
Document-based: Some NoSQL databases, such as MongoDB, use a document-based data
model, where data is stored in semi-structured format, such as JSON or BSON.
Key-value-based: Other NoSQL databases, such as Redis, use a key-value data model, where
data is stored as a collection of key-value pairs.
Column-based: Some NoSQL databases, such as Cassandra, use a column-based data model,
where data is organized into columns instead of rows.
Distributed and high availability: NoSQL databases are often designed to be highly available
and to automatically handle node failures and data replication across multiple nodes in a
database cluster.
Flexibility: NoSQL databases allow developers to store and retrieve data in a flexible and
dynamic manner, with support for multiple data types and changing data structures.
Performance: NoSQL databases are optimized for high performance and can handle a high
volume of reads and writes, making them suitable for big data and real-time applications.
Advantages of NoSQL: There are many advantages of working with NoSQL databases such
as MongoDB and Cassandra. The main advantages are high scalability and high availability.

High scalability: NoSQL databases use sharding for horizontal scaling. Partitioning of data
and placing it on multiple machines in such a way that the order of the data is preserved is
sharding. Vertical scaling means adding more resources to the existing machine whereas
horizontal scaling means adding more machines to handle the data. Vertical scaling is not that
easy to implement but horizontal scaling is easy to implement. Examples of horizontal scaling
databases are MongoDB, Cassandra, etc. NoSQL can handle a huge amount of data because of
scalability, as the data grows NoSQL scale itself to handle that data in an efficient manner.
Flexibility: NoSQL databases are designed to handle unstructured or semi-structured data,
which means that they can accommodate dynamic changes to the data model. This makes
NoSQL databases a good fit for applications that need to handle changing data requirements.
High availability: Auto replication feature in NoSQL databases makes it highly available
because in case of any failure data replicates itself to the previous consistent state.
Scalability: NoSQL databases are highly scalable, which means that they can handle large
amounts of data and traffic with ease. This makes them a good fit for applications that need to
handle large amounts of data or traffic
Performance: NoSQL databases are designed to handle large amounts of data and traffic,
which means that they can offer improved performance compared to traditional relational
databases.
Cost-effectiveness: NoSQL databases are often more cost-effective than traditional relational
databases, as they are typically less complex and do not require expensive hardware or
software.
Agility: Ideal for agile development.
Disadvantages of NoSQL: NoSQL has the following disadvantages.
Lack of standardization: There are many different types of NoSQL databases, each with its
own unique strengths and weaknesses. This lack of standardization can make it difficult to
choose the right database for a specific application
Lack of ACID compliance: NoSQL databases are not fully ACID-compliant, which means
that they do not guarantee the consistency, integrity, and durability of data. This can be a
drawback for applications that require strong data consistency guarantees.
Narrow focus: NoSQL databases have a very narrow focus as it is mainly designed for storage
but it provides very little functionality. Relational databases are a better choice in the field of
Transaction Management than NoSQL.
Open-source: NoSQL is open-source database. There is no reliable standard for NoSQL yet.
In other words, two database systems are likely to be unequal.
Lack of support for complex queries: NoSQL databases are not designed to handle complex
queries, which means that they are not a good fit for applications that require complex data
analysis or reporting.
Lack of maturity: NoSQL databases are relatively new and lack the maturity of traditional
relational databases. This can make them less reliable and less secure than traditional databases.
Management challenge: The purpose of big data tools is to make the management of a large
amount of data as simple as possible. But it is not so easy. Data management in NoSQL is
much more complex than in a relational database. NoSQL, in particular, has a reputation for
being challenging to install and even more hectic to manage on a daily basis.
GUI is not available: GUI mode tools to access the database are not flexibly available in the
market.
Backup: Backup is a great weak point for some NoSQL databases like MongoDB. MongoDB
has no approach for the backup of data in a consistent manner.
Large document size: Some database systems like MongoDB and CouchDB store data in
JSON format. This means that documents are quite large (BigData, network bandwidth, speed),
and having descriptive key names actually hurts since they increase the document size.
Types of NoSQL database: Types of NoSQL databases and the name of the databases system
that falls in that category are:

1. Graph Databases: Examples – Amazon Neptune, Neo4j


2. Key value store: Examples – Memcached, Redis, Coherence
3. Tabular: Examples – Hbase, Big Table, Accumulo
4. Document-based: Examples – MongoDB, CouchDB, Cloudant

When should NoSQL be used:


1. When a huge amount of data needs to be stored and retrieved.
2. The relationship between the data you store is not that important
3. The data changes over time and is not structured.
4. Support of Constraints and Joins is not required at the database level
5. The data is growing continuously and you need to scale the database regularly to handle
the data.

CAP theorem
It is very important to understand the limitations of the NoSQL database. NoSQL
cannot provide consistency and high availability together. This was first expressed by Eric
Brewer in CAP Theorem.
CAP theorem or Eric Brewers theorem states that we can only achieve at most two
out of three guarantees for a database: Consistency, Availability, and Partition Tolerance.
Consistency means that all nodes in the network see the same data at the same time.
Availability is a guarantee that every request receives a response about whether it was
successful or failed. However, it does not guarantee that a read request returns the most recent
write. The more number of users a system can cater to better is the availability.
Partition Tolerance is a guarantee that the system continues to operate despite arbitrary
message loss or failure of part of the system. In other words, even if there is a network outage
in the data center and some of the computers are unreachable, still the system continues to
perform.
Out of these three guarantees, no system can provide more than 2 guarantees. Since in
the case of distributed systems, the partitioning of the network is a must, the tradeoff is always
between consistency and availability.
As depicted in the Venn diagram, RDBMS can provide only consistency but not
partition tolerance. While HBase and Redis can provide Consistency and Partition tolerance.
And MongoDB, CouchDB, Cassandra, and Dynamo guarantee only availability but no
consistency. Such databases generally settle down for eventual consistency meaning that after
a while the system is going to be ok.
Let us take a look at various scenarios or architectures of systems to better understand
the CAP theorem.
The first one is RDBMS where the Reading and writing of data happen on the same
machine. Such systems are consistent but not partition tolerant because if this machine goes
down, there is no backup. Also, if one user is modifying the record, others would have to wait
thus compromising the high availability.
The second diagram is of a system that has two machines. Only one machine can accept
modifications while the reads can be done from all machines. In such systems, the
modifications flow from that one machine to the rest. Such systems are highly available as
there are multiple machines to serve. Also, such systems are partition tolerant because if one
machine goes down, there are other machines available to take up that responsibility. Since it
takes time for the data to reach other machines from node A, the other machine would be
serving older data. This causes inconsistency. Though the data is eventually going to reach all
machines and after a while, things are going to be okay. There we call such systems eventually
consistent instead of strongly consistent. This kind of architecture is found in Zookeeper and
MongoDB.
In the third design of any storage system, we have one machine similar to our first
diagram along with its backup. Every new change or modification at A in the diagram is
propagated to the backup machine B. There is only one machine which is interacting with the
readers and writers. So, It is consistent but not highly available.
Let’s first understand C, A, and P in simple words:
Consistency: means that all clients see the same data at the same time, no matter which
node they connect to in a distributed system. To achieve consistency, whenever data is written
to one node, it must be instantly forwarded or replicated to all the other nodes in the system
before the write is deemed successful.
Availability: means that every non-failing node returns a response for all read and write
requests in a reasonable amount of time, even if one or more nodes are down. Another way to
state this — all working nodes in the distributed system return a valid response for any request,
without failing or exception.
Partition Tolerance: means that the system continues to operate despite arbitrary
message loss or failure of part of the system. In other words, even if there is a network outage
in the data center and some of the computers are unreachable, still the system continues to
perform. Distributed systems guaranteeing partition tolerance can gracefully recover from
partitions once the partition heals.
The CAP theorem categorizes systems into three categories:
CP (Consistent and Partition Tolerant) database: A CP database delivers
consistency and partition tolerance at the expense of availability. When a partition occurs
between any two nodes, the system has to shut down the non-consistent node (i.e., make it
unavailable) until the partition is resolved.
Partition refers to a communication break between nodes within a distributed system.
Meaning, if a node cannot receive any messages from another node in the system, there is a
partition between the two nodes. Partition could have been because of network failure, server
crash, or any other reason.
AP (Available and Partition Tolerant) database: An AP database delivers
availability and partition tolerance at the expense of consistency. When a partition occurs, all
nodes remain available but those at the wrong end of a partition might return an older version
of data than others. When the partition is resolved, the AP databases typically resync the nodes
to repair all inconsistencies in the system.
CA (Consistent and Available) database: A CA delivers consistency and availability
in the absence of any network partition. Often a single node’s DB servers are categorized as
CA systems. Single node DB servers do not need to deal with partition tolerance and are thus
considered CA systems.
In any networked shared-data systems or distributed systems partition tolerance is a
must. Network partitions and dropped messages are a fact of life and must be handled
appropriately. Consequently, system designers must choose between consistency and
availability.

You might also like