NoSQL - Database Revolution - Resp
NoSQL - Database Revolution - Resp
Let's revisit the origin of DBMS to align our understanding before you step into
NoSQL.
Data storage and retrieval was a key focus area along with the evolution of
computers.
Towards the end of 19th Century, 'Punch cards' were leveraged for input,
output, and data storage. It provided a faster approach to key-in data, and to
retrieve it.
Later in 1960's, two famous DBMS was launched. IBM came up with Integrated
Management System (IMS), written for the Apollo program on System/360
and Integrated Database System (IDS) by Charles W. Bachman.
Both IDS and IMS were called as Navigational DBMS.
Navigational Syndrome
Let's discuss the pain areas of Navigational DBMS -
Record centered: One must navigate from one object to another using pointers
or links. For instance, to trace an 'order' it would be necessary first to locate
the 'customer', then follow the link to the customer’s orders.
Extremely inflexible in terms of data structure and query capabilities.
Difficult to add new data elements to an existing system.
Databases were too hard to use and they lacked a theoretical foundation.
Due to limitations mentioned above, they have a lot of demerits.
In 1970, E.F. Codd envisioned a new model of DBMS through his paper titled A
Relational Model of Data for Large Shared Data Banks which paved the way for the
emergence of Relational DBMS (RDBMS).
RDBMS formulated a new methodology for storing data and processing large
databases.
The records (data) would be stored in 'table' with fixed-length records unlike the
free-form list of linked records in IDS, IMS.
Later, databases like Ingres, query language like SQL got evolved.
The nuances and benefits of RDBMS had a wider reach, resulting in buy-in from
different vendors, setting a stage for an era of Database wars.
Many RDBMS such as Sybase, Microsoft SQL Server, Informix, MySQL, DB2,
Oracle got launched around the same time claiming better
Performance
Availability
More functionalities
Cost of storage
Economy of usage.
With no alternates, the roots of RDBMS got completely entrenched by early 2000s.
NoSQL Explosion
Later in 2005, the difference and change in architectures design of applications
between the client-server era and the era of massive web-scale
applications triggered lot of pressure on the
Level of usage
Volume of data considered
knack of handling/monitoring change
on RDBMS that couldn't upscale through incremental innovation.
This started the era of Distributed Non-Relational Database Management System,
later coined as 'NoSQL', which was more aligned to New-Age applications.
'NoSQL' grabbed the attention on the database system that broke the practice of the
traditional SQL database.
Byte it!
NoSQL Traits
In this video, you will understand more about the NoSQL traits.
What's Next?
Having understood the evolution of the DB, it's time to get familiarized with NoSQL.
In the upcoming sections, you will learn about
Key differences between RDBMS vs NoSQL
How Data replication works in NoSQL
Four types of NoSQL
NoSQL types explained
How to choose the right NoSQL type?
RDBMS vs NoSQL
Scaling
RDBMS - Vertical Scaling
Architecture design runs well on a single machine.
To handle larger volumes of operations is to upgrade the machine with a faster
processor or more memory.
There is a limitation to size/level of scaling.
NoSQL - Horizontal Scaling
NoSQL databases are intended to run on clusters of comparatively low-
specification servers.
To handle more data, add more servers to the cluster.
Calibrated to operate with full throttle even with low-cost hardware.
Relatively cheaper approach to handle increased
o Number of operations
o Size of the data.
Maintenance
Data Model
Caching
What is CAP?
Before we proceed further on the comparison, Let's quickly understand more about
CAP (Consistency, Availability, and Partition Tolerance) from this video.
Implications of CAP
Let's understand the implications of CAP in this video and know why this is very
important while designing a database.
Changing pH measure
Here you will get to know about core principles of DB processing.
RDBMS - ACID
Atomicity: If any one element of a transaction fails then the entire transaction
fails.
Consistency: The transaction must adhere to all protocols/rules at all times.
Isolation: No transaction has access to any other transaction that is in an
intermediate or unfinished state.
Durability: Once the transaction is complete, it will continue to persist as
complete and cannot be undone.
NoSQL - BASE
Basically Available: System does guarantee the availability of the data as per
CAP (Consistency, Availability and Partition Tolerance) Theorem.
Soft state: The state of the system could change over time, so even during
times without input.
Eventual consistency: The system would eventually become consistent as it
stops receiving input.
BASE Transactions
In this video, you will get to know more about BASE.
Byte it!
Data Replication
Flavors
NoSQL Categories
In this video, let's understand more about the four different categories at a high level.
Byte it!
database to RDBMS
Key-Value Stores
Most simplest NoSQL database among all.
The data is stored in key-value pairs.
It provides better performance.
Easy to access data via API and the client could
o get value for a key
o put value for a key
o delete a key
Few key-value databases - Riak, Redis, Memcached, Berkeley DB, Amazon
DynamoDB (not open-source) and so on.
Columnar Stores
Stores data as sections of data columns (column families).
Column families are rows with many columns associated.
Column families are chunks of related data often accessed together.
Popular columnar databases include Cassandra, HBase, Hypertable and Google
Bigtable.
10:001,11:002,15:003;
Ron:001,Hermione:002,Harry:003;
Weasley:001,Granger:002,Potter:003;
48000:001,58000:002,42000:003;
Document Stores
Stores and retrieves documents of formats XML, JSON, BSON, and so on.
Documents consist of maps, collections, and scalar values.
Document Store is mainly categorized mainly into –
- ***XML based Databases***
- ***JSON based Databases***
Few of the document databases include MongoDB, CouchDB, Terrastore, RavenDB,
OrientDB.
"FirstName": "Harry",
"LastName": "Potter",
"phone":{
},
"Hobby": ["Potion","Quidditch"]
<contact>
<firstname>Harry</firstname>
<lastname>Potter</lastname>
<address>
</address>
</contact>
Graph Store
Uses relationship, nodes, and properties to represent data
All nodes are connected through relationships.
The relationship has a direction, type, start node and end node.
Uses Graph Theory to store, map and query relationships.
Few popular Graph Databases are Neo4j, AllegroGraph, Oracle Spatial and Graph,
Teradata Aster, ArangoDB, and Graphbase.
You will learn more about the different flavors in detail the following sections.
Key-Value - Zeroed
The Key-Value pair design is similar to simple hash table design. It can be
compared to RDBMS table with two columns (ID/Name).
The value could be blob, text, JSON, XML, and so on.
Popular Dbs include - Riak, Redis, Memcached DB, and Amazon
DynamoDB.
'Key' must be unique and should not be too long.
Redis DB allows performing many superior functions like range, diff, and
intersection.
In this section, you will understand more about Key-value DBs.
Consistency
Consistency
Applicable on single Key as it involves get, put, or delete.
Although optimistic writes could be performed, they are expensive to
implement.
In Distributed key-value store implementations like Riak, values will be
replicated to other nodes.
Buckets are like namespace keys, which reduces key collisions. Example - All Student
keys may reside in the Student bucket.
With Buckets - 'write' is considered good only when the data is consistent
across all the nodes where the data is stored.
'Buckets' in Key-Value Dbs are similar to 'Tables' in RDBMS.
vQuery Features
Design of 'Key' plays prominent role and this is achieved by
o using some Algorithm
o with user inputs (user-id, name, email-id)
o from timestamps/external data.
Could be queried by the key/value associated with it.
Querying based on an attribute of value column is not possible from DB.
In some DBs, the value of the key is retrieved using the fetch API. Ex: Riak.
Scaling
Scalability of Key-Value database is achieved through sharding.
In sharding, the value of the key determines on which node the key is stored.
For example, say you are sharding by the first character of the key.
if the key is k76151487d, which starts with an 'k', will be sent to a different node than
the key dgh396542.
Benefits
Increase performance as more nodes can be added to the cluster.
Impact
If the node used to store 'f' goes down, the data stored on that node becomes
unavailable, nor can new data be written with keys that start with f.
How to overcome this issue?
Scaling
Riak DB leverages CAP Theorem to improve its scalability:
N - # of nodes to store the key-value replicas.
R - # of nodes to fetch data from.
W - # of nodes to write data to.
For example, consider 5-node Riak cluster. And if you configure,
N = 3 => all data should be replicated to at least 3 nodes.
R = 2 => Any 2 nodes must respond to GET request to be considered successful.
W = 2 => PUT request is written to 2 nodes before the write is considered successful.
Key-value databases would not be the best fit in the few scenarios highlighted below.
Relationships among Multiple-Data - There exist relationships between
different sets of data or correlation and the data between different sets of keys.
Multi-operation Transactions -
If you are storing many keys and when there is a failure to save one of the keys, and
you want to roll back the rest of the operations.
Query Data by 'value' - Searching the 'keys' based on some info found in the
'value' part of the key-value pairs. Some exceptions include Riak Search or
indexing engines such as Lucene or Solr.
Operation by groups - As operations are confined to one key at a time, there
exists no way to run several keys simultaneously.
Columnar - Zeroed
The columnar allows effective data storage.
Data is stored on a column-family basis.
Columns store any data types as long as represented as an array of bytes.
Avoids storing nulls values.
Each unit of data is considered as a set of key/value pairs, while the unit itself is
been identified with a primary identifier (primary key).
The column-families are not physically isolated for a given row.
All data pertaining to a row-key is stored together.
Columnar Architecture
The key drawback of columnar architecture is insert and update overhead for single
rows.
Columnar databases implement a form of write-optimized delta store (delta
store) to handle constant trickle feed of changes.
Delta store acts as memory resident and could accept high-frequency data
modifications in a uncompressed manner.
Data in the delta store gets merged periodically/crossing threshold with the
main columnar-oriented store
Modus-Operandi approach:
o Large-scale bulk loads directed to the column store.
o Incremental inserts/updates will flow to delta store.
o Queries would read from both stores to fetch complete results.
o A process will move data from the delta store to the column store
periodically.
The main focus of the columnar concept is that data for columns are grouped on
disk.
Values for a specific column become co-located in the same disk blocks.
Aggregation of the values of specific columns is Optimized because all values
to be aggregated exist within same disk blocks.
Exact IO and CPU optimizations depends on
o Workload
o Indexing and
o Schema design.
Apache Hbase
Cassandra
Developed by Facebook initially for inbox search feature and later handed
over to Apache.
Offers high scalability, availability and overcomes single point of failure
problem.
Writes at amazing speed without compromising on reading efficiency.
Key variables providing a variety of outcomes:
Apache Kudu
Open source storage engine
Designed to support Hadoop ecosystem tools (Cloudera Impala, Apache Spark,
and MapReduce).
Distributes data using horizontal partitioning.
Supports low-latency random access and efficient analytical access patterns.
Offers API for row-level inserts/updates/deletes.
Document Database
Document databases are structured documents that typically refer XML or
JSON, i.e., sets of key/value pairs.
Documents are treated as wholesome and splitting a document into its
constituent name/value pairs are avoided.
Puts together a diverse set of documents into a single collection.
Allows indexing of documents based on its primary identifier and properties.
Stores documents or spreadsheets as well.
Implements ACID transactions and adapt RDBMS characteristics.
Supports Query transactions (to an extent).
XML Databases
XML document formed the first Document DB.
XML possess the capability of representing almost any form of information.
XML has a variety of standards and tools to assist with authoring, validation,
searching, and transforming XML documents.
Let's understand the different tools and their usage.
####XML Tools and Standards - Snapshot:
XPath: Syntax for retrieving specific elements from an XML document.
XQuery: Query language for grilling XML documents, also known as “the SQL
of XML”.
XML schema: Document Template that explains which all elements may be
present in a specified class of XML documents to validate document
correctness.
XML Databases
XSLT: Language to transform XML documents into other formats, like non-XML
formats such as HTML.
Document Object Model (DOM): Object-oriented API to interact with XML,
XHTML, and similarly structured documents.
Contain a platform that incorporates various XML standards like XQuery and
XSLT.
Provides services for the storage, indexing, security, and concurrent access to
XML files.
Famous XML databases
eXist (open-source)
MarkLogic (commercial)
JSON Databases
JavaScript pioneer Douglas Crockford while attempting to build a framework for more
dynamic and interactive web applications created JSON.
JSON is a lightweight substitute for XML.
JSON document database expects the data to be stored in the format of JSON.
o document base unit of storage resembles row in an RDBMS.
o Contains one or more key-value pairs, nested documents, and arrays.
o Arrays may hold complex hierarchical structure.
Collection (data bucket) is a group of documents sharing some common
objective (resembles table in an RDBMS).
Although preferred, documents in a collection need not be of the same type.
In the example - “players” are nested as an array within documents. This pattern is
known as 'document embedding'.
JSON Databases
Where it scores?
Design pattern allows retrieving info in a single operation.
Avoids performing joins within the application.
Issues
Base info duplication across multiple documents
Complicates design resulting in inconsistency.
Solution
Link multiple documents using document identifiers (resembles foreign key in
RDBMS)
Provides balance between performance and maintainability.
JSON databases
MongoDB, CouchDB, OrientDB, and DocumentDB
MongoDB
Couchbase
Graph vs RDBMS
Graph constructs could be easily represented as a relational model.
There are two main challenges to be addressed -
Leveraging SQL syntax to perform graph traversal with depth unknown is
not quite easy. Ex: With SQL determining friends of your friends are easy, but
it's hard to address the “Degrees of separation” problem. (i.e., number of
connections that separate one from another friend).
Degradation of Performance while traversing the graph, due to increased
query response time at each level. Ex: This increases the number of joins
required with SQL, and this could not be generalized as arbitrary depth is
unknown.
Parameters to Focus
Key parameters to be taken into consideration while weighing NoSQL databases
against each other include:
Database features
Performance and
Context-based criteria
NoSQL databases come in different shapes, sizes, and forms.
Feature-based comparison is the best way to group them logically.
Polyglot Persistence
In this video, let's understand more about the Polyglot Persistence.
Feature Comparison
Let's do a quick comparison and contrasts among different NoSQL choices on the
basis of the following features:
Scalability
Transactional integrity and consistency
Data modeling
Query support
Access and interface availability
Scalability
Not all NoSQL databases promise horizontal scalability on equal margins.
HBase and Hypertable carry an advantage, while Redis, MongoDB, and
Couchbase Server lag behind.
The difference becomes more amplified as the data size grows over a few
petabytes.
Data Modeling
RDBMS offers a consistent and organized way of modeling data with standardized
implementation.
The NoSQL world does not offer any room for the standardized and well-
defined data model as they are not bound to solve the same problem or have
the same architecture.
MongoDB (Document DB) has gradually adopted few RDBMS concepts, like
o SQL-like querying
o Rudimentary relational references
o Database objects (inspired by the standard table and column-based
model)
Querying Support
Querying data from any database with ease and effectively is considered to be
an interesting puzzle to be solved.
With standardized syntax and semantics, RDBMS thrives on SQL support for easy
access to data.
Among NoSQL -
MongoDB and CouchDB (Document DB) come with querying capabilities which
are equally powerful to RDBMS.
Redis (Key-Value DB) alone comes with querying the data structures it stores.
Under Columnar DB, HBase has a little bit of querying capabilities.
Why Benchmark?
Benchmarking is required to compare and derive deeper insight on how the different
NoSQL products stack up.
Yahoo! Cloud Services Benchmark (YCSB) is one of the famous
benchmarking infrastructures for comparing NoSQL products.
Other known benchmarks from various product vendors include
Tokyo Cabinet Benchmarks
How fast is Redis - from Redis
Riak benchmark - from Riak
VoltDB - Key/value benchmarking
Sort benchmark.
Formation
The success of Google (BigTable) and Amazon (DynamoDb) triggered the formation
of HBase, Hypertable, Cassandra, and Riak.
Know more about the formation of some of the DBs listed below.
CouchDB
MongoDB
Redis
Which type of Key-Value datastore DB has its key and value sorted?
Ordered Key-Value datastore
In the Master-Slave Replication model, the node which pushes all the
updates in data to subordinate nodes is __________.
All the options
Horizontal scaling approach tends to be cheaper as the number of operations and the size
of the data increases.--true
Full-form of 'CRUD' is _________.--create read update delete
Key-value pair data storages include all except ________.Network attached storage
HBase Tables are divided _________ by row key range into ________ .horizontally regions
In a columnar database, the columns are stored together on disk, achieving a higher
compression ratio is an expensive operation.--false
The column store has to perform _____ IO to insert a new value.--As many disc blocks
MongoDB read/write performance can be tuned with the help of Stored Procedures.false
Document databases split a document into its constituent name/value pairs for indexing
purpose.--false
The MATCH clause is roughly equivalent to the _______ clause in SQL and the RETURN
clause to a ______ clause--where/select
---------------
Sorted Column store would provide higher compression ratio by representing each column
as ________ compared to the preceding one.--delta