50% found this document useful (2 votes)
7K views54 pages

NoSQL - Database Revolution - Resp

The document provides background information on NoSQL databases and their evolution. It discusses how traditional RDBMS systems led to limitations that prompted the development of NoSQL databases. The document then covers the four main types of NoSQL databases (key-value stores, columnar stores, document stores, and graph stores) and how they address issues like scalability and flexibility that RDBMS systems struggle with. It aims to help readers understand the origins and value of the NoSQL approach to non-relational data storage.

Uploaded by

IgorJales
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
50% found this document useful (2 votes)
7K views54 pages

NoSQL - Database Revolution - Resp

The document provides background information on NoSQL databases and their evolution. It discusses how traditional RDBMS systems led to limitations that prompted the development of NoSQL databases. The document then covers the four main types of NoSQL databases (key-value stores, columnar stores, document stores, and graph stores) and how they address issues like scalability and flexibility that RDBMS systems struggle with. It aims to help readers understand the origins and value of the NoSQL approach to non-relational data storage.

Uploaded by

IgorJales
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 54

NoSQL - Database Revolution

NoSQL - Journey Ahead!


Many big enterprises have started adopting alternative databases such
as NoSQL moving away from their traditional empire of RDBMS now.
As a result, they can save money, innovate more rapidly, yield better productivity
and quicker ROI.
In this course, you will explore more into NoSQL and understand in detail about
its types, modus operandi, storage model and finally knowing how to make the
right choice of NoSQL.
As you progress, there would be a parallel analogy with RDBMS to induce better
understanding.

Origin: Punch Cards to DBMS

Let's revisit the origin of DBMS to align our understanding before you step into
NoSQL.
 Data storage and retrieval was a key focus area along with the evolution of
computers.
 Towards the end of 19th Century, 'Punch cards' were leveraged for input,
output, and data storage. It provided a faster approach to key-in data, and to
retrieve it.
 Later in 1960's, two famous DBMS was launched. IBM came up with Integrated
Management System (IMS), written for the Apollo program on System/360
and Integrated Database System (IDS) by Charles W. Bachman.
 Both IDS and IMS were called as Navigational DBMS.

Navigational Syndrome
Let's discuss the pain areas of Navigational DBMS -
 Record centered: One must navigate from one object to another using pointers
or links. For instance, to trace an 'order' it would be necessary first to locate
the 'customer', then follow the link to the customer’s orders.
 Extremely inflexible in terms of data structure and query capabilities.
 Difficult to add new data elements to an existing system.
 Databases were too hard to use and they lacked a theoretical foundation.
Due to limitations mentioned above, they have a lot of demerits.

Rise of RDBMS: Codd's Vision

In 1970, E.F. Codd envisioned a new model of DBMS through his paper titled A
Relational Model of Data for Large Shared Data Banks which paved the way for the
emergence of Relational DBMS (RDBMS).
 RDBMS formulated a new methodology for storing data and processing large
databases.
 The records (data) would be stored in 'table' with fixed-length records unlike the
free-form list of linked records in IDS, IMS.
 Later, databases like Ingres, query language like SQL got evolved.

Era of Database Wars

The nuances and benefits of RDBMS had a wider reach, resulting in buy-in from
different vendors, setting a stage for an era of Database wars.
Many RDBMS such as Sybase, Microsoft SQL Server, Informix, MySQL, DB2,
Oracle got launched around the same time claiming better
 Performance
 Availability
 More functionalities
 Cost of storage
 Economy of usage.
With no alternates, the roots of RDBMS got completely entrenched by early 2000s.

NoSQL Explosion
Later in 2005, the difference and change in architectures design of applications
between the client-server era and the era of massive web-scale
applications triggered lot of pressure on the
 Level of usage
 Volume of data considered
 knack of handling/monitoring change
on RDBMS that couldn't upscale through incremental innovation.
This started the era of Distributed Non-Relational Database Management System,
later coined as 'NoSQL', which was more aligned to New-Age applications.
'NoSQL' grabbed the attention on the database system that broke the practice of the
traditional SQL database.

Byte it!

NoSQL forecasts $4.2 Billion revenue by 2020.

NoSQL - Where it Scores?


Key features of NoSQL which makes it the most sought DB.
 Distributed computing system
 Higher scalability
 Reduced Costs
 Flexible schema design
 Process unstructured and semi-structured data
 No complex relationship
 Open-sourced

NoSQL Traits
In this video, you will understand more about the NoSQL traits.

Focus - Key Points


Let's focus on the Key points to be noted in NoSQL in this video.

What's Next?
Having understood the evolution of the DB, it's time to get familiarized with NoSQL.
In the upcoming sections, you will learn about
 Key differences between RDBMS vs NoSQL
 How Data replication works in NoSQL
 Four types of NoSQL
 NoSQL types explained
 How to choose the right NoSQL type?

RDBMS vs NoSQL

The above illustration depicts the key differentiators between RDBMS and NoSQL.


You will go through various parameters in the following set of cards to get a broader
understanding.

Scaling
RDBMS - Vertical Scaling
 Architecture design runs well on a single machine.
 To handle larger volumes of operations is to upgrade the machine with a faster
processor or more memory.
 There is a limitation to size/level of scaling.
NoSQL - Horizontal Scaling
 NoSQL databases are intended to run on clusters of comparatively low-
specification servers.
 To handle more data, add more servers to the cluster.
 Calibrated to operate with full throttle even with low-cost hardware.
 Relatively cheaper approach to handle increased
o Number of operations
o Size of the data.

Maintenance

RDBMS - High Maintenance


 Maintaining high-end RDBMS systems is expensive and requires trained
workforce for database management.
NoSQL - Low Maintenance
 NoSQL databases require minimal management, and it supports many
features, which makes the need for administration and tuning
requirements becomes less. This covers
o Automatic repair
o Easier data distribution
o Simpler data models

Data Model

RDBMS - Rigid Data Model:


 RDBMS requires data in structured format as per defined data model.
 As change management is a big headache in SQL with a strong dependency on
primary/foreign keys, ad-hoc data insertion becomes tougher.
NoSQL - No Schema/Data model:
 NoSQL database is schema-less so that data can be inserted into a database
with ease, even without any predefined schema.
 The format or data model could be changed anytime, without application
disruption.

Caching

RDBMS - Separate Hardware


 The caching in typical RDBMS database requires separate infrastructure.
 As there is overhead, the logic of retrieval involves little delay.
NoSQL - Integrated:
 NoSQL database supports caching in system memory, so it increases data
output performance.

What is CAP?
Before we proceed further on the comparison, Let's quickly understand more about
CAP (Consistency, Availability, and Partition Tolerance) from this video.

Implications of CAP
Let's understand the implications of CAP in this video and know why this is very
important while designing a database.

Changing pH measure
Here you will get to know about core principles of DB processing.
RDBMS - ACID
 Atomicity: If any one element of a transaction fails then the entire transaction
fails.
 Consistency: The transaction must adhere to all protocols/rules at all times.
 Isolation: No transaction has access to any other transaction that is in an
intermediate or unfinished state.
 Durability: Once the transaction is complete, it will continue to persist as
complete and cannot be undone.
NoSQL - BASE
 Basically Available: System does guarantee the availability of the data as per
CAP (Consistency, Availability and Partition Tolerance) Theorem.
 Soft state: The state of the system could change over time, so even during
times without input.
 Eventual consistency: The system would eventually become consistent as it
stops receiving input.

BASE Transactions
In this video, you will get to know more about BASE.

Byte it!

NoSQL database is not viewed as a replacement to RDBMS but, rather, a

complementary addition to RDBMS and SQL.


NoSQL vs RDBMS Summary
In this section, you got to understand more about the Key differences between the
NoSQL and Traditional RDBMS.
Also, detailed comparison through focused coverage on
 Scaling
 Maintenance
 Data Model
 Caching technique
 ACID vs BASE
also on CAP and its preferences.

Data Replication

Data replication is all about having your data geo-distributed through a non-interactive


and reliable process as a contingency measure to avoid loss of data.
Most of the NoSQL systems have data replication feature built-in.
Data replication in RDBMS is little difficult as they have not adopted Horizontal
scaling.
NoSQL data replication is homogenous, in the sense data cannot be replicated from a
NoSQL system to RDBMS SQL system.
Three types of Data Replication include -
 Sharding Replication
 Master-Slave Replication
 Peer-to-Peer Replication.
You will understand more about these in the following set of cards.

Sharding Replication Model


In this video, you will understand more about Sharding replication model.

Master-Slave Replication Model


Know more about the Master-Slave Replication model in this video.

Peer-to-Peer Replication Model


Understand how Peer-to-Peer works in this video.

Data Replication Summary


In this section, we understood about the
 Different Data replication strategies
 How did they work?
 Highlights and where they are powerful
 Limitations
Not all replication methodology would be applicable in all scenarios.
Based on several factors like
 Environment
 Server capacity/Storage constraints
 Application demand
 Performance/throughput and so on
you might have to select the right option.

Flavors

NoSQL databases are classified into four categories:


 Key-Value Stores
 Columnar Stores
 Document Stores
 Graph Stores

NoSQL Categories
In this video, let's understand more about the four different categories at a high level.

Byte it!

NoSQL just means Not only SQL, reconfirming NoSQL works as an alternative

database to RDBMS

Key-Value Stores
 Most simplest NoSQL database among all.
 The data is stored in key-value pairs.
 It provides better performance.
 Easy to access data via API and the client could
o get value for a key
o put value for a key
o delete a key
Few key-value databases - Riak, Redis, Memcached, Berkeley DB, Amazon
DynamoDB (not open-source) and so on.

Key Value Store Example


Let's take an example of Key-Value pair.
Key value
Harry "Gryffindor, sneaker, 14"
Malfoy "Slytherin, chaser, 15"
Luna "Ravenclaw, Singer,13"
Cedric "Hufflepuff,'', 17"
Based on the key, the corresponding values could be fetched and tokenized
accordingly.

Columnar Stores
 Stores data as sections of data columns (column families).
 Column families are rows with many columns associated.
 Column families are chunks of related data often accessed together.
Popular columnar databases include Cassandra, HBase, Hypertable and Google
Bigtable.

Columnar Store Example

RowId StuID Lastname Firstname Score

001 10 Weasley Ron 48000

002 11 Granger Hermione 58000

003 15 Potter Harry 42000

10:001,11:002,15:003;

Ron:001,Hermione:002,Harry:003;

Weasley:001,Granger:002,Potter:003;
48000:001,58000:002,42000:003;

Document Stores

 Stores and retrieves documents of formats XML, JSON, BSON, and so on.
 Documents consist of maps, collections, and scalar values.
 Document Store is mainly categorized mainly into –
 - ***XML based Databases***

 - ***JSON based Databases***
Few of the document databases include MongoDB, CouchDB, Terrastore, RavenDB,
OrientDB.

Document Store Example

"FirstName": "Harry",

"LastName": "Potter",

"phone":{

"cell":"095 - 000 - 100 - 110",


"work":"099 - 800 - 100 - 110"

},

"Address": "15 Hogwarts School of Wizardry",

"Hobby": ["Potion","Quidditch"]

<contact>

<firstname>Harry</firstname>

<lastname>Potter</lastname>

<phone type="Cell">095 - 000 - 100 - 110</phone>

<phone type="Work">099 - 800 - 100 - 110</phone>

<address>

<Room No>15</Room No>

<University>Hogwarts School of Wizardry</University>

</address>

</contact>

Graph Store
 Uses relationship, nodes, and properties to represent data
 All nodes are connected through relationships.
 The relationship has a direction, type, start node and end node.
 Uses Graph Theory to store, map and query relationships.
Few popular Graph Databases are Neo4j, AllegroGraph, Oracle Spatial and Graph,
Teradata Aster, ArangoDB, and Graphbase.
You will learn more about the different flavors in detail the following sections.

Key-Value - Zeroed
 The Key-Value pair design is similar to simple hash table design. It can be
compared to RDBMS table with two columns (ID/Name).
 The value could be blob, text, JSON, XML, and so on.
 Popular Dbs include - Riak, Redis, Memcached DB, and Amazon
DynamoDB.
 'Key' must be unique and should not be too long.
 Redis DB allows performing many superior functions like range, diff, and
intersection.
In this section, you will understand more about Key-value DBs.

Key-Value Data Model


Let's understand more about the Key-Value Data Model.
Key-Value Data Model Example
Here you will know about a Key-Value DB in detail.

Popular Key Value Databases

Few widely used Key-Value databases are highlighted above.


Key Features to be considered while selecting Key Value Store are listed below
 Consistency
 Transactions
 Querying ability
 Scalability
Let's discuss in detail about them in the following set of cards.

Consistency
Consistency
 Applicable on single Key as it involves get, put, or delete.
 Although optimistic writes could be performed, they are expensive to
implement.
 In Distributed key-value store implementations like Riak, values will be
replicated to other nodes.
Buckets are like namespace keys, which reduces key collisions. Example - All Student
keys may reside in the Student bucket.
 With Buckets - 'write' is considered good only when the data is consistent
across all the nodes where the data is stored.
'Buckets' in Key-Value Dbs are similar to 'Tables' in RDBMS.

vQuery Features
 Design of 'Key' plays prominent role and this is achieved by
o using some Algorithm
o with user inputs (user-id, name, email-id)
o from timestamps/external data.
 Could be queried by the key/value associated with it.
 Querying based on an attribute of value column is not possible from DB.
 In some DBs, the value of the key is retrieved using the fetch API. Ex: Riak.

Scaling
 Scalability of Key-Value database is achieved through sharding.
 In sharding, the value of the key determines on which node the key is stored.
For example, say you are sharding by the first character of the key.
if the key is k76151487d, which starts with an 'k', will be sent to a different node than
the key dgh396542.

Benefits
 Increase performance as more nodes can be added to the cluster.
Impact
 If the node used to store 'f' goes down, the data stored on that node becomes
unavailable, nor can new data be written with keys that start with f.
How to overcome this issue?

Scaling
Riak DB leverages CAP Theorem to improve its scalability:
 N - # of nodes to store the key-value replicas.
 R - # of nodes to fetch data from.
 W - # of nodes to write data to.
For example, consider 5-node Riak cluster. And if you configure,
N = 3 => all data should be replicated to at least 3 nodes.
R = 2 => Any 2 nodes must respond to GET request to be considered successful.
W = 2 => PUT request is written to 2 nodes before the write is considered successful.

 Best practice is to choose a W value to match your consistency needs


during bucket creation.

Key-Value Data Stores in Depth


Now it is time to dive little more into Key-Value DB.

Usecase - Storing Session Details


 Every web session is assigned a unique session-id value, which the
applications store on disk(logfile) / DB(RDBMS).
 Moving this to key-value DB will improve performance to great extent as every
info about the session could be
o Stored by a PUT request
o Retrieved using GET request
 The operation is very fast, as session info are stored in a single object.
Usage:
 Memcached for caching web applications and microapps,
 Riak when availability is an important criteria.

Usecase - Storing Profile/Preferences


Key-Value would be best-fit to store user profile
 userId
 username and
 additional attributes
and user preferences
 language
 country
 timezone and
 user favorites and so on
All these information could be stored in a single object, so getting preferences of a user
would just take single GET operation.
On similar lines, product profiles could be stored as well.

Usecase - Shopping Cart Details


 All e-commerce websites have shopping carts deeply linked with the user.
 The shopping cart details should be available at all times, across different
browsers, devices, machines, and sessions.
 Key-Value would be best-fit for this scenario, with all shopping related
information put into 'value' where the 'key' is the userid.
Usage
 Amazon uses its DynamoDB for storing its user's shopping cart details.

Limitations of Key-Value Store

Key-value databases would not be the best fit in the few scenarios highlighted below.
 Relationships among Multiple-Data - There exist relationships between
different sets of data or correlation and the data between different sets of keys.
 Multi-operation Transactions -
If you are storing many keys and when there is a failure to save one of the keys, and
you want to roll back the rest of the operations.
 Query Data by 'value' - Searching the 'keys' based on some info found in the
'value' part of the key-value pairs. Some exceptions include Riak Search or
indexing engines such as Lucene or Solr.
 Operation by groups - As operations are confined to one key at a time, there
exists no way to run several keys simultaneously.

Columnar - Zeroed
The columnar allows effective data storage.
 Data is stored on a column-family basis.
 Columns store any data types as long as represented as an array of bytes.
 Avoids storing nulls values.
 Each unit of data is considered as a set of key/value pairs, while the unit itself is
been identified with a primary identifier (primary key).
 The column-families are not physically isolated for a given row.
 All data pertaining to a row-key is stored together.

Column Data Model


Let's understand about the Columnar Data model in this video.

Popular Columnar Databases

Few of the popular column-family stores are mentioned above.

Column Data Model Example


In this video, let's understand more about the Columnar with an example.

Column Data Stores in Depth


In this video, we will learn more in-depth about Column Database.

What is Column Family?

RDBMS demands 'table' defined upfront with Columns. All attributes of an entity are


stored in table columns.
 The column-oriented database does not require upfront schema definition and
can include newer columns as the data evolves.
 Column-family is predefined and not a column, which is nothing but set of
columns grouped as a bundle.
 There exists a logical relation between the columns in a column family.
 In general, Column-family members are physically stored together.
Good Practice
 Form Column family by clubbing together columns with similar characteristics.

Column Family Data Storage

Column-family in a columnar database is analogous to 'column' in an RDBMS. While


columns have a restriction on the type of data it stores, Column-families have no such
constraints.
 Columnar stores data values in only those columns with valid values. Nulls are
ignored.
 Column-family acts as a storage container for sparse and malleable datasets of
continuously evolving data.
 row-key uniquely identifies a row in a column database.
 Data is actually stored by column-families, not in tables.
 Column databases can scale to accommodate more rows and columns, mostly,
a single table often spans across multiple machines.

Columnar Architecture

The key drawback of columnar architecture is insert and update overhead for single
rows.
 Columnar databases implement a form of write-optimized delta store (delta
store) to handle constant trickle feed of changes.
 Delta store acts as memory resident and could accept high-frequency data
modifications in a uncompressed manner.
 Data in the delta store gets merged periodically/crossing threshold with the
main columnar-oriented store
 Modus-Operandi approach:
o Large-scale bulk loads directed to the column store.
o Incremental inserts/updates will flow to delta store.
o Queries would read from both stores to fetch complete results.
o A process will move data from the delta store to the column store
periodically.

Data Storage - How it Works?

The main focus of the columnar concept is that data for columns are grouped on
disk.
 Values for a specific column become co-located in the same disk blocks.
 Aggregation of the values of specific columns is Optimized because all values
to be aggregated exist within same disk blocks.
 Exact IO and CPU optimizations depends on
o Workload
o Indexing and
o Schema design.

Apache Hbase

 Apache HBase is a Columnar database that runs on a Hadoop cluster.


 It does not have any rigid schema like RDBMS.
 Stores unstructured or semi-structured data.
 Acts as distributed Database and sharding helps with distributing different data
across multiple servers.
 At low-latency row-level random access of data.
Where it fails?
With very high data volume (TBs), the performance is not up to the mark.

Cassandra
 Developed by Facebook initially for inbox search feature and later handed
over to Apache.
 Offers high scalability, availability and overcomes single point of failure
problem.
 Writes at amazing speed without compromising on reading efficiency.
 Key variables providing a variety of outcomes:

N - # of copies of each data item.

W - # of copies of the data item that must be written.

R - # of copies while reading the data item.

 Replication Factor: Determines # data copies maintained across multiple


nodes.
 Read/Write Consistency - Key configuration parameters during read/write
operation.

ALL: to all nodes

ONE|TWO|THREE: a specified number of nodes.

QUORUM: to set nodes.

EACH_QUORUM: a set of nodes in each data center

LOCAL_QUORUM: a set of nodes in current data center only.

ANY: to any node.

Apache Kudu
 Open source storage engine
 Designed to support Hadoop ecosystem tools (Cloudera Impala, Apache Spark,
and MapReduce).
 Distributes data using horizontal partitioning.
 Supports low-latency random access and efficient analytical access patterns.
 Offers API for row-level inserts/updates/deletes.

Columnar in Other DBs


There exist variations on the implementation of columnar paradigm within both
traditional relational systems and other NoSQL systems.
 SAP HANA (in-memory DB) provides support for column/row orientation on a
table-by-table basis.
 Oracle 12c “Database in Memory” incorporates column store.
 Oracle Exadata leverages Enhanced Hybrid Columnar Compression
(EHCC) to achieve a best-of-both-worlds combination of row and column
storage technologies.
o Rows are stored within compression units of 1 MB reducing overhead
for performing row-level modifications.
o Columns stored together within smaller 8K blocks yielding high levels of
compression.

Document Database
 Document databases are structured documents that typically refer XML or
JSON, i.e., sets of key/value pairs.
 Documents are treated as wholesome and splitting a document into its
constituent name/value pairs are avoided.
 Puts together a diverse set of documents into a single collection.
 Allows indexing of documents based on its primary identifier and properties.
 Stores documents or spreadsheets as well.
 Implements ACID transactions and adapt RDBMS characteristics.
 Supports Query transactions (to an extent).

Document-Based Data Model


Let's understand about the Document DB.

Popular Document DBs


Few of the popular Document DBs are highlighted above.

Document-Based Data Model Example


Let's look at an example to understand more about the Document DB.
You will get to know more about XML/JSON format of Document DBs in the following
sections.

Document Data Stores in Detail


Let's dive little more deeply to understand the Document DB concepts.

XML Databases
 XML document formed the first Document DB.
 XML possess the capability of representing almost any form of information.
XML has a variety of standards and tools to assist with authoring, validation,
searching, and transforming XML documents.
Let's understand the different tools and their usage.
####XML Tools and Standards - Snapshot:
 XPath: Syntax for retrieving specific elements from an XML document.
 XQuery: Query language for grilling XML documents, also known as “the SQL
of XML”.
 XML schema: Document Template that explains which all elements may be
present in a specified class of XML documents to validate document
correctness.

XML Databases
 XSLT: Language to transform XML documents into other formats, like non-XML
formats such as HTML.
 Document Object Model (DOM): Object-oriented API to interact with XML,
XHTML, and similarly structured documents.
 Contain a platform that incorporates various XML standards like XQuery and
XSLT.
 Provides services for the storage, indexing, security, and concurrent access to
XML files.
Famous XML databases
 eXist (open-source)
 MarkLogic (commercial)

JSON Databases
JavaScript pioneer Douglas Crockford while attempting to build a framework for more
dynamic and interactive web applications created JSON.
 JSON is a lightweight substitute for XML.
 JSON document database expects the data to be stored in the format of JSON.
o document base unit of storage resembles row in an RDBMS.
o Contains one or more key-value pairs, nested documents, and arrays.
o Arrays may hold complex hierarchical structure.
 Collection (data bucket) is a group of documents sharing some common
objective (resembles table in an RDBMS).
 Although preferred, documents in a collection need not be of the same type.
In the example - “players” are nested as an array within documents. This pattern is
known as 'document embedding'.

JSON Databases
Where it scores?
 Design pattern allows retrieving info in a single operation.
 Avoids performing joins within the application.
Issues
 Base info duplication across multiple documents
 Complicates design resulting in inconsistency.
Solution
 Link multiple documents using document identifiers (resembles foreign key in
RDBMS)
 Provides balance between performance and maintainability.
JSON databases
MongoDB, CouchDB, OrientDB, and DocumentDB

Data Modelling in Document


 Less deterministic compared to RDBMS.
 Driven by nature of the queries to be executed, while in RDBMS it is driven
by the kind of data to be stored.

MongoDB

MongoDB offers a Competitive edge in NoSQL space by providing developer-friendly


ecosystem and architecture.
Acts as a good alternative for MySQL/Oracle in NoSQL arena.
 JSON-oriented document database
 Leverages BSON(binary encoded variant of JSON)
 - Supports lower parse overhead than JSON.

 - Enhanced support for additional data types like dates and binary data

 Comes with querying capability (JavaScript-based).


 Needs to improve on scalability and throughput capabilities.
 Sharding happens through range or hash.
 Achieves consistency for individual documents through locks.

Couchbase

Formed with the merger of MemBase and CouchDB.


 Open source and distributed DB
 Possess flexible data model with dynamic schemas
 Leverages N1QL - expressive, powerful, and complete SQL for manipulating
and transforming JSON data.
 Achieves latencies at a scale of sub-milliseconds.
 Comes with reliability, high availability, and simple administration capability.

Graph Datastores - Zeroed


Graph Store is an expressive structure with the collection
of Nodes and relationships interlinking them.
 Nodes - representation of entities
 Relationships - how entities relate to the world.
Graph Store is used to model all kind of different scenarios such as
 Construction of a space rocket
 Transportation system (roads and trains)
 Supply-chain and Logistics
 Medical history
 Fraud Detection
 Network and IT Operations
Graph - Types
At a very high level, Graph store can be categorized into two kinds, although the
underlying principles remain same.
1) Graph Database - (Real-time)
 Performs transactional online graph persistence in real-time.
 Similar to online transactional processing (OLTP) databases in RDBMS
area.
2) Graph Compute Engine - (Batch Mode)
 Performs offline graph analytics in batch as series of steps.
 Similar to online analytical processing (OLAP) for analysis of data in bulk,
such as data mining.
You will learn in detail about these both in this section.

Graph-Based Data Model


In this video, you will understand about Graph DB.

Popular Graph Stores

Few of the popular Graph Stores are highlighted above.

Graph-Based Data Model Example


Let's analyze Graph DB further with this example.

Graph Data Stores in Depth


Let's dive deeper to know more about Graph DB through this video.

Understanding Graph Theory

According to Graph theory, the major constituents of a graph include -


 Vertices or Nodes representing distinct objects.
 Edges or Relationships or arcs establishing connectivity among these
objects.
 Both Nodes and Relationships carry some properties.
 - Properties of Nodes are similar to those of ***relational table/JSON
document***.

 - Properties of Relationship considers the ***type, strength, or history of the
relationship***.

Graph theory assigns mathematical notation for


 Adding/removing nodes or relationships from graph
 Performing operations to trace adjacent nodes.
They assist with traversal—walking through graph to explore the network.
Core Rule: 'No broken links'.
 A relationship should always have a start and end node.
 Deletion of a node is not possible without deleting its associated relationships
Property Graph Model

Property Graph Model is similar to object model or an entity relationship diagram.


Nodes
 Nodes (entities) could have multiple attributes (as key-value pairs).
 Nodes are tagged with labels which are tied to different roles.
 Labels might as well bind metadata—index or constraint to Nodes.
Relationships
 Relationships along with start and end node are bound to have direction, type
and quantitative properties like weights, costs, distances, ratings, time
intervals, or strengths.
 Without sacrificing performance, two nodes could share multiple
relationships.
 Relationships can also navigate regardless of direction.

Graph vs RDBMS
Graph constructs could be easily represented as a relational model.
There are two main challenges to be addressed -
 Leveraging SQL syntax to perform graph traversal with depth unknown is
not quite easy. Ex: With SQL determining friends of your friends are easy, but
it's hard to address the “Degrees of separation” problem. (i.e., number of
connections that separate one from another friend).
 Degradation of Performance while traversing the graph, due to increased
query response time at each level. Ex: This increases the number of joins
required with SQL, and this could not be generalized as arbitrary depth is
unknown.

Storage vs Processing - Analysis


Two key parameters to be considered while evaluating Graph DB.
1. Underlying Storage
 Few graph databases leverage native graph storage that is optimized and
designed for storing/managing graphs.
 Some graph databases serialize the graph data into the relational or object-
oriented database or some other general-purpose data store.
2. Processing Engine
 Few Native graph processing database leverage index-free adjacency,
implying inter-connected nodes physically “point” to each other in database
offering better performance.
 Exposing a graph data model through CRUD operations qualifies as a graph
database.
Comparison of different databases on the above criteria is shown above in the picture.

Graph Compute Engine


 Enables execution of global graph computational algorithms on large
datasets.
 Scans and process large amounts of data in batches in an optimized manner.
Architecture Design
 Processing of application queries requests/responses at runtime
through System of record (SOR) database with OLTP properties (like Neo4j).
 Moving data from the system of record database into the graph compute engine
for off-line querying and analysis through ETL job.
Usage
 Identify clusters from data
 Answer questions - 'How many relationships do everyone have in a social
network?'

Popular Graph Computing Engines


 Apache Giraph: leverages MapReduce on Hadoop data.
 GraphX: Part of Berkeley Data Analytics Stack (BDAS) leverages Spark.
 Titan: This Graph database can be overlayed on top of Big Data storage
engines including HBase and Cassandra.

Scenario for Choosing a Particular DB


Having gone through the different flavors and kinds of NoSQL, the key understanding
you should know by now is
 Not all NoSQL databases are similar.
 They all are not made to solve the same problems.
Understanding which NoSQL database would be appropriate for a given scenario and
context is very important.

Parameters to Focus
Key parameters to be taken into consideration while weighing NoSQL databases
against each other include:
 Database features
 Performance and
 Context-based criteria
NoSQL databases come in different shapes, sizes, and forms.
 Feature-based comparison is the best way to group them logically.

Polyglot Persistence
In this video, let's understand more about the Polyglot Persistence.

Feature Comparison
Let's do a quick comparison and contrasts among different NoSQL choices on the
basis of the following features:
 Scalability
 Transactional integrity and consistency
 Data modeling
 Query support
 Access and interface availability

Scalability
 Not all NoSQL databases promise horizontal scalability on equal margins.
 HBase and Hypertable carry an advantage, while Redis, MongoDB, and
Couchbase Server lag behind.
 The difference becomes more amplified as the data size grows over a few
petabytes.

Transactional Integrity and Consistency


Transactional integrity is
 Applicable only when data gets modified, updated, created, and deleted.
 Not relevant in pure data warehousing and mining contexts where data is
written once and read multiple times. Ex: Like web traffic logs, social
networking status updates, stock market tick data, and game scores.
RDBMS makes best fit if updates are common and range of operations require integrity
of updates.
 Column-family databases (HBase and Hypertable), and document databases
(MongoDB) are suited well if atomicity at an individual item level is sufficient.

Data Modeling
RDBMS offers a consistent and organized way of modeling data with standardized
implementation.
 The NoSQL world does not offer any room for the standardized and well-
defined data model as they are not bound to solve the same problem or have
the same architecture.
 MongoDB (Document DB) has gradually adopted few RDBMS concepts, like
o SQL-like querying
o Rudimentary relational references
o Database objects (inspired by the standard table and column-based
model)

Querying Support
 Querying data from any database with ease and effectively is considered to be
an interesting puzzle to be solved.
With standardized syntax and semantics, RDBMS thrives on SQL support for easy
access to data.
Among NoSQL -
 MongoDB and CouchDB (Document DB) come with querying capabilities which
are equally powerful to RDBMS.
 Redis (Key-Value DB) alone comes with querying the data structures it stores.
 Under Columnar DB, HBase has a little bit of querying capabilities.

Access and Interface Availability


 MongoDB dominates in this space with the availability of drivers for mainstream
libraries for interfacing and interacting
 CouchDB also has few drivers available as well as the RESTful HTTP interface.
 Language bindings to connect from most mainstream languages are available
for few like Redis, Membase, Riak, HBase, Hypertable, Cassandra, and
Voldemort.
It is very important to understand the performance characteristics of the various
serialization formats as they form the basis for the wrappers.

Why Benchmark?
Benchmarking is required to compare and derive deeper insight on how the different
NoSQL products stack up.
 Yahoo! Cloud Services Benchmark (YCSB) is one of the famous
benchmarking infrastructures for comparing NoSQL products.
Other known benchmarks from various product vendors include
 Tokyo Cabinet Benchmarks
 How fast is Redis - from Redis
 Riak benchmark - from Riak
 VoltDB - Key/value benchmarking
 Sort benchmark.

YCSB - Deeper Look


Yahoo! runs # of tests on popular NoSQL products as a part of the benchmark in a
tiered manner (measuring latency and throughput at each tier)
 Tier 1 -> Performance - Maximizing workload keeping hardware as constant
and the workload is increased until the hardware is saturated.
 Tier 2 -> Scalability - Hardware is added as workload increases to measure
latency as workload and hardware availability are scaled up proportionally.
Sample study on Columnar
 50/50 Read and Update:

- Regarded as an update-heavy test case.

- Apache Cassandra excelled on both read and update latencies.

 95/5 Read and Update:

- Regarded as a read-heavy case.

- HBase delivered consistent performance for reads.

Click here for Full Analysis.

NoSQL - Contextual Comparison


Understanding contextual information related to creation and evolution of NoSQL DBs
also has a significant role to play while selecting the right one.
Every NoSQL DB carries has its own -
 History
 Motivation/Purpose
 Use case
 Unique value proposition
Aligning to these viewpoints would help to choose the right NoSQL DB to meet the
requirement at hand.

Formation
The success of Google (BigTable) and Amazon (DynamoDb) triggered the formation
of HBase, Hypertable, Cassandra, and Riak.
Know more about the formation of some of the DBs listed below.
 CouchDB
 MongoDB
 Redis

NoSQL Course Summary


Hope you enjoyed taking this course!
In this course, you have learned basics of NoSQL.
 How database evolved?
 Rise of NoSQL
 Difference between RDBMS vs NoSQL
 How Data replication happens in NoSQL?
 NoSQL types and the different types
 Know how to select the right NoSQL
There will be detailed coverage of key NoSQL databases as separate courses.
Kindly go through them as well.
Questions
Distributed Database solutions can be implemented by __________.
All the options

NoSQL can handle __________.


Unstructured and Semi-structured data

__________ is an Object Oriented Database.


NoSQL

Hash Table Design is similar to __________.


Key Value datastore

Terrastore is an example of __________.


Document datastore

In a Key-Value datastore, both keys and values need to be unique.


False TRUE

Key-value databases would not be the best fit if there is/are __________


All the options

An example of Key-Value datastore is __________.


MongoDB DynamoDB

In MongoDB, data is represented as a collection of __________


None of the options

Riak DB leverages the CAP Theorem to improve its scalability.


True

NoSQL data replication is __________.


Homogenous

In a columnar Database, __________ uniquely identifies a record.


Row-Key

Which among the following is the correct API call in Key-Value


datastore?
put(key,value)

The type of Graph Store that works in real-time is __________.


Graph Database

A Graph Store similar to OLAP in RDMS is __________.


Graph Compute Engine
Which of the following has properties attached to it in the Graph
datastore?
Nodes and Relationships

Which of the following factors influence(s) the choice of replication


model?
All the options

Which among the following is used by Amazon to store the user's


shopping cart details?
DynamoDB

The key parameter(s) to be taken into consideration while weighing


NoSQL databases against each other is/are __________
All the options

Which type of Key-Value datastore DB has its key and value sorted?
Ordered Key-Value datastore

Which Replication model supports database read and write operations


in all the nodes?
All the options

In the Master-Slave Replication model, the node which pushes all the
updates in data to subordinate nodes is __________.
All the options

The Specialized Query Language(s) used in Graph datastore


is/are __________.
Cypher

The most popular Navigational DBMS system is/are __________.


Integrated Management System and Integrated Database System

In RDBMS, the attributes of an entity are stored in __________.


None of the options

__________ is referred to the individual's knowledge of multiple


languages.
Polyglot

Cassandra was developed by __________


Facebook

JSON is a lightweight substitute for XML.


True
The famous XML Database(s) is/are _________.
eXist and MarkLogic

The Property Graph Model is similar to __________.


Entity Relationship Diagram

Riak DB leverages the CAP Theorem to improve its scalability.


True

In Riak Key Value datastore, the variable 'W' indicates __________.


Number of Write Operations

__________ are chunks of related data often accessed together.


Column Families

An XML document which satisfies the rules specified by W3C


is __________.
Well Formed XML

The scalability of the Key-Value database is achieved through


Sharding.
True

Columnar datastore avoids storing null values.


True

Which type of database requires a trained workforce for the


management of data?
RDBMS

Which type of scaling handles voluminous data by adding servers to


the clusters?
Horizontal

In a Column Data Model, the number of columns that a row can


have __________.
Varies

Limitations of RDBMS are ______________.scalibilty/design complexity

NoSQL databases are designed to expand _________. horizontally

________ distributes different data across multiple servers.--shrading

Horizontal scaling approach tends to be cheaper as the number of operations and the size
of the data increases.--true
Full-form of 'CRUD' is _________.--create read update delete

A Key-value store does not support Secondary Indexes.--false

The RDBMS 'table' equivalent terminology in Riak is ________.bucket

Key-value pair data storages include all except ________.Network attached storage

Cassandra has properties of both __________ and ____________ google big


table/amazon dynamo

A Riak convergent replicated data type (CRDT) includes ________.Mapps/sets/counters

Cassandra allows to define composite Primary Keys.true

Pre-join projection is equivalent to ________ as in traditional relational systems.--mview

HBase Tables are divided _________ by row key range into ________ .horizontally regions

A column-database used to store __________ versions of each cells.multiple

Columnar databases are preferable for OLTP systems.--false

In a columnar database, the columns are stored together on disk, achieving a higher
compression ratio is an expensive operation.--false

Hbase main server components include all except _________.hbase memstore

The column store has to perform _____ IO to insert a new value.--As many disc blocks

In Hbase, 'Columns' are named and specified in table definition.--false

The row store needs to perform _____ IO to insert a new value.--single

In a column-database a row is being uniquely identified by __________.row key

An RDBMS equivalent component for a "collection" in a Document database:Table

JSON documents are built up of _________.all the option

In MongoDB, there is a similar feature of 'like' expression as like RDBMS.False

______ is a syntax for retrieving specific elements from an XML document.xpath

An RDBMS equivalent component for a "document" in a Document database:row

An RDBMS equivalent component for a "document identifier" in a Document


database:forigen key

MongoDB read/write performance can be tuned with the help of Stored Procedures.false

Document databases split a document into its constituent name/value pairs for indexing
purpose.--false

The MATCH clause is roughly equivalent to the _______ clause in SQL and the RETURN
clause to a ______ clause--where/select

Cypher query language is associated with __________Neo4j

The major components of a Graph include all except _______.JSON

Graph databases are generally built for use with ________.OLTP

Only Nodes have properties in Graph database.--False


Neo4j architecture is a self-driven and independent architecture because of
________________.Both the option

---------------

Sorted Column store would provide higher compression ratio by representing each column
as ________ compared to the preceding one.--delta

Hbase Data blocks metadata information are being maintained by --Namenodes

Riak demonstrates dual nature of _____________ key/value store and a document


database.Some of the common Write Consistency level in Cassandra include all except
___________.QUORIUM?

In Riak, ________ consistency model is implemented.--eventual

Wiredtiger storage engine is a part of ___________.Mango DB

_________ are replicated to allow failover in MongoDB.--shards

Kudu can be accessed via all except _________.HIve

You might also like