Shri Dharmasthala Manjunatheshwara College of Engineering and Technology,
Dharwad-580002
Department of Information Science and Engineering
Big Data Analytics
(18UISC700)
Course Instructor:
Dr. Rajashekarappa
Chapter – 4
NoSQL
Source: Arshdeep Bahga and Vijay Madisetti
NoSQL Databases
• Non-relational databases ("NoSQL databases") are becoming popular with the
increasing use of cloud computing services.
• Non-relational databases have better horizontal scaling capability and improved
performance for big data at the cost of having less rigorous consistency models.
• NoSQL databases are popular for applications in which the scale of data involved
is massive and the data may not be structured. Furthermore, real-time
performance is more important than consistency. These systems are optimized for
fast retrieval and appending operations on records.
• Unlike relational databases, the NoSQL databases do not have a strict schema.
• The records can be in the form of key-value pairs or documents. Most NoSQL
databases are classified in terms of the data storage model or type of records that
can be stored.
4
NoSQL Database Types
• Key-Value Databases
• Document databases
• Column family databases
• Graph databases
5
Key-Value Databases
• Key-value databases are the simplest form of NoSQL databases.
• These databases store data in the form of key-value pairs. The keys are used to
identify uniquely the values stored in the database.
• Applications that want to store data, generate unique keys and submit the
key-value pairs to the database. The database uses the key to determine where the
value should be stored.
• Most key-value databases have distributed architectures comprising of multiple
storage nodes. The data is partitioned across the storage nodes by the keys.
• For determining the partitions for the keys, hash functions are used. The partition
number for a key is obtained by applying a hash function to the key. The hash
functions are chosen such that the keys are evenly distributed across the
partitions.
• Unlike relational databases in which the tables have fixed schemas and there are
constraints on the columns, in key-value databases, there are no such constraints.
Key-value databases do not have tables like in relational databases.
6
Amazon DynamoDB
• Amazon DynamoDB is a fully- managed,
scalable, high- performance NoSQL
database service from Amazon.
• DynamoDB provides fast and predictable
performance and seamless scalability
without any operational overhead.
• DynamoDB’s data model includes Tables,
Items, and Attributes. A table is a
collection of items and each item is a
collection of attributes.
7
Document Databases
• Document store databases store semi-structured data in the form of documents
which are encoded in different standards such as JSON, XML, BSON or YAML.
By semi-structured data we mean that the documents stored are similar to each
other (similar fields, keys or attributes) but there are no strict requirements for a
schema.
• Documents are organized in different ways in different document database such in
the form of collections, buckets or tags.
• Each document stored in a document database has a collection of named fields and
their values. Each document is identified by a unique key or ID.
• There is no need to define any schema for the documents before storing them in the
database.
• While it is possible to store JSON or XML-like documents as values in a key-value
database, the benefit of using document databases over key-value databases is that
these databases allow efficiently querying the documents based on the attribute
values in the documents.
8
• Document databases are useful for applications that want to store semi-structured
MongoDB
• MongoDB is a document-oriented
non-relational database system. MongoDB
is powerful, flexible and highly scalable
database designed for web applications and
is a good choice for a serving database for
data analytics applications.
• The basic unit of data stored by MongoDB
is a document.
• A document includes a JSON-like set of
key-value pairs.
9
Column Family Databases
• In column family databases the basic unit of data storage is a column, which has
a name and a value.
• A collection of columns make up a row which is identified by a row-key.
Columns are grouped together into columns families.
• Unlike, relational databases, the column family databases do not need to have
fixed schemas and a fixed number of columns in each row.
• The number of columns in a column family database can vary across different
rows.
• A column family can be considered as a map having key-value pairs and this map
can vary across different rows.
• Column family databases store data in a denormalized form so that all relevant
information related to an entity required by the applications can be retrieved by
reading a single row. 10
HBase
• HBase is a scalable, non-relational,
distributed, column-family database that
provides structured data storage for large
tables.
• HBase can store both structured and
unstructured data.
• The data storage in HBase can scale linearly
and automatically by the addition of new
nodes.
• HBase has been designed to work with
commodity hardware and is a highly reliable
and fault tolerant system.
• HBase allows fast random reads and writes.
1
1
HBase Data Model
•An HBase table is consists of rows, which are indexed by the row key.
•Each row includes multiple column families.
•Each column family includes multiple columns.
•Each column includes multiple cells or entries which are timestamped.
•HBase tables are indexed by the row key, column key and timestamp.
•Unlike relational database tables, HBase tables do not have a fixed
schema.
•HBase columns families are declared at the time of creation of the table
and cannot be changed later.
•Columns can be added dynamically, and HBase can have millions of
columns.
12
HBase Architecture
• HBase has a distributed architecture.
• An HBase deployment comprises multiple region
servers which usually run on the same machines as
the Hadoop data nodes.
• HBase tables are partitioned by the row key into
multiple regions (HRegions). Each region server has
multiple regions.
• HBase has a master-slave architecture with one of
the nodes acting as the master node (HMaster) and
other nodes are slave nodes.
• The HMaster is responsible for maintaining the
HBase meta-data and assignment of regions to
region servers.
• HBase uses Zookeeper for distributed state
coordination.
• HBase has two special tables - ROOT and META,
for identifying which region server is responsible 13
Graph Databases
• Graph stores are NoSQL databases designed for storing data that has
graph structure with nodes and edges.
• While relational databases model data in the form of rows and columns,
the graph databases model data in the form of nodes and relationships.
• Nodes represent the entities in the data model. Nodes have a set of
attributes. A node can represent different types of entities, for example, a
person, place (such as a city, restaurant or a building) or an object (such as
a car).
• The relationships between the entities are represented in the form of links
between the nodes. Links also have a set of attributes. Links can be
directed or undirected. Directed links denote that the relationship is
unidirectional.
14
Neo4j
• Neo4j is one the popular graph
databases which provides support for
Atomicity, Consistency, Isolation,
Durability (ACID).
• Neo4j adopts a graph model that
consists of nodes and relationships.
• Both nodes and relationships have
properties which are captured in the
form of multiple attributes (key-value
pairs).
• Nodes are tagged with labels which are
used to represent different roles in the
15
domain being modeled.
Neo4j - Cypher
• For create, read, update and delete
(CRUD) operations, Neo4j provides a
query language called Cypher.
• Cypher has some similarities with the
SQL query language used for relational
databases.
16
Comparison of NoSQL databases
17
Thank You