Data Analytics Using NoSQL

Data Analytics using NoSQL
DHINESHKUMAR S K
Taxonomy of NoSQL
Key-value
Graph database
Document-oriented
Column family
3
Typical NoSQL architecture
Hashing
K
function
maps each
key to a
server (node)
CAP theorem for NoSQL
What the CAP theorem really says:

• If you cannot limit the number of faults and requests can be
directed to any server and you insist on serving every request you
receive then you cannot possibly be consistent
How it is interpreted:
• You must always give something up: consistency, availability or
tolerance to failure and reconfiguration
Theory of NOSQL: CAP
GIVEN:
C
• Many nodes
• Nodes contain replicas of partitions
of the data
• Consistency
• All replicas contain the same version
of data
• Client always has the same view of
the data (no matter what node)
• Availability A P
• System remains operationalon failing
nodes
• All clients can always read and write
• Partition tolerance CAP Theorem: satisfying
• multiple entry points
• System remains operationalon
all three at the same
system split (communication
malfunction) time is impossible
• System works well across physical
network partitions
Sharding of data
 Distributes a single logical database system across a cluster of machines

 Uses range-based partitioning to distribute documents based
 on a specific shard key
 Automatically balances the data associated with each shard
 Can be turned on and off per collection (table)
Replica Sets
 Redundancy and Failover

 Zero downtime for upgrades and
maintenance replica1
 Master-slave replication Client

 Strong Consistency
 Delayed Consistency
 Geospatial features
How does NoSQL vary from RDBMS?
 Looser schema definition

 Applications written to deal with specific documents/ data
 Applications aware of the schema definition as opposed to the data
 Designed to handle distributed, large databases
 Trade offs:
 No strong support for ad hoc queries but designed for speed and growth of database
 Query language through the API
 Relaxation of the ACID properties
Benefits of NoSQL
Elastic Scaling Big Data

• RDBMS scale up – bigger • Huge increase in data
load , bigger server RDMS: capacity and
• NO SQL scale out – constraints of data
distribute data across volumes at its limits
multiple hosts • NoSQL designed for big
seamlessly data
DBA Specialists
• RDMS require highly
trained expert to
monitor DB
• NoSQL require less
management, automatic
repair and simpler data
models
Benefits of NoSQL
Flexible data models Economics

• Change management to • RDMS rely on expensive
schema for RDMS have proprietary servers to
to be carefully managed manage data
• NoSQL databases more • No SQL: clusters of
relaxed in structure of cheap commodity
data servers to manage the
• Database schema data and transaction
changes do not have to volumes
be managed as one • Cost per gigabyte or
complicated change unit
transaction/second for
• Application already
written to address an
NoSQL can be lower
amorphous schema than the cost for a
RDBMS
Drawbacks of NoSQL
• Support • Maturity
• RDBMS vendors • RDMS mature
provide a high level of product: means stable
support to clients and dependable
• Stellar reputation • Also means old no
• NoSQL – are open longer cutting edge nor
interesting
source projects
with startups • NoSQL are still
supporting them implementing their
• Reputation not yet basic feature set
established
Drawbacks of NoSQL
• Administration • Analytics and Business

• RDMS administrator well Intelligence
defined role • RDMS designed to
• No SQL’s goal: no  address this niche
administrator necessary • NoSQL designed to meet
however NO SQL still the needs of an Web 2.0
requires effort to application - not
maintain designed for ad hoc
• Lack of Expertise query of the data
• Whole workforce of • Tools are being
trained and seasoned developed to address
RDMS developers this need
• Still recruiting
developers to the NoSQL
camp
RDB ACID to NoSQL BASE
Atomicity Basically
Consistency Available (CP)
Isolation Soft-state
(State of system may change
over time)
Durability Eventually
consistent
(Asynchronous propagation)
MongoDB
What is MongoDB?
 Developed by 10gen
 Founded in 2007
 A document-oriented, NoSQL database
 Written in C++
 Supports APIs (drivers) in many computer languages
 JavaScript, Python, Ruby, Perl, Java, Java Scala, C#, C++, Haskell, Erlang
Functionality of MongoDB
• Dynamic schema
• No DDL
• Document-based database
• Secondary indexes
• Query language via an API
• Atomic writes and fully-consistent reads
• If system configured that way
• Master-slave replication with automated failover (replica sets)
• Built-in horizontal scaling via automated range-based
• partitioning of data (sharding)
Why use MongoDB?
 Simple queries
 Functionality provided applicable to most web applications
 Easy and fast integration of data
 No ERD diagram
 Not well suited for heavy and complex transactions systems
MongoDB: CAP approach
C
Focus on Consistency and
Partition tolerance
• Consistency
• all replicas contain the same
version of the data
• Availability
• system remains operational on A P
failingnodes
• Partition tolarence
CAP Theorem:
• multiple entry points satisfying all three at the same time is
• system remains operational on impossible
system split
MongoDB: Hierarchical Objects
0 or more Databases
 A MongoDB instance may have
0 or more Collections
 zero or more ‘databases’
0 or more Documents
 A database may have
 zero or more ‘collections’.
 A collection may have
 zero or more ‘documents’.
0 or
more
 A document may have
Fields
 one or more ‘fields’.
 MongoDB ‘Indexes’ function much like their
RDBMS counterparts.
RDB Concepts to NO SQL
RDBMS MongoDB
Database Database
Table, View Collection
Row Document (BSON)
Column Field
Index Index
Join Embedded Document
Foreign Key Reference
Partition Shard
Choices made for Design of MongoDB
 Scale horizontally over commodity hardware

 Lots of relatively inexpensive servers
 Keep the functionality that works well in RDBMSs
 Ad hoc queries
 Fully featured indexes
 Secondary indexes
 What doesn’t distribute well in RDB?
 Long running multi-row transactions
 Joins
 Both artifacts of the relational data model (row x column)
BSON format
 Binary-encoded serialization of JSON-like documents

 Zero or more key/value pairs are stored as a single entity
 Each entry consists of a field name, a data type, and a value
 Large elements in a BSON document are prefixed with a
 length field to facilitate scanning
Schema Free
 MongoDB does not need any pre-defined data schema

 Every document in a collection could have different data
 Addresses NULL data fields
{name: “will”, name: “jeff”, {name: “brendan”,

eyes: “blue”, eyes: “blue”, aliases: [“el diablo”]}
birthplace: “NY”, loc: [40.7, 73.4],
aliases: [“bill”, “la ciacco”], boss: “ben”}
loc: [32.7, 63.4],
boss: ”ben”}
{name: “matt”,
pizza: “DiGiorno”,
height: 72,
name: “ben”, loc: [44.6, 71.3]}
hat: ”yes”}
JSON format
 Data is in name / value pairs

 A name/value pair consists of a field name followed by a colon, followed by a value:
 Example: “name”: “R2-D2”
 Data is separated by commas
 Example: “name”: “R2-D2”, race : “Droid”
 Curly braces hold objects
 Example: {“name”: “R2-D2”, race : “Droid”, affiliation:“rebels”}
 An array is stored in brackets []
 Example [ {“name”: “R2-D2”, race : “Droid”, affiliation: “rebels”}, {“name”: “Yoda”,
affiliation: “rebels”} ]
MongoDB Features
 Document-Oriented storage
 Full Index Support
 Replication & High Availability Agile
 Auto-Sharding
 Querying
 Fast In-Place Updates
 Map/Reduce functionality Scalable
Index Functionality
• B+ tree indexes
• An index is automatically created on the _id field (the primary key)
• Users can create other indexes to improve query performance or to enforce Unique values
for a particular field
• Supports single field index as well as Compound index
• Like SQL order of the fields in a compound index matters
• If you index a field that holds an array value, MongoDB creates
• separate index entries for every element of the array
• Sparse property of an index ensures that the index only contain entries for documents that
have the indexed field. (so ignore records that do not have the field defined)
• If an index is both unique and sparse – then the system will reject records that have a
duplicate key value but allow records that do not have the indexed field defined
Hands ON!!!!!
Example: Mongo Document
{
name: 'Brad Steve’,
address: {
street: 'Oak Terrace', city: 'Denton’
}
}
Example: Mongo Collection
{
"_id": ObjectId("4efa8d2b7d284dad101e4bc9"),
"Last Name": "DUMONT",
"First Name": "Jean",
"Date of Birth": "01-22-1963" Obligatory, and automatically generated
}, by MongoDB
{
"_id": ObjectId("4efa8d2b7d284dad101e4bc7"),
"Last Name": "PELLERIN",
"First Name": "Franck",
"Date of Birth": "09-19-1983",
"Address": "1 chemin des Loges",
"City": "VERSAILLES"
}
Sample!
 BLOG
 A blog post has an author, some text, and many comments
 The comments are unique per post, but one author has many posts
 How would you design this in SQL?

Blog – BAD Design
 Collections for posts, authors, and comments

 References by manually created ID
post = { author = {
id: 150, id: 100,
author: 100, name: 'Michael Arrington' posts: [150]
text: 'This is a pretty awesome post.’, }
comments: [100, 105, 112]
} comment = {
id: 105,
text: 'Whatever this is good comment’
}
Sample: Better Design
 Collection for posts

 Embed comments, author name
post = {
author: 'Michael Arrington’,
text: 'This is a pretty awesome post.’,
comments: [ 'Whatever this post sux.', 'I agree, lame!’ ]
}
Installation
CRUD Operations
• Create
• db.collection.insert( <document> )
• db.collection.save( <document> )
• db.collection.update( <query>, <update>, { upsert: true } )
Collection
• Read specifies the
• db.collection.find( <query>, <projection> )
• db.collection.findOne( <query>, <projection> )
collection or
• Update
the ‘table’ to
•
• db.collection.update( <query>, <update>, <options> )
Delete
store the
• db.collection.remove( <query>, <justOne> ) document
Create Operations
Db.collection specifies the collection or the ‘table’ to store the document

• db.collection_name.insert( <document> )
• Omit the _id field to have MongoDB generate a unique key
• Example db.parts.insert( {{type: “screwdriver”, quantity: 15 } )
• db.parts.insert({_id: 10, type: “hammer”, quantity: 1 })
• db.collection_name.update( <query>, <update>, { upsert: true } )
• Will update 1 or more records in a collection satisfying query
• db.collection_name.save( <document> )
• Updates an existing record or creates a new record
Read Operations
• db.collection.find( <query>, <projection> ).cursor modified

• Provides functionality similar to the SELECT command
• <query> where condition , <projection> fields in result set
• Example: var PartsCursor = db.parts.find({parts: “hammer”}).limit(5)
• Has cursors to handle a result set
• Can modify the query to impose limits, skips, and sort orders.
• Can specify to return the ‘top’ number of records from the result set
• db.collection.findOne( <query>, <projection> )
Query Operators
Name Description
$eq Matches value that are equal to a specified value
$gt, $gte Matches values that are greater than (or equal to a specified value
$lt, $lte Matches values less than or ( equal to ) a specified value
$ne Matches values that are not equal to a specified value

$in Matches any of the values specified in an array
$nin Matches none of the values specified in an array
$or Joins query clauses with a logical OR returns all
$and Join query clauses with a loginal AND
$not Inverts the effect of a query expression
$nor Join query clauses with a logical NOR
$exists Matches documents that have a specified field
Update Operations
• db.collection_name.insert( <document> )
• Omit the _id field to have MongoDB generate a unique key
• Example db.parts.insert( {{type: “screwdriver”, quantity: 15 } )
• db.parts.insert({_id: 10, type: “hammer”, quantity: 1 })
• db.collection_name.save( <document> )
• Updates an existing record or creates a new record
• db.collection_name.update( <query>, <update>, { upsert: true } )
• Will update 1 or more records in a collection satisfying query
• db.collection_name.findAndModify(<query>, <sort>, <update>,<new>,
<fields>,<upsert>)
• Modify existing record(s) – retrieve old or new version of the record
Delete Operations
• db.collection_name.remove(<query>, <justone>)
• Delete all records from a collection or matching a criterion
• <justone> - specifies to delete only 1 record matching the criterion
• Example: db.parts.remove(type: /^h/ } ) - remove all parts starting with h
• Db.parts.remove() – delete all documents in the parts collections
SQL vs. Mongo DB entities
My SQL Mongo DB
START TRANSACTION; db.contacts.save( { user
INSERT INTO contacts VALUES (NULL, Name: “joeblow”,
‘joeblow’);
emailAddresses:
INSERT INTO contact_emails
VALUES
[ “joe@blow.co
( NULL, ”joe@blow.com”, m”,
“joseph@blow.com” ] }
LAST_INSERT_ID() ),
( NULL,
“joseph@blow.com”,
LAST_INSERT_ID() ); COMMIT; MongoDB separates physical structure
from logical structure
Designed to deal with large &distributed
Aggregation
Aggregation Framework Operators
 $project
 $match
 $limit
 $skip
 $sort
 $unwind
 $group
 …….
$match
 Filter documents
 Uses existing query syntax
 If using $geoNear it has to be first in pipeline
 $where is not supported
Matching Field Values
{
"_id" : 271421,
"amenity" : "pub",
"name" : "Sir Walter Tyrrell",
"location" : {
"type" : "Point",
"coordinates" : [
-1.6192422,
50.9131996
]
}
} {
"_id" : 271466,
{ "amenity" : "pub",
"_id" : 271466,
"amenity" : "pub", "name" : "The Red Lion",
"name" : "The Red Lion", "location" : {
"location" : { "type" : "Point",
"type" : "Point",
"coordinates" : [ "coordinates" : [
-1.5494749, -1.5494749,
50.7837119 50.7837119
]
} ]}
$project
 Reshape documents
 Include, exclude or rename fields
 Inject computed fields
 Create sub-document fields
Including and Excluding Fields
{ { “$project”: {
"_id" : 271466,
“_id”: 0,
"amenity" : "pub", “amenity”: 1,
"name" : "The Red Lion", “name”: 1,
"location" : { }}
"type" : "Point",
"coordinates" : [
-1.5494749,
50.7837119 {
] “amenity” : “pub”,
“name” : “The Red Lion”
} }
}
Reformatting Documents
{ { “$project”: {
"_id" : 271466,
“_id”: 0,
"amenity" : "pub", “name”: 1,
"name" : "The Red Lion", “meta”: {
“type”: “$amenity”}
"location" : { }}
"type" : "Point",
"coordinates" : [
-1.5494749,
50.7837119 {
] “name” : “The Red Lion”
“meta” : {
} “type” : “pub”
} }}
$group
• Group documents by an ID
 Field reference, object, constant

• Other output fields are computed
$max, $min, $avg, $sum
$addToSet, $push $first, $last
• Processes all data in memory

Aggregation Framework Benefits
 Real-time
 Simple yet powerful interface
 Declared in JSON, executes in C++
 Runs inside MongoDB on local data
− Adds load to your DB
− Limited Operators
− Data output is limited

Data Analytics Using NoSQL

Uploaded by

Data Analytics Using NoSQL

Uploaded by

Data Analytics using NoSQL

What the CAP theorem really says:

receive then you cannot possibly be consistent

 Distributes a single logical database system across a cluster of machines

 Redundancy and Failover

 Master-slave replication Client

 Looser schema definition

Elastic Scaling Big Data

Flexible data models Economics

• Administration • Analytics and Business

Consistency Available (CP)

Row Document (BSON)

 Scale horizontally over commodity hardware

 Binary-encoded serialization of JSON-like documents

 MongoDB does not need any pre-defined data schema

{name: “will”, name: “jeff”, {name: “brendan”,

 Data is in name / value pairs

 How would you design this in SQL?

 Collections for posts, authors, and comments

 Collection for posts

Db.collection specifies the collection or the ‘table’ to store the document

• db.collection.find( <query>, <projection> ).cursor modified

$lt, $lte Matches values less than or ( equal to ) a specified value

$ne Matches values that are not equal to a specified value

 Field reference, object, constant

• Processes all data in memory

You might also like