MongoDB Cookbook - Second Edition - Sample Chapter
MongoDB Cookbook - Second Edition - Sample Chapter
ee
This book starts with an introduction to programming language drivers in both Java and Python.
You will then learn a variety of topics, including advanced query operations and features with MongoDB,
monitoring and backup using MMS, and some very handy administration recipes. After that, there are
recipes on cloud deployment, which feature Docker containers, integration with Hadoop, and improving
developer productivity.
and problems
backups of MongoDB
real-world problems
$ 44.99 US
28.99 UK
P U B L I S H I N G
Cyrus Dasadia
Amol Nayak
problems efficiently
pl
e
MongoDB Cookbook
Second Edition
Over 80 comprehensive recipes that will help you master the art
of using and administering MongoDB 3
P U B L I S H I N G
Sa
Second Edition
MongoDB Cookbook
MongoDB is a high-performance and feature-rich NoSQL database that forms the backbone of the
systems that power a lot of organizations in different domains. Though it is feature rich and simple to use,
developers and administrators often need a quick solution for various problems they may come across.
Cyrus Dasadia
Amol Nayak
Amol Nayak is a MongoDB certified developer and has been working as a developer for
over 8 years. He is currently employed with a leading financial data provider, working on
cutting-edge technologies. He has used MongoDB as a database for various systems at his
current and previous workplaces to support enormous data volumes. He is an open source
enthusiast and supports it by contributing to open source frameworks and promoting them.
He has made contributions to the Spring Integration project, and his contributions are the
adapters for JPA, XQuery, MongoDB, Push notifications to mobile devices, and Amazon Web
Services (AWS). He has also made some contributions to the Spring Data MongoDB project.
Apart from technology, he is passionate about motor sports and is a race official at Buddh
International Circuit, India, for various motor sports events. Earlier, he was the author of
Instant MongoDB, Packt Publishing.
Preface
MongoDB is a document-oriented, leading NoSQL database, which offers linear scalability,
thus making it a good contender for high-volume, high-performance systems across all the
business domains. It has an edge over the majority of NoSQL solutions for its ease of use,
high performance, and rich features.
This book provides detailed recipes that describe how to use the different features
of MongoDB. The recipes cover topics ranging from setting up MongoDB, knowing its
programming language API, and monitoring and administration, to some advanced topics
such as cloud deployment, integration with Hadoop, and some open source and proprietary
tools for MongoDB. The recipe format presents the information in a concise, actionable form;
this lets you refer to the recipe to address and know the details of just the use case in hand
without going through the entire book.
Preface
Chapter 5, Advanced Operations, is an extension of Chapter 2, Command-line Operations
and Indexes. We will look at some of the slightly advanced features such as implementing
server-side scripts, geospatial search, GridFS, full text search, and how to integrate MongoDB
with an external full text search engine.
Chapter 6, Monitoring and Backups, tells you all about administration and some basic
monitoring. However, MongoDB provides a state-of-the-art monitoring and real-time backup
service, MongoDB Monitoring Service (MMS). In this chapter, we will look at some recipes
around monitoring and backup using MMS.
Chapter 7, Deploying MongoDB on the Cloud, covers recipes that use MongoDB service
providers for cloud deployment. We will set up our own MongoDB server on the AWS cloud as
well as run MongoDB in Docker containers.
Chapter 8, Integration with Hadoop, covers recipes to integrate MongoDB with Hadoop to use
the Hadoop MapReduce API in order to run MapReduce jobs on the data residing in MongoDB
data files and write the results to them. We will also see how to use AWS EMR to run our
MapReduce jobs on the cloud using Amazon's Hadoop cluster, EMR, with the mongo-hadoop
connector.
Chapter 9, Open Source and Proprietary Tools, is about using frameworks and products
built around MongoDB to improve a developer's productivity or about simplifying some of the
day-to-day jobs using Mongo. Unless explicitly mentioned, the products/frameworks that we
will be looking at in this chapter are open source.
Appendix, Concepts for Reference, gives you a bit of additional information on the write
concern and read preference for reference.
Installing single node MongoDB with options from the config file
Connecting to the replica set in the shell to query and insert data
Connecting to the replica set to query and insert data from a Java client
Connecting to the replica set to query and insert data using a Python client
Introduction
In this chapter, we will look at starting up the MongoDB server. Though it is a cakewalk to
start the server with default settings for development purposes, there are numerous options
available to fine-tune the start up behavior. We will start the server as a single node and then
introduce various configuration options. We will conclude this chapter by setting up a simple
replica set and running a sharded cluster. So, let's get started with installing and setting up
the MongoDB server in the easiest way possible for simple development purposes.
Getting ready
Well, assuming that we have downloaded the MongoDB binaries from the download site,
extracted it, and have the resulting bin directory in the operating system's path variable.
(This is not mandatory, but it really becomes convenient after doing so.) The binaries can
be downloaded from http://www.mongodb.org/downloads after selecting your host
operating system.
How to do it
1. Create the directory, /data/mongo/db (or any of your choice). This will be our
database directory, and it needs to have permission to write to it by the mongod
(the mongo server process) process.
2. We will start the server from the console with the data directory, /data/mongo/db,
as follows:
> mongod --dbpath
/data/mongo/db
How it works
If you see the following line on the console, you have successfully started the server:
[initandlisten] waiting for connections on port 27017
Chapter 1
Starting a server can't get easier than this. Despite the simplicity in starting the server, there
are a lot of configuration options that can be used to tune the behavior of the server on
startup. Most of the default options are sensible and need not be changed. With the default
values, the server should be listening to port 27017 for new connections, and the logs will be
printed out to the standard output.
See also
There are times where we would like to configure some options on server startup. In the
Installing single node MongoDB recipe, we will use some more start up options.
As the server has been started for development purposes, we don't want to preallocate fullsize database files. (We will soon see what this means.)
Getting ready
If you have already seen and executed the Installing single node MongoDB recipe, you need
not do anything different. If all these prerequisites are met, we are good for this recipe.
How to do it
1. The /data/mongo/db directory for the database and /logs/ for the logs should be
created and present on your filesystem with appropriate permissions to write to it.
2. Execute the following command:
> mongod --port 27000 --dbpath /data/mongo/db logpath /logs/
mongo.log --smallfiles
How it works
Ok, this wasn't too difficult and is similar to the previous recipe, but we have some additional
command-line options this time around. MongoDB actually supports quite a few options at
startup, and we will see a list of the most common and important ones in my opinion:
Option
Description
--help or -h
--config or -f
This specifies the location of the configuration file that contains all
the configuration options. We will see more on this option in a later
recipe. It is just a convenient way of specifying the configurations in a
file rather than on the command prompt; especially when the number
of options specified is more. Using a separate configuration file shared
across different MongoDB instances will also ensure that all the
instances are running with identical configurations.
--verbose or -v
This makes the logs more verbose; we can put more v's to make the
output even more verbose, for example, -vvvvv.
--quiet
--port
This option is used if you are looking to start the server listening to
some port other than the default 27017. We would be frequently using
this option whenever we are looking to start multiple mongo servers on
the same machine, for example, --port 27018 will start the server
listening to port 27018 for new connections.
--logpath
This provides a path to a log file where the logs will be written. The
value defaults to STDOUT. For example, --logpath /logs/
server.out will use /logs/server.out as the log file for the
server. Remember that the value provided should be a file and not a
directory where the logs will be written.
--logappend
This option appends to the existing log file, if any. The default behavior
is to rename the existing log file and then create a new file for the
logs of the currently started mongo instance. Suppose that we have
used the name of the log file as server.out, and on startup, the
file exists, then by default this file will be renamed as server.
out.<timestamp>, where <timestamp> is the current time.
The time is GMT as against the local time. Let's assume that the
current date is October 28th, 2013 and time is 12:02:15, then the file
generated will have the following value as the timestamp: 2013-1028T12-02-15.
Chapter 1
Option
Description
--dbpath
This provides you with the directory where a new database will be
created or an existing database is present. The value defaults to /
data/db. We will start the server using /data /mongo/db as the
database directory. Note that the value should be a directory rather
than the name of the file.
--smallfiles
--replSet
--configsvr
--shardsvr
This informs the started mongod process that this server is being
started as a shard server. By giving this option, the server also listens
to port 27018 instead of the default 27017. We will know more on this
option when we start a simple sharded server.
--oplogSize
Description
--storageEngine
Starting with MongoDB 3.0, a new storage engine called Wired Tiger
was introduced. The previous (default) storage engine is now called
mmapv1. To start MongoDB with Wired Tiger instead of mmapv1, use
the wiredTiger value with this option.
--dirctoryperdb
There's more
For an exhaustive list of options that are available, use the --help or -h option. This list of
options is not exhaustive, and we will see some more coming up in later recipes as and when
we need them. In the next recipe, we will see how to use a configuration file instead of the
command-line arguments.
See also
Single node installation of MongoDB with options from config file for using
configuration files to provide start up options
Getting ready
If you have already executed the Installing single node MongoDB recipe, you need not do
anything different as all the prerequisites of this recipe are the same.
Chapter 1
How to do it
The /data/mongo/db directory for the database and /logs/ for the logs should be created
and present on your filesystem with the appropriate permissions to write to it and perform the
following steps:
1. Create a configuration file that can have any arbitrary name. In our case, let's say that
we create this in /conf/mongo.conf. We then edit the file and add the following
lines to it:
port = 27000
dbpath = /data/mongo/db
logpath = /logs/mongo.log
smallfiles = true
/config/mongo.conf
How it works
All the command-line options that we discussed in the previous recipe, Starting a single node
instance using command-line options, hold true. We are just providing them in a configuration
file instead. If you have not visited the previous recipe, I would recommend you to do so as
that is where we discussed some of the common command-line options. The properties are
specified as <property name> = <value>. For all the properties that don't have values,
for example, the smallfiles option, the value given is a Boolean value, true. If we need to
have a verbose output, we would add v=true (or multiple v's to make it more verbose) to our
configuration file. If you already know what the command-line option is, then it is pretty easy to
guess what the value of the property is in the file. It is almost the same as the command-line
option with just the hyphen removed.
Getting ready
Although it is possible to run the mongo shell without connecting to the MongoDB server
using mongo --nodb, we would rarely need to do so. To start a server on the localhost
without much of a hassle, take a look at the first recipe, Installing single node MongoDB,
and start the server.
How to do it
1. First, we create a simple JavaScript file and call it hello.js. Type the following body
in the hello.js file:
function sayHello(name) {
print('Hello ' + name + ', how are you?')
}
5. Test the database that the shell is connected to by typing the following command:
> db
7.
Note: This book was written with MongoDB version 3.0.2. There is a
good chance that you may be using a later version and hence see a
different version number in the mongo shell.
Chapter 1
How it works
The JavaScript function that we executed here is of no practical use and is just used to
demonstrate how a function can be preloaded on the startup of the shell. There could be
multiple functions in the .js file containing valid JavaScript codepossibly some complex
business logic.
On executing the mongo command without any arguments, we connect to the MongoDB
server running on localhost and listen for new connections on the default port 27017.
Generally speaking, the format of the command is as follows:
mongo <options> <db address> <.js files>
In cases where there are no arguments passed to the mongo executable, it is equivalent to
the passing of the db address as localhost:27017/test.
Let's look at some example values of the db address command-line option and its
interpretation:
mydb: This will connect to the server running on localhost and listen for a connection
on port 27017. The database connected will be mydb.
database test.
Now, there are quite a few options available on the mongo client too. We will see a few of
them in the following table:
Option
Description
--help or -h
--shell
--port
The specifies the port of the mongo server where the client needs to connect.
--host
This specifies the hostname of the mongo server where the client needs
to connect. If the db address is provided with the hostname, port, and
database, then both the --host and --port options need not be specified.
9
Description
--username
or -u
This is relevant when security is enabled for mongo. It is used to provide the
username of the user to be logged in.
--password
or -p
Getting ready
The following are the prerequisites for this recipe:
Use the latest version of Maven available. Version 3.3.3 was the latest at the time of
writing this book.
MongoDB Java driver version 3.0.1 was the latest at the time of writing this book.
The Mongo server is up and running on localhost and port 27017. Take a look at the
first recipe, Installing single node MongoDB, and start the server.
How to do it
1. Install the latest version of JDK from https://www.java.com/en/download/
if you don't already have it on your machine. We will not be going through the steps
to install JDK in this recipe, but before moving on with the next step, JDK should be
present.
2. Maven needs to be downloaded from http://maven.apache.org/download.
cgi. We should see something similar to the following image on the download page.
Choose the binaries in a .tar.gz or .zip format and download it. This recipe is
executed on a machine running on the Windows platform and thus these steps are
for installation on Windows.
10
Chapter 1
3. Once the archive has been downloaded, we need to extract it and put the absolute
path of the bin folder in the extracted archive in the operating system's path
variable. Maven also needs the path of JDK to be set as the JAVA_HOME environment
variable. Remember to set the root of your JDK as the value of this variable.
4. All we need to do now is type mvn -version on the command prompt, and if we see
the output that begins with something as follows, we have successfully set up maven:
> mvn -version
5. At this stage, we have maven installed, and we are now ready to create our simple
project to write our first Mongo client in Java. We start by creating a project
folder. Let's say that we create a folder called Mongo Java. Then we create a folder
structure, src/main/java, in this project folder. The root of the project folder
then contains a file called pom.xml. Once this folder's creation is done, the folder
structure should look as follows:
Mongo Java
+--src
|
+main
|
+java
|--pom.xml
6. We just have the project skeleton with us. We shall now add some content to the
pom.xml file. Not much is needed for this. The following content is all we need
in the pom.xml file:
<project>
<modelVersion>4.0.0</modelVersion>
<name>Mongo Java</name>
<groupId>com.packtpub</groupId>
<artifactId>mongo-cookbook-java</artifactId>
<version>1.0</version>
<packaging>jar</packaging>
<dependencies>
11
7.
We finally write our Java client that will be used to connect to the Mongo server and
execute some very basic operations. The following is the Java class in the src/
main/java location in the com.packtpub.mongo.cookbook package, and the
name of the class is FirstMongoClient:
package com.packtpub.mongo.cookbook;
import
import
import
import
import
com.mongodb.BasicDBObject;
com.mongodb.DB;
com.mongodb.DBCollection;
com.mongodb.DBObject;
com.mongodb.MongoClient;
import java.net.UnknownHostException;
import java.util.List;
/**
* Simple Mongo Java client
*
*/
public class FirstMongoClient {
/**
* Main method for the First Mongo Client. Here we shall be
connecting to a mongo
* instance running on localhost and port 27017.
*
* @param args
*/
public static final void main(String[] args)
throws UnknownHostException {
MongoClient client = new MongoClient("localhost", 27017);
DB testDB = client.getDB("test");
System.out.println("Dropping person collection in test
database");
DBCollection collection = testDB.getCollection("person");
collection.drop();
12
Chapter 1
System.out.println("Adding a person document in the person
collection of test database");
DBObject person =
new BasicDBObject("name", "Fred").append("age", 30);
collection.insert(person);
System.out.println("Now finding a person using findOne");
person = collection.findOne();
if(person != null) {
System.out.printf("Person found, name is %s and age is
%d\n", person.get("name"), person.get("age"));
}
List<String> databases = client.getDatabaseNames();
System.out.println("Database names are");
int i = 1;
for(String database : databases) {
System.out.println(i++ + ": " + database);
}
System.out.println("Closing client");
client.close();
}
}
8. It's now time to execute the preceding Java code. We will execute it using maven from
the shell. You should be in the same directory as pom.xml of the project:
mvn compile exec:java -Dexec.mainClass=com.packtpub.mongo.
cookbook.FirstMongoClient
How it works
These were quite a lot of steps to follow. Let's look at some of them in more detail. Everything up
to step 6 is straightforward and doesn't need any explanation. Let's look at step 7 onwards.
The pom.xml file that we have here is pretty simple. We defined a dependency on
mongo's Java driver. It relies on the online repository, repo.maven.apache.org, to
resolve the artifacts. For a local repository, all we need to do is define the repositories and
pluginRepositories tags in pom.xml. For more information on maven, refer to the maven
documentation at http://maven.apache.org/guides/index.html.
For the Java class, the org.mongodb.MongoClient class is the backbone. We first
instantiate it using one of its overloaded constructors giving the server's host and port. In this
case, the hostname and port were not really needed as the values provided are the default
values anyway, and the no-argument constructor would have worked well too. The following
code snippet instantiates this client:
MongoClient client = new MongoClient("localhost", 27017);
13
Before we insert a document, we will drop the collection so that even upon multiple executions
of the program, we will have just one document in the person collection. The collection is
dropped using the drop() method on the DBCollection object's instance. Next, we create
an instance of com.mongodb.DBObject. This is an object that represents the document to
be inserted into the collection. The concrete class used here is BasicDBObject, which is a
type of java.util.LinkedHashMap, where the key is String and the value is Object. The
value can be another DBObject too, in which case, it is a document nested within another
document. In our case, we have two keys, name and age, which are the field names in the
document to be inserted and the values are of the String and Integer types, respectively.
The append method of BasicDBObject adds a new key value pair to the BasicDBObject
instance and returns the same instance, which allows us to chain the append method calls
to add multiple key value pairs. This created DBObject is then inserted into the collection
using the insert method. This is how we instantiated DBObject for the person collection and
inserted it into the collection as follows:
DBObject person = new BasicDBObject("name", "Fred").append("age", 30);
collection.insert(person);
The findOne method on DBCollection is straightforward and returns one document from
the collection. This version of findOne doesn't accept DBObject (which otherwise acts
as a query executed before a document is selected and returned) as a parameter. This is
synonymous to doing db.person.findOne() from the shell.
Finally, we simply invoke getDatabaseNames to get a list of databases' names in the server.
At this point of time, we should at least be having test and the local database in the
returned result. Once all the operations are complete, we close the client. The MongoClient
class is thread-safe and generally one instance is used per application. To execute the
program, we use the maven's exec plugin. On executing step 9, we should see the following
lines toward the end in the console:
[INFO] [exec:java {execution: default-cli}]
--snip--
14
Chapter 1
Dropping person collection in test database
Adding a person document in the person collection of test database
Now finding a person using findOne
Person found, name is Fred and age is 30
Database names are
1: local
2: test
INFO: Closed connection [connectionId{localValue:2, serverValue:2}] to
localhost:27017 because the pool has been closed.
[INFO] ----------------------------------------------------------------------[INFO] BUILD SUCCESSFUL
[INFO] ----------------------------------------------------------------------[INFO] Total time: 3 seconds
[INFO] Finished at: Tue May 12 07:33:00 UTC 2015
[INFO] Final Memory: 22M/53M
[INFO] -----------------------------------------------------------------------
Getting ready
The following are the prerequisites for this recipe:
The Mongo server is up and running on localhost and port 27017. Take a look at the
first recipe, Installing single node MongoDB, and start the server.
15
How to do it
1. Depending on your operating system, install the pip utility, say, on the Ubuntu/Debian
system. You can use the following command to install pip:
> apt-get install python-pip
3. Lastly, create a new file called my_client.py and type in the following code:
from __future__ import print_function
import pymongo
# Connect to server
client = pymongo.MongoClient('localhost', 27017)
# Select the database
testdb = client.test
# Drop collection
print('Dropping collection person')
testdb.person.drop()
# Add a person
print('Adding a person to collection person')
employee = dict(name='Fred', age=30)
testdb.person.insert(employee)
# Fetch the first entry from collection
person = testdb.person.find_one()
if person:
print('Name: %s, Age: %s' % (person['name'], person['age']))
# Fetch list of all databases
print('DB\'s present on the system:')
for db in client.database_names():
print('
%s' % db)
# Close connection
print('Closing client connection')
client.close()
Chapter 1
How it works
We start off by installing the Python MongoDB driver, pymongo, on the system with the help of
the pip package manager. In the given Python code, we begin by importing print_function
from the __future__ module to allow compatibility with Python 3.x. Next, we import
pymongo so that it can be used in the script.
We instantiate pymongo.MongoClient() with localhost and 27017 as the mongo server
host and port, respectively. In pymongo, we can directly refer to the database and its
collection by using the <client>.<database_name>.<collection_name> convention.
In our recipe, we used the client handler to select the database test simply by referring to
client.test. This returns a database object even if the database does not exist. As a part
of this recipe, we drop the collection by calling testdb.person.drop(), where testdb is
a reference to client.test and person is a collection that we wish to drop. For this recipe,
we are intentionally dropping the collection so that recurring runs will always yield one record
in the collection.
Next, we instantiate a dictionary called employee with a few values such as name and age.
We will now add this entry to our person collection using the insert_one() method.
As we now know that there is an entry in the person collection, we will fetch one document
using the find_one() method. This method returns the first document in the collection,
depending on the order of documents stored on the disk.
Following this, we also try to get the list of all the databases by calling the get_databases()
method to the client. This method returns a list of database names present on the server.
This method may come in handy when you are trying to assert the existence of a database
on the server.
Finally, we close the client connection using the close() method.
Getting ready
Though not a prerequisite, taking a look at the Starting a single node instance using
command-line options recipe will definitely make things easier just in case you are not
aware of various command-line options and their significance while starting a mongo server.
Additionally, the necessary binaries and setups as mentioned in the single server setup must
be done before we continue with this recipe. Let's sum up on what we need to do.
We will start three mongod processes (mongo server instances) on our localhost.
We will create three data directories, /data/n1, /data/n2, and /data/n3 for Node1,
Node2, and Node3, respectively. Similarly, we will redirect the logs to /logs/n1.log, /
logs/n2.log, and /logs/n3.log. The following image will give you an idea on how the
cluster would look:
Client 1
Client 2
Client 3
Port 27000
/data/n1
/logs/n1.log
Primary
N1
Port 27001
Port 27002
Read only
Clients
/data/n3
/logs/n3.log
Slave
N3
Slave
N2
/data/n2
/logs/n2.log
18
Chapter 1
How to do it
Let's take a look at the steps in detail:
1. Create the /data/n1, /data/n2, /data/n3, and /logs directories for the data and
logs of the three nodes respectively. On the Windows platform, you can choose the
c:\data\n1, c:\data\n2, c:\data\n3, and c:\logs\ directories or any other
directory of your choice for the data and logs respectively. Ensure that these directories
have appropriate write permissions for the mongo server to write the data and logs.
2. Start the three servers as follows. Users on the Windows platform need to skip the
--fork option as it is not supported:
$ mongod --replSet repSetTest --dbpath /data/n1 --logpath /logs/
n1.log --port 27000 --smallfiles --oplogSize 128 --fork
$ mongod --replSet repSetTest --dbpath /data/n2 --logpath /logs/
n2.log --port 27001 --smallfiles --oplogSize 128 --fork
$ mongod --replSet repSetTest --dbpath /data/n3 --logpath /logs/
n3.log --port 27002 --smallfiles --oplogSize 128 fork
3. Start the mongo shell and connect to any of the mongo servers running. In this case,
we connect to the first one (listening to port 27000). Execute the following command:
$ mongo localhost:27000
4. Try to execute an insert operation from the mongo shell after connecting to it:
> db.person.insert({name:'Fred', age:35})
This operation should fail as the replica set has not been initialized yet. More
information can be found in the How it works section.
5. The next step is to start configuring the replica set. We start by preparing a JSON
configuration in the shell as follows:
cfg = {
'_id':'repSetTest',
'members':[
{'_id':0, 'host': 'localhost:27000'},
{'_id':1, 'host': 'localhost:27001'},
{'_id':2, 'host': 'localhost:27002'}
]
}
6. The last step is to initiate the replica set with the preceding configuration as follows:
> rs.initiate(cfg)
7.
Execute rs.status() after a few seconds on the shell to see the status. In a
few seconds, one of them should become a primary and the remaining two should
become secondary.
19
How it works
We described the common options in the Installing single node MongoDB recipe with the
command-line options recipe before and all these command-line options are described in detail.
As we are starting three independent mongod services, we have three dedicated database
paths on the filesystem. Similarly, we have three separate log file locations for each of the
processes. We then start three mongod processes with the database and log file path
specified. As this setup is for test purposes and is started on the same machine, we use the
--smallfiles and --oplogSize options. As these processes are running on the same host,
we also choose the ports explicitly to avoid port conflicts. The ports that we chose here were
27000, 27001, and 27002. When we start the servers on different hosts, we may or may not
choose a separate port. We can very well choose to use the default one whenever possible.
The --fork option demands some explanation. By choosing this option, we start the server
as a background process from our operating system's shell and get the control back in the
shell where we can then start more such mongod processes or perform other operations.
In the absence of the --fork option, we cannot start more than one process per shell and
would need to start three mongod processes in three separate shells.
If we take a look at the logs generated in the log directory, we should see the following lines
in it:
[rsStart] replSet can't get local.system.replset config from self or any
seed (EMPTYCONFIG)
[rsStart] replSet info you may need to run replSetInitiate -rs.initiate() in the shell -- if that is not already done
Though we started three mongod processes with the --replSet option, we still haven't
configured them to work with each other as a replica set. This command-line option is just
used to tell the server on startup that this process will be running as a part of a replica set.
The name of the replica set is the same as the value of this option passed on the command
prompt. This also explains why the insert operation executed on one of the nodes failed before
the replica set was initialized. In mongo replica sets, there can be only one primary node
where all the inserting and querying happens. In the image shown, the N1 node is shown as
the primary and listens to port 27000 for client connections. All the other nodes are slave/
secondary instances, which sync themselves up with the primary and hence querying too is
disabled on them by default. It is only when the primary goes down that one of the secondary
takes over and becomes a primary node. However, it is possible to query the secondary for
data as we have shown in the image; we will see how to query from a secondary instance in
the next recipe.
20
Chapter 1
Well, all that is left now is to configure the replica set by grouping the three processes that we
started. This is done by first defining a JSON object as follows:
cfg = {
'_id':'repSetTest',
'members':[
{'_id':0, 'host': 'localhost:27000'},
{'_id':1, 'host': 'localhost:27001'},
{'_id':2, 'host': 'localhost:27002'}
]
}
There are two fields, _id and members, for the unique ID of the replica set and an array of
the hostnames and port numbers of the mongod server processes as part of this replica
set, respectively. Using localhost to refer to the host is not a very good idea and is usually
discouraged; however, in this case, as we started all the processes on the same machine, we
are ok with it. It is preferred that you refer to the hosts by their hostnames even if they are
running on localhost. Note that you cannot mix referring to the instances using localhost and
hostnames both in the same configuration. It is either the hostname or localhost. To configure
the replica set, we then connect to any one of the three running mongod processes; in this
case, we connect to the first one and then execute the following from the shell:
> rs.initiate(cfg)
The _id field in the cfg object passed has a value that is the same as the value we gave to
the --replSet option on the command prompt when we started the server processes. Not
giving the same value would throw the following error:
{
"ok" : 0,
"errmsg" : "couldn't initiate : set name does not match the set
name host Amol-PC:27000 expects"
}
If all goes well and the initiate call is successful, we should see something similar to the
following JSON response on the shell:
{"ok" : 1}
In a few seconds, you should see a different prompt for the shell that we executed this
command from. It should now become a primary or secondary. The following is an example
of the shell connected to a primary member of the replica set:
repSetTest:PRIMARY>
Executing rs.status() should give us some stats on the replica set's status, which we
will explore in depth in a recipe later in the book in the administration section. For now, the
stateStr field is important and contains the PRIMARY, SECONDARY, and other texts.
21
There's more
Look at the Connecting to the replica set in the shell to query and insert data recipe to perform
more operations from the shell after connecting to a replica set. Replication isn't as simple as
we saw here. See the administration section for more advanced recipes on replication.
See also
If you are looking to convert a standalone instance to a replica set, then the instance
with the data needs to become a primary first, and then empty secondary instances will be
added to which the data will be synchronized. Refer to the following URL on how to perform
this operation:
http://docs.mongodb.org/manual/tutorial/convert-standalone-toreplica-set/
Getting ready
The prerequisite for this recipe is that the replica set should be set up and running. Refer to
the previous recipe, Starting multiple instances as part of a replica set, for details on how to
start the replica set.
How to do it
1. We will start two shells here, one for PRIMARY and one for SECONDARY. Execute the
following command on the command prompt:
> mongo localhost:27000
2. The prompt of the shell tells us whether the server to which we have
connected is PRIMARY or SECONDARY. It should show the replica set's name
followed by a :, followed by the server state. In this case, if the replica set is
initialized, up, and running, we should see either repSetTest:PRIMARY> or
repSetTest:SECONDARY>.
22
Chapter 1
3. Suppose that the first server we connected to is a secondary, we need to find the
primary. Execute the rs.status() command in the shell and look out for the
stateStr field. This should give us the primary server. Use the mongo shell to
connect to this server.
4. At this point, we should be having two shells running, one connected to a primary and
another connected to a secondary.
5. In the shell connected to the primary node, execute the following insert:
repSetTest:PRIMARY> db.replTest.insert({_id:1, value:'abc'})
6. There is nothing special about this. We just inserted a small document in a collection
that we will use for the replication test.
7.
By executing the following query on the primary, we should get the following result:
repSetTest:PRIMARY> db.replTest.findOne()
{ "_id" : 1, "value" : "abc" }
8. So far, so good. Now, we will go to the shell that is connected to the SECONDARY node
and execute the following:
repSetTest:SECONDARY> db.replTest.findOne()
rs.slaveOk(true)
10. Execute the query that we executed in step 7 again on the shell. This should now get
the results as follows:
repSetTest:SECONDARY>db.replTest.findOne()
{ "_id" : 1, "value" : "abc" }
11. Execute the following insert on the secondary node; it should not succeed with the
following message:
repSetTest:SECONDARY> db.replTest.insert({_id:1, value:'abc'})
not master
How it works
We have done a lot of things in this recipe, and we will try to throw some light on some of the
important concepts to remember.
23
See also
The next recipe, Connecting to the replica set to query and insert data from a Java client, is
about connecting to a replica set from a Java client.
24
Chapter 1
Getting ready
We need to take a look at the Connecting to the single node using a Java client recipe as
it contains all the prerequisites and steps to set up maven and other dependencies. As we
are dealing with a Java client for replica sets, a replica set must be up and running. Refer to
the Starting multiple instances as part of a replica set recipe for details on how to start the
replica set.
How to do it
1. Write/copy the following piece of code: (This Java class is also available for download
from the Packt website.)
package com.packtpub.mongo.cookbook;
import
import
import
import
import
import
com.mongodb.BasicDBObject;
com.mongodb.DB;
com.mongodb.DBCollection;
com.mongodb.DBObject;
com.mongodb.MongoClient;
com.mongodb.ServerAddress;
import java.util.Arrays;
/**
*
*/
public class ReplicaSetMongoClient {
/**
* Main method for the test client connecting to the
replica set.
* @param args
*/
public static final void main(String[] args) throws
Exception {
MongoClient client = new MongoClient(
Arrays.asList(
new ServerAddress("localhost", 27000),
new ServerAddress("localhost", 27001),
new ServerAddress("localhost", 27002)
)
);
DB testDB = client.getDB("test");
25
2. Connect to any of the nodes in the replica set, say to localhost:27000, and
execute rs.status() from the shell. Take a note of the primary instance in the
replica set and connect to it from the shell if localhost:27000 is not a primary.
Here, switch to the administrator database as follows:
repSetTest:PRIMARY>use admin
26
Chapter 1
3. We now execute the preceding program from the operating system shell as follows:
$ mvn compile exec:java -Dexec.mainClass=com.packtpub.mongo.
cookbook.ReplicaSetMongoClient
4. Shut down the primary instance by executing the following on the mongo shell that is
connected to the primary:
repSetTest:PRIMARY> db.shutdownServer()
How it works
An interesting thing to observe is how we instantiate the MongoClient instance. It is done
as follows:
MongoClient client = new MongoClient(Arrays.asList(
new ServerAddress("localhost", 27000),
new ServerAddress("localhost", 27001),
new ServerAddress("localhost", 27002)));
27
As we can see, the query in the loop was interrupted when the primary node went down.
However, the client switched to the new primary seamlessly. Well, nearly seamlessly, as the
client might have to catch an exception and retry the operation after a predetermined interval
has elapsed.
Getting ready
Refer to the Connecting to the single node using a Python client recipe as it describes how to
set up and install PyMongo, the Python driver for MongoDB. Additionally, a replica set must
be up and running. Refer to the Starting multiple instances as part of a replica set recipe for
details on how to start the replica set.
How to do it
1. Write/copy the following piece of code to replicaset_client.py: (This script is
also available for download from the Packt website.)
from __future__ import print_function
import pymongo
import time
# Instantiate MongoClient with a list of server addresses
28
Chapter 1
client = pymongo.MongoClient(['localhost:27002',
'localhost:27001', 'localhost:27000'], replicaSet='repSetTest')
# Select the collection and drop it before using
collection = client.test.repTest
collection.drop()
#insert a record in
collection.insert_one(dict(name='Foo', age='30'))
for x in range(5):
try:
print('Fetching record: %s' % collection.find_one())
except Exception as e:
print('Could not connect to primary')
time.sleep(3)
2. Connect to any of the nodes in the replica set, say to localhost:27000, and
execute rs.status() from the shell. Take a note of the primary instance in the
replica set and connect to it from the shell, if localhost:27000 is not a primary.
Here, switch to the administrator database as follows:
> repSetTest:PRIMARY>use admin
3. We now execute the preceding script from the operating system shell as follows:
$ python replicaset_client.py
4. Shut down the primary instance by executing the following on the mongo shell that is
connected to the primary:
> repSetTest:PRIMARY> db.shutdownServer()
5. Watch the output on the console where the Python script is executed.
How it works
You will notice that, in this script, we instantiated the mongo client by giving a list of
hosts instead of a single host. As of version 3.0, the pymongo driver's MongoClient()
class can accept either a list of hosts or a single host during initialization and deprecate
MongoReplicaSetClient(). The client will attempt to connect to the first host in the list,
and if successful, will be able to determine the other nodes in the replica set. We are also
passing the replicaSet='repSetTest' parameter exclusively, ensuring that the client
checks whether the connected node is a part of this replica set.
Once connected, we perform normal database operations such as selecting the test database,
dropping the repTest collection, and inserting a single document into the collection.
29
In the preceding output, the client gets disconnected from the primary node midway. However,
very soon, a new primary node is selected by the remaining nodes and the mongo client is
able to resume the connection.
30
Chapter 1
Now, consider an archiving application that needs to store the details of all the requests that
hit a particular website over the past decade. For each request hitting the website, we create
a new record in the underlying data store. Suppose that each record is of 250 bytes with an
average load of three million requests per day, we will cross 1 TB of the data mark in about
five years. This data would be used for various analytics purposes and might be frequently
queried. The query performance should not be drastically affected when the data size
increases. If the system is able to cope with this increasing data volume and still give decent
performance comparable to performance on low data volumes, the system is said to have
scaled up well.
Now that we have seen in brief what scalability is, let me tell you that sharding is a
mechanism that lets a system scale to increasing demands. The crux lies in the fact that
the entire data is partitioned into smaller segments and distributed across various nodes
called shards. Suppose that we have a total of 10 million documents in a mongo collection.
If we shard this collection across 10 shards, then we will ideally have 10,000,000/10 =
1,000,000 documents on each shard. At a given point of time, only one document will reside
on one shard (which by itself will be a replica set in a production system). However, there
is some magic involved that keeps this concept hidden from the developer who is querying
the collection and who gets one unified view of the collection irrespective of the number of
shards. Based on the query, it is mongo that decides which shard to query for the data and
returns the entire result set. With this background, let's set up a simple shard and take a
closer look at it.
Getting ready
Apart from the MongoDB server already installed, no prerequisites are there from a software
perspective. We will be creating two data directories, one for each shard. There will be a
directory for the data and one for logs.
How to do it
1. We start by creating directories for the logs and data. Create the following directories,
/data/s1/db, /data/s2/db, and /logs. On Windows, we can have c:\data\
s1\db and so on for the data and log directories. There is also a configuration server
that is used in the sharded environment to store some metadata. We will use /data/
con1/db as the data directory for the configuration server.
2. Start the following mongod processes, one for each of the two shards, one for the
configuration database, and one mongos process. For the Windows platform, skip the
--fork parameter as it is not supported.
$ mongod --shardsvr --dbpath /data/s1/db --port 27000 --logpath /
logs/s1.log --smallfiles --oplogSize 128 --fork
$ mongod --shardsvr --dbpath /data/s2/db --port 27001 --logpath /
logs/s2.log --smallfiles --oplogSize 128 --fork
31
/logs/mongos.log
3. From the command prompt, execute the following command. This should show a
mongos prompt as follows:
$ mongo
MongoDB shell version: 3.0.2
connecting to: test
mongos>
4. Finally, we set up the shard. From the mongos shell, execute the following two
commands:
mongos> sh.addShard("localhost:27000")
mongos> sh.addShard("localhost:27001")
5. On each addition of a shard, we should get an ok reply. The following JSON message
should be seen giving the unique ID for each shard added:
{ "shardAdded" : "shard0000", "ok" : 1 }
How it works
Let's see what all we did in the process. We created three directories for data (two for the
shards and one for the configuration database) and one directory for logs. We can have a shell
script or batch file to create the directories as well. In fact, in large production deployments,
setting up shards manually is not only time-consuming but also error-prone.
Downloading the example code
You can download the example code files for all Packt books you have
purchased from your account at http://www.packtpub.com.
If you purchased this book elsewhere, you can visit http://www.
packtpub.com/support and register to have the files e-mailed
directly to you.
32
Chapter 1
Let's try to get a picture of what exactly we have done and are trying to achieve. The following
is an image of the shard setup that we just did:
Shard 1
Shard 2
mongos
Config
Client 1
Client n
If we look at the preceding image and the servers started in step 2, we have shard
servers that would store the actual data in the collections. These were the first two of the
four processes that we started listening to ports 27000 and 27001. Next, we started a
configuration server that is seen on the left side in this image. It is the third server of the
four servers started in step 2 and it listens to port 25000 for the incoming connections. The
sole purpose of this database is to maintain the metadata about the shard servers. Ideally,
only the mongos process or drivers connect to this server for the shard details/metadata and
the shard key information. We will see what a shard key is in the next recipe, where we play
around a sharded collection and see the shards that we have created in action.
Finally, we have a mongos process. This is a lightweight process that doesn't do any
persistence of data and just accepts connections from clients. This is the layer that acts as
a gatekeeper and abstracts the client from the concept of shards. For now, we can view it as
basically a router that consults the configuration server and takes the decision to route the
client's query to the appropriate shard server for execution. It then aggregates the result from
various shards if applicable and returns the result to the client. It is safe to say that no client
connects directly to the configuration or shard servers; in fact, no one ideally should connect
to these processes directly except for some administration operations. Clients simply connect
to the mongos process and execute their queries and insert or update operations.
33
Shard 1
Shard 2
Shard n
mongos
mongos
mongos
Client
Client
Client
There's more
What good is a shard unless we put it to action and see what happens from the shell on
inserting and querying the data? In the next recipe, we will make use of the shard setup here,
add some data, and see it in action.
34
Chapter 1
Getting ready
Obviously, we need a sharded mongo server setup up and running. See the previous recipe,
Starting a simple sharded environment of two shards, for more details on how to set up a
simple shard. The mongos process, as in the previous recipe, should be listening to port
number 27017. We have got some names in a JavaScript file called names.js. This file needs
to be downloaded from the Packt website and kept on the local filesystem. The file contains
a variable called names and the value is an array with some JSON documents as the values,
each one representing a person. The contents look as follows:
names = [
{name:'James Smith', age:30},
{name:'Robert Johnson', age:22},
How to do it
1. Start the mongo shell and connect to the default port on localhost as follows. This will
ensure that the names will be available in the current shell:
mongo --shell names.js
MongoDB shell version: 3.0.2
connecting to: test
mongos>
2. Switch to the database that would be used to test the sharding; we call it shardDB:
mongos> use shardDB
35
6. Execute the following to get a query plan and the number of documents on each shard:
mongos> db.person.getShardDistribution()
How it works
This recipe demands some explanation. We downloaded a JavaScript file that defines an array
of 20 people. Each element of the array is a JSON object with the name and age attributes.
We start the shell connecting to the mongos process loaded with this JavaScript file. We then
switch to shardDB, which we use for the purpose of sharding.
For a collection to be sharded, the database in which it will be created needs to be enabled
for the sharding first. We do this using sh.enableSharding().
The next step is to enable the collection to be sharded. By default, all the data will be kept on
one shard and not split across different shards. Think about it; how will Mongo be able to split
the data meaningfully? The whole intention is to split it meaningfully and as evenly as possible
so that whenever we query based on the shard key, Mongo would easily be able to determine
which shard(s) to query. If a query doesn't contain the shard key, the execution of the query
will happen on all the shards and the data would then be collated by the mongos process
before returning it to the client. Thus, choosing the right shard key is very crucial.
Let's now see how to shard the collection. We do this by invoking
sh.shardCollection("shardDB.person", {name: "hashed"}, false). There are
The fully qualified name of the collection in the <db name>.<collection name>
format is the first parameter of the shardCollection method.
The second parameter is the field name to shard on in the collection. This is the field
that would be used to split the documents on the shards. One of the requirements
of a good shard key is that it should have high cardinality. (The number of possible
values should be high.) In our test data, the name value has very low cardinality and
thus is not a good choice as a shard key. We hash this key when using this as a shard
key. We do so by mentioning the key as {name: "hashed"}.
36
Chapter 1
The last parameter specifies whether the value used as the shard key is unique or
not. The name field is definitely not unique and thus it will be false. If the field was,
say, the person's social security number, it could have been set as true. Additionally,
SSN is a good choice for a shard key due to its high cardinality. Remember that the
shard key has to be present for the query to be efficient.
The last step is to see the execution plan for the finding of all the data. The intent of this
operation is to see how the data is being split across two shards. With 300,000 documents,
we expect something around 150,000 documents on each shard. However, from the
distribution statistics, we can observe that shard0000 has 1,49,715 documents whereas
shard0001 has 150285:
Shard shard0000 at localhost:27000
data : 15.99MiB docs : 149715 chunks : 2
estimated data per chunk : 7.99MiB
estimated docs per chunk : 74857
Totals
data : 32.04MiB docs : 300000 chunks : 4
Shard shard0000 contains 49.9% data, 49.9% docs in cluster, avg obj size
on shard : 112B
Shard shard0001 contains 50.09% data, 50.09% docs in cluster, avg obj
size on shard : 112B
There are a couple of additional suggestions that I would recommend you to do.
Connect to the individual shard from the mongo shell and execute queries on the person
collection. See that the counts in these collections are similar to what we see in the preceding
plan. Additionally, one can find out that no document exists on both the shards at the same time.
We discussed in brief about how cardinality affects the way the data is split across shards.
Let's do a simple exercise. We first drop the person collection and execute the shardCollection
operation again but, this time, with the {name: 1} shard key instead of {name:
"hashed"}. This ensures that the shard key is not hashed and stored as is. Now, load the
data using the JavaScript function we used earlier in step number 5, and then execute the
explain() command on the collection once the data is loaded. Observe how the data is now
split (or not) across the shards.
37
There's more
A lot of questions must now be coming up such as what are the best practices? What are
some tips and tricks? How is the sharding thing pulled off by MongoDB behind the scenes
in a way that is transparent to the end user?
This recipe here only explained the basics. In the administration section, all such questions
will be answered.
38
www.PacktPub.com
Stay Connected: