Hadoop Interview Guide
Hadoop Interview Guide
HADOOP
Interview Guide
1 | www.simplilearn.com
Ace Your Hadoop
Interview
Big Data has taken the world by storm So, how do you get yourself a job in
and has been growing tremendously the field of Hadoop? If you have the
in the past decade. With Big Data requisite skills already, the next step is
comes the widespread adoption definitely cracking a Hadoop interview,
of Hadoop to solve major Big Data and we’ve got that sorted for you.
challenges. Hadoop is one of the most Go ahead, and give a read to all the
popular frameworks that is used to frequently asked Hadoop interview
store, process, and analyze Big Data. questions and answers!
Hence, there is always a demand for
professionals to work in this field.
2 | www.simplilearn.com
Interview Guide
Topics
Covered
General Questions 4
MapReduce Questions 11
YARN Questions 16
Hive Questions 19
Pig Questions 22
HBase Questions 26
Sqoop Questions 29
3 | www.simplilearn.com
Interview Guide
General
Questions
Q: What are the different vendor- Hadoop:
specific distributions of Hadoop?
hadoop-env.sh
A: Different vendors in the market It contains information about
have packaged the Apache environment variables such as
Hadoop in a cluster management Java path, process ID path, where
solution which allows them to the logs get stored, what kind of
easily deploy, manage, monitor, metrics will be collected, and so
and upgrade the clusters. Some of on.
the vendor-specific distributions
are: core-site.xml
This file has the HDFS path
and many other properties like
Cloudera
enabling thrash, enabling high
Hortonworks (it has now merged availability, etc.
with Cloudera) hdfs-site.xml
MAPR This file has other information
related to Hadoop cluster such as:
Microsoft Azure
replication factor
IBM InfoSphere
Amazon Web Services where will namenode store
its metadata on disk
if a datanode is running,
Apart from the above popular where would it store the data
distributions, there are other
distributions that run in Apache if a secondary namenode
Hadoop. But, they have is running, where would that
packaged it as a solution like store a copy of namenodes
an installer so that you can metadata, and so on
easily set up clusters on a set of
machines. mapred-site.xml
4 | www.simplilearn.com
Interview Guide
Q: What are the differences between the regular file system and HDFS?
A:
Regular File System HDFS
5 | www.simplilearn.com
Interview Guide
NameNode
NameNode is the master
service that host metadata
in disk and RAM. It holds
information about the various
datanodes, their location, the
size of each block, etc.
The disk has an Edit log (transaction log) and Fsimage (filesystem image). It
gets appended every time a read/write or any other operations happens on
HDFS. Metadata in RAM is dynamically built every time the cluster comes up
and addresses all the client requests.
Datanode
Datanodes hold the actual data blocks and send block reports to Namenode
every 10 seconds. Datanode stores and retrieves the blocks when asked by the
Namenode. It reads and writes the client’s request and performs block
6 | www.simplilearn.com
Interview Guide
Q: If you have an input file of 350 MB, how many input splits would be created
by HDFS and what would be the size of each input split?
7 | www.simplilearn.com
Interview Guide
A: HDFS Rack Awareness refers to the knowledge of different datanodes and how
it is distributed across the racks of a Hadoop Cluster.
Q: How can you restart namenode and all the daemons in Hadoop?
A: You can stop the namenode with ./sbin /Hadoop-daemon.sh stop namenode
command and then start the namenode using ./sbin/Hadoop-daemon.sh start
namenode command.
You can stop all the daemons with ./sbin /stop-all.sh command and then start
the daemons using ./sbin/start-all.sh command.
Q: Which command will help you find the status of blocks and filesystem health?
8 | www.simplilearn.com
Interview Guide
Q: What would happen if you store too many small files in a cluster on HDFS?
A: As Hadoop is coded in Java, every directory, file, and file related block is
considered as an object. For every object within the Hadoop cluster, namenode
RAM gets utilized. More the number of blocks, more would be the usage of
namenode RAM. Storing a lot of small files on HDFS generates a lot of metadata
files. Storing these metadata in the RAM is a challenge as each file, block or
directory takes 150 bytes just for metadata. Thus, the cumulative size of all the
metadata will be too big.
Q: How do you copy data from the local system onto HDFS?
A: The following command helps to copy data from the local file system into
HDFS:
hadoop fs –copyFromLocal [source] [destination]
Example: hadoop fs –copyFromLocal /tmp/data.csv /user/test/data.csv
In the above syntax, the source is the local path and destination is the HDFS
path. Copy from local using a -f option (flag option) helps you in writing the
same file or a new file to HDFS.
A: When a node is added to or removed from the cluster, the Hadoop master is
informed that the node would not be used for storage or processing. These
commands are used to refresh the node information while commissioning or
decommissioning of nodes is done.
dfsadmin -refreshNodes
This is used to run HDFS client and it refreshes node configuration for the
NameNode.
rmadmin -refreshNodes
This is used to perform administrative tasks for ResourceManager.
Q: Is there any way to change replication of files on HDFS after they are already
written to HDFS?
A: Yes, the following are the ways to change the replication of files on HDFS:
We can change the dfs.replication value to a particular number in $HADOOP_
HOME/conf/hadoop-site.xml file which will start replicating to the factor of
that number for any new content that comes in.
If you want to change the replication factor for a particular file or directory,
then use:
$HADOOP_HOME/bin/Hadoop dfs –setrep –w4 /path of the file
Example: $HADOOP_HOME/bin/Hadoop dfs –setrep –w4 /user/temp/test.csv
9 | www.simplilearn.com
Interview Guide
Over-replicated blocks:
These are the blocks that exceed their target replication for the file they belong
to. Normally, over-replication is not a problem, and HDFS will automatically
delete excess replicas.
Consider a case of three nodes running with the replication of three and
one of the nodes went down due to a network failure. Within a few minutes,
namenode re-replicated the data and then the failed node is back with its set
of blocks. This is an over-replication situation, and the namenode will delete a
set of blocks from one of the nodes.
10 | www.simplilearn.com
Interview Guide
MapReduce Questions
Q: What is distributed cache in MapReduce?
RecordReader
It communicates with the InputSplit and converts the data into key-value pairs
suitable for reading by the mapper.
Combiner
Combiner is also known as a mini reducer. For every combiner, there is one
mapper. It substitutes intermediate key-value pairs and passes it to the
partitioner.
Partitioner
Partitioner decides how many reduced tasks would be used to summarize the
data. It also confirms how outputs from combiners are sent to the reducer, and
controls the partitioning of keys of the intermediate map outputs.
11 | www.simplilearn.com
Interview Guide
Reads data from the disk, and after a particular iteration, sends results to the
HDFS. Such a process increases latency and makes graph processing slow.
Q: Name some Hadoop specific data types that are used in a MapReduce
program.
A: Following are some Hadoop specific data types used in a MapReduce program:
IntWritable
FloatWritable
LongWritable
DoubleWritable
BooleanWritable
ArrayWritable
MapWritable
ObjectWritable
A: If a datanode is executing any task slowly, the master node can redundantly
execute another instance of the same task on another node. The task that
finishes first will be accepted and the other task would be killed. So speculative
execution is good if you are working in an intensive workload kind of
environment.
The following image depicts the speculative execution:
12 | www.simplilearn.com
Interview Guide
A:
Map-side join Reduce-side join
The join is performed by the The join is performed by the
mapper reducer
Each input data must be Easier to implement than the
divided into the same map side join as the sorting and
number of partitions shuffling phase sends the value
having identical keys to the
Input to each map is in same reducer
the form of a structured
partition and is in sorted No need to have the dataset
order in a structured form (or
partitioned)
13 | www.simplilearn.com
Interview Guide
Example:
org.apache.hadoop.mapreduce.OutputCommitter
A: Spilling is a process of copying the data from memory buffer to disk when
the content of the buffer reaches a certain threshold size. Spilling happens
when there is not enough memory to fit all of the mapper output. By default, a
background thread starts spilling the content from memory to disk after 80%
of the buffer size is filled. For a 100 MB size buffer, the spilling will start after
the content of the buffer reach a size of 80 MB.
Q: How can you set the mappers and reducers for a MapReduce job?
A: The number of mappers and reducers can be set in the command line using:
-D mapred.map.tasks=5 –D mapred.reduce.tasks=2
job.setNumMapTasks(5); // 5 mappers
job.setNumReduceTasks(2); // 2 reducers
Q: What happens when a node running a map tasks fails before sending the
output to the reducer?
A: If a node crashes when running a map task, the whole task will be assigned to a
new node and run again to re-create the map output. In Hadoop v2, the YARN
framework has a temporary daemon called application master which takes
care of the execution of the application. If a task on a particular node failed
due to unavailability of node, it is the role of application master to have this
task scheduled on some other node.
14 | www.simplilearn.com
Interview Guide
15 | www.simplilearn.com
Interview Guide
Questions on YARN
(Yet Another Resource Negotiator)
Q: What benefits did YARN bring Resource utilization - Yarn allows
in Hadoop 2.0 and how did it dynamic allocation of cluster
solve the issues of MapReduce resources to improve resource
v1? utilization.
Multitenancy - Yarn can use open-
A: MapReduce v1 had major issues source and proprietary data access
when it came to scalability or engines and perform real-time
availability. In Hadoop v1, there analysis and running ad-hoc query.
was only one master process
for processing layer called as
job tracker. Job tracker was Q: How does YARN allocate
responsible for resource tracking resources to an application?
and job scheduling. In YARN, Explain with the help of its
there is a processing master called architecture.
ResourceManager. In Hadoop
v2, you have ResourceManager A: There is a client/application/API
running in high availability which talks to ResourceManager.
mode. There are node managers The ResourceManager manages
running on multiple machines, the resource allocation in the
and a temporary daemon called cluster. It has two internal
application master. In Hadoop components, Scheduler and
v2, the ResourceManager is only Applications Manager. The
handling the client connections ResourceManager is aware of the
and taking care of tracking the resources which are available with
resources. The job scheduling or every node manager.
execution across multiple nodes is
controlled by application masters Scheduler allocates resources to
till the application completes. In various running applications when
YARN, there are different kinds they are running in parallel. It
of resource allocations, and it can schedules resources based on the
run different kinds of workloads. requirements of the applications.
In Hadoop v2, there are following It does not monitor or track the
features available: status of the applications.
Applications Manager is the one
which accepts the job submissions.
Scalability - You can have a cluster It monitors and restarts the
size of more than 10,000 nodes application masters in case of
and can run more than 100,000 failures. Application Master
concurrent tasks. manages the resource needs of
Compatibility - Applications individual applications. It interacts
developed for Hadoop v1 runs on with scheduler to acquire the
YARN without any disruption or required resources. It interacts
availability issues. with NodeManager to execute and
16 | www.simplilearn.com
Interview Guide
monitor tasks.
Node Manager is a tracker that tracks the jobs running. It monitors each
container’s resource utilization.
Container is a collection of resources like RAM, CPU, or network bandwidth. It
provides the rights to an application to use specific amount of resources.
Let us discuss how YARN allocates resources to an application through its
architecture:
A: ResourceManager
17 | www.simplilearn.com
Interview Guide
A: Yes, there can be more than one ResourceManager in the case of a High
Availability cluster. They are:
Active ResourceManager
Standby ResourceManager
At a particular time, there can only be one active ResourceManager. If Active
ResourceManager fails, then the Standby ResourceManager comes to the
rescue.
A: Every node in a Hadoop cluster have multiple processes running which would
need RAM. The machine itself, which has a Linux file system, would have its
own processes that need some RAM usage. So, if you have 10 datanodes,
you need to allocate at least 20-30% towards the overheads, cloudera base
services, etc. You could have 11 or 12 GB and 6 - 7 cores available on every
machine for processing. Multiply that by 10 and that's your processing capacity.
Q: What happens if requested memory or CPU cores goes beyond the size
of container allocation?
18 | www.simplilearn.com
Interview Guide
Hive Questions
Q: What are the different components of a Hive architecture?
Q: What is the difference between external table and managed table in Hive?
A: Partition is the process to group similar type of data together on the basis of
column or partition key. Each table can have one or more partition keys to
identify a particular partition.
Partitioning provides granularity in a Hive table. It reduces the query latency by
scanning only relevant partitioned data instead of the whole data set.
19 | www.simplilearn.com
Interview Guide
We can partition the transaction data for a bank based on month – Jan, Feb,
etc. Any operation regarding a particular month, say Feb, will have to scan only
the Feb partition instead of the whole table data.
A: HDFS read/write operations are time consuming. So, Hive stores metadata
information in the metastore using RDBMS instead of HDFS. This allows to
achieve low latency and is faster.
Parser
Semantic analyzer
Execution engine
User-defined functions
Logical plan generation
Physical plan generation
Optimizer
Operators
Type checking
A: Using SequenceFile format and grouping these small files together to form a
single sequence file can solve this problem. Below are the steps:
20 | www.simplilearn.com
Interview Guide
Hive Pig
21 | www.simplilearn.com
Interview Guide
Pig MapReduce
Grunt shell
Script file
Embedded script
Pig Scripts
It is written in Pig Latin using built-in operators and UDFs and submitted to the
execution environment.
Parser
Does type checking and checks the syntax of the script. The output of parser is
a Directed Acyclic Graph (DAG).
Optimizer
Performs optimization using merge, transform, split, etc. It aims to reduce the
amount of data in the pipeline.
Compiler
Pig compiler converts the optimized code into MapReduce jobs automatically.
Execution Engine
MapReduce jobs are submitted to execution engines to generate the desired
results.
22 | www.simplilearn.com
Interview Guide
Tuple
A tuple is an ordered set of fields which can contain different data types for
each field. It is represented by braces ().
Example: (1,3).
Bag
A bag is a set of tuples represented by curly braces {}.
Example: {(1,4), (3,5), (4,6)}
Map
A Map is a set of key-value pairs used to represent data elements. It is
represented in square brackets [].
Example: [key#value, key1#value1,….]
Dump
• Dump operator runs the Pig Latin scripts and displays the results on the screen.
• Load the data using “load” operator into Pig.
• Display the results using “dump” operator.
Describe
• Describe operator is used to view the schema of a relation.
• Load the data using “load” operator into Pig
• View the schema of a relation using “describe” operator
Explain
• Explain operator displays the physical, logical and MapReduce execution plans.
• Load the data using “load” operator into Pig
• D
isplay the logical, physical and MapReduce execution plans using “explain”
operator
Illustrate
• I llustrate operator gives the step-by-step execution of a sequence of
statements.
• Load the data using “load” operator into Pig
• S
how the step-by-step execution of a sequence of statements using “illustrate”
operator
23 | www.simplilearn.com
Interview Guide
Q: State the usage of group, order by, and distinct keywords in Pig scripts.
A: Group statement collects various records with the same key and groups the
data in one or more relations.
Example: Group_data = GROUP Relation_name BY AGE
COGROUP - Joins two or more tables and then performs GROUP operation on
the joined table result.
CROSS - It is used to compute the cross product (cartesian product) of two or
more relations.
FOREACH - It will iterate through the tuples of a relation, generating a data
transformation
JOIN - Used to join two or more tables in a relation
LIMIT - It will limit the number of output tuples
SPLIT - This will split the relation into two or more relations
UNION - It will merge the contents of two relations
ORDER - Used to sort a relation based on one or more fields
A: FILTER operator is used to select the required tuples from a relation based on
a condition. It also allows you to remove unwanted records from the data file.
Filter the products whole quantity is greater than 1000
Example:
A = LOAD ‘/user/Hadoop/phone_sales’ USING PigStorage(‘,’) AS (year:int,
product:chararray, quantity:int);
B = FILTER A BY quantity > 1000
“Phone_sales” data
year, product, quantity
----------------------------
24 | www.simplilearn.com
Interview Guide
A: UWe need to use the limit operator to retrieve the first 10 records from a file.
Load the data in Pig:
test_data = LOAD “/user/test.txt” USING PigStorage(‘,’) as (field1, field2,….);
Limit the data to first 10 records:
Limit_test_data = LIMIT test_data 10;
25 | www.simplilearn.com
Interview Guide
HBase Questions
Q: What are the key components of ZooKeeper
HBase?
It provides distributed coordination
Region Server service to maintain server state
in the cluster. It maintains which
Region server contains HBase servers are alive and available, and
tables that are divided horizontally provides server failure notification.
into “Regions” based on their key Region servers send their statuses
values. It runs on every node and to ZooKeeper indicating if they are
decides the size of the region. ready for read and write operation.
Each region server is a worker
node which handles read, write,
update, and delete request from
clients.
26 | www.simplilearn.com
Interview Guide
Q: Can we do import/export in
HBase table?
27 | www.simplilearn.com
Interview Guide
28 | www.simplilearn.com
Interview Guide
The catalog table hbase:meta exists as an HBase table and is filtered out of the
HBase shell’s list command. It keeps a list of all the regions in the system and
the location of hbase:meta is stored in ZooKeeper. The -ROOT- table keeps
track of the location of the .META table.
A: In Hbase, all read and write requests should be uniformly distributed across all
of the regions in the RegionServers. Hotspotting occurs when a given region
serviced by a single RegionServer receives most or all of the read or write
requests.
Hotspotting can be avoided by designing the row key in such a way that data
being written should go to multiple regions across the cluster. Below are the
techniques to do so:
Salting
Hashing
Reversing the key
Sqoop Flume
Q: What are the default file formats to import data using Sqoop?
29 | www.simplilearn.com
Interview Guide
2, strive to learn,2010-01-01
3, third message,2009-11-12
SequenceFile Format
SequenceFile is a binary format that stores individual records in custom
record-specific data types. These data types are manifested as Java classes.
Sqoop will automatically generate these data types for you. This format
supports exact storage of all data in binary representations, and is appropriate
for storing binary data.
A: Sqoop eval tool allows users to execute user-defined queries against respective
database servers and preview the result in the console.
Q: How does Sqoop import and export data between RDBMS and HDFS. Explain
with Sqoop architecture.
A:
30 | www.simplilearn.com
Interview Guide
Q: What is the role of JDBC driver in Sqoop setup? Is the JDBC driver enough to
connect Sqoop to the database?
31 | www.simplilearn.com
Interview Guide
Q: How will you update the columns that are already exported? Write Sqoop
command to show all the databases in MySQL server.
A: Codegen tool in Sqoop generates Data Access Object (DAO) Java classes that
encapsulate and interpret imported records.
The following example generates Java code for an “employee” table in the
“testdb” database.
$ sqoop codegen \
--connect jdbc:mysql://localhost/testdb \
--username root \
--table employee
Q: Can Sqoop be used to convert data in different formats? If no, which tools
can be used for this purpose?
A: Yes, Sqoop can be used to convert data into different formats. This depends
on the different arguments that are used for importing.
32 | www.simplilearn.com
Interview Guide
Big Data Hadoop Certification Training: Our Big Data Hadoop training
course lets you master the concepts of the Hadoop framework, preparing
you for Cloudera’s CCA175 Hadoop Certification Exam. Learn how various
components of the Hadoop ecosystem fit into the Big Data processing
lifecycle.
33 | www.simplilearn.com
Interview Guide
INDIA USA
Simplilearn Solutions Pvt Ltd. Simplilearn Americas, Inc.
# 53/1 C, Manoj Arcade, 24th Main, 201 Spear Street, Suite 1100,
Harlkunte San Francisco, CA 94105
2nd Sector, HSR Layout United States
Bangalore: 560102 Phone No: +1-844-532-7688
Call us at: 1800-212-7688
www.simplilearn.com
34 | www.simplilearn.com