Hadoop Interview Guide

Interview Guide
HADOOP
Interview Guide
1 | www.simplilearn.com
Ace Your Hadoop
Interview
Big Data has taken the world by storm So, how do you get yourself a job in
and has been growing tremendously the field of Hadoop? If you have the
in the past decade. With Big Data requisite skills already, the next step is
comes the widespread adoption definitely cracking a Hadoop interview,
of Hadoop to solve major Big Data and we’ve got that sorted for you.
challenges. Hadoop is one of the most Go ahead, and give a read to all the
popular frameworks that is used to frequently asked Hadoop interview
store, process, and analyze Big Data. questions and answers!
Hence, there is always a demand for
professionals to work in this field.
Interview Guide
Topics
Covered
General Questions 4
MapReduce Questions 11
YARN Questions 16
Hive Questions 19
Pig Questions 22
HBase Questions 26
Sqoop Questions 29
Interview Guide
General
Questions
Q: What are the different vendor- Hadoop:
specific distributions of Hadoop?
hadoop-env.sh
A: Different vendors in the market It contains information about
have packaged the Apache environment variables such as
Hadoop in a cluster management Java path, process ID path, where
solution which allows them to the logs get stored, what kind of
easily deploy, manage, monitor, metrics will be collected, and so
and upgrade the clusters. Some of on.
the vendor-specific distributions
are: core-site.xml
This file has the HDFS path
and many other properties like
Cloudera
enabling thrash, enabling high
Hortonworks (it has now merged availability, etc.
with Cloudera) hdfs-site.xml
MAPR This file has other information
related to Hadoop cluster such as:
Microsoft Azure
replication factor
IBM InfoSphere
Amazon Web Services where will namenode store
its metadata on disk
if a datanode is running,
Apart from the above popular where would it store the data
distributions, there are other
distributions that run in Apache if a secondary namenode
Hadoop. But, they have is running, where would that
packaged it as a solution like store a copy of namenodes
an installer so that you can metadata, and so on
easily set up clusters on a set of
machines. mapred-site.xml
Q: What are the different Hadoop It is a file which has properties

configuration files? related to MapReduce processing..
master and slaves
A: Regardless of which distribution
of Hadoop one works on, the These might be deprecated in a
following config files are important vendor-specific distribution but are
and exist in every distribution of part of Hadoop core distribution.
Interview Guide
yarn-site.xml Pseudo-distributed mode

It is based on the YARN processing It uses a single node Hadoop
framework, which was introduced deployment to execute all the
in Hadoop v2. It contains resource Hadoop services. Hadoop as a
allocation, resource manager, and framework has many services that
node manager related properties. would be running irrespective
If you work as a Hadoop admin of the distribution. Each service
or a Hadoop developer, then would then have multiple
knowing these config properties processes. A pseudo-distributed
is important and it will showcase mode is a mode of a cluster
your internal knowledge about the where all the important processes
configs which drive the Hadoop belonging to one or multiple
cluster. services run on a single node. This
mode is suitable for testing and
development.
Q: What are the three modes in
which Hadoop can run?
Fully-distributed mode
A: Hadoop runs in three modes, It uses separate nodes to run
namely: Hadoop master and slave services.
It means the Hadoop framework
Standalone mode and its components are spread
across multiple machines. For
This is the default mode. It uses multiple services such as HDFS,
a local file system and a single Yarn, Flume, Kafka, HBase, Hive,
Java process to run the Hadoop and Impala, there would be one
services. This mode is suitable for or multiple processes distributed
the testing set up. across multiple nodes. This mode
is normally used in a production
environment.
Q: What are the differences between the regular file system and HDFS?
A:
Regular File System HDFS
Data is maintained in a single Data is distributed and

system maintained on multiple systems
If the machine crashes, data If a datanode crashes, data can
recovery is very difficult due still be recovered from other
to low fault tolerance nodes in the cluster
Seek time is more and hence Time taken to read data is
it takes more time to process comparatively more as there
the data is local data read to disc and
coordination of data from
multiple systems
Interview Guide
Q: Why is HDFS fault tolerant?
A: HDFS is fault-tolerant as it replicates data on

different datanodes. By default, a block of data
gets replicated on three datanodes.
In Hadoop v2, the default block size is of 128 MB.
Any file which is up to 128 MB uses one logical
block. If the file size is bigger than 128 MB, it will
be split into blocks. These blocks will be stored
across multiple machines.
The data blocks are stored in different
datanodes. If one node crashes, the data can still
be retrieved from other datanodes. This makes
HDFS fault tolerant.
Q: Explain the architecture of HDFS.
A: The architecture of HDFS is as shown:
NameNode
NameNode is the master
service that host metadata
in disk and RAM. It holds
information about the various
datanodes, their location, the
size of each block, etc.
The disk has an Edit log (transaction log) and Fsimage (filesystem image). It
gets appended every time a read/write or any other operations happens on
HDFS. Metadata in RAM is dynamically built every time the cluster comes up
and addresses all the client requests.
Datanode
Datanodes hold the actual data blocks and send block reports to Namenode
every 10 seconds. Datanode stores and retrieves the blocks when asked by the
Namenode. It reads and writes the client’s request and performs block
Interview Guide
creation, deletion, and replication on instruction from the Namenode.

In a Hadoop cluster, the main service is HDFS. Data which is written to
HDFS is split into blocks depending on its size. The blocks are randomly
distributed across nodes. With the auto-replication feature, these blocks are
auto replicated across multiple machines with the first condition that no two
identical blocks can sit on the same machine.
As soon as the cluster comes up, the datanodes (which are part of the cluster)
based on config files, start sending their heartbeat to the namenode every
three seconds. Namenode stores this information, i.e starts building metadata
in its RAM which has information about the datanodes available in the
beginning. This metadata is maintained in RAM as well as in the disk (shown in
the red box).
Q: What are the two types of metadata a namenode server holds?
A: The two types of metadata that a namenode server holds are:

Metadata in Disk
Metadata in RAM
Q: What is the difference between a federation and high availability?

A:
HDFS Federation HDFS High Availability
There is no limitation to There are two namenodes

the number of namenodes which are related to each
and the namenodes are not other. Both active and standby
related to each other namenodes work all the time
All the namenodes share a At a time, active namenode
pool of metadata in which will be up and running while
each namenode will have its standby namenode will be idle
dedicated pool and updating it’s metadata
once in a while
Provides fault tolerance,
i.e if one namenode goes Requires two separate
down, that will not affect the machines. First, the active
data of the other namenode namenode will be configured
while the secondary namenode
will be configured on the other
system
Q: If you have an input file of 350 MB, how many input splits would be created
by HDFS and what would be the size of each input split?
Interview Guide
A: In HDFS, each block by default is divided into

128 MB. The size of all blocks, except the last
block, will be 128 MB. So, there would three
input splits in total as shown:
The size of splits is 128 MB, 128MB, and 94 MB.
Q: How does rack awareness work in HDFS?
A: HDFS Rack Awareness refers to the knowledge of different datanodes and how
it is distributed across the racks of a Hadoop Cluster.
Consider an image as shown:
By default, each block of data gets replicated thrice on various datanodes

present on different racks. Two identical blocks cannot be placed on the same
datanode. When a cluster is rack aware, all the replicas of a block cannot be
placed on the same rack. If a datanode crashes, you can retrieve the data block
from different datanodes.
Q: How can you restart namenode and all the daemons in Hadoop?
A: You can stop the namenode with ./sbin /Hadoop-daemon.sh stop namenode
command and then start the namenode using ./sbin/Hadoop-daemon.sh start
namenode command.
You can stop all the daemons with ./sbin /stop-all.sh command and then start
the daemons using ./sbin/start-all.sh command.
Q: Which command will help you find the status of blocks and filesystem health?
A: To check the status of the blocks, use the command:

hdfs fsck <path> -files -blocks
To check the health status of filesystem, use the command:
hdfs fsck / -files –blocks –locations > dfs-fsck.log
Interview Guide
Q: What would happen if you store too many small files in a cluster on HDFS?
A: As Hadoop is coded in Java, every directory, file, and file related block is
considered as an object. For every object within the Hadoop cluster, namenode
RAM gets utilized. More the number of blocks, more would be the usage of
namenode RAM. Storing a lot of small files on HDFS generates a lot of metadata
files. Storing these metadata in the RAM is a challenge as each file, block or
directory takes 150 bytes just for metadata. Thus, the cumulative size of all the
metadata will be too big.
Q: How do you copy data from the local system onto HDFS?
A: The following command helps to copy data from the local file system into
HDFS:
hadoop fs –copyFromLocal [source] [destination]
Example: hadoop fs –copyFromLocal /tmp/data.csv /user/test/data.csv
In the above syntax, the source is the local path and destination is the HDFS
path. Copy from local using a -f option (flag option) helps you in writing the
same file or a new file to HDFS.
Q: When do you use dfsadmin -refreshNodes and rmadmin -refreshNodes

command?
A: When a node is added to or removed from the cluster, the Hadoop master is
informed that the node would not be used for storage or processing. These
commands are used to refresh the node information while commissioning or
decommissioning of nodes is done.
dfsadmin -refreshNodes
This is used to run HDFS client and it refreshes node configuration for the
NameNode.
rmadmin -refreshNodes
This is used to perform administrative tasks for ResourceManager.
Q: Is there any way to change replication of files on HDFS after they are already
written to HDFS?
A: Yes, the following are the ways to change the replication of files on HDFS:
We can change the dfs.replication value to a particular number in $HADOOP_
HOME/conf/hadoop-site.xml file which will start replicating to the factor of
that number for any new content that comes in.
If you want to change the replication factor for a particular file or directory,
then use:
$HADOOP_HOME/bin/Hadoop dfs –setrep –w4 /path of the file
Example: $HADOOP_HOME/bin/Hadoop dfs –setrep –w4 /user/temp/test.csv
Interview Guide
Q: Who takes care of replication consistency in a Hadoop cluster and what

do you mean by under/over-replicated blocks?
A: Namenode takes care of replication consistency in a Hadoop cluster, and fsck

command gives the information regarding over and under-replicated block.
Under-replicated blocks:
These are the blocks that do not meet their target replication for the file they
belong to. HDFS will automatically create new replicas of under-replicated
blocks until they meet the target replication. Consider a cluster with three
nodes and replication set to three. At any point, if one of the namenode
crashes, the blocks would be under-replicated. It means that there was a
replication factor set, but there are not enough replicas as per the replication
factor set. If the namenode does not get information about replicas, it will wait
for some time and then start rereplication of missing blocks from the available
nodes.are delivered.
Over-replicated blocks:
These are the blocks that exceed their target replication for the file they belong
to. Normally, over-replication is not a problem, and HDFS will automatically
delete excess replicas.
Consider a case of three nodes running with the replication of three and
one of the nodes went down due to a network failure. Within a few minutes,
namenode re-replicated the data and then the failed node is back with its set
of blocks. This is an over-replication situation, and the namenode will delete a
set of blocks from one of the nodes.
Interview Guide
MapReduce Questions
Q: What is distributed cache in MapReduce?
A: Distributed cache is a mechanism supported by the Hadoop MapReduce

framework. The data coming from the disk can be cached and made available
for all worker nodes where the map/reduce tasks are running for a given job.
When a MapReduce program is running, instead of reading the data from the
disk every time, it would pick up the data from the distributed cache and it will
benefit the MapReduce processing.
To copy the file to HDFS, you can use the command:
hdfs dfs-put /user/Simplilearn/lib/jar_file.jar
To set up the application’s JobConf, use the command:
DistributedCache.addFileToClasspath(newpath(“/user/Simplilearn/lib/jar_
file.jar”), conf)
Q: What role do RecordReader, Combiner, and Partitioner play in a MapReduce

operation?
RecordReader
It communicates with the InputSplit and converts the data into key-value pairs
suitable for reading by the mapper.
Combiner
Combiner is also known as a mini reducer. For every combiner, there is one
mapper. It substitutes intermediate key-value pairs and passes it to the
partitioner.
Partitioner
Partitioner decides how many reduced tasks would be used to summarize the
data. It also confirms how outputs from combiners are sent to the reducer, and
controls the partitioning of keys of the intermediate map outputs.
Q: Why is MapReduce slower in processing data in comparison to other

processing frameworks?
A: MapReduce is slower because:
It uses batch processing to process data. No matter what kind of processing

you want to do, you would have to provide the mapper and reducer functions
to work on data
It uses Java language which is difficult to program as it has multiple lines of
code.
Interview Guide
Reads data from the disk, and after a particular iteration, sends results to the
HDFS. Such a process increases latency and makes graph processing slow.
Q: For a MapReduce job, is it possible to change the number of mappers

to be created?
A: By default, you cannot change the number of mappers because it is equal to

the number of input splits. For example, if you have a 1GB file that is split into
8 blocks (of 128MB each), there will be only 8 mappers running on the cluster.
However, there are different ways in which you can either set a property or
customize the code to change the number of mappers.
Q: Name some Hadoop specific data types that are used in a MapReduce
program.
A: Following are some Hadoop specific data types used in a MapReduce program:
IntWritable
FloatWritable
LongWritable
DoubleWritable
BooleanWritable
ArrayWritable
MapWritable
ObjectWritable
Q: What is speculative execution in Hadoop?
A: If a datanode is executing any task slowly, the master node can redundantly
execute another instance of the same task on another node. The task that
finishes first will be accepted and the other task would be killed. So speculative
execution is good if you are working in an intensive workload kind of
environment.
The following image depicts the speculative execution:
There is a node A which is having a slower task. A scheduler maintains the

resources available, and with speculative execution turned on, then a copy of
slower task runs on node B. The output will then be accepted from node B.
Interview Guide
Q: How is identity mapper different from chain mapper?

A:
Identity Mapper Chain Mapper
It the default mapper which This class is used to run
is chosen when no mapper multiple mappers in a single
is specified in MapReduce map task
driver class
The output of the first mapper
It implements identity becomes the input to the
function, which directly second mapper, second to third
writes all its key-value pairs and so on
into output
It is defined in
It is defined in old
MapReduce API (MR1) in org.apache.Hadoop.mapreduce.
org.apache.Hadoop.mapred. lib.chain.ChainMapperpackage
lib.package
Q: What are the major configuration parameters required in a MapReduce

program?
A: The major configuration parameters are:

Input location of the job in HDFS
Output location of the job in HDFS
Input and output formats
Classes containing map and reduce functions
jar file for mapper, reducer and driver classes
Q: What do you mean by map-side join and reduce-side join in

MapReduce?
A:

Map-side join Reduce-side join
The join is performed by the The join is performed by the
mapper reducer
Each input data must be Easier to implement than the
divided into the same map side join as the sorting and
number of partitions shuffling phase sends the value
having identical keys to the
Input to each map is in same reducer
the form of a structured
partition and is in sorted No need to have the dataset
order in a structured form (or
partitioned)
Interview Guide
Q: What is the role of OutputCommitter class in a MapReduce job?
A: OutputCommitter describes the commit of task output for a MapReduce job.
Example:
org.apache.hadoop.mapreduce.OutputCommitter
public abstract class OutputCommmitter extends OutputCommitter
MapReduce relies on the OutputCommitter of the job to:
Set up the job initialization

Cleaning up the job after job completion
Setup the task temporary output
Check whether a task needs a commit
Commit of the task output
Discard the task commit
Q: Explain the process of spilling in MapReduce.
A: Spilling is a process of copying the data from memory buffer to disk when
the content of the buffer reaches a certain threshold size. Spilling happens
when there is not enough memory to fit all of the mapper output. By default, a
background thread starts spilling the content from memory to disk after 80%
of the buffer size is filled. For a 100 MB size buffer, the spilling will start after
the content of the buffer reach a size of 80 MB.
Q: How can you set the mappers and reducers for a MapReduce job?
A: The number of mappers and reducers can be set in the command line using:
-D mapred.map.tasks=5 –D mapred.reduce.tasks=2
In the code, one can configure JobConf variables:
job.setNumMapTasks(5); // 5 mappers
job.setNumReduceTasks(2); // 2 reducers
Q: What happens when a node running a map tasks fails before sending the
output to the reducer?
A: If a node crashes when running a map task, the whole task will be assigned to a
new node and run again to re-create the map output. In Hadoop v2, the YARN
framework has a temporary daemon called application master which takes
care of the execution of the application. If a task on a particular node failed
due to unavailability of node, it is the role of application master to have this
task scheduled on some other node.
Interview Guide
Q. Can we write the output of MapReduce in different formats?
A: Yes, we can write the output of MapReduce in different formats. Hadoop

supports various input and output formats such as:
TextOutputFormat - it is the default output format and it writes records as lines

of text.
SequenceFileOutputFormat - it is used to write sequence files when the output

files need to be fed into another MapReduce job as input files.
MapFileOutputFormat - it is used to write output as map files.
SequenceFileOutputFormat - it is another variant of SequenceFileInputFormat.

It writes keys and values to a sequence file in binary format.
DBOutputFormat - it is used for writing to relational databases and HBase. This

format also sends the reduce output to a SQL table.
Interview Guide
Questions on YARN
(Yet Another Resource Negotiator)
Q: What benefits did YARN bring Resource utilization - Yarn allows
in Hadoop 2.0 and how did it dynamic allocation of cluster
solve the issues of MapReduce resources to improve resource
v1? utilization.
Multitenancy - Yarn can use open-
A: MapReduce v1 had major issues source and proprietary data access
when it came to scalability or engines and perform real-time
availability. In Hadoop v1, there analysis and running ad-hoc query.
was only one master process
for processing layer called as
job tracker. Job tracker was Q: How does YARN allocate
responsible for resource tracking resources to an application?
and job scheduling. In YARN, Explain with the help of its
there is a processing master called architecture.
ResourceManager. In Hadoop
v2, you have ResourceManager A: There is a client/application/API
running in high availability which talks to ResourceManager.
mode. There are node managers The ResourceManager manages
running on multiple machines, the resource allocation in the
and a temporary daemon called cluster. It has two internal
application master. In Hadoop components, Scheduler and
v2, the ResourceManager is only Applications Manager. The
handling the client connections ResourceManager is aware of the
and taking care of tracking the resources which are available with
resources. The job scheduling or every node manager.
execution across multiple nodes is
controlled by application masters Scheduler allocates resources to
till the application completes. In various running applications when
YARN, there are different kinds they are running in parallel. It
of resource allocations, and it can schedules resources based on the
run different kinds of workloads. requirements of the applications.
In Hadoop v2, there are following It does not monitor or track the
features available: status of the applications.
Applications Manager is the one
which accepts the job submissions.
Scalability - You can have a cluster It monitors and restarts the
size of more than 10,000 nodes application masters in case of
and can run more than 100,000 failures. Application Master
concurrent tasks. manages the resource needs of
Compatibility - Applications individual applications. It interacts
developed for Hadoop v1 runs on with scheduler to acquire the
YARN without any disruption or required resources. It interacts
availability issues. with NodeManager to execute and
Interview Guide
monitor tasks.
Node Manager is a tracker that tracks the jobs running. It monitors each
container’s resource utilization.
Container is a collection of resources like RAM, CPU, or network bandwidth. It
provides the rights to an application to use specific amount of resources.
Let us discuss how YARN allocates resources to an application through its
architecture:
Whenever a job submission happens, ResourceManager makes a request to

the Node Manager to hold some resources for processing. Node Manager then
guarantees the container (a combination of RAM and CPU cores) which would
be available for processing. Next, the ResourceManager starts a temporary
daemon called application master to take care of the execution. The App
Master, which is launched by applications manager, will run in one of the
containers. The other containers will be utilized for execution. This is how YARN
takes care of the allocation.
Q: Which of the following has occupied the place of JobTracker of

MapReduce v1?
a. NodeManager b. ApplicationManager c. ResourceManager d. Scheduler
A: ResourceManager
Q: Write the YARN commands to check the status of an application and

kill an application.
A: To check the status of an application: yarn application -status ApplicationID

To kill or terminate an application: yarn application -kill ApplicationID
Interview Guide
Q: Can we have more than one ResourceManager in a YARN based

cluster?
A: Yes, there can be more than one ResourceManager in the case of a High
Availability cluster. They are:
Active ResourceManager
Standby ResourceManager
At a particular time, there can only be one active ResourceManager. If Active
ResourceManager fails, then the Standby ResourceManager comes to the
rescue.
Q: What are the different schedulers available in YARN?
A: The different schedulers available in YARN are:

FIFO scheduler (first-in-first-out) - It places applications in a queue and runs
them in the order of submission (first in, first out).
Capacity scheduler - A separate dedicated queue allows the small job to start
as soon as it is submitted. The large job finishes later than when using the FIFO
scheduler.
Fair scheduler - There is no need to reserve a set amount of capacity since it
will dynamically balance resources between all the running jobs.
Q: What happens if a ResourceManager fails while executing an

application in a high availability cluster?
A: In a high availability cluster, there are two ResourceManagers, one being

active and the other standby. If a ResourceManager fails in case of high
availability cluster, the standby will be elected as active and instructs the
ApplicationMaster to abort. The ResourceManager recovers its running state by
taking advantage of the container statuses sent from all node managers.
Q: In a cluster of 10 datanodes, each having 16 GB and 10 cores, what

would be the total processing capacity of the cluster?
A: Every node in a Hadoop cluster have multiple processes running which would
need RAM. The machine itself, which has a Linux file system, would have its
own processes that need some RAM usage. So, if you have 10 datanodes,
you need to allocate at least 20-30% towards the overheads, cloudera base
services, etc. You could have 11 or 12 GB and 6 - 7 cores available on every
machine for processing. Multiply that by 10 and that's your processing capacity.
Q: What happens if requested memory or CPU cores goes beyond the size
of container allocation?
A: If an application starts demanding more memory or CPU cores, it cannot fit

into a container allocation. So, the application will fail.
Interview Guide
Hive Questions
Q: What are the different components of a Hive architecture?
A: The different components of a Hive architecture are:

User Interface
It calls the execute interface to the driver. This creates a session to the query.
Then, it sends the query to the compiler to generate an execution plan for it.
Metastore
It stores the metadata information and sends it to the compiler for execution of
a query.
Compiler
It generates the execution plan. It has a DAG of stages where each stage is
either a metadata operation, a map or reduce job or an operation on HDFS.
Execution Engine
Execution engine acts as bridge between Hive and Hadoop to process query. It
communicates bidirectionally with Metastore to perform operations like create,
drop tables, etc.
Q: What is the difference between external table and managed table in Hive?
External Table Managed Table
External tables in Hive refer Also known as internal table, it

to the data that is at an manages the data and moves it
existing location outside the into its warehouse directory by
warehouse directory default
Hive deletes the metadata If one drops a managed table,
information of a table and the metadata information along
does not change the table with the table data is deleted
data present in HDFS from the Hive warehouse
directory
Q: What is a partition and why is it required in Hive?
A: Partition is the process to group similar type of data together on the basis of
column or partition key. Each table can have one or more partition keys to
identify a particular partition.
Partitioning provides granularity in a Hive table. It reduces the query latency by
scanning only relevant partitioned data instead of the whole data set.
Interview Guide
We can partition the transaction data for a bank based on month – Jan, Feb,
etc. Any operation regarding a particular month, say Feb, will have to scan only
the Feb partition instead of the whole table data.
Q: Why Hive does not store metadata information in HDFS?
A: HDFS read/write operations are time consuming. So, Hive stores metadata
information in the metastore using RDBMS instead of HDFS. This allows to
achieve low latency and is faster.
Q: What are the components used in Hive query processor?
A: Listed below are the components used in Hive query processor:
Parser
Semantic analyzer
Execution engine
User-defined functions
Logical plan generation
Physical plan generation
Optimizer
Operators
Type checking
Q: Suppose, there are a lot of small CSV files present in /user/input

directory in HDFS and you want to create a single Hive table from these
files. The data in these files have fields: {registration_no, name, email,
address}. What will be your approach to solve this, where will you
create a single Hive table for lots of small files without degrading the
performance of the system?
A: Using SequenceFile format and grouping these small files together to form a
single sequence file can solve this problem. Below are the steps:
1. Create a temporary table:

CREATE TABLE test (registration_no INT, name STRING, email STRING,
address STRING)
ROW FORMAT FIELDS DELIMITED TERMINATED BY ‘,’ STORED AS TEXTFILE;
2. Load the data into test table:

LOAD DATA INPATH ‘/user/input’ INTO TABLE test;
Interview Guide
3. Create a table that will store data in SequenceFile format:

CREATE TABLE test_seqfile (registration_no INT, name STRING, email STRING,
address STRING)
ROW FORMAT FIELDS DELIMITED TERMINATED BY ‘,’ STORED AS
SEQUENCEFILE;
4. Move the data from the test table into test_seqfile table:
INSERT OVERWRITE TABLE test_seqfile SELECT * FROM test;
Q: Write a query to insert a new column(new_col INT) into a hive table

(h_table) at a position before an existing column (x_col).
A: Following query will insert a new column into a hive table:

ALTER TABLE h_table
CHANGE COLUMN new_col INT
BEFORE x_col
Q: What are the key differences between Hive and Pig?

Hive Pig
Uses a declarative language Uses a high level procedural

called HiveQL similar to SQL language called Pig Latin for
for reporting programming
Operates on the server side Operates on the client side of

of the cluster and allows the cluster and allows both
structured data structured and unstructured
data
Does not support Avro
file format by default. This Supports Avro file format by
can be done using: “Org. default
Apache.Hadoop.Hive.
serde2.Avro” It was developed by Yahoo and
it does not support partition
It was developed by
Facebook and it supports
partition
Interview Guide
Q: Can we have more than one ResourceManager in a YARN based

cluster?
Pig MapReduce
Has less lines of code Has more lines of code

compared to MapReduce
Low level language which
High level language which cannot perform join operation
can easily perform join easily
operation
MapReduce jobs take more
On execution, every Pig time to compile
operator is converted
internally into a MapReduce A MapReduce program written
job in one Hadoop version may not
work with other versions
Works with all the versions
of Hadoop
Q: What are the different ways of executing Pig script?
Grunt shell
Script file
Embedded script
Q: What are the major components of Pig execution environment?
Pig Scripts
It is written in Pig Latin using built-in operators and UDFs and submitted to the
execution environment.
Parser
Does type checking and checks the syntax of the script. The output of parser is
a Directed Acyclic Graph (DAG).
Optimizer
Performs optimization using merge, transform, split, etc. It aims to reduce the
amount of data in the pipeline.
Compiler
Pig compiler converts the optimized code into MapReduce jobs automatically.
Execution Engine
MapReduce jobs are submitted to execution engines to generate the desired
results.
Interview Guide
Q: Explain the different complex data types in Pig.
Tuple
A tuple is an ordered set of fields which can contain different data types for
each field. It is represented by braces ().
Example: (1,3).
Bag
A bag is a set of tuples represented by curly braces {}.
Example: {(1,4), (3,5), (4,6)}
Map
A Map is a set of key-value pairs used to represent data elements. It is
represented in square brackets [].
Example: [key#value, key1#value1,….]
Q: What are the various diagnostic operators available in Apache Pig?
Dump
• Dump operator runs the Pig Latin scripts and displays the results on the screen.
• Load the data using “load” operator into Pig.
• Display the results using “dump” operator.
Describe
• Describe operator is used to view the schema of a relation.
• Load the data using “load” operator into Pig
• View the schema of a relation using “describe” operator
Explain
• Explain operator displays the physical, logical and MapReduce execution plans.
• D
isplay the logical, physical and MapReduce execution plans using “explain”
operator
Illustrate
• I llustrate operator gives the step-by-step execution of a sequence of
statements.
• S
how the step-by-step execution of a sequence of statements using “illustrate”
operator
Interview Guide
Q: State the usage of group, order by, and distinct keywords in Pig scripts.
A: Group statement collects various records with the same key and groups the
data in one or more relations.
Example: Group_data = GROUP Relation_name BY AGE
Order statement is used to display the contents of a relation in a sorted order

based on one or more fields.
Example: Relation_2 = ORDER Relation_name1 BY (ASC|DSC)
Distinct statement removes duplicate records and is implemented only on

entire records and not on individual records.
Example: Relation_2 = DISTINCT Relation_name1
Q: What are the relational operators in Pig?
COGROUP - Joins two or more tables and then performs GROUP operation on
the joined table result.
CROSS - It is used to compute the cross product (cartesian product) of two or
more relations.
FOREACH - It will iterate through the tuples of a relation, generating a data
transformation
JOIN - Used to join two or more tables in a relation
LIMIT - It will limit the number of output tuples
SPLIT - This will split the relation into two or more relations
UNION - It will merge the contents of two relations
ORDER - Used to sort a relation based on one or more fields
Q: What is the use of having filters in Apache Pig?
A: FILTER operator is used to select the required tuples from a relation based on
a condition. It also allows you to remove unwanted records from the data file.
Filter the products whole quantity is greater than 1000
Example:
A = LOAD ‘/user/Hadoop/phone_sales’ USING PigStorage(‘,’) AS (year:int,
product:chararray, quantity:int);
B = FILTER A BY quantity > 1000
“Phone_sales” data
year, product, quantity
----------------------------
Interview Guide
2000, phone, 1000

2001, phone, 1500
2002, phone, 1700
2003, phone, 1200,
2004, phone, 800
2005, phone, 900
Q: Suppose there is a file called “test.txt” having 150 records in HDFS.

Write a PIG command to retrieve the first 10 records of the file.
A: UWe need to use the limit operator to retrieve the first 10 records from a file.
Load the data in Pig:
test_data = LOAD “/user/test.txt” USING PigStorage(‘,’) as (field1, field2,….);
Limit the data to first 10 records:
Limit_test_data = LIMIT test_data 10;
Interview Guide
HBase Questions
Q: What are the key components of ZooKeeper
HBase?
It provides distributed coordination
Region Server service to maintain server state
in the cluster. It maintains which
Region server contains HBase servers are alive and available, and
tables that are divided horizontally provides server failure notification.
into “Regions” based on their key Region servers send their statuses
values. It runs on every node and to ZooKeeper indicating if they are
decides the size of the region. ready for read and write operation.
Each region server is a worker
node which handles read, write,
update, and delete request from
clients.
Q: Explain what are row key and

column families in HBase?
A: Row key acts as a primary key

for any HBase table. It also allows
logical grouping of cells and makes
sure that all cells with the same
HMaster row key are co-located on the
It assigns regions to RegionServers same server.
for load balancing and monitors Column families consist of a group
and manages the Hadoop cluster. of columns which is defined during
Whenever a client wants to table creation and each column
change the schema and any of the family has certain column qualifiers
metadata operations, HMaster is separated by a delimiter.
used.
Interview Guide
Q: Why do we need to disable a table the replication_scope to 1. A

in HBase and how do you do it? replication_scope of 0 indicates
that the table is not replicated.
A: HBase table is disabled to allow
it to be modified or change its
settings. When a table is disabled disable ‘hbase1’
it cannot be accessed through the
alter ‘hbase1’, {NAME => ‘family_
scan command.
name’, REPLICATION_SCOPE =>
‘1’}
enable ‘hbase1’
Q: Can we do import/export in
HBase table?
A: Yes, it is possible to import and

export tables from one HBase
cluster to another.
HBase export utility:

hbase org.apache.hadoop.hbase.
To disable the employee table, use the
mapreduce.Export “table name”
command:
“target export location”
disable ‘employee_table’
Example: hbase org.apache.
To check if the table is disabled, hadoop.hbase.mapreduce.Export
use the command: “employee_table” “/export/
is_disabled ‘employee_table’ employee_table”
Q: Write the code to open a HBase import utility:

connection inHBase.
create ‘emp_table_import’, {NAME
A: Below code is used to open a => ‘myfam’, VERSIONS => 10}
connection in HBase: hbase org.apache.hadoop.hbase.
Configuration myConf = mapreduce.Import “table name”
HBaseConfiguration.create(); “target import location”
HTableInterface usersTable = new

HTable(myConf, “users”); Example: create ‘emp_table_
import’, {NAME => ‘myfam’,
Q: What does replication mean in VERSIONS => 10}
terms of HBase? hbase org.apache.hadoop.hbase.
mapreduce.Import “emp_table_
A: Replication feature in HBase import” “/export/employee_table”
provides a mechanism to copy
data between clusters. This feature
can be used as a disaster recovery Q: What do you mean by campaction
solution that provides high in HBase?
availability for HBase.
A: Compaction is the process of
The following commands merging HBase files into a single
alter the hbase1 table and set file. This is done to reduce the
Interview Guide
amount of memory required to Q: How does Write Ahead Log

store the files and the number of (WAL) help when a RegionServer
disk seeks needed. Once the files crashes?
are merged, the original file is
deleted. A: If a RegionServer hosting a
MemStore crashes, the data that
existed in memory but not yet
persisted, are lost.
HBase recovers against that by
writing to the WAL before the
write completes. HBase cluster
keeps a WAL to record changes as
Q: How does Bloom filter work? they happen. If HBase goes down,
the data that were not yet flushed
A: HBase Bloom filter is a mechanism from the MemStore to the HFile
to test whether a HFile contains a can be recovered by replaying the
specific row or row-col cell. Bloom WAL.
filter is named after its creator
Burton Howard Bloom. It is a data
structure which predicts whether
a given element is a member of a
set of data. This filters provide an
in-memory index structure that
reduces disk reads and determines
the probability of finding a row in a
particular file.
Q: Does HBase have any concept of

namespace?
Q: Write the HBase command to
A: A namespace is a logical grouping list the contents and update the
of tables, analogous to a database column families of a table.
in RDBMS. You can create the
HBase namespace to the schema A: Below code is used to list the
of the RDBMS database. contents of a HBase table:
scan ‘table_name’
To create a namespace, use the Example: scan ‘employee_table’
command:
create_namespace To update column families in the
‘namespacename’ table, use the command:
alter ‘table_name’, ‘column_
To list all the tables that are family_name’
members of the namespace, use Example: alter ‘employee_table’,
the command: list_namespace_ ‘emp_address’
tables ‘default’
Q: What are catalog tables in HBase?

To list all the namespaces, use the
command: A: Catalog has two tables:
list_namespace hbasemeta and -ROOT-
Interview Guide
The catalog table hbase:meta exists as an HBase table and is filtered out of the
HBase shell’s list command. It keeps a list of all the regions in the system and
the location of hbase:meta is stored in ZooKeeper. The -ROOT- table keeps
track of the location of the .META table.
Q: What is Hotspotting in HBase and how to avoid it?
A: In Hbase, all read and write requests should be uniformly distributed across all
of the regions in the RegionServers. Hotspotting occurs when a given region
serviced by a single RegionServer receives most or all of the read or write
requests.
Hotspotting can be avoided by designing the row key in such a way that data
being written should go to multiple regions across the cluster. Below are the
techniques to do so:
Salting
Hashing
Reversing the key
Q: How is Sqoop different from Flume?
Sqoop Flume
Sqoop works with RDBMS Flume works with streaming

and NoSQL databases to data that is generated
import and export data continuously in Hadoop
environment.
Loading data in Sqoop is
not event-driven Example: log files
Works with structured Loading data in Flume is

data sources and Sqoop completely event driven
connectors are used to fetch
data from them Fetches streaming data like
tweets or log files from web
It imports data from RDBMS servers or application servers
onto HDFS and exports it
back to RDBMS Data flows from multiple
channels into HDFS
Q: What are the default file formats to import data using Sqoop?
Delimited Text File Format

This is the default import format and can be specified explicitly using the --as-
textfile argument. This argument will write string-based representations of
each record to the output files, with delimiter characters between individual
columns and rows.
1, here is a message,2010-05-01
Interview Guide
2, strive to learn,2010-01-01
3, third message,2009-11-12
SequenceFile Format
SequenceFile is a binary format that stores individual records in custom
record-specific data types. These data types are manifested as Java classes.
Sqoop will automatically generate these data types for you. This format
supports exact storage of all data in binary representations, and is appropriate
for storing binary data.
Q: What is the importance of eval tool in Sqoop?
A: Sqoop eval tool allows users to execute user-defined queries against respective
database servers and preview the result in the console.
Q: How does Sqoop import and export data between RDBMS and HDFS. Explain
with Sqoop architecture.
A:
1. Introspect database to gather metadata (primary key information)

2. Sqoop divides the input dataset into splits and uses individual map tasks to
push the splits to HDFS
Interview Guide
Q: Suppose you have a database “test_db” in MySQL. Write the command to

connect this database and import tables to Sqoop.
A: Following commands show how to import test_db database and test_demo

table present in it to Sqoop.
Q: Explain how to export a table back to RDBMS with an example.
A: Suppose there is a “departments” table in “retail_db” which is already imported

into Sqoop and you need to export this table back to RDBMS.
Create a new “dept” table to export in RDBMS (MySQL)
Export “departments” table to “dept” table
Q: What is the role of JDBC driver in Sqoop setup? Is the JDBC driver enough to
connect Sqoop to the database?
A: JDBC driver is a standard Java API used for

accessing different databases in RDBMS using
Sqoop. Each database vendor is responsible
for writing their own implementation that will
communicate with the corresponding database
with its native protocol. Each user needs to
download the drivers separately and install them
onto Sqoop prior to its use.
JDBC driver alone is not enough to connect
Sqoop to the database. We also need connectors
Interview Guide
to interact with the different databases. A connector is a pluggable piece that

is used to fetch metadata and allows Sqoop to overcome the differences in
SQL dialects supported by various databases along with providing optimized
data transfer.
Q: How will you update the columns that are already exported? Write Sqoop
command to show all the databases in MySQL server.
A: To update a column of a table which is already exported, we use the command:

--update-key parameter
Following is an example:
sqoop export --connect
jdbc:mysql://localhost/dbname –username root
–password cloudera --export-dir /input/dir
--table test_demo –-fields-terminated-by “,”
--update-key column_name
Q: What is Codegen in Sqoop?
A: Codegen tool in Sqoop generates Data Access Object (DAO) Java classes that
encapsulate and interpret imported records.
The following example generates Java code for an “employee” table in the
“testdb” database.
$ sqoop codegen \
--connect jdbc:mysql://localhost/testdb \
--username root \
--table employee
Q: Can Sqoop be used to convert data in different formats? If no, which tools
can be used for this purpose?
A: Yes, Sqoop can be used to convert data into different formats. This depends
on the different arguments that are used for importing.
Interview Guide
GET AHEAD IN YOUR

HADOOP CAREER
These Hadoop questions and answers are just some of the examples of
what you can come across while interviewing in the Big Data field. If you
want to learn Big Data Hadoop in detail, check out the Big Data Engineer
Master’s Program or go through the courses:
Big Data Hadoop Certification Training: Our Big Data Hadoop training
course lets you master the concepts of the Hadoop framework, preparing
you for Cloudera’s CCA175 Hadoop Certification Exam. Learn how various
components of the Hadoop ecosystem fit into the Big Data processing
lifecycle.
Big Data Hadoop Administrator Certification Training: Big Data

analysis is emerging as a key advantage in business intelligence for many
organizations. Our Big Data and Hadoop Administrator training course lets
you deep-dive into the concepts of Big Data, equipping you with the skills
required for Hadoop administration roles.
These courses helps you achieve thorough expertise in Big Data and
Hadoop, and will be a valuable resource when looking for that new,
rewarding career in the booming field of Big Data!
Interview Guide
INDIA USA
Simplilearn Solutions Pvt Ltd. Simplilearn Americas, Inc.
# 53/1 C, Manoj Arcade, 24th Main, 201 Spear Street, Suite 1100,
Harlkunte San Francisco, CA 94105
2nd Sector, HSR Layout United States
Bangalore: 560102 Phone No: +1-844-532-7688
Call us at: 1800-212-7688
www.simplilearn.com

Hadoop Interview Guide

Uploaded by

Hadoop Interview Guide

Uploaded by

Interview Guide

Q: What are the different Hadoop It is a file which has properties

yarn-site.xml Pseudo-distributed mode

Data is maintained in a single Data is distributed and

Q: Why is HDFS fault tolerant?

A: HDFS is fault-tolerant as it replicates data on

Q: Explain the architecture of HDFS.

A: The architecture of HDFS is as shown:

creation, deletion, and replication on instruction from the Namenode.

Q: What are the two types of metadata a namenode server holds?

A: The two types of metadata that a namenode server holds are:

Q: What is the difference between a federation and high availability?

HDFS Federation HDFS High Availability

There is no limitation to There are two namenodes

A: In HDFS, each block by default is divided into

Q: How does rack awareness work in HDFS?

Consider an image as shown:

By default, each block of data gets replicated thrice on various datanodes

A: To check the status of the blocks, use the command:

Q: When do you use dfsadmin -refreshNodes and rmadmin -refreshNodes

Q: Who takes care of replication consistency in a Hadoop cluster and what

A: Namenode takes care of replication consistency in a Hadoop cluster, and fsck

A: Distributed cache is a mechanism supported by the Hadoop MapReduce

Q: What role do RecordReader, Combiner, and Partitioner play in a MapReduce

Q: Why is MapReduce slower in processing data in comparison to other

A: MapReduce is slower because:

It uses batch processing to process data. No matter what kind of processing

Q: For a MapReduce job, is it possible to change the number of mappers

A: By default, you cannot change the number of mappers because it is equal to

Q: What is speculative execution in Hadoop?

There is a node A which is having a slower task. A scheduler maintains the

Q: How is identity mapper different from chain mapper?

Q: What are the major configuration parameters required in a MapReduce

A: The major configuration parameters are:

Q: What do you mean by map-side join and reduce-side join in

Q: What is the role of OutputCommitter class in a MapReduce job?

A: OutputCommitter describes the commit of task output for a MapReduce job.

public abstract class OutputCommmitter extends OutputCommitter

MapReduce relies on the OutputCommitter of the job to:

Set up the job initialization

Q: Explain the process of spilling in MapReduce.

In the code, one can configure JobConf variables:

Q. Can we write the output of MapReduce in different formats?

A: Yes, we can write the output of MapReduce in different formats. Hadoop

TextOutputFormat - it is the default output format and it writes records as lines

SequenceFileOutputFormat - it is used to write sequence files when the output

MapFileOutputFormat - it is used to write output as map files.

SequenceFileOutputFormat - it is another variant of SequenceFileInputFormat.

DBOutputFormat - it is used for writing to relational databases and HBase. This

Whenever a job submission happens, ResourceManager makes a request to

Q: Which of the following has occupied the place of JobTracker of

a. NodeManager b. ApplicationManager c. ResourceManager d. Scheduler

Q: Write the YARN commands to check the status of an application and

A: To check the status of an application: yarn application -status ApplicationID

Q: Can we have more than one ResourceManager in a YARN based

Q: What are the different schedulers available in YARN?

A: The different schedulers available in YARN are:

Q: What happens if a ResourceManager fails while executing an

A: In a high availability cluster, there are two ResourceManagers, one being

Q: In a cluster of 10 datanodes, each having 16 GB and 10 cores, what

A: If an application starts demanding more memory or CPU cores, it cannot fit

A: The different components of a Hive architecture are:

External Table Managed Table

External tables in Hive refer Also known as internal table, it

Q: What is a partition and why is it required in Hive?

Q: Why Hive does not store metadata information in HDFS?

Q: What are the components used in Hive query processor?

A: Listed below are the components used in Hive query processor:

Q: Suppose, there are a lot of small CSV files present in /user/input

1. Create a temporary table:

2. Load the data into test table:

3. Create a table that will store data in SequenceFile format:

Q: Write a query to insert a new column(new_col INT) into a hive table

A: Following query will insert a new column into a hive table:

Q: What are the key differences between Hive and Pig?

Uses a declarative language Uses a high level procedural