0% found this document useful (0 votes)

3 views23 pages

Bda 21cs71 Module 2

Uploaded by

VEENA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views23 pages

Bda 21cs71 Module 2

Uploaded by

VEENA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

MODULE 2

Syllabus:
IntroductiontoHadoop(T1):Introduction,HadoopanditsEcosystem,HadoopDistributedFileSystem,
MapReduce Framework and Programming Model, Hadoop Yarn, Hadoop Ecosystem Tools.
HadoopDistributedFileSystemBasics(T2):HDFSDesignFeatures,Components,HDFSUserCommands.
EssentialHadoopTools(T2):Using ApachePig,Hive,Sqoop, Flume,Oozie, HBase..

IntroductiontoHadoop:
Introduction:
Hadoop is an Apache open source framework written in java that allows distributed processing of large
datasets across clusters of computers using simple programming models. The Hadoop framework
application works in an environment that provides distributed storage and computation across clusters
of computers. Hadoop is designed to scale up from single server to thousands of machines, eachoffering
local computation and storage.
A programming model is centralized computing of data in which the data is transferred from multiple
distributeddatasourcestoacentralserver.Analyzing,reporting,visualizing,business-intelligencetasks
compute centrally. Data are inputs to the central server.
Anenterprisecollectsandanalyzesdataattheenterpriselevel.
BigDataStore Model:
Model for Big Data store is as follows: Data store in file system consisting of data blocks (physical
division of data). The data blocks are distributed across multiple nodes. Data nodes are at the racks of a
cluster. Racks are scalable. A Rack has multiple data nodes (data servers), and each cluster is arranged
in a number of racks.
Data Storemodel of files in data nodes in racks in the clusters Hadoop system uses the data store model
inwhichstorageisatclusters,racks,datanodesanddatablocks.DatablocksreplicateattheDataNodes
suchthatafailureoflinkleadsto accessofthedatablockfromtheothernodesreplicatedatthesameor other racks
BigDataProgramming Model:
Big Data programming model is that application in which application jobs and tasks (or sub-tasks) is
scheduled on the same servers which store the data for processing.
Job means running an assignment of a set of instructions for processing. For example, processing the
queries in an application and sending the result back to the application is a job. Other example is
instructions for sorting the examination performance data is a job.
Hadoopandits Ecosystem:
Apache initiated the project for developing storage and processing framework for Big Data storage and
processing. Doug Cutting and Machael J. Cafarelle the creators named that framework as Hadoop.
Cutting's son was fascinated by a stuffed toy elephant, named Hadoop, and this is how the nameHadoop
was derived.
Theproject consistedoftwocomponents,oneofthemisfordatastoreinblocksintheclustersandthe other is
computations at each individual cluster in parallel with another.
Hadoopcomponentsarewrittenin JavawithpartofnativecodeinC.Thecommandlineutilities are written in
shell scripts.
Infrastructure consists of cloud for clusters. A cluster consists of sets of computers or PCs. The Hadoop
platform provides a low cost Big Data platform, which is open source and uses cloud services. Tera
Bytesofdataprocessingtakesjustfewminutes. Hadoopenablesdistributed processingoflargedatasets (above
10 million bytes) across clusters of computers using a programming model called MapReduce. The
system characteristics are scalable, self-manageable, self-healing and distributed file system.
Hadoopcorecomponents:
The following diagram shows the core components of the Apache Software Foundation’s Hadoop
framework.
TheHadoopcorecomponentsoftheframework are:
1. Hadoop Common - The common module contains the libraries and utilities that are required by the other
modules of Hadoop. For example, Hadoop common provides various components and interfaces for distributed
file system and general input/output. This includes serialization. Java RPC (Remote Procedure Call) and file
based data structures.
2. Hadoop Distributed File System (HDFS) - A Java-based distributed file system which can store all kinds of
data on the disks at the clusters.
3. MapReduce v1 - Software programming model in Hadoop I using Mapper and Reducer. The v1 processes
large sets of data in parallel and in batches.
4. YARN-Software for managing resources for computing. The user application tasks or sub-tasks run in parallel
at the Hadoop, uses scheduling and handles the requests for the resources in distributed running of the tasks.
5. MapReduce v2 – Hadoop 2 YARN based system for parallel processing of large datasets and
distributed processing of the application tasks.
FeaturesofHadoop:
Hadoopfeaturesareasfollows:
1. Fault-efficient scalable, flexible and modular design which uses simple and modular programming
model. The system provides servers at high scalability. The system is scalable by adding new nodes to
handle larger data. Hadoop proves very helpful in storing, managing, processing and analysing Big
Data. Modular functions make the system flexible. One can add or replace components at ease.
Modularity allows replacing its components for a different software tool.
2. RobustdesignofHDFS:ExecutionofBigData applicationscontinueevenwhenanindividual server or
cluster fails. This isbecause of Hadoop provisions for backup (dueto replications at least three times for
each data block) and a data recovery mechanism. HDFS thus has high reliability.
3. Storeandprocess BigData: ProcessesBigDataof3V characteristics.
4. Distributed clusters computing model with data locality: Processes Big Data at high speed as the
application tasks and sub-tasks submit to the DataNodes. One can achieve more computing power by
increasing the number of computing nodes. The processing splits across multiple DataNodes (servers),
and thus fast processing and aggregated results.
5. Hardwarefault-tolerant:Afaultdoesnotaffect dataand applicationprocessing. If anode goes down, the
other nodes take care of the residue. This is due to multiple copies of all data blocks which replicate
automatically. Default is three copies of data blocks.
6. Open-sourceframework:Opensourceaccessandcloudservicesenable largedatastore.Hadoop uses a
cluster of multiple inexpensive servers or the cloud.
7. Javaand Linux based:HadoopusesJavainterfaces.Hadoopbaseis Linux buthasitsownsetof shell
commands support.
HadoopEcosystem Components:

Hadoopmaincomponentsandecosystemcomponents

Thefourlayers are:
1. Distributedstoragelayer
2. Resourcemanagerlayer forjoborapplicationsub tasksschedulingandexecution.
3. Processingframeworklayer,consistingofMapperandReducerfortheMapreduceprocessflow
4. APIsatthe applicationsupportlayer.

HadoopDistributedFileSystem:
HDFSData storage:
 Hadoopdatastoreconceptimpliesstoringthedataatanumberofclusters.Eachclusterhasa number of data
stores, called racks.
 Each rack stores a number of DataNodes. Each DataNode has a large number of data blocks.
Theracks distribute across a cluster. The nodes have processing and storage capabilities.
 Thenodeshavethedataindatablockstoruntheapplicationtasks.Thedatablocksreplicateby
defaultatleastonthree DataNodes insameorremotenodes.
 Dataatthestoresenablerunningthedistributedapplicationsincludinganalytics,datamining, OLAP using
the clusters. A file, containing the data divides into data blocks.
 A data block default size is 64 MBs (HDFS division of files concept is similar to Linux or
virtualmemory page in Intel x86 and Pentium processors where the block size is fixed and is of 4
KB).
HadoopHDFSfeaturesareasfollows:
(i) Create,append,delete,rename andattributemodificationfunctions
(ii) Contentofindividualfilecannotbemodifiedorreplacedbutappendedwithnewdataattheendof the file.
(iii) Writeoncebut readmanytimes duringusages andprocessing
(iv) Averagefilesizecanbemorethan500MB.
Example:
ConsideradatastorageforUniversitystudents.Eachstudentdata,stuDatawhichisinafileofsizeless
than64MB(1MB=220B) Adatablockstoresthe fullfiledataforastudentofstuData_idN,where N=1 to 500.
(i) HowthefilesofeachstudentwillbedistributedataHadoopcluster?Howmanystudentdatacan be stored at
one cluster? Assume that each rack has two DataNodes for processing each of 64 GB
(1GB=230B)memory.Assumethatclusterconsistsof120racks,andthus240DataNodes.
(ii) Whatis the total memorycapacityof theclusterin TB((1 TB=2B)and DataNodes ineachrack?
(iii) ShowthedistributedblocksforstudentswithID=9and1025.Assumedefaultreplicationinthe
DataNodes=3
(iv) Whatshall bethechangeswhenastuDatafile size≤ 128 MB?
AHadoopclusterexampleandthereplicationofdatablocksinracksfortwostudentsofIDs96and1025
HadoopPhysicalOrganization:
Few nodes in a Hadoop cluster act as NameNodes. These nodes are termed as MasterNodes or simply
masters. The masters have a different configuration supporting high DRAM and processing power. The
masters have much less local storage. Majority of the nodes in Hadoop cluster act as DataNodes and
Task Trackers. Thesenodes arereferred to as slavenodes orslaves. The slaves have lots ofdisk storage
and moderateamountsofprocessingcapabilitiesandDRAM.Slaves areresponsibletostorethedat and
process the computation tasks submitted by the clients.
The following Figure shows the client, master NameNode, primary and secondary MasterNodes and slave
nodes in the Hadoop physical architecture.
Clients as the users run the application with the help of Hadoop ecosystem projects. For example, Hive,
Mahout and Pig are the ecosystem's projects. They are not required to be present at the Hadoop cluster.
A single MasterNode provides HDFS, MapReduce and Hbase using threads in small to medium sized
clusters. When the cluster size is large, multiple servers are used, such as to balance the load. The
secondary NameNode provides NameNode management services and Zookeeper is used by HBase for
metadata storage.
Theclient,masterNameNode,masternodesandslavenodes
The MasterNode fundamentally plays the role of a coordinator. The MasterNode receives client
connections, maintains the description of the global file system namespace, and the allocation of file
blocks. It also monitors the state of the system in order to detect any failure. The Masters consists of
three components NameNode, Secondary NameNode and JobTracker. The NameNode stores all the fil
system related information such as:
 Thefilesection isstoredinwhichpart ofthecluster
 Last accesstimeforthefiles
 Userpermissionslikewhichuser hasaccesstothe file.
SecondaryNameNodeis analternate forNameNode.SecondarynodekeepsacopyofNameNodemeta data.

MapReduceFrameworkandProgrammingModel:
Mappermeanssoftwarefordoingtheassignedtaskafterorganizingthedatablocksimportedusingthe keys.
Reducer means software for reducing the mapped data by using the aggregation, query or user specified
function.
HadoopMapreduce framework:
MapReduce provides two important functions. The distribution of job based on client application task
or users query to various nodes within a cluster is one function. The second function is organizing and
reducing the results from each node into a cohesive response to the application or answer to the query.
TheprocessingtasksaresubmittedtotheHadoop.TheHadoopframework in turnsmanagesthetask of issuing
jobs, job completion, and copying data around the cluster between the DataNodes with the help of
JobTracker.
A client node submits a request of an application to the JobTracker. A JobTracker is a Hadoop daemon
(background program).
Thefollowingarethestepson therequest to MapReduce:
(i) estimatethe needofresourcesforprocessingthat request
(ii) analyzethe statesoftheslave nodes
(iii) placethe mappingtasksin queue
(iv) monitortheprogressoftask,andonthefailure,restartthetaskonslotsoftimeavailable. The job
execution is controlled by two types of processes in MapReduce:
1. The Mapper deploys map tasks on the slots. Map tasks assign to those nodes where the data for the
application is stored. The Reducer output transfers to the client node after the data serialization using
AVRO.
2. The Hadoop system sends the Map and Reduce jobs to the appropriate servers in the cluster. The
Hadoop framework in turns manages the task of issuing jobs, job completion and copying data around
theclusterbetweentheslavenodes. Finally, theclustercollectsandreducesthedatato obtaintheresult and
sends it back to the Hadoop server after completion of the given tasks.
MapReduceProgramming model:
MapReduce program can be written in any language including JAVA, C++ PIPES or Python. Map
function of MapReduce program do mapping to compute the data and convert the data into other data
sets (distributed in HDFS). After the Mapper computations finish, the Reducer function collects the
result of map and generates the final output result. MapReduce program can be applied to any type of
data, i.e., structured or unstructured stored in HDFS.
 Theinput data is in the form of fileor directoryand is stored in theHDFS.
 TheMapReduceprogramperformstwojobsonthisinputdata,theMapjobandtheReducejob. They are
also termed as two phases Map phase and Reduce phase.
 The map job takes a set of data and converts it into another set of data. The individual elements are
broken down into tuples (key/value pairs) in the resultant set of data.
 The reduce job takes the output from a map as input and combines the data tuples into a smaller set
of tuples.
 Map and reduce jobs run in isolation from one another. As thesequence of the name
MapReduceimplies, the reduce job is always performed after the map job.
HadoopYARN:
YARN is a resource management platform. It manages computer resource.YARN manages the schedules
for running of the sub-tasks.
Hadoop2 Execution model:
Following shows the YARN-based execution model. The figure shows the YARN components Client,
Resource Manager (RM), Node Manager (NM), Application Master (AM) and Containers. And also
illustrates YARN components namely, Client, Resource Manager (RM), Node Manager (RM),
Application Master (AM) and Containers.

YARNbasedexecution model
ListofactionsofYARN resourceallocationandschedulingfunctionsisas follows:
 AMasterNodehastwo components:(i) JobHistoryServer and (ii)ResourceManager(RM).
 A Client Node submits the request of an application to the RM. The RM is the master. One RM
exists per cluster. The RM keeps information of all the slave NMs.
 Multiple NMs are at a cluster. An NM creates an AM instance (AMI) and starts up. The AMI
initializes itself and registers with the RM. Multiple AMIs can be created in an AM.
 The AMI performs role of an Application Manager (AppIM), that estimates the resources
requirement for running an application program or sub-task. The ApplMs send their requests for
the necessary resources to the RM.
 NM is a slave of the infrastructure. It signals whenever it initializes. All active NMs send the
controlling signal periodically to the RM signaling their presence.
 Each NMassigns a container(s)foreach AMI. Thecontainer(s)assigned at an instancemaybe at
sameNMoranotherNM.ApplMusesjustafractionoftheresourcesavailable.TheAppIMatan instance
uses the assigned container(s) for running the application sub-task.
 RM allots the resources to AM, and thus to ApplMs for using assigned containers on the sam or
other NM for running the application subtasks in parallel

HadoopEcosystemTools:
Zookeeper:
Zookeeper in Hadoop behaves as a centralized repository where distributed applications can write data
at a node called JournalNode and read the data out of it. Zookeeper uses synchronization, serialization
and coordination activities. It enables functioning of a distributed system as a single function.
ZooKeeper's main coordination services are:
1 Name service - A name service maps a name to the information associated with that name. For
example, DNS service is a name service that maps a domain name to an IP address. Similarly, name
keeps a track of servers or services those are up and running, and looks up their status byname in name
service.
2 Concurrency control - Concurrent access to a shared resource may cause inconsistency of the
resource. A concurrency control algorithm accesses shared resource in the distributed system and
controls concurrency.
3 Configuration management - A requirement of a distributed system is a central configuration
manager. A new joining node can pick up the up-to-date centralized configuration from the
ZooKeeper coordination service as soon as the node joins the system. 4 Failure Distributed systems are
susceptible to the problem of node failures. This requires
implementingan automatic recoveringstrategybyselectingsomealternatenodeforprocessing
Oozie:
Apache Oozieis an open-source project of Apachethat schedules Hadoop jobs. Anefficient process for
jobhandlingisrequired.AnalysisofBigDatarequirescreationofmultiplejobsandsub-tasksina process. Oozie
design provisions the scalable processing of multiple jobs. Thus, Oozie provides a wayto package and
bundle multiple coordinator and workflow jobs, and manage the lifecycle of those jobs.
OozieworkflowjobsarerepresentedasDirectedAcrylicGraphs(DAGs),specifyingasequenceof Oozie
coordinator jobs are recurrent Oozie workflow jobs that are triggered by time and data.
Oozieprovisions forthe following:
1. Integratesmultiplejobsinasequentialmanner
2. StoresandsupportsHadoopjobs forMapReduce,Hive,Pig,and Sqoop
3. Runsworkflowjobsbasedontimeanddata triggers
4. Managesbatchcoordinatorforthe applications
5. Manages the timely execution of tens of elementary jobs lying in thousands of workflows in a
Hadoop cluster.

Sqoop:
The loading of data into Hadoop clusters becomes an important task during data analytics. Apache Sqoop is
a tool that is built for loading efficiently the voluminous amount of data between Hadoop and external
data. Sqoop initially parses the arguments passed in the command line and prepares the map task. The
map task initializes multiple Mappers depending on the number supplied by the user in the command
line. Each map task will be assigned with part of data to be imported based on keydefined in the
command line. Sqoop distributes the input data equally among the Mappers. Then each Mapper creates
a connection with the database using JDBC and fetches the part of data assigned by Sqoop and writes it
into HDFS/Hive/HBase as per the choice provided in the command line.
Sqoop provides the mechanism to import data from external Data Stores into HDFS. Sqoop relates to
Hadoop eco-system components, such as Hive and HBase. Sqoop can extract data from Hadoop orother
ecosystem components.
Sqoop provides command line interface to its users. Sqoop can also be accessed using Java APIs. The
toolallowsdefiningtheschema ofthedataforimport.SqoopexploitsMapReduceframeworktoimport and
export the data, and transfers for parallel processing of sub-tasks. Sqoop provisions for fault tolerance.
Parallel transfer of data results in parallel results and fast data transfer.
Flume:
Apache Flume provides a distributed, reliable and available service. Flume efficiently collects,
aggregates and transfers a large amount of streaming data into HDFS. Flume enables upload of large
files into Hadoop clusters.
The features of flume include robustness and fault tolerance. Flume provides data transfer which is
reliable and provides for recovery in case of failure. Flume is useful for transferring a large amount of
data in applications related to logs of network traffic, sensor data, geo-location data, e-mails and social-
media messages.
ApacheFlumehasthe followingfourimportantcomponents:
1. Sourceswhichaccept datafrom aserver oranapplication.
2. Sinks which receive data and store it in HDFS repositoryor transmit the data to another source. Data
units that are transferred over a channel from source to sink are called events.
3. Channelsconnectbetweensourcesandsinkbyqueuingeventdatafortransactions.Thesizeof events data is
usually 4 KB. The data source is considered to be a source of various set of events. Sources listen for
events and write events to a channel. Sinks basically write event data to a targetand remove the event
from the queue,
4. Agents run the sinks and sources in Flume. The interceptors drop the data or transfer data as it flows
into the system.

HadoopDistributedFileSystemBasics2
HadoopDistributedFilesystemdesign features
The Hadoop Distributed file system(HDFS) was designed for Big Data processing. Although capable
of supporting many users simultaneously, HDFS is not designed as a true parallel file system. Rather,
the design assumes a large file write-once/read-many model. HDFS rigorously restricts data writing
to one user at a time. Bytes are always appended to the end of a stream, and byte streams are
guaranteedtobestoredintheorderwritten.ThedesignofHDFSisbasedonthedesignofthe
Google File System(GFS). HDFS is designed for data streaming where large amounts of data are
read from disk in bulk. The HDFS block size is typically 64MB or 128MB. Thus, this approach is
unsuitable for standard POSIX file system use.

Due to sequential nature of data, there is no local caching mechanism. The large block and file sizes
makes it more efficient to reread data from HDFS than to try to cache the data. A principal design
aspect of Hadoop MapReduce is the emphasis on moving the computation to the data rather than
moving the data to the computation.In other high performance systems, a parallel file system will exist
on hardware separate from computer hardware. Data is then moved to and from the computer
componentsvia high-speed interfacestotheparallelfilesystemarray.Finally,Hadoopclustersassume node
failure will occur at some point. To deal with this situation, it has a redundant design that can tolerate
system failure and still provide the data neededby the compute part of the program.

The followingpointsare important aspectsofHDFS:

 Thewrite-once/read-manydesignisintendedtofacilitatestreamingreads.
 Filesmaybeappended,butrandomseeksarenot permitted. Thereisnocachingofdata.
 Convergeddatastorageandprocessinghappenonthesameservernodes.
 ―Movingcomputationischeaperthanmovingdata.‖
 Areliablefilesystemmaintainsmultiplecopiesofdataacrossthecluster.
 Consequently,failureofa singlewillnotbringdownthefilesystem.
 Aspecialized filesystemisused,whichisnotdesignedforgeneraluse.
HDFS components:
varioussystemrolesinanHDFSdeployment
Reference: Douglas Eadline,"Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache
Hadoop 2 Ecosystem", 1st Edition, Pearson Education, 2016

The design of HDFS is based on two types of nodes: NameNode and multiple DataNodes. In a basic
design, NameNode manages all the metadata needed to store and retrieve the actual datafromthe
DataNodes. No data is actually stored on the NameNode. The design is a Master/Slave architecture
in whichmaster(NameNode)manages thefilesystemnamespaceandregulatesaccesstofilesby

clients. File system namespace operations such as opening, closing and renaming files and directories are
all managed by the NameNode. The NameNode also determines the mapping of blocks to
DataNodes and handles Data Node failures.

The slave(DataNodes) are responsible for serving read and write requests from the file system to the
clients. The NameNode manages block creation, deletion and replication. When a client writes
data, it first communicates with the NameNode and requests to create a file. The NameNode
determines how manyblocks are needed and provides the client with the DataNodes that will store the
data. As part of the storage process, the data blocks are replicated after they are written to the assigned
node.
Depending on how many nodes are in the cluster, the NameNode will attempt to write replicas of the
data blocks on nodes that are in other separate racks. If there is only one rack, then the replicated
blocks are written to other servers in the same rack. After the Data Node acknowledges that the file
block replication is complete, the client closes the file and informs the NameNode that the operation is
complete. Note that the NameNode does not write any data directly to the DataNodes. It does,
however, givetheclient alimitedamountoftimetocompletetheoperation. Ifitdoesnotcompletein the time
period, the operation is cancelled.

The client requests a file from the NameNode, which returns the best DataNodes from which to read
the data. The client then access the data directly from the DataNodes. Thus, once themetadata has been
delivered to theclient, theNameNode steps back and lets the conversation between the client and the
DataNodes proceed. While data transfer is progressing, the NameNode also monitors the DataNodes
bylistening for heartbeats sent from DataNodes. The lack of a heartbeat signal indicates a node
failure. Hence the NameNode will route around the failed Data Node and begin re-replicating the
now-missing blocks. The mappings b/w data blocks and physical DataNodes are not kept in persistent
storage on the NameNode. The NameNode stores all metadata in memory.In almost all Hadoop
deployments, there is a SecondaryNameNode(Checkpoint Node). It is not an active failover node and
cannot replace the primaryNameNode in case of it failure.

 Thusthevarious importantroles inHDFSare:

 HDFSusesamaster/slavemodeldesignedfor largefilereadingorstreaming.
 TheNameNode isa metadataserveror“Datatrafficcop”.
 HDFSprovidesasinglenamespacethat ismanaged bytheNameNode.
 DataisredundantlystoredonDataNodes;thereisnodataonNameNode.
 SecondaryNameNode performs checkpoints of NameNode file system’s state but is not
afailover node.

HDFSusercommands:
 ListFilesinHDFS
 TolistthefilesintherootHDFSdirectory,enterthefollowingcommand:
Syntax: $ hdfs dfs -ls /
Output:
Found2items
drwxrwxrwx-yarnhadoop02015-04-2916:52/app-logs
drwxr-xr-x-hdfshdfs02015-04-2114:28 /apps
 Tolistfilesinyourhomedirectory,enterthefollowingcommand:
Syntax: $ hdfs dfs -ls
Output:
Found2items
drwxr-xr-x-hdfshdfs02015-05-24 20:06 bin
drwxr-xr-x-hdfs hdfs02015-04-2916:52 examples
 MakeaDirectoryinHDFS
 TomakeadirectoryinHDFS,usethefollowingcommand.Aswiththe -lscommand, when
no path is supplied, the user’s home directory is used
Syntax: $hdfsdfs-mkdirstuff
 CopyFilestoHDFS
 To copya file from your current local directoryinto HDFS, use the following command. If a
full path is not supplied, your home directory is assumed. In this case, the file test is placed
in the directory stuff that was created previously.
Syntax:$hdfsdfs-putteststuff
 Thefiletransfercanbeconfirmed byusingthe -lscommand:
Syntax: $ hdfs dfs -ls stuff
Output:
Found1 items
-rw-r--r--2hdfshdfs128572015-05-2913:12stuff/test
 CopyFilesfromHDFS
 Filescanbecopiedbacktoyourlocalfile systemusingthefollowingcommand.
 Inthiscase,thefilewecopiedintoHDFS,test,willbecopiedbacktothecurrentlocal directory with
the name test-local.
Syntax:$hdfsdfs-getstuff/testtest-local
 CopyFileswithinHDFS
 ThefollowingcommandwillcopyafileinHDFS:
Syntax: $ hdfs dfs -cp stuff/test test.hdfs
 Deletea FilewithinHDFS
 ThefollowingcommandwilldeletetheHDFSfile test.
Syntax:$ hdfs dfs -rm test.hd
EssentialHadoopTools:
UsingApache pig:
ApachePigisahigh-levellanguagethatenablesprogrammerstowritecomplexMapReduce transformations using a
simple scripting language.
Pig’ssimpleSQL-likescriptinglanguageiscalled PigLatin,andappealstodevelopersalreadyfamiliar with
scripting languages and SQL.
PigLatin (theactual language)defines asetoftransformationson adatasetsuchas aggregate,join, and sort.
Pig is often used to extract, transform, and load (ETL) data pipelines, quick research on raw data. Apache
Pig has severalusage modes. The first is alocalmode in whichall processing is done on thelocal
machine. The non-local (cluster) modes are Map Reduce and Tez.
These modes execute the job on the cluster using either the Map Reduceengine or the optimized Tezengine.
There are also interactive and batch modes available; they enable Pig applications to be developed locally
in interactive modes, using small amounts of data, and then run at scale on the cluster in a production
mode. The modes are in below fig.
Apache Hive:
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing datasummarization,
ad hoc queries, and the analysis of large data sets using a SQL-like language called HiveQL.
Hive is considered the de facto standard for interactive SQL queries over petabytes of data usingHadoop.
Someessentialfeatures:
Tools to enable easy data extraction, transformation, and loading (ETL) A mechanism to impose
structure on a variety of data formats
Access to files stored either directly in HDFS or in other data storage systems such as HBaseQuery
execution via MapReduce and Tez (optimized MapReduce)Hive is also installed as part of the
Hortonworks HDP Sandbox.To work in Hive with Hadoop, user with access to HDFS can run the Hive
queries.
Simplyenterthehive command.IfHivestartcorrectly,itgetahive> prompt.
$ hive
(somemessagesmayshowuphere) hive>
Hivecommandtocreateanddropthetable.ThatHivecommandsmustendwithasemicolon(;). hive>
CREATE TABLE pokes (foo INT, bar STRING);
OK
Timetaken:1.705seconds
Toseethetableiscreated,
hive> SHOW TABLES;
OK
pokes
Timetaken:0.174seconds,Fetched:1row(s) To
drop the table,
hive>DROPTABLEpokes; OK
Timetaken:4.038 seconds
Apache Sqoop:
SqoopisatooldesignedtotransferdatabetweenHadoopandrelationaldatabases.

Sqoop is usedto
-import data from a relational database management system (RDBMS) into the Hadoop Distributed File
System(HDFS),
- transformthedata inHadoop and
- exportthe data backinto an RDBMS.
Sqoopimport method:

Sqoop import
Thedata import is donein two steps :
1) Sqoopexamines the databasetogather thenecessarymetadata forthedata to beimported.

2) Map-onlyHadoop job: Transferstheactual datausingthemetadata.

wherethefilesshouldbepopulated.Bydefault,thesefilescontaincommadelimitedfields,withnew lines
separating different records.

SqoopExportmethod:
Dataexport fromthe clusterworks in asimilar fashion.Theexport isdone intwo steps :
1) examinethe databaseformetadata.

2) Map-onlyHadoop jobto writethe data to the database.

Sqoopdividestheinputdatasetintosplits,thenusesindividualmaptaskstopushthesplitstothe database.

Sqoopexport
Apache Flume:
ApacheFlumeisan independentagentdesigned tocollect, transport,andstoredatainto HDFS.
Data transport involves a number of Flume agents that may traverse a series of machines and locations.
Flume is often used for log files, social media-generated data, email messages, and just about any
continuous data source.
Flumeagentwithsource,channel,andsink
Flameagentiscomposedofthreecomponents.
1. Source:Thesourcecomponentreceivesdataandsendsittoachannel. Itcansendthedatatomorethan one
channel.
2. Channel:A channelis adata queuethat forwardsthe sourcedata tothe sink destination.
3. Sink:Thesinkdeliversdatatodestinationsuchas HDFS,alocal file,oranotherFlume agent.
A Flume agent must have all three of these components defined. Flume agent can have several
source,channels, and sinks.
Sourcecanwritetomultiplechannels,butasinkcantakedatafromonlyasinglechannel. Data written
to a channel remain in the channel until a sink removes the data.
Bydefault, the data in a channel are kept in memorybut maybe optionallystored on disk to prevent data loss
in the event of a network failure.

PipelinecreatedbyconnectingFlumeagents
Asshownintheabovefigure,Sqoopagentsmaybeplacedinapipeline,possiblytotraverseseveral machines or
domains.
eline,thesinkfromoneagentisconnectedtothesourceofanother.

aspartofthedataexchangeandisdefinedusingJSON.
AFlumeconsolidationnetwork
Apache Oozie:
Oozie is a workflow director system designed to run and manage multiple related Apache Hadoop jobs.
For instance, complete data input and analysis may require several discrete Hadoop jobs to be run as a
workflow in which the output of one job serves as the input for a successive job. Oozie is designed to
construct and manage these workflows. Oozie is not a substitute for the YARN scheduler. That is,YARN
manages resources for individual Hadoop jobs, and Oozie provides a way to connect and control Hadoop
jobs on the cluster.
Oozie workflow jobs are represented as directed acyclic graphs (DAGs) of actions. (DAGs are basically
graphs that cannot have directed loops.) Three types of Oozie jobs are permitted:
Workflow—a specified sequence of Hadoop jobs with outcome-based decision points and control
dependency.Progressfromoneactiontoanothercannothappenuntilthefirstactioniscomplete.
Coordinator—ascheduledworkflowjobthatcanrunatvarioustime
intervals or when data become available. Bundle—a higher-level Oozie abstraction that will batch a set
ofcoordinatorjobs.OozieisintegratedwiththerestoftheHadoopstack,supportingseveraltypesof
Hadoopjobsoutofthebox(e.g.,JavaMapReduce,StreamingMapReduce,Pig,Hive,andSqoop)as well as system-
specific jobs (e.g., Java programs and
shellscripts). Oozie also provides aCLIandaweb UIformonitoringjobs.
FollowingfiguredepictsasimpleOozieworkflow.Inthiscase,OozierunsabasicMapReduce operation. If the
application was successful, the job ends; if an error occurred, the job is killed.

simpleOozieDAGworkflow
Apache HBase:
Like Bigtable, HBaseleverages the distributed data storage provided by the underlying distributed file
systems spread across commodity servers. Apache HBase provides Bigtable-like capabilities on top of
Hadoop and HDFS. Some of the more important features include the following capabilities:
Linearand modular scalabilityStrictlyconsistent reads and writes
Automatic and configurable sharding of tables Automatic failover support between RegionServers
Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables Easy-to-use
Java API for client access.
HBaseDataModelOverview
Atablein HBaseis similarto otherdatabases, having rows and columns. Columns in HBasearegrouped into
column families, all with the same prefix.
 SpecificHBasecellvaluesareidentifiedbyarowkey,column(columnfamilyandcolumn),and version
(timestamp).
 Itis possibleto havemanyversions ofdatawithinan HBase cell.
 Aversionisspecified asatimestampandiscreatedeachtime data arewrittentoacell.
 Rowsarelexicographicallysortedwith thelowestorder appearingfirstin atable.
 Theemptybytearraydenotesboth thestart and the endofatable’s namespace.
 Alltableaccessesarevia thetable rowkey,whichis considereditsprimarykey.

Introduction to Hadoop Ecosystem Basics
No ratings yet
Introduction to Hadoop Ecosystem Basics
23 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
Module 2 CN
No ratings yet
Module 2 CN
23 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
Bigdata Module2 7th-Sem 18cs72
No ratings yet
Bigdata Module2 7th-Sem 18cs72
64 pages
BD - Unit - II - Hadoop Frameworks and HDFS
No ratings yet
BD - Unit - II - Hadoop Frameworks and HDFS
37 pages
Slides PDF - Module 2
No ratings yet
Slides PDF - Module 2
106 pages
Introduction to Hadoop Overview
No ratings yet
Introduction to Hadoop Overview
13 pages
Module - 2
No ratings yet
Module - 2
84 pages
BDA Module-02 Search Creators
No ratings yet
BDA Module-02 Search Creators
33 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Module 2
No ratings yet
Module 2
100 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
23 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
14 pages
HADOOP
No ratings yet
HADOOP
10 pages
2 Notes
No ratings yet
2 Notes
61 pages
Module - 2 Half
No ratings yet
Module - 2 Half
12 pages
Unit 2
No ratings yet
Unit 2
73 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
IDS Unit3
100% (1)
IDS Unit3
16 pages
Unit Iii
No ratings yet
Unit Iii
20 pages
CC Unit 2
No ratings yet
CC Unit 2
29 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
152 pages
Unit 2
No ratings yet
Unit 2
17 pages
Big Data 2 - Part
No ratings yet
Big Data 2 - Part
40 pages
Unit 3-1
No ratings yet
Unit 3-1
14 pages
BDA Mod2@AzDOCUMENTS - in
No ratings yet
BDA Mod2@AzDOCUMENTS - in
64 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
No ratings yet
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
10 pages
Comprehensive Guide to Hadoop and Big Data
No ratings yet
Comprehensive Guide to Hadoop and Big Data
180 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Attachment
No ratings yet
Attachment
11 pages
History and Features of Hadoop
No ratings yet
History and Features of Hadoop
11 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
10 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Unit Ii
No ratings yet
Unit Ii
30 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Hadoop for Big Data Enthusiasts
No ratings yet
Hadoop for Big Data Enthusiasts
42 pages
CC Unit - 5
No ratings yet
CC Unit - 5
27 pages
Hadoop for Data Professionals
No ratings yet
Hadoop for Data Professionals
39 pages
Apache Hadoop
No ratings yet
Apache Hadoop
27 pages
Big Data and Hadoop Ecosystem Overview
No ratings yet
Big Data and Hadoop Ecosystem Overview
260 pages
BDA Module 2
No ratings yet
BDA Module 2
40 pages
Hadoop for Data Engineers
No ratings yet
Hadoop for Data Engineers
180 pages
HDFS Blending with Hadoop and Hive
No ratings yet
HDFS Blending with Hadoop and Hive
12 pages
Introduction To Hadoop (T1) :: 21CS71-BIG DATA ANLAYTICS Scheme:2021 Scheme
No ratings yet
Introduction To Hadoop (T1) :: 21CS71-BIG DATA ANLAYTICS Scheme:2021 Scheme
28 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
13 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Introduction to Hadoop and Big Data
No ratings yet
Introduction to Hadoop and Big Data
22 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
277 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
Case Studies Of-Wps Office
No ratings yet
Case Studies Of-Wps Office
13 pages
Modul1 1
No ratings yet
Modul1 1
13 pages
@vtucode - in Module 5 RM 2021 Scheme 5th Semester
100% (1)
@vtucode - in Module 5 RM 2021 Scheme 5th Semester
17 pages
Operating System 2
No ratings yet
Operating System 2
97 pages
Asss 2
No ratings yet
Asss 2
1 page
Microcontrollers Aug 2024
No ratings yet
Microcontrollers Aug 2024
34 pages
CH 09
No ratings yet
CH 09
58 pages
Advanced Java Lab Manual Upd
No ratings yet
Advanced Java Lab Manual Upd
47 pages
Ardent Brochure
No ratings yet
Ardent Brochure
14 pages
S. Sri Devi's Resume and Qualifications
No ratings yet
S. Sri Devi's Resume and Qualifications
4 pages
Title of The Project:: Kids Corner System
No ratings yet
Title of The Project:: Kids Corner System
31 pages
Email Marketing Compliance and Best Practices
No ratings yet
Email Marketing Compliance and Best Practices
36 pages
Day03-Workshop-Exception-Memory - Encapsulation-2021 - TranLQ
No ratings yet
Day03-Workshop-Exception-Memory - Encapsulation-2021 - TranLQ
22 pages
Smart Card Tutorial
86% (7)
Smart Card Tutorial
8 pages
Introduction To Extendible: Hashing
0% (1)
Introduction To Extendible: Hashing
7 pages
PHP MVC Frameworks: Performance Study
No ratings yet
PHP MVC Frameworks: Performance Study
9 pages
Session 4 Case Study Retail Case
50% (2)
Session 4 Case Study Retail Case
28 pages
Az 305
No ratings yet
Az 305
69 pages
DATABASE Security: Free Powerpoint Templates Free Powerpoint Templates
100% (1)
DATABASE Security: Free Powerpoint Templates Free Powerpoint Templates
45 pages
From Assistants To Agents-1756914964963
No ratings yet
From Assistants To Agents-1756914964963
28 pages
IT - 703A - Question Bank
No ratings yet
IT - 703A - Question Bank
14 pages
Hadoop Cluster Setup on Eucalyptus Cloud
No ratings yet
Hadoop Cluster Setup on Eucalyptus Cloud
4 pages
RDBMS Design for IT Professionals
No ratings yet
RDBMS Design for IT Professionals
10 pages
CCNP Security
No ratings yet
CCNP Security
5 pages
Android Debugging Log Insights
No ratings yet
Android Debugging Log Insights
16 pages
Post Breach Forensic State
No ratings yet
Post Breach Forensic State
13 pages
BAPI Sales Order Configuration Change
No ratings yet
BAPI Sales Order Configuration Change
9 pages
Software - Development - Life - Cycle - For - Web - Application - by - Using - Traditional - Methodo
No ratings yet
Software - Development - Life - Cycle - For - Web - Application - by - Using - Traditional - Methodo
6 pages
SQL مرجع كامل
No ratings yet
SQL مرجع كامل
23 pages
Be Artificial Intelligence and Data Science Semester 5 2025 August Database Management Systems Dms Pattern 2019
No ratings yet
Be Artificial Intelligence and Data Science Semester 5 2025 August Database Management Systems Dms Pattern 2019
2 pages
Dbvisit TechnicalOverview
No ratings yet
Dbvisit TechnicalOverview
18 pages
Oracle SQL Execution Plans Guide
No ratings yet
Oracle SQL Execution Plans Guide
28 pages
AWS Developer Associate Dumps With 100% Passing Guarantee
No ratings yet
AWS Developer Associate Dumps With 100% Passing Guarantee
19 pages
E-Learning Platform: Indira Gandhi National Open University
100% (1)
E-Learning Platform: Indira Gandhi National Open University
26 pages
Database Management Case Study Analysis
No ratings yet
Database Management Case Study Analysis
3 pages
Database Transactions and ACID Properties
No ratings yet
Database Transactions and ACID Properties
149 pages
AOL Interview Q&A for Oracle Applications
No ratings yet
AOL Interview Q&A for Oracle Applications
10 pages
SQL Command Types: DDL, DML, DCL, TCL
No ratings yet
SQL Command Types: DDL, DML, DCL, TCL
13 pages

Bda 21cs71 Module 2

Uploaded by

Bda 21cs71 Module 2

Uploaded by

MODULE 2

The followingpointsare important aspectsofHDFS:

 Thusthevarious importantroles inHDFSare:

2) Map-onlyHadoop job: Transferstheactual datausingthemetadata.

2) Map-onlyHadoop jobto writethe data to the database.

You might also like