An example of how to bulk import data from XML files into a HBase table.
License
Apache licensed.
HBase gives random read and write access to your big data, but getting your big data into HBase can be a challenge. Using the API to put the data in works, but because it has to traverse HBase's write path (i.e. via the WAL and memstore before it is flushed to a HFile) it is slower than if you simply bypassed the lot and created the HFiles yourself and copied them directly into the HDFS.
Luckily HBase comes with bulk load capabilities, and this example demonstrates how they work. The HBase bulk load process consists of two steps:
1. HFile preparation via a MapReduce job, and
2. Importing the HFile into HBase using LoadIncrementalHFiles.doBulkLoad
The aim of the MapReduce job is to generate HBase data files (HFiles) from your input data using HFileOutputFormat. This output format writes out data in HBase's internal storage format so that they can be efficiently loaded into HBase.
HFileOutputFormat includes a convenience function, configureIncrementalLoad(), which automatically sets up a TotalOrderPartitioner based on the current region boundaries of a table.
There are two methods to import the generated HFiles into a HBase table.
1. Command line tool called completeBulkLoad.
2. Second is a programmatic approach which uses the LoadIncrementalHFiles.doBulkLoad method to load the HFiles generated by the previous MapReduce job into the given HBase table. This approach is used in this example.
Output from Mapper class are ImmutableBytesWritable,KeyValue. These classes are used by the subsequent partitioner and reducer to create the HFiles.
The destination HBase table is called book.
There is no need to write your own reducer as the HFileOutputFormat.configureIncrementalLoad() as used in the driver code sets the correct reducer and partitioner up for you.
Check src/test/resource/test.xml to see demofile.
Open up a HBase shell and run the following to setup the table:
create 'book','bookFamily'
Before executing the Map Reduce program the hbase table should be created.
hadoop jar HBaseBulkLoadXml-0.0.1-SNAPSHOT-jar-with-dependencies.jar /inputDirectory/book.xml /outputDirectory/book.xml book
- First argument- Input feed
- Second argument- Output directory
- Third argument HBase table- book