Notesfor BDA
Notesfor BDA
1. Structured Data
• Definition:
o Data that is highly organized and stored in a predefined format (rows and
columns).
• Examples:
• Advantages:
• Disadvantages:
• Usage:
• Definition:
o Can include text, images, videos, audio files, social media posts, etc.
• Examples:
• Advantages:
• Disadvantages:
o Requires advanced tools like AI, ML, and deep learning for processing.
o Storage and retrieval are complex and may require NoSQL databases or cloud
solutions.
• Usage:
• Definition:
o Data that does not conform to a strict tabular structure but still contains some
organizational elements like tags and metadata.
• Examples:
o Emails (which have structured metadata like sender, recipient, timestamp but
unstructured message content).
• Advantages:
o More flexible than structured data while still retaining some level of organization.
• Disadvantages:
• Usage:
• Definition:
o Refers to the enormous amount of data generated every second from multiple
sources.
• Example:
o Facebook generates over 4 petabytes of data daily, including posts, images, and
videos.
• Definition:
• Example:
• Definition:
• Example:
• Definition:
• Example:
o A company analyzing social media sentiment may get biased results due to fake
or bot-generated reviews.
• Definition:
• Example:
o Retailers use big data to predict customer preferences and enhance personalized
marketing strategies.
• Definition:
o Seasonal trends, context shifts, and unexpected anomalies affect data analysis.
• Example:
• Definition:
• Example:
Apart from the core 7 V’s, researchers and experts have introduced more characteristics to
describe Big Data comprehensively:
• Volatility: Data relevance changes over time (e.g., trending hashtags on Twitter).
• Validity: Ensuring data is collected from credible sources (e.g., verified health records vs.
unreliable internet articles).
• Vulnerability: Data security concerns, including cyber threats and breaches (e.g.,
financial fraud detection).
• Venue: The geographical origin of data (e.g., weather data from different countries may
vary in format and accuracy).
o Traditional methods cannot handle the massive increase in digital data from
social media, IoT, and smart devices.
o Companies that leverage Big Data gain insights into customer behavior, leading
to better decision-making.
3. Scientific and Medical Advancements
o Mainframes and relational databases handled structured data but struggled with
large volumes.
o Machine learning, deep learning, and cloud computing enhance Big Data
analytics for real-time decision-making.
o Cloud platforms like AWS, Google Cloud, and Azure provide scalable storage and
processing solutions.
Challenges Faced by Big Data
• Solution: Distributed file systems (HDFS, Amazon S3) and cloud storage improve
scalability.
• Solution: Apache Spark, Flink, and in-memory computing enhance real-time analytics.
• Solution: Data preprocessing techniques, data validation, and AI-powered cleaning tools
help.
• Solution: ETL (Extract, Transform, Load) pipelines and data lakes manage diverse
datasets efficiently.
• Issue: Storing, managing, and processing Big Data requires expensive hardware and
skilled personnel.
1. Data Volume
• Traditional BI:
• Big Data:
o Uses distributed storage systems like Hadoop and cloud storage for scalability.
2. Data Variety
• Traditional BI:
• Big Data:
o Works with text, images, videos, social media posts, IoT sensor data, and logs.
• Traditional BI:
o Uses batch processing with pre-defined queries and structured data models.
o Uses both batch processing (Hadoop) and real-time streaming (Apache Kafka,
Spark Streaming).
• Traditional BI:
o Uses Data Warehouses (e.g., SAP BW, Teradata) for historical analysis.
• Big Data:
o Uses distributed storage across multiple nodes (HDFS, NoSQL, Cloud Data Lakes).
o Data is stored in formats like JSON, Parquet, Avro, and columnar storage for
flexibility.
• Traditional BI:
o Works well for historical data analysis but struggles with real-time insights.
• Big Data:
o Uses stream processing frameworks like Apache Kafka, Flink, and Spark.
6. Scalability
• Traditional BI:
• Big Data:
o Uses cloud-based auto-scaling solutions like AWS, Google Cloud, and Azure.
• Traditional BI:
o Requires expensive on-premises servers and high licensing fees for enterprise
software.
• Big Data:
8. Analytics Capabilities
• Traditional BI:
• Big Data:
o Uses deep learning, NLP, and data mining to generate intelligent insights.
9. Decision-Making Approach
• Traditional BI:
• Big Data:
• Traditional BI:
• Big Data:
• Traditional BI:
• Big Data:
• Traditional BI:
• Big Data:
A Data Warehouse (DW) is a structured system that consolidates and organizes business data
for analysis and decision-making.
• Data Integration: Combines information from multiple sources into a unified structure.
• Immutable Storage: Once stored, data remains unchanged and is updated through
batch processes.
• Source Systems: Includes databases, enterprise applications, and external data feeds.
• Query & Reporting Tools: Platforms like Power BI, Tableau, and SAP BusinessObjects
facilitate business analysis.
• MapReduce:
o A processing framework that divides tasks into smaller chunks and executes
them in parallel.
o Kafka & Flink: Used for real-time data streaming and analytics.
Limitations of Hadoop:
The rise of Big Data technologies is reshaping how businesses store, process, and analyze
information.
o Big Data solutions scale dynamically with cloud computing (AWS, Azure, Google
Cloud).
o Big Data facilitates predictive analytics and automation through AI/ML models.
Data Lakehouses: A fusion of Data Warehouses and Data Lakes to handle diverse data
workloads.
Cloud-Based Analytics: Growing adoption of serverless and managed Big Data platforms.
Edge Computing: Bringing real-time data processing closer to devices.
AI-Powered Analytics: Automating business decisions through AI-driven insights.
Self-Service Data Platforms: Enabling business users to query and analyze data independently.
1. What is Big Data Analytics?
Big Data Analytics refers to the process of examining large, complex datasets to uncover
hidden patterns, correlations, trends, and insights that aid decision-making. Unlike traditional
data analysis, Big Data Analytics handles vast amounts of structured, semi-structured, and
unstructured data at scale, often leveraging distributed computing, artificial intelligence, and
machine learning.
1. Descriptive Analytics – Summarizes past trends and events (e.g., dashboards, reports).
3. Predictive Analytics – Uses historical data and machine learning to forecast future
trends.
Real-World Applications:
Despite its transformative impact, Big Data Analytics is often misunderstood. Here’s what it is
not:
Just About Large Data Volumes – While volume is a factor, Big Data Analytics is more about
deriving meaningful insights from complex datasets.
Traditional Business Intelligence (BI) – BI provides historical insights, whereas Big Data
Analytics incorporates real-time analysis, AI, and predictive modeling.
A One-Size-Fits-All Approach – There are multiple tools and methods tailored to different
industries and needs.
Purely a Technology Upgrade – It involves a strategic shift in how data is collected, processed,
and used for decision-making.
Only for Large Enterprises – Small and medium businesses also leverage Big Data solutions via
cloud-based platforms.
Always Accurate – Data quality, bias, and ethical concerns must be addressed to ensure
meaningful and unbiased analytics.
The growing buzz around Big Data Analytics is driven by several technological advancements,
market demands, and business transformations.
• Rapid increase in data from social media, IoT devices, transactions, sensors, and digital
interactions.
• By 2025, the global data volume is projected to exceed 175 zettabytes.
• Cloud platforms (AWS, Google Cloud, Azure) provide on-demand, scalable data storage.
• Data Lakehouses combine structured data warehousing with unstructured data lakes.
• Businesses need instant insights for customer engagement, fraud detection, and
logistics.
• Streaming analytics platforms like Apache Kafka and Apache Flink process data in
milliseconds.
• Industries require advanced analytics for compliance with GDPR, HIPAA, and financial
regulations.
• Risk management and anomaly detection prevent fraud, cyber threats, and financial
crimes.
Types of Big Data Analytics
Big Data Analytics is categorized into four major types based on the nature of insights it
provides:
Descriptive Analytics
Diagnostic Analytics
Predictive Analytics
Prescriptive Analytics
Each type serves a distinct purpose in analyzing data, from summarizing historical patterns to
making real-time, AI-driven decisions.
Definition:
Key Features:
Summarizes historical data using reports, charts, and key performance indicators (KPIs).
Provides a static snapshot of business performance over a period.
Does not explain causes but highlights trends.
Examples:
E-commerce Sales Reports – Online platforms like Amazon analyze past sales, customer
behavior, and website visits to identify peak shopping seasons.
Hospital Readmission Rates – Healthcare organizations analyze patient records to report
trends in hospital readmissions, helping to assess the quality of care.
Definition:
• Diagnostic Analytics digs deeper into past data to determine causes and correlations
behind trends and anomalies.
• It uses techniques like drill-down analysis, data mining, correlation analysis, and
machine learning.
Key Features:
Identifies root causes of trends and anomalies.
Uses statistical techniques to find patterns and relationships.
Answers "why" questions about business performance.
Examples:
Customer Churn Analysis – Telecom companies (e.g., Verizon, AT&T) analyze why customers
switch to competitors by examining call drop rates, service complaints, and billing issues.
Healthcare Diagnostics – Hospitals analyze electronic medical records to determine why
certain treatments lead to better patient outcomes.
Definition:
• Predictive Analytics forecasts future trends using historical data, statistical modeling,
and machine learning techniques.
• It is widely used for risk assessment, demand forecasting, fraud detection, and
personalized recommendations.
Key Features:
Uses AI/ML algorithms (e.g., regression, time-series analysis, decision trees).
Forecasts future trends based on historical data patterns.
Enables proactive decision-making before an event occurs.
Examples:
Credit Score Prediction – Banks and financial institutions (e.g., FICO, Experian) use predictive
analytics to assess customer creditworthiness and predict loan default risks.
Weather Forecasting – Meteorological agencies analyze historical climate data to predict
upcoming storms, hurricanes, and temperature fluctuations.
Definition:
Key Features:
Suggests optimal decisions using simulation and optimization.
Uses reinforcement learning and advanced AI models for decision automation.
Common in autonomous systems like robotics, self-driving cars, and automated trading.
Examples:
Supply Chain Optimization – Companies like Walmart use prescriptive analytics to automate
inventory management by analyzing demand fluctuations and logistics costs.
Autonomous Vehicles – Tesla’s AI-powered system makes real-time decisions based on traffic
data, pedestrian movements, and road conditions to ensure safe driving.
What is Hadoop?
Definition:
• Apache Hadoop is an open-source framework for storing and processing large datasets
across distributed computing clusters.
• It follows the MapReduce programming model for parallel data processing.
• Developed by Doug Cutting and inspired by Google’s GFS and MapReduce papers.
• Written in Java and maintained by the Apache Software Foundation.
Key Features:
Scalability – Can scale from a single server to thousands of machines.
Fault tolerance – Automatically recovers from node failures.
Cost-effectiveness – Uses commodity hardware to store and process data.
Flexibility – Supports structured, semi-structured, and unstructured data.
Definition:
How It Works:
Splits large files into smaller blocks (default: 128MB or 256MB).
Stores multiple copies (replication factor, default: 3).
Manages files efficiently across nodes.
Example:
A social media platform (e.g., Facebook) uses HDFS to store billions of images and videos
across distributed clusters.
2. Yet Another Resource Negotiator (YARN)
Definition:
• A resource management layer in Hadoop that schedules tasks and allocates system
resources.
• Separates resource management from data processing, improving efficiency.
Key Components:
ResourceManager – Manages system resources globally.
NodeManager – Manages resources on individual nodes.
ApplicationMaster – Oversees execution of applications.
Example:
Netflix uses YARN to allocate resources dynamically when processing user recommendations.
Definition:
Example:
Twitter processes real-time tweets using MapReduce to analyze trending topics.
Definition:
Key Features:
Provides authentication (Kerberos).
Supports compression for optimized storage.
Includes Java libraries for Hadoop applications.
Example:
A financial firm uses Hadoop Common libraries to integrate external data sources into their
Hadoop cluster.
• The NameNode does not store data but returns the list of DataNodes where each
block resides.
Data is Reassembled
• Once all blocks are fetched, the client reconstructs the complete file.
Key Features:
Reads are optimized by fetching from the nearest DataNode.
NameNode only provides metadata; actual data is read from DataNodes.
Hadoop ensures fault tolerance – if a DataNode fails, the client reads from another replica.
HDFS Write Process (How Data is Stored in Hadoop)
• The NameNode does not store data but maintains metadata about file locations.
• It provides block placement details for the file to be stored across DataNodes.
Key Features:
Data is stored in a fault-tolerant manner with replication.
NameNode only manages metadata, not actual data.
Pipeline replication ensures efficient storage across multiple nodes.
Data Storage:
Data Processing:
Data Ingestion:
Workflow Management:
Example:
Uber uses the Hadoop ecosystem (HDFS, Spark, Hive) to analyze ride data and optimize pricing
models.
Advantages of Hadoop & HDFS over GFS and Other Distributed File Systems
Open-Source & Cost-Effective – Unlike Google File System (GFS), Hadoop is open-source and
does not require proprietary hardware.
Scalability – Can scale horizontally by adding more nodes instead of upgrading expensive
servers.
Supports Diverse Workloads – Unlike GFS, which is optimized for Google’s internal services,
Hadoop supports machine learning, analytics, ETL, and real-time processing.
Parallel Processing – Uses MapReduce and Spark to process data faster than traditional
databases.
• HDFS replicates data across multiple nodes (default: 3 copies), ensuring high
availability.
• GFS also replicates, but HDFS offers better handling of node failures.
3. Cost-Effectiveness:
• HDFS runs on commodity hardware, while GFS is built for Google’s infrastructure.
• This makes Hadoop affordable for enterprises of all sizes.
• HDFS supports larger block sizes (128MB+) for efficient storage and retrieval.
• GFS uses smaller block sizes, leading to higher metadata overhead.
5. Industry Adoption:
Example:
Bank of America uses HDFS for real-time fraud detection instead of relying on traditional
relational databases.
Functionality:
Changes in Hadoop 3:
Erasure Coding replaces replication for improved storage efficiency.
Multiple NameNodes are now supported in HA (High Availability) mode.
Example:
If a user wants to read /user/data.csv, the NameNode tells them which DataNodes store the
file chunks.
Functionality:
Changes in Hadoop 3:
Storage efficiency improved with Erasure Coding, reducing redundancy.
DataNodes can store data on heterogeneous storage types (SSD/HDD hybrid).
Example:
When a user uploads a file, the NameNode breaks it into blocks, and DataNodes store these
blocks.
Functionality:
Changes in Hadoop 3:
Federation Support – Multiple ResourceManagers can operate in parallel for scalability.
GPU & FPGA support – Improved hardware acceleration for ML tasks.
Example:
If multiple users submit jobs, the ResourceManager allocates memory & CPU to ensure fair
resource distribution.
Functionality:
Changes in Hadoop 3:
Supports GPU scheduling, allowing deep learning workloads.
Improved container reuse for better efficiency.
Example:
If a Spark job requires 2GB RAM, the NodeManager ensures that enough resources are
available before execution.
Job Execution Daemon
5. JobHistoryServer
Functionality:
Changes in Hadoop 3:
Optimized log storage reduces overhead on large clusters.
Example:
A data engineer can check the JobHistoryServer to debug a failed MapReduce job.
Definition:
• The simplest mode where Hadoop runs on a single machine without using HDFS.
• Mostly used for testing and debugging.
Key Characteristics:
Advantages:
Quick setup, minimal configuration.
Ideal for testing MapReduce programs before deploying them in clusters.
Disadvantages:
No distributed processing or storage.
Cannot handle large-scale Big Data workloads.
Usage:
• Used by developers for testing and debugging Hadoop applications before running
them on a cluster.
Definition:
Key Characteristics:
Advantages:
Simulates a real Hadoop cluster.
Allows developers to test HDFS and YARN before deploying on a full cluster.
Disadvantages:
Not suitable for real-world large-scale applications.
Performance is limited to a single machine.
Usage:
• Used for small-scale testing and learning Hadoop before moving to production.
Definition:
Key Characteristics:
Advantages:
Fully distributed Big Data processing.
Scales horizontally by adding more nodes.
Fault-tolerant and optimized for real-world workloads.
Disadvantages:
Requires complex setup and configuration.
Needs high hardware resources and infrastructure management.
Usage:
On Linux (Ubuntu/Debian/CentOS)
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
tar -xvzf hadoop-3.3.1.tar.gz
mv hadoop-3.3.1 ~/hadoop
export HADOOP_HOME=~/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
Then run:
source ~/.bashrc
hadoop version
hadoop fs -ls /
On Linux
Modify /home/user/hadoop/etc/hadoop/core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Explanation:
• fs.defaultFS → Defines the default file system (HDFS in this case) and its address.
Modify /home/user/hadoop/etc/hadoop/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/user/hadoop_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/user/hadoop_data/hdfs/datanode</value>
</property>
</configuration>
Explanation:
• dfs.replication → Defines number of copies of data blocks.
• dfs.namenode.name.dir → Directory to store NameNode metadata.
• dfs.datanode.data.dir → Directory for DataNode storage.
Modify /home/user/hadoop/etc/hadoop/mapred-site.xml:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Explanation:
Modify /home/user/hadoop/etc/hadoop/yarn-site.xml:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Explanation:
master
worker1
worker2
start-dfs.sh
start-yarn.sh
mr-jobhistory-daemon.sh start historyserver
jps # Check if NameNode, DataNode, ResourceManager, JobHistoryServer are running
Troubleshooting Guide
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Solution:
rm -r /home/user/hadoop_data/hdfs/namenode/*
hdfs namenode -format
DataNode Failing to Start
tail -f $HADOOP_HOME/logs/hadoop-*.log
Feature RDBMS Hadoop
Data Variety Mainly for Structured data Used for Structured, Semi-
Structured, and Unstructured
data
Data Storage Average size data (GBS) Use for large data sets (Tbs and
Pbs)
Querying SQL Language HQL (Hive Query Language)
Schema Required on write (static Required on reading (dynamic
schema) schema)
Speed Reads are fast Both reads and writes are fast
Cost License Free
Use Case OLTP (Online transaction Analytics (Audio, video, logs,
processing) etc.), Data Discovery
Data Objects Works on Relational Tables Works on Key/Value Pair
Throughput Low High
Scalability Vertical Horizontal
Hardware Profile High-End Servers Commodity/Utility Hardware
Integrity High (ACID) Low
Distributed systems offer scalability, fault tolerance, and performance benefits, but they also
introduce significant complexities. Below are some of the most pressing challenges in
distributed computing:
Maintaining data consistency across multiple nodes is difficult, especially when trying to
balance availability and fault tolerance. Systems must choose between eventual consistency
(where updates propagate over time) and strong consistency (where all replicas remain
synchronized). The CAP theorem dictates that in a partitioned network, a system can only
guarantee two out of three properties: Consistency, Availability, and Partition Tolerance.
3. Fault Tolerance and Self-Healing Mechanisms
Distributed systems must continue functioning even when individual nodes fail. Achieving fault
tolerance requires redundancy, failure detection, and automated recovery mechanisms.
Techniques such as checkpointing, leader election, and self-healing architectures ensure that
failures do not bring down the entire system.
With multiple nodes operating in parallel, coordinating access to shared resources becomes a
challenge. Without proper synchronization mechanisms, race conditions and deadlocks can
occur. Distributed systems use consensus protocols (e.g., Paxos, Raft) and distributed locking
mechanisms (e.g., ZooKeeper) to maintain order and prevent data corruption.
As workloads grow, distributed systems must scale efficiently while balancing loads across
available resources. Poor load balancing can lead to hotspots, where some nodes become
overloaded while others remain underutilized. Dynamic load balancing techniques, auto-
scaling, and horizontal scaling strategies help manage varying workloads efficiently.
To overcome these challenges, various approaches and best practices have been developed:
Using consensus algorithms such as Paxos, Raft, and Two-Phase Commit (2PC) ensures that
distributed nodes agree on data updates while maintaining fault tolerance. Replication
strategies like leader-follower replication and quorum-based reads/writes enhance data
reliability.
The circuit breaker pattern prevents cascading failures by detecting failures in services and
temporarily blocking interactions with unresponsive components. This allows services to
recover gracefully instead of causing system-wide outages.
Reducing direct dependencies between services through asynchronous messaging (e.g., Kafka,
RabbitMQ, and event-driven microservices) enhances system resilience and prevents
bottlenecks.
CustomFileInputFormat.java
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
@Override
InterruptedException {
}
CustomLineRecordReader.java
import java.io.IOException;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.util.LineReader;
/**
* This method takes as arguments the map task’s assigned InputSplit and
* formats, this is a good place to seek to the byte position in the file to
* begin reading.
*/
@Override
start = split.getStart();
FileSystem fs = file.getFileSystem(job);
// If Split "S" does not start at byte 0, first line has been already
if (start != 0) {
skipFirstLine = true;
--start;
fileIn.seek(start);
if (skipFirstLine) {
this.pos = start;
/**
* single key/ value pair and returns true until the data is consumed.
*/
@Override
key.set(pos);
int newSize = 0;
// Make sure we get at least one record that starts in this Split
if (newSize == 0) {
break;
pos += newSize;
// Line is lower than Maximum record line size
break;
LOG.info("Skipped line of size " + newSize + " at pos " + (pos - newSize));
if (newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
/**
* This methods are used by the framework to give generated key/value pairs
*/
@Override
InterruptedException {
return key;
/**
* This methods are used by the framework to give generated key/value pairs
*/
@Override
/**
*/
@Override
if (start == end) {
return 0.0f;
} else {
/**
* This method is used by the framework for cleanup after there are no more
*/
@Override
if (in != null) {
in.close();
MapClass.java
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
/**
* Map is passed a single line at a time, it splits the line based on space
* and generated the token which are output by map with value as one to be consumed
* by reduce class
* @author Raman
*/
/**
* splits to tokens and passes to the context as word along with value as one
*/
@Override
Context context)
while(st.hasMoreTokens()){
word.set(st.nextToken());
context.write(word,one);
}
ReduceClass.java
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
/**
* Reduce class which is executed after the map class and takes
* key(word) and corresponding values, sums all the values and write the
* @author Raman
*/
/**
*/
@Override
int sum = 0;
while(valuesIt.hasNext()){
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
/**
* which setup the Hadoop job with Map and Reduce Class
* @author Raman
*/
/**
* Main function which calls the run method and passes the args using ToolRunner
* @throws Exception
*/
System.exit(exitCode);
}
/**
*/
if (args.length != 2) {
getClass().getSimpleName());
return -1;
//Initialize the Hadoop job and set the jar as well as the name of the Job
job.setJarByClass(WordCount.class);
job.setJobName("WordCounter");
//Add input and output file paths to job based on the arguments passed
job.setInputFormatClass(CustomFileInputFormat.class);
job.setOutputValueClass(IntWritable.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapperClass(MapClass.class);
job.setReducerClass(ReduceClass.class);
//Wait for the job to complete and print if the job was successful or not
if(job.isSuccessful()) {
} else if(!job.isSuccessful()) {
return returnValue;