Big Data Analytics Using Apache Hadoop
Big Data Analytics Using Apache Hadoop
SEMINAR REPORT
Submitted in partial fulfilment of
the requirements for the award of Bachelor of Technology Degree
in Computer Science and Engineering
of the University of Kerala
Submitted by
ABIN BABY
Roll No : 1
Seventh Semester
B.Tech Computer Science and Engineering
DEPARTMENT
CERTIFICATE
This is to certify that this seminar report entitled BIG DATA ANALYTICS USING
APACHE HADOOP is a bonafide record of the work done by Abin Baby, under our
guidance towards partial fulfilment of the requirements for the award of the Degree of
Bachelor of Technology in Computer Science and Engineering of the University of
Kerala during the year 2011-2015.
Mrs. Sabitha S
Professor
Dept. of CSE
Assoc. Professor
Dept. of CSE
(Guide)
Assoc. Professor
Dept. of CSE
(Guide)
ii
ACKNOWLEDGEMENTS
I would like to express my sincere gratitude and heartful indebtedness to my
guide Dr. Abdul Nizar , Head of Department, Department of Computer Science and
Engineering for her valuable guidance and encouragement in pursuing this seminar.
Above all, I thank God for the immense grace and blessings at all stages of the project.
Abin Baby
iii
ABSTRACT
The paradigm of processing huge datasets has been shifted from centralized
architecture to distributed architecture. As the enterprises faced issues of gathering
large chunks of data they found that the data cannot be processed using any of the
existing centralized architecture solutions. Apart from time constraints, the enterprises
faced issues of efficiency, performance and elevated infrastructure cost with the data
processing in the centralized environment.
With the help of distributed architecture these large organizations were able to
overcome the problems of extracting relevant information from a huge data dump. One
of the best open source tools used in the market to harness the distributed architecture
in order to solve the data processing problems is Apache Hadoop. Using Apache
Hadoops various components such as data clusters, map-reduce algorithms and
distributed processing, we will resolve various location-based complex data problems
and provide the relevant information back into the system, thereby increasing the user
experience.
iv
TABLE OF CONTENTS
1. Introduction
2. Big Data
10
12
14
16
17
4. Apache Hadoop
19
22
5.1 Architecture
23
6. Hadoop Map-Reduce
25
6.1 Architecture
6.2 Map Reduce Paradigm
6.3 Comparing RDBMS & Map Reduce
25
27
29
31
8. Conclusion
32
9. Bibliography
33
CHAPTER 1
INTRODUCTION
Amount of data generated every day is expanding in drastic manner.Big data is a
popular term used to describe the data which is in zetta bytes. Government , companies
many organisations try to acquire and store data about their citizens and customers in
order to know them better and predict the customer behaviour . social networking
websites generate new data every second and handling such a data is one of the major
challenges companies are facing. Data which is stored in data warehouses is causing
disruption because it is in a raw format ,proper analysis and processing is to be done in
order to produce usable information out of it. Big Data has to deal with large and
complex datasets that can be structured , semi structured,or unstructured and will
typically not fit into memory to be processed. They have to be processed in place, which
means that computation has to be done where the data resides for processing.
Big data challenges include analysis, capture, curation, search, sharing, storage,
transfer, visualization, and privacy violations. The trend to larger data sets is due to the
additional information derivable from analysis of a single large set of related data, as
compared to separate smaller sets with the same total amount of data, allowing
correlations to be found to "spot business trends, prevent diseases, combat crime and so
on.
Big Data usually includes datasets with sizes. It is not possible for such systems
to process this amount of data within the time frame mandated by the business. Big
Data volumes are a constantly moving target, as of 2012 ranging from a few dozen
terabytes to many petabytes of data in a single dataset. Faced with this seemingly
insurmountable challenge, entirely new platforms are called Big Data platforms.
New tools are being used to handle such a large amount of data in short
time.ApacheHadoop is java based programming framework which is used for
processing large data sets in distributed computer environment. Hadoop is used in
1
system where multiple nodes are present which can process terabytes of data.hadoop
uses its own file system HDFS which facilitates fast transfer of data which can sustain
node failure and avoid system failure as whole. Hadoop uses MapReduce algorithm
which breaks down the big data into smaller chunks and performs the operations on it.
Hadoop framework is used by many big companies like Google, yahoo, IBM for
applications such as search engine, advertising and information gathering and
processing.
Various technologies will come in hand-in-hand to accomplish this task such as
Spring Hadoop Data Framework for the basic foundations and running of the MapReduce jobs, Apache Maven for distributed building of the code, REST Web services for
the communication, and lastly Apache Hadoop for distributed processing of the huge
dataset.
CHAPTER 2
BIG DATA
Every day, we create 2.5 quintillion bytes of data so much that 90% of the data
in the world today has been created in the last two years alone. This data comes from
everywhere: sensors used to gather climate information, posts to social media sites,
digital pictures and videos, purchase transaction records, and cell phone GPS signals to
name a few. This data is big data.
The data lying in the servers of the company was just data until yesterday
sorted and filed. Suddenly, the slang Big Data got popular and now the data in a
company is Big Data. The term covers each and every piece of data an organization has
stored till now. It includes data stored in clouds and even the URLs that you have been
bookmarked.A company might not have digitized all the data. They may not have
structured all the data already. But then, all the digital, papers, structured and nonstructured data with the company is now Big Data.
Big data is an all-encompassing term for any collection of data sets so large and
complex that it becomes difficult to process using traditional data processing
applications. It refers to the large amounts, at least terabytes, of poly-structured data
that flows continuously through and around organizations, including video, text, sensor
logs, and transactional records. The business benefits of analyzing this data can be
significant. According to a recent study by the MIT Sloan School of Management,
organizations that use analytics are twice as likely to be top performers in their industry
as those that dont.
Big data burst upon the scene in the first decade of the 21st century, and the first
organizations to embrace it were online and startup firms. In a nutshell, Big Data is your
data. It's the information owned by your company, obtained and processed through new
techniques to produce value in the best way possible.
Companies have sought for decades to make the best use of information to
improve their business capabilities. However, it's the structure (or lack thereof) and
size of Big Data that makes it so unique. Big Data is also special because it represents
both significant information - which can open new doors and the way this information
is analyzed to help open those doors. The analysis goes hand-in-hand with the
information, so in this sense "Big Data" represents a noun "the data" - and a verb
"combing the data to find value." The days of keeping company data in Microsoft Office
documents on carefully organized file shares are behind us, much like the bygone era of
sailing across the ocean in tiny ships. That 50 gigabyte file share in 2002 looks quite tiny
compared to a modern-day 50 terabyte marketing database containing customer
preferences and habits.
The world's technological per-capita capacity to store information has roughly
doubled every 40 months since the 1980s as of 2012, every day 2.5 exabytes (2.51018)
of data were created. The challenge for large enterprises is determining who should
own big data initiatives that straddle the entire organization.
Some of the popular organizations that hold Big Data are as follows:
Facebook: It has 40 PB of data and captures 100 TB/day
Yahoo!: It has 60 PB of data
Twitter: It captures 8 TB/day
EBay: It has 40 PB of data and captures 50 TB/day
How much data is considered as Big Data differs from company to
company.Though true that one company's Big Data is another's small, there is
something common: doesn't fit in memory, nor disk, has rapid influx of data that needs
to be processed and would benefit from distributed software stacks. For some
companies, 10 TB of data would be considered Big Data and for others 1 PB would be
Big Data. So only you can determine whether the data is really Big Data. It is sufficient to
say that it would start in the low terabyte range.
Volume. The quantity of data that is generated is very important in this context.It is the
size of the data which determines the value and potential of the data under
consideration and whether it can actually be considered as Big Data or not.The name
Big Data itself contains a term which is related to size and hence the
characteristic.Many factors contribute to the increase in data volume. Transactionbased data stored through the years. Unstructured data streaming in from social media.
Increasing amounts of sensor and machine-to-machine data being collected. In the past,
excessive data volume was a storage issue. But with decreasing storage costs, other
issues emerge, including how to determine relevance within large data volumes and
how to use analytics to create value from relevant data.
It is estimated that 2.5 Quintillion data is generated every day.
40 Zettabytes of data will be created by 2020,an increase of 30 times from 2005
6 billion people around the world are using mobile phones
Velocity. The term velocity in the context refers to the speed of generation of data or
how fast the data is generated and processed to meet the demands and the challenges
which lie ahead in the path of growth and development.Data is streaming in at
unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and
smart metering are driving the need to deal with torrents of data in near-real time.
Reacting quickly enough to deal with data velocity is a challenge for most organizations.
Variety. The next aspect of Big Data is its variety.This means that the category to which
Big Data belongs to is also a very essential fact that needs to be known by the data
5
analysts.This helps the people, who are closely analyzing the data and are associated
with it, to effectively use the data to their advantage and thus upholding the importance
of the Big Data.Data today comes in all types of formats. Structured, numeric data in
traditional databases. Information created from line-of-business applications.
Unstructured text documents, email, video, audio, stock ticker data and financial
transactions. Managing, merging and governing different varieties of data is something
many organizations still grapple with.
Veracity. In addition to the increasing velocities and varieties of data, data flows can be
highly inconsistent with periodic peaks. Is something trending in social media? Daily,
seasonal and event-triggered peak data loads can be challenging to manage. Even more
so with unstructured data involved. This is a factor which can be a problem for those
who analyse the data.
Volatility : Big data volatility refers to how long is data valid and how long should it be
stored. In this world of real time data you need to determine at what point is data no
longer relevant to the current analysis.
ADVERTISING : Big data analytics help companies like google and other advertising
companies to identify the behavior of a person and to target the ads accordingly.Big
data analytics help in more personal and targeted ads.
HEALTH CARE : The average amount of data per hospital will increase from 167TB to
665TB in 2015.With Big Data medical professionals can improve patient care and
reduce cost by extracting relevant clinical information.
CUSTOMER SERVICE : Service representatives can use data to gain a more holistic
view of their customers , understanding their likes amd dislikes in real time
FINANCIAL TRADING : High-Frequency Trading (HFT) is an area where big data finds a
lot of use today. Here, big data algorithms are used to make trading decisions. Today,
the majority of equity trading now takes place via data algorithms that increasingly take
into account signals from social media networks and news websites to make, buy and
sell decisions in split seconds.
IMPROVING SPORTS PERFORMANCE : Most elite sports have now embraced big data
analytics. We have the IBM SlamTracker tool for tennis tournaments; we use video
analytics that track the performance of every player in a football or baseball game, and
sensor technology in sports equipment such as basket balls or golf clubs allows us to get
feedback (via smart phones and cloud servers) on our game and how to improve it.
8
CHAPTER 3
BIG DATA ANALYTICS
Big data is difficult to work with using most relational database management
systems and desktop statistics and visualization packages, requiring instead "massively
parallel software running on tens, hundreds, or even thousands of servers"
Rapidly ingesting, storing, and processing big data requires a cost effective
infrastructure that can scale with the amount of data and the scope of analysis. Most
organizations
with
traditional
data
platformstypically
relational
database
By most accounts, 80 percent of the development effort in a big data project goes into
data integration and only 20 percent goes toward data analysis. Furthermore, a
9
traditional EDW platform can cost upwards of USD 60K per terabyte. Analyzing one
petabytethe amount of data Google processes in 1 hourwould cost USD 60M.
Clearly more of the same is not a big data strategy that any CIO can afford.So we
require more efficient analytics for Big Data.
10
of
data
and
the
many
different
formats
of
the
data
(both structured and unstructured data) collected across the entire organization and
the many different ways different types of data can be combined, contrasted and
analyzed to find patterns and other useful information.
1.Meeting the need for speed : In todays hypercompetitive business environment,
companies not only have to find and analyze the relevant data they need, they must find
it quickly. Visualization helps organizations perform analyses and make decisions much
more rapidly, but the challenge is going through the sheer volumes of data and
accessing the level of detail needed, all at a high speed. One possible solution is
hardware. Some vendors are using increased memory and powerful parallel processing
to crunch large volumes of data extremely quickly. Another method is putting data inmemory but using a grid computing approach, where many machines are used to solve
a problem.
2.Understanding the data : It takes a lot of understanding to get data in the right
shape so that you can use visualization as part of data analysis. For example, if the data
comes from social media content, you need to know who the user is in a general sense
such as a customer using a particular set of products and understand what it is youre
trying to visualize out of the data. One solution to this challenge is to have the proper
domain expertise in place. Make sure the people analyzing the data have a deep
understanding of where the data comes from, what audience will be consuming the data
and how that audience will interpret the information.
3.Addressing data quality : Even if you can find and analyze data quickly and put it in
the proper context for the audience that will be consuming the information, the value of
data for decision-making purposes will be jeopardized if the data is not accurate or
11
timely. This is a challenge with any data analysis, but when considering the volumes of
information involved in big data projects, it becomes even more pronounced. Again, data
visualization will only prove to be a valuable tool if the data quality is assured. To address
this issue, companies need to have a data governance or information management process
in place to ensure the data is clean.
4.Displaying meaningful results : Plotting points on a graph for analysis becomes
difficult when dealing with extremely large amounts of information or a variety of
categories of information. For example, imagine you have 10 billion rows of retail SKU
data that youre trying to compare. The user trying to view 10 billion plots on the screen
will have a hard time seeing so many data points. One way to resolve this is to cluster
data into a higher-level view where smaller groups of data become visible. By grouping
the data together, or binning, you can more effectively visualize the data.
5. Dealing with outliers : The graphical representations of data made possible by
visualization can communicate trends and outliers much faster than tables containing
numbers and text. Users can easily spot issues that need attention simply by glancing at
a chart. Outliers typically represent about 1 to 5 percent of data, but when youre
working with massive amounts of data, viewing 1 to 5 percent of the data is rather
difficult. How do you represent those points without getting into plotting issues?
Possible solutions are to remove the outliers from the data (and therefore from the
chart) or to create a separate chart for the outliers.
The availability of new in-memory technology and high-performance analytics
that use data visualization is providing a better way to analyze data more quickly than
ever. Visual analytics enables organizations to take raw data and present it in a
meaningful way that generates the most value. Nevertheless, when used with big data,
visualization is bound to lead to some challenges. If youre prepared to deal with these
hurdles, the opportunity for success with a data visualization strategy is much greater.
12
13
CHAPTER 4
APACHE HADOOP
Doug Cutting, and Mike Carafella helped create Apache Hadoop in 2005 out
of necessity as data from the web exploded, and grew far beyond the ability of
traditional systems to handle it. Hadoop was initially inspired by papers published by
Google outlining its approach to handling an avalanche of data, and has since become
the de facto standard for storing, processing and analyzing hundreds of terabytes, and
even petabytes of data.
Apache Hadoop is an open source distributed software platform for storing and
processing data. Written in Java, it runs on a cluster of industry-standard servers
configured with direct-attached storage. Using Hadoop, you can store petabytes of data
reliably on tens of thousands of servers while scaling performance cost-effectively by
merely adding inexpensive nodes to the cluster.
Apache Hadoop is 100% open source, and pioneered a fundamentally new way
of storing and processing data. Instead of relying on expensive, proprietary hardware
and different systems to store and process data, Hadoop enables distributed parallel
processing of huge amounts of data across inexpensive, industry-standard servers that
both store and process the data, and can scale without limits. With Hadoop, no data is
too big. And in todays hyper-connected world where more and more data is being
created every day, Hadoops breakthrough advantages mean that businesses and
organizations can now find value in data that was recently considered useless.
Reveal Insight From All Types of Data,From All Types of Systems
Hadoop can handle all types of data from disparate systems: structured, unstructured,
log files, pictures, audio files, communications records, email just about anything you
can think of, regardless of its native format. Even when different types of data have been
stored in unrelated systems, one can dump it all into a Hadoop cluster with no prior
need for a schema. In other words, we dont need to know how we intend to query the
14
data before storing it; Hadoop lets one to decide later and over time can reveal
questions that never even thought to ask.
By making all data useable, not just whats in a databases, Hadoop lets one to see
relationships that were hidden before and reveal answers that have always been just
out of reach. One can start making more decisions based on hard data instead of
hunches and look at complete data sets, not just samples.
15
16
CHAPTER 5
HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
HDFS is Hadoop's own rack-aware filesystem, which is a UNIX-based data
storage layer of Hadoop. HDFS is derived from concepts of Google filesystem. An
important characteristic of Hadoop is the partitioning of data and computation across
many (thousands of) hosts, and the execution of application computations in parallel,
close to their data. On HDFS, data files are replicated as sequences of blocks in the
cluster. A Hadoop cluster scales computation capacity, storage capacity, and I/O
bandwidth by simply adding commodity servers. HDFS can be accessed from
applications in many different ways. Natively, HDFS provides a Java API for applications
to use.
The Hadoop clusters at Yahoo! span 40,000 servers and store 40 petabytes of
application data, with the largest Hadoop cluster being 4,000 servers. Also, one hundred
other organizations worldwide are known to use Hadoop.
HDFS was designed to be a scalable, fault-tolerant, distributed storage system
that works closely with MapReduce. HDFS will just work under a variety of physical
and systemic circumstances. By distributing storage and computation across many
servers, the combined storage resource can grow with demand while remaining
economical at every size.
HDFS supports parallel reading and writing and is optimized for streaming
reading and writing.The bandwidth scales linearly with the number of nodes.HDFS
provides a block redundancy factor which is normally 3 , ie every block will be
replicated 3 times in various nodes .This helps to get higher fault tolerance.
These specific features ensure that the Hadoop clusters are highly functional and highly
available:
17
Utilities diagnose the health of the files system and can rebalance the data on
different nodes
Rollback allows system operators to bring back the previous version of HDFS
after an upgrade, in case of human or system errors
Highly operable. Hadoop handles different types of cluster that might otherwise
require operator intervention. This design allows a single operator to maintain a
cluster of 1000s of nodes.
ARCHITECTURE
HDFS stores large files by dividing them into blocks (usually 64 or 128 MB) and
replicating the blocks on three or more servers.Data is organized in to files and
directories.
HDFS has a master/slave architecture. An HDFS cluster consists of a single
NameNode, a master server that manages the file system namespace and regulates
access to files by clients.In addition, there are a number of DataNodes, usually one per
node in the cluster, which manage storage attached to the nodes that they run on. HDFS
exposes a file system namespace and allows user data to be stored in files. Internally, a
file is split into one or more blocks and these blocks are stored in a set of DataNodes.
NameNode : They executes file system namespace operations like opening, closing, and
renaming files and directories. It also determines the mapping of blocks to DataNodes.
The Namenode actively monitors the number of replicas of a block. When a replica of a
block is lost due to a DataNode failure or disk failure, the NameNode creates another
18
replica of the block. The NameNode maintains the namespace tree and the mapping of
blocks to DataNodes, holding the entire namespace image in RAM.
DataNodes : are responsible for serving read and write requests from the file systems
clients. The DataNodes also perform block creation, deletion, and replication upon
instruction from the NameNode.
The NameNode and DataNode are pieces of software designed to run on commodity
machines. These machines typically run a GNU/Linux operating system (OS). HDFS is
built using the Java language; any machine that supports Java can run the NameNode or
the DataNode software. Usage of the highly portable Java language means that HDFS can
be deployed on a wide range of machines. A typical deployment has a dedicated
machine that runs only the NameNode software. Each of the other machines in the
cluster runs one instance of the DataNode software. HDFS provide automatic block
replication if any nodes fail. A failed disk or node need not to be repaired immediately
unlike RAID systems.Typically repair is done periodically for a collection of failures
which makes it more efficient.
19
CHAPTER 6
HADOOP MAP REDUCE
TaskTracker to start executing the job. Now, the TaskTracker copies the resources to a
local machine and launches JVM to map and reduce program over the data. Along with
this, the TaskTracker periodically sends update to the JobTracker, which can be
considered as the heartbeat that helps to update JobID, job status, and usage of
resources.
21
22
23
Traditional RDBMS
Map Reduce
Gigabyte(Terabytes)
Pentabytes(Exabytes)
Relational Tables
Batch
Updates
Structure
Static schema
Dynamic schema
Integrity
High(ACID)
Low
Scaling
Non Linear
Linear
Processing
Online Transactions
Querying
Declarative Queries
Functional Programming
Data Size
Data Format
Access
24
CHAPTER 7
APACHE HADOOP ECOSYSTEM
Apache Hadoop ecosystem consist of many other implementations apart from hadoop
hdfs and hadoop map reduce.They include :
1.Apache PIG (Scripting platform) : Pig is a platform for analyzing large data sets that
consists of a high-level language for expressing data analysis programs, coupled with
infrastructure for evaluating these programs. The salient property of Pig programs is
that their structure is amenable to substantial parallelization, which in turns enables
them to handle very large data sets. At the present time,Pig's infrastructure layer
consists of a compiler that produces sequences of Map-Reduce programs. Pig's language
layer currently consists of a textual language called Pig Latin, which is easy to use,
optimized, and extensible.
2. Apache Hive : Hive is a data warehouse system for Hadoop that facilitates easy data
summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop
compatible file systems. It provides a mechanism to project structure onto this data and
query the data using a SQL-like language called HiveQL. Hive also allows traditional
map/reduce programmers to plug in their custom mappers and reducers when it is
inconvenient or inefficient to express this logic in HiveQL.
3. Apache HBase : HBase (Hadoop DataBase) is a distributed, column oriented
database. HBase uses HDFS for the underlying storage. It supports both batch style
computations using MapReduce and point queries (random reads).
The main components of HBase are as described below:
HBase Master is responsible for negotiating load balancing across all Region
Servers and maintain the state of the cluster. It is not part of the actual data
storage or retrieval path.
RegionServer is deployed on each machine and hosts data and processes I/O
requests.
25
4Apache Mahout : Mahout is a scalable machine learning library that implements many
different
approaches
to
machine
learning.
The
project
currently
contains
26
CONCLUSION
Big data is considered as the next big thing in the world of information
technology. The use of Big Data is becoming a crucial way for leading companies to
outperform their peers. Big Data will help to create new growth opportunities and
entirely new categories of companies, such as those that aggregate and analyse industry
data. Sophisticated analytics of big data can substantially improve decision-making,
minimise risks, and unearth valuable insights that would otherwise remain hidden.
Big data analytics is the process of examining big data to uncover hidden
patterns, unknown correlations and other useful information that can be used to make
better decisions. With big data analytics, data scientists and others can analyze huge
volumes of data that conventional analytics and business intelligence solutions can't
touch.
Apache Hadoop is a popular open source big data analytics tool. Using Hadoop,
we can store petabytes of data reliably on tens of thousands of servers while scaling
performance cost-effectively by merely adding inexpensive nodes to the cluster.
Instead of relying on expensive, proprietary hardware and different systems to store
and process data, Hadoop enables distributed parallel processing of huge amounts of
data across inexpensive, industry-standard servers that both store and process the data,
and can scale without limits.
27
Bibliography
[1]
[2]
[3]
[4]
[5]
[6]
[7]
28