Big Data Report
Big Data Report
on
“Big Data”
Submitted in the partial fulfillment of the requirement for the award of the Degree
BACHELOR OF TECHNOLOGY
in
ELECTRONICS & COMMUNICATION ENGINEERING
of
Dr. A.P.J. Abdul Kalam Technical University
LUCKNOW
SUBMITTED BY:
HUNNY SAINI [1416531025]
Under the Guidance
Of
Ms. Shruti Awasthi
(Assistant Professor, ECE Deptt.)
CERTIFICATE
This is to certify that the Seminar entitled “Big Data” has been
submitted by Hunny Saini under my guidance in partial fulfilment of
the degree of Bachelor of Technology in Electronics and
Communication Engineering of Kanpur Institute Of Technology
during the academic year 2016-2017 (Semester-VI).
Date: 12/04/2017
Place: Kanpur
Hunny Saini
Table Of Contents
Abstract
Introduction
Definition
Characteristics
Architecture
Technologies
Applications
Conclusion
Abstract
The age of big data is now coming. But the traditional data analytics may not be
able to handle such large quantities of data. The question that arises now is, how to
develop a high performance platform to efficiently analyze big data and how to
design an appropriate mining algorithm to find the useful things from big data. To
deeply discuss this issue, this paper begins with a brief introduction to data
analytics, followed by the discussions of big data analytics. Some important open
issues and further research directions will also be presented for the next step of big
data analytics.
As the information technology spreads fast, most of the data were born digital as
well as exchanged on internet today. According to the estimation of Lyman and
Varian [1], the new data stored in digital media devices have already been more
than 92 % in 2002, while the size of these new data was also more than five
exabytes. In fact, the problems of analyzing the large scale data were not suddenly
occurred but have been there for several years because the creation of data is
usually much easier than finding useful things from the data. Even though
computer systems today are much faster than those in the 1930s, the large scale
data is a strain to analyze by the computers we have today.
1. Volume
2. Variety
3. Velocity
4. Variability
5. Verasity
The analytics are used to process medical information rapidly and efficiently
for faster decision making and to detect suspicious or fraudulent claims. The
Food and drug administration is using Big Data .
1. Introduction
Big data is a broad term for data sets so large or complex that traditional data
processing applications are inadequate. Challenges include analysis, capture, data
curation, search, sharing, storage, transfer, visualization, and information privacy.
The term often refers simply to the use of predictive analytics or other certain
advanced methods to extract value from data, and seldom to a particular size of
data set. Accuracy in big data may lead to more confident decision making. And
better decisions can mean greater operational efficiency, cost reductions and
reduced risk.
Analysis of data sets can find new correlations, to "spot business trends, prevent
diseases, combat crime and so on." Scientists, practitioners of media and
advertising and governments alike regularly meet difficulties with large data sets in
areas including Internet search, finance and business informatics. Scientists
encounter limitations in e-Science work, including meteorology, genomics,
connectomics, complex physics simulations, and biological and environmental
research.
Data sets grow in size in part because they are increasingly being gathered by
cheap and numerous information-sensing mobile devices, aerial (remote sensing),
software logs, cameras, microphones, radio-frequency identification (RFID)
readers, and wireless sensor networks. The world's technological per-capita
capacity to store information has roughly doubled every 40 months since the
1980s; as of 2012, every day 2.5 exabytes (2.5×1018) of data were created; The
challenge for large enterprises is determining who should own big data initiatives
that straddle the entire organization.
Work with big data is necessarily uncommon; most analysis is of "PC size" data,
on a desktop
PC or notebook that can handle the available data set.
Relational database management systems and desktop statistics and visualization
packages often have difficulty handling big data. The work instead requires
"massively parallel software running on tens, hundreds, or even thousands of
servers". What is considered "big data" varies depending on the capabilities of the
users and their tools, and expanding capabilities make Big Data a moving target.
Thus, what is considered to be "Big" in one year will become ordinary in later
years. "For some organizations, facing hundreds of gigabytes of data for the first
time may
trigger a need to reconsider data management options. For others, it may take tens
or hundreds of terabytes before data size becomes a significant consideration."
2. Definition
Big data usually includes data sets with sizes beyond the ability of commonly
used software
tools to capture, curate, manage, and process data within a tolerable elapsed
time. Big data "size" is a constantly moving target, as of 2012 ranging from a
few dozen terabytes to many petabytes of data. Big data is a set of techniques
and technologies that require new forms of integration to uncover large hidden
values from large datasets that are diverse, complex, and of a massive
scale.
In a 2001 research report and related lectures, META Group (now Gartner)
analyst Doug Laney defined data growth challenges and opportunities as being
three-dimensional, i.e. increasing volume (amount of data), velocity (speed of
data in and out), and variety (range of data types and sources). Gartner, and now
much of the industry, continue to use this "3Vs" model for describing big data.
In 2012, Gartner updated its definition as follows: "Big data is high volume,
high velocity, and/or high variety information assets that require new forms of
processing to enable enhanced decision making, insight discovery and process
optimization." Additionally, a new V "Veracity" is added by some organizations
to describe it.
If Gartner’s definition (the 3Vs) is still widely used, the growing maturity of
the concept fosters a more sound difference between big data and Business
Intelligence, regarding data and their use:
A more recent, consensual definition states that "Big Data represents the
Information assets characterized by such a High Volume, Velocity and Variety to
require specific Technology and Analytical Methods for its transformation into
Value".
3. Characteristics
Variety - The next aspect of Big Data is its variety. This means that the
category to which Big Data belongs to is also a very essential fact that needs to
be known by the data analysts. This helps the people, who are closely
analyzing the data and are associated with it, to effectively use the data to their
advantage and thus upholding the importance of the Big Data.
Velocity - The term ‘velocity’ in the context refers to the speed of generation
of data or how fast the data is generated and processed to meet the demands and
the challenges which lie ahead in the path of growth and development.
Variability - This is a factor which can be a problem for those who analyse
the data. This refers to the inconsistency which can be shown by the data at
times, thus hampering the process of being able to handle and manage the data
effectively.
Veracity - The quality of the data being captured can vary greatly. Accuracy
of analysis depends on the veracity of the source data.
Complexity - Data management can become a very complex process,
especially when large volumes of data come from multiple sources. These
data need to be linked, connected and correlated in order to be able to grasp
the information that is supposed to be conveyed by these data. This situation,
is therefore, termed as the ‘complexity’ of Big Data.
In this scenario and in order to provide useful insight to the factory management
and gain correct content, data has to be processed with advanced tools (analytics
and algorithms) to generate meaningful information. Considering the presence of
visible and invisible issues in an industrial factory, the information generation
algorithm has to be capable of detecting and addressing invisible issues such as
machine degradation, component wear, etc. in the factory floor.
4. Architecture
In 2000, Seisint Inc. developed C++ based distributed file sharing framework
for data storage and querying. Structured, semi-structured and/or unstructured
data is stored and distributed across multiple servers. Querying of data is done
by modified C++ called ECL which uses apply scheme on read method to
create structure of stored data during time of query. In 2004
LexisNexis acquired Seisint Inc. and 2008 acquired ChoicePoint, Inc. and
their high speed parallel processing platform. The two platforms were merged
into HPCC Systems and in 2011 was open sourced under Apache v2.0
License. Currently HPCC and Quantcast File Systemare the only publicly
available platforms capable of analyzing multiple exabytes of data.
Big Data Lake :- With the changing face of business and IT sector, capturing
and storage of data has emerged into a sophisticated system. The big data lake
allows an organization to shift its focus from centralized control to a shared model
to respond to the changing dynamics of information management. This enables
quick segregation of data into the data lake thereby reducing the overhead time.
5. Technologies
Some but not all MPP relational databases have the ability to store and manage
petabytes of data. Implicit is the ability to load, monitor, back up, and optimize
the use of the large data tables in
the RDBMS.
The practitioners of big data analytics processes are generally hostile to slower
shared storage, preferring direct-attached storage (DAS) in its various forms
from solid state drive (SSD) to high capacity SATA disk buried inside parallel
processing nodes. The perception of shared storage architectures—Storage area
network (SAN) and Network-attached storage (NAS) —is that they are
relatively slow, complex, and expensive. These qualities are not consistent with
big data analytics systems that thrive on system performance, commodity
infrastructure, and low cost.
While many vendors offer off-the-shelf solutions for Big Data, experts
recommend the development of in-house solutions custom-tailored to solve the
company's problem at hand if the company has sufficient technical capabilities.
Government
The use and adoption of Big Data within governmental processes is beneficial
and allows efficiencies in terms of cost, productivity, and innovation. That said,
this process does not come without its flaws. Data analysis often requires
multiple parts of government (central and local) to work in collaboration and
create new and innovative processes to deliver the desired outcome. Below are
the thought leading examples within the Governmental Big Data space.
In 2012, the Obama administration announced the Big Data Research and
Development Initiative, to explore how big data could be used to address
important problems faced by the government. The initiative is composed of 84
different big data programs spread across six departments.
Big data analysis played a large role in Barack Obama's successful 2012
re-election campaign.
The United States Federal Government owns six of the ten most powerful
supercomputers in the world.
The Utah Data Center is a data center currently being constructed by the
United States National Security Agency. When finished, the facility will be able
to handle a large amount of information collected by the NSA over the Internet.
The exact amount of storage space is unknown, but more recent sources claim it
will be on the order of a few exabytes.
India
Big data analysis was, in parts, responsible for the BJP and its allies
to win a highly successful Indian General Election 2014.
The Indian Government utilises numerous techniques to ascertain how
the Indian electorate is responding to government action, as well as
ideas for policy augmentation
United Kingdom
International development
Manufacturing
Cyber-Physical Models
Current PHM implementations mostly utilize data during the actual usage
while analytical algorithms can perform more accurately when more
information throughout the machine’s lifecycle, such as system configuration,
physical knowledge and working principles, are included. There is a need to
systematically integrate, manage and analyze machinery or process
Media
To understand how the media utilises Big Data, it is first necessary to provide
some context into the mechanism used for media process. It has been
suggested by Nick Couldry and Joseph Turow that practitioners in Media and
Advertising approach big data as many actionable points of information about
millions of individuals. The industry appears to be moving away from the
traditional approach of using specific media environments such as newspapers,
magazines, or television shows and instead tap into consumers with
technologies that reach targeted people at optimal times in optimal locations.
The ultimate aim is to serve, or convey, a message or content that is
(statistically speaking) in line with the consumers mindset. For example,
publishing environments are increasingly tailoring messages (advertisements)
and content (articles) to appeal to consumers that have been exclusively
gleaned through various data-mining activities.
Data-capture Big Data and the IoT work in conjunction. From a media
perspective, data is the key derivative of device inter connectivity and allows
accurate targeting. The Internet of Things, with the help of big data, therefore
transforms the media industry, companies and even governments, opening up a
new era of economic growth and competitiveness. The intersection of people,
data and intelligent algorithms have far-reaching impacts on media efficiency.
The wealth of data generated allows an elaborate layer on the present targeting
mechanisms of the industry.
Technology
eBay.com uses two data warehouses at 7.5 petabytes and 40PB as well as a
40PB Hadoop cluster for search, consumer recommendations, and
merchandising. Inside eBay’s 90PB data warehouse
As of August 2012, Google was handling roughly 100 billion searches per
month.
Private sector
Retail
Retail Banking
The volume of business data worldwide, across all companies, doubles every
1.2 years, according to estimates.
Real Estate
Windermere Real Estate uses anonymous GPS signals from nearly 100
million drivers to help new home buyers determine their typical drive times
to and from work throughout various times of the day.
Science
The Large Hadron Collider experiments represent about 150 million sensors
delivering data 40 million times per second. There are nearly 600 million
collisions per second. After filtering and refraining from recording more than
99.99995% of these streams, there are 100 collisions of interest per second.
As a result, only working with less than 0.001% of the sensor stream data,
the data flow from all four LHC experiments represents 25 petabytes
annual rate before replication (as of 2012). This becomes nearly 200
If all sensor data were to be recorded in LHC, the data flow would be
extremely hard to work with. The data flow would exceed 150 million
petabytes annual rate, or nearly 500 exabytes per day, before replication. To
put the number in perspective, this is equivalent to 500 quintillion (5×1020)
bytes per day, almost 200 times more than all the other sources combined
in the world.
When the Sloan Digital Sky Survey (SDSS) began collecting astronomical
data in 2000, it amassed more in its first few weeks than all data collected in the
history of astronomy. Continuing at a rate of about 200 GB per night, SDSS has
amassed more than 140 terabytes of information. When the Large Synoptic Survey
Telescope, successor to SDSS, comes online in 2016 it is anticipated to acquire that
amount of data every five days.
Decoding the human genome originally took 10 years to process, now it can
be achieved in less than a day: the DNA sequencers have divided the sequencing
cost by 10,000 in the last ten years, which is 100 times cheaper than the reduction
in cost predicted by Moore's Law.
There are amazing benefits to real-time big data analytics. First, it allows businesses
to detect errors and fraud quickly. This significantly mitigates against losses. Second,
it provides major advantages from a competitive standpoint. Real-time analysis allows
businesses to develop more effective strategies towards competitors in less time,
offering deep insight into consumer trends and sales. In addition, data collected is
valuable and offers businesses a chance to improve profits and customer service.
Perhaps the greatest argument in favor of real-time analysis of big data is that it may
be used to provide cutting-edge healthcare. Proponents of big data point out that
healthcare organizations can use electronic medical records and data from wearables
to prevent deadly hospital infections, for example. To these proponents, privacy
cannot trump the lives big data might save.
8. Cons of Real-Time Big Data
As valuable as this kind of big data can be, it also presents serious challenges. First is
the logistical issue. Companies hoping to use big data will need to modify their entire
approach as data flowing into the company becomes constant rather than periodic: this
mandates major strategic changes for many businesses. Next, real-time big data
demands the ability to conduct sophisticated analyses; companies who fail to do this
correctly open themselves up to implementing entirely incorrect strategies
organization-wide. Furthermore, many currently used data tools are not able to handle
real-time analysis.
One of the biggest concerns many laypeople and politicians have about real-time
analysis of big data is privacy. Civil liberties advocates have attacked the use of big
data from license plate scanners and drones, for example. The idea is that authorities
should not be able to circumvent constitutional protections against unreasonable
searches.
9. Conclusion
The availability of Big Data, low-cost commodity hardware, and new information
management and analytic software have produced a unique moment in the history
of data analysis. The convergence of these trends means that we have the
capabilities required to analyze astonishing data sets quickly and cost-effectively
for the first time in history. These capabilities are neither theoretical nor trivial.
They represent a genuine leap forward and a clear opportunity to realize
enormous gains in terms of efficiency, productivity, revenue, and profitability.
The Age of Big Data is here, and these are truly revolutionary times if both
business and technology professionals continue to work together and
deliver on the promise.
References
https://en.wikipedia.org/wiki/Big_data
https://www.oracle.com/big-
data/index.html
http://www.sas.com/en_us/insights/big-
data/what-is-big-data.html
https://www.tutorialspoint.com/big_data
_tutorials.html