Introduction To Big Data Management
Introduction To Big Data Management
SAW WAN SYNN, LIAU SHUK YEE ,NOR AZREENA HUSNA BT MOHD JAMAIL, ARMILA AZIRA BT
MOHD SHARIFF
Page 1|9
Introduction / Background
Big data is a data sets that are so large complex that traditional data
processing application software is inadequate to deal with them. The big data
is used of predictive analysis, user behaviour analytic or certain other
advanced data analytics methods that extract value from data and seldom to a
particular size of data size. Besides that, big data also can mean a massive
volume of both structured and unstructured data that is so large it is difficult to
process using traditional database and software techniques. In most enterprise
scenarios the volume of data is too big or it moves too fast or it exceeds
current processing capacity.
Big Data has the potential to help companies improve operations and
make faster, more intelligent decisions. This data, when captured, formatted,
manipulated, stored, and analyse can help a company to gain useful insight to
increase revenues, get or retain customers, and improve operations.
Furthermore, big data relates to data creation, storage, retrieval and analysis
that is remarkable in terms of volume, velocity, and variety. Therefore, 3Vs
are three defining properties or dimensions of big data. Volume refers to the
amount of data for example organization collect data from a variety of sources,
including business transactions, social media and information from sensor or
machine-to-machine data. While velocity refers to the speed of data
processing for example data streams in at an unprecedented speed and must be
dealt with in a timely manner. RFID tags, sensors and smart metering are
driving the need to deal with torrents of data in near-real time. Next, variety is
refers to the number of types of data for example data comes in all types of
formats are from structured, numeric data in traditional databases to
unstructured text documents, email, video, audio, stock ticker data and
financial transactions.
Page 2|9
Figure 1: 3Vs of Big Data
Besides that, the goal of big data management is to ensure a high level
of data quality and accessibility for business intelligence and big data analytics
applications. Corporations, government agencies and other organizations
Page 3|9
employ big data management strategies to help them contend with fast-
growing pools of data, typically involving many terabytes or even petabytes of
information saved in a variety of file formats. Effective big data management
helps companies locate valuable information in large sets of unstructured data
and semi-structured data from a variety of sources, including call detail
records, system logs and social media sites.
One of the challenges of big data management is the data visualization. Its
hard to present the mountain of data in consumable form. If the interpreters-human or
software-concluded analyse the data and produce the output or result that cannot be
understand at all. The backbone of data visualization involve deep understanding of
the range and vagaries of human cognition. Its critical to do that.
Page 4|9
Secondly, data quality is another challenge of big data management. The
problem is the accumulation of data makes it hard to keep all data consistent , correct
and complete (Emran et al. 2008), (Leza & Emran 2014). The more the data that you
stored, the harder the data integrity. It is important to make sure the data is static all
the time. If the data cannot update to all places, it means the data is not synchronize
then the output will not be the same from origin. Hence, the data quality is bad at this
situation. The data should be stored and updated concurrently to achieve best data
quality in big data management.
Besides that, the more the data you stored, the harder you analyse the data.
Interpretation of data is hard to make. How you deal with the data is affecting the
interpretation result but the key point is that your understanding level about the data
stored. It definitely is not easy job to read, understand, analyse the data because big
data include many stuffs inside. Efficient filter and pattern recognizers have to be
designed to sieve through the huge of data. As a result, the finding pattern which may
relevant with the dimension of interest.
Furthermore, querying is hard to achieve within big data. The method or the
way that the data stored may affect the querying of data. Such a huge amount of data
stored and you must know the relationship between data before to get the overall
output. On the other hand, it is sure that you can get the output faster if you retrieve
data from 10 rows only compare to thousands of row of data. The high complexity
and high volume of data caused crucial querying.
Page 5|9
personal health data should be kept well because the information inside is fully
confidential to the patient.
The tools that work with big data are used for storage, analyzation, and
querying. Storing big data is an issue that data managers need to deal with especially
with organizations that set green data management as a priority (Emran et al. 2013). A
good big data tools will provide the best infrastructure to support all the related
activities. This is due to the fact that the data we duel with is quite big compare to
traditional data. Therefore, good tools are demanding to get the best performance on
big data. Hadoop, Cloudera, Talend and others are the instances of big data
management tools.
Hadoop is one of the tools to manage big data. This is a popular tool to duel
with data organization and data tackling. Hadoop is produced by Apache and it is
open-source software framework. It is prevalent among many industries because it
provides advanced software library is superior processing of voluminous data sets in
clusters of computers using effective programming models. Hadoop has the ability to
achieve great processing and handle virtually limitless concurrent tasks or jobs. In
addition, the developer provides improvements and updates to the product regularly.
Cloudera has the main purpose in creating data repository that can be accessed
by all corporate users that need the data for different purposes. Cloudera helps the
business to build an enterprise data hub and allow people in organization better access
to the data that are storing. Cloudera just like the enterprise solution to manage the
business and Hadoop ecosystem too at the same time. Cloudera helps to increase the
competitive power by this combination.
Talend also provide a good platform to perform with big data. It is open-
source and the name of the software is Talend Open Studio. Talend offers Eclipse-
Page 6|9
based IDE to combine the tasks with Hadoop. They are focusing the Master Data
Management (MDM) offering, which combines real-time data, applications, and
process integration with embedded data quality and stewardship. Talend Studio able
to build up jobs by dragging and dropping little icons onto a canvas. If want to get an
RSS feed, component of Talend will fetch the RSS and add proxying if necessary.
There are dozens of components for gathering information and dozens more for doing
things like a "fuzzy match." Then, output the results. Stringing together blocks
visually can be simple after get a feel for what the components actually do and don't
do. This was easier to figure out when started looking at the source code being
assembled behind the canvas. Visual programming of Talend may seem like a lofty
goal, but the icons can never represent the mechanisms with enough detail to make it
possible to understand what's going on.
Conclusion
Big data management is the current trend now. It helps to increase the level of
business intelligent level. Moreover, it enhances better performance on running
operational data, cleaning data, enriching data, modelling data and others for the best
analysis result. Majority of the software for big data management are open-source
makes it easy to deploy data analysis that cooperation between data volume, velocity
and variety. Big data management helps to conquer the rapid changing of the
innovation now. In a nutshell, Big Data is the up and coming era of information
warehousing and business investigation and is ready to convey best line incomes cost
proficiently for enterprises.
References
Page 7|9
Webopedia. (2017). Big Data. Retrieved from
http://www.webopedia.com/TERM/B/big_data.html
Tom Jager. (2016). Top 10 tools for working with big data for successful analytics
developers. Retrieved from
http://bigdata-madesimple.com/top-10-tools-for-working-with-big-data-for-
successful-analytics-developers-2/
Import.io. (2017). All the best big data tools and how to use them. Retrieved from
https://www.import.io/post/all-the-best-big-data-tools-and-how-to-use-them/
James Nunns. (2015). 10 of the most popular Big Data tools for developers. Retrieved
from
http://www.cbronline.com/news/big-data/10-of-the-most-popular-big-data-tools-for-
developers-4570483/
Kathy Simpson. (2016). 10 Tips to Prevent Data Theft for Your Small Business.
Retrieved from
https://sba.thehartford.com/managing-risk/10-tips-to-prevent-data-theft
Rajeev Agrawal, Christopher Nyamful. (2016). Challenges of big data storage and
management. Retrieved from
Page 8|9
https://www.researchgate.net/publication/298433319_Challenges_of_big_data_storag
e_and_management
Bill Carmody. (2016). Biggest problem with big data management in 2016. Retrieved
from
https://www.inc.com/bill-carmody/biggest-problem-with-big-data-management-in-
2016.html
Kirk Borne. (2014). Top 10 big data challenges A serious look at 10 big data vs.
Retrieved from
https://mapr.com/blog/top-10-big-data-challenges-serious-look-10-big-data-vs/
Peter Wayne. (2012). 7 top tools for taming big data. Retrieved from
http://www.infoworld.com/article/2616959/big-data/7-top-tools-for-taming-big-
data.html
John Parkinson. (2012). Managing Big Data: Six Operational Challenges. Retrieved
from
http://www.cioinsight.com/c/a/Expert-Voices/Managing-Big-Data-Six-Operational-
Challenges-484979
Emran, N.A., Abdullah, N. & Isa, M.N.M., 2013. Storage space optimisation for
green data center. In Procedia Engineering. pp. 483490.
Emran, N., Embury, S. & Missier, P., 2008. Model-driven component generation for
families of completeness. In 6th International Workshop on Quality in
Databases and Management of Uncertain Data, Very Large Databases (VLDB).
Leza, F.N.M. & Emran, N.A., 2014. Data accessibility model using QR code for
lifetime healthcare records. World Applied Sciences Journal, 30(30), pp.395402.
Page 9|9