0% found this document useful (0 votes)
56 views119 pages

Module 1. 16974328175990

Big data is high-volume, high-velocity, and high-variety data that requires new techniques and technologies to capture, manage and analyze. It comes from a variety of structured, semi-structured and unstructured sources. Traditional databases cannot handle big data efficiently. Technologies like Hadoop enable parallel processing and scalability to analyze petabytes of data. Big data brings opportunities for insights but also challenges regarding data quality, storage and processing large and diverse datasets.

Uploaded by

Sagar B S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views119 pages

Module 1. 16974328175990

Big data is high-volume, high-velocity, and high-variety data that requires new techniques and technologies to capture, manage and analyze. It comes from a variety of structured, semi-structured and unstructured sources. Traditional databases cannot handle big data efficiently. Technologies like Hadoop enable parallel processing and scalability to analyze petabytes of data. Big data brings opportunities for insights but also challenges regarding data quality, storage and processing large and diverse datasets.

Uploaded by

Sagar B S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

BIG DATA ANALYTICS

MODULE I
BIG DATA
• Big data is the term that is used to describe data that is high
volume, high velocity and high variety, requires new
technologies and techniques to capture, store and analyze it.

• Big data consists of large data sets that cannot be managed


efficiently by traditional relational database management
system.

• This datasets range from terabytes (240bytes) to exabytes(260


bytes ).

• Big data can be structured, unstructured and semi


structured or hetrogeneous in nature.
• Mobile phones, credit cards, RFID (Radio Frequency
Identification Devices),social networking platforms
create huge amount of data that may reside unutilized
at unknown servers for many years.

• With the evolution of big data this data can be


accessed and analyzed to generate useful information.
Big Data Definitions
• Big Data is high-volume, high-velocity and/or high-
variety information that requires new forms of
processing for enhanced decision making, insight
discovery and process optimization.

• A collection of data sets so large or complex that


traditional data processing applications are
inadequate.” –Wikipedia.

• Big Data refers to data sets whose size is beyond the


ability of typical database software tool to capture,
store, manage and analyze.
Classification of Data
• Data can be classified as
▫ Structured
▫ Semi-structured
▫ Unstructured.
▫ Multi-structured
Structured Data
• Structured data conform and associate with data
schemas and data models.

• Structured data are found in tables (rows and


columns). Nearly 15-20% data are in structured or
semi-structured form.
Structured data enables the following:
• Data insert, delete, update and append

• Indexing to enable faster data retrieval

• Scalability which enables increasing or decreasing


capacities and data processing operations such as,
storing, processing and analytics

• Transactions processing which follows ACID rules


(Atomicity, Consistency, Isolation and Durability)
• Semi Structured Data

▫ Examples of semi-structured data are XML and JSON


documents. Semi-structured data contain tags or other
markers, which separate semantic elements and enforce
hierarchies of records and fields within the data.

▫ Semi-structured form of data does not conform and


associate with formal data model structures. Data do not
associate data models, such as the relational database
and table models.
• Examples of semi-structured data are XML and JSON
documents.

• Semi-structured data contain tags or other markers,


which separate semantic elements and enforce
hierarchies of records and fields within the data.

• Semi-structured form of data does not conform and


associate with formal data model structures.

• Data do not associate data models, such as the


relational database and table models.
Unstructured Data
• Data does not possess data features such as a table or a
database.

• Unstructured data are found in file types such as .txt, .csv.

• Data may be as key-value pairs, such as hash key-value pairs.

• Data may have internal structures, such as in e- mails.

• The data do not reveal relationships, hierarchy relationships.

• The relationships, schema and features need to be separately


Examples of unstructured Data
• Mobile data: Text messages, chat messages, tweets, blogs and
comments
• Website content data: YouTube videos, browsing data, e-
payments, web store data, user-generated maps
• Social media data: For exchanging data in various forms
• Texts and documents.
• Personal documents and e-mails.

• Text internal to an organization: Text within documents,


logs, survey results.

• Satellite images, atmospheric data, surveillance, traffic videos,


images from Instagram, Flickr (upload, access, organize, edit and
share photos from any device from anywhere in the world).
Multi Structured Data
• Multi-structured data refers to data consisting of
multiple formats of data, viz. structured, semi-
structured and/or unstructured data.

• Multi-structured data sets can have many formats.

• They are found in non-transactional systems.

• For example, streaming data on customer interactions,


data of multiple sensors, data at web or enterprise
server or the data- warehouse data in multiple formats.
Big Data Characteristics

• Volume: is related to size of the data.

• Veloctiy: refers to the speed of generation of data.

• Variety: comprises of a variety of data.

• Varacity: quality of data captured, which can vary


greatly, affecting its accurate analysis.
Volume
• Volume is the data generated by organization or
individuals.

• Today the volume of data in most organization is


approaching exabytes.

• According to IBM over 2.7 zettabytes (1021 bytes)of


data present in the digital universe today.

• Every minute 571 new websites are being created.


Velocity
• Velocity is the rate at which data is generated captured
and shared.

• The source of high velocity data include the following:

▫ IT devices including routers, switches firewalls etc,


constantly generate valuable data.

▫ Portable device including mobile PDA,etc also generate


data at a high speed.
Variety
• Data is being generated at a very fast pace.

• Now data is generated from different types of source


such as internal, external, social, behavioural and
comes in different formats such as images, text,videos
etc.
Veracity

• Veracity refers to the uncertainty of data,whether the


obtained data is correct or not.

• Out of huge amount of data that is generated in almost


every process only the data that is correct and consistant
can be used for further analysis.

• Data when processed becomes information.

• Big data is messy in nature so it takes good amount of time


and expertise to clean that data and make it suitable for
Big Data Types

• Social networks and web data, such as Facebook, Twitter,


e-mails, blogs and YouTube.

• Transactions data and Business Processes (BPs) data,


such as credit card transactions, flight bookings, etc. and public
agencies data such as medical records, insurance business data
etc.

• Customer master data such as data for facial recognition


and for the name, date of birth, marriage anniversary, gender,
location and income category,

• Machine-generated data, such as machine-to-machine or


Internet of Things data, and the data from sensors, trackers, web
• Computer generated data is also considered as
machine generated data.
• Human-generated data such as biometrics data,
human—machine interaction data, email records with
a mail server and MySQL database of student grades.

• Humans also records their experiences in ways such


as writing these in notebooks diaries, taking
photographs or audio and video clips.
Big Data Handling Techniques
• Following are the techniques deployed for Big Data
storage, applications, data management and mining
and analytics:
▫ Huge data volumes storage, data distribution, high-
speed networks and high performance computing.

▫ Applications scheduling using open source, reliable,


scalable, distributed file system, distributed database,
parallel and distributed computing systems, such as
Hadoop or Spark
▫ Open source tools which are scalable, elastic and
provide virtualized environment, clusters of data nodes,
task and thread management.

▫ Data management using NoSQL, document database,


column-oriented database, graph database and other
form of databases used as per needs of the applications
and in memory.

▫ Data mining and analytics, data retrieval, data


reporting, data visualization and machine learning Big
Data tools.
Scalability and Parallel Processing
• Big Data needs processing of large data volume, and
therefore needs intensive computations.

• Processing complex applications with large datasets


(terabyte to petabyte datasets) need hundreds of
computing nodes.

• Processing of this much distributed data within a


short time and at minimum cost is problematic.
• Scalability is the capability of a system to handle the
workload as per the magnitude of the work.

• System capability needs increment with the increased


workloads.

• When the workload and complexity exceed the system


capacity, scale it up and scale it out.

• Scalability enables increase or decrease in the capacity


of data storage, processing & analytics.
Analytical Scalability
• Vertical scalability means scaling up the given system’s
resources and increasing the system’s analytics,
reporting and visualization capabilities.

• This is an additional way to solve problems of greater


complexities. Scaling up means designing the
algorithm according to the architecture that uses
resources efficiently.
• x terabyte of data take time t for processing, code
size with increasing complexity increase by
factor n, then scaling up means that processing
takes equal, less or much less than (n * t).
• Horizontal scalability means increasing the number of systems
working in coherence and scaling out the workload.

• Processing different datasets of a large dataset deploys


horizontal scalability.

• Scaling out means using more resources and distributing the


processing and storage tasks in parallel.

• The easiest way to scale up and scale out execution of analytics


software is to implement it on a bigger machine with more CPUs
for greater volume, velocity, variety and complexity of data.

• The software will definitely perform better on a bigger machine.


Massive Parallel Processing Platforms

• Many programs are so large and complex that it is


impractical to execute them on one computer .

• So scale up the computer system or use massive parallel


processing(MPP).

• Parallelization of tasks can be done at several levels:


▫ Distributing separate tasks onto separate threads on the same
CPU.
▫ Distributing separate tasks onto separate CPUs on the same
computer.
▫ Distributing separate tasks onto separate computers.
(i)Distributed Computing Model
▫ A distributed computing model uses cloud, grid or
clusters, which process and analyze big and large
datasets on distributed computing nodes connected by
high-speed networks.

▫ Big Data processing uses a parallel, scalable and no-


sharing program model, such as MapReduce, for
computations on it.
(ii)Cloud Computing
• “Cloud computing is a type of Internet-based
computing that provides shared processing resources
and data to the computers and other devices on
demand.”

• One of the best approach for data processing is to


perform parallel and distributed computing in a cloud-
computing environment.

• Cloud resources can be Amazon Web Service (AWS)


Elastic Compute Cloud (EC2), Microsoft Azure or
Apache CloudStack.
• Cloud computing features are:
• (i)on-demand service (ii)resource pooling (iii)scalability
(iv)accountability (v)broad network access.

• Cloud service can be accessed from anywhere and at


anytime through the internet.

• Cloud computing allows availability of computer


infrastructure and services on demand basis.
Cloud Services
• There are three types of Cloud Services
▫ Infrastructure as a Service (IaaS):
▫ Platform as a Service (PaaS):
▫ Software as a Service (SaaS):
Infrastructure as a Service (IaaS):
• Providing access to resources, such as hard disks,
network connections, databases storage, data center
and virtual server spaces is Infrastructure as a Service
(IaaS).
• Some examples are Tata Communications, Amazon
data centers and virtual servers.
• Apache CloudStack is an open source software for
deploying and managing a large network of virtual
machines, and offers public cloud services which
provide highlyscalable Infrastructure as a Service
(IaaS).
Platform as a Service
• It implies providing the runtime environment to allow
developers to build applications and services, which
means cloud Platform as a Service.

• Software at the clouds support and manage the


services, storage, networking, deploying, testing,
collaborating, hosting and maintaining applications.

• Examples are Hadoop Cloud Service (IBM BigInsight,


Microsoft Azure HD Insights, Oracle Big Data Cloud
Services).
Software as a service
• Providing software applications as a service to end-
users is known as Software as a Service.

• Software applications are hosted by a service provider


and made available to customers over the Internet.

• Some examples are SQL Google SQL, IBM BigSQL,


Microsoft Polybase and Oracle Big Data SQL.
(iii)Grid Computing
• Grid Computing refers to distributed computing, in which a
group of computers from several locations are connected with
each other to achieve a common task.

• The computer resources are heterogeneously and


geographically dispersing.

• A group of computers that might spread over remotely


comprise a grid.

• A single grid of course, dedicates at an instance to a particular


application only.
Features
• Grid computing, similar to cloud computing, is
scalable.

• Cloud computing depends on sharing of resources (for


example, networks, servers, storage, applications and
services) to attain coordination and coherence among
resources similar to grid computing.

• Similarly, grid also forms a distributed network for


resource integration.
Cluster Computing
• A cluster is a group of homogeneous computers
connected by a network.
• The group works together to accomplish the same
task. Clusters are used mainly for load balancing. They
shift processes between nodes to keep an even load on
the group of connected computers.
(v)Volunteer Computing

• Volunteers are organizations or members who own personal


computers.

• Volunteer computing is a distributed computing paradigm


which uses computing resources of the volunteers.

• Projects examples are science related projects executed by


universities or academia in general.

• Some issues with volunteer computing systems are:


▫ Volunteered computers heterogeneity
▫ Drop outs from the network over time
Designing the Data Architecture
• Big Data architecture is the logical and/or physical
layout/structure of how Big Data will be stored,
accessed and managed within a Big Data or IT
environment.

• Architecture logically defines how Big Data solution


will work, the core components (hardware, database,
software, storage) used, flow of information, security
and more.
Data processing architecture consists of five layers:

(i) Identification of data sources

(ii)Acquisition, ingestion, extraction, pre-processing,


transformation of data

(iii)Data storage at files, servers, cluster or cloud

(iv)Data-processing

(v)Data consumption in the number of programs and


Big Data Architecture
• Logical layer 1 (L1) is for identifying data sources,
which are external, internal or both.

• The layer 2 (L2) is for data-ingestion.

• Data ingestion means a process of absorbing


information, just like the process of absorbing
nutrients and medications into the body by eating or
drinking them .
• Ingestion is the process of obtaining and importing
data for immediate use or transfer. Ingestion may be
in batches or in real time using preprocessing or
semantics.
Layer 1
L1 considers the following aspects in a design:
• Amount of data needed at ingestion layer 2 (L2).
▫ Push from L1 or pull by L2 as per the mechanism for the
usages
▫ Source data-types: Database, files, web or service
• Source formats, i.e., semi-structured, unstructured or
structured.
Layer 2
L2 considers the following aspects:
• Ingestion and ETL processes either in real time, which
means store and use the data as generated, or in
batches.

• Batch processing is using discrete datasets at


scheduled or periodic intervals of time.
Layer 3
L3 considers the following aspects :
• Data storage type (historical or incremental), format,
compression, incoming data frequency, querying
patterns and consumption requirements for L4 or L5.

• Data storage using Hadoop distributed file system or


NoSQL data stores—HBase, Cassandra, MongoDB.
Layer 4
L4 considers the following aspects:
• Data processing software such as MapReduce, Hive,
Pig, Spark, Spark Mahout, Spark Streaming.

• Processing in scheduled batches or real time or


hybrid.

• Processing as per synchronous or asynchronous


processing requirements at L5.
Layer 5
L5 considers the following aspects:
• Data integration.

• Datasets usages for reporting and visualization.

• Analytics (real time, near real time, scheduled


batches), BPs, BIs, knowledge discovery.

• Export of datasets to cloud, web or other systems.


Managing Data for Analysis
• Data managing means enabling, controlling,
protecting, delivering and enhancing the value of data
and information asset.
• Reports, analysis and visualizations need well- defined
data.
• Data management functions include:
1.Data assets creation, maintenance and protection.

2. Data governance, which includes establishing the


processes for ensuring the availability, usability, integrity,
security and high-quality of data
3. Data architecture creation, modelling and analysis.

4.Database maintenance, administration and


management system. For example, RDBMS (relational
database management system), NoSQL.

5. Managing data security, data access control, deletion,


privacy and security.

6. Managing the data quality.

7. Data collection using the ETL process.


9. Creation of reference and master data, and data control and
supervision.

10. Data and application integration.

11. Integrated data management, enterprise-ready data


creation, fast access and analysis, automation and
simplification of operations on the data.

12. Data warehouse management.

13. Maintenance of business intelligence.


Data source ,Quality, preprocessing and storing

Data Source

• Applications programs and tools use data.

• Sources can be external such as sensor, tracker,


weblog etc.

• Also data can be internal like database ,flat files


spreadsheet, CSV ,web server etc.

• Data can be structured, semi structured, multi


structured,or unstructured.
Structured Data sources
• Data source for ingestion, storage and processing can
be a file, database or streaming data.

• The source may be on the same computer running a


program or a networked computer.

• Structured data sources are SQL Server, MySQL,


Microsoft Access database, Oracle DBMS, IBM DB2,
Informix, Amazon SimpleDB or a file-collection
directory at a server.
• Data source name implies a defined name ,which a
process uses to identify the source .

• The name needs to be meaningful name.

• A data dictionary enables references for access to


data.

• The dictionary consists of a set of master lookup tables.

• The dictionary stores at a central location for easier


access as administration of changes in resources.
• Microsoft applications consider 2 types of source for
processing:
▫ (i)Machine source and (ii)File source.
• (i)Machine source-data are present on computing
nodes such as servers. Machine identifies a source by
the user defined name, driver manager name, and
source driver name.
• (ii)File sources are stored files. An application
executing data first connects to driver manager of
source. A user or application connects to the manager
when required .
Unstructured Data Source

• Unstructured data sources are distributed over high-


speed networks.

• The data need high velocity processing. Sources are


from distributed file systems.

• The sources are of file types, such as .txt (text


file), .csv (comma separated values file).

• Data may be as key value pairs, such as hash key-


values pairs.
Data Sources - Sensors, Signals and GPS
• The data sources can be sensors, sensor networks,
signals from machines, devices, controllers and
intelligent edge nodes of different types in the industry
M2M communication and the GPS systems.
• Sensors are electronic devices that sense the physical
environment .
• It is used for measureing temprature, pressure,
humidity ,traffic or objects in proximity.
• Data from RFID to track parcels.
Data Quality

• High quality means data, which enables all the


required operations, analysis, decisions, planning and
knowledge discovery correctly.

• A definition of high quality data can be define with


Five R's as follows:
▫ Relevancy,
▫ Recency,
▫ Range,
▫ Robustness
▫ Reliability.
Data Integrity
• Data integrity refers to the maintenance of consistency
and accuracy in data over its usable life.

• Software, which store, process, or retrieve the data,


should maintain the integrity of data.

• Data should be incorruptible


Factors Affecting Data Quality
• Data Noise
• Outlier
• Missing Value
• Duplicate value
Data Noise
• Noise is One of the factors effecting data quality is
noise.

• Noise in data refers to data giving additional


meaningless information besides true
(actual/required) information.

• Noise is random in character, which means frequency


with which it occurs is variable over time.
Outlier
• An outlier in data refers to data, which appears to not
belong to the dataset.For example, data that is outside
an expected range.

• Actual outliers need to be removed from the dataset,


else the result will be effected by a small or large
amount.
Missing Value, duplicate Value
• Missing Values Another factor effecting data quality is
missing values.

• Missing value implies data not appearing in the data


set.

• Duplicate Values Another factor effecting data quality


is duplicate values.

• Duplicate value implies the same data appearing two


or more times in a dataset.
Data Preprocessing
• Data pre-processing is an important step at the ingestion
layer.
• Pre-processing is a must before data mining and
analytics.
• Pre-processing is also a must before running a Machine
Learning(ML) algorithm.
• Pre-processing needs are:
▫ Dropping out of range, inconsistent and outlier values
▫ Filtering unreliable, irrelevant and redundant information.
▫ Data cleaning, editing, reduction and/or wrangling
▫ Data validation, transformation or transcoding
▫ ELT processing
Data Enrichment
• "Data enrichment refers to operations or processes
which refine, enhance or improve the raw data.“

Data editing
• Data editing refers to the process of reviewing and
adjusting the acquired datasets.

• The editing controls the data quality.

• Editing methods are (i) interactive, (ii) selective, (iii)


automatic, (iv) aggregating and (v)distribution.
Data Reduction
• Data reduction enables the transformation of acquired
information into an ordered, correct and simplified
form.
Data wrangling
• Data wrangling refers to the process of
transforming and mapping the data.

• Results from analytics are then appropriate and


valuable.

• mapping enables data into another format, which


makes it valuable for analytics and data visualizations
Data format used during preprocessing

Different formats for data transfer:

• Comma-separated values CSV.

• Java Script Object Notation (JSON) as batches of


object arrays or resource arrays.

• Tag Length Value (TLV).

• Key-value pairs.
Data Export to Cloud
• Figure shows resulting data pre-processing, data
mining, analysis, visualization and data store.

• The data exports to cloud services.

• The results integrate at the enterprise server or data


warehouse.
Figure 1.3 Data pre-processing, analysis,
visualization, data store export
Cloud Services

• Cloud offers various services. These services can be


accessed through a cloud client (client application),
such as a web browser, SQL or other client.

• Figure 1.4 shows data-store export from machines,


files, computers, web servers and web services.

• The data exports to clouds, such as IBM, Microsoft,


Oracle, Amazon, Rackspace, TCS, Tata
Communications or Hadoop cloud services.
Export of Data to AWS and Rackspace Clouds

• Google cloud platform provides a cloud service called


BigQuery Figure 1.5 shows.

• BigQuery cloud service at Google cloud platform.

• The data exports from a table or partition schema,


JSON, CSV or AVRO files from data sources after the
pre-processing.
Data Storage and Analysis
Data Storage and Management: Traditional
Systems
• Traditional systems use structured or semi-structured
data.

• The sources of structured data store are:

• Traditional relational database-management system


(RDBMS) data, such as MySQL DB2, enterprise server
and data warehouse
• The sources of semistructured data are :

▫ XML and JSON semi structured documents.

▫ CSV files.
SQL
• An RDBMS uses SQL (Structured Query Language). SQL is a
language for viewing or changing (update, insert or append or
delete) databases.
• 1. Create schema, Create schema, which is a structure which
contains description of objects (base tables, views, constraints)
created by a user. The user can describe the data and define the
data in the database.

• 2. Create catalog, which consists of a set of schemas which describe


the database.

• 3. Data Definition Language (DDL) for the commands which


depicts a database, that include creating, altering and dropping of
tables and establishing the constraints. A user can create and drop
databases and tables, establish foreign keys, create view, stored
• 4. Data Manipulation Language (DML) for
commands that maintain and query the database.
• A user can manipulate (INSERT/UPDATE) and access
(SELECT) the data.

• 5. Data Control Language (DCL) for commands that


control a database, and include administering of
privileges and committing. A user can set (grant, add
or revoke) permissions on tables, procedures and
views.
Distributed Database Management System
• A distributed DBMS (DDBMS) is a collection of logically
interrelated databases at multiple system over a computer
network.

• A collection of logically related databases.

• Cooperation between databases in a transparent manner.

• be 'location independent' which means the user is


unaware of where the data is located, and it is possible to
move the data from one physical location to another
without affecting the user.
In-Memory Column Formats Data
• A columnar format in-memory allows faster data retrieval
when only a few columns in a table need to be selected during
query processing or aggregation.

• Online Analytical Processing (OLAP) in real-time transaction


processing is fast when using in-memory column format tables.

• The CPU accesses all columns in a single instance of access to


the memory in columnar format in memory data-storage.
• OLAP enables online viewing of analyzed data and
visualization upto the desired granularity.

• It helps to obtain summerized information and automated


results for a large database.
In-Memory Row Format Databases

• A row format in-memory allows much faster data


processing during OLTP.

• Each row record has corresponding values in multiple


columns and the on-line values store at the
consecutive memory addresses in row format.
Enterprise Data-Store Server and Data Warehouse

• Enterprise data server use data from several distributed


sources which store data using various technologies.

• All data merge using an integration tool.

• Integration enables collective viewing of the datasets at


the data warehouse.

• Enterprise data integration may also include integration


with application(s), such as analytics, visualization,
reporting, business intelligence and knowledge discovery.
• Following are some standardised business processes, as defined in the Oracle
application-integration architecture:

▫ Integrating and enhancing the existing systems and processes

▫ Business intelligence

▫ Data security and integrity

▫ New business services/products (Web services)

▫ Collaboration/knowledge management

▫ Enterprise architecture/SOA

▫ e-commerce

▫ External customer services

▫ Supply chain automation/visualization


• Steps 1 to 5 in Enterprise data integration and
management with Big Data for high performance
computing using local and cloud resources for the
analytics, applications and services.
Big Data Storage
NO SQL
• NoSQL databases are considered as semi-structured
data.

• Big Data Store uses NoSQL.

• NOSQL stands for No SQL or Not Only SQL.

• The stores do not integrate with applications using


SQL.

• NoSQL is also used in cloud data


• NoSQL databases have the following properties:
▫ They have higher scalability.

▫ They use distributed computing.

▫ They are cost effective.

▫ They support flexible schema.

▫ They can process both unstructured and semi-structured data.

▫ There are no complex relationships, such as the ones between


tables in an RDBMS.
Features of NoSQL are as follows:
• It is a class of non-relational data storage systems, and
the flexible data models and multiple schema:

• uninterrupted key/value or big hash table.

• unordered keys and using JSON (PNUTS).

• ordered keys and semi-structured data storage


systems [BigTable, Cassandra (used in
Facebook/Apache) and HBase]
▫ Do not use the JOINS.

▫ Data written at one node can replicate at multiple


nodes, therefore Data storage is fault-tolerant,

▫ May relax the ACID rules during the Data Store


transactions.

▫ Data partitioned and follows CAP theorm


(Consistancy,Availability,Partition Tolerance).
CAP THEORM
• The Consistency, Availability, and Partition
tolerance (CAP) theorem, also known as
Brewer’s theorem.

• It states that a distributed database system,


running on a cluster, can only provide two of the
following three properties:
▫ Consistency
▫ Availability and
▫ Partition tolerance
• Consistency – A read from any node results in
the same data across multiple nodes.

• Availability – A read/write request will always be


acknowledged in the form of a success or a failure .

• Partition tolerance – The database system can


tolerate communication outages that split the
cluster into multiple silos and can still service
read/write requests
• If consistency (C) and availability (A) are required,
available nodes need to communicate to ensure
consistency (C). Therefore, partition tolerance (P) is
not possible.

• If consistency (C) and partition tolerance (P) are


required, nodes cannot remain available (A) as the
nodes will become unavailable while achieving a state
of consistency (C).
• If availability (A) and partition tolerance (P) are
required, then consistency (C) is not possible because
of the data communication requirement between the
nodes. So, the database can remain available (A) but
with inconsistent results.
Big Data Platform

• A Big Data platform supports large datasets and


volume of data.

• The data generate at a higher velocity, in more


varieties or in higher veracity.

• Managing Big Data requires large resources of MPPs,


cloud, parallel processing and specialized tools.
• Bigdata platform should provision tools and services
for:
1. Storage, processing and analytics,
2. Developing, deploying, operating and managing Big
Data environment,
3. Reducing the complexity of multiple data sources and
integration of applications into one cohesive solution,
4. Custom development, querying and integration with
other systems, and
5. The traditional as well as Big Data techniques.
Data management, storage and analytics of Big data
captured at the companies and services require the
following:
1.New innovative non-traditional methods of storage,
processing and analytics.

2. Distributed and Huge volume of Data Stores.

3. Creating scalable as well as elastic virtualized


platform (cloud computing).
5. Massive parallelism.

6. High speed networks.

7. High performance processing, optimization and tuning.

8. Data management model based on Not Only SQL or NoSQL.

9. In-memory data column-formats transactions processing or


dual in-memory data columns as well as row formats for OLAP
and OLTP.

10. Data retrieval, mining, reporting, visualization and analytics.


11. Graph databases to enable analytics with social
network messages, pages and data analytics.

12. Machine learning or other approaches.

13. Big data sources: Data storages, data warehouse,


Oracle Big Data, MongoDB NoSQL, Cassandra NoSQL.

14. Data sources: Sensors, Audit trail of Financial


transactions data, external data such as Web, Social
Media, weather data, health records data.
Hadoop
• Big Data platform consists of Big Data storage(s),
server(s) and data management and business
intelligence software.

• Storage can deploy Hadoop Distributed File System


(HDFS), NoSQL data stores, such as HBase,
MongoDB, Cassandra. HDFS system is an open source
storage system.

• HDFS is a scaling, self-managing and self-healing file


system.
• The Hadoop system packages application-
programming model.

• Hadoop is a scalable and reliable parallel computing


platform.

• Hadoop manages Big Data distributed databases.

• Small height cylinders represent MapReduce and big


ones represent the Hadoop.
• There are two main components of Apache
Hadoop-
▫ (i)HDFS - used for storage
▫ (ii)MapReduce parallel processing
framework
-used of processing
• Hadoop includes a fault tolerant storage system called
Hadoop Distributed File System(HDFS).

• It stores large size files from terabytes to petabytes.

• HDFS attain reliability by replicating the data over


multiple hosts.

• The file in HDFS is split into large block size of 64 MB


by default and each block of file is independently
replicated at multiple nodes.
• MapReduce is a framework that helps developers to
write programs to process large volumes of
unstructured data parallel over a distributed
architecture which produces result in useful
aggregated form.
Mesos
• Mesos is a resources management platform which
enables sharing of cluster of nodes by multiple
frameworks .

• Apache Mesos is an open source cluster manager.

• It handles workloads in a distributed environment


through dynamic resource sharing and isolation
Big Data Stack
• A stack consists of a set of software components and
data store units.

• Applications, machine learning algorithms, analytics


and visualization tools use Big Data Stack (BDS) at a
cloud service, such as Amazon EC2, Azure or private
cloud.

• The stack uses cluster of high performance machines.


BIG DATA ANALYTICS
• Big Data Analytics reformed the ways to conduct
business in many ways such as it improves decision
making ,business process management etc.

• Data analytics can be formally defined as the statistical


and mathematical data analysis that
clusters ,segments, ranks and predicts future
possibilities.

• “Analysis of data is process of inspecting cleaning


transforming and modeling data with the goal of
discovering useful information ,suggesting
• There are 3 main types/phases of business
analytics :
▫ Descriptive analytics
▫ Predictive analytics
▫ Prescriptive analytics
▫ Cognitive analysis
Descriptive Analytics
• It is the most prevalent form of analytics.

• It answers the question “what happened in business”?

• It serves as a base for advanced analytics .

• It analyze a db to provide a information on the trends of


past or current business events that can help managers ,
planners, leaders etc to develop roadmap for future action.
• It performs indepth analysis of data to reveal details such as
frequency of events, underlying reason for failure etc.

• It helps in identifying the root cause of the problems


Predictive analytics

• It is about understanding and predicting the future


and answer the question “what could happen”
using statistical models and different forecast
techniques.

• In predictive analytics we use statastics ,data mining


techniques and machine learning to analyze the future.
Prescriptive Analytics
• Prescriptive Analytics answers the question “what
should we do?” on the basis of complex data
obtained from descriptive and predictive analyses.

• By using optimization techniques ,perspective


analytics determine the finest substitute to minimize
or maximize finance, marketing and many other areas.

• If we have to find the best way of shipping goods from


a factory to a destination to minimize cost ,we can use
prespective analytics.
• Data can be streamlined for growth and expansion in
technology as well as business .
• When data is analyzed it become answer to ”how can
business acquire more customers and gain business
insight”?.
Cognitive analytics

• It enables derivation of the additional value and


undertake better decision.
• Analytics integrates with enterprise server or data
warehouse.

• The figure below shows an overview of reference


model for analytics architecture.

• The RHS of figure shows big data file system, machine


learning algorithms query languages and usage of
hadoop ecosystem.
• The captured or stored data require a well proven
strategy to calculate ,plan and analyse.

• When big data combine the high powered data


analysis enterprise achieve valued business related
task.

• Determine root cause of defects faults and failure in


minimum time.

• Deliver advertisement on mobile or web based on


customer location and buying habits.
Berkely Data Analysis Stack(BDAS)
• Big data analytics need innovative as well as cost
effective techniques.

• BDAS is an open source data analytics stack for


complex computations on Big data.

• It supports efficient large scale in-memory data


processing and thus enables user applications
achieving three fundamental processing
requirements ,accuracy ,time and cost.
Berkely Data Analysis Stack(BDAS)

• Berkeley Data Analytics Stack (BDAS) consists of data


processing, data management and resource management layers.
Following list these:

1.Applications, AMP-Genomics and Carat run at the BDAS. Data


processing software component provides in-memory processing
which processes the data efficiently across the frameworks. AMP
stands for Berkeley's Algorithms, Machines and Peoples
Laboratory.

2. Data processing combines batch, streaming and interactive


computations.

3. Resource management software component provides for

You might also like