0% found this document useful (0 votes)

56 views119 pages

Module 1. 16974328175990

Big data is high-volume, high-velocity, and high-variety data that requires new techniques and technologies to capture, manage and analyze. It comes from a variety of structured, semi-structured and unstructured sources. Traditional databases cannot handle big data efficiently. Technologies like Hadoop enable parallel processing and scalability to analyze petabytes of data. Big data brings opportunities for insights but also challenges regarding data quality, storage and processing large and diverse datasets.

Uploaded by

Sagar B S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views119 pages

Module 1. 16974328175990

Uploaded by

Sagar B S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

BIG DATA ANALYTICS

MODULE I
BIG DATA
• Big data is the term that is used to describe data that is high
volume, high velocity and high variety, requires new
technologies and techniques to capture, store and analyze it.

• Big data consists of large data sets that cannot be managed

efficiently by traditional relational database management
system.

• This datasets range from terabytes (240bytes) to exabytes(260

bytes ).

• Big data can be structured, unstructured and semi

structured or hetrogeneous in nature.
• Mobile phones, credit cards, RFID (Radio Frequency
Identification Devices),social networking platforms
create huge amount of data that may reside unutilized
at unknown servers for many years.

• With the evolution of big data this data can be

accessed and analyzed to generate useful information.
Big Data Definitions
• Big Data is high-volume, high-velocity and/or high-
variety information that requires new forms of
processing for enhanced decision making, insight
discovery and process optimization.

• A collection of data sets so large or complex that

traditional data processing applications are
inadequate.” –Wikipedia.

• Big Data refers to data sets whose size is beyond the

ability of typical database software tool to capture,
store, manage and analyze.
Classification of Data
• Data can be classified as
▫ Structured
▫ Semi-structured
▫ Unstructured.
▫ Multi-structured
Structured Data
• Structured data conform and associate with data
schemas and data models.

• Structured data are found in tables (rows and

columns). Nearly 15-20% data are in structured or
semi-structured form.
Structured data enables the following:
• Data insert, delete, update and append

• Indexing to enable faster data retrieval

• Scalability which enables increasing or decreasing

capacities and data processing operations such as,
storing, processing and analytics

• Transactions processing which follows ACID rules

(Atomicity, Consistency, Isolation and Durability)
• Semi Structured Data

▫ Examples of semi-structured data are XML and JSON

documents. Semi-structured data contain tags or other
markers, which separate semantic elements and enforce
hierarchies of records and fields within the data.

▫ Semi-structured form of data does not conform and

associate with formal data model structures. Data do not
associate data models, such as the relational database
and table models.
• Examples of semi-structured data are XML and JSON
documents.

• Semi-structured data contain tags or other markers,

which separate semantic elements and enforce
hierarchies of records and fields within the data.

• Semi-structured form of data does not conform and

associate with formal data model structures.

• Data do not associate data models, such as the

relational database and table models.
Unstructured Data
• Data does not possess data features such as a table or a
database.

• Unstructured data are found in file types such as .txt, .csv.

• Data may be as key-value pairs, such as hash key-value pairs.

• Data may have internal structures, such as in e- mails.

• The data do not reveal relationships, hierarchy relationships.

• The relationships, schema and features need to be separately

Examples of unstructured Data
• Mobile data: Text messages, chat messages, tweets, blogs and
comments
• Website content data: YouTube videos, browsing data, e-
payments, web store data, user-generated maps
• Social media data: For exchanging data in various forms
• Texts and documents.
• Personal documents and e-mails.

• Text internal to an organization: Text within documents,

logs, survey results.

• Satellite images, atmospheric data, surveillance, traffic videos,

images from Instagram, Flickr (upload, access, organize, edit and
share photos from any device from anywhere in the world).
Multi Structured Data
• Multi-structured data refers to data consisting of
multiple formats of data, viz. structured, semi-
structured and/or unstructured data.

• Multi-structured data sets can have many formats.

• They are found in non-transactional systems.

• For example, streaming data on customer interactions,

data of multiple sensors, data at web or enterprise
server or the data- warehouse data in multiple formats.
Big Data Characteristics

• Volume: is related to size of the data.

• Veloctiy: refers to the speed of generation of data.

• Variety: comprises of a variety of data.

• Varacity: quality of data captured, which can vary

greatly, affecting its accurate analysis.
Volume
• Volume is the data generated by organization or
individuals.

• Today the volume of data in most organization is

approaching exabytes.

• According to IBM over 2.7 zettabytes (1021 bytes)of

data present in the digital universe today.

• Every minute 571 new websites are being created.

Velocity
• Velocity is the rate at which data is generated captured
and shared.

• The source of high velocity data include the following:

▫ IT devices including routers, switches firewalls etc,

constantly generate valuable data.

▫ Portable device including mobile PDA,etc also generate

data at a high speed.
Variety
• Data is being generated at a very fast pace.

• Now data is generated from different types of source

such as internal, external, social, behavioural and
comes in different formats such as images, text,videos
etc.
Veracity

• Veracity refers to the uncertainty of data,whether the

obtained data is correct or not.

• Out of huge amount of data that is generated in almost

every process only the data that is correct and consistant
can be used for further analysis.

• Data when processed becomes information.

• Big data is messy in nature so it takes good amount of time

and expertise to clean that data and make it suitable for
Big Data Types

• Social networks and web data, such as Facebook, Twitter,

e-mails, blogs and YouTube.

• Transactions data and Business Processes (BPs) data,

such as credit card transactions, flight bookings, etc. and public
agencies data such as medical records, insurance business data
etc.

• Customer master data such as data for facial recognition

and for the name, date of birth, marriage anniversary, gender,
location and income category,

• Machine-generated data, such as machine-to-machine or

Internet of Things data, and the data from sensors, trackers, web
• Computer generated data is also considered as
machine generated data.
• Human-generated data such as biometrics data,
human—machine interaction data, email records with
a mail server and MySQL database of student grades.

• Humans also records their experiences in ways such

as writing these in notebooks diaries, taking
photographs or audio and video clips.
Big Data Handling Techniques
• Following are the techniques deployed for Big Data
storage, applications, data management and mining
and analytics:
▫ Huge data volumes storage, data distribution, high-
speed networks and high performance computing.

▫ Applications scheduling using open source, reliable,

scalable, distributed file system, distributed database,
parallel and distributed computing systems, such as
Hadoop or Spark
▫ Open source tools which are scalable, elastic and
provide virtualized environment, clusters of data nodes,
task and thread management.

▫ Data management using NoSQL, document database,

column-oriented database, graph database and other
form of databases used as per needs of the applications
and in memory.

▫ Data mining and analytics, data retrieval, data

reporting, data visualization and machine learning Big
Data tools.
Scalability and Parallel Processing
• Big Data needs processing of large data volume, and
therefore needs intensive computations.

• Processing complex applications with large datasets

(terabyte to petabyte datasets) need hundreds of
computing nodes.

• Processing of this much distributed data within a

short time and at minimum cost is problematic.
• Scalability is the capability of a system to handle the
workload as per the magnitude of the work.

• System capability needs increment with the increased

workloads.

• When the workload and complexity exceed the system

capacity, scale it up and scale it out.

• Scalability enables increase or decrease in the capacity

of data storage, processing & analytics.
Analytical Scalability
• Vertical scalability means scaling up the given system’s
resources and increasing the system’s analytics,
reporting and visualization capabilities.

• This is an additional way to solve problems of greater

complexities. Scaling up means designing the
algorithm according to the architecture that uses
resources efficiently.
• x terabyte of data take time t for processing, code
size with increasing complexity increase by
factor n, then scaling up means that processing
takes equal, less or much less than (n * t).
• Horizontal scalability means increasing the number of systems
working in coherence and scaling out the workload.

• Processing different datasets of a large dataset deploys

horizontal scalability.

• Scaling out means using more resources and distributing the

processing and storage tasks in parallel.

• The easiest way to scale up and scale out execution of analytics

software is to implement it on a bigger machine with more CPUs
for greater volume, velocity, variety and complexity of data.

• The software will definitely perform better on a bigger machine.

Massive Parallel Processing Platforms

• Many programs are so large and complex that it is

impractical to execute them on one computer .

• So scale up the computer system or use massive parallel

processing(MPP).

• Parallelization of tasks can be done at several levels:

▫ Distributing separate tasks onto separate threads on the same
CPU.
▫ Distributing separate tasks onto separate CPUs on the same
computer.
▫ Distributing separate tasks onto separate computers.
(i)Distributed Computing Model
▫ A distributed computing model uses cloud, grid or
clusters, which process and analyze big and large
datasets on distributed computing nodes connected by
high-speed networks.

▫ Big Data processing uses a parallel, scalable and no-

sharing program model, such as MapReduce, for
computations on it.
(ii)Cloud Computing
• “Cloud computing is a type of Internet-based
computing that provides shared processing resources
and data to the computers and other devices on
demand.”

• One of the best approach for data processing is to

perform parallel and distributed computing in a cloud-
computing environment.

• Cloud resources can be Amazon Web Service (AWS)

Elastic Compute Cloud (EC2), Microsoft Azure or
Apache CloudStack.
• Cloud computing features are:
• (i)on-demand service (ii)resource pooling (iii)scalability
(iv)accountability (v)broad network access.

• Cloud service can be accessed from anywhere and at

anytime through the internet.

• Cloud computing allows availability of computer

infrastructure and services on demand basis.
Cloud Services
• There are three types of Cloud Services
▫ Infrastructure as a Service (IaaS):
▫ Platform as a Service (PaaS):
▫ Software as a Service (SaaS):
Infrastructure as a Service (IaaS):
• Providing access to resources, such as hard disks,
network connections, databases storage, data center
and virtual server spaces is Infrastructure as a Service
(IaaS).
• Some examples are Tata Communications, Amazon
data centers and virtual servers.
• Apache CloudStack is an open source software for
deploying and managing a large network of virtual
machines, and offers public cloud services which
provide highlyscalable Infrastructure as a Service
(IaaS).
Platform as a Service
• It implies providing the runtime environment to allow
developers to build applications and services, which
means cloud Platform as a Service.

• Software at the clouds support and manage the

services, storage, networking, deploying, testing,
collaborating, hosting and maintaining applications.

• Examples are Hadoop Cloud Service (IBM BigInsight,

Microsoft Azure HD Insights, Oracle Big Data Cloud
Services).
Software as a service
• Providing software applications as a service to end-
users is known as Software as a Service.

• Software applications are hosted by a service provider

and made available to customers over the Internet.

• Some examples are SQL Google SQL, IBM BigSQL,

Microsoft Polybase and Oracle Big Data SQL.
(iii)Grid Computing
• Grid Computing refers to distributed computing, in which a
group of computers from several locations are connected with
each other to achieve a common task.

• The computer resources are heterogeneously and

geographically dispersing.

• A group of computers that might spread over remotely

comprise a grid.

• A single grid of course, dedicates at an instance to a particular

application only.
Features
• Grid computing, similar to cloud computing, is
scalable.

• Cloud computing depends on sharing of resources (for

example, networks, servers, storage, applications and
services) to attain coordination and coherence among
resources similar to grid computing.

• Similarly, grid also forms a distributed network for

resource integration.
Cluster Computing
• A cluster is a group of homogeneous computers
connected by a network.
• The group works together to accomplish the same
task. Clusters are used mainly for load balancing. They
shift processes between nodes to keep an even load on
the group of connected computers.
(v)Volunteer Computing

• Volunteers are organizations or members who own personal

computers.

• Volunteer computing is a distributed computing paradigm

which uses computing resources of the volunteers.

• Projects examples are science related projects executed by

universities or academia in general.

• Some issues with volunteer computing systems are:

▫ Volunteered computers heterogeneity
▫ Drop outs from the network over time
Designing the Data Architecture
• Big Data architecture is the logical and/or physical
layout/structure of how Big Data will be stored,
accessed and managed within a Big Data or IT
environment.

• Architecture logically defines how Big Data solution

will work, the core components (hardware, database,
software, storage) used, flow of information, security
and more.
Data processing architecture consists of five layers:

(i) Identification of data sources

(ii)Acquisition, ingestion, extraction, pre-processing,

transformation of data

(iii)Data storage at files, servers, cluster or cloud

(iv)Data-processing

(v)Data consumption in the number of programs and

Big Data Architecture
• Logical layer 1 (L1) is for identifying data sources,
which are external, internal or both.

• The layer 2 (L2) is for data-ingestion.

• Data ingestion means a process of absorbing

information, just like the process of absorbing
nutrients and medications into the body by eating or
drinking them .
• Ingestion is the process of obtaining and importing
data for immediate use or transfer. Ingestion may be
in batches or in real time using preprocessing or
semantics.
Layer 1
L1 considers the following aspects in a design:
• Amount of data needed at ingestion layer 2 (L2).
▫ Push from L1 or pull by L2 as per the mechanism for the
usages
▫ Source data-types: Database, files, web or service
• Source formats, i.e., semi-structured, unstructured or
structured.
Layer 2
L2 considers the following aspects:
• Ingestion and ETL processes either in real time, which
means store and use the data as generated, or in
batches.

• Batch processing is using discrete datasets at

scheduled or periodic intervals of time.
Layer 3
L3 considers the following aspects :
• Data storage type (historical or incremental), format,
compression, incoming data frequency, querying
patterns and consumption requirements for L4 or L5.

• Data storage using Hadoop distributed file system or

NoSQL data stores—HBase, Cassandra, MongoDB.
Layer 4
L4 considers the following aspects:
• Data processing software such as MapReduce, Hive,
Pig, Spark, Spark Mahout, Spark Streaming.

• Processing in scheduled batches or real time or

hybrid.

• Processing as per synchronous or asynchronous

processing requirements at L5.
Layer 5
L5 considers the following aspects:
• Data integration.

• Datasets usages for reporting and visualization.

• Analytics (real time, near real time, scheduled

batches), BPs, BIs, knowledge discovery.

• Export of datasets to cloud, web or other systems.

Managing Data for Analysis
• Data managing means enabling, controlling,
protecting, delivering and enhancing the value of data
and information asset.
• Reports, analysis and visualizations need well- defined
data.
• Data management functions include:
1.Data assets creation, maintenance and protection.

2. Data governance, which includes establishing the

processes for ensuring the availability, usability, integrity,
security and high-quality of data
3. Data architecture creation, modelling and analysis.

4.Database maintenance, administration and

management system. For example, RDBMS (relational
database management system), NoSQL.

5. Managing data security, data access control, deletion,

privacy and security.

6. Managing the data quality.

7. Data collection using the ETL process.

9. Creation of reference and master data, and data control and
supervision.

10. Data and application integration.

11. Integrated data management, enterprise-ready data

creation, fast access and analysis, automation and
simplification of operations on the data.

12. Data warehouse management.

13. Maintenance of business intelligence.

Data source ,Quality, preprocessing and storing

Data Source

• Applications programs and tools use data.

• Sources can be external such as sensor, tracker,

weblog etc.

• Also data can be internal like database ,flat files

spreadsheet, CSV ,web server etc.

• Data can be structured, semi structured, multi

structured,or unstructured.
Structured Data sources
• Data source for ingestion, storage and processing can
be a file, database or streaming data.

• The source may be on the same computer running a

program or a networked computer.

• Structured data sources are SQL Server, MySQL,

Microsoft Access database, Oracle DBMS, IBM DB2,
Informix, Amazon SimpleDB or a file-collection
directory at a server.
• Data source name implies a defined name ,which a
process uses to identify the source .

• The name needs to be meaningful name.

• A data dictionary enables references for access to

data.

• The dictionary consists of a set of master lookup tables.

• The dictionary stores at a central location for easier

access as administration of changes in resources.
• Microsoft applications consider 2 types of source for
processing:
▫ (i)Machine source and (ii)File source.
• (i)Machine source-data are present on computing
nodes such as servers. Machine identifies a source by
the user defined name, driver manager name, and
source driver name.
• (ii)File sources are stored files. An application
executing data first connects to driver manager of
source. A user or application connects to the manager
when required .
Unstructured Data Source

• Unstructured data sources are distributed over high-

speed networks.

• The data need high velocity processing. Sources are

from distributed file systems.

• The sources are of file types, such as .txt (text

file), .csv (comma separated values file).

• Data may be as key value pairs, such as hash key-

values pairs.
Data Sources - Sensors, Signals and GPS
• The data sources can be sensors, sensor networks,
signals from machines, devices, controllers and
intelligent edge nodes of different types in the industry
M2M communication and the GPS systems.
• Sensors are electronic devices that sense the physical
environment .
• It is used for measureing temprature, pressure,
humidity ,traffic or objects in proximity.
• Data from RFID to track parcels.
Data Quality

• High quality means data, which enables all the

required operations, analysis, decisions, planning and
knowledge discovery correctly.

• A definition of high quality data can be define with

Five R's as follows:
▫ Relevancy,
▫ Recency,
▫ Range,
▫ Robustness
▫ Reliability.
Data Integrity
• Data integrity refers to the maintenance of consistency
and accuracy in data over its usable life.

• Software, which store, process, or retrieve the data,

should maintain the integrity of data.

• Data should be incorruptible

Factors Affecting Data Quality
• Data Noise
• Outlier
• Missing Value
• Duplicate value
Data Noise
• Noise is One of the factors effecting data quality is
noise.

• Noise in data refers to data giving additional

meaningless information besides true
(actual/required) information.

• Noise is random in character, which means frequency

with which it occurs is variable over time.
Outlier
• An outlier in data refers to data, which appears to not
belong to the dataset.For example, data that is outside
an expected range.

• Actual outliers need to be removed from the dataset,

else the result will be effected by a small or large
amount.
Missing Value, duplicate Value
• Missing Values Another factor effecting data quality is
missing values.

• Missing value implies data not appearing in the data

set.

• Duplicate Values Another factor effecting data quality

is duplicate values.

• Duplicate value implies the same data appearing two

or more times in a dataset.
Data Preprocessing
• Data pre-processing is an important step at the ingestion
layer.
• Pre-processing is a must before data mining and
analytics.
• Pre-processing is also a must before running a Machine
Learning(ML) algorithm.
• Pre-processing needs are:
▫ Dropping out of range, inconsistent and outlier values
▫ Filtering unreliable, irrelevant and redundant information.
▫ Data cleaning, editing, reduction and/or wrangling
▫ Data validation, transformation or transcoding
▫ ELT processing
Data Enrichment
• "Data enrichment refers to operations or processes
which refine, enhance or improve the raw data.“

Data editing
• Data editing refers to the process of reviewing and
adjusting the acquired datasets.

• The editing controls the data quality.

• Editing methods are (i) interactive, (ii) selective, (iii)

automatic, (iv) aggregating and (v)distribution.
Data Reduction
• Data reduction enables the transformation of acquired
information into an ordered, correct and simplified
form.
Data wrangling
• Data wrangling refers to the process of
transforming and mapping the data.

• Results from analytics are then appropriate and

valuable.

• mapping enables data into another format, which

makes it valuable for analytics and data visualizations
Data format used during preprocessing

Different formats for data transfer:

• Comma-separated values CSV.

• Java Script Object Notation (JSON) as batches of

object arrays or resource arrays.

• Tag Length Value (TLV).

• Key-value pairs.
Data Export to Cloud
• Figure shows resulting data pre-processing, data
mining, analysis, visualization and data store.

• The data exports to cloud services.

• The results integrate at the enterprise server or data

warehouse.
Figure 1.3 Data pre-processing, analysis,
visualization, data store export
Cloud Services

• Cloud offers various services. These services can be

accessed through a cloud client (client application),
such as a web browser, SQL or other client.

• Figure 1.4 shows data-store export from machines,

files, computers, web servers and web services.

• The data exports to clouds, such as IBM, Microsoft,

Oracle, Amazon, Rackspace, TCS, Tata
Communications or Hadoop cloud services.
Export of Data to AWS and Rackspace Clouds

• Google cloud platform provides a cloud service called

BigQuery Figure 1.5 shows.

• BigQuery cloud service at Google cloud platform.

• The data exports from a table or partition schema,

JSON, CSV or AVRO files from data sources after the
pre-processing.
Data Storage and Analysis
Data Storage and Management: Traditional
Systems
• Traditional systems use structured or semi-structured
data.

• The sources of structured data store are:

• Traditional relational database-management system

(RDBMS) data, such as MySQL DB2, enterprise server
and data warehouse
• The sources of semistructured data are :

▫ XML and JSON semi structured documents.

▫ CSV files.
SQL
• An RDBMS uses SQL (Structured Query Language). SQL is a
language for viewing or changing (update, insert or append or
delete) databases.
• 1. Create schema, Create schema, which is a structure which
contains description of objects (base tables, views, constraints)
created by a user. The user can describe the data and define the
data in the database.

• 2. Create catalog, which consists of a set of schemas which describe

the database.

• 3. Data Definition Language (DDL) for the commands which

depicts a database, that include creating, altering and dropping of
tables and establishing the constraints. A user can create and drop
databases and tables, establish foreign keys, create view, stored
• 4. Data Manipulation Language (DML) for
commands that maintain and query the database.
• A user can manipulate (INSERT/UPDATE) and access
(SELECT) the data.

• 5. Data Control Language (DCL) for commands that

control a database, and include administering of
privileges and committing. A user can set (grant, add
or revoke) permissions on tables, procedures and
views.
Distributed Database Management System
• A distributed DBMS (DDBMS) is a collection of logically
interrelated databases at multiple system over a computer
network.

• A collection of logically related databases.

• Cooperation between databases in a transparent manner.

• be 'location independent' which means the user is

unaware of where the data is located, and it is possible to
move the data from one physical location to another
without affecting the user.
In-Memory Column Formats Data
• A columnar format in-memory allows faster data retrieval
when only a few columns in a table need to be selected during
query processing or aggregation.

• Online Analytical Processing (OLAP) in real-time transaction

processing is fast when using in-memory column format tables.

• The CPU accesses all columns in a single instance of access to

the memory in columnar format in memory data-storage.
• OLAP enables online viewing of analyzed data and
visualization upto the desired granularity.

• It helps to obtain summerized information and automated

results for a large database.
In-Memory Row Format Databases

• A row format in-memory allows much faster data

processing during OLTP.

• Each row record has corresponding values in multiple

columns and the on-line values store at the
consecutive memory addresses in row format.
Enterprise Data-Store Server and Data Warehouse

• Enterprise data server use data from several distributed

sources which store data using various technologies.

• All data merge using an integration tool.

• Integration enables collective viewing of the datasets at

the data warehouse.

• Enterprise data integration may also include integration

with application(s), such as analytics, visualization,
reporting, business intelligence and knowledge discovery.
• Following are some standardised business processes, as defined in the Oracle
application-integration architecture:

▫ Integrating and enhancing the existing systems and processes

▫ Business intelligence

▫ Data security and integrity

▫ New business services/products (Web services)

▫ Collaboration/knowledge management

▫ Enterprise architecture/SOA

▫ e-commerce

▫ External customer services

▫ Supply chain automation/visualization

• Steps 1 to 5 in Enterprise data integration and
management with Big Data for high performance
computing using local and cloud resources for the
analytics, applications and services.
Big Data Storage
NO SQL
• NoSQL databases are considered as semi-structured
data.

• Big Data Store uses NoSQL.

• NOSQL stands for No SQL or Not Only SQL.

• The stores do not integrate with applications using

SQL.

• NoSQL is also used in cloud data

• NoSQL databases have the following properties:
▫ They have higher scalability.

▫ They use distributed computing.

▫ They are cost effective.

▫ They support flexible schema.

▫ They can process both unstructured and semi-structured data.

▫ There are no complex relationships, such as the ones between

tables in an RDBMS.
Features of NoSQL are as follows:
• It is a class of non-relational data storage systems, and
the flexible data models and multiple schema:

• uninterrupted key/value or big hash table.

• unordered keys and using JSON (PNUTS).

• ordered keys and semi-structured data storage

systems [BigTable, Cassandra (used in
Facebook/Apache) and HBase]
▫ Do not use the JOINS.

▫ Data written at one node can replicate at multiple

nodes, therefore Data storage is fault-tolerant,

▫ May relax the ACID rules during the Data Store

transactions.

▫ Data partitioned and follows CAP theorm

(Consistancy,Availability,Partition Tolerance).
CAP THEORM
• The Consistency, Availability, and Partition
tolerance (CAP) theorem, also known as
Brewer’s theorem.

• It states that a distributed database system,

running on a cluster, can only provide two of the
following three properties:
▫ Consistency
▫ Availability and
▫ Partition tolerance
• Consistency – A read from any node results in
the same data across multiple nodes.

• Availability – A read/write request will always be

acknowledged in the form of a success or a failure .

• Partition tolerance – The database system can

tolerate communication outages that split the
cluster into multiple silos and can still service
read/write requests
• If consistency (C) and availability (A) are required,
available nodes need to communicate to ensure
consistency (C). Therefore, partition tolerance (P) is
not possible.

• If consistency (C) and partition tolerance (P) are

required, nodes cannot remain available (A) as the
nodes will become unavailable while achieving a state
of consistency (C).
• If availability (A) and partition tolerance (P) are
required, then consistency (C) is not possible because
of the data communication requirement between the
nodes. So, the database can remain available (A) but
with inconsistent results.
Big Data Platform

• A Big Data platform supports large datasets and

volume of data.

• The data generate at a higher velocity, in more

varieties or in higher veracity.

• Managing Big Data requires large resources of MPPs,

cloud, parallel processing and specialized tools.
• Bigdata platform should provision tools and services
for:
1. Storage, processing and analytics,
2. Developing, deploying, operating and managing Big
Data environment,
3. Reducing the complexity of multiple data sources and
integration of applications into one cohesive solution,
4. Custom development, querying and integration with
other systems, and
5. The traditional as well as Big Data techniques.
Data management, storage and analytics of Big data
captured at the companies and services require the
following:
1.New innovative non-traditional methods of storage,
processing and analytics.

2. Distributed and Huge volume of Data Stores.

3. Creating scalable as well as elastic virtualized

platform (cloud computing).
5. Massive parallelism.

6. High speed networks.

7. High performance processing, optimization and tuning.

8. Data management model based on Not Only SQL or NoSQL.

9. In-memory data column-formats transactions processing or

dual in-memory data columns as well as row formats for OLAP
and OLTP.

10. Data retrieval, mining, reporting, visualization and analytics.

11. Graph databases to enable analytics with social
network messages, pages and data analytics.

12. Machine learning or other approaches.

13. Big data sources: Data storages, data warehouse,

Oracle Big Data, MongoDB NoSQL, Cassandra NoSQL.

14. Data sources: Sensors, Audit trail of Financial

transactions data, external data such as Web, Social
Media, weather data, health records data.
Hadoop
• Big Data platform consists of Big Data storage(s),
server(s) and data management and business
intelligence software.

• Storage can deploy Hadoop Distributed File System

(HDFS), NoSQL data stores, such as HBase,
MongoDB, Cassandra. HDFS system is an open source
storage system.

• HDFS is a scaling, self-managing and self-healing file

system.
• The Hadoop system packages application-
programming model.

• Hadoop is a scalable and reliable parallel computing

platform.

• Hadoop manages Big Data distributed databases.

• Small height cylinders represent MapReduce and big

ones represent the Hadoop.
• There are two main components of Apache
Hadoop-
▫ (i)HDFS - used for storage
▫ (ii)MapReduce parallel processing
framework
-used of processing
• Hadoop includes a fault tolerant storage system called
Hadoop Distributed File System(HDFS).

• It stores large size files from terabytes to petabytes.

• HDFS attain reliability by replicating the data over

multiple hosts.

• The file in HDFS is split into large block size of 64 MB

by default and each block of file is independently
replicated at multiple nodes.
• MapReduce is a framework that helps developers to
write programs to process large volumes of
unstructured data parallel over a distributed
architecture which produces result in useful
aggregated form.
Mesos
• Mesos is a resources management platform which
enables sharing of cluster of nodes by multiple
frameworks .

• Apache Mesos is an open source cluster manager.

• It handles workloads in a distributed environment

through dynamic resource sharing and isolation
Big Data Stack
• A stack consists of a set of software components and
data store units.

• Applications, machine learning algorithms, analytics

and visualization tools use Big Data Stack (BDS) at a
cloud service, such as Amazon EC2, Azure or private
cloud.

• The stack uses cluster of high performance machines.

BIG DATA ANALYTICS
• Big Data Analytics reformed the ways to conduct
business in many ways such as it improves decision
making ,business process management etc.

• Data analytics can be formally defined as the statistical

and mathematical data analysis that
clusters ,segments, ranks and predicts future
possibilities.

• “Analysis of data is process of inspecting cleaning

transforming and modeling data with the goal of
discovering useful information ,suggesting
• There are 3 main types/phases of business
analytics :
▫ Descriptive analytics
▫ Predictive analytics
▫ Prescriptive analytics
▫ Cognitive analysis
Descriptive Analytics
• It is the most prevalent form of analytics.

• It answers the question “what happened in business”?

• It serves as a base for advanced analytics .

• It analyze a db to provide a information on the trends of

past or current business events that can help managers ,
planners, leaders etc to develop roadmap for future action.
• It performs indepth analysis of data to reveal details such as
frequency of events, underlying reason for failure etc.

• It helps in identifying the root cause of the problems

Predictive analytics

• It is about understanding and predicting the future

and answer the question “what could happen”
using statistical models and different forecast
techniques.

• In predictive analytics we use statastics ,data mining

techniques and machine learning to analyze the future.
Prescriptive Analytics
• Prescriptive Analytics answers the question “what
should we do?” on the basis of complex data
obtained from descriptive and predictive analyses.

• By using optimization techniques ,perspective

analytics determine the finest substitute to minimize
or maximize finance, marketing and many other areas.

• If we have to find the best way of shipping goods from

a factory to a destination to minimize cost ,we can use
prespective analytics.
• Data can be streamlined for growth and expansion in
technology as well as business .
• When data is analyzed it become answer to ”how can
business acquire more customers and gain business
insight”?.
Cognitive analytics

• It enables derivation of the additional value and

undertake better decision.
• Analytics integrates with enterprise server or data
warehouse.

• The figure below shows an overview of reference

model for analytics architecture.

• The RHS of figure shows big data file system, machine

learning algorithms query languages and usage of
hadoop ecosystem.
• The captured or stored data require a well proven
strategy to calculate ,plan and analyse.

• When big data combine the high powered data

analysis enterprise achieve valued business related
task.

• Determine root cause of defects faults and failure in

minimum time.

• Deliver advertisement on mobile or web based on

customer location and buying habits.
Berkely Data Analysis Stack(BDAS)
• Big data analytics need innovative as well as cost
effective techniques.

• BDAS is an open source data analytics stack for

complex computations on Big data.

• It supports efficient large scale in-memory data

processing and thus enables user applications
achieving three fundamental processing
requirements ,accuracy ,time and cost.
Berkely Data Analysis Stack(BDAS)

• Berkeley Data Analytics Stack (BDAS) consists of data

processing, data management and resource management layers.
Following list these:

1.Applications, AMP-Genomics and Carat run at the BDAS. Data

processing software component provides in-memory processing
which processes the data efficiently across the frameworks. AMP
stands for Berkeley's Algorithms, Machines and Peoples
Laboratory.

2. Data processing combines batch, streaming and interactive

computations.

3. Resource management software component provides for

Big Data Analytics 18CS72 - Module 1
No ratings yet
Big Data Analytics 18CS72 - Module 1
84 pages
BDA Unit 1
No ratings yet
BDA Unit 1
50 pages
Bigdata Analytics
No ratings yet
Bigdata Analytics
19 pages
Big Data
No ratings yet
Big Data
34 pages
Module I Big Data
No ratings yet
Module I Big Data
7 pages
Big Data Analytics
No ratings yet
Big Data Analytics
58 pages
Bda (Unit 1)
No ratings yet
Bda (Unit 1)
24 pages
Introduction to Big Data Concepts
100% (2)
Introduction to Big Data Concepts
33 pages
Big Data UNIT I
No ratings yet
Big Data UNIT I
91 pages
BD 1
No ratings yet
BD 1
15 pages
Big Data Basics for Beginners
No ratings yet
Big Data Basics for Beginners
8 pages
DBIS Lecture 4 - Slides (AI and Big Data)
No ratings yet
DBIS Lecture 4 - Slides (AI and Big Data)
84 pages
Module 1
No ratings yet
Module 1
54 pages
Big Data Technologies Overview
No ratings yet
Big Data Technologies Overview
32 pages
Bda CHP1
No ratings yet
Bda CHP1
83 pages
Notesfor BDA
No ratings yet
Notesfor BDA
59 pages
Big Data Seminar Report Overview
100% (2)
Big Data Seminar Report Overview
27 pages
BDA Mod1
No ratings yet
BDA Mod1
36 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
37 pages
Module 6 - Big Data and NOSQL
No ratings yet
Module 6 - Big Data and NOSQL
63 pages
Big Data Analytics
No ratings yet
Big Data Analytics
14 pages
Big Data Analtics (Unit 1)
No ratings yet
Big Data Analtics (Unit 1)
31 pages
Unit-I - Big Data
No ratings yet
Unit-I - Big Data
29 pages
BDA Unit 1
No ratings yet
BDA Unit 1
39 pages
$RM5TSDQ
No ratings yet
$RM5TSDQ
70 pages
Cloud Computing
No ratings yet
Cloud Computing
86 pages
Big Data
No ratings yet
Big Data
110 pages
Bda M1
No ratings yet
Bda M1
111 pages
UNIT I Notes
No ratings yet
UNIT I Notes
26 pages
Module 1
No ratings yet
Module 1
21 pages
Unit-I (Big Data)
No ratings yet
Unit-I (Big Data)
30 pages
BDA (18CS72) Module-1
No ratings yet
BDA (18CS72) Module-1
36 pages
Introduction To Big Data Platform (Module-3)
No ratings yet
Introduction To Big Data Platform (Module-3)
23 pages
Bda U1
No ratings yet
Bda U1
78 pages
Unit 1
No ratings yet
Unit 1
57 pages
Understanding Big Data Characteristics
No ratings yet
Understanding Big Data Characteristics
20 pages
Getting An Overview of Big Data (Module1)
No ratings yet
Getting An Overview of Big Data (Module1)
58 pages
Unit 1
No ratings yet
Unit 1
107 pages
Unit-1 Notes
No ratings yet
Unit-1 Notes
30 pages
Unit 2 Da
No ratings yet
Unit 2 Da
69 pages
Chapter 1 Introduction To Big Data
No ratings yet
Chapter 1 Introduction To Big Data
19 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
25 pages
Big Data Analytics
No ratings yet
Big Data Analytics
64 pages
BDA Unit 1
No ratings yet
BDA Unit 1
68 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
Unit 1 Bigdata
No ratings yet
Unit 1 Bigdata
30 pages
Big Data Unit 1 Notes
100% (1)
Big Data Unit 1 Notes
27 pages
Challenges in Big Data Analytics Techniques
No ratings yet
Challenges in Big Data Analytics Techniques
6 pages
Big Data-Intro
No ratings yet
Big Data-Intro
31 pages
Big Data Intro
No ratings yet
Big Data Intro
32 pages
BDU1
No ratings yet
BDU1
39 pages
Big Data Basics for CSE Students
No ratings yet
Big Data Basics for CSE Students
40 pages
CS8091 LN
No ratings yet
CS8091 LN
68 pages
Unit 1
No ratings yet
Unit 1
59 pages
Unit 1
No ratings yet
Unit 1
76 pages
Bda Unit 1
No ratings yet
Bda Unit 1
47 pages
Module 1
No ratings yet
Module 1
60 pages
Unit 1 - BDS - DS307
No ratings yet
Unit 1 - BDS - DS307
47 pages
MES-Mod 4
No ratings yet
MES-Mod 4
103 pages
Task Synchronization Issues: By, Pukkalla Suraj Nageswararao Usn: 1AY20CS115
No ratings yet
Task Synchronization Issues: By, Pukkalla Suraj Nageswararao Usn: 1AY20CS115
17 pages
Walmart Case Study: History & SWOT Analysis
No ratings yet
Walmart Case Study: History & SWOT Analysis
2 pages
Network Security Essentials
No ratings yet
Network Security Essentials
13 pages
Azure Interview Question
No ratings yet
Azure Interview Question
26 pages
Oracle Streams Guide for Developers
100% (31)
Oracle Streams Guide for Developers
30 pages
Capstone Online Enrollment
100% (2)
Capstone Online Enrollment
20 pages
Senior Software Test Engineer Resume
No ratings yet
Senior Software Test Engineer Resume
2 pages
Ict Answer Key
No ratings yet
Ict Answer Key
6 pages
Test Blueprint Computer Science
50% (2)
Test Blueprint Computer Science
13 pages
Business Intelligence - Telecom Industry
No ratings yet
Business Intelligence - Telecom Industry
6 pages
Question & Answers: Ibm Security Qradar Siem V7.5 Administration
100% (1)
Question & Answers: Ibm Security Qradar Siem V7.5 Administration
50 pages
Workshop
No ratings yet
Workshop
2 pages
Travel Website
No ratings yet
Travel Website
12 pages
Database Marketing Strategies Explained
No ratings yet
Database Marketing Strategies Explained
6 pages
06 - Ch6 Architectural Design
No ratings yet
06 - Ch6 Architectural Design
47 pages
Common-2-Perform Computer Operations
100% (1)
Common-2-Perform Computer Operations
104 pages
DBMS Notes for MCA Students
No ratings yet
DBMS Notes for MCA Students
60 pages
Exadata Backup
100% (1)
Exadata Backup
27 pages
Databases
No ratings yet
Databases
3 pages
PG Complete
No ratings yet
PG Complete
160 pages
1.syllabus 10cs54 Dbms
No ratings yet
1.syllabus 10cs54 Dbms
1 page
CS Assignment - Database
No ratings yet
CS Assignment - Database
6 pages
Data Mining Exam Questions
No ratings yet
Data Mining Exam Questions
25 pages
Plaxis 3d 2024.3 3d 1 Tutorial
No ratings yet
Plaxis 3d 2024.3 3d 1 Tutorial
188 pages
UVM Interview Questions
No ratings yet
UVM Interview Questions
10 pages
Annual Summary of Notices To Mariners What Is NP 247
No ratings yet
Annual Summary of Notices To Mariners What Is NP 247
2 pages
CDP Il304 PDF
No ratings yet
CDP Il304 PDF
25 pages
Ict Databases
No ratings yet
Ict Databases
28 pages
AI Mock Interview Platform - Project Report
No ratings yet
AI Mock Interview Platform - Project Report
19 pages
Game Development Skills Overview
No ratings yet
Game Development Skills Overview
1 page
COSC 1701A, M, N, Computer Applications, W22, Assignment-3 Part-1v02
No ratings yet
COSC 1701A, M, N, Computer Applications, W22, Assignment-3 Part-1v02
11 pages
Chapter 6: Knowledge Acquisition and Its Process
No ratings yet
Chapter 6: Knowledge Acquisition and Its Process
22 pages
Adil Reza Dev Ops Resume
No ratings yet
Adil Reza Dev Ops Resume
2 pages