0% found this document useful (0 votes)
376 views13 pages

BDA Notes

Big data is characterized by high volume, velocity, and variety. The big data architecture consists of 5 layers - data sources, ingestion/preprocessing, storage, processing, and consumption. Scalability enables increasing or decreasing storage and processing capacity. Data preprocessing is needed to clean, transform and integrate data before analysis. Hadoop is an open-source framework for distributed storage and processing of big data. Its core components are HDFS for storage and MapReduce for processing.

Uploaded by

Bhavana N S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
376 views13 pages

BDA Notes

Big data is characterized by high volume, velocity, and variety. The big data architecture consists of 5 layers - data sources, ingestion/preprocessing, storage, processing, and consumption. Scalability enables increasing or decreasing storage and processing capacity. Data preprocessing is needed to clean, transform and integrate data before analysis. Hadoop is an open-source framework for distributed storage and processing of big data. Its core components are HDFS for storage and MapReduce for processing.

Uploaded by

Bhavana N S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Big Data Analytics

Module 1
1. Define Big data. Explain the classification of big data (page no 55 & 69 to 70)

Big Data is high-volume, high-velocity and high-variety information asset that requires new forms
of processing for enhanced decision making, insight discovery and process optimization.

2. Explain the functions of each layer in big data architecture design with a diagram. (92 to 100)

Big Data architecture: "Big Data architecture is the logical or physical layout/structure of how Big
Data will be stored, accessed and managed within an IT environment.

Data processing architecture consists of five layers:


(i) identification of data sources,
(ii) acquisition, ingestion, extraction, pre-processing, transformation of data,
(iii) Data storage at files, servers, cluster or cloud,
(iv) data-processing, and
(v) data consumption by the number of programs and tools such as business intelligence, data
mining, discovering patterns/clusters, artificial intelligence (AI), machine learning (ML),
text analytics, descriptive and predictive analytics, and data visualization.
Layer 1:
 L1 considers the following aspects in a design:
◦ Amount of data needed at ingestion layer 2 (L2)
◦ Push from L1 or pull by L2 as per the mechanism for the usages
◦ Source datatypes: Database, files, web, or service
 Source formats, i.e., semi-structured, unstructured, or structured.

Layer 2:
 Ingestion and ETL processes either in real time, which means store and use the data as generated,
or in batches.
 Batch processing is using discrete datasets at scheduled or periodic intervals of time.

Layer 3:
 Data storage type (historical or incremental), format, compression, incoming data
 frequency, querying patterns and consumption requirements for L4 or L5
 Data storage using Hadoop distributed file system or NoSQL data stores—HBase, Cassandra,
MongoDB.

Layer 4:
 Data processing software such as MapReduce, Hive, Pig, Spark, Spark Mahout, Spark Streaming
 Processing in scheduled batches or real time or hybrid
 Processing as per synchronous or asynchronous processing requirements at L5.

Layer 5:
 Data integration
 Datasets usages for reporting and visualization
 Analytics (real time, near real time, scheduled batches), BPs, BIs, knowledge discovery
 Export of datasets to cloud, web or other systems

3. Define scalability and its types with an example (72 to 79) OR write a short note on analytics
scalability to big data and massively parallel processing platforms.

Scalability
 Scalability enables increase or decrease in the capacity of data storage, processing and analytics.
 Scalability is the capability of a system to handle the workload as per the magnitude of the work.
 System capability needs increment with the increased workloads.
4. Define data preprocessing. Explain in brief the needs of preprocessing (116 to 120)
5. Discuss the evolution of big data and characteristics of big data (28 to 30 & 56 to 57)
Module 2
1. With a neat diagram, explain Hadoop main components and ecosystem components (28 to 29 & 35
to 38)
OR
What are the core components of Hadoop? explain in brief its each of its components.

Hadoop Ecosystem Components


2. Brief out the features of Hadoop HDFS? Also explain the functions of name node and data node
(44 & 49 to 52)
Hadoop HDFS features are as follows:
(i) Create, append, delete, rename, and a ribute modifica on func ons.
(ii) Content of individual file cannot be modified or replaced but appended with new data at
the end of the file.
(iii) Write once but read many mes during usages and processing.
(iv) Average file size can be more than 500 MB.
3. Explain Apache Sqoop import and export method with neat diagram (81 + SVIT_notes 29 to 30)
4. what is Apache flume? describe the feature components and working of Apache flume (SVIT_notes
36 to 37)
5. MapReduce workflow for word count program + steps on request and two types of process (63 to
67 & 68,69)

You might also like