Big Data Analytics
Module 1
1. Define Big data. Explain the classification of big data (page no 55 & 69 to 70)
Big Data is high-volume, high-velocity and high-variety information asset that requires new forms
of processing for enhanced decision making, insight discovery and process optimization.
2. Explain the functions of each layer in big data architecture design with a diagram. (92 to 100)
Big Data architecture: "Big Data architecture is the logical or physical layout/structure of how Big
Data will be stored, accessed and managed within an IT environment.
Data processing architecture consists of five layers:
(i) identification of data sources,
(ii) acquisition, ingestion, extraction, pre-processing, transformation of data,
(iii) Data storage at files, servers, cluster or cloud,
(iv) data-processing, and
(v) data consumption by the number of programs and tools such as business intelligence, data
mining, discovering patterns/clusters, artificial intelligence (AI), machine learning (ML),
text analytics, descriptive and predictive analytics, and data visualization.
Layer 1:
L1 considers the following aspects in a design:
◦ Amount of data needed at ingestion layer 2 (L2)
◦ Push from L1 or pull by L2 as per the mechanism for the usages
◦ Source datatypes: Database, files, web, or service
Source formats, i.e., semi-structured, unstructured, or structured.
Layer 2:
Ingestion and ETL processes either in real time, which means store and use the data as generated,
or in batches.
Batch processing is using discrete datasets at scheduled or periodic intervals of time.
Layer 3:
Data storage type (historical or incremental), format, compression, incoming data
frequency, querying patterns and consumption requirements for L4 or L5
Data storage using Hadoop distributed file system or NoSQL data stores—HBase, Cassandra,
MongoDB.
Layer 4:
Data processing software such as MapReduce, Hive, Pig, Spark, Spark Mahout, Spark Streaming
Processing in scheduled batches or real time or hybrid
Processing as per synchronous or asynchronous processing requirements at L5.
Layer 5:
Data integration
Datasets usages for reporting and visualization
Analytics (real time, near real time, scheduled batches), BPs, BIs, knowledge discovery
Export of datasets to cloud, web or other systems
3. Define scalability and its types with an example (72 to 79) OR write a short note on analytics
scalability to big data and massively parallel processing platforms.
Scalability
Scalability enables increase or decrease in the capacity of data storage, processing and analytics.
Scalability is the capability of a system to handle the workload as per the magnitude of the work.
System capability needs increment with the increased workloads.
4. Define data preprocessing. Explain in brief the needs of preprocessing (116 to 120)
5. Discuss the evolution of big data and characteristics of big data (28 to 30 & 56 to 57)
Module 2
1. With a neat diagram, explain Hadoop main components and ecosystem components (28 to 29 & 35
to 38)
OR
What are the core components of Hadoop? explain in brief its each of its components.
Hadoop Ecosystem Components
2. Brief out the features of Hadoop HDFS? Also explain the functions of name node and data node
(44 & 49 to 52)
Hadoop HDFS features are as follows:
(i) Create, append, delete, rename, and a ribute modifica on func ons.
(ii) Content of individual file cannot be modified or replaced but appended with new data at
the end of the file.
(iii) Write once but read many mes during usages and processing.
(iv) Average file size can be more than 500 MB.
3. Explain Apache Sqoop import and export method with neat diagram (81 + SVIT_notes 29 to 30)
4. what is Apache flume? describe the feature components and working of Apache flume (SVIT_notes
36 to 37)
5. MapReduce workflow for word count program + steps on request and two types of process (63 to
67 & 68,69)