0% found this document useful (0 votes)

11 views13 pages

Experiment 4 & 5

Uploaded by

Paidapati Poojitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views13 pages

Experiment 4 & 5

Uploaded by

Paidapati Poojitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

4.

Apache Hive & Pig for Querying Large Datasets Creation of tables, data loading, and running
queries

Aim: To create tables, data loading, and running queries in Apache Hive

Procedure:

1. Open PuTTY on Windows.

2. In the Host Name (or IP address) field → enter the server’s IP or hostname (for example:
192.168.1.100 or hadoop-master).

3. Port = 22 (default for SSH).

4. Connection type = SSH.

5. Click Open.

6. A terminal will appear → enter your username (e.g., hduser) and password.

Step 1: Create a working directory and the data file

mkdir dema

cd dema

vi hive.txt

Step 2: Create a directory in HDFS and upload the file

hadoop fs -mkdir /dema

hadoop fs -put hive.txt /dema

hadoop fs -ls /dema

Step 3: Enter Hive shell

Hive

Step 4: Create the Hive table

CREATE TABLE employee (eid INT, ename STRING, eage INT, esalary FLOAT)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

LINES TERMINATED BY '\n'

STORED AS TEXTFILE;

Step 5: Load data into Hive table

LOAD DATA LOCAL INPATH 'hive.txt' INTO TABLE employee;

Step 6: Verify table exists

SHOW TABLES;

Step 7: Run HiveQL queries

a. View all data:

SELECT * FROM employee;
b. View specific columns:
SELECT ename, esalary FROM employee;
c. Filter rows based on condition:

SELECT eid, esalary FROM employee WHERE eage < 30;

Aim: To create tables, data loading, and running queries in Pig

PROCEDURE:

1. Open PuTTY on Windows.

2. In the Host Name (or IP address) field → enter the server’s IP or hostname (for example:
192.168.1.100 or hadoop-master).

3. Port = 22 (default for SSH).

4. Connection type = SSH.

5. Click Open.

6. A terminal will appear → enter your username (e.g., hduser) and password.

Step 1: Create a Local File with Employee Data

Open terminal and create the file keer.txt:

vi keer.txt

Then enter the following data (tab-separated):

1 a 10.0
2 b 20.0
3 c 30.0
4 d 40.0

Save and exit:

 Press Esc
 Type :wq
 Hit Enter

Step 2: Create Directory in HDFS and Upload the File

Create an HDFS directory (if not already created):

hdfs dfs -mkdir /cc

Upload the local file to HDFS:

hdfs dfs -put keer.txt /cc/

Check the file is in HDFS:

hdfs dfs -ls /cc

Step 3: Start the Pig Grunt Shell

pig

Wait for the grunt> prompt.

Step 4: Load the Data into a Pig Relation

Inside the grunt> prompt, load the file with schema:

employee = LOAD '/cc/keer.txt' USING PigStorage('\t')

AS (eid:int, ename:chararray, esal:float);

Step 5: Display All Records

DUMP employee;

Sample Output:

(1,a,10.0)
(2,b,20.0)
(3,c,30.0)
(4,d,40.0)

Step 6: Filter Employees with Salary Greater Than 10

high_paid = FILTER employee BY esal > 10;

Display result:

DUMP high_paid;

Sample Output:

(2,b,20.0)
(3,c,30.0)
(4,d,40.0)
Step 7: Group All Records Together (for aggregation)

grouped = GROUP employee ALL;

Step 8: Calculate the Average Salary

avg_salary = FOREACH grouped GENERATE AVG(employee.esal) AS avg_salary;

Display result:

DUMP avg_salary;

Sample Output:

(25.0)
Result:
5. Apache Spark Basics: RDDs and DataFrames Implement Spark transformations and actions

Aim:

To implement basic Apache Spark transformations and actions using RDDs and DataFrames,
demonstrating fundamental data processing operations in a distributed computing environment.

Procedure:

1. Open PuTTY on Windows.

2. In the Host Name (or IP address) field → enter the server’s IP or hostname (for example:
192.168.1.100 or hadoop-master).

3. Port = 22 (default for SSH).

4. Connection type = SSH.

5. Click Open.

6. A terminal will appear → enter your username (e.g., hduser) and password

Step 1: Create the data file

vi number.txt

Step 2: Create a directory in HDFS and upload the file

hdfs dfs –mkdir /apache

Step 3: Upload your input file to HDFS

hdfs dfs -put /number.txt /apache

Step 4: Check if the file is in HDFS:

hdfs dfs -ls /apache

Step 5: Write your Spark job Create a Python file with your Spark code, for example rdd_example.py.

vi spark_rdd_example.py

Step 6: Press i and write a program

from pyspark import SparkContext

sc = SparkContext("local", "RDD Example")

# Read text file from HDFS

lines = sc.textFile("hdfs:///apache/number.txt")

# Filter out empty lines and convert to integers

numbers = lines.filter(lambda x: x.strip() != "").map(lambda x: int(x.strip()))

# Square the numbers

squared = numbers.map(lambda x: x * x)

# Collect and print results

results = squared.collect()

for num in results:

print(num)

sc.stop()

after typing program press:wq for save and exit

Step 8: Run your Spark job

spark-submit rdd_example.py

Step 9: View the output

Procedure:

1. Open PuTTY on Windows.

2. In the Host Name (or IP address) field → enter the server’s IP or hostname (for example:
192.168.1.100 or hadoop-master).

3. Port = 22 (default for SSH).

4. Connection type = SSH.

5. Click Open.

6. A terminal will appear → enter your username (e.g., hduser) and password

Step 1: Create the data file

vi people.txt

Step 2: Create a directory in HDFS and upload the file

hdfs dfs –mkdir /demo

Step 3: Upload your input file to HDFS

hdfs dfs -put people.txt /demo

Step 4: Check if the file is in HDFS:

hdfs dfs -ls /demo

Step 5: Create your Spark Python script

Run this command:

vi spark_dataframe.py

Then press i to insert or type your program:

from pyspark import SparkContext

from pyspark.sql import SQLContext, Row

# Initialize SparkContext and SQLContext

sc = SparkContext(appName="DataFrameExample")
sqlContext = SQLContext(sc)

# Read text file from HDFS

lines = sc.textFile("hdfs:///demo/people.txt")
# Convert each line into a Row object with proper types
rows_rdd = lines.map(lambda line: line.split(",")) \
.map(lambda p: Row(
Name=p[0].strip(),
Age=int(p[1].strip()),
Salary=int(p[2].strip()),
Experience=int(p[3].strip())
))

# Create DataFrame from the RDD of Rows

df = sqlContext.createDataFrame(rows_rdd)
# Filter rows where Age > 30 and Experience > 5
df_filtered = df.filter((df.Age > 40) & (df.Experience > 5))

# Show filtered results

df_filtered.show(1000)

# Stop SparkContext
sc.stop()

After typing program press:wq for save and exit

Step 6: Submit your Spark job

Run this command:

spark-submit spark_dataframe.py

Step 7 : View expected output in terminal

Bda3 7
No ratings yet
Bda3 7
30 pages
UNIT5
No ratings yet
UNIT5
13 pages
Spark Labs for Data Engineers
No ratings yet
Spark Labs for Data Engineers
133 pages
BigData Module 2
No ratings yet
BigData Module 2
41 pages
Exp 5 Big Data Analytics and Computing Lab Manual
No ratings yet
Exp 5 Big Data Analytics and Computing Lab Manual
28 pages
Bad601 Lab Maual
No ratings yet
Bad601 Lab Maual
34 pages
Bigdata
No ratings yet
Bigdata
3 pages
Installation Steps
No ratings yet
Installation Steps
5 pages
BIG DATA Module 2 FINAL SMI
No ratings yet
BIG DATA Module 2 FINAL SMI
44 pages
Chapter 5 - Introducing Pig Pig Architecture
No ratings yet
Chapter 5 - Introducing Pig Pig Architecture
81 pages
BDH Practical 08 29
No ratings yet
BDH Practical 08 29
3 pages
BDA Practicalfile
No ratings yet
BDA Practicalfile
19 pages
Load Unstructured Data into Hive with PySpark
No ratings yet
Load Unstructured Data into Hive with PySpark
9 pages
Spark DataFrame Analysis Project
No ratings yet
Spark DataFrame Analysis Project
9 pages
Demonstration: Understanding Pig: HDP Developer: Apache Pig and Hive
No ratings yet
Demonstration: Understanding Pig: HDP Developer: Apache Pig and Hive
26 pages
Hadoop Setup and Tasks in Docker
No ratings yet
Hadoop Setup and Tasks in Docker
5 pages
Unit 3 BDA
No ratings yet
Unit 3 BDA
4 pages
Bda Record 18071a0597-1
No ratings yet
Bda Record 18071a0597-1
28 pages
Cloudera Quiz 470 Practice Results
No ratings yet
Cloudera Quiz 470 Practice Results
74 pages
Practical 1-4
No ratings yet
Practical 1-4
14 pages
Int 421
No ratings yet
Int 421
2 pages
CCA175 Exam: Spark & Hadoop Tasks
No ratings yet
CCA175 Exam: Spark & Hadoop Tasks
17 pages
Bda - Module Ii
No ratings yet
Bda - Module Ii
239 pages
Analyzing Unstructured Data in Hadoop
No ratings yet
Analyzing Unstructured Data in Hadoop
5 pages
Spark Developer Resume: Shekhar Nagle
No ratings yet
Spark Developer Resume: Shekhar Nagle
4 pages
Hadoop Tutorials: Daniel Lanza Zbigniew Baranowski
No ratings yet
Hadoop Tutorials: Daniel Lanza Zbigniew Baranowski
49 pages
Module 5 - Data Analytics
No ratings yet
Module 5 - Data Analytics
4 pages
Big Data Lab Manual and Syllabus
No ratings yet
Big Data Lab Manual and Syllabus
71 pages
Hive Setup for Data Engineers
No ratings yet
Hive Setup for Data Engineers
8 pages
Exp 4 (Ii) - Hive UDF Executed Steps 20-02-2025
No ratings yet
Exp 4 (Ii) - Hive UDF Executed Steps 20-02-2025
5 pages
Hadoop Testing and Big Data Trends
100% (1)
Hadoop Testing and Big Data Trends
34 pages
Data Processing with Hadoop and Hive
No ratings yet
Data Processing with Hadoop and Hive
4 pages
Apache Pig Data Processing Guide
No ratings yet
Apache Pig Data Processing Guide
10 pages
Bda Lab Record
No ratings yet
Bda Lab Record
32 pages
Cloudlab Exercise 11 Lesson 11
No ratings yet
Cloudlab Exercise 11 Lesson 11
2 pages
Hadoop - Session 7 Python
No ratings yet
Hadoop - Session 7 Python
6 pages
POC Issues 0327
No ratings yet
POC Issues 0327
46 pages
Big Data Record 2024-25
No ratings yet
Big Data Record 2024-25
46 pages
Practical 4.4 HBase
No ratings yet
Practical 4.4 HBase
12 pages
HOL Hive PDF
No ratings yet
HOL Hive PDF
23 pages
BDA Exp (1 To 7)
No ratings yet
BDA Exp (1 To 7)
22 pages
Lab Program
No ratings yet
Lab Program
8 pages
Using Apache Spark in Local Mode
No ratings yet
Using Apache Spark in Local Mode
56 pages
Big Data
No ratings yet
Big Data
19 pages
Bda V
No ratings yet
Bda V
10 pages
Big Data Lab Guide for CS Students
No ratings yet
Big Data Lab Guide for CS Students
53 pages
Visa
No ratings yet
Visa
17 pages
HDFS
No ratings yet
HDFS
6 pages
Cloudera Msazure Hadoop Deployment Guide
No ratings yet
Cloudera Msazure Hadoop Deployment Guide
39 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
55 pages
BDA LabManual
No ratings yet
BDA LabManual
32 pages
Lab2 BigData-HDFSp
No ratings yet
Lab2 BigData-HDFSp
4 pages
Apache Pig Installation and Workouts Guide
No ratings yet
Apache Pig Installation and Workouts Guide
7 pages
Big Data Analysis - Lab Manual - Bharathidasan University - B.SC Data Science, Second Year, 4th Semester
No ratings yet
Big Data Analysis - Lab Manual - Bharathidasan University - B.SC Data Science, Second Year, 4th Semester
41 pages
Big Data & Analytics Lab Manual
No ratings yet
Big Data & Analytics Lab Manual
51 pages
Big Data Hadoop & Spark Course
No ratings yet
Big Data Hadoop & Spark Course
30 pages
18 TO 23 Notes
No ratings yet
18 TO 23 Notes
14 pages
Unit3 - Training Deep Neural Networks
No ratings yet
Unit3 - Training Deep Neural Networks
3 pages
A Systematic Study On PM2.5 and PM10 Concentration Prediction in Air
No ratings yet
A Systematic Study On PM2.5 and PM10 Concentration Prediction in Air
15 pages
OS Lecture Notes-CSEB
No ratings yet
OS Lecture Notes-CSEB
122 pages
Unit-4 Bigdata Analytics: What Is Apache Pig?
No ratings yet
Unit-4 Bigdata Analytics: What Is Apache Pig?
47 pages
Screenshot 2024-06-02 at 7.39.32 PM
No ratings yet
Screenshot 2024-06-02 at 7.39.32 PM
8 pages
Chapter 10 818464 Downloadable 7385183
No ratings yet
Chapter 10 818464 Downloadable 7385183
108 pages
Civil Engineering Removal Exam 2017
No ratings yet
Civil Engineering Removal Exam 2017
8 pages
Quantum and Spintronics Devices
No ratings yet
Quantum and Spintronics Devices
3 pages
High Voltage Lab: Experiment No. 2 TITLE: Study of The Characteristics of Impulse Voltage
100% (1)
High Voltage Lab: Experiment No. 2 TITLE: Study of The Characteristics of Impulse Voltage
12 pages
Understanding Reverse Engineering Methods
75% (4)
Understanding Reverse Engineering Methods
17 pages
Everseen-NodeJS-practical-test 1
No ratings yet
Everseen-NodeJS-practical-test 1
5 pages
B.tech 3RD Sem 2024
No ratings yet
B.tech 3RD Sem 2024
3 pages
Understanding Logic: Statements & Negations
No ratings yet
Understanding Logic: Statements & Negations
7 pages
BLUM to USD Conversion Rate Guide
No ratings yet
BLUM to USD Conversion Rate Guide
1 page
OC60D/OC90D Dielectric Test Sets
No ratings yet
OC60D/OC90D Dielectric Test Sets
2 pages
Use Use of GnRHa-Delivery Systems Review
No ratings yet
Use Use of GnRHa-Delivery Systems Review
31 pages
Atoms and Elements Teaching Ideas
No ratings yet
Atoms and Elements Teaching Ideas
2 pages
Machine Vibration Testing Guide
No ratings yet
Machine Vibration Testing Guide
164 pages
Introduction To Deep Learning - Assignment
No ratings yet
Introduction To Deep Learning - Assignment
4 pages
IMO Class 4 Level 2 2013 Paper 2
No ratings yet
IMO Class 4 Level 2 2013 Paper 2
7 pages
Electric Energy Meter System Integratedwith Machine Learningand Conductedby Artificial Intelligenceof Things Aio T
No ratings yet
Electric Energy Meter System Integratedwith Machine Learningand Conductedby Artificial Intelligenceof Things Aio T
8 pages
Trying Test
No ratings yet
Trying Test
5 pages
All Aluminium Alloy Conductors Overview
No ratings yet
All Aluminium Alloy Conductors Overview
2 pages
LED Lamps for Home & Commercial Use
No ratings yet
LED Lamps for Home & Commercial Use
5 pages
Exámenes Matemáticas Otros Años
No ratings yet
Exámenes Matemáticas Otros Años
96 pages
Steam Turbine Rotor Cyclic Analysis
No ratings yet
Steam Turbine Rotor Cyclic Analysis
30 pages
TP01+02+03 MDSEC DR - LABIOD
No ratings yet
TP01+02+03 MDSEC DR - LABIOD
14 pages
Study Guide
No ratings yet
Study Guide
6 pages
Minor Semester
No ratings yet
Minor Semester
4 pages
Introduction To Pharmaceutical Analysis: AAU, CHS, School of Pharmacy
No ratings yet
Introduction To Pharmaceutical Analysis: AAU, CHS, School of Pharmacy
75 pages
Understanding Vector Magnitude and Motion
No ratings yet
Understanding Vector Magnitude and Motion
2 pages
Engaging Number Riddles for Kids
No ratings yet
Engaging Number Riddles for Kids
2 pages
Arlian Kusuma P - 22410014 Algoritma Tugas Kelompok
No ratings yet
Arlian Kusuma P - 22410014 Algoritma Tugas Kelompok
7 pages
H
No ratings yet
H
5 pages