4.
Apache Hive & Pig for Querying Large Datasets Creation of tables, data loading, and running
queries
Aim: To create tables, data loading, and running queries in Apache Hive
Procedure:
1. Open PuTTY on Windows.
2. In the Host Name (or IP address) field → enter the server’s IP or hostname (for example:
192.168.1.100 or hadoop-master).
3. Port = 22 (default for SSH).
4. Connection type = SSH.
5. Click Open.
6. A terminal will appear → enter your username (e.g., hduser) and password.
Step 1: Create a working directory and the data file
mkdir dema
cd dema
vi hive.txt
Step 2: Create a directory in HDFS and upload the file
hadoop fs -mkdir /dema
hadoop fs -put hive.txt /dema
hadoop fs -ls /dema
Step 3: Enter Hive shell
Hive
Step 4: Create the Hive table
CREATE TABLE employee (eid INT, ename STRING, eage INT, esalary FLOAT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
Step 5: Load data into Hive table
LOAD DATA LOCAL INPATH 'hive.txt' INTO TABLE employee;
Step 6: Verify table exists
SHOW TABLES;
Step 7: Run HiveQL queries
a. View all data:
SELECT * FROM employee;
b. View specific columns:
SELECT ename, esalary FROM employee;
c. Filter rows based on condition:
SELECT eid, esalary FROM employee WHERE eage < 30;
Aim: To create tables, data loading, and running queries in Pig
PROCEDURE:
1. Open PuTTY on Windows.
2. In the Host Name (or IP address) field → enter the server’s IP or hostname (for example:
192.168.1.100 or hadoop-master).
3. Port = 22 (default for SSH).
4. Connection type = SSH.
5. Click Open.
6. A terminal will appear → enter your username (e.g., hduser) and password.
Step 1: Create a Local File with Employee Data
Open terminal and create the file keer.txt:
vi keer.txt
Then enter the following data (tab-separated):
1 a 10.0
2 b 20.0
3 c 30.0
4 d 40.0
Save and exit:
Press Esc
Type :wq
Hit Enter
Step 2: Create Directory in HDFS and Upload the File
Create an HDFS directory (if not already created):
hdfs dfs -mkdir /cc
Upload the local file to HDFS:
hdfs dfs -put keer.txt /cc/
Check the file is in HDFS:
hdfs dfs -ls /cc
Step 3: Start the Pig Grunt Shell
pig
Wait for the grunt> prompt.
Step 4: Load the Data into a Pig Relation
Inside the grunt> prompt, load the file with schema:
employee = LOAD '/cc/keer.txt' USING PigStorage('\t')
AS (eid:int, ename:chararray, esal:float);
Step 5: Display All Records
DUMP employee;
Sample Output:
(1,a,10.0)
(2,b,20.0)
(3,c,30.0)
(4,d,40.0)
Step 6: Filter Employees with Salary Greater Than 10
high_paid = FILTER employee BY esal > 10;
Display result:
DUMP high_paid;
Sample Output:
(2,b,20.0)
(3,c,30.0)
(4,d,40.0)
Step 7: Group All Records Together (for aggregation)
grouped = GROUP employee ALL;
Step 8: Calculate the Average Salary
avg_salary = FOREACH grouped GENERATE AVG(employee.esal) AS avg_salary;
Display result:
DUMP avg_salary;
Sample Output:
(25.0)
Result:
5. Apache Spark Basics: RDDs and DataFrames Implement Spark transformations and actions
Aim:
To implement basic Apache Spark transformations and actions using RDDs and DataFrames,
demonstrating fundamental data processing operations in a distributed computing environment.
Procedure:
1. Open PuTTY on Windows.
2. In the Host Name (or IP address) field → enter the server’s IP or hostname (for example:
192.168.1.100 or hadoop-master).
3. Port = 22 (default for SSH).
4. Connection type = SSH.
5. Click Open.
6. A terminal will appear → enter your username (e.g., hduser) and password
Step 1: Create the data file
vi number.txt
Step 2: Create a directory in HDFS and upload the file
hdfs dfs –mkdir /apache
Step 3: Upload your input file to HDFS
hdfs dfs -put /number.txt /apache
Step 4: Check if the file is in HDFS:
hdfs dfs -ls /apache
Step 5: Write your Spark job Create a Python file with your Spark code, for example rdd_example.py.
vi spark_rdd_example.py
Step 6: Press i and write a program
from pyspark import SparkContext
sc = SparkContext("local", "RDD Example")
# Read text file from HDFS
lines = sc.textFile("hdfs:///apache/number.txt")
# Filter out empty lines and convert to integers
numbers = lines.filter(lambda x: x.strip() != "").map(lambda x: int(x.strip()))
# Square the numbers
squared = numbers.map(lambda x: x * x)
# Collect and print results
results = squared.collect()
for num in results:
print(num)
sc.stop()
after typing program press:wq for save and exit
Step 8: Run your Spark job
spark-submit rdd_example.py
Step 9: View the output
Procedure:
1. Open PuTTY on Windows.
2. In the Host Name (or IP address) field → enter the server’s IP or hostname (for example:
192.168.1.100 or hadoop-master).
3. Port = 22 (default for SSH).
4. Connection type = SSH.
5. Click Open.
6. A terminal will appear → enter your username (e.g., hduser) and password
Step 1: Create the data file
vi people.txt
Step 2: Create a directory in HDFS and upload the file
hdfs dfs –mkdir /demo
Step 3: Upload your input file to HDFS
hdfs dfs -put people.txt /demo
Step 4: Check if the file is in HDFS:
hdfs dfs -ls /demo
Step 5: Create your Spark Python script
Run this command:
vi spark_dataframe.py
Then press i to insert or type your program:
from pyspark import SparkContext
from pyspark.sql import SQLContext, Row
# Initialize SparkContext and SQLContext
sc = SparkContext(appName="DataFrameExample")
sqlContext = SQLContext(sc)
# Read text file from HDFS
lines = sc.textFile("hdfs:///demo/people.txt")
# Convert each line into a Row object with proper types
rows_rdd = lines.map(lambda line: line.split(",")) \
.map(lambda p: Row(
Name=p[0].strip(),
Age=int(p[1].strip()),
Salary=int(p[2].strip()),
Experience=int(p[3].strip())
))
# Create DataFrame from the RDD of Rows
df = sqlContext.createDataFrame(rows_rdd)
# Filter rows where Age > 30 and Experience > 5
df_filtered = df.filter((df.Age > 40) & (df.Experience > 5))
# Show filtered results
df_filtered.show(1000)
# Stop SparkContext
sc.stop()
After typing program press:wq for save and exit
Step 6: Submit your Spark job
Run this command:
spark-submit spark_dataframe.py
Step 7 : View expected output in terminal