0% found this document useful (0 votes)
11 views13 pages

Experiment 4 & 5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views13 pages

Experiment 4 & 5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

4.

Apache Hive & Pig for Querying Large Datasets Creation of tables, data loading, and running
queries

Aim: To create tables, data loading, and running queries in Apache Hive

Procedure:

1. Open PuTTY on Windows.

2. In the Host Name (or IP address) field → enter the server’s IP or hostname (for example:
192.168.1.100 or hadoop-master).

3. Port = 22 (default for SSH).

4. Connection type = SSH.

5. Click Open.

6. A terminal will appear → enter your username (e.g., hduser) and password.

Step 1: Create a working directory and the data file

mkdir dema

cd dema

vi hive.txt

Step 2: Create a directory in HDFS and upload the file

hadoop fs -mkdir /dema

hadoop fs -put hive.txt /dema

hadoop fs -ls /dema

Step 3: Enter Hive shell

Hive

Step 4: Create the Hive table

CREATE TABLE employee (eid INT, ename STRING, eage INT, esalary FLOAT)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

LINES TERMINATED BY '\n'


STORED AS TEXTFILE;

Step 5: Load data into Hive table

LOAD DATA LOCAL INPATH 'hive.txt' INTO TABLE employee;

Step 6: Verify table exists

SHOW TABLES;

Step 7: Run HiveQL queries

a. View all data:


SELECT * FROM employee;
b. View specific columns:
SELECT ename, esalary FROM employee;
c. Filter rows based on condition:

SELECT eid, esalary FROM employee WHERE eage < 30;


Aim: To create tables, data loading, and running queries in Pig

PROCEDURE:

1. Open PuTTY on Windows.

2. In the Host Name (or IP address) field → enter the server’s IP or hostname (for example:
192.168.1.100 or hadoop-master).

3. Port = 22 (default for SSH).

4. Connection type = SSH.

5. Click Open.

6. A terminal will appear → enter your username (e.g., hduser) and password.

Step 1: Create a Local File with Employee Data

Open terminal and create the file keer.txt:

vi keer.txt

Then enter the following data (tab-separated):

1 a 10.0
2 b 20.0
3 c 30.0
4 d 40.0

Save and exit:

 Press Esc
 Type :wq
 Hit Enter

Step 2: Create Directory in HDFS and Upload the File

Create an HDFS directory (if not already created):

hdfs dfs -mkdir /cc

Upload the local file to HDFS:

hdfs dfs -put keer.txt /cc/


Check the file is in HDFS:

hdfs dfs -ls /cc

Step 3: Start the Pig Grunt Shell

pig

Wait for the grunt> prompt.

Step 4: Load the Data into a Pig Relation

Inside the grunt> prompt, load the file with schema:

employee = LOAD '/cc/keer.txt' USING PigStorage('\t')


AS (eid:int, ename:chararray, esal:float);

Step 5: Display All Records

DUMP employee;

Sample Output:

(1,a,10.0)
(2,b,20.0)
(3,c,30.0)
(4,d,40.0)

Step 6: Filter Employees with Salary Greater Than 10

high_paid = FILTER employee BY esal > 10;

Display result:

DUMP high_paid;

Sample Output:

(2,b,20.0)
(3,c,30.0)
(4,d,40.0)
Step 7: Group All Records Together (for aggregation)

grouped = GROUP employee ALL;

Step 8: Calculate the Average Salary

avg_salary = FOREACH grouped GENERATE AVG(employee.esal) AS avg_salary;

Display result:

DUMP avg_salary;

Sample Output:

(25.0)
Result:
5. Apache Spark Basics: RDDs and DataFrames Implement Spark transformations and actions

Aim:

To implement basic Apache Spark transformations and actions using RDDs and DataFrames,
demonstrating fundamental data processing operations in a distributed computing environment.

Procedure:

1. Open PuTTY on Windows.

2. In the Host Name (or IP address) field → enter the server’s IP or hostname (for example:
192.168.1.100 or hadoop-master).

3. Port = 22 (default for SSH).

4. Connection type = SSH.

5. Click Open.

6. A terminal will appear → enter your username (e.g., hduser) and password

Step 1: Create the data file

vi number.txt

Step 2: Create a directory in HDFS and upload the file

hdfs dfs –mkdir /apache

Step 3: Upload your input file to HDFS

hdfs dfs -put /number.txt /apache

Step 4: Check if the file is in HDFS:

hdfs dfs -ls /apache

Step 5: Write your Spark job Create a Python file with your Spark code, for example rdd_example.py.

vi spark_rdd_example.py

Step 6: Press i and write a program

from pyspark import SparkContext


sc = SparkContext("local", "RDD Example")

# Read text file from HDFS

lines = sc.textFile("hdfs:///apache/number.txt")

# Filter out empty lines and convert to integers

numbers = lines.filter(lambda x: x.strip() != "").map(lambda x: int(x.strip()))

# Square the numbers

squared = numbers.map(lambda x: x * x)

# Collect and print results

results = squared.collect()

for num in results:

print(num)

sc.stop()

after typing program press:wq for save and exit

Step 8: Run your Spark job

spark-submit rdd_example.py

Step 9: View the output


Procedure:

1. Open PuTTY on Windows.

2. In the Host Name (or IP address) field → enter the server’s IP or hostname (for example:
192.168.1.100 or hadoop-master).

3. Port = 22 (default for SSH).

4. Connection type = SSH.

5. Click Open.

6. A terminal will appear → enter your username (e.g., hduser) and password

Step 1: Create the data file

vi people.txt

Step 2: Create a directory in HDFS and upload the file

hdfs dfs –mkdir /demo

Step 3: Upload your input file to HDFS

hdfs dfs -put people.txt /demo

Step 4: Check if the file is in HDFS:

hdfs dfs -ls /demo

Step 5: Create your Spark Python script

Run this command:

vi spark_dataframe.py

Then press i to insert or type your program:

from pyspark import SparkContext


from pyspark.sql import SQLContext, Row

# Initialize SparkContext and SQLContext


sc = SparkContext(appName="DataFrameExample")
sqlContext = SQLContext(sc)

# Read text file from HDFS


lines = sc.textFile("hdfs:///demo/people.txt")
# Convert each line into a Row object with proper types
rows_rdd = lines.map(lambda line: line.split(",")) \
.map(lambda p: Row(
Name=p[0].strip(),
Age=int(p[1].strip()),
Salary=int(p[2].strip()),
Experience=int(p[3].strip())
))

# Create DataFrame from the RDD of Rows


df = sqlContext.createDataFrame(rows_rdd)
# Filter rows where Age > 30 and Experience > 5
df_filtered = df.filter((df.Age > 40) & (df.Experience > 5))

# Show filtered results


df_filtered.show(1000)

# Stop SparkContext
sc.stop()

After typing program press:wq for save and exit

Step 6: Submit your Spark job

Run this command:

spark-submit spark_dataframe.py

Step 7 : View expected output in terminal

You might also like