Scenario Based
Interview
Question
Ganesh. R
Problem Statement
Problem Statement:
Given a table of Facebook posts, for each user who
posted at least twice in 2021, write a query to find the
number of days between each user’s first post of the
year and last post of the year in the year 2021. Output
the user and number of the days between each user's
first and last post
Input Table Data
# DataFrame# Define the data as a list of tuples
data = [ (901, 3601, 4500, "You up?",
datetime.strptime("08/03/2022 16:43:00",
"%m/%d/%Y %H:%M:%S")), (743, 3601, 8752,
"Let's take this offline",
datetime.strptime("06/14/2022 14:30:00",
"%m/%d/%Y %H:%M:%S")), (888, 3601, 7855,
"DataLemur has awesome user base!",
datetime.strptime("08/12/2022 08:45:00",
"%m/%d/%Y %H:%M:%S")), (100, 2520, 6987,
"Send this out now!", datetime.strptime("08/16/2021
00:35:00", "%m/%d/%Y %H:%M:%S")), (898,
2520, 9630, "Are you ready for your upcoming
presentation?", datetime.strptime("08/13/2022
14:35:00", "%m/%d/%Y %H:%M:%S")), (990,
2520, 8520, "Maybe it was done by the automation
process.", datetime.strptime("08/19/2022 06:30:00",
"%m/%d/%Y %H:%M:%S")),
(819, 2310, 4500, "What's the status on this?",
datetime.strptime("07/10/2022 15:55:00",
"%m/%d/%Y %H:%M:%S")), (922, 3601, 4500,
"Get on the call", datetime.strptime("08/10/2022
17:03:00", "%m/%d/%Y %H:%M:%S")),
Input Table Data
(942, 2520, 3561, "How much do you know about
Data Science?", datetime.strptime("08/17/2022
13:44:00", "%m/%d/%Y %H:%M:%S")), (966,
3601, 7852, "Meet me in five!",
datetime.strptime("08/17/2022 02:20:00",
"%m/%d/%Y %H:%M:%S")), (902, 4500, 3601,
"Only if you're buying",
datetime.strptime("08/03/2022 06:50:00",
"%m/%d/%Y %H:%M:%S")) ]
# Define schema for DataFrame
schema = StructType(
[
StructField("message_id", IntegerType(), True),
StructField("sender_id", IntegerType(), True),
StructField("receiver_id", IntegerType(), True),
StructField("content", StringType(), True),
StructField("sent_date", TimestampType(), True),
]
)
# Create DataFrame
df = spark.createDataFrame(data, schema)
Output Table
sender_id count_messages
3601 4
2520 3
Problem Statement: Message Count Analysis
Objective: To analyze and identify the top two senders with the highest number of messages
sent during August 2022 from a dataset of messages.
Background: You have a dataset containing message records. Each record includes information
such as the sender_id, message_id, and the date the message was sent (sent_date). The goal is
to extract insights regarding message activity during a specific month and year.
Dataset Description:
Table Name: messages Columns: sender_id (Integer): Unique identifier for each sender.
message_id (Integer): Unique identifier for each message. sent_date (DateTime): Timestamp
indicating when the message was sent.
from pyspark.sql.types import (
StructType,
StructField,
IntegerType,
StringType,
TimestampType,
)
from datetime import datetime
# Define schema for the DataFrame
schema = StructType(
[
StructField("message_id", IntegerType(), True),
StructField("sender_id", IntegerType(), True),
StructField("receiver_id", IntegerType(), True),
StructField("content", StringType(), True),
StructField("sent_date", TimestampType(), True),
]
)
# Data as a list of tuples
data = [
(
901,
3601,
4500,
"You up?",
datetime.strptime("08/03/2022 16:43:00", "%m/%d/%Y %H:%M:%S"),
),
(
743,
3601,
8752,
"Let's take this offline",
datetime.strptime("06/14/2022 14:30:00", "%m/%d/%Y %H:%M:%S"),
),
(
888,
3601,
7855,
"DataLemur has awesome user base!",
datetime.strptime("08/12/2022 08:45:00", "%m/%d/%Y %H:%M:%S"),
),
(
100,
2520,
6987,
"Send this out now!",
datetime.strptime("08/16/2021 00:35:00", "%m/%d/%Y %H:%M:%S"),
),
(
898,
2520,
9630,
"Are you ready for your upcoming presentation?",
datetime.strptime("08/13/2022 14:35:00", "%m/%d/%Y %H:%M:%S"),
),
(
990,
2520,
8520,
"Maybe it was done by the automation process.",
datetime.strptime("08/19/2022 06:30:00", "%m/%d/%Y %H:%M:%S"),
),
(
819,
2310,
4500,
"What's the status on this?",
datetime.strptime("07/10/2022 15:55:00", "%m/%d/%Y %H:%M:%S"),
),
(
922,
3601,
4500,
"Get on the call",
datetime.strptime("08/10/2022 17:03:00", "%m/%d/%Y %H:%M:%S"),
),
(
942,
2520,
3561,
"How much do you know about Data Science?",
datetime.strptime("08/17/2022 13:44:00", "%m/%d/%Y %H:%M:%S"),
),
(
966,
3601,
7852,
"Meet me in five!",
datetime.strptime("08/17/2022 02:20:00", "%m/%d/%Y %H:%M:%S"),
),
(
902,
4500,
3601,
"Only if you're buying",
datetime.strptime("08/03/2022 06:50:00", "%m/%d/%Y %H:%M:%S"),
),
]
# Create DataFrame
df = spark.createDataFrame(data, schema=schema)
# display the DataFrame
df.display()
df.createOrReplaceTempView("messages")
%sql
WITH message_counts AS (
SELECT
sender_id,
COUNT(message_id) AS count_messages
FROM
messages
WHERE
MONTH(sent_date) = 8
AND YEAR(sent_date) = 2022
GROUP BY
sender_id
)
SELECT
*
FROM
message_counts
ORDER BY
count_messages DESC
LIMIT
2;
from pyspark.sql.functions import col, count, month, year
# Load your DataFrame (assuming it's already available as `df`)
# messages_df = spark.read.csv("path_to_your_data.csv", header=True,
inferSchema=True)
# Filter and group the DataFrame
message_counts_df = (
df.filter((month(col("sent_date")) == 8) & (year(col("sent_date"))
== 2022))
.groupBy("sender_id")
.agg(count("message_id").alias("count_messages"))
)
# Order by count_messages and limit the results
top_senders_df =
message_counts_df.orderBy(col("count_messages").desc()).limit(2)
# Show the results
top_senders_df.display()
Explanation:
Filter: We use the filter method to select records for August 2022.
GroupBy and Aggregation: The groupBy method groups by sender_id, and agg calculates the
count of messages.
Order and Limit: Finally, we sort the results in descending order and limit the output to the top
two senders.
Make sure to replace df with the actual DataFrame variable name you have in your PySpark
environment.
IF YOU FOUND THIS POST
USEFUL, PLEASE SAVE IT.
Ganesh. R
THANK YOU
For Your Support
I Appreciate for your support on
My Account, I will Never Stop to Share the
Knowledge.
rganesh203 (Ganesh R) rganesh203 (Ganesh R)
rganesh203 (Ganesh R) rganesh203 (Ganesh R)
rganesh203 (Ganesh R) rganesh203 (Ganesh R)