0% found this document useful (0 votes)

8 views11 pages

Day 72

The document outlines a problem statement requiring a SQL query to analyze Facebook posts from users who posted at least twice in 2021, focusing on the days between their first and last posts. It also describes a message count analysis for August 2022 to identify the top two senders based on message activity. The document includes example data, schema definitions, and SQL/PySpark code snippets for implementation.

Uploaded by

Lapi Lapil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views11 pages

Day 72

Uploaded by

Lapi Lapil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Scenario Based

Interview
Question

Ganesh. R
Problem Statement

Problem Statement:

Given a table of Facebook posts, for each user who

posted at least twice in 2021, write a query to find the
number of days between each user’s first post of the
year and last post of the year in the year 2021. Output
the user and number of the days between each user's
first and last post
Input Table Data
# DataFrame# Define the data as a list of tuples
data = [ (901, 3601, 4500, "You up?",
datetime.strptime("08/03/2022 16:43:00",
"%m/%d/%Y %H:%M:%S")), (743, 3601, 8752,
"Let's take this offline",
datetime.strptime("06/14/2022 14:30:00",
"%m/%d/%Y %H:%M:%S")), (888, 3601, 7855,
"DataLemur has awesome user base!",
datetime.strptime("08/12/2022 08:45:00",
"%m/%d/%Y %H:%M:%S")), (100, 2520, 6987,
"Send this out now!", datetime.strptime("08/16/2021
00:35:00", "%m/%d/%Y %H:%M:%S")), (898,
2520, 9630, "Are you ready for your upcoming
presentation?", datetime.strptime("08/13/2022
14:35:00", "%m/%d/%Y %H:%M:%S")), (990,
2520, 8520, "Maybe it was done by the automation
process.", datetime.strptime("08/19/2022 06:30:00",
"%m/%d/%Y %H:%M:%S")),
(819, 2310, 4500, "What's the status on this?",
datetime.strptime("07/10/2022 15:55:00",
"%m/%d/%Y %H:%M:%S")), (922, 3601, 4500,
"Get on the call", datetime.strptime("08/10/2022
17:03:00", "%m/%d/%Y %H:%M:%S")),
Input Table Data

(942, 2520, 3561, "How much do you know about

Data Science?", datetime.strptime("08/17/2022
13:44:00", "%m/%d/%Y %H:%M:%S")), (966,
3601, 7852, "Meet me in five!",
datetime.strptime("08/17/2022 02:20:00",
"%m/%d/%Y %H:%M:%S")), (902, 4500, 3601,
"Only if you're buying",
datetime.strptime("08/03/2022 06:50:00",
"%m/%d/%Y %H:%M:%S")) ]
# Define schema for DataFrame
schema = StructType(
[
StructField("message_id", IntegerType(), True),
StructField("sender_id", IntegerType(), True),
StructField("receiver_id", IntegerType(), True),
StructField("content", StringType(), True),
StructField("sent_date", TimestampType(), True),
]
)
# Create DataFrame
df = spark.createDataFrame(data, schema)
Output Table

sender_id count_messages

3601 4

2520 3
Problem Statement: Message Count Analysis

Objective: To analyze and identify the top two senders with the highest number of messages
sent during August 2022 from a dataset of messages.

Background: You have a dataset containing message records. Each record includes information
such as the sender_id, message_id, and the date the message was sent (sent_date). The goal is
to extract insights regarding message activity during a specific month and year.

Dataset Description:

Table Name: messages Columns: sender_id (Integer): Unique identifier for each sender.
message_id (Integer): Unique identifier for each message. sent_date (DateTime): Timestamp
indicating when the message was sent.

from pyspark.sql.types import (

StructType,
StructField,
IntegerType,
StringType,
TimestampType,
)
from datetime import datetime

# Define schema for the DataFrame

schema = StructType(
[
StructField("message_id", IntegerType(), True),
StructField("sender_id", IntegerType(), True),
StructField("receiver_id", IntegerType(), True),
StructField("content", StringType(), True),
StructField("sent_date", TimestampType(), True),
]
)

# Data as a list of tuples

data = [
(
901,
3601,
4500,
"You up?",
datetime.strptime("08/03/2022 16:43:00", "%m/%d/%Y %H:%M:%S"),
),
(
743,
3601,
8752,
"Let's take this offline",
datetime.strptime("06/14/2022 14:30:00", "%m/%d/%Y %H:%M:%S"),
),
(
888,
3601,
7855,
"DataLemur has awesome user base!",
datetime.strptime("08/12/2022 08:45:00", "%m/%d/%Y %H:%M:%S"),
),
(
100,
2520,
6987,
"Send this out now!",
datetime.strptime("08/16/2021 00:35:00", "%m/%d/%Y %H:%M:%S"),
),
(
898,
2520,
9630,
"Are you ready for your upcoming presentation?",
datetime.strptime("08/13/2022 14:35:00", "%m/%d/%Y %H:%M:%S"),
),
(
990,
2520,
8520,
"Maybe it was done by the automation process.",
datetime.strptime("08/19/2022 06:30:00", "%m/%d/%Y %H:%M:%S"),
),
(
819,
2310,
4500,
"What's the status on this?",
datetime.strptime("07/10/2022 15:55:00", "%m/%d/%Y %H:%M:%S"),
),
(
922,
3601,
4500,
"Get on the call",
datetime.strptime("08/10/2022 17:03:00", "%m/%d/%Y %H:%M:%S"),
),
(
942,
2520,
3561,
"How much do you know about Data Science?",
datetime.strptime("08/17/2022 13:44:00", "%m/%d/%Y %H:%M:%S"),
),
(
966,
3601,
7852,
"Meet me in five!",
datetime.strptime("08/17/2022 02:20:00", "%m/%d/%Y %H:%M:%S"),
),
(
902,
4500,
3601,
"Only if you're buying",
datetime.strptime("08/03/2022 06:50:00", "%m/%d/%Y %H:%M:%S"),
),
]

# Create DataFrame
df = spark.createDataFrame(data, schema=schema)

# display the DataFrame

df.display()

df.createOrReplaceTempView("messages")

%sql
WITH message_counts AS (
SELECT
sender_id,
COUNT(message_id) AS count_messages
FROM
messages
WHERE
MONTH(sent_date) = 8
AND YEAR(sent_date) = 2022
GROUP BY
sender_id
)
SELECT
*
FROM
message_counts
ORDER BY
count_messages DESC
LIMIT
2;

from pyspark.sql.functions import col, count, month, year

# Load your DataFrame (assuming it's already available as `df`)

# messages_df = spark.read.csv("path_to_your_data.csv", header=True,
inferSchema=True)

# Filter and group the DataFrame

message_counts_df = (
df.filter((month(col("sent_date")) == 8) & (year(col("sent_date"))
== 2022))
.groupBy("sender_id")
.agg(count("message_id").alias("count_messages"))
)

# Order by count_messages and limit the results

top_senders_df =
message_counts_df.orderBy(col("count_messages").desc()).limit(2)

# Show the results

top_senders_df.display()

Explanation:

Filter: We use the filter method to select records for August 2022.

GroupBy and Aggregation: The groupBy method groups by sender_id, and agg calculates the
count of messages.

Order and Limit: Finally, we sort the results in descending order and limit the output to the top
two senders.

Make sure to replace df with the actual DataFrame variable name you have in your PySpark
environment.
IF YOU FOUND THIS POST
USEFUL, PLEASE SAVE IT.

Ganesh. R
THANK YOU
For Your Support

I Appreciate for your support on

My Account, I will Never Stop to Share the
Knowledge.

rganesh203 (Ganesh R) rganesh203 (Ganesh R)

Day 71
No ratings yet
Day 71
9 pages
Chat Analysis Notes
No ratings yet
Chat Analysis Notes
9 pages
019) Pandas - Batch 2 - Day 019 (FINAL DAY)
No ratings yet
019) Pandas - Batch 2 - Day 019 (FINAL DAY)
43 pages
S1 CS - U4 Data Ranges - Frequencies - Shifting
No ratings yet
S1 CS - U4 Data Ranges - Frequencies - Shifting
24 pages
Time Series Data Analysis in Python
No ratings yet
Time Series Data Analysis in Python
34 pages
XII IP Model 1
No ratings yet
XII IP Model 1
10 pages
Cs Sem V Dav Upc 32347507 Sl. No. Qp. 4432 Dec '23
No ratings yet
Cs Sem V Dav Upc 32347507 Sl. No. Qp. 4432 Dec '23
16 pages
Dav Pyq 2023
No ratings yet
Dav Pyq 2023
15 pages
Kendriya Vidyalaya Sangathan: Bhubaneswar Region PRE-BOARD - 1 (Nov.-2024) Session:2024-25 Class Xii Informatics Practices (065) SET-1
No ratings yet
Kendriya Vidyalaya Sangathan: Bhubaneswar Region PRE-BOARD - 1 (Nov.-2024) Session:2024-25 Class Xii Informatics Practices (065) SET-1
11 pages
Class 12 CS PB Worksheet
No ratings yet
Class 12 CS PB Worksheet
10 pages
Python Time Series Analysis Guide
No ratings yet
Python Time Series Analysis Guide
22 pages
12th - Mid-Term-IP
No ratings yet
12th - Mid-Term-IP
5 pages
Revision Worksheet (2024-2025)
No ratings yet
Revision Worksheet (2024-2025)
9 pages
Day 3 - Notes Interview Questions
No ratings yet
Day 3 - Notes Interview Questions
36 pages
6205solved Ip CL Xii 2020
No ratings yet
6205solved Ip CL Xii 2020
11 pages
Machine Learning
No ratings yet
Machine Learning
3 pages
Ip CLSS Xii 2024-25 Hy
No ratings yet
Ip CLSS Xii 2024-25 Hy
14 pages
Day 73
No ratings yet
Day 73
12 pages
Python Interview Questions for Analysts
No ratings yet
Python Interview Questions for Analysts
9 pages
Dev Record Final
No ratings yet
Dev Record Final
34 pages
DEV Lab Manual Student
No ratings yet
DEV Lab Manual Student
35 pages
Python For Data Analysis 3rd Edition - Wes McKinney-trang-4
No ratings yet
Python For Data Analysis 3rd Edition - Wes McKinney-trang-4
60 pages
Pandas Module
No ratings yet
Pandas Module
24 pages
Xii-Informatics Practices-Qp-Set B-18-11-2021
No ratings yet
Xii-Informatics Practices-Qp-Set B-18-11-2021
14 pages
Class 12 Informatics Practices Sample Paper
No ratings yet
Class 12 Informatics Practices Sample Paper
59 pages
12th CS Prelim Paper 2024 Updated
No ratings yet
12th CS Prelim Paper 2024 Updated
10 pages
Work Sheet-1 Class 12 IPR
No ratings yet
Work Sheet-1 Class 12 IPR
5 pages
Lecture 3:analyze Twitter Data by Time Period
No ratings yet
Lecture 3:analyze Twitter Data by Time Period
13 pages
12pb24ip01 QP
No ratings yet
12pb24ip01 QP
12 pages
Information Practices: Section A
No ratings yet
Information Practices: Section A
8 pages
Traversing Dataframe Elements Using: Iterrows, Iteritems and Itertuples
No ratings yet
Traversing Dataframe Elements Using: Iterrows, Iteritems and Itertuples
8 pages
Python Cheat Sheet Intermediate
No ratings yet
Python Cheat Sheet Intermediate
1 page
Httppython Mykvs inuploadsfilesXIIInfo Pract S E 150 PDF
No ratings yet
Httppython Mykvs inuploadsfilesXIIInfo Pract S E 150 PDF
15 pages
4.write The Output of The Following Code
No ratings yet
4.write The Output of The Following Code
2 pages
Using pd.date_range in Pandas
No ratings yet
Using pd.date_range in Pandas
10 pages
Group - by Python Code
No ratings yet
Group - by Python Code
11 pages
Kendriya Vidyalaya Sangathan: Kolkata Region First Preboard E Informatics Practices New (065) - Class Xii
No ratings yet
Kendriya Vidyalaya Sangathan: Kolkata Region First Preboard E Informatics Practices New (065) - Class Xii
15 pages
HEALTHCARE
No ratings yet
HEALTHCARE
3 pages
Class 12 IP 2024
No ratings yet
Class 12 IP 2024
8 pages
XIIInfo Pract S E 469
No ratings yet
XIIInfo Pract S E 469
12 pages
IP-Class 12 - Half Yearly - 2021
No ratings yet
IP-Class 12 - Half Yearly - 2021
11 pages
PGT Information Practices Overview
No ratings yet
PGT Information Practices Overview
8 pages
Twitter Data Pull
No ratings yet
Twitter Data Pull
10 pages
Class-12 Ip Pre-Board Exam 2024-25
No ratings yet
Class-12 Ip Pre-Board Exam 2024-25
8 pages
Section - A: Print (DF - Loc (0) ) A) Calories 420 B) Calories 420
100% (1)
Section - A: Print (DF - Loc (0) ) A) Calories 420 B) Calories 420
183 pages
Whatsapp - Analyzer
No ratings yet
Whatsapp - Analyzer
8 pages
Class XII Computer Science Sample Paper
No ratings yet
Class XII Computer Science Sample Paper
270 pages
Class XII Computer Science Question Paper
No ratings yet
Class XII Computer Science Question Paper
11 pages
Class 12 Cs All Region Papers
No ratings yet
Class 12 Cs All Region Papers
213 pages
Class XII Informatics Practices Sample Paper
No ratings yet
Class XII Informatics Practices Sample Paper
15 pages
Class XII Informatics Practices
No ratings yet
Class XII Informatics Practices
8 pages
Xii-Ip Quarterly Exam Ms 25-26
No ratings yet
Xii-Ip Quarterly Exam Ms 25-26
8 pages
Day 22
No ratings yet
Day 22
6 pages
Day 62
No ratings yet
Day 62
9 pages
Redshift DG
No ratings yet
Redshift DG
733 pages
Day 27
No ratings yet
Day 27
6 pages
Day 28
No ratings yet
Day 28
5 pages
Day 24
No ratings yet
Day 24
8 pages
Day 76
No ratings yet
Day 76
10 pages
AWS Learning Material
No ratings yet
AWS Learning Material
13 pages
Day 57
No ratings yet
Day 57
11 pages
Samsung Security Manager 151207
No ratings yet
Samsung Security Manager 151207
1 page
The CIA Triad-How To Use It Today
No ratings yet
The CIA Triad-How To Use It Today
5 pages
Ds Data Structure Lab Manual
No ratings yet
Ds Data Structure Lab Manual
61 pages
Lm80-p0436-67 D Windows10 BSP Customizationguide
No ratings yet
Lm80-p0436-67 D Windows10 BSP Customizationguide
41 pages
Class 8th Chapter 4 Excersie
No ratings yet
Class 8th Chapter 4 Excersie
14 pages
An Aero Island Christmas Mystery Homeswappers Mystery 3 Adriana Licio Full Chapters Included
No ratings yet
An Aero Island Christmas Mystery Homeswappers Mystery 3 Adriana Licio Full Chapters Included
163 pages
MATLAB Commands & Graphing Guide
No ratings yet
MATLAB Commands & Graphing Guide
14 pages
Log
No ratings yet
Log
5 pages
Are You Tidy or Untidy
100% (1)
Are You Tidy or Untidy
16 pages
IoT Development Boards and Sensors Overview
No ratings yet
IoT Development Boards and Sensors Overview
18 pages
Ext Xoss&r SG 22
No ratings yet
Ext Xoss&r SG 22
463 pages
Exception Handling Project Report
No ratings yet
Exception Handling Project Report
17 pages
AWS Certification Task - Mangala R
No ratings yet
AWS Certification Task - Mangala R
17 pages
Power Builder Tutorial - 4
No ratings yet
Power Builder Tutorial - 4
110 pages
17-Selective Repeat ARQ-Transmission Efficiency-Problem-24-08-2023
No ratings yet
17-Selective Repeat ARQ-Transmission Efficiency-Problem-24-08-2023
17 pages
C++ Data Structures Tutorial
No ratings yet
C++ Data Structures Tutorial
10 pages
Driver's Black Box: A System For Driver Risk Assessment Using Machine Learning and Fuzzy Logic
No ratings yet
Driver's Black Box: A System For Driver Risk Assessment Using Machine Learning and Fuzzy Logic
20 pages
Levenshtein Algorithm 1 PDF
No ratings yet
Levenshtein Algorithm 1 PDF
10 pages
Farmers Disease DiagnosticReporting Portal
No ratings yet
Farmers Disease DiagnosticReporting Portal
13 pages
Exercise 4: Cloudsim
No ratings yet
Exercise 4: Cloudsim
9 pages
ChE - Software
No ratings yet
ChE - Software
8 pages
FujiFilm FinePix 4700 Owner's Manual
No ratings yet
FujiFilm FinePix 4700 Owner's Manual
63 pages
Content and Performance Standards in MIL
No ratings yet
Content and Performance Standards in MIL
4 pages
N6 Computerised Financial Systems
No ratings yet
N6 Computerised Financial Systems
15 pages
MAGNUM Maersk Sealand (52268-4-PM Rev 1)
No ratings yet
MAGNUM Maersk Sealand (52268-4-PM Rev 1)
55 pages
Review Question of Chapter 3
No ratings yet
Review Question of Chapter 3
17 pages
Inventia Short Overview 2019
No ratings yet
Inventia Short Overview 2019
14 pages
Project Idea With SketchUp
No ratings yet
Project Idea With SketchUp
3 pages
Ayush Shukla: Subhash Smarak I.C., Kanpur Sardar Patel I.C. W2 Juhi Barra, Kanpur
No ratings yet
Ayush Shukla: Subhash Smarak I.C., Kanpur Sardar Patel I.C. W2 Juhi Barra, Kanpur
2 pages
Specification by Example How Successful Teams Deliver The Right Software 1st Edition Gojko Adzic PDF Download
No ratings yet
Specification by Example How Successful Teams Deliver The Right Software 1st Edition Gojko Adzic PDF Download
52 pages

Day 72

Uploaded by

Day 72

Uploaded by

Scenario Based

Given a table of Facebook posts, for each user who

(942, 2520, 3561, "How much do you know about

from pyspark.sql.types import (

# Define schema for the DataFrame

# Data as a list of tuples

# display the DataFrame

from pyspark.sql.functions import col, count, month, year

# Load your DataFrame (assuming it's already available as `df`)

# Filter and group the DataFrame

# Order by count_messages and limit the results

# Show the results

I Appreciate for your support on

rganesh203 (Ganesh R) rganesh203 (Ganesh R)

rganesh203 (Ganesh R) rganesh203 (Ganesh R)

rganesh203 (Ganesh R) rganesh203 (Ganesh R)

You might also like