0% found this document useful (0 votes)
8 views

PySpark and Azure Data Engineer Free Notes

The document contains 75+ interview questions and answers related to Azure Data Engineering, specifically focusing on Delta Lake, Databricks, and SQL commands. Key topics include commands for managing Delta tables, data formats, Databricks architecture, and data engineering best practices. Each question is accompanied by a concise explanation and reference links for further reading.

Uploaded by

sumukh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

PySpark and Azure Data Engineer Free Notes

The document contains 75+ interview questions and answers related to Azure Data Engineering, specifically focusing on Delta Lake, Databricks, and SQL commands. Key topics include commands for managing Delta tables, data formats, Databricks architecture, and data engineering best practices. Each question is accompanied by a concise explanation and reference links for further reading.

Uploaded by

sumukh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Azure Data

Engineering
75+ Interview Questions
version 6
Q.1. which of the following commands can a data engineer use to compact small data files of a Delta
table into larger ones?

Ans: OPTIMIZE

Overall explanation

Delta Lake can improve the speed of read queries from a table. One way to improve this speed is by
compacting small files into larger ones. You trigger compaction by running the OPTIMIZE command

Reference: https://docs.databricks.com/sql/language-manual/delta-optimize.html

Q.2. A data engineer is trying to use Delta time travel to rollback a table to a previous version, but the
data engineer received an error that the data files are no longer present.

Which of the following commands was run on the table that caused deleting the data files?

Ans: VACUUM

Overall explanation

Running the VACUUM command on a Delta table deletes the unused data files older than a specified
data retention period. As a result, you lose the ability to time travel back to any version older than that
retention threshold.

Reference: https://docs.databricks.com/sql/language-manual/delta-vacuum.html

Q.3. In Delta Lake tables, which of the following is the primary format for the data files?

Ans: Parquet

Overall explanation

Delta Lake builds upon standard data formats. Delta lake table gets stored on the storage in one or more
data files in Parquet format, along with transaction logs in JSON format.

Reference: https://docs.databricks.com/delta/index.html
Q.4. Which of the following locations hosts the Databricks web application ?

Ans: Control plane

Overall explanation

According to the Databricks Lakehouse architecture, Databricks workspace is deployed in the control
plane along with Databricks services like Databricks web application (UI), Cluster manager, workflow
service, and notebooks.

Reference: https://docs.databricks.com/getting-started/overview.html

Q.5. In Databricks Repos, which of the following operations a data engineer can use to update the
local version of a repo from its remote Git repository ?

Ans: Pull

Overall explanation

The git Pull operation is used to fetch and download content from a remote repository and immediately
update the local repository to match that content.

References:

https://docs.databricks.com/repos/index.html

https://github.com/git-guides/git-pull

Q.6. According to the Databricks Lakehouse architecture, which of the following is located in the
customer's cloud account?

Ans: Cluster virtual machines

Overall explanation

When the customer sets up a Spark cluster, the cluster virtual machines are deployed in the data plane
in the customer's cloud account.

Reference: https://docs.databricks.com/getting-started/overview.html
Q.7. Describe Databricks Lakehouse?

Ans: Single, flexible, high-performance system that supports data, analytics, and machine learning
workloads.

Overall explanation

Databricks Lakehouse is a unified analytics platform that combines the best elements of data lakes and
data warehouses. So, in the Lakehouse, you can work on data engineering, analytics, and AI, all in one
platform.

Reference: https://www.databricks.com/glossary/data-lakehouse

Q.8. If the default notebook language is SQL, which of the following options a data engineer can use to
run a Python code in this SQL Notebook ?

Ans: They can add %python at the start of a cell.

Overall explanation

By default, cells use the default language of the notebook. You can override the default language in a
cell by using the language magic command at the beginning of a cell. The supported magic commands
are: %python, %sql, %scala, and %r.

Reference: https://docs.databricks.com/notebooks/notebooks-code.html

Q.9. Which of the following tasks is not supported by Databricks Repos, and must be performed in
your Git provider ?

Ans: Delete branches

Overall explanation

The following tasks are not supported by Databricks Repos, and must be performed in your Git provider:

1. Create a pull request


2. Delete branches
3. Merge and rebase branches *
* NOTE: Recently, merge and rebase branches have become supported in Databricks Repos. However,
this may still not be updated in the current exam version.

Q.10. Which of the following statements is Not true about Delta Lake ?

Ans: Delta Lake builds upon standard data formats: Parquet + XML

Overall explanation

It is not true that Delta Lake builds upon XML format. It builds upon Parquet and JSON formats

Reference: https://docs.databricks.com/delta/index.html

Q.11. How long is the default retention period of the VACUUM command ?

Ans: 7 days

Overall explanation

By default, the retention threshold of the VACUUM command is 7 days. This means that VACUUM
operation will prevent you from deleting files less than 7 days old, just to ensure that no long-running
operations are still referencing any of the files to be deleted.

Reference: https://docs.databricks.com/sql/language-manual/delta-vacuum.html
Q.12. The data engineering team has a Delta table called employees that contains the employees
personal information including their gross salaries.

Which of the following code blocks will keep in the table only the employees having a salary greater
than 3000 ?

Ans: DELETE FROM employees WHERE salary <= 3000;

Overall explanation

In order to keep only the employees having a salary greater than 3000, we must delete the employees
having salary less than or equal 3000. To do so, use the DELETE statement:

DELETE FROM table_name WHERE condition;

Reference: https://docs.databricks.com/sql/language-manual/delta-delete-from.html

Q.13. A data engineer wants to create a relational object by pulling data from two tables. The
relational object must be used by other data engineers in other sessions on the same cluster only. In
order to save on storage costs, the date engineer wants to avoid copying and storing physical data.

Which of the following relational objects should the data engineer create?

Ans: Global Temporary view (not External table or Managed table)

Overall explanation

In order to avoid copying and storing physical data, the data engineer must create a view object. A view
in databricks is a virtual table that has no physical data. It’s just a saved SQL query against actual tables.

The view type should be Global Temporary view that can be accessed in other sessions on the same
cluster. Global Temporary views are tied to a cluster temporary database called global_temp.

Reference: https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-view.html

Q.14. Fill in the below blank to successfully create a table in Databricks using data from an existing
PostgreSQL database:
CREATE TABLE employees

USING ____________

OPTIONS (

url "jdbc:postgresql:dbserver",

dbtable "employees"

Ans: org.apache.spark.sql.jdbc

Overall explanation

Using the JDBC library, Spark SQL can extract data from any existing relational database that supports
JDBC. Examples include mysql, postgres, SQLite, and more.

Reference: https://learn.microsoft.com/en-us/azure/databricks/external-data/jdbc

Q.15. Which of the following commands can a data engineer use to create a new table along with a
comment ?

Ans: CREATE TABLE payments

COMMENT "This table contains sensitive information"

AS SELECT * FROM bank_transactions

Overall explanation

The CREATE TABLE clause supports adding a descriptive comment for the table. This allows for easier
discovery of table contents.

Syntax:

CREATE TABLE table_name

COMMENT "here is a comment"

AS query
Reference: https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html

Q.16. A junior data engineer usually uses INSERT INTO command to write data into a Delta table. A
senior data engineer suggested using another command that avoids writing of duplicate records.

Which of the following commands is the one suggested by the senior data engineer ?

Ans: MERGE INTO (not APPLY CHANGES INTO or UPDATE or COPY INTO)

MERGE INTO allows to merge a set of updates, insertions, and deletions based on a source table into a
target Delta table. With MERGE INTO, you can avoid inserting the duplicate records when writing into
Delta tables.

References:

https://docs.databricks.com/sql/language-manual/delta-merge-into.html

https://docs.databricks.com/delta/merge.html#data-deduplication-when-writing-into-delta-tables

Q.17. A data engineer is designing a Delta Live Tables pipeline. The source system generates files
containing changes captured in the source data. Each change event has metadata indicating whether
the specified record was inserted, updated, or deleted. In addition to a timestamp column indicating
the order in which the changes happened. The data engineer needs to update a target table based on
these change events.

Which of the following commands can the data engineer use to best solve this problem?

Ans: APPLY CHANGES INTO

Overall explanation

The events described in the question represent Change Data Capture (CDC) feed. CDC is logged at the
source as events that contain both the data of the records along with metadata information:

Operation column indicating whether the specified record was inserted, updated, or deleted

Sequence column that is usually a timestamp indicating the order in which the changes happened
You can use the APPLY CHANGES INTO statement to use Delta Live Tables CDC functionality

Reference: https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-cdc.html

Q.18. In PySpark, which of the following commands can you use to query the Delta table employees
created in Spark SQL?

Ans: spark.table("employees")

Overall explanation

spark.table() function returns the specified Spark SQL table as a PySpark DataFrame

Reference:

https://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/session.html#SparkSession.tabl
e

Q.19. When dropping a Delta table, which of the following explains why only the table's metadata will
be deleted, while the data files will be kept in the storage ?

Ans: The table is external

Overall explanation

External (unmanaged) tables are tables whose data is stored in an external storage path by using a
LOCATION clause.

When you run DROP TABLE on an external table, only the table's metadata is deleted, while the
underlying data files are kept.

Reference: https://docs.databricks.com/lakehouse/data-objects.html#what-is-an-unmanaged-table
Q.20. Given the two tables students_course_1 and students_course_2. Which of the following
commands can a data engineer use to get all the students from the above two tables without
duplicate records ?

Ans: SELECT * FROM students_course_1

UNION

SELECT * FROM students_course_2

Overall explanation

With UNION, you can return the result of subquery1 plus the rows of subquery2

Syntax:

subquery1

UNION [ ALL | DISTINCT ]

subquery2

If ALL is specified duplicate rows are preserved.

If DISTINCT is specified the result does not contain any duplicate rows. This is the default.

Note that both subqueries must have the same number of columns and share a least common type for
each respective column.

Reference: https://docs.databricks.com/sql/language-manual/sql-ref-syntax-qry-select-setops.html

Q.21. Given the following command:

CREATE DATABASE IF NOT EXISTS hr_db ;

In which of the following locations will the hr_db database be located?

Ans: dbfs:/user/hive/warehouse

Overall explanation

Since we are creating the database here without specifying a LOCATION clause, the database will be
created in the default warehouse directory under dbfs:/user/hive/warehouse
Reference: https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-schema.html

Q.22. Fill in the below blank to get the students enrolled in less than 3 courses from array column
students

SELECT

faculty_id,

students,

___________ AS few_courses_students

FROM faculties

Ans: FILTER (students, i -> i.total_courses < 3)

Overall explanation

filter(input_array, lamda_function) is a higher order function that returns an output array from an input
array by extracting elements for which the predicate of a lambda function holds.

Example:

Extracting odd numbers from an input array of integers:

SELECT filter(array(1, 2, 3, 4), i -> i % 2 == 1);

output: [1, 3]

References:

https://docs.databricks.com/sql/language-manual/functions/filter.html

https://docs.databricks.com/optimizations/higher-order-lambda-functions.html

Q.23. The data engineer team has a DLT pipeline that updates all the tables once and then stops. The
compute resources of the pipeline continue running to allow for quick testing.
Which of the following best describes the execution modes of this DLT pipeline ?

Ans: The DLT pipeline executes in Triggered Pipeline mode under Development mode.

Overall explanation

Triggered pipelines update each table with whatever data is currently available and then they shut
down.

In Development mode, the Delta Live Tables system ease the development process by

• Reusing a cluster to avoid the overhead of restarts. The cluster runs for two hours when
development mode is enabled.
• Disabling pipeline retries so you can immediately detect and fix errors.

Reference:

https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-concepts.html

Q.24. In multi-hop architecture, which of the following statements best describes the Bronze layer ?

Ans: It maintains raw data ingested from various sources

Overall explanation

Bronze tables contain data in its rawest format ingested from various sources (e.g., JSON files,
Operational Databaes, Kakfa stream, ...)

Reference:

https://www.databricks.com/glossary/medallion-architecture

Q.25. Which of the following compute resources is available in Databricks SQL ?

Ans: SQL warehouses


Overall explanation

Compute resources are infrastructure resources that provide processing capabilities in the cloud. A SQL
warehouse is a compute resource that lets you run SQL commands on data objects within Databricks
SQL.

Reference: https://docs.databricks.com/sql/admin/sql-endpoints.html

Q.26. Which of the following is the benefit of using the Auto Stop feature of Databricks SQL
warehouses ?

Ans: Minimizes the total running time of the warehouse

Overall explanation

The Auto Stop feature stops the warehouse if it’s idle for a specified number of minutes.

Reference: https://docs.databricks.com/sql/admin/sql-endpoints.html

Q.27. A data engineer wants to increase the cluster size of an existing Databricks SQL warehouse.

Which of the following is the benefit of increasing the cluster size of Databricks SQL warehouses ?

Ans: Improves the latency of the queries execution

Overall explanation

Cluster Size represents the number of cluster workers and size of compute resources available to run
your queries and dashboards. To reduce query latency, you can increase the cluster size.

Reference: https://docs.databricks.com/sql/admin/sql-endpoints.html#cluster-size-1

Q.28. Which of the following describes Cron syntax in Databricks Jobs ?

Ans: It’s an expression to represent complex job schedule that can be defined programmatically

Overall explanation

To define a schedule for a Databricks job, you can either interactively specify the period and starting
time, or write a Cron Syntax expression. The Cron Syntax allows to represent complex job schedule that
can be defined programmatically.
Reference: https://docs.databricks.com/workflows/jobs/jobs.html#schedule-a-job

Q.29. The data engineer team has a DLT pipeline that updates all the tables at defined intervals until
manually stopped. The compute resources terminate when the pipeline is stopped.

Which of the following best describes the execution modes of this DLT pipeline ?

Ans: The DLT pipeline executes in Continuous Pipeline mode under Production mode.

Overall explanation

Continuous pipelines update tables continuously as input data changes. Once an update is started, it
continues to run until the pipeline is shut down.

In Production mode, the Delta Live Tables system:

• Terminates the cluster immediately when the pipeline is stopped.


• Restarts the cluster for recoverable errors (e.g., memory leak or stale credentials).
• Retries execution in case of specific errors (e.g., a failure to start a cluster)

Reference:

https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-concepts.html

Q.30. Which of the following commands can a data engineer use to purge stale data files of a Delta
table?

Ans: VACUUM

Overall explanation

The VACUUM command deletes the unused data files older than a specified data retention period.
Reference: https://docs.databricks.com/sql/language-manual/delta-vacuum.html

Q.31. In Databricks Repos (Git folders), which of the following operations a data engineer can use to
save local changes of a repo to its remote repository ?

Ans: Commit & Push

Overall explanation

Commit & Push is used to save the changes on a local repo, and then uploads this local repo content to
the remote repository.

References:

https://docs.databricks.com/repos/index.html

https://github.com/git-guides/git-push

Q.32. In Delta Lake tables, which of the following is the primary format for the transaction log files?

Ans: JSON

Overall explanation

Delta Lake builds upon standard data formats. Delta lake table gets stored on the storage in one or more
data files in Parquet format, along with transaction logs in JSON format.

Reference: https://docs.databricks.com/delta/index.html

Q.33. Which of the following locations completely hosts the customer data ?

Ans: Customer's cloud account

Overall explanation

According to the Databricks Lakehouse architecture, the storage account hosting the customer data is
provisioned in the data plane in the Databricks customer's cloud account.

Reference: https://docs.databricks.com/getting-started/overview.html
Q.34. A junior data engineer uses the built-in Databricks Notebooks versioning for source control. A
senior data engineer recommended using Databricks Repos (Git folders) instead.

Which of the following could explain why Databricks Repos is recommended instead of Databricks
Notebooks versioning?

Ans: Databricks Repos supports creating and managing branches for development work.

Overall explanation

One advantage of Databricks Repos over the built-in Databricks Notebooks versioning is that Databricks
Repos supports creating and managing branches for development work.

Reference: https://docs.databricks.com/repos/index.html

Q.35. Which of the following services provides a data warehousing experience to its users?

Ans: Databricks SQL

Overall explanation

Databricks SQL (DB SQL) is a data warehouse on the Databricks Lakehouse Platform that lets you run all
your SQL and BI applications at scale.

Reference: https://www.databricks.com/product/databricks-sql

Q.36. A data engineer noticed that there are unused data files in the directory of a Delta table. They
executed the VACUUM command on this table; however, only some of those unused data files have
been deleted.

Which of the following could explain why only some of the unused data files have been deleted after
running the VACUUM command ?

Ans: The deleted data files were older than the default retention threshold. While the remaining files
are newer than the default retention threshold and can not be deleted.

Overall explanation

Running the VACUUM command on a Delta table deletes the unused data files older than a specified
data retention period. Unused files newer than the default retention threshold are kept untouched.
Reference: https://docs.databricks.com/sql/language-manual/delta-vacuum.html

Q.37. The data engineering team has a Delta table called products that contains products’ details
including the net price.

Which of the following code blocks will apply a 50% discount on all the products where the price is
greater than 1000 and save the new price to the table?

Ans: UPDATE products SET price = price * 0.5 WHERE price > 1000;

Overall explanation

The UPDATE statement is used to modify the existing records in a table that match the WHERE
condition. In this case, we are updating the products where the price is strictly greater than 1000.

Syntax:

UPDATE table_name

SET column_name = expr

WHERE condition

Reference:

https://docs.databricks.com/sql/language-manual/delta-update.html

Q.38. A data engineer wants to create a relational object by pulling data from two tables. The
relational object will only be used in the current session. In order to save on storage costs, the date
engineer wants to avoid copying and storing physical data.

Which of the following relational objects should the data engineer create?

Ans: Temporary view

Overall explanation

In order to avoid copying and storing physical data, the data engineer must create a view object. A view
in databricks is a virtual table that has no physical data. It’s just a saved SQL query against actual tables.

The view type should be Temporary view since it’s tied to a Spark session and dropped when the session
ends.
Reference: https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-view.html

Q.39. A data engineer has a database named db_hr, and they want to know where this database was
created in the underlying storage.

Which of the following commands can the data engineer use to complete this task?

Ans: DESCRIBE DATABASE db_hr

Overall explanation

The DESCRIBE DATABASE or DESCRIBE SCHEMA returns the metadata of an existing database (schema).
The metadata information includes the database’s name, comment, and location on the filesystem. If
the optional EXTENDED option is specified, database properties are also returned.

Syntax:

DESCRIBE DATABASE [ EXTENDED ] database_name

Reference: https://docs.databricks.com/sql/language-manual/sql-ref-syntax-aux-describe-schema.html

Q.40. Which of the following commands a data engineer can use to register the table orders from an
existing SQLite database ?

Ans:

CREATE TABLE orders

USING org.apache.spark.sql.jdbc

OPTIONS (

url "jdbc:sqlite:/bookstore.db",

dbtable "orders"

Overall explanation
Using the JDBC library, Spark SQL can extract data from any existing relational database that supports
JDBC. Examples include mysql, postgres, SQLite, and more.

Reference: https://learn.microsoft.com/en-us/azure/databricks/external-data/jdbc

Q.41. When dropping a Delta table, which of the following explains why both the table's metadata
and the data files will be deleted ?

Ans: The table is managed

Overall explanation

Managed tables are tables whose metadata and the data are managed by Databricks.

When you run DROP TABLE on a managed table, both the metadata and the underlying data files are
deleted.

Reference: https://docs.databricks.com/lakehouse/data-objects.html#what-is-a-managed-table

Q.42. Given the following commands:

CREATE DATABASE db_hr;

USE db_hr;

CREATE TABLE employees;

In which of the following locations will the employees table be located?

Ans: dbfs:/user/hive/warehouse/db_hr.db

Overall explanation

Since we are creating the database here without specifying a LOCATION clause, the database will be
created in the default warehouse directory under dbfs:/user/hive/warehouse. The database folder have
the extension (.db)

And since we are creating the table also without specifying a LOCATION clause, the table becomes a
managed table created under the database directory (in db_hr.db folder)
Reference: https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-schema.html

Q.43. Which of the following statements best describes the usage of CREATE SCHEMA command ?

Ans: It’s used to create a database

Overall explanation

CREATE SCHEMA is an alias for CREATE DATABASE statement. While usage of SCHEMA and DATABASE is
interchangeable, SCHEMA is preferred.

Reference: https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-database.html

Q.44. Given the following Structured Streaming query:

(spark.table("orders")

.withColumn("total_after_tax", col("total")+col("tax"))

.writeStream

.option("checkpointLocation", checkpointPath)

.outputMode("append")

.___________

.table("new_orders") )

Fill in the blank to make the query executes multiple micro-batches to process all available data, then
stops the trigger.

Ans: trigger(availableNow=True)

Overall explanation

In Spark Structured Streaming, we use trigger(availableNow=True) to run the stream in batch mode
where it processes all available data in multiple micro-batches. The trigger will stop on its own once it
finishes processing the available data.
Reference: https://docs.databricks.com/structured-streaming/triggers.html#configuring-incremental-
batch-processing

Q.45. In multi-hop architecture, which of the following statements best describes the Silver layer
tables?

Ans: They provide a more refined view of raw data, where it’s filtered, cleaned, and enriched.

Overall explanation

Silver tables provide a more refined view of the raw data. For example, data can be cleaned and filtered
at this level. And we can also join fields from various bronze tables to enrich our silver records

Reference:

https://www.databricks.com/glossary/medallion-architecture

Q.46. The data engineer team has a DLT pipeline that updates all the tables at defined intervals until
manually stopped. The compute resources of the pipeline continue running to allow for quick testing.

Which of the following best describes the execution modes of this DLT pipeline ?

Ans: The DLT pipeline executes in Continuous Pipeline mode under Development mode.

Overall explanation

Continuous pipelines update tables continuously as input data changes. Once an update is started, it
continues to run until the pipeline is shut down.

In Development mode, the Delta Live Tables system ease the development process by Reusing a cluster
to avoid the overhead of restarts. The cluster runs for two hours when development mode is enabled.

Disabling pipeline retries so you can immediately detect and fix errors.

Reference:

https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-concepts.html

Q.47. Given the following Structured Streaming query:


(spark.readStream

.table("cleanedOrders")

.groupBy("productCategory")

.agg(sum("totalWithTax"))

.writeStream

.option("checkpointLocation", checkpointPath)

.outputMode("complete")

.table("aggregatedOrders")

Which of the following best describe the purpose of this query in a multi-hop architecture?

Ans: The query is performing a hop from Silver layer to a Gold table

Overall explanation

The above Structured Streaming query creates business-level aggregates from clean orders data in the
silver table cleanedOrders, and loads them in the gold table aggregatedOrders.

Reference:

https://www.databricks.com/glossary/medallion-architecture

Q.48. Given the following Structured Streaming query:

(spark.readStream

.table("orders")

.writeStream

.option("checkpointLocation", checkpointPath)

.table("Output_Table")

Which of the following is the trigger Interval for this query ?

Ans: Every half second


Overall explanation

By default, if you don’t provide any trigger interval, the data will be processed every half second. This is
equivalent to trigger (processingTime=”500ms")

Reference: https://docs.databricks.com/structured-streaming/triggers.html#what-is-the-default-trigger-
interval

Q.49. In multi-hop architecture, which of the following statements best describes the Gold layer
tables?

Ans: They provide business-level aggregations that power analytics, machine learning, and production
applications

Overall explanation

Gold layer is the final layer in the multi-hop architecture, where tables provide business level aggregates
often used for reporting and dashboarding, or even for Machine learning.

Reference:

https://www.databricks.com/glossary/medallion-architecture

Q.50. The data engineer team has a DLT pipeline that updates all the tables once and then stops. The
compute resources of the pipeline terminate when the pipeline is stopped.

Which of the following best describes the execution modes of this DLT pipeline ?

Ans: The DLT pipeline executes in Triggered Pipeline mode under Production mode.

Overall explanation

Triggered pipelines update each table with whatever data is currently available and then they shut
down.

In Production mode, the Delta Live Tables system:

• Terminates the cluster immediately when the pipeline is stopped.


• Restarts the cluster for recoverable errors (e.g., memory leak or stale credentials).
• Retries execution in case of specific errors (e.g., a failure to start a cluster)

Reference:

https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-concepts.html
Q.51. A data engineer needs to determine whether to use Auto Loader or COPY INTO command in
order to load input data files incrementally.

In which of the following scenarios should the data engineer use Auto Loader over COPY INTO
command ?

Ans: If they are going to ingest files in the order of millions or more over time

Overall explanation

Here are a few things to consider when choosing between Auto Loader and COPY INTO command:

❖ If you’re going to ingest files in the order of thousands, you can use COPY INTO. If you are
expecting files in the order of millions or more over time, use Auto Loader.
❖ If your data schema is going to evolve frequently, Auto Loader provides better primitives around
schema inference and evolution.

Reference: https://docs.databricks.com/ingestion/index.html#when-to-use-copy-into-and-when-to-use-
auto-loader

Q.52. From which of the following locations can a data engineer set a schedule to automatically
refresh a Databricks SQL query ?

Ans: From the query's page in Databricks SQL

Overall explanation

In Databricks SQL, you can set a schedule to automatically refresh a query from the query's page.

Reference: https://docs.databricks.com/sql/user/queries/schedule-query.html
Q.53. Databricks provides a declarative ETL framework for building reliable and maintainable data
processing pipelines, while maintaining table dependencies and data quality.

Which of the following technologies is being described above?

Ans: Delta Live Tables

Overall explanation

Delta Live Tables is a framework for building reliable, maintainable, and testable data processing
pipelines. You define the transformations to perform on your data, and Delta Live Tables manages task
orchestration, cluster management, monitoring, data quality, and error handling.

Reference: https://docs.databricks.com/workflows/delta-live-tables/index.html

Q.54. Which of the following services can a data engineer use for orchestration purposes in Databricks
platform ?

Ans: Databricks Jobs

Overall explanation

Databricks Jobs allow orchestrating data processing tasks. This means the ability to run and manage
multiple tasks as a directed acyclic graph (DAG) in a job.

Reference: https://docs.databricks.com/workflows/jobs/jobs.html

Q.55. A data engineer has a Job with multiple tasks that takes more than 2 hours to complete. In the
last run, the final task unexpectedly failed.

Which of the following actions can the data engineer perform to complete this Job Run while
minimizing the execution time ?

Ans: They can repair this Job Run so only the failed tasks will be re-executed

Overall explanation
You can repair failed multi-task jobs by running only the subset of unsuccessful tasks and any dependent
tasks. Because successful tasks are not re-run, this feature reduces the time and resources required to
recover from unsuccessful job runs.

Reference: https://docs.databricks.com/workflows/jobs/repair-job-failures.html

Q.56. A data engineering team has a multi-tasks Job in production. The team members need to be
notified in the case of job failure.

Which of the following approaches can be used to send emails to the team members in the case of job
failure ?

Ans: They can configure email notifications settings in the job page

Overall explanation

Databricks Jobs support email notifications to be notified in the case of job start, success, or failure.
Simply, click Edit email notifications from the details panel in the Job page. From there, you can add one
or more email addresses.

Q.57. For production jobs, which of the following cluster types is recommended to use?

Ans: Job clusters

Overall explanation

Job Clusters are dedicated clusters for a job or task run. A job cluster auto terminates once the job is
completed, which saves cost compared to all-purpose clusters.

In addition, Databricks recommends using job clusters in production so that each job runs in a fully
isolated environment.
Reference: https://docs.databricks.com/workflows/jobs/jobs.html#choose-the-correct-cluster-type-for-
your-job

Q.58. In Databricks Jobs, which of the following approaches can a data engineer use to configure a
linear dependency between Task A and Task B ?

Ans: They can select the Task A in the Depends On field of the Task B configuration

Overall explanation

You can define the order of execution of tasks in a job using the Depends on dropdown menu. You can
set this field to one or more tasks in the job.

Q.59. Which part of the Databricks Platform can a data engineer use to revoke permissions from users
on tables ?

Ans: Data Explorer

Overall explanation

Data Explorer in Databricks SQL allows you to manage data object permissions. This includes revoking
privileges on tables and databases from users or groups of users.

Reference: https://docs.databricks.com/security/access-control/data-acl.html#data-explorer

Q.60. In which of the following locations can a data engineer change the owner of a table?

Ans: In Data Explorer, from the Owner field in the table's page

Overall explanation
From Data Explorer in Databricks SQL, you can navigate to the table's page to review and change the
owner of the table. Simply, click on the Owner field, then Edit owner to set the new owner.

Reference: https://docs.databricks.com/security/access-control/data-acl.html#manage-data-object-
ownership

Q.61. What is Broadcasting in PySpark, and Why Is It Useful?

→ Explain how broadcasting optimizes joins by reducing data shuffling.

Ans: Broadcasting is a technique used in PySpark to optimize the performance of operations involving
small DataFrames. When a DataFrame is broadcasted, it is sent to all worker nodes and cached, ensuring
that each node has a full copy of the data. This eliminates the need to shuffle and exchange data
between nodes during operations, such as joins, significantly reducing the communication overhead and
improving performance.

• When to Use Broadcasting:

Broadcasting should be used when you have a small DataFrame that is used multiple times in your
processing pipeline, especially in join operations. Broadcasting the small DataFrame can significantly
improve performance by reducing the amount of data that needs to be exchanged between worker
nodes.

• Resource: https://www.sparkcodehub.com/broadcasting-dataframes-in-pyspark

When working with large datasets in PySpark, optimizing performance is crucial. Two powerful
optimization techniques are broadcast variables and broadcast joins. Let’s dive into what they are, when
to use them, and how they help improve performance with clear examples.

• Broadcast Variables:

What is a Broadcast Variable?

A broadcast variable allows you to efficiently share a small, read-only dataset across all executors in a
cluster. Instead of sending this data with every task, it is sent once from the driver to each executor,
minimizing network I/O and allowing tasks to access it locally.
When to Use a Broadcast Variable?

- Scenario: When you need to share small lookup data or configuration settings with all tasks in the
cluster.

- Optimization: Minimizes network I/O by sending the data once and caching it locally on each executor.

Example Code

Let’s say we have a small dictionary of country codes and names that we need to use in our
transformations.
In above example:

- We create and broadcast a dictionary of country codes.

- Each task accesses the broadcasted dictionary locally to convert country codes.

• Broadcast Joins:

What is a Broadcast Join?

A broadcast join optimizes join operations by broadcasting a small dataset to all executor nodes. This
allows each node to perform the join locally, reducing the need for shuffling large datasets across the
network.

When to Use a Broadcast Join?

- Scenario: When performing joins and one of the datasets is small enough to fit in memory.

- Optimization: Reduces shuffling and network I/O, making joins more efficient by enabling local join
operations.

Best Practices for broadcasting:

• Only broadcast small DataFrames: Broadcasting large DataFrames can cause performance
issues and consume a significant amount of memory on worker nodes. Make sure to only
broadcast DataFrames that are small enough to fit in the memory of each worker node.
• Monitor the performance: Keep an eye on the performance of your PySpark applications to
ensure that broadcasting is improving performance as expected. If you notice any performance
issues or memory problems, consider adjusting your broadcasting strategy or revisiting your
data processing pipeline.
• Consider alternative techniques: Broadcasting is not always the best solution for optimizing
performance. In some cases, you may achieve better results by repartitioning your DataFrames
or using other optimization techniques, such as bucketing or caching. Evaluate your specific use
case and choose the most appropriate technique for your needs.
• Be cautious with broadcasting in iterative algorithms: If you're using iterative algorithms, be
careful when broadcasting DataFrames, as the memory used by the broadcasted DataFrame
may not be released until the end of the application. This could lead to memory issues and
performance problems over time.

Q.62. What Makes Spark Superior to MapReduce?


→ Discuss Spark's in-memory processing and speed advantages over Hadoop MapReduce.

Ans: Anser

Q.63. Differentiate Between RDDs, DataFrames, and Datasets.

→ Compare their flexibility, performance, and optimization capabilities.

Answer:
Ans:
Q.64. How Can You Create a DataFrame in PySpark?

→ Provide examples using spark.read and createDataFrame().

Ans:

Q.65. Why Is Partitioning Important in Spark?

→ Explain data distribution and its effect on parallelism and performance.

Ans: Efficient data processing in Apache Spark hinges on shuffle partitions, which directly impact performance and
cluster utilization.

Here's a quick breakdown:


Shuffling: Occurs during wide transformations (e.g., groupBy, join) to group related data across nodes.

Key Scenarios:
1 Large Data per Partition: Increase partitions to ensure each core handles an optimal workload (1MB–200MB).
2 Small Data per Partition: Reduce partitions or match them to the available cores for better utilization.

Common Challenges:
Data Skew: Uneven data distribution slows jobs.
Solutions: Use Adaptive Query Execution (AQE) or salting to balance partitions.

🎯 Why it Matters: Properly tuning shuffle partitions ensures faster job completion and optimal resource usage,
unlocking the true power of Spark!
Q.66. Why Is Caching Crucial in Spark? Share how caching improves performance by storing
intermediate results.

Ans: cache() is an Apache Spark transformation that can be used on a DataFrame, Dataset, or RDD
when you want to perform more than one action. cache() caches the specified DataFrame, Dataset, or
RDD in the memory of your cluster’s workers. Since cache() is a transformation, the caching operation
takes place only when a Spark action (for example, count(), show(), take(), or write()) is also used on
the same DataFrame, Dataset, or RDD in a single action.

Under what scenarios caching is an optimized solution —

• Reusing Data: Caching is optimal when you need to perform multiple operations on the same
dataset to avoid reading from storage repeatedly.
• Frequent Subset Access: Useful for frequently accessing small subsets of a large dataset,
reducing the need to load the entire dataset repeatedly.

In Spark, caching is a mechanism for storing data in memory to speed up access to that data.
When you cache a dataset, Spark keeps the data in memory so that it can be quickly retrieved the
next time it is needed. Caching is especially useful when you need to perform multiple operations on
the same dataset, as it eliminates the need to read the data from a disk each time.

Understanding the concept of caching and how to use it effectively is crucial for optimizing the
performance of your Spark applications. By caching the right data at the right time, you can
significantly speed up your applications and make the most out of your Spark cluster.

To cache a dataset in Spark, you simply call the cache() method on the RDD or DataFrame. For
example, if you have an RDD called myRDD, you can cache it like this:

myRDD.cache()

• Databricks uses disk caching to accelerate data reads by creating copies of remote Parquet
data files in nodes’ local storage using a fast intermediate data format. The data is cached
automatically whenever a file has to be fetched from a remote location. Successive reads of
the same data are then performed locally, which results in significantly improved reading
speed. The cache works for all Parquet data files (including Delta Lake tables).

• Disk cache vs. Spark cache: The Databricks disk cache differs from Apache Spark caching.
Databricks recommends using automatic disk caching.

• The following table summarizes the key differences between disk and Apache Spark caching
so that you can choose the best tool for your workflow:
Resource: https://docs.databricks.com/en/optimizations/disk-cache.html

Q.67. What is Persistence in spark?

Ans: Alternate to cache(), you can use the persist() method to cache a dataset. The persist() method
allows you to specify the level of storage for the cached data, such as memory-only or disk-only storage.
For example, to cache an RDD in memory only, you can use the following code:

• myRDD.persist (StorageLevel.MEMORY_ONLY)

When you cache a dataset in Spark, you should be aware that it will occupy memory on the worker
nodes. If you have limited memory available, you may need to prioritize which datasets to cache based
on their importance to your processing workflow.

• Persistence is a related concept to caching in Spark. When you persist a dataset, you are telling
Spark to store the data on disk or in memory, or a combination of the two, so that it can be
retrieved quickly the next time it is needed.

The persist() method can be used to specify the level of storage for the persisted data. The available
storage levels include MEMORY_ONLY, MEMORY_ONLY_SER, MEMORY_AND_DISK,
MEMORY_AND_DISK_SER, DISK_ONLY, and OFF_HEAP. The MEMORY_ONLY and MEMORY_ONLY_SER
levels store the data in memory, while the MEMORY_AND_DISK and MEMORY_AND_DISK_SER levels
store the data in memory and on disk. The DISK_ONLY level stores the data on disk only, while the
OFF_HEAP level stores the data in off-heap memory.

To persist a dataset in Spark, you can use the persist() method on the RDD or DataFrame. For
example, if you have an RDD called myRDD, you can persist it in memory using the following code:

• myRDD.persist(StorageLevel.MEMORY_ONLY)

If you want to persist the data in memory and on disk, you can use the following code:

• myRDD.persist(StorageLevel.MEMORY_AND_DISK)

When you persist a dataset in Spark, the data will be stored in the specified storage level until you
explicitly remove it from memory or disk. You can remove a persisted dataset using the unpersist()
method. For example, to remove the myRDD dataset from memory, you can use the following code:

• myRDD.unpersist()

Difference between cache() and persist() methods:

❖ Using cache() and persist() methods, Spark provides an optimization mechanism to store the
intermediate computation of an RDD, DataFrame, and Dataset so they can be reused in
subsequent actions(reusing the RDD, Dataframe, and Dataset computation results).
❖ Both caching and persisting are used to save the Spark RDD, Dataframe, and Datasets. But, the
difference is, RDD cache() method default saves it to memory (MEMORY_ONLY) and, DataFrame
cache() method default saves it to memory (MEMORY_AND_DISK), whereas persist() method is
used to store it to the user-defined storage level.
❖ When you persist a dataset, each node stores its partitioned data in memory and reuses them in
other actions on that dataset. And Spark’s persisted data on nodes are fault-tolerant meaning if
any partition of a Dataset is lost, it will automatically be recomputed using the original
transformations that created it.

𝐂𝐚𝐜𝐡𝐢𝐧𝐠 𝐯𝐬. 𝐏𝐞𝐫𝐬𝐢𝐬𝐭𝐞𝐧𝐜𝐞 𝐢𝐧 𝐏𝐲𝐒𝐩𝐚𝐫𝐤 — 𝐖𝐡𝐢𝐜𝐡 𝐭𝐨 𝐔𝐬𝐞 𝐚𝐧𝐝 𝐖𝐡𝐞𝐧?

Caching and persisting data in PySpark are techniques to store intermediate results, enabling faster
access and efficient processing. Knowing when to use each can optimize performance, especially for
iterative tasks or reuse of DataFrames within a job.

𝐖𝐡𝐲 𝐂𝐚𝐜𝐡𝐞 𝐨𝐫 𝐏𝐞𝐫𝐬𝐢𝐬𝐭 𝐃𝐚𝐭𝐚?


1. 𝐒𝐩𝐞𝐞𝐝𝐬 𝐔𝐩 𝐑𝐞𝐮𝐬𝐞 𝐨𝐟 𝐃𝐚𝐭𝐚𝐅𝐫𝐚𝐦𝐞𝐬 : When a DataFrame is cached or persisted, it remains in
memory (or specified storage) for quicker access, saving the time it takes to recompute the data.

2. 𝐈𝐦𝐩𝐫𝐨𝐯𝐞𝐬 𝐈𝐭𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐎𝐩𝐞𝐫𝐚𝐭𝐢𝐨𝐧𝐬 : In machine learning or data transformations, where the same
data is needed multiple times, caching/persisting minimizes redundant calculations.

3. 𝐌𝐢𝐧𝐢𝐦𝐢𝐳𝐞𝐬 𝐈/𝐎 𝐎𝐩𝐞𝐫𝐚𝐭𝐢𝐨𝐧𝐬 : By keeping data in memory, you avoid repeated disk I/O operations,
which are costly in terms of time.

𝐃𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝐁𝐞𝐭𝐰𝐞𝐞𝐧 𝐂𝐚𝐜𝐡𝐢𝐧𝐠 𝐚𝐧𝐝 𝐏𝐞𝐫𝐬𝐢𝐬𝐭𝐞𝐧𝐜𝐞

- 𝐂𝐚𝐜𝐡𝐢𝐧𝐠 is a simplified form of persistence that defaults to storing data in memory


(`MEMORY_ONLY`). It’s often used when memory is not a constraint and you only need the data
during the current Spark session.

- 𝐏𝐞𝐫𝐬𝐢𝐬𝐭𝐞𝐧𝐜𝐞 allows you to specify storage levels (e.g., `MEMORY_AND_DISK`, `DISK_ONLY`) and
provides more control over where and how data is stored. This is useful for data too large to fit in
memory.

𝐖𝐡𝐞𝐧 𝐭𝐨 𝐔𝐬𝐞 𝐂𝐚𝐜𝐡𝐢𝐧𝐠 𝐯𝐬. 𝐏𝐞𝐫𝐬𝐢𝐬𝐭𝐞𝐧𝐜𝐞

𝐖𝐡𝐞𝐧 𝐭𝐨 𝐂𝐚𝐜𝐡𝐞?

- 𝐔𝐬𝐞 𝐂𝐚𝐜𝐡𝐢𝐧𝐠 𝐟𝐨𝐫 𝐐𝐮𝐢𝐜𝐤 𝐑𝐞𝐮𝐬𝐞 : If you need to repeatedly access a DataFrame within the same job
without changing its storage level, caching is efficient and straightforward.

𝐖𝐡𝐞𝐧 𝐭𝐨 𝐏𝐞𝐫𝐬𝐢𝐬𝐭?

- 𝐔𝐬𝐞 𝐏𝐞𝐫𝐬𝐢𝐬𝐭𝐞𝐧𝐜𝐞 𝐟𝐨𝐫 𝐒𝐭𝐨𝐫𝐚𝐠𝐞 𝐂𝐨𝐧𝐭𝐫𝐨𝐥 : When the DataFrame is large or memory is limited,
persisting allows you to specify storage levels like `MEMORY_AND_DISK`, which offloads part of the
data to disk if memory is full.

Advantages of Caching and Persistence:

Below are the advantages of using Spark Cache and Persist methods.

❖ Cost efficient – Spark computations are very expensive; hence, reusing the computations are
used to save cost.
❖ Time efficient – Reusing repeated computations saves lots of time.
❖ Execution time – Saves execution time of the job and we can perform more jobs on the same
cluster.

Q.68. What do you understand by Spark Context, SQL Context and Spark Session?

Ans:
Spark context, Sql Context & Spark session:
In Apache Spark, Spark Context and SQL Context are essential components for interacting with Spark.
Each has its specific role in setting up the environment, managing resources, and enabling
functionalities like querying, transforming data, and working with different APIs (RDD, DataFrame,
Dataset).

Here’s a detailed explanation of each:

1. Spark Context:

• Role: SparkContext is the entry point for using the RDD (Resilient Distributed Dataset) API and
the core abstraction in Spark’s distributed computation. It allows for the creation of RDDs and
provides methods to access Spark’s capabilities, like resource allocation and job execution
across the cluster.

• Key Functions:
1. Resource Allocation: When a Spark job is submitted, SparkContext connects to the cluster
manager (like YARN, Mesos, or Kubernetes), which allocates resources like executors and
tasks.
2. RDD Operations: SparkContext is responsible for creating RDDs, distributing data, and
managing job execution on the distributed cluster.
3. Job Execution: Through SparkContext, transformations and actions applied to RDDs are
scheduled and executed across the cluster.
4. Limitations: SparkContext primarily supports RDDs, which are low level, making it difficult to
perform SQL-like operations and manipulate structured data easily.

2. SQL Context:

• Role: SQLContext was the original class introduced to work with structured data and to run
SQL queries on Spark. It allows users to interact with Spark DataFrames and execute SQL-like
queries on structured data.

• Key Functions:
1. DataFrame Creation: SQLContext allows for creating DataFrames from various data sources
like JSON, CSV, Parquet, etc.
2. SQL Queries: Users can run SQL queries on DataFrames using SQLContext. This gives Spark SQL
capabilities for querying structured and semi-structured data.
3. Integration with Hive: With HiveContext (a subclass of SQLContext), users could interact with
Hive tables and perform more complex SQL operations.

3. Spark Session:

• Role:
1. SparkSession is the new unified entry point for using all the features of Spark, including RDD,
DataFrame, and Dataset APIs. It consolidates different contexts like SparkContext,
SQLContext, and HiveContext into a single, more user-friendly object.
2. Introduced in Spark 2.0, SparkSession simplifies the user interface by managing SparkContext
internally. It is the primary point of entry to run Spark applications involving DataFrames and
SQL queries.

• Key Features:

1. Unified API: SparkSession combines the capabilities of SparkContext, SQLContext, and


HiveContext, allowing users to create DataFrames, run SQL queries, and access all Spark
functionalities through a single object.
2. DataFrame Operations: SparkSession provides an easy-to-use interface for working with
DataFrames and Datasets, which are more efficient and easier to use than RDDs for structured
and semi structured data.
3. Configuring Spark Properties: You can configure settings (like Spark configurations, execution
properties, etc.) within the SparkSession object.

• Advantages:

1. With SparkSession, you no longer need to instantiate separate objects for SQLContext or
HiveContext.
2. It simplifies the user experience and reduces the need for manually managing multiple
context objects.

Q.69. What are DAG, Jobs, Stages and Task in Spark Databricks?

Ans:
• DAG: Spark uses a DAG (Directed Acyclic Graph) scheduler, which schedules stages of
tasks.
Task 1 is the root task and does not depend on any other task.
Task 2 and Task 3 depend on Task 1 completing first.
Finally, Task 4 depends on Task 2 and Task 3 completing successfully.
Each DAG consists of stages and each stage consists of transformations applied on RDD. Each
transformation generates tasks executed in parallel on each cluster nodes. Once this DAG is generated
the Spark scheduler is responsible for execution of both transformation and action across the cluster.

Directed Acyclic Graph (DAG): Represents the logical execution plan, showing task dependencies. The
DAG scheduler optimizes execution by breaking operations into stages and minimizing data shuffling.

• Jobs: A job in Spark refers to a sequence of transformations on data.


Whenever an action like count(), first(), collect(), and save() is called on RDD (Resilient Distributed
Datasets), a job is created. A job could be thought of as the total work that your Spark application needs
to perform, broken down into a series of steps.

Consider a scenario where you’re executing a Spark program, and you call the action count() to get the
number of elements. This will create a Spark job. If further in your program, you call collect(), another
job will be created. So, a Spark application could have multiple jobs, depending upon the number of
actions.

Each action is creating a job. Jobs in Azure Databricks are used to schedule and run automated tasks.
These tasks can be notebook runs, Spark jobs, or arbitrary code executions. Jobs can be triggered on a
schedule or run in response to certain events, making it easy to automate workflows and periodic data
processing tasks.
• Stages: A stage in Spark represents a sequence of transformations that can be executed
in a single pass, i.e., without any shuffling of data. When a job is divided, it is split into stages.
Each stage comprises tasks, and all the tasks within a stage perform the same computation.
➢ A Stage is a collection of tasks that share the same shuffle dependencies, meaning that they
must exchange data with one another during execution.
➢ Stages are executed sequentially, with the output of one stage becoming the input to the next
stage.
➢ Stages are where wide transformations occur (e.g.,groupBy(), repartition(), join())

The boundary between two stages is drawn when transformations cause data shuffling across partitions.
Transformations in Spark are categorized into two types: narrow and wide. Narrow transformations, like
map(), filter(), and union(), can be done within a single partition. But for wide transformations like
groupByKey(), reduceByKey(), or join(), data from all partitions may need to be combined, thus
necessitating shuffling and marking the start of a new stage.

Each transformation is creating a stage. Two wide transformations are including three stages.

• Task: Each partition or core is equivalent to number of task. Tasks are where narrow
transformations occur (e.g.,union(), map(), filter())

Each time there Spark needs to perform a shuffle of the data it will decide and change how many
partitions the shuffle RDD will have. The default value is 200. Therefore, after using groupBy() which
requires a full data shuffle, the number of tasks will have increased to 200 (as seen in your second INFO
print).

It is quite common to see 200 tasks in one of your stages and more specifically at a stage which requires
wide transformation. The reason for this is, wide transformations in Spark requires a shuffle. Operations
like join, group by etc. are wide transform operations and they trigger a shuffle.

By default, Spark creates 200 partitions whenever there is a need for shuffle. Each partition will be
processed by a task. So, you will end up with 200 tasks during execution.

====================================================

𝐐𝐮𝐞𝐬𝐭𝐢𝐨𝐧: How many Spark Jobs & Stages are involved in this activity?👇🏻
In Databricks say you are reading a csv file from a source, doing a filter transformation, counting number of rows and
then writing the result DataFrame to storage in parquet format.

🔑𝐀𝐧𝐬𝐰𝐞𝐫:
Lets divide the question asked here in steps:

✔️𝐒𝐭𝐞𝐩-𝟏 - 𝑹𝒆𝒂𝒅𝒊𝒏𝒈 𝒕𝒉𝒆 𝑭𝒊𝒍𝒆:


The file is read as an input, and no computation occurs at this stage.

✔️𝐒𝐭𝐞𝐩-𝟐- 𝑨𝒑𝒑𝒍𝒚𝒊𝒏𝒈 𝒕𝒉𝒆 𝑭𝒊𝒍𝒕𝒆𝒓:


This is a 𝐥𝐚𝐳𝐲 𝐭𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧 and does not execute until an action triggers the computation.

✔️𝐒𝐭𝐞𝐩-𝟑 - 𝑪𝒐𝒖𝒏𝒕𝒊𝒏𝒈 𝒕𝒉𝒆 𝑹𝒐𝒘𝒔:


The count() 𝐚𝐜𝐭𝐢𝐨𝐧 triggers a Spark job to process the filter and compute the row count.

✔️𝐒𝐭𝐞𝐩-𝟒 - 𝑾𝒓𝒊𝒕𝒊𝒏𝒈 𝒕𝒐 𝑷𝒂𝒓𝒒𝒖𝒆𝒕:


Writing to Parquet is another 𝐚𝐜𝐭𝐢𝐨𝐧, which triggers a separate job to execute the transformations and save the
output.

𝐉𝐎𝐁𝐒:
Here we have 2 actions:
One for the count() action.
One for the write() action.

𝐍𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐚𝐜𝐭𝐢𝐨𝐧𝐬= 𝐍𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐣𝐨𝐛𝐬


Hence 2 Jobs.

🎯𝐒𝐓𝐀𝐆𝐄𝐒:
Each job consists of one or more stages, depending on whether shuffle operations are involved.
Job 1 (Count): Typically one stage if there’s no shuffle.
Job 2 (Write): At least one stage for writing the output, but more stages if re-partitioning is required.

Spark Architecture:

In the context of Spark, the Driver Program is responsible for coordinating and
executing jobs within the application. These are the main components in the Spark
Driver Program.
1. Spark Context - Entry point to the Spark Application, connects to Cluster Manager
2. DAG Scheduler - converts Jobs → Stages
3. Task Scheduler - converts Stages → Tasks
4. Block Manager - In the driver, it handles the data shuffle by maintaining the metadata
of the blocks in the cluster. In executors, it is responsible for caching, broadcasting and
shuffling the data.

====================================================
Resources:
https://mayankoracledba.wordpress.com/2022/10/05/apache-spark-understanding-spark-job-
stages-and-tasks/
Debugging with the Apache Spark UI:
https://docs.databricks.com/en/compute/troubleshooting/debugging-spark-ui.html
https://docs.databricks.com/en/jobs/run-if.html
https://sparkbyexamples.com/spark/what-is-spark-stage/
https://blog.det.life/decoding-jobs-stages-tasks-in-apache-spark-05c8b2b16114

Q.70. What are the Difference Between Number of Cores and Number of Executors?

Ans: When running a Spark job two terms can often be confusing

➢ Number of Cores and Number of Executors. Let’s break it down:

• Number of Executors
1. Executors are Spark's workhorses.
2. Each executor is a JVM instance is responsible for executing tasks
3. Executors handle parallelism at the cluster level.

• Number of Cores
1. Cores determine the number of tasks an executor can run in parallel.
2. It represents CPU power allocated to the executor.
3. It controls parallelism within an executor.

• In simple terms:
1. Executors = How many workers you have.
2. Cores = How many hands each worker has to complete tasks.

Q.71. What are the differences between Partitioning and Bucketing in Big Data with PySpark for
performance optimization?

Ans:

The common confusion among Data engineers is, ever wondered when to use partitioning and when
to bucket your data? Here's a quick breakdown, with an example and PySpark code to help you
differentiate between these two essential techniques for optimizing query performance in large
datasets.

---

Partitioning:

Partitioning divides data into separate directories based on column values. It improves query
performance by pruning unnecessary partitions but can lead to too many small files if not managed
properly.

Use Case: Suppose you're analyzing transaction data by region. You can partition the data by the
region column to limit the data scanned for region-specific queries.

PySpark Example:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("PartitioningExample").getOrCreate()

# Sample data

data = [("John", "North", 1000), ("Doe", "South", 1500), ("Jane", "East", 1200)]

columns = ["name", "region", "sales"]

df = spark.createDataFrame(data, columns)

# Write partitioned data


df.write.partitionBy("region").parquet("partitioned_data")

Querying for region='North' scans only the relevant partition.

---

Bucketing:

Bucketing divides data into fixed buckets based on a hash function applied to column values. Unlike
partitioning, bucketing doesn’t create physical directories; it stores data in files organized logically
within each partition.

Use Case: For evenly distributing data, such as customer IDs in a transaction table, bucketing ensures
better load balancing and reduces skewness during joins.

PySpark Example:

# Enable bucketing

df.write.bucketBy(4, "region").sortBy("sales").saveAsTable("bucketed_data")

# This creates 4 buckets for the `region` column

Note: Joins and aggregations on the region column will now benefit from bucketing, leading to faster
execution.

---

Key Difference

Partitioning: Organizes data into physical directories; great for pruning irrelevant data.

Bucketing: Groups data into logical buckets; ideal for improving join performance.

---

Both techniques are powerful tools for optimizing query execution in large-scale data pipelines.
Combine them strategically based on your data distribution and use case!

Q.72. : 𝐋𝐢𝐬𝐭 𝐨𝐟 𝐭𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐬 𝐚𝐧𝐝 𝐚𝐜𝐭𝐢𝐨𝐧𝐬 𝐮𝐬𝐞𝐝 𝐢𝐧 𝐀𝐩𝐚𝐜𝐡𝐞 𝐒𝐩𝐚𝐫𝐤 𝐃𝐚𝐭𝐚𝐅𝐫𝐚𝐦𝐞𝐬 𝐟𝐨𝐫 𝐚 𝐃𝐚𝐭𝐚
𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠 𝐫𝐨𝐥𝐞.

Ans: 𝐋𝐢𝐬𝐭 𝐨𝐟 𝐭𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐬 𝐚𝐧𝐝 𝐚𝐜𝐭𝐢𝐨𝐧𝐬 𝐮𝐬𝐞𝐝 𝐢𝐧 𝐀𝐩𝐚𝐜𝐡𝐞 𝐒𝐩𝐚𝐫𝐤 𝐃𝐚𝐭𝐚𝐅𝐫𝐚𝐦𝐞𝐬 𝐟𝐨𝐫 𝐚 𝐃𝐚𝐭𝐚
𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠 𝐫𝐨𝐥𝐞:
𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐬:

Transformations are operations on DataFrames that return a new DataFrame. They are lazily evaluated,
meaning they do not execute immediately but build a logical plan that is executed when an action is
performed.

Transformations: Operations like map, filter, join, and groupBy that create new RDDs. They are lazily
evaluated, defining the computation plan.

𝟏. 𝐁𝐚𝐬𝐢𝐜 𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐬:

𝐬𝐞𝐥𝐞𝐜𝐭(): Select specific columns.

𝐟𝐢𝐥𝐭𝐞𝐫(): Filter rows based on a condition.

𝐰𝐢𝐭𝐡𝐂𝐨𝐥𝐮𝐦𝐧():Add or replace a column.

𝐝𝐫𝐨𝐩(): Remove columns.

𝐰𝐡𝐞𝐫𝐞(𝐜𝐨𝐧𝐝𝐢𝐭𝐢𝐨𝐧): Equivalent to filter(condition).

𝐝𝐫𝐨𝐩(*𝐜𝐨𝐥𝐬): Returns a new DataFrame with columns dropped.

𝐝𝐢𝐬𝐭𝐢𝐧𝐜𝐭():Remove duplicate rows.

𝐬𝐨𝐫𝐭(): Sort the DataFrame by columns.

𝐨𝐫𝐝𝐞𝐫𝐁𝐲(): Order the DataFrame by columns.

𝟐. 𝐀𝐠𝐠𝐫𝐞𝐠𝐚𝐭𝐢𝐨𝐧 𝐚𝐧𝐝 𝐆𝐫𝐨𝐮𝐩𝐢𝐧𝐠:

𝐠𝐫𝐨𝐮𝐩𝐁𝐲(): Group rows by column values.

𝐚𝐠𝐠(): Aggregate data using functions.

𝐜𝐨𝐮𝐧𝐭(): Count rows.

𝐬𝐮𝐦(*𝐜𝐨𝐥𝐬):Computes the sum for each numeric column.

𝐚𝐯𝐠(*𝐜𝐨𝐥𝐬): Computes the average for each numeric column.

𝐦𝐢𝐧(*𝐜𝐨𝐥𝐬):Computes the minimum value for each column.

𝐦𝐚𝐱(*𝐜𝐨𝐥𝐬): Computes the maximum value for each column.


𝟑. 𝐉𝐨𝐢𝐧𝐢𝐧𝐠 𝐃𝐚𝐭𝐚𝐅𝐫𝐚𝐦𝐞𝐬:

𝐣𝐨𝐢𝐧(𝐨𝐭𝐡𝐞𝐫, 𝐨𝐧=𝐍𝐨𝐧𝐞, 𝐡𝐨𝐰=𝐍𝐨𝐧𝐞): Joins with another DataFrame using the given join expression.

𝐮𝐧𝐢𝐨𝐧(): Combine two DataFrames with the same schema.

𝐢𝐧𝐭𝐞𝐫𝐬𝐞𝐜𝐭(): Return common rows between DataFrames.

𝟒. 𝐀𝐝𝐯𝐚𝐧𝐜𝐞𝐝 𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐬:

𝐰𝐢𝐭𝐡𝐂𝐨𝐥𝐮𝐦𝐧𝐑𝐞𝐧𝐚𝐦𝐞𝐝(): Rename a column.

𝐝𝐫𝐨𝐩𝐃𝐮𝐩𝐥𝐢𝐜𝐚𝐭𝐞𝐬(): Drop duplicate rows based on columns.

𝐬𝐚𝐦𝐩𝐥𝐞(): Sample a fraction of rows.

𝐥𝐢𝐦𝐢𝐭(): Limit the number of rows.

𝟓. 𝐖𝐢𝐧𝐝𝐨𝐰 𝐅𝐮𝐧𝐜𝐭𝐢𝐨𝐧𝐬:

𝐨𝐯𝐞𝐫(𝐰𝐢𝐧𝐝𝐨𝐰𝐒𝐩𝐞𝐜): Defines a window specification for window functions.

𝐫𝐨𝐰_𝐧𝐮𝐦𝐛𝐞𝐫().𝐨𝐯𝐞𝐫(𝐰𝐢𝐧𝐝𝐨𝐰𝐒𝐩𝐞𝐜): Assigns a row number starting at 1 within a window partition.

rank().over(windowSpec): Provides the rank of rows within a window partition.

𝐀𝐜𝐭𝐢𝐨𝐧𝐬:

Actions trigger the execution of the transformations and return a result to the driver program or write
data to an external storage system.

Actions: Operations like count, collect, save, and reduce that trigger computation and return results to
the driver or write them to an output.

1. Basic Actions:

show(): Display the top rows of the DataFrame.

collect(): Return all rows as an array.


count(): Count the number of rows.

take(): Return the first N rows as an array.

first(): Return the first row.

head(): Return the first N rows.

2. Writing Data:

write(): Write the DataFrame to external storage.

write.mode(): Specify save mode (e.g., overwrite, append).

save(): Save the DataFrame to a specified path.

toJSON(): Convert the DataFrame to a JSON dataset.

3. Other Actions:

foreach(): Apply a function to each row.

foreachPartition(): Apply a function to each partition.

Q.73. What is 𝗗𝗮𝘁𝗮 𝗦𝗸𝗲𝘄 𝗶𝗻 𝗔𝘇𝘂𝗿𝗲 𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀, how to solve it?

Ans: 𝗦𝗼𝗹𝘃𝗶𝗻𝗴 𝗗𝗮𝘁𝗮 𝗦𝗸𝗲𝘄 𝗶𝗻 𝗔𝘇𝘂𝗿𝗲 𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀: 𝗖𝗮𝘂𝘀𝗲𝘀, 𝗗𝗲𝘁𝗲𝗰𝘁𝗶𝗼𝗻, 𝗮𝗻𝗱 𝗙𝗶𝘅𝗲𝘀 ?

Handling data skew is one of the most common challenges faced by data engineers when working with
distributed computing systems like 𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀. Skew can severely degrade performance, leading to
longer job runtimes, increased costs, and potential failures.

𝗪𝗵𝗮𝘁 𝗶𝘀 𝗗𝗮𝘁𝗮 𝗦𝗸𝗲𝘄?


Data skew occurs when some partitions in a dataset are significantly larger than others, causing uneven
workload distribution across worker nodes. The result? A few nodes get overwhelmed while others
remain idle—leading to inefficient processing.

𝗖𝗼𝗺𝗺𝗼𝗻 𝗖𝗮𝘂𝘀𝗲𝘀 𝗼𝗳 𝗗𝗮𝘁𝗮 𝗦𝗸𝗲𝘄:

1️⃣ 𝗜𝗺𝗯𝗮𝗹𝗮𝗻𝗰𝗲𝗱 𝗞𝗲𝘆𝘀: When specific keys appear much more frequently in the dataset.

2️⃣ 𝗟𝗮𝗿𝗴𝗲 𝗝𝗼𝗶𝗻 𝗜𝗺𝗯𝗮𝗹𝗮𝗻𝗰𝗲𝘀: When one side of a join is heavily skewed.

3️⃣ 𝗪𝗶𝗱𝗲 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻𝘀: Operations like groupByKey or reduceByKey aggregate data by keys,
leading to uneven partition sizes.

𝗛𝗼𝘄 𝘁𝗼 𝗗𝗲𝘁𝗲𝗰𝘁 𝗦𝗸𝗲𝘄 𝗶𝗻 𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀:

1️⃣ 𝗧𝗮𝘀𝗸 𝗠𝗲𝘁𝗿𝗶𝗰𝘀: Use the Spark UI to identify stages where some tasks take significantly longer.

2️⃣ 𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻 𝗦𝗶𝘇𝗲: Monitor data partition sizes and look for disproportionate partitioning.

3️⃣ 𝗝𝗼𝗯 𝗠𝗲𝘁𝗿𝗶𝗰𝘀: Set up monitoring tools like Azure Monitor or Ganglia to identify performance
bottlenecks.

𝗙𝗶𝘅𝗲𝘀 𝗳𝗼𝗿 𝗗𝗮𝘁𝗮 𝗦𝗸𝗲𝘄:

𝗦𝗮𝗹𝘁𝗶𝗻𝗴 𝗞𝗲𝘆𝘀: Append a random “salt” value to the keys during transformations (e.g., key_1️,
key_2) to spread data evenly across partitions.

𝗕𝗿𝗼𝗮𝗱𝗰𝗮𝘀𝘁 𝗝𝗼𝗶𝗻𝘀: For highly skewed joins, broadcast the smaller dataset to all nodes.

𝗖𝘂𝘀𝘁𝗼𝗺 𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻𝗶𝗻𝗴: Define a custom partitioner to redistribute data more evenly.

𝗥𝗲𝗱𝘂𝗰𝗲 𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻 𝗦𝗶𝘇𝗲: Avoid default partition numbers—optimize it based on data size.

𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗲 𝗗𝗮𝘁𝗮 𝗟𝗮𝘆𝗼𝘂𝘁: Use file formats like Delta Lake with features like Z-order clustering to
improve query performance.
𝗞𝗲𝘆 𝗧𝗮𝗸𝗲𝗮𝘄𝗮𝘆

In distributed computing, addressing data skew is essential for optimizing job performance, reducing
costs, and ensuring the reliability of your workloads. As a data engineer, mastering these techniques will
set you apart in solving real-world scalability challenges.

Q.74. 𝐃𝐨 𝐲𝐨𝐮 𝐤𝐧𝐨𝐰 𝐭𝐡𝐞 𝐝𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝐛𝐞𝐭𝐰𝐞𝐞𝐧 𝐭𝐡𝐞 𝐜𝐨𝐚𝐥𝐞𝐬𝐜𝐞() 𝐦𝐞𝐭𝐡𝐨𝐝 𝐚𝐧𝐝 𝐫𝐞𝐩𝐚𝐫𝐭𝐢𝐭𝐢𝐨𝐧()
𝐢𝐧 𝐒𝐩𝐚𝐫𝐤?
Ans:

☑️ The coalesce() method takes the target number of partitions and combines the local partitions
available on the same worker node to achieve the target.
☑️ For example, let's assume you have a five-node cluster (in figure).
And you have a dataframe of 12 partitions.
Those 12 partitions are spread on these ten worker nodes.
Now you want to reduce the number of partitions to 5.
So you executed coalesce(5) on your dataframe.
So Spark will try to collapse the local partitions and reduce your partition count to 5.
The final state might look like this.

You must learn the following things about the coalesce.


1️⃣ Coalesce doesn't cause a shuffle/sort.
2️⃣ It will combine local partitions only.
3️⃣ So you should use coalesce when you want to reduce the number of partitions.
4️⃣ If you try to increase the number of partitions using coalesce, it will do nothing.
5️⃣ You must use repartition() to increasing the number of partitions.
6️⃣ Coalesce can cause skewed partitions.
7️⃣ So try to avoid drastically decreasing the number of partitions.

You should also avoid using repartition to reduce the number of partitions.

Why?
Because you can reduce your partitions using coalesce without doing a shuffle?
You can also reduce your partition count using repartition(), but it will cost you a shuffle
operation.

Q.75. What is Partitioning vs. Bucketing in Big Data with PySpark?

Ans: Understanding Partitioning vs. Bucketing in Big Data with PySpark for performance optimization
The common confusion among Data engineers is, ever wondered when to use partitioning and when to bucket your data? Here's a
quick breakdown, with an example and PySpark code to help you differentiate between these two essential techniques for
optimizing query performance in large datasets.

---
Partitioning:

Partitioning divides data into separate directories based on column values. It improves query performance by pruning unnecessary
partitions but can lead to too many small files if not managed properly.

Partitioning (creates folders): When cardinality or category values (eg, Country, City, Transport Mode) are low and not used
when cardinality or category values (eg, EmpID, Aadhaar Card Numbers, Contact Numbers, Date) are high.

- eg: low-cardinality ORDER_STATUS

- benefit: partition pruning (filtering) by skipping some partitions

Use Case: Suppose you're analyzing transaction data by region. You can partition the data by the region column to limit the data
scanned for region-specific queries.

PySpark Example:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("PartitioningExample").getOrCreate()
# Sample data
data = [("John", "North", 1000), ("Doe", "South", 1500), ("Jane", "East", 1200)]
columns = ["name", "region", "sales"]

df = spark.createDataFrame(data, columns)

# Write partitioned data


df.write.partitionBy("region").parquet("partitioned_data")

Querying for region='North' scans only the relevant partition.

---
Bucketing (creates Files): Used when cardinality or category values (eg, EmpID, Aadhaar Card Numbers, Contact Numbers,
Date) are high and not used when cardinality or category values (eg, Country, City, Transport Mode) are low.

Bucketing divides data into fixed buckets based on a hash function applied to column values. Unlike partitioning, bucketing doesn’t
create physical directories; it stores data in files organized logically within each partition.

bucketing

- eg: 𝗵𝗶𝗴𝗵-𝗰𝗮r𝗱𝗶𝗻𝗮𝗹𝗶𝘁𝘆 ORDER_id

- benefit: join optimization of 2 big DFs

- number of buckets fixed, a deterministic 𝗛𝗔𝗦𝗛 function sends records with same keys to same bucket

Use Case: For evenly distributing data, such as customer IDs in a transaction table, bucketing ensures better load balancing and
reduces skewness during joins.

Bucketing NOT supported on delta tables due to limitations, alternative is ZORDER

PySpark Example:

# Enable bucketing
df.write.bucketBy(4, "region").sortBy("sales").saveAsTable("bucketed_data")

# This creates 4 buckets for the `region` column

Note: Joins and aggregations on the region column will now benefit from bucketing, leading to faster execution.

---
Key Difference

Partitioning: Organizes data into physical directories; great for pruning irrelevant data.

Bucketing: Groups data into logical buckets; ideal for improving join performance.

---
Both techniques are powerful tools for optimizing query execution in large-scale data pipelines. Combine them strategically based
on your data distribution and use case!
Q.76. What is z-ordering in Databricks?

Ans: → If you're working with **large datasets in Delta Lake**, optimizing query performance is crucial.
Here's how **Z-Ordering** can help:

### What is Z-Ordering?

Z-Ordering reorganizes your data files by clustering similar data together based on columns frequently
used in queries. It leverages a **space-filling curve** (like the Z-order curve) to store related data closer
on disk, reducing unnecessary file scans.

### How It Works:

Imagine you frequently query sales data in a retail analytics application:

- Query by store:

```sql

SELECT * FROM sales_transactions WHERE store_id = 'S1';

```

- Query by store and product:


```sql

SELECT * FROM sales_transactions WHERE store_id = 'S1' AND product_id = 'P1';

```

Without optimization, these queries scan the entire table, slowing down performance.

### Enter Z-Ordering:

Optimize your table by clustering the data with Z-Ordering:

```sql

OPTIMIZE sales_transactions

ZORDER BY (store_id, product_id);

```

### **Why Use Z-Ordering?**

1️**Improved Data Skipping:** Only scan relevant files for the query.

2️**Faster Query Performance:** Scan fewer files for filtered queries (e.g., `store_id = 'S1'`).

3️**Resource Efficiency:** Save on compute by reducing I/O.

#### Real Results:

- **Before Z-Ordering:** 1000 files scanned, 2 minutes query time.

- **After Z-Ordering:** 100 files scanned, 20 seconds query time.

### Best Practices:

- Use Z-Ordering for columns in **WHERE**, **JOIN**, or **GROUP BY** clauses.

- Avoid Z-Ordering high-cardinality columns like unique IDs.

- Re-run Z-Ordering periodically after major data ingestion.


**Real-world Applications:** From sales reporting to inventory tracking, Z-Ordering is a game-changer
for retail analytics and beyond.

Start using Z-Ordering today and experience faster, more efficient queries in Databricks!

Q.77. What are different Execution Plans in Apache Spark?

Ans: Apache Spark generates a plan of execution, how internally things will work whenever we fire a
query. It is basically what steps Spark is going to take while running the query. Spark tries to optimize
best for better performance and efficient resource management.

There are 4 types of plan Spark generates:

1. Unresolved Logical Plan/ Parsed Logical Plan: In this phase Spark checks the syntax of the query
whether it's syntactically correct or not. Apart from syntax it doesn’t check anything . If syntax is not
correct it throws an error: ParseException

2. Resolved Logical Plan: In this phase Spark tries to resolve all the objects present in the query like
databases, tables, views, columns etc. Spark has a metadata called catalog which stores the details
about all the objects and verifies these objects when we run queries. If the objects are not present it
throws an error: AnalysisException( UnresolvedRelation)

3. Optimised logical Plan: Once query passes first two phases it enters this phase. Spark tries to create
optimized plan .Query goes through a set of pre configured and/or custom defined rules in this layer to
optimize the query. In this phase spark combines all the projection ,works on filter ,aggregation
optimization. Predicate pushdown takes place.

4. Physical Plan : This is the plan that describes how the query will be physically executed on the cluster.
Catalyst optimizer will generate multiple physical plans .Each of these plans will be estimated based on
execution time and resource consumption projection .One of the best cost optimized plan will be
selected finally .This will be used to generate RDD code and then it runs on the machine.

Q.78.
Ans:

Q.79.

Ans:

Q.80.

Ans:

Q.42.

Ans:

Q.43.

Ans:

Q.41.

Ans:

Q.42.

Ans:

Q.43.

Ans:

Q.41.

Ans:

Q.42.

Ans:

Q.43.

Ans:

Q.41.

Ans:

Q.42.

Ans:
Q.43.

Ans:

Q.41.

Ans:

Q.42.

Ans:

Q.43.

Ans:

3How Does Spark Achieve Fault Tolerance?

→ Dive into lineage, RDD immutability, and data recovery.

Data Processing & Optimization

6️⃣ What Role Does Spark SQL Play in Data Processing?

→ Discuss structured data handling and SQL querying capabilities.


7️⃣ How Does Spark Manage Memory Efficiently?

→ Talk about unified memory management and partitioning.

Advanced PySpark Concepts

1️⃣1️⃣ How Do You Handle Schema Evolution in PySpark?

→ Discuss evolving schemas in structured streaming and data ingestion.

1️⃣2️⃣ Explain Window Functions in PySpark.

→ Describe how to perform operations like ranking and aggregation over partitions.

1️⃣4️⃣ Which Version Control Tools Do You Use?

→ Highlight experience with Git, Bitbucket, or similar platforms.

1️⃣5⃣ How Do You Test Your Spark Code?

→ Mention unit testing with PyTest, mocking data, and integration testing.

Performance & Optimization

1️⃣6️⃣ What Is Shuffling, and How Can You Minimize It?

→ Discuss the impact of shuffling and strategies to reduce it (partitioning, broadcasting).

1️⃣7️⃣ How Does Spark Ensure Fault Tolerance? (Repeated for Emphasis)
→ Dive deeper into RDD lineage and data recovery.

1️⃣8️⃣ Why Is Caching in Spark So Significant? (Repeated for Emphasis)

→ Reinforce understanding of memory optimization.

1️⃣9️⃣ Explain Broadcast Variables in Detail. (Repeated for Emphasis)

→ Provide real-world use cases for broadcasting.

2️⃣0⃣ How Does Spark SQL Enhance Data Processing? (Repeated for Emphasis)

→ Discuss schema inference and performance tuning.

22.

You might also like