PySpark and Azure Data Engineer Free Notes
PySpark and Azure Data Engineer Free Notes
Engineering
75+ Interview Questions
version 6
Q.1. which of the following commands can a data engineer use to compact small data files of a Delta
table into larger ones?
Ans: OPTIMIZE
Overall explanation
Delta Lake can improve the speed of read queries from a table. One way to improve this speed is by
compacting small files into larger ones. You trigger compaction by running the OPTIMIZE command
Reference: https://docs.databricks.com/sql/language-manual/delta-optimize.html
Q.2. A data engineer is trying to use Delta time travel to rollback a table to a previous version, but the
data engineer received an error that the data files are no longer present.
Which of the following commands was run on the table that caused deleting the data files?
Ans: VACUUM
Overall explanation
Running the VACUUM command on a Delta table deletes the unused data files older than a specified
data retention period. As a result, you lose the ability to time travel back to any version older than that
retention threshold.
Reference: https://docs.databricks.com/sql/language-manual/delta-vacuum.html
Q.3. In Delta Lake tables, which of the following is the primary format for the data files?
Ans: Parquet
Overall explanation
Delta Lake builds upon standard data formats. Delta lake table gets stored on the storage in one or more
data files in Parquet format, along with transaction logs in JSON format.
Reference: https://docs.databricks.com/delta/index.html
Q.4. Which of the following locations hosts the Databricks web application ?
Overall explanation
According to the Databricks Lakehouse architecture, Databricks workspace is deployed in the control
plane along with Databricks services like Databricks web application (UI), Cluster manager, workflow
service, and notebooks.
Reference: https://docs.databricks.com/getting-started/overview.html
Q.5. In Databricks Repos, which of the following operations a data engineer can use to update the
local version of a repo from its remote Git repository ?
Ans: Pull
Overall explanation
The git Pull operation is used to fetch and download content from a remote repository and immediately
update the local repository to match that content.
References:
https://docs.databricks.com/repos/index.html
https://github.com/git-guides/git-pull
Q.6. According to the Databricks Lakehouse architecture, which of the following is located in the
customer's cloud account?
Overall explanation
When the customer sets up a Spark cluster, the cluster virtual machines are deployed in the data plane
in the customer's cloud account.
Reference: https://docs.databricks.com/getting-started/overview.html
Q.7. Describe Databricks Lakehouse?
Ans: Single, flexible, high-performance system that supports data, analytics, and machine learning
workloads.
Overall explanation
Databricks Lakehouse is a unified analytics platform that combines the best elements of data lakes and
data warehouses. So, in the Lakehouse, you can work on data engineering, analytics, and AI, all in one
platform.
Reference: https://www.databricks.com/glossary/data-lakehouse
Q.8. If the default notebook language is SQL, which of the following options a data engineer can use to
run a Python code in this SQL Notebook ?
Overall explanation
By default, cells use the default language of the notebook. You can override the default language in a
cell by using the language magic command at the beginning of a cell. The supported magic commands
are: %python, %sql, %scala, and %r.
Reference: https://docs.databricks.com/notebooks/notebooks-code.html
Q.9. Which of the following tasks is not supported by Databricks Repos, and must be performed in
your Git provider ?
Overall explanation
The following tasks are not supported by Databricks Repos, and must be performed in your Git provider:
Q.10. Which of the following statements is Not true about Delta Lake ?
Ans: Delta Lake builds upon standard data formats: Parquet + XML
Overall explanation
It is not true that Delta Lake builds upon XML format. It builds upon Parquet and JSON formats
Reference: https://docs.databricks.com/delta/index.html
Q.11. How long is the default retention period of the VACUUM command ?
Ans: 7 days
Overall explanation
By default, the retention threshold of the VACUUM command is 7 days. This means that VACUUM
operation will prevent you from deleting files less than 7 days old, just to ensure that no long-running
operations are still referencing any of the files to be deleted.
Reference: https://docs.databricks.com/sql/language-manual/delta-vacuum.html
Q.12. The data engineering team has a Delta table called employees that contains the employees
personal information including their gross salaries.
Which of the following code blocks will keep in the table only the employees having a salary greater
than 3000 ?
Overall explanation
In order to keep only the employees having a salary greater than 3000, we must delete the employees
having salary less than or equal 3000. To do so, use the DELETE statement:
Reference: https://docs.databricks.com/sql/language-manual/delta-delete-from.html
Q.13. A data engineer wants to create a relational object by pulling data from two tables. The
relational object must be used by other data engineers in other sessions on the same cluster only. In
order to save on storage costs, the date engineer wants to avoid copying and storing physical data.
Which of the following relational objects should the data engineer create?
Overall explanation
In order to avoid copying and storing physical data, the data engineer must create a view object. A view
in databricks is a virtual table that has no physical data. It’s just a saved SQL query against actual tables.
The view type should be Global Temporary view that can be accessed in other sessions on the same
cluster. Global Temporary views are tied to a cluster temporary database called global_temp.
Reference: https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-view.html
Q.14. Fill in the below blank to successfully create a table in Databricks using data from an existing
PostgreSQL database:
CREATE TABLE employees
USING ____________
OPTIONS (
url "jdbc:postgresql:dbserver",
dbtable "employees"
Ans: org.apache.spark.sql.jdbc
Overall explanation
Using the JDBC library, Spark SQL can extract data from any existing relational database that supports
JDBC. Examples include mysql, postgres, SQLite, and more.
Reference: https://learn.microsoft.com/en-us/azure/databricks/external-data/jdbc
Q.15. Which of the following commands can a data engineer use to create a new table along with a
comment ?
Overall explanation
The CREATE TABLE clause supports adding a descriptive comment for the table. This allows for easier
discovery of table contents.
Syntax:
AS query
Reference: https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html
Q.16. A junior data engineer usually uses INSERT INTO command to write data into a Delta table. A
senior data engineer suggested using another command that avoids writing of duplicate records.
Which of the following commands is the one suggested by the senior data engineer ?
Ans: MERGE INTO (not APPLY CHANGES INTO or UPDATE or COPY INTO)
MERGE INTO allows to merge a set of updates, insertions, and deletions based on a source table into a
target Delta table. With MERGE INTO, you can avoid inserting the duplicate records when writing into
Delta tables.
References:
https://docs.databricks.com/sql/language-manual/delta-merge-into.html
https://docs.databricks.com/delta/merge.html#data-deduplication-when-writing-into-delta-tables
Q.17. A data engineer is designing a Delta Live Tables pipeline. The source system generates files
containing changes captured in the source data. Each change event has metadata indicating whether
the specified record was inserted, updated, or deleted. In addition to a timestamp column indicating
the order in which the changes happened. The data engineer needs to update a target table based on
these change events.
Which of the following commands can the data engineer use to best solve this problem?
Overall explanation
The events described in the question represent Change Data Capture (CDC) feed. CDC is logged at the
source as events that contain both the data of the records along with metadata information:
Operation column indicating whether the specified record was inserted, updated, or deleted
Sequence column that is usually a timestamp indicating the order in which the changes happened
You can use the APPLY CHANGES INTO statement to use Delta Live Tables CDC functionality
Reference: https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-cdc.html
Q.18. In PySpark, which of the following commands can you use to query the Delta table employees
created in Spark SQL?
Ans: spark.table("employees")
Overall explanation
spark.table() function returns the specified Spark SQL table as a PySpark DataFrame
Reference:
https://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/session.html#SparkSession.tabl
e
Q.19. When dropping a Delta table, which of the following explains why only the table's metadata will
be deleted, while the data files will be kept in the storage ?
Overall explanation
External (unmanaged) tables are tables whose data is stored in an external storage path by using a
LOCATION clause.
When you run DROP TABLE on an external table, only the table's metadata is deleted, while the
underlying data files are kept.
Reference: https://docs.databricks.com/lakehouse/data-objects.html#what-is-an-unmanaged-table
Q.20. Given the two tables students_course_1 and students_course_2. Which of the following
commands can a data engineer use to get all the students from the above two tables without
duplicate records ?
UNION
Overall explanation
With UNION, you can return the result of subquery1 plus the rows of subquery2
Syntax:
subquery1
subquery2
If DISTINCT is specified the result does not contain any duplicate rows. This is the default.
Note that both subqueries must have the same number of columns and share a least common type for
each respective column.
Reference: https://docs.databricks.com/sql/language-manual/sql-ref-syntax-qry-select-setops.html
Ans: dbfs:/user/hive/warehouse
Overall explanation
Since we are creating the database here without specifying a LOCATION clause, the database will be
created in the default warehouse directory under dbfs:/user/hive/warehouse
Reference: https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-schema.html
Q.22. Fill in the below blank to get the students enrolled in less than 3 courses from array column
students
SELECT
faculty_id,
students,
___________ AS few_courses_students
FROM faculties
Overall explanation
filter(input_array, lamda_function) is a higher order function that returns an output array from an input
array by extracting elements for which the predicate of a lambda function holds.
Example:
output: [1, 3]
References:
https://docs.databricks.com/sql/language-manual/functions/filter.html
https://docs.databricks.com/optimizations/higher-order-lambda-functions.html
Q.23. The data engineer team has a DLT pipeline that updates all the tables once and then stops. The
compute resources of the pipeline continue running to allow for quick testing.
Which of the following best describes the execution modes of this DLT pipeline ?
Ans: The DLT pipeline executes in Triggered Pipeline mode under Development mode.
Overall explanation
Triggered pipelines update each table with whatever data is currently available and then they shut
down.
In Development mode, the Delta Live Tables system ease the development process by
• Reusing a cluster to avoid the overhead of restarts. The cluster runs for two hours when
development mode is enabled.
• Disabling pipeline retries so you can immediately detect and fix errors.
Reference:
https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-concepts.html
Q.24. In multi-hop architecture, which of the following statements best describes the Bronze layer ?
Overall explanation
Bronze tables contain data in its rawest format ingested from various sources (e.g., JSON files,
Operational Databaes, Kakfa stream, ...)
Reference:
https://www.databricks.com/glossary/medallion-architecture
Compute resources are infrastructure resources that provide processing capabilities in the cloud. A SQL
warehouse is a compute resource that lets you run SQL commands on data objects within Databricks
SQL.
Reference: https://docs.databricks.com/sql/admin/sql-endpoints.html
Q.26. Which of the following is the benefit of using the Auto Stop feature of Databricks SQL
warehouses ?
Overall explanation
The Auto Stop feature stops the warehouse if it’s idle for a specified number of minutes.
Reference: https://docs.databricks.com/sql/admin/sql-endpoints.html
Q.27. A data engineer wants to increase the cluster size of an existing Databricks SQL warehouse.
Which of the following is the benefit of increasing the cluster size of Databricks SQL warehouses ?
Overall explanation
Cluster Size represents the number of cluster workers and size of compute resources available to run
your queries and dashboards. To reduce query latency, you can increase the cluster size.
Reference: https://docs.databricks.com/sql/admin/sql-endpoints.html#cluster-size-1
Ans: It’s an expression to represent complex job schedule that can be defined programmatically
Overall explanation
To define a schedule for a Databricks job, you can either interactively specify the period and starting
time, or write a Cron Syntax expression. The Cron Syntax allows to represent complex job schedule that
can be defined programmatically.
Reference: https://docs.databricks.com/workflows/jobs/jobs.html#schedule-a-job
Q.29. The data engineer team has a DLT pipeline that updates all the tables at defined intervals until
manually stopped. The compute resources terminate when the pipeline is stopped.
Which of the following best describes the execution modes of this DLT pipeline ?
Ans: The DLT pipeline executes in Continuous Pipeline mode under Production mode.
Overall explanation
Continuous pipelines update tables continuously as input data changes. Once an update is started, it
continues to run until the pipeline is shut down.
Reference:
https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-concepts.html
Q.30. Which of the following commands can a data engineer use to purge stale data files of a Delta
table?
Ans: VACUUM
Overall explanation
The VACUUM command deletes the unused data files older than a specified data retention period.
Reference: https://docs.databricks.com/sql/language-manual/delta-vacuum.html
Q.31. In Databricks Repos (Git folders), which of the following operations a data engineer can use to
save local changes of a repo to its remote repository ?
Overall explanation
Commit & Push is used to save the changes on a local repo, and then uploads this local repo content to
the remote repository.
References:
https://docs.databricks.com/repos/index.html
https://github.com/git-guides/git-push
Q.32. In Delta Lake tables, which of the following is the primary format for the transaction log files?
Ans: JSON
Overall explanation
Delta Lake builds upon standard data formats. Delta lake table gets stored on the storage in one or more
data files in Parquet format, along with transaction logs in JSON format.
Reference: https://docs.databricks.com/delta/index.html
Q.33. Which of the following locations completely hosts the customer data ?
Overall explanation
According to the Databricks Lakehouse architecture, the storage account hosting the customer data is
provisioned in the data plane in the Databricks customer's cloud account.
Reference: https://docs.databricks.com/getting-started/overview.html
Q.34. A junior data engineer uses the built-in Databricks Notebooks versioning for source control. A
senior data engineer recommended using Databricks Repos (Git folders) instead.
Which of the following could explain why Databricks Repos is recommended instead of Databricks
Notebooks versioning?
Ans: Databricks Repos supports creating and managing branches for development work.
Overall explanation
One advantage of Databricks Repos over the built-in Databricks Notebooks versioning is that Databricks
Repos supports creating and managing branches for development work.
Reference: https://docs.databricks.com/repos/index.html
Q.35. Which of the following services provides a data warehousing experience to its users?
Overall explanation
Databricks SQL (DB SQL) is a data warehouse on the Databricks Lakehouse Platform that lets you run all
your SQL and BI applications at scale.
Reference: https://www.databricks.com/product/databricks-sql
Q.36. A data engineer noticed that there are unused data files in the directory of a Delta table. They
executed the VACUUM command on this table; however, only some of those unused data files have
been deleted.
Which of the following could explain why only some of the unused data files have been deleted after
running the VACUUM command ?
Ans: The deleted data files were older than the default retention threshold. While the remaining files
are newer than the default retention threshold and can not be deleted.
Overall explanation
Running the VACUUM command on a Delta table deletes the unused data files older than a specified
data retention period. Unused files newer than the default retention threshold are kept untouched.
Reference: https://docs.databricks.com/sql/language-manual/delta-vacuum.html
Q.37. The data engineering team has a Delta table called products that contains products’ details
including the net price.
Which of the following code blocks will apply a 50% discount on all the products where the price is
greater than 1000 and save the new price to the table?
Ans: UPDATE products SET price = price * 0.5 WHERE price > 1000;
Overall explanation
The UPDATE statement is used to modify the existing records in a table that match the WHERE
condition. In this case, we are updating the products where the price is strictly greater than 1000.
Syntax:
UPDATE table_name
WHERE condition
Reference:
https://docs.databricks.com/sql/language-manual/delta-update.html
Q.38. A data engineer wants to create a relational object by pulling data from two tables. The
relational object will only be used in the current session. In order to save on storage costs, the date
engineer wants to avoid copying and storing physical data.
Which of the following relational objects should the data engineer create?
Overall explanation
In order to avoid copying and storing physical data, the data engineer must create a view object. A view
in databricks is a virtual table that has no physical data. It’s just a saved SQL query against actual tables.
The view type should be Temporary view since it’s tied to a Spark session and dropped when the session
ends.
Reference: https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-view.html
Q.39. A data engineer has a database named db_hr, and they want to know where this database was
created in the underlying storage.
Which of the following commands can the data engineer use to complete this task?
Overall explanation
The DESCRIBE DATABASE or DESCRIBE SCHEMA returns the metadata of an existing database (schema).
The metadata information includes the database’s name, comment, and location on the filesystem. If
the optional EXTENDED option is specified, database properties are also returned.
Syntax:
Reference: https://docs.databricks.com/sql/language-manual/sql-ref-syntax-aux-describe-schema.html
Q.40. Which of the following commands a data engineer can use to register the table orders from an
existing SQLite database ?
Ans:
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:sqlite:/bookstore.db",
dbtable "orders"
Overall explanation
Using the JDBC library, Spark SQL can extract data from any existing relational database that supports
JDBC. Examples include mysql, postgres, SQLite, and more.
Reference: https://learn.microsoft.com/en-us/azure/databricks/external-data/jdbc
Q.41. When dropping a Delta table, which of the following explains why both the table's metadata
and the data files will be deleted ?
Overall explanation
Managed tables are tables whose metadata and the data are managed by Databricks.
When you run DROP TABLE on a managed table, both the metadata and the underlying data files are
deleted.
Reference: https://docs.databricks.com/lakehouse/data-objects.html#what-is-a-managed-table
USE db_hr;
Ans: dbfs:/user/hive/warehouse/db_hr.db
Overall explanation
Since we are creating the database here without specifying a LOCATION clause, the database will be
created in the default warehouse directory under dbfs:/user/hive/warehouse. The database folder have
the extension (.db)
And since we are creating the table also without specifying a LOCATION clause, the table becomes a
managed table created under the database directory (in db_hr.db folder)
Reference: https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-schema.html
Q.43. Which of the following statements best describes the usage of CREATE SCHEMA command ?
Overall explanation
CREATE SCHEMA is an alias for CREATE DATABASE statement. While usage of SCHEMA and DATABASE is
interchangeable, SCHEMA is preferred.
Reference: https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-database.html
(spark.table("orders")
.withColumn("total_after_tax", col("total")+col("tax"))
.writeStream
.option("checkpointLocation", checkpointPath)
.outputMode("append")
.___________
.table("new_orders") )
Fill in the blank to make the query executes multiple micro-batches to process all available data, then
stops the trigger.
Ans: trigger(availableNow=True)
Overall explanation
In Spark Structured Streaming, we use trigger(availableNow=True) to run the stream in batch mode
where it processes all available data in multiple micro-batches. The trigger will stop on its own once it
finishes processing the available data.
Reference: https://docs.databricks.com/structured-streaming/triggers.html#configuring-incremental-
batch-processing
Q.45. In multi-hop architecture, which of the following statements best describes the Silver layer
tables?
Ans: They provide a more refined view of raw data, where it’s filtered, cleaned, and enriched.
Overall explanation
Silver tables provide a more refined view of the raw data. For example, data can be cleaned and filtered
at this level. And we can also join fields from various bronze tables to enrich our silver records
Reference:
https://www.databricks.com/glossary/medallion-architecture
Q.46. The data engineer team has a DLT pipeline that updates all the tables at defined intervals until
manually stopped. The compute resources of the pipeline continue running to allow for quick testing.
Which of the following best describes the execution modes of this DLT pipeline ?
Ans: The DLT pipeline executes in Continuous Pipeline mode under Development mode.
Overall explanation
Continuous pipelines update tables continuously as input data changes. Once an update is started, it
continues to run until the pipeline is shut down.
In Development mode, the Delta Live Tables system ease the development process by Reusing a cluster
to avoid the overhead of restarts. The cluster runs for two hours when development mode is enabled.
Disabling pipeline retries so you can immediately detect and fix errors.
Reference:
https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-concepts.html
.table("cleanedOrders")
.groupBy("productCategory")
.agg(sum("totalWithTax"))
.writeStream
.option("checkpointLocation", checkpointPath)
.outputMode("complete")
.table("aggregatedOrders")
Which of the following best describe the purpose of this query in a multi-hop architecture?
Ans: The query is performing a hop from Silver layer to a Gold table
Overall explanation
The above Structured Streaming query creates business-level aggregates from clean orders data in the
silver table cleanedOrders, and loads them in the gold table aggregatedOrders.
Reference:
https://www.databricks.com/glossary/medallion-architecture
(spark.readStream
.table("orders")
.writeStream
.option("checkpointLocation", checkpointPath)
.table("Output_Table")
By default, if you don’t provide any trigger interval, the data will be processed every half second. This is
equivalent to trigger (processingTime=”500ms")
Reference: https://docs.databricks.com/structured-streaming/triggers.html#what-is-the-default-trigger-
interval
Q.49. In multi-hop architecture, which of the following statements best describes the Gold layer
tables?
Ans: They provide business-level aggregations that power analytics, machine learning, and production
applications
Overall explanation
Gold layer is the final layer in the multi-hop architecture, where tables provide business level aggregates
often used for reporting and dashboarding, or even for Machine learning.
Reference:
https://www.databricks.com/glossary/medallion-architecture
Q.50. The data engineer team has a DLT pipeline that updates all the tables once and then stops. The
compute resources of the pipeline terminate when the pipeline is stopped.
Which of the following best describes the execution modes of this DLT pipeline ?
Ans: The DLT pipeline executes in Triggered Pipeline mode under Production mode.
Overall explanation
Triggered pipelines update each table with whatever data is currently available and then they shut
down.
Reference:
https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-concepts.html
Q.51. A data engineer needs to determine whether to use Auto Loader or COPY INTO command in
order to load input data files incrementally.
In which of the following scenarios should the data engineer use Auto Loader over COPY INTO
command ?
Ans: If they are going to ingest files in the order of millions or more over time
Overall explanation
Here are a few things to consider when choosing between Auto Loader and COPY INTO command:
❖ If you’re going to ingest files in the order of thousands, you can use COPY INTO. If you are
expecting files in the order of millions or more over time, use Auto Loader.
❖ If your data schema is going to evolve frequently, Auto Loader provides better primitives around
schema inference and evolution.
Reference: https://docs.databricks.com/ingestion/index.html#when-to-use-copy-into-and-when-to-use-
auto-loader
Q.52. From which of the following locations can a data engineer set a schedule to automatically
refresh a Databricks SQL query ?
Overall explanation
In Databricks SQL, you can set a schedule to automatically refresh a query from the query's page.
Reference: https://docs.databricks.com/sql/user/queries/schedule-query.html
Q.53. Databricks provides a declarative ETL framework for building reliable and maintainable data
processing pipelines, while maintaining table dependencies and data quality.
Overall explanation
Delta Live Tables is a framework for building reliable, maintainable, and testable data processing
pipelines. You define the transformations to perform on your data, and Delta Live Tables manages task
orchestration, cluster management, monitoring, data quality, and error handling.
Reference: https://docs.databricks.com/workflows/delta-live-tables/index.html
Q.54. Which of the following services can a data engineer use for orchestration purposes in Databricks
platform ?
Overall explanation
Databricks Jobs allow orchestrating data processing tasks. This means the ability to run and manage
multiple tasks as a directed acyclic graph (DAG) in a job.
Reference: https://docs.databricks.com/workflows/jobs/jobs.html
Q.55. A data engineer has a Job with multiple tasks that takes more than 2 hours to complete. In the
last run, the final task unexpectedly failed.
Which of the following actions can the data engineer perform to complete this Job Run while
minimizing the execution time ?
Ans: They can repair this Job Run so only the failed tasks will be re-executed
Overall explanation
You can repair failed multi-task jobs by running only the subset of unsuccessful tasks and any dependent
tasks. Because successful tasks are not re-run, this feature reduces the time and resources required to
recover from unsuccessful job runs.
Reference: https://docs.databricks.com/workflows/jobs/repair-job-failures.html
Q.56. A data engineering team has a multi-tasks Job in production. The team members need to be
notified in the case of job failure.
Which of the following approaches can be used to send emails to the team members in the case of job
failure ?
Ans: They can configure email notifications settings in the job page
Overall explanation
Databricks Jobs support email notifications to be notified in the case of job start, success, or failure.
Simply, click Edit email notifications from the details panel in the Job page. From there, you can add one
or more email addresses.
Q.57. For production jobs, which of the following cluster types is recommended to use?
Overall explanation
Job Clusters are dedicated clusters for a job or task run. A job cluster auto terminates once the job is
completed, which saves cost compared to all-purpose clusters.
In addition, Databricks recommends using job clusters in production so that each job runs in a fully
isolated environment.
Reference: https://docs.databricks.com/workflows/jobs/jobs.html#choose-the-correct-cluster-type-for-
your-job
Q.58. In Databricks Jobs, which of the following approaches can a data engineer use to configure a
linear dependency between Task A and Task B ?
Ans: They can select the Task A in the Depends On field of the Task B configuration
Overall explanation
You can define the order of execution of tasks in a job using the Depends on dropdown menu. You can
set this field to one or more tasks in the job.
Q.59. Which part of the Databricks Platform can a data engineer use to revoke permissions from users
on tables ?
Overall explanation
Data Explorer in Databricks SQL allows you to manage data object permissions. This includes revoking
privileges on tables and databases from users or groups of users.
Reference: https://docs.databricks.com/security/access-control/data-acl.html#data-explorer
Q.60. In which of the following locations can a data engineer change the owner of a table?
Ans: In Data Explorer, from the Owner field in the table's page
Overall explanation
From Data Explorer in Databricks SQL, you can navigate to the table's page to review and change the
owner of the table. Simply, click on the Owner field, then Edit owner to set the new owner.
Reference: https://docs.databricks.com/security/access-control/data-acl.html#manage-data-object-
ownership
Ans: Broadcasting is a technique used in PySpark to optimize the performance of operations involving
small DataFrames. When a DataFrame is broadcasted, it is sent to all worker nodes and cached, ensuring
that each node has a full copy of the data. This eliminates the need to shuffle and exchange data
between nodes during operations, such as joins, significantly reducing the communication overhead and
improving performance.
Broadcasting should be used when you have a small DataFrame that is used multiple times in your
processing pipeline, especially in join operations. Broadcasting the small DataFrame can significantly
improve performance by reducing the amount of data that needs to be exchanged between worker
nodes.
• Resource: https://www.sparkcodehub.com/broadcasting-dataframes-in-pyspark
When working with large datasets in PySpark, optimizing performance is crucial. Two powerful
optimization techniques are broadcast variables and broadcast joins. Let’s dive into what they are, when
to use them, and how they help improve performance with clear examples.
• Broadcast Variables:
A broadcast variable allows you to efficiently share a small, read-only dataset across all executors in a
cluster. Instead of sending this data with every task, it is sent once from the driver to each executor,
minimizing network I/O and allowing tasks to access it locally.
When to Use a Broadcast Variable?
- Scenario: When you need to share small lookup data or configuration settings with all tasks in the
cluster.
- Optimization: Minimizes network I/O by sending the data once and caching it locally on each executor.
Example Code
Let’s say we have a small dictionary of country codes and names that we need to use in our
transformations.
In above example:
- Each task accesses the broadcasted dictionary locally to convert country codes.
• Broadcast Joins:
A broadcast join optimizes join operations by broadcasting a small dataset to all executor nodes. This
allows each node to perform the join locally, reducing the need for shuffling large datasets across the
network.
- Scenario: When performing joins and one of the datasets is small enough to fit in memory.
- Optimization: Reduces shuffling and network I/O, making joins more efficient by enabling local join
operations.
• Only broadcast small DataFrames: Broadcasting large DataFrames can cause performance
issues and consume a significant amount of memory on worker nodes. Make sure to only
broadcast DataFrames that are small enough to fit in the memory of each worker node.
• Monitor the performance: Keep an eye on the performance of your PySpark applications to
ensure that broadcasting is improving performance as expected. If you notice any performance
issues or memory problems, consider adjusting your broadcasting strategy or revisiting your
data processing pipeline.
• Consider alternative techniques: Broadcasting is not always the best solution for optimizing
performance. In some cases, you may achieve better results by repartitioning your DataFrames
or using other optimization techniques, such as bucketing or caching. Evaluate your specific use
case and choose the most appropriate technique for your needs.
• Be cautious with broadcasting in iterative algorithms: If you're using iterative algorithms, be
careful when broadcasting DataFrames, as the memory used by the broadcasted DataFrame
may not be released until the end of the application. This could lead to memory issues and
performance problems over time.
Ans: Anser
Answer:
Ans:
Q.64. How Can You Create a DataFrame in PySpark?
Ans:
Ans: Efficient data processing in Apache Spark hinges on shuffle partitions, which directly impact performance and
cluster utilization.
Key Scenarios:
1 Large Data per Partition: Increase partitions to ensure each core handles an optimal workload (1MB–200MB).
2 Small Data per Partition: Reduce partitions or match them to the available cores for better utilization.
Common Challenges:
Data Skew: Uneven data distribution slows jobs.
Solutions: Use Adaptive Query Execution (AQE) or salting to balance partitions.
🎯 Why it Matters: Properly tuning shuffle partitions ensures faster job completion and optimal resource usage,
unlocking the true power of Spark!
Q.66. Why Is Caching Crucial in Spark? Share how caching improves performance by storing
intermediate results.
Ans: cache() is an Apache Spark transformation that can be used on a DataFrame, Dataset, or RDD
when you want to perform more than one action. cache() caches the specified DataFrame, Dataset, or
RDD in the memory of your cluster’s workers. Since cache() is a transformation, the caching operation
takes place only when a Spark action (for example, count(), show(), take(), or write()) is also used on
the same DataFrame, Dataset, or RDD in a single action.
• Reusing Data: Caching is optimal when you need to perform multiple operations on the same
dataset to avoid reading from storage repeatedly.
• Frequent Subset Access: Useful for frequently accessing small subsets of a large dataset,
reducing the need to load the entire dataset repeatedly.
In Spark, caching is a mechanism for storing data in memory to speed up access to that data.
When you cache a dataset, Spark keeps the data in memory so that it can be quickly retrieved the
next time it is needed. Caching is especially useful when you need to perform multiple operations on
the same dataset, as it eliminates the need to read the data from a disk each time.
Understanding the concept of caching and how to use it effectively is crucial for optimizing the
performance of your Spark applications. By caching the right data at the right time, you can
significantly speed up your applications and make the most out of your Spark cluster.
To cache a dataset in Spark, you simply call the cache() method on the RDD or DataFrame. For
example, if you have an RDD called myRDD, you can cache it like this:
myRDD.cache()
• Databricks uses disk caching to accelerate data reads by creating copies of remote Parquet
data files in nodes’ local storage using a fast intermediate data format. The data is cached
automatically whenever a file has to be fetched from a remote location. Successive reads of
the same data are then performed locally, which results in significantly improved reading
speed. The cache works for all Parquet data files (including Delta Lake tables).
• Disk cache vs. Spark cache: The Databricks disk cache differs from Apache Spark caching.
Databricks recommends using automatic disk caching.
• The following table summarizes the key differences between disk and Apache Spark caching
so that you can choose the best tool for your workflow:
Resource: https://docs.databricks.com/en/optimizations/disk-cache.html
Ans: Alternate to cache(), you can use the persist() method to cache a dataset. The persist() method
allows you to specify the level of storage for the cached data, such as memory-only or disk-only storage.
For example, to cache an RDD in memory only, you can use the following code:
• myRDD.persist (StorageLevel.MEMORY_ONLY)
When you cache a dataset in Spark, you should be aware that it will occupy memory on the worker
nodes. If you have limited memory available, you may need to prioritize which datasets to cache based
on their importance to your processing workflow.
• Persistence is a related concept to caching in Spark. When you persist a dataset, you are telling
Spark to store the data on disk or in memory, or a combination of the two, so that it can be
retrieved quickly the next time it is needed.
The persist() method can be used to specify the level of storage for the persisted data. The available
storage levels include MEMORY_ONLY, MEMORY_ONLY_SER, MEMORY_AND_DISK,
MEMORY_AND_DISK_SER, DISK_ONLY, and OFF_HEAP. The MEMORY_ONLY and MEMORY_ONLY_SER
levels store the data in memory, while the MEMORY_AND_DISK and MEMORY_AND_DISK_SER levels
store the data in memory and on disk. The DISK_ONLY level stores the data on disk only, while the
OFF_HEAP level stores the data in off-heap memory.
To persist a dataset in Spark, you can use the persist() method on the RDD or DataFrame. For
example, if you have an RDD called myRDD, you can persist it in memory using the following code:
• myRDD.persist(StorageLevel.MEMORY_ONLY)
If you want to persist the data in memory and on disk, you can use the following code:
• myRDD.persist(StorageLevel.MEMORY_AND_DISK)
When you persist a dataset in Spark, the data will be stored in the specified storage level until you
explicitly remove it from memory or disk. You can remove a persisted dataset using the unpersist()
method. For example, to remove the myRDD dataset from memory, you can use the following code:
• myRDD.unpersist()
❖ Using cache() and persist() methods, Spark provides an optimization mechanism to store the
intermediate computation of an RDD, DataFrame, and Dataset so they can be reused in
subsequent actions(reusing the RDD, Dataframe, and Dataset computation results).
❖ Both caching and persisting are used to save the Spark RDD, Dataframe, and Datasets. But, the
difference is, RDD cache() method default saves it to memory (MEMORY_ONLY) and, DataFrame
cache() method default saves it to memory (MEMORY_AND_DISK), whereas persist() method is
used to store it to the user-defined storage level.
❖ When you persist a dataset, each node stores its partitioned data in memory and reuses them in
other actions on that dataset. And Spark’s persisted data on nodes are fault-tolerant meaning if
any partition of a Dataset is lost, it will automatically be recomputed using the original
transformations that created it.
Caching and persisting data in PySpark are techniques to store intermediate results, enabling faster
access and efficient processing. Knowing when to use each can optimize performance, especially for
iterative tasks or reuse of DataFrames within a job.
2. 𝐈𝐦𝐩𝐫𝐨𝐯𝐞𝐬 𝐈𝐭𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐎𝐩𝐞𝐫𝐚𝐭𝐢𝐨𝐧𝐬 : In machine learning or data transformations, where the same
data is needed multiple times, caching/persisting minimizes redundant calculations.
3. 𝐌𝐢𝐧𝐢𝐦𝐢𝐳𝐞𝐬 𝐈/𝐎 𝐎𝐩𝐞𝐫𝐚𝐭𝐢𝐨𝐧𝐬 : By keeping data in memory, you avoid repeated disk I/O operations,
which are costly in terms of time.
- 𝐏𝐞𝐫𝐬𝐢𝐬𝐭𝐞𝐧𝐜𝐞 allows you to specify storage levels (e.g., `MEMORY_AND_DISK`, `DISK_ONLY`) and
provides more control over where and how data is stored. This is useful for data too large to fit in
memory.
𝐖𝐡𝐞𝐧 𝐭𝐨 𝐂𝐚𝐜𝐡𝐞?
- 𝐔𝐬𝐞 𝐂𝐚𝐜𝐡𝐢𝐧𝐠 𝐟𝐨𝐫 𝐐𝐮𝐢𝐜𝐤 𝐑𝐞𝐮𝐬𝐞 : If you need to repeatedly access a DataFrame within the same job
without changing its storage level, caching is efficient and straightforward.
𝐖𝐡𝐞𝐧 𝐭𝐨 𝐏𝐞𝐫𝐬𝐢𝐬𝐭?
- 𝐔𝐬𝐞 𝐏𝐞𝐫𝐬𝐢𝐬𝐭𝐞𝐧𝐜𝐞 𝐟𝐨𝐫 𝐒𝐭𝐨𝐫𝐚𝐠𝐞 𝐂𝐨𝐧𝐭𝐫𝐨𝐥 : When the DataFrame is large or memory is limited,
persisting allows you to specify storage levels like `MEMORY_AND_DISK`, which offloads part of the
data to disk if memory is full.
Below are the advantages of using Spark Cache and Persist methods.
❖ Cost efficient – Spark computations are very expensive; hence, reusing the computations are
used to save cost.
❖ Time efficient – Reusing repeated computations saves lots of time.
❖ Execution time – Saves execution time of the job and we can perform more jobs on the same
cluster.
Q.68. What do you understand by Spark Context, SQL Context and Spark Session?
Ans:
Spark context, Sql Context & Spark session:
In Apache Spark, Spark Context and SQL Context are essential components for interacting with Spark.
Each has its specific role in setting up the environment, managing resources, and enabling
functionalities like querying, transforming data, and working with different APIs (RDD, DataFrame,
Dataset).
1. Spark Context:
• Role: SparkContext is the entry point for using the RDD (Resilient Distributed Dataset) API and
the core abstraction in Spark’s distributed computation. It allows for the creation of RDDs and
provides methods to access Spark’s capabilities, like resource allocation and job execution
across the cluster.
• Key Functions:
1. Resource Allocation: When a Spark job is submitted, SparkContext connects to the cluster
manager (like YARN, Mesos, or Kubernetes), which allocates resources like executors and
tasks.
2. RDD Operations: SparkContext is responsible for creating RDDs, distributing data, and
managing job execution on the distributed cluster.
3. Job Execution: Through SparkContext, transformations and actions applied to RDDs are
scheduled and executed across the cluster.
4. Limitations: SparkContext primarily supports RDDs, which are low level, making it difficult to
perform SQL-like operations and manipulate structured data easily.
2. SQL Context:
• Role: SQLContext was the original class introduced to work with structured data and to run
SQL queries on Spark. It allows users to interact with Spark DataFrames and execute SQL-like
queries on structured data.
• Key Functions:
1. DataFrame Creation: SQLContext allows for creating DataFrames from various data sources
like JSON, CSV, Parquet, etc.
2. SQL Queries: Users can run SQL queries on DataFrames using SQLContext. This gives Spark SQL
capabilities for querying structured and semi-structured data.
3. Integration with Hive: With HiveContext (a subclass of SQLContext), users could interact with
Hive tables and perform more complex SQL operations.
3. Spark Session:
• Role:
1. SparkSession is the new unified entry point for using all the features of Spark, including RDD,
DataFrame, and Dataset APIs. It consolidates different contexts like SparkContext,
SQLContext, and HiveContext into a single, more user-friendly object.
2. Introduced in Spark 2.0, SparkSession simplifies the user interface by managing SparkContext
internally. It is the primary point of entry to run Spark applications involving DataFrames and
SQL queries.
• Key Features:
• Advantages:
1. With SparkSession, you no longer need to instantiate separate objects for SQLContext or
HiveContext.
2. It simplifies the user experience and reduces the need for manually managing multiple
context objects.
Q.69. What are DAG, Jobs, Stages and Task in Spark Databricks?
Ans:
• DAG: Spark uses a DAG (Directed Acyclic Graph) scheduler, which schedules stages of
tasks.
Task 1 is the root task and does not depend on any other task.
Task 2 and Task 3 depend on Task 1 completing first.
Finally, Task 4 depends on Task 2 and Task 3 completing successfully.
Each DAG consists of stages and each stage consists of transformations applied on RDD. Each
transformation generates tasks executed in parallel on each cluster nodes. Once this DAG is generated
the Spark scheduler is responsible for execution of both transformation and action across the cluster.
Directed Acyclic Graph (DAG): Represents the logical execution plan, showing task dependencies. The
DAG scheduler optimizes execution by breaking operations into stages and minimizing data shuffling.
Consider a scenario where you’re executing a Spark program, and you call the action count() to get the
number of elements. This will create a Spark job. If further in your program, you call collect(), another
job will be created. So, a Spark application could have multiple jobs, depending upon the number of
actions.
Each action is creating a job. Jobs in Azure Databricks are used to schedule and run automated tasks.
These tasks can be notebook runs, Spark jobs, or arbitrary code executions. Jobs can be triggered on a
schedule or run in response to certain events, making it easy to automate workflows and periodic data
processing tasks.
• Stages: A stage in Spark represents a sequence of transformations that can be executed
in a single pass, i.e., without any shuffling of data. When a job is divided, it is split into stages.
Each stage comprises tasks, and all the tasks within a stage perform the same computation.
➢ A Stage is a collection of tasks that share the same shuffle dependencies, meaning that they
must exchange data with one another during execution.
➢ Stages are executed sequentially, with the output of one stage becoming the input to the next
stage.
➢ Stages are where wide transformations occur (e.g.,groupBy(), repartition(), join())
The boundary between two stages is drawn when transformations cause data shuffling across partitions.
Transformations in Spark are categorized into two types: narrow and wide. Narrow transformations, like
map(), filter(), and union(), can be done within a single partition. But for wide transformations like
groupByKey(), reduceByKey(), or join(), data from all partitions may need to be combined, thus
necessitating shuffling and marking the start of a new stage.
Each transformation is creating a stage. Two wide transformations are including three stages.
• Task: Each partition or core is equivalent to number of task. Tasks are where narrow
transformations occur (e.g.,union(), map(), filter())
Each time there Spark needs to perform a shuffle of the data it will decide and change how many
partitions the shuffle RDD will have. The default value is 200. Therefore, after using groupBy() which
requires a full data shuffle, the number of tasks will have increased to 200 (as seen in your second INFO
print).
It is quite common to see 200 tasks in one of your stages and more specifically at a stage which requires
wide transformation. The reason for this is, wide transformations in Spark requires a shuffle. Operations
like join, group by etc. are wide transform operations and they trigger a shuffle.
By default, Spark creates 200 partitions whenever there is a need for shuffle. Each partition will be
processed by a task. So, you will end up with 200 tasks during execution.
====================================================
𝐐𝐮𝐞𝐬𝐭𝐢𝐨𝐧: How many Spark Jobs & Stages are involved in this activity?👇🏻
In Databricks say you are reading a csv file from a source, doing a filter transformation, counting number of rows and
then writing the result DataFrame to storage in parquet format.
🔑𝐀𝐧𝐬𝐰𝐞𝐫:
Lets divide the question asked here in steps:
𝐉𝐎𝐁𝐒:
Here we have 2 actions:
One for the count() action.
One for the write() action.
🎯𝐒𝐓𝐀𝐆𝐄𝐒:
Each job consists of one or more stages, depending on whether shuffle operations are involved.
Job 1 (Count): Typically one stage if there’s no shuffle.
Job 2 (Write): At least one stage for writing the output, but more stages if re-partitioning is required.
Spark Architecture:
In the context of Spark, the Driver Program is responsible for coordinating and
executing jobs within the application. These are the main components in the Spark
Driver Program.
1. Spark Context - Entry point to the Spark Application, connects to Cluster Manager
2. DAG Scheduler - converts Jobs → Stages
3. Task Scheduler - converts Stages → Tasks
4. Block Manager - In the driver, it handles the data shuffle by maintaining the metadata
of the blocks in the cluster. In executors, it is responsible for caching, broadcasting and
shuffling the data.
====================================================
Resources:
https://mayankoracledba.wordpress.com/2022/10/05/apache-spark-understanding-spark-job-
stages-and-tasks/
Debugging with the Apache Spark UI:
https://docs.databricks.com/en/compute/troubleshooting/debugging-spark-ui.html
https://docs.databricks.com/en/jobs/run-if.html
https://sparkbyexamples.com/spark/what-is-spark-stage/
https://blog.det.life/decoding-jobs-stages-tasks-in-apache-spark-05c8b2b16114
Q.70. What are the Difference Between Number of Cores and Number of Executors?
Ans: When running a Spark job two terms can often be confusing
• Number of Executors
1. Executors are Spark's workhorses.
2. Each executor is a JVM instance is responsible for executing tasks
3. Executors handle parallelism at the cluster level.
• Number of Cores
1. Cores determine the number of tasks an executor can run in parallel.
2. It represents CPU power allocated to the executor.
3. It controls parallelism within an executor.
• In simple terms:
1. Executors = How many workers you have.
2. Cores = How many hands each worker has to complete tasks.
Q.71. What are the differences between Partitioning and Bucketing in Big Data with PySpark for
performance optimization?
Ans:
The common confusion among Data engineers is, ever wondered when to use partitioning and when
to bucket your data? Here's a quick breakdown, with an example and PySpark code to help you
differentiate between these two essential techniques for optimizing query performance in large
datasets.
---
Partitioning:
Partitioning divides data into separate directories based on column values. It improves query
performance by pruning unnecessary partitions but can lead to too many small files if not managed
properly.
Use Case: Suppose you're analyzing transaction data by region. You can partition the data by the
region column to limit the data scanned for region-specific queries.
PySpark Example:
spark = SparkSession.builder.appName("PartitioningExample").getOrCreate()
# Sample data
data = [("John", "North", 1000), ("Doe", "South", 1500), ("Jane", "East", 1200)]
df = spark.createDataFrame(data, columns)
---
Bucketing:
Bucketing divides data into fixed buckets based on a hash function applied to column values. Unlike
partitioning, bucketing doesn’t create physical directories; it stores data in files organized logically
within each partition.
Use Case: For evenly distributing data, such as customer IDs in a transaction table, bucketing ensures
better load balancing and reduces skewness during joins.
PySpark Example:
# Enable bucketing
df.write.bucketBy(4, "region").sortBy("sales").saveAsTable("bucketed_data")
Note: Joins and aggregations on the region column will now benefit from bucketing, leading to faster
execution.
---
Key Difference
Partitioning: Organizes data into physical directories; great for pruning irrelevant data.
Bucketing: Groups data into logical buckets; ideal for improving join performance.
---
Both techniques are powerful tools for optimizing query execution in large-scale data pipelines.
Combine them strategically based on your data distribution and use case!
Q.72. : 𝐋𝐢𝐬𝐭 𝐨𝐟 𝐭𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐬 𝐚𝐧𝐝 𝐚𝐜𝐭𝐢𝐨𝐧𝐬 𝐮𝐬𝐞𝐝 𝐢𝐧 𝐀𝐩𝐚𝐜𝐡𝐞 𝐒𝐩𝐚𝐫𝐤 𝐃𝐚𝐭𝐚𝐅𝐫𝐚𝐦𝐞𝐬 𝐟𝐨𝐫 𝐚 𝐃𝐚𝐭𝐚
𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠 𝐫𝐨𝐥𝐞.
Ans: 𝐋𝐢𝐬𝐭 𝐨𝐟 𝐭𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐬 𝐚𝐧𝐝 𝐚𝐜𝐭𝐢𝐨𝐧𝐬 𝐮𝐬𝐞𝐝 𝐢𝐧 𝐀𝐩𝐚𝐜𝐡𝐞 𝐒𝐩𝐚𝐫𝐤 𝐃𝐚𝐭𝐚𝐅𝐫𝐚𝐦𝐞𝐬 𝐟𝐨𝐫 𝐚 𝐃𝐚𝐭𝐚
𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠 𝐫𝐨𝐥𝐞:
𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐬:
Transformations are operations on DataFrames that return a new DataFrame. They are lazily evaluated,
meaning they do not execute immediately but build a logical plan that is executed when an action is
performed.
Transformations: Operations like map, filter, join, and groupBy that create new RDDs. They are lazily
evaluated, defining the computation plan.
𝟏. 𝐁𝐚𝐬𝐢𝐜 𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐬:
𝐣𝐨𝐢𝐧(𝐨𝐭𝐡𝐞𝐫, 𝐨𝐧=𝐍𝐨𝐧𝐞, 𝐡𝐨𝐰=𝐍𝐨𝐧𝐞): Joins with another DataFrame using the given join expression.
𝟒. 𝐀𝐝𝐯𝐚𝐧𝐜𝐞𝐝 𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐬:
𝟓. 𝐖𝐢𝐧𝐝𝐨𝐰 𝐅𝐮𝐧𝐜𝐭𝐢𝐨𝐧𝐬:
𝐀𝐜𝐭𝐢𝐨𝐧𝐬:
Actions trigger the execution of the transformations and return a result to the driver program or write
data to an external storage system.
Actions: Operations like count, collect, save, and reduce that trigger computation and return results to
the driver or write them to an output.
1. Basic Actions:
2. Writing Data:
3. Other Actions:
Ans: 𝗦𝗼𝗹𝘃𝗶𝗻𝗴 𝗗𝗮𝘁𝗮 𝗦𝗸𝗲𝘄 𝗶𝗻 𝗔𝘇𝘂𝗿𝗲 𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀: 𝗖𝗮𝘂𝘀𝗲𝘀, 𝗗𝗲𝘁𝗲𝗰𝘁𝗶𝗼𝗻, 𝗮𝗻𝗱 𝗙𝗶𝘅𝗲𝘀 ?
Handling data skew is one of the most common challenges faced by data engineers when working with
distributed computing systems like 𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀. Skew can severely degrade performance, leading to
longer job runtimes, increased costs, and potential failures.
1️⃣ 𝗜𝗺𝗯𝗮𝗹𝗮𝗻𝗰𝗲𝗱 𝗞𝗲𝘆𝘀: When specific keys appear much more frequently in the dataset.
2️⃣ 𝗟𝗮𝗿𝗴𝗲 𝗝𝗼𝗶𝗻 𝗜𝗺𝗯𝗮𝗹𝗮𝗻𝗰𝗲𝘀: When one side of a join is heavily skewed.
3️⃣ 𝗪𝗶𝗱𝗲 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻𝘀: Operations like groupByKey or reduceByKey aggregate data by keys,
leading to uneven partition sizes.
1️⃣ 𝗧𝗮𝘀𝗸 𝗠𝗲𝘁𝗿𝗶𝗰𝘀: Use the Spark UI to identify stages where some tasks take significantly longer.
2️⃣ 𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻 𝗦𝗶𝘇𝗲: Monitor data partition sizes and look for disproportionate partitioning.
3️⃣ 𝗝𝗼𝗯 𝗠𝗲𝘁𝗿𝗶𝗰𝘀: Set up monitoring tools like Azure Monitor or Ganglia to identify performance
bottlenecks.
𝗦𝗮𝗹𝘁𝗶𝗻𝗴 𝗞𝗲𝘆𝘀: Append a random “salt” value to the keys during transformations (e.g., key_1️,
key_2) to spread data evenly across partitions.
𝗕𝗿𝗼𝗮𝗱𝗰𝗮𝘀𝘁 𝗝𝗼𝗶𝗻𝘀: For highly skewed joins, broadcast the smaller dataset to all nodes.
𝗥𝗲𝗱𝘂𝗰𝗲 𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻 𝗦𝗶𝘇𝗲: Avoid default partition numbers—optimize it based on data size.
𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗲 𝗗𝗮𝘁𝗮 𝗟𝗮𝘆𝗼𝘂𝘁: Use file formats like Delta Lake with features like Z-order clustering to
improve query performance.
𝗞𝗲𝘆 𝗧𝗮𝗸𝗲𝗮𝘄𝗮𝘆
In distributed computing, addressing data skew is essential for optimizing job performance, reducing
costs, and ensuring the reliability of your workloads. As a data engineer, mastering these techniques will
set you apart in solving real-world scalability challenges.
Q.74. 𝐃𝐨 𝐲𝐨𝐮 𝐤𝐧𝐨𝐰 𝐭𝐡𝐞 𝐝𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝐛𝐞𝐭𝐰𝐞𝐞𝐧 𝐭𝐡𝐞 𝐜𝐨𝐚𝐥𝐞𝐬𝐜𝐞() 𝐦𝐞𝐭𝐡𝐨𝐝 𝐚𝐧𝐝 𝐫𝐞𝐩𝐚𝐫𝐭𝐢𝐭𝐢𝐨𝐧()
𝐢𝐧 𝐒𝐩𝐚𝐫𝐤?
Ans:
☑️ The coalesce() method takes the target number of partitions and combines the local partitions
available on the same worker node to achieve the target.
☑️ For example, let's assume you have a five-node cluster (in figure).
And you have a dataframe of 12 partitions.
Those 12 partitions are spread on these ten worker nodes.
Now you want to reduce the number of partitions to 5.
So you executed coalesce(5) on your dataframe.
So Spark will try to collapse the local partitions and reduce your partition count to 5.
The final state might look like this.
You should also avoid using repartition to reduce the number of partitions.
Why?
Because you can reduce your partitions using coalesce without doing a shuffle?
You can also reduce your partition count using repartition(), but it will cost you a shuffle
operation.
Ans: Understanding Partitioning vs. Bucketing in Big Data with PySpark for performance optimization
The common confusion among Data engineers is, ever wondered when to use partitioning and when to bucket your data? Here's a
quick breakdown, with an example and PySpark code to help you differentiate between these two essential techniques for
optimizing query performance in large datasets.
---
Partitioning:
Partitioning divides data into separate directories based on column values. It improves query performance by pruning unnecessary
partitions but can lead to too many small files if not managed properly.
Partitioning (creates folders): When cardinality or category values (eg, Country, City, Transport Mode) are low and not used
when cardinality or category values (eg, EmpID, Aadhaar Card Numbers, Contact Numbers, Date) are high.
Use Case: Suppose you're analyzing transaction data by region. You can partition the data by the region column to limit the data
scanned for region-specific queries.
PySpark Example:
spark = SparkSession.builder.appName("PartitioningExample").getOrCreate()
# Sample data
data = [("John", "North", 1000), ("Doe", "South", 1500), ("Jane", "East", 1200)]
columns = ["name", "region", "sales"]
df = spark.createDataFrame(data, columns)
---
Bucketing (creates Files): Used when cardinality or category values (eg, EmpID, Aadhaar Card Numbers, Contact Numbers,
Date) are high and not used when cardinality or category values (eg, Country, City, Transport Mode) are low.
Bucketing divides data into fixed buckets based on a hash function applied to column values. Unlike partitioning, bucketing doesn’t
create physical directories; it stores data in files organized logically within each partition.
bucketing
- number of buckets fixed, a deterministic 𝗛𝗔𝗦𝗛 function sends records with same keys to same bucket
Use Case: For evenly distributing data, such as customer IDs in a transaction table, bucketing ensures better load balancing and
reduces skewness during joins.
PySpark Example:
# Enable bucketing
df.write.bucketBy(4, "region").sortBy("sales").saveAsTable("bucketed_data")
Note: Joins and aggregations on the region column will now benefit from bucketing, leading to faster execution.
---
Key Difference
Partitioning: Organizes data into physical directories; great for pruning irrelevant data.
Bucketing: Groups data into logical buckets; ideal for improving join performance.
---
Both techniques are powerful tools for optimizing query execution in large-scale data pipelines. Combine them strategically based
on your data distribution and use case!
Q.76. What is z-ordering in Databricks?
Ans: → If you're working with **large datasets in Delta Lake**, optimizing query performance is crucial.
Here's how **Z-Ordering** can help:
Z-Ordering reorganizes your data files by clustering similar data together based on columns frequently
used in queries. It leverages a **space-filling curve** (like the Z-order curve) to store related data closer
on disk, reducing unnecessary file scans.
- Query by store:
```sql
```
```
Without optimization, these queries scan the entire table, slowing down performance.
```sql
OPTIMIZE sales_transactions
```
1️**Improved Data Skipping:** Only scan relevant files for the query.
2️**Faster Query Performance:** Scan fewer files for filtered queries (e.g., `store_id = 'S1'`).
Start using Z-Ordering today and experience faster, more efficient queries in Databricks!
Ans: Apache Spark generates a plan of execution, how internally things will work whenever we fire a
query. It is basically what steps Spark is going to take while running the query. Spark tries to optimize
best for better performance and efficient resource management.
1. Unresolved Logical Plan/ Parsed Logical Plan: In this phase Spark checks the syntax of the query
whether it's syntactically correct or not. Apart from syntax it doesn’t check anything . If syntax is not
correct it throws an error: ParseException
2. Resolved Logical Plan: In this phase Spark tries to resolve all the objects present in the query like
databases, tables, views, columns etc. Spark has a metadata called catalog which stores the details
about all the objects and verifies these objects when we run queries. If the objects are not present it
throws an error: AnalysisException( UnresolvedRelation)
3. Optimised logical Plan: Once query passes first two phases it enters this phase. Spark tries to create
optimized plan .Query goes through a set of pre configured and/or custom defined rules in this layer to
optimize the query. In this phase spark combines all the projection ,works on filter ,aggregation
optimization. Predicate pushdown takes place.
4. Physical Plan : This is the plan that describes how the query will be physically executed on the cluster.
Catalyst optimizer will generate multiple physical plans .Each of these plans will be estimated based on
execution time and resource consumption projection .One of the best cost optimized plan will be
selected finally .This will be used to generate RDD code and then it runs on the machine.
Q.78.
Ans:
Q.79.
Ans:
Q.80.
Ans:
Q.42.
Ans:
Q.43.
Ans:
Q.41.
Ans:
Q.42.
Ans:
Q.43.
Ans:
Q.41.
Ans:
Q.42.
Ans:
Q.43.
Ans:
Q.41.
Ans:
Q.42.
Ans:
Q.43.
Ans:
Q.41.
Ans:
Q.42.
Ans:
Q.43.
Ans:
→ Describe how to perform operations like ranking and aggregation over partitions.
→ Mention unit testing with PyTest, mocking data, and integration testing.
1️⃣7️⃣ How Does Spark Ensure Fault Tolerance? (Repeated for Emphasis)
→ Dive deeper into RDD lineage and data recovery.
2️⃣0⃣ How Does Spark SQL Enhance Data Processing? (Repeated for Emphasis)
22.