Azure Databricks
Contents
Azure Databricks..........................................................................................1
Objective:...................................................................................................1
Tools & Environment:...............................................................................1
Azure Databricks Conceptual Overview:...............................................2
What is Azure Databricks?...................................................................2
Core Concepts.........................................................................................2
Task:............................................................................................................3
1) Launch Databricks workspace and create a cluster..................3
2) Create a Notebook and Run Basic PySpark and SQL
Commands...............................................................................................4
3) Read CSV/Parquet/JSON Files from Azure Data Lake/Blob
Using Spark.............................................................................................7
4) Perform basic data transformations (filter, select, groupBy,
join)..........................................................................................................8
Objective:
This task supports my PIP by enhancing my skills in big data processing using
Azure Databricks. involves building an ETL pipeline that reads data from
Azure Data Lake, transforms it with PySpark, and writes to Delta Lake. I will
use Spark SQL for data validation and analysis within Databricks notebooks.
This improves my proficiency in cloud-based data engineering and scalable
analytics.
Tools & Environment:
Azure Databricks: Cloud-based analytics platform for big data processing
using Apache Spark.
Azure Data Lake Storage (ADLS): Scalable cloud storage optimized for
big data analytics.
PySpark: Python API for Apache Spark to process large datasets in a
distributed way.
Databricks Notebook: Interactive workspace to write, run, and visualize
code and queries.
Parquet Format: Efficient, columnar file format for storing structured data.
Azure Databricks Conceptual Overview:
What is Azure Databricks?
Azure Databricks is a cloud-based analytics platform that combines the
power of Apache Spark with the ease of use and scalability of Azure Cloud.
It is designed for big data processing, machine learning, and collaborative
analytics.
Core Concepts
1. Unified Analytics Platform
Combines data engineering, data science, and business analytics in
one platform.
Allows teams to work together using Notebooks, Jobs, and Dashboards.
2. Apache Spark Engine
Built on Apache Spark, which allows for distributed computing across
clusters.
Efficiently processes large-scale data using in-memory computation.
3. Notebooks
Interactive interface to write code in Python, SQL, Scala, or R.
Supports collaboration, visualization, and versioning (like Google Docs
for code).
4. Clusters
Scalable virtual machines that execute Spark jobs.
Can be automatically scaled up/down based on workload.
5. Databricks File System (DBFS)
Built-in distributed storage layer on top of cloud storage (ADLS or
Blob).
Acts as a staging area for temporary and permanent files.
6. Delta Lake
Databricks' storage layer that brings ACID transactions, schema
enforcement, and time travel to data lakes.
Enables reliable data Lakehouse architecture.
7. Integration with Azure
Seamless integration with Azure services:
o Azure Data Factory (pipelines)
o Azure Data Lake (storage)
o Azure Synapse (warehousing)
o Power BI (reporting)
o Azure Key Vault (security)
Task:
1) Launch Databricks workspace and create a cluster.
Step 1: Launch Azure Databricks Workspace
1. Go to the Azure Portal
2. Click "Create a resource" → Search for Databricks
3. Choose your subscription, resource group, workspace name, and
region
4. Select Pricing Tier (Standard or Premium)
5. Click Create
Step 2: Create and Start a Cluster
1. In Databricks UI → Go to Compute
2. Click “Create Cluster”
1. Enter a name
2. Choose a runtime version (e.g. 13.x ML, LTS)
3. Set number of workers (can be auto-scale)
3. Click “Start”
2) Create a Notebook and Run Basic PySpark and SQL
Commands
1. Go to the Workspace section in Azure Databricks
2. Click on Create > Notebook
3. Enter a Notebook name
4. Select Default Language: Python
5. Choose and attach your running cluster
6. Run PySpark and SQL commands
a. PySpark
1. Create a DataFrame
python
data = [("Alice", 25), ("Bob", 30)]
df = [Link](data, ["name", "age"])
[Link]()
2. Display schema and column names
python
[Link]()
[Link]
3. Select specific columns
python
[Link]("name", "age").show()
b. SQL
1. Select all rows
SQL
Select * from Employees;
2. Filter rows using WHERE
SQL
SELECT * FROM Employees WHERE age > 25;
3. Describe the table structure
SQL
DESCRIBE TABLE [Link];
3) Read CSV/Parquet/JSON Files from Azure Data
Lake/Blob Using Spark
step 1: Prepare Azure Storage
Make sure your data files are already stored in Azure Data Lake Gen2
or Azure Blob Storage.
Get the following details:
o Storage account name
o Container name
o File path (e.g., /data/[Link])
o Storage access method: either a SAS token, Account Key, or
Service Principal
Step 2: Set Up Access to Storage in Databricks
Option A: Use Account Key (quick setup for testing)
[Link]("[Link].<storage_account>.[Link]
[Link]", "<your_account_key>")
Step 3: Define File Path
Use the ABFS path format for Azure Data Lake Gen2:
file_path =
"abfss://<container>@<storage_account>.[Link]/<fold
er>/<file>"
Example:
file_path =
"abfss://mycontainer@[Link]/input/[Link]
"
Step 4: Read a CSV File
df_csv = [Link]("header", True).csv(file_path)
df_csv.show()
Explanation:
option("header", True) reads the first row as column names.
.show() displays the top rows.
Step 5: Read a Parquet File
df_parquet = [Link](file_path)
df_parquet.show()
Explanation:
Parquet is a highly efficient, compressed columnar storage
format.
Step 6: Read a JSON File
df_json = [Link](file_path)
df_json.show()
Explanation:
Spark will automatically infer schema from JSON.
Step 7: Verify Schema and Content
df_csv.printSchema()
df_csv.display()
Explanation:
printSchema() shows column names and types.
display() (Databricks only) shows an interactive table.
4) Perform basic data transformations (filter, select,
groupBy, join).
Transformations in Spark are operations that return a new DataFrame.
These include filter, select, groupBy, and join.
a. Filter: Used to select rows based on a condition (like SQL's
WHERE clause).
df_filtered = df_csv.filter(df_csv["Age"] > 30)
df_filtered.show()
This selects only rows where the Age column is greater than 30
b. Select: Used to select specific columns from the DataFrame.
df_selected = df_csv.select("Name", "Age")
df_selected.show()
This keeps only the Name and Age columns, useful when you
don't need all the data.
c. groupBy: Used to group data by a column
df_grouped = df_csv.groupBy("Department")
df_grouped.show()
Groups data by Department
d. Join: Used to combine two DataFrames based on a common
column (like SQL joins).
Firstly, read data:
df_employees = [Link]("/mnt/data/[Link]",
header=True, inferSchema=True)
df_departments = [Link]("/mnt/data/[Link]",
header=True, inferSchema=True)
after that,
df_joined = df_employees.join(df_departments,
df_employees["DeptID"] == df_departments["DeptID"], "inner")
df_joined.show()
Combines employee and department data where DeptID
matches in both DataFrames.
5) Write transformed data back to ADLS in Parquet format.
Step 1: Write to ADLS in Parquet
final_df.[Link]("overwrite").parquet("abfss://my-
container@[Link]/output/
transformed_data/")
mode("overwrite"): replaces existing data
You can use mode("append") to add new data
Step 2: Verify Output
You can browse the ADLS container via:
Azure Portal > Storage Account > Containers > my-container >
output/transformed_data/
Or list it in Databricks