0% found this document useful (0 votes)
99 views10 pages

Azure Databricks ETL Pipeline Guide

The document provides a comprehensive guide on using Azure Databricks for big data processing, focusing on building an ETL pipeline with PySpark. It covers the objectives, tools, and core concepts of Azure Databricks, along with step-by-step tasks for launching a workspace, creating clusters, and performing data transformations. Additionally, it explains how to read data from Azure Data Lake and write transformed data back in Parquet format.

Uploaded by

harshada choure
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views10 pages

Azure Databricks ETL Pipeline Guide

The document provides a comprehensive guide on using Azure Databricks for big data processing, focusing on building an ETL pipeline with PySpark. It covers the objectives, tools, and core concepts of Azure Databricks, along with step-by-step tasks for launching a workspace, creating clusters, and performing data transformations. Additionally, it explains how to read data from Azure Data Lake and write transformed data back in Parquet format.

Uploaded by

harshada choure
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Azure Databricks

Contents
Azure Databricks..........................................................................................1
Objective:...................................................................................................1
Tools & Environment:...............................................................................1
Azure Databricks Conceptual Overview:...............................................2
What is Azure Databricks?...................................................................2
Core Concepts.........................................................................................2
Task:............................................................................................................3
1) Launch Databricks workspace and create a cluster..................3
2) Create a Notebook and Run Basic PySpark and SQL
Commands...............................................................................................4
3) Read CSV/Parquet/JSON Files from Azure Data Lake/Blob
Using Spark.............................................................................................7
4) Perform basic data transformations (filter, select, groupBy,
join)..........................................................................................................8

Objective:
This task supports my PIP by enhancing my skills in big data processing using
Azure Databricks. involves building an ETL pipeline that reads data from
Azure Data Lake, transforms it with PySpark, and writes to Delta Lake. I will
use Spark SQL for data validation and analysis within Databricks notebooks.
This improves my proficiency in cloud-based data engineering and scalable
analytics.

Tools & Environment:


Azure Databricks: Cloud-based analytics platform for big data processing
using Apache Spark.
Azure Data Lake Storage (ADLS): Scalable cloud storage optimized for
big data analytics.

PySpark: Python API for Apache Spark to process large datasets in a


distributed way.

Databricks Notebook: Interactive workspace to write, run, and visualize


code and queries.

Parquet Format: Efficient, columnar file format for storing structured data.

Azure Databricks Conceptual Overview:


What is Azure Databricks?
Azure Databricks is a cloud-based analytics platform that combines the
power of Apache Spark with the ease of use and scalability of Azure Cloud.
It is designed for big data processing, machine learning, and collaborative
analytics.

Core Concepts
1. Unified Analytics Platform
 Combines data engineering, data science, and business analytics in
one platform.

 Allows teams to work together using Notebooks, Jobs, and Dashboards.

2. Apache Spark Engine


 Built on Apache Spark, which allows for distributed computing across
clusters.

 Efficiently processes large-scale data using in-memory computation.

3. Notebooks
 Interactive interface to write code in Python, SQL, Scala, or R.

 Supports collaboration, visualization, and versioning (like Google Docs


for code).
4. Clusters
 Scalable virtual machines that execute Spark jobs.

 Can be automatically scaled up/down based on workload.

5. Databricks File System (DBFS)


 Built-in distributed storage layer on top of cloud storage (ADLS or
Blob).

 Acts as a staging area for temporary and permanent files.

6. Delta Lake
 Databricks' storage layer that brings ACID transactions, schema
enforcement, and time travel to data lakes.

 Enables reliable data Lakehouse architecture.

7. Integration with Azure


 Seamless integration with Azure services:

o Azure Data Factory (pipelines)

o Azure Data Lake (storage)

o Azure Synapse (warehousing)

o Power BI (reporting)

o Azure Key Vault (security)

Task:
1) Launch Databricks workspace and create a cluster.
Step 1: Launch Azure Databricks Workspace
1. Go to the Azure Portal
2. Click "Create a resource" → Search for Databricks

3. Choose your subscription, resource group, workspace name, and


region

4. Select Pricing Tier (Standard or Premium)

5. Click Create

Step 2: Create and Start a Cluster


1. In Databricks UI → Go to Compute
2. Click “Create Cluster”

1. Enter a name

2. Choose a runtime version (e.g. 13.x ML, LTS)

3. Set number of workers (can be auto-scale)

3. Click “Start”

2) Create a Notebook and Run Basic PySpark and SQL


Commands

1. Go to the Workspace section in Azure Databricks

2. Click on Create > Notebook

3. Enter a Notebook name


4. Select Default Language: Python
5. Choose and attach your running cluster
6. Run PySpark and SQL commands

a. PySpark

1. Create a DataFrame

python

data = [("Alice", 25), ("Bob", 30)]

df = [Link](data, ["name", "age"])

[Link]()

2. Display schema and column names

python

[Link]()

[Link]

3. Select specific columns

python
[Link]("name", "age").show()

b. SQL

1. Select all rows

SQL

Select * from Employees;

2. Filter rows using WHERE

SQL

SELECT * FROM Employees WHERE age > 25;

3. Describe the table structure

SQL

DESCRIBE TABLE [Link];

3) Read CSV/Parquet/JSON Files from Azure Data


Lake/Blob Using Spark
step 1: Prepare Azure Storage
 Make sure your data files are already stored in Azure Data Lake Gen2
or Azure Blob Storage.
 Get the following details:
o Storage account name
o Container name
o File path (e.g., /data/[Link])
o Storage access method: either a SAS token, Account Key, or
Service Principal

Step 2: Set Up Access to Storage in Databricks


Option A: Use Account Key (quick setup for testing)
[Link]("[Link].<storage_account>.[Link]
[Link]", "<your_account_key>")

Step 3: Define File Path


Use the ABFS path format for Azure Data Lake Gen2:
file_path =
"abfss://<container>@<storage_account>.[Link]/<fold
er>/<file>"
Example:
file_path =
"abfss://mycontainer@[Link]/input/[Link]
"

Step 4: Read a CSV File


df_csv = [Link]("header", True).csv(file_path)
df_csv.show()
Explanation:
 option("header", True) reads the first row as column names.
 .show() displays the top rows.

Step 5: Read a Parquet File


df_parquet = [Link](file_path)
df_parquet.show()
Explanation:
 Parquet is a highly efficient, compressed columnar storage
format.

Step 6: Read a JSON File


df_json = [Link](file_path)
df_json.show()
Explanation:
 Spark will automatically infer schema from JSON.

Step 7: Verify Schema and Content


df_csv.printSchema()
df_csv.display()
Explanation:
 printSchema() shows column names and types.
 display() (Databricks only) shows an interactive table.
4) Perform basic data transformations (filter, select,
groupBy, join).
Transformations in Spark are operations that return a new DataFrame.
These include filter, select, groupBy, and join.
a. Filter: Used to select rows based on a condition (like SQL's
WHERE clause).

df_filtered = df_csv.filter(df_csv["Age"] > 30)

df_filtered.show()

This selects only rows where the Age column is greater than 30

b. Select: Used to select specific columns from the DataFrame.


df_selected = df_csv.select("Name", "Age")
df_selected.show()

This keeps only the Name and Age columns, useful when you
don't need all the data.

c. groupBy: Used to group data by a column


df_grouped = df_csv.groupBy("Department")
df_grouped.show()

Groups data by Department

d. Join: Used to combine two DataFrames based on a common


column (like SQL joins).
Firstly, read data:

df_employees = [Link]("/mnt/data/[Link]",
header=True, inferSchema=True)
df_departments = [Link]("/mnt/data/[Link]",
header=True, inferSchema=True)

after that,

df_joined = df_employees.join(df_departments,
df_employees["DeptID"] == df_departments["DeptID"], "inner")
df_joined.show()
Combines employee and department data where DeptID
matches in both DataFrames.

5) Write transformed data back to ADLS in Parquet format.


Step 1: Write to ADLS in Parquet
final_df.[Link]("overwrite").parquet("abfss://my-
container@[Link]/output/
transformed_data/")

mode("overwrite"): replaces existing data

You can use mode("append") to add new data

Step 2: Verify Output


You can browse the ADLS container via:

 Azure Portal > Storage Account > Containers > my-container >
output/transformed_data/

 Or list it in Databricks

You might also like