Pyspark and Delta-Lake are very powerful tools, but the requirement of workign with them via DataBricks can be restrictive for some projects. Although the two building blocks (PySpark & Delta-Lake are open source) their setup & implementation can be tricky to get right. This project aims to fix that using Bazel an automated application setup & testing framework.
- Streamlined, One-Click Setup
- Environments setup using anaconda
- Given that a large part of data science & analytics occurs using conda as an environment manager, this tool will be built on top of conda.
- Mutliple distinct environments can be supported on a single machine
- modifications to a spark install for one environment should not interfere with other spark installs
- Support for S3 data storage
- Platform-Agnostic installation (Mac/Windows/Linux) stretch goal
- UI for better UX stretch goal
What it does:
- Creates a fresh conda environment based on the environment.yml file
- Installs Spark & Delta Lake into the conda environment, placing required files in the conda environment (e.g. env_name/lib/python3.11/site-packages/pyspark/...)
- configures the Spark environment to use Delta & S3
MacOS/Linux:
Windows:
- Chocolatey? # TODO: test this
- Copy the files in this repository into a new/existing repo.
- update the environment.yml to match the requirements of your project.
- run
./setup.shto install Spark and Delta Lake into the conda environment.
- run
bazel build spark
bazel clean