beamwax

Comparing Beam and Bytewax

Beam

Setup

python -m venv venv
source venv/bin/activate
pip install -r beam-requirements.txt

beam-wordcount.py just makes sure that everything is running. Then, we can create a fake parquet file by running python generate-parquet.py. This file, df.parquet contains a list of user_ids and their features.

beam-parquet-proto.py uses Beam to scan this file (in reality we should use Kafka and stream) and tries to convert each row to a protobuf which, with a given probability p (default 0), is a wrong schema (-c proto) or a wrong field (-c field). Unfortunately, this conversion is not done directly (parquet to proto), but uses an intermediate JSON conversion, which can be a problem for some (complex) data types.

Bytewax

Some dependencies are clashing with each other, so I just created a new virtual environment.

python -m venv venv
source venv/bin/activate
pip install -r bw-requirements.txt

bw-wordcount.py implements the same wordcount, but uses reduce_window instead. A reduce method is also available, but I haven't found a way to tell it when it stop... But in a streaming case we need a window anyway.

bw-parquet-proto.py is basically a 1:1 translation of beam-parquet-proto.py: we can see it's a lot more "manual" than Beam, everything needs to be specified, and again we use reduce_window to count. This also means that it's probably much more flexible than Beam is, but I'm doing things that are way too simple to test this statement.

Kafka

Finally, we run a prototype pipeline using Kafka.

Start Kafka locally: docker compose up
Start the producer: python kafka-producer.py
Start the Bytewax pipeline: python kafka-bw.py

Bonus: for Slack notifications, a WEBHOOK_URL environment variable (in .env) is required.

This can be augmented with stateful anomaly detection and Slack messaging like so: https://github.com/awmatheson/junk-drawer/blob/main/data-quality/dataflow.py

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitignore		.gitignore
README.md		README.md
beam-parquet-proto.py		beam-parquet-proto.py
beam-requirements.txt		beam-requirements.txt
beam-wordcount.py		beam-wordcount.py
bw-parquet-proto.py		bw-parquet-proto.py
bw-requirements.txt		bw-requirements.txt
bw-streaming-data.py		bw-streaming-data.py
bw-wordcount.py		bw-wordcount.py
docker-compose.yml		docker-compose.yml
generate-parquet.py		generate-parquet.py
kafka-bw.py		kafka-bw.py
kafka-producer.py		kafka-producer.py
user.proto		user.proto
user_bad_field.proto		user_bad_field.proto
user_bad_field_pb2.py		user_bad_field_pb2.py
user_pb2.py		user_pb2.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

beamwax

Beam

Bytewax

Kafka

About

Releases

Packages

Languages

nicocanali/beamwax

Folders and files

Latest commit

History

Repository files navigation

beamwax

Beam

Bytewax

Kafka

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages