DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format. For more information, please check https://arrow.apache.org/datafusion/user-guide/introduction.html
We use parquet file here and create an external table for it; and then execute the queries.
The benchmark should be completed in under an hour. On-demand pricing is $0.6 per hour while spot pricing is only $0.2 to $0.3 per hour (us-east-2).
-
manually start a AWS EC2 instance
c6a.4xlarge- Ubuntu 22.04 or later
- Root 500GB gp2 SSD
- no EBS optimized
- no instance store
-
wait for status check passed, then ssh to EC2
ssh ubuntu@{ip} -
git clone https://github.com/ClickHouse/ClickBench -
cd ClickBench/datafusion -
vi benchmark.shand modify following line to target Datafusion versiongit checkout 46.0.0
-
bash benchmark.sh
- importing parquet by
datafusion-clidoesn't support schema, need to add some casting in queries.sql (e.g. converting EventTime from Int to Timestamp viato_timestamp_seconds) - importing parquet by
datafusion-climake column name column name case-sensitive, i change all column name in queries.sql to double quoted literal (e.g.EventTime->"EventTime") comparing binary with utf-8andgroup by binarydon't work in mac, if you run these queries in mac, you'll get some errors for queries contain binary format apache/datafusion#3050
- install datafusion-cli
- download the parquet
wget --continue --progress=dot:giga https://datasets.clickhouse.com/hits_compatible/hits.parquet - execute it
datafusion-cli -f create_single.sql queries.sqlorbash run2.sh