Skip to content

feat: improve prom write requests decode performance#3478

Merged
waynexia merged 2 commits intoGreptimeTeam:mainfrom
v0y4g3r:feat/optimize-decode-performance
Mar 12, 2024
Merged

feat: improve prom write requests decode performance#3478
waynexia merged 2 commits intoGreptimeTeam:mainfrom
v0y4g3r:feat/optimize-decode-performance

Conversation

@v0y4g3r
Copy link
Contributor

@v0y4g3r v0y4g3r commented Mar 11, 2024

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

N/A

What's changed and what's your intention?

This PR aims to improve the decode performance of PromWriteRequest, which has been addressed in #3425, but still we observe some room for optimization.

Decoding WriteRequest

  • First we can eliminate one copy_to_bytes invocation by inling prost::encoding::bytes::merge
  • Then we dive into copy_to_bytes, and we observe that copy_to_bytes is way slower than Golang's byte slice operation
    • For a bytes = bytes[..idx] operation repeated for 120k times (which is the case when decoding lables from 10k timeseries in a WriteRequest), Rust's Bytes takes 1.2ms, while Golang's byte slice takes 30us
    • The reason for this performance difference is that Bytes also handles reference counting when built from Vec<u8>, while Golang's byte slice only has ptr, len, cap fields and Garbage collector handles the reference counting, which bring little overhead when bytes are pooled.
    • Considering the bytes fields of PromLabel are short-lived and are converted to string and added to TableBuilder soon after a PromTimeseries decoding is finished, so that the original decompressed Bytes always outlive PromLabel::name and PromLabel::value, we can introduce some unsafe operation, such as directly construct a new Bytes from raw pointer and len/cap fields, that's what servers::proto::copy_to_bytes does.
    • For the same test mentioned above, servers::proto::copy_to_bytes takes 200us, still slower than Golang, but fairly acceptable.
    • With this optimization, decoding a 10k timeseries WriteRequest cost time further improved from 3.0ms to 1.8ms. For refence, VictoriaMetrics' WriteRequets decoding takes 1.2ms.

Decode-only benchmark result:

Running benches/bench_prom.rs (target/release/deps/bench_prom-9768faca31abd429)
Gnuplot not found, using plotters backend 
decode/write_request  time:     [7.1515 ms 7.1673 ms 7.1832 ms]
                      change:   [-0.7329% -0.3899% -0.0453%]  (p=0.03<0.05)
                      Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

decode/prom_write_request
                        time:   [1.8225 ms 1.8308 ms 1.8379 ms]
                        change: [-1.2093% -0.5679% +0. 0813%] (p = 0.08 › 0.05)
                        No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
  15 (15.00%) low mild
  1 (1.00%) high mild

Building RowInsertRequests

Aside from decoding, we also find out that PromWriteRequest::as_row_insert_requests takes more time than decoding.

  • During decoding PromTimeseries, we need to build the schemas for each metric (table), which involves two hashmap lookups per label, that sums to 120k hashmap lookup for a 10k timeseries request.
  • We can use ahasher hashmap instead of Rust's default siphasher hashmap, to get a roughly 40% improvement in terms of hashtable lookups (Rust's default hashmap is quite slow btw).
  • Compared to siphasher, ahasher is not a cryptographically secure hasher, but it doesn't matter in our case.
  • I also compared other hashmaps like fxhashmap which is used internally in rustc, but did not notice significant improvent.

Results

Still, decoding a WriteRequest with 10k timeseries and 60k labels and convert it to RowInsertRequests:

     Running benches/bench_prom.rs (target/release/deps/bench_prom-ac6fa2abc4d24957)
decode/write_request    time:   [13.816 ms 13.882 ms 13.954 ms]
                        change: [-0.9281% -0.3477% +0.2455%] (p = 0.23 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
decode/prom_write_request
                        time:   [5.7520 ms 5.7591 ms 5.7666 ms]
                        change: [+0.5254% +0.7116% +0.9096%] (p = 0.00 < 0.05)
                        Change within noise threshold.

Checklist

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.
  • This PR does not require documentation updates.

@github-actions github-actions bot added the docs-not-required This change does not impact docs. label Mar 11, 2024
@v0y4g3r v0y4g3r marked this pull request as ready for review March 11, 2024 08:22
@v0y4g3r v0y4g3r requested review from evenyag and waynexia March 11, 2024 08:25
@codecov
Copy link

codecov bot commented Mar 11, 2024

Codecov Report

Attention: Patch coverage is 73.68421% with 15 lines in your changes are missing coverage. Please review.

Project coverage is 84.93%. Comparing base (a9d42f7) to head (07710c6).
Report is 24 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3478      +/-   ##
==========================================
- Coverage   85.44%   84.93%   -0.51%     
==========================================
  Files         895      900       +5     
  Lines      147093   149647    +2554     
==========================================
+ Hits       125685   127105    +1420     
- Misses      21408    22542    +1134     

@killme2008
Copy link
Member

Great job! Do you think we could submit these enhancements to the main prost project?

@v0y4g3r
Copy link
Contributor Author

v0y4g3r commented Mar 11, 2024

Great job! Do you think we could submit these enhancements to the main prost project?

No, this optimization is safe if and only if:

  • Buf is backed by a continuous byte array
  • callers can ensure the decode result never outlive original Bytes.

@waynexia waynexia enabled auto-merge March 12, 2024 11:45
@waynexia waynexia added this pull request to the merge queue Mar 12, 2024
Merged via the queue into GreptimeTeam:main with commit 9afe327 Mar 12, 2024
3u32 => {
// we can ignore metadata for now.
prost::encoding::skip_field(wire_type, tag, &mut buf, ctx.clone())?;
// todo(hl): metadata are skipped.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly the TODO is?

May we create an issue for it with a bit description?

}

#[inline(always)]
fn copy_to_bytes(data: &mut Bytes, len: usize) -> Bytes {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required This change does not impact docs.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants