Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is sum_over_time() / count_over_time() different than avg_over_time()? #354

Open
mmazur opened this issue Jul 20, 2022 · 1 comment
Open

Comments

@mmazur
Copy link

mmazur commented Jul 20, 2022

The (default) optimized slo:sli_error:ratio_rate30d uses an expression of sum_over_time() / count_over_time(). This is following 9cd3177 which changed it from avg_over_time().

I'm very confused on what the difference is. The definition of an arithmetic average (mean) is sum() / count(), so unless there's something unusual in prom's implementation of these functions, I would expect the two expressions to be equivalent.

Prom's best practices on recording rules does mention:

When aggregating up ratios, aggregate up the numerator and denominator separately and then divide. Do not take the average of a ratio or average of an average as that is not statistically valid.

But sloth does not preserve either the numerator or denominator, therefore doing that is not possible.

@ThomWright
Copy link

Agreed. This seems to be just a different way of averaging ratios, as far as I can tell.

The missing information is the number of requests in each 5m period. Without that, a 5 minute period with 1 error in 10 requests (10% error rate) will be treated equally to a 5 minute period with 1,000 errors in 10,000 requests (also a 10% error rate). But the 1,000 errors should contribute significantly more to the overall 30 day error rate than the 1 error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants