-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Improve histogram bin logic #18761
Conversation
d331adc
to
3a9fa0a
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #18761 +/- ##
==========================================
+ Coverage 79.55% 79.58% +0.02%
==========================================
Files 1544 1544
Lines 213240 213307 +67
Branches 2441 2441
==========================================
+ Hits 169643 169757 +114
+ Misses 43048 43001 -47
Partials 549 549 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
@alexander-beedie can you take a look at this one? |
Looks like an improvement to me, but I'll see if I can run it by one of our developers (the one who spotted the panic) for bonus feedback tomorrow 👍 |
just sharing my thoughts: Why deviate from numpynumpy interval is opened the other side.
pl.Series([2, 5, 8]).hist(bin_count=4)
shape: (4, 3)
┌────────────┬──────────────┬───────┐
│ breakpoint ┆ category ┆ count │
│ --- ┆ --- ┆ --- │
│ f64 ┆ cat ┆ u32 │
╞════════════╪══════════════╪═══════╡
│ 3.5 ┆ (1.994, 3.5] ┆ 1 │
│ 5.0 ┆ (3.5, 5.0] ┆ 1 │
│ 6.5 ┆ (5.0, 6.5] ┆ 0 │
│ 8.0 ┆ (6.5, 8.0] ┆ 1 │
└────────────┴──────────────┴───────┘
np.histogram([2, 5, 8], bins=4)
(array([1, 0, 1, 1]), array([2. , 3.5, 5. , 6.5, 8. ])) Return Struct with Edges (feature)The current "category" is almost unusable since it is a string/cat. example
I can also open a feature request if this is too much for this PR? |
@Julian-J-S I believe @MarcoGorelli has an issue open for exactly the point you make. I did not want to address that in this PR as I didn't want to cover too much ground but I wholeheartedly agree. Regarding your other issue, I think the bounds difference from numpy is to align with pandas. In my opinion, the type of interval is probably not super important to the histogram, but I do agree about the edge bins being ugly. |
Yup, lets split the follow-up here - this PR can improve what is present right now (by fixing the panic and returning more reasonable bins), and then we can look to potentially retool it more comprehensively as a separate (and likely breaking) issue/PR. |
6edf934
to
c653d16
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, working through my backlog...😅 @ritchie46 The improved behaviour gets my vote; further changes can be left to a subsequent PR. As the function is marked as unstable (and this makes several fixes) I don't see a problem merging 👍
Just my opinion/thoughts 😄
On another note: |
I'm happy to update the logic in this PR, but I think we decided first to fix the obviously broken behavior before revamping the "working but odd" behavior. |
Yup; let's get the fixes out first and then follow-up with the behavioural improvements 👌 |
Closes #18650.
This is a revamp of
hist
and how it deals with bin edges when data is not provided, with the intent of better aligning the defaults with numpy. Bins are still left half-open intervals, i.e. of the form(a, b]
. Here are the main changes:bin_count
norbins
is provided, 10 bins are created. The bins are all equally-spaced, spanning from the minimum value to the maximum value. The lowest bin's left edge is increase by 0.1% the span of the entire data (in line with pandas, and our existing implementation) to include the lowest value. This way, all values are included in a call toseries.hist()
with no arguments, orbin_count
provided:bin_counts
is used). The default bin count is still 10.bins
is provided, it is always respected.bins
is provided with only a single edge, then it is zero-width, and no values are binned. The result is empty:bins
andbin_counts
:This implementation is a bit cleaner in that it separates the bin-creation logic from the binning logic. It also includes a check for uniformity of bins, even if bins are provided by the user, enabling the fast path to be used in more cases. The fast path and the slower path are now a bit more obvious in separate functions, and I think that the new slow should be a bit faster than the older slow path. The new fast path revives an additional check that fixes a few floating point issues when dealing with integers but it's a bit improved (performance wise) compared to the old one.
I do think that
hist
could use some improvements with ainclude_lower
parameter to avoid the "ugly" lowest bin which is extended by 0.1%, but that can be for another day.