Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support use of Duration in to_string, ergonomic/perf improvement, tz-aware Datetime bugfix #19697

Merged
merged 5 commits into from
Nov 16, 2024

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Nov 8, 2024

Closes #7174.

(Supersedes the now-closed #19663).

New feature

  • Support applying dt.to_string() to Duration cols, transforming them to an ISO8601 duration string (see: https://en.wikipedia.org/wiki/ISO_8601#Durations for details on this format).
  • The to_string expression (for all temporal dtypes) can now take the shortcut format string "iso", which represents the dtype-specific ISO8601 format for the given column(s). Setting no explicit format is equivalent to setting "iso", leading to some nice ergonomic improvements (detailed below).

(Note: the ISO duration string conversion was validated against an established package that does the same conversion; created an ad-hoc parametric test against many tens of thousands of randomly-generated timedeltas, and everything tied-out. Included various edge-cases in the unit tests).

Ergonomics

  • If you do not supply a format string (previously not possible) or set it to"iso", the associated dtype-specific ISO8601 format is used. This means that you can now make a single expression that will properly format all temporal columns with their dtype-appropriate ISO format, rather than have to separately associate an explicit format string with each dtype.

    So, can now write...

    df.select(cs.temporal().dt.to_string())

    ...instead of:

    df.select(
      cs.datetime("ms").dt.to_string("%F %T.3f"),
      cs.datetime("us").dt.to_string("%F %T.6f"),
      cs.datetime("ns").dt.to_string("%F %T.9f"),
      cs.datetime("ms","*").dt.to_string("%F %T.3f%:z"),
      cs.datetime("us","*").dt.to_string("%F %T.6f%:z"),
      cs.datetime("ns","*").dt.to_string("%F %T.9f%:z"),
      cs.date("%F"),
      cs.time("ms").dt.to_string("%T.3f"),
      cs.time("us").dt.to_string("%T.6f"),
      cs.time("ns").dt.to_string("%T.9f"),
    )

Performance

  • Consolidated the four duration → string functions (and the new "iso" option) down to two, then optimised them using itoa conversions and push/push_str calls instead of relying on write! to do the integer transforms into the output string (~3x faster in local testing) 🚀

Bugfix

  • Transforming a timezone-aware column to string (via cast) did not include the expected ISO timezone offset - spotted this with some new tests and fixed it.

Examples

from datetime import datetime, date, timedelta, time
import polars as pl

df = pl.DataFrame({
    "td": [
        timedelta(days=-1, seconds=-42),
        timedelta(days=14, hours=-10, microseconds=1001),
        timedelta(seconds=0),
    ],
    "tm": [
        time(1, 2, 3, 456789),
        time(23, 59, 9, 101),
        time(0, 0, 0, 1),
    ],
    "dt": [
        date(1999, 3, 1),
        date(2020, 5, 3),
        date(2077, 7, 5),
    ],
    "dtm": [
        datetime(1980, 8, 10, 0, 10, 20),
        datetime(2010, 10, 20, 8, 25, 35),
        datetime(2040, 12, 30, 16, 40, 50),
    ],
}).with_columns(
    dtm_tz = pl.col("dtm").dt.replace_time_zone("Asia/Kathmandu")
)

Cast to dtype-specific ISO8601 string (the default, if no explicit format string given):

df.select(pl.col("td","tm","dt").dt.to_string())
# shape: (3, 3)
# ┌───────────────────┬─────────────────┬────────────┐
# │ td                ┆ tm              ┆ dt         │
# │ ---               ┆ ---             ┆ ---        │
# │ str               ┆ str             ┆ str        │
# ╞═══════════════════╪═════════════════╪════════════╡
# │ -P1DT42.S         ┆ 01:02:03.456789 ┆ 1999-03-01 │
# │ P13DT14H0.001001S ┆ 23:59:09.000101 ┆ 2020-05-03 │
# │ PT0S              ┆ 00:00:00.000001 ┆ 2077-07-05 │
# └───────────────────┴─────────────────┴────────────┘

df.select(pl.col("dtm","dtm_tz").dt.to_string())
# shape: (3, 2)
# ┌────────────────────────────┬──────────────────────────────────┐
# │ dtm                        ┆ dtm_tz                           │
# │ ---                        ┆ ---                              │
# │ str                        ┆ str                              │
# ╞════════════════════════════╪══════════════════════════════════╡
# │ 1980-08-10 00:10:20.000000 ┆ 1980-08-10 00:10:20.000000+05:30 │
# │ 2010-10-20 08:25:35.000000 ┆ 2010-10-20 08:25:35.000000+05:45 │
# │ 2040-12-30 16:40:50.000000 ┆ 2040-12-30 16:40:50.000000+05:45 │
# └────────────────────────────┴──────────────────────────────────┘

Durations do not support strftime; instead to_string recognises only "iso" or "polars" as formats for durations - the latter results in the same string output that we display in the frame repr:

df = pl.DataFrame(
    {
        "td1": [
            timedelta(days=-1),
            timedelta(days=0),
        ],
        "td2": [
            timedelta(weeks=4, days=-10, microseconds=999999),
            timedelta(weeks=20, days=20, hours=13, minutes=45, seconds=22, microseconds=134455),
        ],
    },
    schema={"td1": pl.Duration("ms"), "td2": pl.Duration("ns")},
)

df.select(
    pl.all().dt.to_string("iso").name.suffix("(iso)"),
    pl.all().dt.to_string("polars").name.suffix("(pl)"),
)
# shape: (2, 4)
# ┌──────────┬────────────────────────┬─────────┬───────────────────────────┐
# │ td1(iso) ┆ td2(iso)               ┆ td1(pl) ┆ td2(pl)                   │
# │ ---      ┆ ---                    ┆ ---     ┆ ---                       │
# │ str      ┆ str                    ┆ str     ┆ str                       │
# ╞══════════╪════════════════════════╪═════════╪═══════════════════════════╡
# │ -P1D     ┆ P18DT0.999999S         ┆ -1d     ┆ 18d 999999µs              │
# │ PT0S     ┆ P160DT13H45M22.134455S ┆ 0ms     ┆ 160d 13h 45m 22s 134455µs │
# └──────────┴────────────────────────┴─────────┴───────────────────────────┘

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Nov 8, 2024
@alexander-beedie alexander-beedie changed the title feat: Support use of Duration dtype in to_string feat: Support use of Duration dtype in to_string, ergonomic improvement, tz-aware Datetime bugfix Nov 8, 2024
@alexander-beedie alexander-beedie changed the title feat: Support use of Duration dtype in to_string, ergonomic improvement, tz-aware Datetime bugfix feat: Support use of Duration dtype in to_string, ergonomic/perf improvement, tz-aware Datetime bugfix Nov 8, 2024
Copy link
Collaborator

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice one, just left some comments!

i also notice that this does about 2-3 things together - is it possible to do any of them separately (like the tz-aware datetime bugfix)?

py-polars/polars/expr/datetime.py Show resolved Hide resolved
crates/polars-core/src/chunked_array/temporal/duration.rs Outdated Show resolved Hide resolved
crates/polars-core/src/fmt.rs Outdated Show resolved Hide resolved
crates/polars-core/src/fmt.rs Outdated Show resolved Hide resolved
@alexander-beedie alexander-beedie force-pushed the duration-to-string branch 2 times, most recently from 9f6c6f3 to 646cbf6 Compare November 8, 2024 14:42
@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Nov 8, 2024

i also notice that this does about 2-3 things together - is it possible to do any of them separately (like the tz-aware datetime bugfix)?

The bug is fixed as a side-effect of explicitly/centrally declaring the right ISO strings, and everything else is connected, so there isn't really anything to separate out 😅

@alexander-beedie alexander-beedie force-pushed the duration-to-string branch 3 times, most recently from 546fb33 to 9d99e1e Compare November 8, 2024 14:53
@alexander-beedie alexander-beedie added the performance Performance issues or improvements label Nov 8, 2024
@alexander-beedie alexander-beedie force-pushed the duration-to-string branch 9 times, most recently from fef1443 to 620a23f Compare November 9, 2024 11:26
@alexander-beedie alexander-beedie changed the title feat: Support use of Duration dtype in to_string, ergonomic/perf improvement, tz-aware Datetime bugfix feat: Support use of Duration in to_string, ergonomic/perf improvement, tz-aware Datetime bugfix Nov 9, 2024
Copy link
Collaborator

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for explaining!

I think the logic is right (still doing a final check), but I do want to check about what's going on in Display::fmt. It looks to me like the current logic is very careful to just write to f, and that fmt_duration_ns / fmt_duration_us / fmt_duration_ms only do integer arithmetic without allocating any new Strings

With this PR, on the other hand, fmt_duration_string always allocates a new String

A comment Ritchie had made on something similar some time ago was:

But those string allocations will make this slower down the line as well. As they fragment your heap.

So, I need to defer to @ritchie46 on the perf / memory implications here

crates/polars-core/src/fmt.rs Outdated Show resolved Hide resolved
Copy link

codecov bot commented Nov 13, 2024

Codecov Report

Attention: Patch coverage is 90.00000% with 16 lines in your changes missing coverage. Please review.

Project coverage is 79.56%. Comparing base (8cb7839) to head (95b578e).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...lars-core/src/chunked_array/temporal/conversion.rs 66.66% 7 Missing ⚠️
crates/polars-core/src/fmt.rs 94.44% 5 Missing ⚠️
...polars-core/src/chunked_array/temporal/duration.rs 84.61% 4 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##             main   #19697    +/-   ##
========================================
  Coverage   79.55%   79.56%            
========================================
  Files        1544     1544            
  Lines      213240   213341   +101     
  Branches     2441     2442     +1     
========================================
+ Hits       169643   169741    +98     
- Misses      43048    43051     +3     
  Partials      549      549            

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@alexander-beedie alexander-beedie force-pushed the duration-to-string branch 2 times, most recently from 85dbbce to f37f800 Compare November 13, 2024 21:47
@ritchie46 ritchie46 merged commit 5476332 into pola-rs:main Nov 16, 2024
25 of 26 checks passed
@alexander-beedie alexander-beedie deleted the duration-to-string branch November 16, 2024 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve Duration => Utf8 cast; currently returns stringified physical (Int64) representation
4 participants