Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support both "iso" and "iso:strict" format options for dt.to_string #19840

Merged
merged 2 commits into from
Nov 20, 2024

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Nov 18, 2024

A follow-up to enable stricter ISO formatting for dt.to_string() with Datetime.
See #19697 (comment) for the rationale/details 🤔

TLDR

  • "iso": (what we have now) each component is ISO, joined with a space (this was valid ISO 8601 for the majority of the spec's lifetime, but now falls under RFC 3339 - see the comments below for more explanation/details, and the "Spec Arcana" section).
  • "iso:strict": (new) same as the above, but with a "T" separator between date/time instead of a space; this conforms to the latest ISO spec amendment, ISO 8601-1:2019).

Spec Arcana

Making the "T" mandatory for Datetimes was an amendment made (around five years ago) in ISO 8601-1:2019. Consequently you will see both forms (with a space or with a "T") used widely, with both being referred to as "ISO", depending on the date of implementation and which version of the spec was being targeted.

Also

  • Updated the to_string docstring for Series and DataFrame with additional explanation and examples.
  • Additional/updated unit tests.

Example

from datetime import datetime
import polars as pl

df = pl.DataFrame({
   "dtm": [
       datetime(1980, 8, 10, 0, 10, 20),
       datetime(2010, 10, 20, 8, 25, 35),
       datetime(2040, 12, 30, 16, 40, 50),
   ]
})

df.select(
    pl.col("dtm").dt.to_string("iso").name.suffix(":iso"),
    pl.col("dtm").dt.to_string("iso:strict").name.suffix(":iso_strict"),
)
# shape: (3, 2)
# ┌────────────────────────────┬────────────────────────────┐
# │ dtm:iso                    ┆ dtm:iso_strict             │
# │ ---                        ┆ ---                        │
# │ str                        ┆ str                        │
# ╞════════════════════════════╪════════════════════════════╡
# │ 1980-08-10 00:10:20.000000 ┆ 1980-08-10T00:10:20.000000 │
# │ 2010-10-20 08:25:35.000000 ┆ 2010-10-20T08:25:35.000000 │
# │ 2040-12-30 16:40:50.000000 ┆ 2040-12-30T16:40:50.000000 │
# └────────────────────────────┴────────────────────────────┘

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Nov 18, 2024
@eitsupi
Copy link
Contributor

eitsupi commented Nov 18, 2024

Excuse me, but why do you insist on joining in space instead of ISO 8601 format?
If you have to stick to it, I don't see any basis for calling it ISO.

I think one of the following changes is necessary:

  1. Rename 'iso' to 'default' or something.
  2. Use T as the separator to follow ISO 8601.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Nov 18, 2024

Excuse me, but why do you insist on joining in space instead of ISO 8601 format? If you have to stick to it, I don't see any basis for calling it ISO.

I think I explained it reasonably clearly, but I'll try again to better clarify ;)

It is called "iso" because each part of the output is ISO, the format as a whole is recognised as ISO by essentially everything, and it was also recognised as such by the spec itself for the majority of its existence; this only changed a few years ago (in 2019) when an amendment was made to the spec.

"T" being considered mandatory by the ISO spec is a new development, and only true for ISO 8601-1:2019 and later - for earlier versions of the spec (which dates back some 36 years to 1988) the space is fine.

The space version is preferred as it is the more human-readable of the two, and is still considered to be an ISO format by all tools, libraries, etc (as for the majority of the spec's existence space was considered a valid separator).

As two examples (amongst many), both the Python standard library and the dateutil library recognise the space-separated format as being ISO without prompting...

from datetime import datetime
datetime.fromisoformat("2020-10-08 10:30:45")
# datetime(2020, 10, 8, 10, 30, 45)

import dateutil
dtm = dateutil.parser.isoparse("2020-10-08 10:30:45")
# datetime(2020, 10, 8, 10, 30, 45)

...and the standard library allows for a space when exporting back to ISO (it defaults to "T", but space is so common that an option to allow it is integrated directly into the ISO format function):

dtm.isoformat(" ")
# '1980-08-10 00:10:20'

I do not consider these usages (either in dateutil or the Python standard library) to be incorrect.

But, to fully conform with the most recent version of the spec, I introduce "iso:strict" in this PR, so that the caller can get the degree of ISO conformance that they need/prefer, as I agree that we should offer this format via a shortcut name given the existence of "iso" (and the full ISO format, or any other format, of course remains available via an explicit strftime format string).

@eitsupi
Copy link
Contributor

eitsupi commented Nov 18, 2024

Thanks for the clarification!
I understand the information about when it was revised.

This SO thread was also informative.
https://stackoverflow.com/questions/9531524/in-an-iso-8601-date-is-the-t-character-mandatory

The space is certainly used in this ISO's 2017 post.
https://www.iso.org/iso-8601-date-and-time-format.html

For example, September 27, 2022 at 6 p.m. is represented as 2022-09-27 18:00:00.000.

That said, given that it is now 2024, it is questionable whether the default iso is in the old ISO format and not compliant with the current ISO is the right choice.

I don't think there is any equivalence between a function like datetime.fromisoformat being able to parse (i.e., accept both the old and current formats) and writing out only the old format.

At least iso should refer the current format (current is a misnomer, but format is still valid in the past, right?) and shouldn't something like iso:legacy use space instead of T?

@ritchie46 ritchie46 merged commit 9f1b40c into pola-rs:main Nov 20, 2024
31 of 32 checks passed
@alexander-beedie alexander-beedie deleted the strict-iso-datetime-format branch November 20, 2024 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants