Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Raise on data mismatch in str.json_decode #19347

Merged
merged 1 commit into from
Oct 22, 2024

Conversation

nameexhaustion
Copy link
Collaborator

@nameexhaustion nameexhaustion commented Oct 21, 2024

Fixes #13061

  • Raise if JSON data cannot be decoded to dtype instead of silently appending NULL
  • Raise if there are extra keys in struct data outside of the schema instead of silently ignoring them
print(pl.Series([None, "1"]).str.json_decode(infer_schema_length=1))
# Before
# shape: (2,)
# Series: '' [null]
# [
# 	null
# 	null
# ]
#
# After
# polars.exceptions.ComputeError: error deserializing JSON: error deserializing value "Static(U64(1))" as null. \
#             Try increasing `infer_schema_length` or specifying a schema.
print(
    pl.Series(['{"a": 1}', '{"a": 2, "b": 2}']).str.json_decode(infer_schema_length=1)
)
# Before
# shape: (2,)
# Series: '' [struct[1]]
# [
# 	{1}
# 	{2}
# ]
#
# After
# polars.exceptions.ComputeError: error deserializing JSON: extra key in struct data: b

Todo: I think our NDJSON and JSON readers also suffer from this issue

@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Oct 21, 2024
@nameexhaustion nameexhaustion changed the title fix: Raise on data mismatch / extra data in JSON decode fix: Raise on data mismatch in JSON decode Oct 21, 2024
Copy link

codecov bot commented Oct 21, 2024

Codecov Report

Attention: Patch coverage is 80.69307% with 39 lines in your changes missing coverage. Please review.

Project coverage is 80.18%. Comparing base (11bd08e) to head (8f4d95a).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-json/src/json/deserialize.rs 82.94% 29 Missing ⚠️
crates/polars-json/src/ndjson/deserialize.rs 56.52% 10 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #19347      +/-   ##
==========================================
- Coverage   80.21%   80.18%   -0.03%     
==========================================
  Files        1523     1523              
  Lines      209979   210096     +117     
  Branches     2432     2432              
==========================================
+ Hits       168426   168472      +46     
- Misses      40997    41068      +71     
  Partials      556      556              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nameexhaustion nameexhaustion changed the title fix: Raise on data mismatch in JSON decode fix: Raise on data mismatch in str.json_decode Oct 21, 2024
@nameexhaustion nameexhaustion force-pushed the json-raise-data-mismatch branch 2 times, most recently from f63e4a3 to fdb806f Compare October 22, 2024 05:59
@@ -287,6 +287,8 @@ where
}
}

let allow_extra_fields_in_struct = self.schema.is_some();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently have this behavior where a user can perform a column selection within a struct field by giving a partial schema - this happens because of the existing behavior of JSON deserialization to silently ignore extra fields.

In this PR we change that behavior to raise an error, but to maintain the ability to do this column selection I made it so that we don't raise when we see extra keys if the schema was provided by the user.

_ => None,
BorrowedValue::Static(StaticNode::Null) => None,
_ => {
err_idx = if err_idx == rows.len() { i } else { err_idx };
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure we only push NULL if the value itself was "null", and add tracking of an "error index" to all of the deserializers that tells us the position of the first parsing error.

@ritchie46 ritchie46 merged commit c6b2329 into pola-rs:main Oct 22, 2024
26 checks passed
@nameexhaustion nameexhaustion deleted the json-raise-data-mismatch branch October 28, 2024 04:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Silent data loss in json_decode with default infer_schema_length
2 participants