Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix panic with empty delta scan, or empty parquet scan with a provided schema #19884

Merged
merged 12 commits into from
Nov 21, 2024

Conversation

nameexhaustion
Copy link
Collaborator

@nameexhaustion nameexhaustion commented Nov 20, 2024

Fixes #19876
Fixes #19854
Fixes #19890

For the case when scan_parquet is done on an empty directory and schema was given, we should return an empty DataFrame. Currently this case isn't accounted for in some areas of the code causing some errors / panics.

Also fixes a mypy lint issue due to a new pydantic release

@nameexhaustion nameexhaustion changed the title fix: fix: Fix panic with empty delta scan, or empty parquet scan with a provided schema Nov 20, 2024
@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars and removed title needs formatting labels Nov 20, 2024
!expanded_paths.is_empty() && (paths[0].as_ref() != expanded_paths[0].as_ref())
// For cloud paths, we determine that the input path isn't a file by checking that the
// output path differs.
expanded_paths.is_empty() || (paths[0].as_ref() != expanded_paths[0].as_ref())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix this to also cover the empty directory case

{
polars_bail!(
ComputeError:
"a hive schema was given but hive_partitioning was disabled"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error if someone accidentally does scan_parquet(..., hive_schema={...}, hive_partitioning=False)

file_options.hive_options.hive_start_idx = hive_start_idx;

Ok(Self::Paths(expanded_paths))
},
v => {
file_options.hive_options.enabled = Some(false);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't override this here

if file_options.hive_options.enabled.is_none()
&& expanded_from_single_directory(paths, expanded_paths.as_ref())
{
file_options.hive_options.enabled = Some(true);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does the same thing (only overwriting if the existing value is None), the old code did hive_enabled.unwrap_or_else which made it a bit hard to see

tmp_path.mkdir(exist_ok=True)
path = str(tmp_path)
df = pl.DataFrame({}, [("p", pl.Int64)])
df.write_delta(path)
Copy link
Collaborator Author

@nameexhaustion nameexhaustion Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ion-elgreco , can you check this one? I get an error writing empty tables on deltalake 0.21.0

Copy link
Contributor

@ion-elgreco ion-elgreco Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should create empty tables with something like this Deltatable.create(path, schema=df.to_arroe().schema)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - thanks for the tip!

Copy link

codecov bot commented Nov 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.43%. Comparing base (5f61d70) to head (a4598a3).
Report is 1 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main   #19884       +/-   ##
===========================================
+ Coverage   59.38%   79.43%   +20.05%     
===========================================
  Files        1554     1554               
  Lines      215612   215611        -1     
  Branches     2452     2452               
===========================================
+ Hits       128035   171280    +43245     
+ Misses      87019    43773    -43246     
  Partials      558      558               

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@nameexhaustion nameexhaustion marked this pull request as ready for review November 21, 2024 04:44
@ramonvermeulen-asml
Copy link

Thanks for putting in the effort @nameexhaustion! Let me know if I can test some stuff to ensure this resolves #19890

e.g. building from source and testing locally

@ritchie46 ritchie46 merged commit 927b7b8 into pola-rs:main Nov 21, 2024
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
4 participants