Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected behavior for missing data #260

Closed
damianr99 opened this issue Aug 8, 2021 · 3 comments
Closed

unexpected behavior for missing data #260

damianr99 opened this issue Aug 8, 2021 · 3 comments

Comments

@damianr99
Copy link

Not sure if this is intentional, but I found this behavior surprising:

(dfn/+ (:a (ds/->dataset {:a [1 2 nil 4]})) 1)
[2 3 -9223372036854775807 5]

(dfn/+ (:a (ds/->dataset {:a [1 2.0 nil 4]})) 1)
[2.0 3.0 ##NaN 5.0]

Possibly related, missing values don't sort (I'd expect them to either move to the front, or end, but they don't budge)

(ds/sort-by-column (ds/->dataset {:a [1 ##NaN 2 ##Inf nil 4 ##-Inf]}) :a <)
_unnamed [7 1]:

|   :a |
|------|
| -Inf |
|  NaN |
|  1.0 |
|  2.0 |
|      |
|  4.0 |
|  Inf |
main> (ds/sort-by-column (ds/->dataset {:a [1 ##NaN 2 ##Inf nil 4 ##-Inf]}) :a >)
_unnamed [7 1]:

|   :a |
|------|
|  Inf |
|  NaN |
|  4.0 |
|  2.0 |
|      |
|  1.0 |
| -Inf |
main> 

This is using version "6.010". Thanks!

@cnuernber
Copy link
Collaborator

These are great.

  • The first is half-intentional - dfn is a namespace that is much lower level than the dataset column namespace and it has no knowledge of missing. The official recommendation is to clear out missing values before you start to do numeric processing on the dataset. The deeper fix would be to have the dtype-next architecture know about missing values and use float64 or object space if the column has any missing values as either nil or nan are valid missing value numbers. Because there is no :int64 nan equivalent I use Long/MIN_VALUE when I have to write a long into an array of data and dfn picks this up. There would be similar issues for any of the other integer types. The tack I took here for tmdjs is all math for numeric columns is done in float64 space and thus nan is always an option so this issue at least for the clojurescript version has a solid answer.

  • The second (sorting) is definitely confusing and I agree with your analysis - especially when sorting by column missing either goes first or last - we should check pandas and do whatever they do.

Both valid points and the first especially is irksome and potentially corrupting.

@damianr99
Copy link
Author

Apologies, I didn't see the recommendation to clear out missing values first. I was copying from the examples (e.g. https://techascent.github.io/tech.ml.dataset/walkthrough.html#elementwise-operations). A number of the dataset examples in the documentation drop down to using tech.v3.datatype functions. It's a little unclear for a new user what the pitfalls of doing that are.

@cnuernber
Copy link
Collaborator

cnuernber commented Aug 9, 2021

I don't know if that recommendation is documented it just has been discussed on zulip. There certainly are pitfalls :-).

Here are some things that may help this situation

I think we can mitigate 1. by adding a protocol method to dtype-next which is operational-elemwise-datatype vs. elemwise-datatype with the distinction being that some containers may have to advertise a more general datatype than the specific container type in order to correctly interpret both values that can be represented by the elemwise-datatype and values that cannot be.

For numeric types, the operational datatype if there were missing values would be :float64 else the operation datatype would match the actual datatype. Then update the code in dispatch.clj to respect such things and at least all of the math operations in tech.v3.datatype.functional would work as correctly as possible with missing values.

For the second (sorting of nil values) perhaps we have a new option for sort - {:missing-policy #{:first :last :exception}} which defaults to whatever pandas does and then at least the result format will be standardized.

And finally the documentation could really be improved here especially for first time users. I think the tablecloth project is much further along this pathway and that is the current focus of the scicloj team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants