unexpected behavior for missing data #260

damianr99 · 2021-08-08T18:33:33Z

Not sure if this is intentional, but I found this behavior surprising:

(dfn/+ (:a (ds/->dataset {:a [1 2 nil 4]})) 1)
[2 3 -9223372036854775807 5]

(dfn/+ (:a (ds/->dataset {:a [1 2.0 nil 4]})) 1)
[2.0 3.0 ##NaN 5.0]

Possibly related, missing values don't sort (I'd expect them to either move to the front, or end, but they don't budge)

(ds/sort-by-column (ds/->dataset {:a [1 ##NaN 2 ##Inf nil 4 ##-Inf]}) :a <)
_unnamed [7 1]:

|   :a |
|------|
| -Inf |
|  NaN |
|  1.0 |
|  2.0 |
|      |
|  4.0 |
|  Inf |
main> (ds/sort-by-column (ds/->dataset {:a [1 ##NaN 2 ##Inf nil 4 ##-Inf]}) :a >)
_unnamed [7 1]:

|   :a |
|------|
|  Inf |
|  NaN |
|  4.0 |
|  2.0 |
|      |
|  1.0 |
| -Inf |
main>

This is using version "6.010". Thanks!

The text was updated successfully, but these errors were encountered:

cnuernber · 2021-08-08T20:12:20Z

These are great.

The first is half-intentional - dfn is a namespace that is much lower level than the dataset column namespace and it has no knowledge of missing. The official recommendation is to clear out missing values before you start to do numeric processing on the dataset. The deeper fix would be to have the dtype-next architecture know about missing values and use float64 or object space if the column has any missing values as either nil or nan are valid missing value numbers. Because there is no :int64 nan equivalent I use Long/MIN_VALUE when I have to write a long into an array of data and dfn picks this up. There would be similar issues for any of the other integer types. The tack I took here for tmdjs is all math for numeric columns is done in float64 space and thus nan is always an option so this issue at least for the clojurescript version has a solid answer.
The second (sorting) is definitely confusing and I agree with your analysis - especially when sorting by column missing either goes first or last - we should check pandas and do whatever they do.

Both valid points and the first especially is irksome and potentially corrupting.

damianr99 · 2021-08-09T05:33:11Z

Apologies, I didn't see the recommendation to clear out missing values first. I was copying from the examples (e.g. https://techascent.github.io/tech.ml.dataset/walkthrough.html#elementwise-operations). A number of the dataset examples in the documentation drop down to using tech.v3.datatype functions. It's a little unclear for a new user what the pitfalls of doing that are.

cnuernber · 2021-08-09T15:03:21Z

I don't know if that recommendation is documented it just has been discussed on zulip. There certainly are pitfalls :-).

Here are some things that may help this situation

I think we can mitigate 1. by adding a protocol method to dtype-next which is operational-elemwise-datatype vs. elemwise-datatype with the distinction being that some containers may have to advertise a more general datatype than the specific container type in order to correctly interpret both values that can be represented by the elemwise-datatype and values that cannot be.

For numeric types, the operational datatype if there were missing values would be :float64 else the operation datatype would match the actual datatype. Then update the code in dispatch.clj to respect such things and at least all of the math operations in tech.v3.datatype.functional would work as correctly as possible with missing values.

For the second (sorting of nil values) perhaps we have a new option for sort - {:missing-policy #{:first :last :exception}} which defaults to whatever pandas does and then at least the result format will be standardized.

And finally the documentation could really be improved here especially for first time users. I think the tablecloth project is much further along this pathway and that is the current focus of the scicloj team.

This was referenced Aug 15, 2021

Provide base datatype support for missing cnuernber/dtype-next#33

Closed

Argsort needs to work sanely with NaN cnuernber/dtype-next#34

Closed

cnuernber added a commit that referenced this issue Aug 18, 2021

Partial fix for #260

71ecdfa

cnuernber closed this as completed in fde778c Aug 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unexpected behavior for missing data #260

unexpected behavior for missing data #260

damianr99 commented Aug 8, 2021

cnuernber commented Aug 8, 2021

damianr99 commented Aug 9, 2021

cnuernber commented Aug 9, 2021 •

edited

Loading

unexpected behavior for missing data #260

unexpected behavior for missing data #260

Comments

damianr99 commented Aug 8, 2021

cnuernber commented Aug 8, 2021

damianr99 commented Aug 9, 2021

cnuernber commented Aug 9, 2021 • edited Loading

cnuernber commented Aug 9, 2021 •

edited

Loading