-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix dtype of quality metrics before and after merging #3497
Conversation
# we can iterate through the columns and convert them back to numbers with | ||
# pandas.to_numeric. coerce allows us to keep the nan values. | ||
for column in metrics.columns: | ||
metrics[column] = pd.to_numeric(metrics[column], errors="coerce") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is ok for me.
pandas behavior is becoming quite cryptic for me.
using old_metrics[col].dtype could be also used no ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe. I agree Pandas is making their own dtypes like NADType which doesn't play nicely with numpy in my scipts I tend to just query based on numpy stuff). So I don't know for sure. I could test that later. Although for me I would prefer to coerce everything to numpy types since that's what I'm used to. None of my tables are big enough that I worry about dtype inefficiency stuff that Pandas has been working on with the new backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @zm711 this looks great good catch! super useful test too. Weird behaviour from pandas. Some minor comments:
-
I think
new_df = pd.DataFrame(index=df.index, columns=df.columns, dtype=np.float64)
will have the same effect. You loose the coerce on error behaviour but assuming the data is always going to be filled withNaN
this shouldn't be a problem. However it is more implicit and provides less information on the weird pandas behaviour than the loop approach. -
The results of this operation mean all columns are
np.float64
but in the originalmetrics
as returned fromcompute_quality_metrics
some columns areInt64Dtype
. This seems to be dynamic based on contents (e.g. in the test run presence ratio were all1
and it's dtype isInt64Dtype
but presumably it would be a float under most circumstances.num_spikes
I guess will always beint
. The only time I can imagine this being a problem is if some equality check is performed e.g.num_spikes == 1
which might work for the originalcompute_quality_metrics
output but fail after merging as data will be float. So maybe it is simplest just to castnum_spikes
->Int64Dtype
and leave the rest as float?
Thanks so much @JoeZiminski!
I'm no Pandas expert so I'm happy to have changes here if they are better! I just don't know have an intuition for what is the smartest strategy so if you know Pandas really well then I'll make the change :)
True. This is our mistake for letting Pandas infer. presence ratio is a float between 0 and 1. But if they are all 0 or all 1 it casts to int for memory purposes. Users should never assume it is an int although it could be in extreme cases. It would be better for us to explicitly make it a float and take the memory hit in my opinion.
This is true and when I scanned the table I forgot about this one. It would be better to make that one an
<3 Thanks. I figure we really need to protect ourselves from some of these small regressions. So I'm trying :) |
@alejoe91, do you have any opinions of implementing this? Happy to change to a different method if you prefer something. I think the only thing we are failing to maintain is |
assert len(metrics.index) > len(new_metrics.index) | ||
|
||
# dtype should be fine after merge but is cast from Float64->float64 | ||
assert np.float64 == new_metrics["snr"].dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can add a test on int coercion if we end up using the suggestion here: https://github.com/SpikeInterface/spikeinterface/pull/3497/files#r1827487180
Co-authored-by: Alessio Buccino <[email protected]>
for more information, see https://pre-commit.ci
So the problem is that Pandas will infer the dtype and sometimes this is actually wrong. Like the presence ratio above which should technically always be a float between 0.0 and 1.0, but if it is all 1s and 0s will be stored as an int. Then if we merge and get a fraction then the dtype is wrong. I think it might be better to hard code the int64 for I basically implemented Sam's idea. But this fails. Unless we hard code the dtype of the different metrics rather than allow Pandas to infer them. What do people think about me adding a line to coerce everything in the original calculator to |
Okay so changes in this PR
|
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fix the conflict. Let me know what you all think :)
# we have one issue where the name of the columns for synchrony are named based on | ||
# what the user has input as arguments so we need a way to handle this separately | ||
# everything else should be handled with the column name. | ||
if "sync" in column: | ||
metrics[column] = metrics[column].astype(column_name_to_column_dtype["sync"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue we keep this for backward compatibility no? I could add a comment saying we can simplify this in a couple versions.
Never mind. After Chris's updates I thought this would go in cleanly other than one conflict. His changes actually require me to make additional updates to this PR... |
Now we are finally ready :) |
"sync_spike_2", | ||
"sync_spike_4", | ||
"sync_spike_8", | ||
], # we probably shouldn't hard code this. This is determined by the arguments in the function... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I think we agreed that we can just hard-code this at the QM level, so it should be ok.
Let's keep the comment until this is actually hard-coded!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this was done already: #3559
"amplitude_median": float, | ||
"amplitude_cv_median": float, | ||
"amplitude_cv_range": float, | ||
"sync": float, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then this could become sync_spike_2
, sync_spike_4
, sync_spike_8
as well
@zm711 since we hard-coded the sync sizes in #3559 I simplified the PR, so we don't have to deal with them differently. For back compatibility, I just added a check that casts the column values only if the column name is in |
Okay cool. Works for me :) |
MRE
Basically when you copy a dataframe from a previous dataframe columns it forces the dtype to be object instead of numeric.
Easy Solution
Using the
to_numeric
will bring us back to numeric values.Caveats
This switches the dtype from the Pandas
Float64
to the numpyfloat64
. I don't think this is too bad, since doing the queries should still be fine no?Testing
i added a small test to test merging, but let me know if we'd prefer not to have it.