-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid generating duplicate nan keys with MapGen(FloatGen) #9852
Conversation
Signed-off-by: Haoyang Li <[email protected]>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just nit left.
@@ -676,7 +676,27 @@ def start(self, rand): | |||
def make_dict(): | |||
length = rand.randint(self._min_length, self._max_length) | |||
return {self._key_gen.gen(): self._value_gen.gen() for idx in range(0, length)} | |||
self._start(rand, make_dict) | |||
def make_dict_float(): | |||
# Make sure at most one key is nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: could we mention more about the reason why at most one key is NaN?
e.g.,
In Spark, NaN = NaN returns true. At most one key is NaN to avoid duplicated key value.
This is different from python where NaN as dict key are not equal to each other, so it is possible to have multiple NaN as keys in one dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for review, done.
Seems it is more complicated for these two cases, the chance that datagen generated multiple NaNs is very high. However, it works fine and quiet in most cases when converting dict to dataframe.
And also, if we replace the command in Although the failure is from input data and the behavior is matched, I'd like to keep this PR on hold until I can make the root cause more clear. |
The base branch was changed.
The two failures are because there are multiple NaNs in the test Scalar that will be used directly as literal in query. It will be fine if we convert maps with multiple NaNs as key to dataframes. Since we do not intend to create maps with multiple NaNs in DataGen, I think we can just keep the current solution to avoid generating such kind of data at all. |
Signed-off-by: Haoyang Li <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens on the GPU with this same issue? Do we do the same thing? I realize that python supports multiple NaN keys that will never compare as equal. But does the GPU do the same thing as the CPU does and throws an exception? If not we have to document this and at least file a follow on issue, even if it is a very low priority. If it does we probably want a test to verify that we continue to do this on all of the platforms we support.
To be clear I want to understand what happens when multiple NaN values are inserted as the keys, but also what happens when we try to look up the value stored under a NaN key.
Signed-off-by: Haoyang Li <[email protected]>
The behavior is matched to pyspark. There is only one NaN key kept and it's value can be looked up normally. pyspark will convert python object to java object first then create the dataframe, I think plugin is not touching related logic so we are good.
|
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
build |
Fixes #9685
Fixes #9684
These two failed cases complain that "Duplicate map key NaN was found, please check the input data. "
This is because python dict and spark maptype handle NaN as key in different ways:
This PR avoids generating duplicate NaN keys with MapGen(FloatGen) in integration tests, should be able to fix these two test cases.
I'm surprised by this root cause because the probability of this case feels not very low:
When generating a dataframe of
Map[Float, _]
, DataGen will generate 2048 maps, each of them containing 0~20 entries, NaN will appear in data with 1% chance. Will do the math later.