Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid generating duplicate nan keys with MapGen(FloatGen) #9852

Merged
merged 6 commits into from
Dec 1, 2023

Conversation

thirtiseven
Copy link
Collaborator

@thirtiseven thirtiseven commented Nov 24, 2023

Fixes #9685
Fixes #9684

These two failed cases complain that "Duplicate map key NaN was found, please check the input data. "
This is because python dict and spark maptype handle NaN as key in different ways:

  • In python dict, NaN as dict key are not equal to each other, so it is possible to have multiple NaN as keys in one dict.
  • In Spark map, map treats NaN as a normal number when used as a key, so no duplicate NaN keys are allowed.

This PR avoids generating duplicate NaN keys with MapGen(FloatGen) in integration tests, should be able to fix these two test cases.

I'm surprised by this root cause because the probability of this case feels not very low:
When generating a dataframe of Map[Float, _], DataGen will generate 2048 maps, each of them containing 0~20 entries, NaN will appear in data with 1% chance. Will do the math later.

@thirtiseven
Copy link
Collaborator Author

build

winningsix
winningsix previously approved these changes Nov 27, 2023
Copy link
Collaborator

@winningsix winningsix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just nit left.

@@ -676,7 +676,27 @@ def start(self, rand):
def make_dict():
length = rand.randint(self._min_length, self._max_length)
return {self._key_gen.gen(): self._value_gen.gen() for idx in range(0, length)}
self._start(rand, make_dict)
def make_dict_float():
# Make sure at most one key is nan
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could we mention more about the reason why at most one key is NaN?

e.g.,

In Spark, NaN = NaN returns true. At most one key is NaN to avoid duplicated key value.
This is different from python where NaN as dict key are not equal to each other, so it is possible to have multiple NaN as keys in one dict.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review, done.

@thirtiseven
Copy link
Collaborator Author

thirtiseven commented Nov 27, 2023

Seems it is more complicated for these two cases, the chance that datagen generated multiple NaNs is very high. However, it works fine and quiet in most cases when converting dict to dataframe.

>>> dict1 = {float('nan'): 1, float('nan'): 2, 0.3: 3}
>>> dict2 = {float('nan'): 3, float('nan'): 2, 0.3: 3}
>>> data = [dict1, dict2]
>>> df = spark.createDataFrame(SparkContext.getOrCreate().parallelize(data, numSlices=None), MapType(DoubleType(), IntegerType(), valueContainsNull=True))
>>> df.show()
+--------------------+
|               value|
+--------------------+
|{0.3 -> 3, NaN -> 1}|
|{0.3 -> 3, NaN -> 3}|
+--------------------+

And also, if we replace the command in test_case_when with a collect, the test will pass with the same seed.

Although the failure is from input data and the behavior is matched, I'd like to keep this PR on hold until I can make the root cause more clear.

@thirtiseven thirtiseven marked this pull request as draft November 27, 2023 03:24
@sameerz sameerz added the test Only impacts tests label Nov 28, 2023
@thirtiseven thirtiseven changed the base branch from branch-23.12 to branch-24.02 November 28, 2023 01:48
@thirtiseven thirtiseven dismissed winningsix’s stale review November 28, 2023 01:48

The base branch was changed.

@thirtiseven
Copy link
Collaborator Author

The two failures are because there are multiple NaNs in the test Scalar that will be used directly as literal in query. It will be fine if we convert maps with multiple NaNs as key to dataframes.

Since we do not intend to create maps with multiple NaNs in DataGen, I think we can just keep the current solution to avoid generating such kind of data at all.

Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven thirtiseven self-assigned this Nov 29, 2023
@thirtiseven thirtiseven marked this pull request as ready for review November 29, 2023 02:24
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens on the GPU with this same issue? Do we do the same thing? I realize that python supports multiple NaN keys that will never compare as equal. But does the GPU do the same thing as the CPU does and throws an exception? If not we have to document this and at least file a follow on issue, even if it is a very low priority. If it does we probably want a test to verify that we continue to do this on all of the platforms we support.

To be clear I want to understand what happens when multiple NaN values are inserted as the keys, but also what happens when we try to look up the value stored under a NaN key.

@thirtiseven
Copy link
Collaborator Author

  1. Create dataframe with dicts containing multiple NaN keys and look up value:
>>> from pyspark.sql.types import *
>>> dict1 = {float('nan'): 1, float('nan'): 2, 0.3: 3}
>>> dict2 = {float('nan'): 3, float('nan'): 2, 0.3: 3}
>>> data = [dict1, dict2]
>>> df = spark.createDataFrame(SparkContext.getOrCreate().parallelize(data, numSlices=None), MapType(DoubleType(), IntegerType(), valueContainsNull=True))
>>> df.show()
23/11/30 11:13:35 WARN GpuOverrides:
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> cast(value#0 as string) AS value#3 will run on GPU
      *Expression <Cast> cast(value#0 as string) will run on GPU
    ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
      @Expression <AttributeReference> value#0 could run on GPU

+--------------------+
|               value|
+--------------------+
|{0.3 -> 3, NaN -> 1}|
|{0.3 -> 3, NaN -> 3}|
+--------------------+

>>> from pyspark.sql.functions import col
>>> df.select(col("value").getItem(float('nan'))).first()[0]
23/11/30 11:14:46 WARN GpuOverrides:
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> value#0[NaN] AS value[NaN]#7 will run on GPU
      *Expression <GetMapValue> value#0[NaN] will run on GPU
    ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
      @Expression <AttributeReference> value#0 could run on GPU

1

The behavior is matched to pyspark. There is only one NaN key kept and it's value can be looked up normally.

pyspark will convert python object to java object first then create the dataframe, I think plugin is not touching related logic so we are good.

  1. This issue's case, if we use a literal map containing multiple NaN keys in a query. I added a test and it throws same exception as CPU does.

Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven
Copy link
Collaborator Author

build

@thirtiseven thirtiseven merged commit dfae463 into NVIDIA:branch-24.02 Dec 1, 2023
36 checks passed
@thirtiseven thirtiseven deleted the duplicatenan branch December 1, 2023 06:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
4 participants