Avoid generating duplicate nan keys with MapGen(FloatGen) #9852

thirtiseven · 2023-11-24T10:10:57Z

These two failed cases complain that "Duplicate map key NaN was found, please check the input data. "
This is because python dict and spark maptype handle NaN as key in different ways:

In python dict, NaN as dict key are not equal to each other, so it is possible to have multiple NaN as keys in one dict.
In Spark map, map treats NaN as a normal number when used as a key, so no duplicate NaN keys are allowed.

This PR avoids generating duplicate NaN keys with MapGen(FloatGen) in integration tests, should be able to fix these two test cases.

I'm surprised by this root cause because the probability of this case feels not very low:
When generating a dataframe of Map[Float, _], DataGen will generate 2048 maps, each of them containing 0~20 entries, NaN will appear in data with 1% chance. Will do the math later.

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2023-11-24T10:33:10Z

build

winningsix

LGTM. Just nit left.

winningsix · 2023-11-27T00:11:59Z

integration_tests/src/main/python/data_gen.py

@@ -676,7 +676,27 @@ def start(self, rand):
        def make_dict():
            length = rand.randint(self._min_length, self._max_length)
            return {self._key_gen.gen(): self._value_gen.gen() for idx in range(0, length)}
-        self._start(rand, make_dict)
+        def make_dict_float():
+            # Make sure at most one key is nan


Nit: could we mention more about the reason why at most one key is NaN?

e.g.,

In Spark, NaN = NaN returns true. At most one key is NaN to avoid duplicated key value. This is different from python where NaN as dict key are not equal to each other, so it is possible to have multiple NaN as keys in one dict.

Thanks for review, done.

thirtiseven · 2023-11-27T03:24:05Z

Seems it is more complicated for these two cases, the chance that datagen generated multiple NaNs is very high. However, it works fine and quiet in most cases when converting dict to dataframe.

>>> dict1 = {float('nan'): 1, float('nan'): 2, 0.3: 3}
>>> dict2 = {float('nan'): 3, float('nan'): 2, 0.3: 3}
>>> data = [dict1, dict2]
>>> df = spark.createDataFrame(SparkContext.getOrCreate().parallelize(data, numSlices=None), MapType(DoubleType(), IntegerType(), valueContainsNull=True))
>>> df.show()
+--------------------+
|               value|
+--------------------+
|{0.3 -> 3, NaN -> 1}|
|{0.3 -> 3, NaN -> 3}|
+--------------------+

And also, if we replace the command in test_case_when with a collect, the test will pass with the same seed.

Although the failure is from input data and the behavior is matched, I'd like to keep this PR on hold until I can make the root cause more clear.

The base branch was changed.

thirtiseven · 2023-11-29T02:16:37Z

The two failures are because there are multiple NaNs in the test Scalar that will be used directly as literal in query. It will be fine if we convert maps with multiple NaNs as key to dataframes.

Since we do not intend to create maps with multiple NaNs in DataGen, I think we can just keep the current solution to avoid generating such kind of data at all.

Signed-off-by: Haoyang Li <[email protected]>

revans2

What happens on the GPU with this same issue? Do we do the same thing? I realize that python supports multiple NaN keys that will never compare as equal. But does the GPU do the same thing as the CPU does and throws an exception? If not we have to document this and at least file a follow on issue, even if it is a very low priority. If it does we probably want a test to verify that we continue to do this on all of the platforms we support.

To be clear I want to understand what happens when multiple NaN values are inserted as the keys, but also what happens when we try to look up the value stored under a NaN key.

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2023-11-30T03:28:22Z

Create dataframe with dicts containing multiple NaN keys and look up value:

>>> from pyspark.sql.types import *
>>> dict1 = {float('nan'): 1, float('nan'): 2, 0.3: 3}
>>> dict2 = {float('nan'): 3, float('nan'): 2, 0.3: 3}
>>> data = [dict1, dict2]
>>> df = spark.createDataFrame(SparkContext.getOrCreate().parallelize(data, numSlices=None), MapType(DoubleType(), IntegerType(), valueContainsNull=True))
>>> df.show()
23/11/30 11:13:35 WARN GpuOverrides:
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> cast(value#0 as string) AS value#3 will run on GPU
      *Expression <Cast> cast(value#0 as string) will run on GPU
    ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
      @Expression <AttributeReference> value#0 could run on GPU

+--------------------+
|               value|
+--------------------+
|{0.3 -> 3, NaN -> 1}|
|{0.3 -> 3, NaN -> 3}|
+--------------------+

>>> from pyspark.sql.functions import col
>>> df.select(col("value").getItem(float('nan'))).first()[0]
23/11/30 11:14:46 WARN GpuOverrides:
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> value#0[NaN] AS value[NaN]#7 will run on GPU
      *Expression <GetMapValue> value#0[NaN] will run on GPU
    ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
      @Expression <AttributeReference> value#0 could run on GPU

1

The behavior is matched to pyspark. There is only one NaN key kept and it's value can be looked up normally.

pyspark will convert python object to java object first then create the dataframe, I think plugin is not touching related logic so we are good.

This issue's case, if we use a literal map containing multiple NaN keys in a query. I added a test and it throws same exception as CPU does.

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2023-12-01T00:53:23Z

build

Avoid generating duplicate nan keys with MapGen(FloatGen)

58749b1

Signed-off-by: Haoyang Li <[email protected]>

winningsix previously approved these changes Nov 27, 2023

View reviewed changes

thirtiseven marked this pull request as draft November 27, 2023 03:24

sameerz added the test Only impacts tests label Nov 28, 2023

thirtiseven changed the base branch from branch-23.12 to branch-24.02 November 28, 2023 01:48

Merge branch 'NVIDIA:branch-24.02' into duplicatenan

ac1a80c

Improve comment

7912619

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven self-assigned this Nov 29, 2023

thirtiseven marked this pull request as ready for review November 29, 2023 02:24

revans2 reviewed Nov 29, 2023

View reviewed changes

Add map expr literal dupe nan keys test

aad9f5d

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven added 2 commits November 30, 2023 11:29

clean up

8244e1e

Signed-off-by: Haoyang Li <[email protected]>

clean up

818622b

Signed-off-by: Haoyang Li <[email protected]>

revans2 approved these changes Nov 30, 2023

View reviewed changes

thirtiseven merged commit dfae463 into NVIDIA:branch-24.02 Dec 1, 2023
36 checks passed

thirtiseven deleted the duplicatenan branch December 1, 2023 06:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid generating duplicate nan keys with MapGen(FloatGen) #9852

Avoid generating duplicate nan keys with MapGen(FloatGen) #9852

thirtiseven commented Nov 24, 2023 •

edited

Loading

thirtiseven commented Nov 24, 2023

winningsix left a comment

winningsix Nov 27, 2023

thirtiseven Nov 29, 2023

thirtiseven commented Nov 27, 2023 •

edited

Loading

thirtiseven commented Nov 29, 2023

revans2 left a comment

thirtiseven commented Nov 30, 2023

thirtiseven commented Dec 1, 2023

Avoid generating duplicate nan keys with MapGen(FloatGen) #9852

Avoid generating duplicate nan keys with MapGen(FloatGen) #9852

Conversation

thirtiseven commented Nov 24, 2023 • edited Loading

thirtiseven commented Nov 24, 2023

winningsix left a comment

Choose a reason for hiding this comment

winningsix Nov 27, 2023

Choose a reason for hiding this comment

thirtiseven Nov 29, 2023

Choose a reason for hiding this comment

thirtiseven commented Nov 27, 2023 • edited Loading

thirtiseven commented Nov 29, 2023

revans2 left a comment

Choose a reason for hiding this comment

thirtiseven commented Nov 30, 2023

thirtiseven commented Dec 1, 2023

thirtiseven commented Nov 24, 2023 •

edited

Loading

thirtiseven commented Nov 27, 2023 •

edited

Loading