[HUDI-8758] Enforce insert deduplicate policy for spark SQL #12588

linliu-code · 2025-01-07T04:15:36Z

Change Logs

Based on value of hoodie.datasource.insert.dup.policy, we do:

if its value is drop, we remove the duplicate records before insert;
if its value is fail, we fail the insert query; and
if its value is none or by default, we do not do dedup. (common path)

Impact

This helps the Spark SQL insert query logic.

Risk level (write none, low medium or high below)

Medium.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

nsivabalan · 2025-01-08T02:10:32Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

+  }
+
+  // Check if deduplication is needed.
+  def isDeduplicationNeeded(operation: WriteOperationType): Boolean = {


from where did you pull this from?
is it somewhere in master. or are we introducing this newly in this patch ?

I take it from the master, LINE 510. I modified it to narrow down the impact to only insert. Previously it could impact upsert also.

gotcha.
btw, this could apply only for INSERT and not INSERT_PREPPED

nsivabalan · 2025-01-08T02:12:52Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

+      incomingRecords
+    } else {
+      // Perform deduplication
+      val deduplicatedRecords = DataSourceUtils.dropDuplicates(jsc, incomingRecords, parameters.asJava)


would be better if we push it down to

public static JavaRDD<HoodieRecord> dropDuplicates(HoodieSparkEngineContext engineContext, JavaRDD<HoodieRecord> incomingHoodieRecords, HoodieWriteConfig writeConfig) { try { SparkRDDReadClient client = new SparkRDDReadClient<>(engineContext, writeConfig); return client.tagLocation(incomingHoodieRecords) .filter(r -> !((HoodieRecord<HoodieRecordPayload>) r).isCurrentLocationKnown()); } catch (TableNotFoundException e) { // this will be executed when there is no hoodie table yet // so no dups to drop return incomingHoodieRecords; } }

this method in DataSourceUtils only.

why trigger the dag twice.

nsivabalan · 2025-01-08T02:13:38Z

...ark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/TestInsertTable.scala

@@ -3087,4 +3089,140 @@ class TestInsertTable extends HoodieSparkSqlTestBase {
      })
    }
  }
+
+  test("Test table with insert dup policy - drop case") {
+    withSQLConf("hoodie.datasource.insert.dup.policy" -> "drop") {


can we add tests for all 3 drop dup policy at spark ds layer?

I add the test in the second part. I can move it here.

nsivabalan · 2025-01-08T02:14:28Z

...ark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/TestInsertTable.scala

+             |""".stripMargin)
+
+        // check result after insert and merge data into target table
+        checkAnswer(s"select id, name, dt, day, hour from $targetTable limit 10")(


can we have if, else if, and else branch here and have just 1 test method. Why duplicate the test code.

in that way, if one case fail, the other cases will stop. I remember parameterization does not work here. Need to confirm.

hudi-bot · 2025-01-09T01:02:25Z

CI report:

46dc4d7 UNKNOWN
845d6c3 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

nsivabalan · 2025-01-10T19:13:39Z

hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java

@@ -284,24 +285,47 @@ public static HoodieRecord createHoodieRecord(GenericRecord gr, HoodieKey hKey,
   * @param writeConfig HoodieWriteConfig
   */
  @SuppressWarnings("unchecked")
-  public static JavaRDD<HoodieRecord> dropDuplicates(HoodieSparkEngineContext engineContext, JavaRDD<HoodieRecord> incomingHoodieRecords,
-      HoodieWriteConfig writeConfig) {
+  public static JavaRDD<HoodieRecord> doDropDuplicates(HoodieSparkEngineContext engineContext,


we should rename this to handleDuplicates

nsivabalan · 2025-01-10T19:15:53Z

hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java

-    HoodieWriteConfig writeConfig =
-        HoodieWriteConfig.newBuilder().withPath(parameters.get("path")).withProps(parameters).build();
-    return dropDuplicates(new HoodieSparkEngineContext(jssc), incomingHoodieRecords, writeConfig);
+  public static JavaRDD<HoodieRecord> dropDuplicates(JavaSparkContext jssc,


again, lets rename this

can we add java docs please

nsivabalan · 2025-01-10T19:18:54Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

+  }
+
+  // Check if deduplication is needed.
+  def isDeduplicationNeeded(operation: WriteOperationType): Boolean = {


gotcha.
btw, this could apply only for INSERT and not INSERT_PREPPED

nsivabalan · 2025-01-10T19:21:04Z

hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestInsertDedupPolicy.scala

+    (10, "3", "rider-C", "driver-C", 33.90, 10),
+    (11, "5", "rider-C", "driver-C", 3.3, 3))
+  val expectedForNone: Seq[(Int, String, String, String, Double, Int)] = Seq(
+    (11, "1", "rider-A", "driver-A", 1.1, 1),


shouldn't we expect to see duplicate records with NONE as policy value ?

concat handle should be enabled by default right

nsivabalan · 2025-01-10T19:22:12Z

hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestInsertDedupPolicy.scala

+    java.util.Arrays.asList(
+      Arguments.of("MERGE_ON_READ", "AVRO", NONE_INSERT_DUP_POLICY),
+      Arguments.of("MERGE_ON_READ", "SPARK", NONE_INSERT_DUP_POLICY),
+      Arguments.of("MERGE_ON_READ", "AVRO", DROP_INSERT_DUP_POLICY),


we can avoid multiple record types. just table type and dup policy combos would do.

nsivabalan · 2025-01-10T19:23:04Z

...ark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/TestInsertTable.scala

+               |""".stripMargin)
+          if (policy.equals(NONE_INSERT_DUP_POLICY)) {
+            checkAnswer(s"select id, name, dt, day, hour from $targetTable limit 10")(
+              Seq("1", "aa", 1234, "2024-02-19", 10)


same comment as above. we should see duplicates.

github-actions bot added the size:M PR with lines of changes in (100, 300] label Jan 7, 2025

linliu-code force-pushed the HUDI-8758 branch from 66d6247 to 3d3c923 Compare January 7, 2025 04:51

nsivabalan requested changes Jan 8, 2025

View reviewed changes

linliu-code added 6 commits January 8, 2025 11:34

Enforce insert deduplicate policy for spark SQL

400caf6

Fix some ci failures

9c14654

Fix ci failures

4a99cb7

Remove ValidateDuplicateKeyPayload class

d928b49

Address comments

bf6e175

Fix the tests

092c792

linliu-code force-pushed the HUDI-8758 branch from 34bbc6a to 092c792 Compare January 8, 2025 22:45

github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Jan 8, 2025

linliu-code added 2 commits January 8, 2025 15:01

Add ds test

46dc4d7

refactor

845d6c3

linliu-code requested a review from nsivabalan January 8, 2025 23:14

nsivabalan requested changes Jan 10, 2025

View reviewed changes

nsivabalan mentioned this pull request Jan 10, 2025

[HUDI-8758] Remove ValidateDuplicateKeyPayload class #12591

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-8758] Enforce insert deduplicate policy for spark SQL #12588

[HUDI-8758] Enforce insert deduplicate policy for spark SQL #12588

linliu-code commented Jan 7, 2025

nsivabalan Jan 8, 2025

linliu-code Jan 8, 2025

nsivabalan Jan 10, 2025

nsivabalan Jan 8, 2025

linliu-code Jan 8, 2025

nsivabalan Jan 8, 2025

linliu-code Jan 8, 2025

nsivabalan Jan 8, 2025

linliu-code Jan 8, 2025

hudi-bot commented Jan 9, 2025

nsivabalan Jan 10, 2025

nsivabalan Jan 10, 2025

nsivabalan Jan 10, 2025

nsivabalan Jan 10, 2025

nsivabalan Jan 10, 2025

nsivabalan Jan 10, 2025

nsivabalan Jan 10, 2025

nsivabalan Jan 10, 2025

[HUDI-8758] Enforce insert deduplicate policy for spark SQL #12588

Are you sure you want to change the base?

[HUDI-8758] Enforce insert deduplicate policy for spark SQL #12588

Conversation

linliu-code commented Jan 7, 2025

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hudi-bot commented Jan 9, 2025

CI report:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment