[HUDI-8855] Add bucket properties for spark bucket index query pruning #12614

xicm · 2025-01-10T09:36:18Z

Change Logs

we support bucket index pruning since HUDI-6207, but the configurations of bucket index doesn't been passed to SparkHoodieTableFileIndex.

Impact

none

Risk level (write none, low medium or high below)

none

Documentation Update

none

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

hudi-bot · 2025-01-10T16:36:47Z

CI report:

d89310f Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

danny0405 · 2025-01-13T02:51:21Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

@@ -534,6 +535,15 @@ object HoodieFileIndex extends Logging {
      properties.setProperty(RECORDKEY_FIELD.key, tableConfig.getRecordKeyFields.orElse(Array.empty).mkString(","))
      properties.setProperty(PRECOMBINE_FIELD.key, Option(tableConfig.getPreCombineField).getOrElse(""))
      properties.setProperty(PARTITIONPATH_FIELD.key, HoodieTableConfig.getPartitionFieldPropForKeyGenerator(tableConfig).orElse(""))
+
+      // for simple bucket index, we need to set the INDEX_TYPE, BUCKET_INDEX_HASH_FIELD, BUCKET_INDEX_NUM_BUCKETS
+      val dataBase = Some(tableConfig.getDatabaseName)


all of these properties are write configs, so we wanna fix it by explicitly setting up all the catalog properties.

TheR1sing3un · 2025-01-13T06:17:11Z

The current index-related configuration items are divided into write config. Before bucket index pruning is implemented, these index-related write configurations are not used when reading, but now we need to use index-related configurations when reading.
Could we consider addressing this problem in a more general way:

moving the configurations that are currently written to the configuration that determines the layout of the file and that depend on those configurations for reads at read time to the table-level configuration?
Or, a simpler way to do this is to put these write configurations into the hms and get them automatically when you read table.

github-actions bot added the size:XS PR with lines of changes in <= 10 label Jan 10, 2025

xicm force-pushed the HUDI-8855 branch from 298dd47 to 88aa176 Compare January 10, 2025 09:52

[HUDI-8855] Add bucket properties for spark bucket index query pruning

d89310f

xicm force-pushed the HUDI-8855 branch from 88aa176 to d89310f Compare January 10, 2025 15:23

github-actions bot added size:S PR with lines of changes in (10, 100] and removed size:XS PR with lines of changes in <= 10 labels Jan 10, 2025

danny0405 reviewed Jan 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-8855] Add bucket properties for spark bucket index query pruning #12614

[HUDI-8855] Add bucket properties for spark bucket index query pruning #12614

xicm commented Jan 10, 2025 •

edited

Loading

hudi-bot commented Jan 10, 2025

danny0405 Jan 13, 2025

TheR1sing3un commented Jan 13, 2025

[HUDI-8855] Add bucket properties for spark bucket index query pruning #12614

Are you sure you want to change the base?

[HUDI-8855] Add bucket properties for spark bucket index query pruning #12614

Conversation

xicm commented Jan 10, 2025 • edited Loading

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

hudi-bot commented Jan 10, 2025

CI report:

danny0405 Jan 13, 2025

Choose a reason for hiding this comment

TheR1sing3un commented Jan 13, 2025

xicm commented Jan 10, 2025 •

edited

Loading