Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-8340][VL] Enable from_json function #8320

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

zhli1142015
Copy link
Contributor

@zhli1142015 zhli1142015 commented Dec 24, 2024

What changes were proposed in this pull request?

Fixes: #8340

How was this patch tested?

UT.

@github-actions github-actions bot added CORE works for Gluten Core BUILD VELOX labels Dec 24, 2024
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Copy link

Run Gluten ClickHouse CI on ARM

Copy link

Run Gluten ClickHouse CI on ARM

Copy link

Run Gluten ClickHouse CI on ARM

@zhli1142015
Copy link
Contributor Author

cc @PHILO-HE , thanks.

Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Some minor comments.

@zhouyuan
Copy link
Contributor

@zhli1142015 thanks a lot for the patch, would you please help to create one issue to track this? I assume there will be some follow up patches required

Copy link

Run Gluten Clickhouse CI on x86

@zhli1142015 zhli1142015 changed the title [VL] Enable from_json function [GLUTEN-8340][VL] Enable from_json function Dec 25, 2024
Copy link

#8340

@zhli1142015
Copy link
Contributor Author

@zhli1142015 thanks a lot for the patch, would you please help to create one issue to track this? I assume there will be some follow up patches required

Yes, #8340. This patch targets phase 1 only.

@zhli1142015 zhli1142015 requested a review from PHILO-HE December 25, 2024 06:10
@zhouyuan
Copy link
Contributor

zhouyuan commented Jan 2, 2025

pending: facebookincubator/velox#11709

@ayushi-agarwal
Copy link
Contributor

val jsonData = Seq(
"""{"platformId": "IPHONE", "userId": "123", "sessionId": "abc"}""",
"""{"platformId": "ANDROID", "userId": "456", "sessionId": "def"}""",
"""{"platformId": "IPHONE", "userId": "789", "sessionId": "ghi"}"""
)
val df = spark.createDataFrame(jsonData.map(Tuple1(_))).toDF("json_column")
df.printSchema()

df.write.mode("overwrite").parquet("output/json_parquet_data")
val parquetDF = spark.read.parquet("output/json_parquet_data")

val r2 = parquetDF.collect()
r2.foreach(println)

val schema = new StructType().add("platformId", StringType).add("userId", StringType).add("sessionId", StringType)
val filteredDF = parquetDF.withColumn("parsed_json", from_json(col("json_column"), schema)).select("parsed_json")
val result = filteredDF.collect()

This prints null result when offloaded.
@zhli1142015 Are structs not supported? Shall we add a check to not offload for struct types?

Copy link

github-actions bot commented Jan 6, 2025

Run Gluten Clickhouse CI on x86

1 similar comment
Copy link

github-actions bot commented Jan 6, 2025

Run Gluten Clickhouse CI on x86

Copy link

github-actions bot commented Jan 6, 2025

Run Gluten Clickhouse CI on x86

Copy link

github-actions bot commented Jan 6, 2025

Run Gluten Clickhouse CI on x86

Copy link

github-actions bot commented Jan 7, 2025

Run Gluten Clickhouse CI on x86

@ayushi-agarwal
Copy link
Contributor

val jsonData = Seq( """{"platformId": "IPHONE", "userId": "123", "sessionId": "abc"}""", """{"platformId": "ANDROID", "userId": "456", "sessionId": "def"}""", """{"platformId": "IPHONE", "userId": "789", "sessionId": "ghi"}""" ) val df = spark.createDataFrame(jsonData.map(Tuple1(_))).toDF("json_column") df.printSchema()

df.write.mode("overwrite").parquet("output/json_parquet_data")
val parquetDF = spark.read.parquet("output/json_parquet_data")

val r2 = parquetDF.collect()
r2.foreach(println)

val schema = new StructType().add("platformId", StringType).add("userId", StringType).add("sessionId", StringType)
val filteredDF = parquetDF.withColumn("parsed_json", from_json(col("json_column"), schema)).select("parsed_json")
val result = filteredDF.collect()

This prints null result when offloaded. @zhli1142015 Are structs not supported? Shall we add a check to not offload for struct types?

@zhli1142015 Will this be fixed by the last change you made, shall we add this as a test which check for results matching with and without offload?

@zhli1142015
Copy link
Contributor Author

Yes, this issue has been resolved. The null value occurred because the schema's case did not match that of the input field name. I did the test mostly on 1.2 which doesn't have such problem. I'm not sure which commit caused this difference.

Copy link

github-actions bot commented Jan 9, 2025

Run Gluten ClickHouse CI on ARM

Copy link

github-actions bot commented Jan 9, 2025

Run Gluten ClickHouse CI on ARM

Copy link

github-actions bot commented Jan 9, 2025

Run Gluten ClickHouse CI on ARM

Copy link

github-actions bot commented Jan 9, 2025

Run Gluten ClickHouse CI on ARM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BUILD CORE works for Gluten Core VELOX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[VL] Support from_json function
4 participants