Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable build for Databricks 13.3 [databricks] #9677

Merged
merged 34 commits into from
Nov 23, 2023

Conversation

razajafri
Copy link
Collaborator

@razajafri razajafri commented Nov 12, 2023

This PR builds on previous PRs to add Databricks 13 support to the Spark Rapids plugin. This PR specifically adds pom changes to build the plugin with Databricks 13.3.

Changes Made:

POM changes: All the modules have been updated with a profile for 341db support
XFAIL failing tests: Tests were marked with xfail pytest marker which should be removed once the support is added for them.
PythonUDAF: Added support for PythonUDAF similar to Spark 3.5

Tests:
All the tests were updated

This is in draft mode because it should be merged only after #9644 is merged

Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow the approach #9508 to reduce bloat in poms

datagen/pom.xml Outdated Show resolved Hide resolved
integration_tests/pom.xml Outdated Show resolved Hide resolved
shuffle-plugin/pom.xml Outdated Show resolved Hide resolved
tests/pom.xml Outdated Show resolved Hide resolved
@gerashegalov gerashegalov self-requested a review November 13, 2023 20:21
@jlowe
Copy link
Contributor

jlowe commented Nov 16, 2023

build

@jlowe
Copy link
Contributor

jlowe commented Nov 17, 2023

Latest failure is in fastparquet compatibility test which I could not reproduce on a Databricks 13.3 instance. Kicking again to see if it's reproducible.

@jlowe
Copy link
Contributor

jlowe commented Nov 17, 2023

build

@jlowe
Copy link
Contributor

jlowe commented Nov 17, 2023

I'm now able to reproduce the fastparquet failures, and it appears to be an issue with the fastparquet setup on Databricks 13.3. It's reading NaNs as nulls, whereas the GPU is reading NaNs as NaNs. Not sure yet why we're getting different fastparquet behavior in the DB 13.3 environment with an explicit install of fastparquet vs. what we get on the other Databricks environments.

@jlowe
Copy link
Contributor

jlowe commented Nov 17, 2023

build

1 similar comment
@jlowe
Copy link
Contributor

jlowe commented Nov 20, 2023

build

@jlowe
Copy link
Contributor

jlowe commented Nov 20, 2023

build

1 similar comment
@sameerz
Copy link
Collaborator

sameerz commented Nov 21, 2023

build

@pxLi
Copy link
Collaborator

pxLi commented Nov 21, 2023

341db failed deltalake cases

[2023-11-21T05:19:07.746Z] �[31mFAILED�[0m�[31m [ 28%]�[0m
[2023-11-21T05:19:46.387Z] ../../src/main/python/delta_lake_merge_test.py::test_delta_merge_not_match_insert_only[10-['a', 'b']-False-(range(0, 5), range(0, 5))][DATAGEN_SEED=1700542054, INJECT_OOM, IGNORE_ORDER, ALLOW_NON_GPU(DeserializeToObjectExec,ShuffleExchangeExec,FileSourceScanExec,FilterExec,MapPartitionsExec,MapElementsExec,ObjectHashAggregateExec,ProjectExec,SerializeFromObjectExec,SortExec)] 23/11/21 05:19:41 ERROR Utils: Aborting task
[2023-11-21T05:19:46.387Z] java.lang.OutOfMemoryError: GC overhead limit exceeded
[2023-11-21T05:19:46.387Z] 23/11/21 05:19:42 ERROR FileFormatWriter: Job job_202311210519103400711262770238775_3241 aborted.
[2023-11-21T05:19:46.387Z] 23/11/21 05:19:42 ERROR Executor: Exception in task 2.0 in stage 3241.0 (TID 11704)
[2023-11-21T05:19:46.387Z] org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while writing rows to file:/tmp/pyspark_tests/1121-014647-nfuszhj3-10-2-128-19-master-371556-540372822/DELTA_DATA/CPU.
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:968)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:551)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:116)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:931)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:931)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:407)
[2023-11-21T05:19:46.387Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:404)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:371)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:82)
[2023-11-21T05:19:46.387Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:82)
[2023-11-21T05:19:46.387Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:196)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:181)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:146)
[2023-11-21T05:19:46.387Z] 	at com.databricks.unity.EmptyHandle$.runWithAndClose(UCSHandle.scala:125)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:146)
[2023-11-21T05:19:46.387Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.scheduler.Task.run(Task.scala:99)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$8(Executor.scala:897)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1682)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:900)
[2023-11-21T05:19:46.387Z] 	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[2023-11-21T05:19:46.387Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:795)
[2023-11-21T05:19:46.387Z] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2023-11-21T05:19:46.387Z] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2023-11-21T05:19:46.387Z] 	at java.lang.Thread.run(Thread.java:750)
[2023-11-21T05:19:46.387Z] Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
[2023-11-21T05:19:52.904Z] 23/11/21 05:19:51 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 2.0 in stage 3241.0 (TID 11704),5,main]
[2023-11-21T05:19:52.905Z] org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while writing rows to file:/tmp/pyspark_tests/1121-014647-nfuszhj3-10-2-128-19-master-371556-540372822/DELTA_DATA/CPU.
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:968)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:551)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:116)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:931)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:931)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:407)
[2023-11-21T05:19:52.905Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:404)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:371)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:82)
[2023-11-21T05:19:52.905Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:82)
[2023-11-21T05:19:52.905Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:196)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:181)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:146)
[2023-11-21T05:19:52.905Z] 	at com.databricks.unity.EmptyHandle$.runWithAndClose(UCSHandle.scala:125)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:146)
[2023-11-21T05:19:52.905Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.scheduler.Task.run(Task.scala:99)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$8(Executor.scala:897)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1682)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:900)
[2023-11-21T05:19:52.905Z] 	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[2023-11-21T05:19:52.905Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:795)
[2023-11-21T05:19:52.905Z] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2023-11-21T05:19:52.905Z] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2023-11-21T05:19:52.905Z] 	at java.lang.Thread.run(Thread.java:750)
[2023-11-21T05:19:52.905Z] Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded

@jlowe
Copy link
Contributor

jlowe commented Nov 21, 2023

build

@pxLi
Copy link
Collaborator

pxLi commented Nov 22, 2023

build

Comment on lines +110 to +112
pytest.param(FloatGen(nullable=False),
marks=pytest.mark.xfail(is_databricks_runtime(),
reason="https://github.com/NVIDIA/spark-rapids/issues/9778")),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of including the following:

Suggested change
pytest.param(FloatGen(nullable=False),
marks=pytest.mark.xfail(is_databricks_runtime(),
reason="https://github.com/NVIDIA/spark-rapids/issues/9778")),
pytest.param(FloatGen(nullable=False),
marks=pytest.mark.xfail(is_databricks_runtime(),
reason="https://github.com/NVIDIA/spark-rapids/issues/9778")),
FloatGen(nullable=False, no_nans=True),

Not strictly in the purview of this change. I can add this as a follow-on.

Copy link
Collaborator

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, barring that single (optional) suggestion. Thanks for disabling the float-double tests.

@jlowe
Copy link
Contributor

jlowe commented Nov 22, 2023

build

1 similar comment
@jlowe
Copy link
Contributor

jlowe commented Nov 22, 2023

build

@pxLi
Copy link
Collaborator

pxLi commented Nov 23, 2023

thanks! also cc @NvTimLiu to help setup nightly later, thanks

@pxLi pxLi merged commit d3629fd into NVIDIA:branch-23.12 Nov 23, 2023
36 checks passed
@jlowe
Copy link
Contributor

jlowe commented Nov 23, 2023

Thanks for merging, @pxLi. I built three times to make sure CI would not be flaky with heap GC OOM or other problems, passed three times in a row. So we should be good with this enabled for premerge and nightly.

@sameerz sameerz added the task Work required that improves the product but is not user facing label Nov 26, 2023
@razajafri razajafri deleted the final-pr branch November 27, 2023 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants