-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] RAPIDS Shuffle Manager (Use Mellanox ConnectX5) Opposite Worker's Executor has been Exited forcefully. #9843
Comments
hello @leehaoun, it would be great to see executor logs as they should hopefully give more information on what is failing. Can you take a look at those logs and share any errors you see? |
@abellina Because the Executor wasn't created, there are no logs available for verification. (No record of Executor addition). The EventLog in the HistoryServer is below.
|
I find this log in the remote worker's /root/spark-3.4.1-bin-hadoop3/work/app-20231124084536-0000/2/stderr Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties |
I'm sorry, it was my fault. The issue was caused by the difference in the paths of SPARK_RAPIDS_JAR between the HOST and RemoteWorker. After aligning the RAPIDS_JAR paths on both PCs, RAPIDS_Shuffle worked correctly. I will close the issue. |
Glad you got it working! Let us know if we can help! |
What is your question?
I want to use RAPIDS Shuffle Manager on my spark job.
My Spark job code completes successfully when there are no settings related to the ShuffleManager.
However, when I activate options related to ShuffleManager, the Executor on the remote PC connected via Mellanox optical cable
Exits immediately upon creation and does not activate.
My submit command is below.
(When I remove the contents corresponding to '--conf' here and submit, the remote Executor gets registered normally, and the job processes without any errors)
My Spark-Job code is below.
Problem
When i submit this job, local executor is loaded well.
But, The Executor on the remote worker is forcefully terminated, leaving the following message
My detail environment - Master
OS : Ubuntu 22.04
cuda : 11.8
Spark : 2.12_3.4.1
JAVA : 1.8
rapids jar : rapids-4-spark_2.12-23.10.0.jar
Conda environment :
GPU : RTX 3080 x 1
RAM : 314G
MLNX Driver : MLNX_OFED_LINUX-23.10-0.5.5.0-ubuntu22.04-x86_64
NV_Peer_Mem : 1.2
UCX : 1.14.1
My detail environment - Worker
OS : Ubuntu 22.04
cuda : 11.8
Spark : 2.12_3.4.1
JAVA : 1.8
rapids jar : rapids-4-spark_2.12-23.10.0.jar
Conda environment :
GPU : RTX 3090 x 1
RAM : 1TB
MLNX Driver : MLNX_OFED_LINUX-23.10-0.5.5.0-ubuntu22.04-x86_64
NV_Peer_Mem : 1.2
UCX : 1.14.1
The text was updated successfully, but these errors were encountered: