-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CELEBORN-1092] Introduce JVM monitoring in Celeborn Worker using JVMQuake #2061
Conversation
1cf5f69
to
61dbd4c
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #2061 +/- ##
==========================================
+ Coverage 46.81% 47.06% +0.26%
==========================================
Files 173 173
Lines 10929 10995 +66
Branches 1009 1009
==========================================
+ Hits 5115 5174 +59
- Misses 5479 5483 +4
- Partials 335 338 +3 ☔ View full report in Codecov by Sentry. |
e3b26d9
to
b4aedb8
Compare
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/Worker.scala
Outdated
Show resolved
Hide resolved
common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala
Outdated
Show resolved
Hide resolved
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/monitor/JVMQuake.scala
Outdated
Show resolved
Hide resolved
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/monitor/JVMQuake.scala
Outdated
Show resolved
Hide resolved
@SteNicholas nice idea, this feature is useful for debug memory problem in our spark production |
b4aedb8
to
ecf7f23
Compare
@wangshengjie123, thanks for detailed review. I have addressed above comments. PTAL. |
ecf7f23
to
82ae546
Compare
LGTM, thanks @SteNicholas, cc @waitinfuture @FMX |
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/Worker.scala
Show resolved
Hide resolved
common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala
Outdated
Show resolved
Hide resolved
common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala
Outdated
Show resolved
Hide resolved
common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala
Outdated
Show resolved
Hide resolved
common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala
Outdated
Show resolved
Hide resolved
common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala
Outdated
Show resolved
Hide resolved
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/monitor/JVMQuake.scala
Outdated
Show resolved
Hide resolved
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/monitor/JVMQuake.scala
Show resolved
Hide resolved
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/monitor/JVMQuake.scala
Show resolved
Hide resolved
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/monitor/JVMQuake.scala
Outdated
Show resolved
Hide resolved
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/monitor/JVMQuake.scala
Outdated
Show resolved
Hide resolved
3e70971
to
40d430f
Compare
40d430f
to
930129e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
thank you, merging to main(v0.4.0). |
### What changes were proposed in this pull request? Add java tools.jar into classpath for JVM quake. ### Why are the changes needed? Meet below issue with `celeborn.worker.jvmQuake.enabled=true`, see #2061 ``` 24/11/03 15:51:08,453 ERROR [main] Worker: Initialize worker failed. java.lang.NoClassDefFoundError: sun/jvmstat/monitor/HostIdentifier at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.monitoredVm$lzycompute(JVMQuake.scala:180) at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.monitoredVm(JVMQuake.scala:179) at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.ygcExitTimeMonitor$lzycompute(JVMQuake.scala:185) at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.ygcExitTimeMonitor(JVMQuake.scala:184) at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.org$apache$celeborn$service$deploy$worker$monitor$JVMQuake$$getLastExitTime(JVMQuake.scala:192) at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake.start(JVMQuake.scala:66) at org.apache.celeborn.service.deploy.worker.Worker.<init>(Worker.scala:360) at org.apache.celeborn.service.deploy.worker.Worker$.main(Worker.scala:1041) at org.apache.celeborn.service.deploy.worker.Worker.main(Worker.scala) Caused by: java.lang.ClassNotFoundException: sun.jvmstat.monitor.HostIdentifier at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 9 more ``` Related code: https://github.com/apache/celeborn/blob/c12e8881ab5b03e39a8b53d038a283351bf5906c/project/JDKTools.scala#L58-L75 Similar issue: vladimirvivien/jmx-cli#4 After copy the `tools.jar` into worker-jars, the issue got resolved. It is better that to involve the `tools.jar` automatically without copy. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? <img width="1202" alt="image" src="https://github.com/user-attachments/assets/af8f6c0d-9123-4a73-93b5-69836c5f826d"> Closes #2874 from turboFei/jdk_tools. Authored-by: Wang, Fei <[email protected]> Signed-off-by: SteNicholas <[email protected]>
### What changes were proposed in this pull request? Add java tools.jar into classpath for JVM quake. ### Why are the changes needed? Meet below issue with `celeborn.worker.jvmQuake.enabled=true`, see #2061 ``` 24/11/03 15:51:08,453 ERROR [main] Worker: Initialize worker failed. java.lang.NoClassDefFoundError: sun/jvmstat/monitor/HostIdentifier at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.monitoredVm$lzycompute(JVMQuake.scala:180) at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.monitoredVm(JVMQuake.scala:179) at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.ygcExitTimeMonitor$lzycompute(JVMQuake.scala:185) at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.ygcExitTimeMonitor(JVMQuake.scala:184) at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.org$apache$celeborn$service$deploy$worker$monitor$JVMQuake$$getLastExitTime(JVMQuake.scala:192) at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake.start(JVMQuake.scala:66) at org.apache.celeborn.service.deploy.worker.Worker.<init>(Worker.scala:360) at org.apache.celeborn.service.deploy.worker.Worker$.main(Worker.scala:1041) at org.apache.celeborn.service.deploy.worker.Worker.main(Worker.scala) Caused by: java.lang.ClassNotFoundException: sun.jvmstat.monitor.HostIdentifier at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 9 more ``` Related code: https://github.com/apache/celeborn/blob/c12e8881ab5b03e39a8b53d038a283351bf5906c/project/JDKTools.scala#L58-L75 Similar issue: vladimirvivien/jmx-cli#4 After copy the `tools.jar` into worker-jars, the issue got resolved. It is better that to involve the `tools.jar` automatically without copy. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? <img width="1202" alt="image" src="https://github.com/user-attachments/assets/af8f6c0d-9123-4a73-93b5-69836c5f826d"> Closes #2874 from turboFei/jdk_tools. Authored-by: Wang, Fei <[email protected]> Signed-off-by: SteNicholas <[email protected]> (cherry picked from commit d4044c5) Signed-off-by: SteNicholas <[email protected]>
What changes were proposed in this pull request?
Introduce JVM monitoring in Celeborn Worker using JVMQuake to enable early detection of memory management issues and facilitate fast failure.
Why are the changes needed?
When facing out-of-control memory management in Celeborn Worker we typically use JVMkill as a remedy by killing the process and generating a heap dump for post-analysis. However, even with jvmkill protection, we may still encounter issues caused by JVM running out of memory, such as repeated execution of Full GC without performing any useful work during the pause time. Since the JVM does not exhaust 100% of resources, JVMkill will not be triggered. Therefore JVMQuake is introduced to provide more granular monitoring of GC behavior, enabling early detection of memory management issues and facilitating fast failure. Refers to the principle of jvmquake which is a JVMTI agent that attaches to your JVM and automatically signals and kills it when the program has become unstable.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
JVMQuakeSuite