-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LNL] hang, timeout or crash in pause-release #5109
Comments
The kernel logs are unfortunately missing in https://sof-ci.01.org/softestpr/PR966/build659/devicetest/index.html?model=LNLM_RVP_NOCODEC&testcase=check-pause-resume-capture-100, I don't know why. I checked the kernel logs on the device directly and something strange caught my attention: it looks like the DMIC SFX2 pipeline under pause/release test is being freed and restarted WHILE the test is still running!? The firmware logs are available on the web but I don't know how to match the timestamps.
Should that WARNING be fatal? Maybe it would help collect the logs?
|
This is not just DMIC SFX2, many examples: https://sof-ci.01.org/sofpr/PR9313/build6571/devicetest/index.html |
I'm not sure what we are looking at here, but if I read it right the test fails because the aplay hw:0,2 is failing to die, right? What is interesting regarding to hw:0,2 is: I don't know what the test is doing, but it does not appear to be trying to stop the playback. |
This test passes in all other configurations and (Thanks for analyzing the logs!) |
This failure looks a little bit different:
It does not seem to have any overrun!? |
This looks like memory corruption of some kind. Not the first time I see it. No overrun there.
In the same software configuration, the HDA model has mass overruns: This one has this and then silence, no logs.
|
Here's my observation with one of the failure instances (https://sof-ci.01.org/linuxpr/PR5106/build4026/devicetest/index.html?model=LNLM_RVP_NOCODEC&testcase=multiple-pause-resume-50) The multiple pause resume combination that gets started is aplay on Port 0 (stream 0) and the DeepBuffer Port 0 (stream 31). Pause/release iterations start almost simultaneously for both the streams and proceed normally up until the stream 31 finishes 15 iterations. After it is released in the 15th iteration, the FW runs into an overrun causing aplay to try to recover from the xrun and from then onwards all we see is the stream prepare, stream trigger start followed by an immediate stream trigger stop indicating that aplay cannot recover from the xrun but it keeps trying endlessly. But parallelly, stream 0 continues with its pause/release iterations without any issues. So the question is why does the FW run into an overrun after 15 pause/releases? I am afraid this issue looks very similar to the issue #5080 in the sense that there's a random xrun in the FW after many iterations and there's nothing unusual in the sequences leading up to it. |
what happens if the "DeepBuffer Port 0" is skipped in the tests, do we see an xrun, ever? If the xrun happens only in a case where pipelines are mixed, this could be a case of IIRC we tested pause_push/pause_release with one stream playing, when all streams getting mixed are paused/released at the same time there could be races conditions left and right. |
@plbossart, what is interesting is that the issue (afaik) only can be reproduced on LNL. |
we've seen in the SoundWire case errors that were different on LNL, the timing of the transitions seems a bit different and that seems to open the door to race conditions we didn't see or notice before. |
@plbossart yes, there're several instances of the xrun happening with pause/release with just one stream (DMIC Raw) |
whats interesting in this particular case is that the Port 0 stream is going on with its test just fine while the other stream gets stuck randomly. My initial suspicion was on link DMA but now I feel like it might be the host DMA thats the problem as the logs show that after releasing we're reporting that there's nothing available to copy from the host DMA buffer. But this makes it even harder to understand because in the case of pause/release, the host DMA is left untouched, it just keeps running forever until the pipeline is completely stopped. |
Experience with thesofproject/linux#5109 shows that this warning never seems harmless: the test ends up timing out and failing anyway. So, better failing fast for clearer and better logs. Also increase the log level of press_space() to avoid state confusion. Signed-off-by: Marc Herbert <[email protected]>
thesofproject/sof-test#1226 makes a huge difference, please review. |
Experience with thesofproject/linux#5109 shows that this warning never seems harmless: the test ends up timing out and failing anyway. So, better failing fast for clearer and better logs. Also increase the log level of press_space() to avoid state confusion. Signed-off-by: Marc Herbert <[email protected]>
The failure logs have simplified and changed A LOT since my "fail fast" test fixes in thesofproject/sof-test#1226 was merged. Summary of the results of daily test 44477 (Start Time: 2024-08-02 13:07:34 UTC)
|
This failure has a https://sof-ci.01.org/sofpr/PR9323/build6836/devicetest/index.html?model=LNLM_RVP_NOCODEC&testcase=check-pause-resume-capture-100 This one has only a "non-zero exit status" This one passed on the same system: HDA seems to have a higher pass rate and NOCODEC the worst, example: |
In these runs, some tests passed and some tests failed in every LNL model: https://sof-ci.01.org/sofpr/PR9351/build6896/devicetest/index.html , https://sof-ci.01.org/linuxpr/PR5110/build4324/devicetest/index.html So failures seem intermittent in every configuration. |
Daily run 44926 is interesting with 5 failures out of 6 test runs and they don't all look the same. Same in https://sof-ci.01.org/sofpr/PR9299/build7192/devicetest/index.html EDIT: in daily run 44967?model=LNLM_SDW_AIOC&testcase=multiple-pause-resume-50, the whole system crashed and become unresponsive. The |
With the xrun failures fixed on LNL DUTs and the earlier fix to sof-test, this issue can potentially be now closed. Let's log any new failures with the test case here and/or close the issue if none are seen anymore. FYI @ujfalusi |
I spent a long time looking for an existing bug and I found many similar issues (see list below) but I think this one is either brand new or never filed yet. Part of the problem was that the Expect part of the pause-resume test was utterly buggy; that didn't help. Now that I rewrote it in thesofproject/sof-test#1218 we can finally pay less attention to the test code and a bit more attention to the product.
The
WARNING: received == PAUSE == while in state recording! Ignoring.
message is very surprising.It showed up more than once.EDIT: this message is now gone since "fail fast" test fix thesofproject/sof-test#1226. Never look past the very first error which is usually now
file descriptor in bad state
.2024-07-16T13:31:16-07:00
Linux Branch: topic/sof-dev
Commit: 1998ade4783a
Kconfig Branch: master
Kconfig Commit: 8189104a4f38
SOF Branch: main
Commit: 3051607efb4f
Zephyr Commit:650227d8c47f
https://sof-ci.01.org/softestpr/PR1218/build632/devicetest/index.html?model=LNLM_RVP_NOCODEC&testcase=multiple-pause-resume-50
https://sof-ci.01.org/softestpr/PR966/build659/devicetest/index.html
https://sof-ci.01.org/softestpr/PR812/build656/devicetest/index.html
https://sof-ci.01.org/linuxpr/PR5106/build4026/devicetest/index.html?model=LNLM_RVP_NOCODEC&testcase=multiple-pause-resume-50
https://sof-ci.01.org/sofpr/PR9305/build6551/devicetest/index.html
https://sof-ci.01.org/softestpr/PR1224/build686/devicetest/index.html
https://sof-ci.01.org/sofpr/PR9335/build6748/devicetest/index.html
The logs don't all look the same but here's a typical one:
cc:
The text was updated successfully, but these errors were encountered: