-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LNL] SoundWire: pm_runtime resume timed out - interrupts lost #4943
Comments
|
Looks like a hardware/board issue to me The last log for the rt714 is this https://sof-ci.ostc.intel.com/#/result/planresultdetail/40150?model=LNLM_SDW_AIOC&testcase=verify-kernel-boot-log
This means the device was assigned an address and was no longer seen as device 0. But then nothing happened after that. @bardliao what do you think? |
@plbossart Device number has been assigned. And there is no trace in rt715_sdca_io_init(). I think it is probably why we don't see the next action in the log.
To me, it is more like a codec driver error. See change below. We need to wait for initialization_complete instead of enumeration_complete when the codec get resume. diff --git a/sound/soc/codecs/rt715-sdca-sdw.c b/sound/soc/codecs/rt715-sdca-sdw.c
index d3fb02e8f31a..c8dabb9b16b5 100644
--- a/sound/soc/codecs/rt715-sdca-sdw.c
+++ b/sound/soc/codecs/rt715-sdca-sdw.c
@@ -234,10 +234,10 @@ static int __maybe_unused rt715_dev_resume(struct device *dev)
if (!slave->unattach_request)
goto regmap_sync;
- time = wait_for_completion_timeout(&slave->enumeration_complete,
+ time = wait_for_completion_timeout(&slave->initialization_complete,
msecs_to_jiffies(RT715_PROBE_TIMEOUT));
if (!time) {
- dev_err(&slave->dev, "%s: Enumeration not complete, timed out\n", __func__);
+ dev_err(&slave->dev, "%s: Initialization not complete, timed out\n", __func__);
sdw_show_ping_status(slave->bus, true);
return -ETIMEDOUT; |
I saw that confusion @bardliao but if the enumeration didn't complete by the deadline, then the initialization would not complete either? |
Seen again today on a different device, this time it's ba-lnlm-rvp-sdw-01 https://sof-ci.01.org/softestpr/PR1075/build329/devicetest/index.html (internal run 40213) |
Fair point. Indeed, I can't find |
go back to the hard-coded 0x1 which was changed in "soundwire: intel_ace2x: cleanup DOAIS/DODS settings" Link: thesofproject#4943 Signed-off-by: Pierre-Louis Bossart <[email protected]>
reverting the DOAIS setting does not trigger any obvious change, so we'll have to track where and when this issue re-occurs, if at all. |
similar problem reported on LNL RVP:
It looks like the RT711 is attached on link0 but somehow we never deal with it. That's starting to look like an interrupt that's not detected. It was the same pattern in the initial bug report above but this time for link1.
|
Log what the interrupt enable state is. we should merge this but there's no real merit in upstreaming this. Link: thesofproject#4943 Signed-off-by: Pierre-Louis Bossart <[email protected]>
Log what the interrupt enable state is. we should merge this but there's no real merit in upstreaming this. Link: thesofproject#4943 Signed-off-by: Pierre-Louis Bossart <[email protected]>
@bardliao FYI. My two hypotheses are |
The mechanism for interrupts is that we have ONE interrupt, and then we deal with all the links with this loop
If we have an interrupt on link2, we could very well miss an interrupt on link0 if it happens AFTER we deal with link0 in the list walk. If we look at the reports so far we seem to miss the interrupt for link0 and link1, in all cases we don't seem to have problems dealing with the amplifiers on link2-3. |
Good finding. And it explains why the issue only seen with AIOC. However, I am curious why we only see the issue on LNL? |
@bardliao I am not sure this theory is correct, if it is that would mean the solution we implemented years ago was never robust... For LNL we start the links earlier, and IIRC we don't power-up all the links at once as we do for earlier generations. That could explain differences in timing during the enumeration stage. |
Hope we can find something with #4976. But if it is the case, how should we fix it? |
If I am correct, we can loop multiple times to reduce the possibility of such issues, but it would not be a 100% robust fix. |
Question 1 is probably, is the interrupt definitely edge triggered? And secondly, does it have to be? I would assume it is here and there is no option to switch it to level triggered, which is generally much easier to get right. The normal handling for overlapping edge triggered IRQs works roughly like this:
Assuming one has the IRQ stuff all setup correctly this should be handled by the IRQ core. |
@charleskeepax we have ONE interrupt, but it's a combined status for IPC, streams, SoundWire and wakes. |
I have had very occasional enumeration failures with cs42l43 on the TGL UpX, that has been on my list to investigate for a quite a while. This is probably a good candidate for a cause. If I follow the code correctly in sound/soc/sof/intel/hda.c, it is masking the IRQ in the handler, and unmasking it at the bottom of the thread. That is a pretty questionable thing to be doing with an edge triggered IRQ, will depend on the exact details of the hardware but generally that would be broken as you lose any IRQs that happen during the time the handler is running. Also keep in mind this could be a result of there being more IRQs. The less IRQs there are the less likely you are to lose some, so maybe one of those other changes increased the IRQ volume, or just changed the time such that two IRQs were more likely to happen close to each other. |
But first step would really be to check the hardware behaviour, we still haven't had any luck getting docs for the Cadence IP, although I assume this bit falls outside of that anyway. |
It's not a Cadence problem. |
we did change the timing on LNL, now the links are started serially whereas in previous solutions we have to power them up at the same time. I think the serial part creates a case where a link has an alert while handling another, it would be less likely in the simultaneous case since all devices would be handled at the same time. |
What happens if 2) comes between 1.1 and 1.2?
Looks like there is another short window between 3.2 and 3.3 for another race here...
The generic issue should indeed be a "solved problem": do you have a pointer to a good and not "rough" reference documentation? Unless edge-triggered interrupts are broken by design :-D |
I believe the state variable would need to be either handled as an atomic or be guarded with a spin lock.
Not sure this one actually counts since until the IRQ is unmasked you know no new handlers will kick off but depends a bit on how you order the code. Probably safest to wrap all access to the state in a spin lock.
They are not broken by design, but they sure are much harder to get right than level triggered IRQs. Partly why it's a good idea to outsource the handling to the IRQ subsystem. I think I had a couple emails from Thomas Gleixner saved somewhere I will try to dig out. |
Last observation from me 8 days ago: |
Last seen is 41430 (generic_test_new number) "Start Time: 2024-05-17 08:27:34 UTC"
|
seen today on ba-lnlm-rvp-sdw-03 boot -28 2379f50512f242698128c0505c5f9e90 Thu 2024-06-06 16:00:15 UTC—Thu 2024-06-06 16:04:19 UTC |
@marc-hb @plbossart TPLG=/lib/firmware/intel/sof-ipc4-tplg/sof-lnl-rt711-l0-rt1316-l23-rt714-l1.tplg MODEL=LNLM_SDW_AIOC SOF_TEST_INTERVAL=5 ~/sof-test/test-case/check-kmod-load-unload.sh -l 50 I have observed issue mostly occurring after 21th iteration (3 times observed) |
Keep in mind that |
I am thinking will on board rt711 + external rt1316/rt714 cause any issue? Can we test on board rt711 only or use external rt711? |
@bardliao is certainly correct that we should keep one unmodified RVP with just the on-board RT711 codec, to help triangulate issues with board reworks and integration of external AIOC. @fredoh9 do you know if the 'nocodec' boards used on LNL can be retargeted for SoundWire tests? That said, the problem did occur with link0 and RT711, so I am not sure if this is only a board/hardware problem. |
seen again in daily tests planresultdetail/42570?model=LNLM_SDW_AIOC&testcase=check-runtime-pm-status |
I ran 500 iterations of playback/pm_runtime suspend without seeing this enumeration problem with #5065 I did see a bunch of xruns reported in issue #5066, but that's a separate problem. @bardliao @lgirdwood we are entering the second order twilight zone... |
I managed to reproduce the issue with a non-stop test: while true; do
TPLG=/lib/firmware/intel/sof-ipc4-tplg/sof-lnl-rt711-l0-rt1316-l23-rt714-l1.tplg MODEL=LNLM_SDW_AIOC SOF_TEST_INTERVAL=5 ~/sof-test/test-case/multiple-pipeline.sh -f p 2;
done The first error came after 3800s, over an hour. I am not sure how we are going to make progress on this one. @bardliao @lgirdwood this is going to be fun! |
To stop the while true; do
VAR1=VAL1 test.sh || break
done Or: while VAR1=VAL1 test.sh; do
:
done |
|
It took 649 iterations to reproduce the problem!
so far it's always link2 that shows the problem, not sure why. Maybe it's because it's the first link that gets used with this script. |
My theory is wrong, there are examples of failures where the first link to be used is NOT the one that fails, example copied from https://sof-ci.01.org/linuxpr/PR4857/build2721/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=check-kmod-load-unload shows link1/rt715 is started first but link0/rt711 fails to resume.
|
I can change one nocodec RVP to AIOC model today |
Another one today. The logs look a little bit different |
It's really the same issue. |
@plbossart @kv2019i Can we close this issue? I'm not observing this issue. if anyone observing sporadically let me know we can keep this issue with reduce priority and remove “production” label |
we need a stress test with PR #5106. If we see any occurrences of 'Peripheral 0' in the dmesg log, then it's a fail. |
Ok I 'll schedule stress test with PR #5106 |
@plbossart here is the stress report planresultdetail/43888. For failure I did not see 'Peripheral 0' in dmesg. For suspend/resume dmesg logs are missing so run it manual for 100 iteration for 3 different tests. Tests are able to complete 50+ iteration before it fails due to known S/R issue but dmesg does not show up 'Peripheral 0' when it fails. |
50 iterations is hardly a stress test, given the previous observation of about 0.6% likelihood of getting the enumeration issue. we don't need to test the system suspend-resume, which will have other issues. It's enough to go to D3/pm_runtime suspend. set -e
cnt=0
while true; do
TPLG=/lib/firmware/intel/sof-ipc4-tplg/sof-lnl-rt711-l0-rt1316-l23-rt714-l1.tplg MODEL=LNLM_SDW_AIOC SOF_TEST_INTERVAL=5 ~/sof-test/test-case/multiple-pipeline.sh -f p 2;
sleep 5;
cnt=$((cnt + 1));
echo "iteration $cnt done";
done and let it rip for over 2000 iterations. |
I ran the script for 15+ hours, 2557 iterations completed and still running(Peripheral 0 did not show up) for multiple pipeline playback scenarios. I will try capture stress scenario once then we can conclude. |
For multiple pipeline capture 2014 iteration completed and still running, issue not reproduced |
Spotted by chance on jf-lnlm-rvp-sdw-1 while testing something totally unrelated:
https://sof-ci.01.org/softestpr/PR1178/build324/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=check-sof-logger (internal run ID 40150)
(Spotted while testing unrelated thesofproject/sof-test#1178)
The text was updated successfully, but these errors were encountered: