-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multiple-pipeline-capture: Fail to find 4 pipelines still alive, only 3 left #472
Comments
Try to add some debug log, to see which pipeline is dead. |
I ran 500 iterations on GLK, for which this error was reported in one daily test log. No luck, it's a PASS status. Will try with longer values. |
Of course Murphy's Law applies, failure detected on GLK at the 552th iteration - SOF firmware master and topic/sof-dev.
|
And unfortunately there is zero information on what the errors might be in sof-test logs. @aiChaoSONG we really need to update the script to provide more information on which capture failed. |
@plbossart I have added some debug info in my personal sof-test repo to this case, and also trying to reproduce this issue on some platforms, check report 663, 662, 661, will test more. |
I just submitted two-lines
I checked and there's no silly Is there any chance
More specifically? |
I saw this problem in ICL_HDA_HDA/WHL_RVP_HDA with multiple-pipeline-capture.sh recently. #507 will help debug which process might be the problem but still I don't see any error message. |
When issue occurs, There are four pipeline in the process, but only three is calculated. confirmed a test case issue.
|
BTW there are obsolete |
There's a fair amount of code churn happening right now in |
With a1c5677, we are able to see how many process are alive, and what are they. But from recent report, https://sof-ci.sh.intel.com/#/result/planresultdetail/789?model=JSL_RVP_HDA&testcase=multiple-pipeline-capture-50, we do have four process, but the calculated value is 3. |
Thanks for the example. Amazingly the total is wrong only after 29 successful test iterations. After #525 alignment and de-duplication, I think a good next step would be to add |
I think the very first thing to do immediately after #525 is to merge the two scripts into only one because it will remove a LOT of duplicated code. Could you do that after #525 has been tested by at least one daily test run? Once that's done we can look at #516 and other things. I scanned
As usual this should be tested locally first and not merged on the same day than other test changes; one (test) change at a time. |
I'm 90% sure I found the root cause, and good news: yes this seems to be a pure test code issue. I will hopefully submit a fix on Monday 30th when I get back. In the mean time please review #532 because I'm not going to copy/paste the fix. |
pidof ignores processes in uninterruptable sleep by default, probably because sending them signals is futile. pidof has a -z option but it's just simpler to use "ps". Also de-duplicate code into new ps_checks() function Fixes: thesofproject#472 Any aplay or arecord process can be in uninterruptable sleep when doing I/O but the easiest way to reproduce is to wait a few seconds and let the DSP go to sleep. Signed-off-by: Marc Herbert <[email protected]>
Confirmed to be the issue, see longer explanation in commit message of candidate fix PR #538 |
pidof ignores processes in uninterruptable sleep by default, probably because sending them signals is futile. pidof has a -z option but it's just simpler to use "ps". Also de-duplicate code into new ps_checks() function Fixes: #472 Any aplay or arecord process can be in uninterruptable sleep when doing I/O but the easiest way to reproduce is to wait a few seconds and let the DSP go to sleep. Signed-off-by: Marc Herbert <[email protected]>
Having the same -w parameter for both this initial sleep and the test duration made no sense: fixed by removing the first sleep. It's not clear what the purpose of this first sleep was, it has been in multiple-pipeline-capture/playback.sh since the dawn of time (= commit 39cd964) For sure this first sleep was helping hide "uninterruptible state" bug thesofproject#472 by giving the DSP more time to wake up! Signed-off-by: Marc Herbert <[email protected]>
Nightly tests run 912 still shows some failures, let's not let github auto close this and close it manually only when we actually stop seeing it (PR tests almost never showed it; they don't run for long enough) Tentative fix #538 was merged 15 hours ago, Nightly tests run 912 started at 2020-12-02 05:31:57 UTC which is unfortunately about the same time so it's not obvious which sof-test version 912 they ran (internal issue 619, possible solution in #544) By sheer luck I also fixed a log message in the same #538 (" |
Nightly tests run 937 seems OK; including the report of an actual failure! |
Having the same -w parameter for both this initial sleep and the test duration made no sense: fixed by removing the first sleep. It's not clear what the purpose of this first sleep was, it has been in multiple-pipeline-capture/playback.sh since the dawn of time (= commit 39cd964) For sure this first sleep was helping hide "uninterruptible state" bug thesofproject#472 by giving the DSP more time to wake up! Signed-off-by: Marc Herbert <[email protected]>
Having the same -w parameter for both this initial sleep and the test duration made no sense: fixed by removing the first sleep. It's not clear what the purpose of this first sleep was, it has been in multiple-pipeline-capture/playback.sh since the dawn of time (= commit 39cd964) For sure this first sleep was helping hide "uninterruptible state" bug #472 by giving the DSP more time to wake up! Signed-off-by: Marc Herbert <[email protected]>
Having the same -w parameter for two different sleeps in the same iteration made no sense. It was not clear what the purpose of this first sleep was, it has been in multiple-pipeline-capture/playback.sh since the dawn of time (= commit 39cd964) For sure this first sleep was helping hide "uninterruptible state" bug thesofproject#472 by giving the DSP more time to wake up! More recently it was found after merging and reverting an earlier version of this (PR thesofproject#543 / commit f93a3c8) that this first sleep was (accidentally?) giving more time for processes to actually disappeared after being killed at the end of a test round and not pollute the next test iteration. Make that clearer by moving the first sleep to the end of the iteration, right after the kills and hardcode to one second. Signed-off-by: Marc Herbert <[email protected]>
Having the same -w parameter for two different sleeps in the same iteration made no sense. It was not clear what the purpose of this first sleep was, it has been in multiple-pipeline-capture/playback.sh since the dawn of time (= commit 39cd964) For sure this first sleep was helping hide "uninterruptible state" bug #472 by giving the DSP more time to wake up! More recently it was found after merging and reverting an earlier version of this (PR #543 / commit f93a3c8) that this first sleep was (accidentally?) giving more time for processes to actually disappeared after being killed at the end of a test round and not pollute the next test iteration. Make that clearer by moving the first sleep to the end of the iteration, right after the kills and hardcode to one second. Signed-off-by: Marc Herbert <[email protected]>
This is a random issue on multiple platforms.
e.g. on EHL RVP and CML-U notebook Olym* in daily test on 2020-10-25, report ID 613.
The text was updated successfully, but these errors were encountered: