Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiple-pipeline-capture: Fail to find 4 pipelines still alive, only 3 left #472

Closed
mengdonglin opened this issue Oct 26, 2020 · 19 comments · Fixed by #538
Closed

multiple-pipeline-capture: Fail to find 4 pipelines still alive, only 3 left #472

mengdonglin opened this issue Oct 26, 2020 · 19 comments · Fixed by #538
Assignees
Labels
P1 Blocker bugs or important features type:bug Something doesn't work as expected

Comments

@mengdonglin
Copy link
Contributor

mengdonglin commented Oct 26, 2020

This is a random issue on multiple platforms.

e.g. on EHL RVP and CML-U notebook Olym* in daily test on 2020-10-25, report ID 613.

2020-10-25 23:07:04 UTC [REMOTE_INFO] /home/ubuntu/sof-test/test-case/multiple-pipeline-capture.sh will use topology /lib/firmware/intel/sof-tplg/sof-ehl-rt5660.tplg to run the test case
2020-10-25 23:07:04 UTC [REMOTE_INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital' in test case
2020-10-25 23:07:04 UTC [REMOTE_INFO] Run command to get pipeline parameters
2020-10-25 23:07:04 UTC [REMOTE_COMMAND] sof-tplgreader.py /lib/firmware/intel/sof-tplg/sof-ehl-rt5660.tplg -f 'type:playback' -b ' pcm:HDA Digital' -s 0 -e
2020-10-25 23:07:04 UTC [REMOTE_INFO] Testing: Headset [hw:0,0]
2020-10-25 23:07:04 UTC [REMOTE_COMMAND] aplay -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
2020-10-25 23:07:04 UTC [REMOTE_INFO] pipeline start sleep 0.5s for device wakeup
2020-10-25 23:07:09 UTC [REMOTE_INFO] checking pipeline status
2020-10-25 23:07:09 UTC [REMOTE_INFO] preparing sleep 5
2020-10-25 23:07:14 UTC [REMOTE_INFO] checking pipeline status again
/home/ubuntu/sof-test/test-case/multiple-pipeline-capture.sh: line 103: 66312 Killed                  "${APP_LST[$1]}" -D $dev -c $channel -r $rate -f $fmt "${DEV_LST[$1]}" -q
/home/ubuntu/sof-test/test-case/multiple-pipeline-capture.sh: line 103: 66321 Killed                  "${APP_LST[$1]}" -D $dev -c $channel -r $rate -f $fmt "${DEV_LST[$1]}" -q
/home/ubuntu/sof-test/test-case/multiple-pipeline-capture.sh: line 103: 66330 Killed                  "${APP_LST[$1]}" -D $dev -c $channel -r $rate -f $fmt "${DEV_LST[$1]}" -q
/home/ubuntu/sof-test/test-case/multiple-pipeline-capture.sh: line 103: 66350 Killed                  "${APP_LST[$1]}" -D $dev -c $channel -r $rate -f $fmt "${DEV_LST[$1]}" -q
2020-10-25 23:07:14 UTC [REMOTE_INFO] ===== Testing: (Loop: 32/50) =====
2020-10-25 23:07:14 UTC [REMOTE_INFO] /home/ubuntu/sof-test/test-case/multiple-pipeline-capture.sh will use topology /lib/firmware/intel/sof-tplg/sof-ehl-rt5660.tplg to run the test case
2020-10-25 23:07:14 UTC [REMOTE_INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital' in test case
2020-10-25 23:07:14 UTC [REMOTE_INFO] Run command to get pipeline parameters
2020-10-25 23:07:14 UTC [REMOTE_COMMAND] sof-tplgreader.py /lib/firmware/intel/sof-tplg/sof-ehl-rt5660.tplg -f 'type:capture' -b ' pcm:HDA Digital' -s 0 -e
2020-10-25 23:07:14 UTC [REMOTE_INFO] Testing: Headset [hw:0,0]
2020-10-25 23:07:14 UTC [REMOTE_COMMAND] arecord -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/null -q
2020-10-25 23:07:14 UTC [REMOTE_INFO] Testing: DMIC [hw:0,1]
2020-10-25 23:07:14 UTC [REMOTE_COMMAND] arecord -D hw:0,1 -c 4 -r 48000 -f S16_LE /dev/null -q
2020-10-25 23:07:14 UTC [REMOTE_INFO] Testing: DMIC16kHz [hw:0,2]
2020-10-25 23:07:14 UTC [REMOTE_COMMAND] arecord -D hw:0,2 -c 2 -r 16000 -f S16_LE /dev/null -q
2020-10-25 23:07:14 UTC [REMOTE_INFO] /home/ubuntu/sof-test/test-case/multiple-pipeline-capture.sh will use topology /lib/firmware/intel/sof-tplg/sof-ehl-rt5660.tplg to run the test case
2020-10-25 23:07:14 UTC [REMOTE_INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital' in test case
2020-10-25 23:07:14 UTC [REMOTE_INFO] Run command to get pipeline parameters
2020-10-25 23:07:14 UTC [REMOTE_COMMAND] sof-tplgreader.py /lib/firmware/intel/sof-tplg/sof-ehl-rt5660.tplg -f 'type:playback' -b ' pcm:HDA Digital' -s 0 -e
2020-10-25 23:07:14 UTC [REMOTE_INFO] Testing: Headset [hw:0,0]
2020-10-25 23:07:14 UTC [REMOTE_COMMAND] aplay -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
2020-10-25 23:07:14 UTC [REMOTE_INFO] pipeline start sleep 0.5s for device wakeup
2020-10-25 23:07:20 UTC [REMOTE_INFO] checking pipeline status
2020-10-25 23:07:20 UTC [REMOTE_INFO] preparing sleep 5
2020-10-25 23:07:25 UTC [REMOTE_ERROR] Target pipeline count: 4, current process count: 3
/home/ubuntu/sof-test/test-case/multiple-pipeline-capture.sh: line 96: 66463 Killed                  "${APP_LST[$1]}" -D $dev -c $channel -r $rate -f $fmt "${DEV_LST[$1]}" -q
2020-10-25 23:07:25 UTC [REMOTE_ERROR] Starting func_exit_handler(), exit status=1, FUNCNAME stack:
/home/ubuntu/sof-test/case-lib/hijack.sh: line 17: 66434 Killed                  "${APP_LST[$1]}" -D $dev -c $channel -r $rate -f $fmt "${DEV_LST[$1]}" -q
/home/ubuntu/sof-test/case-lib/hijack.sh: line 17: 66443 Killed                  "${APP_LST[$1]}" -D $dev -c $channel -r $rate -f $fmt "${DEV_LST[$1]}" -q
/home/ubuntu/sof-test/case-lib/hijack.sh: line 23: 66424 Killed                  "${APP_LST[$1]}" -D $dev -c $channel -r $rate -f $fmt "${DEV_LST[$1]}" -q
2020-10-25 23:07:25 UTC [REMOTE_ERROR]  func_error_exit()  @  /home/ubuntu/sof-test/test-case/multiple-pipeline-capture.sh
2020-10-25 23:07:25 UTC [REMOTE_ERROR]  main()  @  /home/ubuntu/sof-test/test-case/multiple-pipeline-capture.sh:140
@mengdonglin mengdonglin added type:bug Something doesn't work as expected P2 Critical bugs or normal features EHL labels Oct 26, 2020
@aiChaoSONG aiChaoSONG self-assigned this Nov 2, 2020
@aiChaoSONG
Copy link

Try to add some debug log, to see which pipeline is dead.

@plbossart
Copy link
Member

I ran 500 iterations on GLK, for which this error was reported in one daily test log. No luck, it's a PASS status. Will try with longer values.

@plbossart
Copy link
Member

Of course Murphy's Law applies, failure detected on GLK at the 552th iteration - SOF firmware master and topic/sof-dev.

2020-11-03 03:26:26 UTC [INFO] ===== Testing: (Loop: 552/5000) =====
2020-11-03 03:26:26 UTC [INFO] /root/sof-test/test-case/multiple-pipeline-capture.sh will use topology /lib/firmware/intel/sof-tplg/sof-glk-da7219.tplg to run the test case
2020-11-03 03:26:26 UTC [INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital' in test case
2020-11-03 03:26:26 UTC [INFO] Run command to get pipeline parameters
2020-11-03 03:26:26 UTC [COMMAND] sof-tplgreader.py /lib/firmware/intel/sof-tplg/sof-glk-da7219.tplg -f 'type:capture' -b ' pcm:HDA Digital' -s 0 -e
2020-11-03 03:26:26 UTC [INFO] Testing: Headset [hw:0,1]
2020-11-03 03:26:26 UTC [COMMAND] arecord -D hw:0,1 -c 2 -r 48000 -f S16_LE /dev/null -q
2020-11-03 03:26:26 UTC [INFO] Testing: DMIC [hw:0,99]
2020-11-03 03:26:26 UTC [COMMAND] arecord -D hw:0,99 -c 4 -r 48000 -f S16_LE /dev/null -q
2020-11-03 03:26:26 UTC [INFO] /root/sof-test/test-case/multiple-pipeline-capture.sh will use topology /lib/firmware/intel/sof-tplg/sof-glk-da7219.tplg to run the test case
2020-11-03 03:26:26 UTC [INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital' in test case
2020-11-03 03:26:26 UTC [INFO] Run command to get pipeline parameters
2020-11-03 03:26:26 UTC [COMMAND] sof-tplgreader.py /lib/firmware/intel/sof-tplg/sof-glk-da7219.tplg -f 'type:playback' -b ' pcm:HDA Digital' -s 0 -e
2020-11-03 03:26:26 UTC [INFO] Testing: Speakers [hw:0,0]
2020-11-03 03:26:26 UTC [COMMAND] aplay -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
2020-11-03 03:26:26 UTC [INFO] Testing: Headset [hw:0,1]
2020-11-03 03:26:26 UTC [COMMAND] aplay -D hw:0,1 -c 2 -r 48000 -f S16_LE /dev/zero -q
2020-11-03 03:26:26 UTC [INFO] pipeline start sleep 0.5s for device wakeup
2020-11-03 03:26:31 UTC [INFO] checking pipeline status
2020-11-03 03:26:31 UTC [INFO] preparing sleep 5
2020-11-03 03:26:36 UTC [ERROR] Target pipeline count: 4, current process count: 3
/root/sof-test/test-case/multiple-pipeline-capture.sh: line 96: 125680 Killed                  "${APP_LST[$1]}" -D $dev -c $channel -r $rate -f $fmt "${DEV_LST[$1]}" -q
/root/sof-test/test-case/multiple-pipeline-capture.sh: line 96: 125689 Killed                  "${APP_LST[$1]}" -D $dev -c $channel -r $rate -f $fmt "${DEV_LST[$1]}" -q
2020-11-03 03:26:36 UTC [ERROR] Starting func_exit_handler(), exit status=1, FUNCNAME stack:
/root/sof-test/case-lib/hijack.sh: line 23: 125651 Killed                  "${APP_LST[$1]}" -D $dev -c $channel -r $rate -f $fmt "${DEV_LST[$1]}" -q
/root/sof-test/case-lib/hijack.sh: line 23: 125660 Killed                  "${APP_LST[$1]}" -D $dev -c $channel -r $rate -f $fmt "${DEV_LST[$1]}" -q
2020-11-03 03:26:36 UTC [ERROR]  func_error_exit()  @  /root/sof-test/test-case/multiple-pipeline-capture.sh
2020-11-03 03:26:36 UTC [ERROR]  main()  @  /root/sof-test/test-case/multiple-pipeline-capture.sh:140
2020-11-03 03:26:36 UTC [INFO] Starting /usr/local/bin/sof-logger  -l /etc/sof/sof-glk.ldc -o /root/sof-test/logs/multiple-pipeline-capture/2020-11-02-19:47:14-8753/etrace.txt
2020-11-03 03:26:37 UTC [INFO] pkill -TERM sof-logger
Terminated
2020-11-03 03:26:39 UTC [INFO] nlines=7 /root/sof-test/logs/multiple-pipeline-capture/2020-11-02-19:47:14-8753/etrace.txt
2020-11-03 03:26:39 UTC [INFO] Test Result: FAIL!

multiple-pipeline-capture-glk.tar.gz

@plbossart
Copy link
Member

And unfortunately there is zero information on what the errors might be in sof-test logs. @aiChaoSONG we really need to update the script to provide more information on which capture failed.

@aiChaoSONG
Copy link

aiChaoSONG commented Nov 4, 2020

@plbossart I have added some debug info in my personal sof-test repo to this case, and also trying to reproduce this issue on some platforms, check report 663, 662, 661, will test more.
Note: CI now supports to use my personal sof-test to run a test plan

@marc-hb marc-hb changed the title multiple-pipeline-capture: Fail to find 4 active pipelines, can only find 3 multiple-pipeline-capture: Fail to find 4 pipelines still alive, only 3 left Nov 10, 2020
@marc-hb
Copy link
Collaborator

marc-hb commented Nov 10, 2020

... to provide more information on which capture failed.

I just submitted two-lines pgrep-a #507 which helps.

And unfortunately there is zero information on what the errors might be in sof-test logs.

I checked and there's no silly /dev/null redirection this time, so this is not a problem with sof-test this time.

Is there any chance --quiet would silence errors too? --quiet options do not normally silence errors but man aplay is very vague.

I have added some debug info in my personal sof-test repo to this case

More specifically?

@fredoh9
Copy link
Collaborator

fredoh9 commented Nov 11, 2020

I saw this problem in ICL_HDA_HDA/WHL_RVP_HDA with multiple-pipeline-capture.sh recently. #507 will help debug which process might be the problem but still I don't see any error message.

@mengdonglin mengdonglin added P1 Blocker bugs or important features and removed P2 Critical bugs or normal features labels Nov 16, 2020
@aiChaoSONG
Copy link

aiChaoSONG commented Nov 16, 2020

When issue occurs, There are four pipeline in the process, but only three is calculated. confirmed a test case issue.
report id: 745, failed platforms: TGLU_RVP_HDA TGLU_SHUR_HDA ICL_RVP_HDA

2020-11-14 07:41:54 UTC [REMOTE_ERROR] Target pipeline count: 4, current process count: 3
154582 aplay -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
154591 aplay -D hw:0,3 -c 2 -r 48000 -f S16_LE /dev/zero -q
154552 arecord -D hw:0,6 -c 4 -r 48000 -f S32_LE /dev/null -q
154561 arecord -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/null -q

@marc-hb
Copy link
Collaborator

marc-hb commented Nov 17, 2020

but only three is calculated

BTW there are obsolete expr commands in these tests, please help review #497 that removes a few and then let's use shellcheck to remove the rest.

@marc-hb
Copy link
Collaborator

marc-hb commented Nov 18, 2020

There's a fair amount of code churn happening right now in multiple-pipeline-capture/playback.sh (and not just) so please everyone keep reporting here any occurrence you see including the versions (SHA1) of all components.

@aiChaoSONG
Copy link

With a1c5677, we are able to see how many process are alive, and what are they. But from recent report, https://sof-ci.sh.intel.com/#/result/planresultdetail/789?model=JSL_RVP_HDA&testcase=multiple-pipeline-capture-50, we do have four process, but the calculated value is 3.

@marc-hb
Copy link
Collaborator

marc-hb commented Nov 19, 2020

Thanks for the example. Amazingly the total is wrong only after 29 successful test iterations.

After #525 alignment and de-duplication, I think a good next step would be to add rec_count and aplay_count to the error message.

@aiChaoSONG
Copy link

@marc-hb Yes, that's what #516 does, but need to wait your patch to merge to eliminate the divergence between the two multiple-pipeline-* cases.

@marc-hb
Copy link
Collaborator

marc-hb commented Nov 19, 2020

I think the very first thing to do immediately after #525 is to merge the two scripts into only one because it will remove a LOT of duplicated code. Could you do that after #525 has been tested by at least one daily test run?

Once that's done we can look at #516 and other things.

I scanned man pidof and I just found -z, this is another (small code)change I think we should try:

       -z     Try to detect processes which are stuck in uninterruptible (D) or zombie (Z) status.
               Usually  these  processes are skipped as trying to deal with them can cause pidof to hang.

As usual this should be tested locally first and not merged on the same day than other test changes; one (test) change at a time.

@marc-hb marc-hb self-assigned this Nov 25, 2020
@marc-hb
Copy link
Collaborator

marc-hb commented Nov 25, 2020

I'm 90% sure I found the root cause, and good news: yes this seems to be a pure test code issue.

I will hopefully submit a fix on Monday 30th when I get back. In the mean time please review #532 because I'm not going to copy/paste the fix.

marc-hb added a commit to marc-hb/sof-test that referenced this issue Dec 1, 2020
pidof ignores processes in uninterruptable sleep by default, probably
because sending them signals is futile. pidof has a -z option but it's
just simpler to use "ps".

Also de-duplicate code into new ps_checks() function

Fixes: thesofproject#472

Any aplay or arecord process can be in uninterruptable sleep when doing
I/O but the easiest way to reproduce is to wait a few seconds and let the
DSP go to sleep.

Signed-off-by: Marc Herbert <[email protected]>
@marc-hb
Copy link
Collaborator

marc-hb commented Dec 1, 2020

I scanned man pidof and I just found -z,

Confirmed to be the issue, see longer explanation in commit message of candidate fix PR #538

xiulipan pushed a commit that referenced this issue Dec 2, 2020
pidof ignores processes in uninterruptable sleep by default, probably
because sending them signals is futile. pidof has a -z option but it's
just simpler to use "ps".

Also de-duplicate code into new ps_checks() function

Fixes: #472

Any aplay or arecord process can be in uninterruptable sleep when doing
I/O but the easiest way to reproduce is to wait a few seconds and let the
DSP go to sleep.

Signed-off-by: Marc Herbert <[email protected]>
marc-hb added a commit to marc-hb/sof-test that referenced this issue Dec 2, 2020
Having the same -w parameter for both this initial sleep and the test
duration made no sense: fixed by removing the first sleep.

It's not clear what the purpose of this first sleep was, it has been in
multiple-pipeline-capture/playback.sh since the dawn of time (= commit
39cd964) For sure this first sleep was helping hide
"uninterruptible state" bug thesofproject#472 by giving the DSP more time to wake up!

Signed-off-by: Marc Herbert <[email protected]>
@marc-hb
Copy link
Collaborator

marc-hb commented Dec 2, 2020

Nightly tests run 912 still shows some failures, let's not let github auto close this and close it manually only when we actually stop seeing it (PR tests almost never showed it; they don't run for long enough)

Tentative fix #538 was merged 15 hours ago, Nightly tests run 912 started at 2020-12-02 05:31:57 UTC which is unfortunately about the same time so it's not obvious which sof-test version 912 they ran (internal issue 619, possible solution in #544)

By sheer luck I also fixed a log message in the same #538 ("pipeline start sleep 0.5s for device wakeup") which shows that run 912 didn't run with fix #538 yet.

@marc-hb marc-hb reopened this Dec 2, 2020
@marc-hb
Copy link
Collaborator

marc-hb commented Dec 3, 2020

Nightly tests run 937 seems OK; including the report of an actual failure!

@marc-hb
Copy link
Collaborator

marc-hb commented Dec 4, 2020

Looks like this test bug is gone, closing. However note the fix #538 broke the directory name for the logs/, see tentative fix #546 (+ internal Jenkins issue 640)

@marc-hb marc-hb closed this as completed Dec 4, 2020
marc-hb added a commit to marc-hb/sof-test that referenced this issue Mar 5, 2021
Having the same -w parameter for both this initial sleep and the test
duration made no sense: fixed by removing the first sleep.

It's not clear what the purpose of this first sleep was, it has been in
multiple-pipeline-capture/playback.sh since the dawn of time (= commit
39cd964) For sure this first sleep was helping hide
"uninterruptible state" bug thesofproject#472 by giving the DSP more time to wake up!

Signed-off-by: Marc Herbert <[email protected]>
marc-hb added a commit that referenced this issue Apr 27, 2021
Having the same -w parameter for both this initial sleep and the test
duration made no sense: fixed by removing the first sleep.

It's not clear what the purpose of this first sleep was, it has been in
multiple-pipeline-capture/playback.sh since the dawn of time (= commit
39cd964) For sure this first sleep was helping hide
"uninterruptible state" bug #472 by giving the DSP more time to wake up!

Signed-off-by: Marc Herbert <[email protected]>
marc-hb added a commit to marc-hb/sof-test that referenced this issue Apr 29, 2021
Having the same -w parameter for two different sleeps in the same
iteration made no sense.

It was not clear what the purpose of this first sleep was, it has been in
multiple-pipeline-capture/playback.sh since the dawn of time (= commit
39cd964) For sure this first sleep was helping hide
"uninterruptible state" bug thesofproject#472 by giving the DSP more time to wake up!

More recently it was found after merging and reverting an earlier
version of this (PR thesofproject#543 / commit f93a3c8) that this first sleep
was (accidentally?) giving more time for processes to actually
disappeared after being killed at the end of a test round and not
pollute the next test iteration. Make that clearer by moving the first
sleep to the end of the iteration, right after the kills and hardcode to
one second.

Signed-off-by: Marc Herbert <[email protected]>
marc-hb added a commit that referenced this issue Apr 29, 2021
Having the same -w parameter for two different sleeps in the same
iteration made no sense.

It was not clear what the purpose of this first sleep was, it has been in
multiple-pipeline-capture/playback.sh since the dawn of time (= commit
39cd964) For sure this first sleep was helping hide
"uninterruptible state" bug #472 by giving the DSP more time to wake up!

More recently it was found after merging and reverting an earlier
version of this (PR #543 / commit f93a3c8) that this first sleep
was (accidentally?) giving more time for processes to actually
disappeared after being killed at the end of a test round and not
pollute the next test iteration. Make that clearer by moving the first
sleep to the end of the iteration, right after the kills and hardcode to
one second.

Signed-off-by: Marc Herbert <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Blocker bugs or important features type:bug Something doesn't work as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants