Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [ZEPHYR] sof-logger dead: Invalid filename length 1650680879 or ldc file does not match firmware #5352

Closed
marc-hb opened this issue Feb 12, 2022 · 16 comments
Labels
bug Something isn't working as expected Intel Linux Daily tests This issue can be found in internal Linux daily tests logger Dictionary and logger P2 Critical bugs or normal features won't fix This will not be worked on atm (e.g. a bug closed for lack of user request, hardware etc) Zephyr Issues only observed with Zephyr integrated

Comments

@marc-hb
Copy link
Collaborator

marc-hb commented Feb 12, 2022

Rarely, the sof-logger dies like this at the very start of a test:
https://sof-ci.01.org/sofpr/PR5340/build11976/devicetest/?model=TGLU_UP_HDA_ZEPHYR&testcase=test-speaker

error: Invalid filename length 1650680879 or ldc file does not match firmware
error: read_entry_from_ldc_file(0x25007) returned -22
error: fetch_entry() failed with: -22, aborting

This happens in about 1 out 50 test runs.

The dictionary is valid because other tests immediately before and after are fine in the exact same configuration.

While the "sof-logger was already dead" error message is in the end the same as #5120 (which was hiding this bug), the symptoms are extremely different.

This caused a number of failures in recent daily runs 10079, 10105 and 10146. It seems Zephyr specific. It's not clear why it's even more rare in PR testing.

10079?model=TGLU_UP_HDA_ZEPHYR&testcase=volume-basic-test-50
10079?model=WHL_UPEXT_HDA_ZEPHYR&testcase=multiple-pipeline-capture-50

10105?model=APL_UP2_NOCODEC_ZEPHYR&testcase=multiple-pipeline-playback-50

10146?model=APL_UP2_NOCODEC_ZEPHYR&testcase=check-capture-50rounds
10146?model=APL_UP2_NOCODEC_ZEPHYR&testcase=multiple-pause-resume-50
10146?model=TGLH_RVP_NOCODEC_ZEPHYR&testcase=check-xrun-injection-playback-10

Daily run 10079:

Start Time: 2022-02-09 22:27:36 UTC
Kernel Branch: topic/sof-dev
Kernel Commit: 8bf95cb0
SOF Branch: main
SOF Commit: 9983ab67a9fe
Zephyr Commit: 41a63f48fdc8

DMA alignment fix c11562b for #5120 was between 10079 and 10105

Daily run 10105 was:

Start Time: 2022-02-10 22:27:29 UTC
Kernel Branch: topic/sof-dev
Kernel Commit: 936898a6
SOF Branch: main
SOF Commit: 443b21de4b3f
Zephyr Commit: 5094a6e08cee

Daily run 10146 was:

Kernel Branch: topic/sof-dev
Kernel Commit: 98119478
SOF Branch: main
SOF Commit: b8954754f055
Zephyr Commit: v3.0.0-rc3

@marc-hb marc-hb added the bug Something isn't working as expected label Feb 12, 2022
@marc-hb
Copy link
Collaborator Author

marc-hb commented Feb 12, 2022

P1 because it affects PR and daily testing.

@kv2019i could this be related to thesofproject/linux#3275

@jsarha any clue now that you're an expert after fixing #5120?

@marc-hb marc-hb added logger Dictionary and logger P1 Blocker bugs or important features labels Feb 12, 2022
@marc-hb marc-hb added the Zephyr Issues only observed with Zephyr integrated label Feb 12, 2022
@marc-hb marc-hb changed the title [BUG] sof-logger dead: Invalid filename length 1650680879 or ldc file does not match firmware [BUG] [ZEPHYR] sof-logger dead: Invalid filename length 1650680879 or ldc file does not match firmware Feb 12, 2022
@lgirdwood
Copy link
Member

lgirdwood commented Feb 14, 2022

1650680879 == 0x6263642F == "bcd/" So it looks like we are reading a header? as length.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Feb 14, 2022

I don't think so, the values seem fairly random. Here's a sample of a few different failures:

error: Invalid filename length 12320 or ldc file does not match firmware
error: read_entry_from_ldc_file(0x35008) returned -22
error: fetch_entry() failed with: -22, aborting

error: Invalid filename length 1650680879 or ldc file does not match firmware
error: read_entry_from_ldc_file(0x25007) returned -22
error: fetch_entry() failed with: -22, aborting

error: Invalid filename length 1601203553 or ldc file does not match firmware
error: read_entry_from_ldc_file(0x22006) returned -22
error: fetch_entry() failed with: -22, aborting
Skipped 0 bytes after the last statement.

error: Invalid filename length 1601463667 or ldc file does not match firmware
error: read_entry_from_ldc_file(0x22006) returned -22
error: fetch_entry() failed with: -22, aborting

error: Invalid filename length 7471104 or ldc file does not match firmware
error: read_entry_from_ldc_file(0x23006) returned -22
error: fetch_entry() failed with: -22, aborting

@marc-hb
Copy link
Collaborator Author

marc-hb commented Feb 19, 2022

Something strange happened in the middle of a test on WHL:
https://sof-ci.01.org/softestpr/PR862/build985/devicetest/?model=WHL_UPEXT_HDA_ZEPHYR&testcase=check-suspend-resume-with-playback-5

[     2728877.704064] (          14.114583) c0 hda-dma            ..../intel/hda/hda-dma.c:936  INFO hda-dmac :7 -> remove
[     2728973.954060] (          96.249992) c0 dma                           src/lib/dma.c:141  INFO dma_put(), dma = 0x9e06e570, sref = 0
[     2728986.089477] (          12.135416) c0 hda-dai      3.6   /drivers/intel/hda/hda.c:90   INFO hda_remove()
[     2729002.756143] (          16.666666) c0 dai                           src/lib/dai.c:188  INFO dai_put type 3 index 6 new sref 0
[     2729116.558221] (         113.802078) c0 ipc                  src/ipc/ipc3/handler.c:1579 INFO ipc: new cmd 0x30110000
[     2729129.422804] (          12.864583) c0 ipc                  src/ipc/ipc3/handler.c:1361 INFO ipc: comp 62 -> free
[     2729141.089470] (          11.666666) c0 pipe         9.62  ......./pipeline-graph.c:195  INFO pipeline_free()
error: log_entry_address    0x204b0 is not in dictionary range!
warn: Seeking forward 4 bytes at a time until re-synchronize.

Found valid LDC address after skipping 4 bytes (one line uses 20 + 0 to 16 bytes)
error: Invalid filename length 1952539747 or ldc file does not match firmware
error: read_entry_from_ldc_file(0x29007) returned -22
error: fetch_entry() failed with: -22, aborting
Skipped 0 bytes after the last statement.

@jsarha
Copy link
Contributor

jsarha commented Feb 21, 2022

This does not appear to happen on my TGL board. I have now run test-speaker.sh 730 times (had it running in a loop over night) on Zephyr build of the latest https://github.com/thesofproject/sof main.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Feb 22, 2022

TGL does not seem to reproduce the most frequently and test-speaker neither. Do you have an APL board?

I would also recommend reloading the firmware regularly, just an intuition based on staring at a lot of failures recently. https://github.com/thesofproject/sof-test/tree/main/tools/kmod works great.

EDIT: seen again in daily 10429

Start Time: 2022-02-22 22:27:21 UTC
Kernel Branch: topic/sof-dev
Kernel Commit: 5b26e39d
SOF Branch: main
SOF Commit: 58ce6e68608f
Zephyr Commit: 8239aea29c16

10429?model=APL_UP2_NOCODEC_ZEPHYR&testcase=multiple-pipeline-all-50
10429?model=TGLH_RVP_NOCODEC_ZEPHYR&testcase=check-xrun-injection-playback-10
10429?model=TGLH_RVP_NOCODEC_ZEPHYR&testcase=multiple-pipeline-capture-50

@keqiaozhang
Copy link
Collaborator

@jsarha,this issue is still valid in CI, we see this issue almost every day in "CI daily test(Today's test ID:10429)".
For TGL platform, maybe you can test with check-xrun-injection.sh.

@mengdonglin mengdonglin added P2 Critical bugs or normal features and removed P1 Blocker bugs or important features labels Feb 28, 2022
@mengdonglin
Copy link
Collaborator

Lower priority to P2, as we're going to abandon DMA-based SOF trace and use Zephyr native LOG implementation. Zephyr native LOG implementation will probably fix this problem.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Feb 28, 2022

Unlike bug severity, priority is never "objectively" defined with specified metrics. Priority can be defined in many different, project-specific ways based on various inputs and preferences. However I don't think I've seen the implementation details of the fix make any priority difference before.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Mar 9, 2022

Slightly different failure in daily 10797?model=WHL_UPEXT_HDA_ZEPHYR&testcase=check-capture-50rounds

Still happening at the very start of the test.

 TIMESTAMP         (us)              DELTA  C# COMPONENT          LOCATION                      CONTENT	ktime=1980.913s  @  2022-03-08 23:03:17 +0000 UTC

Found valid LDC address after skipping 8 bytes (one line uses 20 + 0 to 16 bytes)
error: Invalid number of parameters.
error: fetch_entry() failed with: -22, aborting
Skipped 0 bytes after the last statement.

@keqiaozhang
Copy link
Collaborator

New failure in daily 11126?modelFirmwareType=SOF-Zephyr&model=ADLP_RVP_SDW_ZEPHYR&testcase=multiple-pipeline-capture-50, happening at the very start of the test.

error: Invalid text length.
error: read_entry_from_ldc_file(0x26007) returned -22
error: fetch_entry() failed with: -22, aborting
Skipped 0 bytes after the last statement.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Apr 1, 2022

More DMA trace corruption with Zephyr, this time not at the very start: https://sof-ci.01.org/sofpr/PR5631/build12594/devicetest/

marc-hb added a commit to marc-hb/sof-test that referenced this issue Apr 28, 2022
This failure is just too frequent on Zephyr, downgrade it to a SKIP.

See long story in thesofproject/sof#5352

Signed-off-by: Marc Herbert <[email protected]>
@marc-hb
Copy link
Collaborator Author

marc-hb commented Apr 28, 2022

Downgrading this failure to a SKIP because it's too frequent; too much pollution

marc-hb added a commit to marc-hb/sof-test that referenced this issue Apr 28, 2022
This failure is just too frequent on Zephyr, downgrade it to a SKIP.

See long story in thesofproject/sof#5352

Signed-off-by: Marc Herbert <[email protected]>
marc-hb added a commit to thesofproject/sof-test that referenced this issue Apr 28, 2022
This failure is just too frequent on Zephyr, downgrade it to a SKIP.

See long story in thesofproject/sof#5352

Signed-off-by: Marc Herbert <[email protected]>
@marc-hb marc-hb added the Intel Linux Daily tests This issue can be found in internal Linux daily tests label Apr 29, 2022
@marc-hb
Copy link
Collaborator Author

marc-hb commented May 11, 2022

Downgrading this failure to a SKIP because it's too frequent; too much pollution

This made a huge difference but there are still some failures, typically like this:
https://sof-ci.01.org/sofpr/PR5796/build327/devicetest/?model=CML_RVP_SDW_ZEPHYR&testcase=check-sof-logger

error: log_entry_address          0 is not in dictionary range!

There is only so much DMA corruption you can ignore.

Same in https://sof-ci.01.org/sofpr/PR5266/build399/devicetest/?model=CML_RVP_SDW_ZEPHYR&testcase=check-sof-logger

@marc-hb
Copy link
Collaborator Author

marc-hb commented Jul 26, 2022

@mengdonglin mengdonglin added the won't fix This will not be worked on atm (e.g. a bug closed for lack of user request, hardware etc) label Sep 6, 2022
@mengdonglin
Copy link
Collaborator

Won't fix cavs tool issue. Will switch to new logging tool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected Intel Linux Daily tests This issue can be found in internal Linux daily tests logger Dictionary and logger P2 Critical bugs or normal features won't fix This will not be worked on atm (e.g. a bug closed for lack of user request, hardware etc) Zephyr Issues only observed with Zephyr integrated
Projects
None yet
Development

No branches or pull requests

5 participants