Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [ZEPHYR] corruption of the end of the DMA trace ring buffer ("sof-logger was already dead") #5120

Closed
keqiaozhang opened this issue Dec 21, 2021 · 34 comments · Fixed by #5329
Assignees
Labels
APL Applies to Apollolake platform blocked progress blocked by something else, applies to either feature or bug bug Something isn't working as expected CML Applies to Comet Lake platform logger Dictionary and logger P1 Blocker bugs or important features WHL Applies to WhiskeyLake platform Zephyr Issues only observed with Zephyr integrated
Milestone

Comments

@keqiaozhang
Copy link
Collaborator

keqiaozhang commented Dec 21, 2021

EDIT: this is caused by DMA trace corruption, mostly with XCC and mostly with Zephyr (but not just). See below - @marc-hb

Describe the bug
We observed this error on some zephyr platforms recently. From the console log, it shows that sof-logger was already dead, but there're no obvious errors in dmesg or DMA trace. Since the error trace is not available on zephyr platforms, so we don't know what happened that caused sof-logger to stop working.

the DMA trace shows:

warn: log_entry_address        0 is not in dictionary range!
warn: Seeking forward 4 bytes at a time until re-synchronize.

Found valid LDC address after skipping 52 bytes (one line uses 20 + 0 to 16 bytes)
Skipped 0 bytes after the last statement.
error: Invalid filename length 1702256483 or ldc file does not match firmwareerror: read_entry_from_ldc_file(0x2d007) returned -22error: fetch_entry() failed with: -22, aborting

To Reproduce
eg: on WHL zephyr platform:
$ ./check-playback.sh -d 1 -l 1 -r 50

Reproduction Rate
Almost 100%

Expected behavior
A clear and concise description of what you expected to happen.

Impact
sof-logger stops working

Environment
Kernel Branch: topic/sof-dev
Kernel Commit: ac3b3338-1
SOF Branch: main
SOF Commit:afac44af5f49-2
Zephyr Commit: fef2e30b7f83

Screenshots or console output

2021-12-20 23:13:27 UTC [REMOTE_COMMAND] aplay   -Dhw:0,5 -r 48000 -c 2 -f S16_LE -d 1 /dev/zero -v -q
Hardware PCM card 0 'sof-hda-dsp' device 5 subdevice 0
Its setup is:
  stream       : PLAYBACK
  access       : RW_INTERLEAVED
  format       : S16_LE
  subformat    : STD
  channels     : 2
  rate         : 48000
  exact rate   : 48000 (48000/1)
  msbits       : 16
  buffer_size  : 16384
  period_size  : 4096
  period_time  : 85333
  tstamp_mode  : NONE
  tstamp_type  : MONOTONIC
  period_step  : 1
  avail_min    : 4096
  period_event : 0
  start_threshold  : 16384
  stop_threshold   : 16384
  silence_threshold: 0
  silence_size : 0
  boundary     : 4611686018427387904
  appl_ptr     : 0
  hw_ptr       : 0
declare -- cmd="journalctl_cmd --since=@1640041583"
2021-12-20 23:13:30 UTC [REMOTE_INFO] pkill -TERM sof-logger
2021-12-20 23:13:30 UTC [REMOTE_ERROR] sof-logger was already dead
2021-12-20 23:13:31 UTC [REMOTE_INFO] nlines=16604 /home/ubuntu/sof-test/logs/check-playback/2021-12-20-23:06:31-26447/slogger.txt
2021-12-20 23:13:32 UTC [REMOTE_INFO] Test Result: FAIL!

** dmesg **

[ 1472.825236] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x160]=0x2014001e successful
[ 1472.825244] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60040000: GLB_STREAM_MSG: TRIG_START
[ 1472.825713] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x60040000: GLB_STREAM_MSG: TRIG_START
[ 1472.825734] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1472.825759] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1472.826882] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1472.826907] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1472.827843] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1472.827867] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.732822] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.732861] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.733813] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.733848] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.734807] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.734842] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.756807] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.756841] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.757801] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.757835] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.847737] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: trigger stream 5 dir 0 cmd 0
[ 1473.847762] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60050000: GLB_STREAM_MSG: TRIG_STOP
[ 1473.848743] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.848778] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.848808] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x60050000: GLB_STREAM_MSG: TRIG_STOP
[ 1473.848822] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x160]=0x20140000 successful
[ 1473.848838] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60030000: GLB_STREAM_MSG: PCM_FREE
[ 1473.849076] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x60030000: GLB_STREAM_MSG: PCM_FREE
[ 1473.849109] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: In hda_link_pcm_trigger cmd=0
[ 1473.849140] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x80010000: GLB_DAI_MSG: CONFIG
[ 1473.849373] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x80010000: GLB_DAI_MSG: CONFIG
[ 1473.849453] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: free stream 5 dir 0
[ 1473.849482] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x30020000: GLB_TPLG_MSG: COMP_FREE
[ 1473.849772] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.849795] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.849884] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x30020000: GLB_TPLG_MSG: COMP_FREE
[ 1473.849906] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: widget PCM5P freed
[ 1473.849921] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x30210000: GLB_TPLG_MSG: BUFFER_FREE
[ 1473.850080] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x30210000: GLB_TPLG_MSG: BUFFER_FREE
[ 1473.850089] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: widget BUF9.0 freed
[ 1473.850099] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x30020000: GLB_TPLG_MSG: COMP_FREE
[ 1473.850324] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x30020000: GLB_TPLG_MSG: COMP_FREE
[ 1473.850335] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: widget PGA9.0 freed
[ 1473.850345] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x30210000: GLB_TPLG_MSG: BUFFER_FREE
[ 1473.850518] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x30210000: GLB_TPLG_MSG: BUFFER_FREE
[ 1473.850528] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: widget BUF9.1 freed
[ 1473.850540] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x30020000: GLB_TPLG_MSG: COMP_FREE
[ 1473.850804] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.850827] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.850933] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x30020000: GLB_TPLG_MSG: COMP_FREE
[ 1473.850947] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x30110000: GLB_TPLG_MSG: PIPE_FREE
[ 1473.851139] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x30110000: GLB_TPLG_MSG: PIPE_FREE
[ 1473.851147] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: widget PIPELINE.9.HDA6.OUT freed
[ 1473.851152] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: widget HDA6.OUT freed
[ 1473.851347] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: close stream 5 dir 0
[ 1473.851752] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.851792] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.852728] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.852767] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1474.756773] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1474.756816] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1474.757727] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1474.757766] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1474.758727] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1474.758767] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1474.780720] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1474.780758] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1474.781712] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1474.781749] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1475.780750] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1475.780787] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1475.781709] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1475.781738] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1475.782718] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1475.782746] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1475.804729] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1475.804760] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1475.805704] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1475.805733] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1476.234981] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x80]=0x20140000 successful
[ 1476.235007] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x90050000: GLB_TRACE_MSG: DMA_FREE
[ 1476.235164] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x90050000: GLB_TRACE_MSG: DMA_FREE
[ 1476.267007] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x40010000: GLB_PM_MSG: CTX_SAVE
[ 1476.267177] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x40010000: GLB_PM_MSG: CTX_SAVE
[ 1476.267210] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x4]=0x1010f0f successful
[ 1476.268485] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x4]=0xf0f successful
[ 1476.268492] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: DSP core(s) enabled? 0 : core_mask f
[ 1476.268568] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Debug PCIR: 00000010 at  00000044
[ 1476.268575] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Turning i915 HDAC power 0
[ 1476.268578] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Current DSP power state: D3
[ 1476.268581] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: fw_state change: 6 -> 0

Test ID: 9007, Model name:WHL_UPEXT_HDA_ZEPHYR

@keqiaozhang keqiaozhang added bug Something isn't working as expected Zephyr Issues only observed with Zephyr integrated logger Dictionary and logger labels Dec 21, 2021
@dbaluta

This comment has been minimized.

@marc-hb

This comment has been minimized.

@marc-hb
Copy link
Collaborator

marc-hb commented Dec 21, 2021

From the console log, it shows that sof-logger was already dead, but there're no obvious errors in dmesg or DMA trace. Since the error trace is not available on zephyr platforms, so we don't know what happened that caused sof-logger to stop working.

There will never be an error from the kernel or from the firmware when using the wrong .ldc file. The only check is in sof-logger.

error: Invalid filename length 1702256483 or ldc file does not match firmwareerror: read_entry_from_ldc_file(0x2d007) returned -22error: fetch_entry() failed with: -22, aborting

There is no .ldc checksum with Zephyr right now, working on it. EDIT: fix submitted in #5129, please review.

always rsync both .ldc and .ri files at the same time.

I also recommend never having any .ldc file in /etc to avoid mismatches.

@marc-hb

This comment has been minimized.

@marc-hb
Copy link
Collaborator

marc-hb commented Dec 22, 2021

I took a closer look at the latest daily test results and I found random trace corruption across ALL tests (and all platforms), even in the PASSing test runs! Most of the time the sof-logger can recover from it and the tests PASS, sometimes there is just too much DMA trace corruption and the sof-logger gives up and the test fails.

It seems to happen at random places, not at any particular point in time. EDIT: actually, it seems to happen a regular intervals?

Trying to reproduce now.

[     9058880.473366] (          12.656249) c0 ipc                      src/ipc/ipc3/dai.c:187  INFO ipc_comp_dai_config() dai type = 3 index = 6
[     9058893.858782] (          13.385416) c0 dai          9.61        src/ipc/ipc3/dai.c:281  INFO dai_config() dai type = 3 index = 6 dd 0x9e0795c0
[     9059466.462926] (         572.604126) c0 ipc                  src/ipc/ipc3/handler.c:1576 INFO ipc: new cmd 0x60010000
[     9059492.817091] (          26.354166) c0 pipe         9.62  ....../pipeline-params.c:222  INFO pipe params dir 0 frame_fmt 0 buffer_fmt 9 rate 14
warn: log_entry_address        0 is not in dictionary range!
warn: Seeking forward 4 bytes at a time until re-synchronize.

Found valid LDC address after skipping 64 bytes (one line uses 20 + 0 to 16 bytes)
[     9059639.379586] (         146.562500) c0 dai          9.61        src/ipc/ipc3/dai.c:103  INFO dai_data_config() dai type = 3 index = 6 dd 0x9e0795c0
[     9059653.650418] (          14.270833) c0 hda-dai      3.6   /drivers/intel/hda/hda.c:53   INFO hda_get_hw_params(): channels 2 rate 48000
[     9059702.140000] (          48.489582) c0 dai          9.61           src/audio/dai.c:347  INFO dai_playback_params() dest_dev = 0 stream_id = 0 src_width = 4 dest_width = 4
[     9059715.629583] (          13.489583) c0 dai          9.61           src/audio/dai.c:353  INFO dai_playback_params() fifo 0x0
[     9059745.629581] (          29.999998) c0 pipe         9.62  ....../pipeline-params.c:308  INFO pipe prepare

@marc-hb marc-hb self-assigned this Dec 22, 2021
@marc-hb
Copy link
Collaborator

marc-hb commented Dec 22, 2021

I can finally reproduce on APL UP2 but only with XCC. No reproduction with the Zephyr SDK 0.13.2

EDIT: confirmed 100%. All other things strictly identical, switching back and forth between the two toolchains switches between 0% reproduction and 100% reproduction (after only a couple minutes). Again: very often the sof-logger can recover from the corruption, so don't wait for the test to fail but keep an eye on the DMA trace like this:

./test-case/sometest.sh -s                 # disable sof-logger in the test case, run it yourself instead
sudo sof-logger -t ... | tee _.log         # in one window
tail -f _.log | grep -i -e sync             # in a different window

EDIT: no need to run any test, regular logging is enough, see below. The more logs the faster the reproduction.

@marc-hb
Copy link
Collaborator

marc-hb commented Dec 22, 2021

I had a look at check-playback-100s on APL Zephyr in many recent daily test results. They all have "periodic DMA trace corruption" from the moment I enabled sof-logger in Zephyr tests (daily 8417, November 23rd). What's recent is only how worse it became, until now the sof-logger was merely skipping the corruption.

Every time I saw periodic DMA trace corruption with Zephyr, I checked the logs for XTOS on the same APL with the same versions of everything. There was never any DMA trace corruption with XTOS, only with Zephyr.

@marc-hb
Copy link
Collaborator

marc-hb commented Dec 22, 2021

I took a closer look at the latest daily test results and I found random trace corruption across ALL tests (and all platforms), even in the PASSing test runs!

Actually, not all platforms. It seems to affect only WHL and APL but not TGL. There are some logging failures with TGL but no periodic DMA trace corruption like this. It seemed to have appeared a bit more recently on WHL.

@marc-hb
Copy link
Collaborator

marc-hb commented Dec 23, 2021

@lyakh got the right intuition (thx! the reason why the corruption is periodic in "stable" tests is because it happens exactly at the end of the ring buffer / wrapping time, every time. In my testing the corruption was either never happening (e.g.: with the Zephyr SDK) or always happening (e.g.: with XCC) at the end of the ring buffer.

Instead of the end of the ring buffer, some corrupted data is sent instead. Interestingly:

  • It's not just the last, split and 36 bytes-long (in my testing) log statement that is corrupted: the previous one is corrupted too! About 48 bytes seem corrupted every time.
  • The corruption seems deterministic! On the same UP2 system it always looks like this:

skipping 0x000e0009, skipping 0x06480230, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x08030007, skipping 0x00000000,

For reference

struct log_entry_header {
	uint32_t uid;
	uint32_t id_0 : TRACE_ID_LENGTH; /* e.g. Pipeline ID */
	uint32_t id_1 : TRACE_ID_LENGTH; /* e.g. Component ID */
	uint32_t core_id : 8;		 /* Reporting core's id */

	uint64_t timestamp;		 /* Timestamp (in dsp ticks) */
	uint32_t log_entry_address;	 /* Address of log entry in ELF */ offsetof=18
} __attribute__((packed));

sizeof(header)=20 = 5 words

4 arguments = 16 bytes = 4 words

@marc-hb marc-hb changed the title [BUG] "sof-logger was already dead" happened while testing on some zephyr platforms [BUG] [ZEPHYR] corruption of the end of the DMA trace ring buffer "sof-logger was already dead" Dec 23, 2021
@marc-hb
Copy link
Collaborator

marc-hb commented Dec 23, 2021

Some data mining in the test results (i.e., large grep -r) showed that WHL UPX was never experiencing this corruption until daily test 8866 on Dec 12. Starting from the next daily test run 8897, it experienced the same end-of-DMA-ring-buffer corruption systematically. This allowed me to bisect the regression to this particular commit from @andyross (PR zephyrproject-rtos/zephyr/pull/40609)

commit 923277a84288ff42e4191520ec4b178623a0e1ac
Author: Andy Ross <[email protected]> 
Date:   Thu Nov 18 19:23:43 2021 -0800

    soc/intel_adsp_cavs18: Use new IDC driver
    
    Disable the use of the legacy IDC driver and IPM-based sched_ipi()
    implementation.
    
    Signed-off-by: Andy Ross <[email protected]>

 .../intel_adsp/cavs_v18/Kconfig.defconfig.series   | 29 ++++++----------------
 1 file changed, 7 insertions(+), 22 deletions(-)

In case you wondered, it's not possible to revert just this commit:

zephyr/drivers/ipm/ipm_cavs_idc.c:233: multiple definition of `cavs_idc_smp_init'
zephyr/soc/xtensa/intel_adsp/common/soc_mp.c:286: first defined here

It is possible to revert the entire series but that does make the corruption go away. I think this commit is just a trigger/messenger. Of which subtle caching issue I have no idea.

@marc-hb marc-hb added APL Applies to Apollolake platform blocked progress blocked by something else, applies to either feature or bug P1 Blocker bugs or important features WHL Applies to WhiskeyLake platform labels Dec 23, 2021
@marc-hb marc-hb changed the title [BUG] [ZEPHYR] corruption of the end of the DMA trace ring buffer "sof-logger was already dead" [BUG] [ZEPHYR] corruption of the end of the DMA trace ring buffer ("sof-logger was already dead") Dec 23, 2021
@marc-hb
Copy link
Collaborator

marc-hb commented Dec 23, 2021

The toolchain situation with WHL UPX is almost "reversed" compared to APL UP2: while the corruption on WHL UPX appeared only recently with XCC (see bisect in previous comment), it's been happening since forever with the Zephyr SDK 0.12.3 on WHL UPX! More precisely, "forever" = zephyr commit cf0c5e2a1ce5 lib/os: Add sys_heap_usable_size() which is the oldest zephyr commit compatible with today's SOF.

Every configuration I tested so far (and there have been many) has behaved deterministically: always corrupting or never corrupting. However predicting which hardware+toolchain+firmware configuration will fail and which will not seems completely random.

The corruption on WHL seems a bit less deterministic than on APL but most of these values seem frequent:

skipping  0x00000009, skipping  0x0000000e, skipping  0x00000230, skipping  0x00000648, skipping  0x00000000, 
skipping  0x00000000, skipping  0x00000000, skipping  0x00000000, skipping  0x00000000, skipping  0x00000000, 
skipping  0x00000000, skipping  0x00000000, skipping  0x00000000, skipping  0x00000000, skipping  0x00000007, 
skipping  0x00000803, 
--- sof/tools/logger/convert.c
+++ sof/tools/logger/convert.c
@@ -903,6 +905,7 @@ static int logger_read(void)
                        /* When the address is not correct, move forward by one DWORD (not
                         * entire struct dma_log)
                         */
+                       fprintf(global_config->out_fd, "skipping  0x%08x, ", dma_log.uid);
                        fseek(global_config->in_fd, -(sizeof(dma_log) - sizeof(uint32_t)),
                              SEEK_CUR);
                        skipped_dwords++;

@andyross
Copy link
Contributor

Did you ever get a chance to dump the buffer extents (start pointer and length) on a failing configuration?

Also, I'd be really curious whether the problem simply goes away entirely if you use an uncached pointer for the buffer and remove the cache control steps completely. I'll be honest: I think that code is hurting you more than helping. The nature of this buffer being streaming and write only means that the only value you are getting from the L1 data cache is some write combining (i.e. writing a single word is quicker than a round trip to HP-SRAM). And for that you pay with a HAL[1] invalidate and flush sequence for every update. I think it would be faster uncached. It would surely be smaller.

[1] Another thing you could try on Zephyr builds is to use our z_xtensa_cache_*() API instead of the HAL. I really doubt the HAL has actual bugs, but I will promise that ours is smaller and better.

@andyross
Copy link
Contributor

For clarity above: the write of a single word is quicker because it goes into the cache and the line then gets flushed out in a plausibly faster block transfer, plausibly asynchronously (whether it is or not depends on hardware details I'm not privy to, a naive implementation might very well stall the core while a line is flushed, I have no idea).

Anyway the point is I really doubt the cache is helping. :)

@jsarha
Copy link
Contributor

jsarha commented Jan 20, 2022

I was able to produce the issue also on gcc built zephyr firmware on up-i11:

[ 94204324.589990] ( 1997820.000000) c0 zll-schedule src/schedule/zephyr_ll.c:141 INFO ll task 0x9 <bad uid ptr 0x0000000e> avg 560, max 1608
error: log_entry_address 0 is not in dictionary range!
warn: Seeking forward 4 bytes at a time until re-synchronize.

Found valid LDC address after skipping 32 bytes (one line uses 20 + 0 to 16 bytes)
[458599188777.999268] ( nan) c0 unknown 0.0 src/schedule/zephyr_ll.c:141 INFO ll task 0x9e056980 dma-trace-task <2b972272-c5b1-4b7e-926f-0fc5cb4c4690> avg 39, max 490

             --- negative DELTA = -458504934273.255 us: wrap, IPC_TRACE, other? ---

[ 295981042.665853] ( 0.000000) c0 zll-schedule src/schedule/zephyr_ll.c:141 INFO ll task 0xbe05d4c0 pipe-task avg 3388, max 3422
[ 1997819.972697] ( 1997820.000000) c0 zll-schedule src/schedule/zephyr_ll.c:141 INFO ll task 0x9e056fd8 agent-work avg 54, max 58

@jsarha
Copy link
Contributor

jsarha commented Jan 20, 2022

I also think I found a race in src/trace/dma-trace.c trace_work(). The function first copies the log entries with dma_copy_to_host_nowait() and then sends SOF_IPC_TRACE_DMA_POSITION with ipc_msg_send() right after that. At least in theory the Linux side may be faster in reading the buffer than the DSP is in writing it. However, implementing a dma_copy_to_host_wait() by adding DMA_COPY_BLOCKING to dma_copy() flags and using it in trace_work() did not fix the issue in my case.

@jsarha
Copy link
Contributor

jsarha commented Jan 20, 2022

Also if my findings - based on some hackish debug code - are correct, the log entries sometimes get corrupted right in the middle of the circular buffer between the DSP and Linux and not (only) when the circular buffer wraps from the end to the beginning. However, this finding should be taken with a grain of salt since it looks like my debug code changes are somewhat changing the behavior of the tracing system.

@marc-hb
Copy link
Collaborator

marc-hb commented Jan 20, 2022

the log entries sometimes get corrupted right in the middle of the circular buffer between the DSP and Linux and not (only) when the circular buffer wraps from the end to the beginning.

Across all my testing I've never seen that. I also analyzed many test logs and the message error: log_entry_address X is not in dictionary range! always appeared at very regular intervals.

@lgirdwood
Copy link
Member

@jsarha any update here ?

@jsarha
Copy link
Contributor

jsarha commented Jan 31, 2022

Not yet. Still working on a new angle to tackle the issue.

@marc-hb
Copy link
Collaborator

marc-hb commented Feb 3, 2022

I'm afraid this has been by far the most frequent test failure in recent daily runs, especially when looking at Zephyr results :-(

@jsarha
Copy link
Contributor

jsarha commented Feb 4, 2022

Here is what I have established so far with my upx-i11 running Zephyr FW.

I changed the log record size to 64 bytes (= cache line length) always to make the issue more manageable. On the DSP side the dma buffer is 8192 bytes long and on the Linux host side the dma buffer is is 8 times the size of the DSP side buffer, e.g. 64k.

With 64-byte log entries every 128th entry is always broken. This happens 7 times in 8 in the middle of the Linux side buffer, but the entry is always transferred from the last 64bytes of the DSP side buffer. The wrongly transferred entry has always exactly the same content on the Linux side. When reading that same 64-byte entry from DSP side, before the dma_copy() is called for it, the memory content is correct. So somehow the dma_copy() does not copy the last cache line length of the dma buffer correctly (or at all) to the Linux host side.

The DMA_GW version used on the setup does not pass any other parameters to dma_copy() but size, so the error must be on the DMA initialization code. I have been scrutinizing the the DMA initialization code for a while now, but it is a bit task to fully understand what is going on there.

Feel free to ping me for details, if you are interested.

@miRoox
Copy link

miRoox commented Feb 11, 2022

In today's daily test (10105), this issue still happened on TGLH_RVP_NOCODEC_ZEPHYR and APL_UP2_NOCODEC_ZEPHYR.

@miRoox miRoox reopened this Feb 11, 2022
@marc-hb
Copy link
Collaborator

marc-hb commented Feb 11, 2022

@miRoox I looked at these test results. The sof-logger dies too but with immediately and with a different error message. In this bug the sof-loggers was dying only after many error: log_entry_address 0x.... is not in dictionary range! warnings which I don't see in 10105

I will spend more time scanning the logs to make sure but so far I think this is a different issue (which was shadowed by this one)

@miRoox
Copy link

miRoox commented Feb 11, 2022

@marc-hb thanks for your correction, now the DMA trace shows:

Found valid LDC address after skipping 12 bytes (one line uses 20 + 0 to 16 bytes)
error: Invalid filename length 31304 or ldc file does not match firmware
error: read_entry_from_ldc_file(0x23006) returned -22
error: fetch_entry() failed with: -22, aborting
Skipped 0 bytes after the last statement.

Perhaps this is a different issue.

@marc-hb
Copy link
Collaborator

marc-hb commented Feb 11, 2022

I filed follow-up issue #5345 for the (CML and non-Zephyr) error: log_entry_address 0x___ is not in dictionary range warning.

@marc-hb marc-hb closed this as completed Feb 11, 2022
@marc-hb
Copy link
Collaborator

marc-hb commented Feb 11, 2022

@miRoox could you please file yet another issue for the immediate failures on TGL?

Test PR #5340 makes the sof-logger stricter and demonstrates an easy way to reproduce.

@marc-hb
Copy link
Collaborator

marc-hb commented Feb 12, 2022

could you please file yet another issue for the immediate failures on TGL?

I did it in new

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
APL Applies to Apollolake platform blocked progress blocked by something else, applies to either feature or bug bug Something isn't working as expected CML Applies to Comet Lake platform logger Dictionary and logger P1 Blocker bugs or important features WHL Applies to WhiskeyLake platform Zephyr Issues only observed with Zephyr integrated
Projects
None yet
Development

Successfully merging a pull request may close this issue.