[BUG] [ZEPHYR] corruption of the end of the DMA trace ring buffer ("sof-logger was already dead") #5120

keqiaozhang · 2021-12-21T09:09:18Z

EDIT: this is caused by DMA trace corruption, mostly with XCC and mostly with Zephyr (but not just). See below - @marc-hb

Describe the bug
We observed this error on some zephyr platforms recently. From the console log, it shows that sof-logger was already dead, but there're no obvious errors in dmesg or DMA trace. Since the error trace is not available on zephyr platforms, so we don't know what happened that caused sof-logger to stop working.

the DMA trace shows:

warn: log_entry_address        0 is not in dictionary range!
warn: Seeking forward 4 bytes at a time until re-synchronize.

Found valid LDC address after skipping 52 bytes (one line uses 20 + 0 to 16 bytes)
Skipped 0 bytes after the last statement.
error: Invalid filename length 1702256483 or ldc file does not match firmwareerror: read_entry_from_ldc_file(0x2d007) returned -22error: fetch_entry() failed with: -22, aborting

To Reproduce
eg: on WHL zephyr platform:
$ ./check-playback.sh -d 1 -l 1 -r 50

Reproduction Rate
Almost 100%

Expected behavior
A clear and concise description of what you expected to happen.

Impact
sof-logger stops working

Environment
Kernel Branch: topic/sof-dev
Kernel Commit: ac3b3338-1
SOF Branch: main
SOF Commit:afac44af5f49-2
Zephyr Commit: fef2e30b7f83

Screenshots or console output

2021-12-20 23:13:27 UTC [REMOTE_COMMAND] aplay   -Dhw:0,5 -r 48000 -c 2 -f S16_LE -d 1 /dev/zero -v -q
Hardware PCM card 0 'sof-hda-dsp' device 5 subdevice 0
Its setup is:
  stream       : PLAYBACK
  access       : RW_INTERLEAVED
  format       : S16_LE
  subformat    : STD
  channels     : 2
  rate         : 48000
  exact rate   : 48000 (48000/1)
  msbits       : 16
  buffer_size  : 16384
  period_size  : 4096
  period_time  : 85333
  tstamp_mode  : NONE
  tstamp_type  : MONOTONIC
  period_step  : 1
  avail_min    : 4096
  period_event : 0
  start_threshold  : 16384
  stop_threshold   : 16384
  silence_threshold: 0
  silence_size : 0
  boundary     : 4611686018427387904
  appl_ptr     : 0
  hw_ptr       : 0
declare -- cmd="journalctl_cmd --since=@1640041583"
2021-12-20 23:13:30 UTC [REMOTE_INFO] pkill -TERM sof-logger
2021-12-20 23:13:30 UTC [REMOTE_ERROR] sof-logger was already dead
2021-12-20 23:13:31 UTC [REMOTE_INFO] nlines=16604 /home/ubuntu/sof-test/logs/check-playback/2021-12-20-23:06:31-26447/slogger.txt
2021-12-20 23:13:32 UTC [REMOTE_INFO] Test Result: FAIL!

** dmesg **

[ 1472.825236] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x160]=0x2014001e successful
[ 1472.825244] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60040000: GLB_STREAM_MSG: TRIG_START
[ 1472.825713] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x60040000: GLB_STREAM_MSG: TRIG_START
[ 1472.825734] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1472.825759] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1472.826882] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1472.826907] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1472.827843] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1472.827867] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.732822] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.732861] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.733813] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.733848] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.734807] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.734842] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.756807] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.756841] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.757801] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.757835] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.847737] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: trigger stream 5 dir 0 cmd 0
[ 1473.847762] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60050000: GLB_STREAM_MSG: TRIG_STOP
[ 1473.848743] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.848778] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.848808] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x60050000: GLB_STREAM_MSG: TRIG_STOP
[ 1473.848822] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x160]=0x20140000 successful
[ 1473.848838] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60030000: GLB_STREAM_MSG: PCM_FREE
[ 1473.849076] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x60030000: GLB_STREAM_MSG: PCM_FREE
[ 1473.849109] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: In hda_link_pcm_trigger cmd=0
[ 1473.849140] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x80010000: GLB_DAI_MSG: CONFIG
[ 1473.849373] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x80010000: GLB_DAI_MSG: CONFIG
[ 1473.849453] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: free stream 5 dir 0
[ 1473.849482] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x30020000: GLB_TPLG_MSG: COMP_FREE
[ 1473.849772] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.849795] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.849884] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x30020000: GLB_TPLG_MSG: COMP_FREE
[ 1473.849906] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: widget PCM5P freed
[ 1473.849921] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x30210000: GLB_TPLG_MSG: BUFFER_FREE
[ 1473.850080] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x30210000: GLB_TPLG_MSG: BUFFER_FREE
[ 1473.850089] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: widget BUF9.0 freed
[ 1473.850099] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x30020000: GLB_TPLG_MSG: COMP_FREE
[ 1473.850324] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x30020000: GLB_TPLG_MSG: COMP_FREE
[ 1473.850335] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: widget PGA9.0 freed
[ 1473.850345] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x30210000: GLB_TPLG_MSG: BUFFER_FREE
[ 1473.850518] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x30210000: GLB_TPLG_MSG: BUFFER_FREE
[ 1473.850528] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: widget BUF9.1 freed
[ 1473.850540] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x30020000: GLB_TPLG_MSG: COMP_FREE
[ 1473.850804] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.850827] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.850933] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x30020000: GLB_TPLG_MSG: COMP_FREE
[ 1473.850947] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x30110000: GLB_TPLG_MSG: PIPE_FREE
[ 1473.851139] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x30110000: GLB_TPLG_MSG: PIPE_FREE
[ 1473.851147] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: widget PIPELINE.9.HDA6.OUT freed
[ 1473.851152] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: widget HDA6.OUT freed
[ 1473.851347] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: close stream 5 dir 0
[ 1473.851752] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.851792] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.852728] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1473.852767] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1474.756773] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1474.756816] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1474.757727] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1474.757766] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1474.758727] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1474.758767] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1474.780720] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1474.780758] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1474.781712] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1474.781749] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1475.780750] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1475.780787] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1475.781709] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1475.781738] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1475.782718] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1475.782746] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1475.804729] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1475.804760] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1475.805704] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1475.805733] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1476.234981] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x80]=0x20140000 successful
[ 1476.235007] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x90050000: GLB_TRACE_MSG: DMA_FREE
[ 1476.235164] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x90050000: GLB_TRACE_MSG: DMA_FREE
[ 1476.267007] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x40010000: GLB_PM_MSG: CTX_SAVE
[ 1476.267177] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x40010000: GLB_PM_MSG: CTX_SAVE
[ 1476.267210] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x4]=0x1010f0f successful
[ 1476.268485] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x4]=0xf0f successful
[ 1476.268492] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: DSP core(s) enabled? 0 : core_mask f
[ 1476.268568] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Debug PCIR: 00000010 at  00000044
[ 1476.268575] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Turning i915 HDAC power 0
[ 1476.268578] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Current DSP power state: D3
[ 1476.268581] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: fw_state change: 6 -> 0

Test ID: 9007, Model name:WHL_UPEXT_HDA_ZEPHYR

The text was updated successfully, but these errors were encountered:

marc-hb · 2021-12-21T15:56:08Z

From the console log, it shows that sof-logger was already dead, but there're no obvious errors in dmesg or DMA trace. Since the error trace is not available on zephyr platforms, so we don't know what happened that caused sof-logger to stop working.

There will never be an error from the kernel or from the firmware when using the wrong .ldc file. The only check is in sof-logger.

error: Invalid filename length 1702256483 or ldc file does not match firmwareerror: read_entry_from_ldc_file(0x2d007) returned -22error: fetch_entry() failed with: -22, aborting

There is no .ldc checksum with Zephyr right now, working on it. EDIT: fix submitted in #5129, please review.

always rsync both .ldc and .ri files at the same time.

I also recommend never having any .ldc file in /etc to avoid mismatches.

marc-hb · 2021-12-22T06:03:58Z

I took a closer look at the latest daily test results and I found random trace corruption across ALL tests (and all platforms), even in the PASSing test runs! Most of the time the sof-logger can recover from it and the tests PASS, sometimes there is just too much DMA trace corruption and the sof-logger gives up and the test fails.

It seems to happen at random places, not at any particular point in time. EDIT: actually, it seems to happen a regular intervals?

Trying to reproduce now.

[     9058880.473366] (          12.656249) c0 ipc                      src/ipc/ipc3/dai.c:187  INFO ipc_comp_dai_config() dai type = 3 index = 6
[     9058893.858782] (          13.385416) c0 dai          9.61        src/ipc/ipc3/dai.c:281  INFO dai_config() dai type = 3 index = 6 dd 0x9e0795c0
[     9059466.462926] (         572.604126) c0 ipc                  src/ipc/ipc3/handler.c:1576 INFO ipc: new cmd 0x60010000
[     9059492.817091] (          26.354166) c0 pipe         9.62  ....../pipeline-params.c:222  INFO pipe params dir 0 frame_fmt 0 buffer_fmt 9 rate 14
warn: log_entry_address        0 is not in dictionary range!
warn: Seeking forward 4 bytes at a time until re-synchronize.

Found valid LDC address after skipping 64 bytes (one line uses 20 + 0 to 16 bytes)
[     9059639.379586] (         146.562500) c0 dai          9.61        src/ipc/ipc3/dai.c:103  INFO dai_data_config() dai type = 3 index = 6 dd 0x9e0795c0
[     9059653.650418] (          14.270833) c0 hda-dai      3.6   /drivers/intel/hda/hda.c:53   INFO hda_get_hw_params(): channels 2 rate 48000
[     9059702.140000] (          48.489582) c0 dai          9.61           src/audio/dai.c:347  INFO dai_playback_params() dest_dev = 0 stream_id = 0 src_width = 4 dest_width = 4
[     9059715.629583] (          13.489583) c0 dai          9.61           src/audio/dai.c:353  INFO dai_playback_params() fifo 0x0
[     9059745.629581] (          29.999998) c0 pipe         9.62  ....../pipeline-params.c:308  INFO pipe prepare

marc-hb · 2021-12-22T07:31:13Z

I can finally reproduce on APL UP2 but only with XCC. No reproduction with the Zephyr SDK 0.13.2

EDIT: confirmed 100%. All other things strictly identical, switching back and forth between the two toolchains switches between 0% reproduction and 100% reproduction (after only a couple minutes). Again: very often the sof-logger can recover from the corruption, so don't wait for the test to fail but keep an eye on the DMA trace like this:

./test-case/sometest.sh -s                 # disable sof-logger in the test case, run it yourself instead
sudo sof-logger -t ... | tee _.log         # in one window
tail -f _.log | grep -i -e sync             # in a different window

EDIT: no need to run any test, regular logging is enough, see below. The more logs the faster the reproduction.

marc-hb · 2021-12-22T08:28:43Z

I had a look at check-playback-100s on APL Zephyr in many recent daily test results. They all have "periodic DMA trace corruption" from the moment I enabled sof-logger in Zephyr tests (daily 8417, November 23rd). What's recent is only how worse it became, until now the sof-logger was merely skipping the corruption.

Every time I saw periodic DMA trace corruption with Zephyr, I checked the logs for XTOS on the same APL with the same versions of everything. There was never any DMA trace corruption with XTOS, only with Zephyr.

marc-hb · 2021-12-22T15:13:45Z

I took a closer look at the latest daily test results and I found random trace corruption across ALL tests (and all platforms), even in the PASSing test runs!

Actually, not all platforms. It seems to affect only WHL and APL but not TGL. There are some logging failures with TGL but no periodic DMA trace corruption like this. It seemed to have appeared a bit more recently on WHL.

marc-hb · 2021-12-23T03:30:03Z

@lyakh got the right intuition (thx! the reason why the corruption is periodic in "stable" tests is because it happens exactly at the end of the ring buffer / wrapping time, every time. In my testing the corruption was either never happening (e.g.: with the Zephyr SDK) or always happening (e.g.: with XCC) at the end of the ring buffer.

Instead of the end of the ring buffer, some corrupted data is sent instead. Interestingly:

It's not just the last, split and 36 bytes-long (in my testing) log statement that is corrupted: the previous one is corrupted too! About 48 bytes seem corrupted every time.
The corruption seems deterministic! On the same UP2 system it always looks like this:

skipping 0x000e0009, skipping 0x06480230, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x08030007, skipping 0x00000000,

For reference

struct log_entry_header {
	uint32_t uid;
	uint32_t id_0 : TRACE_ID_LENGTH; /* e.g. Pipeline ID */
	uint32_t id_1 : TRACE_ID_LENGTH; /* e.g. Component ID */
	uint32_t core_id : 8;		 /* Reporting core's id */

	uint64_t timestamp;		 /* Timestamp (in dsp ticks) */
	uint32_t log_entry_address;	 /* Address of log entry in ELF */ offsetof=18
} __attribute__((packed));

sizeof(header)=20 = 5 words

4 arguments = 16 bytes = 4 words

marc-hb · 2021-12-23T08:01:46Z

Some data mining in the test results (i.e., large grep -r) showed that WHL UPX was never experiencing this corruption until daily test 8866 on Dec 12. Starting from the next daily test run 8897, it experienced the same end-of-DMA-ring-buffer corruption systematically. This allowed me to bisect the regression to this particular commit from @andyross (PR zephyrproject-rtos/zephyr/pull/40609)

commit 923277a84288ff42e4191520ec4b178623a0e1ac
Author: Andy Ross <[email protected]> 
Date:   Thu Nov 18 19:23:43 2021 -0800

    soc/intel_adsp_cavs18: Use new IDC driver
    
    Disable the use of the legacy IDC driver and IPM-based sched_ipi()
    implementation.
    
    Signed-off-by: Andy Ross <[email protected]>

 .../intel_adsp/cavs_v18/Kconfig.defconfig.series   | 29 ++++++----------------
 1 file changed, 7 insertions(+), 22 deletions(-)

In case you wondered, it's not possible to revert just this commit:

zephyr/drivers/ipm/ipm_cavs_idc.c:233: multiple definition of `cavs_idc_smp_init'
zephyr/soc/xtensa/intel_adsp/common/soc_mp.c:286: first defined here

It is possible to revert the entire series but that does make the corruption go away. I think this commit is just a trigger/messenger. Of which subtle caching issue I have no idea.

marc-hb · 2021-12-23T08:48:48Z

The toolchain situation with WHL UPX is almost "reversed" compared to APL UP2: while the corruption on WHL UPX appeared only recently with XCC (see bisect in previous comment), it's been happening since forever with the Zephyr SDK 0.12.3 on WHL UPX! More precisely, "forever" = zephyr commit cf0c5e2a1ce5 lib/os: Add sys_heap_usable_size() which is the oldest zephyr commit compatible with today's SOF.

Every configuration I tested so far (and there have been many) has behaved deterministically: always corrupting or never corrupting. However predicting which hardware+toolchain+firmware configuration will fail and which will not seems completely random.

The corruption on WHL seems a bit less deterministic than on APL but most of these values seem frequent:

skipping  0x00000009, skipping  0x0000000e, skipping  0x00000230, skipping  0x00000648, skipping  0x00000000, 
skipping  0x00000000, skipping  0x00000000, skipping  0x00000000, skipping  0x00000000, skipping  0x00000000, 
skipping  0x00000000, skipping  0x00000000, skipping  0x00000000, skipping  0x00000000, skipping  0x00000007, 
skipping  0x00000803,

--- sof/tools/logger/convert.c
+++ sof/tools/logger/convert.c
@@ -903,6 +905,7 @@ static int logger_read(void)
                        /* When the address is not correct, move forward by one DWORD (not
                         * entire struct dma_log)
                         */
+                       fprintf(global_config->out_fd, "skipping  0x%08x, ", dma_log.uid);
                        fseek(global_config->in_fd, -(sizeof(dma_log) - sizeof(uint32_t)),
                              SEEK_CUR);
                        skipped_dwords++;

andyross · 2022-01-19T19:31:21Z

Did you ever get a chance to dump the buffer extents (start pointer and length) on a failing configuration?

Also, I'd be really curious whether the problem simply goes away entirely if you use an uncached pointer for the buffer and remove the cache control steps completely. I'll be honest: I think that code is hurting you more than helping. The nature of this buffer being streaming and write only means that the only value you are getting from the L1 data cache is some write combining (i.e. writing a single word is quicker than a round trip to HP-SRAM). And for that you pay with a HAL[1] invalidate and flush sequence for every update. I think it would be faster uncached. It would surely be smaller.

[1] Another thing you could try on Zephyr builds is to use our z_xtensa_cache_*() API instead of the HAL. I really doubt the HAL has actual bugs, but I will promise that ours is smaller and better.

andyross · 2022-01-19T19:35:37Z

For clarity above: the write of a single word is quicker because it goes into the cache and the line then gets flushed out in a plausibly faster block transfer, plausibly asynchronously (whether it is or not depends on hardware details I'm not privy to, a naive implementation might very well stall the core while a line is flushed, I have no idea).

Anyway the point is I really doubt the cache is helping. :)

jsarha · 2022-01-20T16:51:21Z

I was able to produce the issue also on gcc built zephyr firmware on up-i11:

[ 94204324.589990] ( 1997820.000000) c0 zll-schedule src/schedule/zephyr_ll.c:141 INFO ll task 0x9 <bad uid ptr 0x0000000e> avg 560, max 1608
error: log_entry_address 0 is not in dictionary range!
warn: Seeking forward 4 bytes at a time until re-synchronize.

Found valid LDC address after skipping 32 bytes (one line uses 20 + 0 to 16 bytes)
[458599188777.999268] ( nan) c0 unknown 0.0 src/schedule/zephyr_ll.c:141 INFO ll task 0x9e056980 dma-trace-task <2b972272-c5b1-4b7e-926f-0fc5cb4c4690> avg 39, max 490

             --- negative DELTA = -458504934273.255 us: wrap, IPC_TRACE, other? ---

[ 295981042.665853] ( 0.000000) c0 zll-schedule src/schedule/zephyr_ll.c:141 INFO ll task 0xbe05d4c0 pipe-task avg 3388, max 3422
[ 1997819.972697] ( 1997820.000000) c0 zll-schedule src/schedule/zephyr_ll.c:141 INFO ll task 0x9e056fd8 agent-work avg 54, max 58

jsarha · 2022-01-20T17:51:00Z

I also think I found a race in src/trace/dma-trace.c trace_work(). The function first copies the log entries with dma_copy_to_host_nowait() and then sends SOF_IPC_TRACE_DMA_POSITION with ipc_msg_send() right after that. At least in theory the Linux side may be faster in reading the buffer than the DSP is in writing it. However, implementing a dma_copy_to_host_wait() by adding DMA_COPY_BLOCKING to dma_copy() flags and using it in trace_work() did not fix the issue in my case.

jsarha · 2022-01-20T17:59:48Z

Also if my findings - based on some hackish debug code - are correct, the log entries sometimes get corrupted right in the middle of the circular buffer between the DSP and Linux and not (only) when the circular buffer wraps from the end to the beginning. However, this finding should be taken with a grain of salt since it looks like my debug code changes are somewhat changing the behavior of the tracing system.

marc-hb · 2022-01-20T18:14:59Z

the log entries sometimes get corrupted right in the middle of the circular buffer between the DSP and Linux and not (only) when the circular buffer wraps from the end to the beginning.

Across all my testing I've never seen that. I also analyzed many test logs and the message error: log_entry_address X is not in dictionary range! always appeared at very regular intervals.

lgirdwood · 2022-01-31T16:21:58Z

@jsarha any update here ?

jsarha · 2022-01-31T19:38:27Z

Not yet. Still working on a new angle to tackle the issue.

marc-hb · 2022-02-03T17:19:28Z

I'm afraid this has been by far the most frequent test failure in recent daily runs, especially when looking at Zephyr results :-(

jsarha · 2022-02-04T13:50:46Z

Here is what I have established so far with my upx-i11 running Zephyr FW.

I changed the log record size to 64 bytes (= cache line length) always to make the issue more manageable. On the DSP side the dma buffer is 8192 bytes long and on the Linux host side the dma buffer is is 8 times the size of the DSP side buffer, e.g. 64k.

With 64-byte log entries every 128th entry is always broken. This happens 7 times in 8 in the middle of the Linux side buffer, but the entry is always transferred from the last 64bytes of the DSP side buffer. The wrongly transferred entry has always exactly the same content on the Linux side. When reading that same 64-byte entry from DSP side, before the dma_copy() is called for it, the memory content is correct. So somehow the dma_copy() does not copy the last cache line length of the dma buffer correctly (or at all) to the Linux host side.

The DMA_GW version used on the setup does not pass any other parameters to dma_copy() but size, so the error must be on the DMA initialization code. I have been scrutinizing the the DMA initialization code for a while now, but it is a bit task to fully understand what is going on there.

Feel free to ping me for details, if you are interested.

Let's test DMA corruption bug thesofproject#5120 for real. Signed-off-by: Marc Herbert <[email protected]>

miRoox · 2022-02-11T03:52:21Z

In today's daily test (10105), this issue still happened on TGLH_RVP_NOCODEC_ZEPHYR and APL_UP2_NOCODEC_ZEPHYR.

marc-hb · 2022-02-11T04:04:17Z

@miRoox I looked at these test results. The sof-logger dies too but with immediately and with a different error message. In this bug the sof-loggers was dying only after many error: log_entry_address 0x.... is not in dictionary range! warnings which I don't see in 10105

I will spend more time scanning the logs to make sure but so far I think this is a different issue (which was shadowed by this one)

miRoox · 2022-02-11T04:17:46Z

@marc-hb thanks for your correction, now the DMA trace shows:

Found valid LDC address after skipping 12 bytes (one line uses 20 + 0 to 16 bytes)
error: Invalid filename length 31304 or ldc file does not match firmware
error: read_entry_from_ldc_file(0x23006) returned -22
error: fetch_entry() failed with: -22, aborting
Skipped 0 bytes after the last statement.

Perhaps this is a different issue.

marc-hb · 2022-02-11T06:05:22Z

I filed follow-up issue #5345 for the (CML and non-Zephyr) error: log_entry_address 0x___ is not in dictionary range warning.

marc-hb · 2022-02-11T06:07:17Z

@miRoox could you please file yet another issue for the immediate failures on TGL?

Test PR #5340 makes the sof-logger stricter and demonstrates an easy way to reproduce.

marc-hb · 2022-02-12T01:47:17Z

could you please file yet another issue for the immediate failures on TGL?

I did it in new

[BUG] [ZEPHYR] sof-logger dead: Invalid filename length 1650680879 or ldc file does not match firmware #5352

keqiaozhang added bug Something isn't working as expected Zephyr Issues only observed with Zephyr integrated logger Dictionary and logger labels Dec 21, 2021

This comment has been minimized.

Sign in to view

marc-hb self-assigned this Dec 22, 2021

marc-hb unassigned marc-hb Dec 22, 2021

marc-hb changed the title ~~[BUG] "sof-logger was already dead" happened while testing on some zephyr platforms~~ [BUG] [ZEPHYR] corruption of the end of the DMA trace ring buffer "sof-logger was already dead" Dec 23, 2021

marc-hb assigned andyross, lyakh, lgirdwood and kv2019i Dec 23, 2021

marc-hb added APL Applies to Apollolake platform blocked progress blocked by something else, applies to either feature or bug P1 Blocker bugs or important features WHL Applies to WhiskeyLake platform labels Dec 23, 2021

marc-hb changed the title ~~[BUG] [ZEPHYR] corruption of the end of the DMA trace ring buffer "sof-logger was already dead"~~ [BUG] [ZEPHYR] corruption of the end of the DMA trace ring buffer ("sof-logger was already dead") Dec 23, 2021

marc-hb mentioned this issue Dec 23, 2021

Stop dtrace suspend thesofproject/linux#3332

Merged

marc-hb mentioned this issue Dec 23, 2021

logger: move "log_entry_address x not in dictionary range" to stderr #5128

Merged

This was referenced Jan 21, 2022

pipeline: improve timestamp failure reporting #5233

Merged

Check for a valid trace for ALL tests thesofproject/sof-test#297

Closed

[DRAFT] [TEST] rimage fix confusion between uint8_t arrays and strings #5261

Closed

miRoox mentioned this issue Jan 28, 2022

[BUG] TGL_RVP_NOCODEC_ZEPHYR ipc timed out when TRIG_STOP with multiple-pipeline test #4680

Closed

marc-hb mentioned this issue Feb 8, 2022

rimage: update to January 26th version ee5c3e1deb14 #5320

Merged

jsarha mentioned this issue Feb 8, 2022

dma-trace: Align dma trace buffer correctly, fixed Zephyr dtrace issue #5329

Merged

lgirdwood added this to the v2.1 milestone Feb 9, 2022

marc-hb added a commit to marc-hb/sof that referenced this issue Feb 10, 2022

sof-logger: fail when an invalid address is found in the DMA trace

b6c2a44

Let's test DMA corruption bug thesofproject#5120 for real. Signed-off-by: Marc Herbert <[email protected]>

marc-hb mentioned this issue Feb 10, 2022

sof-logger: fail when an invalid address is found in the DMA trace #5340

Closed

lgirdwood closed this as completed in #5329 Feb 10, 2022

miRoox reopened this Feb 11, 2022

marc-hb mentioned this issue Feb 11, 2022

error: log_entry_address 0x___ is not in dictionary range #5345

Closed

marc-hb closed this as completed Feb 11, 2022

marc-hb mentioned this issue Feb 12, 2022

[BUG] [ZEPHYR] sof-logger dead: Invalid filename length 1650680879 or ldc file does not match firmware #5352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] [ZEPHYR] corruption of the end of the DMA trace ring buffer ("sof-logger was already dead") #5120

[BUG] [ZEPHYR] corruption of the end of the DMA trace ring buffer ("sof-logger was already dead") #5120

keqiaozhang commented Dec 21, 2021 •

edited by marc-hb

Loading

This comment has been minimized.

This comment has been minimized.

marc-hb commented Dec 21, 2021 •

edited

Loading

This comment has been minimized.

marc-hb commented Dec 22, 2021 •

edited

Loading

marc-hb commented Dec 22, 2021 •

edited

Loading

marc-hb commented Dec 22, 2021

marc-hb commented Dec 22, 2021

marc-hb commented Dec 23, 2021 •

edited

Loading

marc-hb commented Dec 23, 2021

marc-hb commented Dec 23, 2021 •

edited

Loading

andyross commented Jan 19, 2022

andyross commented Jan 19, 2022

jsarha commented Jan 20, 2022 •

edited

Loading

jsarha commented Jan 20, 2022 •

edited

Loading

jsarha commented Jan 20, 2022

marc-hb commented Jan 20, 2022

lgirdwood commented Jan 31, 2022

jsarha commented Jan 31, 2022

marc-hb commented Feb 3, 2022 •

edited

Loading

jsarha commented Feb 4, 2022 •

edited

Loading

miRoox commented Feb 11, 2022

marc-hb commented Feb 11, 2022 •

edited

Loading

miRoox commented Feb 11, 2022

marc-hb commented Feb 11, 2022 •

edited

Loading

marc-hb commented Feb 11, 2022 •

edited

Loading

marc-hb commented Feb 12, 2022

[BUG] [ZEPHYR] corruption of the end of the DMA trace ring buffer ("sof-logger was already dead") #5120

[BUG] [ZEPHYR] corruption of the end of the DMA trace ring buffer ("sof-logger was already dead") #5120

Comments

keqiaozhang commented Dec 21, 2021 • edited by marc-hb Loading

This comment has been minimized.

This comment has been minimized.

marc-hb commented Dec 21, 2021 • edited Loading

This comment has been minimized.

marc-hb commented Dec 22, 2021 • edited Loading

marc-hb commented Dec 22, 2021 • edited Loading

marc-hb commented Dec 22, 2021

marc-hb commented Dec 22, 2021

marc-hb commented Dec 23, 2021 • edited Loading

marc-hb commented Dec 23, 2021

marc-hb commented Dec 23, 2021 • edited Loading

andyross commented Jan 19, 2022

andyross commented Jan 19, 2022

jsarha commented Jan 20, 2022 • edited Loading

jsarha commented Jan 20, 2022 • edited Loading

jsarha commented Jan 20, 2022

marc-hb commented Jan 20, 2022

lgirdwood commented Jan 31, 2022

jsarha commented Jan 31, 2022

marc-hb commented Feb 3, 2022 • edited Loading

jsarha commented Feb 4, 2022 • edited Loading

miRoox commented Feb 11, 2022

marc-hb commented Feb 11, 2022 • edited Loading

miRoox commented Feb 11, 2022

marc-hb commented Feb 11, 2022 • edited Loading

marc-hb commented Feb 11, 2022 • edited Loading

marc-hb commented Feb 12, 2022

keqiaozhang commented Dec 21, 2021 •

edited by marc-hb

Loading

marc-hb commented Dec 21, 2021 •

edited

Loading

marc-hb commented Dec 22, 2021 •

edited

Loading

marc-hb commented Dec 22, 2021 •

edited

Loading

marc-hb commented Dec 23, 2021 •

edited

Loading

marc-hb commented Dec 23, 2021 •

edited

Loading

jsarha commented Jan 20, 2022 •

edited

Loading

jsarha commented Jan 20, 2022 •

edited

Loading

marc-hb commented Feb 3, 2022 •

edited

Loading

jsarha commented Feb 4, 2022 •

edited

Loading

marc-hb commented Feb 11, 2022 •

edited

Loading

marc-hb commented Feb 11, 2022 •

edited

Loading

marc-hb commented Feb 11, 2022 •

edited

Loading