-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] [ZEPHYR] corruption of the end of the DMA trace ring buffer ("sof-logger was already dead") #5120
Comments
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There will never be an error from the kernel or from the firmware when using the wrong .ldc file. The only check is in
There is no .ldc checksum with Zephyr right now, working on it. EDIT: fix submitted in #5129, please review.
I also recommend never having any .ldc file in |
This comment has been minimized.
This comment has been minimized.
I took a closer look at the latest daily test results and I found random trace corruption across ALL tests (and all platforms), even in the PASSing test runs! Most of the time the sof-logger can recover from it and the tests PASS, sometimes there is just too much DMA trace corruption and the sof-logger gives up and the test fails. It seems to happen at random places, not at any particular point in time. EDIT: actually, it seems to happen a regular intervals? Trying to reproduce now.
|
I can finally reproduce on APL UP2 but only with XCC. No reproduction with the Zephyr SDK 0.13.2 EDIT: confirmed 100%. All other things strictly identical, switching back and forth between the two toolchains switches between 0% reproduction and 100% reproduction (after only a couple minutes). Again: very often the sof-logger can recover from the corruption, so don't wait for the test to fail but keep an eye on the DMA trace like this:
EDIT: no need to run any test, regular logging is enough, see below. The more logs the faster the reproduction. |
I had a look at check-playback-100s on APL Zephyr in many recent daily test results. They all have "periodic DMA trace corruption" from the moment I enabled sof-logger in Zephyr tests (daily 8417, November 23rd). What's recent is only how worse it became, until now the sof-logger was merely skipping the corruption. Every time I saw periodic DMA trace corruption with Zephyr, I checked the logs for XTOS on the same APL with the same versions of everything. There was never any DMA trace corruption with XTOS, only with Zephyr. |
Actually, not all platforms. It seems to affect only WHL and APL but not TGL. There are some logging failures with TGL but no periodic DMA trace corruption like this. It seemed to have appeared a bit more recently on WHL. |
@lyakh got the right intuition (thx! the reason why the corruption is periodic in "stable" tests is because it happens exactly at the end of the ring buffer / wrapping time, every time. In my testing the corruption was either never happening (e.g.: with the Zephyr SDK) or always happening (e.g.: with XCC) at the end of the ring buffer. Instead of the end of the ring buffer, some corrupted data is sent instead. Interestingly:
skipping 0x000e0009, skipping 0x06480230, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x00000000, skipping 0x08030007, skipping 0x00000000, For reference
sizeof(header)=20 = 5 words 4 arguments = 16 bytes = 4 words |
Some data mining in the test results (i.e., large
In case you wondered, it's not possible to revert just this commit:
It is possible to revert the entire series but that does make the corruption go away. I think this commit is just a trigger/messenger. Of which subtle caching issue I have no idea. |
The toolchain situation with WHL UPX is almost "reversed" compared to APL UP2: while the corruption on WHL UPX appeared only recently with XCC (see bisect in previous comment), it's been happening since forever with the Zephyr SDK 0.12.3 on WHL UPX! More precisely, "forever" = zephyr commit cf0c5e2a1ce5 Every configuration I tested so far (and there have been many) has behaved deterministically: always corrupting or never corrupting. However predicting which hardware+toolchain+firmware configuration will fail and which will not seems completely random. The corruption on WHL seems a bit less deterministic than on APL but most of these values seem frequent:
--- sof/tools/logger/convert.c
+++ sof/tools/logger/convert.c
@@ -903,6 +905,7 @@ static int logger_read(void)
/* When the address is not correct, move forward by one DWORD (not
* entire struct dma_log)
*/
+ fprintf(global_config->out_fd, "skipping 0x%08x, ", dma_log.uid);
fseek(global_config->in_fd, -(sizeof(dma_log) - sizeof(uint32_t)),
SEEK_CUR);
skipped_dwords++; |
Did you ever get a chance to dump the buffer extents (start pointer and length) on a failing configuration? Also, I'd be really curious whether the problem simply goes away entirely if you use an uncached pointer for the buffer and remove the cache control steps completely. I'll be honest: I think that code is hurting you more than helping. The nature of this buffer being streaming and write only means that the only value you are getting from the L1 data cache is some write combining (i.e. writing a single word is quicker than a round trip to HP-SRAM). And for that you pay with a HAL[1] invalidate and flush sequence for every update. I think it would be faster uncached. It would surely be smaller. [1] Another thing you could try on Zephyr builds is to use our z_xtensa_cache_*() API instead of the HAL. I really doubt the HAL has actual bugs, but I will promise that ours is smaller and better. |
For clarity above: the write of a single word is quicker because it goes into the cache and the line then gets flushed out in a plausibly faster block transfer, plausibly asynchronously (whether it is or not depends on hardware details I'm not privy to, a naive implementation might very well stall the core while a line is flushed, I have no idea). Anyway the point is I really doubt the cache is helping. :) |
I was able to produce the issue also on gcc built zephyr firmware on up-i11: [ 94204324.589990] ( 1997820.000000) c0 zll-schedule src/schedule/zephyr_ll.c:141 INFO ll task 0x9 <bad uid ptr 0x0000000e> avg 560, max 1608 Found valid LDC address after skipping 32 bytes (one line uses 20 + 0 to 16 bytes)
[ 295981042.665853] ( 0.000000) c0 zll-schedule src/schedule/zephyr_ll.c:141 INFO ll task 0xbe05d4c0 pipe-task avg 3388, max 3422 |
I also think I found a race in src/trace/dma-trace.c trace_work(). The function first copies the log entries with dma_copy_to_host_nowait() and then sends SOF_IPC_TRACE_DMA_POSITION with ipc_msg_send() right after that. At least in theory the Linux side may be faster in reading the buffer than the DSP is in writing it. However, implementing a dma_copy_to_host_wait() by adding DMA_COPY_BLOCKING to dma_copy() flags and using it in trace_work() did not fix the issue in my case. |
Also if my findings - based on some hackish debug code - are correct, the log entries sometimes get corrupted right in the middle of the circular buffer between the DSP and Linux and not (only) when the circular buffer wraps from the end to the beginning. However, this finding should be taken with a grain of salt since it looks like my debug code changes are somewhat changing the behavior of the tracing system. |
Across all my testing I've never seen that. I also analyzed many test logs and the message |
@jsarha any update here ? |
Not yet. Still working on a new angle to tackle the issue. |
I'm afraid this has been by far the most frequent test failure in recent daily runs, especially when looking at Zephyr results :-( |
Here is what I have established so far with my upx-i11 running Zephyr FW. I changed the log record size to 64 bytes (= cache line length) always to make the issue more manageable. On the DSP side the dma buffer is 8192 bytes long and on the Linux host side the dma buffer is is 8 times the size of the DSP side buffer, e.g. 64k. With 64-byte log entries every 128th entry is always broken. This happens 7 times in 8 in the middle of the Linux side buffer, but the entry is always transferred from the last 64bytes of the DSP side buffer. The wrongly transferred entry has always exactly the same content on the Linux side. When reading that same 64-byte entry from DSP side, before the dma_copy() is called for it, the memory content is correct. So somehow the dma_copy() does not copy the last cache line length of the dma buffer correctly (or at all) to the Linux host side. The DMA_GW version used on the setup does not pass any other parameters to dma_copy() but size, so the error must be on the DMA initialization code. I have been scrutinizing the the DMA initialization code for a while now, but it is a bit task to fully understand what is going on there. Feel free to ping me for details, if you are interested. |
Let's test DMA corruption bug thesofproject#5120 for real. Signed-off-by: Marc Herbert <[email protected]>
In today's daily test (10105), this issue still happened on TGLH_RVP_NOCODEC_ZEPHYR and APL_UP2_NOCODEC_ZEPHYR. |
@miRoox I looked at these test results. The sof-logger dies too but with immediately and with a different error message. In this bug the sof-loggers was dying only after many I will spend more time scanning the logs to make sure but so far I think this is a different issue (which was shadowed by this one) |
@marc-hb thanks for your correction, now the DMA trace shows:
Perhaps this is a different issue. |
I filed follow-up issue #5345 for the (CML and non-Zephyr) |
I did it in new |
EDIT: this is caused by DMA trace corruption, mostly with XCC and mostly with Zephyr (but not just). See below - @marc-hb
Describe the bug
We observed this error on some zephyr platforms recently. From the console log, it shows that sof-logger was already dead, but there're no obvious errors in dmesg or DMA trace. Since the error trace is not available on zephyr platforms, so we don't know what happened that caused sof-logger to stop working.
the DMA trace shows:
To Reproduce
eg: on WHL zephyr platform:
$ ./check-playback.sh -d 1 -l 1 -r 50
Reproduction Rate
Almost 100%
Expected behavior
A clear and concise description of what you expected to happen.
Impact
sof-logger stops working
Environment
Kernel Branch: topic/sof-dev
Kernel Commit: ac3b3338-1
SOF Branch: main
SOF Commit:afac44af5f49-2
Zephyr Commit: fef2e30b7f83
Screenshots or console output
** dmesg **
Test ID: 9007, Model name:WHL_UPEXT_HDA_ZEPHYR
The text was updated successfully, but these errors were encountered: