Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MediaTek DSP memory layout improvements #83292

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

andyross
Copy link
Contributor

Move stacks, .data and .bss from DRAM to SRAM. The SRAM on these devices is limited; we can't put the entire firmware in it. But it shows 5x faster latency from the DSP, so try to prioritize default linkage to put mutable data there.

Actually in point of fact even the SRAM is pretty slow (~40 cycle read latency). Thankfully these devices have monster caches, so in practice things usually work out.

Add a tiny feature at the top level allowing a new section (".stacks")
for all stack memory (vs. .noinit and/or .user_stacks) such that this
device can place them in SRAM for performance.  Longer term we should
come up with a unified plan for "stack placement" such that it can
coexist with USERSPACE and KERNEL_COHERENCE, which also want control
over stack memory at the linker level.  But this is very simple.

Signed-off-by: Andy Ross <[email protected]>
Further cleanup for this architecture.  With .text/.rodata left in
DRAM, use the higher performance memory for what is likely to be the
most used app data.

Signed-off-by: Andy Ross <[email protected]>
Both as a way to validate the "stacks/data in SRAM" feature and for
personal curiosity, I added a quick test case that measures memory
read latency over various linker memory sections, ex:

    START - mem_lat
    Measuring estimated load latency (dcache disabled):
          .data:  40.011 cyc
        .rodata: 224.117 cyc
           .bss:  40.000 cyc
          .text: 222.756 cyc
      __nocache: 221.512 cyc
    Measuring estimated load latency (dcache enabled):
          .data:   1.002 cyc
        .rodata:   1.014 cyc
           .bss:   1.002 cyc
          .text:   1.014 cyc
      __nocache: 222.564 cyc
     PASS - mem_lat in 0.022 seconds

Note that there isn't actually any board-specific code here.  This
should run on any Xtensa device with a cache.  Consider moving
somewhere better.

Signed-off-by: Andy Ross <[email protected]>
Copy link

@eddy1021 eddy1021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from google team.

@kartben
Copy link
Collaborator

kartben commented Jan 6, 2025

@andyross please rebase / resolve merge conflicts

@kartben
Copy link
Collaborator

kartben commented Jan 17, 2025

@andyross a reminder that this needs a rebase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants