Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad codegen for decomposed word loads #85

Closed
nwf opened this issue Dec 21, 2024 · 17 comments
Closed

Bad codegen for decomposed word loads #85

nwf opened this issue Dec 21, 2024 · 17 comments

Comments

@nwf
Copy link
Member

nwf commented Dec 21, 2024

BLAKE2s's portable implementation does what a lot of code that needs to care about endianness does, and decomposes word loads into bytes (see https://github.com/BLAKE2/libb2/blob/643decfbf8ae600c3387686754d74c84144950d1/src/blake2-impl.h#L37-L42 in particular) in hopes that the compiler puts it back together if it can.

Unfortunately, we seem not to, and instead generate a pretty literal transcription:

Disassembly of section .text.load32:

00000000 <load32>:
;   w |= ( uint32_t )( *p++ ) <<  8;
       0: 83 45 15 00   clbu    a1, 1(ca0)
;   uint32_t w = *p++;
       4: 03 46 05 00   clbu    a2, 0(ca0)
;   w |= ( uint32_t )( *p++ ) << 16;
       8: 83 46 25 00   clbu    a3, 2(ca0)
;   w |= ( uint32_t )( *p++ ) << 24;
       c: 03 05 35 00   clb     a0, 3(ca0)
;   w |= ( uint32_t )( *p++ ) <<  8;
      10: a2 05         slli    a1, a1, 8
      12: d1 8d         or      a1, a1, a2
;   w |= ( uint32_t )( *p++ ) << 16;
      14: c2 06         slli    a3, a3, 16
;   w |= ( uint32_t )( *p++ ) << 24;
      16: 62 05         slli    a0, a0, 24
;   w |= ( uint32_t )( *p++ ) << 16;
      18: 55 8d         or      a0, a0, a3
;   w |= ( uint32_t )( *p++ ) << 24;
      1a: 4d 8d         or      a0, a0, a1
;   return w;
      1c: 82 80         cret

That, in turn, seems to raise the function above inlining thresholds, and so it gets called instead, and the whole thing is just a bit sad.

FWIW, LLVM is 0fa9bc5 with #48 still sitting atop, locally; specifically...

$ /cheri/out/cheriot/sdk/bin/clang --version                                                                                                                       
clang version 17.0.0 (https://github.com/CHERIoT-Platform/llvm-project.git 63ed0bf791e19b3bacabb5bb08b7f29790604c11)
Target: powerpc64le-unknown-linux-gnu
Thread model: posix
InstalledDir: /cheri/out/cheriot/sdk/bin
@v01dXYZ
Copy link

v01dXYZ commented Dec 22, 2024

It seems this optimisation is available for x86-64, AArch64 but not for ARMv7 nor RISC-V32/64. The related pass is AggressiveInstCombine.

define i32 @load32(i8* %0) {
  %2 = getelementptr inbounds i8, i8* %0, i64 1
  %3 = load i8, i8* %0, align 1
  %4 = zext i8 %3 to i32

  %5 = getelementptr inbounds i8, i8* %0, i64 2
  %6 = load i8, i8* %2, align 1
  %7 = zext i8 %6 to i32
  %8 = shl nuw nsw i32 %7, 8

  %9 = or i32 %8, %4

  %10 = getelementptr inbounds i8, i8* %0, i64 3
  %11 = load i8, i8* %5, align 1
  %12 = zext i8 %11 to i32
  %13 = shl nuw nsw i32 %12, 16

  %14 = or i32 %9, %13

  %15 = load i8, i8* %10, align 1
  %16 = zext i8 %15 to i32
  %17 = shl nuw i32 %16, 24

  %18 = or i32 %14, %17

  ret i32 %18
}

@v01dXYZ
Copy link

v01dXYZ commented Dec 22, 2024

It seems it's because RISCV doesn't have per default a fast load for i32. There is an attribute to allow it though unaligned-scalar-mem. I don't know if there is a way to indicate the alignment in LLVM IR such as a aligned(...) / align ... which will be used by the pass.

@v01dXYZ
Copy link

v01dXYZ commented Dec 22, 2024

At the LLVM level, one possible patch is to add a test for checking if the Load Operand is an function argument with a alignment attribute.

Below, API reminder:

auto *ArgPtr = dyn_cast<Argument>(L1LI1->getOperand());
MaybeAlign PtrMaybeAlign = ArgPtr->getParamAlignment();
if(PtrMaybeAlign) {
  Align PtrAlign = *MaybeAlign;
  if(PtrAlign.value() >= Type.getBitWidth()) {
    // optimize the or + lshl
  }
}

@resistor
Copy link
Collaborator

#87 fixes the codegen, but breaks CheriotRTOS. I need to figure out why.

@resistor
Copy link
Collaborator

The breakage is because the SAIL-based simulator disables misaligned memory accesses by default. See CHERIoT-Platform/cheriot-sail#92

@resistor
Copy link
Collaborator

Enabling this attribute incidentally improves code size on the cheriot-rtos test suite by 0.2%

@davidchisnall
Copy link

Can you check on the Ibex sim (in the dev container)? I believe it supports unaligned access (for everything that isn’t an capability) and there’s a bug in the sail.

@resistor
Copy link
Collaborator

Can you check on the Ibex sim (in the dev container)? I believe it supports unaligned access (for everything that isn’t an capability) and there’s a bug in the sail.

Now that I've figured out how to do that, the rtos testsuite passes with misaligned loads enabled on the Ibex sim.

@nwf
Copy link
Member Author

nwf commented Dec 23, 2024

Hm. I'm not sure I meant to commit us to requiring misaligned loads! I think (but will check in more detail when I'm back home later today) that the BLAKE2 loads are always 32-bit aligned and perhaps we're just losing that information in C.

Can we restrict the instruction fusion to the case that we do know that the result will be well-aligned?

@davidchisnall
Copy link

RISC-V needs to support unaligned loads for instruction fetch, so I don’t think that there’s much value in permitting implementations that can’t do them for data. On systems with paging, it can be painful if a load or store spans two pages, but that’s the kind of corner case that you can punt to M Mode if necessary.

@nwf
Copy link
Member Author

nwf commented Dec 23, 2024

RISC-V needs to support unaligned loads for instruction fetch, so I don’t think that there’s much value in permitting implementations that can’t do them for data. On systems with paging, it can be painful if a load or store spans two pages, but that’s the kind of corner case that you can punt to M Mode if necessary.

Well, AIUI they'd claim that a compliant implementation could always do 16-bit-sized and -aligned loads and maybe issue several in rapid succession for uncompressed instructions... I don't think there's anything in the architecture that would require supporting, say, an 8-bit-aligned 32-bit load.

@nwf
Copy link
Member Author

nwf commented Dec 23, 2024

At a glance, the blake2s-ref.c uses load32 only at well-aligned locations, assuming well-alignedness of the input block, but unfortunately those are typed as const uint8_t [] and the argument to load32 is a const void *, so none of that reasoning is available to the compiler.

However, even changing the type of load32 to take a uint32_t* doesn't seem to be sufficiently informative for the existing compiler, as perhaps expected, but, IMHO, would be a justifiable change to get improved codegen.

@resistor
Copy link
Collaborator

RISC-V needs to support unaligned loads for instruction fetch, so I don’t think that there’s much value in permitting implementations that can’t do them for data. On systems with paging, it can be painful if a load or store spans two pages, but that’s the kind of corner case that you can punt to M Mode if necessary.

Well, AIUI they'd claim that a compliant implementation could always do 16-bit-sized and -aligned loads and maybe issue several in rapid succession for uncompressed instructions... I don't think there's anything in the architecture that would require supporting, say, an 8-bit-aligned 32-bit load.

I believe a compliant RISC-V HW combined system can always support misaligned loads. The difference is whether they're implemented in HW, or generate an exception that can be handled in SW. The question of whether the toolchain should emit them then becomes primarily a performance question.

@v01dXYZ
Copy link

v01dXYZ commented Dec 24, 2024

@nwf It seems there are two ways to indicate to the compiler the pointer is 4-bytes aligned:

  • aligned_4B_uint8_t a typedef with an __attribute__((aligned(4))). I can't get the same result without using a typedef. -> generates loads with align 4.
  • __builtin_assume_align(src, 4) -> generates loads with align 4 + a llvm.assume(true) ["align ..."]

The RISCV backend will then optimise it as a single 4B load.

@resistor
Copy link
Collaborator

@davidchisnall If we want to require misaligned load support, should we write it into the CHERIoT ISA spec?

@davidchisnall
Copy link

I think it's a good idea for the compiler to default to assuming that they work. I think the trap and emulate logic will require more area for code memory than the equivalent hardware to make it fast, but I'd like to make sure that Sail defaults to supporting them in the dev container before we merge it in the compiler.

@resistor
Copy link
Collaborator

resistor commented Jan 8, 2025

This will be fixed by #87 which is waiting on CHERIoT-Platform/cheriot-sail#93

@resistor resistor closed this as completed Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants