Bad codegen for decomposed word loads #85

nwf · 2024-12-21T14:51:56Z

BLAKE2s's portable implementation does what a lot of code that needs to care about endianness does, and decomposes word loads into bytes (see https://github.com/BLAKE2/libb2/blob/643decfbf8ae600c3387686754d74c84144950d1/src/blake2-impl.h#L37-L42 in particular) in hopes that the compiler puts it back together if it can.

Unfortunately, we seem not to, and instead generate a pretty literal transcription:

Disassembly of section .text.load32:

00000000 <load32>:
;   w |= ( uint32_t )( *p++ ) <<  8;
       0: 83 45 15 00   clbu    a1, 1(ca0)
;   uint32_t w = *p++;
       4: 03 46 05 00   clbu    a2, 0(ca0)
;   w |= ( uint32_t )( *p++ ) << 16;
       8: 83 46 25 00   clbu    a3, 2(ca0)
;   w |= ( uint32_t )( *p++ ) << 24;
       c: 03 05 35 00   clb     a0, 3(ca0)
;   w |= ( uint32_t )( *p++ ) <<  8;
      10: a2 05         slli    a1, a1, 8
      12: d1 8d         or      a1, a1, a2
;   w |= ( uint32_t )( *p++ ) << 16;
      14: c2 06         slli    a3, a3, 16
;   w |= ( uint32_t )( *p++ ) << 24;
      16: 62 05         slli    a0, a0, 24
;   w |= ( uint32_t )( *p++ ) << 16;
      18: 55 8d         or      a0, a0, a3
;   w |= ( uint32_t )( *p++ ) << 24;
      1a: 4d 8d         or      a0, a0, a1
;   return w;
      1c: 82 80         cret

That, in turn, seems to raise the function above inlining thresholds, and so it gets called instead, and the whole thing is just a bit sad.

FWIW, LLVM is 0fa9bc5 with #48 still sitting atop, locally; specifically...

$ /cheri/out/cheriot/sdk/bin/clang --version                                                                                                                       
clang version 17.0.0 (https://github.com/CHERIoT-Platform/llvm-project.git 63ed0bf791e19b3bacabb5bb08b7f29790604c11)
Target: powerpc64le-unknown-linux-gnu
Thread model: posix
InstalledDir: /cheri/out/cheriot/sdk/bin

The text was updated successfully, but these errors were encountered:

v01dXYZ · 2024-12-22T00:50:53Z

It seems this optimisation is available for x86-64, AArch64 but not for ARMv7 nor RISC-V32/64. The related pass is AggressiveInstCombine.

define i32 @load32(i8* %0) {
  %2 = getelementptr inbounds i8, i8* %0, i64 1
  %3 = load i8, i8* %0, align 1
  %4 = zext i8 %3 to i32

  %5 = getelementptr inbounds i8, i8* %0, i64 2
  %6 = load i8, i8* %2, align 1
  %7 = zext i8 %6 to i32
  %8 = shl nuw nsw i32 %7, 8

  %9 = or i32 %8, %4

  %10 = getelementptr inbounds i8, i8* %0, i64 3
  %11 = load i8, i8* %5, align 1
  %12 = zext i8 %11 to i32
  %13 = shl nuw nsw i32 %12, 16

  %14 = or i32 %9, %13

  %15 = load i8, i8* %10, align 1
  %16 = zext i8 %15 to i32
  %17 = shl nuw i32 %16, 24

  %18 = or i32 %14, %17

  ret i32 %18
}

v01dXYZ · 2024-12-22T12:43:09Z

It seems it's because RISCV doesn't have per default a fast load for i32. There is an attribute to allow it though unaligned-scalar-mem. I don't know if there is a way to indicate the alignment in LLVM IR such as a aligned(...) / align ... which will be used by the pass.

v01dXYZ · 2024-12-22T13:31:39Z

At the LLVM level, one possible patch is to add a test for checking if the Load Operand is an function argument with a alignment attribute.

Below, API reminder:

auto *ArgPtr = dyn_cast<Argument>(L1LI1->getOperand());
MaybeAlign PtrMaybeAlign = ArgPtr->getParamAlignment();
if(PtrMaybeAlign) {
  Align PtrAlign = *MaybeAlign;
  if(PtrAlign.value() >= Type.getBitWidth()) {
    // optimize the or + lshl
  }
}

resistor · 2024-12-22T20:56:23Z

#87 fixes the codegen, but breaks CheriotRTOS. I need to figure out why.

resistor · 2024-12-23T07:55:26Z

The breakage is because the SAIL-based simulator disables misaligned memory accesses by default. See CHERIoT-Platform/cheriot-sail#92

resistor · 2024-12-23T08:00:34Z

Enabling this attribute incidentally improves code size on the cheriot-rtos test suite by 0.2%

davidchisnall · 2024-12-23T09:00:37Z

Can you check on the Ibex sim (in the dev container)? I believe it supports unaligned access (for everything that isn’t an capability) and there’s a bug in the sail.

resistor · 2024-12-23T10:00:35Z

Can you check on the Ibex sim (in the dev container)? I believe it supports unaligned access (for everything that isn’t an capability) and there’s a bug in the sail.

Now that I've figured out how to do that, the rtos testsuite passes with misaligned loads enabled on the Ibex sim.

nwf · 2024-12-23T13:36:30Z

Hm. I'm not sure I meant to commit us to requiring misaligned loads! I think (but will check in more detail when I'm back home later today) that the BLAKE2 loads are always 32-bit aligned and perhaps we're just losing that information in C.

Can we restrict the instruction fusion to the case that we do know that the result will be well-aligned?

davidchisnall · 2024-12-23T15:00:47Z

RISC-V needs to support unaligned loads for instruction fetch, so I don’t think that there’s much value in permitting implementations that can’t do them for data. On systems with paging, it can be painful if a load or store spans two pages, but that’s the kind of corner case that you can punt to M Mode if necessary.

nwf · 2024-12-23T15:57:46Z

RISC-V needs to support unaligned loads for instruction fetch, so I don’t think that there’s much value in permitting implementations that can’t do them for data. On systems with paging, it can be painful if a load or store spans two pages, but that’s the kind of corner case that you can punt to M Mode if necessary.

Well, AIUI they'd claim that a compliant implementation could always do 16-bit-sized and -aligned loads and maybe issue several in rapid succession for uncompressed instructions... I don't think there's anything in the architecture that would require supporting, say, an 8-bit-aligned 32-bit load.

nwf · 2024-12-23T20:07:19Z

At a glance, the blake2s-ref.c uses load32 only at well-aligned locations, assuming well-alignedness of the input block, but unfortunately those are typed as const uint8_t [] and the argument to load32 is a const void *, so none of that reasoning is available to the compiler.

However, even changing the type of load32 to take a uint32_t* doesn't seem to be sufficiently informative for the existing compiler, as perhaps expected, but, IMHO, would be a justifiable change to get improved codegen.

resistor · 2024-12-23T22:48:38Z

RISC-V needs to support unaligned loads for instruction fetch, so I don’t think that there’s much value in permitting implementations that can’t do them for data. On systems with paging, it can be painful if a load or store spans two pages, but that’s the kind of corner case that you can punt to M Mode if necessary.

Well, AIUI they'd claim that a compliant implementation could always do 16-bit-sized and -aligned loads and maybe issue several in rapid succession for uncompressed instructions... I don't think there's anything in the architecture that would require supporting, say, an 8-bit-aligned 32-bit load.

I believe a compliant RISC-V HW combined system can always support misaligned loads. The difference is whether they're implemented in HW, or generate an exception that can be handled in SW. The question of whether the toolchain should emit them then becomes primarily a performance question.

v01dXYZ · 2024-12-24T11:42:44Z

@nwf It seems there are two ways to indicate to the compiler the pointer is 4-bytes aligned:

aligned_4B_uint8_t a typedef with an __attribute__((aligned(4))). I can't get the same result without using a typedef. -> generates loads with align 4.
__builtin_assume_align(src, 4) -> generates loads with align 4 + a llvm.assume(true) ["align ..."]

The RISCV backend will then optimise it as a single 4B load.

resistor · 2024-12-27T11:21:31Z

@davidchisnall If we want to require misaligned load support, should we write it into the CHERIoT ISA spec?

davidchisnall · 2024-12-27T11:29:02Z

I think it's a good idea for the compiler to default to assuming that they work. I think the trap and emulate logic will require more area for code memory than the equivalent hardware to make it fast, but I'd like to make sure that Sail defaults to supporting them in the dev container before we merge it in the compiler.

resistor · 2025-01-08T08:54:54Z

This will be fixed by #87 which is waiting on CHERIoT-Platform/cheriot-sail#93

nwf mentioned this issue Dec 21, 2024

Known-small memcpy-s are still libcalls? #86

Closed

resistor mentioned this issue Dec 23, 2024

Should misaligned non-cap memory accesses by enabled by default? CHERIoT-Platform/cheriot-sail#92

Closed

resistor closed this as completed Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad codegen for decomposed word loads #85

Bad codegen for decomposed word loads #85

nwf commented Dec 21, 2024

v01dXYZ commented Dec 22, 2024 •

edited

Loading

v01dXYZ commented Dec 22, 2024 •

edited

Loading

v01dXYZ commented Dec 22, 2024

resistor commented Dec 22, 2024

resistor commented Dec 23, 2024

resistor commented Dec 23, 2024

davidchisnall commented Dec 23, 2024

resistor commented Dec 23, 2024

nwf commented Dec 23, 2024

davidchisnall commented Dec 23, 2024

nwf commented Dec 23, 2024

nwf commented Dec 23, 2024 •

edited

Loading

resistor commented Dec 23, 2024

v01dXYZ commented Dec 24, 2024 •

edited

Loading

resistor commented Dec 27, 2024

davidchisnall commented Dec 27, 2024

resistor commented Jan 8, 2025

Bad codegen for decomposed word loads #85

Bad codegen for decomposed word loads #85

Comments

nwf commented Dec 21, 2024

v01dXYZ commented Dec 22, 2024 • edited Loading

v01dXYZ commented Dec 22, 2024 • edited Loading

v01dXYZ commented Dec 22, 2024

resistor commented Dec 22, 2024

resistor commented Dec 23, 2024

resistor commented Dec 23, 2024

davidchisnall commented Dec 23, 2024

resistor commented Dec 23, 2024

nwf commented Dec 23, 2024

davidchisnall commented Dec 23, 2024

nwf commented Dec 23, 2024

nwf commented Dec 23, 2024 • edited Loading

resistor commented Dec 23, 2024

v01dXYZ commented Dec 24, 2024 • edited Loading

resistor commented Dec 27, 2024

davidchisnall commented Dec 27, 2024

resistor commented Jan 8, 2025

v01dXYZ commented Dec 22, 2024 •

edited

Loading

v01dXYZ commented Dec 22, 2024 •

edited

Loading

nwf commented Dec 23, 2024 •

edited

Loading

v01dXYZ commented Dec 24, 2024 •

edited

Loading