-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad codegen for decomposed word loads #85
Comments
It seems this optimisation is available for x86-64, AArch64 but not for ARMv7 nor RISC-V32/64. The related pass is define i32 @load32(i8* %0) {
%2 = getelementptr inbounds i8, i8* %0, i64 1
%3 = load i8, i8* %0, align 1
%4 = zext i8 %3 to i32
%5 = getelementptr inbounds i8, i8* %0, i64 2
%6 = load i8, i8* %2, align 1
%7 = zext i8 %6 to i32
%8 = shl nuw nsw i32 %7, 8
%9 = or i32 %8, %4
%10 = getelementptr inbounds i8, i8* %0, i64 3
%11 = load i8, i8* %5, align 1
%12 = zext i8 %11 to i32
%13 = shl nuw nsw i32 %12, 16
%14 = or i32 %9, %13
%15 = load i8, i8* %10, align 1
%16 = zext i8 %15 to i32
%17 = shl nuw i32 %16, 24
%18 = or i32 %14, %17
ret i32 %18
} |
It seems it's because RISCV doesn't have per default a fast load for i32. There is an attribute to allow it though |
At the LLVM level, one possible patch is to add a test for checking if the Load Operand is an function argument with a alignment attribute. Below, API reminder: auto *ArgPtr = dyn_cast<Argument>(L1LI1->getOperand());
MaybeAlign PtrMaybeAlign = ArgPtr->getParamAlignment();
if(PtrMaybeAlign) {
Align PtrAlign = *MaybeAlign;
if(PtrAlign.value() >= Type.getBitWidth()) {
// optimize the or + lshl
}
} |
#87 fixes the codegen, but breaks CheriotRTOS. I need to figure out why. |
The breakage is because the SAIL-based simulator disables misaligned memory accesses by default. See CHERIoT-Platform/cheriot-sail#92 |
Enabling this attribute incidentally improves code size on the cheriot-rtos test suite by 0.2% |
Can you check on the Ibex sim (in the dev container)? I believe it supports unaligned access (for everything that isn’t an capability) and there’s a bug in the sail. |
Now that I've figured out how to do that, the rtos testsuite passes with misaligned loads enabled on the Ibex sim. |
Hm. I'm not sure I meant to commit us to requiring misaligned loads! I think (but will check in more detail when I'm back home later today) that the BLAKE2 loads are always 32-bit aligned and perhaps we're just losing that information in C. Can we restrict the instruction fusion to the case that we do know that the result will be well-aligned? |
RISC-V needs to support unaligned loads for instruction fetch, so I don’t think that there’s much value in permitting implementations that can’t do them for data. On systems with paging, it can be painful if a load or store spans two pages, but that’s the kind of corner case that you can punt to M Mode if necessary. |
Well, AIUI they'd claim that a compliant implementation could always do 16-bit-sized and -aligned loads and maybe issue several in rapid succession for uncompressed instructions... I don't think there's anything in the architecture that would require supporting, say, an 8-bit-aligned 32-bit load. |
At a glance, the However, even changing the type of |
I believe a compliant RISC-V HW combined system can always support misaligned loads. The difference is whether they're implemented in HW, or generate an exception that can be handled in SW. The question of whether the toolchain should emit them then becomes primarily a performance question. |
@nwf It seems there are two ways to indicate to the compiler the pointer is 4-bytes aligned:
The RISCV backend will then optimise it as a single 4B load. |
@davidchisnall If we want to require misaligned load support, should we write it into the CHERIoT ISA spec? |
I think it's a good idea for the compiler to default to assuming that they work. I think the trap and emulate logic will require more area for code memory than the equivalent hardware to make it fast, but I'd like to make sure that Sail defaults to supporting them in the dev container before we merge it in the compiler. |
This will be fixed by #87 which is waiting on CHERIoT-Platform/cheriot-sail#93 |
BLAKE2s's portable implementation does what a lot of code that needs to care about endianness does, and decomposes word loads into bytes (see https://github.com/BLAKE2/libb2/blob/643decfbf8ae600c3387686754d74c84144950d1/src/blake2-impl.h#L37-L42 in particular) in hopes that the compiler puts it back together if it can.
Unfortunately, we seem not to, and instead generate a pretty literal transcription:
That, in turn, seems to raise the function above inlining thresholds, and so it gets called instead, and the whole thing is just a bit sad.
FWIW, LLVM is 0fa9bc5 with #48 still sitting atop, locally; specifically...
The text was updated successfully, but these errors were encountered: