-
Notifications
You must be signed in to change notification settings - Fork 554
Support for AArch64
Development is progressing faster than expected to provide support for ARM 64 bit CPUs using the AArch64 architecture.
This is provided as source code only and may be built on native Linux by following the existing procedure subject to any modifications described below.
Bitcoin talk discussion thread: https://bitcointalk.org/index.php?topic=5226770.0
Requirements:
- An ARM CPU supporting AArch64
- Linux OS.
cpuminer-opt-23.8 is released, all users should upgrade
Highlights from this release: Removed some obsolete code that should make it easier to support AArch64 and hopefully MacOS soon. AES is working in general and enabled for Shavite & Echo. Groestl and Fugue still have issues.
Upgraded development environment:
- Orange Pi 5 Plus 16 GB, Rockchip 8 core CPU with AES & SHA2
- Ubuntu Mate 22.04
- GCC-11.4
Compile with:
$ ./arm-build-sh
The only change from build.sh is the addition of "-flax-vector-conversions" to CFLAGS. The compiler will remind you if you forget. Specific achitectures and features can be compiled using examples in armbuild-all.sh.
The miner is known to compile and run on Raspberry Pi 4B and Orange Pi 5 Plus, and compiles for all version of armv8 with our without AES or SHA2 or both.
What works:
- All algorithms except Verthash should be working.
- Allium, Lyra2z, Lyraz330, Argon2d are fully optimzed for NEON, Allium also for AES untested.
- Yespower, Yescrypt, Scrypt, ScryptN2 are fully optimized, SHA is enbabled but untested.
- Sha256dt, Sha256t, Sha256d are fully optimized, SHA2 is also working.
- X17 is mostly optimized.
- MinotaurX is partially optimized.
- AES is working for Shavite & Echo.
- stratum+ssl and stratum+tcp are working, GBT is untested but expected to work.
- CPU and SW feature detection and reporting is working, algo features in progress, CPU brand not yet implemented.
- CPU temperature and clock frequency is working.
- cpu-affinity & threads are working.
Known problems:
- MacOS is not working.
- No detection of CPU model, default info is displayed.
- Detection of AES and SHA CPU extensions is not working.
- No detection of ARM architecture minor version number.
- NEON may not be displayed in algo features for some algos that may support it.
- Algos may show support for NEON even if it's disabled or not yet implemented.
- X17, MinotaurX are not fully oprimized.
- Simd: NEON parallel hash not enasbled, using unoptimized.
- Fugue: Multiple issues with NEON & AES, using unoptimized.
- Groestl: Neon AES not working, using unoptimized.
- Hamsi: parallel NEON not working, using unoptimized.
- SWIFFTX: Deferred, using unoptimized.
- Algos not mentioned have either been deferred or have not been analyzed. They may or may not work on ARM.
Short term plan:
Continue fixing parallel hash functions for x17 before propagating them to the rest of the X family. Figure out what's going on with verthash. Extend suport to x21s, x22i, x25x. Add support for the short algos like skein2, keccak, blake2s, etc. Complete any other work needed to bring parity with SSE2. Performance testing.
Medium term:
Find NEON optimization opportunities that exploit it's architecture and instruction set. Apply lessons learned to x86_64.
Long term:
ARM SVE x86_64 AVX10 RISC-V
Some notable observations about the problems observed:
Verthash is a mystery, it only produces rejects on ARM even with no targtetted code, only compiled C. The same C source works on x86_64 but not on AArch64. Tried with -O3 & -O2. In all other cases falling back to C was always successful. Verthash data file creation and verification work.
There are a few cases where translating from SSE2 to NEON is diffiult or the workaround kills performance. NEON, being RISC, has no microcode so no programmable shuffle instruction. The only shuffling I can find is sub-vector word & sub-word bit, shift, rotate & reverse. Notably SSE2 can't do bit reversal but can shuffle bytes any which way.
Multiplications are implemented differently, particularly widening multiplcatiom where the product is twice the bit width of the souces. X86_64 operates on lanes 0 & 2 while ARM operates on lanes 0 & 1 of the source data. In effect x86_64 assumes the data is pre-widened and discards lanes 1 & 3 leaving 2 zero extended 64 bit source integers. With ARM the source arguments are packed into a smaller vector and the product is widened to 64 bits upon multiplication:
uint64x2_t = uint32x2_t * uint32x2_t
Most uses are the x86_64 format requiring a workaround for ARM.
NEON has some fancy load instructions that combine load with another oeration like byte swap. These may provide optimizatins that SSE can't. Exploring these is part of the longer term plans once the existing problems are solved and ARM code is up to he same level of optimization level as x86_64.
NEON has no blend instruction but can emulate one compatible with x86_64 blendv using boolean algebra, but not very efficiently.