Monero POW v2 #1831

psychocrypt · 2018-09-15T20:52:15Z

Implementation of the upcoming new Monero POW cryptonight_v8.

add AMD/CPU and CUDA implementation
add cpu single hash assembler version
AMD: add option unroll
CPU: add option asm to select assembler version of cryptonight_v8

Note: The CUDA implementation is currently slow and can be updated before the next release.

ATTENTION old config files fill not be compatible with this PR

Please do not report any hash reports here. For hash comparison please report it to the corresponding issue #1832

update docs: pools.tpl, README.md and amd.txt (strided_index:1 is forbidden)
set dev pool to monero8
~~- [] add currency monero8 and remove monero7~~ will be go in as soon as POW is final defined from the Monero devs
test NVIDIA bfactor
test Monero POW v2 #1831 (comment)
fix Monero POW v2 #1831 (comment)

SChernykh · 2018-09-15T22:23:54Z

xmrstak/backend/cpu/crypto/cryptonight_aesni.h

+	/* BEGIN cryptonight_monero_v8 variables */ \
+	__m128i bx1; \
+	__m128i division_result_xmm; \
+	__m128i sqrt_result_xmm; \


It should be a bit faster to store sqrt_result as uint64_t. Compiler doesn't spill it to stack (at least Visual Studio compiler), so it can be stored directly, not in XMM register.

This is for single hash mode only. There will always be register spill in 2x, 3x, ... hash modes.

thx for the info: I also saw it but have not looked deeper into the change to uint64_t. I will test it soon.

@SChernykh ~~I tested it on an ryzen 1600X and the performance will go down a little bit if I switch to uint64_t, therefore this change is not included~~

it was a mistake from my side for single hash it is a little bit faster

SChernykh · 2018-09-16T07:32:40Z

@psychocrypt I'm going to make asm optimized versions for double hash today (single hash is done already), you should use them to get the best performance on CPU. It should be easy to integrate since I use old xmr-stak-cpu for testing.

psychocrypt · 2018-09-16T07:41:14Z

thx I will have a look this evening and add also the unrolling as option to the amd backend.

fireice-uk · 2018-09-16T10:21:10Z

Looking good - thanks for your help @SChernykh

Spudz76 · 2018-09-16T20:19:04Z

Awaiting the CPU asm optimizations before I bother with testing CPUs.

Many of my CPUs are non-AES so it would be pretty rad if someone asm-optimized the softAES part too. I run xmr-stak on them since they are running anyway (blasting Ethereum on the GPUs), not because they are otherwise worth running just to CPU mine - but it would be nice if they were a little faster (it feels like the current softAES module is quick and dirty crutch? unoptimized?). None of these new mods help anything unless it has AES, as far as I can tell.

Couldn't figure out how to find or use the asm mods from your xmr-stak-cpu fork, and what I did try clashed pretty hard with the #1821 that more or less rearranged the entire CPU backend, aside from being based on an old fork that merged with xmr-stak (unified) quite some time ago. I'm not sure why the xmr-stak-cpu xmr-stak-amd repos even exist anymore other than to provide a bad starting vector for forks. All mods and upgrades should happen here, post unification and on dev branch. It still compiles equivalent to xmr-stak-cpu when you disable the GPU backends? Seems like a lot of unnecessary hassle to constantly fore-and-backport things from a deprecated codebase. Or, am I missing some sort of real reason for the non-unified repos still existing?

It is actually somewhat easier to backport mods from xmrig (being forked from the unified, although long ago, and rearranged quite a bit) than from the single-use versions of xmr-stak.

Spudz76 · 2018-09-16T20:32:32Z

I'm always four minutes too soon haha

Spudz76 · 2018-09-16T20:49:01Z

Instantly gives me issues on clang-3.8 (others still attempting build):

[  3%] Building ASM object CMakeFiles/xmr-stak-asm.dir/xmrstak/backend/cpu/crypto/asm/cryptonigh_v8_main_loop.S.o
/usr/src/xmr-stak/xmrstak/backend/cpu/crypto/asm/cryptonigh_v8_main_loop_ivybridge.inc:160:2: error: invalid operand for instruction
 mov r13, -4389456576512
 ^
/usr/src/xmr-stak/xmrstak/backend/cpu/crypto/asm/cryptonigh_v8_main_loop_ivybridge.inc:167:2: error: invalid operand for instruction
 mov r13, 4389456576511
 ^
/usr/src/xmr-stak/xmrstak/backend/cpu/crypto/asm/cryptonigh_v8_main_loop_ryzen.inc:160:2: error: invalid operand for instruction
 mov rdx, 4389456576511
 ^
/usr/src/xmr-stak/xmrstak/backend/cpu/crypto/asm/cryptonigh_v8_main_loop_ryzen.inc:167:2: error: invalid operand for instruction
 mov rdx, -4389456576512
 ^

Similar barfs from clang-3.9, 4.0, 5.0, 6.0, 7, 8 so it's a LLVM assembler dialect issue I guess? Additionally newer Clang seems to hate the $tokens too (output from clang-8)

/usr/src/xmr-stak/xmrstak/backend/cpu/crypto/asm/cryptonigh_v8_main_loop_ivybridge.inc:108:5: error: unknown token in expression
 je $sqrt_fixup_ivybridge
    ^

Clang is detected as such in the prep, all that changes is the version on the executable when I switch versions:

-- The ASM compiler identification is Clang
-- Found assembler: /usr/bin/clang-3.8

gcc-5 (5.5.0) seems to be building with no complaints...

ok gcc-5,6,7,8 all built fine.

Maybe hack the CMakeLists.txt ASM detection to always use gcc regardless if the main compiler has been switched out (it should still link regardless?) Although, on some older distros (jessie) clang-3.8 is the only easily added C++11 capable compiler and GCC 4.x is still the system compiler (but that may still be ok for the ASM stubs only).

icc-18 with gcc6 as backer (-gxx-name=g++-6 -gcc-name=gcc-6) builds fine as well, and detects ASM compiler properly:

-- The ASM compiler identification is Intel
-- Found assembler: /opt/intel/bin/icc

HOWEVER: icc-18 with gcc-7 as backer (-gxx-name=g++-7 -gcc-name=gcc-7) the ASM is fine but this section explodes fatally:

[ 34%] Building CXX object CMakeFiles/xmr-stak-backend.dir/xmrstak/backend/cpu/crypto/cryptonight_common.cpp.o
In file included from /usr/src/xmr-stak/xmrstak/backend/cpu/crypto/cryptonight_common.cpp(33):
/usr/src/xmr-stak/xmrstak/backend/cpu/crypto/cryptonight_aesni.h(438): error: argument of type "unsigned long long *" is incompatible with parameter of type "unsigned long *"
        _addcarry_u64(_subborrow_u64(0, x2, n0, (unsigned long long int*)&x2), r, 0, (unsigned long long int*)&r);
                                                ^
In file included from /usr/src/xmr-stak/xmrstak/backend/cpu/crypto/cryptonight_common.cpp(33):
/usr/src/xmr-stak/xmrstak/backend/cpu/crypto/cryptonight_aesni.h(438): error: argument of type "unsigned long long *" is incompatible with parameter of type "unsigned long *"
        _addcarry_u64(_subborrow_u64(0, x2, n0, (unsigned long long int*)&x2), r, 0, (unsigned long long int*)&r);
                                                                                     ^

Spudz76 · 2018-09-16T21:20:23Z

auto is definitely the wrong word to use here, should be false

If there were a real auto mode it would check CPUID and select the correct of the options, using auto where this is not the case (and it really means off or disabled or false) is very confusing

 * asm            - Allow to switch to a assembler version of cryptonight_v8; allowed value [auto, intel, ryzen]
 *                    - auto: used the default implementation (no assembler version)
 *                    - intel: supports Intel Ivy Bridge (Xeon v2, Core i7/i5/i3 3xxx, Pentium G2xxx, Celeron G1xxx)
 *                    - ryzen: AMD Ryzen (1xxx and 2xxx series)

SChernykh · 2018-09-16T21:22:40Z

supports Intel Ivy Bridge

This is also misleading. It will work on any x86 processor with SSE2 and AES support, it was just optimized for Ivy Bridge.

P.S. My double hash asm optimized version will not be ready today, it's much harder to do than single hash.

Spudz76 · 2018-09-16T22:56:36Z

This fixes the clang problem, by using 'gcc' regardless of anything else before enabling the ASM plugin.

I've also cleaned it up so the lib only gets linked when it is applicable, and added some x86_64 guards (in case ARM/neon ever gets added, etc)

The x86_64 guard logic is probably not the best, not sure what CMAKE_SYSTEM_PROCESSOR shows up as on the CI platforms or VM for example. I also assume it is not some other string on a Ryzen. Also this breaks "alien" cross CPU compiles (static mode) probably...

cmake-asm-improvement.patch.txt

This makes build work with clang, all versions, negating my previous errors. Have not yet tested on debian jessie where gcc is still 4.x while clang-3.8 works for the rest, that could crash while building.

ICC-18 with gcc-7 backer may still crash, testing...

Spudz76 · 2018-09-17T02:44:01Z

I fixed the ICC with gcc-7 backer problem by checking for an immutable macro that will appear when ICC is involved (even if -no-icc is passed to stealth itself, which would unset simpler macros such as _ICC)

-#if defined _MSC_VER || (__GNUC__ >= 7)
+#if defined _MSC_VER || ((__GNUC__ >= 7) && (!defined __INTEL_COMPILER_BUILD_DATE))
        _addcarry_u64(_subborrow_u64(0, x2, n0, (unsigned long long int*)&x2), r, 0, (unsigned long long int*)&r);

Thus, raw gcc-7 itself will still catch the intrinsics version, but gcc-7 as a backer for another toolchain will use the generic. Unclear why the expected types are different, but the intrinsic is recognized... could also be patched some other way (change types based on compiler macro logic)

There appear to be lots of problems with these particular intrinsics especially in clang, but it is strange that gcc-7 can consume them but not when it is being called via ICC. Mayy be worth making a real assembly inline for each compiler to ensure it is not converted into garbage code smashing the stack and registers (which seems to be what every version of clang does with them).

psychocrypt · 2018-09-17T06:28:53Z

@Spudz76 I fixed many things in my last two commits:

a348d99

fix assembler code to pass the clang compiler
CMake: set asm file language
fix icc with gcc-7 compile issue with _addcarry_u64

cf458d6

Remove the asm option auto by off

Could you please check if the icc issue is now gone.

psychocrypt · 2018-09-17T06:29:49Z

The fix for the clang asm compile a348d99#diff-2fe31be299ace6e788b71e821d6b8877R160 for ryzen is currently not tested. I will do this this evening.

SChernykh · 2018-09-17T06:34:01Z

The fix for the clang asm compile a348d99#diff-2fe31be299ace6e788b71e821d6b8877R160 for ryzen is currently not tested.

This will probably compile to the exact same machine code
It's in sqrt_fixup branch which is executed only on 1 of ~470,000 iterations on average, so it shouldn't change performance at all.

- add special asm version for win64 and linux - add cmake path for MSVC and other systems

- reintroduce monero7 until the POW is final - update docs (add cryptonigh_v8)

- fix code style issues - fix spelling issue - fix asm to support newer clang versions

psychocrypt · 2018-09-19T16:07:55Z

@SChernykh thx for your cuda work, I will have a look to it later.

I have now rebased the branch against the dev and added the asm fixes.

SChernykh · 2018-09-19T17:51:21Z

@psychocrypt This code is literally a drop-in replacement, it took me a few minutes to get it up and running. I'm testing GTX 660 Ti now.

Add fast version for div and sqrt for the cuda backend

psychocrypt · 2018-09-19T19:38:18Z

@SChernykh I added again some optimizations. The hash rate for may GTX1080 is now much better (with default configuration).

psychocrypt · 2018-09-19T19:39:20Z

I will fore rebase this branch in a few min because I removed by accident some hash algos^^

- use optimzed div and sqrt - reduce memory footprint

- fix that shared memory for fast div is always used even if an algorithm is not using it - optimize fast div algo - store `division_result` (64_bit) per thread instead of shuffle around and store it as 32bit

psychocrypt · 2018-09-19T19:47:35Z

OK now after 5 force push of the last two commits I repaired my mistake. The PR can now tested

Spudz76 · 2018-09-20T05:48:13Z

Odd side effect of the new code, it doesn't work with "memory write execute" blocked, but the old version was fine.

aledomu · 2018-09-20T17:48:21Z

This is what that option is blocking: https://www.freedesktop.org/software/systemd/man/systemd.exec.html#MemoryDenyWriteExecute=

I don't know much about ASM but I guess it's a side effect of doing dark wizardry to squeeze out the most performance possible.

I've deleted the line if you decide to go ahead, but I recommend if possible to figure out a way to not weaken the sandbox. Compromised miners are no fun.

SChernykh · 2018-09-20T17:55:23Z

@psychocrypt There is "_TEXT_CNV8_MAINLOOP SEGMENT PAGE READ EXECUTE" in cryptonight_v8_main_loop.asm which is compatible with "deny write execute" mode on Windows, but I don't see anything similar in cryptonight_v8_main_loop.S. I guess you need to specify segment attributes there too.

Spudz76 · 2018-09-20T19:09:19Z

Thanks @aledomu I use your .service file on several systems, but it must be Stretch/Xenial level else like half of the file has to be commented out (Jessie/Trusty have systemd too old to know most of those new tricks). I guess what I'm saying is even an insecure .service file is better than none. :) Friends don't let friends use pm2 haha

aledomu · 2018-09-20T21:52:41Z

Indeed I had to comment a few options in my Debian Stretch box. Since the sources are small enough for oneself to compile in a reasonable time, I thought I'd leave the task to comment those options to the user, who would have to deal with the sources anyway.

Not only Debian users need to tinker with the unit file: CentOS still uses an old-as-hell systemd version.

My goal with this unit file was to have it upstream so anybody who wanted a unit file had at least most of the work done. I was targeting mostly DIY rolling distros and systemd portable services, but I personally think that docker/podman is the best solution here in a more "stable" setup.

If there is some kind of policy for supporting old Linux distros, I will comment the required lines.

Spudz76 · 2018-09-21T12:13:47Z

Still can't get CUDA of any sort to work, tried Windows and it throws at cuda_extra line 421
CUDA 9.2 said it was not autoconfiguring due to "only" 3842MB free.
CUDA 9.1 made a config that seemed ok but line 421 crash
Compiled CUDA 8.0 but didn't try it yet. Generally have not had good luck with 8.0 clients to 9.2 driver even on old well proven code, and don't feel like reverting driver at this moment just for it to do likely the same (421 explosion).

Windows GTX970 via OpenCL does work, comparable rates in v7, but v8 runs around 3 times slower that way. Assuming due to lack of nVidia-OpenCL optimal int-sqrt kernel so it uses normal long-hand float sqrt which I think we already knew would be slow.

Did not retest Linux yet but expect same as before (no nVidia via either method)

kio3i0j9024vkoenio · 2018-09-23T22:04:59Z

Where exactly is this fireice-uk:dev code?

There used to be a link to a zip of the v8 code to test but that seems to be missing now.

I would like to test the latest v8 code on my HP DL580 G7 server but need access to the code to do so.

psychocrypt · 2018-09-24T04:21:15Z

The final change is currently not ready. I will finish it today. I need to test windows first. The tiny last tweak fr CUDA make some trouble because the cuda code works with a more complicated cooperate thread mode to be faster. The last tweak round require more communication between the threads. Therefore I need to think about how I can solve it.

psychocrypt added feature backend nvidia backend cpu backend amd labels Sep 15, 2018

psychocrypt assigned fireice-uk Sep 15, 2018

psychocrypt requested a review from fireice-uk September 15, 2018 20:52

psychocrypt mentioned this pull request Sep 15, 2018

Test upcoming Monero cryptonight_v8 #1832

Closed

psychocrypt force-pushed the topic-cn-v8_v2 branch from 9675005 to 8d2c3e8 Compare September 15, 2018 21:06

psychocrypt mentioned this pull request Sep 15, 2018

PoW modifications (shuffles and integer math) discussion and tests SChernykh/xmr-stak-cpu#1

Open

SChernykh reviewed Sep 15, 2018

View reviewed changes

Spudz76 added a commit to Spudz76/xmr-stak that referenced this pull request Sep 16, 2018

testing Monero POW v2 fireice-uk#1831

3e538db

Spudz76 added a commit to Spudz76/xmr-stak that referenced this pull request Sep 16, 2018

testing Monero POW v2 fireice-uk#1831 (correctly this time)

8f6ba9c

psychocrypt force-pushed the topic-cn-v8_v2 branch 5 times, most recently from ce64294 to 9134db5 Compare September 17, 2018 08:33

psychocrypt added 5 commits September 19, 2018 17:48

asm compiler compatibility

13fbb8a

- add special asm version for win64 and linux - add cmake path for MSVC and other systems

fix segfault if option asm is not in cpu.txt

ce84244

update docs and reintroduce monero7

9a2ef07

- reintroduce monero7 until the POW is final - update docs (add cryptonigh_v8)

fix naming cryptonigh to cryptonight

78bd54f

asm, style and spelling fixes

1692c54

- fix code style issues - fix spelling issue - fix asm to support newer clang versions

psychocrypt force-pushed the topic-cn-v8_v2 branch from 2eef7d2 to 1692c54 Compare September 19, 2018 16:06

Spudz76 added a commit to Spudz76/xmr-stak that referenced this pull request Sep 19, 2018

testing Monero POW v2 fireice-uk#1831 (as of 20180919)

cfe33ad

cuda fast math for Monero pow v8

ac56ecb

Add fast version for div and sqrt for the cuda backend

psychocrypt force-pushed the topic-cn-v8_v2 branch from d1bafca to 2943f5c Compare September 19, 2018 19:41

psychocrypt added 2 commits September 19, 2018 21:46

NVIDIA: optimize div and sqrt

659918f

- use optimzed div and sqrt - reduce memory footprint

NVIDIA: optimze v8

fd27561

- fix that shared memory for fast div is always used even if an algorithm is not using it - optimize fast div algo - store `division_result` (64_bit) per thread instead of shuffle around and store it as 32bit

psychocrypt force-pushed the topic-cn-v8_v2 branch from 2943f5c to fd27561 Compare September 19, 2018 19:46

fireice-uk approved these changes Sep 21, 2018

View reviewed changes

fireice-uk merged commit 2f09629 into fireice-uk:dev Sep 21, 2018

psychocrypt mentioned this pull request Sep 24, 2018

Test final Monero POW cryptonight_v8 #1851

Open

psychocrypt deleted the topic-cn-v8_v2 branch September 26, 2018 19:14

psychocrypt mentioned this pull request Sep 30, 2018

cryptonight_v8 version 2 #1850

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monero POW v2 #1831

Monero POW v2 #1831

psychocrypt commented Sep 15, 2018 •

edited

Loading

SChernykh Sep 15, 2018

SChernykh Sep 15, 2018

psychocrypt Sep 15, 2018

psychocrypt Sep 16, 2018 •

edited

Loading

psychocrypt Sep 16, 2018

SChernykh commented Sep 16, 2018

psychocrypt commented Sep 16, 2018 via email

fireice-uk commented Sep 16, 2018

Spudz76 commented Sep 16, 2018

Spudz76 commented Sep 16, 2018

Spudz76 commented Sep 16, 2018 •

edited

Loading

Spudz76 commented Sep 16, 2018

SChernykh commented Sep 16, 2018 •

edited

Loading

Spudz76 commented Sep 16, 2018 •

edited

Loading

Spudz76 commented Sep 17, 2018 •

edited

Loading

psychocrypt commented Sep 17, 2018 •

edited

Loading

psychocrypt commented Sep 17, 2018

SChernykh commented Sep 17, 2018

psychocrypt commented Sep 19, 2018

SChernykh commented Sep 19, 2018 •

edited

Loading

psychocrypt commented Sep 19, 2018

psychocrypt commented Sep 19, 2018

psychocrypt commented Sep 19, 2018

Spudz76 commented Sep 20, 2018

aledomu commented Sep 20, 2018

SChernykh commented Sep 20, 2018 •

edited

Loading

Spudz76 commented Sep 20, 2018

aledomu commented Sep 20, 2018

Spudz76 commented Sep 21, 2018

kio3i0j9024vkoenio commented Sep 23, 2018 •

edited

Loading

psychocrypt commented Sep 24, 2018 via email

Monero POW v2 #1831

Monero POW v2 #1831

Conversation

psychocrypt commented Sep 15, 2018 • edited Loading

SChernykh Sep 15, 2018

Choose a reason for hiding this comment

SChernykh Sep 15, 2018

Choose a reason for hiding this comment

psychocrypt Sep 15, 2018

Choose a reason for hiding this comment

psychocrypt Sep 16, 2018 • edited Loading

Choose a reason for hiding this comment

psychocrypt Sep 16, 2018

Choose a reason for hiding this comment

SChernykh commented Sep 16, 2018

psychocrypt commented Sep 16, 2018 via email

fireice-uk commented Sep 16, 2018

Spudz76 commented Sep 16, 2018

Spudz76 commented Sep 16, 2018

Spudz76 commented Sep 16, 2018 • edited Loading

Spudz76 commented Sep 16, 2018

SChernykh commented Sep 16, 2018 • edited Loading

Spudz76 commented Sep 16, 2018 • edited Loading

Spudz76 commented Sep 17, 2018 • edited Loading

psychocrypt commented Sep 17, 2018 • edited Loading

psychocrypt commented Sep 17, 2018

SChernykh commented Sep 17, 2018

psychocrypt commented Sep 19, 2018

SChernykh commented Sep 19, 2018 • edited Loading

psychocrypt commented Sep 19, 2018

psychocrypt commented Sep 19, 2018

psychocrypt commented Sep 19, 2018

Spudz76 commented Sep 20, 2018

aledomu commented Sep 20, 2018

SChernykh commented Sep 20, 2018 • edited Loading

Spudz76 commented Sep 20, 2018

aledomu commented Sep 20, 2018

Spudz76 commented Sep 21, 2018

kio3i0j9024vkoenio commented Sep 23, 2018 • edited Loading

psychocrypt commented Sep 24, 2018 via email

psychocrypt commented Sep 15, 2018 •

edited

Loading

psychocrypt Sep 16, 2018 •

edited

Loading

Spudz76 commented Sep 16, 2018 •

edited

Loading

SChernykh commented Sep 16, 2018 •

edited

Loading

Spudz76 commented Sep 16, 2018 •

edited

Loading

Spudz76 commented Sep 17, 2018 •

edited

Loading

psychocrypt commented Sep 17, 2018 •

edited

Loading

SChernykh commented Sep 19, 2018 •

edited

Loading

SChernykh commented Sep 20, 2018 •

edited

Loading

kio3i0j9024vkoenio commented Sep 23, 2018 •

edited

Loading