Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monero POW v2 #1831

Merged
merged 20 commits into from
Sep 21, 2018
Merged

Monero POW v2 #1831

merged 20 commits into from
Sep 21, 2018

Conversation

psychocrypt
Copy link
Collaborator

@psychocrypt psychocrypt commented Sep 15, 2018

Implementation of the upcoming new Monero POW cryptonight_v8.

  • add AMD/CPU and CUDA implementation
  • add cpu single hash assembler version
  • AMD: add option unroll
  • CPU: add option asm to select assembler version of cryptonight_v8

Note: The CUDA implementation is currently slow and can be updated before the next release.

ATTENTION old config files fill not be compatible with this PR

Please do not report any hash reports here. For hash comparison please report it to the corresponding issue #1832

  • update docs: pools.tpl, README.md and amd.txt (strided_index:1 is forbidden)
  • set dev pool to monero8
    - [] add currency monero8 and remove monero7 will be go in as soon as POW is final defined from the Monero devs
  • test NVIDIA bfactor
  • test Monero POW v2 #1831 (comment)
  • fix Monero POW v2 #1831 (comment)

/* BEGIN cryptonight_monero_v8 variables */ \
__m128i bx1; \
__m128i division_result_xmm; \
__m128i sqrt_result_xmm; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be a bit faster to store sqrt_result as uint64_t. Compiler doesn't spill it to stack (at least Visual Studio compiler), so it can be stored directly, not in XMM register.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for single hash mode only. There will always be register spill in 2x, 3x, ... hash modes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx for the info: I also saw it but have not looked deeper into the change to uint64_t. I will test it soon.

Copy link
Collaborator Author

@psychocrypt psychocrypt Sep 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SChernykh I tested it on an ryzen 1600X and the performance will go down a little bit if I switch to uint64_t, therefore this change is not included

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was a mistake from my side for single hash it is a little bit faster

@SChernykh
Copy link
Contributor

@psychocrypt I'm going to make asm optimized versions for double hash today (single hash is done already), you should use them to get the best performance on CPU. It should be easy to integrate since I use old xmr-stak-cpu for testing.

@psychocrypt
Copy link
Collaborator Author

psychocrypt commented Sep 16, 2018 via email

@fireice-uk
Copy link
Owner

Looking good - thanks for your help @SChernykh

@Spudz76
Copy link
Contributor

Spudz76 commented Sep 16, 2018

Awaiting the CPU asm optimizations before I bother with testing CPUs.

Many of my CPUs are non-AES so it would be pretty rad if someone asm-optimized the softAES part too. I run xmr-stak on them since they are running anyway (blasting Ethereum on the GPUs), not because they are otherwise worth running just to CPU mine - but it would be nice if they were a little faster (it feels like the current softAES module is quick and dirty crutch? unoptimized?). None of these new mods help anything unless it has AES, as far as I can tell.

Couldn't figure out how to find or use the asm mods from your xmr-stak-cpu fork, and what I did try clashed pretty hard with the #1821 that more or less rearranged the entire CPU backend, aside from being based on an old fork that merged with xmr-stak (unified) quite some time ago. I'm not sure why the xmr-stak-cpu xmr-stak-amd repos even exist anymore other than to provide a bad starting vector for forks. All mods and upgrades should happen here, post unification and on dev branch. It still compiles equivalent to xmr-stak-cpu when you disable the GPU backends? Seems like a lot of unnecessary hassle to constantly fore-and-backport things from a deprecated codebase. Or, am I missing some sort of real reason for the non-unified repos still existing?

It is actually somewhat easier to backport mods from xmrig (being forked from the unified, although long ago, and rearranged quite a bit) than from the single-use versions of xmr-stak.

@Spudz76
Copy link
Contributor

Spudz76 commented Sep 16, 2018

I'm always four minutes too soon haha

Spudz76 added a commit to Spudz76/xmr-stak that referenced this pull request Sep 16, 2018
@Spudz76
Copy link
Contributor

Spudz76 commented Sep 16, 2018

Instantly gives me issues on clang-3.8 (others still attempting build):

[  3%] Building ASM object CMakeFiles/xmr-stak-asm.dir/xmrstak/backend/cpu/crypto/asm/cryptonigh_v8_main_loop.S.o
/usr/src/xmr-stak/xmrstak/backend/cpu/crypto/asm/cryptonigh_v8_main_loop_ivybridge.inc:160:2: error: invalid operand for instruction
 mov r13, -4389456576512
 ^
/usr/src/xmr-stak/xmrstak/backend/cpu/crypto/asm/cryptonigh_v8_main_loop_ivybridge.inc:167:2: error: invalid operand for instruction
 mov r13, 4389456576511
 ^
/usr/src/xmr-stak/xmrstak/backend/cpu/crypto/asm/cryptonigh_v8_main_loop_ryzen.inc:160:2: error: invalid operand for instruction
 mov rdx, 4389456576511
 ^
/usr/src/xmr-stak/xmrstak/backend/cpu/crypto/asm/cryptonigh_v8_main_loop_ryzen.inc:167:2: error: invalid operand for instruction
 mov rdx, -4389456576512
 ^

Similar barfs from clang-3.9, 4.0, 5.0, 6.0, 7, 8 so it's a LLVM assembler dialect issue I guess? Additionally newer Clang seems to hate the $tokens too (output from clang-8)

/usr/src/xmr-stak/xmrstak/backend/cpu/crypto/asm/cryptonigh_v8_main_loop_ivybridge.inc:108:5: error: unknown token in expression
 je $sqrt_fixup_ivybridge
    ^

Clang is detected as such in the prep, all that changes is the version on the executable when I switch versions:

-- The ASM compiler identification is Clang
-- Found assembler: /usr/bin/clang-3.8

gcc-5 (5.5.0) seems to be building with no complaints...

ok gcc-5,6,7,8 all built fine.

Maybe hack the CMakeLists.txt ASM detection to always use gcc regardless if the main compiler has been switched out (it should still link regardless?) Although, on some older distros (jessie) clang-3.8 is the only easily added C++11 capable compiler and GCC 4.x is still the system compiler (but that may still be ok for the ASM stubs only).

icc-18 with gcc6 as backer (-gxx-name=g++-6 -gcc-name=gcc-6) builds fine as well, and detects ASM compiler properly:

-- The ASM compiler identification is Intel
-- Found assembler: /opt/intel/bin/icc

HOWEVER: icc-18 with gcc-7 as backer (-gxx-name=g++-7 -gcc-name=gcc-7) the ASM is fine but this section explodes fatally:

[ 34%] Building CXX object CMakeFiles/xmr-stak-backend.dir/xmrstak/backend/cpu/crypto/cryptonight_common.cpp.o
In file included from /usr/src/xmr-stak/xmrstak/backend/cpu/crypto/cryptonight_common.cpp(33):
/usr/src/xmr-stak/xmrstak/backend/cpu/crypto/cryptonight_aesni.h(438): error: argument of type "unsigned long long *" is incompatible with parameter of type "unsigned long *"
        _addcarry_u64(_subborrow_u64(0, x2, n0, (unsigned long long int*)&x2), r, 0, (unsigned long long int*)&r);
                                                ^
In file included from /usr/src/xmr-stak/xmrstak/backend/cpu/crypto/cryptonight_common.cpp(33):
/usr/src/xmr-stak/xmrstak/backend/cpu/crypto/cryptonight_aesni.h(438): error: argument of type "unsigned long long *" is incompatible with parameter of type "unsigned long *"
        _addcarry_u64(_subborrow_u64(0, x2, n0, (unsigned long long int*)&x2), r, 0, (unsigned long long int*)&r);
                                                                                     ^

@Spudz76
Copy link
Contributor

Spudz76 commented Sep 16, 2018

auto is definitely the wrong word to use here, should be false

If there were a real auto mode it would check CPUID and select the correct of the options, using auto where this is not the case (and it really means off or disabled or false) is very confusing

 * asm            - Allow to switch to a assembler version of cryptonight_v8; allowed value [auto, intel, ryzen]
 *                    - auto: used the default implementation (no assembler version)
 *                    - intel: supports Intel Ivy Bridge (Xeon v2, Core i7/i5/i3 3xxx, Pentium G2xxx, Celeron G1xxx)
 *                    - ryzen: AMD Ryzen (1xxx and 2xxx series)

@SChernykh
Copy link
Contributor

SChernykh commented Sep 16, 2018

supports Intel Ivy Bridge

This is also misleading. It will work on any x86 processor with SSE2 and AES support, it was just optimized for Ivy Bridge.

P.S. My double hash asm optimized version will not be ready today, it's much harder to do than single hash.

Spudz76 added a commit to Spudz76/xmr-stak that referenced this pull request Sep 16, 2018
@Spudz76
Copy link
Contributor

Spudz76 commented Sep 16, 2018

This fixes the clang problem, by using 'gcc' regardless of anything else before enabling the ASM plugin.

I've also cleaned it up so the lib only gets linked when it is applicable, and added some x86_64 guards (in case ARM/neon ever gets added, etc)

The x86_64 guard logic is probably not the best, not sure what CMAKE_SYSTEM_PROCESSOR shows up as on the CI platforms or VM for example. I also assume it is not some other string on a Ryzen. Also this breaks "alien" cross CPU compiles (static mode) probably...

cmake-asm-improvement.patch.txt

This makes build work with clang, all versions, negating my previous errors. Have not yet tested on debian jessie where gcc is still 4.x while clang-3.8 works for the rest, that could crash while building.

ICC-18 with gcc-7 backer may still crash, testing...

@Spudz76
Copy link
Contributor

Spudz76 commented Sep 17, 2018

I fixed the ICC with gcc-7 backer problem by checking for an immutable macro that will appear when ICC is involved (even if -no-icc is passed to stealth itself, which would unset simpler macros such as _ICC)

-#if defined _MSC_VER || (__GNUC__ >= 7)
+#if defined _MSC_VER || ((__GNUC__ >= 7) && (!defined __INTEL_COMPILER_BUILD_DATE))
        _addcarry_u64(_subborrow_u64(0, x2, n0, (unsigned long long int*)&x2), r, 0, (unsigned long long int*)&r);

Thus, raw gcc-7 itself will still catch the intrinsics version, but gcc-7 as a backer for another toolchain will use the generic. Unclear why the expected types are different, but the intrinsic is recognized... could also be patched some other way (change types based on compiler macro logic)

There appear to be lots of problems with these particular intrinsics especially in clang, but it is strange that gcc-7 can consume them but not when it is being called via ICC. Mayy be worth making a real assembly inline for each compiler to ensure it is not converted into garbage code smashing the stack and registers (which seems to be what every version of clang does with them).

@psychocrypt
Copy link
Collaborator Author

psychocrypt commented Sep 17, 2018

@Spudz76 I fixed many things in my last two commits:

a348d99

  • fix assembler code to pass the clang compiler
  • CMake: set asm file language
  • fix icc with gcc-7 compile issue with _addcarry_u64

cf458d6

  • Remove the asm option auto by off

Could you please check if the icc issue is now gone.

@psychocrypt
Copy link
Collaborator Author

The fix for the clang asm compile a348d99#diff-2fe31be299ace6e788b71e821d6b8877R160 for ryzen is currently not tested. I will do this this evening.

@SChernykh
Copy link
Contributor

The fix for the clang asm compile a348d99#diff-2fe31be299ace6e788b71e821d6b8877R160 for ryzen is currently not tested.

  • This will probably compile to the exact same machine code
  • It's in sqrt_fixup branch which is executed only on 1 of ~470,000 iterations on average, so it shouldn't change performance at all.

@psychocrypt psychocrypt force-pushed the topic-cn-v8_v2 branch 5 times, most recently from ce64294 to 9134db5 Compare September 17, 2018 08:33
- add special asm version for win64 and linux
- add cmake path for MSVC and other systems
- reintroduce monero7 until the POW is final
- update docs (add cryptonigh_v8)
- fix code style issues
- fix spelling issue
- fix asm to support newer clang versions
@psychocrypt
Copy link
Collaborator Author

@SChernykh thx for your cuda work, I will have a look to it later.

I have now rebased the branch against the dev and added the asm fixes.

Spudz76 added a commit to Spudz76/xmr-stak that referenced this pull request Sep 19, 2018
@SChernykh
Copy link
Contributor

SChernykh commented Sep 19, 2018

@psychocrypt This code is literally a drop-in replacement, it took me a few minutes to get it up and running. I'm testing GTX 660 Ti now.

Add fast version for div and sqrt for the cuda backend
@psychocrypt
Copy link
Collaborator Author

@SChernykh I added again some optimizations. The hash rate for may GTX1080 is now much better (with default configuration).

@psychocrypt
Copy link
Collaborator Author

I will fore rebase this branch in a few min because I removed by accident some hash algos^^

- use optimzed div and sqrt
- reduce memory footprint
- fix that shared memory for fast div is always used even if an algorithm is not using it
- optimize fast div algo
- store `division_result` (64_bit) per thread instead of shuffle around and store it as 32bit
@psychocrypt
Copy link
Collaborator Author

OK now after 5 force push of the last two commits I repaired my mistake. The PR can now tested

@Spudz76
Copy link
Contributor

Spudz76 commented Sep 20, 2018

Odd side effect of the new code, it doesn't work with "memory write execute" blocked, but the old version was fine.

@aledomu
Copy link
Contributor

aledomu commented Sep 20, 2018

This is what that option is blocking: https://www.freedesktop.org/software/systemd/man/systemd.exec.html#MemoryDenyWriteExecute=

I don't know much about ASM but I guess it's a side effect of doing dark wizardry to squeeze out the most performance possible.

I've deleted the line if you decide to go ahead, but I recommend if possible to figure out a way to not weaken the sandbox. Compromised miners are no fun.

@SChernykh
Copy link
Contributor

SChernykh commented Sep 20, 2018

@psychocrypt There is "_TEXT_CNV8_MAINLOOP SEGMENT PAGE READ EXECUTE" in cryptonight_v8_main_loop.asm which is compatible with "deny write execute" mode on Windows, but I don't see anything similar in cryptonight_v8_main_loop.S. I guess you need to specify segment attributes there too.

@Spudz76
Copy link
Contributor

Spudz76 commented Sep 20, 2018

Thanks @aledomu I use your .service file on several systems, but it must be Stretch/Xenial level else like half of the file has to be commented out (Jessie/Trusty have systemd too old to know most of those new tricks). I guess what I'm saying is even an insecure .service file is better than none. :) Friends don't let friends use pm2 haha

@aledomu
Copy link
Contributor

aledomu commented Sep 20, 2018

Indeed I had to comment a few options in my Debian Stretch box. Since the sources are small enough for oneself to compile in a reasonable time, I thought I'd leave the task to comment those options to the user, who would have to deal with the sources anyway.

Not only Debian users need to tinker with the unit file: CentOS still uses an old-as-hell systemd version.

My goal with this unit file was to have it upstream so anybody who wanted a unit file had at least most of the work done. I was targeting mostly DIY rolling distros and systemd portable services, but I personally think that docker/podman is the best solution here in a more "stable" setup.

If there is some kind of policy for supporting old Linux distros, I will comment the required lines.

@Spudz76
Copy link
Contributor

Spudz76 commented Sep 21, 2018

Still can't get CUDA of any sort to work, tried Windows and it throws at cuda_extra line 421
CUDA 9.2 said it was not autoconfiguring due to "only" 3842MB free.
CUDA 9.1 made a config that seemed ok but line 421 crash
Compiled CUDA 8.0 but didn't try it yet. Generally have not had good luck with 8.0 clients to 9.2 driver even on old well proven code, and don't feel like reverting driver at this moment just for it to do likely the same (421 explosion).

Windows GTX970 via OpenCL does work, comparable rates in v7, but v8 runs around 3 times slower that way. Assuming due to lack of nVidia-OpenCL optimal int-sqrt kernel so it uses normal long-hand float sqrt which I think we already knew would be slow.

Did not retest Linux yet but expect same as before (no nVidia via either method)

@fireice-uk fireice-uk merged commit 2f09629 into fireice-uk:dev Sep 21, 2018
@kio3i0j9024vkoenio
Copy link

kio3i0j9024vkoenio commented Sep 23, 2018

Where exactly is this fireice-uk:dev code?

There used to be a link to a zip of the v8 code to test but that seems to be missing now.

I would like to test the latest v8 code on my HP DL580 G7 server but need access to the code to do so.

@psychocrypt
Copy link
Collaborator Author

psychocrypt commented Sep 24, 2018 via email

@psychocrypt psychocrypt deleted the topic-cn-v8_v2 branch September 26, 2018 19:14
@psychocrypt psychocrypt mentioned this pull request Sep 30, 2018
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants