-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monero POW v2 #1831
Monero POW v2 #1831
Conversation
9675005
to
8d2c3e8
Compare
/* BEGIN cryptonight_monero_v8 variables */ \ | ||
__m128i bx1; \ | ||
__m128i division_result_xmm; \ | ||
__m128i sqrt_result_xmm; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be a bit faster to store sqrt_result as uint64_t. Compiler doesn't spill it to stack (at least Visual Studio compiler), so it can be stored directly, not in XMM register.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for single hash mode only. There will always be register spill in 2x, 3x, ... hash modes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx for the info: I also saw it but have not looked deeper into the change to uint64_t
. I will test it soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SChernykh I tested it on an ryzen 1600X and the performance will go down a little bit if I switch to uint64_t
, therefore this change is not included
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was a mistake from my side for single hash it is a little bit faster
@psychocrypt I'm going to make asm optimized versions for double hash today (single hash is done already), you should use them to get the best performance on CPU. It should be easy to integrate since I use old xmr-stak-cpu for testing. |
thx I will have a look this evening and add also the unrolling as option to
the amd backend.
|
Looking good - thanks for your help @SChernykh |
Awaiting the CPU asm optimizations before I bother with testing CPUs. Many of my CPUs are non-AES so it would be pretty rad if someone asm-optimized the softAES part too. I run xmr-stak on them since they are running anyway (blasting Ethereum on the GPUs), not because they are otherwise worth running just to CPU mine - but it would be nice if they were a little faster (it feels like the current softAES module is quick and dirty crutch? unoptimized?). None of these new mods help anything unless it has AES, as far as I can tell. Couldn't figure out how to find or use the asm mods from your xmr-stak-cpu fork, and what I did try clashed pretty hard with the #1821 that more or less rearranged the entire CPU backend, aside from being based on an old fork that merged with xmr-stak (unified) quite some time ago. I'm not sure why the xmr-stak-cpu xmr-stak-amd repos even exist anymore other than to provide a bad starting vector for forks. All mods and upgrades should happen here, post unification and on dev branch. It still compiles equivalent to xmr-stak-cpu when you disable the GPU backends? Seems like a lot of unnecessary hassle to constantly fore-and-backport things from a deprecated codebase. Or, am I missing some sort of real reason for the non-unified repos still existing? It is actually somewhat easier to backport mods from xmrig (being forked from the unified, although long ago, and rearranged quite a bit) than from the single-use versions of xmr-stak. |
I'm always four minutes too soon haha |
Instantly gives me issues on clang-3.8 (others still attempting build):
Similar barfs from clang-3.9, 4.0, 5.0, 6.0, 7, 8 so it's a LLVM assembler dialect issue I guess? Additionally newer Clang seems to hate the $tokens too (output from clang-8)
Clang is detected as such in the prep, all that changes is the version on the executable when I switch versions:
gcc-5 (5.5.0) seems to be building with no complaints... ok gcc-5,6,7,8 all built fine. Maybe hack the CMakeLists.txt ASM detection to always use gcc regardless if the main compiler has been switched out (it should still link regardless?) Although, on some older distros (jessie) clang-3.8 is the only easily added C++11 capable compiler and GCC 4.x is still the system compiler (but that may still be ok for the ASM stubs only). icc-18 with gcc6 as backer (
HOWEVER: icc-18 with gcc-7 as backer (
|
If there were a real
|
This is also misleading. It will work on any x86 processor with SSE2 and AES support, it was just optimized for Ivy Bridge. P.S. My double hash asm optimized version will not be ready today, it's much harder to do than single hash. |
This fixes the clang problem, by using 'gcc' regardless of anything else before enabling the ASM plugin. I've also cleaned it up so the lib only gets linked when it is applicable, and added some x86_64 guards (in case ARM/neon ever gets added, etc) The x86_64 guard logic is probably not the best, not sure what cmake-asm-improvement.patch.txt This makes build work with clang, all versions, negating my previous errors. Have not yet tested on debian jessie where gcc is still 4.x while clang-3.8 works for the rest, that could crash while building. ICC-18 with gcc-7 backer may still crash, testing... |
I fixed the ICC with gcc-7 backer problem by checking for an immutable macro that will appear when ICC is involved (even if
Thus, raw gcc-7 itself will still catch the intrinsics version, but gcc-7 as a backer for another toolchain will use the generic. Unclear why the expected types are different, but the intrinsic is recognized... could also be patched some other way (change types based on compiler macro logic) There appear to be lots of problems with these particular intrinsics especially in clang, but it is strange that gcc-7 can consume them but not when it is being called via ICC. Mayy be worth making a real assembly inline for each compiler to ensure it is not converted into garbage code smashing the stack and registers (which seems to be what every version of clang does with them). |
The fix for the clang asm compile a348d99#diff-2fe31be299ace6e788b71e821d6b8877R160 for ryzen is currently not tested. I will do this this evening. |
|
ce64294
to
9134db5
Compare
- add special asm version for win64 and linux - add cmake path for MSVC and other systems
- reintroduce monero7 until the POW is final - update docs (add cryptonigh_v8)
- fix code style issues - fix spelling issue - fix asm to support newer clang versions
2eef7d2
to
1692c54
Compare
@SChernykh thx for your cuda work, I will have a look to it later. I have now rebased the branch against the dev and added the asm fixes. |
@psychocrypt This code is literally a drop-in replacement, it took me a few minutes to get it up and running. I'm testing GTX 660 Ti now. |
Add fast version for div and sqrt for the cuda backend
@SChernykh I added again some optimizations. The hash rate for may GTX1080 is now much better (with default configuration). |
I will fore rebase this branch in a few min because I removed by accident some hash algos^^ |
d1bafca
to
2943f5c
Compare
- use optimzed div and sqrt - reduce memory footprint
- fix that shared memory for fast div is always used even if an algorithm is not using it - optimize fast div algo - store `division_result` (64_bit) per thread instead of shuffle around and store it as 32bit
2943f5c
to
fd27561
Compare
OK now after 5 force push of the last two commits I repaired my mistake. The PR can now tested |
Odd side effect of the new code, it doesn't work with "memory write execute" blocked, but the old version was fine. |
This is what that option is blocking: https://www.freedesktop.org/software/systemd/man/systemd.exec.html#MemoryDenyWriteExecute= I don't know much about ASM but I guess it's a side effect of doing dark wizardry to squeeze out the most performance possible. I've deleted the line if you decide to go ahead, but I recommend if possible to figure out a way to not weaken the sandbox. Compromised miners are no fun. |
@psychocrypt There is "_TEXT_CNV8_MAINLOOP SEGMENT PAGE READ EXECUTE" in cryptonight_v8_main_loop.asm which is compatible with "deny write execute" mode on Windows, but I don't see anything similar in cryptonight_v8_main_loop.S. I guess you need to specify segment attributes there too. |
Thanks @aledomu I use your .service file on several systems, but it must be Stretch/Xenial level else like half of the file has to be commented out (Jessie/Trusty have systemd too old to know most of those new tricks). I guess what I'm saying is even an insecure .service file is better than none. :) Friends don't let friends use |
Indeed I had to comment a few options in my Debian Stretch box. Since the sources are small enough for oneself to compile in a reasonable time, I thought I'd leave the task to comment those options to the user, who would have to deal with the sources anyway. Not only Debian users need to tinker with the unit file: CentOS still uses an old-as-hell systemd version. My goal with this unit file was to have it upstream so anybody who wanted a unit file had at least most of the work done. I was targeting mostly DIY rolling distros and systemd portable services, but I personally think that If there is some kind of policy for supporting old Linux distros, I will comment the required lines. |
Still can't get CUDA of any sort to work, tried Windows and it throws at cuda_extra line 421 Windows GTX970 via OpenCL does work, comparable rates in v7, but v8 runs around 3 times slower that way. Assuming due to lack of nVidia-OpenCL optimal int-sqrt kernel so it uses normal long-hand float sqrt which I think we already knew would be slow. Did not retest Linux yet but expect same as before (no nVidia via either method) |
Where exactly is this fireice-uk:dev code? There used to be a link to a zip of the v8 code to test but that seems to be missing now. I would like to test the latest v8 code on my HP DL580 G7 server but need access to the code to do so. |
The final change is currently not ready. I will finish it today. I need to
test windows first.
The tiny last tweak fr CUDA make some trouble because the cuda code works
with a more complicated cooperate thread mode to be faster. The last tweak
round require more communication between the threads. Therefore I need to
think about how I can solve it.
|
Implementation of the upcoming new Monero POW cryptonight_v8.
unroll
asm
to select assembler version of cryptonight_v8Note: The CUDA implementation is currently slow and can be updated before the next release.
ATTENTION old config files fill not be compatible with this PR
Please do not report any hash reports here. For hash comparison please report it to the corresponding issue #1832
pools.tpl
,README.md
andamd.txt
(strided_index:1 is forbidden)monero8
- [] add currencywill be go in as soon as POW is final defined from the Monero devsmonero8
and removemonero7
bfactor