unify cpu cryptonight implementations #1821

psychocrypt · 2018-09-10T14:55:31Z

xmr-stak has several implementations for multi hash per thread.
The results into 3 intepedent implementations.
Each time the algorithm must be changed the possibility to introduce errors is very large.

unify the different cryptonight CPU implementations
simplify the function selection array to find the specilized cryptonight implementation

Note: I tried to implement all as recursive unrolling with templates because I am no fan of macros. The result was a little bit slower code. Therefore I decide to use the ugly macros to implement it.
The new code is on my laptop Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz /4MiB cache) for cryptonight_lite with 4 hashes per thread slidly faster than the old implementation.
Never the less I will test a few more CPUs.

The reason why I have not reused the old macros is that the variables was defined outside of the macros. This is again a point where errors can be introduced very fast e.g. during the implementation of cryptonight_v8. After this PR we get always correct results if the single hash implementation is correct.

This PR solves also #1780

Test on different CPUs
windows tests
test NVIDIA and AMD (to avoid that the CPU check of the GPU nonce is broken)

psychocrypt · 2018-09-10T15:11:01Z

Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz /4MiB cache) two hash one thread

cryptonight
- dev: 117 H/S
- PR: 119 H/S
cryptonight_v7
- dev: 117 H/S
- PR: 117 H/S

I removed the digits after the comma because at least +-1 hash is the measurement error

psychocrypt · 2018-09-13T20:18:09Z

I tested the PR on Windows, with NVIDIA and OpenCL miner. All works fine. The hash rate on different systems is better or equal to the old implementation.

psychocrypt · 2018-09-13T20:19:38Z

I tested the macro's with clang, icc, msvc and gcc with different versions.

xmr-stak has several implementations for multi hash per thread. The results into 3 intepedent implementations. Each time the algorithm must be changed the possibility to introduce errors is very large. - unify the different cryptonight CPU implementations - simplify the function selection array to find the specilized cryptonight implementation - add a intermediat pointer to access the large state (similar to the old multi hash implementation) As side effect this change increases the speed of the single and multi hash algorithm.

Spudz76 · 2018-09-14T14:44:36Z

I am excited about this but haven't had a chance to read through the new code, does it also allow for any number of low-power-mode threads now or is it still just 0-5?

Some CPUs of mine gained a good percentage from running 10 or 20 (and prefetch on) with the mbasaman hacks to the previous macros.

I will test this on all my CPUs and compilers too. So far fastest for me are clang3.8 or gcc6 in most cases.

psychocrypt · 2018-09-14T20:54:07Z

@Spudz76 It is now much simpler and needs less code to generate 10 or 20 hash versions. As a side effect the possibility to make mistake during the copy past is very low.

plavirudar · 2018-09-15T00:53:58Z

I've tested on my CPUs that run N>2 scratchpads, and the performance was pretty unusual, in that the hashrate did not change linearly.

CNL N=1 I get 90 on old and 90 on new implementation.
CNL N=2 I get 150 on old and 155 on new implementation.
CNL N=4 I get 250 on old and 245 on new implementation.
CNL N=8 I get 200 on old and 170 on new implementation.

I believe the code improves the N=2 implementation as it was suboptimal, but for N>2 the older hardcoded method works better. The old code contained an EVEN ROUND and ODD ROUND loop that was removed in the new code, could that be the reason why there is such a significant slowdown?

I'm not a C++ programmer, but would it be possible for you to see if it's possible to implement the macro system in a way that's more similar to the old system? I believe some testing on a machine with >8MB L3 with N=4 should make this pretty clear.

-edit-

I actually grafted the old CN_STEP1-4 defined functions back to the struct Cryptonight_hash<8> function and was able to get back 200h/s on N=8 thread, so it's almost certain that the breaking change is part of cryptonight_aesni.h

psychocrypt · 2018-09-15T06:25:28Z

Could you pl3ase link your code changes because N 8 is not available.

Spudz76 · 2018-09-15T15:22:52Z

#1604 is where the N=6,8,10,20 came from. I never quite got it all broken down into macros, but my vision was to make it generate even the N selections from macro looping (so you can provide a list of N to build such as the CUDA platform list... default being 1,2,3,4,5 but very simple to just add whatever ones you think your CPU likes. This patch gets possibly closer to it.

I had been using the N=10 on several CPUs for gains. I think N=(cache ways) is some sort of harmonic that works well (8 or 12 or 16 or...)

I had also succeeded in splitting the CPU miner out into a dll/so while leaving a completely unoptimized (aka no special tricks, to be sure its accurate) single thread non-AES version in core to handle the result validations (which don't need the special tricks nor speed, but accuracy, and I figure the AES part gets in the way of the actual miner loops using it, right?). This was so that when 20 variants of N are included the exe gets insane huge, while having the CPU methods in a library keeps the main exe size down to something normal (and memory commit is lower being able to dynamic load the kernel for CPU*N at runtime)

Spudz76 · 2018-09-15T16:59:42Z

This diff from my fork contains the N=10 which I have been running for months, provides +30H/s (~90 improves to ~120) on several lower TDP and cache starved CPUs. That's 33% gain, but I honestly don't know why it gains.

Note, this also removes the odd/even thing, while still improving on some core-family while having no apparent effect plus or minus on the smaller N versions or on other core-family.

Spudz76 · 2018-09-15T17:10:24Z

Also fun, with this new refactor you can try insane things like N=128 and/or far too many threads, and while the code chooses N=1 the stats get confused and will show KH/s from some CPU cores (not real of course). I think this is from no CPU thread power left to tend to the stats.

model name      : Intel(R) Celeron(R) CPU 1037U @ 1.80GHz
cache size      : 2048 KB
----- yes it is underpowered and no-AES from the start ---
"cpu_threads_conf" :
[
    { "low_power_mode" : 5, "no_prefetch" : true, "affine_to_cpu" : 0 },
    { "low_power_mode" : 5, "no_prefetch" : true, "affine_to_cpu" : 0 },
    { "low_power_mode" : 5, "no_prefetch" : true, "affine_to_cpu" : 0 },
    { "low_power_mode" : 5, "no_prefetch" : true, "affine_to_cpu" : 0 },
    { "low_power_mode" : 5, "no_prefetch" : true, "affine_to_cpu" : 1 },
    { "low_power_mode" : 5, "no_prefetch" : true, "affine_to_cpu" : 1 },
    { "low_power_mode" : 5, "no_prefetch" : true, "affine_to_cpu" : 1 },
    { "low_power_mode" : 5, "no_prefetch" : true, "affine_to_cpu" : 1 },
],
----- results: -----
[2018-09-14 16:21:06] : Benchmark Thread 0 cpu: 90.9 H/S
[2018-09-14 16:21:06] : Benchmark Thread 1 cpu: 73.5 H/S
[2018-09-14 16:21:06] : Benchmark Thread 2 cpu: 69.4 H/S
[2018-09-14 16:21:06] : Benchmark Thread 3 cpu: 64.1 H/S
[2018-09-14 16:21:06] : Benchmark Thread 4 cpu: 58.8 H/S
[2018-09-14 16:21:06] : Benchmark Thread 5 cpu: 54.9 H/S
[2018-09-14 16:21:06] : Benchmark Thread 6 cpu: 5000.0 H/S
[2018-09-14 16:21:06] : Benchmark Thread 7 cpu: 312.5 H/S
[2018-09-14 16:21:06] : Benchmark Total: 5724.3 H/S

The iMultiway should just get set to MAX_N if it is larger, before any of the rest of the logic. And also set to 1 if lower than 1, and then the default in the case would be rendered ineffective (default being the cause of this bug).

Not sure how flooding with threads makes the stats collection angry, though (race condition / needs spinlocks/don't use hashcount divisor as trigger?). The same thing may also be part of the GPU stats going super wacky in un-smooth cases (overthreading those, or using super high intensity, does much the same)

psychocrypt added refactoring backend nvidia backend cpu backend amd labels Sep 10, 2018

psychocrypt assigned fireice-uk Sep 10, 2018

psychocrypt requested a review from fireice-uk September 10, 2018 14:55

psychocrypt force-pushed the topic-simplefyMultihashImplementation_defines branch 7 times, most recently from bf5db85 to 0c0e7fa Compare September 13, 2018 11:33

psychocrypt force-pushed the topic-simplefyMultihashImplementation_defines branch from 0c0e7fa to 931bd5f Compare September 13, 2018 22:07

fireice-uk approved these changes Sep 14, 2018

View reviewed changes

fireice-uk merged commit b5d3890 into fireice-uk:dev Sep 14, 2018

psychocrypt deleted the topic-simplefyMultihashImplementation_defines branch September 14, 2018 20:53

Spudz76 mentioned this pull request Sep 16, 2018

Monero POW v2 #1831

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unify cpu cryptonight implementations #1821

unify cpu cryptonight implementations #1821

psychocrypt commented Sep 10, 2018 •

edited

Loading

psychocrypt commented Sep 10, 2018

psychocrypt commented Sep 13, 2018

psychocrypt commented Sep 13, 2018

Spudz76 commented Sep 14, 2018

psychocrypt commented Sep 14, 2018

plavirudar commented Sep 15, 2018 •

edited

Loading

psychocrypt commented Sep 15, 2018 via email

Spudz76 commented Sep 15, 2018 •

edited

Loading

Spudz76 commented Sep 15, 2018 •

edited

Loading

Spudz76 commented Sep 15, 2018 •

edited

Loading

unify cpu cryptonight implementations #1821

unify cpu cryptonight implementations #1821

Conversation

psychocrypt commented Sep 10, 2018 • edited Loading

psychocrypt commented Sep 10, 2018

Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz /4MiB cache) two hash one thread

psychocrypt commented Sep 13, 2018

psychocrypt commented Sep 13, 2018

Spudz76 commented Sep 14, 2018

psychocrypt commented Sep 14, 2018

plavirudar commented Sep 15, 2018 • edited Loading

psychocrypt commented Sep 15, 2018 via email

Spudz76 commented Sep 15, 2018 • edited Loading

Spudz76 commented Sep 15, 2018 • edited Loading

Spudz76 commented Sep 15, 2018 • edited Loading

psychocrypt commented Sep 10, 2018 •

edited

Loading

plavirudar commented Sep 15, 2018 •

edited

Loading

Spudz76 commented Sep 15, 2018 •

edited

Loading

Spudz76 commented Sep 15, 2018 •

edited

Loading

Spudz76 commented Sep 15, 2018 •

edited

Loading