Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unify cpu cryptonight implementations #1821

Conversation

psychocrypt
Copy link
Collaborator

@psychocrypt psychocrypt commented Sep 10, 2018

xmr-stak has several implementations for multi hash per thread.
The results into 3 intepedent implementations.
Each time the algorithm must be changed the possibility to introduce errors is very large.

  • unify the different cryptonight CPU implementations
  • simplify the function selection array to find the specilized cryptonight implementation

Note: I tried to implement all as recursive unrolling with templates because I am no fan of macros. The result was a little bit slower code. Therefore I decide to use the ugly macros to implement it.
The new code is on my laptop Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz /4MiB cache) for cryptonight_lite with 4 hashes per thread slidly faster than the old implementation.
Never the less I will test a few more CPUs.

The reason why I have not reused the old macros is that the variables was defined outside of the macros. This is again a point where errors can be introduced very fast e.g. during the implementation of cryptonight_v8. After this PR we get always correct results if the single hash implementation is correct.

This PR solves also #1780

  • Test on different CPUs
  • windows tests
  • test NVIDIA and AMD (to avoid that the CPU check of the GPU nonce is broken)

@psychocrypt
Copy link
Collaborator Author

Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz /4MiB cache) two hash one thread

  • cryptonight
    • dev: 117 H/S
    • PR: 119 H/S
  • cryptonight_v7
    • dev: 117 H/S
    • PR: 117 H/S

I removed the digits after the comma because at least +-1 hash is the measurement error

@psychocrypt psychocrypt force-pushed the topic-simplefyMultihashImplementation_defines branch 7 times, most recently from bf5db85 to 0c0e7fa Compare September 13, 2018 11:33
@psychocrypt
Copy link
Collaborator Author

I tested the PR on Windows, with NVIDIA and OpenCL miner. All works fine. The hash rate on different systems is better or equal to the old implementation.

@psychocrypt
Copy link
Collaborator Author

I tested the macro's with clang, icc, msvc and gcc with different versions.

xmr-stak has several implementations for multi hash per thread.
The results into 3 intepedent implementations.
Each time the algorithm must be changed the possibility to introduce errors is very large.

- unify the different cryptonight CPU implementations
- simplify the function selection array to find the specilized cryptonight implementation
- add a intermediat pointer to access the large state (similar to the old multi hash implementation)

As side effect this change increases the speed of the single and multi hash algorithm.
@psychocrypt psychocrypt force-pushed the topic-simplefyMultihashImplementation_defines branch from 0c0e7fa to 931bd5f Compare September 13, 2018 22:07
@fireice-uk fireice-uk merged commit b5d3890 into fireice-uk:dev Sep 14, 2018
@Spudz76
Copy link
Contributor

Spudz76 commented Sep 14, 2018

I am excited about this but haven't had a chance to read through the new code, does it also allow for any number of low-power-mode threads now or is it still just 0-5?

Some CPUs of mine gained a good percentage from running 10 or 20 (and prefetch on) with the mbasaman hacks to the previous macros.

I will test this on all my CPUs and compilers too. So far fastest for me are clang3.8 or gcc6 in most cases.

@psychocrypt psychocrypt deleted the topic-simplefyMultihashImplementation_defines branch September 14, 2018 20:53
@psychocrypt
Copy link
Collaborator Author

@Spudz76 It is now much simpler and needs less code to generate 10 or 20 hash versions. As a side effect the possibility to make mistake during the copy past is very low.

@plavirudar
Copy link

plavirudar commented Sep 15, 2018

I've tested on my CPUs that run N>2 scratchpads, and the performance was pretty unusual, in that the hashrate did not change linearly.

CNL N=1 I get 90 on old and 90 on new implementation.
CNL N=2 I get 150 on old and 155 on new implementation.
CNL N=4 I get 250 on old and 245 on new implementation.
CNL N=8 I get 200 on old and 170 on new implementation.

I believe the code improves the N=2 implementation as it was suboptimal, but for N>2 the older hardcoded method works better. The old code contained an EVEN ROUND and ODD ROUND loop that was removed in the new code, could that be the reason why there is such a significant slowdown?

I'm not a C++ programmer, but would it be possible for you to see if it's possible to implement the macro system in a way that's more similar to the old system? I believe some testing on a machine with >8MB L3 with N=4 should make this pretty clear.

-edit-

I actually grafted the old CN_STEP1-4 defined functions back to the struct Cryptonight_hash<8> function and was able to get back 200h/s on N=8 thread, so it's almost certain that the breaking change is part of cryptonight_aesni.h

@psychocrypt
Copy link
Collaborator Author

psychocrypt commented Sep 15, 2018 via email

@Spudz76
Copy link
Contributor

Spudz76 commented Sep 15, 2018

#1604 is where the N=6,8,10,20 came from. I never quite got it all broken down into macros, but my vision was to make it generate even the N selections from macro looping (so you can provide a list of N to build such as the CUDA platform list... default being 1,2,3,4,5 but very simple to just add whatever ones you think your CPU likes. This patch gets possibly closer to it.

I had been using the N=10 on several CPUs for gains. I think N=(cache ways) is some sort of harmonic that works well (8 or 12 or 16 or...)

I had also succeeded in splitting the CPU miner out into a dll/so while leaving a completely unoptimized (aka no special tricks, to be sure its accurate) single thread non-AES version in core to handle the result validations (which don't need the special tricks nor speed, but accuracy, and I figure the AES part gets in the way of the actual miner loops using it, right?). This was so that when 20 variants of N are included the exe gets insane huge, while having the CPU methods in a library keeps the main exe size down to something normal (and memory commit is lower being able to dynamic load the kernel for CPU*N at runtime)

@Spudz76
Copy link
Contributor

Spudz76 commented Sep 15, 2018

This diff from my fork contains the N=10 which I have been running for months, provides +30H/s (~90 improves to ~120) on several lower TDP and cache starved CPUs. That's 33% gain, but I honestly don't know why it gains.

Note, this also removes the odd/even thing, while still improving on some core-family while having no apparent effect plus or minus on the smaller N versions or on other core-family.

@Spudz76
Copy link
Contributor

Spudz76 commented Sep 15, 2018

Also fun, with this new refactor you can try insane things like N=128 and/or far too many threads, and while the code chooses N=1 the stats get confused and will show KH/s from some CPU cores (not real of course). I think this is from no CPU thread power left to tend to the stats.

model name      : Intel(R) Celeron(R) CPU 1037U @ 1.80GHz
cache size      : 2048 KB
----- yes it is underpowered and no-AES from the start ---
"cpu_threads_conf" :
[
    { "low_power_mode" : 5, "no_prefetch" : true, "affine_to_cpu" : 0 },
    { "low_power_mode" : 5, "no_prefetch" : true, "affine_to_cpu" : 0 },
    { "low_power_mode" : 5, "no_prefetch" : true, "affine_to_cpu" : 0 },
    { "low_power_mode" : 5, "no_prefetch" : true, "affine_to_cpu" : 0 },
    { "low_power_mode" : 5, "no_prefetch" : true, "affine_to_cpu" : 1 },
    { "low_power_mode" : 5, "no_prefetch" : true, "affine_to_cpu" : 1 },
    { "low_power_mode" : 5, "no_prefetch" : true, "affine_to_cpu" : 1 },
    { "low_power_mode" : 5, "no_prefetch" : true, "affine_to_cpu" : 1 },
],
----- results: -----
[2018-09-14 16:21:06] : Benchmark Thread 0 cpu: 90.9 H/S
[2018-09-14 16:21:06] : Benchmark Thread 1 cpu: 73.5 H/S
[2018-09-14 16:21:06] : Benchmark Thread 2 cpu: 69.4 H/S
[2018-09-14 16:21:06] : Benchmark Thread 3 cpu: 64.1 H/S
[2018-09-14 16:21:06] : Benchmark Thread 4 cpu: 58.8 H/S
[2018-09-14 16:21:06] : Benchmark Thread 5 cpu: 54.9 H/S
[2018-09-14 16:21:06] : Benchmark Thread 6 cpu: 5000.0 H/S
[2018-09-14 16:21:06] : Benchmark Thread 7 cpu: 312.5 H/S
[2018-09-14 16:21:06] : Benchmark Total: 5724.3 H/S

The iMultiway should just get set to MAX_N if it is larger, before any of the rest of the logic. And also set to 1 if lower than 1, and then the default in the case would be rendered ineffective (default being the cause of this bug).

Not sure how flooding with threads makes the stats collection angry, though (race condition / needs spinlocks/don't use hashcount divisor as trigger?). The same thing may also be part of the GPU stats going super wacky in un-smooth cases (overthreading those, or using super high intensity, does much the same)

@Spudz76 Spudz76 mentioned this pull request Sep 16, 2018
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants