-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unify cpu cryptonight implementations #1821
unify cpu cryptonight implementations #1821
Conversation
Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz /4MiB cache) two hash one thread
I removed the digits after the comma because at least +-1 hash is the measurement error |
bf5db85
to
0c0e7fa
Compare
I tested the PR on Windows, with NVIDIA and OpenCL miner. All works fine. The hash rate on different systems is better or equal to the old implementation. |
I tested the macro's with clang, icc, msvc and gcc with different versions. |
xmr-stak has several implementations for multi hash per thread. The results into 3 intepedent implementations. Each time the algorithm must be changed the possibility to introduce errors is very large. - unify the different cryptonight CPU implementations - simplify the function selection array to find the specilized cryptonight implementation - add a intermediat pointer to access the large state (similar to the old multi hash implementation) As side effect this change increases the speed of the single and multi hash algorithm.
0c0e7fa
to
931bd5f
Compare
I am excited about this but haven't had a chance to read through the new code, does it also allow for any number of low-power-mode threads now or is it still just 0-5? Some CPUs of mine gained a good percentage from running 10 or 20 (and prefetch on) with the mbasaman hacks to the previous macros. I will test this on all my CPUs and compilers too. So far fastest for me are clang3.8 or gcc6 in most cases. |
@Spudz76 It is now much simpler and needs less code to generate 10 or 20 hash versions. As a side effect the possibility to make mistake during the copy past is very low. |
I've tested on my CPUs that run N>2 scratchpads, and the performance was pretty unusual, in that the hashrate did not change linearly. CNL N=1 I get 90 on old and 90 on new implementation. I believe the code improves the N=2 implementation as it was suboptimal, but for N>2 the older hardcoded method works better. The old code contained an EVEN ROUND and ODD ROUND loop that was removed in the new code, could that be the reason why there is such a significant slowdown? I'm not a C++ programmer, but would it be possible for you to see if it's possible to implement the macro system in a way that's more similar to the old system? I believe some testing on a machine with >8MB L3 with N=4 should make this pretty clear. -edit- I actually grafted the old CN_STEP1-4 defined functions back to the |
Could you pl3ase link your code changes because N 8 is not available.
|
#1604 is where the N=6,8,10,20 came from. I never quite got it all broken down into macros, but my vision was to make it generate even the N selections from macro looping (so you can provide a list of N to build such as the CUDA platform list... default being 1,2,3,4,5 but very simple to just add whatever ones you think your CPU likes. This patch gets possibly closer to it. I had been using the N=10 on several CPUs for gains. I think N=(cache ways) is some sort of harmonic that works well (8 or 12 or 16 or...) I had also succeeded in splitting the CPU miner out into a dll/so while leaving a completely unoptimized (aka no special tricks, to be sure its accurate) single thread non-AES version in core to handle the result validations (which don't need the special tricks nor speed, but accuracy, and I figure the AES part gets in the way of the actual miner loops using it, right?). This was so that when 20 variants of N are included the exe gets insane huge, while having the CPU methods in a library keeps the main exe size down to something normal (and memory commit is lower being able to dynamic load the kernel for CPU*N at runtime) |
This diff from my fork contains the N=10 which I have been running for months, provides +30H/s (~90 improves to ~120) on several lower TDP and cache starved CPUs. That's 33% gain, but I honestly don't know why it gains. Note, this also removes the odd/even thing, while still improving on some core-family while having no apparent effect plus or minus on the smaller N versions or on other core-family. |
Also fun, with this new refactor you can try insane things like N=128 and/or far too many threads, and while the code chooses N=1 the stats get confused and will show KH/s from some CPU cores (not real of course). I think this is from no CPU thread power left to tend to the stats.
The Not sure how flooding with threads makes the stats collection angry, though (race condition / needs spinlocks/don't use hashcount divisor as trigger?). The same thing may also be part of the GPU stats going super wacky in un-smooth cases (overthreading those, or using super high intensity, does much the same) |
xmr-stak has several implementations for multi hash per thread.
The results into 3 intepedent implementations.
Each time the algorithm must be changed the possibility to introduce errors is very large.
Note: I tried to implement all as recursive unrolling with templates because I am no fan of macros. The result was a little bit slower code. Therefore I decide to use the ugly macros to implement it.
The new code is on my laptop
Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz /4MiB cache)
for cryptonight_lite with 4 hashes per thread slidly faster than the old implementation.Never the less I will test a few more CPUs.
The reason why I have not reused the old macros is that the variables was defined outside of the macros. This is again a point where errors can be introduced very fast e.g. during the implementation of cryptonight_v8. After this PR we get always correct results if the single hash implementation is correct.
This PR solves also #1780