Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support to CPU mine at more than 75% for dedicated machines #1730

Open
ToXIc69 opened this issue Jul 19, 2018 · 13 comments
Open

support to CPU mine at more than 75% for dedicated machines #1730

ToXIc69 opened this issue Jul 19, 2018 · 13 comments

Comments

@ToXIc69
Copy link

ToXIc69 commented Jul 19, 2018

Hi guys thanks for a great miner, is it possible to add an option to utilize the CPU miner higher than 75 %?
I have dedicated hardware and would like to get some more H/s out of it. I've gone over all the recommended options and settings but can't get any more.

@Spudz76
Copy link
Contributor

Spudz76 commented Jul 19, 2018

Did not specify what model CPU or its specs, but you can run maximal of CPU-cache divided by algo-blocksize. Normal cryptonight is 2MB. So if you don't have "cores * 2MB" of cache you will never be able to mine with all the cores. Such as: an 8 core with 12MB cache you can run 6 cores, then you run out of cache so any more will slow the others down. Should have bought a 16MB cache version to get closer to 100% utilization.

Cryptonight-heavy use 4MB blocksize so you will run out of cache way faster (half the cores) with coins based on that. An 8 core/12MB cpu would only be able to run 3 cores then before falling on its face.
You will get best utilization with Cryptonight-lite based 1MB blocksize. An 8 core/12MB cpu could use all the cores with some cache left over then.

Otherwise get a CPU that is better cache balanced for blocksize of whatever coin you target.
All that matters is cache size, running hashes on any memory other than cache is too slow (cpu waits for memory access constantly) plus the AES extensions only operate on data within cache and do nothing for data sitting in farther memory (so using system memory it would max out at whatever non-AES speeds your cores would do - probably like 4H/s each?).

What CPU model and version, and what does your cpu.txt look like?

@ToXIc69
Copy link
Author

ToXIc69 commented Jul 19, 2018

sorry about that, i've went over this topic with some miners over in reddit and couldnt get any more so was wondering if it could be done with some software changes.

OS | Server 2012R2

CPU | 2x E5-2630 v3 @ 2.4GHz (https://ark.intel.com/products/83356/Intel-Xeon-Processor-E5-2630-v3-20M-Cache-2_40-GHz)

RAM 20GB

CPU | 2x E5-2650 @ 2.00GHz (https://ark.intel.com/products/64590/Intel-Xeon-Processor-E5-2650-20M-Cache-2_00-GHz-8_00-GTs-Intel-QPI)

RAM 64GB

@Spudz76
Copy link
Contributor

Spudz76 commented Jul 19, 2018

Those should do fine then, on 2MB CN algos, as far as cache.

Haswell cores do tend to be slow with this code, I have a Haswell i3 with 3MB of cache and it gets quite a boost from the patches applied to my dev-superthread branch, if you compile from that and use low_power_mode: 10 and no_prefetch: false in your cpu.txt with whatever the correct phys-core affinities are. Essentially by turning off prefetching it reduces bad predictions and then stacking lots of work (10x normal) the CPU internal optimizations kick in and smooth out cache usage.

"SmartCache" doesn't get smart on Haswell unless you let it see more of what work is coming (deeper pipeline). So then I think it internally reallocates cache too often, forcing actual mining to wait.

hey may benefit from even deeper than 10 however that is still being tested in #1604

@Spudz76
Copy link
Contributor

Spudz76 commented Jul 19, 2018

Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
This CPU that I run is very close to the same but a Sandy Bridge (not Haswell) and 6 cores. Sandy/Ivy Bridge families don't seem to enjoy the 10-way patch like Haswell do.

On it, I get 6 threads on defaults (=12MB cache) and then I add one thread that "roams" with affinity:false which seems to scrape up the last bit of cache and hashrate, even though it's running on an HT core. 1MB cache left over. Ends up running at 60% load, it's Linux but you can see the 7 threads I have doing XMR are pegging 100% of those cores:

  1  [||||||||||||||||||||||||100.0%]    4  [||||||||||||||||||||||||100.0%]   7  [||||||||||||||||||||||||100.0%]    10 [|                         1.3%]
  2  [||||||||||||||||||||||||100.0%]    5  [||||||||||||||||||||||||100.0%]   8  [||                        4.5%]    11 [|                         1.3%]
  3  [||||||||||||||||||||||||100.0%]    6  [||||||||||||||||||||||||100.0%]   9  [||                        2.6%]    12 [|                         1.3%]
  Mem[|||||||||||||||||||||||||||||||||||||||||||                835M/3.82G]   Tasks: 48, 48 thr; 8 running
  Swp[                                                             0K/3.74G]   Load average: 7.18 7.20 7.13
                                                                               Uptime: 6 days, 17:40:15

If I shut off HT and just ran 6 threads then it would be at 100% utilized but the rest of the system would be choking (HT cores basically run the rest of the system, interrupt handling, etc). Also this rig is pushing work to six GPUs with ethminer -- xmr-stak is only CPU mining since the CPU was doing nothing otherwise.

HASHRATE REPORT - CPU
| ID |    10s |    60s |    15m | ID |    10s |    60s |    15m |
|  0 |   35.8 |   35.8 |   35.8 |  1 |   39.2 |   39.2 |   39.2 |
|  2 |   38.4 |   38.4 |   38.4 |  3 |   38.3 |   38.3 |   38.3 |
|  4 |   39.3 |   39.3 |   39.3 |  5 |   40.7 |   40.7 |   40.7 |
|  6 |   35.3 |   35.3 |   35.3 |
Totals (CPU):   267.1  267.1  267.1 H/s
-----------------------------------------------------------------
Totals (ALL):    267.1  267.1  267.1 H/s
Highest:   274.0 H/s
-----------------------------------------------------------------

@ToXIc69
Copy link
Author

ToXIc69 commented Jul 19, 2018

how long do you think i should wait after each change to see the effects?

@ToXIc69
Copy link
Author

ToXIc69 commented Jul 19, 2018

here's one of the CPU text for eg

E5-2630 v3 @ 2.4GHz

HASHRATE REPORT - CPU
| ID | 10s | 60s | 15m | ID | 10s | 60s | 15m |
| 0 | 35.7 | 42.6 | 44.2 | 1 | 36.1 | 42.2 | 43.6 |
| 2 | 41.8 | 48.2 | 49.8 | 3 | 41.4 | 48.2 | 49.8 |
| 4 | 41.7 | 48.2 | 49.7 | 5 | 41.9 | 48.3 | 49.8 |
| 6 | 40.6 | 48.0 | 49.8 | 7 | 41.0 | 48.1 | 49.8 |
| 8 | 37.0 | 43.1 | 44.4 | 9 | 36.1 | 42.1 | 43.6 |
| 10 | 37.5 | 42.4 | 47.1 | 11 | 37.4 | 47.4 | 49.1 |
| 12 | 37.0 | 43.6 | 47.8 | 13 | 36.9 | 47.3 | 49.3 |
| 14 | 36.8 | 47.4 | 48.4 | 15 | 37.0 | 42.3 | 45.0 |
| 16 | 37.1 | 43.4 | 46.5 | 17 | 37.1 | 47.3 | 47.8 |
| 18 | 36.8 | 47.1 | 47.1 | 19 | 37.5 | 47.4 | 46.7 |
Totals (CPU): 764.6 914.6 949.6 H/s

Totals (ALL): 764.6 914.6 949.6 H/s
Highest: 960.1 H/s

/*

  • Thread configuration for each thread. Make sure it matches the number above.
  • low_power_mode - This can either be a boolean (true or false), or a number between 1 to 5. When set to true,
  •              this mode will double the cache usage, and double the single thread performance. It will 
    
  •              consume much less power (as less cores are working), but will max out at around 80-85% of 
    
  •              the maximum performance. When set to a number N greater than 1, this mode will increase the
    
  •              cache usage and single thread performance by N times.
    
  • no_prefetch - Some sytems can gain up to extra 5% here, but sometimes it will have no difference or make
  •              things slower.
    
  • affine_to_cpu - This can be either false (no affinity), or the CPU core number. Note that on hyperthreading
  •              systems it is better to assign threads to physical cores. On Windows this usually means selecting 
    
  •              even or odd numbered cpu numbers. For Linux it will be usually the lower CPU numbers, so for a 4 
    
  •              physical core CPU you should select cpu numbers 0-3.
    
  • On the first run the miner will look at your system and suggest a basic configuration that will work,
  • you can try to tweak it from there to get the best performance.
  • A filled out configuration should look like this:
  • "cpu_threads_conf" :
  • [
  •  { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 0 },
    
  •  { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 1 },
    
  • ],
  • If you do not wish to mine with your CPU(s) then use:
  • "cpu_threads_conf" :
  • null,
    */

"cpu_threads_conf" :
[
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 0 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 2 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 4 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 6 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 8 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 10 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 12 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 14 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 1 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 3 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 64 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 66 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 68 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 70 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 72 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 74 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 76 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 78 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 65 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 67 },

],

@Spudz76
Copy link
Contributor

Spudz76 commented Jul 19, 2018

Around 50H/s per core on a Sandy Bridge is about all I get as well, the 10-way patch does not seem to help on those family of core. This is on my most similar to that system with dual Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz and 20MB cache. I am not running full thread loading as this server also runs as a hypervisor with several VMs (proxbox) but the per-thread speeds should match (same core family).

HASHRATE REPORT - CPU
| ID |    10s |    60s |    15m | ID |    10s |    60s |    15m |
|  0 |   51.4 |   51.8 |   51.8 |  1 |   51.7 |   51.9 |   51.9 |
|  2 |   51.1 |   51.4 |   51.4 |  3 |   45.8 |   46.2 |   46.1 |
|  4 |   46.0 |   46.5 |   46.4 |  5 |   46.3 |   46.8 |   46.7 |
|  6 |   48.2 |   48.1 |   48.2 |  7 |   47.9 |   48.1 |   48.0 |
|  8 |   47.8 |   47.9 |   47.9 |  9 |   47.8 |   48.2 |   48.1 |
| 10 |   48.6 |   48.7 |   48.6 | 11 |   48.9 |   49.0 |   48.9 |
Totals (CPU):   581.6  584.7  583.8 H/s
-----------------------------------------------------------------
Totals (ALL):    581.6  584.7  583.8 H/s
Highest:   599.5 H/s
-----------------------------------------------------------------

Your affinities are weird but that could just be how the mobo is mapping things (still weird), and Windows doesn't actually do affinity > 64 (it should be shouting warnings on startup). I would expect affinities to be sequential with no huge hole between 15 and 64 (so, 0-15 are CPU0 and 16-31 would be the CPU1, and you would use the even numbered). So CPU1 seems normal with 0,2,4,6,8,10,12,14 but then I would expect 16,18,20,22,24,26,28,30 for the second set. Where did 1 and 3 come from. Weird mobo or perhaps hwloc is misunderstanding.

@Spudz76
Copy link
Contributor

Spudz76 commented Jul 19, 2018

Maybe it's a quad socket board? Running as dual with CPU0 and 2 loaded (which could leave a logical hole for CPU1 socket that is not populated) and then the numbering makes a little more sense.

Or they used a quad capable chipset for a dual board for some reason and mapped it as above.

Linux always makes their ID sequential regardless the "street address" on the mobo. Hmm. I thought Windows did also, just the HT cores interleaved on the odds instead of appended.

@Spudz76
Copy link
Contributor

Spudz76 commented Jul 19, 2018

Maybe it's just an oddball way Server 2012 does things versus how it would be on desktop windows...
But that somewhat sucks because it still doesn't support affinity lock on > 64 yet it forces you to ID the second CPU up there. Hmmmmmmm.

@ToXIc69
Copy link
Author

ToXIc69 commented Jul 19, 2018

probably correct on the quad socket i havent physically looked at it.. just dusted it off at work and put it to work lol

i just hate seeing the CPU around 75% util i just wanna crank it up.

@baldpope
Copy link

@Spudz76 based on your comment above, with the E5-2640 v4, at 25MB cache, I should easily be able to use all 10 cores at 2MB each. Just sharing data as I follow various threads.

Am I correct in understanding your comment?

@Spudz76
Copy link
Contributor

Spudz76 commented Jul 19, 2018

Microsoft says there may be holes reserved depending on if it thinks CPUs might show up hotswap, so that must be the deal with your motherboard.

I do not know if the current code utilizes any of the tricks for managing which "group" and for setting affinity on > 64 although I do know there is a check and a warning message. Apparently instead it is supposed to set the group affinity using an alternate function call, and then set the affinity mask within that group to achieve the pointer to the second group (64+).

The whole 64 limit is because they used two DWORD (32-bit) and its a bitmask so 64 bits means 64 affinities can be marked. Groups must be their way around that so you use Group1 and get a second pair of DWORDs to set. Not sure if this was previously known to this project or accounted for. But that page also says it should affect Win7 also.

@Spudz76
Copy link
Contributor

Spudz76 commented Jul 19, 2018

@baldpope Yes although you may be able to steal some extra hashrate by adding two more "illegal" HT threads (use 24MB cache and more cores than are physical). I generally set the stowaways to affinity:false though, it seems to be happier about that.

Cache is more important than real actual cores. But odd number you have to leave 1MB on the table and use 24MB total (12 threads, 10 cores, 2 HT)

Similar to how I threw a seventh thread on my 6-core 15MB cache and it gave me a total hashrate higher than not doing so.

Something about HT and SmartCache makes the "fake cores" somehow work nicely even though it should hurt the physical side of the same virtual core... who knows?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants