-
Notifications
You must be signed in to change notification settings - Fork 233
Likwid Pin
You can only use likwid-pin with threading implementations using the
pthread_create
API call which are dynamically linked. Moreover the usage
makes only sense if you use a static placement of the threads. This means every thread
runs on a dedicated processor. Since version 3.1 it is possible to oversubscribe processors creating many more threads as there are processors. likwid-pin will distribute the threads round robin on the processors you specify in your thread list.
For threaded applications on modern multi-core platforms it is crucial to pin threads to dedicated cores. While the Linux kernel offers an API to pin your threads, it is tedious and involves some coding to implement a flexible solution to address affinity. Intel includes an sophisticated pinning mechanism for their OpenMP implementation. While this already works quite well out of the box, it can be further controlled with environment variables.
Still there are occasions where a simple platform and compiler independent
solution is required. Because all common OpenMP implementations rely on the
pthread API it is possible for likwid-pin to preload a wrapper library to the
pthread_create
call. In this wrapper, the threads are pinned using the Linux OS
API. likwid-pin can also be used to pin serial applications as a replacement
for taskset. This is an idea inspired by a tool available at http://www.mulder.franken.de/workstuff/ .
likwid-pin explicitly supports pthread and the OpenMP implementations of Intel
and GNU gcc. Other OpenMP implementations are also supported by allowing to
specify a skip mask. In this mask, it is specified which threads shall be
skipped during pinning because they are used as shepard threads and do no
actual work.
likwid-pin offers three different syntax flavors to specify how to pin threads to processors:
- Using a thread list
- Specify a expression based thread list
- Use scatter policy
Usually processors are numbered within the Linux kernel, we refer to this ordering as physical numbering. LIKWID introduces thread groups throughout all tools to enable logical pinning. A thread group are processors sharing a topological entity on a node or chip. This may be the socket, or a ccNUMA domain or a shared cache. likwid-pin supports four different ways of numbering the cores when using the thread group syntax:
- physical numbering: processors are numbered according to the numbering in the OS
- logical numbering in node: processors are logical numbered over whole node (N prefix)
- logical numbering in socket: processors are logical numbered in every socket (S# prefix, e.g., S0)
- logical numbering in cache group: processors are logical numbered in last level cache group (C# prefix, e.g., C1)
- logical numbering in memory domain: processors are logical numbered in NUMA domain (M# prefix, e.g., M2)
- logical numbering within cpuset: processors are logical numbered inside Linux cpuset (L prefix)
For all numberings apart from one and six physical cores come first. If you
have two sockets with 4 cores each and every core has 2 SMT threads with -c N:0-7
you get all physical cores. To also use SMT threads use N:0-15
.
Since version 3.1 LIKWID also supports an alternative expression based syntax variant. If you use an expression based thread list definition compact ordering is used. So the processors will be in consecutive ordering with regard to SMT threads.
likwid-pin can be used to also set the NUMA memory policy to interleave. Because likwid-pin can figure out all memory domains involved in your run, it automatically configure interleaving for all NUMA nodes used.
likwid-pin sets the environment variable OMP_NUM_THREADS
for you, if not already
present in your environment. It will set as many threads as present in your pin
expression.
Moreover, the environment variable CILK_WORKERS
is set to number of threads present in the pin expression.
likwid-pin always set KMP_AFFINITY
to disabled
to avoid interference with other pinning mechanisms. In LIKWID 4.2.1, also OMP_PLACES
, GOMP_CPU_AFFINITY
and OMP_PROC_BIND
are unset if set before.
-h, --help Help message
-v, --version Version information
-V, --verbose <level> Verbose output, 0 (only errors), 1 (info), 2 (details), 3 (developer)
-i Set NUMA interleave policy with all involved numa nodes
-m Set NUMA membind policy with all involved numa nodes
-S, --sweep Sweep memory and LLC of involved NUMA nodes
-c <list> Comma separated processor IDs or expression
-s, --skip <hex> Bitmask with threads to skip
-p Print available domains with mapping on physical IDs
If used together with -p option outputs a physical processor IDs.
-d <string> Delimiter used for using -p to output physical processor list, default is comma.
-q, --quiet Silent without output
As usual you can get a short help message with
$ likwid-pin -h
With a pthread application type (in this example with 5 threads)
$ likwid-pin -c 0,2,4-6 ./myApp parameters
With pthread it is important that you also have to include the process in your processor list. This is because for pthreads it is also possible to use the process as a worker. You can omit the -c option now. likwid-pin will then automatically use -c N:0-maxProcessors.
For a gcc OpenMP application this is the same. If you omit to set
OMP_NUM_THREADS
likwid-pin will set it to as many threads as you specified in
your pinning expression.
$ likwid-pin -c 0,2,4,6 ./myApp parameters
With logical numbering this may translate to:
$ likwid-pin -c N:0-3 ./myApp parameters
or:
$ likwid-pin -c S0:0-3 ./myApp parameters
If you want the ccNUMA domains your threads are running to be cleaned up
before your code running like with likwid-memsweeper you can use the -S
flag:
$ likwid-pin -S -c S0:0-3 ./myApp parameters
You can use multiple thread domains in a logical processor list, separated by @
:
$ likwid-pin -c S0:0-3@S3:4-7 ./myApp parameters
To print out available thread domains use ( the output is for a four socket Nehalem EX machine). In this example socket, last level cache group and memory domain are equivalent:
$ likwid-pin -p
Domain 0:
Tag N: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
Domain 1:
Tag S0: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
Domain 2:
Tag S1: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
Domain 3:
Tag S2: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
Domain 4:
Tag S3: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
Domain 5:
Tag C0: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
Domain 6:
Tag C1: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
Domain 7:
Tag C2: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
Domain 8:
Tag C3: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
Domain 9:
Tag M0: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
Domain 10:
Tag M1: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
Domain 11:
Tag M2: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
Domain 12:
Tag M3: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
$ likwid-pin -c S0:0@S3:0 -S ./stream-icc
Sweeping memory
Sweeping domain 0: Using 104849 MB of 131062 MB
Cleaning LLC with 50 MB
Sweeping domain 3: Using 104858 MB of 131072 MB
Cleaning LLC with 50 MB
Starting from version 3.1 likwid-pin also supports thread expressions.
Expressions based thread list generation with compact processor numbering. Example usage expression: likwid-pin -c E:N:8 ./myApp This will generate a compact list of thread to processor mapping for the node domain with eight threads. The following syntax variants are available:
- -c E:<thread domain>:<number of threads>
- -c E:<thread domain>:<number of threads>:<chunk size>:<stride>
For two SMT threads per core on a SMT 4 machine use e.g. -c E:N:122:2:4
The simplest way to use the expression based syntax is:
$ likwid-pin -c E:S0:4 ./myApp parameters
This will use 4 processors within the socket 0 thread domain. Remember that the ordering is compact. This means if the processor has 2-way SMT the first two physical cores will be used with 4 threads.
Optionally you may specify a block size and stride:
$ likwid-pin -c E:S0:8:1:2 ./myApp parameters
On a 2-way SMT system this is equivalent to -c S0:0-7
, eight threads, block size is one and stride (from start of block to start of block) is two. This is handy especially on systems with 4-way SMT. Consider an Intel Xeon Phi, you want to use 2 SMT threads per physical core with only 30 cores resulting in 60 threads. This can easily be achieved with:
$ likwid-pin -c E:N:60:2:4 ./myApp parameters
Or consider an AMD Bulldozer system and you want to use only one core per FPU:
$ likwid-pin -c E:S0:4:1:2 ./myApp parameters
You may also chain expression using the following syntax:
$ likwid-pin -c E:S0:20:2:4@S1:4:1:2 ./myApp parameters
Another option is to use a scatter policy among a thread domain type. Example usage scatter: likwid-pin -c M:scatter ./myApp This will generate a thread to processor mapping scattered among all memory domains with physical cores first.
You can also use likwid-pin to convert logical thread expressions into physical processor lists. This may be handy for other tools which do not support logical processor IDs. Optionally you can specify a custom delimiter for this list with the -d option.
Since version 3.1 oversubscription is allowed reusing the thread list you provided. If an overflow occurred, this will be indicated in the output.
With version 4.2.1 the shepard threads are detected automatically. Many OpenMP runtime versions were tested and the only version where it wasn't able to detect them was the Intel C/C++ compiler 11.0/11.1. If you want to use likwid-pin with older OpenMP runtimes you might have to skip the shepard threads manually by setting a skip mask with the -s command line option.
# Example with Intel C/C++ compiler 11.1
$ likwid-pin -c 3,4,5,6 -s 0x1 a.out
[pthread wrapper]
[pthread wrapper] MAIN -> 3
[pthread wrapper] PIN_MASK: 0->4 1->5 2->6
[pthread wrapper] SKIP MASK: 0x0
threadid 140177980745472 -> core 4 - OK
threadid 140177980479232 -> core 5 - OK
threadid 140177976280832 -> core 6 - OK
Roundrobin placement triggered
threadid 140177972082432 -> core 3 - OK
likwid-pin(35974)---a.out(35978)-+-pstree(35995)
|-{a.out}(35982)
|-{a.out}(35986)
|-{a.out}(35990)
`-{a.out}(35994)
Thread 0 of 4 threads says: Hello from CPU 3 on host host1! - 35978 - 35978
Thread 1 of 4 threads says: Hello from CPU 5 on host host1! - 35978 - 35986
Thread 2 of 4 threads says: Hello from CPU 6 on host host1! - 35978 - 35990
Thread 3 of 4 threads says: Hello from CPU 3 on host host1! - 35978 - 35994
Everytime you see the message Roundrobin placement triggered
more threads than specified CPUs were started. These additional threads are commonly shepard threads. In the output you can see, that thread 0 and 3 are both scheduled to CPU 3. When comparing the output of pstree and the Thread-IDs (TID, last number in hello lines), the thread with TID 35982 does not say hello because it's a shepard thread. When looking at the list, it is the first started thread, thus a skip mask of 0x1 skips it:
$ likwid-pin -c 3,4,5,6 -s 0x1 a.out
[pthread wrapper]
[pthread wrapper] MAIN -> 3
[pthread wrapper] PIN_MASK: 0->4 1->5 2->6
[pthread wrapper] SKIP MASK: 0x1
threadid 140457091475200 -> SKIP
threadid 140457091208960 -> core 4 - OK
threadid 140457087010560 -> core 5 - OK
threadid 140457082812160 -> core 6 - OK
likwid-pin(32439)---a.out(32443)-+-pstree(32460)
|-{a.out}(32447)
|-{a.out}(32451)
|-{a.out}(32455)
`-{a.out}(32459)
Thread 0 of 4 threads says: Hello from CPU 3 on host host1! - 32443 - 32443
Thread 2 of 4 threads says: Hello from CPU 5 on host host1! - 32443 - 32455
Thread 1 of 4 threads says: Hello from CPU 4 on host host1! - 32443 - 32451
Thread 3 of 4 threads says: Hello from CPU 6 on host host1! - 32443 - 32459
Example output for a OpenMP threaded STREAM benchmark.
$ likwid-pin -c 0-3 ./STREAM_OMP-WOODY
[likwid-pin] Main PID -> core 0 - OK
-------------------------------------------------------------
STREAM version $Revision: 5.8 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 6000000, Offset = 0
Total memory required = 137.3 MB.
Each test is run 10 times, but only
the **best** time for each is used.
-------------------------------------------------------------
[wrapper](pthread) [wrapper](pthread) PIN_MASK: 0->1 1->2 2->3
[wrapper](pthread) SKIP MASK: 0x2
[wrapper 0](pthread) Notice: Using libpthread.so.0
threadid 47223170505040 -> core 1 - OK
[wrapper 1](pthread) Notice: Using libpthread.so.0
threadid 47223174703440 -> SKIP
[wrapper 2](pthread) Notice: Using libpthread.so.0
threadid 47223178901840 -> core 2 - OK
[wrapper 3](pthread) Notice: Using libpthread.so.0
threadid 47223183100240 -> core 3 - OK
Number of Threads requested = 4
-------------------------------------------------------------
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 70298 microseconds.
(= 35149 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 7034.1035 0.0137 0.0136 0.0137
Scale: 7087.4672 0.0138 0.0135 0.0154
Add: 7147.0976 0.0207 0.0201 0.0219
Triad: 7186.9842 0.0207 0.0200 0.0227
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
-
Applications
-
Config files
-
Daemons
-
Architectures
- Available counter options
- AMD
- Intel
- Intel Atom
- Intel Pentium M
- Intel Core2
- Intel Nehalem
- Intel NehalemEX
- Intel Westmere
- Intel WestmereEX
- Intel Xeon Phi (KNC)
- Intel Silvermont & Airmont
- Intel Goldmont
- Intel SandyBridge
- Intel SandyBridge EP/EN
- Intel IvyBridge
- Intel IvyBridge EP/EN/EX
- Intel Haswell
- Intel Haswell EP/EN/EX
- Intel Broadwell
- Intel Broadwell D
- Intel Broadwell EP
- Intel Skylake
- Intel Coffeelake
- Intel Kabylake
- Intel Xeon Phi (KNL)
- Intel Skylake X
- Intel Cascadelake SP/AP
- Intel Tigerlake
- Intel Icelake
- Intel Icelake X
- Intel SappireRapids
- Intel GraniteRapids
- Intel SierraForrest
- ARM
- POWER
-
Tutorials
-
Miscellaneous
-
Contributing