-
Notifications
You must be signed in to change notification settings - Fork 233
LIKWID and Nvidia GPUs
LIKWID 5.0 was extended to support Nvidia GPUs in LIKWID. The tasks was challenging because LIKWID focused on CPUs and there was no "other device" logic. In order to simplify the transition from CPUs to GPUs for the users, the LIKWID API for GPUs is basically a copy of the LIKWID API for CPUs with a few differences. For the command line applications, new CLI options are introduced.
The Nvidia libraries provide an option to allow profiling as a user. If you try out likwid-perfctr
and it tells error 35 (CUPTI_ERROR_INSUFFICIENT_PRIVILEGES
) in the output, see this page and follow the instructions here.
As soon as the CUDA and CUPTI libraries are reachable (path to libs in LD_LIBRARY_PATH
), likwid-topology prints a separate section GPU Topology
listing all GPUs.
********************************************************************************
GPU Topology
********************************************************************************
GPU count: 1
--------------------------------------------------------------------------------
ID: 0
Name: Tesla V100-SXM2-16GB
Compute capability: 7.0
L2 size: 6.00 MB
Memory: 16.00 GB
SIMD width: 32
Clock rate: 1530000 kHz
Memory clock rate: 877000 kHz
Attached to NUMA node: -1
--------------------------------------------------------------------------------
If you worked with likwid-topology
, you might have used -c, --caches
to display detailed cache information. For GPUs, the CLI option -G, --gpus
causes likwid-topology
to print out detailed information about the GPU:
********************************************************************************
GPU Topology
********************************************************************************
GPU count: 1
--------------------------------------------------------------------------------
ID: 0
Name: Tesla V100-SXM2-16GB
Compute capability: 7.0
L2 size: 6.00 MB
Memory: 16.00 GB
SIMD width: 32
Clock rate: 1530000 kHz
Memory clock rate: 877000 kHz
Attached to NUMA node: -1
Number of SPs: 80
Max. threads per SP: 2048
Max. threads per block: 1024
Max. thread dimensions: 1024/1024/64
Max. regs per block: 0
Shared mem per block: 49152
Memory bus width: 4096
Texture alignment: 512
Surface alignment: 512
ECC: on
GPU integrated: no
Max. grid sizes: 2147483647/65535/65535
PCI bus: 0x4
PCI domain: 0x4
PCI device: 0x0
--------------------------------------------------------------------------------
The second CLI tool which was adapted to support Nvidia GPUs is likwid-perfctr
. We thought about different ways to specify the GPUs and related events and ended up with distinct CLI options:
$ likwid-perfctr -h
[...]
-G, --gpus <list> List of GPUs to monitor
-W, --gpugroup <string> Performance group or custom event set string for GPU monitoring
[...]
Notice: At the moment LIKWID for Nvidia GPUs is only supported for instrumented code, so you always need -m
!
$ make -C test triadCU
$ likwid-perfctr -G 0 -W FLOPS_DP -m test/triadCU
--------------------------------------------------------------------------------
CPU name: POWER9, altivec supported
CPU type: POWER9 architecture
CPU clock: 3.80 GHz
--------------------------------------------------------------------------------
137.38ms 31.26GB/s
--------------------------------------------------------------------------------
Region triad, Group 1: FLOPS_DP
+-------------------+----------+
| Region Info | GPU 0 |
+-------------------+----------+
| RDTSC Runtime [s] | 0.105340 |
| call count | 10 |
+-------------------+----------+
+----------------------------------------------------+---------+-----------+
| Event | Counter | GPU 0 |
+----------------------------------------------------+---------+-----------+
| SMSP_SASS_THREAD_INST_EXECUTED_OP_DADD_PRED_ON_SUM | GPU0 | 0 |
| SMSP_SASS_THREAD_INST_EXECUTED_OP_DMUL_PRED_ON_SUM | GPU1 | 0 |
| SMSP_SASS_THREAD_INST_EXECUTED_OP_DFMA_PRED_ON_SUM | GPU2 | 867648400 |
+----------------------------------------------------+---------+-----------+
+--------------+------------+
| Metric | GPU 0 |
+--------------+------------+
| DP [MFLOP/s] | 16473.3097 |
+--------------+------------+
The NvMarkerAPI is also quite similar to the MarkerAPI:
For LIKWID < 5.4:
-
LIKWID_NVMARKER_INIT
: Initialize the LIKWID library for Nvidia GPUs -
LIKWID_NVMARKER_CLOSE
: Close the LIKWID library and write result file -
LIKWID_NVMARKER_REGISTER(name)
: Register a region with the namename
-
LIKWID_NVMARKER_START(name)
: Start a region with the namename
-
LIKWID_NVMARKER_STOP(name)
: Stop a region with the namename
-
LIKWID_NVMARKER_RESET(name)
: Reset counter data for the region with the namename
-
LIKWID_NVMARKER_SWITCH
: Switch to next group if any (round-robin fashion) -
LIKWID_NVMARKER_GET(name, ngpu, nevents, eventlist, time, count)
: Get the current results of a region.
For LIKWID >= 5.4:
-
NVMON_MARKER_INIT
: Initialize the LIKWID library for Nvidia GPUs -
NVMON_MARKER_CLOSE
: Close the LIKWID library and write result file -
NVMON_MARKER_REGISTER(name)
: Register a region with the namename
-
NVMON_MARKER_START(name)
: Start a region with the namename
-
NVMON_MARKER_STOP(name)
: Stop a region with the namename
-
NVMON_MARKER_RESET(name)
: Reset counter data for the region with the namename
-
NVMON_MARKER_SWITCH
: Switch to next group if any (round-robin fashion) -
NVMON_MARKER_GET(name, ngpu, nevents, eventlist, time, count)
: Get the current results of a region.
In order to activate the macros, you have to compile with -DLIKWID_NVMON
and proper include and library paths. Finally you have to link with the LIKWID library.
Example code how to use the NvMarkerAPI (5.4+):
#include <likwid-marker.h>
int main(int argc, char **argv) {
NVMON_MARKER_INIT;
NVMON_MARKER_REGISTER("triad");
for (int i = 0; i < iters; i++) {
NVMON_MARKER_START("triad");
sch_triad_kernel<double>
<<<max_blocks, block_size>>>(dA, dB, dC, dD, buffer_size);
NVMON_MARKER_STOP("triad");
}
NVMON_MARKER_CLOSE;
}
Compilation step:
nvcc -O3 -I. -Xcompiler -mno-float128 -I$LIKWID_HOME/include -DLIKWID_NVMON triad.cu -o triadCU -lm -L$LIKWID_HOME/lib -llikwid
One difference is that the NvMarkerAPI is not bound to the threads executing the calls. So NVMON_MARKER_START
/LIKWID_NVMARKER_START
starts the counters on all GPUs supplied to likwid-perfctr
. If you compare with the CPU-related MarkerAPI, there each thread has to execute the macros itself.
The command line applications are basically just Lua scripts interfacing with the LIKWID library. For users of the CPU-related LIKWID API (PerfMon) the transition to the GPU-related LIKWID API (NvMon) is not difficult as the calls are mostly the same.
Example calls:
PerfMon | NvMon | Description |
---|---|---|
perfmon_init(ncpus, cpulist) |
nvmon_init(ngpus, gpulist) |
Initialize the measurement system on given CPUs/GPUs |
perfmon_getMetric(group, metric, cpu-idx) |
nvmon_getMetric(group, metric, gpu-idx) |
Get the result of a derived metric for CPUs/GPUs |
perfmon_startCounters() |
nvmon_startCounters() |
Start previously set up event set on all CPUs/GPUs |
There are a few exceptions like nvmon_getEventsOfGpu(gpuId, NvmonEventList_t* list)
because each GPU might provide a different set of events/metrics. The Perfmon equivalent would be perfmon_init_maps(); perfmon_check_counter_map(cpuId)
and a directly readable list of events.
When LIKWID is compiled with NVIDIA_INTERFACE=true
, the output of likwid-perfctr -a
contains multiple section, the first for CPU related groups and the second with the Nvidia GPU related performance groups. The groups for the different backends are different.
- compute capability
< 7.0
:$LIKWID_SRC/groups/nvidia_gpu_cc_lt_7
- compute capability
>= 7.0
:$LIKWID_SRC/groups/nvidia_gpu_cc_ge_7
If you want to create own performance groups, put them in $HOME/.likwid/groups/nvidia_gpu_cc_lt_7
and $HOME/.likwid/groups/nvidia_gpu_cc_ge_7
respectively.
Internally, LIKWID uses different backends to access the performance events on Nvidia GPUs. For GPUs with compute capability < 7.0
, LIKWID uses the CUPTI Event API (Metrics API upcoming) while for GPUs with compute capability >= 7.0
, it uses the CUpti Profiling API in combination with the PerfWorks API. Which backend is used for a GPU is determined once in the initialization phase. The PerfWorks API is not published yet (no documentation) but there is some example code which was the basis for the LIKWID backend.
General:
- No CUPTI Metrics API for GPUs with compute capability
< 7.0
.
CUPTI Profiling backend:
- Only works if GPU 0 is part of the GPU list. Only measuring on GPU 1 does not work.
- Some metrics require multi-pass kernel execution. It's not detectable in the setup phase whether an event requires multiple passes, so the error is thrown at the first counter starting.
- On some occasions, the CUPTI Profiling API returns error 999 which is "Unknown CUPTI error"
General:
- Nvidia changes data structures also in minor releases. LIKWID uses some
#ifdef
s in the code to switch between the data structures. It might be that you need to recompile LIKWID to make it work.
-
Applications
-
Config files
-
Daemons
-
Architectures
- Available counter options
- AMD
- Intel
- Intel Atom
- Intel Pentium M
- Intel Core2
- Intel Nehalem
- Intel NehalemEX
- Intel Westmere
- Intel WestmereEX
- Intel Xeon Phi (KNC)
- Intel Silvermont & Airmont
- Intel Goldmont
- Intel SandyBridge
- Intel SandyBridge EP/EN
- Intel IvyBridge
- Intel IvyBridge EP/EN/EX
- Intel Haswell
- Intel Haswell EP/EN/EX
- Intel Broadwell
- Intel Broadwell D
- Intel Broadwell EP
- Intel Skylake
- Intel Coffeelake
- Intel Kabylake
- Intel Xeon Phi (KNL)
- Intel Skylake X
- Intel Cascadelake SP/AP
- Intel Tigerlake
- Intel Icelake
- Intel Icelake X
- Intel SappireRapids
- Intel GraniteRapids
- Intel SierraForrest
- ARM
- POWER
-
Tutorials
-
Miscellaneous
-
Contributing