Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why blocksize is 256 in gpu-cache test #9

Open
blueWatermelonFri opened this issue Jul 2, 2024 · 1 comment
Open

Why blocksize is 256 in gpu-cache test #9

blueWatermelonFri opened this issue Jul 2, 2024 · 1 comment

Comments

@blueWatermelonFri
Copy link

blueWatermelonFri commented Jul 2, 2024

Hey, i find in gpu-cache test the blocksize is 256, why it is not 1024

When i changed blocksize from 256 to 1024, L1 cache bandwidth tested has some improvement and fluctuates more.

blocksize = 256 results as follows

         1 kB        50ms       0.7%    8648.7 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
         2 kB        37ms       0.1%   11608.2 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
         3 kB        33ms       0.0%   12947.3 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
         4 kB        31ms       5.4%   14061.7 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
         6 kB        30ms       3.3%   14402.9 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
         8 kB        30ms       6.6%   14989.1 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        10 kB        30ms       3.0%   14555.9 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        12 kB        30ms      27.9%   15976.9 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        14 kB        30ms       5.3%   14430.3 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        16 kB        30ms       2.2%   14588.7 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        18 kB        33ms       2.0%   13113.2 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        20 kB        30ms      17.5%   15206.6 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        22 kB        29ms       7.9%   15610.4 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        24 kB        28ms      11.8%   15916.6 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        26 kB        32ms      11.1%   13737.2 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        28 kB        30ms       5.0%   14240.1 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        30 kB        31ms       0.6%   14172.9 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        32 kB        30ms       4.1%   14733.7 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        34 kB        29ms       2.2%   14845.4 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        36 kB        29ms       3.3%   15113.0 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        38 kB        29ms       5.4%   14967.6 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        40 kB        29ms       5.4%   15129.5 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        42 kB        29ms       8.7%   15437.6 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        44 kB        29ms       7.0%   15451.0 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        46 kB        29ms       8.4%   15633.8 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        48 kB        28ms      12.3%   15940.7 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        50 kB        28ms      16.4%   16288.1 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        52 kB        28ms      14.6%   16230.0 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        54 kB        28ms      12.6%   16195.2 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        56 kB        27ms      10.0%   16434.3 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        58 kB        28ms      11.0%   16433.2 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s

blocksize = 1024 results as follows

     data set   exec time     spread        Eff. bw       DRAM read      DRAM write         L2 read       L2 store
         4 kB        37ms       0.1%   11645.2 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
         6 kB       111ms       0.0%    3902.7 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
         8 kB        29ms      46.0%   17593.9 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        10 kB        66ms       6.0%    6564.7 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        12 kB        29ms      24.8%   16609.0 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        14 kB        52ms       1.4%    8303.3 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        16 kB        28ms      27.1%   17275.3 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        18 kB        44ms       6.6%    9894.2 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        20 kB        28ms      27.0%   17521.9 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        22 kB        39ms       7.5%   11307.5 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        24 kB        27ms      16.9%   17184.6 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        26 kB        37ms      18.0%   12475.2 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        28 kB        27ms      40.3%   18542.5 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        30 kB        34ms      11.9%   13365.3 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        32 kB        26ms      20.7%   18043.9 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        34 kB        34ms      23.1%   14124.3 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        36 kB        27ms      26.9%   17707.2 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s

My device is A800 80GB PCIe.

@te42kyfo
Copy link
Collaborator

te42kyfo commented Jul 3, 2024

The number of thread blocks needs to be a divisor of N, which is a template parameter to measure. Otherwise many threads will do too much work.

In lines 144 forward, only use multiples of 1024 as template parameter. On some GPUs, which do not have a L1 cache as large, the amount of work per thread would be very small, and the performance actually worse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants