Asked  7 Months ago    Answers:  5   Viewed   115 times

This is a question about how to determine the CUDA grid, block and thread sizes. This is an additional question to the one posted here.

Following this link, the answer from talonmies contains a code snippet (see below). I don't understand the comment "value usually chosen by tuning and hardware constraints".

I haven't found a good explanation or clarification that explains this in the CUDA documentation. In summary, my question is how to determine the optimal blocksize (number of threads) given the following code:

const int n = 128 * 1024;
int blocksize = 512; // value usually chosen by tuning and hardware constraints
int nblocks = n / nthreads; // value determine by block size and total work



There are two parts to that answer (I wrote it). One part is easy to quantify, the other is more empirical.

Hardware Constraints:

This is the easy to quantify part. Appendix F of the current CUDA programming guide lists a number of hard limits which limit how many threads per block a kernel launch can have. If you exceed any of these, your kernel will never run. They can be roughly summarized as:

  1. Each block cannot have more than 512/1024 threads in total (Compute Capability 1.x or 2.x and later respectively)
  2. The maximum dimensions of each block are limited to [512,512,64]/[1024,1024,64] (Compute 1.x/2.x or later)
  3. Each block cannot consume more than 8k/16k/32k/64k/32k/64k/32k/64k/32k/64k registers total (Compute 1.0,1.1/1.2,1.3/2.x-/3.0/3.2/3.5-5.2/5.3/6-6.1/6.2/7.0)
  4. Each block cannot consume more than 16kb/48kb/96kb of shared memory (Compute 1.x/2.x-6.2/7.0)

If you stay within those limits, any kernel you can successfully compile will launch without error.

Performance Tuning:

This is the empirical part. The number of threads per block you choose within the hardware constraints outlined above can and does effect the performance of code running on the hardware. How each code behaves will be different and the only real way to quantify it is by careful benchmarking and profiling. But again, very roughly summarized:

  1. The number of threads per block should be a round multiple of the warp size, which is 32 on all current hardware.
  2. Each streaming multiprocessor unit on the GPU must have enough active warps to sufficiently hide all of the different memory and instruction pipeline latency of the architecture and achieve maximum throughput. The orthodox approach here is to try achieving optimal hardware occupancy (what Roger Dahl's answer is referring to).

The second point is a huge topic which I doubt anyone is going to try and cover it in a single StackOverflow answer. There are people writing PhD theses around the quantitative analysis of aspects of the problem (see this presentation by Vasily Volkov from UC Berkley and this paper by Henry Wong from the University of Toronto for examples of how complex the question really is).

At the entry level, you should mostly be aware that the block size you choose (within the range of legal block sizes defined by the constraints above) can and does have a impact on how fast your code will run, but it depends on the hardware you have and the code you are running. By benchmarking, you will probably find that most non-trivial code has a "sweet spot" in the 128-512 threads per block range, but it will require some analysis on your part to find where that is. The good news is that because you are working in multiples of the warp size, the search space is very finite and the best configuration for a given piece of code relatively easy to find.

Tuesday, June 1, 2021
answered 7 Months ago
  1. You have to install the drivers for your integrated onboard gpu. This can be done by booting up while using iGPU from bios settings, and your pc shall be able to load the drivers it needs on its own. For my Ivy bridge, the bios settings are these:

    • Go to bios by repeatedly pressing del on startup.

    • Go to Advanced Mode by pressing F7 or by mouse clicking the respective option

    • Go to Advanced tab,System Agent, and enable as primary the iGPU, and the multi-monitor support.

  2. You boot with the iGPU (mine was HD4000), check that all is ok, but now you see the problem mentioned: CUDA devices can't be found/used (Except Tesla, i think, since they are not considered display and so they have nothing to do with this). Normally, you can find the discrete gpu under the device manager and update it manually from there. Then, CUDA should be able to spot your gpu. If that doesn't work, proceed to step 3.

  3. You reboot, go to bios again, select as primary the PCI-E card, and boot. You should be now using the nvidia card and everything should be normal, but still no way to use both iGPU and Nvidia. While using Nvidia card, go to device manager, find Intel iGPU in display devices, and perform a driver update. This is needed despite the fact that we have already let intel get its own drivers on the previous steps.

  4. When done, go to boot menu again, use as primary the iGPU, plug your display in it, and now, when using it, CUDA will be able to see the NVIDIA card and perform computations in it, while the display uses the iGPU, and the system remains responsive! I think this can also be used for gaming, or other performance reasons, like Dual monitors on integrated (onboard GPU) and discrete GPU.

Friday, August 6, 2021
answered 4 Months ago

Stay away from super-short magic numbers. Just because you're designing a binary format doesn't mean you can't use a text string for identifier. Follow that by an EOF char, and as an added bonus people who cat or type your binary file won't get a mangled terminal.

Sunday, September 26, 2021
Jeremy Pare
answered 3 Months ago

Your comparison has several flaws, some of which are covered in the comments.

  1. You need to eliminate any allocation effects. You can do this by doing some "warm-up" transfers first.
  2. You need to eliminate any "start-up" effects. You can do this by doing some "warm-up" transfers first.
  3. When comparing the data, remember that bandwidthTest is using a PINNED memory allocation, which thrust does not use. Therefore the thrust data transfer rate will be slower. This typically contributes about a 2x factor (i.e. pinned memory transfers are typically about 2x faster than pageable memory transfers. If you want a better comparison with bandwidthTest run it with the --memory=pageable switch.
  4. Your choice of timing functions might not be the best. cudaEvents is pretty reliable for timing CUDA operations.

Here is a code which does proper timing:

$ cat
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/copy.h>
#include <thrust/fill.h>

#define DSIZE ((1UL<<20)*32)

int main(){

  thrust::device_vector<int> d_data(DSIZE);
  thrust::host_vector<int> h_data(DSIZE);
  float et;
  cudaEvent_t start, stop;

  thrust::fill(h_data.begin(), h_data.end(), 1);
  thrust::copy(h_data.begin(), h_data.end(), d_data.begin());

  std::cout<< "warm up iteration " << d_data[0] << std::endl;
  thrust::fill(d_data.begin(), d_data.end(), 2);
  thrust::copy(d_data.begin(), d_data.end(), h_data.begin());
  std::cout<< "warm up iteration " << h_data[0] << std::endl;
  thrust::fill(h_data.begin(), h_data.end(), 3);
  thrust::copy(h_data.begin(), h_data.end(), d_data.begin());
  cudaEventElapsedTime(&et, start, stop);
  std::cout<<"host to device iteration " << d_data[0] << " elapsed time: " << (et/(float)1000) << std::endl;
  std::cout<<"apparent bandwidth: " << (((DSIZE*sizeof(int))/(et/(float)1000))/((float)1048576)) << " MB/s" << std::endl;
  thrust::fill(d_data.begin(), d_data.end(), 4);
  thrust::copy(d_data.begin(), d_data.end(), h_data.begin());
  cudaEventElapsedTime(&et, start, stop);
  std::cout<<"device to host iteration " << h_data[0] << " elapsed time: " << (et/(float)1000) << std::endl;
  std::cout<<"apparent bandwidth: " << (((DSIZE*sizeof(int))/(et/(float)1000))/((float)1048576)) << " MB/s" << std::endl;

  std::cout << "finished" << std::endl;
  return 0;

I compile with (I have a PCIE Gen2 system with a cc2.0 device)

$ nvcc -O3 -arch=sm_20 -o t213

When I run it I get the following results:

$ ./t213
warm up iteration 1
warm up iteration 2
host to device iteration 3 elapsed time: 0.0476644
apparent bandwidth: 2685.44 MB/s
device to host iteration 4 elapsed time: 0.0500736
apparent bandwidth: 2556.24 MB/s

This looks correct to me because a bandwidthTest on my system would report about 6GB/s in either direction as I have a PCIE Gen2 system. Since thrust uses pageable, not pinned memory, I get about half that bandwidth, i.e. 3GB/s, and thrust is reporting about 2.5GB/s.

For comparison, here is the bandwidth test on my system, using pageable memory:

$ /usr/local/cuda/samples/bin/linux/release/bandwidthTest --memory=pageable
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Quadro 5000
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     2718.2

 Device to Host Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     2428.2

 Device to Device Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     99219.1

Friday, December 3, 2021
answered 4 Days ago

Manage your date format masking using the most reasonable approach

First of all, I agree with Alex regarding using to_char. This would be my first choice for modifying date masks for specific requirements.

In Toad on an ad hoc basis, you could just invoke the alter session command as needed:

ALTER SESSION SET nls_date_format='DD/MON/RRRR';

If you are partial to a specific date format mask (and you see yourself often issuing the command, ALTER SESSION SET NLS...) then perhaps you might want to consider changing your user login settings.

If you just modify your specific user preference login file, login.sql (see here ), your session will adhere to the date format mask of your choosing at the beginning of your session. I am partial to creating the environment variable, SQLPATH, and placing my login script there.

Toad will honor your login.sql file settings (e.g. see this post).

Since this is driven by specific requirements or personal preferences, I would never think of modifying this from default at the site level.

Saturday, December 4, 2021
answered 3 Days ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :