This is a question about how to determine the CUDA grid, block and thread sizes. This is an additional question to the one posted here.
Following this link, the answer from talonmies contains a code snippet (see below). I don't understand the comment "value usually chosen by tuning and hardware constraints".
I haven't found a good explanation or clarification that explains this in the CUDA documentation. In summary, my question is how to determine the optimal
blocksize (number of threads) given the following code:
const int n = 128 * 1024; int blocksize = 512; // value usually chosen by tuning and hardware constraints int nblocks = n / nthreads; // value determine by block size and total work madd<<<nblocks,blocksize>>>mAdd(A,B,C,n);