Unfortunate conditional placement ("if" could be outside "for" for 2D loop domain)
Using non-square groups in this matrix multiplication produces a conditional inside the inner loop and increases the run time by 10x. Could this conditional instead be incorporated in the loop bounds? If so, this would probably reduce the effect on the execution time. The attached example demonstrates the problem by timing the two cases and printing out the cl code for comparison.
Edited by Andreas Klöckner