Splitting offset loop with size larger than the loop bounds causes incorrect output / segfaults
Consider the following simple example:
https://gist.github.com/arghdos/4683c18d8960c17940745c193ca6f977#file-loop_creator-py
Now for the example given--range (3, 6)--loopy produces the following:
https://gist.github.com/arghdos/4683c18d8960c17940745c193ca6f977#file-range36-cl
If my interpretation is correct, this results in the following values:
lid | i_outer | 4 * i_outer + lid(0) |
---|---|---|
0 | 1 | 4 |
1 | 1 | 5 |
2 | 1 | 6 |
3 | 0 | 3 |
Hence, for lid(0) == 2
, we are writing to elements 600...699
, outside of the bounds of the array.
There are many examples of this splitting going subtly wrong (e.g., range(3, 5)) but this particular example causes a segfault on my setup for both POCL and Intel's OpenCL drivers (although interestingly, NVIDIA runs just fine)
Further, changing the range to range(0, 3) results in the required work group size being (correctly) scaled down to 3 (as does, range(1, 4)
, range(4, 7)
, and many others), but range(7, 10)
gives segfaults again