No privatisation of reduction variables for `ilp.seq` on CTarget when inames differ.
This is reduced from a bigger testcase in Firedrake. I'm not sure exactly if this is a bug in codegen, or we need to annotate the split inames with more information.
Consider
import loopy as lp
import numpy as np
knl = lp.make_kernel(
[
"[n] -> { [i, j, k] : 0 <= i < n and 0 <= j, k < 2}",
],
"""
a[k] = 0 {id=statement1, inames=i:k}
a[j] = a[j] + b[j] {id=expr_insn, dep=statement1, inames=i:j}
""", [
lp.ValueArg(
name='n', dtype=np.int32),
lp.GlobalArg(
name='a', dtype=np.float64,
shape=(2,)),
lp.GlobalArg(
name='b', dtype=np.float64,
shape=(2,)),
], lang_version=(2018, 2), target=lp.CTarget(), assumptions="n > 0 and n mod 4 = 0")
sk = lp.split_iname(knl, "i", 4, inner_tag="ilp.seq")
print(lp.generate_code_v2(knl).device_code())
print(lp.generate_code_v2(sk).device_code())
The initial kernel is fine:
#include <stdint.h>
void loopy_kernel(int32_t const n, double *__restrict__ a, double const *__restrict__ b)
{
for (int32_t i = 0; i <= -1 + n; ++i)
{
for (int32_t k = 0; k <= 1; ++k)
a[k] = 0.0;
for (int32_t j = 0; j <= 1; ++j)
a[j] = a[j] + b[j];
}
}
We first zero, and then reduce. However, the second is not, we get:
#include <stdint.h>
void loopy_kernel(int32_t const n, double *__restrict__ a, double const *__restrict__ b)
{
for (int32_t i_outer = 0; i_outer <= (-4 + n) / 4; ++i_outer)
{
for (int32_t k = 0; k <= 1; ++k)
for (int32_t i_inner = 0; i_inner <= 3; ++i_inner)
a[k] = 0.0;
for (int32_t j = 0; j <= 1; ++j)
for (int32_t i_inner = 0; i_inner <= 3; ++i_inner)
a[j] = a[j] + b[j];
}
}
So we've pushed the ilp loop inside, good, but that transformation is invalid since the zeroing now doesn't happen in the right place. To be fair, loopy does report a write-race warning:
/Users/vtdb72/Documents/work/src/firedrake/src/loopy/loopy/check.py:268: WriteRaceConditionWarning: in kernel loopy_kernel: instruction 'statement1' contains a write race: instruction will be run across parallel iname(s) 'i_inner', which is/are not referenced in the lhs index (add 'write_race(statement1)' to silenced_warnings kernel attribute to disable)
WriteRaceConditionWarning)
/Users/vtdb72/Documents/work/src/firedrake/src/loopy/loopy/check.py:268: WriteRaceConditionWarning: in kernel loopy_kernel: instruction 'expr_insn' contains a write race: instruction will be run across parallel iname(s) 'i_inner', which is/are not referenced in the lhs index (add 'write_race(expr_insn)' to silenced_warnings kernel attribute to disable)
WriteRaceConditionWarning)
So I know something iffy is going on. But what do I need to do to get the right code? I suppose I need to privatise a
over i_inner
and create a separate reduction step at the end.