No privatisation of reduction variables for `ilp.seq` on CTarget when inames differ.

This is reduced from a bigger testcase in Firedrake. I'm not sure exactly if this is a bug in codegen, or we need to annotate the split inames with more information.

Consider

import loopy as lp
import numpy as np

knl = lp.make_kernel(
    [
    "[n] -> { [i, j, k] : 0 <= i < n and 0 <= j, k < 2}",
    ],
    """
    a[k] = 0 {id=statement1, inames=i:k}
    a[j] = a[j] + b[j] {id=expr_insn, dep=statement1, inames=i:j}
    """, [
        lp.ValueArg(
            name='n', dtype=np.int32),
        lp.GlobalArg(
            name='a', dtype=np.float64,
            shape=(2,)),
        lp.GlobalArg(
            name='b', dtype=np.float64,
            shape=(2,)),
        ], lang_version=(2018, 2), target=lp.CTarget(), assumptions="n > 0 and n mod 4 = 0")

sk = lp.split_iname(knl, "i", 4, inner_tag="ilp.seq")

print(lp.generate_code_v2(knl).device_code())
print(lp.generate_code_v2(sk).device_code())

The initial kernel is fine:

#include <stdint.h>

void loopy_kernel(int32_t const n, double *__restrict__ a, double const *__restrict__ b)
{
  for (int32_t i = 0; i <= -1 + n; ++i)
  {
    for (int32_t k = 0; k <= 1; ++k)
      a[k] = 0.0;
    for (int32_t j = 0; j <= 1; ++j)
      a[j] = a[j] + b[j];
  }
}

We first zero, and then reduce. However, the second is not, we get:

#include <stdint.h>

void loopy_kernel(int32_t const n, double *__restrict__ a, double const *__restrict__ b)
{
  for (int32_t i_outer = 0; i_outer <= (-4 + n) / 4; ++i_outer)
  {
    for (int32_t k = 0; k <= 1; ++k)
      for (int32_t i_inner = 0; i_inner <= 3; ++i_inner)
        a[k] = 0.0;
    for (int32_t j = 0; j <= 1; ++j)
      for (int32_t i_inner = 0; i_inner <= 3; ++i_inner)
        a[j] = a[j] + b[j];
  }
}

So we've pushed the ilp loop inside, good, but that transformation is invalid since the zeroing now doesn't happen in the right place. To be fair, loopy does report a write-race warning:

/Users/vtdb72/Documents/work/src/firedrake/src/loopy/loopy/check.py:268: WriteRaceConditionWarning: in kernel loopy_kernel: instruction 'statement1' contains a write race: instruction will be run across parallel iname(s) 'i_inner', which is/are not referenced in the lhs index (add 'write_race(statement1)' to silenced_warnings kernel attribute to disable)
  WriteRaceConditionWarning)
/Users/vtdb72/Documents/work/src/firedrake/src/loopy/loopy/check.py:268: WriteRaceConditionWarning: in kernel loopy_kernel: instruction 'expr_insn' contains a write race: instruction will be run across parallel iname(s) 'i_inner', which is/are not referenced in the lhs index (add 'write_race(expr_insn)' to silenced_warnings kernel attribute to disable)
  WriteRaceConditionWarning)

So I know something iffy is going on. But what do I need to do to get the right code? I suppose I need to privatise a over i_inner and create a separate reduction step at the end.