Save and reload doesn't do its job in the presence of duplicated inames across kernels
Our current save and reload heuristic adds an axis in front of the temporary based on the common set of hardware parallel inames accessing the instruction, but it might make sense to relax this constraint and just look at hardware parallel tags of the accessing instructions.
knl = lp.make_kernel("{[i,j]: 0 <= i < 10 and 0 <= j < 10}",
"""
<>a[j] = j
... gbarrier
out[i,j] = a[j]
""",
seq_dependencies=True)
knl = lp.tag_inames(knl, dict(i="g.0", j="l.0"))
knl = lp.duplicate_inames(knl, "j", within="writes:out", tags={"j": "l.0"})
knl = lp.set_temporary_scope(knl, "a", "local")
print(lp.get_one_scheduled_kernel(lp.save_and_reload_temporaries(lp.get_one_scheduled_kernel(lp.preprocess_kernel(knl)))))
Output (note that a_save_slot is smaller than desired):
---------------------------------------------------------------------------
KERNEL: loopy_kernel
---------------------------------------------------------------------------
ARGUMENTS:
out: GlobalArg, type: np:dtype('int32'), shape: (10, 10), dim_tags: (N1:stride:10, N0:stride:1)
---------------------------------------------------------------------------
DOMAINS:
{ [i, j, j_0] : 0 <= i <= 9 and 0 <= j <= 9 and 0 <= j_0 <= 9 }
{ [a_save_axis_0_loopy_kernel] : 0 <= a_save_axis_0_loopy_kernel <= 9 }
{ [a_reload_axis_0_loopy_kernel_0] : 0 <= a_reload_axis_0_loopy_kernel_0 <= 9 }
---------------------------------------------------------------------------
INAME IMPLEMENTATION TAGS:
a_reload_axis_0_loopy_kernel_0: l.0
a_save_axis_0_loopy_kernel: l.0
i: g.0
j: l.0
j_0: l.0
---------------------------------------------------------------------------
TEMPORARIES:
a: type: np:dtype('int32'), shape: (10), dim_tags: (N0:stride:1) scope:local
a_save_slot: type: np:dtype('int32'), shape: (10), dim_tags: (N0:stride:1) scope:global
---------------------------------------------------------------------------
INSTRUCTIONS:
↱↱ [j] a[j] <- j # insn
└│↱ [a_save_axis_0_loopy_kernel] a_save_slot[a_save_axis_0_loopy_kernel] <- a[a_save_axis_0_loopy_kernel] # a.save,no_sync_with=a.reload@global:a.save@global
↱└└↱ [] ... gbarrier # insn_0
└↱ │ [a_reload_axis_0_loopy_kernel_0] a[a_reload_axis_0_loopy_kernel_0] <- a_save_slot[a_reload_axis_0_loopy_kernel_0] # a.reload,no_sync_with=a.reload@global:a.save@global
└ └ [i,j_0] out[i, j_0] <- a[j_0] # insn_1
---------------------------------------------------------------------------
SCHEDULE:
0: CALL KERNEL loopy_kernel(extra_args=['a_save_slot'], extra_inames=[])
1: [insn] a[j] <- j
2: ---BARRIER:local---
3: [a.save] a_save_slot[a_save_axis_0_loopy_kernel] <- a[a_save_axis_0_loopy_kernel]
4: RETURN FROM KERNEL loopy_kernel
5: ---BARRIER:global---
6: CALL KERNEL loopy_kernel_0(extra_args=['a_save_slot'], extra_inames=[])
7: [a.reload] a[a_reload_axis_0_loopy_kernel_0] <- a_save_slot[a_reload_axis_0_loopy_kernel_0]
8: ---BARRIER:local---
9: [insn_1] out[i, j_0] <- a[j_0]
10: RETURN FROM KERNEL loopy_kernel_0
---------------------------------------------------------------------------