importloopyaslpknl=lp.make_kernel("{[i]: 0<=i<4}",""" a = simul_reduce(sum, i, 7*i) b = simul_reduce(sum, i, 10*i)""")knl=lp.tag_inames(knl,"i:l.0")knl=lp.realize_reduction(knl)print(lp.generate_code_v2(knl).device_code())
generates two sets of inames.
Edited
Designs
Child items
0
Show closed items
GraphQL error: The resource that you are attempting to access does not exist or you don't have permission to perform this action
No child items are currently open.
Linked items
0
Link issues together to show that they're related.
Learn more.
map_reduction_local should generate same sets of intermediate inames for the simul_reduce
And so after realize_reduction the kernel in the description should become --
---------------------------------------------------------------------------INSTRUCTIONS: for red_init_i↱ acc_i[red_init_i] = 0 {id=insn_i_init}│↱ neutral_i = 0 {id=insn_i_init_neutral}││ end red_init_i││ for i└└↱ acc_i[i] = neutral_i + 7*i {id=insn_i_transfer} │ end i │ for red_i_s0_0↱ └ acc_i[red_i_s0_0] = acc_i[red_i_s0_0] + acc_i[red_i_s0_0 + 2] {id=red_i_stage_0}│ end red_i_s0_0│ for red_i_s1_0└↱ acc_i[red_i_s1_0] = acc_i[red_i_s1_0] + acc_i[red_i_s1_0 + 1] {id=red_i_stage_1} │ end red_i_s1_0 │ for red_i └ a = acc_i[0] {id=insn_1} end red_i for red_init_i↱ acc_i_0[red_init_i] = 0 {id=insn_0_i_init}│↱ neutral_i_0 = 0 {id=insn_0_i_init_neutral}││ end red_init_i││ for i└└↱ acc_i_0[i] = neutral_i_0 + 10*i {id=insn_0_i_transfer} │ end i │ for red_i_s0_0↱ └ acc_i_0[red_i_s0_0] = acc_i_0[red_i_s0_0] + acc_i_0[red_i_s0_0 + 2] {id=red_i_stage_0_0}│ end red_i_s0_0│ for red_i_s1_0└↱ acc_i_0[red_i_s1_0] = acc_i_0[red_i_s1_0] + acc_i_0[red_i_s1_0 + 1] {id=red_i_stage_1_0} │ end red_i_s1_0 │ for red_i └ b = acc_i_0[0] {id=insn_0_0} end red_i---------------------------------------------------------------------------
Currently, the above improvised kernel obtained after step 1 generates a sub-optimal code(notice the unnecessary local barriers):
__kernelvoid__attribute__((reqd_work_group_size(4,1,1)))loopy_kernel(__globalint*__restrict__a,__globalint*__restrict__b){__localintacc_i[4];__localintacc_i_0[4];intneutral_i;intneutral_i_0;neutral_i=0;acc_i[lid(0)]=0;barrier(CLK_LOCAL_MEM_FENCE)/* for acc_i (insn_i_transfer depends on insn_i_init) */;acc_i[lid(0)]=neutral_i+7*lid(0);barrier(CLK_LOCAL_MEM_FENCE)/* for acc_i (red_i_stage_0 depends on insn_i_transfer) */;if(1+-1*lid(0)>=0)acc_i[lid(0)]=acc_i[lid(0)]+acc_i[2+lid(0)];barrier(CLK_LOCAL_MEM_FENCE)/* for acc_i (red_i_stage_1 depends on red_i_stage_0) */;if(lid(0)==0)acc_i[0]=acc_i[0]+acc_i[1];barrier(CLK_LOCAL_MEM_FENCE)/* for acc_i (insn_1 depends on red_i_stage_1) */;a[0]=acc_i[0];neutral_i_0=0;acc_i_0[lid(0)]=0;barrier(CLK_LOCAL_MEM_FENCE)/* for acc_i_0 (insn_0_i_transfer depends on insn_0_i_init) */;acc_i_0[lid(0)]=neutral_i_0+10*lid(0);barrier(CLK_LOCAL_MEM_FENCE)/* for acc_i_0 (red_i_stage_0_0 depends on insn_0_i_transfer) */;if(1+-1*lid(0)>=0)acc_i_0[lid(0)]=acc_i_0[lid(0)]+acc_i_0[2+lid(0)];barrier(CLK_LOCAL_MEM_FENCE)/* for acc_i_0 (red_i_stage_1_0 depends on red_i_stage_0_0) */;if(lid(0)==0)acc_i_0[0]=acc_i_0[0]+acc_i_0[1];barrier(CLK_LOCAL_MEM_FENCE)/* for acc_i_0 (insn_0_0 depends on red_i_stage_1_0) */;b[0]=acc_i_0[0];}