TODO list ^^^^^^^^^ Immediately: ------------ TODO: Imitate codegen bulk slab handling in bulk slab trials For writeup: ------------ TODO: Reimplement forced lengths TODO: Try, fix reg. prefetch (DG example) / CSEs ILP and reg. prefetch interact! TODO: Custom reductions per red. axis TODO: Functions TODO: Common subexpressions TODO: Array common subexpressions (shared and private!) TODO: ILP arrays FIXME: support non-reductive dimensions (what did I mean here?) FIXME: write names should be assigned during scheduling FIXME: screwy lower bounds in ILP FIXME: Leading syncthreads elimination TODO: Divisibility TODO: Try, fix indirect addressing TODO: Implement GT200 matmul, Fermi matmul, DG TODO: DMA engine threads? TODO: Deal with equalities that crop up. TODO: Better user feedback. Later: ------ TODO: Try different kernels TODO: - Tricky: Convolution, Stencil TODO: Separate all-bulk from non-bulk kernels. (maybe?) (#ifdef?) TODO: implement efficient ceil_div? (as opposed to floor_div) TODO: why are corner cases inefficient? TODO: Use gists (why do disjoint sets arise?) TODO: variable shuffle detection Things to consider ^^^^^^^^^^^^^^^^^^ - implemented_domain may end up being smaller than requested in cse evaluations--check that! - Instructions must agree on all iname tags except the parallel ones - Auto tag assignment depends on known work group size - Depedencies are pointwise for shared loop dimensions and global over non-shared ones (between dependent and ancestor) - Parallel dimension splitting/merging via tags - Generalize reduction to be over multiplie variables - Implement get_problems() Dealt with ^^^^^^^^^^ - Reduction needs to know a neutral element - Types of reduction variables? How to represent the schedule ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - Focus everything on instructions - Each instruction can have its own interpretation of global/local ids. - Loop variables/splits and such are and remain global - What about grouped dimensions? - UniqueTag is the wrong idea! (not really--it's ok per-insn) Scheduling: - Find insns whose dependencies are satisfied - Find maximally shareable loop - Open that one - For that opened loop, check if an available insn can run - If not, open another loop - Else, schedule that instruction