TODO list ^^^^^^^^^ For writeup: ------------ TODO: Reimplement forced lengths TODO: Try, fix reg. prefetch (DG example) / CSEs ILP and reg. prefetch interact! TODO: Custom reductions per red. axis TODO: Functions TODO: Common subexpressions TODO: Array common subexpressions (shared and private!) TODO: ILP arrays FIXME: support non-reductive dimensions (what did I mean here?) FIXME: write names should be assigned during scheduling FIXME: screwy lower bounds in ILP FIXME: Leading syncthreads elimination TODO: Divisibility TODO: Try, fix indirect addressing TODO: Implement GT200 matmul, Fermi matmul, DG TODO: DMA engine threads? TODO: Deal with equalities that crop up. TODO: Better user feedback. Later: ------ TODO: Try different kernels TODO: - Tricky: Convolution, Stencil TODO: Separate all-bulk from non-bulk kernels. (maybe?) (#ifdef?) TODO: implement efficient ceil_div? (as opposed to floor_div) TODO: why are corner cases inefficient? TODO: Use gists (why do disjoint sets arise?) TODO: variable shuffle detection Things to consider ^^^^^^^^^^^^^^^^^^ - implemented_domain may end up being smaller than requested in cse evaluations--check that! - Auto tag assignment depends on known work group size - Depedencies are pointwise for shared loop dimensions and global over non-shared ones (between dependent and ancestor) - Parallel dimension splitting/merging via tags - Implement get_problems() - FIXME: Deal with insns losing a seq iname dep in a CSE realization a <- cse(reduce(stuff)) - Every loop in loopy is opened at most once. - Syntax to declare insn deps - reimplement add_prefetch - user interface for dim length prescription - make syntax for explicit loop dependencies - multiple insns could fight over which iname gets local axis 0 -> complicated optimization problem - How to determine which variables need to be duplicated for ILP? -> Only reduction - Slab decomposition for parallel dimensions - implement at the outermost nesting level regardless - bound *all* tagged inames - Sharing of checks across ILP instances - Loop bounds currently may not depend on parallel dimensions Does it make sense to relax this? Dealt with ^^^^^^^^^^ - CSE iname duplication might be unnecessary? (don't think so: It might be desired to do a full fetch before a mxm k loop even if that requires going iterative.) - Reduction needs to know a neutral element - Types of reduction variables? - Generalize reduction to be over multiple variables Should a dependency on an iname be forced in a CSE? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Local var: l | n g | y dl | Err d | Err Private var: l | y g | y dl | Err d | Err dg: Invalid-> error d: is duplicate l: is tagged as local idx g: is tagged as group idx Raise error if dl is targeting a private variable, regardless of whether it's a dependency or not. How to represent the schedule ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - Focus everything on instructions - Each instruction can have its own interpretation of global/local ids. - Loop variables/splits and such are and remain global - What about grouped dimensions? - UniqueTag is the wrong idea! (not really--it's ok per-insn) Scheduling: - Find insns whose dependencies are satisfied - Find maximally shareable loop - Open that one - For that opened loop, check if an available insn can run - If not, open another loop - Else, schedule that instruction