Documentation Notes ^^^^^^^^^^^^^^^^^^^ - Need to clarify fundamental difference between constants baked into code and things that remain variable. (ISL parameters, symbolic shapes) Things to consider ^^^^^^^^^^^^^^^^^^ - Depedencies are pointwise for shared loop dimensions and global over non-shared ones (between dependent and ancestor) - multiple insns could fight over which iname gets local axis 0 -> complicated optimization problem - Every loop in loopy is opened at most once. Too restrictive? - Why do precomputes necessarily have to duplicate the inames? -> because that would be necessary for a sequential prefetch - Cannot do slab decomposition on inames that share a tag with other inames -> Is that reasonable? - Entering a loop means: - setting up conditionals related to it (slabs/bounds) - allowing loops nested inside to depend on loop state - Not using all hw loop dimensions causes an error, as is the case for variant 3 in the rank_one test. - Measure efficiency of corner cases - Loopy as a data model for implementing custom rewritings - We won't generate WAW barrier-needing dependencies from one instruction to itself. - Loopy is semi-interactive. - Limitation: base index for parallel axes is 0. - Dependency on order of operations is ill-formed - Dependency on non-local global writes is ill-formed - No substitution rules allowed on lhs of insns To-do ^^^^^ - Kernel fusion - when are link_inames, duplicate_inames safe? - rename IndexTag -> InameTag - Data implementation tags - turn base_indices into offset - vectorization - write_image() - change_arg_to_image (test!) - Make tests run on GPUs - Test array access with modulo - Derive all errors from central hierarchy - Provide context for more errors? - Allow mixing computed and stored strides Fixes: - applied_iname_rewrites tracking for prefetch footprints isn't bulletproof old inames may still be around, so the rewrite may or may not have to be applied. - Group instructions by dependency/inames for scheduling, to increase sched. scalability - What if no universally valid precompute base index expression is found? (test_intel_matrix_mul with n = 6*16, e.g.?) - If finding a maximum proves troublesome, move parameters into the domain Future ideas ^^^^^^^^^^^^ - subtract_domain_lower_bound - Storage sharing for temporaries? - Kernel splitting (via what variables get computed in a kernel) - Put all OpenCL functions into mangler - Fuse: store/fetch elimination? - Array language - reg rolling - When duplicating inames, use iname aliases to relieve burden on isl - (Web) UI - Check for unordered (no-dependency) writes to the same location - Vanilla C string instructions? - Barriers for data exchanged via global vars? - Float4 joining on fetch/store? - Better for loop bound generation -> Try a triangular loop - Eliminate the first (pre-)barrier in a loop. - Generate automatic test against sequential code. - Reason about generated code, give user feedback on potential improvements. - Convolutions, Stencils - DMA engine threads? - Try, fix indirect addressing - Nested slab decomposition (in conjunction with conditional hoisting) could generate nested conditional code. - Better code for strides. Dealt with ^^^^^^^^^^ - How can one automatically generate something like microblocks? -> Some sort of axis-adding transform? - RuleAwareIdentityMapper extract_subst -> needs WalkMapper [actually fine as is] padding [DONE] replace make_unique_var_name [DONE] join_inames [DONE] duplicate_inames [DONE] split_iname [DONE] CSE [DONE] - rename iname - delete unused inames - Expose iname-duplicate-and-rename as a primitive. - make sure simple side effects work - Loop bounds currently may not depend on parallel dimensions Does it make sense to relax this? - Streamline argument specification - syntax for linear array access - Test divisibility constraints - Test join_inames - Divisibility, modulo, strides? -> Tested, gives correct (but suboptimal) code. - *_dimension -> *_iname - Use gists (why do disjoint sets arise?) - Automatically verify that all array access is within bounds. - : (as in, Matlab full-slice) in prefetches - Add dependencies after the fact - Scalar insn priority - ScalarArg is a bad name -> renamed to ValueArg - What to do about constants in codegen? (...f suffix, complex types) -> dealt with by type contexts - relating to Multi-Domain [DONE] - Reenable codegen sanity check. [DONE] - Incorporate loop-bound-mediated iname dependencies into domain parenthood. [DONE] - Make sure that variables that enter into loop bounds are only written exactly once. [DONE] - Make sure that loop bound writes are scheduled before the relevant loops. [DONE] - add_prefetch tagging - nbody GPU -> pending better prefetch spec - Prefetch by sample access - How is intra-instruction ordering of ILP loops going to be determined? (taking into account that it could vary even per-instruction?) - Sharing of checks across ILP instances - Differentiate ilp.unr from ilp.seq - Allow complex-valued arithmetic, despite CL's best efforts. - "No schedule found" debug help: - Find longest dead-end - Automatically report on what hinders progress there - CSE should be more like variable assignment - Deal with equality constraints. (These arise, e.g., when partitioning a loop of length 16 into 16s.) - dim_{min,max} caching - Exhaust the search for a no-boost solution first, before looking for a schedule with boosts. - Pick not just axis 0, but all axes by lowest available stride - Scheduler tries too many boostability-related options - Automatically generate testing code vs. sequential. - If isl can prove that all operands are positive, may use '/' instead of 'floor_div'. - For forced workgroup sizes: check that at least one iname maps to them. - variable shuffle detection -> will need unification - Dimension joining - user interface for dim length prescription - Restrict-to-sequential and tagging have nothing to do with each other. -> Removed SequentialTag and turned it into a separate computed kernel property. - Just touching a variable written to by a non-idempotent instruction makes that instruction also not idempotent -> Idempotent renamed to boostable. -> Done. - Give the user control over which reduction inames are duplicated. - assert dependencies <= parent_inames in loopy/__init__.py -> Yes, this must be the case. -> If you include reduction inames. - Give a good error message if a parameter assignment in get_problems() is missing. - Slab decomposition for ILP -> I don't think that's possible. - It is hard to understand error messages that referred to instructions that are generated during preprocessing. -> Expose preprocessing to the user so she can inspect the preprocessed kernel. - Which variables need to be duplicated for ILP? -> Only reduction - implemented_domain may end up being smaller than requested in cse evaluations--check that! - Allow prioritization of loops in scheduling. - Make axpy better. - Screwy lower bounds in slab decomposition - reimplement add_prefetch - Flag, exploit idempotence - Some things involving CSEs might be impossible to schedule a[i,j] = cse(b[i]) * cse(c[j]) - Be smarter about automatic local axis choice -> What if we run out of axes? - Implement condition hoisting (needed, e.g., by slab decomposition) - Check for non-use of hardware axes - Slab decomposition for parallel dimensions - implement at the outermost nesting level regardless - bound *all* tagged inames - can't slab inames that share tags with other inames (for now) - Make syntax for iname dependencies - make syntax for insn dependencies - Implement get_problems() - CSE iname duplication might be unnecessary? (don't think so: It might be desired to do a full fetch before a mxm k loop even if that requires going iterative.) - Reduction needs to know a neutral element - Types of reduction variables? - Generalize reduction to be over multiple variables - duplicate_dimensions can be implemented without having to muck around with individual constraints: - add_dims - move_dims - intersect Should a dependency on an iname be forced in a CSE? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Local var: l | n g | y dl | Err d | Err Private var: l | y g | y dl | Err d | Err dg: Invalid-> error d: is duplicate l: is tagged as local idx g: is tagged as group idx Raise error if dl is targeting a private variable, regardless of whether it's a dependency or not.