Skip to content
Snippets Groups Projects
MEMO 8.52 KiB
Newer Older
  • Learn to ignore specific revisions
  • Tim Warburton's avatar
    Tim Warburton committed
    Documentation Notes
    ^^^^^^^^^^^^^^^^^^^
    
    - Need to clarify fundamental difference between constants baked into code
      and things that remain variable. (ISL parameters, symbolic shapes)
    
    
    Things to consider
    ^^^^^^^^^^^^^^^^^^
    
    - Depedencies are pointwise for shared loop dimensions
      and global over non-shared ones (between dependent and ancestor)
    
    
    - multiple insns could fight over which iname gets local axis 0
      -> complicated optimization problem
    
    
    - Every loop in loopy is opened at most once.
      Too restrictive?
    
    
    Andreas Klöckner's avatar
    Andreas Klöckner committed
    - Why do precomputes necessarily have to duplicate the inames?
    
      -> because that would be necessary for a sequential prefetch
    
    - Cannot do slab decomposition on inames that share a tag with
      other inames
      -> Is that reasonable?
    
    
    Andreas Klöckner's avatar
    Andreas Klöckner committed
    - Entering a loop means:
      - setting up conditionals related to it (slabs/bounds)
      - allowing loops nested inside to depend on loop state
    
    
    - Not using all hw loop dimensions causes an error, as
      is the case for variant 3 in the rank_one test.
    
    Tim Warburton's avatar
    Tim Warburton committed
    - Measure efficiency of corner cases
    
    
    - Loopy as a data model for implementing custom rewritings
    
    
    - We won't generate WAW barrier-needing dependencies
    
    - Loopy is semi-interactive.
    
    
    - Limitation: base index for parallel axes is 0.
    
    
    Andreas Klöckner's avatar
    Andreas Klöckner committed
    - Dependency on order of operations is ill-formed
    
    - Dependency on non-local global writes is ill-formed
    
    
    - No substitution rules allowed on lhs of insns
    
    
    Tim Warburton's avatar
    Tim Warburton committed
    To-do
    ^^^^^
    
    
    Andreas Klöckner's avatar
    Andreas Klöckner committed
    - Kernel fusion
    
    
    - when are link_inames, duplicate_inames safe?
    
    Andreas Klöckner's avatar
    Andreas Klöckner committed
    - Data implementation tags
    
      - turn base_indices into offset
    
    Andreas Klöckner's avatar
    Andreas Klöckner committed
      - vectorization
      - write_image()
    
    - Make tests run on GPUs
    
    
    Andreas Klöckner's avatar
    Andreas Klöckner committed
    - Test array access with modulo
    
    
    - Derive all errors from central hierarchy
    
    - Provide context for more errors?
    
    
    Andreas Klöckner's avatar
    Andreas Klöckner committed
    - Allow mixing computed and stored strides
    
    
    - applied_iname_rewrites tracking for prefetch footprints isn't bulletproof
      old inames may still be around, so the rewrite may or may not have to be
      applied.
    
    
    - Group instructions by dependency/inames for scheduling, to
      increase sched. scalability
    
    
    - What if no universally valid precompute base index expression is found?
      (test_intel_matrix_mul with n = 6*16, e.g.?)
    
    - If finding a maximum proves troublesome, move parameters into the domain
    
    
    Tim Warburton's avatar
    Tim Warburton committed
    Future ideas
    ^^^^^^^^^^^^
    
    Andreas Klöckner's avatar
    Andreas Klöckner committed
    - subtract_domain_lower_bound
    
    
    - Storage sharing for temporaries?
    
    
    Andreas Klöckner's avatar
    Andreas Klöckner committed
    - Kernel splitting (via what variables get computed in a kernel)
    
    
    - Put all OpenCL functions into mangler
    
    
    - Fuse: store/fetch elimination?
    
    
    Andreas Klöckner's avatar
    Andreas Klöckner committed
    - reg rolling
    
    - When duplicating inames, use iname aliases to relieve burden on isl
    
    
    - (Web) UI
    
    
    - Check for unordered (no-dependency) writes to the same location
    
    
    - Vanilla C string instructions?
    
    - Barriers for data exchanged via global vars?
    
    
    - Float4 joining on fetch/store?
    
    
    - Better for loop bound generation
      -> Try a triangular loop
    
    
    Tim Warburton's avatar
    Tim Warburton committed
    - Eliminate the first (pre-)barrier in a loop.
    
    - Generate automatic test against sequential code.
    
    - Reason about generated code, give user feedback on potential
      improvements.
    
    - Convolutions, Stencils
    
    - DMA engine threads?
    
    - Try, fix indirect addressing
    
    
    Tim Warburton's avatar
    Tim Warburton committed
    - Nested slab decomposition (in conjunction with conditional hoisting) could
      generate nested conditional code.
    
    
    - Better code for strides.
    
    
    Dealt with
    ^^^^^^^^^^
    
    
    - How can one automatically generate something like microblocks?
      -> Some sort of axis-adding transform?
    
    
      extract_subst -> needs WalkMapper [actually fine as is]
      padding [DONE]
      replace make_unique_var_name [DONE]
      join_inames [DONE]
      duplicate_inames [DONE]
      split_iname [DONE]
      CSE [DONE]
    
    
    - rename iname
    
    
    - delete unused inames
    
    
    - Expose iname-duplicate-and-rename as a primitive.
    
    
    - make sure simple side effects work
    
    - Loop bounds currently may not depend on parallel dimensions
      Does it make sense to relax this?
    
    
    - Streamline argument specification
    
    
    - syntax for linear array access
    
    
    - Test divisibility constraints
    
    - Test join_inames
    
    
    - Divisibility, modulo, strides?
      -> Tested, gives correct (but suboptimal) code.
    
    - *_dimension -> *_iname
    
    
    - Use gists (why do disjoint sets arise?)
    
    
    - Automatically verify that all array access is within bounds.
    
    
    - : (as in, Matlab full-slice) in prefetches
    
    
    - Add dependencies after the fact
    
    - Scalar insn priority
    
    
    - ScalarArg is a bad name
      -> renamed to ValueArg
    
    
    - What to do about constants in codegen? (...f suffix, complex types)
      -> dealt with by type contexts
    
    
    - relating to Multi-Domain [DONE]
      - Reenable codegen sanity check. [DONE]
    
    
      - Incorporate loop-bound-mediated iname dependencies into domain
        parenthood. [DONE]
    
    
      - Make sure that variables that enter into loop bounds are only written
        exactly once. [DONE]
    
      - Make sure that loop bound writes are scheduled before the relevant
        loops. [DONE]
    
    
    Andreas Klöckner's avatar
    Andreas Klöckner committed
    - add_prefetch tagging
    
    
    - nbody GPU
      -> pending better prefetch spec
      - Prefetch by sample access
    
    
    - How is intra-instruction ordering of ILP loops going to be determined?
      (taking into account that it could vary even per-instruction?)
    
    - Sharing of checks across ILP instances
    
    
    - Differentiate ilp.unr from ilp.seq
    
    
    - Allow complex-valued arithmetic, despite CL's best efforts.
    
    
    - "No schedule found" debug help:
    
      - Find longest dead-end
      - Automatically report on what hinders progress there
    
    
    - CSE should be more like variable assignment
    
    
    - Deal with equality constraints.
      (These arise, e.g., when partitioning a loop of length 16 into 16s.)
    
    - dim_{min,max} caching
    
    - Exhaust the search for a no-boost solution first, before looking
      for a schedule with boosts.
    
    
    - Pick not just axis 0, but all axes by lowest available stride
    
    
    - Scheduler tries too many boostability-related options
    
    
    - Automatically generate testing code vs. sequential.
    
    
    - If isl can prove that all operands are positive, may use '/' instead of
      'floor_div'.
    
    
    - For forced workgroup sizes: check that at least one iname
      maps to them.
    
    
    - variable shuffle detection
      -> will need unification
    
    
    - Restrict-to-sequential and tagging have nothing to do with each other.
      -> Removed SequentialTag and turned it into a separate computed kernel
      property.
    
    
    - Just touching a variable written to by a non-idempotent
      instruction makes that instruction also not idempotent
      -> Idempotent renamed to boostable.
      -> Done.
    
    
    - Give the user control over which reduction inames are
      duplicated.
    
    
    - assert dependencies <= parent_inames in loopy/__init__.py
      -> Yes, this must be the case.
    
    - Give a good error message if a parameter assignment in get_problems()
      is missing.
    
    
    Tim Warburton's avatar
    Tim Warburton committed
    - Slab decomposition for ILP
      -> I don't think that's possible.
    
    
    Tim Warburton's avatar
    Tim Warburton committed
    - It is hard to understand error messages that referred to instructions that
      are generated during preprocessing.
    
    Tim Warburton's avatar
    Tim Warburton committed
    
      -> Expose preprocessing to the user so she can inspect the preprocessed
         kernel.
    
    Tim Warburton's avatar
    Tim Warburton committed
    
    
    Andreas Klöckner's avatar
    Andreas Klöckner committed
    - Which variables need to be duplicated for ILP?
      -> Only reduction
    
    
    - implemented_domain may end up being smaller than requested in cse
      evaluations--check that!
    
    
    - Allow prioritization of loops in scheduling.
    
    - Make axpy better.
    
    
    - Screwy lower bounds in slab decomposition
    
    
    - reimplement add_prefetch
    
    
    - Flag, exploit idempotence
    
    - Some things involving CSEs might be impossible to schedule
      a[i,j] = cse(b[i]) * cse(c[j])
    
    - Be smarter about automatic local axis choice
      -> What if we run out of axes?
    
    
    - Implement condition hoisting
      (needed, e.g., by slab decomposition)
    
    
    - Check for non-use of hardware axes
    
    
    - Slab decomposition for parallel dimensions
      - implement at the outermost nesting level regardless
      - bound *all* tagged inames
      - can't slab inames that share tags with other inames (for now)
    
    
    - Make syntax for iname dependencies
    
    - make syntax for insn dependencies
    
    Andreas Klöckner's avatar
    Andreas Klöckner committed
    - Implement get_problems()
    
    
    - CSE iname duplication might be unnecessary?
      (don't think so: It might be desired to do a full fetch before a mxm k loop
      even if that requires going iterative.)
    
    
    - Reduction needs to know a neutral element
    
    - Types of reduction variables?
    
    
    - Generalize reduction to be over multiple variables
    
    
    - duplicate_dimensions can be implemented without having to muck around 
      with individual constraints:
      - add_dims
      - move_dims
      - intersect
    
    Tim Warburton's avatar
    Tim Warburton committed
    
    
    Should a dependency on an iname be forced in a CSE?
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    
    Local var:
    
    l  | n
    g  | y
    dl | Err
    d  | Err
    
    Private var:
    
    l  | y
    g  | y
    dl | Err
    d  | Err
    
    dg: Invalid-> error
    
    d: is duplicate
    l: is tagged as local idx
    g: is tagged as group idx
    
    Raise error if dl is targeting a private variable, regardless of whether it's
    a dependency or not.