Skip to content
Snippets Groups Projects
MEMO 3.55 KiB
Newer Older
  • Learn to ignore specific revisions
  • TODO list
    ^^^^^^^^^
    
    For writeup:
    ------------
    TODO: Reimplement forced lengths
    TODO: Try, fix reg. prefetch (DG example) / CSEs
      ILP and reg. prefetch interact!
    TODO: Custom reductions per red. axis
    TODO: Functions
    TODO: Common subexpressions
    TODO: Array common subexpressions (shared and private!)
    TODO: ILP arrays
    FIXME: support non-reductive dimensions (what did I mean here?)
    FIXME: write names should be assigned during scheduling
    FIXME: screwy lower bounds in ILP
    FIXME: Leading syncthreads elimination
    
    TODO: Divisibility
    TODO: Try, fix indirect addressing
    
    TODO: Implement GT200 matmul, Fermi matmul, DG
    TODO: DMA engine threads?
    TODO: Deal with equalities that crop up.
    TODO: Better user feedback.
    
    Later:
    ------
    TODO: Try different kernels
    TODO:   - Tricky: Convolution, Stencil
    TODO: Separate all-bulk from non-bulk kernels. (maybe?) (#ifdef?)
    TODO: implement efficient ceil_div? (as opposed to floor_div)
    TODO: why are corner cases inefficient?
    TODO: Use gists (why do disjoint sets arise?)
    TODO: variable shuffle detection
    
    Things to consider
    ^^^^^^^^^^^^^^^^^^
    
    - Depedencies are pointwise for shared loop dimensions
      and global over non-shared ones (between dependent and ancestor)
    
    
    - multiple insns could fight over which iname gets local axis 0
      -> complicated optimization problem
    
    
    - Every loop in loopy is opened at most once.
      Too restrictive?
    
    - Loop bounds currently may not depend on parallel dimensions
      Does it make sense to relax this?
    
    
    - implemented_domain may end up being smaller than requested in cse
      evaluations--check that!
    
    
    - Parallel dimension splitting/merging via tags
    
    
    - FIXME: Deal with insns losing a seq iname dep in a CSE realization
    
    - reimplement add_prefetch
    
    - user interface for dim length prescription
    
    
    - How to determine which variables need to be duplicated for ILP?
    
      -> Reduction
      -> CSEs?
    
    
    - Slab decomposition for parallel dimensions
    
      - implement at the outermost nesting level regardless
      - bound *all* tagged inames
    
    
    - Sharing of checks across ILP instances
    
    
    - Some things involving CSEs might be impossible to schedule
      a[i,j] = cse(b[i]) * cse(c[j])
    
    - Flag, exploit idempotence
    
    - Implement insert_parallel_dim_check_points
      (but first: find a kernel that needs it)
    
    
    Dealt with
    ^^^^^^^^^^
    
    
    - Make syntax for iname dependencies
    
    - make syntax for insn dependencies
    
    Andreas Klöckner's avatar
    Andreas Klöckner committed
    - Implement get_problems()
    
    
    - CSE iname duplication might be unnecessary?
      (don't think so: It might be desired to do a full fetch before a mxm k loop
      even if that requires going iterative.)
    
    
    - Reduction needs to know a neutral element
    
    - Types of reduction variables?
    
    
    - Generalize reduction to be over multiple variables
    
    
    Should a dependency on an iname be forced in a CSE?
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    
    Local var:
    
    l  | n
    g  | y
    dl | Err
    d  | Err
    
    Private var:
    
    l  | y
    g  | y
    dl | Err
    d  | Err
    
    dg: Invalid-> error
    
    d: is duplicate
    l: is tagged as local idx
    g: is tagged as group idx
    
    Raise error if dl is targeting a private variable, regardless of whether it's
    a dependency or not.
    
    
    How to represent the schedule
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    
    - Focus everything on instructions
      - Each instruction can have its own interpretation of global/local ids.
    - Loop variables/splits and such are and remain global
    - What about grouped dimensions?
    - UniqueTag is the wrong idea! (not really--it's ok per-insn)
    
    Scheduling:
    - Find insns whose dependencies are satisfied
    - Find maximally shareable loop
    - Open that one
    - For that opened loop, check if an available insn can run
      - If not, open another loop
      - Else, schedule that instruction