TODO list
^^^^^^^^^

For writeup:
------------
TODO: Reimplement forced lengths
TODO: Try, fix reg. prefetch (DG example) / CSEs
  ILP and reg. prefetch interact!
TODO: Custom reductions per red. axis
TODO: Functions
TODO: Common subexpressions
TODO: Array common subexpressions (shared and private!)
TODO: ILP arrays
FIXME: support non-reductive dimensions (what did I mean here?)
FIXME: write names should be assigned during scheduling
FIXME: screwy lower bounds in ILP
FIXME: Leading syncthreads elimination

TODO: Divisibility
TODO: Try, fix indirect addressing

TODO: Implement GT200 matmul, Fermi matmul, DG
TODO: DMA engine threads?
TODO: Deal with equalities that crop up.
TODO: Better user feedback.

Later:
------
TODO: Try different kernels
TODO:   - Tricky: Convolution, Stencil
TODO: Separate all-bulk from non-bulk kernels. (maybe?) (#ifdef?)
TODO: implement efficient ceil_div? (as opposed to floor_div)
TODO: why are corner cases inefficient?
TODO: Use gists (why do disjoint sets arise?)
TODO: variable shuffle detection

Things to consider
^^^^^^^^^^^^^^^^^^

- implemented_domain may end up being smaller than requested in cse
  evaluations--check that!

- Auto tag assignment depends on known work group size

- Depedencies are pointwise for shared loop dimensions
  and global over non-shared ones (between dependent and ancestor)

- Parallel dimension splitting/merging via tags

- Implement get_problems()

- FIXME: Deal with insns losing a seq iname dep in a CSE realization

  a <- cse(reduce(stuff))

- Every loop in loopy is opened at most once.

- Syntax to declare insn deps

- reimplement add_prefetch

- user interface for dim length prescription

- make syntax for explicit loop dependencies

- multiple insns could fight over which iname gets local axis 0
  -> complicated optimization problem

- How to determine which variables need to be duplicated for ILP?
  -> Only reduction

- Slab decomposition for parallel dimensions
  - implement at the outermost nesting level regardless
  - bound *all* tagged inames

- Sharing of checks across ILP instances

- Loop bounds currently may not depend on parallel dimensions
  Does it make sense to relax this?

Dealt with
^^^^^^^^^^

- CSE iname duplication might be unnecessary?
  (don't think so: It might be desired to do a full fetch before a mxm k loop
  even if that requires going iterative.)

- Reduction needs to know a neutral element

- Types of reduction variables?

- Generalize reduction to be over multiple variables

Should a dependency on an iname be forced in a CSE?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Local var:

l  | n
g  | y
dl | Err
d  | Err

Private var:

l  | y
g  | y
dl | Err
d  | Err

dg: Invalid-> error

d: is duplicate
l: is tagged as local idx
g: is tagged as group idx

Raise error if dl is targeting a private variable, regardless of whether it's
a dependency or not.

How to represent the schedule
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- Focus everything on instructions
  - Each instruction can have its own interpretation of global/local ids.
- Loop variables/splits and such are and remain global
- What about grouped dimensions?
- UniqueTag is the wrong idea! (not really--it's ok per-insn)

Scheduling:
- Find insns whose dependencies are satisfied
- Find maximally shareable loop
- Open that one
- For that opened loop, check if an available insn can run
  - If not, open another loop
  - Else, schedule that instruction