Skip to content
Snippets Groups Projects
Forked from Andreas Klöckner / loopy
6118 commits behind the upstream repository.
MEMO 4.81 KiB
Documentation Notes
^^^^^^^^^^^^^^^^^^^

- Need to clarify fundamental difference between constants baked into code
  and things that remain variable. (ISL parameters, symbolic shapes)

Things to consider
^^^^^^^^^^^^^^^^^^

- Depedencies are pointwise for shared loop dimensions
  and global over non-shared ones (between dependent and ancestor)

- multiple insns could fight over which iname gets local axis 0
  -> complicated optimization problem

- Every loop in loopy is opened at most once.
  Too restrictive?

- Loop bounds currently may not depend on parallel dimensions
  Does it make sense to relax this?

- Why do CSEs necessarily have to duplicate the inames?
  -> because that would be necessary for a sequential prefetch

- Cannot do slab decomposition on inames that share a tag with
  other inames
  -> Is that reasonable?

- Parallel dimension splitting/merging via tags
  -> unnecessary?

- Not using all hw loop dimensions causes an error, as
  is the case for variant 3 in the rank_one test.

- Measure efficiency of corner cases

- Loopy as a data model for implementing custom rewritings

- We won't generate WAWs barrier-needing dependencies
  from one instruction to itself.

To-do
^^^^^

- variable shuffle detection
  -> will need unification

- Fix all tests

- Automatically generate testing code vs. sequential.

- Deal with equality constraints.
  (These arise, e.g., when partitioning a loop of length 16 into 16s.)

- duplicate_dimensions can be implemented without having to muck around 
  with individual constraints:
  - add_dims
  - move_dims
  - intersect

Future ideas
^^^^^^^^^^^^

- Float4 joining on fetch/store?

- How can one automatically generate something like microblocks?

- Better for loop bound generation
  -> Try a triangular loop
- Sharing of checks across ILP instances

- Eliminate the first (pre-)barrier in a loop.

- Generate automatic test against sequential code.

- Automatically verify that all array access is within bounds.

- Reason about generated code, give user feedback on potential
  improvements.

- Convolutions, Stencils

- DMA engine threads?

- Divisibility, modulo, strides?

- Try, fix indirect addressing

- Use gists (why do disjoint sets arise?)

- Nested slab decomposition (in conjunction with conditional hoisting) could
  generate nested conditional code.

Dealt with
^^^^^^^^^^

- Dimension joining

- user interface for dim length prescription

- Restrict-to-sequential and tagging have nothing to do with each other.
  -> Removed SequentialTag and turned it into a separate computed kernel
  property.

- Just touching a variable written to by a non-idempotent
  instruction makes that instruction also not idempotent
  -> Idempotent renamed to boostable.
  -> Done.

- Give the user control over which reduction inames are
  duplicated.

- assert dependencies <= parent_inames in loopy/__init__.py
  -> Yes, this must be the case.
  -> If you include reduction inames.

- Give a good error message if a parameter assignment in get_problems()
  is missing.

- Slab decomposition for ILP
  -> I don't think that's possible.

- It is hard to understand error messages that referred to instructions that
  are generated during preprocessing.

  -> Expose preprocessing to the user so she can inspect the preprocessed
     kernel.

- Which variables need to be duplicated for ILP?
  -> Only reduction

- implemented_domain may end up being smaller than requested in cse
  evaluations--check that!

- Allow prioritization of loops in scheduling.

- Make axpy better.

- Screwy lower bounds in slab decomposition

- reimplement add_prefetch

- Flag, exploit idempotence

- Some things involving CSEs might be impossible to schedule
  a[i,j] = cse(b[i]) * cse(c[j])

- Be smarter about automatic local axis choice
  -> What if we run out of axes?

- Implement condition hoisting
  (needed, e.g., by slab decomposition)

- Check for non-use of hardware axes

- Slab decomposition for parallel dimensions
  - implement at the outermost nesting level regardless
  - bound *all* tagged inames
  - can't slab inames that share tags with other inames (for now)

- Make syntax for iname dependencies

- make syntax for insn dependencies

- Implement get_problems()

- CSE iname duplication might be unnecessary?
  (don't think so: It might be desired to do a full fetch before a mxm k loop
  even if that requires going iterative.)

- Reduction needs to know a neutral element

- Types of reduction variables?

- Generalize reduction to be over multiple variables


Should a dependency on an iname be forced in a CSE?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Local var:

l  | n
g  | y
dl | Err
d  | Err

Private var:

l  | y
g  | y
dl | Err
d  | Err

dg: Invalid-> error

d: is duplicate
l: is tagged as local idx
g: is tagged as group idx

Raise error if dl is targeting a private variable, regardless of whether it's
a dependency or not.