Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
TODO list
^^^^^^^^^
For writeup:
------------
TODO: Reimplement forced lengths
TODO: Try, fix reg. prefetch (DG example) / CSEs
ILP and reg. prefetch interact!
TODO: Custom reductions per red. axis
TODO: Functions
TODO: Common subexpressions
TODO: Array common subexpressions (shared and private!)
TODO: ILP arrays
FIXME: support non-reductive dimensions (what did I mean here?)
FIXME: write names should be assigned during scheduling
FIXME: screwy lower bounds in ILP
FIXME: Leading syncthreads elimination
TODO: Divisibility
TODO: Try, fix indirect addressing
TODO: Implement GT200 matmul, Fermi matmul, DG
TODO: DMA engine threads?
TODO: Deal with equalities that crop up.
TODO: Better user feedback.
Later:
------
TODO: Try different kernels
TODO: - Tricky: Convolution, Stencil
TODO: Separate all-bulk from non-bulk kernels. (maybe?) (#ifdef?)
TODO: implement efficient ceil_div? (as opposed to floor_div)
TODO: why are corner cases inefficient?
TODO: Use gists (why do disjoint sets arise?)
TODO: variable shuffle detection
Things to consider
^^^^^^^^^^^^^^^^^^
- implemented_domain may end up being smaller than requested in cse
evaluations--check that!
- Auto tag assignment depends on known work group size
- Depedencies are pointwise for shared loop dimensions
and global over non-shared ones (between dependent and ancestor)
- Parallel dimension splitting/merging via tags
- Implement get_problems()
- FIXME: Deal with insns losing a seq iname dep in a CSE realization
a <- cse(reduce(stuff))
- Every loop in loopy is opened at most once.
- reimplement add_prefetch
- user interface for dim length prescription
- make syntax for explicit loop dependencies
- multiple insns could fight over which iname gets local axis 0
-> complicated optimization problem
- How to determine which variables need to be duplicated for ILP?
-> Only reduction
- Slab decomposition for parallel dimensions
- implement at the outermost nesting level regardless
- bound *all* tagged inames
- Sharing of checks across ILP instances
- Loop bounds currently may not depend on parallel dimensions
Does it make sense to relax this?
- CSE iname duplication might be unnecessary?
(don't think so: It might be desired to do a full fetch before a mxm k loop
even if that requires going iterative.)
- Reduction needs to know a neutral element
- Types of reduction variables?
- Generalize reduction to be over multiple variables
Should a dependency on an iname be forced in a CSE?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Local var:
l | n
g | y
dl | Err
d | Err
Private var:
l | y
g | y
dl | Err
d | Err
dg: Invalid-> error
d: is duplicate
l: is tagged as local idx
g: is tagged as group idx
Raise error if dl is targeting a private variable, regardless of whether it's
a dependency or not.
How to represent the schedule
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- Focus everything on instructions
- Each instruction can have its own interpretation of global/local ids.
- Loop variables/splits and such are and remain global
- What about grouped dimensions?
- UniqueTag is the wrong idea! (not really--it's ok per-insn)
Scheduling:
- Find insns whose dependencies are satisfied
- Find maximally shareable loop
- Open that one
- For that opened loop, check if an available insn can run
- If not, open another loop
- Else, schedule that instruction