Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
TODO list
^^^^^^^^^
For writeup:
------------
TODO: Reimplement forced lengths
TODO: Try, fix reg. prefetch (DG example) / CSEs
ILP and reg. prefetch interact!
TODO: Functions
TODO: ILP arrays
FIXME: support non-reductive dimensions (what did I mean here?)
FIXME: write names should be assigned during scheduling
FIXME: screwy lower bounds in ILP
FIXME: Leading syncthreads elimination
TODO: Divisibility
TODO: Try, fix indirect addressing
TODO: Implement GT200 matmul, Fermi matmul, DG
TODO: DMA engine threads?
TODO: Deal with equalities that crop up.
TODO: Better user feedback.
Later:
------
TODO: Try different kernels
TODO: - Tricky: Convolution, Stencil
TODO: Separate all-bulk from non-bulk kernels. (maybe?) (#ifdef?)
TODO: implement efficient ceil_div? (as opposed to floor_div)
TODO: why are corner cases inefficient?
TODO: Use gists (why do disjoint sets arise?)
TODO: variable shuffle detection
Things to consider
^^^^^^^^^^^^^^^^^^
- Depedencies are pointwise for shared loop dimensions
and global over non-shared ones (between dependent and ancestor)
- multiple insns could fight over which iname gets local axis 0
-> complicated optimization problem
- Every loop in loopy is opened at most once.
Too restrictive?
- Loop bounds currently may not depend on parallel dimensions
Does it make sense to relax this?
- Why do CSEs necessarily have to duplicate the inames?
-> because that would be necessary for a sequential prefetch
- Cannot do slab decomposition on inames that share a tag with
other inames
-> Is that reasonable?
- Parallel dimension splitting/merging via tags
-> unnecessary?
- Not using all hw loop dimensions causes an error, as
is the case for variant 3 in the rank_one test.
- implemented_domain may end up being smaller than requested in cse
evaluations--check that!
- FIXME: Deal with insns losing a seq iname dep in a CSE realization
a <- cse(reduce(stuff))
- reimplement add_prefetch
- user interface for dim length prescription
- How to determine which variables need to be duplicated for ILP?
- Sharing of checks across ILP instances
- Better for loop bound generation
-> Try a triangular loop
- Nested slab decomposition (in conjunction with conditional hoisting) could
generate nested conditional code.
- Flag, exploit idempotence
- Some things involving CSEs might be impossible to schedule
a[i,j] = cse(b[i]) * cse(c[j])
- Be smarter about automatic local axis choice
-> What if we run out of axes?
- Implement condition hoisting
(needed, e.g., by slab decomposition)
- Check for non-use of hardware axes
- Slab decomposition for parallel dimensions
- implement at the outermost nesting level regardless
- bound *all* tagged inames
- can't slab inames that share tags with other inames (for now)
- Make syntax for iname dependencies
- make syntax for insn dependencies
- CSE iname duplication might be unnecessary?
(don't think so: It might be desired to do a full fetch before a mxm k loop
even if that requires going iterative.)
- Reduction needs to know a neutral element
- Types of reduction variables?
- Generalize reduction to be over multiple variables
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
Should a dependency on an iname be forced in a CSE?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Local var:
l | n
g | y
dl | Err
d | Err
Private var:
l | y
g | y
dl | Err
d | Err
dg: Invalid-> error
d: is duplicate
l: is tagged as local idx
g: is tagged as group idx
Raise error if dl is targeting a private variable, regardless of whether it's
a dependency or not.
How to represent the schedule
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- Focus everything on instructions
- Each instruction can have its own interpretation of global/local ids.
- Loop variables/splits and such are and remain global
- What about grouped dimensions?
- UniqueTag is the wrong idea! (not really--it's ok per-insn)
Scheduling:
- Find insns whose dependencies are satisfied
- Find maximally shareable loop
- Open that one
- For that opened loop, check if an available insn can run
- If not, open another loop
- Else, schedule that instruction