You need to sign in or sign up before continuing.
Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
TODO list
^^^^^^^^^
Immediately:
------------
TODO: Imitate codegen bulk slab handling in bulk slab trials
For writeup:
------------
TODO: Reimplement forced lengths
TODO: Try, fix reg. prefetch (DG example) / CSEs
ILP and reg. prefetch interact!
TODO: Custom reductions per red. axis
TODO: Functions
TODO: Common subexpressions
TODO: Array common subexpressions (shared and private!)
TODO: ILP arrays
FIXME: support non-reductive dimensions (what did I mean here?)
FIXME: write names should be assigned during scheduling
FIXME: screwy lower bounds in ILP
FIXME: Leading syncthreads elimination
TODO: Divisibility
TODO: Try, fix indirect addressing
TODO: Implement GT200 matmul, Fermi matmul, DG
TODO: DMA engine threads?
TODO: Deal with equalities that crop up.
TODO: Better user feedback.
Later:
------
TODO: Try different kernels
TODO: - Tricky: Convolution, Stencil
TODO: Separate all-bulk from non-bulk kernels. (maybe?) (#ifdef?)
TODO: implement efficient ceil_div? (as opposed to floor_div)
TODO: why are corner cases inefficient?
TODO: Use gists (why do disjoint sets arise?)
TODO: variable shuffle detection
Things to consider
^^^^^^^^^^^^^^^^^^
- implemented_domain may end up being smaller than requested in cse
evaluations--check that!
- Instructions must agree on all iname tags except the parallel ones
- Auto tag assignment depends on known work group size
- Depedencies are pointwise for shared loop dimensions
and global over non-shared ones (between dependent and ancestor)
- Parallel dimension splitting/merging via tags
- Generalize reduction to be over multiplie variables
- Implement get_problems()
Dealt with
^^^^^^^^^^
- Reduction needs to know a neutral element
- Types of reduction variables?
How to represent the schedule
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- Focus everything on instructions
- Each instruction can have its own interpretation of global/local ids.
- Loop variables/splits and such are and remain global
- What about grouped dimensions?
- UniqueTag is the wrong idea! (not really--it's ok per-insn)
Scheduling:
- Find insns whose dependencies are satisfied
- Find maximally shareable loop
- Open that one
- For that opened loop, check if an available insn can run
- If not, open another loop
- Else, schedule that instruction