Forked from
Andreas Klöckner / loopy
3727 commits behind the upstream repository.
-
Andreas Klöckner authoredAndreas Klöckner authored
MEMO 8.52 KiB
Documentation Notes
^^^^^^^^^^^^^^^^^^^
- Need to clarify fundamental difference between constants baked into code
and things that remain variable. (ISL parameters, symbolic shapes)
Things to consider
^^^^^^^^^^^^^^^^^^
- Depedencies are pointwise for shared loop dimensions
and global over non-shared ones (between dependent and ancestor)
- multiple insns could fight over which iname gets local axis 0
-> complicated optimization problem
- Every loop in loopy is opened at most once.
Too restrictive?
- Why do precomputes necessarily have to duplicate the inames?
-> because that would be necessary for a sequential prefetch
- Cannot do slab decomposition on inames that share a tag with
other inames
-> Is that reasonable?
- Entering a loop means:
- setting up conditionals related to it (slabs/bounds)
- allowing loops nested inside to depend on loop state
- Not using all hw loop dimensions causes an error, as
is the case for variant 3 in the rank_one test.
- Measure efficiency of corner cases
- Loopy as a data model for implementing custom rewritings
- We won't generate WAW barrier-needing dependencies
from one instruction to itself.
- Loopy is semi-interactive.
- Limitation: base index for parallel axes is 0.
- Dependency on order of operations is ill-formed
- Dependency on non-local global writes is ill-formed
- No substitution rules allowed on lhs of insns
To-do
^^^^^
- Kernel fusion
- when are link_inames, duplicate_inames safe?
- rename IndexTag -> InameTag
- Data implementation tags
- turn base_indices into offset
- vectorization
- write_image()
- change_arg_to_image (test!)
- Make tests run on GPUs
- Test array access with modulo
- Derive all errors from central hierarchy
- Provide context for more errors?
- Allow mixing computed and stored strides
Fixes:
- applied_iname_rewrites tracking for prefetch footprints isn't bulletproof
old inames may still be around, so the rewrite may or may not have to be
applied.
- Group instructions by dependency/inames for scheduling, to
increase sched. scalability
- What if no universally valid precompute base index expression is found?
(test_intel_matrix_mul with n = 6*16, e.g.?)
- If finding a maximum proves troublesome, move parameters into the domain
Future ideas
^^^^^^^^^^^^
- subtract_domain_lower_bound
- Storage sharing for temporaries?
- Kernel splitting (via what variables get computed in a kernel)
- Put all OpenCL functions into mangler
- Fuse: store/fetch elimination?
- Array language
- reg rolling
- When duplicating inames, use iname aliases to relieve burden on isl
- (Web) UI
- Check for unordered (no-dependency) writes to the same location
- Vanilla C string instructions?
- Barriers for data exchanged via global vars?
- Float4 joining on fetch/store?
- Better for loop bound generation
-> Try a triangular loop
- Eliminate the first (pre-)barrier in a loop.
- Generate automatic test against sequential code.
- Reason about generated code, give user feedback on potential
improvements.
- Convolutions, Stencils
- DMA engine threads?
- Try, fix indirect addressing
- Nested slab decomposition (in conjunction with conditional hoisting) could
generate nested conditional code.
- Better code for strides.
Dealt with
^^^^^^^^^^
- How can one automatically generate something like microblocks?
-> Some sort of axis-adding transform?
- RuleAwareIdentityMapper
extract_subst -> needs WalkMapper [actually fine as is]
padding [DONE]
replace make_unique_var_name [DONE]
join_inames [DONE]
duplicate_inames [DONE]
split_iname [DONE]
CSE [DONE]
- rename iname
- delete unused inames
- Expose iname-duplicate-and-rename as a primitive.
- make sure simple side effects work
- Loop bounds currently may not depend on parallel dimensions
Does it make sense to relax this?
- Streamline argument specification
- syntax for linear array access
- Test divisibility constraints
- Test join_inames
- Divisibility, modulo, strides?
-> Tested, gives correct (but suboptimal) code.
- *_dimension -> *_iname
- Use gists (why do disjoint sets arise?)
- Automatically verify that all array access is within bounds.
- : (as in, Matlab full-slice) in prefetches
- Add dependencies after the fact
- Scalar insn priority
- ScalarArg is a bad name
-> renamed to ValueArg
- What to do about constants in codegen? (...f suffix, complex types)
-> dealt with by type contexts
- relating to Multi-Domain [DONE]
- Reenable codegen sanity check. [DONE]
- Incorporate loop-bound-mediated iname dependencies into domain
parenthood. [DONE]
- Make sure that variables that enter into loop bounds are only written
exactly once. [DONE]
- Make sure that loop bound writes are scheduled before the relevant
loops. [DONE]
- add_prefetch tagging
- nbody GPU
-> pending better prefetch spec
- Prefetch by sample access
- How is intra-instruction ordering of ILP loops going to be determined?
(taking into account that it could vary even per-instruction?)
- Sharing of checks across ILP instances
- Differentiate ilp.unr from ilp.seq
- Allow complex-valued arithmetic, despite CL's best efforts.
- "No schedule found" debug help:
- Find longest dead-end
- Automatically report on what hinders progress there
- CSE should be more like variable assignment
- Deal with equality constraints.
(These arise, e.g., when partitioning a loop of length 16 into 16s.)
- dim_{min,max} caching
- Exhaust the search for a no-boost solution first, before looking
for a schedule with boosts.
- Pick not just axis 0, but all axes by lowest available stride
- Scheduler tries too many boostability-related options
- Automatically generate testing code vs. sequential.
- If isl can prove that all operands are positive, may use '/' instead of
'floor_div'.
- For forced workgroup sizes: check that at least one iname
maps to them.
- variable shuffle detection
-> will need unification
- Dimension joining
- user interface for dim length prescription
- Restrict-to-sequential and tagging have nothing to do with each other.
-> Removed SequentialTag and turned it into a separate computed kernel
property.
- Just touching a variable written to by a non-idempotent
instruction makes that instruction also not idempotent
-> Idempotent renamed to boostable.
-> Done.
- Give the user control over which reduction inames are
duplicated.
- assert dependencies <= parent_inames in loopy/__init__.py
-> Yes, this must be the case.
-> If you include reduction inames.
- Give a good error message if a parameter assignment in get_problems()
is missing.
- Slab decomposition for ILP
-> I don't think that's possible.
- It is hard to understand error messages that referred to instructions that
are generated during preprocessing.
-> Expose preprocessing to the user so she can inspect the preprocessed
kernel.
- Which variables need to be duplicated for ILP?
-> Only reduction
- implemented_domain may end up being smaller than requested in cse
evaluations--check that!
- Allow prioritization of loops in scheduling.
- Make axpy better.
- Screwy lower bounds in slab decomposition
- reimplement add_prefetch
- Flag, exploit idempotence
- Some things involving CSEs might be impossible to schedule
a[i,j] = cse(b[i]) * cse(c[j])
- Be smarter about automatic local axis choice
-> What if we run out of axes?
- Implement condition hoisting
(needed, e.g., by slab decomposition)
- Check for non-use of hardware axes
- Slab decomposition for parallel dimensions
- implement at the outermost nesting level regardless
- bound *all* tagged inames
- can't slab inames that share tags with other inames (for now)
- Make syntax for iname dependencies
- make syntax for insn dependencies
- Implement get_problems()
- CSE iname duplication might be unnecessary?
(don't think so: It might be desired to do a full fetch before a mxm k loop
even if that requires going iterative.)
- Reduction needs to know a neutral element
- Types of reduction variables?
- Generalize reduction to be over multiple variables
- duplicate_dimensions can be implemented without having to muck around
with individual constraints:
- add_dims
- move_dims
- intersect
Should a dependency on an iname be forced in a CSE?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Local var:
l | n
g | y
dl | Err
d | Err
Private var:
l | y
g | y
dl | Err
d | Err
dg: Invalid-> error
d: is duplicate
l: is tagged as local idx
g: is tagged as group idx
Raise error if dl is targeting a private variable, regardless of whether it's
a dependency or not.