Terrifically slow scheduling and subsequent code generation for kernels with many inames
Some of the kernels we're making in tsfc/firedrake have many inames. I've attached a (py3k) pickled simple case.
I got bored trying to schedule this kernel (a simpler case that scheduled ok, I got bored in generate_code_v2
). I did a bit of profiling, for this kernel, it takes a long time to even get through the pre-scheduling checks. In particular, check_bounds
takes a terribly long time. A large part of this appears to be spent in get_access_range
, and in particular, turning the access_map
into a range with access_map.range()
.
For example, computing the bounds check for the final (write-back) instruction takes around 4 seconds on my machine.
This is what I did:
The pickled kernel is: foo
import pickle
import loopy as lp
from loopy.symbolic import IdentityMapper
from loopy.symbolic import get_access_range
import timeit
class Collector(IdentityMapper):
def __init__(self):
super().__init__()
self.collected = []
def map_subscript(self, expr):
self.rec(expr.aggregate)
self.collected.append(expr.index)
collector = Collector()
with open("foo", "rb") as f:
knl = pickle.loads(f.read())
domain = knl.domains[0]
assumptions = knl.assumptions
subscripts = collector(knl.instructions[-1].expression)
%timeit list(get_access_range(domain, s, assumptions) for s in subscripts)
=> 3.98 s ± 54.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I didn't dig as deeply into why the codegen itself was slow.