Terrifically slow scheduling and subsequent code generation for kernels with many inames

If you run this through perf, I think you'll find that most time is actually spent in isl's bounds finding and projection functions. IIUC, this uses a procedure called Fourier-Motzkin elimination whose runtime is (worst-case) double-exponential in the number of variables/inames (and "just plain expensive" even on an average day). Loopy already tries pretty hard (by caching and hacking around needlessly expensive corner cases) to keep this cost down already, but if you have a genuine coupled problem with that many variables, that cost is hard to avoid.

Looking at your kernel, it seems like sticking all inames into the same domain (and therefore getting hit with a large variable count and associated elimination cost) isn't necessary, since many of your inames are independent of one another.

Instead, if that's possible, I would recommend generating multiple subdomains for each set of loops, if that's an option.

For example, this is possible:

lp.make_kernel([
    "{[i,j]: 0<=i,j<n}",
    "{[k1]: 0<=k1<i and 0<=k1<j}",
    "{[k2]: 0<=k2<i and 0<=k2<j}",
    "{[k3]: 0<=k3<i and 0<=k3<j}",
    "{[k4]: 0<=k4<i and 0<=k4<j}",
    ],
    """
    for i,j
      for k1
      end
      for k2
      end
      for k3
      end
    end
    """)

OK, thanks, that works much better. So if I understand this domain forest idea, I only need inames in the same domain if I want constraints on them (e.g. {[i,j]: 0<=i,j<n and i+j = n}). Furthermore, domains later in the list can refer to inames earlier in the list to provide constraints.

This gets me down from ~20minutes to ~10 seconds for that example, which is much more plausible. That's sort of at the upper limit of where we'd like to be, but is certainly good for now.

Thanks!

closed

Happy to hear it! If you profile the 10 seconds that are left over, you'll probably see that most of that is in the scheduler. That's known (#90), and also not unfixable.

mentioned in commit a9940db3

Terrifically slow scheduling and subsequent code generation for kernels with many inames

Designs

Child items ...

Activity