loading code generation results from cache is slow for large kernels
If you profile a warm cache run of examples/layerpot.py
from pytential, you'll see that isl_set_read_from_str
is called over 1000 times (!), and each call is somewhat expensive which adds up to 1 second of overhead from a very low order QBX example. But there are only 23 kernels that get loaded. I've traced this and I am pretty sure that the overhead is caused by the fact that implemented_domains
saves a lot of sets. And we don't really need any these sets to be loaded anyway.
There are also 500 calls to isl_basic_set_read_from_str
, which however don't take as much time total.
Two solutions I can think of:
- A "fast loading path" for code generation results when the goal is just to load enough code to call the kernel.
- Lazy loading of ISL sets.