loading code generation results from cache is slow for large kernels

If you profile a warm cache run of examples/layerpot.py from pytential, you'll see that isl_set_read_from_str is called over 1000 times (!), and each call is somewhat expensive which adds up to 1 second of overhead from a very low order QBX example. But there are only 23 kernels that get loaded. I've traced this and I am pretty sure that the overhead is caused by the fact that implemented_domains saves a lot of sets. And we don't really need any these sets to be loaded anyway.

There are also 500 calls to isl_basic_set_read_from_str, which however don't take as much time total.

Two solutions I can think of:

A "fast loading path" for code generation results when the goal is just to load enough code to call the kernel.
Lazy loading of ISL sets.