Instruction scheduling is slow
The instruction scheduler creates a priority queue of instructions to check at each call to the scheduler:, as can be seen here
However, this queue is inefficient for two reasons:
- Sorting does not take into account instruction dependencies, so the scheduler may try A before B even when A depends on B.
- The sorting can be re-used across recursive calls to the scheduler.
This has a noticeable performance impact for large (order 25-ish) pytential kernels.