Parallelism never really "gets out of bed"

While solving on Dunkel for a "slightly" bit problem (such as examples/helmholtz-dirichlet.py at 756a6610), overall CPU utilization on dunkel (24 cores) doesn't exceed 300% during GMRES iteration, which should be 95% FMM calls. To me, this indicates that there is (sequential) overhead here that prevents the code from achieving its "actual" efficiency. Maybe connected to #4 (closed)?

cc @mattwala