Parallelism never really "gets out of bed"
While solving on Dunkel for a "slightly" bit problem (such as examples/helmholtz-dirichlet.py
at 756a6610), overall CPU utilization on dunkel
(24 cores) doesn't exceed 300% during GMRES iteration, which should be 95% FMM calls. To me, this indicates that there is (sequential) overhead here that prevents the code from achieving its "actual" efficiency. Maybe connected to #4 (closed)?
cc @mattwala