Improve parallelism in fmmlib interface
The new logging infrastructure (be747563, boxtree!46 (merged)) makes it clear what bits of the code have:
- long run times
- no parallelism at all
- or poor parallel efficiency
form_multipoles: completed (70.84s wall, 1.0x CPU)
coarsen_multipoles: completed (107.20s wall, 1.0x CPU)
eval_direct: completed (1.08s wall, 1.1x CPU)
multipole_to_local: completed (333.29s wall, 19.6x CPU)
eval_multipoles: completed (1.25s wall, 1.0x CPU)
form_locals: completed (414.51s wall, 1.0x CPU)
refine_locals: completed (64.89s wall, 1.0x CPU)
eval_locals: completed (1.17s wall, 1.0x CPU)
form_global_qbx_locals: completed (379.56s wall, 19.4x CPU)
translate_box_multipoles_to_qbx_local: completed (180.96s wall, 15.2x CPU)
translate_box_local_to_qbx_local: completed (67.63s wall, 1.0x CPU)
eval_qbx_expansions: completed (8.51s wall, 1.0x CPU)
qbx fmm: completed (1633.51s wall, 10.6x CPU)
The lowest-hanging fruit, in descending order of juiciness:
-
form_locals: completed (414.51s wall, 1.0x CPU)
-
coarsen_multipoles: completed (107.20s wall, 1.0x CPU)
-
form_multipoles: completed (70.84s wall, 1.0x CPU)
-
refine_locals: completed (64.89s wall, 1.0x CPU)
Edited by Andreas Klöckner