Skip to content

Improve parallelism in fmmlib interface

The new logging infrastructure (be747563, boxtree!46 (merged)) makes it clear what bits of the code have:

  • long run times
  • no parallelism at all
  • or poor parallel efficiency
form_multipoles: completed (70.84s wall, 1.0x CPU)
coarsen_multipoles: completed (107.20s wall, 1.0x CPU)
eval_direct: completed (1.08s wall, 1.1x CPU)
multipole_to_local: completed (333.29s wall, 19.6x CPU)
eval_multipoles: completed (1.25s wall, 1.0x CPU)
form_locals: completed (414.51s wall, 1.0x CPU)
refine_locals: completed (64.89s wall, 1.0x CPU)
eval_locals: completed (1.17s wall, 1.0x CPU)
form_global_qbx_locals: completed (379.56s wall, 19.4x CPU)
translate_box_multipoles_to_qbx_local: completed (180.96s wall, 15.2x CPU)
translate_box_local_to_qbx_local: completed (67.63s wall, 1.0x CPU)
eval_qbx_expansions: completed (8.51s wall, 1.0x CPU)
qbx fmm: completed (1633.51s wall, 10.6x CPU)

The lowest-hanging fruit, in descending order of juiciness:

  • form_locals: completed (414.51s wall, 1.0x CPU)
  • coarsen_multipoles: completed (107.20s wall, 1.0x CPU)
  • form_multipoles: completed (70.84s wall, 1.0x CPU)
  • refine_locals: completed (64.89s wall, 1.0x CPU)

cc @howard28 @haogao2 @mattwala

Edited by Andreas Klöckner