Make sense of running times
Consider the run time data from #92:
Presumed DLP:
form_multipoles: completed (70.84s wall, 1.0x CPU)
coarsen_multipoles: completed (107.20s wall, 1.0x CPU)
eval_direct: completed (1.08s wall, 1.1x CPU)
multipole_to_local: completed (333.29s wall, 19.6x CPU)
eval_multipoles: completed (1.25s wall, 1.0x CPU)
form_locals: completed (414.51s wall, 1.0x CPU)
refine_locals: completed (64.89s wall, 1.0x CPU)
eval_locals: completed (1.17s wall, 1.0x CPU)
form_global_qbx_locals: completed (379.56s wall, 19.4x CPU)
translate_box_multipoles_to_qbx_local: completed (180.96s wall, 15.2x CPU)
translate_box_local_to_qbx_local: completed (67.63s wall, 1.0x CPU)
eval_qbx_expansions: completed (8.51s wall, 1.0x CPU)
qbx fmm: completed (1633.51s wall, 10.6x CPU)
Presumed SLP:
build_tree_with_qbx_metadata: completed (0.43s wall, 11.6x CPU)
mark_targets: completed (0.42s wall, 16.3x CPU)
try_find_centers: completed (1.19s wall, 16.8x CPU)
tree build: completed (4.92s wall, 13.5x CPU): 15 levels, 742156 boxes
build traversal: completed (5.53s wall, 8.5x CPU)
form_multipoles: completed (220.47s wall, 1.0x CPU)
coarsen_multipoles: completed (112.64s wall, 1.0x CPU)
eval_direct: completed (1.20s wall, 1.1x CPU)
multipole_to_local: completed (489.84s wall, 13.4x CPU)
eval_multipoles: completed (1.34s wall, 1.0x CPU)
form_locals: completed (1422.97s wall, 0.9x CPU)
refine_locals: completed (58.35s wall, 1.0x CPU)
eval_locals: completed (0.97s wall, 1.0x CPU)
form_global_qbx_locals: completed (2093.70s wall, 20.0x CPU)
translate_box_multipoles_to_qbx_local: completed (337.42s wall, 16.5x CPU)
translate_box_local_to_qbx_local: completed (61.03s wall, 1.0x CPU)
eval_qbx_expansions: completed (15.77s wall, 1.0x CPU)
qbx fmm: completed (4818.42s wall, 11.6x CPU)
@mattwala Why are we spending so much time in the "presumed SLP" in form_global_qbx_locals
? Any idea?