Improve parallelism in fmmlib interface

The new logging infrastructure (be747563, boxtree!46 (merged)) makes it clear what bits of the code have:

long run times
no parallelism at all
or poor parallel efficiency

form_multipoles: completed (70.84s wall, 1.0x CPU)
coarsen_multipoles: completed (107.20s wall, 1.0x CPU)
eval_direct: completed (1.08s wall, 1.1x CPU)
multipole_to_local: completed (333.29s wall, 19.6x CPU)
eval_multipoles: completed (1.25s wall, 1.0x CPU)
form_locals: completed (414.51s wall, 1.0x CPU)
refine_locals: completed (64.89s wall, 1.0x CPU)
eval_locals: completed (1.17s wall, 1.0x CPU)
form_global_qbx_locals: completed (379.56s wall, 19.4x CPU)
translate_box_multipoles_to_qbx_local: completed (180.96s wall, 15.2x CPU)
translate_box_local_to_qbx_local: completed (67.63s wall, 1.0x CPU)
eval_qbx_expansions: completed (8.51s wall, 1.0x CPU)
qbx fmm: completed (1633.51s wall, 10.6x CPU)

The lowest-hanging fruit, in descending order of juiciness:

form_locals: completed (414.51s wall, 1.0x CPU)
coarsen_multipoles: completed (107.20s wall, 1.0x CPU)
form_multipoles: completed (70.84s wall, 1.0x CPU)
refine_locals: completed (64.89s wall, 1.0x CPU)

cc @howard28 @haogao2 @mattwala

Edited Sep 11, 2018 by Andreas Klöckner