Skip to content

Compare efficiency of broadcasting the complete tree across all ranks

Hao Gao requested to merge broadcast-particles into distributed-fmm-global

In the old implementation, the root rank computes the local tree objects and distributes each local tree to a worker rank. For N particles and P processes, the complexity of generating one local tree is O(N), so the total cost is O(NP).

This MR broadcasts the complete tree across all ranks. Then, worker ranks can generate the complete traversal object and compute the local tree independently. The complexity is O(NlogP).

Experiments shows this already offers benefit for 16 nodes.

Total time of generating global tree/traversal + generating/distributing local tree + generating local traversal.

case 4 nodes 16 nodes
old 83.85s 94.32s
new 86.77s 88.63s
Edited by Hao Gao

Merge request reports