Compare efficiency of broadcasting the complete tree across all ranks
In the old implementation, the root rank computes the local tree objects and distributes each local tree to a worker rank. For N particles and P processes, the complexity of generating one local tree is O(N), so the total cost is O(NP).
This MR broadcasts the complete tree across all ranks. Then, worker ranks can generate the complete traversal object and compute the local tree independently. The complexity is O(NlogP).
Experiments shows this already offers benefit for 16 nodes.
Total time of generating global tree/traversal + generating/distributing local tree + generating local traversal.
| case | 4 nodes | 16 nodes |
|---|---|---|
| old | 83.85s | 94.32s |
| new | 86.77s | 88.63s |
Edited by Hao Gao