Compare efficiency of broadcasting the complete tree across all ranks
In the old implementation, the root rank computes the local tree objects and distributes each local tree to a worker rank. For N
particles and P
processes, the complexity of generating one local tree is O(N)
, so the total cost is O(NP)
.
This MR broadcasts the complete tree across all ranks. Then, worker ranks can generate the complete traversal object and compute the local tree independently. The complexity is O(NlogP)
.
Experiments shows this already offers benefit for 16 nodes.
Total time of generating global tree/traversal + generating/distributing local tree + generating local traversal.
case | 4 nodes | 16 nodes |
---|---|---|
old | 83.85s | 94.32s |
new | 86.77s | 88.63s |
Edited by Hao Gao