Unverified Commit f53993dd authored Apr 05, 2022 by gaohao95 Committed by GitHub Apr 05, 2022
Distributed FMM with complete tree structure on each rank (#49)



* Print absolute and relative errors.

* Add stats collection code.

* Add scaling test.

* Generate responsible_boxes_mask on the host. This saves a lot of time
compared with keeping it on the device.

* Revert "Add scaling test."

This reverts commit 3210a175.

* Revert "Add stats collection code."

This reverts commit 501db499.

* Add test script using ConstantOneExpansionWrangler

* Revert "Revert "Add scaling test.""

This reverts commit 903585b9.

* Revert "Print absolute and relative errors."

This reverts commit 514c527f.

* Collect local mask and scan

* Remove unnecessary argument

* Move local weights construction to drive_dfmm

* Improve comment.

* s/butterfly/tree-like/

* Revert "Revert "Revert "Add scaling test."""

This reverts commit 3545bc74.

* Add a comment about efficiency of the mask compressor kernel

* Integrate distributed FMM with compressed list 3

* Dict is more proper than class without methods

* Add workload_weight interface

* Improve work partition to consider m2l, m2p and p2l

* Handle different FMM order by level

* Memorize local tree and traversal build

* Make test script more concise

* Revert memorization to prevent potential deadlock

* Support well_sep_is_n_away flag

* Fix flake8

* Handle list 3, 4 near merged with list 1

* Get well_sep_is_n_away from trav object on root

* Refactor source weight distribution

* put source weight to tree order before distribute_source_weight

* Temporarily use one global command queue for worker process instead of an argument

* Refactor code

* Add tgt_box_mask to local_data

* Add more options for generate_local_travs

* _from_sep_smaller_min_nsources_cumul is not compatible with distributed implementation

* Use Python's standard logging module

* Build one traversal object with different source flags instead of two separate traversals

* Distribute sources and targets without pickle

* Use waitall in tree distribution instead of a loop

* Use nonblocking recv for tree distribution

* Reduce request object

* Use multidimensional array instead of object arrays for local sources/targets

* Move work partition to a separate file

* Refactor ancestor boxes construction

* Refactor source box construction

* Add doc for function src_boxes_mask

* Refactor multipole boxes query

* Refactor local tree building

* Bug fix

* Tweak interface so that partition work is outside local tree build

* Add get_boxes_mask

* Make responsible_box_query an argument

* Update documentation

* Use to_device API

* Refactor code, move distributed implementation into a submodule

* Add more documentation

* Remove mask and scan from local_data to save device memory

* Improve code quality

* Add documentation for FMM calculation

* Integrate test_distributed with pytest

* Integrate constantone test case into test_distributed

* Improve doc

* Add performance model for form_multipole

* Correction for timing API

* Add no_targets option

* Refactor total FMM terms computation

* Count the workload of direct evaluation

* Refactor linear regression, add eval_direct model

* Extend linear regression to multiple variables

* Refactor FMM parameters

* Count m2l operations

* Add script for testing performance model

* Add m2l model

* Add eval_multipoles model

* Add form_locals model

* Add eval_locals model

* Move code around and refactoring

* Integrate performance model into distributed implementation

* Bug fix

* Improve performance model of list 3 and 4

* Add save/load to performance model

* Use robust linear regression

* Allow distributed fmm to use model on disk, bug fix

* Improve logging

* More logging

* Add barriers for accurate timing

* Save and load the perf model to json file

* Add default performance model

* Refactor default perf model

* Refactor direct eval source box counting

* Refactor boxes time prediction

* Tweak interface

* Remove FMM driver argument from perf model

* Remove wrangler factory for performance model

* __add__ for TimingResult

* Add box_target_counts_nonchild kwarg

* Fix syntax error

* Update __add__

* Update uses of TimingResult in the performance model to new interface

* Another fix

* Fixes for reindexing List 4 close

* [WIP] Try to fix CIs on distributed-fmm-global branch

* Use setuptools.find_packages() in setup.py

* Flake8 fixes

* Localize a few more imports in test_distributed.py

* Flake 8 fix

* Make statsmodels optional

* More detailed warning on statsmodels

* Add PY_EXTRA_FLAGS="-m mpi4py.run" to help avoid stuck MPI-based CIs

* Use -m mpi4py.run in spawning subprocesses

* Turn off output capture (for now) on MPI test

* Temporarily add level_nterms to constantone wrangler

* Fix typo

* Add default_perf_model.json to package_data

* Force OpenMP to use 1 thread in test cases

* Add a note for not importing mpi4py.MPI at module level

* Raise runtime error when partition cannot be done

* Add time recording utility

* Bug fix

* Bug fix

* Integrate OpenCL cost model with distributed implementation

* Try disable recv mprobe

* Revert "Try disable recv mprobe"

This reverts commit 4d16f6d0ecfbb1de71231d45a1ee5baf9a631f1c.

* Use new cost model interface

* Remove record_timing argument

* Find boxes in subrange only loop through contrib boxes instead of all boxes

* Try PyOpenCL instead of Loopy for find_boxes_used_by_subrange

* Try pure Python for find_boxes_used_by_subrange

* Try PyOpenCL version again

* Use cost model with calibration parameters

* Add explanation for local_traversal

* Use new cost model interface

* Allow MPI oversubscription in distributed test cases

* Revert "Allow MPI oversubscription in distributed test cases"

This reverts commit 1ae7540219d487bb7d5846a06b92c7b9e672bda0.

* Support MPICH for distributed test cases

* Skip distributed test cases for Python<3.5

* Improve doc

* Use new cost model API

* Allow oversubscription in OpenMPI

* Move run_mpi to tools

* Broadcast the complete traversal object to all worker ranks

* Broadcast tree instead of traversal

* Change generate_local_tree interface for accepting a tree instead of a traversal

* Gather source and target indices outside generate_local_tree

* Add a base class DistributedExpansionWrangler

* Fix pylint

* More pylint fix

* Add more documentation

* Log timing for each stage in distributed FMM eval

* Refactor local tree generation

* Placate flake8

* Refactor distributed FMM driver into drive_fmm

* Fix test failure

* Move kernels for generating local trees to methods instead of using namedtuple

* Improve partition interfaces

* Add ImmutableHostDeviceArray

* Add documentation page for the distributed implementation

* Fix doc CI failure

* Address reviewer's comments

* Register pytest markers

* Expansion wranglers: do not store trees

* Remove (unrealized) Wrangler.tree_dependent_info doc

* Add sumpy downstream CI

* Address reviewer's comments

* Accept both host and device array in ImmutableHostDeviceArray

* Fix test failures for tree/wrangler refactor

* Create fmm.py

* More justification for TraversalAndWrangler design

* Placate flake8

* Back out ill-fated TraversalAndWrangler, introduce TreeIndependentDataForWrangler, introduce boxtree.{timing,constant_one}

* Fix pylint/flake8 for tree-indep data for wrangler

* Update the may-hold-tree comment in the wrangler docstring

* Fix incorrect merge of drive_fmm docstring

* Remove *zeros methods from wrangler interface

* Tweak sumpy downstream to use appropriate branch

* Fix downstream Github CI script

* Fix downstream CI script syntax

* Refactor so that FMMLibTreeIndependentDataForWrangler knows kernel but not Helmholtz k

* Adjust downstream pytential CI for wrangler-refactor branch

* Add template_ary arg to finalize_potentials

* Remove global_wrangler from drive_fmm interface

* Mark DistributedExpansionWrangler as an abstract class

* Move distributed logic and MPI communicator to the distributed wrangler

* Remove queue from the distributed wrangler

* Partition boxes on the root rank and distribute the endpoints

* Use PyOpenCL's fancy indexing in favor of custom kernels

* Address reviewer's comments

* placate pylint

* Improve documentation

* Add Github CI job with MPI tests

* Revert "Add Github CI job with MPI tests"

This reverts commit 8b2e392bb8c863efcebc3f4c518292c22d23f259.

* Placate pylint

* Use MPICH version as default

* Address reviewer's suggestions

* Use code container to control cache lifetime

* Address reviewer's suggestions on distributed partitioning

* Address reviewer's comments on local tree

* Refactor repeated logic of generating local particles

* Make local trees by calling constructors instead of copying

* Continue to address reviewer's comments on local tree generation

* Use the inherited version of Tree to calculate

* Restrict source boxes directly instead of relying on local_box_flags

* Move the logic of modifying target box flags to local tree construction

* Minor cleanups

Co-authored-by: Matt Wala <wala1@illinois.edu>
Co-authored-by: Andreas Klöckner <inform@tiker.net>
Co-authored-by: Hao Gao <haogao@Haos-MacBook-Pro.local>
parent 34d75dcf