Distributed FMM with complete tree structure on each rank (#49)
* Print absolute and relative errors. * Add stats collection code. * Add scaling test. * Generate responsible_boxes_mask on the host. This saves a lot of time compared with keeping it on the device. * Revert "Add scaling test." This reverts commit 3210a175. * Revert "Add stats collection code." This reverts commit 501db499. * Add test script using ConstantOneExpansionWrangler * Revert "Revert "Add scaling test."" This reverts commit 903585b9. * Revert "Print absolute and relative errors." This reverts commit 514c527f. * Collect local mask and scan * Remove unnecessary argument * Move local weights construction to drive_dfmm * Improve comment. * s/butterfly/tree-like/ * Revert "Revert "Revert "Add scaling test.""" This reverts commit 3545bc74. * Add a comment about efficiency of the mask compressor kernel * Integrate distributed FMM with compressed list 3 * Dict is more proper than class without methods * Add workload_weight interface * Improve work partition to consider m2l, m2p and p2l * Handle different FMM order by level * Memorize local tree and traversal build * Make test script more concise * Revert memorization to prevent potential deadlock * Support well_sep_is_n_away flag * Fix flake8 * Handle list 3, 4 near merged with list 1 * Get well_sep_is_n_away from trav object on root * Refactor source weight distribution * put source weight to tree order before distribute_source_weight * Temporarily use one global command queue for worker process instead of an argument * Refactor code * Add tgt_box_mask to local_data * Add more options for generate_local_travs * _from_sep_smaller_min_nsources_cumul is not compatible with distributed implementation * Use Python's standard logging module * Build one traversal object with different source flags instead of two separate traversals * Distribute sources and targets without pickle * Use waitall in tree distribution instead of a loop * Use nonblocking recv for tree distribution * Reduce request object * Use multidimensional array instead of object arrays for local sources/targets * Move work partition to a separate file * Refactor ancestor boxes construction * Refactor source box construction * Add doc for function src_boxes_mask * Refactor multipole boxes query * Refactor local tree building * Bug fix * Tweak interface so that partition work is outside local tree build * Add get_boxes_mask * Make responsible_box_query an argument * Update documentation * Use to_device API * Refactor code, move distributed implementation into a submodule * Add more documentation * Remove mask and scan from local_data to save device memory * Improve code quality * Add documentation for FMM calculation * Integrate test_distributed with pytest * Integrate constantone test case into test_distributed * Improve doc * Add performance model for form_multipole * Correction for timing API * Add no_targets option * Refactor total FMM terms computation * Count the workload of direct evaluation * Refactor linear regression, add eval_direct model * Extend linear regression to multiple variables * Refactor FMM parameters * Count m2l operations * Add script for testing performance model * Add m2l model * Add eval_multipoles model * Add form_locals model * Add eval_locals model * Move code around and refactoring * Integrate performance model into distributed implementation * Bug fix * Improve performance model of list 3 and 4 * Add save/load to performance model * Use robust linear regression * Allow distributed fmm to use model on disk, bug fix * Improve logging * More logging * Add barriers for accurate timing * Save and load the perf model to json file * Add default performance model * Refactor default perf model * Refactor direct eval source box counting * Refactor boxes time prediction * Tweak interface * Remove FMM driver argument from perf model * Remove wrangler factory for performance model * __add__ for TimingResult * Add box_target_counts_nonchild kwarg * Fix syntax error * Update __add__ * Update uses of TimingResult in the performance model to new interface * Another fix * Fixes for reindexing List 4 close * [WIP] Try to fix CIs on distributed-fmm-global branch * Use setuptools.find_packages() in setup.py * Flake8 fixes * Localize a few more imports in test_distributed.py * Flake 8 fix * Make statsmodels optional * More detailed warning on statsmodels * Add PY_EXTRA_FLAGS="-m mpi4py.run" to help avoid stuck MPI-based CIs * Use -m mpi4py.run in spawning subprocesses * Turn off output capture (for now) on MPI test * Temporarily add level_nterms to constantone wrangler * Fix typo * Add default_perf_model.json to package_data * Force OpenMP to use 1 thread in test cases * Add a note for not importing mpi4py.MPI at module level * Raise runtime error when partition cannot be done * Add time recording utility * Bug fix * Bug fix * Integrate OpenCL cost model with distributed implementation * Try disable recv mprobe * Revert "Try disable recv mprobe" This reverts commit 4d16f6d0. * Use new cost model interface * Remove record_timing argument * Find boxes in subrange only loop through contrib boxes instead of all boxes * Try PyOpenCL instead of Loopy for find_boxes_used_by_subrange * Try pure Python for find_boxes_used_by_subrange * Try PyOpenCL version again * Use cost model with calibration parameters * Add explanation for local_traversal * Use new cost model interface * Allow MPI oversubscription in distributed test cases * Revert "Allow MPI oversubscription in distributed test cases" This reverts commit 1ae75402. * Support MPICH for distributed test cases * Skip distributed test cases for Python<3.5 * Improve doc * Use new cost model API * Allow oversubscription in OpenMPI * Move run_mpi to tools * Broadcast the complete traversal object to all worker ranks * Broadcast tree instead of traversal * Change generate_local_tree interface for accepting a tree instead of a traversal * Gather source and target indices outside generate_local_tree * Add a base class DistributedExpansionWrangler * Fix pylint * More pylint fix * Add more documentation * Log timing for each stage in distributed FMM eval * Refactor local tree generation * Placate flake8 * Refactor distributed FMM driver into drive_fmm * Fix test failure * Move kernels for generating local trees to methods instead of using namedtuple * Improve partition interfaces * Add ImmutableHostDeviceArray * Add documentation page for the distributed implementation * Fix doc CI failure * Address reviewer's comments * Register pytest markers * Expansion wranglers: do not store trees * Remove (unrealized) Wrangler.tree_dependent_info doc * Add sumpy downstream CI * Address reviewer's comments * Accept both host and device array in ImmutableHostDeviceArray * Fix test failures for tree/wrangler refactor * Create fmm.py * More justification for TraversalAndWrangler design * Placate flake8 * Back out ill-fated TraversalAndWrangler, introduce TreeIndependentDataForWrangler, introduce boxtree.{timing,constant_one} * Fix pylint/flake8 for tree-indep data for wrangler * Update the may-hold-tree comment in the wrangler docstring * Fix incorrect merge of drive_fmm docstring * Remove *zeros methods from wrangler interface * Tweak sumpy downstream to use appropriate branch * Fix downstream Github CI script * Fix downstream CI script syntax * Refactor so that FMMLibTreeIndependentDataForWrangler knows kernel but not Helmholtz k * Adjust downstream pytential CI for wrangler-refactor branch * Add template_ary arg to finalize_potentials * Remove global_wrangler from drive_fmm interface * Mark DistributedExpansionWrangler as an abstract class * Move distributed logic and MPI communicator to the distributed wrangler * Remove queue from the distributed wrangler * Partition boxes on the root rank and distribute the endpoints * Use PyOpenCL's fancy indexing in favor of custom kernels * Address reviewer's comments * placate pylint * Improve documentation * Add Github CI job with MPI tests * Revert "Add Github CI job with MPI tests" This reverts commit 8b2e392b. * Placate pylint * Use MPICH version as default * Address reviewer's suggestions * Use code container to control cache lifetime * Address reviewer's suggestions on distributed partitioning * Address reviewer's comments on local tree * Refactor repeated logic of generating local particles * Make local trees by calling constructors instead of copying * Continue to address reviewer's comments on local tree generation * Use the inherited version of Tree to calculate * Restrict source boxes directly instead of relying on local_box_flags * Move the logic of modifying target box flags to local tree construction * Minor cleanups Co-authored-by: Matt Wala <wala1@illinois.edu> Co-authored-by: Andreas Klöckner <inform@tiker.net> Co-authored-by: Hao Gao <haogao@Haos-MacBook-Pro.local>