Skip to content
  • gaohao95's avatar
    f53993dd
    Distributed FMM with complete tree structure on each rank (#49) · f53993dd
    gaohao95 authored
    
    
    * Print absolute and relative errors.
    
    * Add stats collection code.
    
    * Add scaling test.
    
    * Generate responsible_boxes_mask on the host. This saves a lot of time
    compared with keeping it on the device.
    
    * Revert "Add scaling test."
    
    This reverts commit 3210a175.
    
    * Revert "Add stats collection code."
    
    This reverts commit 501db499.
    
    * Add test script using ConstantOneExpansionWrangler
    
    * Revert "Revert "Add scaling test.""
    
    This reverts commit 903585b9.
    
    * Revert "Print absolute and relative errors."
    
    This reverts commit 514c527f.
    
    * Collect local mask and scan
    
    * Remove unnecessary argument
    
    * Move local weights construction to drive_dfmm
    
    * Improve comment.
    
    * s/butterfly/tree-like/
    
    * Revert "Revert "Revert "Add scaling test."""
    
    This reverts commit 3545bc74.
    
    * Add a comment about efficiency of the mask compressor kernel
    
    * Integrate distributed FMM with compressed list 3
    
    * Dict is more proper than class without methods
    
    * Add workload_weight interface
    
    * Improve work partition to consider m2l, m2p and p2l
    
    * Handle different FMM order by level
    
    * Memorize local tree and traversal build
    
    * Make test script more concise
    
    * Revert memorization to prevent potential deadlock
    
    * Support well_sep_is_n_away flag
    
    * Fix flake8
    
    * Handle list 3, 4 near merged with list 1
    
    * Get well_sep_is_n_away from trav object on root
    
    * Refactor source weight distribution
    
    * put source weight to tree order before distribute_source_weight
    
    * Temporarily use one global command queue for worker process instead of an argument
    
    * Refactor code
    
    * Add tgt_box_mask to local_data
    
    * Add more options for generate_local_travs
    
    * _from_sep_smaller_min_nsources_cumul is not compatible with distributed implementation
    
    * Use Python's standard logging module
    
    * Build one traversal object with different source flags instead of two separate traversals
    
    * Distribute sources and targets without pickle
    
    * Use waitall in tree distribution instead of a loop
    
    * Use nonblocking recv for tree distribution
    
    * Reduce request object
    
    * Use multidimensional array instead of object arrays for local sources/targets
    
    * Move work partition to a separate file
    
    * Refactor ancestor boxes construction
    
    * Refactor source box construction
    
    * Add doc for function src_boxes_mask
    
    * Refactor multipole boxes query
    
    * Refactor local tree building
    
    * Bug fix
    
    * Tweak interface so that partition work is outside local tree build
    
    * Add get_boxes_mask
    
    * Make responsible_box_query an argument
    
    * Update documentation
    
    * Use to_device API
    
    * Refactor code, move distributed implementation into a submodule
    
    * Add more documentation
    
    * Remove mask and scan from local_data to save device memory
    
    * Improve code quality
    
    * Add documentation for FMM calculation
    
    * Integrate test_distributed with pytest
    
    * Integrate constantone test case into test_distributed
    
    * Improve doc
    
    * Add performance model for form_multipole
    
    * Correction for timing API
    
    * Add no_targets option
    
    * Refactor total FMM terms computation
    
    * Count the workload of direct evaluation
    
    * Refactor linear regression, add eval_direct model
    
    * Extend linear regression to multiple variables
    
    * Refactor FMM parameters
    
    * Count m2l operations
    
    * Add script for testing performance model
    
    * Add m2l model
    
    * Add eval_multipoles model
    
    * Add form_locals model
    
    * Add eval_locals model
    
    * Move code around and refactoring
    
    * Integrate performance model into distributed implementation
    
    * Bug fix
    
    * Improve performance model of list 3 and 4
    
    * Add save/load to performance model
    
    * Use robust linear regression
    
    * Allow distributed fmm to use model on disk, bug fix
    
    * Improve logging
    
    * More logging
    
    * Add barriers for accurate timing
    
    * Save and load the perf model to json file
    
    * Add default performance model
    
    * Refactor default perf model
    
    * Refactor direct eval source box counting
    
    * Refactor boxes time prediction
    
    * Tweak interface
    
    * Remove FMM driver argument from perf model
    
    * Remove wrangler factory for performance model
    
    * __add__ for TimingResult
    
    * Add box_target_counts_nonchild kwarg
    
    * Fix syntax error
    
    * Update __add__
    
    * Update uses of TimingResult in the performance model to new interface
    
    * Another fix
    
    * Fixes for reindexing List 4 close
    
    * [WIP] Try to fix CIs on distributed-fmm-global branch
    
    * Use setuptools.find_packages() in setup.py
    
    * Flake8 fixes
    
    * Localize a few more imports in test_distributed.py
    
    * Flake 8 fix
    
    * Make statsmodels optional
    
    * More detailed warning on statsmodels
    
    * Add PY_EXTRA_FLAGS="-m mpi4py.run" to help avoid stuck MPI-based CIs
    
    * Use -m mpi4py.run in spawning subprocesses
    
    * Turn off output capture (for now) on MPI test
    
    * Temporarily add level_nterms to constantone wrangler
    
    * Fix typo
    
    * Add default_perf_model.json to package_data
    
    * Force OpenMP to use 1 thread in test cases
    
    * Add a note for not importing mpi4py.MPI at module level
    
    * Raise runtime error when partition cannot be done
    
    * Add time recording utility
    
    * Bug fix
    
    * Bug fix
    
    * Integrate OpenCL cost model with distributed implementation
    
    * Try disable recv mprobe
    
    * Revert "Try disable recv mprobe"
    
    This reverts commit 4d16f6d0.
    
    * Use new cost model interface
    
    * Remove record_timing argument
    
    * Find boxes in subrange only loop through contrib boxes instead of all boxes
    
    * Try PyOpenCL instead of Loopy for find_boxes_used_by_subrange
    
    * Try pure Python for find_boxes_used_by_subrange
    
    * Try PyOpenCL version again
    
    * Use cost model with calibration parameters
    
    * Add explanation for local_traversal
    
    * Use new cost model interface
    
    * Allow MPI oversubscription in distributed test cases
    
    * Revert "Allow MPI oversubscription in distributed test cases"
    
    This reverts commit 1ae75402.
    
    * Support MPICH for distributed test cases
    
    * Skip distributed test cases for Python<3.5
    
    * Improve doc
    
    * Use new cost model API
    
    * Allow oversubscription in OpenMPI
    
    * Move run_mpi to tools
    
    * Broadcast the complete traversal object to all worker ranks
    
    * Broadcast tree instead of traversal
    
    * Change generate_local_tree interface for accepting a tree instead of a traversal
    
    * Gather source and target indices outside generate_local_tree
    
    * Add a base class DistributedExpansionWrangler
    
    * Fix pylint
    
    * More pylint fix
    
    * Add more documentation
    
    * Log timing for each stage in distributed FMM eval
    
    * Refactor local tree generation
    
    * Placate flake8
    
    * Refactor distributed FMM driver into drive_fmm
    
    * Fix test failure
    
    * Move kernels for generating local trees to methods instead of using namedtuple
    
    * Improve partition interfaces
    
    * Add ImmutableHostDeviceArray
    
    * Add documentation page for the distributed implementation
    
    * Fix doc CI failure
    
    * Address reviewer's comments
    
    * Register pytest markers
    
    * Expansion wranglers: do not store trees
    
    * Remove (unrealized) Wrangler.tree_dependent_info doc
    
    * Add sumpy downstream CI
    
    * Address reviewer's comments
    
    * Accept both host and device array in ImmutableHostDeviceArray
    
    * Fix test failures for tree/wrangler refactor
    
    * Create fmm.py
    
    * More justification for TraversalAndWrangler design
    
    * Placate flake8
    
    * Back out ill-fated TraversalAndWrangler, introduce TreeIndependentDataForWrangler, introduce boxtree.{timing,constant_one}
    
    * Fix pylint/flake8 for tree-indep data for wrangler
    
    * Update the may-hold-tree comment in the wrangler docstring
    
    * Fix incorrect merge of drive_fmm docstring
    
    * Remove *zeros methods from wrangler interface
    
    * Tweak sumpy downstream to use appropriate branch
    
    * Fix downstream Github CI script
    
    * Fix downstream CI script syntax
    
    * Refactor so that FMMLibTreeIndependentDataForWrangler knows kernel but not Helmholtz k
    
    * Adjust downstream pytential CI for wrangler-refactor branch
    
    * Add template_ary arg to finalize_potentials
    
    * Remove global_wrangler from drive_fmm interface
    
    * Mark DistributedExpansionWrangler as an abstract class
    
    * Move distributed logic and MPI communicator to the distributed wrangler
    
    * Remove queue from the distributed wrangler
    
    * Partition boxes on the root rank and distribute the endpoints
    
    * Use PyOpenCL's fancy indexing in favor of custom kernels
    
    * Address reviewer's comments
    
    * placate pylint
    
    * Improve documentation
    
    * Add Github CI job with MPI tests
    
    * Revert "Add Github CI job with MPI tests"
    
    This reverts commit 8b2e392b.
    
    * Placate pylint
    
    * Use MPICH version as default
    
    * Address reviewer's suggestions
    
    * Use code container to control cache lifetime
    
    * Address reviewer's suggestions on distributed partitioning
    
    * Address reviewer's comments on local tree
    
    * Refactor repeated logic of generating local particles
    
    * Make local trees by calling constructors instead of copying
    
    * Continue to address reviewer's comments on local tree generation
    
    * Use the inherited version of Tree to calculate
    
    * Restrict source boxes directly instead of relying on local_box_flags
    
    * Move the logic of modifying target box flags to local tree construction
    
    * Minor cleanups
    
    Co-authored-by: default avatarMatt Wala <wala1@illinois.edu>
    Co-authored-by: default avatarAndreas Klöckner <inform@tiker.net>
    Co-authored-by: default avatarHao Gao <haogao@Haos-MacBook-Pro.local>
    f53993dd
    Distributed FMM with complete tree structure on each rank (#49)
    gaohao95 authored
    
    
    * Print absolute and relative errors.
    
    * Add stats collection code.
    
    * Add scaling test.
    
    * Generate responsible_boxes_mask on the host. This saves a lot of time
    compared with keeping it on the device.
    
    * Revert "Add scaling test."
    
    This reverts commit 3210a175.
    
    * Revert "Add stats collection code."
    
    This reverts commit 501db499.
    
    * Add test script using ConstantOneExpansionWrangler
    
    * Revert "Revert "Add scaling test.""
    
    This reverts commit 903585b9.
    
    * Revert "Print absolute and relative errors."
    
    This reverts commit 514c527f.
    
    * Collect local mask and scan
    
    * Remove unnecessary argument
    
    * Move local weights construction to drive_dfmm
    
    * Improve comment.
    
    * s/butterfly/tree-like/
    
    * Revert "Revert "Revert "Add scaling test."""
    
    This reverts commit 3545bc74.
    
    * Add a comment about efficiency of the mask compressor kernel
    
    * Integrate distributed FMM with compressed list 3
    
    * Dict is more proper than class without methods
    
    * Add workload_weight interface
    
    * Improve work partition to consider m2l, m2p and p2l
    
    * Handle different FMM order by level
    
    * Memorize local tree and traversal build
    
    * Make test script more concise
    
    * Revert memorization to prevent potential deadlock
    
    * Support well_sep_is_n_away flag
    
    * Fix flake8
    
    * Handle list 3, 4 near merged with list 1
    
    * Get well_sep_is_n_away from trav object on root
    
    * Refactor source weight distribution
    
    * put source weight to tree order before distribute_source_weight
    
    * Temporarily use one global command queue for worker process instead of an argument
    
    * Refactor code
    
    * Add tgt_box_mask to local_data
    
    * Add more options for generate_local_travs
    
    * _from_sep_smaller_min_nsources_cumul is not compatible with distributed implementation
    
    * Use Python's standard logging module
    
    * Build one traversal object with different source flags instead of two separate traversals
    
    * Distribute sources and targets without pickle
    
    * Use waitall in tree distribution instead of a loop
    
    * Use nonblocking recv for tree distribution
    
    * Reduce request object
    
    * Use multidimensional array instead of object arrays for local sources/targets
    
    * Move work partition to a separate file
    
    * Refactor ancestor boxes construction
    
    * Refactor source box construction
    
    * Add doc for function src_boxes_mask
    
    * Refactor multipole boxes query
    
    * Refactor local tree building
    
    * Bug fix
    
    * Tweak interface so that partition work is outside local tree build
    
    * Add get_boxes_mask
    
    * Make responsible_box_query an argument
    
    * Update documentation
    
    * Use to_device API
    
    * Refactor code, move distributed implementation into a submodule
    
    * Add more documentation
    
    * Remove mask and scan from local_data to save device memory
    
    * Improve code quality
    
    * Add documentation for FMM calculation
    
    * Integrate test_distributed with pytest
    
    * Integrate constantone test case into test_distributed
    
    * Improve doc
    
    * Add performance model for form_multipole
    
    * Correction for timing API
    
    * Add no_targets option
    
    * Refactor total FMM terms computation
    
    * Count the workload of direct evaluation
    
    * Refactor linear regression, add eval_direct model
    
    * Extend linear regression to multiple variables
    
    * Refactor FMM parameters
    
    * Count m2l operations
    
    * Add script for testing performance model
    
    * Add m2l model
    
    * Add eval_multipoles model
    
    * Add form_locals model
    
    * Add eval_locals model
    
    * Move code around and refactoring
    
    * Integrate performance model into distributed implementation
    
    * Bug fix
    
    * Improve performance model of list 3 and 4
    
    * Add save/load to performance model
    
    * Use robust linear regression
    
    * Allow distributed fmm to use model on disk, bug fix
    
    * Improve logging
    
    * More logging
    
    * Add barriers for accurate timing
    
    * Save and load the perf model to json file
    
    * Add default performance model
    
    * Refactor default perf model
    
    * Refactor direct eval source box counting
    
    * Refactor boxes time prediction
    
    * Tweak interface
    
    * Remove FMM driver argument from perf model
    
    * Remove wrangler factory for performance model
    
    * __add__ for TimingResult
    
    * Add box_target_counts_nonchild kwarg
    
    * Fix syntax error
    
    * Update __add__
    
    * Update uses of TimingResult in the performance model to new interface
    
    * Another fix
    
    * Fixes for reindexing List 4 close
    
    * [WIP] Try to fix CIs on distributed-fmm-global branch
    
    * Use setuptools.find_packages() in setup.py
    
    * Flake8 fixes
    
    * Localize a few more imports in test_distributed.py
    
    * Flake 8 fix
    
    * Make statsmodels optional
    
    * More detailed warning on statsmodels
    
    * Add PY_EXTRA_FLAGS="-m mpi4py.run" to help avoid stuck MPI-based CIs
    
    * Use -m mpi4py.run in spawning subprocesses
    
    * Turn off output capture (for now) on MPI test
    
    * Temporarily add level_nterms to constantone wrangler
    
    * Fix typo
    
    * Add default_perf_model.json to package_data
    
    * Force OpenMP to use 1 thread in test cases
    
    * Add a note for not importing mpi4py.MPI at module level
    
    * Raise runtime error when partition cannot be done
    
    * Add time recording utility
    
    * Bug fix
    
    * Bug fix
    
    * Integrate OpenCL cost model with distributed implementation
    
    * Try disable recv mprobe
    
    * Revert "Try disable recv mprobe"
    
    This reverts commit 4d16f6d0.
    
    * Use new cost model interface
    
    * Remove record_timing argument
    
    * Find boxes in subrange only loop through contrib boxes instead of all boxes
    
    * Try PyOpenCL instead of Loopy for find_boxes_used_by_subrange
    
    * Try pure Python for find_boxes_used_by_subrange
    
    * Try PyOpenCL version again
    
    * Use cost model with calibration parameters
    
    * Add explanation for local_traversal
    
    * Use new cost model interface
    
    * Allow MPI oversubscription in distributed test cases
    
    * Revert "Allow MPI oversubscription in distributed test cases"
    
    This reverts commit 1ae75402.
    
    * Support MPICH for distributed test cases
    
    * Skip distributed test cases for Python<3.5
    
    * Improve doc
    
    * Use new cost model API
    
    * Allow oversubscription in OpenMPI
    
    * Move run_mpi to tools
    
    * Broadcast the complete traversal object to all worker ranks
    
    * Broadcast tree instead of traversal
    
    * Change generate_local_tree interface for accepting a tree instead of a traversal
    
    * Gather source and target indices outside generate_local_tree
    
    * Add a base class DistributedExpansionWrangler
    
    * Fix pylint
    
    * More pylint fix
    
    * Add more documentation
    
    * Log timing for each stage in distributed FMM eval
    
    * Refactor local tree generation
    
    * Placate flake8
    
    * Refactor distributed FMM driver into drive_fmm
    
    * Fix test failure
    
    * Move kernels for generating local trees to methods instead of using namedtuple
    
    * Improve partition interfaces
    
    * Add ImmutableHostDeviceArray
    
    * Add documentation page for the distributed implementation
    
    * Fix doc CI failure
    
    * Address reviewer's comments
    
    * Register pytest markers
    
    * Expansion wranglers: do not store trees
    
    * Remove (unrealized) Wrangler.tree_dependent_info doc
    
    * Add sumpy downstream CI
    
    * Address reviewer's comments
    
    * Accept both host and device array in ImmutableHostDeviceArray
    
    * Fix test failures for tree/wrangler refactor
    
    * Create fmm.py
    
    * More justification for TraversalAndWrangler design
    
    * Placate flake8
    
    * Back out ill-fated TraversalAndWrangler, introduce TreeIndependentDataForWrangler, introduce boxtree.{timing,constant_one}
    
    * Fix pylint/flake8 for tree-indep data for wrangler
    
    * Update the may-hold-tree comment in the wrangler docstring
    
    * Fix incorrect merge of drive_fmm docstring
    
    * Remove *zeros methods from wrangler interface
    
    * Tweak sumpy downstream to use appropriate branch
    
    * Fix downstream Github CI script
    
    * Fix downstream CI script syntax
    
    * Refactor so that FMMLibTreeIndependentDataForWrangler knows kernel but not Helmholtz k
    
    * Adjust downstream pytential CI for wrangler-refactor branch
    
    * Add template_ary arg to finalize_potentials
    
    * Remove global_wrangler from drive_fmm interface
    
    * Mark DistributedExpansionWrangler as an abstract class
    
    * Move distributed logic and MPI communicator to the distributed wrangler
    
    * Remove queue from the distributed wrangler
    
    * Partition boxes on the root rank and distribute the endpoints
    
    * Use PyOpenCL's fancy indexing in favor of custom kernels
    
    * Address reviewer's comments
    
    * placate pylint
    
    * Improve documentation
    
    * Add Github CI job with MPI tests
    
    * Revert "Add Github CI job with MPI tests"
    
    This reverts commit 8b2e392b.
    
    * Placate pylint
    
    * Use MPICH version as default
    
    * Address reviewer's suggestions
    
    * Use code container to control cache lifetime
    
    * Address reviewer's suggestions on distributed partitioning
    
    * Address reviewer's comments on local tree
    
    * Refactor repeated logic of generating local particles
    
    * Make local trees by calling constructors instead of copying
    
    * Continue to address reviewer's comments on local tree generation
    
    * Use the inherited version of Tree to calculate
    
    * Restrict source boxes directly instead of relying on local_box_flags
    
    * Move the logic of modifying target box flags to local tree construction
    
    * Minor cleanups
    
    Co-authored-by: default avatarMatt Wala <wala1@illinois.edu>
    Co-authored-by: default avatarAndreas Klöckner <inform@tiker.net>
    Co-authored-by: default avatarHao Gao <haogao@Haos-MacBook-Pro.local>
Loading