Distributed v3 (#148)
* add test * use are_shape_components_equal Co-authored-by: Kaushik Kulkarni <15399010+kaushikcfd@users.noreply.github.com> * fix * Add random graph generation * push current state * Random DAG generator actually tests things * Fix show_dot_graph invocation that tripped up pylint * Partitioner: Partitions exist even when they have no in/out edges * Partitioner: Refactor disjointness check into separate function * get_dot_graph_from_partitions: Do not require toposorted partition list * test_partitioner: properly use make_random_dag * Move transform functionality to separate file, fix partition type annotations * Use PartitionId type alias for partition IDs * Add pytato.partition to docs * Fix flake8 in partition * Code folding in pytato.partition * Partition: rename _handle_new_binding -> _handle_parent_child * Split EdgeCachedMapper out of graph partitioner * EdgeCachedMapper: support *args * Remove beginnings of _PartitionSplitter * Add PartitionInducedCycleError, bail on test if cycle encountered * remove DictOfNamedArrays.__str__ * add GraphToDictMapper * remove *args, derive from CachedMapper * rename to Userscollector, add index map functions * flake8 * add random DAG test * better doc * flake8 * remove changes to get_dot_graph * remove spurious array.py changes * add missing distributed.py * misc fixes * doc fix * fix equality * Update array.py * Update partition.py * Update __init__.py * misc type fixes * rename * mypy * make example run again * fix and rename tag_child_nodes * add test * flake8 * use ints * dont duplicate input arrays * make ArrayToDotNodeInfoMapper a cached mapper * make ArrayToDotNodeInfoMapper a cached mapper * fixes * fix get_dot_graph_from_partitions * various constructor fixes * inputs fix * MPI fixes * comm passing * DistributedSend: remove shape,dtype, Recv: remove data * Partitioned vis: only emit Placeholders once * Hook distributed docs into main docs * Refactor distributed for DistributedSendRefHolder * misc fixes * add a few sanity checks * fix reverse_graph * flake8 * fix recv hang Previously, send was stapled to recv, and both were in the same partition. This meant that first the irecv was waited on before the corresponing send, leading to a deadlock. * Revert "fix recv hang" This reverts commit ba374bc91c60cb33374e4aa7034024a8f18818b1. * Rename DistributedSendRefHolder.{data->passthrough_data} * UsersCollector: Collect DistributedSend as a separate user * Add ArrayToDotNodeInfoMapper.map_distributed_send_ref_holder to appropriately traverse send * Add nitpicky style FIXMEs to CodePartitions * Drop _DistributedCommReplacer from generate_code_for_partitions * Make DistributedSend a Hashable * make_distributed_recv: normalize dtype * Create DistributedGraphPartitions, rework representation of distributed graphs * Teach the graph visualizer how to show DistributedGraphPartitions * Adapt distributed example to new distributed graph data structure * rename CodePartitions to GraphPartitions * fixup CodePartitions rename * mypy fixes * bail before execution * make recv part of partition * make addr part of fields * doc fix * Revert "make recv part of partition" This reverts commit 8de5a686b43159a3fd19730cf09c79002c7f4e22. * Refactor partitioning to use fewer dicts, preserve partial part order * Track fewer-dicts refactor of partitioner in distributed * Dynamically select ready parts in execute_partition_distributed * Rename _GraphPartitioner.{seen_partition_ids->seen_part_ids} * Partitioner: more partition -> part renaming * Mypy-clean distributed execution * Partitioned Vis: Emit non-Placeholder input arrays inside their partitions * Make map_distributed_send_ref_holder an abstract method of EdgeCachedMapper * find_partition: Use better var name for _GraphPartitioner * Teach find_partition to handle DistributedSend (fixable design fail) * gather_distributed_comm_info: Handle sends extracted by partitioner * Distributed example: make get_part_id a nested function * Distributed exec: use non-blocking send, wait for send request completion * Fix direction of Part.needed_pids * Distribted exec: ready_pids: actually contain pids * Distributed exec: fix minor bugs, distributed example works * Remove abort from distributed example app * Placate flake8 about distributed example * Refactor partitioning to use fewer dicts, preserve partial part order * Rename _GraphPartitioner.{seen_partition_ids->seen_part_ids} * Partitioner: more partition -> part renaming * Partitioned Vis: Emit non-Placeholder input arrays inside their partitions * Test get_dot_graph_from_partition as part of test_partitioner * find_partition: Use better var name for _GraphPartitioner * Fix direction of Part.needed_pids * Fix find_partition * Fix doc warnings * Fix find_partition * Fix doc warnings * lint fixes * run CI with multiple ranks * mpi fix * another ci fix * ci fix * work around doc build failure * better test for example + cleanup * extract find_partition_distributed * add basic pytest * add random dag test * change comm tag * add first pass comment * fix get_dot_graph_from_partition doc * export staple_distributed_send * expoet more functions * add missing map_loopy_call * Change canonical import location for distributed functionality * Re-break circular imports for doc build in pytato.partition * add comment to doc * Use MRO to find Array mapper methods * Use a separate mapper method for LoopyCallResult * Visualization: Skip data in DataWrapper * Visualization: Support visualization of LoopyCall, DictOfNamedArray * Partition: gather user_input_names for each part * Partitioned visualization: Do not mishandle Placeholders from user input * lint fixes * simplify pid_to_user_input_names * spelling * Fix partition/vis type annotations * Ensure ph names are unique across partitions in _DistributedCommReplacer * Dist: Rename *_partition_distributed -> *_distributed_partition, drop gather_comm_info * Clarify, rename GraphPart.{user,partition}_input_names * send numpy data * receive cl buffer * small name change * add fixme * Distributed receive: Use cl.array.to_device * Fix DistributedSend.copy: args are optional * Fix DistributedRecv._fields: shape and dtype were missing * Distributed: rename arguments comm -> mpi_communicator * Distributed: Support symbolic tags, add number_distributed_tags * fix running with a single rank * opencl/numpy arg fixes * lint fixes * another doc fix * frozenset in number_tags * Refactor _GraphPartitioner/find_partition so that it does not know about distributed_sends * attempt to address axes changes * better checking for non-existing recvs * rename _gather_distributed_comm_info arg * simplify getting output * mypy fix * show hex id * fix walkmapper for dist recv * better LoopyCall visualization * debug * add missing axes * assert that we are giving find_distributed_partition not too many outputs * fix merge error * disable some debug * flake * less strict disjoint checking * lint fixes * fixes * another one * add to stringifier * support multiple outputs in find_distributed_partition * reprifier * undo disable disjoint check * undo check disable * flake8 * better test * Avoid type-ignores in single-rank case of execute_distributed_partition * find_partition: avoid passing a partitioner instance * Revert some tag/einsum merge accidents * simplify recv_names* * Random DAG generator: Add support for user-supplied 'additional_generators' * Distributed tests: actually run with MPI, fix 'basic' test, use comm nodes in random DAG * use number_distributed_tags * document partitioner class * add comment regarding renumbering * refactor random comm generation * restore currentmodule:: pytato.tags (was this removed accidentally?) * fix doc build * fix flake8 * remove TODO (attributes are kind of obvious now) * cleanup _check_partition_disjointness * cleanup distributed.py imports * fix doc build * add NodeCountMapper * fix doc * Document GraphPartitioner interface * Revert "refactor random comm generation" This reverts commit 4fd2dd1873302a648a8168a745cfcd0630a319b5. * Add a comment explaining index reversal in distributed exec * _do_test_distributed_execution_random_dag: Use numpy for reference result Compiled evaluation for these graphs seems to compute incorrect results, see gh-255. Co-authored-by: Kaushik Kulkarni <15399010+kaushikcfd@users.noreply.github.com> Co-authored-by: [6~ <inform@tiker.net>
parent
bd4ce53e
Loading
Loading
Pipeline
#262406
failed
with stage
in
12 minutes and 42 seconds
Loading
Please register or sign in to comment