Codegen much slower due to do_access_ranges_overlap_conservative()
Here is a profile of a recent cold cache run of some pytential code. This code builds and runs an order 20 FMM with the Laplace kernel.
Thu Mar 8 14:00:44 2018 out.prof
475425154 function calls (447088610 primitive calls) in 390.339 seconds
Ordered by: cumulative time
List reduced from 10357 to 30 due to restriction <30>
ncalls tottime percall cumtime percall filename:lineno(function)
1669/1 0.092 0.000 390.582 390.582 {built-in method builtins.exec}
1 0.000 0.000 390.582 390.582 test.py:1(<module>)
1 0.000 0.000 390.104 390.104 test.py:107(test_qbx_cauchy_integral)
2 0.001 0.000 390.097 195.048 /Users/matt/src/conformal-map-paper/code/qbx.py:28(qbx_cauchy_integral)
6849590/3644 2.566 0.000 389.342 0.107 /Users/matt/miniconda3/envs/inteq/lib/python3.6/site-packages/pytools/__init__.py:569(wrapper)
57/25 0.000 0.000 384.512 15.380 /Users/matt/src/pytential/pytential/symbolic/execution.py:303(__call__)
57/25 0.003 0.000 384.512 15.380 /Users/matt/src/pytential/pytential/symbolic/compiler.py:365(execute)
6 0.000 0.000 383.562 63.927 /Users/matt/src/pytential/pytential/qbx/__init__.py:492(exec_compute_potential_insn)
6 0.001 0.000 383.557 63.926 /Users/matt/src/pytential/pytential/qbx/__init__.py:556(exec_compute_potential_insn_fmm)
6 0.002 0.000 379.547 63.258 /Users/matt/src/pytential/pytential/qbx/fmm.py:360(drive_fmm)
445 0.005 0.000 286.728 0.644 /Users/matt/src/loopy/loopy/kernel/__init__.py:1262(__call__)
445 0.005 0.000 286.549 0.644 /Users/matt/src/loopy/loopy/target/pyopencl_execution.py:314(__call__)
65 0.005 0.000 286.332 4.405 /Users/matt/src/loopy/loopy/target/pyopencl_execution.py:273(kernel_info)
5203139/712804 7.307 0.000 237.544 0.000 /Users/matt/miniconda3/envs/inteq/lib/python3.6/site-packages/pymbolic/mapper/__init__.py:114(__call__)
65 0.001 0.000 221.180 3.403 /Users/matt/src/loopy/loopy/target/execution.py:762(get_typed_and_scheduled_kernel)
25 0.026 0.001 211.100 8.444 /Users/matt/src/loopy/loopy/target/execution.py:728(get_typed_and_scheduled_kernel_uncached)
25 0.001 0.000 184.095 7.364 /Users/matt/src/loopy/loopy/schedule/__init__.py:2051(get_one_scheduled_kernel)
25 0.000 0.000 179.949 7.198 /Users/matt/src/loopy/loopy/schedule/__init__.py:2038(_get_one_scheduled_kernel_inner)
50 0.001 0.000 179.949 3.599 /Users/matt/src/loopy/loopy/schedule/__init__.py:1837(generate_loop_schedules)
180399/180352 0.049 0.000 177.455 0.001 {built-in method builtins.next}
50 0.005 0.000 177.282 3.546 /Users/matt/src/loopy/loopy/schedule/__init__.py:1854(generate_loop_schedules_inner)
25 0.001 0.000 168.311 6.732 /Users/matt/src/loopy/loopy/check.py:595(pre_schedule_checks)
22001 2.010 0.000 164.969 0.007 /Users/matt/src/loopy/loopy/symbolic.py:1583(get_access_range)
25163/25121 0.258 0.000 164.552 0.007 /Users/matt/src/loopy/loopy/symbolic.py:1697(map_subscript)
25 0.005 0.000 162.958 6.518 /Users/matt/src/loopy/loopy/check.py:560(check_variable_access_ordered)
25 0.233 0.009 162.953 6.518 /Users/matt/src/loopy/loopy/check.py:446(_check_variable_access_ordered_inner)
8202 0.054 0.000 158.409 0.019 /Users/matt/src/loopy/loopy/symbolic.py:1806(do_access_ranges_overlap_conservative)
52278 0.066 0.000 157.768 0.003 /Users/matt/src/loopy/loopy/symbolic.py:1753(__call__)
16404 0.343 0.000 157.614 0.010 /Users/matt/src/loopy/loopy/symbolic.py:1769(_get_access_range_conservative)
24 0.001 0.000 93.877 3.912 /Users/matt/src/sumpy/sumpy/tools.py:366(get_cached_optimized_kernel)
In particular, 157 seconds is spent in /Users/matt/src/loopy/loopy/symbolic.py:1753
, which is part of AccessRangeMapper
.