Slowness in test_fmm with warm cache
Related: pytential#38
Running test_fmm results in the following profile output.
- Worst (cumulative) loopy offenders:
Sun May 21 17:41:41 2017 fmm2.prof
66086176 function calls (59437428 primitive calls) in 91.009 seconds
Ordered by: cumulative time
List reduced from 10388 to 597 due to restriction <'loopy'>
List reduced from 597 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
675 0.015 0.000 52.935 0.078 /home/matt/src/loopy/loopy/kernel/__init__.py:1365(__call__)
675 0.016 0.000 52.616 0.078 /home/matt/src/loopy/loopy/target/pyopencl_execution.py:704(__call__)
225 0.022 0.000 41.617 0.185 /home/matt/src/loopy/loopy/target/pyopencl_execution.py:646(cl_kernel_info)
225 0.084 0.000 26.974 0.120 /home/matt/src/loopy/loopy/execution.py:140(get_typed_and_scheduled_kernel)
225 0.005 0.000 10.077 0.045 /home/matt/src/loopy/loopy/schedule/__init__.py:1967(get_one_scheduled_kernel)
225 0.007 0.000 9.836 0.044 /home/matt/src/loopy/loopy/codegen/__init__.py:375(generate_code_v2)
225 0.006 0.000 9.405 0.042 /home/matt/src/loopy/loopy/preprocess.py:1109(preprocess_kernel)
675 0.015 0.000 8.017 0.012 /home/matt/src/loopy/loopy/kernel/__init__.py:1442(update_persistent_hash)
225 0.169 0.001 7.141 0.032 /home/matt/src/loopy/loopy/type_inference.py:472(infer_unknown_types)
36975 0.490 0.000 5.779 0.000 /home/matt/src/loopy/loopy/kernel/instruction.py:821(update_persistent_hash)
- Worst (cumulative) pyopencl offenders:
Sun May 21 17:41:41 2017 fmm2.prof
66086176 function calls (59437428 primitive calls) in 91.009 seconds
Ordered by: cumulative time
List reduced from 10388 to 767 due to restriction <'/home/matt/src/pyopencl'>
List reduced from 767 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
120 0.008 0.000 14.801 0.123 /home/matt/src/pyopencl/pyopencl/scan.py:870(__init__)
120 0.019 0.000 14.777 0.123 /home/matt/src/pyopencl/pyopencl/scan.py:1057(finish_setup)
240 0.015 0.000 12.016 0.050 /home/matt/src/pyopencl/pyopencl/scan.py:1248(build_scan_kernel)
360 0.007 0.000 11.633 0.032 /home/matt/src/pyopencl/pyopencl/scan.py:828(_make_template)
498 0.003 0.000 11.024 0.022 /home/matt/src/pyopencl/pyopencl/cffi_cl.py:808(finish)
3012 0.032 0.000 10.768 0.004 /home/matt/src/pyopencl/pyopencl/cffi_cl.py:2127(enqueue_nd_range_kernel)
96 0.014 0.000 8.652 0.090 /home/matt/src/pyopencl/pyopencl/algorithm.py:959(__call__)
60 0.002 0.000 6.992 0.117 /home/matt/src/pyopencl/pyopencl/algorithm.py:816(get_scan_kernel)
26811 0.028 0.000 5.330 0.000 /home/matt/src/pyopencl/pyopencl/cffi_cl.py:227(__del__)
1963 0.077 0.000 4.863 0.002 /home/matt/src/pyopencl/pyopencl/cffi_cl.py:1692(_set_set_args_body)
Thoughts:
- I am somewhat surprised by the presence of the scan kernel building in the time. That is used by the tree builder in boxtree. Is that something that the caching mechanism should take care of?
- As with pytential, cl_kernel_info is a bad offender here (45% of the time spent). This really suggests we should look into caching type inference results, at the very least.