H2D/Y2D kernel codegen is pretty slow
From https://gitlab.tiker.net/inducer/pytential/-/jobs/31393:
========================== slowest 10 test durations ===========================
48.44s call test/test_fmm.py::test_sumpy_fmm[ctx_getter=<context factory for <pyopencl.Device 'pthread-Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz' on 'Portable Computing Language' at 0x557a157af7e0>>-knl6-H2DLocalExpansion-H2DMultipoleExpansion]
39.44s call test/test_kernels.py::test_translations[ctx_getter=<context factory for <pyopencl.Device 'pthread-Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz' on 'Portable Computing Language' at 0x557a157af7e0>>-knl4-H2DLocalExpansion-H2DMultipoleExpansion]
24.29s call test/test_fmm.py::test_sumpy_fmm[ctx_getter=<context factory for <pyopencl.Device 'pthread-Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz' on 'Portable Computing Language' at 0x557a157af7e0>>-knl9-Y2DLocalExpansion-Y2DMultipoleExpansion]
13.27s call test/test_fmm.py::test_sumpy_fmm[ctx_getter=<context factory for <pyopencl.Device 'pthread-Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz' on 'Portable Computing Language' at 0x557a157af7e0>>-knl4-VolumeTaylorLocalExpansion-VolumeTaylorMultipoleExpansion]
12.32s call test/test_fmm.py::test_sumpy_fmm_exclude_self[ctx_getter=<context factory for <pyopencl.Device 'pthread-Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz' on 'Portable Computing Language' at 0x557a157af7e0>>]
12.27s call test/test_fmm.py::test_sumpy_fmm[ctx_getter=<context factory for <pyopencl.Device 'pthread-Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz' on 'Portable Computing Language' at 0x557a157af7e0>>-knl5-HelmholtzConformingVolumeTaylorLocalExpansion-HelmholtzConformingVolumeTaylorMultipoleExpansion]
8.14s call test/test_kernels.py::test_p2e2p[ctx_getter=<context factory for <pyopencl.Device 'pthread-Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz' on 'Portable Computing Language' at 0x557a157af7e0>>-True-base_knl9-H2DMultipoleExpansion-4]
6.70s call test/test_kernels.py::test_translations[ctx_getter=<context factory for <pyopencl.Device 'pthread-Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz' on 'Portable Computing Language' at 0x557a157af7e0>>-knl2-VolumeTaylorLocalExpansion-VolumeTaylorMultipoleExpansion]
6.40s call test/test_kernels.py::test_translations[ctx_getter=<context factory for <pyopencl.Device 'pthread-Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz' on 'Portable Computing Language' at 0x557a157af7e0>>-knl3-HelmholtzConformingVolumeTaylorLocalExpansion-HelmholtzConformingVolumeTaylorMultipoleExpansion]
6.27s call test/test_fmm.py::test_sumpy_fmm[ctx_getter=<context factory for <pyopencl.Device 'pthread-Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz' on 'Portable Computing Language' at 0x557a157af7e0>>-knl3-LaplaceConformingVolumeTaylorLocalExpansion-LaplaceConformingVolumeTaylorMultipoleExpansion]
Now, granted, these typically use higher order than the Taylor ones, but still--all they do is spit out and CSE a formula. There isn't even numerical differentiation involved. I don't see why they should be taking so long. :)
Edited by Andreas Klöckner