Slowness in test_fmm with warm cache

As with pytential, cl_kernel_info is a bad offender here (45% of the time spent). This really suggests we should look into caching type inference results, at the very least.

Also, we are getting killed by unpickling:

Sun May 21 17:41:41 2017    fmm2.prof

         66086176 function calls (59437428 primitive calls) in 91.009 seconds

   Ordered by: cumulative time
   List reduced from 10388 to 137 due to restriction <'load'>
   List reduced from 137 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2896    3.732    0.001   17.264    0.006 {built-in method load}

Perhaps we should do some sort of lazy loading for the kernel? E.g. load only the compressed string representation unless absolutely necessary.

Altogether, I think we can slash at least a factor of 2 or 3 here.

All of this also points to the fact that large kernels generated by sumpy are a major performance problem.

I am somewhat surprised by the presence of the scan kernel building in the time. That is used by the tree builder in boxtree. Is that something that the caching mechanism should take care of?

It should, unless the generated kernel is nondeterministic for some reason--perhaps because of an iteration over an unordered container. We should check that this comes out to the same kernel every time.

Perhaps we should do some sort of lazy loading for the kernel? E.g. load only the compressed string representation unless absolutely necessary.

I'm still not 100% on board with whole-kernel lazy-loading, because I think it's hard to make that transparent. And if it's not transparent, we would have to either instrument all attribute access (at some runtime cost) or sprinkle something like expand_lazy_loaded_kernel() at every user-facing entrypoint.

True. I was thinking of experimenting with lazy loading of the instructions - loading the instructions could be a bottleneck due to the size of the expression that needs to be loaded. The instructions could be handled relatively easily because the instructions are in a list (and we already have a lazily loading dictionary, so a lazily loading list is not too far fetched).

I'd be OK with that.

I've been prototyping a lazy loading instruction list in loopy. Preliminary results on this test give a 20% speedup.

That's great. I think it's a miracle that you got anything at all, given that I'd expect that nearly every kernel will get expanded for type inference at the moment. Or, put another way, I expect the full benefit of lazy loading will only show up once type inference is done.

It looks like almost all the gain comes from cheaper kernel equality comparison. I guess there are probably other cached kernels in there as well.

There's also still the type inference issue.

Besides a rewrite in C/C++ or implementing HM, I don't know how to make type inference much faster. I think the current version does one (maybe two) passes through the expressions right now unless the types are in a cyclical dependency chain (which shouldn't happen for the translation operator kernels). The more promising thing for type inference is to cache the results.

I'm just running

pycl -m cProfile -o out test_fmm.py 'test_sumpy_fmm(cl._csc, YukawaKernel(2), Y2DLocalExpansion, Y2DMultipoleExpansion)'

and am getting this:

     5050    0.623    0.000    0.623    0.000 {built-in method builtins.compile}
   161464    0.367    0.000    0.493    0.000 /home/andreas/src/pymbolic/pymbolic/primitives.py:501(__setstate__)
      325    0.355    0.001    0.355    0.001 {built-in method pyopencl._cffi.platform__get_devices}
217364/72    0.254    0.000    0.881    0.012 /home/andreas/src/pytools/pytools/persistent_dict.py:172(rec)
      243    0.221    0.001    1.935    0.008 {built-in method _pickle.load}
376078/376069    0.210    0.000    0.210    0.000 {built-in method builtins.hasattr}

(the whole thing takes 5s)

Notes:

I can't reproduce the long scan build times.
Type inference also mostly doesn't show. (but I'll try a heavier kernel)
A lot of time is spent compiling PyOpenCL "invoker" scripts. (5000 of them!) I'll look into caching those.
Next longest is unpickling of expressions, which kernel lazy-load should help with.

Type inference also mostly doesn't show. (but I'll try a heavier kernel)

Yukawa and Helmholtz are probably fairly benign, given that they use custom expansions. I'd expect to see the worst case kernel sizes for plain Taylor.

pyopencl!13 (merged)

pyopencl!13 (merged) gives about a 10 percent benefit. Using the same machinery in loopy (which generates fairly hefty invoker code) should give another 10.

pyopencl!14 (merged)

FYI, I'm disabling the PyOpenCL binary cache for pocl. For pocl 0.13, that makes the runtime longer by a bit (It's a wash on 0.14). But the consequence of the cache was that on second run, pocl was rebuilding again, based on the CL "binary"--the run time was great only after that.

I noticed that runtimes only stabilized after the third time and was wondering why. Thanks

Nvm, can reproduce type inf slowness. (cumtime is the trick.)

Will work on that next.

loopy!116 (closed). Sadly, not a win according to my tests. (But it would be great if you could test, too.)

Next on my list are persistent hashes. Those get recomputed a lot, for no good reason.

Next on my list are persistent hashes. Those get recomputed a lot, for no good reason.

Is there anywhere those are documented? The pytools source code is pretty thin on documentation for those.

loopy!116 (closed). Sadly, not a win according to my tests. (But it would be great if you could test, too.)

Not exactly a win here either. What this is telling me is that type inference is as fast as doing one or two passes through the kernel, as I expected - so that could be taken as good news, I guess. On the other hand, it suggests if we want faster type inference we might need to do some rethinking.

When combined with my (hacktastic) lazy instruction list, it gives a further 10% improvement, so it does a pretty good job in that sense. I'll push what I have tonight, but I think what I have might need some more careful reworking.

When combined with my (hacktastic) lazy instruction list, it gives a further 10% improvement, so it does a pretty good job in that sense. I'll push what I have tonight, but I think what I have might need some more careful reworking.

Actually the timinig seems to be a bit noisy. I will do some more investigating.

Hacking the persistent hash to minimize re-hashing (by saving in RAM and also pickling hashes) gives a win: 16.82s -> 13.92s end-to-end on

python test_fmm.py 'test_sumpy_fmm(cl._csc, HelmholtzKernel(2), HelmholtzConformingVolumeTaylorLocalExpansion, HelmholtzConformingVolumeTaylorMultipoleExpansion)'

(with orders 4,5,6)

Right now, I'm only pickling hashes for whole loopy kernels. I'll look into pickling hashes for instructions, too.

This may also change the balance on whether the type inf cache is a win--because if we have to pay for a kernel traversal to save one, that's not likely.

Profile right now:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    639/1    0.014    0.000   20.418   20.418 {built-in method builtins.exec}
        1    0.000    0.000   20.418   20.418 test_fmm.py:1(<module>)
        1    0.010    0.010   19.744   19.744 test_fmm.py:92(test_sumpy_fmm)
31630/1838    0.036    0.000   17.867    0.010 /home/andreas/src/pytools/pytools/__init__.py:479(wrapper)
        3    0.001    0.000   11.259    3.753 /home/andreas/src/boxtree/boxtree/fmm.py:29(drive_fmm)
      108    0.001    0.000   10.532    0.098 /home/andreas/src/loopy/loopy/kernel/__init__.py:1381(__call__)
      108    0.001    0.000   10.492    0.097 /home/andreas/src/loopy/loopy/target/pyopencl_execution.py:704(__call__)
       27    0.001    0.000   10.447    0.387 /home/andreas/src/loopy/loopy/target/pyopencl_execution.py:646(cl_kernel_info)
      109    0.003    0.000    9.200    0.084 /home/andreas/src/pyopencl/pyopencl/__init__.py:426(build)
      109    0.000    0.000    9.178    0.084 /home/andreas/src/pyopencl/pyopencl/__init__.py:483(_build_and_catch_errors)
      109    0.000    0.000    9.177    0.084 /home/andreas/src/pyopencl/pyopencl/__init__.py:457(<lambda>)
      109    0.002    0.000    9.177    0.084 /home/andreas/src/pyopencl/pyopencl/cffi_cl.py:1601(build)
      109    0.001    0.000    9.163    0.084 /home/andreas/src/pyopencl/pyopencl/cffi_cl.py:1539(_build)
      109    9.162    0.084    9.162    0.084 {built-in method pyopencl._cffi.program__build}
      296    0.024    0.000    7.481    0.025 /home/andreas/src/pytools/pytools/persistent_dict.py:458(__getitem__)
      296    0.015    0.000    7.456    0.025 /home/andreas/src/pytools/pytools/persistent_dict.py:346(fetch)
      592    0.412    0.001    6.386    0.011 {built-in method _pickle.load}
       27    0.011    0.000    5.975    0.221 /home/andreas/src/loopy/loopy/execution.py:140(get_typed_and_scheduled_kernel)
      458    0.001    0.000    4.511    0.010 /home/andreas/src/pytools/pytools/persistent_dict.py:212(__call__)
590641/458    2.078    0.000    4.508    0.010 /home/andreas/src/pytools/pytools/persistent_dict.py:172(rec)

To summarize, out of the 20s (with profiling overhead) it takes:

9s spent waiting on pocl to reload existing binaries. This is too slow--I wonder what can be done. Looking with perf at the C level, we're already spending 21% of our time in clang::SourceManager::getFileIDLocal. Generally there seems to be a lot of clang. See also below.
6s spent unpickling.
Still nearly 5s on persistent hash traversals. That needs to go.
10s in cl_kernel_info. I'm conjecturing that most of this is other things. Hash traversals likely. Type inference isn't even in the top 30 unless I'm missing something.

C-level profile with perf:

  21,06%  python   libpocl.so.1.7.0                                   [.] clang::SourceManager::getFileIDLocal                                                                                                                                
   6,45%  python   libpython3.5m.so.1.0                               [.] PyEval_EvalFrameEx                                                                                                                                                  
   3,11%  python   libpocl.so.1.7.0                                   [.] clang::TokenLexer::Lex                                                                                                                                              
   2,63%  python   libpocl.so.1.7.0                                   [.] SHA1_Transform                                                                                                                                                      
   2,10%  python   libpocl.so.1.7.0                                   [.] clang::SourceManager::getSpellingLocSlowCase                                                                                                                        
   2,04%  python   libpocl.so.1.7.0                                   [.] clang::TokenLexer::ExpandFunctionArguments                                                                                                                          
   1,59%  python   libpython3.5m.so.1.0                               [.] lookdict_unicode_nodummy                                                                                                                                            
   1,37%  python   libpocl.so.1.7.0                                   [.] clang::TokenLexer::updateLocForMacroArgTokens                                                                                                                       
   1,29%  python   libpocl.so.1.7.0                                   [.] clang::TokenLexer::PasteTokens                                                                                                                                      
   1,23%  python   libpython3.5m.so.1.0                               [.] _PyObject_GenericGetAttrWithDict

The good news is that M2Ls are starting to show up at around 1s. :)

Slowness in test_fmm with warm cache

Designs

Child items ...

Activity