Large performance difference between double and single layer P2P / List 1 interaction

This example computes a single- and a double-layer potential from the same source geometry (a starfish) to the same target geometry (a 1000x1000 grid). That all works, but it turns out that the P2P evaluation for the double layer takes 4.21 seconds (loopy kernel: $13), while the one for the single layer takes 35.25 s (loopy kernel: $14). The main difference seems to be that the SLP needs to evaluate a log while the DLP does not. (If that's true: Aaargh "special" function hate. 😠 )

I'm running with pocl 0.13 on my laptop, in case it matters.

@joshbevan: This explains the big perf difference we saw yesterday between SLP and DLP. It was misattributed to list 2 because of async OpenCL execution.

cc @mattwala @isuruf @joshbevan @jjdoher2