Skip to content

Kernel launch overhead

I observe that the launch overhead for kernels is surprisingly high - O(80 us). Below is rudimentary code to profile with cProfile.

  • @inducer hypothesized that the dtype lookup in arg_to_dtype_set might be the culprit--hence my adding extra redundant arguments. However, this doesn't seem to be the case per the results below.
  • Looping 1000x means the reported numbers are in miliseconds (can't figure out how to change the reported time unit)
  • Notably, the results are slower by ~25% when run in a jupyter notebook.
import loopy as lp
import pyopencl as cl
import pyopencl.array as cla
ctx = cl.create_some_context(interactive=False)
queue = cl.CommandQueue(ctx)

knl = lp.make_kernel(
    "{[i]: 0<=i<n}",
    "f[i] = 1",
    [lp.GlobalArg('f', shape=('n',)),
     lp.GlobalArg('a', shape=('n',)),
     lp.GlobalArg('b', shape=('n',)),
     lp.GlobalArg('c', shape=('n',)),
     lp.GlobalArg('d', shape=('n',)),
     '...'],
)

f = cla.zeros(queue, shape=(1,), dtype='float64')

knl(queue, f=f, a=f, b=f, c=f, d=f)

def test():
    for i in range(1000):
        evt, _ = knl(queue, f=f, a=f, b=f, c=f, d=f)
        evt.wait()

import cProfile
cProfile.run('test()', sort=1)

The output I'm getting:

  Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1000    0.035    0.000    0.069    0.000 <generated code>:8(_lpy_host_loopy_kernel)
     1000    0.030    0.000    0.030    0.000 {built-in method pyopencl._cl.enqueue_nd_range_kernel}
     1000    0.011    0.000    0.086    0.000 <generated code>:52(invoke_loopy_kernel_loopy_kernel)
     1000    0.005    0.000    0.008    0.000 execution.py:790(arg_to_dtype_set)
   ...