Kernel launch overhead

I observe that the launch overhead for kernels is surprisingly high - O(80 us). Below is rudimentary code to profile with cProfile.

@inducer hypothesized that the dtype lookup in arg_to_dtype_set might be the culprit--hence my adding extra redundant arguments. However, this doesn't seem to be the case per the results below.
Looping 1000x means the reported numbers are in miliseconds (can't figure out how to change the reported time unit)
Notably, the results are slower by ~25% when run in a jupyter notebook.

import loopy as lp
import pyopencl as cl
import pyopencl.array as cla
ctx = cl.create_some_context(interactive=False)
queue = cl.CommandQueue(ctx)

knl = lp.make_kernel(
    "{[i]: 0<=i<n}",
    "f[i] = 1",
    [lp.GlobalArg('f', shape=('n',)),
     lp.GlobalArg('a', shape=('n',)),
     lp.GlobalArg('b', shape=('n',)),
     lp.GlobalArg('c', shape=('n',)),
     lp.GlobalArg('d', shape=('n',)),
     '...'],
)

f = cla.zeros(queue, shape=(1,), dtype='float64')

knl(queue, f=f, a=f, b=f, c=f, d=f)

def test():
    for i in range(1000):
        evt, _ = knl(queue, f=f, a=f, b=f, c=f, d=f)
        evt.wait()

import cProfile
cProfile.run('test()', sort=1)

The output I'm getting:

  Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1000    0.035    0.000    0.069    0.000 <generated code>:8(_lpy_host_loopy_kernel)
     1000    0.030    0.000    0.030    0.000 {built-in method pyopencl._cl.enqueue_nd_range_kernel}
     1000    0.011    0.000    0.086    0.000 <generated code>:52(invoke_loopy_kernel_loopy_kernel)
     1000    0.005    0.000    0.008    0.000 execution.py:790(arg_to_dtype_set)
   ...