Kernel launch overhead
I observe that the launch overhead for kernels is surprisingly high - O(80 us). Below is rudimentary code to profile with cProfile
.
-
@inducer hypothesized that the dtype lookup in
arg_to_dtype_set
might be the culprit--hence my adding extra redundant arguments. However, this doesn't seem to be the case per the results below. - Looping 1000x means the reported numbers are in miliseconds (can't figure out how to change the reported time unit)
- Notably, the results are slower by ~25% when run in a jupyter notebook.
import loopy as lp
import pyopencl as cl
import pyopencl.array as cla
ctx = cl.create_some_context(interactive=False)
queue = cl.CommandQueue(ctx)
knl = lp.make_kernel(
"{[i]: 0<=i<n}",
"f[i] = 1",
[lp.GlobalArg('f', shape=('n',)),
lp.GlobalArg('a', shape=('n',)),
lp.GlobalArg('b', shape=('n',)),
lp.GlobalArg('c', shape=('n',)),
lp.GlobalArg('d', shape=('n',)),
'...'],
)
f = cla.zeros(queue, shape=(1,), dtype='float64')
knl(queue, f=f, a=f, b=f, c=f, d=f)
def test():
for i in range(1000):
evt, _ = knl(queue, f=f, a=f, b=f, c=f, d=f)
evt.wait()
import cProfile
cProfile.run('test()', sort=1)
The output I'm getting:
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1000 0.035 0.000 0.069 0.000 <generated code>:8(_lpy_host_loopy_kernel)
1000 0.030 0.000 0.030 0.000 {built-in method pyopencl._cl.enqueue_nd_range_kernel}
1000 0.011 0.000 0.086 0.000 <generated code>:52(invoke_loopy_kernel_loopy_kernel)
1000 0.005 0.000 0.008 0.000 execution.py:790(arg_to_dtype_set)
...