diff --git a/doc/tutorial.rst b/doc/tutorial.rst index 92ec799f7045cf63dc75d1386d8a51fd7d42954c..12c058fb741279db55521118f6711f197735dbd0 100644 --- a/doc/tutorial.rst +++ b/doc/tutorial.rst @@ -1118,14 +1118,17 @@ Once a work item has reached a barrier, it waits for everyone that it synchronizes with to reach the barrier before continuing. This means that unless all work items reach the same barrier, the kernel will hang during execution. +Barrier insertion +~~~~~~~~~~~~~~~~~ + By default, :mod:`loopy` inserts local barriers between two instructions when it detects that a dependency involving local memory may occur across work items. To see this in action, take a look at the section on :ref:`local_temporaries`. -In contrast, :mod:`loopy` will *not* insert global barriers automatically. -Global barriers require manual intervention along with some special -post-processing which we describe below. Consider the following kernel, which -attempts to rotate its input to the right by 1 in parallel: +In contrast, :mod:`loopy` will *not* insert global barriers automatically and +instead will report an error if it detects the need for a global barrier. As an +example, consider the following kernel, which attempts to rotate its input to +the right by 1 in parallel: .. doctest:: @@ -1153,8 +1156,22 @@ this, :mod:`loopy` will complain that global barrier needs to be inserted: ... MissingBarrierError: Dependency 'rotate depends on maketmp' (for variable 'arr') requires synchronization by a global barrier (add a 'no_sync_with' instruction option to state that no synchronization is needed) -The syntax for a global barrier instruction is ``... gbarrier``. This needs to -be added between the pair of offending instructions. +The syntax for a inserting a global barrier instruction is +``... gbarrier``. :mod:`loopy` also supports manually inserting local +barriers. The syntax for a local barrier instruction is ``... lbarrier``. + +Saving temporaries across global barriers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +For some platforms (currently only PyOpenCL), :mod:`loopy` implements global +barriers by splitting the kernel into a host side kernel and multiple +device-side kernels. On such platforms, it will be necessary to save non-global +temporaries that are live across kernel calls. This section presents an example +of how to use :func:`loopy.save_and_reload_temporaries` which is helpful for +that purpose. + +Let us start with an example. Consider the kernel from above with a +``... gbarrier`` instruction that has already been inserted. .. doctest:: @@ -1175,17 +1192,16 @@ be added between the pair of offending instructions. ... assumptions="n mod 16 = 0") >>> knl = lp.split_iname(knl, "i", 16, inner_tag="l.0", outer_tag="g.0") -When we try to generate code for this, it will still not work. +Here is what happens when we try to generate code for the kernel: >>> cgr = lp.generate_code_v2(knl) Traceback (most recent call last): ... MissingDefinitionError: temporary variable 'tmp' gets used in subkernel 'rotate_v2_0' without a definition (maybe you forgot to call loopy.save_and_reload_temporaries?) -To understand what is going on, you need to know that :mod:`loopy` implements -global barriers by splitting the kernel into multiple device-side kernels. The -splitting happens when the instruction schedule is generated. To see the -schedule, we must first call :func:`loopy.get_one_scheduled_kernel`: +This happens due to the kernel splitting done by :mod:`loopy`. The splitting +happens when the instruction schedule is generated. To see the schedule, we +should call :func:`loopy.get_one_scheduled_kernel`: >>> knl = lp.get_one_scheduled_kernel(lp.preprocess_kernel(knl)) >>> print(knl) @@ -1280,7 +1296,7 @@ The kernel translates into two OpenCL kernels. arr[((1 + lid(0) + gid(0) * 16) % n)] = tmp; } -Executing the kernel does what we expect. +Now we can execute the kernel. >>> arr = cl.array.arange(queue, 16, dtype=np.int32) >>> print(arr)