diff --git a/doc/reference.rst b/doc/reference.rst index 355088a6b5be3c4d5e2863ffa7eb65b998775de6..0f03ab58e21cb3d4fd6fc75384285276c76bce6e 100644 --- a/doc/reference.rst +++ b/doc/reference.rst @@ -12,19 +12,28 @@ a more gentle introduction, you may consider reading the example-based .. _inames: +Domain Tree +----------- + + + Inames ------- +^^^^^^ + Loops are (by default) entered exactly once. This is necessary to preserve dependency semantics--otherwise e.g. a fetch could happen inside one loop nest, and then the instruction using that fetch could be inside a wholly different loop nest. -Integer Domain --------------- +Instructions +------------ Expressions ------------ +^^^^^^^^^^^ + +Loopy's expressions are a slight superset of the expressions supported by +:mod:`pymbolic`. * `if` * `reductions` @@ -47,21 +56,22 @@ are accepted, in addition to what is registered in :mod:`pyopencl`. .. _tags: -Tags ----- +Iname Implementation Tags +------------------------- ===================== ==================================================== Tag Meaning ===================== ==================================================== `None` | `"for"` Sequential loop `"l.N"` Local (intra-group) axis N -`"l.auto"` Automatically chosen local (intra-group) axis `"g.N"` Group-number axis N -`"unr"` Plain unrolling +`"unr"` Unroll `"ilp"` | `"ilp.unr"` Unroll using instruction-level parallelism `"ilp.seq"` Realize parallel iname as innermost loop ===================== ==================================================== +.. "l.auto" intentionally undocumented + (Throughout this table, `N` must be replaced by an actual number.) "ILP" does three things: diff --git a/doc/tutorial.rst b/doc/tutorial.rst index b65bd917657c918960dc57d9b3f6acf029dbb433..8203e483b3cf4cd2f237fdf8d6afef35b959b0d9 100644 --- a/doc/tutorial.rst +++ b/doc/tutorial.rst @@ -3,9 +3,9 @@ Tutorial ======== -This guide provides a gentle introduction into what loopy is, how it works, and -what it can do. There's also the :ref:`reference` that clearly defines all -aspects of loopy. +This guide provides a gentle introduction into what loopy is, how it works, +and what it can do. There's also the :ref:`reference` that aims to +unambiguously define all aspects of loopy. Preparation ----------- @@ -14,7 +14,6 @@ For now, :mod:`loopy` depends on :mod:`pyopencl`. We import a few modules and initialize :mod:`pyopencl` .. doctest:: - :options: +ELLIPSIS >>> import numpy as np >>> import pyopencl as cl @@ -202,7 +201,6 @@ If you are suspecting that this code is causing you an issue, you can inspect that code, too, using :attr:`loopy.Flags.write_wrapper`: .. doctest:: - :options: +ELLIPSIS >>> evt, (out,) = knl(queue, a=x_vec_host, flags="write_wrapper") from __future__ import division @@ -296,7 +294,6 @@ instruction on the first. Looking at loopy's view of this kernel, we see that these dependencies show up there, too: .. doctest:: - :options: +ELLIPSIS >>> print knl --------------------------------------------------------------------------- @@ -364,7 +361,6 @@ While our requested instruction ordering has been obeyed, something is still not right: .. doctest:: - :options: +ELLIPSIS >>> assert (out.get() == a_mat_dev.get().T*2).all() Traceback (most recent call last): @@ -579,15 +575,24 @@ commonly called 'loop tiling': ... + .. _implementing-inames: Implementing Loop Axes ("Inames") --------------------------------- So far, all the loops we have seen loopy implement were ``for`` loops. Each -iname in loopy carries a so-called 'implementation tag'. Let's split the -main loop of a vector fill kernel into chunks of 4 and unroll the -fixed-length inner loop by setting the inner loop's tag to ``"unr"``: +iname in loopy carries a so-called 'implementation tag'. :ref:`tags` shows +all possible choices for iname implementation tags. The important ones are +explained below. + +Unrolling +~~~~~~~~~ + +Our first example of an 'implementation tag' is ``"unr"``, which performs +loop unrolling. Let us split the main loop of a vector fill kernel into +chunks of 4 and unroll the fixed-length inner loop by setting the inner +loop's tag to ``"unr"``: .. doctest:: @@ -650,6 +655,53 @@ Iname implementation tags are also printed along with the entire kernel: --------------------------------------------------------------------------- ... +Parallelization +~~~~~~~~~~~~~~~ + +Loops are also parallelized in loopy by assigning them parallelizing +implementation tags. In OpenCL, this means that the loop variable +corresponds to either a local ID or a workgroup ID. The implementation tags +for local IDs are ``"l.0"``, ``"l.1"``, ``"l.2"``, and so on. The +corresponding tags for group IDs are ``"g.0"``, ``"g.1"``, ``"g.2"``, and +so on. + +Let's try this out on our vector fill kernel by creating workgroups of size +128: + +.. doctest:: + + >>> knl = lp.make_kernel(ctx.devices[0], + ... "{ [i]: 0<=i<n }", + ... "a[i] = 0", assumptions="n>=0") + >>> knl = lp.split_iname(knl, "i", 128, + ... outer_tag="g.0", inner_tag="l.0") + >>> evt, (out,) = knl(queue, a=x_vec_dev, flags="write_cl") + <BLANKLINE> + ... + __kernel void __attribute__ ((reqd_work_group_size(128, 1, 1))) loopy_kernel(__global float *restrict a, int const n) + { + <BLANKLINE> + if ((-1 + -128 * gid(0) + -1 * lid(0) + n) >= 0) + a[lid(0) + gid(0) * 128] = 0.0f; + } + +Loopy requires that workgroup sizes are fixed and constant at compile time. +By comparison, the overall execution ND-range size (i.e. the number of +workgroups) is allowed to be runtime-variable. + +Note how there was no need to specify group or range sizes. Loopy computes +those for us: + +.. doctest:: + + >>> glob, loc = knl.get_grid_sizes() + >>> print glob + (Aff("[n] -> { [([(127 + n)/128])] }"),) + >>> print loc + (Aff("[n] -> { [(128)] }"),) + +Note that this functionality returns internal objects and is not really +intended for end users. Avoiding Conditionals ~~~~~~~~~~~~~~~~~~~~~ @@ -743,7 +795,6 @@ A static maximum was not found Attempting to create this kernel results in an error: .. doctest:: - :options: +ELLIPSIS >>> lp.make_kernel(ctx.devices[0], ... "{ [i]: 0<=i<n }", @@ -771,7 +822,6 @@ Loopy does not know that non-positive values of *n* make no sense. It needs to be told in order for the error to disappear--note the *assumptions* argument: .. doctest:: - :options: +ELLIPSIS >>> knl = lp.make_kernel(ctx.devices[0], ... "{ [i]: 0<=i<n }", @@ -827,7 +877,6 @@ sign that something is amiss: When we ask to see the code, the issue becomes apparent: .. doctest:: - :options: +ELLIPSIS >>> evt, (out,) = knl(queue, a=a_mat_dev, flags="write_cl") <BLANKLINE> @@ -850,6 +899,6 @@ considers making *a_fetch_0* ``local`` (in the OpenCL memory sense of the word) to make use of parallelism in prefetching, it discovers that a write race across the remaining axis of the workgroup would emerge. -FIXME +TODO .. vim: tw=75:spell