From e3f98570890aa0c4bfc016570c6b291645a8d539 Mon Sep 17 00:00:00 2001
From: Andreas Kloeckner <inform@tiker.net>
Date: Thu, 13 Feb 2014 22:27:57 -0600
Subject: [PATCH] Start writing tutorial on temporaries

---
 doc/tutorial.rst | 173 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 169 insertions(+), 4 deletions(-)

diff --git a/doc/tutorial.rst b/doc/tutorial.rst
index 7785a4259..46f1c5ea2 100644
--- a/doc/tutorial.rst
+++ b/doc/tutorial.rst
@@ -10,6 +10,8 @@ unambiguously define all aspects of loopy.
 Preparation
 -----------
 
+.. {{{
+
 :mod:`loopy` currently requires on :mod:`pyopencl` to be installed. We
 import a few modules and set up a :class:`pyopencl.Context` and a
 :class:`pyopencl.CommandQueue`:
@@ -34,6 +36,7 @@ examples:
     >>> n = 16*16
     >>> x_vec_dev = cl.clrandom.rand(queue, n, dtype=np.float32)
     >>> y_vec_dev = cl.clrandom.rand(queue, n, dtype=np.float32)
+    >>> z_vec_dev = cl.clrandom.rand(queue, n, dtype=np.float32)
     >>> a_mat_dev = cl.clrandom.rand(queue, (n, n), dtype=np.float32)
     >>> b_mat_dev = cl.clrandom.rand(queue, (n, n), dtype=np.float32)
 
@@ -44,9 +47,13 @@ And some data on the host:
     >>> x_vec_host = np.random.randn(n).astype(np.float32)
     >>> y_vec_host = np.random.randn(n).astype(np.float32)
 
+.. }}}
+
 Getting started
 ---------------
 
+.. {{{
+
 We'll start by taking a closer look at a very simple kernel that reads in
 one vector, doubles it, and writes it to another.
 
@@ -139,9 +146,13 @@ following:
   as ``(stride:1)``. We will see more on this in
   :ref:`implementing-array-axes`.
 
+.. }}}
+
 Running a kernel
 ----------------
 
+.. {{{
+
 Running the kernel that we've just created is easy. Let's check the result
 for good measure.
 
@@ -246,9 +257,15 @@ call :func:`loopy.generate_code`:
         out[i] = 2.0f * a[i];
     }
 
+.. }}}
+
+.. _ordering:
+
 Ordering
 --------
 
+.. {{{
+
 Next, we'll change our kernel a bit. Our goal will be to transpose a matrix
 and double its entries, and we will do this in two steps for the sake of
 argument:
@@ -328,7 +345,7 @@ a heuristic:
     instructions reading that variable will automatically depend on the
     writing instruction.
 
-The intention of this heuristic is to cover the common case of a
+The intent of this heuristic is to cover the common case of a
 precomputed result being stored and used many times. Generally, these
 dependencies are *in addition* to any manual dependencies added via
 ``{dep=...}``.  It is possible (but rare) that the heuristic adds undesired
@@ -477,11 +494,15 @@ No more warnings! Loop nesting is also reflected in the dependency graph:
 
 .. image:: images/dep-graph-nesting.svg
 
+.. }}}
+
 .. _intro-transformations:
 
 Introduction to Kernel Transformations
 --------------------------------------
 
+.. {{{
+
 What we have covered thus far puts you in a position to describe many kinds
 of computations to loopy--in the sense that loopy will generate code that
 carries out the correct operation. That's nice, but it's natural to also
@@ -581,13 +602,15 @@ commonly called 'loop tiling':
               out[n * (i_inner + i_outer * 16) + j_inner + j_outer * 16] = a[n * (j_inner + j_outer * 16) + i_inner + i_outer * 16];
     ...
 
-
+.. }}}
 
 .. _implementing-inames:
 
 Implementing Loop Axes ("Inames")
 ---------------------------------
 
+.. {{{
+
 So far, all the loops we have seen loopy implement were ``for`` loops. Each
 iname in loopy carries a so-called 'implementation tag'.  :ref:`tags` shows
 all possible choices for iname implementation tags. The important ones are
@@ -783,6 +806,8 @@ enabling some cost savings:
         }
     ...
 
+.. }}}
+
 .. _specifying-arguments:
 
 Specifying arguments
@@ -793,11 +818,146 @@ Specifying arguments
 Implementing Array Axes
 -----------------------
 
+
+Precomputation, Storage, and Temporary Variables
+------------------------------------------------
+
+.. {{{
+
+The loopy kernels we have seen thus far have consisted only of assignments
+from one global-memory storage location to another. Sometimes, computation
+results obviously get reused, so that recomputing them or even just
+re-fetching them from global memory becomes uneconomical. Loopy has
+a number of different ways of adressing this need.
+
+Explicit private temporaries
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The simplest of these ways is the creation of an explicit temporary
+variable, as one might do in C or another programming language:
+
+.. doctest::
+
+    >>> knl = lp.make_kernel(
+    ...     "{ [i]: 0<=i<n }",
+    ...     """
+    ...     <float32> a_temp = sin(a[i])
+    ...     out1[i] = a_temp
+    ...     out2[i] = sqrt(1-a_temp*a_temp)
+    ...     """)
+
+The angle brackets ``<>`` denote the creation of a temporary. The name of
+the temporary may not override inames, argument names, or other names in
+the kernel. The name in between the angle brackets is a typename as
+understood by the type registry :mod:`pyopencl.array`. To first order,
+the conventional :mod:`numpy` scalar types (:class:`numpy.int16`,
+:class:`numpy.complex128`) will work. (Yes, :mod:`loopy` supports and
+generates correct code for complex arithmetic.)
+
+The generated code places this variable into what OpenCL calss 'private'
+memory, local to each work item.
+
+.. doctest::
+
+    >>> knl = lp.set_options(knl, "write_cl")
+    >>> evt, (out1, out2) = knl(queue, a=x_vec_dev)
+    <BLANKLINE>
+    ...
+    {
+      float a_temp;
+    <BLANKLINE>
+      for (int i = 0; i <= (-1 + n); ++i)
+      {
+        a_temp = sin(a[i]);
+        out2[i] = sqrt(1.0f + -1.0f * a_temp * a_temp);
+        out1[i] = a_temp;
+      }
+    }
+
+Type inference for temporaries
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Most :mod:`loopy` code can be written so as to be type-generic, so
+specifying a type for the variable is optional. The angle brackets alone
+denote that a temporary should be created, and the type of the variable
+will be deduced from the expression being assigned.
+
+.. doctest::
+
+    >>> knl = lp.make_kernel(
+    ...     "{ [i]: 0<=i<n }",
+    ...     """
+    ...     <> a_temp = sin(a[i])
+    ...     out1[i] = a_temp
+    ...     out2[i] = sqrt(1-a_temp*a_temp)
+    ...     """)
+    >>> evt, (out1, out2) = knl(queue, a=x_vec_dev)
+
+Temporaries in local memory
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In most situations, :mod:`loopy` will  automatically deduce whether a given
+temporary should be placed into local or private storage. If the variable
+is ever written to in parallel and indexed by expressions containing local
+IDs, then it is marked as residing in local memory. If this heuristic is
+insufficient, :class:`loopy.TemporaryVariable` instances can be marked
+local manually.
+
+Consider the following example:
+
+.. doctest::
+
+    >>> knl = lp.make_kernel(
+    ...     "{ [i_outer,i_inner, k]:  "
+    ...          "0<= 16*i_outer + i_inner <n and 0<= i_inner,k <16}",
+    ...     """
+    ...     <> a_temp[i_inner] = a[16*i_outer + i_inner]
+    ...     out[16*i_outer + i_inner] = sum(k, a_temp[k])
+    ...     """)
+    >>> knl = lp.tag_inames(knl, dict(i_outer="g.0", i_inner="l.0"))
+    >>> knl = lp.set_options(knl, "write_cl")
+    >>> evt, (out,) = knl(queue, a=x_vec_dev)
+    <BLANKLINE>
+    ...
+    {
+      __local float a_temp[16];
+      float acc_k;
+    <BLANKLINE>
+      if ((-1 + -16 * gid(0) + -1 * lid(0) + n) >= 0)
+      {
+        a_temp[lid(0)] = a[16 * gid(0) + lid(0)];
+        acc_k = 0.0f;
+      }
+      barrier(CLK_LOCAL_MEM_FENCE) /* for a_temp (insn_0_k_update depends on insn) */;
+      if ((-1 + -16 * gid(0) + -1 * lid(0) + n) >= 0)
+      {
+        for (int k = 0; k <= 15; ++k)
+          acc_k = acc_k + a_temp[k];
+        out[16 * gid(0) + lid(0)] = acc_k;
+      }
+    }
+
+Observe that ``a_temp`` was automatically placed in local memory, because
+it is written in parallel across values of the group-local iname
+``i_inner``. In addition, :mod:`loopy` has emitted a barrier instruction to
+achieve the :ref:`ordering` specified by the instruction dependencies.
+
+
+.. }}}
+
+Prefetching
+~~~~~~~~~~~
+
+Creating temporaries by transformation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
 .. _more-complicated-programs:
 
 More complicated programs
 -------------------------
 
+.. {{{
+
 SCOP
 
 Data-dependent control flow
@@ -809,10 +969,13 @@ Conditionals
 Snippets of C
 ~~~~~~~~~~~~~
 
+.. }}}
 
 Common Problems
 ---------------
 
+.. {{{
+
 A static maximum was not found
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -834,7 +997,7 @@ Attempting to create this kernel results in an error:
 
 The problem is that loopy cannot find a simple, universally valid expression
 for the length of *out* in this case. Notice how the kernel accesses both the
-*i*th and the first element of out.  The set notation at the end of the error
+*i*-th and the first element of out.  The set notation at the end of the error
 message summarizes its best attempt:
 
 * If n=1, then out has size 1.
@@ -926,4 +1089,6 @@ across the remaining axis of the workgroup would emerge.
 
 TODO
 
-.. vim: tw=75:spell
+.. }}}
+
+.. vim: tw=75:spell:foldmethod=marker
-- 
GitLab