diff --git a/doc/source/algorithm.rst b/doc/source/algorithm.rst
new file mode 100644
index 0000000000000000000000000000000000000000..94758f7129d7e302e64e6251c645f99b791ed95a
--- /dev/null
+++ b/doc/source/algorithm.rst
@@ -0,0 +1,281 @@
+Parallel Algorithms
+===================
+
+Element-wise expression evalution ("map")
+-----------------------------------------
+
+.. module:: pyopencl.elementwise
+
+Evaluating involved expressions on :class:`pyopencl.array.Array` instances by
+using overloaded operators can be somewhat inefficient, because a new temporary
+is created for each intermediate result. The functionality in the module
+:mod:`pyopencl.elementwise` contains tools to help generate kernels that
+evaluate multi-stage expressions on one or several operands in a single pass.
+
+.. class:: ElementwiseKernel(context, arguments, operation, name="kernel", preamble="", options=[])
+
+    Generate a kernel that takes a number of scalar or vector *arguments*
+    and performs the scalar *operation* on each entry of its arguments, if that
+    argument is a vector.
+
+    *arguments* is specified as a string formatted as a C argument list.
+    *operation* is specified as a C assignment statement, without a semicolon.
+    Vectors in *operation* should be indexed by the variable *i*.
+
+    *name* specifies the name as which the kernel is compiled, 
+    and *options* are passed unmodified to :meth:`pyopencl.Program.build`.
+
+    *preamble* is a piece of C source code that gets inserted outside of the
+    function context in the elementwise operation's kernel source code.
+
+    .. method:: __call__(*args, wait_for=None)
+
+        Invoke the generated scalar kernel. The arguments may either be scalars or
+        :class:`GPUArray` instances.
+
+Here's a usage example::
+
+    import pyopencl as cl
+    import pyopencl.array as cl_array
+    import numpy
+
+    ctx = cl.create_some_context()
+    queue = cl.CommandQueue(ctx)
+
+    n = 10
+    a_gpu = cl_array.to_device(
+            ctx, queue, numpy.random.randn(n).astype(numpy.float32))
+    b_gpu = cl_array.to_device(
+            ctx, queue, numpy.random.randn(n).astype(numpy.float32))
+
+    from pyopencl.elementwise import ElementwiseKernel
+    lin_comb = ElementwiseKernel(ctx,
+            "float a, float *x, "
+            "float b, float *y, "
+            "float *z",
+            "z[i] = a*x[i] + b*y[i]",
+            "linear_combination")
+
+    c_gpu = cl_array.empty_like(a_gpu)
+    lin_comb(5, a_gpu, 6, b_gpu, c_gpu)
+
+    import numpy.linalg as la
+    assert la.norm((c_gpu - (5*a_gpu+6*b_gpu)).get()) < 1e-5
+
+(You can find this example as :file:`examples/demo_elementwise.py` in the PyOpenCL
+distribution.)
+
+.. _custom-reductions:
+
+Sums and counts ("reduce")
+--------------------------
+
+.. module:: pyopencl.reduction
+
+.. class:: ReductionKernel(ctx, dtype_out, neutral, reduce_expr, map_expr=None, arguments=None, name="reduce_kernel", options=[], preamble="")
+
+    Generate a kernel that takes a number of scalar or vector *arguments*
+    (at least one vector argument), performs the *map_expr* on each entry of
+    the vector argument and then the *reduce_expr* on the outcome of that.
+    *neutral* serves as an initial value. *preamble* offers the possibility
+    to add preprocessor directives and other code (such as helper functions)
+    to be added before the actual reduction kernel code.
+
+    Vectors in *map_expr* should be indexed by the variable *i*. *reduce_expr*
+    uses the formal values "a" and "b" to indicate two operands of a binary
+    reduction operation. If you do not specify a *map_expr*, "in[i]" -- and
+    therefore the presence of only one input argument -- is automatically
+    assumed.
+
+    *dtype_out* specifies the :class:`numpy.dtype` in which the reduction is
+    performed and in which the result is returned. *neutral* is specified as
+    float or integer formatted as string. *reduce_expr* and *map_expr* are
+    specified as string formatted operations and *arguments* is specified as a
+    string formatted as a C argument list. *name* specifies the name as which
+    the kernel is compiled. *options* are passed unmodified to
+    :meth:`pyopencl.Program.build`. *preamble* specifies a string of code that
+    is inserted before the actual kernels.
+
+    .. method:: __call__(*args, queue=None)
+
+    .. versionadded: 2011.1
+
+Here's a usage example::
+
+    a = pyopencl.array.arange(queue, 400, dtype=numpy.float32)
+    b = pyopencl.array.arange(queue, 400, dtype=numpy.float32)
+
+    krnl = ReductionKernel(ctx, numpy.float32, neutral="0",
+            reduce_expr="a+b", map_expr="x[i]*y[i]",
+            arguments="__global float *x, __global float *y")
+
+    my_dot_prod = krnl(a, b).get()
+
+.. _custom-scan:
+
+Prefix Sums ("scan")
+--------------------
+
+.. module:: pyopencl.scan
+
+.. |scan_extra_args| replace:: a list of tuples *(name, value)* specifying
+    extra arguments to pass to the scan procedure. *value* must be :mod:`numpy`
+    sized type.
+.. |preamble| replace:: A snippet of C that is inserted into the compiled kernel
+    before the actual kernel function. May be used for, e.g. type definitions
+    or include statements.
+
+A prefix sum is a running sum of an array, as provided by
+e.g. :mod:`numpy.cumsum`::
+
+    >>> import numpy as np
+    >>> a = [1,1,1,1,1,2,2,2,2,2]
+    >>> np.cumsum(a)
+    array([ 1,  2,  3,  4,  5,  7,  9, 11, 13, 15])
+
+This is a very simple example of what a scan can do. It turns out that scans
+are significantly more versatile. They are a basic building block of many
+non-trivial parallel algorithms. Many of the operations enabled by scans seem
+difficult to parallelize because of loop-carried dependencies.
+
+.. seealso::
+
+    `Prefix sums and their applications <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.128.6230>`_, by Guy Blelloch.
+        This article gives an overview of some surprising applications of scans.
+
+    :ref:`predefined-scans`
+        These operations built into PyOpenCL are realized using :class:`GenericScanKernel`.
+
+Usage Example
+^^^^^^^^^^^^^
+
+This example illustrates the implementation of a simplified version of :func:`copy_if`,
+which copies integers from an array into the (variable-size) output if they are
+greater than 300::
+
+    knl = GenericScanKernel(
+            ctx, np.int32,
+            arguments="__global int *ary, __global int *out",
+            input_expr="(ary[i] > 300) ? 1 : 0",
+            scan_expr="a+b", neutral="0",
+            output_statement="""
+                if (prev_item != item) out[item-1] = ary[i];
+                """)
+
+    out = a.copy()
+    knl(a, out)
+
+    a_host = a.get()
+    out_host = a_host[a_host > 300]
+
+    assert (out_host == out.get()[:len(out_host)]).all()
+
+The value being scanned over is a number of flags indicating whether each array
+element is greater than 300. These flags are computed by *input_expr*. The
+prefix sum over this array gives a running count of array items greater than
+300. The *output_statement* the compares `prev_item` (the previous item's scan
+result, i.e. index) to `item` (the current item's scan result, i.e.
+index). If they differ, i.e. if the predicate was satisfied at this
+position, then the item is stored in the output at the computed index.
+
+This example does not make use of the following advanced features also available
+in PyOpenCL:
+
+* Segmented scans
+
+* Access to the previous item in *input_expr* (e.g. for comparisons)
+  See the `implementation <https://github.com/inducer/pyopencl/blob/master/pyopencl/scan.py#L1353>`_ of :func:`unique` for an example.
+
+Making Custom Scan Kernels
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. versionadded: 2012.2
+
+.. autoclass:: GenericScanKernel
+
+    .. method:: __call__(*args, allocator=None, queue=None)
+
+        *queue* and *allocator* default to the ones provided on the first
+        :class:`pyopencl.array.Array` in *args*.
+
+Debugging aids
+~~~~~~~~~~~~~~
+
+.. class:: GenericDebugScanKernel
+
+    Performs the same function and has the same interface as
+    :class:`GenericScanKernel`, but uses a dead-simple, sequential scan.  Works
+    best on CPU platforms, and helps isolate bugs in scans by removing the
+    potential for issues originating in parallel execution.
+
+.. _predefined-scans:
+
+Simple / Legacy Interface
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. class:: ExclusiveScanKernel(ctx, dtype, scan_expr, neutral, name_prefix="scan", options=[], preamble="", devices=None)
+
+    Generates a kernel that can compute a `prefix sum <https://secure.wikimedia.org/wikipedia/en/wiki/Prefix_sum>`_
+    using any associative operation given as *scan_expr*.
+    *scan_expr* uses the formal values "a" and "b" to indicate two operands of
+    an associative binary operation. *neutral* is the neutral element
+    of *scan_expr*, obeying *scan_expr(a, neutral) == a*.
+
+    *dtype* specifies the type of the arrays being operated on. 
+    *name_prefix* is used for kernel names to ensure recognizability
+    in profiles and logs. *options* is a list of compiler options to use
+    when building. *preamble* specifies a string of code that is
+    inserted before the actual kernels. *devices* may be used to restrict
+    the set of devices on which the kernel is meant to run. (defaults
+    to all devices in the context *ctx*.
+
+    .. method:: __call__(self, input_ary, output_ary=None, allocator=None, queue=None)
+
+.. class:: InclusiveScanKernel(dtype, scan_expr, neutral=None, name_prefix="scan", options=[], preamble="", devices=None)
+
+    Works like :class:`ExclusiveScanKernel`.
+
+    .. versionchanged:: 2012.2
+        *neutral* is now always required.
+
+For the array `[1,2,3]`, inclusive scan results in `[1,3,6]`, and exclusive
+scan results in `[0,1,3]`.
+
+Here's a usage example::
+
+    knl = InclusiveScanKernel(context, np.int32, "a+b")
+
+    n = 2**20-2**18+5
+    host_data = np.random.randint(0, 10, n).astype(np.int32)
+    dev_data = cl_array.to_device(queue, host_data)
+
+    knl(dev_data)
+    assert (dev_data.get() == np.cumsum(host_data, axis=0)).all()
+
+Predicated copies ("partition", "unique", ...)
+----------------------------------------------
+
+.. module:: pyopencl.algorithm
+
+.. autofunction:: copy_if
+
+.. autofunction:: remove_if
+
+.. autofunction:: partition
+
+.. autofunction:: unique
+
+Sorting (radix sort)
+--------------------
+
+.. autoclass:: RadixSort
+
+    .. automethod:: __call__
+
+Building many variable-size lists
+---------------------------------
+
+.. autoclass:: ListOfListsBuilder
+
+    .. automethod:: __call__
+
diff --git a/doc/source/array.rst b/doc/source/array.rst
index 752a1b8f22f0958d7ba929789c5b16dcac1612a0..6a4eda2e04dfd2990fa38837508b0e98fc68ccd0 100644
--- a/doc/source/array.rst
+++ b/doc/source/array.rst
@@ -552,278 +552,6 @@ generator for scientific purposes, please consider citing them:
   number generator of LÃ¼scher, `Computer Physics Communications 79 (1994) 111-114
   <http://dx.doi.org/10.1016/0010-4655(94)90233-X>`_
 
-
-Single-pass Custom Expression Evaluation
-----------------------------------------
-
-.. module:: pyopencl.elementwise
-
-Evaluating involved expressions on :class:`pyopencl.array.Array` instances can be
-somewhat inefficient, because a new temporary is created for each
-intermediate result. The functionality in the module :mod:`pyopencl.elementwise`
-contains tools to help generate kernels that evaluate multi-stage expressions
-on one or several operands in a single pass.
-
-.. class:: ElementwiseKernel(context, arguments, operation, name="kernel", preamble="", options=[])
-
-    Generate a kernel that takes a number of scalar or vector *arguments*
-    and performs the scalar *operation* on each entry of its arguments, if that
-    argument is a vector.
-
-    *arguments* is specified as a string formatted as a C argument list.
-    *operation* is specified as a C assignment statement, without a semicolon.
-    Vectors in *operation* should be indexed by the variable *i*.
-
-    *name* specifies the name as which the kernel is compiled, 
-    and *options* are passed unmodified to :meth:`pyopencl.Program.build`.
-
-    *preamble* is a piece of C source code that gets inserted outside of the
-    function context in the elementwise operation's kernel source code.
-
-    .. method:: __call__(*args, wait_for=None)
-
-        Invoke the generated scalar kernel. The arguments may either be scalars or
-        :class:`GPUArray` instances.
-
-Here's a usage example::
-
-    import pyopencl as cl
-    import pyopencl.array as cl_array
-    import numpy
-
-    ctx = cl.create_some_context()
-    queue = cl.CommandQueue(ctx)
-
-    n = 10
-    a_gpu = cl_array.to_device(
-            ctx, queue, numpy.random.randn(n).astype(numpy.float32))
-    b_gpu = cl_array.to_device(
-            ctx, queue, numpy.random.randn(n).astype(numpy.float32))
-
-    from pyopencl.elementwise import ElementwiseKernel
-    lin_comb = ElementwiseKernel(ctx,
-            "float a, float *x, "
-            "float b, float *y, "
-            "float *z",
-            "z[i] = a*x[i] + b*y[i]",
-            "linear_combination")
-
-    c_gpu = cl_array.empty_like(a_gpu)
-    lin_comb(5, a_gpu, 6, b_gpu, c_gpu)
-
-    import numpy.linalg as la
-    assert la.norm((c_gpu - (5*a_gpu+6*b_gpu)).get()) < 1e-5
-
-(You can find this example as :file:`examples/demo_elementwise.py` in the PyOpenCL
-distribution.)
-
-.. _custom-reductions:
-
-Custom Reductions
------------------
-
-.. module:: pyopencl.reduction
-
-.. class:: ReductionKernel(ctx, dtype_out, neutral, reduce_expr, map_expr=None, arguments=None, name="reduce_kernel", options=[], preamble="")
-
-    Generate a kernel that takes a number of scalar or vector *arguments*
-    (at least one vector argument), performs the *map_expr* on each entry of
-    the vector argument and then the *reduce_expr* on the outcome of that.
-    *neutral* serves as an initial value. *preamble* offers the possibility
-    to add preprocessor directives and other code (such as helper functions)
-    to be added before the actual reduction kernel code.
-
-    Vectors in *map_expr* should be indexed by the variable *i*. *reduce_expr*
-    uses the formal values "a" and "b" to indicate two operands of a binary
-    reduction operation. If you do not specify a *map_expr*, "in[i]" -- and
-    therefore the presence of only one input argument -- is automatically
-    assumed.
-
-    *dtype_out* specifies the :class:`numpy.dtype` in which the reduction is
-    performed and in which the result is returned. *neutral* is specified as
-    float or integer formatted as string. *reduce_expr* and *map_expr* are
-    specified as string formatted operations and *arguments* is specified as a
-    string formatted as a C argument list. *name* specifies the name as which
-    the kernel is compiled. *options* are passed unmodified to
-    :meth:`pyopencl.Program.build`. *preamble* specifies a string of code that
-    is inserted before the actual kernels.
-
-    .. method:: __call__(*args, queue=None)
-
-    .. versionadded: 2011.1
-
-Here's a usage example::
-
-    a = pyopencl.array.arange(queue, 400, dtype=numpy.float32)
-    b = pyopencl.array.arange(queue, 400, dtype=numpy.float32)
-
-    krnl = ReductionKernel(ctx, numpy.float32, neutral="0",
-            reduce_expr="a+b", map_expr="x[i]*y[i]",
-            arguments="__global float *x, __global float *y")
-
-    my_dot_prod = krnl(a, b).get()
-
-.. _custom-scan:
-
-Parallel Scan / Prefix Sum
---------------------------
-
-.. module:: pyopencl.scan
-
-.. |scan_extra_args| replace:: a list of tuples *(name, value)* specifying
-    extra arguments to pass to the scan procedure. *value* must be :mod:`numpy`
-    sized type.
-.. |preamble| replace:: A snippet of C that is inserted into the compiled kernel
-    before the actual kernel function. May be used for, e.g. type definitions
-    or include statements.
-
-A prefix sum is a running sum of an array, as provided by
-e.g. :mod:`numpy.cumsum`::
-
-    >>> import numpy as np
-    >>> a = [1,1,1,1,1,2,2,2,2,2]
-    >>> np.cumsum(a)
-    array([ 1,  2,  3,  4,  5,  7,  9, 11, 13, 15])
-
-This is a very simple example of what a scan can do. It turns out that scans
-are significantly more versatile. They are a basic building block of many
-non-trivial parallel algorithms. Many of the operations enabled by scans seem
-difficult to parallelize because of loop-carried dependencies.
-
-.. seealso::
-
-    `Prefix sums and their applications <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.128.6230>`_, by Guy Blelloch.
-        This article gives an overview of some surprising applications of scans.
-
-    :ref:`predefined-scans`
-        These operations built into PyOpenCL are realized using :class:`GenericScanKernel`.
-
-Usage Example
-^^^^^^^^^^^^^
-
-This example illustrates the implementation of a simplified version of :func:`copy_if`,
-which copies integers from an array into the (variable-size) output if they are
-greater than 300::
-
-    knl = GenericScanKernel(
-            ctx, np.int32,
-            arguments="__global int *ary, __global int *out",
-            input_expr="(ary[i] > 300) ? 1 : 0",
-            scan_expr="a+b", neutral="0",
-            output_statement="""
-                if (prev_item != item) out[item-1] = ary[i];
-                """)
-
-    out = a.copy()
-    knl(a, out)
-
-    a_host = a.get()
-    out_host = a_host[a_host > 300]
-
-    assert (out_host == out.get()[:len(out_host)]).all()
-
-The value being scanned over is a number of flags indicating whether each array
-element is greater than 300. These flags are computed by *input_expr*. The
-prefix sum over this array gives a running count of array items greater than
-300. The *output_statement* the compares `prev_item` (the previous item's scan
-result, i.e. index) to `item` (the current item's scan result, i.e.
-index). If they differ, i.e. if the predicate was satisfied at this
-position, then the item is stored in the output at the computed index.
-
-This example does not make use of the following advanced features also available
-in PyOpenCL:
-
-* Segmented scans
-
-* Access to the previous item in *input_expr* (e.g. for comparisons)
-  See the `implementation <https://github.com/inducer/pyopencl/blob/master/pyopencl/scan.py#L1353>`_ of :func:`unique` for an example.
-
-Making Custom Scan Kernels
-^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. versionadded: 2012.2
-
-.. autoclass:: GenericScanKernel
-
-    .. method:: __call__(*args, allocator=None, queue=None)
-
-        *queue* and *allocator* default to the ones provided on the first
-        :class:`pyopencl.array.Array` in *args*.
-
-Debugging aids
-~~~~~~~~~~~~~~
-
-.. class:: GenericDebugScanKernel
-
-    Performs the same function and has the same interface as
-    :class:`GenericScanKernel`, but uses a dead-simple, sequential scan.  Works
-    best on CPU platforms, and helps isolate bugs in scans by removing the
-    potential for issues originating in parallel execution.
-
-.. _predefined-scans:
-
-Simple / Legacy Interface
-^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. class:: ExclusiveScanKernel(ctx, dtype, scan_expr, neutral, name_prefix="scan", options=[], preamble="", devices=None)
-
-    Generates a kernel that can compute a `prefix sum <https://secure.wikimedia.org/wikipedia/en/wiki/Prefix_sum>`_
-    using any associative operation given as *scan_expr*.
-    *scan_expr* uses the formal values "a" and "b" to indicate two operands of
-    an associative binary operation. *neutral* is the neutral element
-    of *scan_expr*, obeying *scan_expr(a, neutral) == a*.
-
-    *dtype* specifies the type of the arrays being operated on. 
-    *name_prefix* is used for kernel names to ensure recognizability
-    in profiles and logs. *options* is a list of compiler options to use
-    when building. *preamble* specifies a string of code that is
-    inserted before the actual kernels. *devices* may be used to restrict
-    the set of devices on which the kernel is meant to run. (defaults
-    to all devices in the context *ctx*.
-
-    .. method:: __call__(self, input_ary, output_ary=None, allocator=None, queue=None)
-
-.. class:: InclusiveScanKernel(dtype, scan_expr, neutral=None, name_prefix="scan", options=[], preamble="", devices=None)
-
-    Works like :class:`ExclusiveScanKernel`.
-
-    .. versionchanged:: 2012.2
-        *neutral* is now always required.
-
-For the array `[1,2,3]`, inclusive scan results in `[1,3,6]`, and exclusive
-scan results in `[0,1,3]`.
-
-Here's a usage example::
-
-    knl = InclusiveScanKernel(context, np.int32, "a+b")
-
-    n = 2**20-2**18+5
-    host_data = np.random.randint(0, 10, n).astype(np.int32)
-    dev_data = cl_array.to_device(queue, host_data)
-
-    knl(dev_data)
-    assert (dev_data.get() == np.cumsum(host_data, axis=0)).all()
-
-Higher-level algorithms
------------------------
-
-.. module:: pyopencl.algorithm
-
-.. autofunction:: copy_if
-
-.. autofunction:: remove_if
-
-.. autofunction:: partition
-
-.. autofunction:: unique
-
-Sorting
-^^^^^^^
-
-.. autoclass:: RadixSort
-
-    .. automethod:: __call__
-
 Fast Fourier Transforms
 -----------------------
 
diff --git a/doc/source/index.rst b/doc/source/index.rst
index 387bd7cd0be922e4b07a8548a56cce1abe133db1..607b9da53cf074ae51af88e40f66c8a722f20f01 100644
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@@ -74,6 +74,7 @@ is on the web, thanks to Ian Johnson.
 
     runtime
     array
+    algorithm
     tools
     misc