diff --git a/doc/runtime_program.rst b/doc/runtime_program.rst
index e95468782e37f3ec9fc1ab18556dfda371dc3adf..71ee84b4841c457daba248a1be87d2967fa21a08 100644
--- a/doc/runtime_program.rst
+++ b/doc/runtime_program.rst
@@ -212,12 +212,7 @@ Kernel
         :meth:`set_arg` to see what argument types are allowed.
         |std-enqueue-blurb|
 
-        *None* may be passed for local_size.
-
-        If *g_times_l* is specified, the global size will be multiplied by the
-        local size. (which makes the behavior more like Nvidia CUDA) In this case,
-        *global_size* and *local_size* also do not have to have the same number
-        of dimensions.
+        |glsize|
 
         .. note::
 
@@ -287,10 +282,7 @@ Kernel
 
     |std-enqueue-blurb|
 
-    If *g_times_l* is specified, the global size will be multiplied by the
-    local size. (which makes the behavior more like Nvidia CUDA) In this case,
-    *global_size* and *local_size* also do not have to have the same number
-    of dimensions.
+    |glsize|
 
     .. versionchanged:: 2011.1
         Added the *g_times_l* keyword arg.
diff --git a/doc/subst.rst b/doc/subst.rst
index 4210ab24ce99a871aa4cfe318d3eb07049d5a98a..5e7b524b4c927602196fca2027d99a954af68199 100644
--- a/doc/subst.rst
+++ b/doc/subst.rst
@@ -13,3 +13,15 @@
 
 .. |copy-depr| replace:: **Note:** This function is deprecated as of PyOpenCL 2011.1.
         Use :func:`enqueue_copy` instead.
+
+.. |glsize| replace:: *global_size* and *local_size* are tuples of identical length, with
+        between one and three entries. *global_size* specifies the overall size
+        of the computational grid: one work item will be launched for every
+        integer point in the grid. *local_size* specifies the workgroup size,
+        which must evenly divide the *global_size* in a dimension-by-dimension
+        manner.  *None* may be passed for local_size, in which case the
+        implementation will use an implementation-defined workgroup size.
+        If *g_times_l* is *True*, the global size will be multiplied by the
+        local size. (which makes the behavior more like Nvidia CUDA) In this case,
+        *global_size* and *local_size* also do not have to have the same number
+        of entries.