diff --git a/doc/tutorial.rst b/doc/tutorial.rst index 7920c542e31a0469f17ac9e4151ab91da9915bbd..ef69f16491f2a116fd0729191832428b7e8b4611 100644 --- a/doc/tutorial.rst +++ b/doc/tutorial.rst @@ -1186,11 +1186,15 @@ TODO Gathering kernel statistics --------------------------- -Operations, array access, and barriers can all be counted, which may facilitate performance prediction and optimization of a :mod:`loopy` kernel. +Operations, array access, and barriers can all be counted, which may +facilitate performance prediction and optimization of a :mod:`loopy` +kernel. .. note:: - The functions used in the following examples may produce warnings. If you have already made the filterwarnings and catch_warnings calls used in the examples above, you may need to reset these before continuing: + The functions used in the following examples may produce warnings. If + you have already made the filterwarnings and catch_warnings calls used + in the examples above, you may need to reset these before continuing: .. doctest:: @@ -1200,7 +1204,10 @@ Operations, array access, and barriers can all be counted, which may facilitate Counting operations ~~~~~~~~~~~~~~~~~~~ -:func:`loopy.get_op_poly` provides information on the number and type of operations being performed in a kernel. To demonstrate this, we'll create an example kernel that performs several operations on arrays containing different types of data: +:func:`loopy.get_op_poly` provides information on the number and type of +operations being performed in a kernel. To demonstrate this, we'll create +an example kernel that performs several operations on arrays containing +different types of data: .. doctest:: @@ -1210,16 +1217,21 @@ Counting operations ... c[i, j, k] = a[i,j,k]*b[i,j,k]/3.0+a[i,j,k] ... e[i, k] = g[i,k]*(2+h[i,k+1]) ... """) - >>> knl = lp.add_and_infer_dtypes(knl, - dict(a=np.float32, b=np.float32, g=np.float64, h=np.float64)) + >>> knl = lp.add_and_infer_dtypes(knl, dict(a=np.float32, b=np.float32, g=np.float64, h=np.float64)) -Note that loopy will infer the data types for arrays c and e from the information provided. Now we will count the operations: +Note that loopy will infer the data types for arrays c and e from the +information provided. Now we will count the operations: .. doctest:: + >>> from loopy.statistics import get_op_poly >>> op_map = get_op_poly(knl) -:func:`loopy.get_op_poly` returns a mapping of **{** :class:`numpy.dtype` **:** :class:`islpy.PwQPolynomial` **}**. The :class:`islpy.PwQPolynomial` holds the number of operations for the :class:`numpy.dtype` specified in the key (in terms of the :class:`loopy.LoopKernel` *inames*). We'll print this map now: +:func:`loopy.get_op_poly` returns a mapping of **{** :class:`numpy.dtype` +**:** :class:`islpy.PwQPolynomial` **}**. The :class:`islpy.PwQPolynomial` +holds the number of operations for the :class:`numpy.dtype` specified in +the key (in terms of the :class:`loopy.LoopKernel` *inames*). We'll print +this map now: .. doctest:: @@ -1227,6 +1239,7 @@ Note that loopy will infer the data types for arrays c and e from the informatio float64 : [n, m, l] -> { 2 * n * m : n >= 1 and m >= 1 and l >= 1 } int32 : [n, m, l] -> { n * m : n >= 1 and m >= 1 and l >= 1 } float32 : [n, m, l] -> { 3 * n * m * l : n >= 1 and m >= 1 and l >= 1 } + <BLANKLINE> We can evaluate these polynomials using :func:`islpy.eval_with_dict`: @@ -1244,7 +1257,9 @@ We can evaluate these polynomials using :func:`islpy.eval_with_dict`: Counting array accesses ~~~~~~~~~~~~~~~~~~~~~~~ -:func:`loopy.get_DRAM_access_poly` provides information on the number and type of array loads and stores being performed in a kernel. To demonstrate this, we'll continue using the kernel from the previous example: +:func:`loopy.get_DRAM_access_poly` provides information on the number and +type of array loads and stores being performed in a kernel. To demonstrate +this, we'll continue using the kernel from the previous example: .. doctest:: @@ -1255,16 +1270,27 @@ Counting array accesses (dtype('float64'), 'uniform', 'load') : [n, m, l] -> { 2 * n * m : n >= 1 and m >= 1 and l >= 1 } (dtype('float64'), 'uniform', 'store') : [n, m, l] -> { n * m : n >= 1 and m >= 1 and l >= 1 } (dtype('float32'), 'uniform', 'load') : [n, m, l] -> { 3 * n * m * l : n >= 1 and m >= 1 and l >= 1 } + <BLANKLINE> -:func:`loopy.get_DRAM_access_poly` returns a mapping of **{(** :class:`numpy.dtype` **,** :class:`string` **,** :class:`string` **)** **:** :class:`islpy.PwQPolynomial` **}**. +:func:`loopy.get_DRAM_access_poly` returns a mapping of **{(** +:class:`numpy.dtype` **,** :class:`string` **,** :class:`string` **)** +**:** :class:`islpy.PwQPolynomial` **}**. - The :class:`numpy.dtype` specifies the type of the data being accessed. -- The first string in the map key specifies the DRAM access type as *consecutive*, *nonconsecutive*, or *uniform*. *Consecutive* memory accesses occur when consecutive threads access consecutive array elements in memory, *nonconsecutive* accesses occur when consecutive threads access nonconsecutive array elements in memory, and *uniform* accesses occur when consecutive threads access the *same* element in memory. +- The first string in the map key specifies the DRAM access type as + *consecutive*, *nonconsecutive*, or *uniform*. *Consecutive* memory + accesses occur when consecutive threads access consecutive array elements + in memory, *nonconsecutive* accesses occur when consecutive threads + access nonconsecutive array elements in memory, and *uniform* accesses + occur when consecutive threads access the *same* element in memory. -- The second string in the map key specifies the DRAM access type as a *load*, or a *store*. +- The second string in the map key specifies the DRAM access type as a + *load*, or a *store*. -- The :class:`islpy.PwQPolynomial` holds the number of DRAM accesses with the characteristics specified in the key (in terms of the :class:`loopy.LoopKernel` *inames*). +- The :class:`islpy.PwQPolynomial` holds the number of DRAM accesses with + the characteristics specified in the key (in terms of the + :class:`loopy.LoopKernel` *inames*). We can evaluate these polynomials using :func:`islpy.eval_with_dict`: @@ -1282,7 +1308,12 @@ We can evaluate these polynomials using :func:`islpy.eval_with_dict`: ~~~~~~~~~~~ -Since we have not tagged any of the inames or parallelized the kernel across threads (which would have produced iname tags), :func:`loopy.get_DRAM_access_poly` considers the array accesses *uniform*. Now we'll parallelize the kernel and count the array accesses again. The resulting :class:`islpy.PwQPolynomial` will be more complicated this time, so we'll print the mapping manually to make it more legible: +Since we have not tagged any of the inames or parallelized the kernel +across threads (which would have produced iname tags), +:func:`loopy.get_DRAM_access_poly` considers the array accesses *uniform*. +Now we'll parallelize the kernel and count the array accesses again. The +resulting :class:`islpy.PwQPolynomial` will be more complicated this time, +so we'll print the mapping manually to make it more legible: .. doctest:: @@ -1292,17 +1323,21 @@ Since we have not tagged any of the inames or parallelized the kernel across thr ... print("%s :\n%s\n" % (key, load_store_map.dict[key])) (dtype('float32'), 'consecutive', 'load') : [n, m, l] -> { (3 * n * m * l * floor((127 + m)/128)) : n >= 1 and m <= 127 and m >= 1 and l >= 1; (384 * n * l * floor((127 + m)/128)) : n >= 1 and m >= 128 and l >= 1 } - + <BLANKLINE> (dtype('float64'), 'consecutive', 'store') : [n, m, l] -> { (n * m * floor((127 + m)/128)) : n >= 1 and m <= 127 and m >= 1 and l >= 1; (128 * n * floor((127 + m)/128)) : n >= 1 and m >= 128 and l >= 1 } - + <BLANKLINE> (dtype('float64'), 'consecutive', 'load') : [n, m, l] -> { (2 * n * m * floor((127 + m)/128)) : n >= 1 and m <= 127 and m >= 1 and l >= 1; (256 * n * floor((127 + m)/128)) : n >= 1 and m >= 128 and l >= 1 } - + <BLANKLINE> (dtype('float32'), 'consecutive', 'store') : [n, m, l] -> { (n * m * l * floor((127 + m)/128)) : n >= 1 and m <= 127 and m >= 1 and l >= 1; (128 * n * l * floor((127 + m)/128)) : n >= 1 and m >= 128 and l >= 1 } + <BLANKLINE> -With this parallelization, consecutive threads will access consecutive array elements in memory. The polynomials are a bit more complicated now due to the parallelization, but when we evaluate them, we see that the total number of array accesses has not changed: +With this parallelization, consecutive threads will access consecutive +array elements in memory. The polynomials are a bit more complicated now +due to the parallelization, but when we evaluate them, we see that the +total number of array accesses has not changed: .. doctest:: @@ -1318,7 +1353,8 @@ With this parallelization, consecutive threads will access consecutive array ele ~~~~~~~~~~~ -To produce *nonconsecutive* array accesses, we'll switch the inner and outer tags in our parallelization of the kernel: +To produce *nonconsecutive* array accesses, we'll switch the inner and +outer tags in our parallelization of the kernel: .. doctest:: @@ -1328,17 +1364,20 @@ To produce *nonconsecutive* array accesses, we'll switch the inner and outer tag ... print("%s :\n%s\n" % (key, load_store_map.dict[key])) (dtype('float32'), 'nonconsecutive', 'store') : [n, m, l] -> { (n * m * l * floor((127 + m)/128)) : n >= 1 and m <= 127 and m >= 1 and l >= 1; (128 * n * l * floor((127 + m)/128)) : n >= 1 and m >= 128 and l >= 1 } - + <BLANKLINE> (dtype('float64'), 'nonconsecutive', 'load') : [n, m, l] -> { (2 * n * m * floor((127 + m)/128)) : n >= 1 and m <= 127 and m >= 1 and l >= 1; (256 * n * floor((127 + m)/128)) : n >= 1 and m >= 128 and l >= 1 } - + <BLANKLINE> (dtype('float64'), 'nonconsecutive', 'store') : [n, m, l] -> { (n * m * floor((127 + m)/128)) : n >= 1 and m <= 127 and m >= 1 and l >= 1; (128 * n * floor((127 + m)/128)) : n >= 1 and m >= 128 and l >= 1 } - + <BLANKLINE> (dtype('float32'), 'nonconsecutive', 'load') : [n, m, l] -> { (3 * n * m * l * floor((127 + m)/128)) : n >= 1 and m <= 127 and m >= 1 and l >= 1; (384 * n * l * floor((127 + m)/128)) : n >= 1 and m >= 128 and l >= 1 } + <BLANKLINE> -With this parallelization, consecutive threads will access *nonconsecutive* array elements in memory. The total number of array accesses has not changed: +With this parallelization, consecutive threads will access *nonconsecutive* +array elements in memory. The total number of array accesses has not +changed: .. doctest:: @@ -1355,7 +1394,8 @@ With this parallelization, consecutive threads will access *nonconsecutive* arra Counting barriers ~~~~~~~~~~~~~~~~~ -:func:`loopy.get_barrier_poly` counts the number of barriers in a kernel. First, we'll call this function on the kernel from the previous example: +:func:`loopy.get_barrier_poly` counts the number of barriers in a kernel. +First, we'll call this function on the kernel from the previous example: .. doctest:: @@ -1393,22 +1433,26 @@ Now to make things more interesting, we'll create a kernel with barriers: >>> print(code) #define lid(N) ((int) get_local_id(N)) #define gid(N) ((int) get_group_id(N)) - + <BLANKLINE> __kernel void __attribute__ ((reqd_work_group_size(97, 1, 1))) loopy_kernel(__global int const *restrict a, __global int *restrict e) { __local int c[50 * 10 * 99]; - + <BLANKLINE> for (int i = 0; i <= 49; ++i) for (int j = 0; j <= 9; ++j) { - barrier(CLK_LOCAL_MEM_FENCE) /* for c (first rev-depends on insn) */; + barrier(CLK_LOCAL_MEM_FENCE) /* for c (insn rev-depends on insn_0) */; c[990 * i + 99 * j + lid(0) + 1 + gid(0) * 128] = 2 * a[980 * i + 98 * j + lid(0) + 1 + gid(0) * 128]; - barrier(CLK_LOCAL_MEM_FENCE) /* for c (insn depends on first) */; + barrier(CLK_LOCAL_MEM_FENCE) /* for c (insn_0 depends on insn) */; e[980 * i + 98 * j + lid(0) + 1 + gid(0) * 128] = c[990 * i + 99 * j + 1 + lid(0) + 1 + gid(0) * 128] + c[990 * i + 99 * j + -1 + lid(0) + 1 + gid(0) * 128]; } } -In this kernel, when a thread performs the second instruction it uses data produced by *different* threads during the first instruction. For correct execution barriers are required, so loopy inserts them. Now we'll count the barriers using :func:`loopy.get_barrier_poly`: + +In this kernel, when a thread performs the second instruction it uses data +produced by *different* threads during the first instruction. For correct +execution barriers are required, so loopy inserts them. Now we'll count the +barriers using :func:`loopy.get_barrier_poly`: .. doctest:: @@ -1418,7 +1462,10 @@ In this kernel, when a thread performs the second instruction it uses data produ Barrier polynomial: { 1000 } Barrier count: 1000 -Based on the kernel code printed above, we would expect to find 50x10x2 barriers, and we do. In this case, the number of barriers does not depend on any inames, so we can pass an empty dictionary to :func:`islpy.eval_with_dict`. +Based on the kernel code printed above, we would expect to find 50x10x2 +barriers, and we do. In this case, the number of barriers does not depend +on any inames, so we can pass an empty dictionary to +:func:`islpy.eval_with_dict`. .. }}} diff --git a/loopy/compiled.py b/loopy/compiled.py index 5a522ee0bce01ce6b6c58bdb4025811d601aea73..d8d127c0bc89f439569b8a016e485054c05f2bc2 100644 --- a/loopy/compiled.py +++ b/loopy/compiled.py @@ -786,11 +786,10 @@ class _CLKernelInfo(Record): class CompiledKernel: def __init__(self, context, kernel): """ - :arg kernel: may be a loopy.LoopKernel, a generator returning kernels \ - (a warning will be issued if more than one is returned). If the \ - kernel has not yet been loop-scheduled, that is done, too, with no \ + :arg kernel: may be a loopy.LoopKernel, a generator returning kernels + (a warning will be issued if more than one is returned). If the + kernel has not yet been loop-scheduled, that is done, too, with no specific arguments. - """ self.context = context