diff --git a/doc/misc.rst b/doc/misc.rst index cd52ae1b0eaaba1fd576491392841a9c48d4ebdf..c339280f58139fd5cfaa1589e4a47b58ea7713bc 100644 --- a/doc/misc.rst +++ b/doc/misc.rst @@ -83,8 +83,204 @@ OTHER DEALINGS IN THE SOFTWARE. Frequently Asked Questions ========================== -The FAQ is maintained collaboratively on the -`Wiki FAQ page <http://wiki.tiker.net/Loopy/FrequentlyAskedQuestions>`_. +Is Loopy specific to OpenCL? +---------------------------- + +No, absolutely not. You can switch to a different code generation target +(subclasses of :class:`loopy.TargetBase`) by using (say):: + + knl = knl.copy(target=loopy.CudaTarget()) + +Also see :ref:`targets`. (Py)OpenCL right now has the best support for +running kernels directly out of the box, but that could easily be expanded. +Open an issue to discuss what you need. + +In the meantime, you can generate code simply by saying:: + + cg_result = loopy.generate_code_v2(knl) + print(cg_result.host_code()) + print(cg_result.device_code()) + +For what types of codes does :mod:`loopy` work well? +---------------------------------------------------- + +Any array-based/number-crunching code whose control flow is not *too* +data dependent should be expressible. For example: + +* Sparse matrix-vector multiplies, despite data-dependent control + flow (varying row lengths, say), is easy and natural to express. + +* Looping until convergence on the other hand is an example + of something that can't be expressed easily. Such checks + would have to be performed outside of :mod:`loopy` code. + +Can I see some examples? +------------------------ + +Loopy has a ton of tests, and right now, those are probably the best +source of examples. Here are some links: + +* `Tests directory <https://github.com/inducer/loopy/tree/master/test>`_ +* `Applications tests <https://github.com/inducer/loopy/blob/master/test/test_apps.py>`_ +* `Feature tests <https://github.com/inducer/loopy/blob/master/test/test_loopy.py>`_ + +Here's a more complicated example of a loopy code: + +.. literalinclude:: ../examples/python/find-centers.py + :language: c + +This example is included in the :mod:`loopy` distribution as +:download:`examples/python/find-centers.py <../examples/python/find-centers.py>`. +What this does is find nearby "centers" satisfying some criteria +for an array of points ("targets"). + +What types of transformations can I do? +--------------------------------------- + +This list is always growing, but here are a few pointers: + +* Unroll + + Use :func:`loopy.tag_inames` with the ``"unr"`` tag. + Unrolled loops must have a fixed size. (See either + :func:`loopy.split_iname` or :func:`loopy.fix_parameters`.) + +* Stride changes (Row/column/something major) + + Use :func:`loopy.tag_array_axes` with (e.g.) ``stride:17`` or + ``N1,N2,N0`` to determine how each axis of an array is realized. + +* Prefetch + + Use :func:`loopy.add_prefetch`. + +* Reorder loops + + Use :func:`loopy.set_loop_priority`. + +* Precompute subexpressions: + + Use a :ref:`substitution-rule` to assign a name to a subexpression, + using may be :func:`loopy.assignment_to_subst` or :func:`extract_subst`. + Then use :func:`loopy.precompute` to create an (array or scalar) + temporary with precomputed values. + +* Tile: + + Use :func:`loopy.split_iname` to produce enough loops, then use + :func:`loopy.set_loop_priority` to set the ordering. + +* Fix constants + + Use :func:`loopy.fix_parameters`. + +* Parallelize (across cores) + + Use :func:`loopy.tag_inames` with the ``"g.0"``, ``"g.1"`` (and so on) tags. + +* Parallelize (across vector lanes) + + Use :func:`loopy.tag_inames` with the ``"l.0"``, ``"l.1"`` (and so on) tags. + +* Affinely map loop domains + + Use :func:`loopy.affine_map_inames`. + +* Texture-based data access + + Use :func:`loopy.change_arg_to_image` to use texture memory + for an argument. + +* Kernel Fusion + + Use :func:`loopy.fuse_kernels`. + +* Explicit-SIMD Vectorization + + Use :func:`loopy.tag_inames` with the ``"vec"`` iname tag. + Note that the corresponding axis of an array must + also be tagged using the ``"vec"`` array axis tag + (using :func:`tag_array_axes`) in order for vector code to be + generated. + + Vectorized loops (and array axes) must have a fixed size. (See either + :func:`split_iname` or :func:`fix_parameters` along with + :func:`split_array_axis`.) + +* Reuse of Temporary Storage + + Use :func:`loopy.alias_temporaries` to reduce the size of intermediate + storage. + +* SoA $\leftrightarrow$ AoS + + Use :func:`tag_array_axes` with the ``"sep"`` array axis tag + to generate separate arrays for each entry of a short, fixed-length + array axis. + + Separated array axes must have a fixed size. (See either + :func:`loopy.split_array_axis`.) + +* Realization of Instruction-level parallelism + + Use :func:`loopy.tag_inames` with the ``"ilp"`` tag. + ILP loops must have a fixed size. (See either + :func:`split_iname` or :func:`fix_parameters`.) + +* Type inference + + Use :func:`loopy.add_and_infer_dtypes`. + +* Convey assumptions: + + Use :func:`loopy.assume` to say, e.g. + ``loopy.assume(knl, "N mod 4 = 0")`` or + ``loopy.assume(knl, "N > 0")``. + +* Perform batch computations + + Use :func:`loopy.to_batched`. + +* Interface with your own library functions + + Use :func:`loopy.register_function_manglers`. + +Uh-oh. I got a scheduling error. Any hints? +------------------------------------------- + +* Make sure that dependencies between instructions are as + you intend. + + Use :func:`loopy.show_dependency_graph` to check. + + There's a heuristic that tries to help find dependencies. If there's + only a single write to a variable, then it adds dependencies from all + readers to the writer. In your case, that's actually counterproductive, + because it creates a circular dependency, hence the scheduling issue. + So you'll have to turn that off, like so:: + + knl = lp.make_kernel( + "{ [t]: 0 <= t < T}", + """ + <> xt = x[t] {id=fetch,dep=*} + x[t + 1] = xt * 0.1 {dep=fetch} + """) + +* Make sure that your loops are correctly nested. + + Print the kernel to make sure all instructions are within + the set of inames you intend them to be in. + +* One iname is one for loop. + + For sequential loops, one iname corresponds to exactly one + ``for`` loop in generated code. Loopy will not generate multiple + loops from one iname. + +* Make sure that your loops are correctly nested. + + The scheduler will try to be as helpful as it can in telling + you where it got stuck. Citing Loopy ============ diff --git a/doc/ref_kernel.rst b/doc/ref_kernel.rst index 560facd63f183e1113ccb7cb94ff1953aced6e7d..e41fbd6e89abbe7fc120b1460f982045d807dca9 100644 --- a/doc/ref_kernel.rst +++ b/doc/ref_kernel.rst @@ -468,6 +468,8 @@ Kernel Options .. autoclass:: Options +.. _targets: + Targets ------- diff --git a/examples/python/find-centers.py b/examples/python/find-centers.py new file mode 100644 index 0000000000000000000000000000000000000000..c5e5e916156fd44b5a37cdb3cd41718916461a06 --- /dev/null +++ b/examples/python/find-centers.py @@ -0,0 +1,43 @@ +import numpy as np +import loopy as lp +import pyopencl as cl + +cl_ctx = cl.create_some_context(interactive=True) + +knl = lp.make_kernel( + "{[ictr,itgt,idim]: " + "0<=itgt<ntargets " + "and 0<=ictr<ncenters " + "and 0<=idim<ambient_dim}", + + """ + for itgt + for ictr + <> dist_sq = sum(idim, + (tgt[idim,itgt] - center[idim,ictr])**2) + <> in_disk = dist_sq < (radius[ictr]*1.05)**2 + <> matches = ( + (in_disk + and qbx_forced_limit == 0) + or (in_disk + and qbx_forced_limit != 0 + and qbx_forced_limit * center_side[ictr] > 0) + ) + + <> post_dist_sq = if(matches, dist_sq, HUGE) + end + <> min_dist_sq, <> min_ictr = argmin(ictr, post_dist_sq) + + tgt_to_qbx_center[itgt] = if(min_dist_sq < HUGE, min_ictr, -1) + end + """) + +knl = lp.fix_parameters(knl, ambient_dim=2) +knl = lp.add_and_infer_dtypes(knl, { + "tgt,center,radius,HUGE": np.float32, + "center_side,qbx_forced_limit": np.int32, + }) + +lp.auto_test_vs_ref(knl, cl_ctx, knl, parameters={ + "HUGE": 1e20, "ncenters": 200, "ntargets": 300, + "qbx_forced_limit": 1}) diff --git a/loopy/target/__init__.py b/loopy/target/__init__.py index eb39539b9c489320b227da7c7397c0748a704159..88e656a1e3a4bfeb25a250dbcb3a05d1f805bac8 100644 --- a/loopy/target/__init__.py +++ b/loopy/target/__init__.py @@ -36,6 +36,8 @@ __doc__ = """ .. autoclass:: OpenCLTarget .. autoclass:: PyOpenCLTarget .. autoclass:: ISPCTarget +.. autoclass:: NumbaTarget +.. autoclass:: NumbaCudaTarget """ diff --git a/loopy/target/numba.py b/loopy/target/numba.py index 95c1de08c9ef90bda6438d613e45e0515508573d..6946063ee04f52a4890344b4cbff9446bacb6923 100644 --- a/loopy/target/numba.py +++ b/loopy/target/numba.py @@ -167,7 +167,7 @@ class NumbaCudaASTBuilder(NumbaBaseASTBuilder): class NumbaCudaTarget(TargetBase): - """A target for plain Python, without any parallel extensions. + """A target for Numba with CUDA extensions. """ host_program_name_suffix = ""