Skip to content
Snippets Groups Projects
  1. Jul 07, 2015
  2. Jul 06, 2015
  3. Jul 05, 2015
  4. Jul 02, 2015
  5. Jun 29, 2015
  6. Jun 24, 2015
  7. Jun 23, 2015
    • Karl Rupp's avatar
      SpGEMM: Workaround for bug in NVIDIA OpenCL compiler. · 216a6ac4
      Karl Rupp authored
      if (buffer_size == get_local_size(0)) { ... } block caused problems
      with NVIDIA drivers 34x.yz. Reproducing the error on simpler kernels
      was not possible.
      By moving operations on index_in_C and buffer_size out of the block,
      the issues get resolved.
      
      Also introduces use of thread-private variable 'local_id' to replace
      uses of get_local_id(0) in same kernel.
      Might improve performance slightly.
      216a6ac4
  8. Jun 11, 2015
    • Karl Rupp's avatar
      CUDA: Added improved CSR SpMV kernel. · b3a6f0e1
      Karl Rupp authored
      Used whenever average number of nonzeros per row is larger
      than 6.5 (Maxwell) or 12.0 (Kepler and earlier).
      Overall performance about 10-20 percent better than CUSPARSE.
      b3a6f0e1
  9. Jun 08, 2015
  10. May 31, 2015
  11. May 28, 2015
  12. May 27, 2015
  13. May 23, 2015
  14. May 22, 2015
    • Karl Rupp's avatar
      Scan: Refurbished CUDA and OpenCL implementations. · aadb5b72
      Karl Rupp authored
      Now uses only three kernels and one temporary buffer rather than the
      previous approach with four kernels and two temporary vectors(!).
      Also prepared explicit API for inplace-scans.
      
      Possible further optimizations:
       - Non-inplace scans can run without temporary buffer
       - Small vectors can run with only one kernel invocation, no temporary buffer
       - Test suite for scans needs more love.
      aadb5b72
  15. May 21, 2015
    • Karl Rupp's avatar
      SpGEMM: Fixed missing barrier in OpenCL kernels. · 525fc3ae
      Karl Rupp authored
      The current kernels only worked for true lock-step execution.
      On the CPU, where each work group is executed by a few threads,
      an additional barrier is required for a correct execution.
      Should also fix problems on some NVIDIA GPUs.
      525fc3ae
  16. May 20, 2015
  17. May 10, 2015
    • Karl Rupp's avatar
      Merge branch 'karlrupp/sparse-matrix-matrix-product' · 66a8949c
      Karl Rupp authored
      karlrupp/sparse-matrix-matrix-product:
       Fast implementations of sparse matrix-matrix products.
       About 1.5x faster than MKL on Haswell if AVX2 enabled.
       About 1.5x faster than CUSP and CUBLAS on NVIDIA GPUs.
       About the same performnace on MIC.
       Faster on FirePro W9100 with OpenCL than on a Tesla K20m with CUDA.
       A few more tweaks possible, but will be applied in a separate feature branch.
      66a8949c
    • Karl Rupp's avatar
      SpGEMM: Switched back to dynamic scheduling with OpenMP. · b3e5daa0
      Karl Rupp authored
      Lists and hashes did not perform well, so removed.
      Work estimation only showed very mild gains over dynamic scheduling with
      suitable block size, so for the time being we stick to the much simpler version.
      b3e5daa0
  18. May 07, 2015
  19. Apr 27, 2015
  20. Apr 18, 2015
Loading