* Reimplementation of LU factorization in viennacl/linalg/lu.hpp. Better...
* Reimplementation of LU factorization in viennacl/linalg/lu.hpp. Better performance, but still a lot of unused potential. * Replaced slow generic CUDA matrix-matrix multiplication kernel by several semi-automatically generated kernels. Performance still only half of OpenCL, although code is virtually identical. * Fixed a bug with C = prod(A, B) if C is a matrix_range or matrix_slice. An unnecessary temporary was introduced. * CUDA-benchmarks now build correctly
Loading
Please register or sign in to comment