- Jul 07, 2015
-
-
Andi authored
Replaced VIENNACL_LINAL_BISECT_GPU by VIENNACL_LINALG_BISECT_GPU
-
Andi authored
-
Andi authored
Replaced the following preprocessor macros in OpenCL backend of the bisection algorithm: MAX_THREADS_BLOCK -> VIENNACL_BISECT_MAX_THREADS_BLOCK MAX_SMALL_MATRIX -> VIENNACL_BISECT_MAX_SMALL_MATRIX MAX_SMALL_MATRIX -> VIENNACL_BISECT_MAX_THREADS_BLOCK_SMALL_MATRIX MIN_ABS_INTERVAL -> VIENNACL_BISECT_MIN_ABS_INTERVAL
-
Andi authored
* Fixed an error with an endless loop. * Fixed an error with a race condition.
-
- Jul 06, 2015
-
-
Andi authored
-
- Jul 05, 2015
- Jul 02, 2015
-
-
Karl Rupp authored
added eigen map size1 and size2 traits methods
-
Charles Determan authored
-
Karl Rupp authored
viennacl::copy() now also works for Eigen::Map<VectorXf> and Eigen::Map<VectorXd> As a positive side effect, also improves performance of the copy. Refer to #137 for discussion.
-
- Jun 29, 2015
-
-
Karl Rupp authored
Based on previous tweaks of CUDA kernels. Performance gain up to 30 percent. Tests on Maxwell pending.
-
- Jun 24, 2015
-
-
Karl Rupp authored
Works for dynamically sized matrices. Statically sized (small) matrices are not supported, because they will provide extremely poor performance due to PCI-e latency.
-
- Jun 23, 2015
-
-
Karl Rupp authored
if (buffer_size == get_local_size(0)) { ... } block caused problems with NVIDIA drivers 34x.yz. Reproducing the error on simpler kernels was not possible. By moving operations on index_in_C and buffer_size out of the block, the issues get resolved. Also introduces use of thread-private variable 'local_id' to replace uses of get_local_id(0) in same kernel. Might improve performance slightly.
-
- Jun 11, 2015
-
-
Karl Rupp authored
Used whenever average number of nonzeros per row is larger than 6.5 (Maxwell) or 12.0 (Kepler and earlier). Overall performance about 10-20 percent better than CUSPARSE.
-
- Jun 08, 2015
-
-
Karl Rupp authored
Resolves the need for warp-shuffles by using shared memory instead. Exclusive scans now run on the device (no thrust or host-based operations).
-
- May 31, 2015
-
-
Karl Rupp authored
Caused problems with Visual Studio 2008, since it's not in C++03.
-
Karl Rupp authored
std::map<Key, Value>::at() and std::vector<T>::data() was used a couple of times. stdint.h is not available in C++03, hence needed to be replaced by custom typedefs. As a result, compilation on VS 2008 failed, as these features are unavailable there.
-
- May 28, 2015
- May 27, 2015
-
-
Karl Rupp authored
Also extended test suite accordingly.
-
Karl Rupp authored
Extended test suite accordingly to cover both implementations.
-
Karl Rupp authored
Significantly simplifies debugging and diagonstics :-)
-
Karl Rupp authored
This routine is required in cases where the user populates the memory buffers manually. Otherwise, failures in sparse matrix-products are to be expected.
-
Karl Rupp authored
Adds in-place versions inclusive_scan(x); exclusive_scan(x); Extended test suite uncovered a bug in the in-place version of the OpenMP implementation of exclusive_scan(x);
-
Karl Rupp authored
Warp-shuffles required an explicit cast to int.
-
- May 23, 2015
-
-
Karl Rupp authored
-
- May 22, 2015
-
-
Karl Rupp authored
Now uses only three kernels and one temporary buffer rather than the previous approach with four kernels and two temporary vectors(!). Also prepared explicit API for inplace-scans. Possible further optimizations: - Non-inplace scans can run without temporary buffer - Small vectors can run with only one kernel invocation, no temporary buffer - Test suite for scans needs more love.
-
- May 21, 2015
-
-
Karl Rupp authored
The current kernels only worked for true lock-step execution. On the CPU, where each work group is executed by a few threads, an additional barrier is required for a correct execution. Should also fix problems on some NVIDIA GPUs.
-
- May 20, 2015
- May 10, 2015
-
-
Karl Rupp authored
karlrupp/sparse-matrix-matrix-product: Fast implementations of sparse matrix-matrix products. About 1.5x faster than MKL on Haswell if AVX2 enabled. About 1.5x faster than CUSP and CUBLAS on NVIDIA GPUs. About the same performnace on MIC. Faster on FirePro W9100 with OpenCL than on a Tesla K20m with CUDA. A few more tweaks possible, but will be applied in a separate feature branch.
-
Karl Rupp authored
Lists and hashes did not perform well, so removed. Work estimation only showed very mild gains over dynamic scheduling with suitable block size, so for the time being we stick to the much simpler version.
-
- May 07, 2015
-
-
Karl Rupp authored
Number of work per group was computed incorrectly, thus not all rows have been visited.
-
Karl Rupp authored
When copying back from device, there's no need for implicitly assuming a square matrix, because the dimensions are known on the device.
-
Karl Rupp authored
-
Karl Rupp authored
Replaces old OpenCL implementation. Uses shared memory rather than warp shuffles. Uses fixed workgroup size of 32 for merge kernels in order to get rid of the cost of barriers on AMD devices. Likely to perform better on AMD devices than on NVIDIA devices, but performance tests still need to be run. Fully replaces old OpenCL implementation.
-
- Apr 27, 2015
-
-
Karl Rupp authored
Template resolution picked up the incorrect type (char*) in response to the way pointers are stored internally. See issue #133. Reported-by: Arijit Hazra <mailtohazra@gmail.com> via viennacl-support
-
- Apr 18, 2015