Commit 33d66246 authored Jul 30, 2013 by Karl Rupp

Removed static temporaries for inner_prod() and norm_X() for CUDA and OpenCL backends.

These optimizations resulted in race conditions for a multithreaded setting.
The drawback now is higher 'launch' overhead in these routines.
Benchmarking required in order to quantify overhead and consider further steps (temporaries in context)

parent db34bc39

Show whitespace changes

Inline Side-by-side

Please to comment