Removed static temporaries for inner_prod() and norm_X() for CUDA and OpenCL backends.
These optimizations resulted in race conditions for a multithreaded setting. The drawback now is higher 'launch' overhead in these routines. Benchmarking required in order to quantify overhead and consider further steps (temporaries in context)
Loading
Please sign in to comment