Scan: Refurbished CUDA and OpenCL implementations.
Now uses only three kernels and one temporary buffer rather than the previous approach with four kernels and two temporary vectors(!). Also prepared explicit API for inplace-scans. Possible further optimizations: - Non-inplace scans can run without temporary buffer - Small vectors can run with only one kernel invocation, no temporary buffer - Test suite for scans needs more love.
Loading
Please register or sign in to comment