Device-Specific/ GEMM: Now using floatn* instead of float* + vload.
The fallback kernel has to be used whenever simd_width>1 && (ldstartA % simd_width > 0 || ldstartB % simd_width > 0). This is quite a mess, but thanks to this commit it'll be easier to see if vstore(), vload() causes any performance regression, and use floatn* everywhere if this is the case
Loading
Please register or sign in to comment