GMRES: Improved kernel first first stage of pipelined orthogonalization.
Use of thread-local variables is substantially slower than using shared memory directly in this case. 2x difference on a Tesla C2050 for this particular kernel. Overall performance gains depend on sparsity pattern of the matrix (as always).
Loading
Please register or sign in to comment