Commit 4381e000 authored Dec 04, 2014 by Karl Rupp

GMRES: Improved kernel first first stage of pipelined orthogonalization.

Use of thread-local variables is substantially slower than using
shared memory directly in this case. 2x difference on a Tesla C2050
for this particular kernel. Overall performance gains depend on sparsity
pattern of the matrix (as always).

parent 9d8bae24

Show whitespace changes

Inline Side-by-side

Please to comment