Modified inner_prod() such that summation after multi-group reduction is...
Modified inner_prod() such that summation after multi-group reduction is performed on CPU, unless LHS is a GPU scalar. This gives a few percent of performance for CG and BiCGStab and eliminates messy temporaries.
Loading
Please register or sign in to comment