removed repetition from local tile sum knl because of weird performance...
removed repetition from local tile sum knl because of weird performance (compiler may be doing some kind of loop optimization
removed repetition from local tile sum knl because of weird performance (compiler may be doing some kind of loop optimization