Hoisting barriers for long distance loop carried dependencies
Right I don't think there is a way (either automatic or manual) of hoisting barriers out of loops for loop carried dependencies where the dependency distance is big enough to allow it, other than directly modifying the schedule.
- the bare minimum would be letting the user use a combination of nosync with an explicitly inserted barrier instruction (currently incomplete, as we don't have local barrier instructions)
- a nicer solution would be a to refine nosync scopes so they can express "no need to sync across these loops" (with corresponding support in barrier insertion)