An reduction operation applies a binary operator to a sequence of element and get back a single result.

To apply parallel reduction, the binary operator must be associative.

Note that we can just modify the array in-place.

Parallel reduction is not arithmetic intensive, it takes only 1 add so it is completely memory bandwidth bounded.