An reduction operation applies a binary operator to a sequence of element and get back a single result.
To apply parallel reduction, the binary operator must be associative.
Note that we can just modify the array in-place.
Parallel reduction is not arithmetic intensive, it takes only 1 add so it is completely memory bandwidth bounded.