For a single threaded scan algorithm, we can do a for loop like this:
out[0] = in[0];
for (int k = 1; k < n; ++k) {
out[k] = out[k = 1] + in[k];
}There are n-1 addition for an array of length n.
This looks very sequential, but there are efficient parallel versions 1.
Naive Parallel Version

Pseudocode:

This is an inclusive scan. And each thread writes one sum and reads two values.

Work-Efficient Parallel Scan
Work-efficient parallel scan use a binary search tree (in concept) to perform scan in two phases
- up-sweep (parallel reduction)
- down-sweep
Unlike the naive version, this is an exclusive scan.
Up-Sweep
Same as parallel reduction

At this stage, imaging array as a tree
- Array stores only left child
- right child is the element itself
- For node at index n
- Left child index = n / 2 (rounds down)
- right child index = n

Down-Sweep
- “Traverse” back down tree using partial sums to build the scan in place
- Set root to zero
- At each pass, a node passes its value to its left child, and sets the right child to the sum of the previous left child’s value and its value

Pseudocode:

Complexities
- Sequential Scan
- Naive Parallel Scan:
- Work-Efficient Parallel Scan:
- Up-Sweep:
adds - Down-sweep:
adds swaps
- Up-Sweep: