For a single threaded scan algorithm, we can do a for loop like this:
out[0] = in[0];
for (int k = 1; k < n; ++k) {
out[k] = out[k = 1] + in[k];
}
There are n-1
addition for an array of length n
.
This looks very sequential, but there are efficient parallel versions 1.
Naive Parallel Version
Pseudocode:
This is an inclusive scan. And each thread writes one sum and reads two values.
Work-Efficient Parallel Scan
Work-efficient parallel scan use a balanced binary tree (in concept) to perform scan in two phases
- up-sweep (parallel reduction)
- down-sweep
Unlike the naive version, this is an exclusive scan.
Up-Sweep
Same as parallel reduction
At this stage, imaging array as a tree
- Array stores only left child
- right child is the element itself
- For node at index n
- Left child index = n / 2 (rounds down)
- right child index = n
Down-Sweep
- “Traverse” back down tree using partial sums to build the scan in place
- Set root to zero
- At each pass, a node passes its value to its left child, and sets the right child to the sum of the previous left child’s value and its value
Pseudocode:
Complexities
- Sequential Scan
- Naive Parallel Scan:
- Work-Efficient Parallel Scan:
- Up-Sweep:
adds - Down-sweep:
adds swaps
- Up-Sweep: