For a single threaded scan algorithm, we can do a for loop like this:

out[0] = in[0];
for (int k = 1; k < n; ++k) {
  out[k] = out[k = 1] + in[k];
}

There are n-1 addition for an array of length n.

This looks very sequential, but there are efficient parallel versions 1.

Naive Parallel Version

Pseudocode:

This is an inclusive scan. And each thread writes one sum and reads two values.

Work-Efficient Parallel Scan

Work-efficient parallel scan use a balanced binary tree (in concept) to perform scan in two phases

Unlike the naive version, this is an exclusive scan.

Up-Sweep

Same as parallel reduction

At this stage, imaging array as a tree

  • Array stores only left child
  • right child is the element itself
  • For node at index n
    • Left child index = n / 2 (rounds down)
    • right child index = n

Down-Sweep

  • “Traverse” back down tree using partial sums to build the scan in place
    • Set root to zero
    • At each pass, a node passes its value to its left child, and sets the right child to the sum of the previous left child’s value and its value

Pseudocode:

Complexities

  • Sequential Scan
  • Naive Parallel Scan:
  • Work-Efficient Parallel Scan:
    • Up-Sweep: adds
    • Down-sweep: adds swaps

References

Footnotes

  1. Parallel Prefix Sum (Scan) with CUDA