CUDA allows threads in the same block to coordinate their activities by using a barrier synchronization function __syncthreads().

A __syncthreads() statement must be executed by all threads.

When a __syncthreads() statement is placed in an if-statement, either all or none of the threads in a block execute the path that includes the _syncthreads(). For if-else, if each path has a __syncthreads() statement, either all threads in a block execute the if path or all of them execute the else-path.

A thread that never arrives at the barrier synchronization point can cause everyone else to wait forever.

Threads in deferent blocks are not allowed to perform barrier synchronization. An advantage is that the CUDA runtime can execute blocks in any order.