CUDA allows threads in the same block to coordinate their activities by using a barrier synchronization function __syncthreads(). For example,
Mds[i] = Md[j];
__syncthreads();
func(Mds[i], Mds[i + 1]); // depends on Mds[i + 1], which is setup by another threadA thread that never arrives at the barrier synchronization point can cause everyone else to wait forever. When a __syncthreads() statement is placed in an if-statement, either all or none of the threads in a block execute the path that includes the _syncthreads().
if (cond) {
__syncthreads(); // OOps
}Threads in different blocks are not allowed to perform barrier synchronization. An advantage is that the CUDA runtime can execute blocks in any order.