On current processors, reading a value from memory consumes hundreds cycles, and ALUs cannot perform computations if they are waiting for fetching memory. CPUs and GPUs take drastically different approach to hide this latency.
CPUs apply a variety of techniques for this task, including a large and hierarchical cache memory, out of order execution, and speculative execution.
On the other hand, GPUs use aggressive thread switching to hide latency. 1 If one warp stall, the warp scheduler can quickly switch to another warp to keep execution. Unlike CPUs, this context switch is cheap (per clock cycle basis) 2