streaming multiprocessor CUDA memory and cache architecture Registers L1 cache and Shared Memory L2 cache Thread hierarchy block and warp __syncthreads Driver API vs Runtime API stream task graph CUDA performance bank conflicts atomics copy compute overlap