Textured Fetch vs. Global Memory Read

Device memory reads through texture fetching present several potential benefits over reads from global memory:

Unoptimized data shifts and Data shifts optimized by use of texture memory illustrate how textures can be used to avoid uncoalesced global memory accesses in the following variation of the offsetCopy kernel. This copy performs a shift in data, as demonstrated in the following kernel.

Unoptimized data shifts

__global__ void shiftCopy(float *odata, float *idata, int shift)
{
  int xid = blockIdx.x * blockDim.x + threadIdx.x;
  odata[xid] = idata[xid+shift];
}

This copy kernel applies a shift to the global memory location when reading from idata, but writes to unshifted global memory locations in odata. The amount of shift is specified as a function argument to the kernel. Some degradation of performance occurs when the shift is neither zero nor a multiple of 16 because reading from idata will be either uncoalesced (compute capability 1.1 or lower) or result in transactions with wasted bandwidth (compute capability 1.2 or higher). Note that regardless of compute capability, writing to odata is fully coalesced.

The version of this code that uses textures to perform the shifted read is shown in Data shifts optimized by use of texture memory.

Data shifts optimized by use of texture memory

__global__ void textureShiftCopy(float *odata, float *idata, 
                                 int shift)
{
  int xid = blockIdx.x * blockDim.x + threadIdx.x;
  odata[xid] = tex1Dfetch(texRef, xid+shift);
}

Here, the texture reference texRef is bound to the idata array in the host code and the function tex1Dfetch() reads the shifted memory locations of idata via a texture fetch. The results of both kernels (using global memory and textures for loads) on an NVIDIA GeForce GTX 280 and an NVIDIA GeForce GTX 8800 are given in Figure 1.

Figure 1. Results of Using Texture Memory to Avoid Uncoalesced Global Memory Access

The benefit of using textures for cases that are not optimally coalesced is clear. Textured reads can maintain effective bandwidth of the unshifted, fully coalesced cases within a few percent. Note that shifts that are neither zero nor multiples of 16 show greater effective bandwidth than the offsetCopy kernel in Figure 1. Because all the stores in the shift kernels are fully coalesced with no wasted bandwidth, the shift applies only to the loads.