NVIDIA Compute Visual Profiler Version 3.2
Published by
NVIDIA Corporation
2701 San Tomas Expressway
Santa
Clara, CA 95050
Notice
BY DOWNLOADING THIS FILE, USER AGREES TO THE FOLLOWING:
ALL NVIDIA SOFTWARE, DESIGN SPECIFICATIONS, REFERENCE
BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND
SEPARATELY, "MATERIALS") ARE BEING PROVIDED "AS IS". NVIDIA MAKES NO WARRANTIES,
EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND
EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY,
AND FITNESS FOR A PARTICULAR PURPOSE.
Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication or otherwise under any patent or patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. These materials supersedes and replaces all information previously supplied. NVIDIA Corporation products are not authorized for use as critical components in life support devices or systems without express written approval of NVIDIA Corporation.
Trademarks
NVIDIA, CUDA, and the NVIDIA logo are trademarks or registered trademarks
of NVIDIA Corporation in the United States and other countries. Other
company and product names may be trademarks of the respective companies
with which they are associated.
Copyright (C) 2007-2010 by NVIDIA Corporation. All rights reserved.
PLEASE REFER EULA.txt FOR THE LICENSE AGREEMENT FOR USING NVIDIA SOFTWARE.
List of supported features:
Execute a CUDA or OpenCL program (referred to as Compute program in this document) with profiling enabled and view the profiler output
as a table. The table has the following columns for each GPU method:
File, Profile, Session, Options, Window and Help.
See the description below for details on the menu options.
Second line has 4 groups of tool bar icons.
Summary session information is displayed when a session is selected in the tree view.
Summary device information is displayed when a device is selected in the tree view.
Session context menu.
Session->Device context menu :
Counter | Description | Type | 1.0 | 1.1 | 1.2 | 1.3 | 2.0 | 2.1 |
---|---|---|---|---|---|---|---|---|
branch | Number of branches taken by threads executing a kernel. This counter will be incremented by one if at least one thread in a warp takes the branch. Note that barrier instructions (__syncThreads()) also get counted as branches. | SM | Y | Y | Y | Y | Y | Y |
divergent branch | Number of divergent branches within a warp. This counter will be incremented by one if at least one thread in a warp diverges (that is, follows a different execution path) via a data dependent conditional branch. The counter will be incremented by one at each point of divergence in a warp. | SM | Y | Y | Y | Y | Y | Y |
instructions | Number of instructions executed. | SM | Y | Y | Y | Y | N | N |
warp serialize | If two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. This counter gives the number of thread warps that serialize on address conflicts to either shared or constant memory. | SM | Y | Y | Y | Y | N | N |
sm cta launched | Number of threads blocks launched on a multiprocessor. | SM | Y | Y | Y | Y | Y | Y |
gld uncoalesced | Number of non-coalesced global memory loads. | TPC | Y | Y | N | N | N | N |
gld coalesced | Number of coalesced global memory loads. | TPC | Y | Y | N | N | N | N |
gld request | Number of global memory load requests. On devices with compute capability 1.3 enabling this counter will result in increased counts for the "instructions" and "branch" counter values if they are also enabled in the same application run. | TPC | N | N | Y | Y | Y | Y |
gld 32 byte | Number of 32 byte global memory load transactions. This increments by 1 for each 32 byte transaction. | TPC | N | N | Y | Y | N | N |
gld 64 byte | Number of 64 byte global memory load transactions. This increments by 1 for each 64 byte transaction. | TPC | N | N | Y | Y | N | N |
gld 128 byte | Number of 128 byte global memory load transactions. This increments by 1 for each 128 byte transaction. | TPC | N | N | Y | Y | N | N |
gst coalesced | Number of coalesced global memory stores. | TPC | Y | Y | N | N | N | N |
gst request | Number of global memory store requests. On devices with compute capability 1.3 enabling this counter will result in increased counts for the "instructions" and "branch" counter values if they are also enabled in the same application run. | TPC | N | N | Y | Y | Y | Y |
gst 32 byte | Number of 32 byte global memory store transactions. This increments by 2 for each 32 byte transaction. | TPC | N | N | Y | Y | N | N |
gst 64 byte | Number of 64 byte global memory store transactions. This increments by 4 for each 64 byte transaction. | TPC | N | N | Y | Y | N | N |
gst 128 byte | Number of 128 byte global memory store transactions. This increments by 8 for each 128 byte transaction. | TPC | N | N | Y | Y | N | N |
local load | Number of local memory load transactions. Each local load request will generate one transaction irrespective of the size of the transaction. | TPC | Y | Y | Y | Y | Y | Y |
local store | Number of local memory store transactions. This increments by 2 for each 32-byte transaction, by 4 for each 64-byte transaction and by 8 for each 128-byte transaction for compute devices having compute capability 1.x. This increments by 1 irrespective of the size of the transaction for compute devices having compute capability 2.0. | TPC | Y | Y | Y | Y | Y | Y |
cta launched | Number of threads blocks launched on a TPC. | TPC | Y | Y | Y | Y | N | N |
texture cache hit | Number of texture cache hits. | TPC | Y | Y | Y | Y | N | N |
texture cache miss | Number of texture cache misses. | TPC | Y | Y | Y | Y | N | N |
prof triggers | There are 8 such triggers that user can profile. Those are generic and can be inserted in any place of the code to collect the related information. | TPC | Y | Y | Y | Y | Y | Y |
shared load | Number of executed shared load instructions per warp on a multiprocessor. | SM | N | N | N | N | Y | Y |
shared store | Number of executed shared store instructions per warp on a multiprocessor. | SM | N | N | N | N | Y | Y |
instructions issued | Number of instructions issued including replays. | SM | N | N | N | N | Y | Y |
instructions executed | Number of instructions executed, do not include replays. | SM | N | N | N | N | Y | Y |
warps launched | Number of warps launched on a multiprocessor. | SM | N | N | N | N | Y | Y |
threads launched | Number of threads launched on a multiprocessor. | SM | N | N | N | N | Y | Y |
active cycles | Number of cycles a multiprocessor has at least one active warp. | SM | N | N | N | N | Y | Y |
active warps | Accumulated number of active warps per cycle. For every cycle it increments by the number of active warps in the cycle which can be in the range 0 to 48. | SM | N | N | N | N | Y | Y |
l1 global load hit | Number of global load hits in L1 cache. | SM | N | N | N | N | Y | Y |
l1 global load miss | Number of global load misses in L1 cache. | SM | N | N | N | N | Y | Y |
l1 local load hit | Number of local load hits in L1 cache. | SM | N | N | N | N | Y | Y |
l1 local load miss | Number of local load misses in L1 cache | SM | N | N | N | N | Y | Y |
l1 local store hit | Number of local store hits in L1 cache. | SM | N | N | N | N | Y | Y |
l1 local store miss | Number of local store misses in L1 cache. | SM | N | N | N | N | Y | Y |
l1 shared bank conflicts | Number of shared bank conflicts. | SM | N | N | N | N | Y | Y |
uncached global load transaction | Number of uncached global load transactions. This increments by 1, 2, or 4 for 32, 64 and 128 bit accesses respectively. Non-zero values are only seen when L1 cache is disabled during compile time. Please refer to CUDA Programming Guide(Section G.4.2) for disabling L1 cache. | SM | N | N | N | N | Y | Y |
global store transaction | Number of global store transactions. This increments by 1, 2, or 4 for 32, 64 and 128 bit accesses respectively. | SM | N | N | N | N | Y | Y |
l2 read requests | Number of read requests from L1 to L2 cache. This increments by 1 for each 32-byte access. | SM | N | N | N | N | Y | Y |
l2 write requests | Number of write requests from L1 to L2 cache. This increments by 1 for each 32-byte access. | SM | N | N | N | N | Y | Y |
l2 read misses | Number of read misses in L2 cache. This increments by 1 for each 32-byte access. | SM | N | N | N | N | Y | Y |
l2 write misses | Number of write misses in L2 cache. This increments by 1 for each 32-byte access. | SM | N | N | N | N | Y | Y |
dram reads | Number of read requests to DRAM. This increments by 1 for each 32-byte access. | FB | N | N | N | N | Y | Y |
dram writes | Number of write requests to DRAM. This increments by 1 for each 32-byte access. | FB | N | N | N | N | Y | Y |
tex cache requests | Number of texture cache requests. This increments by 1 for each 32-byte access. | SM | N | N | N | N | Y | Y |
tex cache misses | Number of texture cache misses. This increments by 1 for each 32-byte access. | SM | N | N | N | N | Y | Y |
Derived stats | Description | 1.0 | 1.1 | 1.2 | 1.3 | 2.0 | 2.1 |
---|---|---|---|---|---|---|---|
glob mem read throughput |
Global memory read throughput in giga-bytes per second. For compute capability < 2.0 this is calcualted as (((gld_32*32) + (gld_64*64) + (gld_128*128)) * TPC) / gputime For compute capability >= 2.0 this is calcualted as ((DRAM reads) * 32) / gputime |
* | * | * | * | * | * |
glob mem write throughput |
Global memory write throughput in giga-bytes per second. For compute capability < 2.0 this is calcualted as (((gst_32*32) + (gst_64*64) + (gst_128*128)) * TPC) / gputime For compute capability >= 2.0 this is calcualted as ((DRAM writes) * 32) / gputime |
* | * | * | * | * | * |
glob mem overall throughput |
Global memory overall throughput in giga-bytes per second. This is calcualted as Global memory read throughput + Global memory write throughput |
* | * | * | * | * | * |
gld efficiency | Global load efficiency | NA | NA | 0-1 | 0-1 | NA | NA |
gst efficiency | Global store efficiency | NA | NA | 0-1 | 0-1 | NA | NA |
Instruction throughput |
instruction throughput: Instruction throughput ratio. This is the ratio of achieved instruction rate to peak single issue instruction rate. The achieved instruction rate is calculated using the "instructions" profiler counter. The peak instruction rate is calculated based on the GPU clock speed. In the case of instruction dual-issue coming into play, this ratio shoots up to greater than 1. This is calculated as gpu_time * clock_frequency / (instructions) |
0-1 | 0-1 | 0-1 | 0-1 | NA | NA |
retire ipc |
Retired instructions per cycle This is calculated as (instuctions executed) / (active cycles). |
NA | NA | NA | NA | 0-2 | 0-4 |
active warps/active cycles |
The average number of warps that are active on a multiprocessor per cycle. This is calculated as (active warps) / (active cycles). This is supported only for GPUs with compute capability 2.0. |
NA | NA | NA | NA | 0-48 | 0-48 |
l1 gld hit rate |
This is calculated as 100 * (l1 global load hit count) / ((l1 global load hit count) + (l1 global load miss count)) This is supported only for GPUs with compute capability 2.0. |
NA | NA | NA | NA | 0-100 | 0-100 |
texture hit rate % |
This is calculated as 100 * (tex_cache_requests - tex_cache_misses) / (tex_cache_requests) This is supported only for GPUs with compute capability 2.0. |
NA | NA | NA | NA | 0-100 | 0-100 |