This section describes processor performance monitor events available for performance analysis and tuning for AMD Athlon™ 64 and AMD Opteron™ processors. AMD Athlon 64 and AMD Opteron processors provide four 48-bit performance counters per available core, which allows four types of events to be monitored simultaneously. The performance counters are not guaranteed to be fully accurate and should be used as a relative measure of performance to assist in application tuning. Unlisted event numbers are reserved and their results are undefined.
The Event Select value is used to select the event to be monitored. The Unit Mask is used to further qualify the event selected by the Event Select value. The Mask Value given here is an index and corresponds to actual 8-bit Unit Mask as specified in the following table.
Mask value | Unit Mask |
0 | 0x01 |
1 | 0x02 |
2 | 0x04 |
3 | 0x08 |
4 | 0x10 |
5 | 0x20 |
6 | 0x40 |
7 | 0x80 |
Unless otherwise stated, the Unit Mask values shown may be combined to select any desired combination of the sub-events for a given event. For events where no Unit Mask table is shown, the Unit Mask is not applicable.
Speculative vs. Retired events: Several events may include speculative activity, meaning they may be associated with false-path instructions that are ultimately discarded due to a branch misprediction. Events associated with Retire reflect actual program execution. For events where the distinction may matter, these are explicitly labeled as one or the other.
Dual-core operation: In AMD64 dual-core processors, each core has its own set of event counters. However, each core shares the event-select logic for events in the shared Northbridge logic, allowing an overwrite of a Northbridge event select (including unit mask) that was previously set up by the other core, changing the event that the first core thinks it is counting.
Note: This conflict between cores occurs between corresponding event counters, e.g., PMC0 vs. PMC0. So both cores cannot simultaneously monitor different Northbridge events using the same counter. When using the performance counters simultaneously in both cores, care must be taken to avoid this conflict, such as by having one core monitor the desired Northbridge events and the other core either monitor events internal to itself, or not use the corresponding event counters.
Rev E errata regarding dual-core processor operation: Rev E dual-core processors have an erratum whereby any write to an event select MSR, regardless of what event is being selected, will overwrite the Northbridge event selects for that counter. Hence the conflict described above exists even when the second core is being programmed for non-Northbridge events. This will be fixed in a future revision. The work-around for this is to have the Northbridge-monitoring core program its event counters after the other core has completed its own event counter setup.
For detailed information, refer to the BIOS and Kernel Developer's Guide for AMD Athlon™ 64 and AMD Opteron™ Processors, order# 26094.
Abbreviation: FPU ops
The number of operations (uops) dispatched to the FPU execution pipelines. This event reflects how busy the FPU pipelines are. This includes all operations done by x87, MMX® and SSE instructions, including moves. Each increment represents a one-cycle dispatch event; packed 128-bit SSE operations count as two ops; scalar operations count as one. Speculative. (See also event CBh).
Note: Since this event includes non-numeric operations it is not suitable for measuring MFLOPs.
Value | Unit mask description |
0 | Add pipe ops |
1 | Multiply pipe ops |
2 | Store pipe ops |
3 | Add pipe load ops |
4 | Multiply pipe load ops |
5 | Store pipe load ops |
Abbreviation: No FPU op cycles
The number of cycles in which no FPU operations were retired. Invert this (set the Invert control bit in the event select MSR) to count cycles in which at least one FPU operation was retired.
Abbreviation: Fast flag FPU ops
The number of FPU operations that use the fast flag interface (e.g. FCOMI, COMISS, COMISD, UCOMISS, UCOMISD). Speculative.
Abbreviation: Seg reg loads
The number of segment register loads performed.
Value | Unit mask description |
0 | ES |
1 | CS |
2 | SS |
3 | DS |
4 | FS |
5 | GS |
6 | HS |
Abbreviation: Restart self-mod code
The number of pipeline restarts that were caused by self-modifying code (a store that hits any instruction that's been fetched for execution beyond the instruction doing the store).
Abbreviation: Restart probe hit
The number of pipeline restarts caused by an invalidating probe hitting on a speculative out-of-order load.
Abbreviation: LS2 buffer full
The number of cycles that the LS2 buffer is full. This buffer holds stores waiting to retire as well as requests that missed the data cache and are waiting on a refill. This condition will stall further data cache accesses, although such stalls may be overlapped by independent instruction execution.
Abbreviation: Locked ops
This event covers locked operations performed and their execution time. The execution time represented by the cycle counts is typically overlapped to a large extent with other instructions. The non-speculative cycles event is suitable for event-based profiling of lock operations that tend to miss in the cache.
Value | Unit mask description |
0 | Number of locked instructions executed |
1 | Number of cycles spent in speculative phase |
2 | Number of cycles spent in non-speculative phase |
Abbreviation: Memory req
These events reflect accesses to uncachable (UC) or write-combining (WC) memory regions (as defined by MTRR or PAT settings) and Streaming Store activity to WB memory. Both the WC and Streaming Store events reflect Write Combining buffer flushes, not individual store instructions. WC buffer flushes which typically consist of one 64-byte write to the system for each flush (assuming software typically fills a buffer before it gets flushed). A partially-filled buffer will require two or more smaller writes to the system. The WC event reflects flushes of WC buffers that were filled by stores to WC memory or streaming stores to WB memory. The Streaming Store event reflects only flushes due to streaming stores (which are typically only to WB memory). The difference between counts of these two events reflects the true amount of write events to WC memory.
Value | Unit mask description |
0 | Requests to non-cacheable (UC) memory |
1 | Requests to write-combining (WC) memory |
7 | Streaming store (SS) requests |
Abbreviation: DC accesses
The number of accesses to the data cache for load and store references. This may include certain microcode scratchpad accesses, although these are generally rare. Each increment represents an eight-byte access, although the instruction may only be accessing a portion of that. Speculative.
Abbreviation: DC misses
The number of data cache references which missed in the data cache. Speculative.
Except in the case of streaming stores, only the first miss for a given line is included - access attempts by other instructions while the refill is still pending are not included in this event. So in the absence of streaming stores, each event reflects one 64-byte cache line refill, and counts of this event are the same as, or very close to, the combined count for event 42h.
Streaming stores however will cause this event for every such store, since the target memory is not refilled into the cache. Hence this event should not be used as an indication of data cache refill activity - event 42h should be used for such measurements. (See event 65h for an indication of streaming store activity.) A large difference between events 41h (with all UNIT_MASK bits set) and 42h would be due mainly to streaming store activity.
Abbreviation: DC refills L2/sys
The number of data cache refills satisfied from the L2 cache (and/or the system), per the UNIT_MASK. UNIT_MASK bits 4:1 allow a breakdown of refills from the L2 by coherency state. UNIT_MASK bit 0 reflects refills which missed in the L2, and provides the same measure as the combined sub-events of event 43h. Each increment reflects a 64-byte transfer. Speculative.
Value | Unit mask description |
0 | Refill from System |
1 | Shared-state line from L2 |
2 | Exclusive-state line from L2 |
3 | Owned-state line from L2 |
4 | Modified-state line from L2 |
Abbreviation: DC refills sys
The number of L1 cache refills satisfied from the system (system memory or another cache), as opposed to the L2. The UNIT_MASK selects lines in one or more specific coherency states. Each increment reflects a 64-byte transfer. Speculative.
Value | Unit mask description |
0 | Invalid |
1 | Shared |
2 | Exclusive |
3 | Owned |
4 | Modified |
Abbreviation: DC evicted
The number of L1 data cache lines written to the L2 cache or system memory, having been displaced by L1 refills. The UNIT_MASK may be used to count only victims in specific coherency states. Each increment represents a 64-byte transfer. Speculative.
In most cases, L1 victims are moved to the L2 cache, displacing an older cache line there. Lines brought into the data cache by PrefetchNTA instructions, however, are evicted directly to system memory (if dirty) or invalidated (if clean). There is no provision for measuring this component by itself. The Invalid case (UNIT_MASK value 01h) reflects the replacement of lines that would have been invalidated by probes for write operations from another processor or DMA activity.
Value | Unit mask description |
0 | Invalid |
1 | Shared |
2 | Exclusive |
3 | Owned |
4 | Modified |
Abbreviation: DTLB L1M L2H
The number of data cache accesses that miss in the L1 DTLB and hit in the L2 DTLB. Speculative.
Abbreviation: DTLB L1M L2M
The number of data cache accesses that miss in both the L1 and L2 DTLBs. Speculative.
Abbreviation: Misalign access
The number of data cache accesses that are misaligned. These are accesses which cross an eight-byte boundary. They incur an extra cache access (reflected in event 40h), and an extra cycle of latency on reads. Speculative.
Abbreviation: Late cancel
Abbreviation: Early cancel
Abbreviation: 1-bit ECC errors
The number of single-bit errors corrected by either of the error detection/correction mechanisms in the data cache.
Value | Unit mask description |
0 | Scrubber error |
1 | Piggyback scrubber errors |
Abbreviation: Prefetch inst
The number of prefetch instructions dispatched by the decoder. Speculative. Such instructions may or may not cause a cache line transfer. Any Dcache and L2 accesses, hits and misses by prefetch instructions are included in these types of events.
Value | Unit mask description |
0 | Load (Prefetch, PrefetchT0/T1/T2) |
1 | Store (PrefetchW) |
2 | NTA (PrefetchNTA) |
Abbreviation: DC misses locked inst
The number of data cache misses incurred by locked instructions. (The total number of locked instructions may be obtained from event 24h.)
Such misses may be satisfied from the L2 or system memory, but there is no provision for distinguishing between the two. When used for event-based profiling, this event will tend to occur very close to the offending instructions. (See also event 24h.) This event is also included in the basic Dcache miss event (41h).
Value | Unit mask description |
1 | Data cache misses by locked instructions |
Abbreviation: Data prefetcher
These events reflect requests made by the data prefetcher. UNIT_MASK bit 1 counts total prefetch requests, while bit 0 counts requests where the target block is found in the L2 or data cache. The difference between the two represents actual data read (in units of 64-byte cache lines) from the system by the prefetcher. This is also included in the count of event 7Fh, UNIT_MASK bit 0 (combined with other L2 fill events).
Value | Unit mask description |
0 | Cancelled prefetches |
1 | Prefetch attempts |
Abbreviation: Sys read resp
The number of responses from the system for cache refill requests. The UNIT_MASK may be used to select specific cache coherency states. Each increment represents one 64-byte cache line transferred from the system (DRAM or another cache, including another core on the same node) to the data cache, instruction cache or L2 cache (for data prefetcher and TLB table walks). Modified-state responses may be for Dcache store miss refills, PrefetchW software prefetches, hardware prefetches for a store-miss stream, or Change-to-Dirty requests that get a dirty (Owned) probe hit in another cache. Exclusive responses may be for any Icache refill, Dcache load miss refill, other software prefetches, hardware prefetches for a load-miss stream, or TLB table walks that miss in the L2 cache; Shared responses may be for any of those that hit a clean line in another cache.
Value | Unit mask description |
0 | Exclusive |
1 | Modified |
2 | Shared |
Abbreviation: Quad written to sys
The number of quadword (8-byte) data transfers from the processor to the system. These may be part of a 64-byte cache line writeback or a 64-byte dirty probe hit response, each of which would cause eight increments; or a partial or complete Write Combining buffer flush (Sized Write), which could cause from one to eight increments.
Value | Unit mask description |
0 | Quadword write transfer |
Abbreviation: L2 requests
The number of requests to the L2 cache for Icache or Dcache fills, or page table lookups for the TLB. These events reflect only read requests to the L2; writes to the L2 are indicated by event 7Fh. These include some amount of retries associated with address or resource conflicts. Such retries tend to occur more as the L2 gets busier, and in certain extreme cases (such as large block moves that overflow the L2) these extra requests can dominate the event count.
These extra requests are not a direct indication of performance impact - they simply reflect opportunistic accesses that don't complete. But because of this, they are not a good indication of actual cache line movement. The Icache and Dcache miss and refill events (81h, 82h, 83h, 41h, 42h, 43h) provide a more accurate indication of this, and are the preferred way to measure such traffic.
Value | Unit mask description |
0 | IC fill |
1 | DC fill |
2 | TLB fill (page table walks) |
3 | Tag snoop request |
4 | Cancelled request |
Abbreviation: L2 misses
The number of requests that miss in the L2 cache. This may include some amount of speculative activity, as well as some amount of retried requests as described in event 7Dh. The IC-fill-miss and DC-fill-miss events tend to mirror the Icache and Dcache refill-from-system events (83h and 43h, respectively), and tend to include more speculative activity than those events.
Value | Unit mask description |
0 | IC fill |
1 | DC fill (includes possible replays) |
2 | TLB page table walk |
Abbreviation: L2 fill/write
The number of lines written into the L2 cache due to victim writebacks from the Icache or Dcache, TLB page table walks and the hardware data prefetcher (UNIT_MASK bit 0); or writebacks of dirty lines from the L2 to the system (UNIT_MASK bit 1). Each increment represents a 64-byte cache line transfer.
Note: Victim writebacks from the Dcache may be measured separately using event 44h. However this is not quite the same as the Dcache component of event 7Fh, the main difference being PrefetchNTA lines. When these are evicted from the Dcache due to replacement, they are written out to system memory (if dirty) or simply invalidated (if clean), rather than being moved to the L2 cache.
Value | Unit mask description |
0 | L2 fills |
Abbreviation: IC fetches
The number of instruction cache accesses by the instruction fetcher. Each access is an aligned 16 byte read, from which a varying number of instructions may be decoded.
Abbreviation: IC misses
The number of instruction fetches that miss in the instruction cache. This is typically equal to or very close to the sum of events 82h and 83h. Each miss results in a 64-byte cache line refill.
Abbreviation: IC refills from L2
The number of instruction cache refills satisfied from the L2 cache. Each increment represents one 64-byte cache line transfer.
Abbreviation: IC refills from sys
The number of instruction cache refills from system memory (or another cache). Each increment represents one 64-byte cache line transfer.
Abbreviation: ITLB L1M L2H
The number of instruction fetches that miss in the L1 ITLB but hit in the L2 ITLB.
Abbreviation: ITLB L1M L2M
The number of instruction fetches that miss in both the L1 and L2 ITLBs.
Abbreviation: Restart i-stream probe
The number of pipeline restarts caused by invalidating probes that hit on the instruction stream currently being executed. This would happen if the active instruction stream was being modified by another processor in an MP system - typically a highly unlikely event.
Abbreviation: Inst fetch stall
The number of cycles the instruction fetcher is stalled. This may be for a variety of reasons such as branch predictor updates, unconditional branch bubbles, far jumps and cache misses, among others. May be overlapped by instruction dispatch stalls or instruction execution, such that these stalls don't necessarily impact performance.
Abbreviation: RET stack hits
The number of near return instructions (RET or RET Iw) that get their return address from the return address stack (i.e., where the stack has not gone empty). This may include cases where the address is incorrect (return mispredicts). This may also include speculatively executed false-path returns. Return mispredicts are typically caused by the return address stack underflowing, however they may also be caused by an imbalance in calls vs. returns, such as doing a call but then popping the return address off the stack.
Note: This event cannot be reliably compared with events C9h and CAh (such as to calculate percentage of return mispredicts due to an empty return address stack), since it may include speculatively executed false-path returns that are not included in those retire-time events.
Abbreviation: RET stack overflows
The number of (near) call instructions that cause the return address stack to overflow. When this happens, the oldest entry is discarded. This count may include speculatively executed calls.
Abbreviation: Ret CLFLUSH inst
The number of CLFLUSH instructions retired.
Abbreviation: Ret CPUID inst
The number of CPUID instructions retired.
Abbreviation: CPU clocks
The number of clocks that the CPU is not in a halted state (due to STPCLK or a HALT instruction). Note: this event allows system idle time to be automatically factored out from IPC (or CPI) measurements, providing the OS halts the CPU when going idle. If the OS goes into an idle loop rather than halting, such calculations will be influenced by the IPC of the idle loop.
Abbreviation: Ret inst
The number of instructions retired (execution completed and architectural state updated). This count includes exceptions and interrupts - each exception or interrupt is counted as one instruction.
Abbreviation: Ret uops
The number of micro-ops retired. This includes all processor activity (instructions, exceptions, interrupts, microcode assists, etc.).
Abbreviation: Ret branch
The number of branch instructions retired. This includes all types of architectural control flow changes, including exceptions and interrupts.
Abbreviation: Ret misp branch
The number of branch instructions retired, of any type, that were not correctly predicted. This includes those for which prediction is not attempted (far control transfers, exceptions and interrupts).
Abbreviation: Ret taken branch
The number of taken branches that were retired. This includes all types of architectural control flow changes, including exceptions and interrupts.
Abbreviation: Ret taken branch misp
The number of retired taken branch instructions that were mispredicted.
Abbreviation: Ret far xfers
The number of far control transfers retired including far call/jump/return, IRET, SYSCALL and SYSRET, plus exceptions and interrupts. Far control transfers are not subject to branch prediction.
Abbreviation: Ret branch resyncs
The number of resync branches. These reflect pipeline restarts due to certain microcode assists and events such as writes to the active instruction stream, among other things. Each occurrence reflects a restart penalty similar to a branch mispredict. Relatively rare.
Abbreviation: Ret near RET
The number of near return instructions (RET or RET Iw) retired.
Abbreviation: Ret near RET misp
The number of near returns retired that were not correctly predicted by the return address predictor. Each such mispredict incurs the same penalty as a mispredicted conditional branch instruction.
Abbreviation: Ret ind branch misp
The number of indirect branch instructions retired where the target address was not correctly predicted.
Abbreviation: Ret MMX/FP inst
The number of MMX®, SSE or X87 instructions retired. The UNIT_MASK allows the selection of the individual classes of instructions as given in the table. Each increment represents one complete instruction.
Note: Since this event includes non-numeric instructions it is not suitable for measuring MFLOPS.
Value | Unit mask description |
0 | x87 instructions |
1 | MMX and 3DNow! instructions |
2 | Packed SSE and SSE2 instructions |
3 | Scalar SSE and SSE2 instructions |
Abbreviation: Ret fastpath double op
Value | Unit mask description |
0 | With low op in position 0 |
1 | With low op in position 1 |
2 | With low op in position 2 |
Abbreviation: Int-masked cycles
The number of processor cycles where interrupts are masked (EFLAGS.IF = 0). Using edge-counting with this event will give the number of times IF is cleared; dividing the cycle-count value by this value gives the average length of time that interrupts are disabled on each instance. Compare the edge count with event CFh to determine how often interrupts are disabled for interrupt handling vs. other reasons (e.g. critical sections).
Abbreviation: Int-masked pending
The number of processor cycles where interrupts are masked (EFLAGS.IF = 0) and an interrupt is pending. Using edge-counting with this event and comparing the resulting count with the edge count for event CDh gives the proportion of interrupts for which handling is delayed due to prior interrupts being serviced, critical sections, etc. The cycle count value gives the total amount of time for such delays. The cycle count divided by the edge count gives the average length of each such delay.
Abbreviation: Int taken
The number of hardware interrupts taken. This does not include software interrupts (INT n instruction).
Abbreviation: Decoder empty
The number of processor cycles where the decoder has nothing to dispatch (typically waiting on an instruction fetch that missed the Icache, or for the target fetch after a branch mispredict).
Abbreviation: Dispatch stalls
The number of processor cycles where the decoder is stalled for any reason (has one or more instructions ready but can't dispatch them due to resource limitations in execution). This is the combined effect of events D2h - DAh, some of which may overlap; this event reflects the net stall cycles. The more common stall conditions (events D5h, D6h, D7h, D8h, and to a lesser extent D2) may overlap considerably. The occurrence of these stalls is highly dependent on the nature of the code being executed (instruction mix, memory reference patterns, etc.).
Abbreviation: Stall branch abort
The number of processor cycles the decoder is stalled waiting for the pipe to drain after a mispredicted branch. This stall occurs if the corrected target instruction reaches the dispatch stage before the pipe has emptied. See also event D1h.
Abbreviation: Stall serialization
The number of processor cycles the decoder is stalled due to a serializing operation, which waits for the execution pipeline to drain. Relatively rare; mainly associated with system instructions. See also event D1h.
Abbreviation: Stall seg load
The number of processor cycles the decoder is stalled due to a segment load instruction being encountered while execution of a previous segment load operation is still pending. Relatively rare except in 16-bit code. See also event D1h.
Abbreviation: Stall reorder full
The number of processor cycles the decoder is stalled because the reorder buffer is full. May occur simultaneously with certain other stall conditions; see event D1h.
Abbreviation: Stall res station full
The number of processor cycles the decoder is stalled because a required integer unit reservation stations is full. May occur simultaneously with certain other stall conditions; see event D1h.
Abbreviation: Stall FPU full
The number of processor cycles the decoder is stalled because the scheduler for the Floating Point Unit is full. This condition can be caused by a lack of parallelism in FP-intensive code, or by cache misses on FP operand loads (which could also show up as event D8h instead, depending on the nature of the instruction sequences). May occur simultaneously with certain other stall conditions; see event D1h
Abbreviation: Stall LS full
The number of processor cycles the decoder is stalled because the Load/Store Unit is full. This generally occurs due to heavy cache miss activity. May occur simultaneously with certain other stall conditions; see event D1h.
Abbreviation: Stall waiting quiet
The number of processor cycles the decoder is stalled waiting for all outstanding requests to the system to be resolved. Relatively rare; associated with certain system instructions and types of interrupts. May partially overlap certain other stall conditions; see event D1h.
Abbreviation: Stall far/resync
The number of processor cycles the decoder is stalled waiting for the execution pipeline to drain before dispatching the target instructions of a far control transfer or a Resync (an instruction stream restart associated with certain microcode assists). Relatively rare; does not overlap with other stall conditions. See also event D1h.
Abbreviation: FPU except
The number of floating point unit exceptions for microcode assists. The UNIT_MASK may be used to isolate specific types of exceptions.
Value | Unit mask description |
0 | x87 reclass microfaults |
1 | SSE retype microfaults |
2 | SSE reclass microfaults |
3 | SSE and x87 microtraps |
Abbreviation: DR0 matches
The number of matches on the address in breakpoint register DR0, per the breakpoint type specified in DR7. The breakpoint does not have to be enabled. Each instruction breakpoint match incurs an overhead of about 120 cycles; load/store breakpoint matches do not incur any overhead.
Abbreviation: DR1 matches
The number of matches on the address in breakpoint register DR1. See notes for event DCh.
Abbreviation: DR2 matches
The number of matches on the address in breakpoint register DR2. See notes for event DCh.
Abbreviation: DR3 matches
The number of matches on the address in breakpoint register DR3. See notes for event DCh.
Abbreviation: DRAM accesses
The number of memory accesses performed by the local DRAM controller. The UNIT_MASK may be used to isolate the different DRAM page access cases. Page miss cases incur an extra latency to open a page; page conflict cases incur both a page-close as well as page-open penalties. These penalties may be overlapped by DRAM accesses for other requests and don't necessarily represent lost DRAM bandwidth. The associated penalties are as follows:
Page miss: Trcd (DRAM RAS-to-CAS delay)
Page conflict: Trp + Trcd (DRAM row-precharge time plus RAS-to-CAS delay)
Each DRAM access represents one 64-byte block of data transferred if the DRAM is configured for 64-byte granularity, or one 32-byte block if the DRAM is configured for 32-byte granularity. (The latter is only applicable to single-channel DRAM systems, which may be configured either way.)
Value | Unit mask description |
0 | Page hit |
1 | Page miss |
2 | Page conflict |
Abbreviation: Page table overflows
The number of page table overflows in the local DRAM controller. This table maintains information about which DRAM pages are open. An overflow occurs when a request for a new page arrives when the maximum number of pages are already open. Each occurrence reflects an access latency penalty equivalent to a page conflict.
Abbreviation: Turnarounds
The number of turnarounds on the local DRAM data bus. The UNIT_MASK may be used to isolate the different cases. These represent lost DRAM bandwidth, which may be calculated as follows (in bytes per occurrence):
DIMM turnaround: DRAM_width_in_bytes * 2 edges_per_memclk * 2
R/W turnaround: DRAM_width_in_bytes * 2 edges_per_memclk * 1
R/W turnaround: DRAM_width_in_bytes * 2 edges_per_memclk * (Tcl-1)
where DRAM_width_in_bytes is 8 or 16 (for single- or dual-channel systems), and Tcl is the CAS latency of the DRAM in memory system clock cycles (where the memory clock for DDR-400, or PC3200 DIMMS, for example, would be 200 MHz).
Value | Unit mask description |
0 | DIMM (chip select) turnaround |
1 | Read to write turnaround |
2 | Write to read turnaround |
Abbreviation: Bypass ctr sat
Value | Unit mask description |
0 | Memory controller high priority bypass |
1 | Memory controller low priority bypass |
2 | DRAM controller interface bypass |
3 | DRAM controller queue bypass |
Abbreviation: Sized blocks
These events provide measures of two unrelated sets of events: DRAM request cancellation activity, and sized read/write block sizes.
Cancel activity: The number of MemCancel requests received by the local memory controller due to dirty probe hits. When a probe for a cache refill or DMA read hits a dirty line, the cache will provide the data inhibit the return of the stale cache line from memory, to conserve system bandwidth. These events reflect the total MemCancel requests seen by the memory controller (UNIT_MASK bit 0), and those requests which actually arrive in time to inhibit the stale data transfer (bit 1). These mask bits should be used separately - the combined count is not particularly meaningful.
Note: Successful cancels may or may not inhibit the DRAM access, depending on whether they arrive soon enough. Such accesses are reflected in event E0h, DRAM Accesses, but it is not possible to isolate this particular component (DRAM accesses that ideally would have been prevented). The upper bound would be the number of cancel requests seen, however this is typically far too pessimistic to be a useful approximation.
Sized Read/Write activity: The Sized Read/Write events reflect 32- or 64-byte transfers (as opposed to other sizes which could be anywhere between 1 and 64 bytes), from either the processor or the Hostbridge (on any node in an MP system). Such accesses from the processor would be due only to write combining buffer flushes, where 32-byte accesses would reflect flushes of partially-filled buffers. Event 65h provides a count of sized write requests associated with WC buffer flushes; comparing that with counts for these events (providing there is very little Hostbridge activity at the same time) will give an indication of how efficiently the write combining buffers are being used. Event 65h may also be useful in factoring out WC flushes when comparing these events with the Upstream Requests component of event ECh.
Value | Unit mask description |
2 | 32-byte sized writes (Rev D and later) |
3 | 64-byte sized writes (Rev D and later) |
4 | 32-byte sized reads (Rev D and later) |
5 | 64-byte sized reads (Rev D and later) |
Abbreviation: Thermal/ECC errors
Value | Unit mask description |
0 | Clocks CPU is active when HTC is active |
1 | Clocks CPU clock is inactive when HTC is active |
2 | Clocks when die temperature is higher then software high temperature threshold |
3 | Clocks when high temperature threshold was exceeded |
7 | Correctable and uncorrectable DRAM ECC Errors Rev E) |
Abbreviation: CPU/IO req mem/IO
These events reflect request flow between units and nodes, as selected by the UNIT_MASK. The UNIT_MASK is divided into two fields: request type (CPU or I/O access to I/O or Memory) and source/target location (local vs. remote). One or more requests types must be enabled via bits 3:0, and at least one source and one target location must be selected via bits 7:4. Each event reflects a request of the selected type(s) going from the selected source(s) to the selected target(s).
Not all possible paths are supported. The following table shows the UNIT_MASK values that are valid for each request type: Any of the mask values shown may be logically ORed to combine the events. For instance, local CPU requests to both local and remote nodes would be A8h | 98h = B8h. Any CPU to any I/O would be A4h | 94h | 64h = F4h (but remote CPU to remote I/O requests would not be included).
Note: It is not possible to tell from these events how much data is going in which direction, as there is no distinction between reads and writes. Also, particularly for I/O, the requests may be for varying amounts of data, anywhere from one to sixty-four bytes. Event E5h provides an indication of 32- and 64-byte read and write transfers for such requests (although from the target point of view). For a direct measure of the amount and direction of data flowing between nodes, use events F6h, F7h and F8h.
Value | Unit mask description |
0 | I/O to I/O |
1 | I/O to memory |
2 | CPU to I/O |
3 | CPU to memory |
4 | To remote node |
5 | To local node |
6 | From remote node |
7 | From local node |
Abbreviation: Cache block cmd
The number of requests made to the system for cache line transfers or coherency state changes, by request type. Each increment represents one cache line transfer, except for Change-to-Dirty. If a Change-to-Dirty request hits on a line in another processor's cache that's in the Owned state, it will cause a cache line transfer, otherwise there is no data transfer associated with Change-to-Dirty requests.
Value | Unit mask description |
0 | Victim block (writeback) |
1 | Read block (Dcache load miss refill) |
2 | Read block Shared (ICache refill) |
3 | Read block Modified (DCache store miss refill) |
4 | Change to Dirty |
Abbreviation: Sized cmd
The number of Sized Read/Write commands handled by the System Request Interface (local processor and hostbridge interface to the system). These commands may originate from the processor or hostbridge. Typical uses of the various Sized Read/Write commands are given in the UNIT_MASK table. See also event E5h, which covers commonly-used block sizes for these requests, and event ECh, which provides a separate measure of Hostbridge accesses.
Value | Unit mask description |
0 | Non-posted SzWr byte (1-32 bytes) |
1 | Non-posted SzWr DWORD (1-16 DWORDs) |
2 | Posted SzWr byte (1-32 bytes) |
3 | Posted SzWr DWORD (1-16 DWORDs) |
4 | SzRd byte (4 bytes) |
5 | SzRd DWORD (1-16 DWORDs) |
6 | RdModWr |
Abbreviation: Probe resp/up req
This covers two unrelated sets of events: cache probe results, and requests received by the Hostbridge from devices on non-coherent links.
Probe results: These events reflect the results of probes sent from a memory controller to local caches. They provide an indication of the degree data and code is shared between processors (or moved between processors due to process migration). The dirty-hit events indicate the transfer of a 64-byte cache line to the requestor (for a read or cache refill) or the target memory (for a write). The system bandwidth used by these, in terms of bytes per unit of time, may be calculated as 64 times the event count, divided by the elapsed time. Sized writes to memory that cover a full cache line do not incur this cache line transfer -- they simply invalidate the line and are reported as clean hits. Cache line transfers will occur for Change2Dirty requests that hit cache lines in the Owned state. (Such cache lines are counted as Modified-state refills for event 6Ch, System Read Responses.)
Upstream requests: The upstream read and write events reflect requests originating from a device on a local non-coherent HyperTransport™ link. The two read events allow display refresh traffic in a UMA system to be measured separately from other DMA activity. Display refresh traffic will typically be dominated by 64-byte transfers. Non-display-related DMA accesses may be anywhere from 1 to 64 bytes in size, but may be dominated by a particular size such as 32 or 64 bytes, depending on the nature of the devices. Event E5h can provide a measure of 32- and 64-byte accesses by the hostbridge (possibly combined with write combining buffer flush activity from the processor, although that can be factored out via event 65h).
Value | Unit mask description |
0 | Probe miss |
1 | Probe hit clean |
2 | Probe hit dirty without memory cancel |
3 | Probe hit dirty with memory cancel |
4 | Upstream display refresh reads |
5 | Upstream non-display refresh reads |
6 | Upstream writes (Rev D and later) |
Abbreviation: GART events
These events reflect GART activity, and in particular allow one to calculate the GART TLB miss ratio as GART_miss_count divided by GART_aperture_hit_count. GART aperture accesses are typically from I/O devices as opposed to the processor, and generally from a 3D graphics accelerator, but can be from other devices when the GART is used as an IO MMU).
Value | Unit mask description |
0 | GART aperture hit on access from CPU |
1 | GART aperture hit on access from I/O |
2 | GART miss |
Abbreviation: HT0 bandwidth
The number of dwords transmitted (or unused, in the case of Nops) on the outgoing side of the HyperTransport links. The sum of all four subevents (all four UNIT_MASK bits set) directly reflects the maximum transmission rate of the link. Link utilization may be calculated by dividing the combined Command, Data and Buffer Release count (UNIT_MASK 07h) by that value plus the Nop count (UNIT_MASK 08h). Bandwidth in terms of bytes per unit time for any one component or combination of components is calculated by multiplying the count by four and dividing by elapsed time.
The Data event provides a direct indication of the flow of data around the system. Translating this link-based view into a source/target node based view requires knowledge of the system layout (i.e. which links connect to which nodes).
Value | Unit mask description |
0 | Command DWORD sent |
1 | Data DWORD sent |
2 | Buffer release DWORD sent |
3 | NOP DWORD sent (idle) |
Abbreviation: HT1 bandwidth
The number of dwords transmitted (or unused, in the case of Nops) on the outgoing side of the HyperTransport links. The sum of all four subevents (all four UNIT_MASK bits set) directly reflects the maximum transmission rate of the link. Link utilization may be calculated by dividing the combined Command, Data and Buffer Release count (UNIT_MASK 07h) by that value plus the Nop count (UNIT_MASK 08h). Bandwidth in terms of bytes per unit time for any one component or combination of components is calculated by multiplying the count by four and dividing by elapsed time.
The Data event provides a direct indication of the flow of data around the system. Translating this link-based view into a source/target node based view requires knowledge of the system layout (i.e. which links connect to which nodes).
Value | Unit mask description |
0 | Command DWORD sent |
1 | Data DWORD sent |
2 | Buffer release DWORD sent |
3 | NOP DWORD sent (idle) |
Abbreviation: HT2 bandwidth
The number of dwords transmitted (or unused, in the case of Nops) on the outgoing side of the HyperTransport links. The sum of all four subevents (all four UNIT_MASK bits set) directly reflects the maximum transmission rate of the link. Link utilization may be calculated by dividing the combined Command, Data and Buffer Release count (UNIT_MASK 07h) by that value plus the Nop count (UNIT_MASK 08h). Bandwidth in terms of bytes per unit time for any one component or combination of components is calculated by multiplying the count by four and dividing by elapsed time.
The Data event provides a direct indication of the flow of data around the system. Translating this link-based view into a source/target node based view requires knowledge of the system layout (i.e. which links connect to which nodes).
Value | Unit mask description |
0 | Command DWORD sent |
1 | Data DWORD sent |
2 | Buffer release DWORD sent |
3 | NOP DWORD sent (idle) |