Architecture Instruction Set Extensions Programming Reference PDF
Architecture Instruction Set Extensions Programming Reference PDF
319433-041
OCTOBER 2020
Intel technologies may require enabled hardware, software or service activation.
No product or component can be absolutely secure.
Your costs and results may vary.
You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning
Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter
drafted which includes subject matter disclosed herein.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
All product plans and roadmaps are subject to change without notice.
The products described may contain design defects or errors known as errata which may cause the product to deviate from
published specifications. Current characterized errata are available on request.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness
for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing,
or usage in trade.
Code names are used by Intel to identify products, technologies, or services that are in development and not publicly available.
These are not “commercial” names and not intended to function as trademarks.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be ob-
tained by calling 1-800-548-4725, or by visiting http://www.intel.com/design/literature.htm.
Copyright © 2020, Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its
subsidiaries.
*Other names and brands may be claimed as the property of others.
ii Ref. # 319433-041
Revision History
iv Ref. # 319433-041
Revision Description Date
Ref. # 319433-041 v
Revision Description Date
vi Ref. # 319433-041
Revision Description Date
CHAPTER 1
FUTURE INTEL® ARCHITECTURE INSTRUCTION EXTENSIONS AND FEATURES
1.1 About This Document. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1
1.2 DisplayFamily and DisplayModel for Future Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1
1.3 Instruction Set Extensions and Feature Introduction in Intel® 64 and IA-32 Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
1.4 Detection of Future Instructions and Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
1.5 CPUID Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3
CPUID—CPU Identification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-3
1.6 Compressed Displacement (disp8*N) Support in EVEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-45
1.7 bfloat16 Floating-Point Format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-46
CHAPTER 2
INSTRUCTION SET REFERENCE, A-Z
2.1 Instruction Set Reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1
CLUI — Clear User Interrupt Flag. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-2
ENQCMD — Enqueue Command. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-3
ENQCMDS — Enqueue Command Supervisor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-6
HRESET — History Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-8
PCONFIG — Platform Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10
SENDUIPI — Send User Interprocessor Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-17
SERIALIZE — Serialize Instruction Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19
STUI — Set User Interrupt Flag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-20
TESTUI — Determine User Interrupt Flag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-21
UIRET — User-Interrupt Return. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-22
VCVTNE2PS2BF16 — Convert Two Packed Single Data to One Packed BF16 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-24
VCVTNEPS2BF16 — Convert Packed Single Data to Packed BF16 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-26
VDPBF16PS — Dot Product of BF16 Pairs Accumulated into Packed Single Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-28
VP2INTERSECTD/VP2INTERSECTQ — Compute Intersection Between DWORDS/QUADWORDS to a Pair of Mask
Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-30
VPDPBUSD — Multiply and Add Unsigned and Signed Bytes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-32
VPDPBUSDS — Multiply and Add Unsigned and Signed Bytes with Saturation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-33
VPDPWSSD — Multiply and Add Signed Word Integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-35
VPDPWSSDS — Multiply and Add Signed Word Integers with Saturation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-36
WBNOINVD—Write Back and Do Not Invalidate Cache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-37
XRESLDTRK — Resume Tracking Load Addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-39
XSUSLDTRK— Suspend Tracking Load Addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-40
CHAPTER 3
INTEL® AMX INSTRUCTION SET REFERENCE, A-Z
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1
3.1.1 Tile Architecture Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-3
3.1.2 TMUL Architecture Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-4
3.1.3 Handling of Tile Row and Column Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-4
3.1.4 Exceptions and Interrupts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-5
3.2 Intel® AMX and the XSAVE Feature Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5
3.2.1 State Components for Intel® AMX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-5
3.2.2 XSAVE-Related Enumeration for Intel® AMX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-6
3.2.3 Enabling Intel® AMX As an XSAVE-Enabled Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-6
3.2.4 Loading of XTILECFG and XTILEDATA by XRSTOR and XRSTORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-7
3.2.5 Saving of XTILEDATA by XSAVE, XSAVEC, XSAVEOPT, and XSAVES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-7
3.2.6 Extended Feature Disable (XFD). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-7
3.3 Recommendations for System Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8
3.4 Implementation Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8
3.5 Helper Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8
3.6 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-9
3.7 Exception Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10
Ref. # 319433-041 ix
3.8 Instruction Set Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11
LDTILECFG — Load Tile Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12
STTILECFG — Store Tile Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-15
TDPBF16PS — Dot Product of BF16 Tiles Accumulated into Packed Single Precision Tile . . . . . . . . . . . . . . . . . . . . . . . . . . 3-17
TDPBSSD/TDPBSUD/TDPBUSD/TDPBUUD — Dot Product of Signed/Unsigned Bytes with Dword Accumulation . . . . 3-19
TILELOADD/TILELOADDT1 — Load Tile. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21
TILERELEASE — Release Tile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-23
TILESTORED — Store Tile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24
TILEZERO — Zero Tile. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-25
CHAPTER 4
ENQUEUE STORES AND PROCESS ADDRESS SPACE IDENTIFIERS (PASIDS)
4.1 The IA32_PASID MSR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1
4.2 The PASID State Component for the XSAVE Feature Set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1
4.3 PASID Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
4.3.1 PASID Translation Structures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
4.3.2 The PASID Translation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
4.3.3 VMX Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4
CHAPTER 5
INTEL® TSX SUSPEND LOAD ADDRESS TRACKING
CHAPTER 6
HYPERVISOR-MANAGED LINEAR ADDRESS TRANSLATION
6.1 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1
6.2 VMCS Changes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2
6.3 Changes to EPT Paging-Structure Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2
6.3.1 Reservation of a Guest Page Type in EPT Paging Structure Entry for Future Use. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3
6.4 Changes to VMX Support for Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3
6.5 Protected Linear Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
6.6 Hypervisor-Managed Linear Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
6.6.1 HLAT Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-6
6.6.2 Operation of HLAT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-6
6.6.3 Format of the HLAT L5E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7
6.6.4 Format of the HLAT L4E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7
6.6.5 Format of the HLAT L3E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8
6.6.6 Format of the HLAT L2E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8
6.6.7 Format of the HLAT L1E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-9
6.6.8 HLAT Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10
6.6.9 HLAT Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-11
6.6.10 HLAT Interaction with IA and EPT A/D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12
6.6.11 Cached HLAT Derived Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12
6.7 Changes to Guest Physical Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-13
6.7.1 Paging-Write Interaction with EPT A/D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-14
6.7.2 IOMMU Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-15
6.8 Addition to EPT violation Exit Qualification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-15
6.9 HLAT Interaction with Intel® SGX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-15
6.10 HLAT Interaction with Nested VT-x. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-16
6.11 Changes to VM Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-16
6.12 Changes to VMX Capability Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-16
CHAPTER 7
ARCHITECTURAL LAST BRANCH RECORDS (LBRS)
7.1 Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-1
7.1.1 Logged Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-1
7.1.2 Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
7.1.2.1 Enabling and Disabling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
7.1.2.2 LBR Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
x Ref. # 319433-041
7.1.2.3 Branch Type Enabling and Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
7.1.2.4 Call-Stack Mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3
Call-Stack Mode and LBR Freeze . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3
7.1.2.5 CPL Filtering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3
7.1.3 Record Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-4
7.1.3.1 IP Fields. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-4
7.1.3.2 Branch Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-4
7.1.3.3 Cycle Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-4
7.1.3.4 Mispredict Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5
7.1.3.5 Intel® TSX Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5
7.1.4 Interaction with Other Processor Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5
7.1.4.1 SMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5
SMM Transfer Monitor (STM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5
7.1.4.2 VMX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5
7.1.4.3 Intel® SGX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-6
7.1.4.4 Debug Breakpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-6
7.1.4.5 SMX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-6
7.1.4.6 MWAIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-6
7.1.4.7 Precise Event-Based Sampling (PEBS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-6
7.2 MSRs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-6
7.3 Context Switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-9
7.3.1 XSAVE and LBR Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10
7.3.2 INIT and MOD Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10
7.3.3 Fast LBR Read Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11
7.3.4 XRSTORS Faulting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11
7.4 Enumeration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11
7.4.1 CPUID for Architectural LBRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11
7.4.2 CPUID for XSAVE LBR Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12
7.4.3 Enumeration for New VMCS Fields. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12
7.5 other Impacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-13
7.5.1 Branch Trace Store on Intel Atom Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-13
7.5.2 IA32_DEBUGCTL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-13
7.5.3 IA32_PERF_CAPABILITIES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-13
CHAPTER 8
NON-WRITE-BACK LOCK DISABLE ARCHITECTURE
8.1 Enumeration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1
8.2 Enabling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1
8.3 Interaction with Intel® Software Guard Extensions (Intel® SGX) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2
8.4 Interaction with VMX Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2
8.5 Expected Software Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2
8.6 Bus Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3
CHAPTER 9
BUS LOCK AND VM NOTIFY
9.1 Bus Lock Debug Exception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-1
9.1.1 Bus Lock VM Exit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-1
9.2 Notify VM Exit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-1
CHAPTER 10
INTEL® RESOURCE DIRECTOR TECHNOLOGY FEATURE UPDATES
10.1 Intel® RDT Feature Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-1
10.1.1 Intel® RDT on Processors Based on Ice Lake Server Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-1
10.1.2 Intel® RDT on Intel Atom® Processors Based on Tremont Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-1
10.1.3 Intel® RDT in Processors Based on Sapphire Rapids Server Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-1
10.2 Enumerable Memory Bandwidth Monitoring Counter Width. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2
10.2.1 Memory Bandwidth Monitoring (MBM) Enabling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2
10.2.2 Augmented MBM Enumeration and MSR Interfaces for Extensible Counter Width . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2
10.3 Second Generation Memory Bandwidth Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2
Ref. # 319433-041 xi
10.3.1 MBA 2.0 Advantages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-3
10.3.2 MBA 2.0 Software-Visible Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-4
10.4 Third Generation Memory Bandwidth Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-4
10.4.1 MBA 3.0 Hardware Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-4
10.4.2 MBA 3.0 Software-Visible Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-5
CHAPTER 11
USER INTERRUPTS
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-1
11.2 Enumeration and Enabling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-1
11.3 User-Interrupt State and User-Interrupt MSRs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-2
11.3.1 User-Interrupt State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-2
11.3.2 User-Interrupt MSRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-2
11.4 Evaluation and Delivery of User Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-3
11.4.1 User-Interrupt Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-3
11.4.2 User-Interrupt Delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-3
11.5 User-Interrupt Notification Identification and Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-5
11.5.1 User-Interrupt Notification Identification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-5
11.5.2 User-Interrupt Notification Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-6
11.6 New Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-7
11.7 User IPIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-7
11.8 Legacy Instruction Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-7
11.8.1 Support by RDMSR and WRMSR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-7
11.8.2 Support by the XSAVE Feature Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-8
11.8.2.1 User-Interrupt State Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-8
11.8.2.2 XSAVE-Related Enumeration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-9
11.8.2.3 XSAVES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-9
11.8.2.4 XRSTORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-10
11.9 VMX Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-10
11.9.1 VMCS Changes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-10
11.9.2 Changes to VMX Non-Root Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-10
11.9.2.1 Treatment of Ordinary Interrupts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-11
11.9.2.2 Treatment of Virtual Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-11
11.9.2.3 VM Exits Incident to New Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-12
11.9.2.4 Access to the User-Interrupt MSRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-12
11.9.2.5 Operation of SENDUIPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-12
11.9.3 Changes to VM Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-13
11.9.3.1 Checks on the Guest-State Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-13
11.9.3.2 Loading MSRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-13
11.9.3.3 Event Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-13
11.9.3.4 User-Interrupt Recognition After VM Entry. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-14
11.9.4 Changes to VM Exits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-14
11.9.4.1 Recording VM-Exit Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-14
11.9.4.2 Saving Guest State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-14
11.9.4.3 Saving MSRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-14
11.9.4.4 Loading Host State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-14
11.9.4.5 Loading MSRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-14
11.9.4.6 User-Interrupt Recognition After VM Exit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-14
11.9.5 Changes to VMX Capability Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-15
CHAPTER 12
PERFORMANCE MONITORING UPDATES
12.1 Performance Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-1
12.2 Processor Event Based Sampling (PEBS) Facility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2
12.2.1 Instruction-Accurate PDIR (PDIR++). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2
12.2.2 Precise Distribution (PDist) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2
12.2.3 Load Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2
12.2.4 Store Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3
12.3 Adaptive PEBS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-4
12.3.1 Memory Access Info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-4
CHAPTER 13
ENHANCED HARDWARE FEEDBACK INTERFACE (EHFI)
13.1 Enhanced Hardware Feedback Interface Intended Usage Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-2
13.2 Hardware Feedback Interface Pointer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-3
13.3 Hardware Feedback Interface Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-3
13.4 Hardware Feedback Interface Notifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-4
13.5 Hardware Feedback interface Structure Dynamic update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-5
13.6 Logical Processor Scope Enhanced Hardware Feedback Interface Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-5
13.7 Implicit Reset of Package and Logical Processor Scope Configuration MSRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-6
13.8 Logical Processor Scope Enhanced Hardware Feedback Interface Run Time Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-6
13.9 Enhanced Hardware Feedback Interface Enumeration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-6
13.10 Logical Processor Scope History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-7
13.10.1 Hardware History Reset Enumeration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-7
13.10.2 Enabling Enhanced Hardware Feedback Interface History Reset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-7
13.10.3 Implicit Enhanced Hardware Feedback Interface History Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-7
Ref. # 319433-041 xv
xvi Ref. # 319433-041
FIGURES
PAGE
CHAPTER 1
FUTURE INTEL® ARCHITECTURE INSTRUCTION EXTENSIONS AND
FEATURES
Table 1-2. Recent Instruction Set Extensions / Features Introduction in Intel® 64 and IA-32 Processors1
Instruction Set Architecture / Feature Introduction
PCONFIG, WBNOINVD Ice Lake Server
Intel® MKTME Ice Lake Server
ENCLV Tremont, Ice Lake Server
Direct stores: MOVDIRI, MOVDIR64B Tremont, Tiger Lake, Sapphire Rapids
AVX512_BF16 Cooper Lake, Sapphire Rapids
CET: Control-flow Enforcement Technology Tiger Lake, Sapphire Rapids
AVX512_VP2INTERSECT Tiger Lake, Sapphire Rapids
Enqueue Stores: ENQCMD and ENQCMDS Sapphire Rapids
CLDEMOTE Tremont, Alder Lake, Sapphire Rapids
PTWRITE Goldmont Plus, Alder Lake, Sapphire Rapids
User Wait: TPAUSE, UMONITOR, UMWAIT Tremont, Alder Lake, Sapphire Rapids
Architectural LBRs Alder Lake, Sapphire Rapids
HLAT Alder Lake, Sapphire Rapids
SERIALIZE Alder Lake, Sapphire Rapids
Intel® TSX Suspend Load Address Tracking (TSXLDTRK) Sapphire Rapids
Intel® Advanced Matrix Extensions (Intel® AMX) Sapphire Rapids
Includes CPUID Leaf 1EH, “TMUL Information Main Leaf”, and
CPUID bits AMX-BF16, AMX-TILE, and AMX-INT8.
Key Locker2 Tiger Lake, Alder Lake
AVX-VNNI Alder Lake3, Sapphire Rapids
Enhanced Hardware Feedback Interface (EHFI) and HRESET Alder Lake
User Interrupts (UINTR) Sapphire Rapids
Intel® Trust Domain Extensions (Intel® TDX)4 Future Processors
NOTES:
1. Visit Intel Ark for Intel® product specifications, features and compatibility quick reference guide, and code name decoder.
2. Details on Key Locker can be found here: https://software.intel.com/content/www/us/en/develop/download/intel-key-locker-specifi-
cation.html.
3. Alder Lake Intel Hybrid Technology will not support Intel® AVX-512. ISA features such as Intel® AVX, AVX-VNNI, Intel® AVX2, and
UMONITOR/UMWAIT/TPAUSE are supported.
4. Details on Intel® Trust Domain Extensions can be found here: https://software.intel.com/content/www/us/en/develop/articles/intel-
trust-domain-extensions.html.
CPUID—CPU Identification
Compat/
Opcode Instruction 64-Bit Mode Description
Leg Mode
0F A2 CPUID Valid Valid Returns processor identification and feature information to the EAX,
EBX, ECX, and EDX registers, as determined by input entered in EAX
(in some cases, ECX as well).
Description
The ID flag (bit 21) in the EFLAGS register indicates support for the CPUID instruction. If a software procedure can
set and clear this flag, the processor executing the procedure supports the CPUID instruction. This instruction
operates the same in non-64-bit modes and 64-bit mode.
CPUID returns processor identification and feature information in the EAX, EBX, ECX, and EDX registers.1 The
instruction’s output is dependent on the contents of the EAX register upon execution (in some cases, ECX as well).
For example, the following pseudocode loads EAX with 00H and causes CPUID to return a Maximum Return Value
and the Vendor Identification String in the appropriate registers:
"Caching Translation Information" in Chapter 4, “Paging,” in the Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 3A.
1. On Intel 64 processors, CPUID clears the high 32 bits of the RAX/RBX/RCX/RDX registers in all modes.
2. CPUID leaf 1FH is a preferred superset to leaf 0BH. Intel recommends first checking for the existence of CPUID leaf 1FH before
using leaf 0BH.
EDX Bits 03-00: Number of C0* sub C-states supported using MWAIT
Bits 07-04: Number of C1* sub C-states supported using MWAIT
Bits 11-08: Number of C2* sub C-states supported using MWAIT
Bits 15-12: Number of C3* sub C-states supported using MWAIT
Bits 19-16: Number of C4* sub C-states supported using MWAIT
Bits 23-20: Number of C5* sub C-states supported using MWAIT
Bits 27-24: Number of C6* sub C-states supported using MWAIT
Bits 31-28: Number of C7* sub C-states supported using MWAIT
NOTE:
* The definition of C0 through C7 states for MWAIT extension are processor-specific C-states, not
ACPI C-states.
*** The value of the “level type” field is not related to level numbers in any way, higher “level type”
values do not mean higher levels. Level type field has the following encoding:
0: invalid
1: SMT
2: Core
3-255: Reserved
Processor Extended State Enumeration Main Leaf (EAX = 0DH, ECX = 0)
0DH NOTES:
Leaf 0DH main leaf (ECX = 0).
EAX Bits 31-00: Reports the valid bit fields of the lower 32 bits of the XFEATURE_ENABLED_MASK regis-
ter. If a bit is 0, the corresponding bit field in XCR0 is reserved.
Bit 00: Legacy x87.
Bit 01: 128-bit SSE.
Bit 02: 256-bit AVX
Bits 04-03: MPX state
Bit 07-05: AVX-512 state.
Bit 08: Used for IA32_XSS.
Bit 09: PKRU state.
Bits 12-10: Reserved.
Bits 14-13: Used for IA32_XSS.
Bits 15: Reserved.
Bit 16: Used for IA32_XSS.
Bit 17: XTILECFG.
Bit 18: XTILEDATA.
Bits 31-19: Reserved.
EBX Bits 31-00: Maximum size (bytes, from the beginning of the XSAVE/XRSTOR save area) required by
enabled features in XCR0. May be different than ECX if some features at the end of the XSAVE save
area are not enabled.
ECX Bit 31-00: Maximum size (bytes, from the beginning of the XSAVE/XRSTOR save area) of the
XSAVE/XRSTOR save area required by all supported features in the processor, i.e all the valid bit
fields in XCR0.
EDX Bit 31-00: Reports the valid bit fields of the upper 32 bits of the XCR0 register. If a bit is 0, the cor-
responding bit field in XCR0 is reserved
Processor Extended State Enumeration Sub-leaf (EAX = 0DH, ECX = 1)
0DH EAX Bit 00: XSAVEOPT is available.
Bit 01: Supports XSAVEC and the compacted form of XRSTOR if set.
Bit 02: Supports XGETBV with ECX = 1 if set.
Bit 03: Supports XSAVES/XRSTORS and IA32_XSS if set.
Bit 04: Supports Extended Feature Disable (XFD) if set.
Bits 31-05: Reserved.
EBX Bits 31-00: The size in bytes of the XSAVE area containing all states enabled by XCRO | IA32_XSS.
EAX Bits 04 - 00: Length of the capacity bit mask for the corresponding ResID using minus-one notation.
Bits 31 - 05: Reserved.
EBX Bits 31 - 00: Bit-granular map of isolation/contention of allocation units.
ECX Bits 31 - 00: Reserved.
EDX Bits 15 - 00: Highest COS number supported for this ResID.
Bits 31 - 16: Reserved.
Memory Bandwidth Allocation Enumeration Sub-leaf (EAX = 10H, ECX = ResID =3)
10H NOTES:
Leaf 10H output depends on the initial value in ECX.
EAX Bits 11 - 00: Reports the maximum MBA throttling value supported for the corresponding ResID
using minus-one notation.
Bits 31 - 12: Reserved.
EBX Bits 31 - 00: Reserved.
ECX Bit 00: Per-thread MBA controls are supported.
Bit 01: Reserved.
Bit 02: Reports whether the response of the delay values is linear.
Bits 31 - 03: Reserved.
EDX Bits 15 - 00: Highest COS number supported for this ResID.
Bits 31 - 16: Reserved.
Intel® Software Guard Extensions (Intel® SGX) Capability Enumeration Leaf, sub-leaf 0 (EAX = 12H, ECX = 0)
12H NOTES:
Leaf 12H sub-leaf 0 (ECX = 0) is supported if CPUID.(EAX=07H, ECX=0H):EBX[SGX] = 1.
EAX Bit 00: SGX1. If 1, Indicates Intel SGX supports the collection of SGX1 leaf functions.
Bit 01: SGX2. If 1, Indicates Intel SGX supports the collection of SGX2 leaf functions.
Bits 04-02: Reserved.
Bit 05: If 1, indicates Intel SGX supports ENCLV instruction leaves EINCVIRTCHILD, EDECVIRTCHILD,
and ESETCONTEXT.
Bit 06: If 1, indicates Intel SGX supports ENCLS instruction leaves ETRACKC, ERDINFO, ELDBC, and
ELDUC.
Bits 31-07: Reserved.
EBX Bits 31-00: MISCSELECT. Bit vector of supported extended Intel SGX features.
ECX Bits 31-00: Reserved.
EDX Bits 07-00: MaxEnclaveSize_Not64. The maximum supported enclave size in non-64-bit mode is
2^(EDX[7:0]).
Bits 15-08: MaxEnclaveSize_64. The maximum supported enclave size in 64-bit mode is
2^(EDX[15:8]).
Bits 31-16: Reserved.
EBX[19:00]: Bits 51:32 of the physical address of the base of the EPC section.
EBX[31:20]: Reserved.
EDX[19:00]: Bits 51:32 of the size of the corresponding EPC section within the Processor
Reserved Memory.
EDX[31:20]: Reserved.
While a processor may support the Processor Frequency Information leaf, fields that return a value
of zero are not supported.
System-On-Chip Vendor Attribute Enumeration Main Leaf (EAX = 17H, ECX = 0)
17H NOTES:
Leaf 17H main leaf (ECX = 0).
Leaf 17H output depends on the initial value in ECX.
Leaf 17H sub-leaves 1 through 3 reports SOC Vendor Brand String.
Leaf 17H is valid if MaxSOCID_Index >= 3.
Leaf 17H sub-leaves 4 and above are reserved.
EAX Bits 31-00: MaxSOCID_Index. Reports the maximum input value of supported sub-leaf in leaf 17H.
EBX Bits 15-00: SOC Vendor ID.
Bit 16: IsVendorScheme. If 1, the SOC Vendor ID field is assigned via an industry standard
enumeration scheme. Otherwise, the SOC Vendor ID field is assigned by Intel.
Bits 31-17: Reserved = 0.
ECX Bits 31-00: Project ID. A unique number an SOC vendor assigns to its SOC projects.
EDX Bits 31-00: Stepping ID. A unique number within an SOC project that an SOC vendor assigns.
System-On-Chip Vendor Attribute Enumeration Sub-leaf (EAX = 17H, ECX = 1..3)
17H EAX Bit 31-00: SOC Vendor Brand String. UTF-8 encoded string.
EBX Bit 31-00: SOC Vendor Brand String. UTF-8 encoded string.
ECX Bit 31-00: SOC Vendor Brand String. UTF-8 encoded string.
EDX Bit 31-00: SOC Vendor Brand String. UTF-8 encoded string.
NOTES:
Leaf 17H output depends on the initial value in ECX.
SOC Vendor Brand String is a UTF-8 encoded string padded with trailing bytes of 00H.
The complete SOC Vendor Brand String is constructed by concatenating in ascending order of
EAX:EBX:ECX:EDX and from the sub-leaf 1 fragment towards sub-leaf 3.
EAX Bits 31-00: Reports the maximum input value of supported sub-leaf in leaf 18H.
EBX Bit 00: 4K page size entries supported by this structure.
Bit 01: 2MB page size entries supported by this structure.
Bit 02: 4MB page size entries supported by this structure.
Bit 03: 1 GB page size entries supported by this structure.
Bits 07-04: Reserved.
Bits 10-08: Partitioning (0: Soft partitioning between the logical processors sharing this structure).
Bits 15-11: Reserved.
Bits 31-16: W = Ways of associativity.
ECX Bits 31-00: S = Number of Sets.
EDX Bits 04-00: Translation cache type field.
00000b: Null (indicates this sub-leaf is not valid).
00001b: Data TLB.
00010b: Instruction TLB.
00011b: Unified TLB.
00100b: Load Only TLB. Hit on loads; fills on both loads and stores.
00101b: Store Only TLB. Hit on stores; fill on stores.
All other encodings are reserved.
Bits 07-05: Translation cache level (starts at 1).
Bit 08: Fully associative structure.
Bits 13-09: Reserved.
Bits 25-14: Maximum number of addressable IDs for logical processors sharing this translation
cache*
Bits 31-26: Reserved.
*** The value of the “level type” field is not related to level numbers in any way, higher “level type”
values do not mean higher levels. Level type field has the following encoding:
0: Invalid.
1: SMT.
2: Core.
3: Module.
4: Tile.
5: Die.
6-255: Reserved.
Processor History Reset Sub-leaf (EAX = 20H, ECX = 0)
20H EAX Reports the maximum number of sub-leaves that are supported in leaf 20H.
EBX Indicates which bits may be set in the IA32_HRESET_ENABLE MSR to enable enhanced hardware
feedback interface history.
ECX Reserved.
EDX Reserved.
Unimplemented CPUID Leaf Functions
40000000H Invalid. No existing or future CPU will return processor identification or feature information if the
- initial EAX value is in the range 40000000H to 4FFFFFFFH.
4FFFFFFFH
Extended Function CPUID Information
80000000H EAX Maximum Input Value for Extended Function CPUID Information (see Table 1-4).
EBX Reserved
ECX Reserved
EDX Reserved
80000001H EAX Extended Processor Signature and Feature Bits.
EBX Reserved
ECX Bit 00: LAHF/SAHF available in 64-bit mode
Bits 04-01: Reserved
Bit 05: LZCNT available
Bits 07-06: Reserved
Bit 08: PREFETCHW
Bits 31-09: Reserved
EDX Bits 10-00: Reserved
Bit 11: SYSCALL/SYSRET available (when in 64-bit mode)
Bits 19-12: Reserved = 0
Bit 20: Execute Disable Bit available
Bits 25-21: Reserved = 0
Bit 26: 1-GByte pages are available if 1
Bit 27: RDTSCP and IA32_TSC_AUX are available if 1
Bits 28: Reserved = 0
Bit 29: Intel® 64 Architecture available if 1
Bits 31-30: Reserved = 0
** CPUID leaf 04H provides details of deterministic cache parameters, including the L2 cache in sub-
leaf 2
80000007H EAX Reserved = 0
EBX Reserved = 0
ECX Reserved = 0
EDX Bits 07-00: Reserved = 0
Bit 08: Invariant TSC available if 1
Bits 31-09: Reserved = 0
80000008H EAX Virtual/Physical Address size
Bits 07-00: #Physical Address Bits*
Bits 15-08: #Virtual Address Bits
Bits 31-16: Reserved = 0
NOTES:
* If CPUID.80000008H:EAX[7:0] is supported, the maximum physical address number supported
should come from this field.
INPUT EAX = 0H: Returns CPUID’s Highest Value for Basic Processor Information and the Vendor Identification
String
When CPUID executes with EAX set to 0H, the processor returns the highest value the CPUID recognizes for
returning basic processor information. The value is returned in the EAX register (see Table 1-4) and is processor
specific.
A vendor identification string is also returned in EBX, EDX, and ECX. For Intel processors, the string is “Genu-
ineIntel” and is expressed:
EBX := 756e6547h (* “Genu”, with G in the low 4 bits of BL *)
EDX := 49656e69h (* “ineI”, with i in the low 4 bits of DL *)
ECX := 6c65746eh (* “ntel”, with n in the low 4 bits of CL *)
INPUT EAX = 80000000H: Returns CPUID’s Highest Value for Extended Processor Information
When CPUID executes with EAX set to 0H, the processor returns the highest value the processor recognizes for
returning extended processor information. The value is returned in the EAX register (see Table 1-4) and is
processor specific.
Table 1-4. Highest CPUID Source Operand for Intel 64 and IA-32 Processors
Highest Value in EAX
Intel 64 or IA-32 Processors
Basic Information Extended Function Information
Earlier Intel486 Processors CPUID Not Implemented CPUID Not Implemented
Later Intel486 Processors and Pentium 01H Not Implemented
Processors
Pentium Pro and Pentium II Processors, 02H Not Implemented
Intel® Celeron® Processors
Pentium III Processors 03H Not Implemented
Pentium 4 Processors 02H 80000004H
Intel Xeon Processors 02H 80000004H
Pentium M Processor 02H 80000004H
Pentium 4 Processor supporting Hyper- 05H 80000008H
Threading Technology
Pentium D Processor (8xx) 05H 80000008H
Pentium D Processor (9xx) 06H 80000008H
Intel Core Duo Processor 0AH 80000008H
Table 1-4. Highest CPUID Source Operand for Intel 64 and IA-32 Processors (Continued)
Highest Value in EAX
Intel 64 or IA-32 Processors
Basic Information Extended Function Information
Intel Core 2 Duo Processor 0AH 80000008H
Intel Xeon Processor 3000, 5100, 5300 0AH 80000008H
Series
Intel Xeon Processor 3000, 5100, 5200, 0AH 80000008H
5300, 5400 Series
Intel Core 2 Duo Processor 8000 Series 0DH 80000008H
Intel Xeon Processor 5200, 5400 Series 0AH 80000008H
31 28 27 20 19 16 15 14 13 12 11 8 7 4 3 0
Reserved
NOTE
See "Caching Translation Information" in Chapter 4, “Paging,” in the Intel® 64 and IA-32 Architectures
Software Developer’s Manual, Volume 3A
, and Chapter 16 in the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 1
, for information on identifying earlier IA-32 processors.
The Extended Family ID needs to be examined only when the Family ID is 0FH. Integrate the fields into a display
using the following rule:
IF Family_ID ≠ 0FH
THEN Displayed_Family = Family_ID;
ELSE Displayed_Family = Extended_Family_ID + Family_ID;
(* Right justify and zero-extend 4-bit field. *)
FI;
(* Show Display_Family as HEX field. *)
The Extended Model ID needs to be examined only when the Family ID is 06H or 0FH. Integrate the field into a
display using the following rule:
NOTE
Software must confirm that a processor feature is present using feature flags returned by CPUID
prior to using the feature. Software should not depend on future offerings retaining all features.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
ECX
0
RDRAND
F16C
AVX
OSXSAVE
XSAVE
AES
TSC-Deadline
POPCNT
MOVBE
x2APIC
SSE4_2 — SSE4.2
SSE4_1 — SSE4.1
DCA — Direct Cache Access
PCID — Process-context Identifiers
PDCM — Perf/Debug Capability MSR
xTPR Update Control
CMPXCHG16B
FMA — Fused Multiply Add
SDBG
CNXT-ID — L1 Context ID
SSSE3 — SSSE3 Extensions
TM2 — Thermal Monitor 2
EST — Enhanced Intel SpeedStep® Technology
SMX — Safer Mode Extensions
VMX — Virtual Machine Extensions
DS-CPL — CPL Qualified Debug Store
MONITOR — MONITOR/MWAIT
DTES64 — 64-bit DS Area
PCLMULQDQ — Carryless Multiplication
SSE3 — SSE3 Extensions
Reserved
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
EDX
Reserved
INPUT EAX = 02H: Cache and TLB Information Returned in EAX, EBX, ECX, EDX
When CPUID executes with EAX set to 02H, the processor returns information about the processor’s internal caches
and TLBs in the EAX, EBX, ECX, and EDX registers.
The encoding is as follows:
• The least-significant byte in register EAX (register AL) indicates the number of times the CPUID instruction
must be executed with an input value of 02H to get a complete description of the processor’s caches and TLBs.
The first member of the family of Pentium 4 processors will return a 01H.
• The most significant bit (bit 31) of each register indicates whether the register contains valid information (set
to 0) or is reserved (set to 1).
• If a register contains valid information, the information is contained in 1 byte descriptors. Table 1-8 shows the
encoding of these descriptors. Note that the order of descriptors in the EAX, EBX, ECX, and EDX registers is not
defined; that is, specific bytes are not designated to contain descriptors for specific cache or TLB types. The
descriptors may appear in any order.
Table 1-8. Encoding of Cache and TLB Descriptors
Descriptor Value Cache or TLB Description
00H Null descriptor
01H Instruction TLB: 4 KByte pages, 4-way set associative, 32 entries
02H Instruction TLB: 4 MByte pages, 4-way set associative, 2 entries
03H Data TLB: 4 KByte pages, 4-way set associative, 64 entries
04H Data TLB: 4 MByte pages, 4-way set associative, 8 entries
05H Data TLB1: 4 MByte pages, 4-way set associative, 32 entries
06H 1st-level instruction cache: 8 KBytes, 4-way set associative, 32 byte line size
08H 1st-level instruction cache: 16 KBytes, 4-way set associative, 32 byte line size
0AH 1st-level data cache: 8 KBytes, 2-way set associative, 32 byte line size
0BH Instruction TLB: 4 MByte pages, 4-way set associative, 4 entries
0CH 1st-level data cache: 16 KBytes, 4-way set associative, 32 byte line size
22H 3rd-level cache: 512 KBytes, 4-way set associative, 64 byte line size, 2 lines per sector
23H 3rd-level cache: 1 MBytes, 8-way set associative, 64 byte line size, 2 lines per sector
25H 3rd-level cache: 2 MBytes, 8-way set associative, 64 byte line size, 2 lines per sector
EAX 66 5B 50 01H
EBX 0H
ECX 0H
EDX 00 7A 70 00H
Which means:
• The least-significant byte (byte 0) of register EAX is set to 01H. This indicates that CPUID needs to be executed
once with an input value of 2 to retrieve complete information about caches and TLBs.
• The most-significant bit of all four registers (EAX, EBX, ECX, and EDX) is set to 0, indicating that each register
contains valid 1-byte descriptors.
• Bytes 1, 2, and 3 of register EAX indicate that the processor has:
— 50H - a 64-entry instruction TLB, for mapping 4-KByte and 2-MByte or 4-MByte pages.
— 5BH - a 64-entry data TLB, for mapping 4-KByte and 4-MByte pages.
— 66H - an 8-KByte 1st level data cache, 4-way set associative, with a 64-Byte cache line size.
• The descriptors in registers EBX and ECX are valid, but contain NULL descriptors.
• Bytes 0, 1, 2, and 3 of register EDX indicate that the processor has:
— 00H - NULL descriptor.
— 70H - Trace cache: 12 K-μop, 8-way set associative.
— 7AH - a 256-KByte 2nd level cache, 8-way set associative, with a sectored, 64-byte cache line size.
— 00H - NULL descriptor.
INPUT EAX = 04H: Returns Deterministic Cache Parameters for Each Level
When CPUID executes with EAX set to 04H and ECX contains an index value, the processor returns encoded data
that describe a set of deterministic cache parameters (for the cache level associated with the input in ECX). Valid
index values start from 0.
Software can enumerate the deterministic cache parameters for each level of the cache hierarchy starting with an
index value of 0, until the parameters report the value associated with the cache type field is 0. The architecturally
defined fields reported by deterministic cache parameters are documented in Table 1-3.
The CPUID leaf 4 also reports data that can be used to derive the topology of processor cores in a physical package.
This information is constant for all valid index values. Software can query the raw data reported by executing
CPUID with EAX=04H and ECX=0H and use it as part of the topology enumeration algorithm described in Chapter
8, “Multiple-Processor Management,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.
When CPUID executes with EAX set to 07H and ECX = n (n ≥ 1 and less than the number of non-zero bits in
CPUID.(EAX=07H, ECX= 0H).EAX), the processor returns information about extended feature flags. See Table
1-3. In sub-leaf 0, only EAX has the number of sub-leaves. In sub-leaf 0, EBX, ECX & EDX all contain extended
feature flags.
See Table 1-3. Software can use the forward-extendable technique depicted below to query the valid sub-leaves
and obtain size and offset information for each processor extended state save area:
INPUT EAX = 0FH: Returns Intel Resource Director Technology (Intel RDT) Monitoring Enumeration Information
When CPUID executes with EAX set to 0FH and ECX = 0, the processor returns information about the bit-vector
representation of QoS monitoring resource types that are supported in the processor and maximum range of RMID
values the processor can use to monitor of any supported resource types. Each bit, starting from bit 1, corresponds
to a specific resource type if the bit is set. The bit position corresponds to the sub-leaf index (or ResID) that soft-
ware must use to query QoS monitoring capability available for that type. See Table 1-3.
When CPUID executes with EAX set to 0FH and ECX = n (n >= 1, and is a valid ResID), the processor returns infor-
mation software can use to program IA32_PQR_ASSOC, IA32_QM_EVTSEL MSRs before reading QoS data from the
IA32_QM_CTR MSR.
INPUT EAX = 10H: Returns Intel Resource Director Technology (Intel RDT) Allocation Enumeration Information
When CPUID executes with EAX set to 10H and ECX = 0, the processor returns information about the bit-vector
representation of QoS Enforcement resource types that are supported in the processor. Each bit, starting from bit
1, corresponds to a specific resource type if the bit is set. The bit position corresponds to the sub-leaf index (or
ResID) that software must use to query QoS enforcement capability available for that type. See Table 1-3.
When CPUID executes with EAX set to 10H and ECX = n (n >= 1, and is a valid ResID), the processor returns infor-
mation about available classes of service and range of QoS mask MSRs that software can use to configure each
class of services using capability bit masks in the QoS Mask registers, IA32_resourceType_Mask_n.
INPUT EAX = 15H: Returns Time Stamp Counter and Nominal Core Crystal Clock Information
When CPUID executes with EAX set to 15H and ECX = 0H, the processor returns information about Time Stamp
Counter and Core Crystal Clock. See Table 1-3.
This method (introduced with Pentium 4 processors) returns an ASCII brand identification string and the maximum
operating frequency of the processor to the EAX, EBX, ECX, and EDX registers.
Input: EAX=
0x80000000
CPUID
CPUID
True =
Function
Extended
Supported
NOTE
When a frequency is given in a brand string, it is the maximum qualified frequency of the processor,
not the frequency at which the processor is currently running.
"zHM", or
Match
"zHG", or
Substring
"zHT"
False
IF Substring Matched Report Error
True If "zHM"
Multiplier = 1 x 106
If "zHG"
Multiplier = 1 x 109
Determine "Multiplier" If "zHT"
Multiplier = 1 x 1012
Scan Digits
Until Blank Reverse Digits
Determine "Freq"
In Reverse Order To Decimal Value
Max. Qualified
Frequency =
"Freq" = X.YZ if
"Freq" x "Multiplier"
Digits = "ZY.X"
Table 1-10 shows brand indices that have identification strings associated with them.
Table 1-10. Mapping of Brand Indices; and Intel 64 and IA-32 Processor Brand Strings
Brand Index Brand String
00H This processor does not support the brand identification feature
01H Intel(R) Celeron(R) processor1
02H Intel(R) Pentium(R) III processor1
03H Intel(R) Pentium(R) III Xeon(R) processor; If processor signature = 000006B1h, then Intel(R) Celeron(R)
processor
04H Intel(R) Pentium(R) III processor
06H Mobile Intel(R) Pentium(R) III processor-M
07H Mobile Intel(R) Celeron(R) processor1
08H Intel(R) Pentium(R) 4 processor
09H Intel(R) Pentium(R) 4 processor
0AH Intel(R) Celeron(R) processor1
0BH Intel(R) Xeon(R) processor; If processor signature = 00000F13h, then Intel(R) Xeon(R) processor MP
0CH Intel(R) Xeon(R) processor MP
0EH Mobile Intel(R) Pentium(R) 4 processor-M; If processor signature = 00000F13h, then Intel(R) Xeon(R) processor
0FH Mobile Intel(R) Celeron(R) processor1
11H Mobile Genuine Intel(R) processor
12H Intel(R) Celeron(R) M processor
13H Mobile Intel(R) Celeron(R) processor1
14H Intel(R) Celeron(R) processor
15H Mobile Genuine Intel(R) processor
16H Intel(R) Pentium(R) M processor
17H Mobile Intel(R) Celeron(R) processor1
18H – 0FFH RESERVED
NOTES:
1.Indicates versions of these processors that were introduced after the Pentium III
Operation
CASE (EAX) OF
EAX = 0:
EAX := Highest basic function input value understood by CPUID;
EBX := Vendor identification string;
EDX := Vendor identification string;
ECX := Vendor identification string;
BREAK;
EAX = 1H:
EAX[3:0] := Stepping ID;
EAX[7:4] := Model;
EAX[11:8] := Family;
EAX[13:12] := Processor type;
EAX[15:14] := Reserved;
EAX[19:16] := Extended Model;
EAX[27:20] := Extended Family;
EAX[31:28] := Reserved;
EBX[7:0] := Brand Index; (* Reserved if the value is zero. *)
EBX[15:8] := CLFLUSH Line Size;
EBX[16:23] := Reserved; (* Number of threads enabled = 2 if MT enable fuse set. *)
EBX[24:31] := Initial APIC ID;
ECX := Feature flags; (* See Figure 1-2. *)
EDX := Feature flags; (* See Figure 1-3. *)
BREAK;
EAX = 2H:
EAX := Cache and TLB information;
EBX := Cache and TLB information;
ECX := Cache and TLB information;
EDX := Cache and TLB information;
BREAK;
EAX = 3H:
EAX := Reserved;
EBX := Reserved;
ECX := ProcessorSerialNumber[31:0];
(* Pentium III processors only, otherwise reserved. *)
EDX := ProcessorSerialNumber[63:32];
(* Pentium III processors only, otherwise reserved. *
BREAK
EAX = 4H:
EAX := Deterministic Cache Parameters Leaf; (* See Table 1-3. *)
EBX := Deterministic Cache Parameters Leaf;
ECX := Deterministic Cache Parameters Leaf;
EDX := Deterministic Cache Parameters Leaf;
BREAK;
EAX = 5H:
EAX := MONITOR/MWAIT Leaf; (* See Table 1-3. *)
EBX := MONITOR/MWAIT Leaf;
ECX := MONITOR/MWAIT Leaf;
EDX := MONITOR/MWAIT Leaf;
BREAK;
EAX = 6H:
EAX := Thermal and Power Management Leaf; (* See Table 1-3. *)
EBX := Thermal and Power Management Leaf;
ECX := Thermal and Power Management Leaf;
EDX := Thermal and Power Management Leaf;
BREAK;
EAX = 7H:
EAX := Structured Extended Feature Leaf; (* See Table 1-3. *);
EBX := Structured Extended Feature Leaf;
ECX := Structured Extended Feature Leaf;
EDX := Structured Extended Feature Leaf;
BREAK;
EAX = 8H:
EAX := Reserved = 0;
EBX := Reserved = 0;
ECX := Reserved = 0;
EDX := Reserved = 0;
BREAK;
EAX = 9H:
EAX := Direct Cache Access Information Leaf; (* See Table 1-3. *)
EBX := Direct Cache Access Information Leaf;
ECX := Direct Cache Access Information Leaf;
EDX := Direct Cache Access Information Leaf;
BREAK;
EAX = AH:
EAX := Architectural Performance Monitoring Leaf; (* See Table 1-3. *)
EBX := Architectural Performance Monitoring Leaf;
ECX := Architectural Performance Monitoring Leaf;
EDX := Architectural Performance Monitoring Leaf;
BREAK
EAX = BH:
EAX := Extended Topology Enumeration Leaf; (* See Table 1-3. *)
EBX := Extended Topology Enumeration Leaf;
ECX := Extended Topology Enumeration Leaf;
EDX := Extended Topology Enumeration Leaf;
BREAK;
EAX = CH:
EAX := Reserved = 0;
EBX := Reserved = 0;
ECX := Reserved = 0;
EDX := Reserved = 0;
BREAK;
EAX = DH:
EAX := Processor Extended State Enumeration Leaf; (* See Table 1-3. *)
EBX := Processor Extended State Enumeration Leaf;
ECX := Processor Extended State Enumeration Leaf;
EDX := Processor Extended State Enumeration Leaf;
BREAK;
EAX = EH:
EAX := Reserved = 0;
EBX := Reserved = 0;
ECX := Reserved = 0;
EDX := Reserved = 0;
BREAK;
EAX = FH:
EAX := Platform Quality of Service Monitoring Enumeration Leaf; (* See Table 1-3. *)
EBX := Platform Quality of Service Monitoring Enumeration Leaf;
ECX := Platform Quality of Service Monitoring Enumeration Leaf;
EDX := Platform Quality of Service Monitoring Enumeration Leaf;
BREAK;
EAX = 10H:
EAX := Platform Quality of Service Enforcement Enumeration Leaf; (* See Table 1-3. *)
EBX := Platform Quality of Service Enforcement Enumeration Leaf;
ECX := Platform Quality of Service Enforcement Enumeration Leaf;
EDX := Platform Quality of Service Enforcement Enumeration Leaf;
BREAK;
EAX = 12H:
EAX := Intel SGX Enumeration Leaf; (* See Table 1-3. *)
EBX := Reserved = 0;
ECX := Reserved = 0;
EDX := Reserved = 0;
BREAK;
EAX = 80000006H:
EAX := Reserved = 0;
EBX := Reserved = 0;
ECX := Cache information;
EDX := Reserved = 0;
BREAK;
EAX = 80000007H:
EAX := Reserved = 0;
EBX := Reserved = 0;
ECX := Reserved = 0;
EDX := Reserved = 0;
BREAK;
EAX = 80000008H:
EAX := Reserved = 0;
EBX := Reserved = 0;
ECX := Reserved = 0;
EDX := Reserved = 0;
BREAK;
DEFAULT: (* EAX = Value outside of recognized range for CPUID. *)
(* If the highest basic information leaf data depend on ECX input value, ECX is honored.*)
EAX := Reserved; (* Information returned for highest basic information leaf. *)
EBX := Reserved; (* Information returned for highest basic information leaf. *)
ECX := Reserved; (* Information returned for highest basic information leaf. *)
EDX := Reserved; (* Information returned for highest basic information leaf. *)
BREAK;
ESAC;
Flags Affected
None.
1 64bit 1 {1tox} 8 8 8
0 32bit 0 none 8 16 32
Half Load+Op (Half Vector)
1 32bit 0 {1tox} 4 4 4
Table 1-12. EVEX DISP8*N for Instructions Not Affected by Embedded Broadcast
TupleType InputSize EVEX.W N (VL= 128) N (VL= 256) N (VL= 512) Comment
Full Mem N/A N/A 16 32 64 Load/store or subDword full vector
8bit N/A 1 1 1
16bit N/A 2 2 2
Tuple1 Scalar 1Tuple
32bit 0 4 4 4
64bit 1 8 8 8
32bit N/A 4 4 4 1 Tuple, memsize not affected by
Tuple1 Fixed
64bit N/A 8 8 8 EVEX.W
Table 1-12. EVEX DISP8*N for Instructions Not Affected by Embedded Broadcast(Continued)
TupleType InputSize EVEX.W N (VL= 128) N (VL= 256) N (VL= 512) Comment
Half Mem N/A N/A 8 16 32 SubQword Conversion
Quarter Mem N/A N/A 4 8 16 SubDword Conversion
Eighth Mem N/A N/A 2 4 8 SubWord Conversion
Mem128 N/A N/A 16 16 16 Shift count from memory
MOVDDUP N/A N/A 8 32 64 VMOVDDUP
NOTES:
1. Scalar
BFP10001
CHAPTER 2
INSTRUCTION SET REFERENCE, A-Z
Instructions described in this document follow the general documentation convention established in Intel® 64 and
IA-32 Architectures Software Developer’s Manual Volume 2A. Additionally, some instructions use notation conven-
tions as described below.
In the instruction encoding, the MODRM byte is represented several ways depending on the role it plays. The
MODRM byte has 3 fields: 2-bit MODRM.MOD field, a 3-bit MODRM.REG field and a 3-bit MODRM.RM field. When all
bits of the MODRM byte have fixed values for an instruction, the 2-hex nibble value of that byte is presented after
the opcode in the encoding boxes on the instruction description pages. When only some fields of the MODRM byte
must contain fixed values, those values are specified as follows:
• If only the MODRM.MOD must be 0b11, and MODRM.REG and MODRM.RM fields are unrestricted, this is
denoted as 11:rrr:bbb. The rrr correspond to the 3-bits of the MODRM.REG field and the bbb correspond to
the 3-bits of the MODMR.RM field.
• If the MODRM.MOD field is constrained to be a value other than 0b11, i.e., it must be one of 0b00, 0b01, or
0b10, then we use the notation !(11).
• If for example only the MODRM.REG field had a specific required value, e.g., 0b101, that would be denoted as
mm:101:bbb.
NOTE
Intel® 64 and IA-32 Architectures Software Developer’s Manual
Historically the only specified the MODRM.REG
field restrictions with the notation /0 ... /7 and did not specify restrictions on the MODRM.MOD and
MODRM.RM fields in the encoding boxes.
Description
CLUI clears the user interrupt flag (UIF). Its effect takes place immediately: a user interrupt cannot be delivered on
the instruction boundary following CLUI.
An execution of CLUI inside a transactional region causes a transactional abort; the abort loads EAX as it would
have had it been caused due to an execution of CLI.
Operation
UIF := 0;
Flags Affected
None.
Description
The ENQCMD instruction allows software to write commands to enqueue registers, which are special device
registers accessed using memory-mapped I/O (MMIO).
Enqueue registers expect writes to have the following format:
511 32 31 30 20 19
0
Bits 19:0 convey the process address space identifier (PASID), a value which system software may assign to indi-
vidual software threads. Bit 31 contains privilege identification (0 = user; 1 = supervisor). Devices implementing
enqueue registers may use these two values along with a device-specific command in the upper 60 bytes. Chapter
4 provides more details regarding how ENQCMD uses PASIDs.
The ENQCMD instruction begins by reading 64 bytes of command data from its source memory operand. This is an
ordinary load with cacheability and memory ordering implied normally by the memory type. The source operand
need not be aligned, and there is no guarantee that all 64 bytes are loaded atomically.
The instruction then formats those 64 bytes into command data with a format consistent with that given in
Figure 2-1:
• Command[19:0] get IA32_PASID[19:0].1
• Command[30:20] are zero.
• Command[31] is 0 (indicating user).
• Command[511:32] get bits 511:32 of the source operand that was read from memory.
(The instruction ignores bits 31:0 of the source operand.)
The ENQCMD instruction uses an enqueue store (defined below) to write this command data to the destination
operand. The address of the destination operand is specified in a general-purpose register as an offset into the ES
segment (the segment cannot be overridden).2 The destination linear address must be 64-byte aligned. The oper-
ation of an enqueue store disregards the memory type of the destination memory address.
1. It is expected that system software will load the IA32_PASID MSR so that bits 19:0 contain the PASID of the current soft-
ware thread. The MSR’s valid bit, IA32_PASID[31], must be 1. The PASID MSR is discussed in more detail in Section 4.1.
2. In 64-bit mode, the width of the register operand is 64 bits (32 bits with a 67H prefix). Outside 64-bit mode when CS.D =
1, the width is 32 bits (16 bits with a 67H prefix). Outside 64-bit mode when CS.D=0, the width is 16 bits (32 bits with a
67H prefix).
An enqueue store is not ordered relative to older stores to WB or WC memory (including non-temporal stores) or
to executions of the CLFLUSHOPT or CLWB (when applied to addresses other than that of the enqueue store). Soft-
ware can enforce such ordering by executing a fencing instruction such as SFENCE or MFENCE before the enqueue
store.
An enqueue store does not write the data into the cache hierarchy, nor does it fetch any data into the cache hier-
archy. An enqueue store’s command data is never combined with that of any other store to the same address.
Unlike other stores, an enqueue store returns a status, which the ENQCMD instruction loads into the ZF flag in the
RFLAGS register:
• ZF = 0 (success) reports that the 64-byte command data was written atomically to a device’s enqueue register
and has been accepted by the device. (It does not guarantee that the device has acted on the command; it may
have queued it for later execution.)
• ZF = 1 (retry) reports that the command data was not accepted. This status is returned if the destination
address is an enqueue register but the command was not accepted due to capacity or other temporal reasons.
This status is also returned if the destination address was not an enqueue register (including the case of a
memory address); in these cases, the store is dropped and is written neither to MMIO nor to memory.
Availability of the ENQCMD instruction is indicated by the presence of the CPUID feature flag ENQCMD
(CPUID.(EAX=07H, ECX=0H):ECX[bit 29]).
Operation
IF IA32_PASID[31] = 0
THEN #GP;
ELSE
COMMAND := (SRC & ~FFFFFFFFH) | (IA32_PASID & FFFFFH);
DEST := COMMAND;
FI;
Flags Affected
The ZF flag is set if the enqueue-store completion returns the retry status; otherwise it is cleared. All other flags
are cleared.
Description
The ENQCMDS instruction allows system software to write commands to enqueue registers, which are special
device registers accessed using memory-mapped I/O (MMIO).
Enqueue registers expect writes to have the format given in Figure 2-1 and explained in the section on “ENQCMD
— Enqueue Command.”
The ENQCMDS instruction begins by reading 64 bytes of command data from its source memory operand. This is
an ordinary load with cacheability and memory ordering implied normally by the memory type. The source operand
need not be aligned, and there is no guarantee that all 64 bytes are loaded atomically.
ENQCMDS formats its source data differently from ENQCMD. Specifically, it formats them into command data as
follows:
• Command[19:0] get bits 19:0 of the source operand that was read from memory. These 20 bits communicate
a process address-space identifier (PASID). Chapter 4 provides more details regarding how ENQCMDS uses
PASIDs.
• Command[30:20] are zero.
• Command[511:31] get bits 511:31 of the source operand that was read from memory. Bit 31 communicates a
privilege identification (0 = user; 1 = supervisor).
(The instruction ignores bits 30:20 of the source operand.)
The ENQCMDS instruction then uses an enqueue store (defined below) to write this command data to the desti-
nation operand. The address of the destination operand is specified in a general-purpose register as an offset into
the ES segment (the segment cannot be overridden).1 The destination linear address must be 64-byte aligned. The
operation of an enqueue store disregards the memory type of the destination memory address.
An enqueue store is not ordered relative to older stores to WB or WC memory (including non-temporal stores) or
to executions of the CLFLUSHOPT or CLWB (when applied to addresses other than that of the enqueue store). Soft-
ware can enforce such ordering by executing a fencing instruction such as SFENCE or MFENCE before the enqueue
store.
An enqueue store does not write the data into the cache hierarchy, nor does it fetch any data into the cache hier-
archy. An enqueue store’s command data is never combined with that of any other store to the same address.
Unlike other stores, an enqueue store returns a status, which the ENQCMDS instruction loads into the ZF flag in the
RFLAGS register:
• ZF = 0 (success) reports that the 64-byte command data was written atomically to a device’s enqueue register
and has been accepted by the device. (It does not guarantee that the device has acted on the command; it may
have queued it for later execution.)
• ZF = 1 (retry) reports that the command data was not accepted. This status is returned if the destination
address is an enqueue register but the command was not accepted due to capacity or other temporal reasons.
1. In 64-bit mode, the width of the register operand is 64 bits (32 bits with a 67H prefix). Outside 64-bit mode when CS.D =
1, the width is 32 bits (16 bits with a 67H prefix). Outside 64-bit mode when CS.D=0, the width is 16 bits (32 bits with a
67H prefix).
This status is also returned if the destination address was not an enqueue register (including the case of a
memory address); in these cases, the store is dropped and is written neither to MMIO nor to memory.
The ENQCMDS instruction may be executed only if CPL = 0. Availability of the ENQCMDS instruction is indicated by
the presence of the CPUID feature flag ENQCMD (CPUID.(EAX=07H, ECX=0H):ECX[bit 29]).
Operation
DEST := SRC & ~7FF00000H; // clear bits 30:20
Flags Affected
The ZF flag is set if the enqueue-store completion returns the retry status; otherwise it is cleared. All other flags
are cleared.
Description
Provides a hint to the processor to selectively reset the prediction history of the current logical processor. HRESET
operation is controlled by the implicit EAX operand. The value of the explicit imm8 operand is ignored.
CPUID.07H.01H:EAX.HRESET[bit 22] indicates support of the HRESET instruction. This instruction can only be
executed at CPL 0.
The HRESET instruction is capable of providing a reset hint for multiple predictions. Prior to the execution of
HRESET, the system software must take the following steps:
1. Enumerate the HRESET capabilities via CPUID.20H.0H:EBX, which indicates what predictions can be reset.
2. Opt-in to reset a subset of the available capabilities by setting the respective bits in the IA32_HRESET_ENABLE
MSR. The opt-in bits in the IA32_HRESET_ENABLE MSR are aligned with the HRESET capabilities CPUID bits.
The implicit EAX operand must contain set bits that are a subset of those set in the IA32_HRESET_ENABLE MSR.
Otherwise, HRESET generates #GP(0). When EAX=0 this instruction is interpreted as NOP.
Any attempt to execute the HRESET instruction inside a transactional region will result in a transaction abort.
Operation
IF EAX = 0
THEN NOP
ELSE
FOREACH i such that EAX[i] = 1
Reset prediction history for feature i
FI
Flags Affected
None.
Description
PCONFIG allows software to configure certain platform features. PCONFIG supports multiple leaf functions, with a
leaf function identified by the value in EAX. The registers RBX, RCX, and RDX may provide input information for
certain leaves. All leaves write status information to EAX but do not modify RBX, RCX, or RDX.
Each PCONFIG leaf function applies to a specific hardware block called a PCONFIG target, and each PCONFIG target
is associated with a numerical identifier. The identifiers of the PCONFIG targets supported by the CPU (which imply
the supported leaf functions) are enumerated in the sub-leaves of the PCONFIG-information leaf of CPUID (EAX =
1BH). An attempt to execute an undefined leaf function results in a general-protection exception (#GP).
Addresses and operands are 32 bits outside 64-bit mode and are 64 bits in 64-bit mode. The value of CS.D does not
affect operand size or address size.
Table 2-1 shows the leaf encodings for PCONFIG.
The MKTME_KEY_PROGRAM leaf of PCONFIG pertains to the MKTME target, which has target identifier 1. It is used
by software to manage the key associated with a KeyID. The leaf function is invoked by setting the leaf value of 0
in EAX and the address of MKTME_KEY_PROGRAM_STRUCT in RBX. Successful execution of the leaf clears RAX (set
to zero) and ZF, CF, PF, AF, OF, and SF are cleared. In case of failure, the failure reason is indicated in RAX with ZF
set to 1 and CF, PF, AF, OF, and SF are cleared. The MKTME_KEY_PROGRAM leaf uses the MKTME_KEY_PROGRAM_-
STRUCT in memory shown in Table 2-2.
The encryption algorithm field (ENC_ALG) allows software to select one of the activated encryption algorithms
for the KeyID. The BIOS can activate a set of algorithms to allow for use when programming keys using the
IA32_TME_ACTIVATE MSR (does not apply to KeyID 0 which uses TME policy). The processor checks to
ensure that the algorithm selected by software is one of the algorithms that has been activated by the BIOS.
• KEY_FIELD_1: This field carries the software supplied data key to be used for the KeyID if the direct key
programming option is used (KEYID_SET_KEY_DIRECT). When the random key programming option is used
(KEYID_SET_KEY_RANDOM), this field carries the software supplied entropy to be mixed in the CPU generated
random data key. It is software's responsibility to ensure that the key supplied for the direct programming
option or the entropy supplied for the random programming option does not result in weak keys. There are no
explicit checks in the instruction to detect or prevent weak keys. When AES XTS-128 is used, the upper 48B are
treated as reserved and must be zeroed out by software before executing the instruction.
• KEY_FIELD_2: This field carries the software supplied tweak key to be used for the KeyID if the direct key
programming option is used (KEYID_SET_KEY_DIRECT). When the random key programming option is used
(KEYID_SET_KEY_RANDOM), this field carries the software supplied entropy to be mixed in the CPU generated
random tweak key. It is software's responsibility to ensure that the key supplied for the direct programming
option or the entropy supplied for the random programming option does not result in weak keys. There are no
explicit checks in the instruction to detect or prevent weak keys. When AES XTS-128 is used, the upper 48B are
treated as reserved and must be zeroed out by software before executing the instruction.
All KeyIDs use the TME key on MKTME activation. Software can at any point decide to change the key for a
KeyID using the PCONFIG instruction. Change of keys for a KeyID does NOT change the state of the TLB
caches or memory pipeline. It is software's responsibility to take appropriate actions to ensure correct
behavior.
Table 2-4 shows the return values associated with the MKTME_KEY_PROGRAM leaf of PCONFIG. On
instruction execution, RAX is populated with the return value.
PCONFIG Virtualization
Software in VMX root operation can control the execution of PCONFIG in VMX non-root operation using the
following VM-execution controls introduced for PCONFIG:
• PCONFIG_ENABLE: This control is a single bit control and enables the PCONFIG instruction in VMX non-root
operation. If 0, the execution of PCONFIG in VMX non-root operation causes #UD. Otherwise, execution of
PCONFIG works according to PCONFIG_EXITING.
• PCONFIG_EXITING: This is a 64b control and allows VMX root operation to cause a VM-exit for various leaf
functions of PCONFIG. This control does not have any effect if the PCONFIG_ENABLE control is clear. It is
recommended that VMMs intercept execution of any PCONFIG leaves with which they are not familiar and
convert such executions into #GP(0).
PCONFIG Concurrency
In a scenario where the MKTME_KEY_PROGRAM leaf of PCONFIG is executed concurrently on multiple logical
processors, only one logical processor will succeed in updating the key table. PCONFIG execution will return with an
error code (DEVICE_BUSY) on other logical processors and software must retry. In cases where the instruction
execution fails with a DEVICE_BUSY error code, the key table is not updated, thereby ensuring that either the key
table is updated in its entirety with the information for a KeyID, or it is not updated at all. In order to accomplish
this, the MKTME_KEY_PROGRAM leaf of PCONFIG maintains a writer lock for updating the key table. This lock is
referred to as the Key table lock and denoted in the instruction flows as KEY_TABLE_LOCK. The lock can either be
unlocked, when no logical processor is holding the lock (also the initial state of the lock) or be in an exclusive state
where a logical processor is trying to update the key table. There can be only one logical processor holding the lock
in exclusive state. The lock, being exclusive, can only be acquired when the lock is in unlocked state.
PCONFIG uses the following syntax to acquire KEY_TABLE_LOCK in exclusive mode and release the lock:
• KEY_TABLE_LOCK.ACQUIRE(WRITE)
• KEY_TABLE_LOCK.RELEASE()
Operation
IF (TMP_KEY_PROGRAM_STRUCT.KEY_FIELD_1.BYTES[63:16] != 0) #GP(0);
IF (TMP_KEY_PROGRAM_STRUCT.KEY_FIELD_2.BYTES[63:16] != 0) #GP(0);
}
(* Check that the KEYID being operated upon is a valid KEYID *)
IF (TMP_KEY_PROGRAM_STRUCT.KEYID >
2^IA32_TME_ACTIVATE.MK_TME_KEYID_BITS - 1
OR TMP_KEY_PROGRAM_STRUCT.KEYID >
IA32_TME_CAPABILITY.MK_TME_MAX_KEYS
OR TMP_KEY_PROGRAM_STRUCT.KEYID == 0)
{
RFLAGS.ZF = 1;
RAX = INVALID_KEYID;
goto EXIT;
}
(* Check that only one algorithm is requested for the KeyID and it is one of the activated algorithms *)
IF (NUM_BITS(TMP_KEY_PROGRAM_STRUCT.KEYID_CTRL.ENC_ALG) != 1 ||
(TMP_KEY_PROGRAM_STRUCT.KEYID_CTRL.ENC_ALG &
IA32_TME_ACTIVATE. MK_TME_CRYPTO_ALGS == 0))
{
RFLAGS.ZF = 1;
RAX = INVALID_ENC_ALG;
goto EXIT;
}
(* Try to acquire exclusive lock *)
IF (NOT KEY_TABLE_LOCK.ACQUIRE(WRITE))
{
//PCONFIG failure
RFLAGS.ZF = 1;
RAX = DEVICE_BUSY;
goto EXIT;
}
(* Lock is acquired and key table will be updated as per the command
Before this point no changes to the key table are made *)
switch(TMP_KEY_PROGRAM_STRUCT.KEYID_CTRL.COMMAND)
{
case KEYID_SET_KEY_DIRECT:
<<Write
DATA_KEY=TMP_KEY_PROGRAM_STRUCT.KEY_FIELD_1,
TWEAK_KEY=TMP_KEY_PROGRAM_STRUCT.KEY_FIELD_2,
ENCRYPTION_MODE=ENCRYPT_WITH_KEYID_KEY,
to MKTME Key table at index TMP_KEY_PROGRAM_STRUCT.KEYID
>>
break;
case KEYID_SET_KEY_RANDOM:
TMP_RND_DATA_KEY = <<Generate a random key using hardware RNG>>
IF (NOT ENOUGH ENTROPY)
{
RFLAGS.ZF = 1;
RAX = ENTROPY_ERROR;
goto EXIT;
}
TMP_RND_TWEAK_KEY = <<Generate a random key using hardware RNG>>
<<Write
DATA_KEY=TMP_RND_DATA_KEY,
TWEAK_KEY=TMP_RND_TWEAK_KEY,
ENCRYPTION_MODE=ENCRYPT_WITH_KEYID_KEY,
to MKTME_KEY_TABLE at index TMP_KEY_PROGRAM_STRUCT.KEYID
>>
break;
case KEYID_CLEAR_KEY:
<<Write
DATA_KEY='0,
TWEAK_KEY='0,
ENCRYPTION_MODE = ENCRYPT_WITH_TME_KEY,
to MKTME_KEY_TABLE at index TMP_KEY_PROGRAM_STRUCT.KEYID
>>
break;
case KD_NO_ENCRYPT:
<<Write
ENCRYPTION_MODE=NO_ENCRYPTION,
to MKTME_KEY_TABLE at index TMP_KEY_PROGRAM_STRUCT.KEYID
>>
break;
}
RAX = 0;
RFLAGS.ZF = 0;
//Release Lock
KEY_TABLE_LOCK(RELEASE);
EXIT:
RFLAGS.CF=0;
RFLAGS.PF=0;
RFLAGS.AF=0;
RFLAGS.OF=0;
RFLAGS.SF=0;
}
end_of_flow
Description
The SENDUIPI instruction takes a single register operand. The operand always has 64 bits; operand-size overrides
(e.g., the prefix 66) are ignored.
Although SENDUIPI may be executed at any privilege level, all of the instruction’s memory accesses are performed
with supervisor privilege.
Virtualization of the SENDUIPI instruction (in particular, that of the sending of the notification interrupt) is
discussed in Section 11.9.2.5.
The Operation section refers to the values UITTADDR and UITTSZ. The values are defined in Section 11.3.1. It also
includes operations on a user posted-interrupt descriptor (UPID). The format of a UPID is defined in Section 11.5.
Operation
IF reg > UITTSZ;
THEN #GP(0);
FI;
read tempUITTE from 16 bytes at UITTADDR+ (reg « 4);
IF tempUITTE.V = 0 or tempUITTE sets any reserved bit (see Section 11.7.1)
THEN #GP(0);
FI;
read tempUPID from 16 bytes at tempUITTE.UPIDADDR;// under lock
IF tempUPID sets any reserved bits or bits that must be zero (see Table 11-1)
THEN #GP(0); // release lock
FI;
tempUPID.PIR[tempUITTE.UV] := 1;
IF tempUPID.SN = tempUPID.ON = 0
THEN
tempUPID.ON := 1;
sendNotify := 1;
ELSE sendNotify := 0;
FI;
write tempUPID to 16 bytes at tempUITTE.UPIDADDR;// release lock
IF sendNotify = 1
THEN
IF local APIC is in x2APIC mode
THEN send ordinary IPI with vector tempUPID.NV
to 32-bit physical APIC ID tempUPID.NDST;
ELSE send ordinary IPI with vector tempUPID.NV
to 8-bit physical APIC ID tempUPID.NDST[15:8];
FI;
FI;
Flags Affected
None.
Description
Serializes instruction execution. Before the next instruction is fetched and executed, the SERIALIZE instruction
ensures that all modifications to flags, registers, and memory by previous instructions are completed, draining all
buffered writes to memory. This instruction is also a serializing instruction as defined in the section “Serializing
Instructions” in Chapter 8 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.
SERIALIZE does not modify registers, arithmetic flags or memory.
The availability of the SERIALIZE instruction is indicated by the presence of the CPUID feature flag SERIALIZE, bit
14 of the EDX register in sub-leaf CPUID:7H.0H.
Operation
Wait_On_Fetch_And_Execution_Of_Next_Instruction_Until(preceding_instructions_complete_and_preceding_stores_globally_visible);
Other Exceptions
#UD If the LOCK prefix is used.
Description
STUI sets the user interrupt flag (UIF). Its effect takes place immediately; a user interrupt may be delivered on the
instruction boundary following STUI. (This is in contrast with STI, whose effect is delayed by one instruction).
An execution of STUI inside a transactional region causes a transactional abort; the abort loads EAX as it would
have had it been due to an execution of STI.
Operation
UIF := 1;
Flags Affected
None.
TESTUI copies the current value of the user interrupt flag (UIF) into EFLAGS.CF. This instruction can be executed
regardless of CPL.
TESTUI may be executed normally inside a transactional region.
Operation
CF := UIF;
ZF := AF := OF := PF := SF := 0;
Flags Affected
The ZF, OF, AF, PF, SF flags are cleared and the CF flags to the value of the user interrupt flag.
Description
UIRET returns from the handling of a user interrupt. It can be executed regardless of CPL.
Execution of UIRET inside a transactional region causes a transactional abort; the abort loads EAX as it would have
had it been due to an execution of IRET.
UIRET can be tracked by Architectural Last Branch Records (LBRs), Intel Processor Trace (Intel PT), and Perfor-
mance Monitoring. For both Intel PT and LBRs, UIRET is recorded in precisely the same manner as IRET. Hence for
LBRs, UIRETs fall into the OTHER_BRANCH category, which implies that IA32_LBR_CTL.OTHER_BRANCH[bit 22]
must be set to record user-interrupt delivery, and that the IA32_LBR_x_INFO.BR_TYPE field will indicate
OTHER_BRANCH for any recorded user interrupt. For Intel PT, control flow tracing must be enabled by setting
IA32_RTIT_CTL.BranchEn[bit 13].
UIRET will also increment performance counters for which counting BR_INST_RETIRED.FAR_BRANCH is enabled.
Operation
Pop tempRIP;
Pop tempRFLAGS; // see below for how this is used to load RFLAGS
Pop tempRSP;
IF tempRIP is not canonical in current paging mode
THEN #GP(0);
FI;
IF shadow stack is enabled for CPL = 3
THEN
PopShadowStack SSRIP;
IF SSRIP ≠ tempRIP
THEN #CP (FAR-RET/IRET);
FI;
FI;
RIP := tempRIP;
// update in RFLAGS only CF, PF, AF, ZF, SF, TF, DF, OF, NT, RF, AC, and ID
RFLAGS := (RFLAGS & ~254DD5H) | (tempRFLAGS & 254DD5H);
RSP := tempRSP;
UIF := 1;
Clear any cache-line monitoring established by MONITOR or UMONITOR;
Flags Affected
See Operation section.
VCVTNE2PS2BF16 — Convert Two Packed Single Data to One Packed BF16 Data
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.F2.0F38.W0 72 /r A V/V AVX512VL Convert packed single data from xmm2 and
VCVTNE2PS2BF16 xmm1{k1}{z}, AVX512_BF16 xmm3/m128/m32bcst to packed BF16 data in
xmm2, xmm3/m128/m32bcst xmm1 with writemask k1.
EVEX.256.F2.0F38.W0 72 /r A V/V AVX512VL Convert packed single data from ymm2 and
VCVTNE2PS2BF16 ymm1{k1}{z}, AVX512_BF16 ymm3/m256/m32bcst to packed BF16 data in
ymm2, ymm3/m256/m32bcst ymm1 with writemask k1.
EVEX.512.F2.0F38.W0 72 /r A V/V AVX512F Convert packed single data from zmm2 and
VCVTNE2PS2BF16 zmm1{k1}{z}, AVX512_BF16 zmm3/m512/m32bcst to packed BF16 data in
zmm2, zmm3/m512/m32bcst zmm1 with writemask k1.
Description
Converts two SIMD registers of packed single data into a single register of packed BF16 data.
This instruction does not support memory fault suppression.
This instruction uses “Round to nearest (even)” rounding mode. Output denormals are always flushed to zero and
input denormals are always treated as zero. MXCSR is not consulted nor updated. No floating-point exceptions are
generated.
Operation
VCVTNE2PS2BF16 dest, src1, src2
VL = (128, 256, 512)
KL = VL/16
origdest := dest
FOR i := 0 to KL-1:
IF k1[ i ] or *no writemask*:
IF i < KL/2:
IF src2 is memory and evex.b == 1:
t := src2.fp32[0]
ELSE:
t := src2.fp32[ i ]
ELSE:
t := src1.fp32[ i-KL/2]
ELSE IF *zeroing*:
dest.word[ i ] := 0
ELSE: // Merge masking, dest element unchanged
dest.word[ i ] := origdest.word[ i ]
DEST[MAXVL-1:VL] := 0
Other Exceptions
See Exceptions Type E4NF.
Description
Converts one SIMD register of packed single data into a single register of packed BF16 data.
This instruction uses “Round to nearest (even)” rounding mode. Output denormals are always flushed to zero and
input denormals are always treated as zero. MXCSR is not consulted nor updated.
As the instruction operand encoding table shows, the EVEX.vvvv field is not used for encoding an operand.
EVEX.vvvv is reserved and must be 0b1111 otherwise instructions will #UD.
Operation
Define convert_fp32_to_bfloat16(x):
IF x is zero or denormal:
dest[15] := x[31] // sign preserving zero (denormal go to zero)
dest[14:0] := 0
ELSE IF x is infinity:
dest[15:0] := x[31:16]
ELSE IF x is NAN:
dest[15:0] := x[31:16] // truncate and set MSB of the mantissa to force QNAN
dest[6] := 1
ELSE // normal number
LSB := x[16]
rounding_bias := 0x00007FFF + LSB
temp[31:0] := x[31:0] + rounding_bias // integer add
dest[15:0] := temp[31:16]
RETURN dest
origdest := dest
FOR i := 0 to KL/2-1:
IF k1[ i ] or *no writemask*:
IF src is memory and evex.b == 1:
t := src.fp32[0]
ELSE:
t := src.fp32[ i ]
dest.word[i] := convert_fp32_to_bfloat16(t)
ELSE IF *zeroing*:
dest.word[ i ] := 0
ELSE: // Merge masking, dest element unchanged
dest.word[ i ] := origdest.word[ i ]
DEST[MAXVL-1:VL/2] := 0
Other Exceptions
See Exceptions Type E4.
VDPBF16PS — Dot Product of BF16 Pairs Accumulated into Packed Single Precision
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.F3.0F38.W0 52 /r A V/V AVX512VL Multiply BF16 pairs from xmm2 and
VDPBF16PS xmm1{k1}{z}, xmm2, AVX512_BF16 xmm3/m128, and accumulate the resulting
xmm3/m128/m32bcst packed single precision results in xmm1 with
writemask k1.
EVEX.256.F3.0F38.W0 52 /r A V/V AVX512VL Multiply BF16 pairs from ymm2 and
VDPBF16PS ymm1{k1}{z}, ymm2, AVX512_BF16 ymm3/m256, and accumulate the resulting
ymm3/m256/m32bcst packed single precision results in ymm1 with
writemask k1.
EVEX.512.F3.0F38.W0 52 /r A V/V AVX512F Multiply BF16 pairs from zmm2 and
VDPBF16PS zmm1{k1}{z}, zmm2, AVX512_BF16 zmm3/m512, and accumulate the resulting
zmm3/m512/m32bcst packed single precision results in zmm1 with
writemask k1.
Description
This instruction performs a SIMD dot-product of two BF16 pairs and accumulates into a packed single precision
register.
“Round to nearest even” rounding mode is used when doing each accumulation of the FMA. Output denormals are
always flushed to zero and input denormals are always treated as zero. MXCSR is not consulted nor updated.
Operation
Define make_fp32(x):
// The x parameter is bfloat16. Pack it in to upper 16b of a dword. The bit pattern is a legal fp32 value. Return that bit pattern.
dword := 0
dword[31:16] := x
RETURN dword
origdest := srcdest
FOR i := 0 to KL-1:
IF k1[ i ] or *no writemask*:
IF src2 is memory and evex.b == 1:
t := src2.dword[0]
ELSE:
t := src2.dword[ i ]
// FP32 FMA with daz in, ftz out and RNE rounding. MXCSR neither consulted nor updated.
ELSE IF *zeroing*:
srcdest.dword[ i ] := 0
ELSE: // merge masking, dest element unchanged
srcdest.dword[ i ] := origdest.dword[ i ]
srcdest[MAXVL-1:VL] := 0
Other Exceptions
See Exceptions Type E4.
Description
This instruction writes an even/odd pair of mask registers. The mask register destination indicated in the
MODRM.REG field is used to form the basis of the register pair. The low bit of that field is masked off (set to zero)
to create the first register of the pair.
EVEX.aaa and EVEX.z must be zero.
Operation
VP2INTERSECTD destmask, src1, src2
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR i := 0 to KL-1:
FOR j := 0 to KL-1:
match := (src1.dword[i] == src2.dword[j])
maskregs[dest_base+0].bit[i] |= match
maskregs[dest_base+1].bit[j] |= match
FOR i = 0 to KL-1:
FOR j = 0 to KL-1:
match := (src1.qword[i] == src2.qword[j])
maskregs[dest_base+0].bit[i] |= match
maskregs[dest_base+1].bit[j] |= match
Other Exceptions
See Exceptions Type E4NF.
Description
Multiplies the individual unsigned bytes of the first source operand by the corresponding signed bytes of the second
source operand, producing intermediate signed word results. The word results are then summed and accumulated
in the destination dword element size operand.
This instruction supports memory fault suppression.
Operation
VPDPBUSD dest, src1, src2
VL=(128, 256)
KL=VL/32
ORIGDEST := DEST
FOR i := 0 TO KL-1:
// Extending to 16b
// src1extend := ZERO_EXTEND
// src2extend := SIGN_EXTEND
DEST[MAX_VL-1:VL] := 0
Other Exceptions
See Exceptions Type 4.
VPDPBUSDS — Multiply and Add Unsigned and Signed Bytes with Saturation
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
VEX.128.66.0F38.W0 51 /r A V/V AVX_VNNI Multiply groups of 4 pairs signed bytes in
VPDPBUSDS xmm1, xmm2, xmm3/m128 with corresponding unsigned
xmm3/m128 bytes of xmm2, summing those products and
adding them to doubleword result, with signed
saturation in xmm1.
VEX.256.66.0F38.W0 51 /r A V/V AVX_VNNI Multiply groups of 4 pairs signed bytes in
VPDPBUSDS ymm1, ymm2, ymm3/m256 with corresponding unsigned
ymm3/m256 bytes of ymm2, summing those products and
adding them to doubleword result, with signed
saturation in ymm1.
Description
Multiplies the individual unsigned bytes of the first source operand by the corresponding signed bytes of the second
source operand, producing intermediate signed word results. The word results are then summed and accumulated
in the destination dword element size operand. If the intermediate sum overflows a 32b signed number the result
is saturated to either 0x7FFF_FFFF for positive numbers of 0x8000_0000 for negative numbers.
This instruction supports memory fault suppression.
Operation
VPDPBUSDS dest, src1, src2
VL=(128, 256)
KL=VL/32
ORIGDEST := DEST
FOR i := 0 TO KL-1:
// Extending to 16b
// src1extend := ZERO_EXTEND
// src2extend := SIGN_EXTEND
DEST[MAX_VL-1:VL] := 0
Other Exceptions
See Exceptions Type 4.
Description
Multiplies the individual signed words of the first source operand by the corresponding signed words of the second
source operand, producing intermediate signed, doubleword results. The adjacent doubleword results are then
summed and accumulated in the destination operand.
This instruction supports memory fault suppression.
Operation
VPDPWSSD dest, src1, src2
VL=(128, 256)
KL=VL/32
ORIGDEST := DEST
FOR i := 0 TO KL-1:
p1dword := SRC1.word[2*i+0] * t.word[2*i+0]
p2dword := SRC1.word[2*i+1] * t.word[2*i+1]
DEST.dword[i] := ORIGDEST.dword[i] + p1dword + p2dword
DEST[MAX_VL-1:VL] := 0
Other Exceptions
See Exceptions Type 4.
Description
Multiplies the individual signed words of the first source operand by the corresponding signed words of the second
source operand, producing intermediate signed, doubleword results. The adjacent doubleword results are then
summed and accumulated in the destination operand. If the intermediate sum overflows a 32b signed number, the
result is saturated to either 0x7FFF_FFFF for positive numbers of 0x8000_0000 for negative numbers.
This instruction supports memory fault suppression.
Operation
VPDPWSSDS dest, src1, src2
VL=(128, 256)
KL=VL/32
ORIGDEST := DEST
FOR i := 0 TO KL-1:
p1dword := SRC1.word[2*i+0] * t.word[2*i+0]
p2dword := SRC1.word[2*i+1] * t.word[2*i+1]
DEST.dword[i] := SIGNED_DWORD_SATURATE(ORIGDEST.dword[i] + p1dword + p2dword)
DEST[MAX_VL-1:VL] := 0
Other Exceptions
See Exceptions Type 4.
Description
The WBNOINVD instruction writes back all modified cache lines in the processor’s internal cache to main memory
but does not invalidate (flush) the internal caches.
After executing this instruction, the processor does not wait for the external caches to complete their write-back
operation before proceeding with instruction execution. It is the responsibility of hardware to respond to the cache
write-back signal. The amount of time or cycles for WBNOINVD to complete will vary due to size and other factors
of different cache hierarchies. As a consequence, the use of the WBNOINVD instruction can have an impact on
logical processor interrupt/event response time.
The WBNOINVD instruction is a privileged instruction. When the processor is running in protected mode, the CPL of
a program or procedure must be 0 to execute this instruction. This instruction is also a serializing instruction (see
“Serializing Instructions” in Chapter 8 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A).
In situations where cache coherency with main memory is not a concern, software can use the INVD instruction.
This instruction’s operation is the same in non-64-bit modes and 64-bit mode.
Operation
WriteBack(InternalCaches);
Continue; (* Continue execution *)
Flags Affected
None.
Description
The instruction marks the end of an Intel TSX (RTM) suspend load address tracking region. If the instruction is used
inside a suspend load address tracking region it will end the suspend region and all following load addresses will be
added to the transaction read set. If this instruction is used inside an active transaction but not in a suspend region
it will cause transaction abort.
If the instruction is used outside of a transactional region it behaves like a NOP.
Chapter 5 provides additional information on Intel® TSX Suspend Load Address Tracking.
Operation
XRESLDTRK
IF RTM_ACTIVE = 1:
IF SUSLDTRK_ACTIVE = 1:
SUSLDTRK_ACTIVE := 0
ELSE:
RTM_ABORT
ELSE:
NOP
Flags Affected
None.
Other Exceptions
#UD If CPUID.(EAX=7, ECX=0):EDX.TSXLDTRK[bit 16] = 0.
If the LOCK prefix is used.
Description
The instruction marks the start of an Intel TSX (RTM) suspend load address tracking region. If the instruction is
used inside a transactional region, subsequent loads are not added to the read set of the transaction. If the instruc-
tion is used inside a suspend load address tracking region it will cause transaction abort.
If the instruction is used outside of a transactional region it behaves like a NOP.
Chapter 5 provides additional information on Intel® TSX Suspend Load Address Tracking.
Operation
XSUSLDTRK
IF RTM_ACTIVE = 1:
IF SUSLDTRK_ACTIVE = 0:
SUSLDTRK_ACTIVE := 1
ELSE:
RTM_ABORT
ELSE:
NOP
Flags Affected
None.
Other Exceptions
#UD If CPUID.(EAX=7, ECX=0):EDX.TSXLDTRK[bit 16] = 0.
If the LOCK prefix is used.
CHAPTER 3
INTEL® AMX INSTRUCTION SET REFERENCE, A-Z
3.1 INTRODUCTION
Intel Advanced Matrix Extensions (Intel® AMX) is a new 64-bit programming paradigm consisting of two compo-
®
nents: a set of 2-dimensional registers (tiles) representing sub-arrays from a larger 2-dimensional memory image,
and an accelerator able to operate on tiles, the first implementation is called TMUL (tile matrix multiply unit).
An Intel AMX implementation enumerates to the programmer how the tiles can be programmed by providing a
palette of options. Two palettes are supported; palette 0 represents the initialized state, and palette 1 consists of
8 KB of storage spread across 8 tile registers named TMM0..TMM7. Each tile has a maximum size of 16 rows x 64
bytes, (1 KB), however the programmer can configure each tile to smaller dimensions appropriate to their algo-
rithm. The tile dimensions supplied by the programmer (rows and bytes_per_row, i.e., colsb) are metadata that
drives the execution of tile and accelerator instructions. In this way, a single instruction can launch autonomous
multi-cycle execution in the tile and accelerator hardware. The palette value (palette_id) and metadata are held
internally in a tile related control register (TILECFG). The TILECFG contents will be commensurate with that
reported in the palette_table (see “CPUID—CPU Identification” in Chapter 1 for a description of the available
parameters).
Intel AMX is an extensible architecture. New accelerators can be added, or the TMUL accelerator may be enhanced
to provide higher performance. In these cases, the state (TILEDATA) provided by tiles may need to be made larger,
either in one of the metadata dimensions (more rows or colsb) and/or by supporting more names. The extensibility
is carried out by adding new palette entries describing the additional state. Since execution is driven through meta-
data, an existing Intel AMX binary could take advantage of larger storage sizes and higher performance TMUL units
by selecting the most powerful palette indicated by CPUID and adjusting loop and pointer updates accordingly.
Figure 3-1 shows a conceptual diagram of the Intel AMX architecture. An Intel architecture host drives the algo-
rithm, the memory blocking, loop indices and pointer arithmetic. Tile loads and stores and accelerator commands
are sent to multi-cycle execution units. Status, if required, is reported back. Intel AMX instructions are synchro-
nous in the Intel architecture instruction stream and the memory loaded and stored by the tile instructions is
coherent with respect to the host’s memory accesses. There are no restrictions on interleaving of Intel architecture
and Intel AMX code or restrictions on the resources the host can use in parallel with Intel AMX (e.g., Intel AVX-
512). There is also no architectural requirement on the Intel architecture compute capability of the Intel architec-
ture host other than it supports 64-bit mode.
TILECFG
tmm0
Coherent Memory
Accelerator 2
Interface tmm1
...
tmm[n-1]
Intel AMX instructions use new registers and inherit basic behavior from Intel architecture in the same manner that
Intel SSE and Intel AVX did. Tile instructions include loads and stores using the traditional Intel architecture
register set as pointers. The TMUL instruction set (defined to be CPUID bits AMX-BF16 and AMX-INT8) only
supports reg-reg operations.
TILECFG is programmed using the LDTILECFG instruction. The selected palette defines the available storage and
general configuration while the rest of the memory data specifies the number of rows and column bytes for each
tile. Consistency checks are performed to ensure the TILECFG matches the restrictions of the palette. A General
Protection fault (#GP) is reported if the LDTILECFG fails consistency checks. A successful load of
TILECFG with a palette_id other than 0 is represented in this document with TILES_CONFIGURED = 1. When the
TILECFG is initialized (palette_id = 0), it is represented in the document as TILES_CONFIGURED = 0. Nearly all
Intel AMX instructions will generate a #UD exception if TILES_CONFIGURED is not equal to 1; the exceptions are
those that do TILECFG maintenance: LDTILECFG, STTILECFG and TILERELEASE.
If two tiles are configured to contain M rows by N columns of 4-byte data, and two tiles to contain M rows by N
columns of 8-byte data, LDTILECFG will ensure that the metadata values are appropriate to the palette (e.g., that
rows ≤ 16 and N ≤ 64 for palette 1). The four M and N values can all be different as long as they adhere to the
restrictions of the palette. Further dynamic checks are done in the tile and the TMUL instruction set to deal with
cases where a legally configured tile may be inappropriate for the instruction operation. Tile registers can be set to
‘invalid’ by configuring the rows and colsb to ‘0’.
Tile loads and stores are strided accesses from the application memory to packed rows of data. Algorithms are
expressed assuming row major data layout. Column major users should translate the terms according to their
orientation.
TILELOAD* and TILESTORE* instructions are restartable and can handle (up to) 2*rows page faults per instruction.
Restartability is provided by a start_row parameter in the TILECFG register.
The TMUL unit is conceptually a grid of fused multiply-add units able to read and write tiles. The dimensions of the
TMUL unit (tmul_maxk and tmul_maxn) are enumerated similar to the maximum dimensions of the tiles (see
“CPUID—CPU Identification” in Chapter 1 for details).
The matrix multiplications in the TMUL instruction set compute C[M][N] += A[M][K] * B[K][N]. The M, N and K
values will cause the TMUL instruction set to generate a #UD exception if the following constraints are not met:
• M : ≤ palette.max_rows
• K : ≤ colsb / element_size (A), ≤ palette.max_rows (B) and ≤ tmul_maxk
• N : ≤ colsb / element_size (C and B), ≤ tmul_maxn
In Figure 3-2, the number of rows in tile B matches the K dimension in the matrix multiplication pseudocode. K
dimensions smaller than that enumerated in the TMUL grid are also possible and any additional computation the
TMUL unit can support will not affect the result.
The number of elements specified by colsb of the B matrix is also less than or equal to tmul_maxn. Any remaining
values beyond that specified by the metadata will be set to zero.
C[M][N]
B[0][:N]
A[m-1][1]
B[1][:N]
....
....
....
A[M][K] B[K][N]
A[m-K+1][K-1]
B[K-1][:N]
C[m-K+1][0] C[m-K+1][1] C[m-K+1][n-1]
The XSAVE feature sets supports context management of the new state defined for Intel AMX. This support is
described in Section 3.2.
C A B
LDTILECFG [rax]
// assume some outer loops driving the cache tiling (not shown)
{
TILELOADD tmm0, [rsi+rdi] // srcdst, RSI points to C, RDI is strided value
TILELOADD tmm1, [rsi+rdi+N] // second tile of C, unrolling in SIMD dimension N
MOV r14, 0
LOOP:
TILELOADD tmm2, [r8+r9] // src2 is strided load of A, reused for 2 TMUL instr.
TILELOADD tmm3, [r10+r11] // src1 is strided load of B
TDPBUSD tmm0, tmm2, tmm3 // update left tile of C
TILELOADD tmm3, [r10+r11+N] // src1 loaded with B from next rightmost tile
TDPBUSD tmm1, tmm2, tmm3 // update right tile of C
ADD r8, K // update pointers by constants known outside of loop
ADD r10, K*r11
ADD r14, K
CMP r14, LIMIT
JNE LOOP
using a 1 KByte tile with 64-byte rows, there would be 16 rows, so in this example, the last 6 rows would also be
zeroed.
Intel AMX instructions will always obey the metadata on reads and the zeroing rules on writes, and so a subsequent
XSAVE would see zeros in the appropriate locations. Tiles that are not written by Intel AMX instructions between
XRSTOR and XSAVE will write back with the same image they were loaded with regardless of the value of TILECFG.
When XFD causes an instruction to generate #NM, the processor loads the IA32_XFD_ERR MSR to identify the
disabled state component(s). Specifically, the MSR is loaded with the logical AND of the IA32_XFD MSR and the
bitmap corresponding to the state components required by the faulting instruction. (Intel AMX instructions require
XTILECFG state and XTILEDATA state to be enabled.)
Device-not-available exceptions that are not due to XFD — those resulting from setting CR0.TS to 1 — do not
modify the IA32_XFD_ERR MSR.
define palette_table[id]:
uint16_t total_tile_bytes
uint16_t bytes_per_tile
uint16_t bytes_per_row
uint16_t max_names
uint16_t max_rows
define zero_tilecfg_start():
tilecfg.start_row := 0
define zero_all_tile_data():
if XCR0[XTILEDATA]:
b := CPUID(0xD,XTILEDATA).EAX // size of feature
for j in 0 ... b:
TILEDATA.byte[j] := 0
define xcr0_supports_palette(palette_id):
if palette_id == 0:
return 1
elif palette_id == 1:
if XCR0[XTILECFG] and XCR0[XTILEDATA]:
return 1
return 0
3.6 NOTATION
Instructions described in this chapter follow the general documentation convention established in Intel® 64 and IA-
32 Architectures Software Developer’s Manual Volume 2A. Additionally, Intel® Advanced Matrix Extensions use
notation conventions as described below.
In the instruction encoding boxes, sibmem is used to denote an encoding where a MODRM byte and SIB byte are
used to indicate a memory operation where the base and displacement are used to point to memory, and the index
register (if present) is used to denote a stride between memory rows. The index register is scaled by the sib.scale
field as usual. The base register is added to the displacement, if present.
In the instruction encoding, the MODRM byte is represented several ways depending on the role it plays. The
MODRM byte has 3 fields: 2-bit MODRM.MOD field, a 3-bit MODRM.REG field and a 3-bit MODRM.RM field. When all
bits of the MODRM byte have fixed values for an instruction, the 2-hex nibble value of that byte is presented after
the opcode in the encoding boxes on the instruction description pages. When only some fields of the MODRM byte
must contain fixed values, those values are specified as follows:
• If only the MODRM.MOD must be 0b11, and MODRM.REG and MODRM.RM fields are unrestricted, this is
denoted as 11:rrr:bbb. The rrr correspond to the 3-bits of the MODRM.REG field and the bbb correspond to
the 3-bits of the MODMR.RM field.
• If the MODRM.MOD field is constrained to be a value other than 0b11, i.e., it must be one of 0b00, 0b01, or
0b10, then we use the notation !(11).
• If the MODRM.REG field had a specific required value, e.g., 0b101, that would be denoted as mm:101:bbb.
NOTE
Historically theIntel® 64 and IA-32 Architectures Software Developer’s Manual
only specified the MODRM.REG
field restrictions with the notation /0 ... /7 and did not specify restrictions on the MODRM.MOD and
MODRM.RM fields in the encoding boxes.
Description
The LDTILECFG instruction takes a operand containing a pointer to a 64-byte memory location containing the
description of the tiles to be supported. In order to configure the tiles, the AMX-TILE bit in CPUID must be set and
the operating system has to have enabled the tiles architecture.
The memory area first describes the number of tiles selected and then selects from the palette of tile types.
Requests must be compatible with the restrictions provided by CPUID.
The memory area describes how many tiles are being used and defines each tile in terms of rows and columns; see
Table 3-1 below.
If a tile row and column pair is not used to specify tile parameters, they must have the value zero. All enabled tiles
(based on the palette) must be configured. Specifying tile parameters for more tiles than the implementation limit
or the palette limit results in a #GP fault.
If the palette_id is zero, that signifies the INIT state for the both XTILECFG and XTILEDATA. Tiles are zeroed in the
INIT state. The only legal non-INIT value for palette_id is 1.
Any attempt to execute the LDTILECFG instruction inside an Intel TSX transaction will result in a transaction abort.
Operation
LDTILECFG mem
error := False
buf := read_memory(mem, 64)
temp_tilecfg.palette_id := buf.byte[0]
if temp_tilecfg.palette_id > max_palette:
error := True
if not xcr0_supports_palette(temp_tilecfg.palette_id):
error := True
if temp_tilecfg.palette_id !=0:
temp_tilecfg.start_row := buf.byte[1]
if buf.byte[2..15] is nonzero:
error := True
p := 16
# configure columns
for n in 0 ... palette_table[temp_tilecfg.palette_id].max_names-1:
temp_tilecfg.t[n].colsb:= buf.word[p/2]
p := p + 2
if temp_tilecfg.t[n].colsb > palette_table[temp_tilecfg.palette_id].bytes_per_row:
error := True
if nonzero(buf[p...47]):
error := True
# configure rows
p := 48
for n in 0 ... palette_table[temp_tilecfg.palette_id].max_names-1:
temp_tilecfg.t[n].rows:= buf.byte[p]
if temp_tilecfg.t[n].rows > palette_table[temp_tilecfg.palette_id].max_rows:
error := True
p := p + 1
if nonzero(buf[p...63]):
error := True
if error:
#GP
elif temp_tilecfg.palette_id == 0:
TILES_CONFIGURED := 0// init state
tilecfg := 0// equivalent to 64B of zeros
zero_all_tile_data()
else:
tilecfg := temp_tilecfg
zero_all_tile_data()
TILES_CONFIGURED := 1
Flags Affected
None.
Exceptions
AMX-E1; see Section 3.7, “Exception Classes” for details.
Description
The STTILECFG instruction takes a pointer to a 64-byte memory location (described in Table 3-1) that will, after
successful execution of this instruction, contain the description of the tiles that were configured. In order to
configure tiles, the AMX-TILE bit in CPUID must be set and the operating system has to have enabled the tiles
architecture.
If the tiles are not configured, then STTILECFG stores 64B of zeros to the indicated memory location.
Any attempt to execute the STTILECFG instruction inside an Intel TSX transaction will result in a transaction abort.
Operation
STTILECFG mem
if TILES_CONFIGURED == 0:
//write 64 bytes of zeros at mem pointer
buf[0..63] := 0
write_memory(mem, 64, buf)
else:
buf.byte[0] := tilecfg.palette_id
buf.byte[1] := tilecfg.start_row
buf.byte[2..15] := 0
p := 16
for n in 0 ... palette_table[tilecfg.palette_id].max_names-1:
buf.word[p/2] := tilecfg.t[n].colsb
p := p + 2
if p < 47:
buf.byte[p..47] := 0
p := 48
for n in 0 ... palette_table[tilecfg.palette_id].max_names-1:
buf.byte[p++] := tilecfg.t[n].rows
if p < 63:
buf.byte[p..63] := 0
Flags Affected
None.
Exceptions
AMX-E2; see Section 3.7, “Exception Classes” for details.
TDPBF16PS — Dot Product of BF16 Tiles Accumulated into Packed Single Precision Tile
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
VEX.128.F3.0F38.W0 5C 11:rrr:bbb A V/N.E. AMX-BF16 Matrix multiply BF16 elements from tmm2 and
TDPBF16PS tmm1, tmm2, tmm3 tmm3, and accumulate the packed single
precision elements in tmm1.
Description
This instruction performs a set of SIMD dot-products of two BF16 elements and accumulates the results into a
packed single precision tile. Each dword element in input tiles tmm2 and tmm3 is interpreted as a BF16 pair. For
each possible combination of (row of tmm2, column of tmm3), the instruction performs a set of SIMD dot-products
on all corresponding BF16 pairs (one pair from tmm2 and one pair from tmm3), adds the results of those dot-prod-
ucts, and then accumulates the result into the corresponding row and column of tmm1.
“Round to nearest even” rounding mode is used when doing each accumulation of the FMA. Output denormals are
always flushed to zero and input denormals are always treated as zero. MXCSR is not consulted nor updated.
Any attempt to execute the TDPBF16PS instruction inside a TSX transaction will result in a transaction abort.
Operation
define make_fp32(x):
// The x parameter is bfloat16. Pack it in to upper 16b of a dword.
// The bit pattern is a legal fp32 value. Return that bit pattern.
dword: = 0
dword[31:16] := x
return dword
zero_upper_rows(tsrcdest, tsrcdest.rows)
zero_tilecfg_start()
Flags Affected
None.
Exceptions
AMX-E4; see Section 3.7, “Exception Classes” for details.
Description
For each possible combination of (row of tmm2, column of tmm3), the instruction performs a set of SIMD dot-prod-
ucts on all corresponding four byte elements, one from tmm2 and one from tmm3, adds the results of those dot-
products, and then accumulates the result into the corresponding row and column of tmm1. Each dword in input
tiles tmm2 and tmm3 is interpreted as four byte elements. These may be signed or unsigned. Each letter in the
two-letter pattern SU, US, SS, UU indicates the signed/unsigned nature of the values in tmm2 and tmm3, respec-
tively.
Any attempt to execute the TDPBSSD/TDPBSUD/TDPBUSD/TDPBUUD instructions inside an Intel TSX transaction
will result in a transaction abort.
Operation
define DPBD(c,x,y):// arguments are dwords
if *x operand is signed*:
extend_src1 := SIGN_EXTEND
else:
extend_src1 := ZERO_EXTEND
if *y operand is signed*:
extend_src2 := SIGN_EXTEND
else:
extend_src2 := ZERO_EXTEND
TDPBSSD, TDPBSUD, TDPBUSD, TDPBUUD tsrcdest, tsrc1, tsrc2 (Register Only Version)
// C = m x n (tsrcdest), A = m x k (tsrc1), B = k x n (tsrc2)
tsrc1_elements_per_row := tsrc1.colsb / 4
tsrc2_elements_per_row := tsrc2.colsb / 4
tsrcdest_elements_per_row := tsrcdest.colsb / 4
zero_upper_rows(tsrcdest, tsrcdest.rows)
zero_tilecfg_start()
Flags Affected
None.
Exceptions
AMX-E4; see Section 3.7, “Exception Classes” for details.
Description
This instruction is required to use SIB addressing. The index register serves as a stride indicator. If the SIB
encoding omits an index register, the value zero is assumed for the content of the index register.
This instruction loads a tile destination with rows and columns as specified by the tile configuration. The “T1”
version provides a hint to the implementation that the data will likely not be reused in the near future and the data
caching can be optimized accordingly.
The TILECFG.start_row in the XTILECFG data should be initialized to '0' in order to load the entire tile and is set to
zero on successful completion of the TILELOADD instruction. TILELOADD is a restartable instruction and the
TILECFG.start_row will be non-zero when restartable events occur during the instruction execution.
Only memory operands are supported and they can only be accessed using a SIB addressing mode, similar to the
V[P]GATHER*/V[P]SCATTER* instructions.
Any attempt to execute the TILELOADD/TILELOADDT1 instructions inside an Intel TSX transaction will result in a
transaction abort.
Operation
TILELOADD[,T1] tdest, tsib
start := tilecfg.start_row
zero_upper_rows(tdest,start)
Flags Affected
None.
Exceptions
AMX-E3; see Section 3.7, “Exception Classes” for details.
Description
This instruction returns TILECFG and TILEDATA to the INIT state.
Any attempt to execute the TILERELEASE instruction inside an Intel TSX transaction will result in a transaction
abort.
Operation
zero_all_tile_data()
tilecfg := 0// equivalent to 64B of zeros
TILES_CONFIGURED := 0
Flags Affected
None.
Exceptions
AMX-E6; see Section 3.7, “Exception Classes” for details.
Description
This instruction is required to use SIB addressing. The index register serves as a stride indicator. If the SIB
encoding omits an index register, the value zero is assumed for the content of the index register.
This instruction stores a tile source of rows and columns as specified by the tile configuration.
The TILECFG.start_row in the XTILECFG data should be initialized to '0' in order to store the entire tile and are set
to zero on successful completion of the TILESTORED instruction. TILESTORED is a restartable instruction and the
TILECFG.start_row will be non-zero when restartable events occur during the instruction execution.
Only memory operands are supported and they can only be accessed using a SIB addressing mode, similar to the
V[P]GATHER*/V[P]SCATTER* instructions.
Any attempt to execute the TILESTORED instruction inside an Intel TSX transaction will result in a transaction
abort.
Operation
TILESTORED tsib, tsrc
start := tilecfg.start_row
Flags Affected
None.
Exceptions
AMX-E3; see Section 3.7, “Exception Classes” for details.
Description
This instruction zeroes the destination tile.
Any attempt to execute the TILEZERO instruction inside an Intel TSX transaction will result in a transaction abort.
Operation
TILEZERO tdest
nbytes := palette_table[palette_id].bytes_per_row
Flags Affected
None.
Exceptions
AMX-E5; see Section 3.7, “Exception Classes” for details.
CHAPTER 4
ENQUEUE STORES AND PROCESS ADDRESS SPACE IDENTIFIERS
(PASIDS)
Chapter 2 described the ENQCMD and ENQCMDS instructions. These instructions perform enqueue stores, which
write command data to special device registers called enqueue registers.
Bits 19:0 of the 64-byte command data written by an enqueue store conveys the process address space identifier
(PASID) associated with the command. Software can use PASIDs to identify individual software threads. Devices
supporting enqueue registers may use these PASIDs in responding to commands submitted through those regis-
ters.
As explained in Chapter 2, an execution of ENQCMD formats the command data with the PASID specified in
bits 19:0 of the IA32_PASID MSR. It is expected that system software will configure that MSR to contain the PASID
associated with the software thread that is executing.
ENQCMDS can be executed only by system software operating with CPL = 0. It is the responsibility of system soft-
ware executing ENQCMDS to configure the command data with the appropriate PASID.
Section 4.1 provides details of the IA32_PASID MSR. Section 4.2 describes how the XSAVE feature set supports
that MSR. Section 4.3 presents PASID virtualization, a virtualization feature that allows a virtual-machine monitor
to control the PASID values produced by enqueue stores executed by software in a virtual machine.
An execution of WRMSR causes a general-protection exception (#GP) in response to an attempt to set any bit in
the ranges 30:20 or 63:32. Executions of RDMSR always return zero for those bits.
Because system software may associate a PASID with a software thread, it may choose to update the IA32_PASID
MSR on context switches. To facilitate such a usage, the XSAVE feature set is extended to manage the IA32_PASID
MSR. These extensions are detailed in Section 4.2.
4.2 THE PASID STATE COMPONENT FOR THE XSAVE FEATURE SET
As noted in Section 4.1, system software may choose to update the IA32_PASID MSR on context switches. This
usage is supported by extensions to the XSAVE feature set.
The XSAVE feature set supports the saving and restoring of state components. These state components are orga-
nized using state-component bitmaps (each bit in such a bitmap corresponds to a state component).
A new state component is introduced called PASID state. PASID state comprises the IA32_PASID MSR. It is
defined to be state component 10, so PASID state is associated with bit 10 in state component bitmaps. It is a
supervisor state component, meaning that it can be managed only by the XSAVES and XRSTORS instructions.
System software can enable those instructions to manage PASID state by setting bit 10 in the IA32_XSS MSR.
Processor support for this management of PASID state is enumerated by the CPUID instruction as follows:
• CPUID function 0DH, sub-function 1, enumerates in EDX:ECX a bitmap of the supervisor state components.
ECX[10] will be enumerated as 1 to indicate that PASID state is supported.
• If PASID state is supported, CPUID function 0DH, sub-function 10 enumerates details for state component as
follows:
— EAX enumerates 8 as the size (in bytes) required for PASID state. (The state component comprises only the
one MSR.)
— EBX enumerates value 0, as is the case for supervisor state components.
— ECX[0] enumerates 1, indicating that PASID state is a supervisor state component.
— ECX[1] enumerates 0, indicating that state component 10 is located immediately following the preceding
state component when the compacted format of the extended region of an XSAVE area is used.
— ECX[31:2] and EDX enumerate 0, as is the case for all state components.
Like WRMSR, XRSTORS causes a general-protection exception (#GP) in response to an attempt to set any bit in the
IA32_PASID MSR in the ranges 30:20 or 63:32. Like RDMSR, XSAVES always saves zero for those bits.
The XSAVES instruction optimizes the amount of data that it writes to memory by not writing data for a state
component known to be in its initial configuration. PASID state is in its initial configuration if the IA32_PASID MSR
is 0.
1. See the Intel® Scalable I/O Virtualization Technical Specification for more details.
A PASID-translation hierarchy also includes up to 512 4-KByte PASID tables; these are referenced by PASID
directory entries (see above). A PASID table comprises 1024 4-byte entries, each of which has the following
format:
• Bits 19:0 are the host PASID specified by the entry.
• Bits 30:20 are reserved and must be 0.
• Bits 31 is the entry’s valid bit. The entry is used only if this bit is 1.
Section 4.3.2 explains how the PASID-translation hierarchies are used to translate the PASIDs used for enqueue
stores.
19 18 10 9 0
31 30 19 0
Entry Index = Guest PASID [9:0]
V Rsvd Host PASID
VMCS Fields
63 M (M–1) 11 10 1 0
• Bits 18:10 of the guest PASID select an entry from the PASID directory. A VM exit occurs if the entry’s valid bit
is clear or if any reserved bit is set. Otherwise, bits M:0 of the entry (with bit 0 cleared) contain the physical
address of a PASID table, where M is the physical-address width supported by the processor.
• Bits 9:0 of the guest PASID select an entry from the PASID table. A VM exit occurs if the entry’s present bit is
clear or if any reserved bit is set. Otherwise, bits 19:0 of the entry are the host PASID.
An execution of ENQCMD or ENQCMDS performs PASID translation only after checking for conditions that may
result in general-protection exception (the check of IA32_PASID.Valid for ENQCMD; the check of CPL for
ENQCMDS) and after loading the instruction's source operand from memory. PASID translation occurs before the
actual enqueue store and thus before any faults or VM exits that it may cause (e.g., page faults or EPT violations).
CHAPTER 5
INTEL® TSX SUSPEND LOAD ADDRESS TRACKING
CHAPTER 6
HYPERVISOR-MANAGED LINEAR ADDRESS TRANSLATION
This chapter provides information about a new VT-x capability called Hypervisor-managed Linear Address Transla-
tion (HLAT). This capability is intended to be used by a Hypervisor/Virtual Machine Monitor (VMM) to enforce guest
linear translation (to guest physical mappings). When combined with the existing Extended Page Table (EPT) capa-
bility, HLAT enables the VMM to ensure the integrity of combined guest linear translation (mappings and permis-
sions) cached by the processor TLB, via a reduced software TCB managed by the VMM. The VMM-enforced guest
translations are therefore not subject to tamper by untrusted system software adversaries.
6.1 USAGE
This feature is intended to augment the security functionality for a type of Virtual Machine Monitor (VMM) that may
use legacy EPT read/write/execute (XWR) permission bits (bits 2:0 of the EPTE) as well as the new User-execute
(XU) access bit (bit 10 of the EPTE) to ensure the integrity of code/data resident in guest physical memory
assigned to the guest operating system. EPT permissions are also used in these VMMs to isolate memory; for
example, to host a Secure Kernel (SK) that can manage security properties for the General Purpose Kernel (GPK).
For such usages, it is important that the VMM ensure that the guest linear address mappings which are used by the
General Purpose Kernel to refer to the EPT monitored guest physical pages are access-controlled as well.
Figure 6-1 below shows an example software setup.
Ordinary Paging
Structures
SK Ordinary
Paging Structures
VMM
VMMs could enforce the integrity of these specific guest linear to guest physical mappings (paging structures) by
using legacy EPT permissions to mark the guest physical memory containing the relevant guest paging structures
as read-only. The intent of marking these guest paging structures as read-only is to ensure an invalid mapping is
not created by guest software. However, such page-table edit control techniques are known to cause very high
overheads due to the requirement that the VMM must monitor all paging contexts created by the (Guest) operating
system. HLAT enables a VMM to enforce the integrity of guest linear mappings without this high overhead.
This chapter describes a processor mechanism for the type of VMM described above consisting of:
• A Hypervisor-managed Linear Address Translation (HLAT) mechanism which uses an alternate IA paging
structure managed in guest physical memory (for example, by a Secure Kernel) that contains guest linear to
guest physical translations that the VMM/Secure Kernel wants to enforce.
• A new EPT control bit called “Paging-Write” specified in EPT leaf entries. The new bit specifies which guest
physical pages hold HLAT or legacy IA paging structures so that the processor can use the Paging-Write as
permission to perform A/D bit writes (instead of the software W permission in the EPTE). Typical usage for the
Paging-Write bit is with the legacy EPT Write bit cleared. In PAE paging, the PDPTE does not have A/D bits,
instead the 4K page-directory-pointer table page contains 4 PDPTR entries. Hence, in PAE paging, the
processor ignores the PW bit of leaf entries of CR3 EPT walks. Software note: in this case, the VMM will need to
monitor the page-directory-pointer table page for writes using EPT write permissions (or alternately the VMM
can emulate the PDPTR load into the VMCS for the guest on a MOV CR3 by configuring VM Exit on load to CR3
in PAE paging).
• A new EPT control bit called “Verify Paging-Write” specified in EPT leaf entries (that refer to the final host
physical page in the translation). The new bit specifies which guest physical pages should only be referenced
via translation (guest) paging structures that are marked as Paging-writable under EPT.
The new EPT control bits “Paging-Write” and “Verify Paging-Write” are enabled via tertiary processor-based VM-
execution controls. Note that Verify Paging-Write (VPW) relies on Paging-Write (PW) for its interpretation, hence, if
PW is disabled, VPW is ignored (and SW available as legacy) and the processor operates as if VPW is disabled, i.e.,
no EPT fault will occur due to VPW violations.
6.3.1 Reservation of a Guest Page Type in EPT Paging Structure Entry for Future Use
If either of the CET “Enable EPT Kernel Shadow Stack Control” EPT control or the HLAT “Enable Paging-Write” is
disabled, then the control bits in the EPTE for the disabled control remain available to software. (This is same as
legacy behavior.)
If both the CET “Enable EPT Kernel Shadow Stack Control” EPT control and the “Enable Paging-Write” are enabled
together, then the encoding in the EPTE for both bits set (11b) may be used in the future for an additional guest
page type if needed. Defining this encoding (11b) will need an explicit opt-in control in the future. Note that there
is no special treatment for this encoding in the HLAT architecture.
Four Guest Page Types can be expressed via the CET “Enable EPT Kernel Shadow Stack Control” EPT control and
the “Enable Paging-Write” EPT control; ordinary guest page (SSS=0b, PW=0b), guest kernel supervisor shadow
stack (SSS=1b, PW=0b), guest paging structure (SSS=0b,PW=1b), and undefined (SSS=1b, and PW=1b).
It is the responsibility of the VMM (software) to avoid using the undefined (11b) settings of these two control bits,
noting that, Kernel Shadow Stack is to be used by the VMM for the final EPTE in the EPT translation for Kernel
shadow stack GPAs, whereas, Paging-Write is used for the final EPTE in the EPT translation of GPAs containing
guest paging structures (not for the final page referenced through the guest paging structures). For such a config-
uration (11b) for SSS and PW specified by the VMM, the processor will enforce both Kernel Shadow Stack access
semantics and Paging-write access semantics for those GPAs. See the table below for details.
walk, or the ordinary guest CR3-rooted IA paging structure walk; and never both. It is not necessary for a guest
linear address translation to be found in the HLAT paging structures (after a PLR match); this allows the VMM to
enforce sparse guest supervisor linear address translations via HLAT. The VMM may handle the missing translations
by the use of a control bit called a “restart walk” bit in the HLAT paging structures. When the processor transla-
tion of a guest linear address through the HLAT paging encounters a “restart walk”, the processor aborts the walk
and performs a legacy walk starting at the ordinary guest CR3-rooted paging structures.
For a PLR match, if the guest linear address translation via the HLAT walk succeeds and a mapping to a guest phys-
ical address is found in the HLAT, then the walk is completed successfully and the TLB is filled appropriately.
However, an HLAT walk may not complete due to the following reasons:
• A guest linear address translation may be specified as not present in the HLAT paging structures. In this
case, the walk is terminated and a page fault is reported to software with a Page-Fault Error Code
indicator to indicate HLAT fault due to page not present.
• A guest linear address translation may encounter reserved bits set in the HLAT paging structures. In
this case, the walk is terminated and a page fault is reported to software with a Page-Fault Error Code
indicator to indicate HLAT fault due to reserved bit set.
• A guest linear address translation may be aborted by the processor encountering a “restart walk”
control bit during the walk. In this case, the walk is restarted from the ordinary guest CR3-rooted
paging structures. A translation may be found in the ordinary guest CR3-rooted paging structures and
the processor response to those conditions is the same as legacy address translation through IA and
EPT paging structures, i.e., either the TLB is filled or a page-fault or EPT violation is generated. If the
TLB is filled after a restart, the processor ensures that the TLB page size used matches the page size at
which the HLAT walk was aborted due to restart. Note that a guest linear address translation that starts
at the HLAT paging structures and encounters a “restart walk” switches to ordinary CR3 address
translation and cannot architecturally revert back to translation through the HLAT structures.
HLAT relieves the VMM from making guest physical pages that hold the ordinary guest CR3-rooted guest paging
structures read-only under EPT. However, to meet the security objective for HLAT, the VMM must make the guest
physical pages that hold the HLAT guest paging structures read-only under EPT; this restriction has no legacy
compatibility restrictions. Processor page walk A/D bit updates occur as defined, or ordinary paging structures
during guest linear address translation through HLAT paging structures. Per legacy behavior, these A/D bit writes
will cause EPT violations if the guest paging structure guest physical pages are read only under EPTs. To avoid this
performance overhead, a new EPT control bit “Paging-Write” is defined which can be enabled via the new tertiary
VM execution control called “Enable Paging-Write”. Guest physical pages that have the Paging-Write bit set under
EPTs allow the processor page walker to perform A/D bit writes without EPT violations (even if the EPT entry Write
permission is clear, effectively EPTE.W||EPTE.PW is used as the leaf EPTE Write permission by the processor). Soft-
ware writes to guest physical pages are still subject to the EPT Write permission; Paging-Write is ignored for writes
from software. Note that Paging-Write can be used for HLAT paging structures or ordinary paging structures irre-
spective of whether HLAT is enabled or not. Details of EPT violation behavior for Paging-Write is described in
Section 6.8, Table 6-9.
Note that by using HLAT and Paging-Write the VMM can enforce guest linear address translation for specific guest
linear addresses, however, it cannot enforce restricting guest linear address alias translations to guest physical
addresses. The VMM can restrict the effect of aliases by making guest physical pages non-writable under EPTs.
However, there may be scenarios where the VMM may wish to restrict aliases to writable guest-physical pages. To
enable the VMM to restrict aliases, a new EPT control bit “Verify Paging-Write”, is defined which can be enabled via
the VM execution control called “Enable Guest Paging Verification”. Guest physical pages that have the Verify-
Paging-Write bit set under EPT cause the processor page walker to check that only guest physical pages that have
the leaf EPT entry attribute Paging-Write were used to translate a guest linear address to that guest physical
address. Note that Verify-Paging-Write can be used for HLAT paging structures or ordinary paging structures irre-
spective of if HLAT is enabled or not.
Specific conditions may cause an EPT violation with a new Exit Qualification bit described in Section 6.8 when
“Verify Paging-Write” is enabled.
NOTE
Initial implementations may report a 1-bit prefix width in this capability MSR and will not support
32-bit mode paging.
2. A new 16-bit VMCS control field is defined called “HLAT PLR Prefix Size”. The VMCS index and encoding for this
field is 00000006H. VMM software should program this field to specify the GLA MSB prefix to apply to test the
GLA for a PLR match (to condition HLAT walks).
This 16-bit control field holds a value between 0 and 52 specifying the size of the all 1’s prefix (in number of
bits) that the processor should apply for matching the GLA to the PLR. A value specified by software higher than
what the processor enumerates in the HLAT prefix size value will be truncated by the processor, resulting in the
enforcement of an address prefix of size specified by the capability MSR.
linear addresses to guest physical addresses. The HLAT structure is identical for IA32-e (compatibility and 64-bit)
modes of VMX-non-root (guest) software. Additionally, VA48 and LA57 are supported via the same HLAT structure
format. For IA32 PAE and non-PAE modes (CR4.PAE=0) HLAT lookup is not performed even if the “Enable HLAT”
VM-execution control is 1. Disabling paging when HLAT is enabled disables HLAT; the VMM should restrict such
guest paging mode changes via CR exiting.
The format of the HLATP is shown in Table 6-2.
Processor access to the HLAT data structures (in guest physical memory) will use the memory type that the MTRRs
(memory-type range registers) and EPTs specify for the guest physical address of the access.
Software should ensure that the VMCS and referenced data structures are located at physical addresses that are
mapped to WB memory type by the MTRRs.
A 4KB naturally aligned HLAT L5 table is located at the guest physical address specified in bits 51:12 of the “Hyper-
visor-managed Linear Address Translation Pointer”, a 64-bit VM-execution control field. An HLAT L5 table
comprises 512 64-bit entries (HLAT L5Es). An HLAT L5E is selected at the guest physical address defined as
follows:
• Bits 63:52 are all 0.
• Bits 51:12 are from the HLATP.
• Bits 11:3 are bits 56:48 of the guest linear address.
• Bits 2:0 are all 0.
A 4KB naturally aligned HLAT L4 table is located at the guest physical address specified in bits N:12 of the HLAT
L5E. An HLAT L4 table comprises 512 64-bit entries (HLAT L4Es). An HLAT L4E is selected at the guest physical
address defined as follows:
• Bits 63:52 are all 0.
• Bits 51:12 are from the HLAT L4E.
• Bits 11: 1 3 are bits 47:39 of the guest linear address.
• Bits 2:0 are all 0.
A 4KB naturally aligned HLAT L3 table is located at the guest physical address specified in bits N:12 of the HLAT
L4E. An HLAT L3 table comprises 512 64-bit entries (HLAT L3Es). An HLAT L3E is selected at the guest physical
address defined as follows:
A 4KB naturally aligned HLAT L2 table is located at the guest physical address specified in bits N:12 of the HLAT
L3E. An HLAT L2 table comprises 512 64-bit entries (HLAT L2Es). An HLAT L2E is selected at the guest physical
address defined as follows:
• Bits 63:52 are all 0.
• Bits 51:12 are from the HLAT L3E.
• Bits 11:3 are bits 29:21 of the guest linear address.
• Bits 2:0 are all 0.
A 4KB naturally aligned HLAT L1 table is located at the guest physical address specified in bits N:12 of the HLAT
L2E. An HLAT L1 table comprises 512 64-bit entries (HLAT L1Es). An HLAT L1E is selected at the guest physical
address defined as follows:
• Bits 63:52 are all 0.
• Bits 51:12 are from the HLAT L2E.
• Bits 11:3 are bits 20:12 of the guest linear address.
• Bits 2:0 are all 0.
Figure 6-3. New Bit in the IA-32e Paging Structures Recognized During HLAT Walks1
NOTES:
1. This bit remains ignored in ordinary page walks to translate a guest linear address.
2. M is an abbreviation for MAXPHYADDR.
3. Reserved fields must be 0.
4. If IA32_EFER.NXE = 0 and the P flag of a paging-structure entry is 1, the XD flag (bit 63) is reserved.
5. If CR4.PKE = 0, the protection key is ignored.
For HLAT page walks which encounter an HLAT entry with P=R=1, and other reserved bits set, the reserved bit
check faults are a higher priority than the restart operation and will be reported to software through a page fault
with PFEC bits set as described below.
The PFEC bit 7 flag is set (1) if, the exception resulted during translation of a guest linear address, and:
1. VM-execution control: “enable HLAT” = 1, and
2. The access caused a page-fault exception (for code or data access) during HLAT lookup implying that the linear
address matched the PLR criteria.
When this PFEC bit 7 flag is set (1):
• The P flag (bit 0) is cleared (0) implying there is no translation in the HLAT for the linear address
because the P flag was 0 in one of the HLAT paging structure entries used to translate that linear
address.
• OR The RSVD flag (bit 3) is 1
AND
• PK flag (bit 5) and SGX flag (bit 15) are 0 for both cases.
• I/D (bit 4), R/W (bit 1), U/S (bit 2) and CET SSS (bit 6) flags should be set appropriately.
Note that no PFEC bit is set for faults generated when the HLAT translation does not permit the access (during
lookup or from cached HLAT mappings). To differentiate faults due to insufficient permissions in HLAT, the VMM can
leverage EPT permissions thus causing an EPT violation or using Virtualization Exceptions, the VMM can generate a
#VE exception for page accesses violating the EPT permissions, thus differentiating permission violations reported
to the OS from legacy page faults due to a permission violation from ordinary paging (via CR3-rooted paging struc-
tures).
Similarly, ASID management is not modified due to HLAT. If PCID is enabled by the guest, PCID always derives
from guest CR3 (not HLAT root) and follows the legacy approach. If HLAT mappings are global they will be treated
with global ASID; if HLAT mappings are not global, they will map to ASID determined by VPID/EPTP/PCID per
legacy behavior.
The paging structure caches (PxE caches) on the other hand, hold additional information that modifies the page
walk, hence the following changes are required to the PxE caches:
1. The restart bit is cached as a new output bit from the PxE cache lookup during page walks. A page walk that
hits the PxE cache and results in an entry for which the restart bit is cached aborts the walk, and causes the
walk to restart from the guest CR3 rooted guest physical address (which may hit other entries in the PxE
cache).
2. An HLAT tag bit is provided as input to the PxE lookup to ensure that the walk is performed in HLAT mode until
a restart is encountered.
An EPT misconfiguration will occur if, for the translation of a guest-physical address, the paging-write-access bit is
1 and the read access bit (EPTE bit 0) is clear (0) in the leaf EPT paging-structure entry used to translate the guest-
physical address.
Figure 6-4. Example of Paging-Write and Verify-Paging-Write EPT Control Bits Used for Guest Paging Structures
(HLAT or Ordinary Paging)
When the “Enable Paging-Write” VM-execution control is enabled, and the hypervisor makes guest paging struc-
tures non-writable for OS software writes via EPT permissions, then, for inter-operation with EPT Accessed/Dirty
bits, the hypervisor must set the Paging-Write access EPT entry control bit (see Section 6.3) to allow the processor
paging architecture to safely consider guest paging structure accesses as writable for the page walker (while
continuing to enforce the writability for software based on the legacy EPTE.W bit).
CHAPTER 7
ARCHITECTURAL LAST BRANCH RECORDS (LBRS)
Architectural Last Branch Records (LBRs) enable recording of software path history by logging taken branches and
other control flow transfers within processor registers. Each LBR record or entry is comprised of three MSRs:
• IA32_LBR_x_FROM_IP − Holds the source IP of the operation.
• IA32_LBR_x_TO_IP − Holds the destination IP of the operation.
• IA32_LBR_x_INFO − Holds metadata for the operation, including mispredict, TSX, and elapsed cycle time infor-
mation.
The number of LBR records available varies across processor generations, and is specified in CPUID (see Section
7.4).
LBR records are stored in age order. The most recent LBR entry is stored in IA32_LBR_0_*, the next youngest in
IA32_LBR_1_*, and so on. When an operation to be recorded completes (retires) with LBRs enabled
(IA32_LBR_CTL.LBREn=1), older LBR entries are shifted in the LBR array by one entry, then a record of the new
operation is written into entry 0. See Section 7.1.1 for the list of recorded operations.
The number of LBR entries available for recording operations is dictated by the value in IA32_LBR_DEPTH.DEPTH.
By default, the DEPTH value matches the maximum number of LBRs supported by the processor, but software may
opt to use fewer in order to achieve reduced context switch latency. See Section 7.3.1 for more details.
In addition to the LBRs, there is a single Last Event Record (LER). It records the last taken branch preceding the
last exception, hardware interrupt, or software interrupt. Like LBRs, the LER is comprised of three MSRs
(IA32_LER_FROM_IP, IA32_LER_TO_IP, IA32_LER_INFO), and is subject to the same dependencies on enabling
and filtering.
Which operations are recorded in LBRs depends upon a series of factors:
• Branch Type Filtering − Software must opt in to the types of branches to be logged; see Section 7.1.2.3.
• Current Privilege Level (CPL) − LBRs can be filtered based on CPL; see Section 7.1.2.5.
• LBR Freeze − LBR and LER recording can be suspended by setting IA32_PERF_GLOBAL_STATUS.LBR_FRZ to 1.
See Section 17.4.7 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B for details on
LBR_FRZ.
On some implementations, recording LBRs may require constraining the number of operations that can complete
in a cycle. As a result, on these implementations, enabling LBRs may have some performance overhead.
7.1 BEHAVIOR
7.1.2 Configuration
change the CPL, the operation is recorded in LBRs only if the CPL at the end of the operation is enabled for LBR
recording. In cases where the CPL transitions from a value that is filtered out to a value that is enabled for LBR
recording, the FROM_IP address for the recorded CPL transition branch or event will be 0FFFFFFFFFFFFFFFFH.
7.1.3.1 IP Fields
The source and destination IP values in IA32_LBR_x_[FROM|TO]_IP and IA32_LER_x_[FROM|TO]_IP may hold
effective IPs (EIPs) or linear IPs (LIPs), depending on the processor generation. EIP is the offset from the CSbase
address, while LIP includes the CSbase address. Which IP type is used is indicated in CPUID.(EAX=01CH,
ECX=0):EAX[bit 31].
The value read from this field will always be canonical. Note that this includes the case where a canonical violation
(#GP) results from executing sequential code that runs precisely to the end of the lower canonical address space
(where IP[63:MAXLINADDR-1] is 0, but IP[MAXLINADDR-2:0] is all ones). In this case, the FROM_IP will hold the
lowest canonical address in the upper canonical space, such that IP[63:MAXLINADDR-1] is all ones, and
IP[MAXLINADDR-2:0] is 0.
In some cases, due to CPL filtering, the FROM_IP of the recorded operation may be filtered out. In this case
0FFFFFFFFFFFFFFFFH will be recorded. See Section 7.1.2.5 for details.
Writes of these fields will be forced canonical, such that the processor ignores the value written to the upper bits
(IP[63:MAXLINADDR-1]).
For a list of branch operations that fall into the categories above, see Table 7-2. In future generations, BR_TYPE bits
2:0 may be used to distinguish between differing types of OTHER_BRANCH.
Some implementations may opt to reduce the granularity of the CYC_CNT field for larger values. The implication of
this is that the least significant bits may be forced to 1 in cases where the count has reached some minimum
threshold. It is guaranteed that this reduced granularity will never result in an inaccuracy of more than 10%.
7.1.4.1 SMM
IA32_LBR_CTL.LBREn is saved and cleared on #SMI, and restored on RSM. As a result of disabling LBRs, the #SMI
is not recorded. RSM is recorded only if IA32_DEBUGCTL.FREEZE_WHILE_SMM_EN is set to 0, and the FROM_IP
will be set to the same value as the TO_IP.
7.1.4.2 VMX
By default, LBR operation persists across VMX transitions. However, VMCS fields have been added to enable
constraining LBR usage to within non-root operation only. See details in Table 7-4.
To enable “guest-only” LBR use, a VMM should set both the “Load Guest IA32_LBR_CTL” entry control and the
“Clear IA32_LBR_CTL” exit control. For “system-wide” LBR use, where LBRs remain enabled across host and
guest(s), a VMM should keep both new VMCS controls clear.
VM-entry checks that, if the “Load Guest IA32_LBR_CTL” entry control is 1, bits reserved in the IA32_LBR_CTL
MSR must be 0 in the field for that register.
7.1.4.5 SMX
On GETSEC leaves SENTER or ENTERACCS, IA32_LBR_CTL is cleared. As a result, the operation is not recorded.
7.1.4.6 MWAIT
On an MWAIT that requests a C-state deeper than C1, IA32_LBR_x_* MSRs may be cleared to 0. IA32_LBR_CTL,
IA32_LBR_DEPTH, and IA32_LER_* MSRs will be preserved.
For an MWAIT that enters a C-states equal to or less deep than C1, and all C-states entered as a result of Hardware
Duty Cycling (HDC), all LBR MSRs are preserved.
7.2 MSRS
The MSRs that represent the LBR entries (IA32_LBR_x_[TO|FROM|INFO]) and the LER entry
(IA32_LER_[TO|FROM|INFO]) do not fault on writes. Any address field written will force sign-extension based on
the maximum linear address width supported by the processor, and any non-zero value written to undefined bits
may be ignored such that subsequent reads return 0.
On a warm reset, all LBR MSRs, including IA32_LBR_DEPTH, have their values preserved. However, IA32_L-
BR_CTL.LBREn is cleared to 0, disabling LBRs. If a warm reset is triggered while the processor is in C6, also known
as warm init, all LBR MSRs will be reset to their initial values.
Regardless of the number of LBRs supported by the processor, the size of the LBR state save region is constant.
Note that this XSAVES behavior implies that the saved value of IA32_LBR_DEPTH could become stale while the rest
of the LBRs are INIT, since modifications to IA32_LBR_DEPTH do not effect INIT tracking. This will have no impact
on LBR behavior, as a subsequent XRSTORS that detects a depth mismatch will either ignore the IA32_LBR_DEPTH
value (if XSTATE_BV[bit15]=0) or will re-initialize the IA32_LBR_x_* MSRs (if XSTATE_BV[bit 15]=1).
On XRSTORS with IA32_LBR_DEPTH mismatch, INIT tracking is not modified.
There is no MOD tracking for Architectural LBRs; they should be treated as modified anytime they are not in INIT
state.
It is recommended that software initialize the Architectural LBR State Component memory to all zeros, and to clear
XSTATE_BV[bit 15].
7.4 ENUMERATION
Table 7-8. CPUID Leaf 01CH Enumeration of Architectural LBR Capabilities (Contd.)
CPUID.(EAX=01CH, ECX=0)
Name Description
Register Bits
0 Mispredict Bit Supported IA32_LBR_x_INFO[63] holds indication of branch mispre-
diction (MISPRED)
1 Timed LBRs Supported IA32_LBR_x_INFO[15:0] holds CPU cycles since last LBR
entry (CYC_CNT), and IA32_LBR_x_INFO[60] holds an
ECX indication of whether the value held there is valid
(CYC_CNT_VALID).
2 Branch Type Field Supported IA32_LBR_INFO_x[59:56] holds indication of the recorded
operation's branch type (BR_TYPE).
31:3 Reserved Reserved.
EDX 31:0 Reserved Reserved.
Table 7-9. CPUID Leaf 0DH.0FH Enumeration of XSAVE Support for Architectural LBRs
CPUID.(EAX=0DH, ECX=0FH)
Name Description
Register Bits
EAX 31:0 Size Size, in bytes, of the Arch LBR save area.
EBX 31:0 Offset Offset, in bytes, of the start of the Arch LBR save area
from the beginning of the XSAVE/XRSTOR area.
ECX 0 Supervisor State Set if bit 15 is supported in the IA32_XSS MSR; it is clear if
bit 15 is instead supported in XCR0.
1 Aligned Set if, when the compacted format of an XSAVE area is
used, this extended state component located on the next
64-byte boundary following the preceding state
component. Otherwise, it is located immediately following
the preceding state component.
31:2 Reserved Reserved.
EDX 31:0 Reserved Reserved.
7.5.2 IA32_DEBUGCTL
On processors that do not support model-specific LBRs, IA32_DEBUGCTL[bit 0] has no meaning. It can be written
to 0 or 1, but reads will always return 0.
7.5.3 IA32_PERF_CAPABILITIES
On processors that do not support model-specific LBRs, IA32_PERF_CAPABILITIES.LBR_FMT will have the value
03FH.
CHAPTER 8
NON-WRITE-BACK LOCK DISABLE ARCHITECTURE
Locked read-modify-write (RMW) to a memory operation is used explicitly by several Intel architecture set instruc-
tions, such as ADD with a lock prefix, and explicitly by other instructions and flows, such as updating a segment
access bit or page tables access/dirty bits.
Locked RMW access is usually handled through processor cache in the lower hierarchies, and it only impacts soft-
ware running on same logical processors that share this cache.
If the memory type of this locked RMW is not write-back, the processor can’t handle it within the internal cache and
will issue a bus lock operation. This operation will block all logical processors and devices from accessing memory
until the operation has completed.
Having a burst of bus locks by one of the logical processors may cause starvation to the rest of the logical proces-
sors and devices.
The new architecture will allow software to disable non-WB lock operation. Once the feature is enabled, performing
a non-WB lock operation by software will generate a general protection fault (#GP).
8.1 ENUMERATION
The non-write-back lock disable capability will be enumerated through a model specific bit (bit 4) in the
IA32_CORE_CAPABILITIES MSR.
8.2 ENABLING
This model specific feature will add a new MSR control bit (bit 28) in the TEST_CTRL MSR in order to generate a
general protection fault (#GP) each time a non-WB load lock is detected.
CHAPTER 9
BUS LOCK AND VM NOTIFY
CHAPTER 10
INTEL® RESOURCE DIRECTOR TECHNOLOGY FEATURE UPDATES
Intel® Resource Director Technology (Intel® RDT) provides a number of monitoring and control capabilities for
shared resources in multiprocessor systems. This chapter covers updates to the feature that will be available in
future Intel processors.
10.2.2 Augmented MBM Enumeration and MSR Interfaces for Extensible Counter Width
A field is added to CPUID to enumerate the MBM counter width in platforms which support the extensible MBM
counter width feature.
Prior to this point, CPUID.0F.[ECX=1]:EAX was reserved. This CPUID output register (EAX) is redefined to provide
two new fields:
• Encode counter width as offset from 24b in bits[7:0].
• An overflow bit in bit[8].
See “CPUID—CPU Identification” in Chapter 1 for details.
In EAX bits 7:0, the counter width is encoded as an offset from 24b. A value of zero in this field means 24-bit
counters are supported. A value of 8 indicates that 32-bit counters are supported, as in processors based on Ice
Lake Server microarchitecture.
With the addition of this enumerable counter width, the requirement that software poll at ≥ 1Hz is removed. Soft-
ware may poll at a varying rate with reduced risk of rollover, and under typical conditions rollover is likely to require
hundreds of seconds (though this value is not explicitly specified, and may vary and decrease over time). If soft-
ware seeks to ensure that rollover does not occur more than once between samples, then sampling at ≥ 1Hz while
consuming the enumerated counter widths' worth of data will provide this guarantee, for a specific platform and
counter width, under all conditions.
Software which uses the MBM event retrieval MSR interface should be updated to comprehend this new format,
which enables up to 62-bit MBM counters to be provided by future platforms. Higher-level software which
consumes the resulting bandwidth values is not expected to be affected.
a change from MBA 1.0 which did not require this. Intel BIOS reference code includes a default configuration which
is recommended for general usage.
MBA 2.0 in Ice Lake Server and Tremont microarchitectures moves from static throttling at the core/uncore inter-
face to a more dynamic control scheme based on a hardware controller that tracks actual DRAM bandwidth. This
allows software which uses primarily the L3 cache to observe increased throughput for a given throttling level, or
benefits for software which exhibits L3-bound phases. Due to the closer consideration of memory bandwidth
loading, this enhancement may lead to an increase in system efficiency when using MBA 2.0, relative to MBA 1.0.
MBA 1.0 is established as a legacy feature. Backward compatibility of the software interfaces is preserved, and
MBA 2.0 changes manifest as enhancements atop the MBA 1.0 baseline.
As with the prior generation feature, MBA 2.0 uses CPUID for enumeration, and throttling is performed using a
mapping created from software thread-to-CLOS (in the IA32_PQR_ASSOC MSR) which is then mapped per-CLOS
to delay values via the IA32_L2_QoS_Ext_BW_Thrtl_n MSRs. User software specifies a per-CLOS delay value, 0-
90% bandwidth throttling for instance, though the max and granularity are platform dependent and enumerated in
CPUID.
MBA 2.0 implementation is shown in Figure 10-1. MBA 2.0 operates through the use of an advanced new hardware
controller and feedback mechanism which allows automated hardware monitoring and control around the user-
provided delay value set point. This set point and associated delay value infrastructure remains unchanged from
MBA 1.0, preserving software compatibility.
MBA 2.0 enhancements over MBA 1.0, in addition to the new hardware controller, include:
1. Configurable delay selection across threads.
• MBA 1.0 implementation statically picks the max MBADelay across the threads running on a core (by
calculating value = max(delayValue(CLOS[thread0]),delayValue(CLOS[thread1]))).
• Software may have the option to pick either maximum or minimum delay to be resolved and applied
across the threads; maximum value remains the default.
2. Increasing CLOSIDs from 8 to 15.
• Skylake Server microarchitecture has 8 CLOSIDs for MBA 1.0.
• Ice Lake Server microarchitecture increases this value to 15 (also consistent with L3 CAT).
Note that bit[0] for min/max configuration is supported in MBA 2.0, but is removed when MBA 3.0 moves the
controller logic to per-thread capable. This transient feature existence is why the min/max control remains model-
specific.
To enumerate and manage support for the model-specific min/max feature, software may use processor
family/model/stepping to match supported products, then CPUID to later detect MBA 3.0 support.
While this means that more direct throttling of threads is possible, future usage guidance may be necessary to help
explain the effects of Intel® Hyper-Threading Technology contention vs. cache and memory contention, and how
these effects may be understood by software.
CHAPTER 11
USER INTERRUPTS
11.1 INTRODUCTION
This chapter details an architectural feature called user interrupts.
This feature defines user interrupts as new events in the architecture. They are delivered to software operating in
64-bit mode with CPL = 3 without any change to segmentation state. Different user interrupts are distinguished by
a 6-bit user-interrupt vector, which is pushed on the stack as part of user-interrupt delivery. A new instruction,
UIRET (user-interrupt return) reverses user-interrupt delivery.
The user-interrupt architecture is configured by new supervisor-managed state. This state includes new MSRs. In
expected usages, an operating system (OS) will update the content of these MSRs when switch between OS-
managed threads.
One of the MSRs references a data structure called the user posted-interrupt descriptor (UPID). User inter-
rupts for an OS-managed thread can be posted in the UPID associated with that thread. Such user interrupts will
be delivered after receipt of an ordinary interrupt (also identified in the UPID) called a user-interrupt notifica-
tion.1
System software can define operations to post user interrupts and to send user-interrupt notifications. In addition,
the user-interrupt architecture defines a new instruction, SENDUIPI, by which application software can send inter-
processor user interrupts (user IPIs). An execution of SENDUIPI posts a user interrupt in a UPID and sends a user-
interrupt notification.
(Platforms may include mechanisms to process external interrupts as either ordinary interrupts or user interrupts.
Those processed as user interrupts would be posted in UPIDs may result in user-interrupt notifications. Specifics of
such mechanisms are outside of the scope of this document.)
Section 11.2 explains how a processor enumerates support for user interrupts and how they are enabled by system
software. Section 11.3 identifies the new processor state defined for user interrupts. Section 11.4 explains how a
processor identifies and delivers user interrupts. Section 11.5 describes how a processor identifies and processes
user-interrupt notifications. Section 11.7 defines new support for user inter-processor interrupts (user IPIs).
Section 11.8 details how existing instructions support the new processor state and presents instructions to be
introduced for user interrupts. Section 11.8.2 and Section 11.9 describe how user interrupts are supported by the
XSAVE feature set and the VMX extensions, respectively. Section discuss interactions with system-management
mode (SMM).
1. For clarity, this chapter uses the term ordinary interrupts to refer to those events in the existing interrupt architecture, which are
typically delivered to system software operating with CPL = 0.
1. Execution of the STI instruction does not block delivery of user interrupts for one instruction as it does ordinary interrupts. If a user
interrupt is delivered immediately following execution of a STI instruction, ordinary interrupts are not blocked after delivery of the
user interrupt.
2. User-interrupt delivery occurs only if CPL = 3. Since the HLT and MWAIT instructions can be executed only if CPL = 0, a user inter-
rupt can never be delivered when a logical processor is an activity state that was entered using one of those instructions.
User-interrupt delivery will also increment performance counters for which counting
BR_INST_RETIRED.FAR_BRANCH is enabled. Some implementations may have dedicated events for counting
user-interrupt delivery; see processor-specific event lists at https://download.01.org/perfmon/index/.
The notation PIR (posted-interrupt requests) refers to the 64 posted-interrupt requests in a UPID.
If an ordinary interrupt arrives while CR4.UINTR = IA32_EFER.LMA = 1, the logical processor determines whether
the interrupt is a user-interrupt notification. This process is called user-interrupt notification identification
and is described in Section 11.5.1.
Once a logical processor has identified a user-interrupt notification, it copies user interrupts in the UPID’s PIR into
UIRR. This process is called user-interrupt notification processing and is described in Section 11.5.2.
A logical processor is not interruptible during either user-interrupt notification identification or user-interrupt noti-
fication processing or between those operations (when they occur in succession).
1. If the interrupt arrives between iterations of a REP-prefixed string instruction, the processor first updates state as follows: RIP is
loaded to reference the string instruction; RCX, RSI, and RDI are updated as appropriate to reflect the iterations completed; and
RFLAGS.RF is set to 1.
3. The processor writes zero to the EOI register in the local APIC; this dismisses the interrupt with vector V = UINV
from the local APIC.
User-interrupt notification identification involves acknowledgment of the local APIC and thus occurs only when
ordinary interrupts are not masked.
(The behavior described above may be modified in VMX non-root operation; see Section 11.9.2.2 and Section
11.9.3.3.)
If user-interrupt notification identification completes step #3, the logical processor then performs user-interrupt
notification processing as described in Section 11.5.2.
An ordinary interrupt that occurs during transactional execution causes the transactional execution to abort and
transition to a non-transactional execution. This occurs before user-interrupt notification identification.
An ordinary interrupt that occurs while software is executing inside an enclave causes an asynchronous enclave
exit (AEX). This AEX occurs before user-interrupt notification identification.
• In all other cases, the logical processor is in the active state following user-interrupt notification processing.
Section 11.9.2.3 discusses VM exits that may occur during user-interrupt notification processing.
interrupt state component by setting IA32_XSS.UINTR. (This implies that XSETBV will not allow XCR0.UINTR to be
set.)
The user-interrupt state component comprises 48 bytes in memory with the following layout:
• Bytes 7:0 are for UIHANDLER (the IA32_UINTR_HANDLER MSR).
• Bytes 15:8 are for UISTACKADJUST (the IA32_UINTR_STACKADJUST MSR).
• Bytes 23:16 are for UITTSZ and UINV (from the IA32_UINTR_MISC MSR) and for UIF, organized as follows:
— Byte 19:16 is for UITTSZ (bits 31:0 of the IA32_UINTR_MISC MSR).
— Byte 20 is for UINV (bits 39:32 of the IA32_UINTR_MISC MSR).
— Bytes 22:21 (2 bytes) and bits 6:0 of byte 23 are reserved. (They may be used for bits 62:40 if the
IA32_UINTR_MISC MSR, if they are defined in the future.)
— Bit 7 of byte 23 is for UIF.
Because bit 7 of byte 23 is for UIF (which is not part of the IA32_UINTR_MISC MSR), software that reads a
value from bytes 23:16 should clear bit 63 of that 64-bit value before attempting to write it to the
IA32_UINTR_MISC MSR.
• Bytes 31:24 are for UPIDADDR (the IA32_UINTR_PD MSR).
• Bytes 39:32 are for UIRR (the IA32_UINTR_RR MSR).
• Bytes 47:40 are for UITTADDR (the IA32_UINTR_TT MSR, including the bit 0, the valid bit).
The user-interrupt state component is in its initial state if all user-interrupt registers are zero.
Certain portions of a supervisor state component may be identified as master-enable state. XSAVES and
XRSTORS treat this state specially. UINV is the master-enable state for the user-interrupt state component. See
Section 11.8.2.3 and Section 11.8.2.4 for the treatment of this state by XSAVES and XRSTORS, respectively.
11.8.2.3 XSAVES
The management of the user-interrupt state component by XSAVES follows the architecture of the XSAVE feature
set. The following items identify points that are specific to saving the user-interrupt state component:
• XSAVES writes the user-interrupt registers to the user-interrupt state component using the format specified in
Section 11.8.2.1.
• XSAVES stores zeros to bits and bytes identified in Section 11.8.2.1 as reserved.
• The values saved for UIHANDLER, UPIDADDR, and UITTADDR are always canonical relative to the maximum
linear-address width enumerated by CPUID1.
• After saving the user-interrupt state component, XSAVES clears UINV. (UINV is IA32_UINTR_MISC[39:32];
XSAVES does not modify the remainder of that MSR.)
11.8.2.4 XRSTORS
The management of the user-interrupt state component by XRSTORS follows the architecture of the XSAVE feature
set. The following items identify points that are specific to restoring the user-interrupt state component:
• Before restoring the user-interrupt state component, XRSTORS verifies that UINV is 0. If it is not, XRSTORS
causes a general-protection fault (#GP) before loading any part of the user-interrupt state component. (UINV
is IA32_UINTR_MISC[39:32]; XRSTORS does not check the contents of the remainder of that MSR.)
• If the instruction mask and XSAVE area used by XRSTORS indicates that the user-interrupt state component
should be loaded from the XSAVE area, XRSTORS reads the user-interrupt registers from the XSAVE area using
the format identified in Section 11.8.2.1. The values read cause a general-protection fault (#GP) in any of the
following cases:
— If any of the bits and bytes identified as reserved is not zero;
— If the value to be loaded into any one of UIHANDLER, UISTACKADJUST, UPIDADDR, or UITTADDR is not
canonical relative to the maximum linear-address width enumerated by CPUID; or
— If the value to be loaded into either UPIDADDR or UITTADDR sets any of the bits reserved in that register
(the reserved bits are bits 5:0 of UPIDADDR and bits 3:1 of UITTADDR; bit 0 of UITTADDR is the valid bit for
SENDUIPI).
• If XRSTORS causes a fault or a VM exit after loading any part of the user-interrupt state component, XRSTORS
clears UINV before delivering the fault or VM exit. (Other elements of user-interrupt state, including other parts
of the IA32_UINTR_MISC MSR, may retain the values that were loaded by XRSTORS.)
• After a non-faulting execution of XRSTORS that loads the user-interrupt state component, the logical processor
recognizes a pending user interrupt if and only if some bit is set in the new value of UIRR (see Section 11.4.1).
1. They need might not be canonical relative to the current paging mode if it supports only smaller linear addresses.
1. If virtual-interrupt delivery occurs between iterations of a REP-prefixed string instruction, the processor will first update state as
follows: RIP is loaded to reference the string instruction; RCX, RSI, and RDI are updated as appropriate to reflect the iterations com-
pleted; and RFLAGS.RF is set to 1.
If this modified form of user-interrupt notification identification completes step #3, the logical processor then
performs user-interrupt notification processing as specified in Section 11.5.2.
A logical processor is not interruptible during this modified form of user-interrupt notification identification or
between it and any subsequent user-interrupt notification processing.
A virtual interrupt that occurs during transactional execution causes the transactional execution to abort and tran-
sition to a non-transactional execution. This occurs before this modified form of user-interrupt notification identifi-
cation.
A virtual interrupt that occurs while software is executing inside an enclave normally causes an asynchronous
enclave exit (AEX). Such an AEX would occur before this modified form of user-interrupt notification identification.
Outside of VMX non-root operation, the logical processor will send this IPI by writing to the local APIC’s interrupt-
command register (ICR). In VMX non-root operation, behavior depends on the settings of the “use TPR shadow”
and “virtualize APIC accesses” VM-execution controls:
1. The new UIRET and SENDUIPI instructions also access memory with linear addresses. Because they are instructions, the existing
VMX architecture fully defines the operation of any resulting VM exits.
2. SPP-induced VM exits include both SPP misses and SPP misconfigurations.
1. If the “use TPR shadow” VM-execution control is 0, the behavior is not modified: the logical processor sends the
specified IPI by writing to the local APIC’s ICR as specified above.
2. If the “use TPR shadow” VM-execution control is 1 and the “virtualize APIC accesses” VM-execution control is 0,
the logical processor virtualizes the sending of an x2APIC-mode IPI by performing the following steps:
a. Writing the 64-bit value Z to offset 300H on the virtual-APIC page (VICR), where Z[7:0] = tempUPID.NV
(the 8-bit virtual vector), Z[63:32] = tempUPID.NDST (the 32-bit virtual APIC ID) and Z[31:8] = 000000H
(indicating a physically addressed fixed-mode IPI).
b. Causing an APIC-write VM exit with exit qualification 300H.
APIC-write VM exits are trap-like: the value of CS:RIP saved in the guest-state area of the VMCS references
the instruction after SENDUIPI. The basic exit reason for an APIC-write VM exit is “APIC write” (56). The exit
qualification is the page offset of the write access that led to the VM exit — 300H in this case.
3. If the “use TPR shadow” and “virtualize APIC accesses” VM-execution controls are both 1, the logical processor
virtualizes the sending of an xAPIC-mode IPI by performing the following steps:
a. Writing the 32-bit value X to offset 310H on the virtual-APIC page (VICR_HI), where X[31:24] =
tempUPID.NDST[15:8] (the 8-bit virtual APIC ID) and X[23:0] = 000000H1.
b. Writing the 32-bit value Y to offset 300H on the virtual-APIC page (VICR_LO), where Y[7:0] =
tempUPID.NV (the 8-bit virtual vector) and Y[31:8] = 000000H (indicating a physically addressed fixed-
mode IPI).
c. Causing an APIC-write VM exit with exit qualification 300H (see above).
1. For xAPIC mode (which is virtualized if the “virtualize APIC accesses” VM-execution control is 1), the destination APIC ID is in byte 1
(not byte 0) of the UPID’s 4-byte NDST field.
2. If VM entry loaded UINV from the VMCS, the checking of UINV is based on the value loaded.
3. This step, writing zero to the EOI register in the local APIC, is omitted.
Because VM entry allows interrupt injection only when interrupts are not masked in a guest (e.g., when RFLAGS is
being loaded with a value that sets bit 9, IF), this modified form of user-interrupt notification identification occurs
only when virtual interrupts are not masked.
If user-interrupt notification identification completes step #2, the logical processor then performs user-interrupt
notification processing as detailed Section 11.5.2.
A logical processor is not interruptible during this modified form of user-interrupt notification identification or
between it and any subsequent user-interrupt notification processing.
This change in VM-entry event injection occurs as long as UINTR is set to 1 in the CR4 field in the guest-state area
of the VMCS and the “IA-32e mode guest” VM-entry control is 1; the settings of the “external-interrupt exiting” and
“virtual-interrupt delivery” VM-execution controls do not affect this change.
CHAPTER 12
PERFORMANCE MONITORING UPDATES
This chapter covers performance monitoring updates for processors based on Alder Lake Client microarchitecture
and processors based on Sapphire Rapids Server microarchitecture.
31 23 15 7 0
Memory B ound Fetch Latency Bra nch Mispredicts Heavy Ope rations
63 55 47 39 32
Figure 12-1. PERF_METRICS MSR Definition for Alder Lake Client and Sapphire Rapids Server Microarchitectures
The lower half of the register is the TMA level 1 metrics (legacy). The upper half is also divided into four 8-bit fields,
each of which is an integer fraction of 255. Additionally, each of the new level 2 metrics in the upper half is a subset
of the corresponding level 1 metric in the lower half (that is, its parent node per the TMA hierarchy). This enables
software to deduce the other four level 2 metrics by subtracting corresponding metrics as shown in Figure 12-2.
Figure 12-2. Deducing Implied Level 2 Metrics in the Core PMU for
Alder Lake Client and Sapphire Rapids Server Microarchitectures
The PERF_METRICS MSR and fixed counter 3 of the core PMU for Alder Lake Client and Sapphire Rapids Server
microarchitectures feature 12 metrics in total that cover all level 1 and level 2 nodes of the TMA hierarchy.
NOTE
Macro-fusion may forbid a particular instruction from obtaining PEBS samples when using a fixed reload-value on a
tight endless loop. Therefore, it is recommended to “normalize” samples for each basic-block of instructions. This
implies distributing the total sample counts evenly over all instructions within a basic block.
NOTE
Precise events on PC0 may tune the reload value differently than general-purpose performance monitoring
counters 1-7 when attempting to improve accuracy via prime reload values.
Table 12-1. Data Source Encoding for Memory Accesses (Ice Lake and Later Microarchitectures) (Contd.)
Encoding Description
04H L3 HIT. This request was satisfied by the L3 cache with no coherency actions performed (snooping).
05H XCORE MISS. This request was satisfied by the L3 cache but involved a coherency check in some sibling core(s).
06H XCORE HIT. This request was satisfied by the L3 cache but involved a coherency check that hit a non-modified copy in
a sibling core.
07H XCORE FWD. This request was satisfied by a sibling core where either a modified (cross-core HITM) or a non-modified
(cross-core FWD) cache-line copy was found.
08H Local Far Memory. This request has missed the L3 cache and was serviced by local far memory.
09H Remote Far Memory. This request has missed the L3 cache and was serviced by remote far memory.
0AH Local Near Memory. This request has missed the L3 cache and was serviced by local near memory.
0BH Remote Near Memory. This request has missed the L3 cache and was serviced by remote near memory.
0CH Remote FWD. This request has missed the L3 cache and a non-modified cache-line copy was forwarded from a
remote cache.
0DH Remote HITM. This request has missed the L3 cache and a modified cache-line was forwarded from a remote cache.
0EH I/O. Request of input/output operation.
0FH UC. The request was to uncacheable memory.
The store latency information is written into a PEBS record as shown in Table 12-2.
The store latency relies on the PEBS facility, so the PEBS configuration must be completed first. Unlike load latency,
there is no option to filter on subset of stores that exceed a certain threshold.
1. For more details about the method, refer to Section B.1, “Top-Down Analysis Method” of the Intel® 64 and IA-32 Architectures Opti-
mization Reference Manual.
To determine which fields are supported for certain performance monitoring events, consult the Memory Info attri-
bute in the event list at 01.org.
NOTE
There may be additional block reasons, even if DataBlk and AddressBlk are both clear, e.g., non-optimal instruction
latency.
CHAPTER 13
ENHANCED HARDWARE FEEDBACK INTERFACE (EHFI)
Intel processors that enumerate CPUID.06H.0H:EAX.HW_FEEDBACK[bit 19] as 1 support the hardware feedback
interface (HFI). The hardware feedback interface is described in Section 14.6 “Hardware Feedback Interface” of the
Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3B.
Intel processors that enumerate CPUID.06H.0H:EAX[bit 23] as 1 support the enhanced hardware feedback inter-
face (EHFI). Hardware provides guidance to the Operating System (OS) scheduler to perform optimal workload
scheduling through a semi-static table in memory and software thread specific index (Class ID) that points into
that table and selects which data to use for that software thread. The table structure is shown below. Its size and
various pointers into it are computed immediately following Table 13-1.
When the two software threads in question belong to the same Class ID, the OS Scheduler can schedule to higher
performance logical processors within that class when in performance mode and to higher energy efficiency
logical processors within that class when in battery saving mode.
For the HFI, where all software threads effectively belong to the same class (class 0 in the EHFI), the OS
Scheduler can use similar logic and schedule to higher performance logical processors when in performance
mode, and to higher energy efficiency logical processors when in battery saving mode.
The core ordering of the performance and energy columns may be different between HFI with thread-specific hard-
ware feedback supported classes.
Bit 1 is valid only if CPUID[6].EAX[bit 23] is set. When setting this bit while support is not enumerated, the hard-
ware generates #GP.
Table 13-2 summarizes the control options described above.
See Section 13.7 for details on scenarios where IA32_HW_FEEDBACK_CONFIG bits are implicitly reset by the hard-
ware.
the enable bit transitions from 0 to 1, hardware will generate an initial notification, with the IA32_PACKAGE_THER-
M_STATUS bit 26 set to 1, to indicate that the OS should read the current HFI/EHFI structure.
Bit 0 of the logical processor scope configuration MSR can be cleared or set regardless of the state of the
HFI/EHFI package configuration MSR state. Even when bit 0 of all logical processor configuration MSRs is clear,
the processor can still update the EHFI structure if it is still enabled in the IA32_HW_FEEDBACK_CONFIG package
scope MSR. When the operating system clears IA32_HW_FEEDBACK_THREAD_CONFIG[bit 0], hardware clears
the history accumulated on that logical processor which otherwise drives assigning the Class ID to the software
thread that executes on that logical processor. As long as IA32_HW_FEEDBACK_THREAD_CONFIG[bit 0] is set,
the Class ID is available for the operating system to read, independent of the state of the package scope
IA32_HW_FEEDBACK_CONFIG[1:0] bits.
See Section 13.7 for details on scenarios where IA32_HW_FEEDBACK_CONFIG bits are implicitly reset by the
hardware.
C
Cache and TLB information 1-31
Cache Inclusiveness 1-5
CLFLUSH instruction
CPUID flag 1-30
CMOVcc flag 1-30
CMOVcc instructions
CPUID flag 1-30
CMPXCHG16B instruction
CPUID bit 1-28
CMPXCHG8B instruction
CPUID flag 1-30
CPUID instruction 1-3, 1-30
36-bit page size extension 1-30
APIC on-chip 1-30
basic CPUID information 1-4
cache and TLB characteristics 1-4, 1-31
CLFLUSH flag 1-30
CLFLUSH instruction cache line size 1-26
CMPXCHG16B flag 1-28
CMPXCHG8B flag 1-30
CPL qualified debug store 1-27
debug extensions, CR4.DE 1-29
debug store supported 1-30
deterministic cache parameters leaf 1-4, 1-7, 1-9, 1-10, 1-11, 1-12, 1-13, 1-14, 1-15, 1-16, 1-21
extended function information 1-22
feature information 1-29
FPU on-chip 1-29
FSAVE flag 1-30
FXRSTOR flag 1-30
IA-32e mode available 1-22
input limits for EAX 1-24
L1 Context ID 1-28
local APIC physical ID 1-26
machine check architecture 1-30
machine check exception 1-30
memory type range registers 1-30
MONITOR feature information 1-34
MONITOR/MWAIT flag 1-27
MONITOR/MWAIT leaf 1-5, 1-6, 1-7, 1-10, 1-17, 1-21
MWAIT feature information 1-34
page attribute table 1-30
page size extension 1-29
performance monitoring features 1-34
physical address bits 1-23
physical address extension 1-30
power management 1-34, 1-35, 1-36
processor brand index 1-26, 1-36
processor brand string 1-23, 1-36
processor serial number 1-30
processor type field 1-25
RDMSR flag 1-29
returned in EBX 1-26
returned in ECX & EDX 1-26
self snoop 1-31
SpeedStep technology 1-27
SS2 extensions flag 1-31
Ref. # 319433-041 I
SSE extensions flag 1-31
SSE3 extensions flag 1-27
SSSE3 extensions flag 1-28
SYSENTER flag 1-30
SYSEXIT flag 1-30
thermal management 1-34, 1-35, 1-36
thermal monitor 1-27, 1-30, 1-31
time stamp counter 1-29
using CPUID 1-3
vendor ID string 1-24
version information 1-4, 1-33
virtual 8086 Mode flag 1-29
virtual address bits 1-23
WRMSR flag 1-29
F
Feature information, processor 1-3
FXRSTOR instruction
CPUID flag 1-30
FXSAVE instruction
CPUID flag 1-30
I
IA-32e mode
CPUID flag 1-22
Instruction set
grouped by processor 1-2
L
L1 Context ID 1-28
M
Machine check architecture
CPUID flag 1-30
description 1-30
MMX instructions
CPUID flag for technology 1-30
Model & family information 1-33
MONITOR instruction
CPUID flag 1-27
feature data 1-34
MWAIT instruction
CPUID flag 1-27
feature data 1-34
P
Pending break enable 1-31
Performance-monitoring counters
CPUID inquiry for 1-34
R
RDMSR instruction
CPUID flag 1-29
S
Self Snoop 1-31
SpeedStep technology 1-27
SSE extensions
CPUID flag 1-31
SSE2 extensions
CPUID flag 1-31
SSE3
II Ref. # 319433-041
CPUID flag 1-27
SSE3 extensions
CPUID flag 1-27
SSSE3 extensions
CPUID flag 1-28
Stepping information 1-33
SYSENTER instruction
CPUID flag 1-30
SYSEXIT instruction
CPUID flag 1-30
T
Thermal Monitor
CPUID flag 1-31
Thermal Monitor 2 1-27
CPUID flag 1-27
Time Stamp Counter 1-29
V
Version information, processor 1-3
VPMULTISHIFTQB – Select Packed Unaligned Bytes from Quadword Source 2-31
W
WBINVD instruction 2-37
WBINVD/INVD bit 1-5
WRMSR instruction
CPUID flag 1-29
X
XRSTOR 1-34
XSAVE 1-28, 1-34