Architecture Instruction Set Extensions Programming Reference
Architecture Instruction Set Extensions Programming Reference
March 2024
319433-052
Notices & Disclaimers
This document contains information on products in the design phase of development. The information here is
subject to change without notice. Do not finalize a design with this information.
Intel technologies may require enabled hardware, software or service activation.
No product or component can be absolutely secure.
Your costs and results may vary.
You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning
Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter
drafted which includes subject matter disclosed herein.
All product plans and roadmaps are subject to change without notice.
The products described may contain design defects or errors known as errata which may cause the product to deviate from
published specifications. Current characterized errata are available on request.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability,
fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of
dealing, or usage in trade.
Code names are used by Intel to identify products, technologies, or services that are in development and not publicly
available. These are not “commercial” names and not intended to function as trademarks.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with
the sole exception that a) you may publish an unmodified copy and b) code included in this document is licensed subject to
the Zero-Clause BSD open source license (0BSD), https://opensource.org/licenses/0BSD. You may create software
implementations based on this document and in compliance with the foregoing that are intended to execute on the Intel
product(s) referenced in this document. No rights are granted to create modifications or derivatives of this document.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other
names and brands may be claimed as the property of others.
CHAPTER 1
FUTURE INTEL® ARCHITECTURE INSTRUCTION EXTENSIONS AND FEATURES
1.1 About This Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 1-1
1.2 DisplayFamily and DisplayModel for Future Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 1-1
1.3 Instruction Set Extensions and Feature Introduction in Intel® 64 and IA-32 Processors. . . . . . .. . 1-2
1.4 Detection of Future Instructions and Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 1-4
CPUID—CPU Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 1-4
1.5 Compressed Displacement (disp8*N) Support in EVEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1-53
1.6 bfloat16 Floating-Point Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1-54
CHAPTER 2
INSTRUCTION SET REFERENCE, A-Z
2.1 Instruction Set Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1
AADD—Atomically Add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2
AAND—Atomically AND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4
AOR—Atomically OR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6
AXOR—Atomically XOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8
CMPccXADD—Compare and Add if Condition is Met . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10
PBNDKB—Platform Bind Key to Binary Large Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-15
PCONFIG—Platform Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19
RDMSRLIST—Read List of Model Specific Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-30
URDMSR—User Read from Model-Specific Register. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-33
UWRMSR—User Write to Model-Specific Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-35
VBCSTNEBF162PS—Load BF16 Element and Convert to FP32 Element With Broadcast . . . . . . . 2-37
VBCSTNESH2PS—Load FP16 Element and Convert to FP32 Element with Broadcast . . . . . . . . . 2-38
VCVTNEEBF162PS—Convert Even Elements of Packed BF16 Values to FP32 Values . . . . . . . . . 2-39
VCVTNEEPH2PS—Convert Even Elements of Packed FP16 Values to FP32 Values . . . . . . . . . . . 2-40
VCVTNEOBF162PS—Convert Odd Elements of Packed BF16 Values to FP32 Values . . . . . . . . . 2-41
VCVTNEOPH2PS—Convert Odd Elements of Packed FP16 Values to FP32 Values . . . . . . . . . . . 2-42
VCVTNEPS2BF16—Convert Packed Single-Precision Floating-Point Values to BF16 Values . . . . . 2-43
VPDPB[SU,UU,SS]D[,S]—Multiply and Add Unsigned and Signed Bytes With and Without
Saturation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-45
VPDPW[SU,US,UU]D[,S]—Multiply and Add Unsigned and Signed Words With and Without
Saturation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-48
VPMADD52HUQ—Packed Multiply of Unsigned 52-Bit Integers and Add the High 52-Bit Products
to Qword Accumulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-51
VPMADD52LUQ—Packed Multiply of Unsigned 52-Bit Integers and Add the Low 52-Bit Products to
Qword Accumulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-52
VSHA512MSG1—Perform an Intermediate Calculation for the Next Four SHA512 Message
Qwords. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-53
VSHA512MSG2—Perform a Final Calculation for the Next Four SHA512 Message Qwords . . . . . 2-54
VSHA512RNDS2—Perform Two Rounds of SHA512 Operation . . . . . . . . . . . . . . . . . . . . . . . . 2-55
VSM3MSG1—Perform Initial Calculation for the Next Four SM3 Message Words . . . . . . . . . . . . 2-57
VSM3MSG2—Perform Final Calculation for the Next Four SM3 Message Words . . . . . . . . . . . . 2-59
VSM3RNDS2—Perform Two Rounds of SM3 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-61
VSM4KEY4—Perform Four Rounds of SM4 Key Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-63
VSM4RNDS4—Performs Four Rounds of SM4 Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-65
WRMSRLIST—Write List of Model Specific Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-67
WRMSRNS—Non-Serializing Write to Model Specific Register. . . . . . . . . . . . . . . . . . . . . . . . . 2-70
CHAPTER 3
INTEL® AMX INSTRUCTION SET REFERENCE, A-Z
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1
3.1.1 Tile Architecture Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3
3.1.2 TMUL Architecture Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
CHAPTER 4
UC-LOCK DISABLE
4.1 Features to Disable Bus Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1
4.2 UC-Lock Disable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1
CHAPTER 5
INTEL® RESOURCE DIRECTOR TECHNOLOGY FEATURE UPDATES
5.1 Cache Bandwidth Allocation (CBA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1
5.1.1 Introduction to Cache Bandwidth Allocation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1
5.1.2 Cache Bandwidth Allocation Enumeration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1
5.1.3 Cache Bandwidth Allocation Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
5.1.4 Cache Bandwidth Allocation Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
CHAPTER 6
LINEAR ADDRESS MASKING (LAM)
6.1 Enumeration, Enabling, and Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1
6.2 Treatment of Data Accesses with LAM Active for User Pointers . . . . . . . . . . . . . . . . . . . . . . . . . 6-1
6.3 Treatment of Data Accesses with LAM Active for Supervisor Pointers . . . . . . . . . . . . . . . . . . . . . 6-3
6.4 Canonicality Checking for Data Addresses Written to Control Registers and MSRs . . . . . . . . . . . . 6-4
6.5 Paging Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-4
6.6 VMX Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-4
6.6.1 Guest Linear Address . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-4
6.6.2 VM-Entry Checking of Values of CR3 and CR4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
6.6.3 CR3-Target Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
6.6.4 Hypervisor-Managed Linear Address Translation (HLAT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
6.7 Debug and Tracing Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
6.7.1 Debug Registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
6.7.2 Intel® Processor Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
6.8 Intel® SGX Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
6.9 System Management Mode (SMM) Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-6
CHAPTER 7
CODE PREFETCH INSTRUCTION UPDATES
PREFETCHh—Prefetch Data or Code Into Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-1
CHAPTER 8
NEXT GENERATION PERFORMANCE MONITORING UNIT (PMU)
8.1 New Enumeration Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 8-1
8.1.1 CPUID Sub-Leafing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... . 8-2
8.1.2 Reporting Per Logical Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... . 8-2
8.1.3 General-Purpose Counters Bitmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... . 8-2
8.1.4 Fixed-Function Counters True-View Bitmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... . 8-2
8.1.5 Architectural Performance Monitoring Events Bitmap . . . . . . . . . . . . . . . . . . . . . . . . . . . ... . 8-3
8.1.6 TMA Slots Per Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... . 8-3
8.1.7 Non-Architectural Performance Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... . 8-3
CHAPTER 9
LINEAR ADDRESS SPACE SEPARATION (LASS)
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-1
9.2 Enumeration and Enabling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-1
9.3 Operation of Linear-Address Space Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-1
9.3.1 Data Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2
9.3.2 Instruction Fetches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2
CHAPTER 10
REMOTE ATOMIC OPERATIONS IN INTEL ARCHITECTURE
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .... 10-1
10.2 Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .... 10-1
10.3 Alignment Requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .... 10-1
10.4 Memory Ordering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .... 10-2
10.5 Memory Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .... 10-2
10.6 Write Combining Behavior. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .... 10-2
10.7 Performance Expectations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .... 10-2
10.7.1 Interaction Between RAO and Other Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .... 10-3
10.7.2 Updates of Contended Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .... 10-3
10.7.3 Updates of Uncontended Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .... 10-3
10.8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .... 10-4
10.8.1 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .... 10-4
10.8.2 Interrupt/Event Handler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .... 10-4
CHAPTER 11
TOTAL STORAGE ENCRYPTION IN INTEL ARCHITECTURE
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-1
11.1.1 Key Programming Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-1
11.1.1.1 Key Wrapping Support: PBNDKB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-1
CHAPTER 12
FLEXIBLE UIRET
12.1 Existing UIRET Functionality and UIF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3
12.2 Flexible Updates of UIF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3
12.3 UIRET Instruction Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3
UIRET—User-Interrupt Return . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-4
CHAPTER 13
USER-TIMER EVENTS AND INTERRUPTS
13.1 Enabling and Enumeration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-1
13.2 User Deadline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-1
13.3 User Timer: Architectural State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-2
13.4 Pending and Processing of User-Timer Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-2
13.5 VMX Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-2
13.5.1 VMCS Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-3
13.5.2 Changes to VMX Non-Root Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-3
13.5.2.1 Treatment of Accesses to the IA32_UINTR_TIMER MSR . . . . . . . . . . . . . . . . . . . . . . . . . . 13-3
13.5.2.2 Treatment of User-Timer Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-3
13.5.3 Changes to VM Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-3
CHAPTER 14
APIC-TIMER VIRTUALIZATION
14.1 Guest-Timer Hardware . . . . . . . . . . . . . . . . . . . . . . .... ... . . . . . . . . . . . . . . . . . . . . . . . . 14-1
14.1.1 Responding to Guest-Deadline Updates . . . . . . . . .... ... . . . . . . . . . . . . . . . . . . . . . . . . 14-1
14.1.2 Guest-Timer Events . . . . . . . . . . . . . . . . . . . . . . .... ... . . . . . . . . . . . . . . . . . . . . . . . . 14-2
14.2 VMCS Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... ... . . . . . . . . . . . . . . . . . . . . . . . . 14-2
14.2.1 New VMX Control . . . . . . . . . . . . . . . . . . . . . . . . .... ... . . . . . . . . . . . . . . . . . . . . . . . . 14-2
14.2.2 New VMCS Fields . . . . . . . . . . . . . . . . . . . . . . . . .... ... . . . . . . . . . . . . . . . . . . . . . . . . 14-2
14.3 Changes to VM Entries . . . . . . . . . . . . . . . . . . . . . . .... ... . . . . . . . . . . . . . . . . . . . . . . . . 14-2
14.3.1 Checking VMX Controls . . . . . . . . . . . . . . . . . . . . .... ... . . . . . . . . . . . . . . . . . . . . . . . . 14-2
14.3.2 Loading the Guest Deadline. . . . . . . . . . . . . . . . . .... ... . . . . . . . . . . . . . . . . . . . . . . . . 14-3
14.4 Changes to VMX Non-Root Operation . . . . . . . . . . . . .... ... . . . . . . . . . . . . . . . . . . . . . . . . 14-3
14.4.1 Accesses to the IA32_TSC_DEADLINE MSR . . . . . . .... ... . . . . . . . . . . . . . . . . . . . . . . . . 14-3
14.4.2 Processing of Guest-Timer Events . . . . . . . . . . . . .... ... . . . . . . . . . . . . . . . . . . . . . . . . 14-3
14.5 Changes to VM Exits . . . . . . . . . . . . . . . . . . . . . . . .... ... . . . . . . . . . . . . . . . . . . . . . . . . 14-4
CHAPTER 15
VMX SUPPORT FOR THE IA32_SPEC_CTRL MSR
15.1 VMCS Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-1
15.1.1 New VMX Controls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-1
15.1.2 New VMCS Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-1
15.2 Changes to VM Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-1
15.2.1 Host-State Checking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-1
15.2.2 Guest-State Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-1
15.2.3 Guest-State Loading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-2
15.3 Changes to VM Exits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-2
15.3.1 Saving Guest State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-2
CHAPTER 16
PROCESSOR TRACE TRIGGER TRACING
16.1 Enabling and Enumeration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 16-1
16.2 Processor Trace Trigger Tracing Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 16-1
16.2.1 Trigger (TRIG) Packet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 16-2
16.3 MSR Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 16-3
16.3.1 IA32_RTIT_TRIGGERx_CFG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 16-3
16.3.2 IA32_PERFEVTSELx MSR Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 16-4
16.3.3 DR7 Changes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 16-4
16.3.4 IA32_RTIT_STATUS Changes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 16-5
CHAPTER 17
MONITORLESS MWAIT
17.1 Using Monitorless MWAIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ... . 17-1
17.2 Enumeration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ... . 17-1
17.3 Enabling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ... . 17-2
17.4 Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ... . 17-2
17.5 MWAIT Instruction Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ... . 17-2
MWAIT—Monitor Wait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ... . 17-3
CHAPTER 1
FUTURE INTEL® ARCHITECTURE INSTRUCTION EXTENSIONS AND
FEATURES
Table 1-2. Recent Instruction Set Extensions / Features Introduction in Intel® 64 and IA-32 Processors1
Instruction Set Architecture / Feature Introduction
Direct stores: MOVDIRI, MOVDIR64B Tremont, Tiger Lake, Sapphire Rapids
AVX512_BF16 Cooper Lake, Sapphire Rapids
CET: Control-flow Enforcement Technology Tiger Lake, Sapphire Rapids, Sierra Forest, Grand Ridge
AVX512_VP2INTERSECT Tiger Lake (not currently supported in any other processors)
Enqueue Stores: ENQCMD and ENQCMDS Sapphire Rapids, Sierra Forest, Grand Ridge
CLDEMOTE Tremont, Sapphire Rapids
PTWRITE Goldmont Plus, Alder Lake, Sapphire Rapids
User Wait: TPAUSE, UMONITOR, UMWAIT Tremont, Alder Lake, Sapphire Rapids
Architectural LBRs Alder Lake, Sapphire Rapids, Sierra Forest, Grand Ridge
HLAT Alder Lake, Sapphire Rapids, Sierra Forest, Grand Ridge
SERIALIZE Alder Lake, Sapphire Rapids, Sierra Forest, Grand Ridge
Intel® TSX Suspend Load Address Tracking (TSXLDTRK) Sapphire Rapids
Intel® Advanced Matrix Extensions (Intel® AMX) Sapphire Rapids
Includes CPUID Leaf 1EH, “TMUL Information Main Leaf,” and
CPUID bits AMX-BF16, AMX-TILE, and AMX-INT8.
AVX-VNNI Alder Lake, Sapphire Rapids, Sierra Forest, Grand Ridge
User Interrupts (UINTR) Sapphire Rapids, Sierra Forest, Grand Ridge, Arrow Lake, Lunar Lake
Intel® Trust Domain Extensions (Intel® TDX)2 Emerald Rapids
Supervisor Memory Protection Keys (PKS)3 Alder Lake, Sapphire Rapids, Sierra Forest, Grand Ridge
Linear Address Masking (LAM) Sierra Forest, Grand Ridge, Arrow Lake, Lunar Lake
IPI Virtualization Sapphire Rapids, Sierra Forest, Grand Ridge, Arrow Lake, Lunar Lake
RAO-INT Future processors
PREFETCHIT0/1 Granite Rapids, Clearwater Forest, Panther Lake
AMX-FP16 Granite Rapids
Table 1-2. Recent Instruction Set Extensions / Features Introduction in Intel® 64 and IA-32
Instruction Set Architecture / Feature Introduction
CMPCCXADD Sierra Forest, Grand Ridge, Arrow Lake, Lunar Lake
AVX-IFMA Sierra Forest, Grand Ridge, Arrow Lake, Lunar Lake
AVX-NE-CONVERT Sierra Forest, Grand Ridge, Arrow Lake, Lunar Lake
AVX-VNNI-INT8 Sierra Forest, Grand Ridge, Arrow Lake, Lunar Lake
RDMSRLIST/WRMSRLIST/WRMSRNS Sierra Forest, Grand Ridge, Panther Lake
Linear Address Space Separation (LASS) Sierra Forest, Grand Ridge, Arrow Lake, Lunar Lake
Virtualization of the IA32_SPEC_CTRL MSR: Specify Bits Sapphire Rapids, Sierra Forest, Grand Ridge, Panther Lake
Cannot be Modified by Guest Software
UC-Lock Disable via CPUID Enumeration Sierra Forest, Grand Ridge
LBR Event Logging Sierra Forest, Grand Ridge, Arrow Lake S (06_C6H), Lunar Lake
AMX-COMPLEX Granite Rapids D (06_AEH)
AVX-VNNI-INT16 Arrow Lake S (06_C6H), Lunar Lake, Clearwater Forest
SHA512 Arrow Lake S (06_C6H), Lunar Lake, Clearwater Forest
SM3 Arrow Lake S (06_C6H), Lunar Lake, Clearwater Forest
SM4 Arrow Lake S (06_C6H), Lunar Lake, Clearwater Forest
UIRET flexibly updates UIF Sierra Forest, Grand Ridge, Arrow Lake, Lunar Lake
Total Storage Encryption (TSE) and the PBNDKB instruction Lunar Lake
Intel®
Advanced Vector Extensions 10 Version 1 (Intel® Granite Rapids
AVX10.1)4
USER_MSR Clearwater Forest
Flexible Return and Event Delivery (FRED) and the LKGS Panther Lake, Clearwater Forest
instruction5
NMI-Source Reporting5 Panther Lake, Clearwater Forest
User-Timer Events and Interrupts Clearwater Forest
APIC-Timer Virtualization Clearwater Forest
VMX Support for the IA32_SPEC_CTRL MSR Sierra Forest, Grand Ridge
Intel Processor Trace Trigger Tracing Clearwater Forest
Monitorless MWAIT Clearwater Forest
Intel® Advanced Performance Extensions (Intel® APX)6 Future processors
NOTES:
1. Visit for Intel® product specifications, features and compatibility quick reference guide, and code name decoder, visit:
https://ark.intel.com/content/www/us/en/ark.html.
2. Details on Intel® Trust Domain Extensions can be found here:
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html.
3. Details on Supervisor Memory Protection Keys (PKS) can be found in the Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 3A.
4. Details on Intel® Advanced Vector Extensions 10 can be found here: https://cdrdv2.intel.com/v1/dl/getContent/784267.
5. Details on the LKGS (load into IA32_KERNEL_GS_BASE) instruction, NMI-source reporting, and Flexible Return and Event Delivery can
be found here: https://cdrdv2.intel.com/v1/dl/getContent/795033.
6. Details on Intel® Advanced Performance Extensions can be found here: https://cdrdv2.intel.com/v1/dl/getContent/784266
CPUID—CPU Identification
64-Bit Compat/
Opcode Instruction Description
Mode Leg Mode
0F A2 CPUID Valid Valid Returns processor identification and feature information to the EAX, EBX, ECX,
and EDX registers, as determined by input entered in EAX (in some cases, ECX
as well).
Description
The ID flag (bit 21) in the EFLAGS register indicates support for the CPUID instruction. If a software procedure can
set and clear this flag, the processor executing the procedure supports the CPUID instruction. This instruction
operates the same in non-64-bit modes and 64-bit mode.
CPUID returns processor identification and feature information in the EAX, EBX, ECX, and EDX registers.1 The
instruction’s output is dependent on the contents of the EAX register upon execution (in some cases, ECX as well).
For example, the following pseudocode loads EAX with 00H and causes CPUID to return a Maximum Return Value
and the Vendor Identification String in the appropriate registers:
1. On Intel 64 processors, CPUID clears the high 32 bits of the RAX/RBX/RCX/RDX registers in all modes.
2. CPUID leaf 1FH is a preferred superset to leaf 0BH. Intel recommends first checking for the existence of CPUID leaf 1FH
before using leaf 0BH.
EAX Bits 31-00: Reports the valid bit fields of the lower 32 bits of the XFEATURE_ENABLED_MASK regis-
ter. If a bit is 0, the corresponding bit field in XCR0 is reserved.
Bit 00: x87 state.
Bit 01: SSE state.
Bit 02: AVX state.
Bits 04-03: MPX state
Bit 07-05: AVX-512 state.
Bit 08: Used for IA32_XSS.
Bit 09: PKRU state.
Bits 16-10: Used for IA32_XSS.
Bit 17: TILECFG state.
Bit 18: TILEDATA state.
Bits 31-19: Reserved.
EBX Bits 31-00: Maximum size (bytes, from the beginning of the XSAVE/XRSTOR save area) required by
enabled features in XCR0. May be different than ECX if some features at the end of the XSAVE save
area are not enabled.
ECX Bit 31-00: Maximum size (bytes, from the beginning of the XSAVE/XRSTOR save area) of the
XSAVE/XRSTOR save area required by all supported features in the processor, i.e all the valid bit
fields in XCR0.
EAX Bits 04-00: Length of the capacity bit mask for the corresponding ResID. Add one to the return value
to get the result.
Bits 31-05: Reserved.
EBX Bits 31-00: Bit-granular map of isolation/contention of allocation units.
ECX Bit 00: Reserved.
Bit 01: If 1, indicates L3 CAT for non-CPU agents is supported.
Bit 02: If 1, indicates L3 Code and Data Prioritization Technology is supported.
Bit 03: If 1, indicates non-contiguous capacity bitmask is supported. The bits that are set in the vari-
ous IA32_L3_MASK_n registers do not have to be contiguous.
Bits 31-04: Reserved.
EDX Bits 15-00: Highest COS number supported for this ResID.
Bits 31-16: Reserved.
L2 Cache Intel® RDT Allocation Enumeration Sub-leaf (Initial EAX Value = 10H, ECX = ResID = 2)
10H NOTE:
Leaf 10H output depends on the initial value in ECX.
EAX Bits 04-00: Length of the capacity bit mask for the corresponding ResID. Add one to the return value
to get the result.
Bits 31-05: Reserved.
EBX Bits 31-00: Bit-granular map of isolation/contention of allocation units.
ECX Bits 01-00: Reserved.
Bit 02: CDP. If 1, indicates L2 Code and Data Prioritization Technology is supported.
Bit 03: If 1, indicates non-contiguous capacity bitmask is supported. The bits that are set in the vari-
ous IA32_L2_MASK_n registers do not have to be contiguous.
Bits 31-04: Reserved.
EDX Bits 15-00: Highest COS number supported for this ResID.
Bits 31-16: Reserved.
Memory Bandwidth Allocation Enumeration Sub-leaf (Initial EAX Value = 10H, ECX = ResID = 3)
10H NOTE:
Leaf 10H output depends on the initial value in ECX.
EAX Bits 11-00: Reports the maximum MBA throttling value supported for the corresponding ResID. Add
one to the return value to get the result.
Bits 31-12: Reserved.
EBX Bits 31-00: Reserved.
EAX Bits 07-00: Reports the maximum core throttling level supported for the corresponding ResID. Add
one to the return value to get the number of throttling levels supported.
Bits 11-08: If 1, indicates the logical processor scope of the IA32_QoS_Core_BW_Thrtl_n MSRs.
Other values are reserved.
Bits 31-12: Reserved.
EBX Bits 31-00: Reserved.
ECX Bits 02-00: Reserved.
Bit 03: If 1, the response of the bandwidth control is approximately linear. If 0, the response of the
bandwidth control is non-linear.
Bits 31-04: Reserved.
EDX Bits 15-00: Highest Class of Service (COS) number supported for this ResID.
Bits 31-16: Reserved.
Intel® Software Guard Extensions Capability Enumeration Leaf, Sub-leaf 0 (Initial EAX Value = 12H, ECX = 0)
12H NOTE:
Leaf 12H sub-leaf 0 (ECX = 0) is supported if CPUID.(EAX=07H, ECX=0H):EBX[SGX] = 1.
EAX Bit 00: SGX1. If 1, indicates Intel SGX supports the collection of SGX1 leaf functions.
Bit 01: SGX2. If 1, indicates Intel SGX supports the collection of SGX2 leaf functions.
Bits 04-02: Reserved.
Bit 05: If 1, indicates Intel SGX supports ENCLV instruction leaves EINCVIRTCHILD, EDECVIRTCHILD,
and ESETCONTEXT.
Bit 06: If 1, indicates Intel SGX supports ENCLS instruction leaves ETRACKC, ERDINFO, ELDBC, and
ELDUC.
Bit 07: If 1, indicates Intel SGX supports ENCLU instruction leaf EVERIFYREPORT2.
Bits 09-08: Reserved.
Bit 10: If 1, indicates Intel SGX supports ENCLS instruction leaf EUPDATESVN.
Bit 11: If 1, indicates Intel SGX supports ENCLU instruction leaf EDECCSSA.
Bits 31-12: Reserved.
EBX Bits 31-00: MISCSELECT. Bit vector of supported extended Intel SGX features.
ECX Bits 31-00: Reserved.
EDX Bits 07-00: MaxEnclaveSize_Not64. The maximum supported enclave size in non-64-bit mode is
2^(EDX[7:0]).
Bits 15-08: MaxEnclaveSize_64. The maximum supported enclave size in 64-bit mode is
2^(EDX[15:8]).
Bits 31-16: Reserved.
Intel® SGX Attributes Enumeration Leaf, Sub-leaf 1 (Initial EAX Value = 12H, ECX = 1)
12H NOTE:
Leaf 12H sub-leaf 1 (ECX = 1) is supported if CPUID.(EAX=07H, ECX=0H):EBX[SGX] = 1.
EAX Bit 31-00: Reports the valid bits of SECS.ATTRIBUTES[31:0] that software can set with ECREATE.
EBX[19:00]: Bits 51:32 of the physical address of the base of the EPC section.
EBX[31:20]: Reserved.
EDX[19:00]: Bits 51:32 of the size of the corresponding EPC section within the Processor
Reserved Memory.
EDX[31:20]: Reserved.
Intel® Processor Trace Enumeration Main Leaf (Initial EAX Value = 14H, ECX = 0)
14H NOTE:
Leaf 14H main leaf (ECX = 0).
EAX Bits 31-00: Reports the maximum sub-leaf supported in leaf 14H.
While a processor may support the Processor Frequency Information leaf, fields that return a
value of zero are not supported.
System-On-Chip Vendor Attribute Enumeration Main Leaf (Initial EAX Value = 17H, ECX = 0)
17H NOTES:
Leaf 17H main leaf (ECX = 0).
Leaf 17H output depends on the initial value in ECX.
Leaf 17H sub-leaves 1 through 3 reports SOC Vendor Brand String.
Leaf 17H is valid if MaxSOCID_Index >= 3.
Leaf 17H sub-leaves 4 and above are reserved.
EAX Bits 31-00: MaxSOCID_Index. Reports the maximum input value of supported sub-leaf in leaf 17H.
EBX Bits 15-00: SOC Vendor ID.
Bit 16: IsVendorScheme. If 1, the SOC Vendor ID field is assigned via an industry standard
enumeration scheme. Otherwise, the SOC Vendor ID field is assigned by Intel.
Bits 31-17: Reserved = 0.
ECX Bits 31-00: Project ID. A unique number an SOC vendor assigns to its SOC projects.
EDX Bits 31-00: Stepping ID. A unique number within an SOC project that an SOC vendor assigns.
NOTE:
* The core type may only be used as an identification of the microarchitecture for this logical proces-
sor and its numeric value has no significance, neither large nor small. This field neither implies nor
expresses any other attribute to this logical processor and software should not assume any.
** CPUID leaf 04H provides details of deterministic cache parameters, including the L2 cache in sub-
leaf 2.
80000007H EAX Reserved = 0.
EBX Reserved = 0.
ECX Reserved = 0.
INPUT EAX = 0H: Returns CPUID’s Highest Value for Basic Processor Information and the Vendor Identification
String
When CPUID executes with EAX set to 0H, the processor returns the highest value the CPUID recognizes for
returning basic processor information. The value is returned in the EAX register and is processor specific.
A vendor identification string is also returned in EBX, EDX, and ECX. For Intel processors, the string is “Genu-
ineIntel” and is expressed:
EBX := 756e6547h (* “Genu”, with G in the low 4 bits of BL *)
EDX := 49656e69h (* “ineI”, with i in the low 4 bits of DL *)
ECX := 6c65746eh (* “ntel”, with n in the low 4 bits of CL *)
INPUT EAX = 80000000H: Returns CPUID’s Highest Value for Extended Processor Information
When CPUID executes with EAX set to 0H, the processor returns the highest value the processor recognizes for
returning extended processor information. The value is returned in the EAX register and is processor specific.
31 28 27 20 19 16 15 14 13 12 11 8 7 4 3 0
Reserved
NOTE
See "Caching Translation Information" in Chapter 4, “Paging,” in the Intel® 64 and IA-32 Architec-
tures Software Developer’s Manual, Volume 3A, and Chapter 20 in the Intel® 64 and IA-32 Archi-
tectures Software Developer’s Manual, Volume 1, for information on identifying earlier IA-32
processors.
The Extended Family ID needs to be examined only when the Family ID is 0FH. Integrate the fields into a display
using the following rule:
IF Family_ID ≠ 0FH
THEN Displayed_Family = Family_ID;
ELSE Displayed_Family = Extended_Family_ID + Family_ID;
FI;
(* Show Display_Family as HEX field. *)
The Extended Model ID needs to be examined only when the Family ID is 06H or 0FH. Integrate the field into a
display using the following rule:
• Brand index (low byte of EBX) — this number provides an entry into a brand string table that contains brand
strings for IA-32 processors. More information about this field is provided later in this section.
• CLFLUSH instruction cache line size (second byte of EBX) — this number indicates the size of the cache line
flushed with CLFLUSH instruction in 8-byte increments. This field was introduced in the Pentium 4 processor.
• Local APIC ID (high byte of EBX) — this number is the 8-bit ID that is assigned to the local APIC on the
processor during power up. This field was introduced in the Pentium 4 processor.
NOTE
Software must confirm that a processor feature is present using feature flags returned by CPUID
prior to using the feature. Software should not depend on future offerings retaining all features.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
ECX
0
RDRAND
F16C
AVX
OSXSAVE
XSAVE
AES
TSC-Deadline
POPCNT
MOVBE
x2APIC
SSE4_2 — SSE4.2
SSE4_1 — SSE4.1
DCA — Direct Cache Access
PCID — Process-context Identifiers
PDCM — Perf/Debug Capability MSR
xTPR Update Control
CMPXCHG16B
FMA — Fused Multiply Add
SDBG
CNXT-ID — L1 Context ID
SSSE3 — SSSE3 Extensions
TM2 — Thermal Monitor 2
EST — Enhanced Intel SpeedStep® Technology
SMX — Safer Mode Extensions
VMX — Virtual Machine Extensions
DS-CPL — CPL Qualified Debug Store
MONITOR — MONITOR/MWAIT
DTES64 — 64-bit DS Area
PCLMULQDQ — Carryless Multiplication
SSE3 — SSE3 Extensions
Reserved
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
EDX
Reserved
Table 1-6. More on Feature Information Returned in the EDX Register (Continued)
Bit # Mnemonic Description
3 PSE Page Size Extension. Large pages of size 4 MByte are supported, including CR4.PSE for controlling the
feature, the defined dirty bit in PDE (Page Directory Entries), optional reserved bit trapping in CR3, PDEs,
and PTEs.
4 TSC Time Stamp Counter. The RDTSC instruction is supported, including CR4.TSD for controlling privilege.
5 MSR Model Specific Registers RDMSR and WRMSR Instructions. The RDMSR and WRMSR instructions are
supported. Some of the MSRs are implementation dependent.
6 PAE Physical Address Extension. Physical addresses greater than 32 bits are supported: extended page table
entry formats, an extra level in the page translation tables is defined, 2-MByte pages are supported instead
of 4 Mbyte pages if PAE bit is 1. The actual number of address bits beyond 32 is not defined, and is
implementation specific.
7 MCE Machine Check Exception. Exception 18 is defined for Machine Checks, including CR4.MCE for controlling
the feature. This feature does not define the model-specific implementations of machine-check error
logging, reporting, and processor shutdowns. Machine Check exception handlers may have to depend on
processor version to do model specific processing of the exception, or test for the presence of the Machine
Check feature.
8 CX8 CMPXCHG8B Instruction. The compare-and-exchange 8 bytes (64 bits) instruction is supported (implicitly
locked and atomic).
9 APIC APIC On-Chip. The processor contains an Advanced Programmable Interrupt Controller (APIC), responding
to memory mapped commands in the physical address range FFFE0000H to FFFE0FFFH (by default - some
processors permit the APIC to be relocated).
10 Reserved Reserved.
11 SEP SYSENTER and SYSEXIT Instructions. The SYSENTER and SYSEXIT and associated MSRs are supported.
12 MTRR Memory Type Range Registers. MTRRs are supported. The MTRRcap MSR contains feature bits that
describe what memory types are supported, how many variable MTRRs are supported, and whether fixed
MTRRs are supported.
13 PGE Page Global Bit. The global bit is supported in paging-structure entries that map a page, indicating TLB
entries that are common to different processes and need not be flushed. The CR4.PGE bit controls this
feature.
14 MCA Machine Check Architecture. The Machine Check Architecture, which provides a compatible mechanism for
error reporting in P6 family, Pentium 4, Intel Xeon processors, and future processors, is supported. The
MCG_CAP MSR contains feature bits describing how many banks of error reporting MSRs are supported.
15 CMOV Conditional Move Instructions. The conditional move instruction CMOV is supported. In addition, if x87
FPU is present as indicated by the CPUID.FPU feature bit, then the FCOMI and FCMOV instructions are
supported
16 PAT Page Attribute Table. Page Attribute Table is supported. This feature augments the Memory Type Range
Registers (MTRRs), allowing an operating system to specify attributes of memory accessed through a linear
address on a 4KB granularity.
17 PSE-36 36-Bit Page Size Extension. 4-MByte pages addressing physical memory beyond 4 GBytes are supported
with 32-bit paging. This feature indicates that upper bits of the physical address of a 4-MByte page are
encoded in bits 20:13 of the page-directory entry. Such physical addresses are limited by MAXPHYADDR
and may be up to 40 bits in size.
18 PSN Processor Serial Number. The processor supports the 96-bit processor identification number feature and
the feature is enabled.
19 CLFSH CLFLUSH Instruction. CLFLUSH Instruction is supported.
20 Reserved Reserved.
Table 1-6. More on Feature Information Returned in the EDX Register (Continued)
Bit # Mnemonic Description
21 DS Debug Store. The processor supports the ability to write debug information into a memory resident buffer.
This feature is used by the branch trace store (BTS) and precise event-based sampling (PEBS) facilities (see
Chapter 24, “Introduction to Virtual-Machine Extensions,” in the Intel® 64 and IA-32 Architectures
Software Developer’s Manual, Volume 3C).
22 ACPI Thermal Monitor and Software Controlled Clock Facilities. The processor implements internal MSRs that
allow processor temperature to be monitored and processor performance to be modulated in predefined
duty cycles under software control.
23 MMX Intel MMX Technology. The processor supports the Intel MMX technology.
24 FXSR FXSAVE and FXRSTOR Instructions. The FXSAVE and FXRSTOR instructions are supported for fast save
and restore of the floating-point context. Presence of this bit also indicates that CR4.OSFXSR is available
for an operating system to indicate that it supports the FXSAVE and FXRSTOR instructions.
25 SSE SSE. The processor supports the SSE extensions.
26 SSE2 SSE2. The processor supports the SSE2 extensions.
27 SS Self Snoop. The processor supports the management of conflicting memory types by performing a snoop
of its own cache structure for transactions issued to the bus.
28 HTT Max APIC IDs reserved field is Valid. A value of 0 for HTT indicates there is only a single logical processor
in the package and software should assume only a single APIC ID is reserved. A value of 1 for HTT indicates
the value in CPUID.1.EBX[23:16] (the Maximum number of addressable IDs for logical processors in this
package) is valid for the package.
29 TM Thermal Monitor. The processor implements the thermal monitor automatic thermal control circuitry (TCC).
30 Reserved Reserved.
31 PBE Pending Break Enable. The processor supports the use of the FERR#/PBE# pin when the processor is in
the stop-clock state (STPCLK# is asserted) to signal the processor that an interrupt is pending and that the
processor should return to normal operation to handle the interrupt. Bit 10 (PBE enable) in the
IA32_MISC_ENABLE MSR enables this capability.
INPUT EAX = 02H: Cache and TLB Information Returned in EAX, EBX, ECX, EDX
When CPUID executes with EAX set to 02H, the processor returns information about the processor’s internal caches
and TLBs in the EAX, EBX, ECX, and EDX registers.
The encoding is as follows:
• The least-significant byte in register EAX (register AL) indicates the number of times the CPUID instruction
must be executed with an input value of 02H to get a complete description of the processor’s caches and TLBs.
The first member of the family of Pentium 4 processors will return a 01H.
• The most significant bit (bit 31) of each register indicates whether the register contains valid information (set
to 0) or is reserved (set to 1).
• If a register contains valid information, the information is contained in 1 byte descriptors. Table 1-7 shows the
encoding of these descriptors. Note that the order of descriptors in the EAX, EBX, ECX, and EDX registers is not
defined; that is, specific bytes are not designated to contain descriptors for specific cache or TLB types. The
descriptors may appear in any order.
Table 1-7. Encoding of CPUID Leaf 2 Descriptors
Descriptor
Type Cache or TLB Description
Value
00H General Null descriptor, this byte contains no information.
01H TLB Instruction TLB: 4 KByte pages, 4-way set associative, 32 entries.
02H TLB Instruction TLB: 4 MByte pages, fully associative, 2 entries.
03H TLB Data TLB: 4 KByte pages, 4-way set associative, 64 entries.
EAX 66 5B 50 01H
EBX 0H
ECX 0H
EDX 00 7A 70 00H
Which means:
• The least-significant byte (byte 0) of register EAX is set to 01H. This indicates that CPUID needs to be executed
once with an input value of 2 to retrieve complete information about caches and TLBs.
• The most-significant bit of all four registers (EAX, EBX, ECX, and EDX) is set to 0, indicating that each register
contains valid 1-byte descriptors.
• Bytes 1, 2, and 3 of register EAX indicate that the processor has:
— 50H - a 64-entry instruction TLB, for mapping 4-KByte and 2-MByte or 4-MByte pages.
— 5BH - a 64-entry data TLB, for mapping 4-KByte and 4-MByte pages.
— 66H - an 8-KByte 1st level data cache, 4-way set associative, with a 64-Byte cache line size.
• The descriptors in registers EBX and ECX are valid, but contain NULL descriptors.
• Bytes 0, 1, 2, and 3 of register EDX indicate that the processor has:
— 00H - NULL descriptor.
— 70H - Trace cache: 12 K-μop, 8-way set associative.
— 7AH - a 256-KByte 2nd level cache, 8-way set associative, with a sectored, 64-byte cache line size.
— 00H - NULL descriptor.
INPUT EAX = 04H: Returns Deterministic Cache Parameters for Each Level
When CPUID executes with EAX set to 04H and ECX contains an index value, the processor returns encoded data
that describe a set of deterministic cache parameters (for the cache level associated with the input in ECX). Valid
index values start from 0.
Software can enumerate the deterministic cache parameters for each level of the cache hierarchy starting with an
index value of 0, until the parameters report the value associated with the cache type field is 0. The architecturally
defined fields reported by deterministic cache parameters are documented in Table 1-3.
The CPUID leaf 4 also reports data that can be used to derive the topology of processor cores in a physical package.
This information is constant for all valid index values. Software can query the raw data reported by executing
CPUID with EAX=04H and ECX=0H and use it as part of the topology enumeration algorithm described in Chapter
9, “Multiple-Processor Management,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual,
Volume 3A.
INPUT EAX = 0FH: Returns Intel Resource Director Technology (Intel RDT) Monitoring Enumeration Information
When CPUID executes with EAX set to 0FH and ECX = 0, the processor returns information about the bit-vector
representation of QoS monitoring resource types that are supported in the processor and maximum range of RMID
values the processor can use to monitor of any supported resource types. Each bit, starting from bit 1, corresponds
to a specific resource type if the bit is set. The bit position corresponds to the sub-leaf index (or ResID) that soft-
ware must use to query QoS monitoring capability available for that type. See Table 1-3.
When CPUID executes with EAX set to 0FH and ECX = n (n >= 1, and is a valid ResID), the processor returns infor-
mation software can use to program IA32_PQR_ASSOC, IA32_QM_EVTSEL MSRs before reading QoS data from the
IA32_QM_CTR MSR.
INPUT EAX = 10H: Returns Intel Resource Director Technology (Intel RDT) Allocation Enumeration Information
When CPUID executes with EAX set to 10H and ECX = 0, the processor returns information about the bit-vector
representation of QoS Enforcement resource types that are supported in the processor. Each bit, starting from bit
1, corresponds to a specific resource type if the bit is set. The bit position corresponds to the sub-leaf index (or
ResID) that software must use to query QoS enforcement capability available for that type. See Table 1-3.
When CPUID executes with EAX set to 10H and ECX = n (n >= 1, and is a valid ResID), the processor returns infor-
mation about available classes of service and range of QoS mask MSRs that software can use to configure each
class of services using capability bit masks in the QoS Mask registers, IA32_resourceType_Mask_n.
When CPUID executes with EAX set to 12H and ECX = 1H, the processor returns information about Intel SGX attri-
butes. See Table 1-3.
When CPUID executes with EAX set to 12H and ECX = n (n > 1), the processor returns information about Intel SGX
Enclave Page Cache. See Table 1-3.
INPUT EAX = 15H: Returns Time Stamp Counter and Nominal Core Crystal Clock Information
When CPUID executes with EAX set to 15H and ECX = 0H, the processor returns information about Time Stamp
Counter and Core Crystal Clock. See Table 1-3.
INPUT EAX = 24H: Returns Intel AVX10 Converged Vector ISA Information
When CPUID executes with EAX set to 24H, the processor returns Intel AVX10 converged vector ISA information.
See Table 1-3.
Input: EAX=
0x80000000
CPUID
CPUID
True =
Function
Extended
Supported
"zHM", or
Match
"zHG", or
Substring
"zHT"
False
IF Substring Matched Report Error
True If "zHM"
Multiplier = 1 x 106
If "zHG"
Multiplier = 1 x 109
Determine "Multiplier" If "zHT"
Multiplier = 1 x 1012
Scan Digits
Until Blank Reverse Digits
Determine "Freq"
In Reverse Order To Decimal Value
Max. Qualified
Frequency =
"Freq" = X.YZ if
"Freq" x "Multiplier"
Digits = "ZY.X"
NOTE
When a frequency is given in a brand string, it is the maximum qualified frequency of the processor,
not the frequency at which the processor is currently running.
Table 1-9. Mapping of Brand Indices; and Intel 64 and IA-32 Processor Brand Strings
Brand Index Brand String
00H This processor does not support the brand identification feature
01H Intel(R) Celeron(R) processor1
02H Intel(R) Pentium(R) III processor1
03H Intel(R) Pentium(R) III Xeon(R) processor; If processor signature = 000006B1h, then Intel(R) Celeron(R)
processor
04H Intel(R) Pentium(R) III processor
06H Mobile Intel(R) Pentium(R) III processor-M
07H Mobile Intel(R) Celeron(R) processor1
08H Intel(R) Pentium(R) 4 processor
09H Intel(R) Pentium(R) 4 processor
0AH Intel(R) Celeron(R) processor1
0BH Intel(R) Xeon(R) processor; If processor signature = 00000F13h, then Intel(R) Xeon(R) processor MP
0CH Intel(R) Xeon(R) processor MP
0EH Mobile Intel(R) Pentium(R) 4 processor-M; If processor signature = 00000F13h, then Intel(R) Xeon(R) processor
0FH Mobile Intel(R) Celeron(R) processor1
11H Mobile Genuine Intel(R) processor
12H Intel(R) Celeron(R) M processor
13H Mobile Intel(R) Celeron(R) processor1
14H Intel(R) Celeron(R) processor
15H Mobile Genuine Intel(R) processor
16H Intel(R) Pentium(R) M processor
17H Mobile Intel(R) Celeron(R) processor1
18H – 0FFH RESERVED
NOTES:
1.Indicates versions of these processors that were introduced after the Pentium III.
Operation
CASE (EAX) OF
EAX = 0:
EAX := Highest basic function input value understood by CPUID;
EBX := Vendor identification string;
EDX := Vendor identification string;
ECX := Vendor identification string;
BREAK;
EAX = 1H:
EAX[3:0] := Stepping ID;
EAX[7:4] := Model;
EAX[11:8] := Family;
EAX[13:12] := Processor type;
EAX[15:14] := Reserved;
EAX[19:16] := Extended Model;
EAX[27:20] := Extended Family;
EAX[31:28] := Reserved;
EBX[7:0] := Brand Index; (* Reserved if the value is zero. *)
EBX[15:8] := CLFLUSH Line Size;
EBX[16:23] := Reserved; (* Number of threads enabled = 2 if MT enable fuse set. *)
EBX[24:31] := Initial APIC ID;
ECX := Feature flags; (* See Figure 1-2. *)
EDX := Feature flags; (* See Figure 1-3. *)
BREAK;
EAX = 2H:
EAX := Cache and TLB information;
EBX := Cache and TLB information;
ECX := Cache and TLB information;
EDX := Cache and TLB information;
BREAK;
EAX = 3H:
EAX := Reserved;
EBX := Reserved;
ECX := ProcessorSerialNumber[31:0];
(* Pentium III processors only, otherwise reserved. *)
EDX := ProcessorSerialNumber[63:32];
(* Pentium III processors only, otherwise reserved. *
BREAK
EAX = 4H:
EAX := Deterministic Cache Parameters Leaf; (* See Table 1-3. *)
EBX := Deterministic Cache Parameters Leaf;
ECX := Deterministic Cache Parameters Leaf;
EDX := Deterministic Cache Parameters Leaf;
BREAK;
EAX = 5H:
EAX := MONITOR/MWAIT Leaf; (* See Table 1-3. *)
EBX := MONITOR/MWAIT Leaf;
ECX := MONITOR/MWAIT Leaf;
EDX := MONITOR/MWAIT Leaf;
BREAK;
EAX = 6H:
EAX := Thermal and Power Management Leaf; (* See Table 1-3. *)
EBX := Thermal and Power Management Leaf;
ECX := Thermal and Power Management Leaf;
EDX := Thermal and Power Management Leaf;
BREAK;
EAX = 7H:
EAX := Structured Extended Feature Leaf; (* See Table 1-3. *);
EBX := Structured Extended Feature Leaf;
ECX := Structured Extended Feature Leaf;
EDX := Structured Extended Feature Leaf;
BREAK;
EAX = 8H:
EAX := Reserved = 0;
EBX := Reserved = 0;
ECX := Reserved = 0;
EDX := Reserved = 0;
BREAK;
EAX = 9H:
EAX := Direct Cache Access Information Leaf; (* See Table 1-3. *)
EBX := Direct Cache Access Information Leaf;
ECX := Direct Cache Access Information Leaf;
EDX := Direct Cache Access Information Leaf;
BREAK;
EAX = AH:
EAX := Architectural Performance Monitoring Leaf; (* See Table 1-3. *)
EBX := Architectural Performance Monitoring Leaf;
ECX := Architectural Performance Monitoring Leaf;
EDX := Architectural Performance Monitoring Leaf;
BREAK
EAX = BH:
EAX := Extended Topology Enumeration Leaf; (* See Table 1-3. *)
EBX := Extended Topology Enumeration Leaf;
ECX := Extended Topology Enumeration Leaf;
EDX := Extended Topology Enumeration Leaf;
BREAK;
EAX = CH:
EAX := Reserved = 0;
EBX := Reserved = 0;
ECX := Reserved = 0;
EDX := Reserved = 0;
BREAK;
EAX = DH:
EAX := Processor Extended State Enumeration Leaf; (* See Table 1-3. *)
EBX := Processor Extended State Enumeration Leaf;
ECX := Processor Extended State Enumeration Leaf;
EDX := Processor Extended State Enumeration Leaf;
BREAK;
EAX = EH:
EAX := Reserved = 0;
EBX := Reserved = 0;
ECX := Reserved = 0;
EDX := Reserved = 0;
BREAK;
EAX = FH:
EAX := Platform Quality of Service Monitoring Enumeration Leaf; (* See Table 1-3. *)
EBX := Platform Quality of Service Monitoring Enumeration Leaf;
ECX := Platform Quality of Service Monitoring Enumeration Leaf;
EDX := Platform Quality of Service Monitoring Enumeration Leaf;
BREAK;
EAX = 10H:
EAX := Platform Quality of Service Enforcement Enumeration Leaf; (* See Table 1-3. *)
EBX := Platform Quality of Service Enforcement Enumeration Leaf;
ECX := Platform Quality of Service Enforcement Enumeration Leaf;
EDX := Platform Quality of Service Enforcement Enumeration Leaf;
BREAK;
EAX = 12H:
EAX := Intel SGX Enumeration Leaf; (* See Table 1-3. *)
EBX := Intel SGX Enumeration Leaf;
Flags Affected
None.
In earlier IA-32 processors that do not support the CPUID instruction, execution of the instruction results in an
invalid opcode (#UD) exception being generated.§
Table 1-11. EVEX DISP8*N for Instructions Not Affected by Embedded Broadcast
TupleType InputSize EVEX.W N (VL= 128) N (VL= 256) N (VL= 512) Comment
Full Mem N/A N/A 16 32 64 Load/store or subDword full vector
8bit N/A 1 1 1 1Tuple
16bit N/A 2 2 2
Tuple1 Scalar
32bit 0 4 4 4
64bit 1 8 8 8
32bit N/A 4 4 4 1 Tuple, memsize not affected by
Tuple1 Fixed EVEX.W
64bit N/A 8 8 8
Tuple1_4X 32bit 0 161 N/A 16 4FMA(PS)
32bit 0 8 8 8 Broadcast (2 elements)
Tuple2
64bit 1 NA 16 16
Table 1-11. EVEX DISP8*N for Instructions Not Affected by Embedded Broadcast (Continued)
TupleType InputSize EVEX.W N (VL= 128) N (VL= 256) N (VL= 512) Comment
32bit 0 NA 16 16 Broadcast (4 elements)
Tuple4
64bit 1 NA NA 32
Tuple8 32bit 0 NA NA 32 Broadcast (8 elements)
Half Mem N/A N/A 8 16 32 SubQword Conversion
Quarter Mem N/A N/A 4 8 16 SubDword Conversion
Eighth Mem N/A N/A 2 4 8 SubWord Conversion
Mem128 N/A N/A 16 16 16 Shift count from memory
MOVDDUP N/A N/A 8 32 64 VMOVDDUP
NOTES:
1. Scalar.
BFP10001
CHAPTER 2
INSTRUCTION SET REFERENCE, A-Z
Instructions described in this document follow the general documentation convention established in the Intel® 64
and IA-32 Architectures Software Developer’s Manual Volume 2A. Additionally, some instructions use notation
conventions as described below.
In the instruction encoding, the MODRM byte is represented several ways depending on the role it plays. The
MODRM byte has 3 fields: 2-bit MODRM.MOD field, a 3-bit MODRM.REG field and a 3-bit MODRM.RM field. When all
bits of the MODRM byte have fixed values for an instruction, the 2-hex nibble value of that byte is presented after
the opcode in the encoding boxes on the instruction description pages. When only some fields of the MODRM byte
must contain fixed values, those values are specified as follows:
• If only the MODRM.MOD must be 0b11, and MODRM.REG and MODRM.RM fields are unrestricted, this is
denoted as 11:rrr:bbb. The rrr correspond to the 3-bits of the MODRM.REG field and the bbb correspond to
the 3-bits of the MODMR.RM field.
• If the MODRM.MOD field is constrained to be a value other than 0b11, i.e., it must be one of 0b00, 0b01, or
0b10, then we use the notation !(11).
• If for example only the MODRM.REG field had a specific required value, e.g., 0b101, that would be denoted as
mm:101:bbb.
NOTE
Historically theIntel®
64 and IA-32 Architectures Software Developer’s Manual only specified the
MODRM.REG field restrictions with the notation /0 ... /7 and did not specify restrictions on the
MODRM.MOD and MODRM.RM fields in the encoding boxes.
AADD—Atomically Add
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F38 FC !(11):rrr:bbb A V/V RAO-INT Atomically add my with ry and store the result in
AADD my, ry my.
Description
This instruction atomically adds the destination operand (first operand) and the source operand (second operand),
and then stores the result in the destination operand.
The destination operand is a memory location and the source operand is a register. In 64-bit mode, the instruc-
tion’s default operation size is 32 bits. Using a REX prefix in the form of REX.R permits access to additional registers
(R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bits. The destination operand must be
naturally aligned with respect to the data size, at a 4-byte boundary, or an 8-byte boundary if used with a REX.W
prefix in 64-bit mode.
This instruction requires that the destination operand has a write-back (WB) memory type and it is implemented
using the weakly-ordered memory consistency model of write combining (WC) memory type. Before the operation,
the cache line is written-back (if modified) and invalidated from the processor cache. When the operation
completes, the processor may optimize the cacheability of the destination address by writing the result only to
specific levels of the cache hierarchy. Because this instructions uses a weakly-ordered memory consistency model,
a fencing operation implemented with LFENCE, SFENCE, or MFENCE instruction should be used in conjunction with
AADD if a stronger ordering is required. However, note that AADD is not ordered with respect to a younger LFENCE,
as this instruction is not loading data from memory into the processor.
Any attempt to execute the AADD instruction inside an Intel TSX transaction will result in a transaction abort.
Operation
AADD dest, src
Flags Affected
None.
AAND—Atomically AND
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F38 FC !(11):rrr:bbb A V/V RAO-INT Atomically AND my with ry and store the result in
AAND my, ry my.
Description
This instruction atomically performs a bitwise AND operation of the destination operand (first operand) and the
source operand (second operand), and then stores the result in the destination operand.
The destination operand is a memory location and the source operand is a register. In 64-bit mode, the instruc-
tion’s default operation size is 32 bits. Using a REX prefix in the form of REX.R permits access to additional registers
(R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bits. The destination operand must be
naturally aligned with respect to the data size, at a 4-byte boundary, or an 8-byte boundary if used with a REX.W
prefix in 64-bit mode.
This instruction requires that the destination operand has a write-back (WB) memory type and it is implemented
using the weakly-ordered memory consistency model of write combining (WC) memory type. Before the operation,
the cache line is written-back (if modified) and invalidated from the processor cache. When the operation
completes, the processor may optimize the cacheability of the destination address by writing the result only to
specific levels of the cache hierarchy. Because this instructions uses a weakly-ordered memory consistency model,
a fencing operation implemented with LFENCE, SFENCE, or MFENCE instruction should be used in conjunction with
AAND if a stronger ordering is required. However, note that AAND is not ordered with respect to a younger LFENCE,
as this instruction is not loading data from memory into the processor.
Any attempt to execute the AAND instruction inside an Intel TSX transaction will result in a transaction abort.
Operation
AAND dest, src
Flags Affected
None.
AOR—Atomically OR
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
F2 0F38 FC !(11):rrr:bbb A V/V RAO-INT Atomically OR my with ry and store the result in
AOR my, ry my.
Description
This instruction atomically performs a bitwise OR operation of the destination operand (first operand) and the
source operand (second operand), and then stores the result in the destination operand.
The destination operand is a memory location and the source operand is a register. In 64-bit mode, the instruc-
tion’s default operation size is 32 bits. Using a REX prefix in the form of REX.R permits access to additional registers
(R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bits. The destination operand must be
naturally aligned with respect to the data size, at a 4-byte boundary, or an 8-byte boundary if used with a REX.W
prefix in 64-bit mode.
This instruction requires that the destination operand has a write-back (WB) memory type and it is implemented
using the weakly-ordered memory consistency model of write combining (WC) memory type. Before the operation,
the cache line is written-back (if modified) and invalidated from the processor cache. When the operation
completes, the processor may optimize the cacheability of the destination address by writing the result only to
specific levels of the cache hierarchy. Because this instructions uses a weakly-ordered memory consistency model,
a fencing operation implemented with LFENCE, SFENCE, or MFENCE instruction should be used in conjunction with
AOR if a stronger ordering is required. However, note that AOR is not ordered with respect to a younger LFENCE,
as this instruction is not loading data from memory into the processor.
Any attempt to execute the AOR instruction inside an Intel TSX transaction will result in a transaction abort.
Operation
AOR dest, src
Flags Affected
None.
AXOR—Atomically XOR
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
F3 0F38 FC !(11):rrr:bbb A V/V RAO-INT Atomically XOR my with ry and store the result in
AXOR my, ry my.
Description
This instruction atomically performs a bitwise XOR operation of the destination operand (first operand) and the
source operand (second operand), and then stores the result in the destination operand.
The destination operand is a memory location and the source operand is a register. In 64-bit mode, the instruc-
tion’s default operation size is 32 bits. Using a REX prefix in the form of REX.R permits access to additional registers
(R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bits. The destination operand must be
naturally aligned with respect to the data size, at a 4-byte boundary, or an 8-byte boundary if used with a REX.W
prefix in 64-bit mode.
This instruction requires that the destination operand has a write-back (WB) memory type and it is implemented
using the weakly-ordered memory consistency model of write combining (WC) memory type. Before the operation,
the cache line is written-back (if modified) and invalidated from the processor cache. When the operation
completes, the processor may optimize the cacheability of the destination address by writing the result only to
specific levels of the cache hierarchy. Because this instructions uses a weakly-ordered memory consistency model,
a fencing operation implemented with LFENCE, SFENCE, or MFENCE instruction should be used in conjunction with
AXOR if a stronger ordering is required. However, note that AXOR is not ordered with respect to a younger LFENCE,
as this instruction is not loading data from memory into the processor.
Any attempt to execute the AXOR instruction inside an Intel TSX transaction will result in a transaction abort.
Operation
AXOR dest, src
Flags Affected
None.
VEX.128.66.0F38.W0 E6 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If below or equal (CF=1 or ZF=1),
CMPBEXADD m32, r32, r32
add value from r32 (third operand) to m32 and
write new value in m32. The second operand is
always updated with the original value from
m32.
VEX.128.66.0F38.W1 E6 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If below or equal (CF=1 or ZF=1),
CMPBEXADD m64, r64, r64
add value from r64 (third operand) to m64 and
write new value in m64. The second operand is
always updated with the original value from
m64.
VEX.128.66.0F38.W0 E2 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If below (CF=1), add value from
CMPBXADD m32, r32, r32
r32 (third operand) to m32 and write new value
in m32. The second operand is always updated
with the original value from m32.
VEX.128.66.0F38.W1 E2 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If below (CF=1), add value from
CMPBXADD m64, r64, r64
r64 (third operand) to m64 and write new value
in m64. The second operand is always updated
with the original value from m64.
VEX.128.66.0F38.W0 EE !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If less or equal (ZF=1 or SF≠OF),
CMPLEXADD m32, r32, r32
add value from r32 (third operand) to m32 and
write new value in m32. The second operand is
always updated with the original value from
m32.
VEX.128.66.0F38.W1 EE !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If less or equal (ZF=1 or SF≠OF),
CMPLEXADD m64, r64, r64
add value from r64 (third operand) to m64 and
write new value in m64. The second operand is
always updated with the original value from
m64.
VEX.128.66.0F38.W0 EC !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If less (SF≠OF), add value from r32
CMPLXADD m32, r32, r32
(third operand) to m32 and write new value in
m32. The second operand is always updated
with the original value from m32.
VEX.128.66.0F38.W1 EC !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
CMPLXADD m64, r64, r64 value in m64. If less (SF≠OF), add value from r64
(third operand) to m64 and write new value in
m64. The second operand is always updated
with the original value from m64.
VEX.128.66.0F38.W0 E7 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
CMPNBEXADD m32, r32, r32 value in m32. If not below or equal (CF=0 and
ZF=0), add value from r32 (third operand) to
m32 and write new value in m32. The second
operand is always updated with the original
value from m32.
VEX.128.66.0F38.W1 E3 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If not below (CF=0), add value from
CMPNBXADD m64, r64, r64
r64 (third operand) to m64 and write new value
in m64. The second operand is always updated
with the original value from m64.
VEX.128.66.0F38.W0 EF !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If not less or equal (ZF=0 and
CMPNLEXADD m32, r32, r32
SF=OF), add value from r32 (third operand) to
m32 and write new value in m32. The second
operand is always updated with the original
value from m32.
VEX.128.66.0F38.W1 EF !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If not less or equal (ZF=0 and
CMPNLEXADD m64, r64, r64
SF=OF), add value from r64 (third operand) to
m64 and write new value in m64. The second
operand is always updated with the original
value from m64.
VEX.128.66.0F38.W0 ED !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If not less (SF=OF), add value from
CMPNLXADD m32, r32, r32
r32 (third operand) to m32 and write new value
in m32. The second operand is always updated
with the original value from m32.
VEX.128.66.0F38.W1 ED !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If not less (SF=OF), add value from
CMPNLXADD m64, r64, r64
r64 (third operand) to m64 and write new value
in m64. The second operand is always updated
with the original value from m64.
VEX.128.66.0F38.W0 E1 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If not overflow (OF=0), add value
CMPNOXADD m32, r32, r32
from r32 (third operand) to m32 and write new
value in m32. The second operand is always
updated with the original value from m32.
VEX.128.66.0F38.W1 E1 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If not overflow (OF=0), add value
CMPNOXADD m64, r64, r64
from r64 (third operand) to m64 and write new
value in m64. The second operand is always
updated with the original value from m64.
VEX.128.66.0F38.W0 EB !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If not parity (PF=0), add value from
CMPNPXADD m32, r32, r32
r32 (third operand) to m32 and write new value
in m32. The second operand is always updated
with the original value from m32.
VEX.128.66.0F38.W1 EB !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If not parity (PF=0), add value from
CMPNPXADD m64, r64, r64
r64 (third operand) to m64 and write new value
in m64. The second operand is always updated
with the original value from m64.
VEX.128.66.0F38.W0 E9 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If not sign (SF=0), add value from
CMPNSXADD m32, r32, r32
r32 (third operand) to m32 and write new value
in m32. The second operand is always updated
with the original value from m32.
VEX.128.66.0F38.W1 E9 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If not sign (SF=0), add value from
CMPNSXADD m64, r64, r64
r64 (third operand) to m64 and write new value
in m64. The second operand is always updated
with the original value from m64.
VEX.128.66.0F38.W0 E5 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If not zero (ZF=0), add value from
CMPNZXADD m32, r32, r32
r32 (third operand) to m32 and write new value
in m32. The second operand is always updated
with the original value from m32.
VEX.128.66.0F38.W1 E5 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If not zero (ZF=0), add value from
CMPNZXADD m64, r64, r64
r64 (third operand) to m64 and write new value
in m64. The second operand is always updated
with the original value from m64.
VEX.128.66.0F38.W0 E0 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If overflow (OF=1), add value from
CMPOXADD m32, r32, r32
r32 (third operand) to m32 and write new value
in m32. The second operand is always updated
with the original value from m32.
VEX.128.66.0F38.W1 E0 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If overflow (OF=1), add value from
CMPOXADD m64, r64, r64
r64 (third operand) to m64 and write new value
in m64. The second operand is always updated
with the original value from m64.
VEX.128.66.0F38.W0 EA !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If parity (PF=1), add value from
CMPPXADD m32, r32, r32
r32 (third operand) to m32 and write new value
in m32. The second operand is always updated
with the original value from m32.
VEX.128.66.0F38.W1 EA !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If parity (PF=1), add value from
CMPPXADD m64, r64, r64
r64 (third operand) to m64 and write new value
in m64. The second operand is always updated
with the original value from m64.
VEX.128.66.0F38.W0 E8 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If sign (SF=1), add value from r32
CMPSXADD m32, r32, r32
(third operand) to m32 and write new value in
m32. The second operand is always updated
with the original value from m32.
VEX.128.66.0F38.W1 E8 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If sign (SF=1), add value from r64
CMPSXADD m64, r64, r64
(third operand) to m64 and write new value in
m64. The second operand is always updated
with the original value from m64.
VEX.128.66.0F38.W0 E4 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If zero (ZF=1), add value from r32
CMPZXADD m32, r32, r32
(third operand) to m32 and write new value in
m32. The second operand is always updated
with the original value from m32.
VEX.128.66.0F38.W1 E4 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If zero (ZF=1), add value from r64
CMPZXADD m64, r64, r64
(third operand) to m64 and write new value in
m64. The second operand is always updated
with the original value from m64.
Description
This instruction compares the value from memory with the value of the second operand. If the specified condition
is met, then the processor will add the third operand to the memory operand and write it into memory, else the
memory is unchanged by this instruction.
This instruction must have MODRM.MOD equal to 0, 1, or 2. The value 3 for MODRM.MOD is reserved and will cause
an invalid opcode exception (#UD).
The second operand is always updated with the original value of the memory operand. The EFLAGS conditions are
updated from the results of the comparison.The instruction uses an implicit lock. This instruction does not permit
the use of an explicit lock prefix.
Operation
CMPCCXADD srcdest1, srcdest2, src3
tmp1 := load lock srcdest1
tmp2 := tmp1 + src3
EFLAGS.CS,OF,SF,ZF,AF,PF := CMP tmp1, srcdest2
IF <condition>:
srcdest1 := store unlock tmp2
ELSE
srcdest1 := store unlock tmp1
srcdest2 :=tmp1
Flags Affected
The EFLAGS conditions are updated from the results of the comparison.
Exceptions
Exceptions Type 14; see Table 2-1.
Protected and
Compatibility
Virtual-8086
64-bit
Real
Description
The PBNDKB instruction allows software to bind information to a platform by encrypting it with a platform-specific
wrapping key. The encrypted data may later be used by the PCONFIG instruction to configure the total storage
encryption (TSE) engine.1
The instruction can be executed only in 64-bit mode. The registers RBX and RCX provide input information to the
instruction. Executions of PBNDKB may fail for platform-specific reasons. An execution reports failure by setting
the ZF flag and loading EAX with a non-zero failure reason; a successful execution clears ZF and EAX.
The instruction operates on 256-byte data structures called bind structures. It reads a bind structure at the linear
address in RBX and writes a modified bind structure to the linear address in RCX. The addresses in RBX and RCX
must be different from each other and must be 256-byte aligned.
The instruction encrypts a portion of the input bind structure and generates a MAC of parts of that structure. The
encrypted data and MAC are written out as part of the output bind structure.
The format of a bind structure is given in Table 2-1.
1.For details on Total Storage Encryption (TSE), see Chapter 11 of this document.
• BTDATA: This field contains additional control and data that are not encrypted. It has the following format:
— USER_SUPP_CHALLENGE (bytes 31:0): PBNDKB uses this value in the input bind structure to determine
the wrapping key (see below). It writes zero to this field in the output bind structure.
— KEY_GENERATION_CTRL (byte 32): PBNDKB uses this value in the input bind structure to determine
whether to randomize the keys being encrypted. The value must be 0 or 1 (otherwise, a #GP occurs).
— The remaining 95 bytes are reserved and must be zero.
PBNDKB determines a 256-bit wrapping key by computing an HMAC based on SHA-256 using 256-bit platform-
specific key and the USER_SUPP_CHALLENGE in the BTDATA field in the input bind structure.
PBNDKB then uses the wrapping key and an AES GCM authenticated encryption function to encrypt BTENCDATA
and produce a MAC. The encryption function uses the following inputs:
• The 64-byte BTENCDATA to be encrypted (which may have been randomized; see above).
• The 256-bit wrapping key.
• The 96-bit IV randomly generated by PBNDKB.
• 176 bytes of additional authenticated data that are the concatenation of 8 bytes of zeroes, the IV, 28 bytes of
zeroes, and the BTDATA in the input bind structure.
• The length of the additional authenticated data (176).
The encryption function produces a structure with 64 bytes of encrypted data and a 16-byte MAC. PBNDKB saves
these values to the corresponding fields in its output bind structure. Other fields are copied from the input bind
structure or written as zero, except the IV (which receives the randomly generated value) and the
USER_SUPP_CHALLENGE in the BTDATA, which is written as zero.
Operation
(* #UD if PBNDKB is not enumerated, CPL > 0, or not in 64-bit mode*)
IF CPUID.(EAX=07H, ECX=01H):EBX.PBNDKB[bit 1] = 0 OR CPL > 0 OR not in 64-bit mode
THEN #UD; FI;
(* XOR the input keys with the random keys; this does not modify input bind structure in memory *)
TMP_BIND_STRUCT.BTENCDATA.DATA_KEY := RNG_DATA_KEY XOR TMP_BIND_STRUCT.BTENCDATA.DATA_KEY;
TMP_BIND_STRUCT.BTENCDATA.TWEAK_KEY := RNG_TWEAK_KEY XOR TMP_BIND_STRUCT.BTENCDATA.TWEAK_KEY;
FI;
(* Compose 176 bytes of additional authenticated data for use by authenticated decryption *)
AAD := Concatenation of bytes 63:16 and bytes 255:128 of TMP_BIND_STRUCT;
OUT_BIND_STRUCT.MAC := ENCRYPT_STRUCT.MAC;
OUT_BIND_STRUCT[bytes 23:16] := 0;
OUT_BIND_STRUCT.IV := TMP_IV;
OUT_BIND_STRUCT[bytes 63:36] := 0;
OUT_BIND_STRUCT.BTENCDATA := ENCRYPT_STRUCT.ENC_DATA;
OUT_BIND_STRUCT.BTDATA.USER_SUPP_CHALLENGE := 0;
OUT_BIND_STRUCT.BTDATA.KEY_GENERATION_CTRL := IN_BIND_STRUCT.BTDATA.KEY_GENERATION_CTRL;
OUT_BIND_STRUCT.BTDATA[bytes 127:33] := 0;
EXIT:
RFLAGS.CF := 0;
RFLAGS.PF := 0;
RFLAGS.AF := 0;
RFLAGS.OF := 0;
RFLAGS.SF := 0;
PCONFIG—Platform Configuration
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 01 C5 A V/V PCONFIG This instruction is used to execute functions for
PCONFIG configuring platform features.
Description
The PCONFIG instruction allows software to configure certain platform features. It supports these features with
multiple leaf functions, selecting a leaf function using the value in EAX.
Depending on the leaf function, the registers RBX, RCX, and RDX may be used to provide input information or for
the instruction to report output information. Addresses and operands are 32 bits outside 64-bit mode and are 64
bits in 64-bit mode. The value of CS.D does not affect operand size or address size.
Executions of PCONFIG may fail for platform-specific reasons. An execution reports failure by setting the ZF flag
and loading EAX with a non-zero failure reason; a successful execution clears ZF and EAX.
Each PCONFIG leaf function applies to a specific hardware block called a PCONFIG target. The leaf function is
supported only if the processor supports that target. Each target is associated with a numerical target identifier,
and CPUID leaf 1BH (PCONFIG information) enumerates the identifiers of the supported targets. An attempt to
execute an undefined leaf function, or a leaf function that applies to an unsupported target identifier, results in a
general-protection exception (#GP).
not change the state of the TLB caches or memory pipeline. Software is responsible for taking appropriate actions
to ensure correct behavior.
The key table used by TME-MK is shared by all logical processors in a platform. For this reason, execution of this
leaf function must gain exclusive access to the key table before updating it. The leaf function does this by acquiring
a lock (implemented in the platform) and retaining that lock until the execution completes. An execution of the leaf
function may fail to acquire the lock if it is already in use. In this situation, the leaf function will load EAX with failure
reason 5 (DEVICE_BUSY). When this happens, the key table is not updated, and software should retry execution of
PCONFIG.
• KEY_FIELD_2: If the direct key-programming command is used (TSE_SET_KEY_DIRECT), this field carries
the software supplied tweak key to be used for the KeyID. Otherwise, the field is ignored.
The TSE key table is shared by all logical processors in a platform. For this reason, execution of this leaf function
must gain exclusive access to the key table before updating it. The leaf function does this by acquiring a lock
(implemented in the platform) and retaining that lock until the execution completes. An execution of the leaf func-
tion may fail to acquire the lock if it is already in use. In this situation, the leaf function will load EAX with failure
reason 5 (DEVICE_BUSY). When this happens, the key table is not updated, and software should retry execution of
PCONFIG.
• IV: The initialization vector that PBNDKB used for encryption. The PCONFIG leaf function will use this in its
decryption of encrypted data and computation of the MAC.
• BTENCDATA: Data which had been encrypted by PBNDKB, containing the data and tweak keys to be used by
TSE.
• BTDATA: Data that was input to PBNDKB that was output without encryption. It has the following format:
— USER_SUPP_CHALLENGE (bytes 31:0): PBNDKB uses a value provided by software in its input bind
structure but writes zero to this field in the output bind structure to be used by PCONFIG. Software should
configure this field with the proper value before executing this PCONFIG leaf function.
— KEY_GENERATION_CTRL (byte 32): PBNDKB uses this value to determine whether to generate random
keys. The PCONFIG leaf function does not use this field.
— The remaining 95 bytes are reserved and must be zero.
The leaf function uses the entire BTDATA field when it computes the MAC.
The leaf function determines a 256-bit wrapping key by computing an HMAC based on SHA-256 using 256-bit
platform-specific key and the USER_SUPP_CHALLENGE in the BTDATA field of the TSE_BIND_STRUCT.
Using the wrapping key, the leaf function uses an AES GCM authenticated decryption function to decrypt BTENC-
DATA and compute a MAC. The decryption function uses the following inputs:
• The 64-byte BTENCDATA from TSE_BIND_STRUCT to be decrypted.
• The 256-bit wrapping key.
• The 96-bit IV from TSE_BIND_STRUCT.
• Additional authenticated data that is the concatenation of bytes 63:16 and bytes 255:128 of the TSE_BIND_-
STRUCT. These 176 bytes will comprise 8 bytes of zeroes, the 12-byte IV, 28 bytes of zeroes, and 128 bytes of
BTDATA of which the upper 95 bytes are zero).
• The length of the additional authenticated data (176).
The decryption function produces a structure with a 64 bytes of decrypted data and a 16-byte MAC. The decrypted
data comprises a 256-bit data key and a 256-bit tweak key.
If the MAC produced by the decryption function differs from that provided in the TSE_BIND_STRUCT, the leaf func-
tion will load EAX with failure reason 7 (UNWRAP_FAILURE). Otherwise, the leaf function will attempt to program
the TSE key table for the selected KeyID with the keys contained in the decrypted data.
The TSE key table is shared by all logical processors in a platform. For this reason, execution of this leaf function
must gain exclusive access to the key table before updating it. The leaf function does this by acquiring a lock
(implemented in the platform) and retaining that lock until the execution completes. An execution of the leaf func-
tion may fail to acquire the lock if it is already in use. In this situation, the leaf function will load EAX with failure
reason 5 (DEVICE_BUSY). When this happens, the key table is not updated, and software should retry execution of
PCONFIG.
Operation
(* #UD if PCONFIG is not enumerated or CPL > 0 *)
IF CPUID.(EAX=07H, ECX=0H):EDX.PCONFIG[bit 18] = 0 OR CPL > 0
THEN #UD; FI;
(* Check that only one encryption algorithm is requested for the KeyID and it is one of the activated algorithms *)
IF TMP_KEY_PROGRAM_STRUCT.KEYID_CTRL.ENC_ALG does not set exactly one bit OR
(TMP_KEY_PROGRAM_STRUCT.KEYID_CTRL.ENC_ALG & IA32_TME_ACTIVATE[63:48]) = 0
THEN #GP(0); FI;
Attempt to acquire lock to gain exclusive access to platform key table for TME-MK;
IF attempt is unsuccessful
THEN (* PCONFIG failure *)
RFLAGS.ZF := 1;
RAX := DEVICE_BUSY; (* failure reason 5 *)
GOTO EXIT;
FI;
CASE (TMP_KEY_PROGRAM_STRUCT.KEYID_CTRL.COMMAND) OF
0 (KEYID_SET_KEY_DIRECT):
Update TME-MK table for TMP_KEY_PROGRAM_STRUCT.KEYID as follows:
Encrypt with the selected key
Use the encryption algorithm selected by TMP_KEY_PROGRAM_STRUCT.KEYID_CTRL.ENC_ALG
(* The number of bytes used by the next two lines depends on selected encryption algorithm *)
DATA_KEY is TMP_KEY_PROGRAM_STRUCT.KEY_FIELD_1
TWEAK_KEY is TMP_KEY_PROGRAM_STRUCT.KEY_FIELD_2
BREAK;
1 (KEYID_SET_KEY_RANDOM):
Load TMP_RND_DATA_KEY with a random key using hardware RNG; (* key size depends on selected encryption algorithm *)
IF there was insufficient entropy
THEN (* PCONFIG failure *)
RFLAGS.ZF := 1;
RAX := ENTROPY_ERROR; (* failure reason 2 *)
Release lock on platform key table;
GOTO EXIT;
FI;
Load TMP_RND_TWEAK_KEY with a random key using hardware RNG; (* key size depends on selected encryption algorithm *)
IF there was insufficient entropy
2 (KEYID_CLEAR_KEY):
Update TME-MK table for TMP_KEY_PROGRAM_STRUCT.KEYID as follows:
Encrypt (or not) using the current configuration for TME
The specified encryption algorithm and key values are not used.
BREAK;
3 (KEYID_NO_ENCRYPT):
Update TME-MK table for TMP_KEY_PROGRAM_STRUCT.KEYID as follows:
Do not encrypt
The specified encryption algorithm and key values are not used.
BREAK;
ESAC;
Release lock on platform key table for TME-MK;
1 (TSE_KEY_PROGRAM):
IF CPUID function 1BH does not enumerate support for the TSE target (value 2)
THEN #GP(0); FI;
(* Check that only one encryption algorithm is requested for the KeyID and it is one of the activated algorithms *)
IF TMP_KEY_STRUCT.KEYID_CTRL.ENC_ALG does not set exactly one bit OR
(TMP_KEY_STRUCT.KEYID_CTRL.ENC_ALG & IA32_TSE_CAPABILITY[15:0]) = 0
THEN #GP(0); FI;
Attempt to acquire lock to gain exclusive access to platform key table for TSE;
IF attempt is unsuccessful
THEN (* PCONFIG failure *)
RFLAGS.ZF := 1;
RAX := DEVICE_BUSY; (* failure reason 5 *)
GOTO EXIT;
FI;
CASE (TMP_KEY_STRUCT.KEYID_CTRL.COMMAND) OF
0 (TSE_SET_KEY_DIRECT):
Update TSE table for TMP_KEY_STRUCT.KEYID as follows:
Encrypt with the selected key
Use the encryption algorithm selected by TMP_KEY_STRUCT.KEYID_CTRL.ENC_ALG
(* The number of bytes used by the next two lines depends on selected encryption algorithm *)
DATA_KEY is TMP_KEY_STRUCT.KEY_FIELD_1
TWEAK_KEY is TMP_KEY_STRUCT.KEY_FIELD_2
BREAK;
1 (TSE_NO_ENCRYPT):
Update TSE table for TMP_KEY_STRUCT.KEYID as follows:
Do not encrypt
The specified encryption algorithm and key values are not used.
BREAK;
ESAC;
Release lock on platform key table for TSE;
2 (TSE_KEY_PROGRAM_WRAPPED):
IF CPUID function 1BH does not enumerate support for the TSE target (value 2)
THEN #GP(0); FI;
(* Check that only one encryption algorithm is requested for the KeyID and it is one of the activated algorithms *)
IF RBX[39:24] does not set exactly one bit OR (RBX[39:24] & IA32_TSE_CAPABILITY[15:0]) = 0
THEN #GP(0); FI;
(* Compose 176 bytes of additional authenticated data for use by authenticated decryption *)
AAD := Concatenation of bytes 63:16 and bytes 255:128 of TMP_BIND_STRUCT;
Attempt to acquire lock to gain exclusive access to platform key table for TSE;
IF attempt is unsuccessful
THEN (* PCONFIG failure *)
RFLAGS.ZF := 1;
RAX := DEVICE_BUSY; (* failure reason 5 *)
GOTO EXIT;
FI;
ESAC;
RAX := 0;
RFLAGS.ZF := 0;
EXIT:
RFLAGS.CF := 0;
RFLAGS.PF := 0;
RFLAGS.AF := 0;
RFLAGS.OF := 0;
RFLAGS.SF := 0;
Description
This instruction reads a software-provided list of up to 64 MSRs and stores their values in memory.
RDMSRLIST takes three implied input operands:
• RSI: Linear address of a table of MSR addresses (8 bytes per address)1.
• RDI: Linear address of a table into which MSR data is stored (8 bytes per MSR).
• RCX: 64-bit bitmask of valid bits for the MSRs. Bit 0 is the valid bit for entry 0 in each table, etc.
For each RCX bit [n] from 0 to 63, if RCX[n] is 1, RDMSRLIST will read the MSR specified at entry [n] in the RSI
table and write it out to memory at the entry [n] in the RDI table.
This implies a maximum of 64 MSRs that can be processed by this instruction. The processor will clear RCX[n] after
it finishes handling that MSR. Similar to repeated string operations, RDMSRLIST supports partial completion for
interrupts, exceptions, and traps. In these situations, the RIP register saved will point to the RDMSRLIST instruc-
tion while the RCX register will have cleared bits corresponding to all completed iterations.
This instruction must be executed at privilege level 0; otherwise, a general protection exception #GP(0) is gener-
ated. This instruction performs MSR specific checks and respects the VMX MSR VM-execution controls in the same
manner as RDMSR.
Although RDMSRLIST accesses the entries in the two tables in order, the actual reads of the MSRs may be
performed out of order: for table entries m < n, the processor may read the MSR for entry n before reading the
MSR for entry m. (This may be true also for a sequence of executions of RDMSR.) Ordering is guaranteed if the
address of the IA32_BARRIER MSR (2FH) appears in the table of MSR addresses. Specifically, if IA32_BARRIER
appears at entry m, then the MSR read for any entry n with n > m will not occur until (1) all instructions prior to
RDMSRLIST have completed locally; and (2) MSRs have been read for all table entries before entry m.
The processor is allowed to (but not required to) “load ahead” in the list. Examples:
• Use old memory type or TLB translation for loads/stores to list memory despite an MSR written by a previous
iteration changing MTRR or invalidating TLBs.
• Cause a page fault or EPT violation for a memory access to an entry > “n” in MSR address or data tables,
despite the processor only having read or written “n” MSRs.2
1. Since MSR addresses are only 32-bits wide, bits 63:32 of each MSR address table entry is reserved.
2. For example, the processor may take a page fault due to a linear address for the 10th entry in the MSR address table despite only
having completed the MSR writes up to entry 5.
• The value of the MSR address is in the range C0000000H–C0001FFFH and bit n in the read bitmap for high
MSRs is 1, where n is the value of the MSR address & 00001FFFH.
A VM exit for the above reasons for the RDMSRLIST instruction will specify exit reason 78 (decimal). The exit qual-
ification is set to the MSR address causing the VM exit if the “use MSR bitmaps” VM-execution control is 1. If the
“use MSR bitmaps” VM-execution control is 0, then the VM-exit qualification will be 0.
If software wants to emulate a single iteration of RDMSRLIST after a VM exit, it can use the exit qualification to
identify the MSR. Such software may need to write to the table of data. It can calculate the guest-linear address of
the table entry to write by using the values of RDI (the guest-linear address of the table) and RCX (the lowest bit
set in RCX identifies the specific table entry.
Operation
DO WHILE RCX != 0
MSR_index := position of least significant bit set in RCX;
Load MSR_address_table_entry from 8 bytes at the linear address RSI + (MSR_index * 8);
IF MSR_address_table_entry[63:32] != 0 THEN #GP(0); FI;
MSR_address := MSR_address_table_entry[31:0];
IF RDMSR of the MSR with address MSR_address would #GP THEN #GP(0); FI;
Store the value of the MSR with address MSR_address into 8 bytes at the linear address RDI + (MSR_index * 8);
RCX[MSR_index] := 0;
Allow delivery of any pending interrupts or traps;
OD;
Flags Affected
None.
VEX.128.F2.MAP7:W0.F8 11:000:bbb MI V/N.E. USER_MSR Load into register bbb the value of the MSR with
URDMSR r64, imm32 address in the 32-bit immediate.
Description
URDMSR reads the contents of a 64-bit MSR specified in operand 2 into operand 1. Operand 1 is a general-purpose
register, while operand 2 may be either a general-purpose register or an immediate. URDMSR reads the indicated
MSR in the same manner as RDMSR.
MSRs readable by RDMSR with CPL = 0 can be read by URDMSR at any privilege level but under OS control. The OS
controls what MSRs can be read by the URDMSR and UWRMSR instructions with a 4-KByte bitmap located at an
aligned linear address in the IA32_USER_MSR_CTL (MSR address 1CH). The URDMSR instruction is enabled only if
bit 0 of this MSR is 1.
The low 2 KBytes of the bitmap control URDMSR (they compose the URDMSR bitmap); the 2 KBytes includes one
bit for each MSR address in the range 0H–3FFFH. URDMSR may read an MSR only if the bit corresponding to the
MSR has value 1; otherwise (or if the MSR address is outside that range) URDMSR causes a general-protection
exception (#GP).
The URDMSR accesses to these bitmaps are implicit supervisor-mode accesses, which means they use supervisor
privilege regardless of CPL. The OS can create an alias to the bitmap in the user address space if it wants the appli-
cation to know which MSRs are permitted. Still, the alias should be mapped read-only to prevent the application
from overwriting the bitmap.
Virtualization Behavior
Like RDMSR, execution of URDMSR in VMX non-root operation causes a VM exit if any of the following are true:
• The “use MSR bitmaps” VM-execution control is 0.
• The value of the MSR address is not in the range 00000000H–00001FFFH.
• The value of the MSR address is in the range 00000000H–00001FFFH, and bit n in the read bitmap for low MSRs
is 1, where n is the value of the MSR address.
Such VM exits have priority below a #GP due to an MSR address outside the bitmap range or whose bit is clear in
the bitmap. In enclave mode, URDMSR will cause a #GP(0) exception instead of a VM exit if any of the above condi-
tions are true.
A VM exit for the above reasons for the URDMSR instruction will specify exit reason 80 (decimal). The exit qualifi-
cation is set to the MSR address causing the VM exit. The VM-exit instruction length and VM-exit instruction infor-
mation fields will be populated for these VM exits; see Table 2-6 for details.
Table 2-6. Format of the VM-Exit Instruction Information Field Used for URDMSR and UWRMSR
Bit Position Content
2:0 Undefined.
6:3 Reg1: (ModR/M field, source / dest data operand)
0 = RAX / 1 = RCX / 2 = RDX / 3 = RBX / 4 = RSP / 5 = RBP / 6 = RSI / 7 = RDI.
8–15 represent R8–R15, respectively.
31:7 Undefined.
No new VMX execution controls are added for URDMSR; legacy MSR controls suffice. Legacy VMMs should not allow
guests to set IA32_USER_MSR_CTL.ENABLE and thus should not receive these VM exits.
Operation
DEST := MSR[SRC]
Flags Affected
None.
VEX.128.F3.MAP7:W0.F8 11:000:bbb IM V/N.E. USER_MSR Load into the MSR with address in the 32-bit
UWRMSR imm32, r64 immediate the value of register bbb.
Description
UWRMSR writes the contents of operand 2 into the 64-bit MSR specified in operand 1. Operand 2 is a general-
purpose register, while operand 1 may be either a general-purpose register or an immediate. UWRMSR writes the
indicated MSR in the same manner as WRMSR, but it is limited to a specific set of MSRs. Table 2-7 gives the list of
MSRs currently writeable by UWRMSR.
The MSRs enumerated in Table 2-7 can be written by UWRMSR at any privilege level but under OS control. The OS
controls what MSRs can be read by the URDMSR and UWRMSR instructions with a 4-KByte bitmap located at an
aligned linear address in the IA32_USER_MSR_CTL (MSR address 1CH). The UWRMSR instruction is enabled only
if bit 0 of this MSR is 1.
The high 2 KBytes of the bitmap control UWRMSR (they compose the UWRMSR bitmap); the 2 KBytes includes one
bit for each MSR address in the range 0H–3FFFH. UWRMSR may write to an MSR only if the bit corresponding to the
MSR has value 1; otherwise (or if the MSR address is outside that range) UWRMSR causes a general-protection
exception (#GP).
UWRMSR accesses to these bitmaps are implicit supervisor-mode accesses, which means they use supervisor priv-
ilege regardless of CPL. The OS can create an alias to the bitmap in the user address space if it wants the applica-
tion to know which MSRs are permitted. Still, the alias should be mapped read-only to prevent the application from
overwriting the bitmap. UWRMSR behaves like WRMSRNS and is not defined as a serializing instruction (see
“Serializing Instructions” in Chapter 9 of the Intel® 64 and IA-32 Architectures Software Developer's Manual,
Volume 3A). Refer to the WRMSRNS instruction for a thorough explanation of what this implies.
Virtualization Behavior
Like WRMSR, execution of UWRMSR in VMX non-root operation causes a VM exit if any of the following are true:
• The “use MSR bitmaps” VM-execution control is 0.
• The value of the MSR address is not in the range 00000000H–00001FFFH.
• The value of the MSR address is in the range 00000000H–00001FFFH, and bit n in the write bitmap for low
MSRs is 1, where n is the value of the MSR address.
Such VM exits have priority below a #GP due to an MSR address outside the bitmap range or whose bit is clear in
the bitmap. In enclave mode, UWRMSR will cause a #GP(0) exception instead of a VM exit if any of the above
conditions are true.
A VM exit for the above reasons for the UWRMSR instruction will specify exit reason 81 (decimal). The exit qualifi-
cation is set to the MSR address causing the VM exit. The VM-exit instruction length and VM-exit instruction infor-
mation fields will be populated for these VM exits. See Table 2-6, found under the URDMSR instruction, for details.
No new VMX execution controls are added for UWRMSR; legacy MSR controls suffice. Legacy VMMs should not
allow guests to set IA32_USER_MSR_CTL.ENABLE and thus should not receive these VM exits.
Operation
MSR[DEST] := SRC
Flags Affected
None.
Description
This instruction loads one BF16 element from memory, converts it to FP32, and broadcasts it to a SIMD register.
This instruction does not generate floating-point exceptions and does not consult or update MXCSR.
Since any BF16 number can be represented in FP32, the conversion result is exact and no rounding is needed.
Operation
VBCSTNEBF162PS dest, src (VEX encoded version)
VL = (128, 256)
KL = VL/32
Flags Affected
None.
Other Exceptions
See Exceptions Type 5.
VEX.256.66.0F38.W0 B1 !(11):rrr:bbb A V/V AVX-NE- Load one FP16 element from m16, convert to
VBCSTNESH2PS ymm1, m16 CONVERT FP32, and store result in ymm1.
Description
This instruction loads one FP16 element from memory, converts it to FP32, and broadcasts it to a SIMD register.
This instruction does not generate floating-point exceptions and does not consult or update MXCSR.
Input FP16 denormals are converted to normal FP32 numbers and not treated as zero. Since any FP16 number can
be represented in FP32, the conversion result is exact and no rounding is needed.
Operation
VBCSTNESH2PS dest, src (VEX encoded version)
VL = (128, 256)
KL = VL/32
Flags Affected
None.
Other Exceptions
See Exception Type 5.
VEX.256.F3.0F38.W0 B0 !(11):rrr:bbb A V/V AVX-NE- Convert even elements of packed BF16 values
VCVTNEEBF162PS ymm1, m256 CONVERT from m256 to FP32 values and store in ymm1.
Description
This instruction loads packed BF16 elements from memory, converts the even elements to FP32, and writes the
result to the destination SIMD register.
This instruction does not generate floating-point exceptions and does not consult or update MXCSR.
Since any BF16 number can be represented in FP32, the conversion result is exact and no rounding is needed.
Operation
VCVTNEEBF162PS dest, src (VEX encoded version)
VL = (128, 256)
KL = VL/32
Flags Affected
None.
Other Exceptions
See Exception Type 4.
VEX.256.66.0F38.W0 B0 !(11):rrr:bbb A V/V AVX-NE- Convert even elements of packed FP16 values
VCVTNEEPH2PS ymm1, m256 CONVERT from m256 to FP32 values and store in ymm1.
Description
This instruction loads packed FP16 elements from memory, converts the even elements to FP32, and writes the
result to the destination SIMD register.
This instruction does not generate floating-point exceptions and does not consult or update MXCSR.
Input FP16 denormals are converted to normal FP32 numbers and not treated as zero. Since any FP16 number can
be represented in FP32, the conversion result is exact and no rounding is needed.
Operation
VCVTNEEPH2PS dest, src (VEX encoded version)
VL = (128, 256)
KL = VL/32
Flags Affected
None.
Other Exceptions
See Exception Type 4.
VEX.256.F2.0F38.W0 B0 !(11):rrr:bbb A V/V AVX-NE- Convert odd elements of packed BF16 values
VCVTNEOBF162PS ymm1, m256 CONVERT from m256 to FP32 values and store in ymm1.
Description
This instruction loads packed BF16 elements from memory, converts the odd elements to FP32, and writes the
result to the destination SIMD register.
This instruction does not generate floating-point exceptions and does not consult or update MXCSR.
Since any BF16 number can be represented in FP32, the conversion result is exact and no rounding is needed.
Operation
VCVTNEOBF162PS dest, src (VEX encoded version)
VL = (128, 256)
KL = VL/32
Flags Affected
None.
Other Exceptions
See Exception Type 4.
VEX.256.NP.0F38.W0 B0 !(11):rrr:bbb A V/V AVX-NE- Convert odd elements of packed FP16 values
VCVTNEOPH2PS ymm1, m256 CONVERT from m256 to FP32 values and store in ymm1.
Description
This instruction loads packed FP16 elements from memory, converts the odd elements to FP32, and writes the
result to the destination SIMD register.
This instruction does not generate floating-point exceptions and does not consult or update MXCSR.
Input FP16 denormals are converted to normal FP32 numbers and not treated as zero. Since any FP16 number can
be represented in FP32, the conversion result is exact and no rounding is needed.
Operation
VCVTNEOPH2PS dest, src (VEX encoded version)
VL = (128, 256)
KL = VL/32
Flags Affected
None.
Other Exceptions
See Exception Type 4.
Description
This instruction loads packed FP32 elements from a SIMD register or memory, converts the elements to BF16, and
writes the result to the destination SIMD register.
The upper bits of the destination register beyond the down-converted BF16 elements are zeroed.
This instruction uses “Round to nearest (even)” rounding mode. Output denormals are always flushed to zero and
input denormals are always treated as zero. MXCSR is not consulted nor updated.
Operation
define convert_fp32_to_bfloat16(x):
IF x is zero or denormal:
dest[15] := x[31] // sign preserving zero (denormal go to zero)
dest[14:0] := 0
ELSE IF x is infinity:
dest[15:0] := x[31:16]
ELSE IF x is nan:
dest[15:0] := x[31:16] // truncate and set msb of the mantisa force qnan
dest[6] := 1
ELSE // normal number
lsb := x[16]
rounding_bias := 0x00007FFF + lsb
temp[31:0] := x[31:0] + rounding_bias // integer add
dest[15:0] := temp[31:16]
return dest
FOR i := 0 to KL/2-1:
t := src.fp32[i]
dest.word[i] := convert_fp32_to_bfloat16(t)
DEST[MAXVL-1:VL/2] := 0
Flags Affected
None.
Other Exceptions
See Exceptions Type 4.
VPDPB[SU,UU,SS]D[,S]—Multiply and Add Unsigned and Signed Bytes With and Without
Saturation
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
VEX.128.F2.0F38.W0 50 /r A V/V AVX-VNNI-INT8 Multiply groups of 4 pairs of signed bytes in
VPDPBSSD xmm1, xmm2, xmm3/m128 with corresponding signed bytes
xmm3/m128 of xmm2, summing those products and adding
them to the doubleword result in xmm1.
VEX.256.F2.0F38.W0 50 /r A V/V AVX-VNNI-INT8 Multiply groups of 4 pairs of signed bytes in
VPDPBSSD ymm1, ymm2, ymm3/m256 with corresponding signed bytes
ymm3/m256 of ymm2, summing those products and adding
them to the doubleword result in ymm1.
VEX.128.F2.0F38.W0 51 /r A V/V AVX-VNNI-INT8 Multiply groups of 4 pairs of signed bytes in
VPDPBSSDS xmm1, xmm2, xmm3/m128 with corresponding signed bytes
xmm3/m128 of xmm2, summing those products and adding
them to the doubleword result, with signed
saturation in xmm1.
VEX.256.F2.0F38.W0 51 /r A V/V AVX-VNNI-INT8 Multiply groups of 4 pairs of signed bytes in
VPDPBSSDS ymm1, ymm2, ymm3/m256 with corresponding signed bytes
ymm3/m256 of ymm2, summing those products and adding
them to the doubleword result, with signed
saturation in ymm1.
VEX.128.F3.0F38.W0 50 /r A V/V AVX-VNNI-INT8 Multiply groups of 4 pairs of signed bytes in
VPDPBSUD xmm1, xmm2, xmm3/m128 with corresponding unsigned
xmm3/m128 bytes of xmm2, summing those products and
adding them to doubleword result in xmm1.
VEX.256.F3.0F38.W0 50 /r A V/V AVX-VNNI-INT8 Multiply groups of 4 pairs of signed bytes in
VPDPBSUD ymm1, ymm2, ymm3/m256 with corresponding unsigned
ymm3/m256 bytes of ymm2, summing those products and
adding them to doubleword result in ymm1.
VEX.128.F3.0F38.W0 51 /r A V/V AVX-VNNI-INT8 Multiply groups of 4 pairs of signed bytes in
VPDPBSUDS xmm1, xmm2, xmm3/m128 with corresponding unsigned
xmm3/m128 bytes of xmm2, summing those products and
adding them to doubleword result, with signed
saturation in xmm1.
VEX.256.F3.0F38.W0 51 /r A V/V AVX-VNNI-INT8 Multiply groups of 4 pairs of signed bytes in
VPDPBSUDS ymm1, ymm2, ymm3/m256 with corresponding unsigned
ymm3/m256 bytes of ymm2, summing those products and
adding them to doubleword result, with signed
saturation in ymm1.
VEX.128.NP.0F38.W0 50 /r A V/V AVX-VNNI-INT8 Multiply groups of 4 pairs of unsigned bytes in
VPDPBUUD xmm1, xmm2, xmm3/m128 with corresponding unsigned
xmm3/m128 bytes of xmm2, summing those products and
adding them to doubleword result in xmm1.
VEX.256.NP.0F38.W0 50 /r A V/V AVX-VNNI-INT8 Multiply groups of 4 pairs of unsigned bytes in
VPDPBUUD ymm1, ymm2, ymm3/m256 with corresponding unsigned
ymm3/m256 bytes of ymm2, summing those products and
adding them to doubleword result in ymm1.
Description
Multiplies the individual bytes of the first source operand by the corresponding bytes of the second source operand,
producing intermediate word results. The word results are then summed and accumulated in the destination dword
element size operand.
For unsigned saturation, when an individual result value is beyond the range of an unsigned doubleword (that is,
greater than FFFFF_FFFFH), the saturated unsigned doubleword integer value of FFFF_FFFFH is stored in the
doubleword destination.
For signed saturation, when an individual result is beyond the range of a signed doubleword integer (that is,
greater than 7FFF_FFFFH or less than 8000_0000H), the saturated value of 7FFF_FFFFH or 8000_0000H, respec-
tively, is written to the destination operand.
Operation
VPDPB[SU,UU,SS]D[,S] dest, src1, src2 (VEX encoded version)
VL = (128, 256)
KL = VL/32
ORIGDEST := DEST
FOR i := 0 TO KL-1:
IF *src1 is signed*:
src1extend := SIGN_EXTEND // SU, SS
ELSE:
src1extend := ZERO_EXTEND // UU
IF *src2 is signed*:
src2extend := SIGN_EXTEND // SS
ELSE:
src2extend := ZERO_EXTEND // UU, SU
IF *saturating*:
DEST[MAXVL-1:VL] := 0
Other Exceptions
See Exceptions Type 4.
VPDPW[SU,US,UU]D[,S]—Multiply and Add Unsigned and Signed Words With and Without
Saturation
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
VEX.128.F3.0F38.W0 D2 /r A V/V AVX-VNNI-INT16 Multiply groups of 2 pairs of signed words in
VPDPWSUD xmm1, xmm2, xmm3/m128 with corresponding unsigned
xmm3/m128 words of xmm2, summing those products and
adding them to the doubleword result in
xmm1.
VEX.256.F3.0F38.W0 D2 /r A V/V AVX-VNNI-INT16 Multiply groups of 2 pairs of signed words in
VPDPWSUD ymm1, ymm2, ymm3/m256 with corresponding unsigned
ymm3/m256 words of ymm2, summing those products and
adding them to the doubleword result in
ymm1.
VEX.128.F3.0F38.W0 D3 /r A V/V AVX-VNNI-INT16 Multiply groups of 2 pairs of signed words in
VPDPWSUDS xmm1, xmm2, xmm3/m128 with corresponding unsigned
xmm3/m128 words of xmm2, summing those products and
adding them to the doubleword result, with
signed saturation in xmm1.
VEX.256.F3.0F38.W0 D3 /r A V/V AVX-VNNI-INT16 Multiply groups of 2 pairs of signed words in
VPDPWSUDS ymm1, ymm2, ymm3/m256 with corresponding unsigned
ymm3/m256 words of ymm2, summing those products and
adding them to the doubleword result, with
signed saturation in ymm1.
VEX.128.66.0F38.W0 D2 /r A V/V AVX-VNNI-INT16 Multiply groups of 2 pairs of unsigned words in
VPDPWUSD xmm1, xmm2, xmm3/m128 with corresponding signed words
xmm3/m128 of xmm2, summing those products and adding
them to doubleword result in xmm1.
VEX.256.66.0F38.W0 D2 /r A V/V AVX-VNNI-INT16 Multiply groups of 2 pairs of unsigned words in
VPDPWUSD ymm1, ymm2, ymm3/m256 with corresponding signed words
ymm3/m256 of ymm2, summing those products and adding
them to doubleword result in ymm1.
VEX.128.66.0F38.W0 D3 /r A V/V AVX-VNNI-INT16 Multiply groups of 2 pairs of unsigned words in
VPDPWUSDS xmm1, xmm2, xmm3/m128 with corresponding signed words
xmm3/m128 of xmm2, summing those products and adding
them to doubleword result, with signed
saturation in xmm1.
VEX.256.66.0F38.W0 D3 /r A V/V AVX-VNNI-INT16 Multiply groups of 2 pairs of unsigned words in
VPDPWUSDS ymm1, ymm2, ymm3/m256 with corresponding signed words
ymm3/m256 of ymm2, summing those products and adding
them to doubleword result, with signed
saturation in ymm1.
VEX.128.NP.0F38.W0 D2 /r A V/V AVX-VNNI-INT16 Multiply groups of 2 pairs of unsigned words in
VPDPWUUD xmm1, xmm2, xmm3/m128 with corresponding unsigned
xmm3/m128 words of xmm2, summing those products and
adding them to doubleword result in xmm1.
VEX.256.NP.0F38.W0 D2 /r A V/V AVX-VNNI-INT16 Multiply groups of 2 pairs of unsigned words in
VPDPWUUD ymm1, ymm2, ymm3/m256 with corresponding unsigned
ymm3/m256 words of ymm2, summing those products and
adding them to doubleword result in ymm1.
Description
Multiplies the individual words of the first source operand by the corresponding words of the second source
operand, producing intermediate dword results. The dword results are then summed and accumulated in the desti-
nation dword element size operand.
For unsigned saturation, when an individual result value is beyond the range of an unsigned doubleword (that is,
greater than FFFF_FFFFH), the saturated unsigned doubleword integer value of FFFF_FFFFH is stored in the double-
word destination.
For signed saturation, when an individual result is beyond the range of a signed doubleword integer (that is,
greater than 7FFF_FFFFH or less than 8000_0000H), the saturated value of 7FFF_FFFFH or 8000_0000H, respec-
tively, is written to the destination operand.
The EVEX version of VPDPWSSD[,S] was previously introduced with AVX512-VNNI. The VEX version of
VPDPWSSD[,S] was previously introduced with AVX-VNNI.
Operation
VPDPW[UU,SU,US]D[,S] dest, src1, src2
VL = (128, 256)
KL = VL/32
ORIGDEST := DEST
IF *src1 is signed*: // SU
src1extend := SIGN_EXTEND
ELSE: // UU, US
src1extend := ZERO_EXTEND
IF *src2 is signed*: // US
src2extend := SIGN_EXTEND
ELSE: // UU, SU
src2extend := ZERO_EXTEND
FOR i := 0 TO KL-1:
p1dword := src1extend(SRC1.word[2*i+0]) * src2extend(SRC2.word[2*i+0])
p2dword := src1extend(SRC1.word[2*i+1]) * src2extend(SRC2.word[2*i+1])
IF *saturating version*:
IF *UU instruction version*:
Other Exceptions
See Exceptions Type 4.
VPMADD52HUQ—Packed Multiply of Unsigned 52-Bit Integers and Add the High 52-Bit
Products to Qword Accumulators
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
VEX.128.66.0F38.W1 B5 /r A V/V AVX-IFMA Multiply unsigned 52-bit integers in xmm2 and
VPMADD52HUQ xmm1, xmm2, xmm3/m128 and add the high 52 bits of the
xmm3/m128 104-bit product to the qword unsigned
integers in xmm1.
VEX.256.66.0F38.W1 B5 /r A V/V AVX-IFMA Multiply unsigned 52-bit integers in ymm2 and
VPMADD52HUQ ymm1, ymm2, ymm3/m256 and add the high 52 bits of the
ymm3/m256 104-bit product to the qword unsigned
integers in ymm1.
Description
Multiplies packed unsigned 52-bit integers in each qword element of the first source operand (the second operand)
with the packed unsigned 52-bit integers in the corresponding elements of the second source operand (the third
operand) to form packed 104-bit intermediate results. The high 52-bit, unsigned integer of each 104-bit product is
added to the corresponding qword unsigned integer of the destination operand (the first operand).
Operation
VPMADDHUQ srcdest, src1, src2 (VEX version)
VL = (128,256)
KL = VL/64
FOR i in 0 .. KL-1:
temp128 := zeroextend64(src1.qword[i][51:0]) *zeroextend64(src2.qword[i][51:0])
srcdest.qword[i] := srcdest.qword[i] +zeroextend64(temp128[103:52])
srcdest[MAXVL:VL] := 0
Other Exceptions
See Exceptions Type 4.
VPMADD52LUQ—Packed Multiply of Unsigned 52-Bit Integers and Add the Low 52-Bit Products
to Qword Accumulators
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
VEX.128.66.0F38.W1 B4 /r A V/V AVX-IFMA Multiply unsigned 52-bit integers in xmm2 and
VPMADD52LUQ xmm1, xmm2, xmm3/m128 and add the low 52 bits of the
xmm3/m128 104-bit product to the qword unsigned
integers in xmm1.
VEX.256.66.0F38.W1 B4 /r A V/V AVX-IFMA Multiply unsigned 52-bit integers in ymm2 and
VPMADD52LUQ ymm1, ymm2, ymm3/m256 and add the low 52 bits of the
ymm3/m256 104-bit product to the qword unsigned
integers in ymm1.
Description
Multiplies packed unsigned 52-bit integers in each qword element of the first source operand (the second operand)
with the packed unsigned 52-bit integers in the corresponding elements of the second source operand (the third
operand) to form packed 104-bit intermediate results. The low 52-bit, unsigned integer of each 104-bit product is
added to the corresponding qword unsigned integer of the destination operand (the first operand).
Operation
VPMADDLUQ srcdest, src1, src2 (VEX version)
VL = (128,256)
KL = VL/64
FOR i in 0 .. KL-1:
temp128 := zeroextend64(src1.qword[i][51:0]) *zeroextend64(src2.qword[i][51:0])
srcdest.qword[i] := srcdest.qword[i] +zeroextend64(temp128[51:0])
srcdest[MAXVL:VL] := 0
Other Exceptions
See Exceptions Type 4.
Description
The VSHA512MSG1 instruction is one of the two SHA512 message scheduling instructions. The instruction
performs an intermediate calculation for the next four SHA512 message qwords.
See https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf for more information on the SHA512 standard.
Operation
define ROR64(qword, n):
count := n % 64
dest := (qword >> count) | (qword << (64-count))
return dest
define s0(qword):
return ROR64(qword,1) ^ ROR64(qword, 8) ^ SHR64(qword, 7)
Flags Affected
None.
Other Exceptions
See Exceptions Type 6.
VSHA512MSG2—Perform a Final Calculation for the Next Four SHA512 Message Qwords
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
VEX.256.F2.0F38.W0 CD 11:rrr:bbb A V/V AVX Performs the final calculation for the next four
VSHA512MSG2 ymm1, ymm2 SHA512 SHA512 message qwords using previous
message qwords from ymm1 and ymm2,
storing the result in ymm1.
Description
The VSHA512MSG2 instruction is one of the two SHA512 message scheduling instructions. The instruction
performs the final calculation for the next four SHA512 message qwords.
See https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf for more information on the SHA512 standard.
Operation
define ROR64(qword, n):
count := n % 64
dest := (qword >> count) | (qword << (64-count))
return dest
define s1(qword):
return ROR64(qword,19) ^ ROR64(qword, 61) ^ SHR64(qword, 6)
SRCDEST.qword[3] := W[19]
SRCDEST.qword[2] := W[18]
SRCDEST.qword[1] := W[17]
SRCDEST.qword[0] := W[16]
Flags Affected
None.
Other Exceptions
See Exceptions Type 6.
Description
The VSHA512RNDS2 instruction performs two rounds of SHA512 operation using initial SHA512 state (C,D,G,H)
from the first operand, an initial SHA512 state (A,B,E,F) from the second operand, and a pre-computed sum of the
next two round message qwords and the corresponding round constants from the third operand (only the two
lower qwords of the third operand). The updated SHA512 state (A,B,E,F) is written to the first operand, and the
second operand can be used as the updated state (C,D,G,H) in later rounds.
See https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf for more information on the SHA512 standard.
Operation
define ROR64(qword, n):
count := n % 64
dest := (qword >> count) | (qword << (64-count))
return dest
define cap_sigma0(qword):
return ROR64(qword,28) ^ ROR64(qword, 34) ^ ROR64(qword, 39)
define cap_sigma1(qword):
return ROR64(qword,14) ^ ROR64(qword, 18) ^ ROR64(qword, 41)
define MAJ(a,b,c):
return (a & b) ^ (a & c) ^ (b & c)
define CH(e,f,g):
return (e & f) ^ (g & ~e)
FOR i in 0..1:
A[i+1] := CH(E[i], F[i], G[i]) +
cap_sigma1(E[i]) + WK[i] + H[i] +
MAJ(A[i], B[i], C[i]) +
cap_sigma0(A[i])
B[i+1] := A[i]
C[i+1] := B[i]
D[i+1] := C[i]
E[i+1] := CH(E[i], F[i], G[i]) +
cap_sigma1(E[i]) + WK[i] + H[i] + D[i]
F[i+1] := E[i]
G[i+1] := F[i]
H[i+1] := G[i]
SRCDEST.qword[3] = A[2]
SRCDEST.qword[2] = B[2]
SRCDEST.qword[1] = E[2]
SRCDEST.qword[0] = F[2]
Flags Affected
None.
Other Exceptions
See Exceptions Type 6.
VSM3MSG1—Perform Initial Calculation for the Next Four SM3 Message Words
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
VEX.128.NP.0F38.W0 DA /r A V/V AVX Performs an initial calculation for the next four
VSM3MSG1 xmm1, xmm2, SM3 SM3 message words using previous message
xmm3/m128 words from xmm2 and xmm3/m128, storing
the result in xmm1.
Description
The VSM3MSG1 instruction is one of the two SM3 message scheduling instructions. The instruction performs an
initial calculation for the next four SM3 message words.
Operation
define ROL32(dword, n):
count := n % 32
dest := (dword << count) | (dword >> (32-count))
return dest
define P1(x):
return x ^ ROL32(x, 15) ^ ROL32(x, 23)
W[7] := SRCDEST.dword[0]
W[8] := SRCDEST.dword[1]
W[9] := SRCDEST.dword[2]
W[10] := SRCDEST.dword[3]
W[13] := SRC1.dword[0]
W[14] := SRC1.dword[1]
W[15] := SRC1.dword[2]
SRCDEST.dword[0] := P1(TMP0)
SRCDEST.dword[1] := P1(TMP1)
SRCDEST.dword[2] := P1(TMP2)
SRCDEST.dword[3] := P1(TMP3)
Flags Affected
None.
Other Exceptions
See Exceptions Type 4.
VSM3MSG2—Perform Final Calculation for the Next Four SM3 Message Words
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
VEX.128.66.0F38.W0 DA /r A V/V AVX Performs the final calculation for the next four
VSM3MSG2 xmm1, xmm2, SM3 SM3 message words using previous message
xmm3/m128 words from xmm2 and xmm3/m128, storing
the result in xmm1.
Description
The VSM3MSG2 instruction is one of the two SM3 message scheduling instructions. The instruction performs the
final calculation for the next four SM3 message words.
Operation
//see the VSM3MSG1 instruction for definition of ROL32()
WTMP[0] := SRCDEST.dword[0]
WTMP[1] := SRCDEST.dword[1]
WTMP[2] := SRCDEST.dword[2]
WTMP[3] := SRCDEST.dword[3]
// Dword array W[] has indices are based on the SM3 specification.
W[3] := SRC1.dword[0]
W[4] := SRC1.dword[1]
W[5] := SRC1.dword[2]
W[6] := SRC1.dword[3]
W[10] := SRC2.dword[0]
W[11] := SRC2.dword[1]
W[12] := SRC2.dword[2]
W[13] := SRC2.dword[3]
SRCDEST.dword[0] := W[16]
SRCDEST.dword[1] := W[17]
SRCDEST.dword[2] := W[18]
SRCDEST.dword[3] := W[19]
Flags Affected
None.
Other Exceptions
See Exceptions Type 4.
Description
The VSM3RNDS2 instruction performs two rounds of SM3 operation using initial SM3 state (C, D, G, H) from the
first operand, an initial SM3 states (A, B, E, F) from the second operand and a pre-computed words from the third
operand. The first operand with initial SM3 state of (C, D, G, H) assumes input of non-rotated left variables from
previous state. The updated SM3 state (A, B, E, F) is written to the first operand.
The imm8 should contain the even round number for the first of the two rounds computed by this instruction. The
computation masks the imm8 value by AND’ing it with 0x3E so that only even round numbers from 0 through 62
are used for this operation.
Operation
//see the VSM3MSG1 instruction for definition of ROL32()
define P0(dword):
return dword ^ ROL32(dword, 9) ^ ROL32(dword, 17)
W[5] := SRC2.dword[3]
C[0] := ROL32(C[0], 9)
D[0] := ROL32(D[0], 9)
G[0] := ROL32(G[0], 19)
H[0] := ROL32(H[0], 19)
FOR i in 0..1:
S1 := ROL32((ROL32(A[i], 12) + E[i] + CONST), 7)
S2 := S1 ^ ROL32(A[i],12)
T1 := FF(A[i], B[i], C[i], ROUND) + D[i] + S2 + (W[i]^W[i+4])
T2 := GG(E[i], F[i], G[i], ROUND) + H[i] + S1 + W[i]
D[i+1] := C[i]
C[i+1] := ROL32(B[i],9)
B[i+1] := A[i]
A[i+1] := T1
H[i+1] := G[i]
G[i+1] := ROL32(F[i], 19)
F[i+1] := E[i]
E[i+1] := P0(T2)
CONST := ROL32(CONST, 1)
SRCDEST.dword[3] := A[2]
SRCDEST.dword[2] := B[2]
SRCDEST.dword[1] := E[2]
SRCDEST.dword[0] := F[2]
Flags Affected
None.
Other Exceptions
See Exceptions Type 4.
Description
The VSM4KEY4 instruction performs four rounds of SM4 key expansion. The instruction operates on independent
128-bit lanes.
Additional details can be found at: https://tools.ietf.org/html/draft-ribose-cfrg-sm4-10.
Both SM4 instructions use a common sbox table:
BYTE sbox[256] = {
0xD6, 0x90, 0xE9, 0xFE, 0xCC, 0xE1, 0x3D, 0xB7, 0x16, 0xB6, 0x14, 0xC2, 0x28, 0xFB, 0x2C, 0x05,
0x2B, 0x67, 0x9A, 0x76, 0x2A, 0xBE, 0x04, 0xC3, 0xAA, 0x44, 0x13, 0x26, 0x49, 0x86, 0x06, 0x99,
0x9C, 0x42, 0x50, 0xF4, 0x91, 0xEF, 0x98, 0x7A, 0x33, 0x54, 0x0B, 0x43, 0xED, 0xCF, 0xAC, 0x62,
0xE4, 0xB3, 0x1C, 0xA9, 0xC9, 0x08, 0xE8, 0x95, 0x80, 0xDF, 0x94, 0xFA, 0x75, 0x8F, 0x3F, 0xA6,
0x47, 0x07, 0xA7, 0xFC, 0xF3, 0x73, 0x17, 0xBA, 0x83, 0x59, 0x3C, 0x19, 0xE6, 0x85, 0x4F, 0xA8,
0x68, 0x6B, 0x81, 0xB2, 0x71, 0x64, 0xDA, 0x8B, 0xF8, 0xEB, 0x0F, 0x4B, 0x70, 0x56, 0x9D, 0x35,
0x1E, 0x24, 0x0E, 0x5E, 0x63, 0x58, 0xD1, 0xA2, 0x25, 0x22, 0x7C, 0x3B, 0x01, 0x21, 0x78, 0x87,
0xD4, 0x00, 0x46, 0x57, 0x9F, 0xD3, 0x27, 0x52, 0x4C, 0x36, 0x02, 0xE7, 0xA0, 0xC4, 0xC8, 0x9E,
0xEA, 0xBF, 0x8A, 0xD2, 0x40, 0xC7, 0x38, 0xB5, 0xA3, 0xF7, 0xF2, 0xCE, 0xF9, 0x61, 0x15, 0xA1,
0xE0, 0xAE, 0x5D, 0xA4, 0x9B, 0x34, 0x1A, 0x55, 0xAD, 0x93, 0x32, 0x30, 0xF5, 0x8C, 0xB1, 0xE3,
0x1D, 0xF6, 0xE2, 0x2E, 0x82, 0x66, 0xCA, 0x60, 0xC0, 0x29, 0x23, 0xAB, 0x0D, 0x53, 0x4E, 0x6F,
0xD5, 0xDB, 0x37, 0x45, 0xDE, 0xFD, 0x8E, 0x2F, 0x03, 0xFF, 0x6A, 0x72, 0x6D, 0x6C, 0x5B, 0x51,
0x8D, 0x1B, 0xAF, 0x92, 0xBB, 0xDD, 0xBC, 0x7F, 0x11, 0xD9, 0x5C, 0x41, 0x1F, 0x10, 0x5A, 0xD8,
0x0A, 0xC1, 0x31, 0x88, 0xA5, 0xCD, 0x7B, 0xBD, 0x2D, 0x74, 0xD0, 0x12, 0xB8, 0xE5, 0xB4, 0xB0,
0x89, 0x69, 0x97, 0x4A, 0x0C, 0x96, 0x77, 0x7E, 0x65, 0xB9, 0xF1, 0x09, 0xC5, 0x6E, 0xC6, 0x84,
0x18, 0xF0, 0x7D, 0xEC, 0x3A, 0xDC, 0x4D, 0x20, 0x79, 0xEE, 0x5F, 0x3E, 0xD7, 0xCB, 0x39, 0x48
}
Operation
define ROL32(dword, n):
count := n % 32
dest := (dword << count) | (dword >> (32-count))
return dest
define lower_t(dword):
tmp.byte[0] := SBOX_BYTE(dword, 0)
tmp.byte[1] := SBOX_BYTE(dword, 1)
tmp.byte[2] := SBOX_BYTE(dword, 2)
tmp.byte[3] := SBOX_BYTE(dword, 3)
return tmp
define L_KEY(dword):
return dword ^ ROL32(dword, 13) ^ ROL32(dword, 23)
define T_KEY(dword):
return L_KEY(lower_t(dword))
for i in 0..KL-1:
P[0] := SRC1.xmm[i].dword[0]
P[1] := SRC1.xmm[i].dword[1]
P[2] := SRC1.xmm[i].dword[2]
P[3] := SRC1.xmm[i].dword[3]
DEST.xmm[i].dword[0] := C[0]
DEST.xmm[i].dword[1] := C[1]
DEST.xmm[i].dword[2] := C[2]
DEST.xmm[i].dword[3] := C[3]
DEST[MAXVL-1:VL] := 0
Flags Affected
None.
Other Exceptions
See Exceptions Type 6.
Description
The SM4RNDS4 instruction performs four rounds of SM4 encryption. The instruction operates on independent 128-
bit lanes.
Additional details can be found at: https://tools.ietf.org/html/draft-ribose-cfrg-sm4-10.
See “VSM4KEY4—Perform Four Rounds of SM4 Key Expansion” for the sbox table.
Operation
// see the VSM4KEY4 instruction for the definition of ROL32, lower_t
define L_RND(dword):
tmp := dword
tmp := tmp ^ ROL32(dword, 2)
tmp := tmp ^ ROL32(dword, 10)
tmp := tmp ^ ROL32(dword, 18)
tmp := tmp ^ ROL32(dword, 24)
return tmp
define T_RND(dword):
return L_RND(lower_t(dword))
for i in 0..KL-1:
P[0] := SRC1.xmm[i].dword[0]
P[1] := SRC1.xmm[i].dword[1]
P[2] := SRC1.xmm[i].dword[2]
P[3] := SRC1.xmm[i].dword[3]
DEST.xmm[i].dword[0] := C[0]
DEST.xmm[i].dword[1] := C[1]
DEST.xmm[i].dword[2] := C[2]
DEST.xmm[i].dword[3] := C[3]
DEST[MAXVL-1:VL] := 0
Flags Affected
None.
Other Exceptions
See Exceptions Type 6.
Description
This instruction writes a software-provided list of up to 64 MSRs with values loaded from memory.
WRMSRLIST takes three implied input operands:
• RSI: Linear address of a table of MSR addresses (8 bytes per address)1.
• RDI: Linear address of a table from which MSR data is loaded (8 bytes per MSR).
• RCX: 64-bit bitmask of valid bits for the MSRs. Bit 0 is the valid bit for entry 0 in each table, etc.
For each RCX bit [n] from 0 to 63, if RCX[n] is 1, WRMSRLIST will write the MSR specified at entry [n] in the RSI
table with the value read from memory at the entry [n] in the RDI table.
This implies a maximum of 64 MSRs that can be processed by this instruction. The processor will clear RCX[n] after
it finishes handling that MSR. Similar to repeated string operations, WRMSRLIST supports partial completion for
interrupts, exceptions, and traps. In these situations, the RIP register saved will point to the MSRLIST instruction
while the RCX register will have cleared bits corresponding to all completed iterations.
This instruction must be executed at privilege level 0; otherwise, a general protection exception #GP(0) is gener-
ated. This instruction performs MSR specific checks and respects the VMX MSR VM-execution controls in the same
manner as WRMSR.
Like WRMSRNS (and unlike WRMSR), WRMSRLIST is not defined as a serializing instruction (see “Serializing
Instructions” in Chapter 9 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A). This
means that software should not rely on WRMSRLIST to drain all buffered writes to memory before the next instruc-
tion is fetched and executed. For implementation reasons, some processors may serialize when writing certain
MSRs, even though that is not guaranteed.
Like WRMSR and WRMSRNS, WRMSRLIST will ensure that all operations before the WRMSRLIST do not use the new
MSR value and that all operations after the WRMSRLIST do use the new value. An exception to this rule is certain
store-related performance monitor events that only count when those stores are drained to memory. Since
WRMSRLIST is not a serializing instruction, if software is using WRMSRLIST to change the controls for such perfor-
mance monitor events, then stores before the WRMSRLIST may be counted with new MSR values written by
WRMSRLIST. Software can insert the SERIALIZE instruction before the WRMSRLIST if so desired.
Those MSRs that cause a TLB invalidation when they are written via WRMSR (e.g., MTRRs) will also cause the same
TLB invalidation when written by WRMSRLIST.
In places where WRMSR is being used as a proxy for a serializing instruction, a different serializing instruction can
be used (e.g., SERIALIZE).
WRMSRLIST writes MSRs in order, which means the processor will ensure that an MSR in iteration “n” will be
written only after previous iterations (“n-1”). If the older MSR writes had a side effect that affects the behavior of
the next MSR, the processor will ensure that side effect is honored.
The processor is allowed to (but not required to) “load ahead” in the list. Examples:
• Use old memory type or TLB translation for loads from list memory despite an MSR written by a previous
iteration changing MTRR or invalidating TLBs.
1. Since MSR addresses are only 32-bits wide, bits 63:32 of each MSR address table entry is reserved.
• Cause a page fault or EPT violation for a memory access to an entry > “n” in MSR address or data tables,
despite the processor only having read or written “n” MSRs.1
Operation
DO WHILE RCX != 0
MSR_index := position of least significant bit set in RCX;
Load MSR_address_table_entry from 8 bytes at the linear address RSI + (MSR_index * 8);
IF MSR_address_table_entry[63:32] != 0 THEN #GP(0); FI;
MSR_address := MSR_address_table_entry[31:0];
Load MSR_data from 8 bytes at the linear address RDI + (MSR_index * 8);
IF WRMSR of MSR_data to the MSR with address MSR_address would #GP THEN #GP(0); FI;
Load the MSR with address MSR_address with MSR_data;
RCX[MSR_index] := 0;
Allow delivery of any pending interrupts or traps;
OD;
Flags Affected
None.
1. For example, the processor may take a page fault due to a linear address for the 10th entry in the MSR address table despite only
having completed the MSR writes up to entry 5.
Description
WRMSRNS is an instruction that behaves exactly like WRMSR, with the only difference being that it is not a serial-
izing instruction by default.
Writes the contents of registers EDX:EAX into the 64-bit model specific register (MSR) specified in the ECX register.
The contents of the EDX register are copied to the high-order 32 bits of the selected MSR and the contents of the
EAX register are copied to the low-order 32 bits of the MSR. The high-order 32 bits of RAX, RCX, and RDX are
ignored.
This instruction must be executed at privilege level 0 or in real-address mode; otherwise, a general protection
exception #GP(0) is generated.
Unlike WRMSR, WRMSRNS is not defined as a serializing instruction (see “Serializing Instructions” in Chapter 9 of
the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A). This means that software should
not rely on it to drain all buffered writes to memory before the next instruction is fetched and executed. For imple-
mentation reasons, some processors may serialize when writing certain MSRs, even though that is not guaranteed.
Like WRMSR, WRMSRNS will ensure that all operations before it do not use the new MSR value and that all opera-
tions after the WRMSRNS do use the new value. An exception to this rule is certain store related performance
monitor events that only count when those stores are drained to memory. Since WRMSRNS is not a serializing
instruction, if software is using WRMSRNS to change the controls for such performance monitor events, then stores
before the WRMSRMS may be counted with new MSR values written by WRMSRNS. Software can insert the SERI-
ALIZE instruction before the WRMSRNS if so desired.
Those MSRs that cause a TLB invalidation when they are written via WRMSR (e.g., MTRRs) will also cause the same
TLB invalidation when written by WRMSRNS.
In order to improve performance, software may replace WRMSR with WRMSRNS. In places where WRMSR is being
used as a proxy for a serializing instruction, a different serializing instruction can be used (e.g., SERIALIZE).
Operation
MSR[ECX] := EDX:EAX;
Flags Affected
None.
Exceptions
Same exceptions as WRMSR.
#UD If CPUID.(EAX=07H, ECX=01H):EAX.WRMSRNS[bit 19] = 0.
CHAPTER 3
INTEL® AMX INSTRUCTION SET REFERENCE, A-Z
NOTES
The following Intel® AMX instructions have moved to the Intel® 64 and IA-32 Architectures
Software Developer’s Manual: LDTILECFG, STTILECFG, TDPBF16PS,
TDPBSSD/TDPBSUD/TDPBUSD/TDPBUUD, TILELOADD/TILELOADDT1, TILERELEASE,
TILESTORED, and TILEZERO.
The Intel Advanced Matrix Extensions introductory material and helper functions will be maintained
here, as well as in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, for the
reader’s convenience. For information on Intel AMX and the XSAVE feature set, and recommenda-
tions for system software, see the latest version of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual.
3.1 INTRODUCTION
Intel® Advanced Matrix Extensions (Intel® AMX) is a new 64-bit programming paradigm consisting of two compo-
nents: a set of 2-dimensional registers (tiles) representing sub-arrays from a larger 2-dimensional memory image,
and an accelerator able to operate on tiles, the first implementation is called TMUL (tile matrix multiply unit).
An Intel AMX implementation enumerates to the programmer how the tiles can be programmed by providing a
palette of options. Two palettes are supported; palette 0 represents the initialized state, and palette 1 consists of
8 KB of storage spread across 8 tile registers named TMM0..TMM7. Each tile has a maximum size of 16 rows x 64
bytes, (1 KB), however the programmer can configure each tile to smaller dimensions appropriate to their algo-
rithm. The tile dimensions supplied by the programmer (rows and bytes_per_row, i.e., colsb) are metadata that
drives the execution of tile and accelerator instructions. In this way, a single instruction can launch autonomous
multi-cycle execution in the tile and accelerator hardware. The palette value (palette_id) and metadata are held
internally in a tile related control register (TILECFG). The TILECFG contents will be commensurate with that
reported in the palette_table (see “CPUID—CPU Identification” in Chapter 1 for a description of the available
parameters).
Intel AMX is an extensible architecture. New accelerators can be added, or the TMUL accelerator may be enhanced
to provide higher performance. In these cases, the state (TILEDATA) provided by tiles may need to be made larger,
either in one of the metadata dimensions (more rows or colsb) and/or by supporting more tile registers (names).
The extensibility is carried out by adding new palette entries describing the additional state. Since execution is
driven through metadata, an existing Intel AMX binary could take advantage of larger storage sizes and higher
performance TMUL units by selecting the most powerful palette indicated by CPUID and adjusting loop and pointer
updates accordingly.
Figure 3-1 shows a conceptual diagram of the Intel AMX architecture. An Intel architecture host drives the algo-
rithm, the memory blocking, loop indices and pointer arithmetic. Tile loads and stores and accelerator commands
are sent to multi-cycle execution units. Status, if required, is reported back. Intel AMX instructions are synchro-
nous in the Intel architecture instruction stream and the memory loaded and stored by the tile instructions is
coherent with respect to the host’s memory accesses. There are no restrictions on interleaving of Intel architecture
and Intel AMX code or restrictions on the resources the host can use in parallel with Intel AMX (e.g., Intel AVX-
512). There is also no architectural requirement on the Intel architecture compute capability of the Intel architec-
ture host other than it supports 64-bit mode.
TILECFG
tmm0
Coherent Memory
Accelerator 2
Interface tmm1
...
tmm[n-1]
Intel AMX instructions use new registers and inherit basic behavior from Intel architecture in the same manner that
Intel SSE and Intel AVX did. Tile instructions include loads and stores using the traditional Intel architecture
register set as pointers. The TMUL instruction set (defined to be CPUID bits AMX-BF16 and AMX-INT8) only
supports reg-reg operations.
TILECFG is programmed using the LDTILECFG instruction. The selected palette defines the available storage and
general configuration while the rest of the memory data specifies the number of rows and column bytes for each
tile. Consistency checks are performed to ensure the TILECFG matches the restrictions of the palette. A General
Protection fault (#GP) is reported if the LDTILECFG fails consistency checks. A successful load of
TILECFG with a palette_id other than 0 is represented in this document with TILES_CONFIGURED = 1. When the
TILECFG is initialized (palette_id = 0), it is represented in the document as TILES_CONFIGURED = 0. Nearly all
Intel AMX instructions will generate a #UD exception if TILES_CONFIGURED is not equal to 1; the exceptions are
those that do TILECFG maintenance: LDTILECFG, STTILECFG and TILERELEASE.
If a tile is configured to contain M rows by N column bytes, LDTILECFG will ensure that the metadata values are
appropriate to the palette (e.g., that M ≤ 16 and N ≤ 64 for palette 1). The four M and N values can all be different
as long as they adhere to the restrictions of the palette. Further dynamic checks are done in the tile and the TMUL
instruction set to deal with cases where a legally configured tile may be inappropriate for the instruction operation.
Tile registers can be set to ‘invalid’ by configuring the rows and colsb to ‘0’.
Tile loads and stores are strided accesses from the application memory to packed rows of data. Algorithms are
expressed assuming row major data layout. Column major users should translate the terms according to their
orientation.
TILELOAD* and TILESTORE* instructions are restartable and can handle (up to) 2*rows page faults per instruction.
Restartability is provided by a start_row parameter in the TILECFG register.
The TMUL unit is conceptually a grid of fused multiply-add units able to read and write tiles. The dimensions of the
TMUL unit (tmul_maxk and tmul_maxn) are enumerated similar to the maximum dimensions of the tiles (see
“CPUID—CPU Identification” in Chapter 1 for details).
The matrix multiplications in the TMUL instruction set compute C[M][N] += A[M][K] * B[K][N]. The M, N, and K
values will cause the TMUL instruction set to generate a #UD exception if the dimensions do not match for matrix
multiply or do not match the palette.
In Figure 3-2, the number of rows in tile B matches the K dimension in the matrix multiplication pseudocode. K
dimensions smaller than that enumerated in the TMUL grid are also possible and any additional computation the
TMUL unit can support will not affect the result.
The number of elements specified by colsb of the B matrix is also less than or equal to tmul_maxn. Any remaining
values beyond that specified by the metadata will be set to zero.
C[M][N]
B[0][:N]
A[m-1][1]
B[1][:N]
....
....
....
A[M][K] B[K][N]
A[m-K+1][K-1]
B[K-1][:N]
C[m-K+1][0] C[m-K+1][1] C[m-K+1][n-1]
The XSAVE feature sets supports context management of the new state defined for Intel AMX. This support is
described in Section 3.2.
To facilitate handling of tile configuration data, there is a STTILECFG instruction. If the tile configuration is in the
INIT state (TILES_CONFIGURED == 0), then STTILECFG will write 64 bytes of zeros. Otherwise STTILECFG will
store the TILECFG to memory in the format used by LDTILECFG.
C A B
LDTILECFG [rax]
// assume some outer loops driving the cache tiling (not shown)
{
TILELOADD tmm0, [rsi+rdi] // srcdst, RSI points to C, RDI is strided value
TILELOADD tmm1, [rsi+rdi+N] // second tile of C, unrolling in SIMD dimension N
MOV r14, 0
LOOP:
TILELOADD tmm2, [r8+r9] // src2 is strided load of A, reused for 2 TMUL instr.
TILELOADD tmm3, [r10+r11] // src1 is strided load of B
TDPBUSD tmm0, tmm2, tmm3 // update left tile of C
TILELOADD tmm3, [r10+r11+N] // src1 loaded with B from next rightmost tile
TDPBUSD tmm1, tmm2, tmm3 // update right tile of C
ADD r8, K // update pointers by constants known outside of loop
ADD r10, K*r11
ADD r14, K
CMP r14, LIMIT
JNE LOOP
define palette_table[id]:
uint16_t total_tile_bytes
uint16_t bytes_per_tile
uint16_t bytes_per_row
uint16_t max_names
uint16_t max_rows
define zero_tilecfg_start():
tilecfg.start_row := 0
define zero_all_tile_data():
if XCR0[TILEDATA]:
b := CPUID(0xD,TILEDATA).EAX // size of feature
for j in 0 ... b:
TILEDATA.byte[j] := 0
define xcr0_supports_palette(palette_id):
if palette_id == 0:
return 1
elif palette_id == 1:
if XCR0[TILECFG] and XCR0[TILEDATA]:
return 1
return 0
3.5 NOTATION
Instructions described in this chapter follow the general documentation convention established in Intel® 64 and IA-
32 Architectures Software Developer’s Manual Volume 2A. Additionally, Intel® Advanced Matrix Extensions use
notation conventions as described below.
In the instruction encoding boxes, sibmem is used to denote an encoding where a MODRM byte and SIB byte are
used to indicate a memory operation where the base and displacement are used to point to memory, and the index
register (if present) is used to denote a stride between memory rows. The index register is scaled by the sib.scale
field as usual. The base register is added to the displacement, if present.
In the instruction encoding, the MODRM byte is represented several ways depending on the role it plays. The
MODRM byte has 3 fields: 2-bit MODRM.MOD field, a 3-bit MODRM.REG field and a 3-bit MODRM.RM field. When all
bits of the MODRM byte have fixed values for an instruction, the 2-hex nibble value of that byte is presented after
the opcode in the encoding boxes on the instruction description pages. When only some fields of the MODRM byte
must contain fixed values, those values are specified as follows:
• If only the MODRM.MOD must be 0b11, and MODRM.REG and MODRM.RM fields are unrestricted, this is
denoted as 11:rrr:bbb. The rrr correspond to the 3-bits of the MODRM.REG field and the bbb correspond to
the 3-bits of the MODMR.RM field.
• If the MODRM.MOD field is constrained to be a value other than 0b11, i.e., it must be one of 0b00, 0b01, or
0b10, then we use the notation !(11).
• If the MODRM.REG field had a specific required value, e.g., 0b101, that would be denoted as mm:101:bbb.
NOTE
Intel®
Historically the 64 and IA-32 Architectures Software Developer’s Manual only specified the
MODRM.REG field restrictions with the notation /0 ... /7 and did not specify restrictions on the
MODRM.MOD and MODRM.RM fields in the encoding boxes.
VEX.128.66.0F38.W0 6C 11:rrr:bbb A V/N.E. AMX-COMPLEX Matrix multiply complex elements from tmm2 and
TCMMIMFP16PS tmm1, tmm2, tmm3 tmm3, and accumulate the imaginary part into
single precision elements in tmm1.
VEX.128.NP.0F38.W0 6C 11:rrr:bbb A V/N.E. AMX-COMPLEX Matrix multiply complex elements from tmm2 and
TCMMRLFP16PS tmm1, tmm2, tmm3, and accumulate the real part into single
tmm3 precision elements in tmm1.
Description
These instructions perform matrix multiplication of two tiles containing complex elements and accumulate the
results into a packed single precision tile. Each dword element in input tiles tmm2 and tmm3 is interpreted as a
complex number with FP16 real part and FP16 imaginary part.
TCMMRLFP16PS calculates the real part of the result. For each possible combination of (row of tmm2, column of
tmm3), the instruction performs a set of multiplication and accumulations on all corresponding complex numbers
(one from tmm2 and one from tmm3). The real part of the tmm2 element is multiplied with the real part of the
corresponding tmm3 element, and the negated imaginary part of the tmm2 element is multiplied with the imagi-
nary part of the corresponding tmm3 elements. The two accumulated results are added, and then accumulated
into the corresponding row and column of tmm1.
TCMMIMFP16PS calculates the imaginary part of the result. For each possible combination of (row of tmm2, column
of tmm3), the instruction performs a set of multiplication and accumulations on all corresponding complex
numbers (one from tmm2 and one from tmm3). The imaginary part of the tmm2 element is multiplied with the real
part of the corresponding tmm3 element, and the real part of the tmm2 element is multiplied with the imaginary
part of the corresponding tmm3 elements. The two accumulated results are added, and then accumulated into the
corresponding row and column of tmm1.
“Round to nearest even” rounding mode is used when doing each accumulation of the FMA. Output denormals are
always flushed to zero but FP16 input denormals are not treated as zero.
MXCSR is not consulted nor updated.
Any attempt to execute these instructions inside an Intel TSX transaction will result in a transaction abort.
Operation
TCMMIMFP16PS tsrcdest, tsrc1, tsrc2
// C = m x n (tsrcdest), A = m x k (tsrc1), B = k x n (tsrc2)
zero_upper_rows(tsrcdest, tsrcdest.rows)
zero_tileconfig_start()
zero_upper_rows(tsrcdest, tsrcdest.rows)
zero_tileconfig_start()
Flags Affected
None.
Exceptions
AMX-E4; see Section 3.6, “Exception Classes” for details.
TDPFP16PS—Dot Product of FP16 Tiles Accumulated into Packed Single Precision Tile
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
VEX.128.F2.0F38.W0 5C 11:rrr:bbb A V/N.E. AMX-FP16 Matrix multiply FP16 elements from tmm2 and
TDPFP16PS tmm1, tmm2, tmm3 tmm3, and accumulate the packed single precision
elements in tmm1.
Description
This instruction performs a set of SIMD dot-products of two FP16 elements and accumulates the results into a
packed single precision tile. Each dword element in input tiles tmm2 and tmm3 is interpreted as a FP16 pair. For
each possible combination of (row of tmm2, column of tmm3), the instruction performs a set of SIMD dot-products
on all corresponding FP16 pairs (one pair from tmm2 and one pair from tmm3), adds the results of those dot-prod-
ucts, and then accumulates the result into the corresponding row and column of tmm1.
“Round to nearest even” rounding mode is used when doing each accumulation of the Fused Multiply-Add (FMA).
Output FP32 denormals are always flushed to zero. Input FP16 denormals are always handled and not treated as
zero.
MXCSR is not consulted nor updated.
Any attempt to execute the TDPFP16PS instruction inside an Intel TSX transaction will result in a transaction abort.
Operation
TDPFP16PS tsrcdest, tsrc1, tsrc2
// C = m x n (tsrcdest), A = m x k (tsrc1), B = k x n (tsrc2)
Flags Affected
None.
Exceptions
AMX-E4; see Section 3.6, “Exception Classes” for details.
CHAPTER 4
UC-LOCK DISABLE
NOTE
No processor will both set IA32_CORE_CAPABILITIES[4] and enumerate
CPUID.(EAX=07H, ECX=2):EDX[bit 6] as 1.
If a processor enumerates support for UC-lock disable (in either way), software can enable UC-lock disable by
setting MSR_MEMORY_CTRL[28]. When this bit is set, a locked access using a memory type other than WB causes
a fault. The locked access does not occur. The specific fault that occurs depends on how UC-lock disable is enumer-
ated:
• If IA32_CORE_CAPABILITIES[4] is read as 1, the UC lock results in a general-protection exception (#GP) with
a zero error code.
• If CPUID.(EAX=07H, ECX=2):EDX[bit 6] is enumerated as 1, the UC lock results in an #AC with an error code
with value 4.
1. The term “UC lock” is used because the most common situation regards accesses to UC memory. Despite the name, locked accesses
to WC, WP, and WT memory also cause bus locks.
2. Other alignment-check exceptions occur only if CR0.AM = 1, EFLAGS.AC = 1, and CPL = 3. The alignment-check exceptions resulting
from split-lock disable may occur even if CR0.AM = 0, EFLAGS.AC = 0, or CPL < 3.
CHAPTER 5
INTEL® RESOURCE DIRECTOR TECHNOLOGY FEATURE UPDATES
Intel® Resource Director Technology (Intel® RDT) provides several monitoring and control capabilities for shared
resources in multiprocessor systems. This chapter covers updates to the Cache Bandwidth Allocation feature of
Intel RDT.
Previous versions of this document contained additional information on Intel RDT. This information can now be
found in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B, as well as in a new docu-
ment titled “Intel® Resource Director Technology Architecture Specification,” available here:
https://cdrdv2.intel.com/v1/dl/getContent/789566.
• CPUID.(EAX=10H, ECX=ResID=5):EAX
— EAX[7:0] reports the maximum CBA throttling value supported.
— EAX[11:8] reports the scope of CBA IA32_QoS_Core_BW_Thrtl_n MSRs. If EAX[11:8]=1, this indicates the
logical processor scope of the MSRs.
— EAX[31:12] is reserved.
• CPUID.(EAX=10H, ECX=ResID=5):EBX
— EBX[31:0] is reserved.
• CPUID.(EAX=10H, ECX=ResID=5):ECX
— ECX[3] reports whether the response of the bandwidth control is approximately linear. If ECX[3] is 1, the
response of the bandwidth control is approximately linear. If ECX[3] is 0, the response of the bandwidth
control is non-linear.
— ECX[2:0] and ECX[31:4] are reserved.
• CPUID.(EAX=10H, ECX=ResID=5):EDX
— EDX[15:0] reports the number of Classes of Service supported for the feature. Add one to the return value
to get the result. For instance, a reported value of 15 implies a maximum of 16 supported CBA CLOS.
— EDX[31:16] is reserved.
31 12 11 8 7 0
EAX Scope of
Reserved CBA MSRs
CBA_MAX_Throttle_Level
31 0
EBX Reserved
31 3 2 0
ECX Reserved
CBA_Lin_Rsp
31 16 15 0
63
Base MSR Address = E00H 0
IA32_QoS_Core_BW_Thrtl_0 MSR
63
Base MSR Address = E01H 0
IA32_QoS_Core_BW_Thrtl_1 MSR
Note that the throttling values provided to the software are calibrated through specific traffic patterns; however, as
workload characteristics may vary, the response precision and linearity of the bandwidth threshold values will vary
across products and should be treated as approximate values only.
CHAPTER 6
LINEAR ADDRESS MASKING (LAM)
This chapter describes a new feature called linear-address masking (LAM). LAM modifies the checking that is
applied to 64-bit linear addresses, allowing software to use of the untranslated address bits for metadata.
In 64-bit mode, linear address have 64 bits and are translated either with 4-level paging, which translates the low
48 bits of each linear address, or with 5-level paging, which translates 57 bits. The upper linear-address bits are
reserved through the concept of canonicality. A linear address is 48-bit canonical if bits 63:47 of the address are
identical; it is 57-bit canonical if bits 63:56 are identical. (Clearly, any linear address that is 48-bit canonical is also
57-bit canonical.) When 4-level paging is active, the processor requires all linear addresses used to access memory
to be 48-bit canonical; similarly, 5-level paging ensures that all linear addresses are 57-bit canonical.
Software usages that associate metadata with a pointer might benefit from being able to place metadata in the
upper (untranslated) bits of the pointer itself. However, the canonicality enforcement mentioned earlier implies
that software would have to mask the metadata bits in a pointer (making it canonical) before using it as a linear
address to access memory. LAM allows software to use pointers with metadata without having to mask the meta-
data bits. With LAM enabled, the processor masks the metadata bits in a pointer before using it as a linear address
to access memory.
LAM is supported only in 64-bit mode and applies only addresses used for data accesses. LAM does not apply to
addresses used for instruction fetches or to those being loaded into the RIP register (e.g., as targets of jump and
call instructions).
6.2 TREATMENT OF DATA ACCESSES WITH LAM ACTIVE FOR USER POINTERS
Recall that, without LAM, canonicality checks are defined so that 4-level paging requires bits 63:47 of each pointer
to be identical, while 5-level paging requires bits 63:56 to be identical. LAM allows some of these bits to be used as
metadata by modifying canonicality checking.
When LAM48 is enabled for user pointers (see Section 6.1), the processor allows bits 62:48 of a user pointer to be
used as metadata. Regardless of the paging mode, the processor performs a modified canonicality check that
enforces that bit 47 of the pointer matches bit 63. As illustrated in Figure 6-1, bits 62:48 are not checked and are
thus available for software metadata. After this modified canonicality check is performed, bits 62:48 are masked
by sign-extending the value of bit 47 (0), and the resulting (48-bit canonical) address is then passed on for trans-
lation by paging.
(Note also that, without LAM, canonicality checking with 5-level paging does not apply to bit 47 of a user pointer;
when LAM48 is enabled for user pointers, bit 47 of a user pointer must be 0. Note also that linear-address
bits 56:47 are translated by 5-level paging. When LAM48 is enabled for user pointers, these bits are always 0 in
any linear address derived from a user pointer: bits 56:48 of the pointer contained metadata, while bit 47 is
required to be 0.)
63 62 48 47 46 0
0 SW Metadata 0
==
Figure 6-1. Canonicality Check When LAM48 is Enabled for User Pointers
When LAM57 is enabled for user pointers, the processor allows bits 62:57 of a user pointer to be used as metadata.
With 5-level paging, the processor performs a modified canonicality check that enforces only that bit 56 of the
pointer matches bit 63. As illustrated in Figure 6-2, bits 62:57 are not checked and are thus available for software
metadata. After this modified canonicality check is performed, bits 62:57 are masked by sign-extending the value
of bit 56 (0), and the resulting (57-bit canonical) address is then passed on for translation by 5-level paging.
63 62 57 56 55 0
0 SW Metadata 0
==
Figure 6-2. Canonicality Check When LAM57 is Enabled for User Pointers with 5-Level Paging
When LAM57 is enabled for user pointers with 4-level paging, the processor performs a modified canonicality check
that enforces only that bits 56:47 of a user pointer match bit 63. As illustrated in Figure 6-3, bits 62:57 are not
checked and are thus available for software metadata. After this modified canonicality check is performed, bits
62:57 are masked by sign-extending the value of bit 56 (0), and the resulting (48-bit canonical) address is then
passed on for translation by 4-level paging.
63 62 57 56 47 46 0
0 SW Metadata 0
==
Figure 6-3. Canonicality Check When LAM57 is Enabled for User Pointers with 4-Level Paging
63 62 57 56 55 0
1 SW Metadata 1
==
Figure 6-4. Canonicality Check When LAM57 is Enabled for Supervisor Pointers with 5-Level Paging
When LAM48 is enabled for supervisor pointers (4-level paging), the processor performs a modified canonicality
check that enforces only that bit 47 of a supervisor pointer matches bit 63. As illustrated in Figure 6-5, bits 62:48
are not checked and are thus available for software metadata. After this modified canonicality check is performed,
bits 62:48 are masked by sign-extending the value of bit 47 (1), and the resulting (48-bit canonical) address is
then passed on for translation by 4-level paging.
63 62 48 47 46 0
1 SW Metadata 1
==
Figure 6-5. Canonicality Check When LAM48 is Enabled for Supervisor Pointers with 4-Level Paging
• ATTRIBUTES.LAM_U48 (bit 9) - Activate LAM for user data pointers and use of bits 62:48 as masked metadata
in enclave mode. This bit can be set if CPUID.(EAX=12H, ECX=01H):EAX[9] is 1.
• ATTRIBUTES.LAM_U57 (bit 8) - Activate LAM for user data pointers and use of bits 62:57 as masked metadata
in enclave mode. This bit can be set if CPUID.(EAX=12H, ECX=01H):EAX[8] is 1.
ECREATE causes #GP(0) if ATTRIBUTE.LAM_U48 bit is 1 and CPUID.(EAX=12H, ECX=01H):EAX[9] is 0, or if
ATTRIBUTE.LAM_U57 bit is 1 and CPUID.(EAX=12H, ECX=01H):EAX[8] is 0.
During enclave execution, accesses using linear addresses are treated as if CR3.LAM_U48 =
SECS.ATTRIBUTES.LAM_U48, CR3.LAM_U57 = SECS.ATTRIBUTES.LAM_U57, and CR4.LAM_SUP = 0. The actual
value of CR3 is not changed. This implies that, during enclave execution, if SECS.ATTRIBUTES.LAM_U57 = 1,
LAM57 is enabled for user pointers during enclave execution and, if SECS.ATTRIBUTES.LAM_U57 = 0 and
SECS.ATTRIBUTES. LAM_U48 = 1, then LAM48 is enabled for user pointers. If SECS.ATTRIBUTES.LAM_U57 =
SECS.ATTRIBUTES. LAM_U48 = 0, LAM is not enabled for user pointers.
When in enclave mode, supervisor data pointers are not subject to any masking.
The following ENCLU leaf functions check for linear addresses to be within the ELRANGE. When LAM is active, this
check is performed on the linear addresses that result from masking metadata bits in user pointers used by the leaf
functions.
• EACCEPT
• EACCEPTCOPY
• EGETKEY
• EMODPE
• EREPORT
The following linear address fields in the Intel SGX data structures hold linear addresses that are either loaded into
the EPCM or are written out from the EPCM and do not contain any metadata.
• SECS.BASEADDR
• PAGEINFO.LINADDR
CHAPTER 7
CODE PREFETCH INSTRUCTION UPDATES
Description
Fetches the line of data or code (instructions’ bytes) from memory that contains the byte specified with the source
operand to a location in the cache hierarchy specified by a locality hint:
• T0 (temporal data)—prefetch data into all levels of the cache hierarchy.
• T1 (temporal data with respect to first level cache misses)—prefetch data into level 2 cache and higher.
• T2 (temporal data with respect to second level cache misses)—prefetch data into level 3 cache and higher, or
an implementation-specific choice.
• NTA (non-temporal data with respect to all cache levels)—prefetch data into non-temporal cache structure and
into a location close to the processor, minimizing cache pollution.
• IT0 (temporal code)—prefetch code into all levels of the cache hierarchy.
• IT1 (temporal code with respect to first level cache misses)—prefetch code into all but the first-level of the
cache hierarchy.
The source operand is a byte memory location. (The locality hints are encoded into the machine level instruction
using bits 3 through 5 of the ModR/M byte.) Some locality hints may prefetch only for RIP-relative memory
addresses; see additional details below. The address to prefetch is NextRIP + 32-bit displacement, where NextRIP
is the first byte of the instruction that follows the prefetch instruction itself.
If the line selected is already present in the cache hierarchy at a level closer to the processor, no data movement
occurs. Prefetches from uncacheable or WC memory are ignored.
The PREFETCHh instruction is merely a hint and does not affect program behavior. If executed, this instruction
moves data closer to the processor in anticipation of future use.
The implementation of prefetch locality hints is implementation-dependent, and can be overloaded or ignored by a
processor implementation. The amount of data or code lines prefetched is also processor implementation-depen-
dent. It will, however, be a minimum of 32 bytes. Additional details of the implementation-dependent locality hints
are described in Section 7.4 of Intel® 64 and IA-32 Architectures Optimization Reference Manual.
It should be noted that processors are free to speculatively fetch and cache data from system memory regions that
are assigned a memory-type that permits speculative reads (that is, the WB, WC, and WT memory types). A
PREFETCHh instruction is considered a hint to this speculative behavior. Because this speculative fetching can occur
at any time and is not tied to instruction execution, a PREFETCHh instruction is not ordered with respect to the
fence instructions (MFENCE, SFENCE, and LFENCE) or locked memory references. A PREFETCHh instruction is also
unordered with respect to CLFLUSH and CLFLUSHOPT instructions, other PREFETCHh instructions, or any other
general instruction. It is ordered with respect to serializing instructions such as CPUID, WRMSR, OUT, and MOV CR.
PREFETCHIT0/1 apply when in 64-bit mode with RIP-relative addressing; they stay NOPs otherwise. For optimal
performance, the addresses used with these instructions should be the starting byte of a real instruction.
PREFETCHIT0/1 instructions are enumerated by CPUID.(EAX=07H, ECX=01H).EDX.PREFETCHI[bit 14].The encod-
ings stay NOPs in processors that do not enumerate these instructions.
Operation
FETCH (m8);
Numeric Exceptions
None.
CHAPTER 8
NEXT GENERATION PERFORMANCE MONITORING UNIT (PMU)
The Performance Monitoring Unit (PMU) for Intel® Core™ Ultra processors offers additional enhancements beyond
what is available in both the 12th generation Intel® Core™ processor based on Alder Lake performance hybrid
architecture and the 13th generation Intel® Core™ processor:
• Timed PEBS
• New True-View Enumeration Architecture
— General-Purpose Counters
— Fixed-Function Counters
— Architectural Performance Monitoring Events
— Non-Architectural Capabilities
• Architectural Performance Monitoring Events
— Topdown Microarchitecture Analysis (TMA) Level 1
The next-generation Performance Monitoring Unit (PMU) offers additional enhancements beyond what is available
in the Intel® Core™ Ultra processor:
• New True-View Enumeration Architecture
— TMA Slots Per Cycle
• Architectural Performance Monitoring Events
— LBR Inserts
• Counters Snapshotting and PEBS Format 6
• Performance Monitoring MSR Enhancements
— MSR Aliasing
— UnitMask2
— EQ-bit
• RDPMC Metrics Clear Mode
• Auto Counter Reload
NOTE
CPUID leaf 0AH continues to report useful attributes, such as architectural performance monitoring
version ID and counter width (# bits).
NOTE
Locating a PMU feature under CPUID leaf 023H alerts software that the features may not be
supported uniformly across all logical processors.
The behavior of the fixed-function performance counters supported by next generation PMU is expected to be
consistent on all processors that support those counters, as defined in Table 8-3.
Table 8-3. Association of Fixed-Function Performance Counters with Architectural Performance Events
Fixed-Function Address Event Mask Mnemonic Description
Performance Counter
IA32_FIXED_CTR0 309H INST_RETIRED.ANY This event counts the number of instructions that retire
execution. For instructions that consist of multiple uops,
this event counts the retirement of the last uop of the
instruction. The counter continues counting during
hardware interrupts, traps, and in-side interrupt handlers.
IA32_FIXED_CTR1 30AH CPU_CLK_UNHALTED.THREAD The CPU_CLK_UNHALTED.THREAD event counts the
CPU_CLK_UNHALTED.CORE number of core cycles while the logical processor is not in a
halt state.
If there is only one logical processor in a processor core,
CPU_CLK_UNHALTED.CORE counts the unhalted cycles of
the processor core.
The core frequency may change from time to time due to
transitions associated with Enhanced Intel SpeedStep
Technology or TM2. For this reason this event may have a
changing ratio with regards to time.
IA32_FIXED_CTR2 30BH CPU_CLK_UNHALTED.REF_TSC This event counts the number of reference cycles at the
TSC rate when the core is not in a halt state and not in a TM
stop-clock state. The core enters the halt state when it is
running the HLT instruction or the MWAIT instruction. This
event is not affected by core frequency changes (e.g., P
states) but counts at the same frequency as the time stamp
counter. This event can approximate elapsed time while the
core was not in a halt state and not in a TM stopclock state.
IA32_FIXED_CTR3 30CH TOPDOWN.SLOTS This event counts the number of available slots for an
unhalted logical processor. The event increments by
machine-width of the narrowest pipeline as employed by
the Top-down Microarchitecture Analysis method. The
count is distributed among unhalted logical processors
(hyper-threads) who share the same physical core.
Software can use this event as the denominator for the
top-level metrics of the Top-down Microarchitecture
Analysis method.
Table 8-3. Association of Fixed-Function Performance Counters with Architectural Performance Events
Fixed-Function Address Event Mask Mnemonic Description
Performance Counter
IA32_FIXED_CTR41 30DH TOPDOWN_BAD_SPECULATION This event counts Topdown slots that were not consumed
by the backend due to a pipeline flush, such as a
mispredicted branch or a machine clear. It provides a value
equivalent to a general-purpose counter configured with
UMask=00H and EventSelect=73H.
IA32_FIXED_CTR51 30EH TOPDOWN_FE_BOUND This event counts Topdown slots where uops were not
provided to the backend due to frontend limitations, such as
instruction cache/TLB miss delays or decoder limitations. It
provides a value equivalent to a general purpose counter
configured with UMask=01H and EventSelect=9CH.
IA32_FIXED_CTR61 30FH TOPDOWN_RETIRING This event counts Topdown slots that were committed
(retired) by the backend. It provides a value equivalent to a
general purpose counter configured with UMask=02H and
EventSelect=C2H.
NOTES:
1. If this counter is supported, it will be accessible in the following MSRs: IA32_PERF_GLOBAL_STATUS (38EH),
IA32_PERF_GLOBAL_CTRL (38FH), IA32_PERF_GLOBAL_STATUS_RESET (390H), and
IA32_PERF_GLOBAL_STATUS_SET (391H).
1. Refer to Chapter 19, “Last Branch Records,” of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. The next generation PMU incorporates PEBS_FMT=5h as described in Section 20.6.2.4.2 of the Intel® 64 and IA-32 Architectures
Software Developer’s Manual, Volume 3B.
The Retire Latency field reports the number of Unhalted Core Cycles between the retirement of the current instruc-
tion (as indicated by the Instruction Pointer field of the PEBS record) and the retirement of the prior instruction. All
ones are reported when the number exceeds 16 bits.
Processors that support this enhancement set a new bit: IA32_PERF_CAPABILITIES.PEBS_TIMING_INFO[bit 17].
NOTE
Timed PEBS is not supported when PEBS is programmed on fixed-function counter 0. The Retire
Latency field of such record is undefined.
Memory Info
LBR Entries
Counters
Metrics
XMMs
GPRs
LBRs
31 23 15 7 0
Include_Fixed_CTRx Include_PMCx
63 55 47 39 32
MSR_PEBS_DATA_CFG
Address: 3F2H
Counters Group FIXED_CTR BitVector [31:0] Bit vector of IA32_FIXED_CTRx MSRs. IA32_FIXED_CTRx is
Header recorded if bit x is set.
Metrics BitVector [31:0] Bit vector of the performance metrics counters.
Reserved [31:0] Reserved.
Counters/Metrics PMCx [63:0] PMCx will be captured if PMC BitVector x is set.
Values
...
FIXED CTRx [63:0] FIXED_CTRx will be captured if FIXED_CTRx BitVector x is set.
...
Metrics Base [63:0] The performance metrics base, mapped to IA32_FIXED_CTR3, if
Metrics BitVector bit 0 is set.
Metrics Data [63:0] MSR_PERF_METRICS, if Metrics BitVector bit 1 is set.
IA32_PMCx will be captured if both Counters and MSR_PEBS_DATA_CFG bit 32 + x are set. In this case, the PMC
BitVector field bit x will be set too.
IA32_FIXED_CTRx will be captured if both Counters and MSR_PEBS_DATA_CFG bit 48 + x are set. In this case, the
FIXED_CTR BitVector field bit x will be set too.
The performance metrics will be recorded if both Metrics and MSR_PEBS_DATA_CFG bit 51 (the bit used for
IA32_FIXED_CTR3) are set. The Metrics record will have two 64-bit fields, MSR_PERF_METRICS and the
PERF_METRICS_BASE that is derived from IA32_FIXED_CTR3. In this case, the Metrics BitVector will be 3. Note
that MSR_PERF_METRICS and the IA32_FIXED_CTR3 MSR will be cleared after they are recorded.
Size of the group can be calculated in bytes by: 16 + popcount(BitVectors[127:0]) * 8.
Table 8-7. Data Source Encoding for Memory Accesses in Next Generation PMU
Encoding Data Source
0H Unknown source.
01H or 02H L1 Hit.
03H FB Merge (L1 mishandling buffer).
05H L2 Hit.
06H XQ Merge (L2 mishandling buffer).
08H L3 Hit.
0CH L3 Hit; x-core forward.
0DH L3 Hit; x-core modified.
0FH L3 Miss; x-core modified.
10H L3 Miss; MSC hit.
11H L3 Miss; memory.
1. This feature is also available in a subset of processors with a CPUID signature value of DisplayFamily_DisplayModel 06_C5H or
06_C6H (though they report IA32_PERF_CAPABILITIES.PEBS_FMT as 5).
An IA32_PMC_GPn_CTR MSR can be used to access the counter value for a GP (general-purpose) counter ‘n.’ On
processors that support CPUID leaf 23H, a GP counter ‘n’ that is enumerated in both CPUID leaf 23H and leaf 0AH
can be accessed through either IA32_PMC_GPn_CTR or the legacy MSR addresses (IA32_PMCn, IA32_A_PMCn). In
contrast, a counter ‘n’ that is only enumerated in CPUID leaf 23H can only be accessed through
IA32_PMC_GPn_CTR. This guideline also applies to the other MSR aliases described in this section (i.e.,
IA32_PMC_GPn_CFG_A and IA32_PERFEVTSELn, IA32_PMC_FXm_CTR and IA32_FIXED_CTRm). The
IA32_PMC_GPn_CTR MSR address1 for counter ‘n’ is 1900H + 4 * n, and this MSR has full-width write support.
The IA32_PMC_GPn_CFG_A MSR can be used to access the performance event select register for a GP counter ‘n’
and is at address2 1901H + 4 * n. The reload configuration MSRs for GP counter ‘n,’ IA32_PMC_GPn_CFG_B, is at
MSR address 1902H + 4 * n. There is no legacy MSR alias to this reload configuration register. Thus, the register
only exists when enumerated in CPUID leaf 23H. Similarly, no legacy MSR alias exists for the event-select extended
registers, IA32_PMC_GPn_CFG_C, which are at MSR address 1903H + 4 * n for GP counter ‘n.’
An IA32_PMC_FXm_CTR MSR can be used to access the counter value for a fixed-function counter ‘m’ if that
counter is enumerated in CPUID leaf 23H. The IA32_PMC_FXm_CTR MSR address for fixed-function counter ‘m’ is
1980H + 4 * m. There is no alias for the fixed-function counters' reload configuration or event select extended
registers (IA32_PMC_FXm_CFG_B at MSR addresses 1982H + 4 * m and IA32_PMC_FXm_CFG_C at MSR address
1983H + 4 * m, respectively).
The available general-purpose and fixed-function counters are reported by CPUID.(EAX = 23H, ECX = 01H):EAX
and CPUID.(EAX = 23H, ECX = 01H):EBX, respectively. Note that not all counters enumerated in CPUID leaf 23H
may have corresponding IA32_PMC_GPn_CFG_B, IA32_PMC_GPn_CFG_C, IA32_PMC_FXm_CFG_B, or
IA32_PMC_FXm_CFG_C MSRs. The enumeration and usage of these MSRs are described in Section 8.7, “Auto
Counter Reload.” The enumeration in CPUID leaf 23H is true-view, and thus, the enumeration may only be set on
(and the MSRs/counters they enumerate only supported on) a subset of the logical processors of the system.
1. As an example, the IA32_PMC_GP1_CTR MSR has MSR address 1904H. Note that the legacy full-width MSR addresses for the
counters, IA32_A_PMCn MSRs, remains at MSR address 4C1H + n.
2. As an example, the IA32_PMC_GP1_CFG_A MSR has MSR address 1905H. Note that the legacy MSR address for the event select
registers, IA32_PERFEVTSELn MSRs, remain at MSR address 186H + n.
8.7.3 MSRs
Table 8-9. Architectural MSRs
Register
Address Architectural MSR Name / Bit Fields MSR/Bit Description Reset
(Former MSR Name) Value
Hex Dec
1902H, 6402, IA32_PMC_GPx_CFG_B ACR Reload Configuration for PMCx
1906H, 6406,
0 PMC0 0
190AH, 6410,
… ... Reload of PMC_GPx on overflow of PMC0.
1902H 6402 1 PMC1 0
+(4*n) +(4*n)
Reload of PMC_GPx on overflow of PMC1.
... ... 0
n PMCn 0
Reload of PMC_GPx on overflow of PMCn.
31:n+1 Reserved. 0
32 FIXED_CTR0 0
Reload of PMC_GPx on overflow of FIXED_CTR0.
33 FIXED_CTR1 0
Reload of PMC_GPx on overflow of FIXED_CTR1.
... ... 0
32+m FIXED_CTRm 0
Reload of PMC_GPx on overflow of FIXED_CTRm.
63:33+m Reserved. 0
CHAPTER 9
LINEAR ADDRESS SPACE SEPARATION (LASS)
This chapter describes a new feature called linear address space separation (LASS).
9.1 INTRODUCTION
Chapter 4 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A describes paging,
which is the process of translating linear addresses to physical addresses and determining, for each translation, the
linear address’s access rights; these determine what accesses to a linear address are allowed.
Every access to a linear address is either a supervisor-mode access or a user-mode access. A linear address’s
access rights include an indication of whether address is a supervisor-mode address or a user-mode address.
Paging prevents user-mode accesses to supervisor-mode addresses; in addition, there are features that can
prevent supervisor-mode accesses to user-mode addresses. (These features are supervisor-mode execution
prevention — SMEP — and supervisor-mode access prevention — SMAP.) In most cases, the blocked accesses
cause page-fault exceptions (#PF); for some cases (e.g., speculative accesses), the accesses are dropped without
fault.
With these mode-based protections, paging can prevent malicious software from directly reading or writing
memory inappropriately. To enforce these protections, the processor must traverse the hierarchy of paging struc-
tures in memory. Unprivileged software can use timing information resulting from this traversal to determine
details about the paging structures, and these details may be used to determine the layout of supervisor memory.
Linear-address space separation (LASS) is an independent mechanism that enforces the same mode-based protec-
tions as paging but without traversing the paging structures. Because the protections enforced by LASS are applied
before paging, “probes” by malicious software will provide no paging-based timing information.
LASS is based on a linear-address organization established by many operating systems: all linear addresses whose
most significant bit is 0 (“low” or “positive” addresses) are user-mode addresses, while all linear addresses whose
most significant bit is 1 (“high” or “negative” addresses) are supervisor-mode addresses. An operating system
should enable LASS only if it uses this organization of linear addresses.
Some accesses do not cause faults when they would violate the mode-based protections established by paging.
These include prefetches (e.g., those resulting from execution of one of the PREFETCHh instructions), executions
of the CLDEMOTE instruction, and accesses resulting from the speculative fetch or execution of an instruction. Such
an access may cause a LASS violation; if it does, the access is not performed but no fault occurs. (When such an
access would violate the mode-based protections of paging, the access is not performed but no page fault occurs.)
In 64-bit mode, LASS violations have priority just below that of canonicality violations; in compatibility mode, they
have priority just below that of segment-limit violations.
The remainder of this section describes how LASS applies to different types of accesses to linear addresses.
Chapter 4, “Paging,” of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A provides
full definitions of these access types. The sections below discuss specific LASS violations based on bit 63 of a linear
address. For a linear address with only 32 bits (or 16 bits), the processor treats bit 63 as if it were 0.
1. The WRUSS instruction is an exception; although it can be executed only if CPL = 0, the processor treats its shadow-stack accesses
as user accesses.
CHAPTER 10
REMOTE ATOMIC OPERATIONS IN INTEL ARCHITECTURE
10.1 INTRODUCTION
Remote Atomic Operations (RAO) are a set of instructions to improve synchronization performance. RAO is espe-
cially useful in multiprocessor applications that have a set of characteristics commonly found together:
• A need to update, i.e., read and modify, one or more variables atomically, e.g., because multiple processors
may attempt to update the same variable simultaneously.
• Updates are not expected to be interleaved with other reads or writes of the variables.
• The order in which the updates happen is unimportant.
One example of this scenario is a multiprocessor histogram computation, where multiple processors cooperate to
compute a shared histogram, which is then used in the next phase of computation. This is described in more detail
in Section 10.8.1.
RAO instructions aim to provide high performance in this scenario by:
• Atomically updating memory without returning any information to the processor itself.
• Relaxing the ordering of RAO instructions with respect to other updates or writes to the variables.
RAO instructions are defined such that, unlike conventional atomics (e.g., LOCK ADD), their operations may be
performed closer to memory, such as at a shared cache or memory controller. Performing operations closer to
memory reduces or even eliminates movement of data between memory and the processor executing the instruc-
tion. They also have weaker ordering guarantees than conventional atomics. This facilitates execution closer to
memory, and can also lead to reduced stalls in the processor pipeline. These properties mean that using RAO
instead of conventional atomics may provide a significant performance boost for the scenario outlined above.
10.2 INSTRUCTIONS
The current set of RAO instructions can be found in Chapter 2, “Instruction Set Reference, A-Z.” These instructions
include integer addition and bitwise AND, OR, and XOR. These operations may be performed on 32-bit (double-
word) or 64-bit (quadword) data elements. The destination, which is also one of the inputs, is always a location in
memory. The other input is a general-purpose register, ry, in Table 10-1. The instructions do not change any regis-
ters or flags.
10.8 EXAMPLES
10.8.1 Histogram
Histogram is a common computational pattern, including in multiprocessor programming, but achieving an effi-
cient parallel implementation can be tricky. In a conventional histogram computation, software sweeps over a set
of input values; it maps each input value to a histogram bin, and increments that bin.
Common multiprocessor histogram implementations partition the inputs across the processors, so each processor
works on a subset of the inputs. Straightforward implementations have each processor directly update the shared
histogram. To ensure correctness, since multiple processors may attempt updates to the same histogram bin
simultaneously, the updates must use atomics. As described above, using conventional atomics can be expensive,
especially when we have highly contended cache lines in the histogram. That may occur for small histograms or for
histograms where many inputs map to a small number of histogram bins.
A common alternative approach uses a technique called privatization, where each processor gets its own “local”
histogram; as each processor works on its subset of the inputs, it updates its local histogram. As a final “extra”
step, software must accumulate the local histograms into the globally shared histogram, a step called a reduction.
This reduction step is where processors synchronize and communicate; using it allows the computation of local
histograms to be embarrassingly parallel and require no atomics or inter-processor communication, and can often
lead to good performance. However, privatization has downsides:
• The reduction step can take a lot of time if the histogram has many bins.
• The time for a reduction is relatively constant regardless of the number of processors. As the number of
processors grows, therefore, the fraction of time spent on the reduction tends to grow.
• The local histograms require extra memory, and that memory footprint grows with the number of processors.
• The reduction is an “extra” step that complicates the software.
With RAO, software can use the simpler multiprocessor algorithm and achieve reliably good performance. The
following pseudo-code lists a RAO-based histogram implementation.
// in each processor:
double *data; // “data” is a per-processor array, holding a subset of all inputs
data = get_data(); // populate “data” values
The above code can provide good performance under various scenarios, i.e., sizes of histograms and biases in
which histogram bins are updated. RAO avoids data “ping-ponging” between processors, even under high conten-
tion. Further, the weak ordering of RAO allows a series of AADD instructions to overlap with each other in the pipe-
line, and thus provide for instruction level parallelism.
In addition to the performance benefits, the RAO code is simple and is thus easier to maintain.
While we specifically show and discuss histogram above, this computation pattern is very common, e.g., software
packet processing workloads exhibit this in how they track statistics of the packets. Other algorithms exhibiting this
pattern should similarly see benefits from RAO.
// In other processors:
12: if (my_core->flags & SOME_EVENT) {
13: …… // react to the occurrence of SOME_EVENT
14: clear_bits(&my_core->flags, SOME_EVENT);
15: }
With conventional atomics (e.g., LOCK OR), a significant portion of execution time of handle_event would be spent
accessing core->flags (line 5) and core->extra_flags (line 7). It is likely that when handle_event begins, the two
fields are in another processor's cache, e.g., if that processor updated some bits in the fields. Therefore, the data
would need to migrate to the cache of the processor executing handle_event.
In contrast, for the above code example, for RAO implementations that perform updates close to memory, the RAO
AOR instruction should reduce data movement of core->flags and core->extra_flags and thus result in a lower
execution latency. Further, when other processors later access these fields (lines 12-15), they will also benefit from
a lower latency due to reduced data movement, since they may get the data from a more central location.
Also note that since the order of notifications does not matter in this case, the function further takes advantage of
RAO's weak ordering, allowing multiple RAO AOR instructions to be executed concurrently. It does, however,
include a memory fence at the end (line 10), to ensure that all updates are visible to all processors before leaving
the handler.
CHAPTER 11
TOTAL STORAGE ENCRYPTION IN INTEL ARCHITECTURE
11.1 INTRODUCTION
Total Storage Encryption (TSE) is an architecture that allows encryption of storage at high speed. TSE provides the
following capabilities:
• Protection (confidentiality) of data at rest in storage.
• NIST Standard AES-XTS Encryption.
• A mechanism for software to configure hardware keys (which are not software visible) or software keys.
• A consistent key interface to the crypto engine.
11.2 ENUMERATION
CPUID enumerates the existence of the IA32_TSE_CAPABILITY MSR and the PBNDKB instruction.
The IA32_TSE_CAPABILITY MSR enumerates supported cryptographic algorithms and keys.
• 2: TSE
If TSE is supported on the platform, CPUID.PCONFIG_LEAF will enumerate TSE as a supported target in sub-leaf 0,
ECX=TSE:
• TSE_KEY_PROGRAM leaf is available when TSE is enumerated by PCONFIG as a target.
• TSE_KEY_PROGRAM_WRAPPED is available when TSE is enumerated by PCONFIG as a target.
Bits 15:0 enumerate, as a bitmap, the encryption algorithms that are supported. As of this writing, the only
supported algorithm is 256-bit AES-XTS, which is enumerated by setting bit 0.
CHAPTER 12
FLEXIBLE UIRET
This chapter documents an enhancement to the UIRET instruction (user-interrupt return). This enhancement
allows software control of the value of the user-interrupt flag (UIF) established by UIRET.
Vol. 2B 4-3
FLEXIBLE UIRET
UIRET—User-Interrupt Return
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
F3 0F 01 EC ZO V/I UINTR Return from handling a user interrupt.
UIRET
Description
UIRET returns from the handling of a user interrupt. It can be executed regardless of CPL.
Execution of UIRET inside a transactional region causes a transactional abort; the abort loads EAX as it would have
had it been due to an execution of IRET.
UIRET can be tracked by Architectural Last Branch Records (LBRs), Intel Processor Trace (Intel PT), and Perfor-
mance Monitoring. For both Intel PT and LBRs, UIRET is recorded in precisely the same manner as IRET. Hence for
LBRs, UIRETs fall into the OTHER_BRANCH category, which implies that IA32_LBR_CTL.OTHER_BRANCH[bit 22]
must be set to record user-interrupt delivery, and that the IA32_LBR_x_INFO.BR_TYPE field will indicate
OTHER_BRANCH for any recorded user interrupt. For Intel PT, control flow tracing must be enabled by setting
IA32_RTIT_CTL.BranchEn[bit 13].
UIRET will also increment performance counters for which counting BR_INST_RETIRED.FAR_BRANCH is enabled.
Operation
Pop tempRIP;
Pop tempRFLAGS; // see below for how this is used to load RFLAGS
Pop tempRSP;
IF tempRIP is not canonical in current paging mode
THEN #GP(0);
FI;
IF ShadowStackEnabled(CPL)
THEN
PopShadowStack SSRIP;
IF SSRIP ≠ tempRIP
THEN #CP (FAR-RET/IRET);
FI;
FI;
RIP := tempRIP;
// update in RFLAGS only CF, PF, AF, ZF, SF, TF, DF, OF, NT, RF, AC, and ID
RFLAGS := (RFLAGS & ~254DD5H) | (tempRFLAGS & 254DD5H);
RSP := tempRSP;
IF CPUID.(EAX=07H, ECX=01H):EDX.UIRET_UIF[bit 17] = 1
THEN UIF := tempRFLAGS[1];
ELSE UIF := 1;
FI;
Clear any cache-line monitoring established by MONITOR or UMONITOR;
Flags Affected
See the Operation section.
CHAPTER 13
USER-TIMER EVENTS AND INTERRUPTS
• The write to the deadline may occur before TSC reaches the original deadline. In this case, no user-timer event
will occur based on the original deadline. Any subsequent user-timer event will be based on the new value of
the deadline.
Software writes to the user deadline using a new MSR described in Section 13.3.
1. Execution of MOV SS, POP SS, or STI may block the processing of user-timer events for one instruction.
2. A logical processor processes a user-timer event only if CPL = 3. Since the HLT and MWAIT instructions can be executed only if CPL
= 0, a user-timer event is never processed when a logical processor is an activity state that was entered using one of those instruc-
tions.
1. This conversion may not be meaningful if “RDTSC exiting” is 1. Software setting “RDTSC exiting” to 1 should ensure that any write to
the IA32_UINTR_TIMER MSR causes a VM exit.
CHAPTER 14
APIC-TIMER VIRTUALIZATION
• The guest deadline may be modified before the TSC reaches the original guest deadline. In this case, no guest-
timer event will occur based on the original guest deadline, and any subsequent guest-timer event will be based
on the new guest deadline.
Processing of a guest-timer event updates the virtual-APIC page to cause a virtual timer interrupt to become
pending. Specifically, the logical processor performs the following steps:
V := virtual timer vector;
VIRR[V] := 1;// update virtual IRR field on virtual-APIC page
RVI := max{RVI, V};// update guest interrupt status field in VMCS
evaluate pending virtual interrupts;// a virtual interrupt may be delivered immediately after this processing
Guest deadline := 0;
Guest deadline shadow := 0;
The following items consider certain special cases:
• If a guest-timer event is processed between iterations of a REP-prefixed instruction (after at least one iteration
has completed but before all iterations have completed), the following items characterize processor state after
the steps indicated above and before guest execution resumes:
— RIP references the REP-prefixed instruction;
— RCX, RSI, and RDI are updated to reflect the iterations completed; and
— RFLAGS.RF = 1.
• If a guest-timer event is processed after partial execution of a gather instruction or a scatter instruction, the
destination register and the mask operand are partially updated and RFLAGS.RF = 1.
• If a guest-timer event is processed while the logical processor is in the state entered by HLT, the processor
returns to the HLT state after the steps indicated above (if a pending virtual interrupt was recognized, the
logical processor may immediately wake from the HLT state).
• If a guest-timer event is processed while the logical processor is in the state entered by MWAIT, TPAUSE, or
UMWAIT, the processor will be in the active state after the steps indicated above.
• A guest-timer event that becomes pending during transactional execution may abort the transaction and result
in a transition to a non-transactional execution. If it does, the transactional abort loads EAX as it would had it
been due to an interrupt.
• A guest-timer event that occurs while the logical processor is in enclave mode causes an asynchronous enclave
exit (AEX) to occur before the steps indicated above.
CHAPTER 15
VMX SUPPORT FOR THE IA32_SPEC_CTRL MSR
This chapter describes a VMX extension that supports virtualization of the IA32_SPEC_CTRL MSR.1 This feature
supports management of the IA32_SPEC_CTRL MSR across VMX transitions (saving it on VM exits and allowing it
to be loaded on VM entries and VM exits).
Section 15.1 presents details of the VMCS changes supporting the new feature. Section 15.2 and Section 15.3
specify how the features affect VM entries and VM exits, respectively.
1. A related feature allows a virtual-machine monitor (VMM) to specify that certain bits of the MSR cannot be modified by guest soft-
ware. Details of that feature can be found in the Intel® 64 and IA-32 Architectures Software Developer’s Manual. That feature and
the feature described in this chapter can be implemented together or separately.
CHAPTER 16
PROCESSOR TRACE TRIGGER TRACING
This chapter documents the architecture for Processor Trace Trigger Tracing, an addition to the suite of capabilities
within Intel® Processor Trace (Intel® PT). Details on the Intel PT infrastructure and control flow trace capabilities
can be found in Chapter 33, “Intel® Processor Trace‚” of the Intel® 64 and IA-32 Architectures Software Devel-
oper’s Manual, Volume 3C.
Processor Trace Trigger Tracing (PTTT) is a capability that allows the use of performance counter events, perfor-
mance counter overflows, and debug register matches as trigger events and allows the user to configure program-
mable trigger actions for those events. Trigger actions allow Intel Processor Trace to be paused and resumed and
the IP attribution information for the trigger event to be saved in the Intel Processor Trace by inserting a new PT
packet type called the TRIG packet. The PTTT feature introduces the concept of a trigger unit whose trigger inputs
and actions can be configured by software. The presence of the PTTT capability, the number of trigger units avail-
able in the logical processor implementation, and their supported capabilities can be detected using CPUID.
IA32_RTIT_STATUS.Paused[bit 8]. If resume action is requested, then tracing resumes. If both pause and resume
actions are set, then no action will be taken by hardware and IA32_RTIT_STATUS.Paused[bit 8] remains
unchanged. At any given cycle, multiple trigger events may fire. In this case, the action of the youngest instruction
with the trigger firing determines the resume or pause action. In addition to the pause/resume trigger action, a
new PT packet called the TRIG packet is generated by hardware. The TRIG packet is illustrated in Section 16.2.1.
The TRIG packet captures information about the trigger(s) that occurred in that cycle. If an ICNT trigger action is
requested, the TRIG packet includes the IP attribution information for the instruction that retired in that cycle. The
ICNT field in the TRIG packet indicates the number of instructions retired since the last IP indication reference
packet sent earlier (FUP, TIP*, TNT, TRIG+ICNT). It is a 16-bit saturating counter, which is always positive. If there
is no reference packet, then an FUP packet is also generated after the TRIG packet, and the IP bit will be set in the
TRIG packet. The TRIG packet includes a bitmap of all the triggers that fired in that cycle in the TRBV field and indi-
cates if more than one trigger happened using the MULT bit. When the MULT bit is set, the ICNT value refers to the
first instruction in the cycle that fired from the lowest-order trigger unit.
If DR match is to be used as a trigger, software is expected to program the DRx register with the linear address and
other enable bits in DR7. Only code and data breakpoint matches are supported. I/O breakpoint match is not
supported. A DR match trigger event is recognized under conditions that are the same as DR exceptions. Similar to
DR exceptions, the DR match trigger may be pended and delivered under the correct contexts. A DRx resource can
be used either as a trigger tracing input or to cause a normal #DB exception. If a DR match on the same linear
address is to be configured as a trigger and #DB exception, then two separate DRx resources need to be used, one
for a trace trigger and one for a #DB exception.
Dependencies Action.En bit must be set to Generation When any of the enabled trigger events are active, after com-
enable the trigger unit. Scenario pleting the requested actions, TRIG packet is generated.
If Input is a Perfmon event or
overflow, then the correspond-
ing PERFEVTSELx.EN_PT_LOG
bit must be set.
In Input is a DR match, then the
corresponding DR7.DRx_PT_LOG
bit must be set.
16.3.1 IA32_RTIT_TRIGGERx_CFG
The IA32_RTIT_TRIGGERx_CFG MSR allows the configuration of individual trigger units. Specifically, it allows the
user to select the input used for the trigger and the actions to be taken when the trigger event happens. The
number of supported IA32_RTIT_TRIGGERx_CFG MSRs is indicated in the CPUID.(EAX=14H, ECX=1):EAX[10:8]
field.
63 38 31 24 23 22 21 20 19 18 17 16 15 8 7 0
I A I U
Counter Mask E P O Unit Mask
N N N E S Event Select
(CMASK) N C S (UMASK)
V Y T R
CHAPTER 17
MONITORLESS MWAIT
17.2 ENUMERATION
Existing processors indicate support for the MONITOR and MWAIT instructions by enumerating
CPUID.01H:ECX.MONITOR[bit 3] as 1. This enumeration also implies support for CPUID leaf 05H, the
MONTOR/MWAIT leaf. CPUID leaf 05H enumerates details of the operation of the MONITOR instruction (e.g., the
size of the monitored address range) and the capabilities of the MWAIT instruction (e.g., the extensions that can be
specified in ECX).
CPUID.05H:ECX.MONITORLESS_MWAIT[bit 3] enumerates support for monitorless use of MWAIT. If this bit is
enumerated as 1, software can execute MWAIT with ECX[2] = 1 (see Section 17.1).
To allow virtualization of monitorless MWAIT (without the monitored form; see Section 17.4),
CPUID.(EAX=07H, ECX=01H):EDX.MWAIT[bit 23] indicates support for the MWAIT instruction and for CPUID leaf
05H. The following items detail the implications of the value enumerated for this bit:
• If CPUID.(EAX=07H, ECX=01H):EDX.MWAIT[bit 23] is enumerated as 0, MWAIT is still supported if
CPUID.01H:ECX.MONITOR[bit 3] is enumerated as 1. Monitorless MWAIT is supported only if
CPUID.05H:ECX.MONITORLESS_MWAIT[bit 3] is enumerated as 1.
NOTE
Cores in hybrid CPU support MWAIT consistently. A core will support monitorless MWAIT only if all
cores in the hybrid CPU do so.
17.3 ENABLING
The MONITOR and MWAIT instructions are available only when
IA32_MISC_ENABLE.ENABLE_MONITOR_FSM[bit 18] = 1.
If IA32_MISC_ENABLE.ENABLE_MONITOR_FSM[bit 18] = 0, execution of MONITOR or MWAIT causes an invalid-
opcode exception (#UD). In addition, CPUID.01H:ECX.MONITOR[bit 3] and
CPUID.(EAX=07H, ECX=01H):EDX.MWAIT[bit 23] are each enumerated as 0, and CPUID leaf 05H is not
supported.
17.4 VIRTUALIZATION
A virtual-machine monitor (VMM) may want to present the abstraction of a virtual machine that supports monitor-
less MWAIT but not the existing monitoring of address ranges.
Such a VMM can virtualize CPUID to enumerate CPUID.(EAX=07H, ECX=01H):EDX.MWAIT[bit 23] as 1,
CPUID.01H:ECX.MONITOR[bit 3] as 0, and CPUID.05H:ECX.MONITORLESS_MWAIT[bit 3] as 1. The VMM can
intercept executions of MONITOR and deliver a #UD to the guest; it can intercept executions of MWAIT and either
(1) deliver a #GP(0) to the guest if ECX[2] = 0; or (2) virtualize monitorless MWAIT if ECX[2] = 1.
MWAIT—Monitor Wait
Opcode Instruction Op/ 64-Bit Compat/ Description
En Mode Leg Mode
0F 01 C9 MWAIT ZO Valid Valid A hint that allows the processor to stop instruction
execution and enter an implementation-dependent
optimized state until occurrence of a class of events.
Description
MWAIT instruction provides hints to allow the processor to enter an implementation-dependent optimized state.
There are two principal targeted usages: address-range monitor and advanced power management.
CPUID.01H:ECX.MONITOR[bit 3] and CPUID.(EAX=07H, ECX=01H):EDX.MWAIT[bit 23] both indicate the avail-
ability of MWAIT in the processor; the instruction is supported if either is enumerated as 1. When set, MWAIT may
be executed only at privilege level 0 (use at any other privilege level results in an invalid-opcode exception).
CPUID.05H:ECX.MONITORLESS_MWAIT[bit 3] indicates the availability of MWAIT that does not use a monitored
address range (with ECX[2] set; “monitorless MWAIT”) but does not indicate availability of MONITOR or of non-
monitorless MWAIT (MWAIT with ECX[2] cleared).
The operating system or system BIOS may disable this instruction by using the IA32_MISC_ENABLE MSR;
disabling MWAIT clears the CPUID feature flag and causes execution to generate an invalid-opcode exception.
This instruction’s operation is the same in non-64-bit modes and 64-bit mode.
ECX specifies optional extensions for the MWAIT instruction. EAX may contain hints such as the preferred opti-
mized state the processor should enter. The first processors to implement MWAIT supported only the zero value for
EAX and ECX. Later processors allowed setting ECX[0] to enable masked interrupts as break events for MWAIT or
setting ECX[2] to enable monitorless MWAIT (see below). Software can use the CPUID instruction to determine the
extensions and hints supported by the processor.
Note: Target C states for MWAIT extensions are processor-specific C-states, not ACPI C-states
31:8 Reserved.
Note that if MWAIT is used to enter any of the C-states that are numerically higher than C1, a store to the address
range armed by the MONITOR instruction will cause the processor to exit MWAIT only if the store was originated by
other processor agents. A store from non-processor agent might not cause the processor to exit MWAIT in such
cases.
If MWAIT is used with ECX[2] set, it will ignore any preceding MONITOR instruction and will ignore stores to any
address range that may have been monitored. Support for this is enumerated by
CPUID.05H:ECX.MONITORLESS_MWAIT[bit 3].
For additional details of MWAIT extensions, see Chapter 15, “Power and Thermal Management,” of Intel® 64 and
IA-32 Architectures Software Developer’s Manual, Volume 3B.
Operation
(* MWAIT takes the argument in EAX as a hint extension and is architected to take the argument in ECX as an instruction extension
MWAIT EAX, ECX *)
{
WHILE ( (“Monitor Hardware is in armed state”)) {
implementation_dependent_optimized_state(EAX, ECX); }
Set the state of Monitor Hardware as triggered;
}
Numeric Exceptions
None.