## Arm® Cortex®-X2 Core Revision: r2p1 # **Software Optimization Guide** Non-Confidential Copyright © 2021 Arm Limited (or its affiliates). All rights reserved. Issue 5.0 PJDOC-466751330-14955 # Arm® Cortex®-X2 Core Software Optimization Guide Copyright © 2021 Arm Limited (or its affiliates). All rights reserved. #### Release information #### Document history | Issue | Date | Confidentiality | Change | |-------|-------------|------------------|-------------------------| | 1.0 | 31 Mar 2020 | Confidential | First release for r0p0 | | 2.0 | 15 May 2020 | Confidential | Release for r1p0 | | 3.0 | 21 Aug 2020 | Confidential | Release for r2p0 | | 4.0 | 25 May 2021 | Non-Confidential | Second release for r2p0 | | 5.0 | 10 Dec 2021 | Non-Confidential | First release for r2p1 | ### Non-Confidential Proprietary Notice This document is protected by copyright and other related rights and the practice or implementation of the information contained in this document may be protected by one or more patents or pending patent applications. No part of this document may be reproduced in any form by any means without the express prior written permission of Arm. No license, express or implied, by estoppel or otherwise to any intellectual property rights is granted by this document unless specifically stated. Your access to the information in this document is conditional upon your acceptance that you will not use or permit others to use the information for the purposes of determining whether implementations infringe any third party patents. THIS DOCUMENT IS PROVIDED "AS IS". ARM PROVIDES NO REPRESENTATIONS AND NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE WITH RESPECT TO THE DOCUMENT. For the avoidance of doubt, Arm makes no representation with respect to, has undertaken no analysis to identify or understand the scope and content of, patents, copyrights, trade secrets, or other rights. This document may include technical inaccuracies or typographical errors. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL ARM BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF ARM HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. This document consists solely of commercial items. You shall be responsible for ensuring that any use, duplication or disclosure of this document complies fully with any relevant export laws and regulations to assure that this document or any portion thereof is not exported, directly or indirectly, in violation of such export laws. Use of the word "partner" in reference to Arm's customers is not intended to create or refer to any partnership relationship with any other company. Arm may make changes to this document at any time and without notice. This document may be translated into other languages for convenience, and you agree that if there is any conflict between the English version of this document and any translation, the terms of the English version of the Agreement shall prevail. The Arm corporate logo and words marked with ® or ™ are registered trademarks or trademarks of Arm Limited (or its affiliates) in the US and/or elsewhere. All rights reserved. Other brands and names mentioned in this document may be the trademarks of their respective owners. Please follow Arm's trademark usage guidelines at https://www.arm.com/company/policies/trademarks. Copyright © 2021 Arm Limited (or its affiliates). All rights reserved. Arm Limited. Company 02557590 registered in England. 110 Fulbourn Road, Cambridge, England CB1 9NJ. (LES-PRE-20349) ### Confidentiality Status This document is Non-Confidential. The right to use, copy and disclose this document may be subject to license restrictions in accordance with the terms of the agreement entered into by Arm and the party that Arm delivered this document to. Unrestricted Access is an Arm internal classification. #### **Product Status** The information in this document is final, that is for a developed product. #### Web Address developer.arm.com ### Inclusive language commitment Arm values inclusive communities. Arm recognizes that we and our industry have used terms that can be offensive. Arm strives to lead the industry and create change. This document includes terms that can be offensive. We will replace these terms in a future issue of this document. If you find offensive terms in this document, please email **terms@arm.com**. # **Contents** | 6 | |----| | 6 | | 6 | | 6 | | 6 | | 6 | | 7 | | 8 | | 9 | | 10 | | 10 | | 10 | | 11 | | 12 | | 14 | | 14 | | 14 | | 15 | | 15 | | 16 | | 17 | | 18 | | 18 | | 19 | | 20 | | 21 | | 21 | | 22 | | 23 | | 24 | | 26 | | | | 3.17 ASIMD floating-point instructions | 28 | |----------------------------------------------------------|----| | 3.18 ASIMD BFloat16 (BF16) instructions | 31 | | 3.19 ASIMD miscellaneous instructions | 31 | | 3.20 ASIMD load instructions | 32 | | 3.21 ASIMD store instructions | 35 | | 3.22 Cryptography extensions | 36 | | 3.23 CRC | 37 | | 3.24 SVE Predicate instructions | 37 | | 3.25 SVE integer instructions | 39 | | 3.26 SVE floating-point instructions | 45 | | 3.27 SVE BFloat16 (BF16) instructions | 48 | | 3.28 SVE Load instructions | 48 | | 3.29 SVE Store instructions | 51 | | 3.30 SVE Miscellaneous instructions | 52 | | 3.31 SVE Cryptographic instructions | 53 | | 4 Special considerations | 54 | | 4.1 Dispatch constraints | 54 | | 4.2 Optimizing general-purpose register spills and fills | 54 | | 4.3 Optimizing memory routines | 55 | | 4.4 Load/Store alignment | 56 | | 4.5 Store to Load Forwarding | 56 | | 4.6 AES encryption/decryption | 56 | | 4.7 Region based fast forwarding | 57 | | 4.8 Branch instruction alignment | 58 | | 4.9 FPCR self-synchronization | 58 | | 4.10 Special register access | 58 | | 4.11 Instruction fusion | 60 | | 4.12 Zero Latency MOVs | 60 | | 4.13 Cache maintenance operations | 61 | | 4.14 Memory Tagging - Tagging Performance | 61 | | 4.15 Memory Tagging - Synchronous Mode | 62 | | 4.16 Complex ASIMD and SVE instructions | 62 | | 4.17 MOVPRFX fusion | 63 | # 1 Introduction ### 1.1 Product revision status The rmpn identifier indicates the revision status of the product described in this book, for example, r1p2, where: rm Identifies the major revision of the product, for example, r1. pn Identifies the minor revision or modification status of the product, for example, p2. ### 1.2 Intended audience This document is for system designers, system integrators, and programmers who are designing or programming a System-on-Chip (SoC) that uses an Arm core. ### 1.3 Scope This document describes aspects of the Cortex-X2 core micro-architecture that influence software performance. Micro-architectural detail is limited to that which is useful for software optimization. Documentation extends only to software visible behavior of the Cortex-X2 core and not to the hardware rationale behind the behavior. ### 1.4 Conventions The following subsections describe conventions used in Arm documents. ### 1.4.1 Glossary The Arm Glossary is a list of terms used in Arm documentation, together with definitions for those terms. The Arm Glossary does not contain terms that are industry standard unless the Arm meaning differs from the generally accepted meaning. See the Arm Glossary for more information: https://developer.arm.com/glossary. ### 1.4.2 Terms and abbreviations This document uses the following terms and abbreviations. | Term | Meaning | |-------|-----------------------------| | ALU | Arithmetic and Logical Unit | | ASIMD | Advanced SIMD | | MOP | Macro-OPeration | | μΟΡ | Micro-OPeration | | SQRT | Square Root | | FP | Floating-point | ## 1.4.3 Typographical conventions | Convention | Use | |------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | italic | Introduces citations. | | bold | Highlights interface elements, such as menu names. Denotes signal names. Also used for terms in descriptive lists, where appropriate. | | monospace | Denotes text that you can enter at the keyboard, such as commands, file and program names, and source code. | | monospace <b>bold</b> | Denotes language keywords when used outside example code. | | monospace<br>underline | Denotes a permitted abbreviation for a command or option. You can enter the underlined text instead of the full command or option name. | | <and></and> | Encloses replaceable terms for assembler syntax where they appear in code or code fragments. For example: MRC p15, 0, <rd>, <crn>, <crm>, <opcode_2></opcode_2></crm></crn></rd> | | SMALL CAPITALS | Used in body text for a few terms that have specific technical meanings, that are defined in the Arm® Glossary. For example, IMPLEMENTATION DEFINED, IMPLEMENTATION SPECIFIC, UNKNOWN, and UNPREDICTABLE. | | Caution | This represents a recommendation which, if not followed, might lead to system failure or damage. | | Warning | This represents a requirement for the system that, if not followed, might result in system failure or damage. | | Danger | This represents a requirement for the system that, if not followed, will result in system failure or damage. | | Note | This represents an important piece of information that needs your attention. | | - Tip | This represents a useful tip that might make it easier, better or faster to perform a task. | | Remember | This is a reminder of something important that relates to the information you are reading. | ## 1.5 Additional reading This document contains information that is specific to this product. See the following documents for other relevant information: **Table 1-1 Arm publications** | Document name | Document ID | Licensee only | |-------------------------------------------------------------------------------------------|-------------|---------------| | Arm® Architecture Reference Manual, Armv8, for Armv8-A architecture profile | DDI 0487 | N | | Arm® Architecture Reference Manual Supplement,<br>Armv9, for Armv9-A architecture profile | DDI 0608 | N | | Arm® Cortex-X2 Core Technical Reference Manual | 101803 | N | ### 1.6 Feedback Arm welcomes feedback on this product and its documentation. ### 1.6.1 Feedback on this product If you have any comments or suggestions about this product, contact your supplier and give: - The product name. - The product revision or version. - An explanation with as much information as you can provide. Include symptoms and diagnostic procedures if appropriate. ### 1.6.2 Feedback on content If you have comments on content, send an email to errata@arm.com and give: - The title Arm® Cortex®-X2 Core Software Optimization Guide. - The number PJDOC-466751330-14955. - If applicable, the page number(s) to which your comments refer. - A concise explanation of your comments. Arm also welcomes general suggestions for additions and improvements. Arm tests the PDF only in Adobe Acrobat and Acrobat Reader and cannot guarantee the quality of the represented document when used with any other PDF reader. # 2 Overview The Cortex-X2 core is a high-performance and low-power product that implements the Armv9.0-A architecture and supports all previous Armv8-A architectures up to Armv8.5-A. It targets large screen compute applications The key features of Cortex-X2 Core are: - Implementation of the Armv9-A A64 instruction sets - Memory Management Unit (MMU) - 40-bit Physical Address (PA) and 48-bit Virtual Address (VA) - Generic Interrupt Controller (GIC) CPU interface to connect to an external interrupt distributor - Generic Timers that supports 64-bit count input from an external system counter - Implementation of the Reliability, Availability, and Serviceability (RAS) Extension - Implementation of the Scalable Vector Extension (SVE) with a 128-bit vector length and Scalable Vector Extension 2 (SVE2) - Integrated execution unit with Advanced SIMD and floating-point support - Support for the optional Cryptographic Extension, which is licensed separately - Activity Monitoring Unit (AMU) - Separate L1 data and instruction caches - Private, unified data and instruction L2 cache - Support for Memory System Resource Partitioning and Monitoring (MPAM) - Armv9-A debug logic - Performance Monitoring Unit (PMU) - Embedded Trace Extension (ETE) - Trace Buffer Extension (TRBE) - Optional Embedded Logic Analyzer (ELA) This document describes elements of the Cortex-X2 core micro-architecture that influence software performance so that software and compilers can be optimized accordingly. ## 2.1 Pipeline overview The following figure describes the high-level Cortex-X2 instruction processing pipeline. Instructions are first fetched and then decoded into internal Macro-OPerations (MOPs). From there, the MOPs proceed through register renaming and dispatch stages. A MOP can be split into two Micro-OPerations ( $\mu$ OPs) further down the pipeline after the decode stage. Once dispatched, $\mu$ OPs wait for their operands and issue out-of-order to one of fifteen issue pipelines. Each issue pipeline can accept one $\mu$ OP per cycle. Figure 2-1 Cortex-X2 core pipeline The execution pipelines support different types of operations, as shown in the following table. Table 2-1 Cortex-X2 core operations | Instruction groups | Instructions | |------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------| | Branch 0/1 | Branch µOPs | | Integer Single-Cycle 0/1 | Integer ALU µOPs | | Integer Single/Multi-<br>cycle 0/1 | Integer shift-ALU, multiply, divide, CRC and sum-of-absolute-differences µOPs | | Load/Store 0/1 | Load, Store address generation and special memory µOPs | | Load 2 | Load µOPs | | Store data 0/1 | Store data µOPs | | FP/ASIMD-0 | ASIMD ALU, ASIMD misc, ASIMD integer multiply, FP convert, FP misc, FP add, FP multiply, FP divide, FP sqrt, crypto µOPs, store data µOPs | | FP/ASIMD-1 | ASIMD ALU, ASIMD misc, FP misc, FP add, FP multiply, ASIMD shift $\mu$ OPs, store data $\mu$ OPs, crypto $\mu$ OPs. | | FP/ASIMD-2 | ASIMD ALU, ASIMD misc, ASIMD integer multiply, FP convert, FP misc, FP add, FP multiply, FP divide, FP sqrt, crypto µOPs. | | FP/ASIMD-3 | ASIMD ALU, ASIMD misc, FP misc, FP add, FP multiply, ASIMD shift μOPs, crypto μOPs | # 3 Instruction characteristics ### 3.1 Instruction tables This chapter describes high-level performance characteristics for most Armv9-A instructions. A series of tables summarize the effective execution latency and throughput (instruction bandwidth per cycle), pipelines utilized, and special behaviours associated with each group of instructions. Utilized pipelines correspond to the execution pipelines described in chapter 2. In the tables below, Exec Latency is defined as the minimum latency seen by an operation dependent on an instruction in the described group. In the tables below, Execution Throughput is defined as the maximum throughput (in instructions per cycle) of the specified instruction group that can be achieved in the entirety of the Cortex-X2 microarchitecture. ## 3.2 Legend for reading the utilized pipelines Table 3-1 Cortex-X2 core pipeline names and symbols | Pipeline name | Symbol used in tables | |----------------------------------------------------|-----------------------| | Branch 0/1 | В | | Integer single Cycle 0/1 | S | | Integer single Cycle 0/1 and single/multicycle 0/1 | 1 | | Integer single/multicycle 0/1 | М | | Integer multicycle 0 | МО | | Load/Store 01 | LO1 | | Load/Store 0/1 and Load 2 | L | | Store data 0/1 | D | | FP/ASIMD 0/1/2/3 | V | | FP/ASIMD 0/1 | V01 | | FP/ASIMD 0/2 | V02 | | FP/ASIMD 1/3 | V13 | | FP/ASIMD 0 | VO | | FP/ASIMD 1 | V1 | ### 3.3 Branch instructions Table 3-2 AArch64 Branch instructions | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |---------------------------|-------------------------|-----------------|-------------------------|-----------------------|-------| | Branch, immed | В | 1 | 2 | В | - | | Branch, register | BR, RET | 1 | 2 | В | - | | Branch and link, immed | BL | 1 | 2 | B, S | - | | Branch and link, register | BLR | 1 | 2 | B, S | - | | Compare and branch | CBZ, CBNZ, TBZ,<br>TBNZ | 1 | 2 | В | - | # 3.4 Arithmetic and logical instructions Table 3-3 AArch64 Arithmetic and logical instructions | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |------------------------------------------------|-----------------------------------------------------------|-----------------|-------------------------|-----------------------|-------| | ALU, basic | ADD, ADC, AND,<br>BIC, EON, EOR,<br>ORN, ORR, SUB,<br>SBC | 1 | 4 | I | - | | ALU, basic, flagset | ADDS, ADCS,<br>ANDS, BICS,<br>SUBS, SBCS | 1 | 3 | I | - | | ALU, extend and shift | ADD{S}, SUB{S} | 2 | 2 | М | - | | Arithmetic, LSL shift, shift <= 4 | ADD, SUB | 1 | 4 | I | - | | Arithmetic, flagset, LSL shift, shift <= 4 | ADDS, SUBS | 1 | 3 | I | - | | Arithmetic, LSR/ASR/ROR shift or LSL shift > 4 | ADD{S}, SUB{S} | 2 | 2 | М | - | | Arithmetic, immediate to logical address tag | ADDG, SUBG | 2 | 2 | М | - | | Conditional compare | CCMN, CCMP | 1 | 3 | I | - | | Conditional select | CSEL, CSINC,<br>CSINV, CSNEG | 1 | 3 | I | - | | Convert floating-point condition flags | AXFLAG, XAFLAG | 1 | 1 | 1 | - | | Flag manipulation instructions | SETF8, SETF16,<br>RMIF, CFINV | 1 | 1 | 1 | - | | Insert Random Tag | IRG | 2,3 | 2, 1 | M, M0 | 1 | | Insert Tag Mask | GMI | 1 | 4 | I | - | | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |----------------------------|---------------------------------|-----------------|-------------------------|-----------------------|-------| | Logical, shift, no flagset | AND, BIC, EON,<br>EOR, ORN, ORR | 1 | 4 | | - | | Logical, shift, flagset | ANDS, BICS | 2 | 2 | М | - | | Subtract Pointer | SUBP | 1 | 4 | 1 | - | | Subtract Pointer, flagset | SUBPS | 1 | 3 | I | - | #### Notes: ## 3.5 Divide and multiply instructions Table 3-4 AArch64 Divide and multiply instructions | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |-----------------------------|-----------------------------------------|-----------------|-------------------------|-----------------------|-------| | Divide, W-form | SDIV, UDIV | 5 to 12 | 1/12 to 1/5 | MO | 1 | | Divide, X-form | SDIV, UDIV | 5 to 20 | 1/20 to 1/5 | MO | 1 | | Multiply | MUL, MNEG | 2 | 2 | М | - | | Multiply accumulate, W-form | MADD, MSUB | 2(1) | 1 | MO | 2 | | Multiply accumulate, X-form | MADD, MSUB | 2(1) | 1 | MO | 2 | | Multiply accumulate long | SMADDL,<br>SMSUBL,<br>UMADDL,<br>UMSUBL | 2(1) | 1 | МО | 2 | | Multiply high | SMULH, UMULH | 3 | 2 | M | 2 | | Multiply long | SMNEGL, SMULL,<br>UMNEGL,<br>UMULL | 2 | 2 | М | - | #### Notes: <sup>1.</sup> The latency is 2, throughput is 2 and utilized pipeline is M when GCR\_EL1.RRND = 1. When GCR\_EL1.RRND = 0, latency is 3, throughput is 1 and pipeline utilized is M0. <sup>1.</sup> Integer divides are performed using an iterative algorithm and block any subsequent divide operations until complete. Early termination is possible, depending upon the data values. <sup>2.</sup> Multiply-accumulate pipelines support late-forwarding of accumulate operands from similar $\mu$ OPs, allowing a typical sequence of multiply-accumulate $\mu$ OPs to issue one every N cycles (accumulate latency N shown in parentheses). Accumulator forwarding is not supported for consumers of 64 bit multiply high operations. ## 3.6 Pointer Authentication Instructions Table 3-5 AArch64 pointer authentication instructions | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |-------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|-----------------|-------------------------|-----------------------|-------| | Authenticate data address | AUTDA, AUTDB,<br>AUTDZA,<br>AUTDZB | 5 | 1 | МО | - | | Authenticate instruction address | AUTIA, AUTIB,<br>AUTIA1716,<br>AUTIB1716,<br>AUTIASP,<br>AUTIBSP,<br>AUTIAZ, AUTIBZ,<br>AUTIZA, AUTIZB | 5 | 1 | МО | - | | Branch and link, register, with pointer authentication | BLRAA, BLRAAZ,<br>BLRAB, BLRABZ | 6 | 1 | M0, B | | | Branch, register, with pointer authentication | BRAA, BRAAZ,<br>BRAB, BRABZ | 6 | 1 | M0, B | | | Branch, return, with pointer authentication | RETA, RETB | 6 | 1 | M0, B | | | Compute pointer authentication code for data address | PACDA, PACDB,<br>PACDZA,<br>PACDZB | 5 | 1 | МО | | | Compute pointer authentication code, using generic key | PACGA | 5 | 1 | MO | | | Compute pointer authentication code for instruction address | PACIA, PACIB,<br>PACIA1716,<br>PACIB1716,<br>PACIASP,<br>PACIBSP,<br>PACIAZ, PACIBZ,<br>PACIZA, PACIZB | 5 | 1 | MO | | | Load register, with pointer authentication | LDRAA, LDRAB | 9 | 1 | M0, L | | | Strip pointer authentication code | XPACD, XPACI,<br>XPACLRI | 2 | 1 | MO | | ## 3.7 Miscellaneous data-processing instructions Table 3-6 AArch64 Miscellaneous data-processing instructions | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |----------------------------|----------------------------|-----------------|-------------------------|-----------------------|-------| | Address generation | ADR, ADRP | 1 | 4 | I | - | | Bitfield extract, one reg | EXTR | 1 | 4 | I | - | | Bitfield extract, two regs | EXTR | 3 | 2 | I, M | - | | Bitfield move, basic | SBFM, UBFM | 1 | 4 | I | - | | Bitfield move, insert | BFM | 2 | 2 | М | - | | Count leading | CLS, CLZ | 1 | 4 | I | - | | Move immed | MOVN, MOVK,<br>MOVZ | 1 | 4 | 1 | - | | Reverse bits/bytes | RBIT, REV,<br>REV16, REV32 | 1 | 4 | I | - | | Variable shift | ASRV, LSLV,<br>LSRV, RORV | 1 | 4 | | - | ### 3.8 Load instructions The latencies shown assume the memory access hits in the Level 1 Data Cache and represent the maximum latency to load all the registers written by the instruction. **Table 3-7 AArch64 Load instructions** | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |-----------------------------------|---------------------------------------------------------------|-----------------|-------------------------|-----------------------|-------| | Load register, literal | LDR, LDRSW,<br>PRFM | 4 | 3 | L | - | | Load register, unscaled immed | LDUR, LDURB,<br>LDURH, LDURSB,<br>LDURSH,<br>LDURSW,<br>PRFUM | 4 | 3 | L | - | | Load register, immed post-index | LDR, LDRB,<br>LDRH, LDRSB,<br>LDRSH, LDRSW | 4 | 3 | L, I | - | | Load register, immed pre-index | LDR, LDRB,<br>LDRH, LDRSB,<br>LDRSH, LDRSW | 4 | 3 | L, I | - | | Load register, immed unprivileged | LDTR, LDTRB,<br>LDTRH, LDTRSB,<br>LDTRSH,<br>LDTRSW | 4 | 3 | L | - | | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |----------------------------------------------------------------|-----------------------------------------------------|-----------------|-------------------------|-----------------------|-------| | Load register, unsigned immed | LDR, LDRB,<br>LDRH, LDRSB,<br>LDRSH, LDRSW,<br>PRFM | 4 | 3 | L | - | | Load register, register offset,<br>basic | LDR, LDRB,<br>LDRH, LDRSB,<br>LDRSH, LDRSW,<br>PRFM | 4 | 3 | L | - | | Load register, register offset, scale by 4/8 | LDR, LDRSW,<br>PRFM | 4 | 3 | L | - | | Load register, register offset, scale by 2 | LDRH, LDRSH | 4 | 3 | L | - | | Load register, register offset, extend | LDR, LDRB,<br>LDRH, LDRSB,<br>LDRSH, LDRSW,<br>PRFM | 4 | 3 | L | - | | Load register, register offset, extend, scale by 4/8 | LDR, LDRSW,<br>PRFM | 4 | 3 | L | - | | Load register, register offset, extend, scale by 2 | LDRH, LDRSH | 4 | 3 | L | - | | Load pair, signed immed offset, normal, W-form | LDP, LDNP | 4 | 3 | L | - | | Load pair, signed immed offset, normal, X-form | LDP, LDNP | 4 | 1.5 | L | - | | Load pair, signed immed offset, signed words | LDPSW | 5 | 1 | I, L | - | | Load pair, immed post-index or immed pre-index, normal, W-form | LDP | 4 | 3 | L, I | - | | Load pair, immed post-index or immed pre-index, normal, X-form | LDP | 4 | 1.5 | L, I | - | | Load pair, immed post-index or immed pre-index, signed words | LDPSW | 5 | 1 | I, L | - | ### 3.9 Store instructions The following table describes performance characteristics for standard store instructions. Stores $\mu$ OPs are split into address and data $\mu$ OPs. Once executed, stores are buffered and committed in the background. Table 3-8 AArch64 Store instructions | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |-------------------------------------------------------|-------------------------|-----------------|-------------------------|-----------------------|-------| | Store register, unscaled immed | STUR, STURB,<br>STURH | 1 | 2 | L01, D | - | | Store register, immed post-index | STR, STRB, STRH | 1 | 2 | L01, D, I | - | | Store register, immed pre-index | STR, STRB, STRH | 1 | 2 | L01, D, I | - | | Store register, immed unprivileged | STTR, STTRB,<br>STTRH | 1 | 2 | L01, D | - | | Store register, unsigned immed | STR, STRB, STRH | 1 | 2 | L01, D | - | | Store register, register offset, basic | STR, STRB, STRH | 1 | 2 | L01, D | - | | Store register, register offset, scaled by 4/8 | STR | 1 | 2 | L01, D | - | | Store register, register offset, scaled by 2 | STRH | 1 | 2 | I, L01, D | - | | Store register, register offset, extend | STR, STRB, STRH | 1 | 2 | L01, D | - | | Store register, register offset, extend, scale by 4/8 | STR | 1 | 2 | L01, D | - | | Store register, register offset, extend, scale by 2 | STRH | 1 | 2 | I, L01, D | - | | Store pair, immed offset | STP, STNP | 1 | 2 | L01, D | - | | Store pair, immed post-index | STP | 1 | 2 | L01, D, I | - | | Store pair, immed pre-index | STP | 1 | 2 | L01, D, I | - | # **3.10 Tag Load Instructions** #### Table 3-9 AArch64 Tag load instructions | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | | Utilized<br>Pipelines | Notes | |-------------------------------|-------------------------|-----------------|---|-----------------------|-------| | Load allocation tag | LDG | 4 | 3 | L | - | | Load multiple allocation tags | LDGM | 4 | 3 | L | - | ## 3.11 Tag Store instructions Table 3-10 AArch64 Tag store instructions | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |-------------------------------------------------------------------------|-------------------------|-----------------|-------------------------|-----------------------|-------| | Store allocation tags to one or two granules, post-index | STG, ST2G | 1 | 2 | L01, D, I | - | | Store allocation tags to one or two granules, pre-index | STG, ST2G | 1 | 2 | L01, D, I | - | | Store allocation tags to one or two granules, signed offset | STG, ST2G | 1 | 2 | L01, D | - | | Store allocation tag to one or<br>two granules, zeroing, post-<br>index | STZG, STZ2G | 1 | 2 | L01, D, I | - | | Store Allocation Tag to one or two granules, zeroing, pre-index | STZG, STZ2G | 1 | 2 | L01, D, I | - | | Store allocation tag to two granules, zeroing, signed offset | STZG, STZ2G | 1 | 2 | L01, D | - | | Store allocation tag and reg pair to memory, post-Index | STGP | 1 | 2 | L01, D, I | - | | Store allocation tag and reg pair to memory, pre-Index | STGP | 1 | 2 | L01, D, I | - | | Store allocation tag and reg pair to memory, signed offset | STGP | 1 | 2 | L01, D | - | | Store multiple allocation tags | STGM | 1 | 2 | L01, D | - | | Store multiple allocation tags, zeroing | STZGM | 1 | 2 | L01, D | - | # 3.12 FP data processing instructions Table 3-11 AArch64 FP data processing instructions | Instruction Group | | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |-------------------|----------------------|-----------------|-------------------------|-----------------------|-------| | FP absolute value | FABS | 2 | 4 | V | - | | FP arithmetic | FADD, FSUB | 2 | 4 | V | - | | FP compare | FCCMP{E},<br>FCMP{E} | 2 | 1 | VO | - | | FP divide, H-form | FDIV | 7 | 8/7 | V02 | 1 | | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |------------------------|---------------------------------------------------------------------------------------------------------------------|-----------------|-------------------------|-----------------------|-------| | FP divide, S-form | FDIV | 7 to 10 | 8/9 to 8/7 | V02 | 1 | | FP divide, D-form | FDIV | 7 to 15 | 2/7 to 4/7 | V02 | 1 | | FP min/max | FMIN, FMINNM,<br>FMAX, FMAXNM | 2 | 4 | V | - | | FP multiply | FMUL, FNMUL | 3 | 4 | V | 2 | | FP multiply accumulate | FMADD, FMSUB,<br>FNMADD,<br>FNMSUB | 4 (2) | 4 | V | 3 | | FP negate | FNEG | 2 | 4 | V | - | | FP round to integral | FRINTA, FRINTI,<br>FRINTM,<br>FRINTN, FRINTP,<br>FRINTX, FRINTZ,<br>FRINT32X,<br>FRINT64X,<br>FRINT32Z,<br>FRINT64Z | 3 | 2 | V02 | - | | FP select | FCSEL | 2 | 4 | V | - | | FP square root, H-form | FSQRT | 7 | 8/7 | V02 | 1 | | FP square root, S-form | FSQRT | 7 to 9 | 1 to 8/7 | V02 | 1 | | FP square root, D-form | FSQRT | 7 to 16 | 4/15 to 4/7 | V02 | 1 | #### Notes: - 1. FP divide and square root operations are performed using an iterative algorithm and block subsequent similar operations to the same pipeline until complete. - 2. FP multiply-accumulate pipelines support late forwarding of the result from FP multiply $\mu$ OPs to the accumulate operands of an FP multiply-accumulate $\mu$ OP. The latter can potentially be issued 1 cycle after the FP multiply $\mu$ OP has been issued. - 3. FP multiply-accumulate pipelines support late-forwarding of accumulate operands from similar $\mu$ OPs, allowing a typical sequence of multiply-accumulate $\mu$ OPs to issue one every N cycles(accumulate latency N shown in parentheses). ### 3.13 FP miscellaneous instructions Table 3-12 AArch64 FP miscellaneous instructions | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |---------------------------------|--------------------------------------------------------------------------------|-----------------|-------------------------|-----------------------|-------| | FP convert, from gen to vec reg | SCVTF, UCVTF | 3 | 1 | MO | - | | FP convert, from vec to gen reg | FCVTAS, FCVTAU, FCVTMS, FCVTMU, FCVTNS, FCVTNU, FCVTPS, FCVTPU, FCVTZS, FCVTZU | 3 | 1 | V | - | | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |-----------------------------------------------|-------------------------|-----------------|-------------------------|-----------------------|-------| | FP convert, Javascript from vec to gen reg | FJCVTZS | 3 | 1 | VO | - | | FP convert, from vec to vec reg | FCVT, FCVTXN | 3 | 2 | V02 | - | | FP move, immed | FMOV | 2 | 4 | V | - | | FP move, register | FMOV | 2 | 4 | V | - | | FP transfer, from gen to low half of vec reg | FMOV | 3 | 1 | MO | - | | FP transfer, from gen to high half of vec reg | FMOV | 5 | 1 | M0, V | - | | FP transfer, from vec to gen reg | FMOV | 2 | 1 | V01 | - | ### 3.14 FP load instructions The latencies shown assume the memory access hits in the Level 1 Data Cache and represent the maximum latency to load all the vector registers written by the instruction. Compared to standard loads, an extra cycle is required to forward results to FP/ASIMD pipelines. Table 3-13 AArch64 FP load instructions | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |-----------------------------------------------------------|-------------------------|-----------------|-------------------------|-----------------------|-------| | Load vector reg, literal, S/D/Q forms | LDR | 6 | 3 | L | - | | Load vector reg, unscaled immed | LDUR | 6 | 3 | L | - | | Load vector reg, immed post-index | LDR | 6 | 3 | L, I | - | | Load vector reg, immed pre-<br>index | LDR | 6 | 3 | L, I | - | | Load vector reg, unsigned immed | LDR | 6 | 3 | L | - | | Load vector reg, register offset, basic | LDR | 6 | 3 | L | - | | Load vector reg, register offset, scale, S/D-form | LDR | 6 | 3 | L | - | | Load vector reg, register offset, scale, H/Q-form | LDR | 7 | 3 | I, L | - | | Load vector reg, register offset, extend | LDR | 6 | 3 | L | - | | Load vector reg, register offset, extend, scale, S/D-form | LDR | 6 | 3 | L | - | | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |-----------------------------------------------------------|-------------------------|-----------------|-------------------------|-----------------------|-------| | Load vector reg, register offset, extend, scale, H/Q-form | LDR | 7 | 3 | I, L | - | | Load vector pair, immed offset, S/D-form | LDP, LDNP | 6 | 3 | L | - | | Load vector pair, immed offset,<br>Q-form | LDP, LDNP | 6 | 3/2 | L | - | | Load vector pair, immed post-<br>index, S/D-form | LDP | 6 | 3 | I, L | - | | Load vector pair, immed post-<br>index, Q-form | LDP | 6 | 3/2 | L, I | - | | Load vector pair, immed pre-<br>index, S/D-form | LDP | 6 | 3 | I, L | - | | Load vector pair, immed pre-<br>index, Q-form | LDP | 6 | 3/2 | L, I | - | ## 3.15 FP store instructions Stores MOPs are split into store address and store data $\mu$ OPs. Once executed, stores are buffered and committed in the background. Table 3-14 AArch64 FP store instructions | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |--------------------------------------------------------|-------------------------|-----------------|-------------------------|-----------------------|-------| | Store vector reg, unscaled immed, B/H/S/D-form | STUR | 2 | 2 | L01, V01 | - | | Store vector reg, unscaled immed, Q-form | STUR | 2 | 2 | L01, V01 | - | | Store vector reg, immed post-<br>index, B/H/S/D-form | STR | 2 | 2 | L01, V01, I | - | | Store vector reg, immed post-<br>index, Q-form | STR | 2 | 2 | L01, V01, I | - | | Store vector reg, immed pre-<br>index, B/H/S/D-form | STR | 2 | 2 | L01, V01, I | - | | Store vector reg, immed pre-<br>index, Q-form | STR | 2 | 2 | L01, V01, I | - | | Store vector reg, unsigned immed, B/H/S/D-form | STR | 2 | 2 | L01, V01 | - | | Store vector reg, unsigned immed, Q-form | STR | 2 | 2 | L01, V01 | - | | Store vector reg, register offset, basic, B/H/S/D-form | STR | 2 | 2 | L01, V01 | - | | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |------------------------------------------------------------|-------------------------|-----------------|-------------------------|-----------------------|-------| | Store vector reg, register offset,<br>basic, Q-form | STR | 2 | 2 | L01, V01 | - | | Store vector reg, register offset, scale, H-form | STR | 2 | 2 | I, L01, V01 | - | | Store vector reg, register offset, scale, S/D-form | STR | 2 | 2 | L01, V01 | - | | Store vector reg, register offset, scale, Q-form | STR | 2 | 2 | I, LO1, VO1 | - | | Store vector reg, register offset, extend, B/H/S/D-form | STR | 2 | 2 | L01, V01 | - | | Store vector reg, register offset, extend, Q-form | STR | 2 | 2 | L01, V01 | - | | Store vector reg, register offset, extend, scale, H-form | STR | 2 | 2 | I, LO1, VO1 | - | | Store vector reg, register offset, extend, scale, S/D-form | STR | 2 | 2 | L01, V01 | - | | Store vector reg, register offset, extend, scale, Q-form | STR | 2 | 2 | I, LO1, VO1 | - | | Store vector pair, immed offset,<br>S-form | STP, STNP | 2 | 2 | L01, V01 | - | | Store vector pair, immed offset,<br>D-form | STP, STNP | 2 | 2 | L01, V01 | - | | Store vector pair, immed offset,<br>Q-form | STP, STNP | 2 | 1 | L01, V01 | - | | Store vector pair, immed post-<br>index, S-form | STP | 2 | 2 | I, LO1, VO1 | - | | Store vector pair, immed post-<br>index, D-form | STP | 2 | 2 | I, LO1, VO1 | - | | Store vector pair, immed post-<br>index, Q-form | STP | 2 | 1 | I, LO1, VO1 | - | | Store vector pair, immed pre-<br>index, S-form | STP | 2 | 2 | I, LO1, VO1 | - | | Store vector pair, immed pre-<br>index, D-form | STP | 2 | 2 | I, LO1, VO1 | - | | Store vector pair, immed pre-<br>index, Q-form | STP | 2 | 1 | I, LO1, VO1 | - | # 3.16 ASIMD integer instructions Table 3-15 AArch64 ASIMD integer instructions | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|-------------------------|-----------------------|-------| | ASIMD absolute diff | SABD, UABD | 2 | 4 | V | - | | ASIMD absolute diff accum | SABA, UABA | 4(1) | 2 | V13 | 2 | | ASIMD absolute diff accum long | SABAL(2),<br>UABAL(2) | 4(1) | 2 | V13 | 2 | | ASIMD absolute diff long | SABDL(2),<br>UABDL(2) | 2 | 4 | V | - | | ASIMD arith, basic | ABS, ADD, NEG,<br>SADDL(2),<br>SADDW(2),<br>SHADD, SHSUB,<br>SSUBL(2),<br>SSUBW(2), SUB,<br>UADDL(2),<br>UADDW(2),<br>UHADD, UHSUB,<br>USUBL(2),<br>USUBW(2) | 2 | 4 | V | - | | ASIMD arith, complex | ADDHN(2),<br>RADDHN(2),<br>RSUBHN(2),<br>SQABS, SQADD,<br>SQNEG, SQSUB,<br>SRHADD,<br>SUBHN(2),<br>SUQADD,<br>UQADD, UQSUB,<br>URHADD,<br>USQADD | 2 | 4 | V | - | | ASIMD arith, pair-wise | ADDP, SADDLP,<br>UADDLP | 2 | 4 | V | - | | ASIMD arith, reduce, 4H/4S | ADDV, SADDLV,<br>UADDLV | 2 | 2 | V13 | - | | ASIMD arith, reduce, 8B/8H | ADDV, SADDLV,<br>UADDLV | 4 | 2 | V13, V | - | | ASIMD arith, reduce, 16B | ADDV, SADDLV,<br>UADDLV | 4 | 1 | V13 | - | | ASIMD compare | CMEQ, CMGE,<br>CMGT, CMHI,<br>CMHS, CMLE,<br>CMLT, CMTST | 2 | 4 | V | - | | ASIMD dot product | SDOT, UDOT | 3 (1) | 4 | V | 2 | | ASIMD dot product using signed and unsigned integers | SUDOT, USDOT | 3(1) | 4 | V | 2 | | Instruction Group | AArch64 | Exec | Execution | Utilized | Notes | |----------------------------------------------------------|----------------------------------------------------------------------------------------|---------|------------|-----------|-------| | | Instructions | Latency | Throughput | Pipelines | | | ASIMD logical | AND, BIC, EOR,<br>MOV, MVN, NOT,<br>ORN, ORR | 2 | 4 | V | - | | ASIMD matrix multiply-<br>accumulate | SMMLA, UMMLA,<br>USMMLA | 3(1) | 4 | V | 2 | | ASIMD max/min, basic and pairwise | SMAX, SMAXP,<br>SMIN, SMINP,<br>UMAX, UMAXP,<br>UMIN, UMINP | 2 | 4 | V | - | | ASIMD max/min, reduce, 4H/4S | SMAXV, SMINV,<br>UMAXV, UMINV | 2 | 2 | V13 | - | | ASIMD max/min, reduce, 8B/8H | SMAXV, SMINV,<br>UMAXV, UMINV | 4 | 2 | V13, V | - | | ASIMD max/min, reduce, 16B | SMAXV, SMINV,<br>UMAXV, UMINV | 4 | 1 | V13 | - | | ASIMD multiply | MUL, SQDMULH,<br>SQRDMULH | 4 | 2 | V02 | - | | ASIMD multiply accumulate | MLA, MLS | 4(1) | 2 | V02 | 1 | | ASIMD multiply accumulate high | SQRDMLAH,<br>SQRDMLSH | 4(2) | 2 | V02 | 1 | | ASIMD multiply accumulate long | SMLAL(2),<br>SMLSL(2),<br>UMLAL(2),<br>UMLSL(2) | 4(1) | 2 | V02 | 1 | | ASIMD multiply accumulate saturating long | SQDMLAL(2),<br>SQDMLSL(2) | 4(2) | 2 | V02 | 1 | | ASIMD multiply/multiply long<br>(8x8) polynomial, D-form | PMUL, PMULL(2) | 3 | 2 | V23 | 3 | | ASIMD multiply/multiply long<br>(8x8) polynomial, Q-form | PMUL, PMULL(2) | 3 | 2 | V23 | 3 | | ASIMD multiply long | SMULL(2),<br>UMULL(2),<br>SQDMULL(2) | 3 | 2 | V02 | - | | ASIMD pairwise add and accumulate long | SADALP,<br>UADALP | 4(1) | 2 | V13 | 2 | | ASIMD shift accumulate | SSRA, SRSRA,<br>USRA, URSRA | 4(1) | 2 | V13 | 2 | | ASIMD shift by immed, basic | SHL, SHLL(2),<br>SHRN(2),<br>SSHLL(2), SSHR,<br>SXTL(2),<br>USHLL(2), USHR,<br>UXTL(2) | 2 | 2 | V13 | - | | ASIMD shift by immed and insert, basic | SLI, SRI | 2 | 2 | V13 | - | | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |----------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------|-----------------|-------------------------|-----------------------|-------| | ASIMD shift by immed, complex | RSHRN(2),<br>SQRSHRN(2),<br>SQRSHRUN(2),<br>SQSHL{U},<br>SQSHRN(2),<br>SQSHRUN(2),<br>SRSHR,<br>UQRSHRN(2),<br>UQSHL,<br>UQSHRN(2),<br>URSHR | 4 | 2 | V13 | - | | ASIMD shift by register, basic | SSHL, USHL | 2 | 2 | V13 | - | | ASIMD shift by register, complex | SRSHL, SQRSHL,<br>SQSHL, URSHL,<br>UQRSHL, UQSHL | 4 | 2 | V13 | - | #### Notes: - 1. Multiply-accumulate pipelines support late-forwarding of accumulate operands from similar $\mu$ OPs, allowing a typical sequence of integer multiply-accumulate $\mu$ OPs to issue one every cycle or one every other cycle (accumulate latency shown in parentheses). - 2. Other accumulate pipelines also support late-forwarding of accumulate operands from similar $\mu$ OPs, allowing a typical sequence of such $\mu$ OPs to issue one every cycle (accumulate latency shown in parentheses). - 3. This category includes instructions of the form "PMULL Vd.8H, Vn.8B, Vm.8B" and "PMULL2 Vd.8H, Vn.16B, Vm.16B". ## 3.17 ASIMD floating-point instructions Table 3-16 AArch64 ASIMD floating-point instructions | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |---------------------------------------|----------------------------------------------------------|-----------------|-------------------------|-----------------------|-------| | ASIMD FP absolute value/difference | FABS, FABD | 2 | 4 | V | - | | ASIMD FP arith, normal | FADD, FSUB,<br>FADDP | 2 | 4 | V | - | | ASIMD FP compare | FACGE, FACGT,<br>FCMEQ, FCMGE,<br>FCMGT, FCMLE,<br>FCMLT | 2 | 4 | V | - | | ASIMD FP complex add | FCADD | 2 | 4 | V | - | | ASIMD FP complex multiply add | FCMLA | 4(2) | 4 | V | 1 | | ASIMD FP convert, long (F16 to F32) | FCVTL(2) | 4 | 1 | V02 | - | | ASIMD FP convert, long (F32 to F64) | FCVTL(2) | 3 | 2 | V02 | - | | ASIMD FP convert, narrow (F32 to F16) | FCVTN(2) | 4 | 1 | V02 | - | | Instruction Group | AArch64 | Exec | Execution | Utilized | Notes | |--------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------|----------|-------------|-----------|-------| | | Instructions | Latency | Throughput | Pipelines | | | ASIMD FP convert, narrow (F64 to F32) | FCVTN(2),<br>FCVTXN(2) | 3 | 2 | V02 | - | | ASIMD FP convert, other, D-<br>form F32 and Q-form F64 | FCVTAS, FCVTAU, FCVTMS, FCVTMU, FCVTNS, FCVTNU, FCVTPS, FCVTPU, FCVTZS, FCVTZU, SCVTF, UCVTF | 3 | 2 | V02 | - | | ASIMD FP convert, other, D-<br>form F16 and Q-form F32 | FCVTAS, VCVTAU, FCVTMS, FCVTMU, FCVTNS, FCVTNU, FCVTPS, FCVTPU, FCVTZS, FCVTZU, SCVTF, UCVTF | 4 | 1 | V02 | - | | ASIMD FP convert, other, Q-<br>form F16 | FCVTAS,<br>VCVTAU,<br>FCVTMS,<br>FCVTMU,<br>FCVTNS,<br>FCVTNU,<br>FCVTPS,<br>FCVTPU,<br>FCVTZS,<br>FCVTZU, SCVTF,<br>UCVTF | 6 | 1/2 | V02 | - | | ASIMD FP divide, D-form, F16 | FDIV | 7 | 2/7 | V02 | 3 | | ASIMD FP divide, D-form, F32 | FDIV | 7 to 10 | 4/9 to 4/7 | V02 | 3 | | ASIMD FP divide, Q-form, F16 | FDIV | 10 to 13 | 2/13 to 1/5 | V02 | 3 | | ASIMD FP divide, Q-form, F32 | FDIV | 7 to 10 | 2/9 to 2/7 | V02 | 3 | | ASIMD FP divide, Q-form, F64 | FDIV | 7 to 15 | 1/7 to 2/7 | V02 | 3 | | ASIMD FP max/min, normal | FMAX, FMAXNM,<br>FMIN, FMINNM | 2 | 4 | V | - | | ASIMD FP max/min, pairwise | FMAXP,<br>FMAXNMP,<br>FMINP,<br>FMINNMP | 2 | 4 | V | - | | ASIMD FP max/min, reduce, F32 and D-form F16 | FMAXV,<br>FMAXNMV,<br>FMINV,<br>FMINNMV | 4 | 2 | V | - | | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |----------------------------------------------|---------------------------------------------------------------------------------------------------------------------|-----------------|-------------------------|-----------------------|-------| | ASIMD FP max/min, reduce, Q-<br>form F16 | FMAXV,<br>FMAXNMV,<br>FMINV,<br>FMINNMV | 6 | 4/3 | V | - | | ASIMD FP multiply | FMUL, FMULX | 3 | 4 | V | 2 | | ASIMD FP multiply accumulate | FMLA, FMLS | 4(2) | 4 | V | 1 | | ASIMD FP multiply accumulate long | FMLAL(2),<br>FMLSL(2) | 4(2) | 4 | V | 1 | | ASIMD FP negate | FNEG | 2 | 4 | V | - | | ASIMD FP round, D-form F32<br>and Q-form F64 | FRINTA, FRINTI,<br>FRINTM,<br>FRINTN, FRINTP,<br>FRINTX, FRINTZ,<br>FRINT32X,<br>FRINT64X,<br>FRINT32Z,<br>FRINT64Z | 3 | 2 | V02 | - | | ASIMD FP round, D-form F16 and Q-form F32 | FRINTA, FRINTI,<br>FRINTM,<br>FRINTN, FRINTP,<br>FRINTX, FRINTZ,<br>FRINT32X,<br>FRINT64X,<br>FRINT32Z,<br>FRINT64Z | 4 | 1 | V02 | - | | ASIMD FP round, Q-form F16 | FRINTA, FRINTI,<br>FRINTM,<br>FRINTN, FRINTP,<br>FRINTX, FRINTZ,<br>FRINT32X,<br>FRINT64X,<br>FRINT32Z,<br>FRINT64Z | 6 | 1/2 | V02 | - | | ASIMD FP square root, D-form, F16 | FSQRT | 7 | 2/7 | V02 | 3 | | ASIMD FP square root, D-form, F32 | FSQRT | 7 to 10 | 4/9 to 4/7 | V02 | 3 | | ASIMD FP square root, Q-form, F16 | FSQRT | 11 to 13 | 2/13 to 2/11 | V02 | 3 | | ASIMD FP square root, Q-form, F32 | FSQRT | 7 to 10 | 2/9 to 2/7 | V02 | 3 | | ASIMD FP square root, Q-form, F64 | FSQRT | 7 to 16 | 2/15 to 2/7 | V02 | 3 | #### Notes: 1. ASIMD multiply-accumulate pipelines support late-forwarding of accumulate operands from similar $\mu$ OPs, allowing a typical sequence of floating-point multiply-accumulate $\mu$ OPs to issue one every N cycles (accumulate latency N shown in parentheses). - 2. ASIMD multiply-accumulate pipelines support late forwarding of the result from ASIMD FP multiply $\mu$ OPs to the accumulate operands of an ASIMD FP multiply-accumulate $\mu$ OP. The latter can potentially be issued 1 cycle after the ASIMD FP multiply $\mu$ OP has been issued. - 3. ASIMD divide and square root operations are performed using an iterative algorithm and block subsequent similar operations to the same pipeline until complete. ## 3.18 ASIMD BFloat16 (BF16) instructions Table 3-17 AArch64 ASIMD BFloat (BF16) instructions | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |----------------------------------|-------------------------|-----------------|-------------------------|-----------------------|-------| | ASIMD convert, F32 to BF16 | BFCVTN,<br>BFCVTN2 | 4 | 1 | V02 | - | | ASIMD dot product | BFDOT | 4(2) | 4 | V | 1 | | ASIMD matrix multiply accumulate | BFMMLA | 5(3) | 4 | V | 1 | | ASIMD multiply accumulate long | BFMLALB,<br>BFMLALT | 4(2) | 4 | V | 1 | | Scalar convert, F32 to BF16 | BFCVT | 3 | 2 | V02 | - | ## 3.19 ASIMD miscellaneous instructions Table 3-18 AArch64 ASIMD miscellaneous instructions | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |----------------------------------|-------------------------------------|-----------------|-------------------------|-----------------------|-------| | ASIMD bit reverse | RBIT | 2 | 4 | V | - | | ASIMD bitwise insert | BIF, BIT, BSL | 2 | 4 | V | - | | ASIMD count | CLS, CLZ, CNT | 2 | 4 | V | - | | ASIMD duplicate, gen reg | DUP | 3 | 1 | MO | - | | ASIMD duplicate, element | DUP | 2 | 4 | V | - | | ASIMD extract | EXT | 2 | 4 | V | - | | ASIMD extract narrow | XTN(2) | 2 | 4 | V | - | | ASIMD extract narrow, saturating | SQXTN(2),<br>SQXTUN(2),<br>UQXTN(2) | 4 | 2 | V13 | - | | ASIMD insert, element to element | INS | 2 | 4 | V | - | | ASIMD move, FP immed | FMOV | 2 | 4 | V | - | | ASIMD move, integer immed | MOVI, MVNI | 2 | 4 | V | - | | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |----------------------------------------------------------------------------|---------------------------|-----------------|-------------------------|-----------------------|-------| | ASIMD reciprocal and square root estimate, D-form U32 | URECPE,<br>URSQRTE | 3 | 2 | V02 | - | | ASIMD reciprocal and square root estimate, Q-form U32 | URECPE,<br>URSQRTE | 4 | 1 | V02 | - | | ASIMD reciprocal and square root estimate, D-form F32 and scalar forms | FRECPE,<br>FRSQRTE | 3 | 2 | V02 | - | | ASIMD reciprocal and square<br>root estimate, D-form F16 and<br>Q-form F32 | FRECPE,<br>FRSQRTE | 4 | 1 | V02 | - | | ASIMD reciprocal and square root estimate, Q-form F16 | FRECPE,<br>FRSQRTE | 6 | 1/2 | V02 | - | | ASIMD reciprocal exponent | FRECPX | 3 | 2 | V02 | | | ASIMD reciprocal step | FRECPS,<br>FRSQRTS | 4 | 4 | V | - | | ASIMD reverse | REV16, REV32,<br>REV64 | 2 | 4 | V | - | | ASIMD table lookup, 1 or 2 table regs | TBL | 2 | 2 | V01 | - | | ASIMD table lookup, 3 table regs | TBL | 4 | 1 | V01 | - | | ASIMD table lookup, 4 table regs | TBL | 4 | 2/3 | V01 | - | | ASIMD table lookup extension, 1 table reg | TBX | 2 | 4 | V | - | | ASIMD table lookup extension, 2 table reg | TBX | 4 | 2 | V | - | | ASIMD table lookup extension, 3 table reg | TBX | 6 | 4/3 | V | - | | ASIMD table lookup extension, 4 table reg | TBX | 6 | 4/5 | V | - | | ASIMD transfer, element to gen reg | UMOV, SMOV | 2 | 1 | V01 | - | | ASIMD transfer, gen reg to element | INS | 5 | 1 | M0, V | - | | ASIMD transpose | TRN1, TRN2 | 2 | 4 | V | - | | ASIMD unzip/zip | UZP1, UZP2,<br>ZIP1, ZIP2 | 2 | 4 | V | - | ### 3.20 ASIMD load instructions The latencies shown assume the memory access hits in the Level 1 Data Cache and represent the maximum latency to load all the vector registers written by the instruction. Compared to standard loads, an extra cycle is required to forward results to FP/ASIMD pipelines. Table 3-19 AArch64 ASIMD load instructions | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |----------------------------------------------------|-------------------------|-----------------|-------------------------|-----------------------|-------| | ASIMD load, 1 element, multiple,<br>1 reg, D-form | LD1 | 6 | 3 | L | - | | ASIMD load, 1 element, multiple,<br>1 reg, Q-form | LD1 | 6 | 3 | L | - | | ASIMD load, 1 element, multiple,<br>2 reg, D-form | LD1 | 6 | 3/2 | L | - | | ASIMD load, 1 element, multiple,<br>2 reg, Q-form | LD1 | 6 | 3/2 | L | - | | ASIMD load, 1 element, multiple,<br>3 reg, D-form | LD1 | 6 | 1 | L | - | | ASIMD load, 1 element, multiple,<br>3 reg, Q-form | LD1 | 6 | 1 | L | - | | ASIMD load, 1 element, multiple,<br>4 reg, D-form | LD1 | 7 | 3/4 | L | - | | ASIMD load, 1 element, multiple,<br>4 reg, Q-form | LD1 | 7 | 3/4 | L | - | | ASIMD load, 1 element, one lane,<br>B/H/S | LD1 | 8 | 3 | L, V | - | | ASIMD load, 1 element, one lane,<br>D | LD1 | 8 | 3 | L, V | - | | ASIMD load, 1 element, all lanes,<br>D-form, B/H/S | LD1R | 8 | 3 | L, V | - | | ASIMD load, 1 element, all lanes,<br>D-form, D | LD1R | 8 | 3 | L, V | - | | ASIMD load, 1 element, all lanes,<br>Q-form | LD1R | 8 | 3 | L, V | - | | ASIMD load, 2 element, multiple,<br>D-form, B/H/S | LD2 | 8 | 2 | L, V | - | | ASIMD load, 2 element, multiple,<br>Q-form, B/H/S | LD2 | 8 | 3/2 | L, V | - | | ASIMD load, 2 element, multiple,<br>Q-form, D | LD2 | 8 | 3/2 | L, V | - | | ASIMD load, 2 element, one lane,<br>B/H | LD2 | 8 | 2 | L, V | - | | ASIMD load, 2 element, one lane,<br>S | LD2 | 8 | 2 | L, V | - | | ASIMD load, 2 element, one lane,<br>D | LD2 | 8 | 2 | L, V | - | | ASIMD load, 2 element, all lanes,<br>D-form, B/H/S | LD2R | 8 | 2 | L, V | - | | ASIMD load, 2 element, all lanes,<br>D-form, D | LD2R | 8 | 2 | L, V | - | | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |----------------------------------------------------|-------------------------|-----------------|-------------------------|-----------------------|-------| | ASIMD load, 2 element, all lanes,<br>Q-form | LD2R | 8 | 2 | L, V | - | | ASIMD load, 3 element, multiple, D-form, B/H/S | LD3 | 8 | 4/3 | L, V | - | | ASIMD load, 3 element, multiple, Q-form, B/H/S | LD3 | 8 | 1 | L, V | - | | ASIMD load, 3 element, multiple, Q-form, D | LD3 | 8 | 1 | L, V | - | | ASIMD load, 3 element, one lane, B/H | LD3 | 8 | 4/3 | L, V | - | | ASIMD load, 3 element, one lane, S | LD3 | 8 | 4/3 | L, V | - | | ASIMD load, 3 element, one lane, D | LD3 | 8 | 4/3 | L, V | - | | ASIMD load, 3 element, all lanes,<br>D-form, B/H/S | LD3R | 8 | 4/3 | L, V | - | | ASIMD load, 3 element, all lanes, D-form, D | LD3R | 8 | 4/3 | L, V | - | | ASIMD load, 3 element, all lanes,<br>Q-form, B/H/S | LD3R | 8 | 4/3 | L, V | - | | ASIMD load, 3 element, all lanes,<br>Q-form, D | LD3R | 8 | 4/3 | L, V | - | | ASIMD load, 4 element, multiple, D-form, B/H/S | LD4 | 8 | 1 | L, V | - | | ASIMD load, 4 element, multiple,<br>Q-form, B/H/S | LD4 | 9 | 1/2 | L, V | - | | ASIMD load, 4 element, multiple,<br>Q-form, D | LD4 | 9 | 1/2 | L, V | - | | ASIMD load, 4 element, one lane, B/H | LD4 | 8 | 1 | L, V | - | | ASIMD load, 4 element, one lane, S | LD4 | 8 | 1 | L, V | - | | ASIMD load, 4 element, one lane, D | LD4 | 8 | 1 | L, V | - | | ASIMD load, 4 element, all lanes, D-form, B/H/S | LD4R | 8 | 1 | L, V | - | | ASIMD load, 4 element, all lanes, D-form, D | LD4R | 8 | 1 | L, V | - | | ASIMD load, 4 element, all lanes,<br>Q-form, B/H/S | LD4R | 8 | 1 | L, V | - | | ASIMD load, 4 element, all lanes,<br>Q-form, D | LD4R | 8 | 1 | L, V | - | | (ASIMD load, writeback form) | - | - | - | I | 1 | Notes: 1. Writeback forms of load instructions require an extra $\mu$ OP to update the base address. This update is typically performed in parallel with the load $\mu$ OP (update latency shown in parentheses). ### 3.21 ASIMD store instructions Stores MOPs are split into store address and store data $\mu$ OPs. Once executed, stores are buffered and committed in the background. Table 3-20 AArch64 ASIMD store instructions | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized Pipelines | Notes | |----------------------------------------------------|-------------------------|-----------------|-------------------------|--------------------|-------| | ASIMD store, 1 element,<br>multiple, 1 reg, D-form | ST1 | 2 | 2 | L01, V01 | - | | ASIMD store, 1 element,<br>multiple, 1 reg, Q-form | ST1 | 2 | 2 | L01, V01 | - | | ASIMD store, 1 element,<br>multiple, 2 reg, D-form | ST1 | 2 | 2 | L01, V01 | - | | ASIMD store, 1 element,<br>multiple, 2 reg, Q-form | ST1 | 2 | 1 | L01, V01 | - | | ASIMD store, 1 element,<br>multiple, 3 reg, D-form | ST1 | 2 | 1 | L01, V01 | - | | ASIMD store, 1 element,<br>multiple, 3 reg, Q-form | ST1 | 2 | 2/3 | L01, V01 | - | | ASIMD store, 1 element,<br>multiple, 4 reg, D-form | ST1 | 2 | 1 | L01, V01 | - | | ASIMD store, 1 element,<br>multiple, 4 reg, Q-form | ST1 | 2 | 1/2 | L01, V01 | - | | ASIMD store, 1 element, one lane, B/H/S | ST1 | 4 | 1 | L01, V01 | - | | ASIMD store, 1 element, one lane, D | ST1 | 4 | 1 | L01, V01 | - | | ASIMD store, 2 element,<br>multiple, D-form, B/H/S | ST2 | 4 | 1 | V01, L01 | - | | ASIMD store, 2 element,<br>multiple, Q-form, B/H/S | ST2 | 4 | 1/2 | V01, L01 | - | | ASIMD store, 2 element,<br>multiple, Q-form, D | ST2 | 4 | 1/2 | V01, L01 | - | | ASIMD store, 2 element, one lane, B/H/S | ST2 | 4 | 1 | V01, L01 | - | | ASIMD store, 2 element, one lane, D | ST2 | 4 | 1 | V01, L01 | - | | ASIMD store, 3 element, multiple, D-form, B/H/S | ST3 | 5 | 1/2 | V01, L01 | - | | ASIMD store, 3 element, multiple, Q-form, B/H/S | ST3 | 6 | 1/3 | V01, L01 | - | | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |----------------------------------------------------|-------------------------|-----------------|-------------------------|-----------------------|-------| | ASIMD store, 3 element,<br>multiple, Q-form, D | ST3 | 6 | 1/3 | V01, L01 | - | | ASIMD store, 3 element, one lane, B/H | ST3 | 5 | 1/2 | V01, L01 | - | | ASIMD store, 3 element, one lane, S | ST3 | 5 | 1/2 | V01, L01 | - | | ASIMD store, 3 element, one lane, D | ST3 | 5 | 1/2 | V01, L01 | - | | ASIMD store, 4 element,<br>multiple, D-form, B/H/S | ST4 | 6 | 1/3 | V01, L01 | - | | ASIMD store, 4 element,<br>multiple, Q-form, B/H/S | ST4 | 7 | 1/6 | V01, L01 | - | | ASIMD store, 4 element,<br>multiple, Q-form, D | ST4 | 5 | 1/4 | V01, L01 | - | | ASIMD store, 4 element, one lane, B/H/S | ST4 | 6 | 2/3 | V01, L01 | - | | ASIMD store, 4 element, one lane, D | ST4 | 4 | 1/2 | V01, L01 | - | | (ASIMD store, writeback form) | - | - | - | I | 1 | #### Notes: # 3.22 Cryptography extensions Table 3-21 AArch64 Cryptography extensions | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |--------------------------------------------|------------------------------|-----------------|-------------------------|-----------------------|-------| | Crypto AES ops | AESD, AESE,<br>AESIMC, AESMC | 2 | 2 | V01 | - | | Crypto polynomial (64x64)<br>multiply long | PMULL (2) | 2 | 2 | V23 | - | | Crypto SHA1 hash acceleration op | SHA1H | 2 | 1 | VO | - | | Crypto SHA1 hash acceleration ops | SHA1C, SHA1M,<br>SHA1P | 4 | 1 | VO | - | | Crypto SHA1 schedule acceleration ops | SHA1SU0,<br>SHA1SU1 | 2 | 1 | VO | - | | Crypto SHA256 hash acceleration ops | SHA256H,<br>SHA256H2 | 4 | 1 | VO | - | | Crypto SHA256 schedule acceleration ops | SHA256SU0,<br>SHA256SU1 | 2 | 1 | VO | - | <sup>1.</sup> Writeback forms of store instructions require an extra $\mu$ OP to update the base address. This update is typically performed in parallel with the store $\mu$ OP (update latency shown in parentheses). | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |----------------------------------------|--------------------------------------------------------------------------------|-----------------|-------------------------|-----------------------|-------| | Crypto SHA512 hash<br>acceleration ops | SHA512H,<br>SHA512H2,<br>SHA512SU0,<br>SHA512SU1 | 2 | 1 | VO | - | | Crypto SHA3 ops | BCAX, EOR3,<br>RAX1, XAR | 2 | 1 | VO | - | | Crypto SM3 ops | SM3PARTW1,<br>SM3PARTW2SM<br>3SS1, SM3TT1A,<br>SM3TT1B,<br>SM3TT2A,<br>SM3TT2B | 2 | 1 | VO | - | | Crypto SM4 ops | SM4E, SM4EKEY | 4 | 1 | VO | - | ### 3.23 CRC #### Table 3-22 AArch64 CRC | | AArch64<br>Instructions | | | Utilized<br>Pipelines | Notes | |------------------|-------------------------|---|---|-----------------------|-------| | CRC checksum ops | CRC32, CRC32C | 2 | 1 | MO | 1 | #### Notes: 1. CRC execution supports late forwarding of the result from a producer $\mu$ OP to a consumer $\mu$ OP. This results in a 1 cycle reduction in latency as seen by the consumer. # 3.24 SVE Predicate instructions #### **Table 3-23 SVE Predicate Instructions** | Instruction Group | SVE Instruction | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |---------------------------------------------------|--------------------------|-----------------|-------------------------|-----------------------|-------| | Loop control, based on predicate | BRKA, BRKB | 2 | 2 | М | 1 | | Loop control, based on predicate and flag setting | BRKAS, BRKBS | 3 | 2 | М | 1 | | Loop control, propagating | BRKN, BRKPA,<br>BRKPB | 2 | 1 | MO | 1 | | Loop control propagating and flag setting | BRKNS, BRKPAS,<br>BRKPBS | 3 | 1 | M0, M | 1 | | Instruction Group | SVE Instruction | Exec | Execution | Utilized | Notes | |---------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|------------|-----------|-------| | instruction Group | JVE IIISTI detion | Latency | Throughput | Pipelines | Notes | | Loop control, based on GPR | WHILEGE, WHILEGT, WHILEHI, WHILELS, WHILELS, WHILELT, WHILERW, WHILEWR | 3 | 1 | M | - | | Loop terminate | CTERMEQ,<br>CTERMNE | 1 | 1 | М | - | | Predicate counting scalar | ADDPL, ADDVL, CNTB, CNTH, CNTH, CNTH, CNTH, CNTD, DECB, DECH, DECW, DECD, INCB, INCH, INCW, INCD, RDVL, SQDECB, SQDECH, SQDECD, SQINCB, SQINCH, SQINCH, SQINCD, UQDECB, UQDECH, UQDECH, UQDECD, UQINCB, UQINCH, UQINCH, UQINCD | 2 | 2 | M | | | Predicate counting scalar, ALL, {1,2,4} | INC, DEC | 1 | 4 | 1 | | | Predicate counting scalar, active predicate | CNTP, DECP,<br>INCP, SQDECP,<br>SQINCP,<br>UQDECP,<br>UQINCP | 2 | 2 | М | - | | Predicate counting vector, active predicate | DECP, INCP,<br>SQDECP,<br>SQINCP,<br>UQDECP,<br>UQINCP | 7 | 1 | M, M0, V | - | | Predicate logical | AND, BIC, EOR,<br>MOV, NAND,<br>NOR, NOT, ORN,<br>ORR | 1 | 1 | MO | 1 | | Instruction Group | SVE Instruction | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |-------------------------------------|------------------------------------------------------------------|-----------------|-------------------------|-----------------------|-------| | Predicate logical, flag setting | ANDS, BICS,<br>EORS, MOV,<br>NANDS, NORS,<br>NOTS, ORNS,<br>ORRS | 1 | 1 | M0, M | 1 | | Predicate reverse | REV | 2 | 2 | М | - | | Predicate select | SEL | 1 | 1 | MO | - | | Predicate set | PFALSE, PTRUE | 2 | 2 | М | - | | Predicate set/initialize, set flags | PTRUES | 3 | 2 | М | - | | Predicate find first, next | PFIRST, PNEXT | 3 | 2 | М | - | | Predicate test | PTEST | 1 | 2 | М | - | | Predicate transpose | TRN1, TRN2 | 2 | 2 | М | - | | Predicate unpack and widen | PUNPKHI,<br>PUNPKLO | 2 | 2 | М | - | | Predicate zip/unzip | ZIP1, ZIP2, UZP1,<br>UZP2 | 2 | 2 | М | - | #### Notes # 3.25 SVE integer instructions Table 3-24 SVE integer instructions | Instruction Group | SVE Instruction | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |--------------------------------------|-----------------------------------------|-----------------|-------------------------|-----------------------|-------| | Aithmetic, absolute diff | SABD, UABD | 2 | 4 | V | - | | Arithmetic, absolute diff accum | SABA, UABA | 4(1) | 2 | V13 | 2 | | Arithmetic, absolute diff accum long | SABALB, SABALT,<br>UABALB,<br>UABALT | 4(1) | 2 | V13 | 2 | | Arithmetic, absolute diff long | SABDLB,<br>SABDLT,<br>UABDLB,<br>UABDLT | 2 | 4 | V | - | <sup>1.</sup> When the governing predicate is the same as destination, the latency is increased by one cycle. | Instruction Group | SVE Instruction | Exec | Execution | Utilized | Notes | |-----------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|------------|-----------|-------| | | | Latency | Throughput | Pipelines | | | Arithmetic, basic | ABS, ADD, ADR, CNOT, NEG, SADDLB, SADDLT, SADDUT, SADDWT, SHADD, SHSUB, SSUBLB, SSUBLBT, SSUBLT, SSUBLT, SSUBLTB, SSUBWN, SUBHNB, SUBHNT, SUBHNB, SUBHNT, SUBHNB, UADDLB, UADDUT, UADDWB, UADDWT, UHADD, UHSUB, USUBLT, USUBWB, USUBWT | 2 | 4 | V | | | Arithmetic, complex | ADDHNB, ADDHNT, RADDHNB, RADDHNT, RSUBHNB, RSUBHNT, SQABS, SQADD, SQNEG, SQSUB, SQSUBR, SRHADD, SUQADD, UQADD, UQSUB, UQSUBR, URHADD, USQADD | 2 | 4 | V | | | Arithmetic, large integer | ADCLB, ADCLT,<br>SBCLB, SBCLT | 2 | 4 | V | - | | Arithmetic, pairwise add | ADDP | 2 | 4 | V | | | Arithmetic, pairwise add and accum long | SADALP,<br>UADALP | 4(1) | 2 | V13 | 2 | | Arithmetic, shift | ASR, ASRR, LSL,<br>LSLR, LSR, LSRR | 2 | 1 | V1 | - | | Arithmetic, shift and accumulate | SRSRA, SSRA,<br>URSRA, USRA | 4(1) | 2 | V13 | 2 | | Instruction Group | SVE Instruction | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |--------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|-------------------------|-----------------------|-------| | Arithmetic, shift by immediate | SHRNB, SHRNT,<br>SSHLLB, SSHLLT,<br>USHLLB, USHLLT | 2 | 2 | V13 | - | | Arithmetic, shift by immediate and insert | SLI, SRI | 2 | 2 | V13 | | | Arithmetic, shift complex | RSHRNB, RSHRNT, SQRSHL, SQRSHLR, SQRSHRNT, SQRSHRNT, SQRSHRUNT, SQSHL, SQSHLR, SQSHLU, SQSHRNB, SQSHRNT, SQSHRUNT, UQRSHL, UQRSHLR, UQRSHRNB, UQRSHRNT, UQRSHL, UQRSHL, UQSHRNT, UQSHRNB, UQSHRNB, UQSHRNB, UQSHRNB, UQSHRNB, | 4 | 2 | V13 | - | | Arithmetic, shift right for divide | ASRD | 4 | 2 | V13 | - | | Arithmetic, shift rounding | SRSHL, SRSHLR,<br>SRSHR, URSHL,<br>URSHLR, URSHR | 4 | 2 | V13 | - | | Bit manipulation | BDEP, BEXT,<br>BGRP | 6 | 1/2 | V1 | - | | Bitwise select | BSL, BSL1N,<br>BSL2N, NBSL | 2 | 4 | V | - | | Count/reverse bits | CLS, CLZ, CNT,<br>RBIT | 2 | 4 | V | - | | Broadcast logical bitmask<br>immediate to vector | DUPM, MOV | 2 | 4 | V | - | | Compare and set flags | CMPEQ, CMPGE,<br>CMPGT, CMPHI,<br>CMPHS, CMPLE,<br>CMPLO, CMPLS,<br>CMPLT, CMPNE | 4 | 1 | VO, MO | 1 | | Complex add | CADD, SQCADD | 2 | 4 | V | - | | Complex dot product 8-bit element | CDOT | 3(1) | 4 | V | 2 | | Complex dot product 16-bit element | CDOT | 4(1) | 2 | V02 | 2 | | Instruction Group | <b>SVE Instruction</b> | Exec | Execution | Utilized | Notes | |-----------------------------------------------------------------------|-----------------------------------------------------------------|---------|-------------|-----------|-------| | | | Latency | Throughput | Pipelines | | | Complex multiply-add B, H, S<br>element size | CMLA | 4(1) | 2 | V02 | 2 | | Complex multiply-add D element size | CMLA | 5(3) | 1 | V02 | 2 | | Conditional extract operations, scalar form | CLASTA, CLASTB | 8 | 1 | M0, V01 | - | | Conditional extract operations,<br>SIMD&FP scalar and vector<br>forms | CLASTA, CLASTB,<br>COMPACT,<br>SPLICE | 3 | 1 | V1 | - | | Convert to floating point, 64b to float or convert to double | SCVTF, UCVTF | 3 | 2 | V02 | - | | Convert to floating point, 32b to single or half | SCVTF, UCVTF | 4 | 1 | V02 | - | | Convert to floating point, 16b to half | SCVTF, UCVTF | 6 | 1/2 | V02 | - | | Copy, scalar | CPY | 5 | 1 | M0, V | | | Copy, scalar SIMD&FP or imm | CPY | 2 | 4 | V | | | Divides, 32 bit | SDIV, SDIVR,<br>UDIV, UDIVR | 7 to 12 | 1/11 to 1/7 | VO | 3 | | Divides, 64 bit | SDIV, SDIVR,<br>UDIV, UDIVR | 7 to 20 | 1/20 to 1/7 | VO | 3 | | Dot product, 8 bit | SDOT, UDOT | 3(1) | 4 | V | 2 | | Dot product, 8 bit, using signed and unsigned integers | SUDOT, USDOT | 3(1) | 4 | V | 2 | | Dot product, 16 bit | SDOT, UDOT | 4(1) | 2 | V02 | 2 | | Duplicate, immediate and indexed form | DUP, MOV | 2 | 4 | V | - | | Duplicate, scalar form | DUP, MOV | 3 | 1 | MO | - | | Extend, sign or zero | SXTB, SXTH,<br>SXTW, UXTB,<br>UXTH, UXTW | 2 | 2 | V13 | - | | Extract | EXT | 2 | 4 | V | - | | Extract narrow saturating | SQXTNB,<br>SQXTNT,<br>SQXTUNB,<br>SQXTUNT,<br>UQXTNB,<br>UQXTNT | 4 | 2 | V13 | - | | Extract/insert operation, SIMD and FP scalar form | LASTA, LASTB,<br>INSR | 3 | 1 | V1 | - | | Extract/insert operation, scalar | LASTA, LASTB,<br>INSR | 6 | 1 | V1, M0 | - | | Histogram operations | HISTCNT,<br>HISTSEG | 2 | 4 | V | | | Instruction Group | SVE Instruction | Exec | Execution | Utilized | Notes | |----------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------|---------|------------|-----------|-------| | mon action of cap | | Latency | Throughput | Pipelines | | | Horizontal operations, B, H, S form, immediate operands only | INDEX | 4 | 2 | V02 | - | | Horizontal operations, B, H, S form, (scalar, immediate operands)/ scalar operands only / immediate, scalar operands | INDEX | 7 | 1 | M0, V02 | - | | Horizontal operations, D form, immediate operands only | INDEX | 5 | 1 | V02 | - | | Horizontal operations, D form, scalar, immediate operands)/ scalar operands only / immediate, scalar operands | INDEX | 8 | 1/2 | M0, V02 | - | | Logical | AND, BIC, EON,<br>EOR, EORBT,<br>EORTB, MOV,<br>NOT, ORN, ORR | 2 | 4 | V | - | | Max/min, basic and pairwise | SMAX, SMAXP,<br>SMIN, SMINP,<br>UMAX, UMAXP<br>UMIN, UMINP | 2 | 4 | V | - | | Matching operations | MATCH,<br>NMATCH | 2 | 1 | VO, M | 1,5 | | Matrix multiply-accumulate | SMMLA, UMMLA,<br>USMMLA | 3(1) | 4 | V | 2 | | Move prefix | MOVPRFX | 2 | 4 | V | - | | Multiply, B, H, S element size | MUL, SMULH,<br>UMULH | 4 | 2 | V02 | - | | Multiply, D element size | MUL, SMULH,<br>UMULH | 5 | 1 | V02 | - | | Multiply long | SMULLB,<br>SMULLT,<br>UMULLB,<br>UMULLT | 4 | 2 | V02 | - | | Multiply accumulate, B, H, S element size | MLA, MLS | 4(1) | 2 | V02 | 2 | | Multiply accumulate, D element size | MLA, MLS, MAD,<br>MSB, | 5(3) | 1 | V02 | 2 | | Multiply accumulate long | SMLALB,<br>SMLSLB, SMLSLT,<br>UMLALB,<br>UMLALT,<br>UMLSLB,<br>UMLSLT | 4(1) | 2 | V02 | 2 | | Instruction Group | SVE Instruction | Exec | Execution | Utilized | Notes | |----------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|---------|------------|-----------|-------| | | | Latency | Throughput | Pipelines | | | Multiply accumulate saturating doubling long regular | SQDMLALB,<br>SQDMLALT,<br>SQDMLALBT,<br>SQDMLSLB,<br>SQDMLSLT,<br>SQDMLSLBT | 4(2) | 2 | V02 | 4 | | Multiply saturating doubling high, B, H, S element size | SQDMULH | 4 | 2 | V02 | - | | Multiply saturating doubling high, D element size | SQDMULH | 5 | 1 | V02 | - | | Multiply saturating doubling long | SQDMULLB,<br>SQDMULLT | 4 | 2 | V02 | - | | Multiply saturating rounding<br>doubling regular/complex<br>accumulate, B, H, S element size | SQRDMLAH,<br>SQRDMLSH,<br>SQRDCMLAH | 4(2) | 2 | V02 | 4 | | Multiply saturating rounding doubling regular/complex accumulate, D element size | SQRDMLAH,<br>SQRDMLSH,<br>SQRDCMLAH | 5(3) | 1 | V02 | 4 | | Multiply saturating rounding doubling regular/complex, B, H, S element size | SQRDMULH | 4 | 2 | V02 | - | | Multiply saturating rounding doubling regular/complex, D element size | SQRDMULH | 5 | 1 | V02 | - | | Multiply/multiply long, (8x8) polynomial | PMUL, PMULLB,<br>PMULLT | 2 | 2 | V23 | - | | Predicate counting, vector | DECH, DECW, DECD, INCH, INCW, INCD, SQDECH, SQDECW, SQDECD, SQINCH, SQINCW, SQINCD, UQDECH, UQDECW, UQDECD, UQINCH, UQINCH, UQINCW, | 2 | 4 | V | - | | Reciprocal estimate | URECPE,<br>URSQRTE | 4 | 1 | V02 | - | | Reduction, arithmetic, B form | SADDV, UADDV,<br>SMAXV, SMINV,<br>UMAXV, UMINV | 11 | 1 | V, V13 | - | | Reduction, arithmetic, H form | SADDV, UADDV,<br>SMAXV, SMINV,<br>UMAXV, UMINV | 9 | 1 | V, V13 | - | | Instruction Group | SVE Instruction | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |-------------------------------|------------------------------------------------|-----------------|-------------------------|-----------------------|-------| | Reduction, arithmetic, S form | SADDV, UADDV,<br>SMAXV, SMINV,<br>UMAXV, UMINV | 8 | 8/5 | V, V13 | - | | Reduction, logical | ANDV, EORV,<br>ORV | 6 | 2 | V, V13 | - | | Reverse, vector | REV, REVB,<br>REVH, REVW | 2 | 4 | V | - | | Select, vector form | MOV, SEL | 2 | 4 | V | - | | Table lookup | TBL | 2 | 4 | V | - | | Table lookup extension | TBX | 2 | 4 | V | | | Transpose, vector form | TRN1, TRN2 | 2 | 4 | V | - | | Unpack and extend | SUNPKHI,<br>SUNPKLO,<br>UUNPKHI,<br>UUNPKLO | 2 | 4 | V | - | | Zip/unzip | UZP1, UZP2,<br>ZIP1, ZIP2 | 2 | 4 | V | - | #### Notes: - 1. When the governing predicate is the same as destination, the latency is increased by one cycle. - 2. SVE accumulate pipelines support late-forwarding of accumulate operands from similar $\mu$ OPs, allowing a typical sequence of such $\mu$ OPs to issue one every N cycles (accumulate latency N shown in parentheses). - 3. SVE integer divide operations are performed using an iterative algorithm and block subsequent similar operations to the same pipeline until complete. - 4. Same as 2 except that for saturating instructions require an extra cycle of latency for late-forwarding accumulate operands. - 5. If the consuming instruction has a flag source, the latency for this instruction is 4 cycles. # 3.26 SVE floating-point instructions Table 3-25 SVE floating-point instructions | Instruction Group | SVE Instruction | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |------------------------------------------|--------------------------------------|-----------------|-------------------------|-----------------------|-------| | Floating point absolute value/difference | FABD, FABS | 2 | 4 | V | - | | Floating point arithmetic | FADD, FADDP,<br>FNEG, FSUB,<br>FSUBR | 2 | 4 | V | - | | Floating point associative add, F16 | FADDA | 10 | 1/9 | V1 | - | | Floating point associative add, F32 | FADDA | 6 | 1/5 | V1 | - | | Floating point associative add, F64 | FADDA | 4 | 4 | V | - | | Instruction Group | SVE Instruction | Exec | Execution | Utilized | Notes | |-------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|----------|-------------|-----------|-------| | | | Latency | Throughput | Pipelines | | | Floating point compare | FACGE, FACGT,<br>FACLE, FACLT,<br>FCMEQ, FCMGE,<br>FCMGT, FCMLE,<br>FCMLT, FCMNE,<br>FCMUO | 2 | 1 | VO | - | | Floating point complex add | FCADD | 3 | 4 | V | - | | Floating point complex multiply add | FCMLA | 5(2) | 4 | V | 1 | | Floating point convert, long or<br>narrow (F16 to F32 or F32 to<br>F16) | FCVT, FCVTLT,<br>FCVTNT | 4 | 1 | V02 | - | | Floating point convert, long or<br>narrow (F16 to F64, F32 to F64,<br>F64 to F32 or F64 to F16) | FCVT, FCVTLT,<br>FCVTNT | 3 | 2 | V02 | - | | Floating point convert, round to odd | FCVTX,<br>FCVTXNT | 3 | 2 | V02 | - | | Floating point base2 log, F16 | FLOGB | 6 | 1/2 | V02 | - | | Floating point base2 log, F32 | FLOGB | 4 | 1 | V02 | - | | Floating point base2 log, F64 | FLOGB | 3 | 2 | V02 | - | | Floating point convert to integer, F16 | FCVTZS, FCVTZU | 6 | 1/2 | V02 | - | | Floating point convert to integer, F32 | FCVTZS, FCVTZU | 4 | 1 | V02 | - | | Floating point convert to integer,<br>F64 | FCVTZS, FCVTZU | 3 | 2 | V02 | - | | Floating point copy | FCPY, FDUP,<br>FMOV | 2 | 4 | V | - | | Floating point divide, F16 | FDIV, FDIVR | 10 to 13 | 1/6 to 1/5 | V02 | 2 | | Floating point divide, F32 | FDIV, FDIVR | 7 to 10 | 2/9 to 2/7 | V02 | 2 | | Floating point divide, F64 | FDIV, FDIVR | 7 to 15 | 2/14 to 2/7 | V02 | 2 | | Floating point min/max pairwise | FMAXP,<br>FMAXNMP,<br>FMINP,<br>FMINNMP | 2 | 4 | V | - | | Floating point min/max | FMAX, FMIN,<br>FMAXNM,<br>FMINNM | 2 | 4 | V | - | | Floating point multiply | FSCALE, FMUL,<br>FMULX | 3 | 4 | V | - | | Floating point multiply accumulate | FMLA, FMLS,<br>FMAD, FMSB,<br>FNMAD, FNMLA,<br>FNMLS, FNMSB | 4(2) | 4 | V | 1 | | Instruction Group | SVE Instruction | Exec | Execution | Utilized<br>Pipelines | Notes | |-------------------------------------------------|-----------------------------------------------------------------|----------|--------------|-----------------------|-------| | | 5. U. A. S. E. U. A. E. | Latency | Throughput | | | | Floating point multiply add/sub accumulate long | FMLALB, FMLALT,<br>FMLSLB, FMLSLT | 4(2) | 4 | V | | | Floating point reciprocal estimate, F16 | FRECPE, FRECPX,<br>FRSQRTE | 6 | 1/2 | V02 | - | | Floating point reciprocal estimate, F32 | FRECPE, FRECPX,<br>FRSQRTE | 4 | 1 | V02 | - | | Floating point reciprocal estimate, F64 | FRECPE, FRECPX,<br>FRSQRTE | 3 | 2 | V02 | - | | Floating point reciprocal step | FRECPS,<br>FRSQRTS | 4 | 4 | V | - | | Floating point reduction, F16 | FADDV,<br>FMAXNMV,<br>FMAXV,<br>FMINNMV,<br>FMINV | 6 | 4/3 | V | - | | Floating point reduction, F32 | FADDV,<br>FMAXNMV,<br>FMAXV,<br>FMINNMV,<br>FMINV | 4 | 2 | V | - | | Floating point reduction, F64 | FADDV,<br>FMAXNMV,<br>FMAXV,<br>FMINNMV,<br>FMINV | 2 | 4 | V | - | | Floating point round to integral,<br>F16 | FRINTA, FRINTI,<br>FRINTM, FRINTN,<br>FRINTP, FRINTX,<br>FRINTZ | 6 | 1/2 | V02 | - | | Floating point round to integral,<br>F32 | FRINTA, FRINTI,<br>FRINTM, FRINTN,<br>FRINTP, FRINTX,<br>FRINTZ | 4 | 1 | V02 | - | | Floating point round to integral,<br>F64 | FRINTA, FRINTI,<br>FRINTM, FRINTN,<br>FRINTP, FRINTX,<br>FRINTZ | 3 | 2 | V02 | - | | Floating point square root, F16 | FSQRT | 10 to 13 | 1/12 to 1/10 | VO | 2 | | Floating point square root, F32 | FSQRT | 7 to 10 | 1/9 to 1/7 | VO | 2 | | Floating point square root F64 | FSQRT | 7 to 16 | 1/14 to 1/7 | VO | 2 | | Floating point trigonometric exponentiation | FEXPA | 3 | 1 | V1 | | | Floating point trigonometric multiply add | FTMAD | 4 | 4 | V | | | Floating point trigonometric, miscellaneous | FTSMUL, FTSSEL | 3 | 4 | V | - | Notes: - 1. SVE multiply-accumulate pipelines support late-forwarding of accumulate operands from similar $\mu$ OPs, allowing a typical sequence of floating-point multiply-accumulate $\mu$ OPs to issue one every N cycles (accumulate latency N shown in parentheses). - 2. SVE divide and square root operations are performed using an iterative algorithm and block subsequent similar operations to the same pipeline until complete. ### 3.27 SVE BFloat16 (BF16) instructions Table 3-26 SVE Bfloat16 (BF16) instructions | Instruction Group | SVE Instruction | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |----------------------------|---------------------|-----------------|-------------------------|-----------------------|-------| | Convert, F32 to BF16 | BFCVT,<br>BFCVTNT | 3 | 2 | V02 | - | | Dot product | BFDOT | 4(2) | 4 | V | 1 | | Matrix multiply accumulate | BFMMLA | 5(3) | 4 | V | 1 | | Multiply accumulate long | BFMLALB,<br>BFMLALT | 4(2) | 4 | V | 1 | #### Notes: #### 3.28 SVE Load instructions The latencies shown assume the memory access hits in the Level 1 Data Cache and represent the maximum latency to load all the vector registers written by the instruction. **Table 3-27 SVE Load instructions** | Instruction Group | SVE Instruction | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |----------------------------------|-------------------------------------------------------|-----------------|-------------------------|-----------------------|-------| | Load vector | LDR | 6 | 3 | L | - | | Load predicate | LDR | 6 | 2 | L, M | - | | Contiguous load, scalar + imm | LD1B, LD1D,<br>LD1H, LD1W,<br>LD1SB, LD1SH,<br>LD1SW, | 6 | 3 | L | - | | Contiguous load, scalar + scalar | LD1B, LD1D,<br>LD1H, LD1W,<br>LD1SB, LD1SH<br>LD1SW | 6 | 3 | L | - | <sup>1.</sup> SVE pipelines that execute these instructions support late-forwarding of accumulate operands from similar $\mu$ OPs, allowing a typical sequence of $\mu$ OPs to issue one every N cycles (accumulate latency N shown in parentheses). | Instruction Group | SVE Instruction | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |---------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|-----------------|-------------------------|-----------------------|-------| | Contiguous load broadcast,<br>scalar + imm | LD1RB, LD1RH,<br>LD1RD, LD1RW,<br>LD1RSB,<br>LD1RSH,<br>LD1RSW,<br>LD1RQB,<br>LD1RQD,<br>LD1RQH, | 6 | 3 | L | - | | Contiguous load broadcast,<br>scalar + scalar | LD1RQB,<br>LD1RQD,<br>LD1RQH,<br>LD1RQW | 6 | 3 | L | - | | Non temporal load, scalar + imm | LDNT1B,<br>LDNT1D,<br>LDNT1H,<br>LDNT1W | 6 | 3 | L | - | | Non temporal load, scalar + scalar | LDNT1B,<br>LDNT1D,<br>LDNT1H,<br>LDNT1W | 6 | 3 | L | - | | Non temporal gather load,<br>vector + scalar 32-bit element<br>size | LDNT1B,<br>LDNT1H,<br>LDNT1W,<br>LDNT1SB,<br>LDNT1SH | 9 | 1 | L, V | - | | Non temporal gather load,<br>vector + scalar 64-bit element<br>size | LDNT1B,<br>LDNT1D,<br>LDNT1H,<br>LDNT1W,<br>LDNT1SB,<br>LDNT1SH,<br>LDNT1SW | 9 | 1/2 | L,V1 | - | | Contiguous first faulting load,<br>scalar + scalar | LDFF1B,<br>LDFF1D,<br>LDFF1H,<br>LDFF1W,<br>LDFF1SB,<br>LDFF1SD,<br>LDFF1SH,<br>LDFF1SW | 6 | 3 | L, S | - | | Contiguous non faulting load,<br>scalar + imm | LDNF1B,<br>LDNF1D,<br>LDNF1H,<br>LDNF1W,<br>LDNF1SB,<br>LDNF1SH,<br>LDNF1SW | 6 | 3 | L | - | | Contiguous Load two structures to two vectors, scalar + imm | LD2B, LD2D,<br>LD2H, LD2W | 8 | 3/2 | V, L | - | | Instruction Group | SVE Instruction | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |--------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------|-----------------|-------------------------|-----------------------|-------| | Contiguous Load two structures to two vectors, scalar + scalar | LD2B, LD2D,<br>LD2H, LD2W | 9 | 3/2 | V, L, S | - | | Contiguous Load three<br>structures to three vectors,<br>scalar + imm | LD3B, LD3D,<br>LD3H, LD3W | 9 | 3/2 | V, L | - | | Contiguous Load three<br>structures to three vectors,<br>scalar + scalar | LD3B, LD3D,<br>LD3H, LD3W | 10 | 3/2 | V, L, S | - | | Contiguous Load four structures to four vectors, scalar + imm | LD4B, LD4D,<br>LD4H LD4W | 9 | 1/2 | V, L | - | | Contiguous Load four structures to four vectors, scalar + scalar | LD4B, LD4D,<br>LD4H, LD4W | 10 | 1/2 | L, V, S | - | | Gather load, vector + imm, 32-<br>bit element size | LD1B, LD1H,<br>LD1W, LD1SB,<br>LD1SH, LD1SW,<br>LDFF1B,<br>LDFF1H,<br>LDFF1W,<br>LDFF1SB,<br>LDFF1SH,<br>LDFF1SW | 9 | 1 | L, V | - | | Gather load, vector + imm, 64-<br>bit element size | LD1B, LD1D,<br>LD1H, LD1W,<br>LD1SB, LD1SH,<br>LD1SW, LDFF1B,<br>LDFF1D<br>LDFF1H,<br>LDFF1W,<br>LDFF1SB,<br>LDFF1SD,<br>LDFF1SH,<br>LDFF1SW | 9 | 1 | L, V | - | | Gather load, 32-bit scaled offset | LD1H, LD1SH,<br>LDFF1H,<br>LDFF1SH, LD1W,<br>LDFF1W,<br>LDFF1SW | 10 | 1/2 | L, V | - | | Gather load, 32-bit unpacked unscaled offset | LD1B, LD1SB,<br>LDFF1B,<br>LDFF1SB, LD1D,<br>LDFF1D, LD1H,<br>LD1SH, LDFF1H,<br>LDFF1SH, LD1W,<br>LD1SW,<br>LDFF1W,<br>LDFF1SW | 9 | 1 | L, V | - | ### 3.29 SVE Store instructions **Table 3-28 SVE Store instructions** | Instruction Group | SVE Instruction | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |-----------------------------------------------------------------------------|-----------------------------------------|-----------------|-------------------------|-----------------------|-------| | Store from predicate reg | STR | 1 | 2 | L01 | - | | Store from vector reg | STR | 2 | 2 | L01, V01 | - | | Contiguous store, scalar + imm | ST1B, ST1H,<br>ST1D, ST1W | 2 | 2 | L01, V01 | - | | Contiguous store, scalar + scalar | ST1H | 2 | 2 | L01, S, V01 | - | | Contiguous store, scalar + scalar | ST1B, ST1D,<br>ST1W | 2 | 2 | L01, V01 | - | | Contiguous store two structures from two vectors, scalar + imm | ST2B, ST2H,<br>ST2D, ST2W | 4 | 1 | L01, V01 | - | | Contiguous store two structures<br>from two vectors, scalar + scalar | ST2H | 4 | 1 | L01, S, V01 | - | | Contiguous store two structures<br>from two vectors, scalar + scalar | ST2B, ST2D,<br>ST2W | 4 | 1 | L01, V01 | - | | Contiguous store three<br>structures from three vectors,<br>scalar + imm | ST3B, ST3D,<br>ST3H, ST3W | 7 | 2/9 | L01, V01 | - | | Contiguous store three<br>structures from three vectors,<br>scalar + scalar | ST3H | 7 | 2/9 | L01, S, V01 | - | | Contiguous store three<br>structures from three vectors,<br>scalar + scalar | ST3B, ST3D,<br>ST3W | 7 | 2/9 | L01, S, V01 | - | | Contiguous store four<br>structures from four vectors,<br>scalar + imm | ST2B, ST4D,<br>ST4H, ST4W | 11 | 1/9 | L01, V01 | - | | Contiguous store four<br>structures from four vectors,<br>scalar + scalar | ST4H | 11 | 1/9 | L01, S, V01 | - | | Contiguous store four<br>structures from four vectors,<br>scalar + scalar | ST4B, ST4D,<br>ST4W | 11 | 1/9 | L01, S, V01 | - | | Non temporal store, scalar +<br>imm | STNT1B,<br>STNT1D,<br>STNT1H,<br>STNT1W | 2 | 2 | L01, V01 | - | | Non temporal store, scalar +<br>scalar | STNT1H | 2 | 2 | L01, S, V01 | - | | Non temporal store, scalar +<br>scalar | STNT1B,<br>STNT1D,<br>STNT1W | 2 | 2 | L01, V01 | - | | Instruction Group | SVE Instruction | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |-----------------------------------------------------------------------|-----------------------------------------|-----------------|-------------------------|-----------------------|-------| | Scatter non temporal store,<br>vector + scalar 32-bit element<br>size | STNT1B,<br>STNT1H,<br>STNT1W | 4 | 1/2 | L01, V01 | - | | Scatter non temporal store,<br>vector + scalar 64-bit element<br>size | STNT1B,<br>STNT1D,<br>STNT1H,<br>STNT1W | 2 | 1 | L01, V01 | - | | Scatter store vector + imm 32-<br>bit element size | ST1B, ST1H,<br>ST1W | 4 | 1/2 | L01, V01 | - | | Scatter store vector + imm 64-<br>bit element size | ST1B, ST1D,<br>ST1H, ST1W | 2 | 1 | L01, V01 | - | | Scatter store, 32-bit scaled offset | ST1H, ST1W | 4 | 1/2 | L01, V01 | - | | Scatter store, 32-bit unpacked unscaled offset | ST1B, ST1D,<br>ST1H, ST1W | 2 | 1 | L01, V01 | - | | Scatter store, 32-bit unpacked scaled offset | ST1D, ST1H,<br>ST1W | 2 | 1 | L01, V01 | - | | Scatter store, 32-bit unscaled offset | ST1B, ST1H,<br>ST1W | 4 | 1/2 | L01, V01 | - | | Scatter store, 64-bit scaled offset | ST1D, ST1H,<br>ST1W | 2 | 1 | L01, V01 | - | | Scatter store, 64-bit unscaled offset | ST1B, ST1D,<br>ST1H, ST1W | 2 | 1 | L01, V01 | - | ### 3.30 SVE Miscellaneous instructions Table 3-29 SVE miscellaneous instructions | Instruction Group | SVE Instruction | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |--------------------------------------------|-----------------|-----------------|-------------------------|-----------------------|-------| | Read first fault register,<br>unpredicated | RDFFR | 2 | 1 | MO | - | | Read first fault register, predicated | RDFFR | 3 | 1 | M0, M | 1 | | Read first fault register and set flags | RDFFRS | 4 | 1/2 | M0, M | 1 | | Set first fault register | SETFFR | 2 | 1 | MO | - | | Write to first fault register | WRFFR | 2 | 1 | MO | - | #### Notes: 1. When destination is same as the governing predicate, the latency of the instruction increases by one cycle. # 3.31 SVE Cryptographic instructions #### Table 3-30 SVE cryptographic instructions | Instruction Group | AArch64<br>Instructions | Exec<br>Latency | Execution<br>Throughput | Utilized<br>Pipelines | Notes | |-------------------|------------------------------|-----------------|-------------------------|-----------------------|-------| | Crypto AES ops | AESD, AESE,<br>AESIMC, AESMC | 2 | 2 | V01 | - | | Crypto SHA3 ops | BCAX, EOR3,<br>RAX1, XAR | 2 | 1 | VO | - | | Crypto SM4 ops | SM4E, SM4EKEY | 4 | 1 | VO | - | # 4 Special considerations #### **4.1** Dispatch constraints Dispatch of $\mu$ OPs from the in-order portion to the out-of-order portion of the microarchitecture includes several constraints. It is important to consider these constraints during code generation to maximize the effective dispatch bandwidth and subsequent execution bandwidth of Cortex-X2. The dispatch stage can process up to 8 MOPs per cycle and dispatch up to 16 $\mu$ OPs per cycle, with the following limitations on the number of $\mu$ OPs of each type that may be simultaneously dispatched. Up to 4 µOPs utilizing the S or B pipelines Up to 4 µOPs utilizing the M pipelines Up to 2 µOPs utilizing the MO pipelines Up to 2 µOPs utilizing the VO pipeline Up to 2 µOPs utilizing the V1 pipeline Up to 6 μOPs utilizing the L pipelines In the event there are more $\mu$ OPs available to be dispatched in a given cycle than can be supported by the constraints above, $\mu$ OPs will be dispatched in oldest to youngest age-order to the extent allowed by the above. ### 4.2 Optimizing general-purpose register spills and fills Register transfers between general-purpose registers (GPR) and ASIMD registers (VPR) are lower latency than reads and writes to the cache hierarchy, thus it is recommended that GPR registers be filled/spilled to the VPR rather to memory, when possible. ### 4.3 Optimizing memory routines To achieve maximum throughput for memory copy (or similar loops), one should do the following. Unroll the loop to include multiple load and store operations per iteration, minimizing the overheads of looping. Align stores on 32B boundary wherever possible. Use non-writeback forms of LDP and STP instructions interleaving them like shown in the example below: ``` Loop start: SUBS x2, x2, #96 LDP q3,q4,[x1,#0] q3,q4,[x0,#0] STP LDP q3,q4,[x1,#32] STP q3,q4,[x0,#32] q3,q4,[x1,#64] LDP STP q3,q4,[x0,#64] x1,x1,#96 ADD x0,x0,#96 ADD BGT Loop start ``` If the memory locations being copied are non-cacheable, the non-temporal version of LDPQ (LDNPQ) should be used. STPQ should still be used for the stores. Similarly, it Is recommended to use LDPQ to achieve maximum throughput for memcmp (memory compare) loops that compare cacheable memory. LDNPQ should be used for non-cacheable memory. To achieve maximum throughput on memset, it is recommended that one do the following. Unroll the loop to include multiple store operations per iteration, minimizing the overheads of looping. ``` Loop_start: STP q1,q3,[x0,#0] STP q1,q3,[x0,#0x20] STP q1,q3,[x0,#0x40] STP q1,q3,[x0,#0x60] ADD x0,x0,#0x80 SUBS x2,x2,#0x80 B.GT Loop start ``` To achieve maximum performance on memset to zero, it is recommended that one use DC ZVA instead of STP. An optimal routine might look something like the following. ``` ADD x0,x0,#0x40 DC ZVA,x0 ADD x0,x0,#0x40 B.GT Loop_start ``` #### 4.4 Load/Store alignment The Armv8-A architecture allows many types of load and store accesses to be arbitrarily aligned. The Cortex-X2 core handles most unaligned accesses without performance penalties. However, there are cases which could reduce bandwidth or incur additional latency, as described below. - Load operations that cross a cache-line (64-byte) boundary. - Quad-word load operations that are not 4B aligned. - Store operations that cross a 32B boundary. ### 4.5 Store to Load Forwarding The Cortex-X2 core allows data to be forwarded from store instructions to a load instruction with the restrictions mentioned below: Load start address should align with the start or middle address of the older store. This does not apply to LDPs that load 2 32b registers Loads of size greater than or equal to 8 bytes can get the data forwarded from a maximum of 2 stores. If there are 2 stores, then each store should forward to either first or second half of the load Loads of size less than or equal to 4 bytes can get their data forwarded from only 1 store #### 4.6 AES encryption/decryption Cortex-X2 can issue four AESE/AESMC/AESD/AESIMC instruction every cycle (fully pipelined) with an execution latency of two cycles. This means encryption or decryption for at least four data chunks should be interleaved for maximum performance: ``` AESE data0, key_reg AESMC data0, data0 AESE data1, key_reg AESMC data1, data1 AESE data2, key_reg AESMC data2, data2 AESE data3, key_reg AESMC data3, key_reg AESMC data0, data0 ... ``` Pairs of dependent AESE/AESMC and AESD/AESIMC instructions are higher performance when they are adjacent in the program code and both instructions use the same destination register. ### 4.7 Region based fast forwarding The forwarding logic in the V pipelines is optimized to provide optimal latency for instructions which are expected to commonly forward to one another. The effective latency of FP and ASIMD instructions as described in section 3 is increased by one cycle if the producer and consumer instructions are not part of the same forwarding region. These optimized forwarding regions are defined in the following table. Table 4-1 Optimized forwarding regions | Region | Instruction Types | Notes | |--------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------| | 1 | ASIMD/SVE integer ALU, ASIMD/SVE integer shift, ASIMD/scalar insert and move, ASIMD/SVE integer abs/cmp/max/min and the ASIMD miscellaneous instructions in table 3-18. | 1 | | 2 | FP/ASIMD/SVE floating-point multiply, FP/ASIMD/SVE floating point multiply-accumulate, FP/ASIMD/SVE compare, FP/ASIMD/SVE add/sub and the ASIMD miscellaneous instructions in table 3-18. | 1,2,3 | | 3 | ASIMD/SVE Crypto and SHA1/SHA256 | - | | 4 | ASIMD/SVE AES, ASIMD/SVE polynomial multiply and all the instruction types in region 1. | 1 | | 5 | ASIMD/SVE BFDOT and BFMMLA instructions | - | #### Notes: - 1. Reciprocal step and estimate instructions are excluded from this region. - 2. ASIMD/SVE extract narrow, saturating instructions are excluded from this region. - 3. ASIMD miscellaneous instructions can only be consumers of this region. The following instructions are not a part of any region: - FP/ASIMD/SVE floating-point div/sqrt and SVE integer divides - FP/ASIMD/SVE convert and rounding instructions that do not write to general purpose registers - ASIMD/SVE integer mul/mac - ASIMD/SVE integer reduction In addition to the regions mentioned in the table above, all instructions in regions 1 and 2 can fast forward to FP/ASIMD/SVE stores, FP/ASIMD vector to integer register transfers and ASIMD converts that write to general purpose registers. More special notes about the forwarding region in table 4-1: • Element sources (the non-vector operand in "by element" multiplies) used by ASIMD/SVE floating-point multiply and multiply-accumulate operations cannot be consumers. - Complex shift by immediate/register and shift accumulate instructions cannot be producers (see sections 3.16 and 3.25) in region 1. - Extract narrow, saturating instructions cannot be producers (see sections 3.19 and 3.25) in region 1. - Absolute difference accumulate and pairwise add and accumulate instructions cannot be producers (see sections 3.16 and 3.25) in region 1. - For floating-point producer-consumer pairs, the precision of the instructions should match (single, double or half) in region 2. - Pair-wise floating-point instructions cannot be producers or consumers in region 2. It is not advisable to interleave instructions belonging to different regions. Also, certain instructions can only be producers or consumers in a particular region but not both (see footnote 3 for table 4-1). For example, the code below interleaves producers and consumers from regions 1 and 2. This will result in and additional latency of 1 cycle as seen by FMUL. FSUB v27.2s, v28.2s, v20.2s - Region 2 FADD v20.2s, v28.2s, v20.2s - Region 2 MOV v27.s[1], v20.s[1] - Region 2 producer but not a region 2 consumer FMUL v26.2s, v27.2s, v6.2s - Region 2 #### 4.8 Branch instruction alignment Branch instruction and branch target instruction alignment and density can affect performance. For best case performance, avoid placing more than four branch instructions within an aligned 32-byte instruction memory region. ### 4.9 FPCR self-synchronization Programmers and compiler writers should note that writes to the FPCR register are self-synchronizing, i.e. its effect on subsequent instructions can be relied upon without an intervening context synchronizing operation. ### 4.10 Special register access The Cortex-X2 core performs register renaming for general purpose registers to enable speculative and out-of-order instruction execution. But most special-purpose registers are not renamed. Instructions that read or write non-renamed registers are subjected to one or more of the following additional execution constraints. Non-Speculative Execution – Instructions may only execute non-speculatively. In-Order Execution – Instructions must execute in-order with respect to other similar instructions or in some cases all instructions. Flush Side-Effects – Instructions trigger a flush side-effect after executing for synchronization. The table below summarizes various special-purpose register read accesses and the associated execution constraints or side-effects. Table 4-2 Special-purpose register read accesses | Register Read | Non-Speculative | In-<br>Order | Flush Side-Effect | Notes | |---------------|-----------------|--------------|-------------------|-------| | APSR | Yes | Yes | No | 3 | | CurrentEL | No | Yes | No | - | | DAIF | No | Yes | No | - | | DLR_EL0 | No | Yes | No | - | | DSPSR_EL0 | No | Yes | No | - | | ELR_* | No | Yes | No | - | | FPCR | No | Yes | No | - | | FPSCR | Yes | Yes | No | 2 | | FPSR | Yes | Yes | No | 2 | | NZCV | No | No | No | 1 | | SP_* | No | No | No | 1 | | SPSel | No | Yes | No | - | | SPSR_* | No | Yes | No | - | | FFR | No | Yes | No | - | #### Notes: - 1. The NZCV and SP registers are fully renamed. - 2. FPSR/FPSCR reads must wait for all prior instructions that may update the status flags to execute and retire. - 3. APSR reads must wait for all prior instructions that may set the Q bit to execute and retire. The table below summarizes various special-purpose register write accesses and the associated execution constraints or side-effects. Table 4-3 Special-purpose register write accesses | Register Write | Non-Speculative | In-<br>Order | Flush Side-Effect | Notes | |----------------|-----------------|--------------|-------------------|-------| | APSR | Yes | Yes | No | 4 | | DAIF | Yes | Yes | No | - | | DLR_ELO | Yes | Yes | No | - | | DSPSR_EL0 | Yes | Yes | No | - | | ELR_* | Yes | Yes | No | - | | FPCR | Yes | Yes | Maybe | 2 | | FPSCR | Yes | Yes | Maybe | 2, 3 | | FPSR | Yes | Yes | No | 3 | | Register Write | Non-Speculative | In-<br>Order | Flush Side-Effect | Notes | |----------------|-----------------|--------------|-------------------|-------| | NZCV | No | No | No | 1 | | SP_* | No | No | No | 1 | | SPSel | Yes | Yes | Yes | - | | SPSR_* | Yes | Yes | No | - | | FFR | Yes | Yes | No | - | #### Notes: - 1. The NZCV and SP registers are fully renamed. - 2. If the FPCR/FPSCR write is predicted to change the control field values, it will introduce a barrier which prevents subsequent instructions from executing. If the FPCR/FPSCR write is predicted to not change the control field values, it will execute without a barrier but trigger a flush if the values change. - 3. FPSR/FPSCR writes must stall at dispatch if another FPSR/FPSCR write is still pending. - 4. APSR writes that set the Q bit will introduce a barrier which prevents subsequent instructions from executing until the write completes. #### 4.11 Instruction fusion Cortex-X2 can accelerate certain instruction pairs in an operation called fusion. Specific Aarch64 instruction pairs that can be fused are as follows: AESE + AESMC (see Section 4.5 on AES Encryption/Decryption) AESD + AESIMC (see Section 4.5 on AES Encryption/Decryption) CMP/CMN (immediate) + B.cond CMP/CMN (register) + B.cond TST (immediate) + B.cond TST (register) + B.cond BICS (register) + B.cond NOP + Any instruction These instruction pairs must be adjacent to each other in program code. For CMP, CMN, TST and BICS, fusion is not allowed for shifted and/or extended register forms. For BICS, the destination register should be XZR or WZR if fusion is to take place. ### 4.12 Zero Latency MOVs A subset of register-to-register move operations and move immediate operations are executed with zero latency. These instructions do not utilize the scheduling and execution resources of the machine. These are as follows: MOV Xd, #0 MOV Xd. XZR MOV Wd, #0 MOV Wd, WZR MOV Hd, WZR MOV Hd, XZR MOV Sd, WZR MOV Dd. XZR MOVI Dd, #0 MOVI Vd.2D, #0 MOV Wd, Wn MOV Xd, Xn The last 2 instructions may not be executed with zero latency under certain conditions. ### 4.13 Cache maintenance operations While using set way invalidation operations on L1 cache, it is recommended that software be written to traverse the sets in the inner loop and ways in the out loop. # 4.14 Memory Tagging - Tagging Performance To achieve maximum throughput for tag-only, it is recommended that one do the following. Unroll the loop to include multiple store operations per iteration, minimizing the overheads of looping. Use STGM (or DCGVA) instruction as shown in the example below: ``` Loop_start: SUBS x2,x2,#0x80 STGM x1,[x0] ADD x0,x0,#0x40 STGM x1,[x0] ADD x0,x0,#0x40 B.GT Loop start ``` To achieve maximum throughput for tag and zeroing out data, it is recommended that one do the following. Unroll the loop to include multiple store operations per iteration, minimizing the overheads of looping. Use STZGM (or DCZGVA) instruction as shown in the example below: ``` Loop_start: SUBS x2,x2,#0x80 STZGM x1,[x0] ADD x0,x0,#0x40 STZGM x1,[x0] ADD x0,x0,#0x40 B.GT Loop_start ``` To achieve maximum throughput for tag-loading, it is recommended that one do the following. Unroll the loop to include multiple load operations per iteration, minimizing the overheads of looping. Use LDGM instruction as shown in the example below: ``` Loop_start: SUBS x2,x2,#0x80 LDGM x1,[x0] ADD x0,x0,#0x40 LDGM x1,[x0] ADD x0,x0,#0x40 B.GT Loop_start ``` Also, it is recommended to use STZGM (or DCZGVA) to set tag if data is not a concern. ### 4.15 Memory Tagging - Synchronous Mode In synchronous tag checking mode, stores cannot be performed speculatively. Each store must complete a tag check before the next store can be executed non-speculatively. Thus, performance of stores in synchronous tag checking mode will be diminished. It is recommended to use asynchronous mode for better performance. ### 4.16 Complex ASIMD and SVE instructions The bandwidth of the following ASIMD and SVE instructions is limited by decode constraints and it is advisable to avoid them when high performing code is desired. #### **ASIMD** LD4R, post-indexed addressing, element size = 64b. LD4, single 4-element structure, post indexed addressing mode, element size = 64b. LD4, multiple 4-element structures, quad form. LD4, multiple 4-element structures, double word form. ST4, multiple 4-element structures, quad form, element size less than 64b. ST4, multiple 4-element structures, quad form, element size = 64b, post indexed addressing mode. SVE LD1B gather (scalar + vector addressing) where vector index register is the same as the destination register and element size = 32. Addressing mode is 32b unscaled offset. LD1H gather (scalar + vector addressing) where vector index register is the same as the destination register and element size = 32. Addressing mode is 32b scaled or unscaled offset. LD1W gather (scalar + vector addressing) where vector index register is the same as the destination register and element size = 32. Addressing mode is 32b scaled or unscaled offset. LD3[B/H/W/D] contiguous (scalar + scalar addressing). LD4[B/H/D/W] contiguous (scalar + immediate addressing). LD4[B/H/D/W] contiguous (scalar + scalar addressing). LDFF1B gather (scalar + vector addressing) where vector index register is the same as the destination register and element size = 32. Addressing mode is 32b unscaled offset. LDFF1H gather (scalar + vector addressing) where vector index register is the same as the destination register and element size = 32. Addressing mode is 32b scaled or unscaled offset. LDFF1W gather (scalar + vector addressing) where vector index register is the same as the destination register and element size = 32. Addressing mode is 32b scaled or unscaled offset. ST3[B/H/W/D] contiguous (scalar + scalar addressing). ST4[B/H/D/W] contiguous (scalar + immediate addressing). ST4[B/H/D/W] contiguous (scalar + scalar addressing). #### 4.17 MOVPRFX fusion Under certain conditions, a mechanism called MOVPRFX fusion can be used to accelerate the execution of an instruction pair that consists of an SVE MOVPRFX instruction immediately followed in program order by an SVE integer, floating point or BF16 instruction. The list of SVE instructions and the conditions under which tis fusion can be applied is mentioned in the tables below. | Instruction Group | SVE Instruction | Notes | |--------------------------------------------|----------------------------------------------------------------------------|-------------------------------------------------------------------------------| | Integer Instructions | | | | Arithmetic, absolute difference accumulate | SABA, SABALB, SABALT, UABA, UABALB, UABALT | - | | Arithmetic, basic | ABS, ADD, CNOT, NEG, SHADD, SHSUB, SHSUBR, SUB, SUBR, UHADD, UHSUB, UHSUBR | For ADD and SUB, only the immediate and vector, predicated forms are fusible. | | Instruction Group | SVE Instruction | Notes | |-----------------------------------------|------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------| | Arithmetic, complex | SQABS, SQADD, SQNEG, SQSUB, SQSUBR,<br>SRHADD, SUQADD, UQADD, UQSUB,<br>UQSUBR, URHADD, USQADD | For SQABS, SQSUB, UQADD and UQSUB, only the immediate and vector, predicated forms are fusible. | | Arithmetic, large integer | ADCLB, ADCLT, SBCLB, SBCLT | - | | Arithmetic, pairwise add | ADDP | - | | Arithmetic, pairwise add and accum long | SADALP, UADALP | - | | Arithmetic, shift | ASR, ASRR, LSL, LSLR, LSR, LSRR | For ASR, LSL and LSR, only the immediate, predicated and vector forms are fusible. | | Arithmetic, shift and accumulate | SRSRA, SSRA, URSRA, USRA | - | | Arithmetic, shift complex | SQRSHL, SQRSHLR, SQSHL, SQSHLR,<br>SQSHLU, UQRSHL, UQRSHLR, UQSHL,<br>UQSHLR | - | | Arithmetic, shift right for divide | ASRD | - | | Arithmetic, shift rounding | SRSHL, SRSHLR, SRSHR, URSHL, URSHLR, URSHR | - | | Bitwise select | BSL, BSL1N, BSL2N, NBSL | - | | Count/reverse bits | CLS, CLZ, CNT, RBIT | - | | Complex add | CADD, SQCADD | - | | Complex dot product | CDOT | - | | Complex multiply-add | CMLA | - | | Conditional extract operations | CLASTA, CLASTB, SPLICE | For CLASTA and CLASTB, only the vector forms are fusible. | | Convert to floating point | SCVTF, UCVTF | - | | Сору | СРУ | All forms except the immediate, zeroing form are fusible. | | Divides | SDIV, SDIVR, UDIV, UDIVR | - | | Dot product | SDOT, UDOT, SUDOT, USDOT | - | | Extend, sign or zero | SXTB, SXTH, SXTW, UXTB, UXTH, UXTW | - | | Extract/insert operation | EXT, INSR | - | | Logical | AND, BIC, EON, EOR, EORBT, EORTB, NOT, ORN, ORR | For AND, BIC, EOR and ORR, only<br>the immediate and vector,<br>predicated forms are fusible | | Max/min, basic and pairwise | SMAX, SMAXP, SMIN, SMINP, UMAX, UMAXP, UMIN, UMINP | - | | Matrix multiply-accumulate | SMMLA, UMMLA, USMMLA | - | | Multiply | MUL, SMULH, UMULH | For MUL, only the immediate and vector, predicated forms are fusible. For the others, only the predicated form is fusible. | | Instruction Group | SVE Instruction | Notes | |------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------| | Multiply accumulate | MLA, MLS | For the vector forms, only unpredicated and zeroing predicate forms of MOVPRFX are fusible. | | Multiply accumulate long | SMLALB, SMLALT, SMLSLB, SMLSLT,<br>UMLALB, UMLALT, UMLSLB, UMLSLT | - | | Multiply accumulate saturating doubling long regular | SQDMLALB, SQDMLALT, SQDMLALBT,<br>SQDMLSLB, SQDMLSLT, SQDMLSLBT | - | | Multiply saturating rounding doubling regular/complex accumulate | SQRDMLAH, SQRDMLSH, SQRDCMLAH | - | | Predicate counting, vector form | DECH, DECW, DECD, INCH, INCW, INCD, SQDECH, SQDECW, SQDECD, SQINCH, SQINCW, SQINCD, UQDECH, UQDECW, UQDECD, UQINCH, UQINCW, UQINCD | | | Reciprocal estimate | URECPE, URSQRTE | - | | Reverse, vector | REV, REVB, REVH, REVW | - | | Select, vector form | SEL | - | | Floating point Instructions | | | | Floating point absolute value/difference | FABD, FABS | - | | Floating point arithmetic | FADD, FADDP, FNEG, FSUB, FSUBR | For FADD and FSUB, only the immediate and vector, predicated forms are fusible. | | Floating point complex add | FCADD | - | | Floating point complex multiply add | FCMLA | For the vector form, only unpredicated and zeroing predicate forms of MOVPRFX are fusible. | | Floating point convert | FCVT, FCVTX | - | | Floating point base2 log | FLOGB | - | | Floating point convert to integer | FCVTZS, FCVTZU | - | | Floating point copy | FCPY, FMOV | Only the predicated forms of FCPY are fusible | | Floating point divide | FDIV, FDIVR | - | | Floating point min/max pairwise | FMAXP, FMAXNMP, FMINP, FMINNMP | - | | Floating point min/max | FMAX, FMIN, FMAXNM, FMINNM | - | | Floating point multiply | FSCALE, FMUL, FMULX | For FMUL, only the immediate and vector, predicated forms are fusible | | Floating point multiply accumulate | FMLA, FMLS, FMAD, FMSB, FNMAD,<br>FNMLA, FNMLS, FNMSB | For FMLA and FMLS, only unpredicated and zeroing predicate forms of MOVPRFX are fusible. | | Instruction Group | SVE Instruction | Notes | |-------------------------------------------------|--------------------------------------------------------|-------| | Floating point multiply add/sub accumulate long | FMLALB, FMLALT, FMLSLB, FMLSLT | - | | Floating point reciprocal estimate | FRECPX | - | | Floating point round to integral | FRINTA, FRINTI, FRINTM, FRINTN, FRINTP, FRINTX, FRINTZ | - | | Floating point square root | FSQRT | - | | Floating point trigonometric multiply add | FTMAD | - | | BF16 Instructions | | | | Dot product | BFDOT | - | | Matrix multiply accumulate | BFMMLA | - | | Multiply accumulate long | BFMLALB, BFMLALT | - | | Cryptographic Instructions | | | | Crypto SHA3 ops | BCAX, EOR3, XAR | - |