

# Arm® Cortex®-A76

# **Software Optimization Guide**

Non-Confidential

Version 10.0 PJDOC-466751330-7215

Copyright © 2019-2021 Arm Limited (or its affiliates). All rights reserved



## Arm<sup>®</sup> Cortex-A76

### Software Optimization Guide

Copyright © 2019-2021 Arm Limited (or its affiliates). All rights reserved.

## Confidential Proprietary Notice

This document is **NON-CONFIDENTIAL** and any use by you is subject to the terms of the agreement between you and Arm or the terms of the agreement between you and the party authorised by Arm to disclose this document to you.

This document is protected by copyright and other related rights and the practice or implementation of the information contained in this document may be protected by one or more patents or pending patent applications. No part of this document may be reproduced in any form by any means without the express prior written permission of Arm. **No license, express or implied, by estoppel or otherwise to any intellectual property rights is granted by this document unless specifically stated.** 

Your access to the information in this document is conditional upon your acceptance that you will not use or permit others to use the information: (i) for the purposes of determining whether implementations infringe any third party patents; (ii) for developing technology or products which avoid any of Arm's intellectual property; or (iii) as a reference for modifying existing patents or patent applications or creating any continuation, continuation in part, or extension of existing patents or patent applications; or (iv) for generating data for publication or disclosure to third parties, which compares the performance or functionality of the Arm technology described in this document with any other products created by you or a third party, without obtaining Arm's prior written consent.

THIS DOCUMENT IS PROVIDED "AS IS". ARM PROVIDES NO REPRESENTATIONS AND NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE WITH RESPECT TO THE DOCUMENT. For the avoidance of doubt, Arm makes no representation with respect to, and has undertaken no analysis to identify or understand the scope and content of, third party patents, copyrights, trade secrets, or other rights.

This document may include technical inaccuracies or typographical errors.

TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL ARM BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF ARM HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

This document consists solely of commercial items. You shall be responsible for ensuring that any use, duplication or disclosure of this document complies fully with any relevant export laws and regulations to assure that this document or any portion thereof is not exported, directly or indirectly, in violation of such export laws. Use of the word "partner" in reference to Arm's customers is not intended to create or refer to any partnership relationship with any other company. Arm may make changes to this document at any time and without notice.

If any of the provisions contained in these terms conflict with any of the provisions of any click through or signed written agreement covering this document with Arm, then the click through or signed written agreement prevails over and supersedes the conflicting provisions of these terms. This document may be translated into other languages for convenience, and you agree that if there is any conflict between the English version of this document and any translation, the terms of the English version of the Agreement shall prevail.

The Arm corporate logo and words marked with <sup>®</sup> or <sup>™</sup> are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. Other brands and names mentioned in this document may be the trademarks of their respective owners. Please follow Arm's trademark usage guidelines at http://www.arm.com/company/policies/trademarks.

Copyright © 2019-2021 Arm Limited or its affiliates. All rights reserved.

Arm Limited. Company 02557590 registered in England.

110 Fulbourn Road, Cambridge, England CB1 9NJ.

LES-PRE-20349

### **Confidentiality Status**

This document is Non-Confidential. The right to use, copy and disclose this document may be subject to license restrictions in accordance with the terms of the agreement entered into by Arm and the party that Arm delivered this document to.

Unrestricted Access is an Arm internal classification.

### **Product Status**

The information in this document is Final, that is for a developed product.

### Web Address

http://www.arm.com/

# Contents

| 1 About this document                                | 6  |
|------------------------------------------------------|----|
| 1.1. References                                      | 6  |
| 1.2. Terms and Abbreviations                         | 6  |
| 1.3. Scope                                           | 6  |
|                                                      |    |
| 2 Introduction                                       |    |
| 2.1. Pipeline Overview                               | 7  |
| 3 Instruction characteristics                        |    |
| 3.1. Instruction tables                              | 8  |
| 3.2. Legend for reading the utilized pipelines       | 8  |
| 3.3. Branch Instructions                             | 8  |
| 3.4. Arithmetic and Logical Instructions             | 9  |
| 3.5. Move and Shift Instructions                     |    |
| 3.6. Divide and Multiply Instructions                |    |
| 3.7. Saturating and Parallel Arithmetic Instructions |    |
| 3.8. Miscellaneous Data-Processing Instructions      |    |
| 3.9. Load Instructions                               |    |
| 3.10. Store Instructions                             |    |
| 3.11. FP Data Processing Instructions                |    |
| 3.12. FP Miscellaneous Instructions                  |    |
| 3.13. FP Load Instructions                           |    |
| 3.14. FP Store Instructions                          |    |
| 3.15. ASIMD Integer Instructions                     | 25 |
| 3.16. ASIMD Floating-Point Instructions              |    |
| 3.17. ASIMD Miscellaneous Instructions               |    |
| 3.18. ASIMD Load Instructions                        |    |
| 3.19. ASIMD Store Instructions                       |    |
| 3.20. Cryptography Extensions                        |    |
| 3.21. CRC                                            |    |
|                                                      |    |
| 4 Special considerations                             |    |

| 4.1. Dispatch Constraints                                 |  |
|-----------------------------------------------------------|--|
| 4.2. Dispatch Stall                                       |  |
| 4.3. Optimizing General-Purpose Register Spills and Fills |  |
| 4.4. Optimizing Memory Copy                               |  |
| 4.5. Load/Store Alignment                                 |  |

| 4.6. Store to Load Forwarding     | 42 |
|-----------------------------------|----|
| 4.7. AES Encryption/Decryption    | 42 |
| 4.8. Region Based Fast Forwarding | 43 |
| 4.9. Branch instruction alignment | 44 |
| 4.10. FPCR self-synchronization   | 44 |
| 4.11. Special Register Access     | 44 |
| 4.12. Register Forwarding Hazards | 46 |
| 4.13. IT Blocks                   | 46 |

# **1** About this document

This document contains a guide to the Cortex-A76 micro-architecture with a view to aiding software optimization.

## **1.1. References**

| Reference | Document | Licensee only Y/N | Title                                                                       |
|-----------|----------|-------------------|-----------------------------------------------------------------------------|
| 1         | DDI 0487 | N                 | Arm® Architecture Reference Manual, Armv8, for Armv8-A architecture profile |
| 2         | 100798   | Ν                 | Arm® Cortex®-A76 Core Technical Reference Manual                            |

## **1.2. Terms and Abbreviations**

This document uses the following terms and abbreviations.

| Term  | Meaning                 |
|-------|-------------------------|
| ALU   | Arithmetic/Logical Unit |
| ASIMD | Advanced SIMD           |
| Мор   | Macro-Operation         |
| Uop   | Micro-Operation         |
| VFP   | Vector Floating Point   |

## **1.3. Scope**

This document provides high-level information about the Cortex-A76 pipeline, instruction performance characteristics, and special performance considerations. This information is intended to aid people who are optimizing software and compilers for Cortex-A76. For a more complete description of the Cortex-A76 processor, please refer to the *Cortex-A76 Technical Reference Manual*.

# **2** Introduction

## 2.1. Pipeline Overview

The following diagram describes the high-level Cortex-A76 instruction processing pipeline. Instructions are first fetched, then decoded into internal macro-operations (Mops). From there, the Mops proceed through register renaming and dispatch stages. A Mop can be split further into two Uops at dispatch stage. Once dispatched, uops wait for their operands and issue out-of-order to one of eight execution pipelines. Each execution pipeline can accept and complete one uop per cycle.



The execution pipelines support different types of operations, as follows:

| Pipeline (mnemonic)                  | Supported functionality                                                                                                                   |
|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------|
| Branch                               | Branch µOPs                                                                                                                               |
| Integer Single-Cycle 0/1             | Integer ALU µOPs                                                                                                                          |
| Integer Single/Multi-cycle           | Integer shift-ALU, multiply, divide, CRC and sum-of-absolute-differences µOPs                                                             |
| Load/Store Address<br>Generation 0/1 | Load, Store and special memory uops                                                                                                       |
| FP/ASIMD-0                           | ASIMD ALU, ASIMD misc, ASIMD integer multiply, FP convert, FP misc, FP add, FP multiply, FP divide, FP sqrt, crypto uops, store data uops |
| FP/ASIMD-1                           | ASIMD ALU, ASIMD misc, FP misc, FP add, FP multiply, ASIMD shift uops, store data uops,                                                   |

# **3 Instruction characteristics**

## 3.1. Instruction tables

This chapter describes high-level performance characteristics for most Arm v8.2-A A32, T32 and A64 instructions. A series of tables summarize the effective execution latency and throughput (instruction bandwidth per cycle), pipelines utilized, and special behaviours associated with each group of instructions. Utilized pipelines correspond to the execution pipelines described in chapter 2.

In the tables below, Exec Latency is defined as the minimum latency seen by an operation dependent on an instruction in the described group.

In the tables below, Execution Throughput is defined as the maximum throughput (in instructions per cycle) of the specified instruction group that can be achieved in the entirety of the Cortex-A76 microarchitecture.

## 3.2. Legend for reading the utilized pipelines

| Pipeline name                                  | Symbol used in tables |
|------------------------------------------------|-----------------------|
| Branch 0/1                                     | В                     |
| Integer single Cycle 0/1                       | S                     |
| Integer single Cycle 0/1 and single/multicycle | 1                     |
| Integer single/multicycle                      | Μ                     |
| Integer single Cycle 1 and Integer multicycle  | D                     |
| Load/Store 0/1                                 | L                     |
| FP/ASIMD 0/1                                   | $\vee$                |
| FP/ASIMD 0                                     | VO                    |
| FP/ASIMD 1                                     | V1                    |

## **3.3. Branch Instructions**

| Instruction group                        | AArch64 instructions | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|------------------------------------------|----------------------|-----------------|-------------------------|-----------------------|-------|
| Branch, immed                            | В                    | 1               | 1                       | В                     |       |
| Branch, register                         | BR, RET              | 1               | 1                       | В                     |       |
| Branch and link, immed                   | BL                   | 1               | 1                       | I, B                  |       |
| Branch and link, register (reg !=<br>Ir) | BLR                  | 1               | 1                       | I, B                  |       |
| Branch and link, register (reg ==<br>lr) | BLR                  | 2               | 1                       | I, B                  |       |
| Compare and branch                       | CBZ, CBNZ, TBZ, TBNZ | 1               | 1                       | В                     |       |

| Instruction group                        | AArch32 instructions | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|------------------------------------------|----------------------|-----------------|-------------------------|-----------------------|-------|
| Branch, immed                            | В                    | 1               | 1                       | В                     |       |
| Branch, register                         | BX                   | 1               | 1                       | В                     |       |
| Branch and link, immed                   | BL, BLX              | 1               | 1                       | I, B                  |       |
| Branch and link, register (reg !=<br>lr) | BLX                  | 1               | 1                       | I, B                  |       |
| Branch and link, register (reg ==<br>lr) | BLX                  | 2               | 1                       | I, B                  |       |
| Compare and branch                       | CBZ, CBNZ            | 1               | 1                       | В                     |       |

# 3.4. Arithmetic and Logical Instructions

| Instruction group                                 | AArch64 instructions                  | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|---------------------------------------------------|---------------------------------------|-----------------|-------------------------|-----------------------|-------|
| Arithmetic, basic                                 | ADD{S}, ADC{S}, SUB{S},<br>SBC{S}     | 1               | 3                       |                       |       |
| Arithmetic, extend and shift                      | ADD{S}, SUB{S}                        | 2               | 1                       | М                     |       |
| Arithmetic, LSL shift, shift <= 4                 | ADD{S}, SUB{S}                        | 1               | 3                       |                       |       |
| Arithmetic, LSR/ASR/ROR shift<br>or LSL shift > 4 | ADD{S}, SUB{S}                        | 2               | 1                       | Μ                     |       |
| Conditional compare                               | CCMN, CCMP                            | 1               | 3                       |                       |       |
| Conditional select                                | CSEL, CSINC, CSINV, CSNEG             | 1               | 3                       |                       |       |
| Logical, basic                                    | AND{S}, BIC{S}, EON, EOR, ORN,<br>ORR | 1               | 3                       |                       |       |
| Logical, shift, no flagset                        | AND, BIC, EON, EOR, ORN, ORR          | 1               | 3                       |                       |       |
| Logical, shift, flagset                           | ANDS, BICS                            | 2               | 1                       | М                     |       |

| Instruction group                                            | AArch32 instructions                                                                                                     | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|--------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|-----------------|-------------------------|-----------------------|-------|
| ALU, basic                                                   | ADD{S}, ADC{S}, ADR, AND{S},<br>BIC{S}, CMN, CMP, EOR{S},<br>ORN{S}, ORR{S}, RSB{S}, RSC{S},<br>SUB{S}, SBC{S}, TEQ, TST | 1               | 3                       |                       |       |
| ALU, shift by register,<br>unconditional                     | (same as ALU, basic)                                                                                                     | 2               | 1                       | Μ                     |       |
| ALU, shift by register, conditional                          | (same as ALU, basic)                                                                                                     | 2               | 1                       | I, M                  |       |
| Arithmetic, LSL shift by immed,<br>shift <= 4, unconditional | ADD{S}, ADC{S}, RSB{S}, RSC{S},<br>SUB{S}, SBC{S}                                                                        | 1               | 3                       |                       |       |

| Instruction group                                                      | AArch32 instructions                              | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|------------------------------------------------------------------------|---------------------------------------------------|-----------------|-------------------------|-----------------------|-------|
| Arithmetic, LSL shift by immed,<br>shift <= 4, conditional             | ADD{S}, ADC{S}, RSB{S}, RSC{S},<br>SUB{S}, SBC{S} | 1               | 1                       | М                     |       |
| Arithmetic, LSR/ASR/ROR shift<br>by immed or LSL shift by immed<br>> 4 | ADD{S}, ADC{S}, RSB{S}, RSC{S},<br>SUB{S}, SBC{S} | 2               | 1                       | Μ                     |       |
| Logical, shift by immed,<br>noflagset                                  | AND, BIC, EOR, ORN, ORR                           | 1               | 3                       |                       |       |
| Logical, shift by immed, flagset                                       | AND{S}, BIC{S}, EOR{S}, ORN{S},<br>ORR{S}         | 2               | 1                       | М                     |       |
| Test/Compare, shift by immed                                           | CMN, CMP, TEQ, TST                                | 2               | 1                       | М                     |       |
| Branch forms                                                           |                                                   | +1              | 1                       | +B                    | 1     |

1. Branch forms are possible when the instruction destination register is the PC. For those cases, an additional branch uop is required. This adds 1 cycle to the latency.

## 3.5. Move and Shift Instructions

| Instruction group                                      | AArch32 instructions                  | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|--------------------------------------------------------|---------------------------------------|-----------------|-------------------------|-----------------------|-------|
| Move, basic                                            | MOV{S}, MOVW, MVN{S}                  | 1               | 3                       |                       |       |
| Move, shift by immed, no<br>setflags                   | ASR, LSL, LSR, ROR, RRX, MVN          | 1               | 3                       |                       |       |
| Move, shift by immed, setflags                         | ASRS, LSLS, LSRS, RORS, RRXS,<br>MVNS | 2               | 1                       | М                     |       |
| Move, shift by register, no<br>setflags, unconditional | ASR, LSL, LSR, ROR, RRX, MVN          | 1               | 3                       |                       |       |
| Move, shift by register, no<br>setflags, conditional   | ASR, LSL, LSR, ROR, RRX, MVN          | 2               | 3/2                     |                       |       |
| Move, shift by register, setflags                      | ASRS, LSLS, LSRS, RORS, RRXS,<br>MVNS | 2               | 1                       | М                     |       |
| Move, top                                              | MOVT                                  | 1               | 3                       |                       |       |
| Move, branch forms                                     |                                       | +1              | 1                       | +B                    |       |

## 3.6. Divide and Multiply Instructions

| Instruction group           | AArch64 instructions | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|-----------------------------|----------------------|-----------------|-------------------------|-----------------------|-------|
| Divide, W-form              | SDIV, UDIV           | 5 to 12         | 1/12 to 1/5             | М                     | 1     |
| Divide, X-form              | SDIV, UDIV           | 5 to 20         | 1/20 to 1/5             | М                     | 1     |
| Multiply accumulate, W-form | MADD, MSUB           | 2(1)            | 1                       | М                     | 2     |

| Instruction group           | AArch64 instructions              | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|-----------------------------|-----------------------------------|-----------------|-------------------------|-----------------------|-------|
| Multiply accumulate, X-form | MADD, MSUB                        | 4(3)            | 1/3                     | М                     | 2,4   |
| Multiply accumulate long    | SMADDL, SMSUBL, UMADDL,<br>UMSUBL | 2(1)            | 1                       | Μ                     | 2     |
| Multiply high               | SMULH, UMULH                      | 5(3)            | 1/4                     | М                     | 5     |

| Instruction group                                         | AArch32 instructions                                                                                      | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|-----------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|-----------------|-------------------------|-----------------------|-------|
| Divide                                                    | SDIV, UDIV                                                                                                | 5 to 12         | 1/12 to 1/5             | М                     | 1     |
| Multiply                                                  | MUL, SMULBB, SMULBT,<br>SMULTB, SMULTT, SMULWB,<br>SMULWT, SMMUL{R},<br>SMUAD{X}, SMUSD{X}                | 2               | 1                       | Μ                     |       |
| Multiply accumulate, conditional                          | MLA, MLS, SMLABB, SMLABT,<br>SMLATB, SMLATT, SMLAWB,<br>SMLAWT, SMLAD{X}, SMLSD{X},<br>SMMLA{R}, SMMLS{R} | 3               | 1                       | M, I                  |       |
| Multiply accumulate,<br>unconditional                     | MLA, MLS, SMLABB, SMLABT,<br>SMLATB, SMLATT, SMLAWB,<br>SMLAWT, SMLAD{X}, SMLSD{X},<br>SMMLA{R}, SMMLS{R} | 2(1)            | 1                       | Μ                     | 2     |
| Multiply accumulate accumulate long, conditional          | UMAAL                                                                                                     | 4               | 1                       | Ι, Μ                  |       |
| Multiply accumulate accumulate long, unconditional        | UMAAL                                                                                                     | 3               | 1                       | Ι, Μ                  |       |
| Multiply accumulate long                                  | SMLAL, SMLALBB, SMLALBT,<br>SMLALTB, SMLALTT,<br>SMLALD{X}, SMLSLD{X}, UMLAL                              | 3               | 1                       | M, I                  |       |
| Multiply long, all setflag,<br>conditional and no setflag | SMULL, UMULL                                                                                              | 3               | 1                       | M, I                  |       |
| Multiply long, unconditional and no setflag               | SMULL, UMULL                                                                                              | 2               | 1                       | М                     |       |
| (Multiply, setflags forms)                                |                                                                                                           | +1              | (Same as<br>above)      | +                     | 3     |

- 1. Integer divides are performed using a iterative algorithm and block any subsequent divide operations until complete. Early termination is possible, depending upon the data values.
- 2. Multiply-accumulate pipelines support late-forwarding of accumulate operands from similar uops, allowing a typical sequence of multiply-accumulate uops to issue one every N cycles (accumulate latency N shown in parentheses).
- 3. Multiplies that set the condition flags require an additional integer uop.

- 4. X-form multiply accumulates stall the multiplier pipeline for 2 extra cycles.
- 5. Multiply high operations stall the multiplier pipeline for N extra cycles before any other type M uop can be issued to that pipeline, with N shown in parentheses.

## 3.7. Saturating and Parallel Arithmetic Instructions

| Instruction group                              | AArch32 instructions                                                     | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|------------------------------------------------|--------------------------------------------------------------------------|-----------------|-------------------------|-----------------------|-------|
| Parallel arith, unconditional                  | SADD16, SADD8, SSUB16,<br>SSUB8, UADD16, UADD8,<br>USUB16, USUB8         | 2               | 1                       | М                     |       |
| Parallel arith, conditional                    | SADD16, SADD8, SSUB16,<br>SSUB8, UADD16, UADD8,<br>USUB16, USUB8         | 2(4)            | 3/5                     | M, I                  | 1     |
| Parallel arith with exchange,<br>unconditional | SASX, SSAX, UASX, USAX                                                   | 3               | 1                       | Ι, Μ                  |       |
| Parallel arith with exchange, conditional      | SASX, SSAX, UASX, USAX                                                   | 3(5)            | 3/5                     | Ι, Μ                  | 1     |
| Parallel halving arith                         | SHADD16, SHADD8, SHSUB16,<br>SHSUB8, UHADD16, UHADD8,<br>UHSUB16, UHSUB8 | 2               | 1                       | М                     |       |
| Parallel halving arith with exchange           | SHASX, SHSAX, UHASX, UHSAX                                               | 3               | 1                       | I, M                  |       |
| Parallel saturating arith                      | QADD16, QADD8, QSUB16,<br>QSUB8, UQADD16, UQADD8,<br>UQSUB16, UQSUB8     | 2               | 2                       | М                     |       |
| Parallel saturating arith with exchange        | QASX, QSAX, UQASX, UQSAX                                                 | 3               | 1                       | Ι, Μ                  |       |
| Saturate                                       | SSAT, SSAT16, USAT, USAT16                                               | 2               | 1                       | Μ                     |       |
| Saturating arith                               | QADD, QSUB                                                               | 2               | 1                       | Μ                     |       |
| Saturating doubling arith                      | QDADD, QDSUB                                                             | 3               | 1                       | I, M                  |       |

Notes:

1. Conditional GE-setting instructions require three extra uops and two additional cycles to conditionally update the GE field (GE latency shown in parentheses).

## 3.8. Miscellaneous Data-Processing Instructions

| Instruction group          | AArch64 instructions    | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|----------------------------|-------------------------|-----------------|-------------------------|-----------------------|-------|
| Address generation         | ADR, ADRP               | 1               | 3                       | 1                     |       |
| Bitfield extract, one reg  | EXTR                    | 1               | 3                       | 1                     |       |
| Bitfield extract, two regs | EXTR                    | 3               | 1                       | Ι, Μ                  |       |
| Bitfield move, basic       | SBFM, UBFM              | 1               | 3                       | 1                     |       |
| Bitfield move, insert      | BFM                     | 2               | 1                       | М                     |       |
| Count leading              | CLS, CLZ                | 1               | 3                       | 1                     |       |
| Move immed                 | MOVN, MOVK, MOVZ        | 1               | 3                       | 1                     |       |
| Reverse bits/bytes         | RBIT, REV, REV16, REV32 | 1               | 3                       |                       |       |
| Variable shift             | ASRV, LSLV, LSRV, RORV  | 1               | 3                       | 1                     |       |

| Instruction group                   | AArch32 instructions       | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|-------------------------------------|----------------------------|-----------------|-------------------------|-----------------------|-------|
| Bit field extract                   | SBFX, UBFX                 | 1               | 3                       |                       |       |
| Bit field insert/clear              | BFI, BFC                   | 2               | 1                       | М                     |       |
| Count leading zeros                 | CLZ                        | 1               | 3                       |                       |       |
| Pack halfword                       | РКН                        | 2               | 1                       | М                     |       |
| Reverse bits/bytes                  | RBIT, REV, REV16, REVSH    | 1               | 3                       | 1                     |       |
| Select bytes, unconditional         | SEL                        | 1               | 3                       | 1                     |       |
| Select bytes, conditional           | SEL                        | 2               | 3/2                     | 1                     |       |
| Sign/zero extend, normal            | SXTB, SXTH, UXTB, UXTH     | 1               | 3                       | 1                     |       |
| Sign/zero extend, parallel          | SXTB16, UXTB16             | 2               | 1                       | М                     |       |
| Sign/zero extend and add,<br>normal | SXTAB, SXTAH, UXTAB, UXTAH | 2               | 1                       | Μ                     |       |
| Sign/zero extend and add, parallel  | SXTAB16, UXTAB16           | 4               | 1/2                     | Μ                     |       |
| Sum of absolute differences         | USAD8, USADA8              | 2               | 1                       | М                     |       |

## **3.9. Load Instructions**

The latencies shown assume the memory access hits in the Level 1 (L1) data cache.

| Instruction group                                           | AArch64 instructions                                    | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|-------------------------------------------------------------|---------------------------------------------------------|-----------------|-------------------------|-----------------------|-------|
| Load register, literal                                      | LDR, LDRSW, PRFM                                        | 4               | 2                       | L                     |       |
| Load register, unscaled immed                               | LDUR, LDURB, LDURH,<br>LDURSB, LDURSH, LDURSW,<br>PRFUM | 4               | 2                       | L                     |       |
| Load register, immed post-index                             | LDR, LDRB, LDRH, LDRSB,<br>LDRSH, LDRSW                 | 4               | 2                       | L, I                  |       |
| Load register, immed pre-index                              | LDR, LDRB, LDRH, LDRSB,<br>LDRSH, LDRSW                 | 4               | 2                       | L, I                  |       |
| Load register, immed<br>unprivileged                        | LDTR, LDTRB, LDTRH, LDTRSB,<br>LDTRSH, LDTRSW           | 4               | 2                       | L                     |       |
| Load register, unsigned immed                               | LDR, LDRB, LDRH, LDRSB,<br>LDRSH, LDRSW, PRFM           | 4               | 2                       | L                     |       |
| Load register, register offset,<br>basic                    | LDR, LDRB, LDRH, LDRSB,<br>LDRSH, LDRSW, PRFM           | 4               | 2                       | L                     |       |
| Load register, register offset, scale by 4/8                | LDR, LDRSW, PRFM                                        | 4               | 2                       | L                     |       |
| Load register, register offset, scale by 2                  | LDRH, LDRSH                                             | 5               | 2                       | I, L                  |       |
| Load register, register offset,<br>extend                   | LDR, LDRB, LDRH, LDRSB,<br>LDRSH, LDRSW, PRFM           | 4               | 2                       | L                     |       |
| Load register, register offset,<br>extend, scale by 4/8     | LDR, LDRSW, PRFM                                        | 4               | 2                       | L                     |       |
| Load register, register offset,<br>extend, scale by 2       | LDRH, LDRSH                                             | 5               | 2                       | Ι, L                  |       |
| Load pair, signed immed offset,<br>normal, W-form           | LDP, LDNP                                               | 4               | 2                       | L                     |       |
| Load pair, signed immed offset,<br>normal, X-form           | LDP, LDNP                                               | 4               | 1                       | L                     |       |
| Load pair, signed immed offset,<br>signed words, base != SP | LDPSW                                                   | 5               | 1                       | I, L                  |       |
| Load pair, signed immed offset,<br>signed words, base = SP  | LDPSW                                                   | 5               | 1                       | Ι, L                  |       |
| Load pair, immed post-index,<br>normal                      | LDP                                                     | 4               | 1                       | L, I                  |       |
| Load pair, immed post-index,<br>signed words                | LDPSW                                                   | 5               | 1                       | Ι, L                  |       |

| Instruction group                           | AArch64 instructions | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|---------------------------------------------|----------------------|-----------------|-------------------------|-----------------------|-------|
| Load pair, immed pre-index, normal          | LDP                  | 4               | 1                       | L, I                  |       |
| Load pair, immed pre-index,<br>signed words | LDPSW                | 5               | 1                       | Ι, L                  |       |

| Instruction group                                       | AArch32 instructions                                  | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|---------------------------------------------------------|-------------------------------------------------------|-----------------|-------------------------|-----------------------|-------|
| Load, immed offset                                      | LDR{T}, LDRB{T}, LDRD,<br>LDRH{T}, LDRSB{T}, LDRSH{T} | 4               | 2                       | L                     | 1,2   |
| Load, register offset, plus                             | LDR, LDRB, LDRD, LDRH,<br>LDRSB, LDRSH                | 4               | 2                       | L                     | 1.2   |
| Load, register offset, minus                            | LDR, LDRB, LDRD, LDRH,<br>LDRSB, LDRSH                | 5               | 2                       | Ι, L                  | 1,2   |
| Load, scaled register offset, plus,<br>LSL2             | LDR, LDRB                                             | 4               | 2                       | L                     | 1     |
| Load, scaled register offset,<br>other                  | LDR, LDRB, LDRH, LDRSB,<br>LDRSH                      | 5               | 2                       | Ι, L                  | 1     |
| Load, immed pre-indexed                                 | LDR, LDRB, LDRD, LDRH,<br>LDRSB, LDRSH                | 4               | 2                       | L, I                  | 1,2   |
| Load, register pre-indexed, shift<br>Rm, plus and minus | LDR, LDRB, LDRH, LDRSB,<br>LDRSH                      | 5               | 2                       | I, L, M               | 3     |
| Load, register pre-indexed                              | LDRD                                                  | 4               | 2                       | L, I                  |       |
| Load, register pre-indexed, cond                        | LDRD                                                  | 5               | 1 1/2                   | L, I                  |       |
| Load, scaled register pre-<br>indexed, plus, LSL2       | LDR, LDRB                                             | 4               | 2                       | L, I                  | 1     |
| Load, scaled register pre-<br>indexed, unshifted        | LDR, LDRB                                             | 4               | 2                       | L, I                  |       |
| Load, immed post-indexed                                | LDR{T}, LDRB{T}, LDRD,<br>LDRH{T}, LDRSB{T}, LDRSH{T} | 4               | 2                       | L, I                  | 1,2   |
| Load, register post-indexed                             | LDR, LDRB, LDRH{T}, LDRSB{T},<br>LDRSH{T}             | 5               | 2                       | Ι, L                  |       |
| Load, register post-indexed                             | LDRD                                                  | 4               | 2                       | L, I                  |       |
| Load, register post-indexed                             | LDRT, LDRBT                                           | 5               | 2                       | I, L                  |       |
| Load, scaled register post-<br>indexed                  | LDR, LDRB                                             | 4               | 2                       | L, M                  | 3     |
| Load, scaled register post-<br>indexed                  | LDRT, LDRBT                                           | 4               | 2                       | L, M                  | 3     |
| Preload, immed offset                                   | PLD, PLDW                                             | 4               | 2                       | L                     |       |
| Preload, register offset, plus                          | PLD, PLDW                                             | 4               | 2                       | L                     |       |

| Instruction group                                    | AArch32 instructions               | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes   |
|------------------------------------------------------|------------------------------------|-----------------|-------------------------|-----------------------|---------|
| Preload, register offset, minus                      | PLD, PLDW                          | 5               | 2                       | I, L                  |         |
| Preload, scaled register offset, plus LSL2           | PLD, PLDW                          | 5               | 2                       | Ι, L                  |         |
| Preload, scaled register offset, other               | PLD, PLDW                          | 5               | 2                       | IL                    |         |
| Load multiple, no writeback,<br>base reg not in list | LDMIA, LDMIB, LDMDA,<br>LDMDB      | Ν               | 2/R                     | L                     | 1, 4, 5 |
| Load multiple, no writeback,<br>base reg in list     | LDMIA, LDMIB, LDMDA,<br>LDMDB      | 1+ N            | 2/R                     | Ι, L                  | 1, 4, 5 |
| Load multiple, writeback                             | LDMIA, LDMIB, LDMDA,<br>LDMDB, POP | 1+N             | 2/R                     | L, I                  | 1, 4, 5 |
| (Load, all branch forms)                             |                                    | +1              |                         | + B                   | 6       |

- 1. Condition loads have an extra uop which goes down pipeline I and have 1 cycle extra latency compared to their unconditional counterparts.
- 2. The throughput of conditional LDRD is 1 as compared to a throughput of 2 for unconditional LDRD.
- 3. The address update op for addressing forms which use reg scaled reg, or reg extend goes down pipeline 'I' if the shift is LSL where the shift value is less than or equal to 4.
- 4. N is floor[ (num\_reg+3)/4 ].
- **5.** R is floor[(num\_reg +1)/2].
- 6. Branch forms are possible when the instruction destination register is the PC. For those cases, an additional branch uop is required. This adds 1 cycle to the latency.

## **3.10. Store Instructions**

The following table describes performance characteristics for standard store instructions. Stores uops are split into address and data uops at dispatch time. Once executed, stores are buffered and committed in the background.

| Instruction group                     | AArch64 instructions | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|---------------------------------------|----------------------|-----------------|-------------------------|-----------------------|-------|
| Store register, unscaled immed        | STUR, STURB, STURH   | 1               | 2                       | L, D                  |       |
| Store register, immed post-<br>index  | STR, STRB, STRH      | 1               | 2                       | L, D                  |       |
| Store register, immed pre-index       | STR, STRB, STRH      | 1               | 2                       | L, D                  |       |
| Store register, immed<br>unprivileged | STTR, STTRB, STTRH   | 1               | 2                       | L, D                  |       |
| Store register, unsigned immed        | STR, STRB, STRH      | 1               | 2                       | L, D                  |       |

| Instruction group                                        | AArch64 instructions | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|----------------------------------------------------------|----------------------|-----------------|-------------------------|-----------------------|-------|
| Store register, register offset,<br>basic                | STR, STRB, STRH      | 1               | 2                       | L, D                  |       |
| Store register, register offset, scaled by 4/8           | STR                  | 1               | 2                       | L, D                  |       |
| Store register, register offset, scaled by 2             | STRH                 | 2               | 3/2                     | I, L D                |       |
| Store register, register offset,<br>extend               | STR, STRB, STRH      | 1               | 2                       | L, D                  |       |
| Store register, register offset,<br>extend, scale by 4/8 | STR                  | 1               | 2                       | L, D                  |       |
| Store register, register offset,<br>extend, scale by 1   | STRH                 | 2               | 3/2                     | I, L, D               |       |
| Store pair, immed offset, W-<br>form                     | STP, STNP            | 1               | 2                       | L, D                  |       |
| Store pair, immed offset, X-form                         | STP, STNP            | 1               | 1                       | L, D                  |       |
| Store pair, immed post-index,<br>W-form                  | STP                  | 1               | 1                       | L, D                  |       |
| Store pair, immed post-index, X-<br>form                 | STP                  | 1               | 1                       | L, D                  |       |
| Store pair, immed pre-index, W-<br>form                  | STP                  | 1               | 1                       | L, D                  |       |
| Store pair, immed pre-index, X-<br>form                  | STP                  | 1               | 1                       | L, D                  |       |

| Instruction group                           | AArch32 instructions              | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|---------------------------------------------|-----------------------------------|-----------------|-------------------------|-----------------------|-------|
| Store, immed offset                         | STR{T}, STRB{T}, STRD,<br>STRH{T} | 1               | 2                       | L, D                  |       |
| Store, register offset, plus                | STR, STRB, STRD, STRH             | 1               | 2                       | L, D                  |       |
| Store, register offset, minus               | STR, STRB, STRD, STRH             | 1               | 2                       | L, D                  |       |
| Store, register offset, no shift,<br>plus   | STR, STRB                         | 1               | 2                       | L, D                  |       |
| Store, scaled register offset, plus<br>LSL2 | STR, STRB                         | 1               | 2                       | L, D                  |       |
| Store, scaled register offset, other        | STR, STRB                         | 2               | 3/2                     | I, L, D               |       |
| Store, scaled register offset, minus        | STR, STRB                         | 2               | 3/2                     | I, L, D               |       |
| Store, immed pre-indexed                    | STR, STRB, STRD, STRH             | 1               | 3/2                     | I, L, D               |       |

| Instruction group                                 | AArch32 instructions                | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|---------------------------------------------------|-------------------------------------|-----------------|-------------------------|-----------------------|-------|
| Store, register pre-indexed, plus, no shift       | STR, STRB, STRD, STRH               | 1               | 3/2                     | L, D                  |       |
| Store, register pre-indexed,<br>minus             | STR, STRB, STRD, STRH               | 2               | 1                       | I, L, D               |       |
| Store, scaled register pre-<br>indexed, plus LSL2 | STR, STRB                           | 1               | 3/2                     | L, D                  |       |
| Store, scaled register pre-<br>indexed, other     | STR, STRB                           | 2               | 1                       | I, L, D,M             | 1     |
| Store, immed post-indexed                         | STR{T}, STRB{T}, STRD,<br>STRH{T}   | 1               | 3/2                     | L, D                  |       |
| Store, register post-indexed                      | STRH{T}, STRD                       | 1               | 3/2                     | L, D                  |       |
| Store, register post-indexed                      | STR{T}, STRB{T}                     | 1               | 3/2                     | L, D                  |       |
| Store, scaled register post-<br>indexed           | STR{T}, STRB{T}                     | 1               | 3/2                     | L, D                  |       |
| Store multiple, no writeback                      | STMIA, STMIB, STMDA,<br>STMDB       | Ν               | 1/N                     | L, D                  | 2     |
| Store multiple, writeback                         | STMIA, STMIB, STMDA,<br>STMDB, PUSH | Ν               | 1/N                     | L, D                  | 2     |

- 1. The address update op for addressing forms which use reg scaled reg, or reg extend goes down pipeline 'l' if the shift is LSL where the shift value is less than or equal to 4.
- 2. For store multiple instructions, N=floor((num\_regs+3)/4).

## **3.11. FP Data Processing Instructions**

| Instruction group | AArch64 instructions          | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|-------------------|-------------------------------|-----------------|-------------------------|-----------------------|-------|
| FP absolute value | FABS                          | 2               | 2                       | $\vee$                |       |
| FP arithmetic     | FADD, FSUB                    | 2               | 2                       | $\vee$                |       |
| FP compare        | FCCMP{E}, FCMP{E}             | 2               | 1                       | VO                    |       |
| FP divide, H-form | FDIV                          | 7               | 4/7                     | VO                    | 1     |
| FP divide, S-form | FDIV                          | 7 to 10         | 4/9 to 4/7              | VO                    | 1     |
| FP divide, D-form | FDIV                          | 7 to 15         | 1/7 to 2/7              | VO                    | 1     |
| FP min/max        | FMIN, FMINNM, FMAX,<br>FMAXNM | 2               | 2                       | V                     |       |
| FP multiply       | FMUL, FNMUL                   | 3               | 2                       | $\vee$                | 2     |

| Instruction group      | AArch64 instructions                                         | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|------------------------|--------------------------------------------------------------|-----------------|-------------------------|-----------------------|-------|
| FP multiply accumulate | FMADD, FMSUB, FNMADD,<br>FNMSUB                              | 4 (2)           | 2                       | V                     | 3     |
| FP negate              | FNEG                                                         | 2               | 2                       | $\vee$                |       |
| FP round to integral   | FRINTA, FRINTI, FRINTM,<br>FRINTN, FRINTP, FRINTX,<br>FRINTZ | 3               | 1                       | VO                    |       |
| FP select              | FCSEL                                                        | 2               | 2                       | $\vee$                |       |
| FP square root, H-form | FSQRT                                                        | 7               | 4/7                     | VO                    | 1     |
| FP square root, S-form | FSQRT                                                        | 7 to 10         | 4/9 to 4/7              | VO                    | 1     |
| FP square root, D-form | FSQRT                                                        | 7 to 17         | 1/8 to 2/7              | VO                    | 1     |

| Instruction group                    | AArch32 instructions                                         | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|--------------------------------------|--------------------------------------------------------------|-----------------|-------------------------|-----------------------|-------|
| VFP absolute value                   | VABS                                                         | 2               | 2                       | $\vee$                |       |
| VFP arith                            | VADD, VSUB                                                   | 2               | 2                       | $\vee$                |       |
| VFP compare, unconditional           | VCMP, VCMPE                                                  | 2               | 1                       | VO                    |       |
| VFP compare, conditional             | VCMP, VCMPE                                                  | 4               | 1                       | V, VO                 |       |
| VFP convert                          | VCVT{R}, VCVTB, VCVTT,<br>VCVTA, VCVTM, VCVTN,<br>VCVTP      | 3               | 1                       | VO                    |       |
| VFP divide, H-form                   | VDIV                                                         | 7               | 4/7                     | VO                    | 1     |
| VFP divide, S-form                   | VDIV                                                         | 7 to 10         | 4/9 to 4/7              | VO                    | 1     |
| VFP divide, D-form                   | VDIV                                                         | 7 to 15         | 1/7 to 2/7              | VO                    | 1     |
| VFP max/min                          | VMAXNM, VMINNM                                               | 2               | 2                       | V                     |       |
| VFP multiply                         | VMUL, VNMUL                                                  | 3               | 2                       | $\vee$                | 2     |
| VFP multiply accumulate<br>(chained) | VMLA, VMLS, VNMLA, VNMLS                                     | 5 (2)           | 2                       | V                     | 3     |
| VFP multiply accumulate (fused)      | VFMA, VFMS, VFNMA, VFNMS                                     | 4 (2)           | 2                       | $\vee$                | 3     |
| VFP negate                           | VNEG                                                         | 2               | 2                       | $\vee$                |       |
| VFP round to integral                | VRINTA, VRINTM, VRINTN,<br>VRINTP, VRINTR, VRINTX,<br>VRINTZ | 3               | 1                       | VO                    |       |
| VFP select                           | VSELEQ, VSELGE, VSELGT,<br>VSELVS                            | 2               | 2                       | V                     |       |
| VFP square root, H-form              | VSQRT                                                        | 7               | 4/7                     | VO                    | 1     |
| VFP square root, S-form              | VSQRT                                                        | 7 to 10         | 4/9 to 4/7              | VO                    | 1     |
| VFP square root, D-form              | VSQRT                                                        | 7 to 17         | 1/8 to 2/7              | VO                    | 1     |

- 1. FP divide and square root operations are performed using an iterative algorithm and block subsequent similar operations to the same pipeline until complete.
- 2. FP multiply-accumulate pipelines support late forwarding of the result from FP multiply uops to the accumulate operands of an FP multiply-accumulate uop. The latter can potentially be issued 1 cycle after the FP multiply uop has been issued.
- **3.** FP multiply-accumulate pipelines support late-forwarding of accumulate operands from similar uops, allowing a typical sequence of multiply-accumulate uops to issue one every N cycles (accumulate latency N shown in parentheses).

| Instruction group                | AArch64 instructions                                                                    | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|----------------------------------|-----------------------------------------------------------------------------------------|-----------------|-------------------------|-----------------------|-------|
| FP convert, from vec to vec reg  | FCVT, FCVTXN                                                                            | 3               | 1                       | VO                    |       |
| FP convert, from gen to vec reg  | SCVTF, UCVTF                                                                            | 6               | 1                       | M, VO                 |       |
| FP convert, from vec to gen reg  | FCVTAS, FCVTAU, FCVTMS,<br>FCVTMU, FCVTNS, FCVTNU,<br>FCVTPS, FCVTPU, FCVTZS,<br>FCVTZU | 4               | 1                       | V0, V1                |       |
| FP move, immed                   | FMOV                                                                                    | 2               | 2                       | V                     |       |
| FP move, register                | FMOV                                                                                    | 2               | 2                       | $\vee$                |       |
| FP transfer, from gen to vec reg | FMOV                                                                                    | 3               | 1                       | М                     |       |
| FP transfer, from vec to gen reg | FMOV                                                                                    | 2               | 1                       | V1                    |       |

## 3.12. FP Miscellaneous Instructions

| Instruction group                                                     | AArch32 instructions | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|-----------------------------------------------------------------------|----------------------|-----------------|-------------------------|-----------------------|-------|
| VFP move, immed                                                       | VMOV                 | 2               | 2                       | $\vee$                |       |
| VFP move, register                                                    | VMOV                 | 2               | 2                       | $\vee$                |       |
| VFP move, insert                                                      | VINS                 | 2               | 2                       | $\vee$                |       |
| VFP move, extraction                                                  | VMOVX                | 2               | 2                       | $\vee$                |       |
| VFP transfer, core to vfp, single<br>reg to S-reg, cond               | VMOV                 | 5               | 1                       | M, V                  |       |
| VFP transfer, core to vfp, single<br>reg to S-reg, uncond             | VMOV                 | 3               | 1                       | М                     |       |
| VFP transfer, core to vfp, single<br>reg to upper/lower half of D-reg | VMOV                 | 5               | 1                       | M, V                  |       |
| VFP transfer, core to vfp, 2 regs<br>to 2 S-regs, cond                | VMOV                 | 6               | 1/2                     | M, V                  |       |
| VFP transfer, core to vfp, 2 regs<br>to 2 S-regs, uncond              | VMOV                 | 4               | 1/2                     | М                     |       |

| Instruction group                                                                  | AArch32 instructions | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|------------------------------------------------------------------------------------|----------------------|-----------------|-------------------------|-----------------------|-------|
| VFP transfer, core to vfp, 2 regs<br>to D-reg, cond                                | VMOV                 | 5               | 1                       | M, V                  |       |
| VFP transfer, core to vfp, 2 regs<br>to D-reg, uncond                              | VMOV                 | 3               | 1                       | М                     |       |
| VFP transfer, vfp S-reg or<br>upper/lower half of vfp D-reg to<br>core reg, cond   | VMOV                 | 3               | 1                       | V1, I                 |       |
| VFP transfer, vfp S-reg or<br>upper/lower half of vfp D-reg to<br>core reg, uncond | VMOV                 | 2               | 1                       | V1                    |       |
| VFP transfer, vfp 2 S-regs or D-<br>reg to 2 core regs, cond                       | VMOV                 | 3               | 1                       | V1, I                 |       |
| VFP transfer, vfp 2 S-regs or D-<br>reg to 2 core regs, uncond                     | VMOV                 | 2               | 1                       | V1                    |       |

## **3.13. FP Load Instructions**

The latencies shown assume the memory access hits in the Level 1 Data Cache. Compared to standard loads, an extra cycle is required to forward results to FP/ASIMD pipelines.

| Instruction group                                            | AArch64 instructions | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|--------------------------------------------------------------|----------------------|-----------------|-------------------------|-----------------------|-------|
| Load vector reg, literal, S/D/Q<br>forms                     | LDR                  |                 | 2                       | L                     |       |
| Load vector reg, unscaled immed                              | LDUR                 | 5               | 2                       | L                     |       |
| Load vector reg, immed post-<br>index                        | LDR                  | 5               | 2                       | L, I                  |       |
| Load vector reg, immed pre-<br>index                         | LDR                  | 5               | 2                       | L, I                  |       |
| Load vector reg, unsigned<br>immed                           | LDR                  | 5               | 2                       | L, I                  |       |
| Load vector reg, register offset, basic                      | LDR                  | 5               | 2                       | L, I                  |       |
| Load vector reg, register offset, scale, S/D-form            | LDR                  | 5               | 2                       | L, I                  |       |
| Load vector reg, register offset, scale, H/Q-form            | LDR                  | 6               | 2                       | Ι, L                  |       |
| Load vector reg, register offset, extend                     | LDR                  | 5               | 2                       | L, I                  |       |
| Load vector reg, register offset,<br>extend, scale, S/D-form | LDR                  | 5               | 2                       | L, I                  |       |

| Instruction group                                            | AArch64 instructions | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|--------------------------------------------------------------|----------------------|-----------------|-------------------------|-----------------------|-------|
| Load vector reg, register offset,<br>extend, scale, H/Q-form | LDR                  | 6               | 2                       | Ι, L                  |       |
| Load vector pair, immed offset,<br>S/D-form                  | LDP, LDNP            | 5               | 1                       | L, I                  |       |
| Load vector pair, immed offset,<br>Q-form                    | LDP, LDNP            | 7               | 1                       | L                     |       |
| Load vector pair, immed post-<br>index, S/D-form             | LDP                  | 5               | 1                       | Ι, L                  |       |
| Load vector pair, immed post-<br>index, Q-form               | LDP                  | 7               | 1                       | L, I                  |       |
| Load vector pair, immed pre-<br>index, S/D-form              | LDP                  | 5               | 1                       | Ι, L                  |       |
| Load vector pair, immed pre-<br>index, Q-form                | LDP                  | 7               | 1                       | L, I                  |       |

| Instruction group          | AArch32 instructions | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes   |
|----------------------------|----------------------|-----------------|-------------------------|-----------------------|---------|
| FP load, register          | VLDR                 | 4               | 2                       | L                     | 1       |
| FP load multiple, S form   | VLDMIA, VLDMDB, VPOP | Ν               | 2/R                     | L                     | 1, 2, 3 |
| FP load multiple, D form   | VLDMIA, VLDMDB, VPOP | N + 2           | 1/R                     | L, V                  | 1, 2, 3 |
| (FP load, writeback forms) |                      | (1)             |                         | +                     | 4       |

- 1. Condition loads have an extra uop which goes down pipeline V and have 2 cycle extra latency compared to their unconditional counterparts.
- 2. N is floor[ (num\_reg+3)/4 ].
- **3.** R is floor[(num\_reg+1)/2].
- 4. Writeback forms of load instructions require an extra uop to update the base address. This update is typically performed in parallel with or prior to the load uop (update latency shown in parentheses).

## 3.14. FP Store Instructions

Stores Mops are split into store address and store data uops at dispatch time. Once executed, stores are buffered and committed in the background.

| Instruction group                                             | AArch64 instructions | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|---------------------------------------------------------------|----------------------|-----------------|-------------------------|-----------------------|-------|
| Store vector reg, unscaled<br>immed, B/H/S/D-form             | STUR                 | 2               | 2                       | L, I                  |       |
| Store vector reg, unscaled<br>immed, Q-form                   | STUR                 | 2               | 1                       | L, I                  |       |
| Store vector reg, immed post-<br>index, B/H/S/D-form          | STR                  | 2               | 2                       | L, V                  |       |
| Store vector reg, immed post-<br>index, Q-form                | STR                  | 2               | 1                       | L, V                  |       |
| Store vector reg, immed pre-<br>index, B/H/S/D-form           | STR                  | 2               | 2                       | L, V                  |       |
| Store vector reg, immed pre-<br>index, Q-form                 | STR                  | 2               | 1                       | L, V                  |       |
| Store vector reg, unsigned<br>immed, B/H/S/D-form             | STR                  | 2               | 2                       | L, V                  |       |
| Store vector reg, unsigned<br>immed, Q-form                   | STR                  | 2               | 1                       | L, V                  |       |
| Store vector reg, register offset,<br>basic, B/H/S/D-form     | STR                  | 2               | 2                       | L, V                  |       |
| Store vector reg, register offset,<br>basic, Q-form           | STR                  | 2               | 1                       | L, V                  |       |
| Store vector reg, register offset,<br>scale, H-form           | STR                  | 2               | 2                       | I, L, V               |       |
| Store vector reg, register offset,<br>scale, S/D-form         | STR                  | 2               | 2                       | L, V                  |       |
| Store vector reg, register offset,<br>scale, Q-form           | STR                  | 2               | 1                       | I, L, V               |       |
| Store vector reg, register offset,<br>extend, B/H/S/D-form    | STR                  | 2               | 2                       | L, V                  |       |
| Store vector reg, register offset,<br>extend, Q-form          | STR                  | 2               | 1                       | L, V                  |       |
| Store vector reg, register offset,<br>extend, scale, H-form   | STR                  | 2               | 2                       | I, L, ∨               |       |
| Store vector reg, register offset,<br>extend, scale, S/D-form | STR                  | 2               | 2                       | L, V                  |       |
| Store vector reg, register offset,<br>extend, scale, Q-form   | STR                  | 2               | 1                       | I, L, V               |       |

| Instruction group                               | AArch64 instructions | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|-------------------------------------------------|----------------------|-----------------|-------------------------|-----------------------|-------|
| Store vector pair, immed offset,<br>S-form      | STP, STNP            | 2               | 2                       | L, V                  |       |
| Store vector pair, immed offset,<br>D-form      | STP, STNP            | 2               | 1                       | L, V                  |       |
| Store vector pair, immed offset,<br>Q-form      | STP, STNP            | 3               | 1/2                     | L, V                  |       |
| Store vector pair, immed post-<br>index, S-form | STP                  | 2               | 1                       | L, V                  |       |
| Store vector pair, immed post-<br>index, D-form | STP                  | 2               | 1                       | L, V                  |       |
| Store vector pair, immed post-<br>index, Q-form | STP                  | 3               | 1                       | L, V                  |       |
| Store vector pair, immed pre-<br>index, S-form  | STP                  | 2               | 1                       | L, V                  |       |
| Store vector pair, immed pre-<br>index, D-form  | STP                  | 2               | 1                       | L, V                  |       |
| Store vector pair, immed pre-<br>index, Q-form  | STP                  | 3               | 1/2                     | L, V                  |       |

| Instruction group           | AArch32 instructions  | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|-----------------------------|-----------------------|-----------------|-------------------------|-----------------------|-------|
| FP store, immed offset      | VSTR                  | 2               | 2                       | LI                    |       |
| FP store multiple, S-form   | VSTMIA, VSTMDB, VPUSH | N+1             | 2/R                     | L, V                  | 1,3   |
| FP store multiple, D-form   | VSTMIA, VSTMDB, VPUSH | P+1             | 1/R                     | L, V                  | 2, 3  |
| (FP store, writeback forms) |                       | (1)             |                         | +                     | 4     |

- 1. For store multiple instructions, N=floor((num\_regs+3)/4).
- 2. For store multiple instructions, P=floor((num\_regs+1)/2).
- **3.** R=floor[(num\_regs + 1)/2].
- 4. Writeback forms of store instructions require an extra uop to update the base address. This update is typically performed in parallel with or prior to the store uop (update latency shown in parentheses)

## **3.15. ASIMD Integer Instructions**

| Instruction group                      | AArch64 instructions                                                                                                                          | Exec<br>latenc<br>y | Execution<br>throughput<br>(T) | Utilized<br>Pipelines | Notes |
|----------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|---------------------|--------------------------------|-----------------------|-------|
| ASIMD absolute diff                    | SABD, UABD                                                                                                                                    | 2                   | 2                              | $\vee$                |       |
| ASIMD absolute diff accum              | SABA, UABA                                                                                                                                    | 4(1)                | 1                              | V1                    | 2     |
| ASIMD absolute diff accum long         | SABAL(2), UABAL(2)                                                                                                                            | 4(1)                | 1                              | V1                    | 2     |
| ASIMD absolute diff long               | SABDL(2), UABDL(2)                                                                                                                            | 2                   | 2                              | $\vee$                |       |
| ASIMD arith, basic                     | ABS, ADD, NEG, SADDL(2),<br>SADDW(2), SHADD, SHSUB,<br>SSUBL(2), SSUBW(2), SUB,<br>UADDL(2), UADDW(2),<br>UHADD, UHSUB, USUBL(2),<br>USUBW(2) | 2                   | 2                              | V                     |       |
| ASIMD arith, complex                   | ADDHN(2), RADDHN(2),<br>RSUBHN(2), SQABS, SQADD,<br>SQNEG, SQSUB, SRHADD,<br>SUBHN(2), SUQADD, UQADD,<br>UQSUB, URHADD, USQADD                | 2                   | 2                              | V                     |       |
| ASIMD arith, pair-wise                 | ADDP, SADDLP, UADDLP                                                                                                                          | 2                   | 2                              | $\vee$                |       |
| ASIMD arith, reduce, 4H/4S             | ADDV, SADDLV, UADDLV                                                                                                                          | 3                   | 1                              | V1                    |       |
| ASIMD arith, reduce, 8B/8H             | ADDV, SADDLV, UADDLV                                                                                                                          | 5                   | 1                              | V1, V                 |       |
| ASIMD arith, reduce, 16B               | ADDV, SADDLV, UADDLV                                                                                                                          | 6                   | 1/2                            | V1                    |       |
| ASIMD compare                          | CMEQ, CMGE, CMGT, CMHI,<br>CMHS, CMLE, CMLT, CMTST                                                                                            | 2                   | 2                              | V                     |       |
| ASIMD dot product                      | SDOT, UDOT                                                                                                                                    | 2                   | 2                              | V                     |       |
| ASIMD logical                          | AND, BIC, EOR, MOV, MVN,<br>ORN, ORR, NOT                                                                                                     | 2                   | 2                              | V                     |       |
| ASIMD max/min, basic and pair-<br>wise | SMAX, SMAXP, SMIN, SMINP,<br>UMAX, UMAXP, UMIN, UMINP                                                                                         | 2                   | 2                              | V                     |       |
| ASIMD max/min, reduce, 4H/4S           | SMAXV, SMINV, UMAXV,<br>UMINV                                                                                                                 | 3                   | 1                              | V1                    |       |
| ASIMD max/min, reduce, 8B/8H           | SMAXV, SMINV, UMAXV,<br>UMINV                                                                                                                 | 5                   | 1                              | V1, V                 |       |
| ASIMD max/min, reduce, 16B             | SMAXV, SMINV, UMAXV,<br>UMINV                                                                                                                 | 6                   | 1/2                            | V1                    |       |
| ASIMD multiply, D-form                 | MUL, SQDMULH, SQRDMULH                                                                                                                        | 4                   | 1                              | VO                    |       |
| ASIMD multiply, Q-form                 | MUL, SQDMULH, SQRDMULH                                                                                                                        | 5                   | 1/2                            | VO                    |       |
| ASIMD multiply accumulate, D-<br>form  | MLA, MLS                                                                                                                                      | 4(1)                | 1                              | VO                    | 1     |
| ASIMD multiply accumulate, Q-<br>form  | MLA, MLS                                                                                                                                      | 5(2)                | 1/2                            | VO                    | 1     |

| Instruction group                                        | AArch64 instructions                                                                                                       | Exec<br>latenc<br>y | Execution<br>throughput<br>(T) | Utilized<br>Pipelines | Notes |
|----------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------|---------------------|--------------------------------|-----------------------|-------|
| ASIMD multiply accumulate<br>high, D-form                | SQRDMLAH, SQRDMLSH                                                                                                         | 4                   | 1                              | VO                    | -     |
| ASIMD multiply accumulate<br>high, Q-form                | SQRDMLAH, SQRDMLSH                                                                                                         | 5                   | 1/2                            | VO                    | -     |
| ASIMD multiply accumulate long                           | SMLAL(2), SMLSL(2), UMLAL(2),<br>UMLSL(2)                                                                                  | 4(1)                | 1                              | VO                    | 1     |
| ASIMD multiply accumulate saturating long                | SQDMLAL(2), SQDMLSL(2)                                                                                                     | 4                   | 1                              | VO                    |       |
| ASIMD multiply/multiply long<br>(8x8) polynomial, D-form | PMUL, PMULL(2)                                                                                                             | 3                   | 1                              | VO                    | 3     |
| ASIMD multiply/multiply long<br>(8x8) polynomial, Q-form | PMUL, PMULL(2)                                                                                                             | 4                   | 1/2                            | VO                    | 3     |
| ASIMD multiply long                                      | SMULL(2), UMULL(2),<br>SQDMULL(2)                                                                                          | 4                   | 1                              | VO                    |       |
| ASIMD pairwise add and accumulate long                   | SADALP, UADALP                                                                                                             | 4(1)                | 1                              | V1                    | 2     |
| ASIMD shift accumulate                                   | SSRA, SRSRA, USRA, URSRA                                                                                                   | 4(1)                | 1                              | V1                    | 2     |
| ASIMD shift by immed, basic                              | SHL, SHLL(2), SHRN(2),<br>SSHLL(2), SSHR, SXTL(2),<br>USHLL(2), USHR, UXTL(2)                                              | 2                   | 1                              | V1                    |       |
| ASIMD shift by immed and insert, basic                   | SLI, SRI                                                                                                                   | 2                   | 1                              | V1                    |       |
| ASIMD shift by immed, complex                            | RSHRN(2), SQRSHRN(2),<br>SQRSHRUN(2), SQSHL{U},<br>SQSHRN(2), SQSHRUN(2),<br>SRSHR, UQRSHRN(2), UQSHL,<br>UQSHRN(2), URSHR | 4                   | 1                              | V1                    |       |
| ASIMD shift by register, basic                           | SSHL, USHL                                                                                                                 | 2                   | 1                              | V1                    |       |
| ASIMD shift by register, complex                         | SRSHL, SQRSHL, SQSHL,<br>URSHL, UQRSHL, UQSHL                                                                              | 4                   | 1                              | V1                    |       |

| Instruction group              | AArch32 instructions                            | Exec<br>latency | Execution<br>throughput<br>(T) | Utilized<br>Pipelines | Notes |
|--------------------------------|-------------------------------------------------|-----------------|--------------------------------|-----------------------|-------|
| ASIMD absolute diff            | VABD                                            | 2               | 2                              | $\vee$                |       |
| ASIMD absolute diff accum      | VABA                                            | 4(1)            | 1                              | V1                    | 2     |
| ASIMD absolute diff accum long | VABAL                                           | 4(1)            | 1                              | V1                    | 2     |
| ASIMD absolute diff long       | VABDL                                           | 2               | 2                              | $\vee$                |       |
| ASIMD arith, basic             | VADD, VADDL, VADDW, VNEG,<br>VSUB, VSUBL, VSUBW | 2               | 2                              | $\vee$                |       |

| Instruction group                                        | AArch32 instructions                                                                              | Exec<br>latency | Execution<br>throughput<br>(T) | Utilized<br>Pipelines | Notes |
|----------------------------------------------------------|---------------------------------------------------------------------------------------------------|-----------------|--------------------------------|-----------------------|-------|
| ASIMD arith, complex                                     | VABS, VADDHN, VHADD,<br>VHSUB, VQABS, VQADD,<br>VQNEG, VQSUB, VRADDHN,<br>VRHADD, VRSUBHN, VSUBHN | 2               | 2                              | V                     |       |
| ASIMD dot product                                        | VSDOT, VUDOT                                                                                      | 2               | 2                              | V                     |       |
| ASIMD arith, pair-wise                                   | VPADD, VPADDL                                                                                     | 2               | 2                              | V                     |       |
| ASIMD compare                                            | VCEQ, VCGE, VCGT, VCLE,<br>VTST                                                                   | 2               | 1                              | $\vee$                |       |
| ASIMD logical                                            | VAND, VBIC, VMVN, VORR,<br>VORN, VEOR                                                             | 2               | 1                              | $\vee$                |       |
| ASIMD max/min                                            | VMAX, VMIN, VPMAX, VPMIN                                                                          | 2               | 1                              | $\vee$                |       |
| ASIMD multiply, D-form                                   | VMUL, VQDMULH,<br>VQRDMULH                                                                        | 4               | 1                              | VO                    |       |
| ASIMD multiply, Q-form                                   | VMUL, VQDMULH,<br>VQRDMULH                                                                        | 5               | 1/2                            | VO                    |       |
| ASIMD multiply accumulate, D-<br>form                    | VMLA, VMLS                                                                                        | 4(1)            | 1                              | VO                    | 1     |
| ASIMD multiply accumulate, Q-<br>form                    | VMLA, VMLS                                                                                        | 5(2)            | 1/2                            | VO                    | 1     |
| ASIMD multiply accumulate<br>high, D-form                | VQRDMLAH, VQRDMLSH                                                                                | 4               | 1                              | VO                    | -     |
| ASIMD multiply accumulate<br>high, Q-form                | VQRDMLAH, VQRDMLSH                                                                                | 5               | 1/2                            | VO                    | -     |
| ASIMD multiply accumulate long                           | VMLAL, VMLSL                                                                                      | 4(1)            | 1                              | VO                    | 1     |
| ASIMD multiply accumulate saturating long                | VQDMLAL, VQDMLSL                                                                                  | 4               | 1                              | VO                    |       |
| ASIMD multiply/multiply long<br>(8x8) polynomial, D-form | VMUL (.P8), VMULL (.P8)                                                                           | 3               | 1                              | VO                    |       |
| ASIMD multiply (8x8)<br>polynomial, Q-form               | VMUL (.P8)                                                                                        | 4               | 1/2                            | VO                    |       |
| ASIMD multiply long                                      | VMULL (.S, .I), VQDMULL                                                                           | 4               | 1                              | VO                    |       |
| ASIMD pairwise add and accumulate                        | VPADAL                                                                                            | 4(1)            | 1                              | V1                    | 1     |
| ASIMD shift accumulate                                   | VSRA, VRSRA                                                                                       | 4(1)            | 1                              | V1                    | 1     |
| ASIMD shift by immed, basic                              | VMOVL, VSHL, VSHLL, VSHR,<br>VSHRN                                                                | 2               | 1                              | V1                    |       |
| ASIMD shift by immed and insert, basic                   | VSLI, VSRI                                                                                        | 2               | 1                              | V1                    |       |

| Instruction group                | AArch32 instructions                                              | Exec<br>latency | Execution<br>throughput<br>(T) | Utilized<br>Pipelines | Notes |
|----------------------------------|-------------------------------------------------------------------|-----------------|--------------------------------|-----------------------|-------|
| ASIMD shift by immed, complex    | VQRSHRN, VQRSHRUN,<br>VQSHL{U}, VQSHRN,<br>VQSHRUN, VRSHR, VRSHRN | 4               | 1                              | V1                    |       |
| ASIMD shift by register, basic   | VSHL                                                              | 2               | 1                              | V1                    |       |
| ASIMD shift by register, complex | VQRSHL, VQSHL, VRSHL                                              | 4               | 1                              | V1                    |       |

- 1. Multiply-accumulate pipelines support late-forwarding of accumulate operands from similar uops, allowing a typical sequence of integer multiply-accumulate uops to issue one every cycle or one every other cycle (accumulate latency shown in parentheses).
- 2. Other accumulate pipelines also support late-forwarding of accumulate operands from similar uops, allowing a typical sequence of such uops to issue one every cycle (accumulate latency shown in parentheses).
- 3. This category includes instructions of the form "PMULL Vd.8H, Vn.8B, Vm.8B" and "PMULL2 Vd.8H, Vn.16B, Vm.16B"

## **3.16. ASIMD Floating-Point Instructions**

| Instruction group                                      | AArch64 instructions                                                                                  | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|--------------------------------------------------------|-------------------------------------------------------------------------------------------------------|-----------------|-------------------------|-----------------------|-------|
| ASIMD FP absolute<br>value/difference                  | FABS, FABD                                                                                            | 2               | 2                       | $\vee$                |       |
| ASIMD FP arith, normal                                 | FABD, FADD, FSUB, FADDP                                                                               | 2               | 2                       | $\vee$                |       |
| ASIMD FP compare                                       | FACGE, FACGT, FCMEQ,<br>FCMGE, FCMGT, FCMLE,<br>FCMLT                                                 | 2               | 2                       | $\vee$                |       |
| ASIMD FP convert, long (F16 to<br>F32)                 | FCVTL(2)                                                                                              | 4               | 1/2                     | VO                    |       |
| ASIMD FP convert, long (F32 to<br>F64)                 | FCVTL(2)                                                                                              | 3               | 1                       | VO                    |       |
| ASIMD FP convert, narrow (F32<br>to F16)               | FCVTN(2)                                                                                              | 4               | 1/2                     | VO                    |       |
| ASIMD FP convert, narrow (F64<br>to F32)               | FCVTN(2), FCVTXN(2)                                                                                   | 3               | 1                       | VO                    |       |
| ASIMD FP convert, other, D-<br>form F32 and Q-form F64 | FCVTAS, FCVTAU, FCVTMS,<br>FCVTMU, FCVTNS, FCVTNU,<br>FCVTPS, FCVTPU, FCVTZS,<br>FCVTZU, SCVTF, UCVTF | 3               | 1                       | VO                    |       |

| Instruction group                                      | AArch64 instructions                                                                                  | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|--------------------------------------------------------|-------------------------------------------------------------------------------------------------------|-----------------|-------------------------|-----------------------|-------|
| ASIMD FP convert, other, D-<br>form F16 and Q-form F32 | FCVTAS, VCVTAU, FCVTMS,<br>FCVTMU, FCVTNS, FCVTNU,<br>FCVTPS, FCVTPU, FCVTZS,<br>FCVTZU, SCVTF, UCVTF | 4               | 1/2                     | VO                    |       |
| ASIMD FP convert, other, Q-<br>form F16                | FCVTAS, VCVTAU, FCVTMS,<br>FCVTMU, FCVTNS, FCVTNU,<br>FCVTPS, FCVTPU, FCVTZS,<br>FCVTZU, SCVTF, UCVTF | 6               | 1/4                     | VO                    |       |
| ASIMD FP divide, D-form, F16                           | FDIV                                                                                                  | 7               | 1/7                     | VO                    | 3     |
| ASIMD FP divide, D-form, F32                           | FDIV                                                                                                  | 7 to 10         | 2/9 to 2/7              | VO                    | 3     |
| ASIMD FP divide, Q-form, F16                           | FDIV                                                                                                  | 10 to<br>13     | 1/13 to 1/10            | VO                    | 3     |
| ASIMD FP divide, Q-form, F32                           | FDIV                                                                                                  | 7 to 10         | 1/9 to 1/7              | VO                    | 3     |
| ASIMD FP divide, Q-form, F64                           | FDIV                                                                                                  | 7 to 15         | 1/14 to 1/7             | VO                    | 3     |
| ASIMD FP max/min, normal                               | FMAX, FMAXNM, FMIN,<br>FMINNM                                                                         | 2               | 2                       | V                     |       |
| ASIMD FP max/min, pairwise                             | FMAXP, FMAXNMP, FMINP,<br>FMINNMP                                                                     | 2               | 2                       | V                     |       |
| ASIMD FP max/min, reduce                               | FMAXV, FMAXNMV, FMINV,<br>FMINNMV                                                                     | 5               | 2                       | V                     |       |
| ASIMD FP max/min, reduce, Q-<br>form F16               | FMAXV, FMAXNMV, FMINV,<br>FMINNMV                                                                     | 8               | 2/3                     | V                     |       |
| ASIMD FP multiply                                      | FMUL, FMULX                                                                                           | 3               | 2                       | V                     | 2     |
| ASIMD FP multiply accumulate                           | FMLA, FMLS                                                                                            | 4(2)            | 2                       | V                     | 1     |
| ASIMD FP multiply accumulate long                      | FMLAL(2), FMLSL(2)                                                                                    | 5(2)            | 2                       | V                     | 1     |
| ASIMD FP negate                                        | FNEG                                                                                                  | 2               | 2                       | $\vee$                |       |
| ASIMD FP round, D-form F32<br>and Q-form F64           | FRINTA, FRINTI, FRINTM,<br>FRINTN, FRINTP, FRINTX,<br>FRINTZ                                          | 3               | 1                       | VO                    |       |
| ASIMD FP round, D-form F16<br>and Q-form F32           | FRINTA, FRINTI, FRINTM,<br>FRINTN, FRINTP, FRINTX,<br>FRINTZ                                          | 4               | 1/2                     | VO                    |       |
| ASIMD FP round, Q-form F16                             | FRINTA, FRINTI, FRINTM,<br>FRINTN, FRINTP, FRINTX,<br>FRINTZ                                          | 6               | 1/4                     | VO                    |       |
| ASIMD FP square root, D-form,<br>F16                   | FSQRT                                                                                                 | 7               | 1/7                     | VO                    | 3     |
| ASIMD FP square root, D-form,<br>F32                   | FSQRT                                                                                                 | 7 to 10         | 2/9 to 2/7              | VO                    | 3     |

| Instruction group                    | AArch64 instructions | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|--------------------------------------|----------------------|-----------------|-------------------------|-----------------------|-------|
| ASIMD FP square root, Q-form,<br>F16 | FSQRT                | 11 to<br>13     | 1/13 to 1/11            | VO                    | 3     |
| ASIMD FP square root, Q-form,<br>F32 | FSQRT                | 7 to 10         | 1/9 to 1/7              | VO                    | 3     |
| ASIMD FP square root, Q-form,<br>F64 | FSQRT                | 7 to 17         | 1/16 to 1/7             | VO                    | 3     |

| Instruction group                       | AArch32 instructions                                  | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|-----------------------------------------|-------------------------------------------------------|-----------------|-------------------------|-----------------------|-------|
| ASIMD FP absolute value                 | VABS                                                  | 2               | 2                       | $\vee$                |       |
| ASIMD FP arith                          | VABD, VADD, VPADD, VSUB                               | 2               | 2                       | $\vee$                |       |
| ASIMD FP compare                        | VACGE, VACGT, VACLE, VACLT,<br>VCEQ, VCGE, VCGT, VCLE | 2               | 2                       | V                     |       |
| ASIMD FP convert, integer, D-<br>form   | VCVT, VCVTA, VCVTM, VCVTN,<br>VCVTP                   | 3               | 1                       | VO                    |       |
| ASIMD FP convert, integer, Q-<br>form   | VCVT, VCVTA, VCVTM, VCVTN,<br>VCVTP                   | 4               | 1/2                     | VO                    |       |
| ASIMD FP convert, fixed, D-<br>form     | VCVT                                                  | 3               | 1                       | VO                    |       |
| ASIMD FP convert, fixed, Q-<br>form     | VCVT                                                  | 4               | 1/2                     | VO                    |       |
| ASIMD FP convert, half-<br>precision    | VCVT                                                  | 4               | 1/2                     | VO                    |       |
| ASIMD FP max/min                        | VMAX, VMIN, VPMAX, VPMIN,<br>VMAXNM, VMINNM           | 2               | 2                       | V                     |       |
| ASIMD FP multiply                       | VMUL, VNMUL                                           | 3               | 2                       | $\vee$                | 2     |
| ASIMD FP chained multiply accumulate    | VMLA, VMLS                                            | 5(2)            | 2                       | V                     | 1     |
| ASIMD FP fused multiply accumulate      | VFMA, VFMS                                            | 4(2)            | 2                       | V                     | 1     |
| ASIMD FP fused multiply accumulate long | VFMAL(2),VFMSL(2)                                     | 4(2)            | 2                       | V                     | 1     |
| ASIMD FP negate                         | VNEG                                                  | 2               | 2                       | $\vee$                |       |
| ASIMD FP round to integral, D-<br>form  | VRINTA, VRINTM, VRINTN,<br>VRINTP, VRINTX, VRINTZ     | 3               | 1                       | VO                    |       |
| ASIMD FP round to integral, Q-<br>form  | VRINTA, VRINTM, VRINTN,<br>VRINTP, VRINTX, VRINTZ     | 4               | 1/2                     | VO                    |       |

- 1. ASIMD multiply-accumulate pipelines support late-forwarding of accumulate operands from similar uops, allowing a typical sequence of floating-point multiply-accumulate uops to issue one every N cycles (accumulate latency N shown in parentheses).
- 2. ASIMD multiply-accumulate pipelines support late forwarding of the result from ASIMD FP multiply uops to the accumulate operands of an ASIMD FP multiply-accumulate uop. The latter can potentially be issued 1 cycle after the ASIMD FP multiply uop has been issued.
- **3.** ASIMD divide and square root operations are performed using an iterative algorithm and block subsequent similar operations to the same pipeline until complete.

## 3.17. ASIMD Miscellaneous Instructions

| Instruction group                                        | AArch64 instructions                        | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|----------------------------------------------------------|---------------------------------------------|-----------------|-------------------------|-----------------------|-------|
| ASIMD bit reverse                                        | RBIT                                        | 2               | 2                       | $\vee$                |       |
| ASIMD bitwise insert                                     | BIF, BIT, BSL                               | 2               | 2                       | $\vee$                |       |
| ASIMD count                                              | CLS, CLZ, CNT                               | 2               | 2                       | $\vee$                |       |
| ASIMD duplicate, gen reg                                 | DUP                                         | 3               | 1                       | Μ                     |       |
| ASIMD duplicate, element                                 | DUP                                         | 2               | 2                       | $\vee$                |       |
| ASIMD extract                                            | EXT                                         | 2               | 2                       | $\vee$                |       |
| ASIMD extract narrow                                     | XTN                                         | 2               | 2                       | $\vee$                |       |
| ASIMD extract narrow, saturating                         | SQXTN(2), SQXTUN(2),<br>UQXTN(2)            | 4               | 1                       | V1                    |       |
| ASIMD insert, element to element                         | INS                                         | 2               | 2                       | V                     |       |
| ASIMD move, FP immed                                     | FMOV                                        | 2               | 2                       | V                     |       |
| ASIMD move, integer immed                                | MOVI, MVNI                                  | 2               | 2                       | $\vee$                |       |
| ASIMD reciprocal estimate, D-<br>form F32 and F64        | FRECPE, FRECPX, FRSQRTE,<br>URECPE, URSQRTE | 3               | 1                       | VO                    |       |
| ASIMD reciprocal estimate, D-<br>form F16 and Q-form F32 | FRECPE, FRECPX, FRSQRTE,<br>URECPE, URSQRTE | 4               | 1/2                     | VO                    |       |
| ASIMD reciprocal estimate, Q-<br>form F16                | FRECPE, FRECPX, FRSQRTE,<br>URECPE, URSQRTE | 6               | 1/4                     | VO                    |       |
| ASIMD reciprocal step                                    | FRECPS, FRSQRTS                             | 4               | 2                       | $\vee$                |       |
| ASIMD reverse                                            | REV16, REV32, REV64                         | 2               | 2                       | $\vee$                |       |
| ASIMD table lookup, 1 or 2<br>table regs                 | TBL                                         | 2               | 2                       | V                     |       |
| ASIMD table lookup, 3 table<br>regs                      | TBL                                         | 4               | 1/2                     | $\vee$                |       |

| Instruction group                            | AArch64 instructions   | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|----------------------------------------------|------------------------|-----------------|-------------------------|-----------------------|-------|
| ASIMD table lookup, 4 table<br>regs          | TBL                    | 4               | 2/3                     | $\vee$                |       |
| ASIMD table lookup extension,<br>1 table reg | ТВХ                    | 2               | 2                       | V                     |       |
| ASIMD table lookup extension,<br>2 table reg | ТВХ                    | 4               | 1/2                     | V                     |       |
| ASIMD table lookup extension,<br>3 table reg | ТВХ                    | 6               | 2/3                     | V                     |       |
| ASIMD table lookup extension,<br>4 table reg | ТВХ                    | 6               | 2/5                     | $\vee$                |       |
| ASIMD transfer, element to gen reg           | umov, smov             | 2               | 1                       | V1                    |       |
| ASIMD transfer, gen reg to element           | INS                    | 5               | 1                       | M, V                  |       |
| ASIMD transpose                              | TRN1, TRN2             | 2               | 2                       | $\vee$                |       |
| ASIMD unzip/zip                              | UZP1, UZP2, ZIP1, ZIP2 | 2               | 2                       | $\vee$                |       |

| Instruction group                        | AArch32 instructions   | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|------------------------------------------|------------------------|-----------------|-------------------------|-----------------------|-------|
| ASIMD bitwise insert                     | VBIF, VBIT, VBSL       | 2               | 2                       | V                     |       |
| ASIMD count                              | VCLS, VCLZ, VCNT       | 2               | 2                       | $\vee$                |       |
| ASIMD duplicate, core reg                | VDUP                   | 3               | 1                       | М                     |       |
| ASIMD duplicate, scalar                  | VDUP                   | 2               | 2                       | $\vee$                |       |
| ASIMD extract                            | VEXT                   | 2               | 2                       | V                     |       |
| ASIMD move, immed                        | VMOV, VMVN             | 2               | 2                       | V                     |       |
| ASIMD move, register                     | VMOV                   | 2               | 2                       | $\vee$                |       |
| ASIMD move, narrowing                    | VMOVN                  | 2               | 2                       | $\vee$                |       |
| ASIMD move, saturating                   | VQMOVN, VQMOVUN        | 4               | 1                       | V1                    |       |
| ASIMD reciprocal estimate, D-<br>form    | VRECPE, VRSQRTE        | 3               | 1                       | VO                    |       |
| ASIMD reciprocal estimate, Q-<br>form    | VRECPE, VRSQRTE        | 4               | 1/2                     | VO                    |       |
| ASIMD reciprocal step                    | VRECPS, VRSQRTS        | 5               | 2                       | $\vee$                |       |
| ASIMD reverse                            | VREV16, VREV32, VREV64 | 2               | 2                       | $\vee$                |       |
| ASIMD swap                               | VSWP                   | 4               | 2/3                     | $\vee$                |       |
| ASIMD table lookup, 1 or 2<br>table regs | VTBL                   | 2               | 2                       | V                     |       |

| Instruction group                                 | AArch32 instructions | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|---------------------------------------------------|----------------------|-----------------|-------------------------|-----------------------|-------|
| ASIMD table lookup, 3 table<br>regs               | VTBL                 | 4               | 1/2                     | V                     |       |
| ASIMD table lookup, 4 table<br>regs               | VTBL                 | 4               | 2/3                     | V                     |       |
| ASIMD table lookup extension,<br>1 reg            | VTBX                 | 2               | 2                       | V                     |       |
| ASIMD table lookup extension,<br>2 table reg      | VTBX                 | 4               | 1/2                     | V                     |       |
| ASIMD table lookup extension,<br>3 table reg      | VTBX                 | 6               | 2/3                     | V                     |       |
| ASIMD table lookup extension,<br>4 table reg      | VTBX                 | 8               | 2/5                     | $\vee$                |       |
| ASIMD transfer, scalar to core reg, word          | VMOV                 | 2               | 1                       | V1                    |       |
| ASIMD transfer, scalar to core<br>reg, byte/hword | VMOV                 | 3               | 1                       | V1, I                 |       |
| ASIMD transfer, core reg to scalar                | VMOV                 | 5               | 1                       | M, V                  |       |
| ASIMD transpose                                   | VTRN                 | 4               | 2/3                     | $\vee$                |       |
| ASIMD unzip/zip                                   | VUZP, VZIP           | 4               | 2/3                     | V                     |       |

## 3.18. ASIMD Load Instructions

The latencies shown assume the memory access hits in the Level 1 Data Cache. Compared to standard loads, an extra cycle is required to forward results to FP/ASIMD pipelines.

| Instruction group                                 | AArch64 instructions | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|---------------------------------------------------|----------------------|-----------------|-------------------------|-----------------------|-------|
| ASIMD load, 1 element,<br>multiple, 1 reg, D-form | LD1                  | 5               | 2                       | L                     |       |
| ASIMD load, 1 element,<br>multiple, 1 reg, Q-form | LD1                  | 5               | 2                       | L                     |       |
| ASIMD load, 1 element,<br>multiple, 2 reg, D-form | LD1                  | 5               | 1                       | L                     |       |
| ASIMD load, 1 element,<br>multiple, 2 reg, Q-form | LD1                  | 5               | 1                       | L                     |       |
| ASIMD load, 1 element,<br>multiple, 3 reg, D-form | LD1                  | 6               | 2/3                     | L                     |       |
| ASIMD load, 1 element,<br>multiple, 3 reg, Q-form | LD1                  | 6               | 2/3                     | L                     |       |

| Instruction group                                  | AArch64 instructions | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|----------------------------------------------------|----------------------|-----------------|-------------------------|-----------------------|-------|
| ASIMD load, 1 element,<br>multiple, 4 reg, D-form  | LD1                  | 6               | 1/2                     | L                     |       |
| ASIMD load, 1 element,<br>multiple, 4 reg, Q-form  | LD1                  | 6               | 1/2                     | L                     |       |
| ASIMD load, 1 element, one<br>lane, B/H/S          | LD1                  | 7               | 2                       | L, V                  |       |
| ASIMD load, 1 element, one<br>lane, D              | LD1                  | 7               | 2                       | L, V                  |       |
| ASIMD load, 1 element, all<br>lanes, D-form, B/H/S | LD1R                 | 7               | 2                       | L, V                  |       |
| ASIMD load, 1 element, all<br>lanes, D-form, D     | LD1R                 | 7               | 2                       | L, V                  |       |
| ASIMD load, 1 element, all<br>lanes, Q-form        | LD1R                 | 7               | 2                       | L, V                  |       |
| ASIMD load, 2 element,<br>multiple, D-form, B/H/S  | LD2                  | 7               | 1                       | L, V                  |       |
| ASIMD load, 2 element,<br>multiple, Q-form, B/H/S  | LD2                  | 7               | 1                       | L, V                  |       |
| ASIMD load, 2 element,<br>multiple, Q-form, D      | LD2                  | 7               | 1                       | L, V                  |       |
| ASIMD load, 2 element, one<br>lane, B/H            | LD2                  | 7               | 1                       | L, V                  |       |
| ASIMD load, 2 element, one<br>lane, S              | LD2                  | 7               | 1                       | L, V                  |       |
| ASIMD load, 2 element, one<br>lane, D              | LD2                  | 7               | 1                       | L, V                  |       |
| ASIMD load, 2 element, all<br>lanes, D-form, B/H/S | LD2R                 | 7               | 1                       | L, V                  |       |
| ASIMD load, 2 element, all<br>lanes, D-form, D     | LD2R                 | 7               | 1                       | L, V                  |       |
| ASIMD load, 2 element, all<br>lanes, Q-form        | LD2R                 | 7               | 1                       | L, V                  |       |
| ASIMD load, 3 element,<br>multiple, D-form, B/H/S  | LD3                  | 8               | 1/2                     | L, V                  |       |
| ASIMD load, 3 element,<br>multiple, Q-form, B/H/S  | LD3                  | 8               | 1/2                     | L, V                  |       |
| ASIMD load, 3 element,<br>multiple, Q-form, D      | LD3                  | 8               | 1/2                     | L, V                  |       |
| ASIMD load, 3 element, one<br>lane, B/H            | LD3                  | 7               | 1/2                     | L, V                  |       |

| Instruction group                                  | AArch64 instructions | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|----------------------------------------------------|----------------------|-----------------|-------------------------|-----------------------|-------|
| ASIMD load, 3 element, one<br>lane, S              | LD3                  | 7               | 1/2                     | L, V                  |       |
| ASIMD load, 3 element, one<br>lane, D              | LD3                  | 7               | 1/2                     | L, V                  |       |
| ASIMD load, 3 element, all<br>lanes, D-form, B/H/S | LD3R                 | 7               | 1/2                     | L, V                  |       |
| ASIMD load, 3 element, all<br>lanes, D-form, D     | LD3R                 | 7               | 1/2                     | L, V                  |       |
| ASIMD load, 3 element, all<br>lanes, Q-form, B/H/S | LD3R                 | 7               | 1/2                     | L, V                  |       |
| ASIMD load, 3 element, all<br>lanes, Q-form, D     | LD3R                 | 7               | 1/2                     | L, V                  |       |
| ASIMD load, 4 element,<br>multiple, D-form, B/H/S  | LD4                  | 8               | 2/7                     | L, V                  |       |
| ASIMD load, 4 element,<br>multiple, Q-form, B/H/S  | LD4                  | 10              | 1/5                     | L, V                  |       |
| ASIMD load, 4 element,<br>multiple, Q-form, D      | LD4                  | 10              | 1/5                     | L, V                  |       |
| ASIMD load, 4 element, one<br>lane, B/H            | LD4                  | 8               | 1/2                     | L, V                  |       |
| ASIMD load, 4 element, one<br>lane, S              | LD4                  | 8               | 1/2                     | L, V                  |       |
| ASIMD load, 4 element, one<br>lane, D              | LD4                  | 8               | 1/2                     | L, V                  |       |
| ASIMD load, 4 element, all<br>lanes, D-form, B/H/S | LD4R                 | 8               | 1/2                     | L, V                  |       |
| ASIMD load, 4 element, all<br>lanes, D-form, D     | LD4R                 | 8               | 1/2                     | L, V                  |       |
| ASIMD load, 4 element, all<br>lanes, Q-form, B/H/S | LD4R                 | 8               | 1/2                     | L, V                  |       |
| ASIMD load, 4 element, all<br>lanes, Q-form, D     | LD4R                 | 8               | 1/2                     | L, V                  |       |
| (ASIMD load, writeback form)                       |                      | (1)             |                         | +                     | 1     |

| Instruction group                         | AArch32 instructions | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|-------------------------------------------|----------------------|-----------------|-------------------------|-----------------------|-------|
| ASIMD load, 1 element,<br>multiple, 1 reg | VLD1                 | 5               | 2                       | L                     |       |
| ASIMD load, 1 element,<br>multiple, 2 reg | VLD1                 | 5               | 2                       | L                     |       |

| Instruction group                             | AArch32 instructions | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|-----------------------------------------------|----------------------|-----------------|-------------------------|-----------------------|-------|
| ASIMD load, 1 element,<br>multiple, 3 reg     | VLD1                 | 5               | 1                       | L                     |       |
| ASIMD load, 1 element,<br>multiple, 4 reg     | VLD1                 | 5               | 1                       | L                     |       |
| ASIMD load, 1 element, one<br>lane            | VLD1                 | 7               | 2                       | L, V                  |       |
| ASIMD load, 1 element, all<br>lanes, 1 reg    | VLD1                 | 7               | 2                       | LV                    |       |
| ASIMD load, 1 element, all<br>lanes, 2 reg    | VLD1                 | 7               | 2/3                     | L, V                  |       |
| ASIMD load, 2 element,<br>multiple, 2 reg     | VLD2                 | 7               | 2/3                     | L, V                  |       |
| ASIMD load, 2 element,<br>multiple, 4 reg     | VLD2                 | 8               | 1/2                     | L, V                  |       |
| ASIMD load, 2 element, one<br>lane, size 32   | VLD2                 | 7               | 1                       | L, V                  |       |
| ASIMD load, 2 element, one<br>lane, size 8/16 | VLD2                 | 7               | 1                       | L, V                  |       |
| ASIMD load, 2 element, all<br>lanes           | VLD2                 | 7               | 1                       | L, V                  |       |
| ASIMD load, 3 element,<br>multiple, 3 reg     | VLD3                 | 8               | 2/3                     | L, V                  |       |
| ASIMD load, 3 element, one<br>lane, size 32   | VLD3                 | 8               | 2/3                     | L, V                  |       |
| ASIMD load, 3 element, one<br>lane, size 8/16 | VLD3                 | 8               | 2/3                     | L, V                  |       |
| ASIMD load, 3 element, all<br>lanes           | VLD3                 | 8               | 2/3                     | L, V                  |       |
| ASIMD load, 4 element,<br>multiple, 4 reg     | VLD4                 | 8               | 1/2                     | L, V                  |       |
| ASIMD load, 4 element, one<br>lane, size 32   | VLD4                 | 8               | 1/2                     | L, V                  |       |
| ASIMD load, 4 element, one<br>lane, size 8/16 | VLD4                 | 8               | 1/2                     | L, V                  |       |
| ASIMD load, 4 element, all<br>lanes           | VLD4                 | 8               | 1/2                     | L, V                  |       |
| (ASIMD load, writeback form)                  |                      | (1)             |                         | +                     | 1     |

1. Writeback forms of load instructions require an extra uop to update the base address. This update is typically performed in parallel with the load uop (update latency shown in parentheses).

## **3.19. ASIMD Store Instructions**

Stores Mops are split into store address and store data uops at dispatch time. Once executed, stores are buffered and committed in the background.

| Instruction group                                  | AArch64 instructions | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|----------------------------------------------------|----------------------|-----------------|-------------------------|-----------------------|-------|
| ASIMD store, 1 element,<br>multiple, 1 reg, D-form | ST1                  | 2               | 2                       | L, V                  |       |
| ASIMD store, 1 element,<br>multiple, 1 reg, Q-form | ST1                  | 2               | 1                       | L, V                  |       |
| ASIMD store, 1 element,<br>multiple, 2 reg, D-form | ST1                  | 2               | 1                       | L, V                  |       |
| ASIMD store, 1 element,<br>multiple, 2 reg, Q-form | ST1                  | 3               | 1/2                     | L, V                  |       |
| ASIMD store, 1 element,<br>multiple, 3 reg, D-form | ST1                  | 3               | 2/3                     | L, V                  |       |
| ASIMD store, 1 element,<br>multiple, 3 reg, Q-form | ST1                  | 4               | 1/3                     | L, V                  |       |
| ASIMD store, 1 element,<br>multiple, 4 reg, D-form | ST1                  | 3               | 1/2                     | L, V                  |       |
| ASIMD store, 1 element,<br>multiple, 4 reg, Q-form | ST1                  | 5               | 1/4                     | L, V                  |       |
| ASIMD store, 1 element, one<br>Iane, B/H/S         | ST1                  | 4               | 1                       | V, L                  |       |
| ASIMD store, 1 element, one<br>lane, D             | ST1                  | 4               | 1                       | V, L                  |       |
| ASIMD store, 2 element,<br>multiple, D-form, B/H/S | ST2                  | 4               | 1                       | V, L                  |       |
| ASIMD store, 2 element,<br>multiple, Q-form, B/H/S | ST2                  | 5               | 1/2                     | V, L                  |       |
| ASIMD store, 2 element,<br>multiple, Q-form, D     | ST2                  | 5               | 1/2                     | V, L                  |       |
| ASIMD store, 2 element, one<br>lane, B/H/S         | ST2                  | 4               | 1                       | V, L                  |       |
| ASIMD store, 2 element, one<br>lane, D             | ST2                  | 4               | 1                       | V, L                  |       |
| ASIMD store, 3 element,<br>multiple, D-form, B/H/S | ST3                  | 5               | 1/2                     | V, L                  |       |
| ASIMD store, 3 element,<br>multiple, Q-form, B/H/S | ST3                  | 6               | 1/3                     | V, L                  |       |

| Instruction group                                  | AArch64 instructions | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|----------------------------------------------------|----------------------|-----------------|-------------------------|-----------------------|-------|
| ASIMD store, 3 element,<br>multiple, Q-form, D     | ST3                  | 6               | 1/3                     | V, L                  |       |
| ASIMD store, 3 element, one<br>lane, B/H           | ST3                  | 4               | 1/2                     | V, L                  |       |
| ASIMD store, 3 element, one<br>lane, S             | ST3                  | 4               | 1/2                     | V, L                  |       |
| ASIMD store, 3 element, one<br>lane, D             | ST3                  | 5               | 1/2                     | V, L                  |       |
| ASIMD store, 4 element,<br>multiple, D-form, B/H/S | ST4                  | 7               | 1/3                     | V, L                  |       |
| ASIMD store, 4 element,<br>multiple, Q-form, B/H/S | ST4                  | 9               | 1/6                     | V, L                  |       |
| ASIMD store, 4 element,<br>multiple, Q-form, D     | ST4                  | 6               | 1/4                     | V, L                  |       |
| ASIMD store, 4 element, one<br>Iane, B/H           | ST4                  | 5               |                         | V, L                  |       |
| ASIMD store, 4 element, one<br>lane, S             | ST4                  |                 | 2/3                     | V, L                  |       |
| ASIMD store, 4 element, one<br>lane, D             | ST4                  |                 |                         | V, L                  |       |
| (ASIMD store, writeback<br>form)                   |                      | (1)             |                         | Add I                 | 1     |

| Instruction group                          | AArch32 instructions | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|--------------------------------------------|----------------------|-----------------|-------------------------|-----------------------|-------|
| ASIMD store, 1 element,<br>multiple, 1 reg | VST1                 | 2               | 2                       | L, V                  |       |
| ASIMD store, 1 element,<br>multiple, 2 reg | VST1                 | 2               | 2                       | L, V                  |       |
| ASIMD store, 1 element,<br>multiple, 3 reg | VST1                 | 3               | 2/3                     | L, V                  |       |
| ASIMD store, 1 element,<br>multiple, 4 reg | VST1                 | 3               | 1/2                     | L, V                  |       |
| ASIMD store, 1 element, one<br>lane        | VST1                 | 4               | 2                       | V, L                  |       |
| ASIMD store, 2 element,<br>multiple, 2 reg | VST2                 | 4               | 1                       | V, L                  |       |
| ASIMD store, 2 element,<br>multiple, 4 reg | VST2                 | 5               | 1/2                     | V, L                  |       |
| ASIMD store, 2 element, one<br>lane        | VST2                 | 4               | 2                       | V, L                  |       |

| Instruction group                              | AArch32 instructions | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|------------------------------------------------|----------------------|-----------------|-------------------------|-----------------------|-------|
| ASIMD store, 3 element,<br>multiple, 3 reg     | VST3                 | 5               | 2/3                     | V, L                  |       |
| ASIMD store, 3 element, one<br>lane, size 32   | VST3                 | 4               | 1                       | V, L                  |       |
| ASIMD store, 3 element, one<br>lane, size 8/16 | VST3                 | 4               | 1                       | V, L                  |       |
| ASIMD store, 4 element,<br>multiple, 4 reg     | VST4                 | 8               | 1/2                     | V, L                  |       |
| ASIMD store, 4 element, one<br>lane, size 32   | VST4                 | 7               | 2                       | V, L                  |       |
| ASIMD store, 4 element, one<br>lane, size 8/16 | VST4                 | 7               | 2                       | V, L                  |       |
| (ASIMD store, writeback<br>form)               |                      | (1)             |                         | +                     | 1     |

1. Writeback forms of store instructions require an extra uop to update the base address. This update is typically performed in parallel with the store uop (update latency shown in parentheses).

## 3.20. Cryptography Extensions

| Instruction group                          | AArch64 instructions      | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|--------------------------------------------|---------------------------|-----------------|-------------------------|-----------------------|-------|
| Crypto AES ops                             | AESD, AESE, AESIMC, AESMC | 2               | 1                       | VO                    |       |
| Crypto polynomial (64x64)<br>multiply long | PMULL(2)                  | 2               | 1                       | VO                    |       |
| Crypto SHA1 hash acceleration op           | SHA1H                     | 2               | 1                       | VO                    |       |
| Crypto SHA1 hash acceleration ops          | SHA1C, SHA1M, SHA1P       | 4               | 1                       | VO                    |       |
| Crypto SHA1 schedule<br>acceleration ops   | SHA1SUO, SHA1SU1          | 2               | 1                       | VO                    |       |
| Crypto SHA256 hash<br>acceleration ops     | SHA256H, SHA256H2         | 4               | 1                       | VO                    |       |
| Crypto SHA256 schedule<br>acceleration ops | SHA256SUO, SHA256SU1      | 2               | 1                       | VO                    |       |

| Instruction group | AArch32 instructions      | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|-------------------|---------------------------|-----------------|-------------------------|-----------------------|-------|
| Crypto AES ops    | AESD, AESE, AESIMC, AESMC | 2               | 1                       | VO                    | 1     |

| Instruction group                          | AArch32 instructions | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|--------------------------------------------|----------------------|-----------------|-------------------------|-----------------------|-------|
| Crypto polynomial (64x64)<br>multiply long | VMULL.P64            | 2               | 1                       | VO                    |       |
| Crypto SHA1 hash acceleration op           | SHA1H                | 2               | 1                       | VO                    |       |
| Crypto SHA1 hash acceleration ops          | SHA1C, SHA1M, SHA1P  | 4               | 1                       | VO                    |       |
| Crypto SHA1 schedule<br>acceleration ops   | SHA1SUO, SHA1SU1     | 2               | 1                       | VO                    |       |
| Crypto SHA256 hash<br>acceleration ops     | SHA256H, SHA256H2    | 4               | 1                       | VO                    |       |
| Crypto SHA256 schedule<br>acceleration ops | SHA256SUO, SHA256SU1 | 2               | 1                       | VO                    |       |

1. Adjacent AESE/AESMC instruction pairs and adjacent AESD/AESIMC instruction pairs will exhibit the performance characteristics described in Section 4.5.

## 3.21. CRC

| Instruction group | AArch64 instructions | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|-------------------|----------------------|-----------------|-------------------------|-----------------------|-------|
| CRC checksum ops  | CRC32, CRC32C        | 2               | 1                       | Μ                     | 1     |

| Instruction group | AArch32 instructions | Exec<br>latency | Execution<br>throughput | Utilized<br>Pipelines | Notes |
|-------------------|----------------------|-----------------|-------------------------|-----------------------|-------|
| CRC checksum ops  | CRC32, CRC32C        | 2               | 1                       | М                     | 1     |

### Notes:

1. CRC execution supports late forwarding of the result from a producer uop to a consumer uop. This results in a 1 cycle reduction in latency as seen by the consumer.

# **4** Special considerations

## 4.1. Dispatch Constraints

Dispatch of uops from the in-order portion to the out-of-order portion of the microarchitecture includes a number of constraints. It is important to consider these constraints during code generation in order to maximize the effective dispatch bandwidth and subsequent execution bandwidth of Cortex-A76.

The dispatch stage can process up to 4 Mops per cycle and dispatch up to 8 uops per cycle, with the following limitations on the number of uops of each type that may be simultaneously dispatched.

- Up to 2 uops utilizing B pipeline
- Up to 4 uops utilizing S pipelines
- Up to 2 uops utilizing M pipeline
- Up to 2 uops utilizing each of the V pipelines.
- Up to 2 uops utilizing each of the L pipelines

In the event there are more uops available to be dispatched in a given cycle than can be supported by the constraints above, uops will be dispatched in oldest to youngest age-order to the extent allowed by the above.

## 4.2. Dispatch Stall

In the event of a V-pipeline uop containing more than 1 quad-word register source, a portion or all of which was previously written as one or multiple single words, that uop will stall in dispatch for three cycles. This stall occurs only on the first such instance, and subsequent consumers of the same register will not experience this stall.

## 4.3. Optimizing General-Purpose Register Spills and Fills

Register transfers between general-purpose registers (GPR) and ASIMD registers (VPR) are lower latency than reads and writes to the cache hierarchy, thus it is recommended that GPR registers be filled/spilled to the VPR rather to memory, when possible.

## 4.4. Optimizing Memory Copy

The Cortex-A76 processor includes two load/store pipelines, which allow it to execute two 128 bit load uops and one 128 bit store uop every cycle

To achieve maximum throughput for memory copy (or similar loops), one should do the following.

- Unroll the loop to include multiple load and store operations per iteration, minimizing the overheads of looping.
- Use discrete, non-writeback forms of load and store instructions while interleaving them.
- Align stores on 16B boundary wherever possible.

Q3,Q4,[x0,#32]

• The following example shows a recommended instruction sequence for a long memory copy in AArch64 state:

Loop\_start: SUBS X2,X2,#192 LDP Q3,Q4,[x1,#0] STP Q3,Q4,[x0,#0] LDP Q3,Q4,[x1,#32]

STP

| LDP | Q3,Q4,[x1,#64]  |
|-----|-----------------|
| STP | Q3,Q4,[x0,#64]  |
| LDP | Q3,Q4,[x1,#96]  |
| STP | Q3,Q4,[x0,#96]  |
| LDP | Q3,Q4,[x1,#128] |
| STP | Q3,Q4,[x0,#128] |
| LDP | Q3,Q4,[x1,#160] |
| STP | Q3,Q4,[x0,#160] |
| ADD | X1,X1,#192      |
| ADD | X0,X0,#192      |
| BGT | Loop_start      |

A recommended copy routine for AArch32 would look similar to the sequence above but would use LDRD/STRD instructions. Avoid load-/store-multiple instruction encodings (such as LDM and STM).

## 4.5. Load/Store Alignment

The Arm v8.2-A architecture allows many types of load and store accesses to be arbitrarily aligned. The Cortex-A76 processor handles most unaligned accesses without performance penalties. However, there are cases which reduce bandwidth or incur additional latency, as described below.

- Load operations that cross a cache-line (64-byte) boundary.
- Quad-word load operations that are not 4B aligned.
- Store operations that cross a 16B boundary.

## 4.6. Store to Load Forwarding

The Cortex-A76 core allows data to be forwarded from store instructions to a load instruction with the restrictions mentioned below:

- Load start address should align with the start or middle address of the older store
- Loads of size greater than or equal to 8 bytes can get the data forwarded from a maximum of 2 stores. If there are 2 stores, then each store should forward to either first or second half of the load
- Loads of size less than or equal to 4 bytes can get their data forwarded from only 1 store

## 4.7. AES Encryption/Decryption

Cortex-A76 can issue one AESE/AESMC/AESD/AESIMC instruction every cycle (fully pipelined) with an execution latency of two cycles This means encryption or decryption for at least two data chunks should be interleaved for maximum performance:

AESEdata0, key0AESMCdata0, data0AESEdata1, key0AESMCdata1, data1AESEdata0, key0AESMCdata0, data0AESEdata1, key1AESMCdata1, data1

• • •

Pairs of dependent AESE/AESMC and AESD/AESIMC instructions are higher performance when they are adjacent in the program code and both instructions use the same destination register.

## 4.8. Region Based Fast Forwarding

The forwarding logic in the V pipelines is optimized to provide optimal latency for instructions which are expected to commonly forward to one another. The effective latency of FP and ASIMD instructions as described in section 3 is increased by one cycle if the producer and consumer instructions are not part of the same forwarding region. These optimized forwarding regions are defined in the following table.

#### Table 1: Optimized forwarding regions

| Region | Instruction Types                                                                                                      | Notes |
|--------|------------------------------------------------------------------------------------------------------------------------|-------|
| 1      | ASIMD ALU, ASIMD shift, ASIMD/ scalar insert and move, ASIMD abs/cmp/max/min and the ASIMD miscellaneous instructions. | 1     |
| 2      | FP multiply, FP multiply-accumulate, FP compare, FP add/sub and the ASIMD miscellaneous instructions.                  | 1,2,3 |
| 3      | Crypto SHA1/SHA256.                                                                                                    |       |

Notes:

- 1. Reciprocal step and estimate instructions are excluded from this region.
- 2. ASIMD extract narrow, saturating instructions are excluded from this region.
- 3. ASIMD miscellaneous instructions can only be consumers of this region.

The following instructions are not a part of any region:

- FP div/sqrt
- FP convert and rounding
- ASIMD integer mul/mac
- ASIMD reduction.

In addition to the regions mentioned in the table above, all floating point and ASIMD instructions can fast forward to FP and ASIMD stores.

More special notes about the forwarding region in table 40:

- Fast forwarding will not occur in AArch32 mode if the consuming register's width is greater than that of the producer.
- Element sources used by FP multiply and multiply-accumulate operations cannot be consumers.
- Complex ASIMD shift by immediate/register and shift accumulate instructions cannot be producers (see section 3.14) in region 1.
- ASIMD extract narrow, saturating instructions cannot be producers (see section 3.16) in region 1.
- ASIMD absolute difference accumulate and pairwise add and accumulate instructions cannot be producers (see section 3.14) in region 1.
- For FP producer-consumer pairs, the precision of the instructions should match (single, double or half) in region 2.
- Pair-wise FP instructions cannot be producers or consumers in region 2.

It is not advisable to interleave instructions belonging to different. Also, certain instructions can only be producers or consumers in a particular region but not both (see footnote 3 for table 40). For example, the code below interleaves producers and consumers from regions 1 and 2. This will result in and additional latency of 1 cycle as seen by FMUL.

FSUB v27.2s, v28.2s, v20.2s - Region 2

FADD v20.2s, v28.2s, v20.2s - Region 2 MOV v27.s[1], v20.s[1] - Region 2 producer but not a region 2 consumer FMUL v26.2s, v27.2s, v6.2s - Region 2

## 4.9. Branch instruction alignment

Branch instruction and branch target instruction alignment and density can affect performance. For best-case performance, consider the following guidelines.

- Avoid placing more than 4 branch instructions within an aligned 32-byte instruction memory region.
- When possible, a branch and its target should be located within the same 2M aligned memory region.

Consider aligning subroutine entry points and branch targets to 32B boundaries, within the bounds of the code-density requirements of the program. This will ensure that the subsequent fetch can maximize bandwidth following the taken branch by bringing in all useful instructions.

For loops which comprise 32 or fewer instruction bytes, it is preferred that the loop be located entirely within a single aligned 32-byte instruction memory region.

## 4.10. FPCR self-synchronization

Programmers and compiler writers should note that writes to the FPCR register are self-synchronizing, i.e. its effect on subsequent instructions can be relied upon without an intervening context synchronizing operation.

## 4.11. Special Register Access

The Cortex-A76 processor performs register renaming for general purpose registers to enable speculative and out-of-order instruction execution. But most special-purpose registers are not renamed. Instructions that read or write non-renamed registers are subjected to one or more of the following additional execution constraints.

- Non-Speculative Execution Instructions may only execute non-speculatively.
- In-Order Execution Instructions must execute in-order with respect to other similar instructions or in some cases all instructions.
- Flush Side-Effects Instructions trigger a flush side-effect after executing for synchronization.

The table below summarizes various special-purpose register read accesses and the associated execution constraints or sideeffects.

| Register Read | Non-Speculative | In-Order | Flush Side-Effect | Notes |
|---------------|-----------------|----------|-------------------|-------|
| APSR          | Yes             | Yes      | No                | 3     |
| CurrentEL     | No              | Yes      | No                |       |
| DAIF          | No              | Yes      | No                |       |
| DLR_ELO       | No              | Yes      | No                |       |
| DSPSR_ELO     | No              | Yes      | No                |       |
| ELR_*         | No              | Yes      | No                |       |
| FPCR          | No              | Yes      | No                |       |
| FPSCR         | Yes             | Yes      | No                | 2     |

| Register Read | Non-Speculative | In-Order | Flush Side-Effect | Notes |
|---------------|-----------------|----------|-------------------|-------|
| FPSR          | Yes             | Yes      | No                | 2     |
| NZCV          | No              | No       | No                | 1     |
| SP_*          | No              | No       | No                | 1     |
| SPSel         | No              | Yes      | No                |       |
| SPSR_*        | No              | Yes      | No                |       |

- 1. The NZCV and SP registers are fully renamed.
- 2. FPSR/FPSCR reads must wait for all prior instructions that may update the status flags to execute and retire.
- 3. APSR reads must wait for all prior instructions that may set the Q bit to execute and retire.

The table below summarizes various special-purpose register write accesses and the associated execution constraints or sideeffects.

| Register Write | Non-Speculative | In-Order | Flush Side-Effect | Notes |
|----------------|-----------------|----------|-------------------|-------|
| APSR           | Yes             | Yes      | No                | 4     |
| DAIF           | Yes             | Yes      | No                |       |
| DLR_ELO        | Yes             | Yes      | No                |       |
| DSPSR_ELO      | Yes             | Yes      | No                |       |
| ELR_*          | Yes             | Yes      | No                |       |
| FPCR           | Yes             | Yes      | Maybe             | 2     |
| FPSCR          | Yes             | Yes      | Maybe             | 2, 3  |
| FPSR           | Yes             | Yes      | No                | 3     |
| NZCV           | No              | No       | No                | 1     |
| SP_*           | No              | No       | No                | 1     |
| SPSel          | Yes             | Yes      | Yes               |       |
| SPSR_*         | Yes             | Yes      | No                |       |

Notes:

- 1. The NZCV and SP registers are fully renamed.
- 2. If the FPCR/FPSCR write is predicted to change the control field values, it will introduce a barrier which prevents subsequent instructions from executing. If the FPCR/FPSCR write is predicted to not change the control field values, it will execute without a barrier but trigger a flush if the values change.
- 3. FPSR/FPSCR writes must stall at dispatch if another FPSR/FPSCR write is still pending.
- 4. APSR writes that set the Q bit will introduce a barrier which prevents subsequent instructions from executing until the write completes.

## 4.12. Register Forwarding Hazards

The Armv8-A architecture allows FP/ASIMD instructions to read and write 32-bit S-registers. In AArch32, Each S-register corresponds to one half (upper or lower) of an overlaid 64-bit D-register. A Q register in turn consists of two overlaid D register. Register forwarding hazards may occur when one uop reads a Q-register operand that has recently been written with one or more S-register result. Consider the following scenario.

VADD S0, S1, S2

VADD Q6, Q5, Q0

The first instruction writes S0, which correspond to the lowest part of Q0. The second instruction then requires Q0 as an input operand. In this scenario, there is a dependency RAW dependency between the first and the second instructions. In most cases, Cortex-A76 performs slightly worse in such situations.

Cortex-A76 is able to avoid this register-hazard condition for certain cases. The following rules describe the conditions under which a register-hazard can occur.

- The producer writes an S-register (not a D[x] scalar)
- The consumer reads an overlapping Q-register (not as a D[x] scalar)
- The consumer is a FP/ASIMD uop (not a store or MOV uop)

To avoid unnecessary hazards, it is recommended that the programmer use D[x] scalar writes when populating registers prior to ASIMD operations. For example, either of the following instruction forms would safely prevent a subsequent hazard.

VLD1.32 D0[x], [address] VADD Q1, Q0, Q2F

## 4.13. IT Blocks

The Armv8-A architecture performance deprecates some uses of the IT instruction in such a way that software may be written using multiple naïve single instruction IT blocks. It is preferred that software instead generate multi instruction IT blocks rather than single instruction blocks.