# Arm<sup>®</sup> Ethos<sup>™</sup>-U65 NPU

Revision: r0p0

**Technical reference manual** 



#### Arm<sup>®</sup> Ethos™-U65 NPU

#### **Technical reference manual**

Copyright © 2020, 2021 Arm Limited or its affiliates. All rights reserved.

#### **Release Information**

#### **Document History**

| Issue   | Date             | Confidentiality  | Change                              |
|---------|------------------|------------------|-------------------------------------|
| 0000-01 | 31 March 2020    | Confidential     | First development release for r0p0. |
| 0000-02 | 24 June 2020     | Confidential     | First beta release for r0p0.        |
| 0000-03 | 20 August 2020   | Confidential     | First EAC release for r0p0.         |
| 0000-04 | 19 November 2020 | Non-Confidential | Second EAC release for r0p0.        |
| 0000-05 | 12 May 2021      | Non-Confidential | Third EAC release for r0p0.         |

#### **Non-Confidential Proprietary Notice**

This document is protected by copyright and other related rights and the practice or implementation of the information contained in this document may be protected by one or more patents or pending patent applications. No part of this document may be reproduced in any form by any means without the express prior written permission of Arm. No license, express or implied, by estoppel or otherwise to any intellectual property rights is granted by this document unless specifically stated.

Your access to the information in this document is conditional upon your acceptance that you will not use or permit others to use the information for the purposes of determining whether implementations infringe any third party patents.

THIS DOCUMENT IS PROVIDED "AS IS". ARM PROVIDES NO REPRESENTATIONS AND NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE WITH RESPECT TO THE DOCUMENT. For the avoidance of doubt, Arm makes no representation with respect to, and has undertaken no analysis to identify or understand the scope and content of, third party patents, copyrights, trade secrets, or other rights.

This document may include technical inaccuracies or typographical errors.

TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL ARM BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF ARM HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

This document consists solely of commercial items. You shall be responsible for ensuring that any use, duplication or disclosure of this document complies fully with any relevant export laws and regulations to assure that this document or any portion thereof is not exported, directly or indirectly, in violation of such export laws. Use of the word "partner" in reference to Arm's customers is not intended to create or refer to any partnership relationship with any other company. Arm may make changes to this document at any time and without notice.

If any of the provisions contained in these terms conflict with any of the provisions of any click through or signed written agreement covering this document with Arm, then the click through or signed written agreement prevails over and supersedes the conflicting provisions of these terms. This document may be translated into other languages for convenience, and you agree that if there is any conflict between the English version of this document and any translation, the terms of the English version of the Agreement shall prevail.

The Arm corporate logo and words marked with ® or TM are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. Other brands and names mentioned in this document may be the trademarks of their respective owners. Please follow Arm's trademark usage guidelines at <a href="https://www.arm.com/company/policies/trademarks">https://www.arm.com/company/policies/trademarks</a>.

Copyright  $\ensuremath{\mathbb{C}}$  2020, 2021 Arm Limited (or its affiliates). All rights reserved.

Arm Limited. Company 02557590 registered in England.

110 Fulbourn Road, Cambridge, England CB1 9NJ.

(LES-PRE-20349)

#### **Confidentiality Status**

This document is Non-Confidential. The right to use, copy and disclose this document may be subject to license restrictions in accordance with the terms of the agreement entered into by Arm and the party that Arm delivered this document to.

Unrestricted Access is an Arm internal classification.

#### **Product Status**

The information in this document is Final, that is for a developed product.

#### Web Address

developer.arm.com

#### Progressive terminology commitment

Arm values inclusive communities. Arm recognizes that we and our industry have used terms that can be offensive. Arm strives to lead the industry and create change.

This document includes terms that can be offensive. We will replace these terms in a future issue of this document.

If you find offensive terms in this document, please contact terms@arm.com.

# Contents

# **Arm® Ethos™-U65 NPU Technical reference manual**

|           | Pret  | race                                      |                  |
|-----------|-------|-------------------------------------------|------------------|
|           |       | About this book                           |                  |
|           |       | Feedback                                  | 9                |
| Chapter 1 | Intro | oduction                                  |                  |
|           | 1.1   | Description of the neural processing unit | 1-1              |
|           | 1.2   | Interfaces                                | 1-14             |
|           | 1.3   | Documentation                             | 1-1              |
|           | 1.4   | Design process                            | 1-16             |
|           | 1.5   | Product revisions                         | 1-1              |
| Chapter 2 | Fun   | ctional description                       |                  |
|           | 2.1   | Control and data flow                     | 2-19             |
|           | 2.2   | Security and boot flow                    | 2-2 <sup>-</sup> |
|           | 2.3   | Functional blocks                         | 2-22             |
| Chapter 3 | Prog  | grammers model                            |                  |
|           | 3.1   | Register characteristics                  | 3-29             |
|           | 3.2   | Register page BASE                        | 3-30             |
|           | 3.3   | Register page BASE_POINTERS               | 3-48             |
|           | 3.4   | Register page ID                          | 3-54             |
|           | 3.5   | Register page PMU                         | 3-59             |
|           | 3.6   | Command stream                            | 3-76             |
|           | 3.7   | Weight stream format                      | 3-87             |
|           |       |                                           |                  |

|            | 3.8  | Operators and performance       | 3-97       |
|------------|------|---------------------------------|------------|
|            | 3.9  | Block based operation           | 3-107      |
| Appendix A | Sign | nal descriptions                |            |
|            | A.1  | Clock and reset signals         | Appx-A-113 |
|            | A.2  | Interrupt signals               | Appx-A-114 |
|            | A.3  | Power management signals        | Appx-A-115 |
|            | A.4  | AMBA® 5 AXI master signals      | Appx-A-116 |
|            | A.5  | AMBA® 4 APB slave signals       | Appx-A-122 |
|            | A.6  | DFT and MBIST signals           | Appx-A-123 |
| Appendix B | Gen  | eral neural network concepts    |            |
|            | B.1  | General neural network concepts | Appx-B-125 |
| Appendix C | Boo  | t flow information              |            |
|            | C.1  | Boot flow information           | Appx-C-127 |
| Appendix D | Revi | isions                          |            |
|            | D.1  | Revisions                       | Appx-D-130 |

# **Preface**

This preface introduces the  $Arm^*$   $Ethos^{\mathsf{TM}}$ -U65 NPU Technical reference manual.

It contains the following:

- About this book on page 7.
- Feedback on page 9.

#### About this book

This manual is for the Arm<sup>®</sup> Ethos<sup>™</sup>-U65 neural processing unit.

#### **Product revision status**

The rxpy identifier indicates the revision status of the product described in this book, for example, r1p2, where:

- rx Identifies the major revision of the product, for example, r1.
- py Identifies the minor revision or modification status of the product, for example, p2.

#### Intended audience

This manual is for system designers, system integrators, and verification engineers who are designing a *System-on-Chip* (SoC) device that uses an Arm® Ethos™-U65 NPU.

#### Using this book

This book is organized into the following chapters:

#### **Chapter 1 Introduction**

This chapter introduces the processor.

#### **Chapter 2 Functional description**

This chapter describes the function and structure of the processor.

#### Chapter 3 Programmers model

This chapter describes a register and register map of the NPU.

## Appendix A Signal descriptions

This appendix describes the signals for the processor.

#### Appendix B General neural network concepts

This appendix describes the various concepts Arm uses to describe the NPU.

#### Appendix C Boot flow information

This appendix describes the various boot flows for the NPU.

#### Appendix D Revisions

This appendix describes the technical changes between releases of this book.

#### Glossary

The Arm Glossary is a list of terms used in Arm documentation, together with definitions for those terms. The Arm Glossary does not contain terms that are industry standard unless the Arm meaning differs from the generally accepted meaning.

See the Arm® Glossary for more information.

# Typographic conventions

italic

Introduces special terminology, denotes cross-references, and citations.

#### bold

Highlights interface elements, such as menu names. Denotes signal names. Also used for terms in descriptive lists, where appropriate.

#### monospace

Denotes text that you can enter at the keyboard, such as commands, file and program names, and source code.

#### monospace

Denotes a permitted abbreviation for a command or option. You can enter the underlined text instead of the full command or option name.

#### monospace italic

Denotes arguments to monospace text where the argument is to be replaced by a specific value.

#### monospace bold

Denotes language keywords when used outside example code.

<and>

Encloses replaceable terms for assembler syntax where they appear in code or code fragments. For example:

#### SMALL CAPITALS

Used in body text for a few terms that have specific technical meanings, that are defined in the *Arm*® *Glossary*. For example, IMPLEMENTATION DEFINED, IMPLEMENTATION SPECIFIC, UNKNOWN, and UNPREDICTABLE.

#### **Signals**

The signal conventions are:

#### Signal level

The level of an asserted signal depends on whether the signal is active-HIGH or active-LOW. Asserted means:

- HIGH for active-HIGH signals.
- LOW for active-LOW signals.

#### Lowercase n

At the start or end of a signal name, n denotes an active-LOW signal.

#### Additional reading

This book contains information that is specific to this product. See the following documents for other relevant information.

#### **Arm publications**

Non-confidential documents:

- AMBA® AXI and ACE Protocol Specification AXI3, AXI4, AXI5, ACE and ACE5 (Arm IHI 0022).
- AMBA® Low Power Interface Specification Arm® Q-Channel and P-Channel Interfaces (IHI 0068).
- Arm<sup>®</sup> Ethos<sup>™</sup>-U65 NPU Technical overview (102024).
- Arm<sup>®</sup> Ethos<sup>™</sup>-U NPU Application development overview (101888).

Confidential documents that are only available to licensees:

- Arm<sup>®</sup> Ethos<sup>™</sup>-U65 NPU Configuration and integration manual (102025).
- Arm<sup>®</sup> Ethos<sup>™</sup>-U NPU Functional model integration guide (101889).

#### Developer resources:

• https://developer.arm.com/solutions/machine-learning-on-arm.

# Other publications

None.

#### **Feedback**

#### Feedback on this product

If you have any comments or suggestions about this product, contact your supplier and give:

- The product name.
- The product revision or version.
- An explanation with as much information as you can provide. Include symptoms and diagnostic procedures if appropriate.

#### Feedback on content

If you have comments on content then send an e-mail to errata@arm.com. Give:

- The title *Arm Ethos-U65 NPU Technical reference manual*.
- The number 102023 0000 05 en.
- If applicable, the page number(s) to which your comments refer.
- A concise explanation of your comments.

| Arm also welcomes general suggestions for additions and improvements.                                                                                         |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Note                                                                                                                                                          |
| Arm tests the PDF only in Adobe Acrobat and Acrobat Reader, and cannot guarantee the quality of the represented document when used with any other PDF reader. |

# Chapter 1 **Introduction**

This chapter introduces the processor.

It contains the following sections:

- 1.1 Description of the neural processing unit on page 1-11.
- 1.2 Interfaces on page 1-14.
- 1.3 Documentation on page 1-15.
- 1.4 Design process on page 1-16.
- 1.5 Product revisions on page 1-17.

# 1.1 Description of the neural processing unit

The *Neural Processing Unit* (NPU) improves the inference performance of neural networks. The NPU targets quantized *Convolutional Neural Networks* (CNN) and 8-bit and 16-bit integer *Recurrent Neural Networks* (RNN). The NPU supports 8-bit weights.

Arm delivers the hardware *Register Transfer Level* (RTL) of the NPU with an open-source driver and compiler. A neural network must be compiled offline using the open-source compiler to produce a command stream. The application invokes the driver, which communicates with the NPU to tell it where the command stream is and initiates the network traversal. The command stream describes the steps necessary for the NPU to execute the operators compiled into the command stream autonomously. When complete, the NPU raises an IRQ to the driver.

The driver programs the memory location of the command stream and other payloads into registers in the NPU. The *Central Control* (CC) processes the command stream.

The NPU includes a *Direct Memory Access* (DMA) controller that can read and write to external memory. When the NPU performs inferences, the DMA controller reads the neural network description. This description contains:

- The command stream
- Network weights
- · Bias information
- Scale information

The DMA controller also transfers the *Input Feature Maps* (IFMs) and the *Output Feature Maps* (OFMs) and NPU-private intermediate data that is also held in system memory.

The external interfaces that the NPU implements are:

Two Arm AMBA 5 AXI master interfaces that provide the DMA controller with access to external
memory. Two read/write masters, M0 and M1. This means that the NPU can present two sets of
transactions at the same time. The command, weight, bias, and scale channels can be mapped to
either AXI master.

| either AXI master.                                    |  |
|-------------------------------------------------------|--|
| Note                                                  |  |
| The master interfaces are also AMBA 4 AXI compatible. |  |

• An Arm AMBA 4 APB slave interface with wake up signaling that allows the application processor to program the NPU.

The following figure shows a typical system configuration block diagram for the NPU.



Figure 1-1 Typical system configuration block diagram

The following figure shows the main components of the NPU.



Figure 1-2 Functional blocks diagram

This section contains the following subsection:

• 1.1.1 Supported application programming interfaces on page 1-12.

# 1.1.1 Supported application programming interfaces

To program, test, and monitor the NPU, Arm deploys the open-source *TensorFlow Lite for Microcontrollers* (TFL $\mu$ ) tool, which runs on an external host application processor. It uses the compiler offline to compile and optimize the neural network graph for the NPU. Its API generates a command stream for the NPU to process.

The compiler decides which parts of a network graph can be optimized and executed on the NPU. The NPU drivers manage the workloads that execute inferences on the NPU.

If the network maps exclusively to the NPU, then the power required by the external host application processor is negligible. If there is a requirement to process layers on the Cortex\*-M core, then more performance is required.

# 1.2 Interfaces

The NPU has several external interfaces.

The external interfaces are:

- Arm AMBA 4 APB slave with wake-up signaling.
- Two Arm AMBA 5 AXI masters:
  - A read/write master, M0.
  - A read/write master, M1.
- An interrupt.
- Two Q-channels:
  - A Q-Channel for clock.
  - A Q-Channel for power.
- System configuration signals that determine the security level after boot.
- Clock.
- Reset.

## 1.3 Documentation

Arm Limited publishes documentation that describes the NPU, including this document.

#### **Technical overview**

The Technical overview (TO) describes the functionality of the NPU.

#### Technical reference manual

The *Technical reference manual* (TRM) describes the functionality and the effects of functional options on the behavior of the processor. It is required at all stages of the design flow. Design flow choices can mean that some behavior that the TRM describes is not relevant. If you are programming the processor, obtain additional information from:

- The implementer to determine the build configuration of the implementation.
- The integrator to determine the pin configuration of the device that you are using.

#### **Application development overview**

The Application development overview (ADO) describes the flow of data between an application and the NPU.

#### Configuration and integration manual

The Configuration and integration manual (CIM) describes the configuration and implementation of the NPU.

# Functional model integration guide

The Functional model integration guide (FMIG) describes how to integrate the NPU functional model.

The CIM and FMIG are confidential books only available to licensees.

# 1.4 Design process

The NPU is delivered as synthesizable RTL. Before it can be used in a product, it must go through the design process.

#### **Implementation**

The implementer configures and synthesizes the RTL to produce a hard macrocell.

# Integration

The integrator connects the configured design into an SoC, including a memory system and peripherals.

# **Programming**

The system programmer uses the following to develop the SoC:

- The software to configure and initialize the NPU.
- The application software and the SoC tests.

# 1.5 Product revisions

Successive product revisions have differences in functionality.

r0p0

First release.

# **Chapter 2 Functional description**

This chapter describes the function and structure of the processor.

It contains the following sections:

- 2.1 Control and data flow on page 2-19.
- 2.2 Security and boot flow on page 2-21.
- 2.3 Functional blocks on page 2-22.

## 2.1 Control and data flow

The software stack manages the control and data flows between the application software running on an external host application processor and individual subcomponents of the NPU.

The components of the software stack communicate with each other to handle the control and data flow between the neural network application and the NPU.

The following figure shows the software stack for the NPU.



Figure 2-1 The software stack of the NPU

The NPU uses offline tools to optimize the code. At runtime, the application processor passes this optimized trained model to the NPU.



The following steps describe the offline tooling flow:

- 1. Pass your trained model through the quantization tool. This tool quantizes weights to 8-bit and activations to 8-bit or 16-bit values.
- 2. Pass the quantized model to the compiler. This tool optimizes the model for this NPU and outputs an optimized model that contains a command stream for the NPU.

The following steps describe the runtime control and data flow:

- 1. The optimized model is placed in system memory, which is accessible by the NPU.
- 2. At runtime, the TFLµ tool reads the model and dispatches the operators.
- 3. The NPU reads the optimized model and runs the command stream that is included in it. The application processor runs any parts that the NPU cannot execute.
- 4. When the inference is complete, the result is placed in the memory location that the driver specifies.

The following figure shows the control and data flow.



Figure 2-2 Control and data flow

This section contains the following subsection:

• 2.1.1 Supported memory formats for feature maps on page 2-20.

#### 2.1.1 Supported memory formats for feature maps

The NPU supports the industry-standard NHWC format of feature-map data.

NHWC is used as an input and output format by the NPU for communication with TensorFlow light.

When the NPU processes multiple layers, it reformats NHWC-formatted feature maps into an internal NHCWB16 format when reading in data. The NPU also performs the reverse transformation on the final output layer.

#### **NHWC format**

The NHWC format has the following properties:

- H (height), W (width), and C (channels) data.
- The size of each element (ElemSize) is 1-byte or 2-bytes.
- Only a single batch is supported (N=1).
- The address of an element y, x, c is (BASE+y\*STRIDE\_Y+x\*STRIDE\_X+c\*ElemSize).
- The values BASE, STRIDE\_Y, and STRIDE\_X must be aligned in element size.
- Only tile 0 can be used, the address of tile 0 is BASE.

#### **NHCWB16 format**

The NHCWB16 format has the following properties:

- A block format consisting of 16 channels per block.
- Only a single batch is supported (N=1).
- The address of an element y, x, c is (BASE+y\*STRIDE\_Y+(c/16)\*STRIDE\_C + (x\*16 + (c %16))\*ElemSize).
- The values BASE, STRIDE\_Y, and STRIDE\_C must be 16-byte aligned.
- Tiles can be used.

# 2.2 Security and boot flow

The NPU can be set to different security and privilege modes during a reset. The host application processor cannot reset the NPU to a higher security level than its current level.

At any reset, all registers and memories in the NPU are cleared to prevent leakage between states.

When a soft reset is requested, the NPU ensures that all AMBA 5 AXI transactions are complete before issuing the reset.

When the NPU is powered up after a hard reset, it reads the **PORPL** signal to set its privilege level:

- LOW indicates user mode.
- · HIGH indicates privileged mode.

When the NPU is powered up after a hard reset, it reads the **PORSL** signal to set its security level:

- LOW indicates Secure mode.
- HIGH indicates Non-secure mode.

When the NPU is accessed, it uses the **PPROT** signal to check if the access is permitted. The NPU security and privilege level that is used on the AXI ports are the **ARPROT/AWPROT** signals. The **ARPROT/AWPROT** signals may be used for memory protection at system-level.

| ART ROT/AVI ROT signals may be used for memory protection at system-level.                                                                                                                                            |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Note                                                                                                                                                                                                                  |
| The NPU assumes that the software on the host that has permission to access it is trusted. You must ensure that the system provides suitable protection from memory tampering (for example, by protecting the flash). |

#### 2.3 Functional blocks

The NPU consists of the *Central Control* (CC), a DMA controller, a MAC unit, an Output unit, and the interconnect fabric.

The following are descriptions of the units of the NPU:

- The CC receives tasks from the external host application processor. The CC queues and dispatches units of work to the DMA and engines.
- The DMA controller uses its two Arm AMBA 5 AXI master interfaces to move data between external memory and internal shared memory.
- The MAC unit has various internal units for reading IFMs, performing dot products and accumulations.

The following figure shows the main components of the NPU.



Figure 2-3 Functional blocks diagram

This section contains the following subsections:

- 2.3.1 External interfaces on page 2-22.
- 2.3.2 Central control on page 2-23.
- 2.3.3 DMA controller on page 2-23.
- 2.3.4 Clock and power module on page 2-25.
- 2.3.5 Weight decoder on page 2-26.
- 2.3.6 MAC unit on page 2-26.
- 2.3.7 Output unit on page 2-26.

#### 2.3.1 External interfaces

The NPU uses two AMBA 5 AXI master interfaces, an AMBA 4 APB slave interface with wake-up signaling, an interrupt interface, two Q-Channel interfaces, clock, and reset to enable access to and from external components.

#### Two AMBA 5 AXI master interfaces

These interfaces enable read-access and write-access to external memory for the DMA controller.

The NPU has two read/write masters, M0 and M1.

#### AMBA 4 APB slave interface with wake-up signaling

Enables the device driver that runs on the external host application processor to access the control registers of the NPU.

#### Interrupt interface

Sends interrupt requests to the external host application processor, usually to signal a completed job.

#### Two Q-Channel interfaces

These interfaces enable communication with an external clock controller and power controller. This communication enables the system to automatically disable the clock of the NPU or disable the power to it. The clock is otherwise free-running. The NPU does not quiesce while executing a task, and usually does not quiesce if there are any tasks in a job queue.

The NPU software stack partly manages the activity the Q-Channel reports on.

You can configure the NPU to request downclocking or powering down when it is idle. This downclocking can be when the command queue is empty or when the NPU is waiting to be restarted after being stopped.

# Clock and reset

The NPU has one clock and one reset signal.

Arm recommends that the AXI and NPU clock be the same; however, a different clock ratio can be supported using the **CLKEN** signals.

#### 2.3.2 Central control

The *Central Control* (CC) is the main control unit inside the NPU. The CC controls how the NPU processes neural networks, maintains synchronization, and handles data dependencies.

The CC receives tasks from the external host application processor. The CC queues and dispatches units of work to the DMA controller, weight decoder, MAC unit, and Output unit. The DMA controller and MAC unit send events to the CC to signal the completion of work.

The CC contains multiple sets of operation settings to increase efficiency. This enables the CC to set up the next piece of work while the current one is being processed.

After completing scheduling, dispatching, or processing work, the CC checks for any events that have been triggered. If there are no new events, the CC requests underclocking or powerdown, depending on the configuration.

The CC comprises a Traversal unit. The CC instructs this unit to handle commands that require traversal. This unit breaks commands down into smaller commands, performs synchronization as they execute, and implements the different data flows the NPU requires.

The CC comprises a Command unit. This unit receives commands and parses them. Traversal tasks are passed to the traversal unit. Data dependencies can be coded into the NPU command stream by the Offline Compiler, so that data dependencies between commands are not broken. Measuring the data dependency is an NPU internal process.

Other commands can:

- Trigger interrupts.
- Cause the NPU to wait for a data dependency to be cleared.
- Set up internal registers with information relating to the next execution step.

The CC implements an Arm AMBA 4 APB slave interface. This interface enables the application processor to control the NPU. This interface also enables performance measurements.

#### 2.3.3 DMA controller

The DMA controller manages all transactions that use the Arm AMBA 5 AXI interfaces.

The channels that the DMA controller uses are:

#### **Command channel**

The NPU uses this channel to read the command stream, normally from external flash. The NPU moves the commands into CC. The application processor activates the command channel when it sets up the location and size of the Command queue. It sets up the Command queue by using the registers that are mapped to the AMBA 4 APB.

#### IFM channel

The NPU uses the IFM channel to read input feature maps and stores them in its shared RAM. Because the shared buffer must store activations from different x,y coordinates in different words, the DMA controller unpacks data which is stored in NHWC format. This might require extra internal buffering, but only for the initial layer of a job. Internal layers can use a more efficient format.

The DMA controller considers the kernel stride, because this affects which bank or address the DMA controller requires to store activations.

When the DMA controller is in vector-product mode, it supports fetching multiple batches.

The IFM channel is triggered once per block for blocks that require input feature maps.

#### **OFM** channel

The NPU uses the OFM channel to write output feature maps from shared RAM to external RAM. Because the output is double-buffered in the shared RAM, the DMA requires an interface to synchronize with the output module to notify the DMA which buffer is empty or full.

For the last layer of a job, the output must be written out in NHWC format. This may require the DMA to pack the data, depending on the depth of the layer. Since this process reduces the bandwidth, this process is possible in a small register bank inside the DMA.

The traversal unit triggers the OFM channel once per output block for blocks that require transfer to external memory.

## Weight channel

The weight channel transfers compressed weights from external memory to the weight decoder. The DMA controller uses a read buffer to hide bus latency from the weight decoder and to enable the DMA to handle data arriving out of order.

The traversal unit triggers the weight channel for blocks that require the transfer of weights.

The weight stream must be quantized to 8 bits or less by an offline tool. When passed through the offline compiler, weights are compressed losslessly and reordered into an NPU-specific weight stream. This process is effective, if the quantizer uses less than 8 bits or if it uses clustering and pruning techniques, it may also employ all three methods. Using lossless compression, an average of  $\sim$ 2 bits is possible in the final weight stream, especially if the weight stream has many zeros.

#### mem2mem channel

The NPU uses this channel to stream general data from memory to memory. The main purpose of this channel is to read weights from slow, non-volatile memory and store them in the SRAM. This might be performed in preparation for a layer which reads the weights multiple times. Having the weights in SRAM saves power and improves performance compared to reads to non-volatile memory.

The traversal unit triggers mem2mem operations on specific API commands.

#### Bias and scale channel

This channel streams data to the Output unit. The data that it transmits is the scale and bias necessary for the block that the NPU is processing. Layers that pass through the Output unit are written to the external SRAM. As the layers pass through the Output unit, activation functions can be fused.

| Note |
|------|
|------|

Only the mem2mem DMA channel is controllable directly by the command stream. The other channels are used to load or store data required by NPU operations. Write DMA channels must always use AXI port 0. Read DMA channels can use AXI port 0 or 1 according to which region is configured for the memory.

# 2.3.4 Clock and power module

The *Clock and Power Module* (CPM) handles hard and soft resets, contains registers for the current security settings, the main clock gate, and the QLPI interface.

#### Clock and power module controlling reset

The **nRESET** input signal triggers a hard reset. When the APB RESET register is written to, a soft reset is triggered, as long as Write-Access is permitted. The APB-PPROT and CPL, CSL register values determine whether a write is permitted.

Register access to APB RESET is permitted, if (PPROT[0]>=CPL && PPROT[1]<=CNS). Otherwise the register access is not permitted.

At any reset, all registers and memories in the NPU are cleared to prevent leakage between Security states. The CPM triggers all soft resets. Hard resets must come from an external reset controller.

Both hard and soft resets use a similar procedure, which is:

- 1. If the reset is a soft reset:
  - a. With the DMA controller clock on, signal to the DMA that a soft reset is initiated.
  - b. Wait for the DMA to acknowledge the reset request.
- 2. With the internal NPU clock off, activate the system reset within two clock cycles.
- 3. Deactivate the system reset.
- 4. With the shared buffer and DMA controller clock on, the CPM signals to the shared buffer and the DMA that the RAMs must be cleared.
- 5. Update the setting in the CPL, CSL register.

#### QLPI for clock

To enable high-level clock gating, the NPU exposes a Q-Channel slave port. This slave port enables the system to automatically disable the clock of the NPU, that is free-running except during reset.

If the entire NPU is in stopped state, it indicates when the clock can be turned off. You can configure the NPU registers using the AMBA 4 APB, so that it keeps requesting a clock in stopped state.

#### QLPI for power

For high-level power gating, the NPU exposes a Q-Channel slave port. This slave port permits the system to automatically disable the power of the NPU.

If the entire NPU is in stopped state, it indicates when power can be turned off. You can configure the NPU using the AMBA 4 APB, so that it keeps requesting power in stopped state.

#### Clock and power module clock gates

The CPM contains one main clock gate. Other clock gating is performed inside each of the blocks, which the CPM can override. These clock gates are explicitly instantiated, with the CPM clock gate preceding the block level clock gates.

#### 2.3.5 Weight decoder

The Weight Decoder (WD) reads the weight stream from the DMA controller. The decoder decompresses and stores this stream in a double-buffered register, ready for the MAC unit to consume it.

#### 2.3.6 MAC unit

The MAC unit performs multiply-accumulate operations that are required for convolution, depth-wise pooling, vector products, and the max operation required for max pooling.

The MAC unit comprises:

- · An IFM unit
- · Dot product units
- An adder array.

#### IFM unit

The IFM unit inside the MAC unit reads the input feature maps from the shared SRAM and stores them in register slices. These slices are fed into the multipliers in the dot product units. The IFM unit also performs some extra services as part of other operations.

The IFM unit handles zero-padding around the outside edge of feature maps and the upscaling that deconvolution requires. Deconvolution upscaling uses nearest neighbor or zero insertion.

#### Dot product units

The MAC unit contains several dot product units. These dot product units perform the multiply-accumulate operations that are required for convolutions.

The dot product units contain a max operator that they use for max pooling.

## Adder array

The adder array reads a set of accumulators from the shared RAM buffer and updates them with partial accumulations from the dot product units. The adder array then writes the result back.

Accuracy is maintained throughout this process. The internal accumulators retain precision so that the output is bit-exact to the software reference, in this case TFL.

The compiler selects the accumulator format in the shared buffer. This format can be:

- 32-bit two's complement
- 40-bit two's complement

You can also configure the compiler to use 16-bit floating-point format, which improves performance but impacts accuracy.

These formats are only used internally.

# 2.3.7 Output unit

The Output unit reads finished accumulators from the shared RAM and converts them into output activations. This process includes performing scaling for each OFM, adding the bias to values, and applying the activation function to each point.

Every layer is written to external SRAM, but the activation function and scaling are normally fused. There is no forwarding path from output to input inside the NPU. Although layers can be split into horizontal stripes and run in "cascade" to minimize the SRAM footprint. This means that the external SRAM footprint can be smaller than the largest layer.

The activation functions that the Output unit supports are:

- ReLU, ReLU1, ReLU6, and Leaky ReLU
- tanh
- sigmoid

- Configurable Lookup Table (LUT)
- · None or bypass

The elementwise operations that the Output unit supports are:

- Elementwise ADD and SUB
- Elementwise Multiplication (MUL)
- Elementwise Min and Max
- Elementwise ABS
- Elementwise Shift Left (SHL) and Elementwise Shift Right (SHR)
- Elementwise Count-leading Zero (CLZ)

When the Output unit has computed output activations, it writes them back into the shared RAM. The output activations are buffered in the shared RAM where they wait for the DMA controller to send them to external memory.

#### Scaling unit

The Scaling unit in the Output unit performs scaling in convolutions and division in average pooling.

The number of scaling operations that are performed per clock depends on the configuration. The number of outputs per clock varies, depending on the operation.

# **ReLU and Leaky ReLU**

Rectified Linear Unit (ReLU) operations are typically performed after scaling and bias addition.

The number of ReLU operations that are done in parallel is the same as the number of parallel operations that the Scaling unit performs.

Leaky ReLU (LReLU) is a variant, a nonzero ReLU with a small positive gradient that targets negative values, unlike standard ReLU functions. Leaky ReLU implements Leaky ReLU as long as the input and output quantization scale are the same. The most recent TensorFlow Lite allows the quantization scale to differ. In that case, we recommend using the LUT for 8-bit activations and element wise operators for 16-bit activations.

#### tanh, sigmoid, and LUT

The Output unit supports tanh and sigmoid functions using a hardwired table combined with bilinear interpolation. The same table is used for both functions, because they are mathematically related.

The Output unit can perform one tanh or sigmoid function per cycle.

There is also a *Configurable Lookup Table* (LUT) that can be used for any point-wise activation or function. For 8-bit activations, the LUT holds up to 256 8-bit values that are directly mapped from IFM to OFM. The LUT size increases to 512 for 16-bit values; however, the outputs are interpolated, bilinear values.

The LUT can be configured by setting up a mem2mem transfer. For more information, refer to 2.3.3 DMA controller on page 2-23.

# Chapter 3 **Programmers model**

This chapter describes a register and register map of the NPU.

It contains the following sections:

- 3.1 Register characteristics on page 3-29.
- 3.2 Register page BASE on page 3-30.
- 3.3 Register page BASE POINTERS on page 3-48.
- 3.4 Register page ID on page 3-54.
- 3.5 Register page PMU on page 3-59.
- 3.6 Command stream on page 3-76.
- 3.7 Weight stream format on page 3-87.
- 3.8 Operators and performance on page 3-97.
- 3.9 Block based operation on page 3-107.

# 3.1 Register characteristics

The registers in the NPU have common characteristics.

The following are the characteristics of the registers in the NPU:

- Register addresses are shown as offsets from the base address.
- Registers are 32-bit wide words.
- Register reads and writes use word accesses only.
- Register halfword and byte reads are UNDEFINED.
- Register halfword and byte writes are UNPREDICTABLE.
- Every access to the registers is compared with the *Current active Privilege Level* (CPL) and the active *Current Non-Secure level* (CNS) of the PROT register:
  - Register access is permitted if (PPROT[0]>=CPL && PPROT[1]<=CNS). Otherwise the register access is not permitted.</li>
  - A read access that is not permitted, either due to privilege or being a write-only register, returns the value zero.
  - A write-access that is not permitted, either due to privilege or being a read-only register, is ignored.

# 3.2 Register page BASE

The NPU control registers bank.

Table 3-1 BASE registers

| Address    | Link                                       | Usage                                                                                         | Access     | Default                                                                                     |
|------------|--------------------------------------------|-----------------------------------------------------------------------------------------------|------------|---------------------------------------------------------------------------------------------|
| 0x00000000 | 3.2.1 Register ID on page 3-31             | ID register                                                                                   | Read-only  | 0x10066001                                                                                  |
| 0x00000004 | 3.2.2 Register STATUS on page 3-32         | Register describing the current operating status of the NPU                                   | Read-only  | 0x00000008                                                                                  |
| 0x00000008 | 3.2.3 Register CMD on page 3-34            | Command register, reads as last written command                                               | Read/write | 0x0000000C                                                                                  |
| 0x0000000C | 3.2.4 Register RESET on page 3-35          | Request Reset and new security mode                                                           | Read/write | 0x00000000                                                                                  |
| 0x00000010 | 3.2.5 Register QBASE0 on page 3-36         | Base address of Command queue bits[31:0]. The address is 4-byte-aligned                       | Read/write | 0x00000000                                                                                  |
| 0x00000014 | 3.2.6 Register QBASE1 on page 3-37         | Address extension bits[39:32] bits for queue base                                             | Read/write | 0x00000000                                                                                  |
| 0x00000018 | 3.2.7 Register QREAD on page 3-37          | Read offset in the command stream in bytes. Multiples of 4 in the range 0-16 MB               | Read-only  | 0x00000000                                                                                  |
| 0x0000001C | 3.2.8 Register QCONFIG<br>on page 3-37     | AXI configuration for the command stream in the range 0-3. Same encoding as for REGIONCFG     | Read/write | 0x00000000                                                                                  |
| 0x00000020 | 3.2.9 Register QSIZE on page 3-37          | Size of the command stream in bytes. Multiples of 4 in the range 0-16 MB                      | Read/write | 0x00000000                                                                                  |
| 0x00000024 | 3.2.10 Register PROT on page 3-38          | Protection level configured for the NPU when acting as an AXI master                          | Read-only  | 0x00000000                                                                                  |
| 0x00000028 | 3.2.11 Register CONFIG on page 3-39        | RTL configuration                                                                             | Read-only  | 0x10003008 for the<br>256 configuration,<br>and 0x10006009 for<br>the 512<br>configuration. |
| 0x0000002C | 3.2.12 Register LOCK on page 3-40          | Lock register. This register is designed for driver use and does not affect NPU functionality | Read/write | 0x00000000                                                                                  |
| 0x0000003C | 3.2.13 Register REGIONCFG on page 3-40     | Base pointer configuration. Bits[2*k +1:2*k] give the memory type for REGION[k]               | Read/write | 0x00000000                                                                                  |
| 0x00000040 | 3.2.14 Register AXI_LIMITO on page 3-43    | AXI limits for port 0 counter 0                                                               | Read/write | 0x00000000                                                                                  |
| 0x00000044 | 3.2.15 Register AXI_LIMIT1 on page 3-44    | AXI limits for port 0 counter 1                                                               | Read/write | 0x00000000                                                                                  |
| 0x00000048 | 3.2.16 Register AXI_LIMIT2<br>on page 3-45 | AXI limits for port 1 counter 2                                                               | Read/write | 0x00000000                                                                                  |
| 0x0000004C | 3.2.17 Register AXI_LIMIT3<br>on page 3-46 | AXI limits for port 1 counter 3                                                               | Read/write | 0x00000000                                                                                  |

This section contains the following subsections:

- *3.2.1 Register ID* on page 3-31.
- 3.2.2 Register STATUS on page 3-32.
- *3.2.3 Register CMD* on page 3-34.

- 3.2.4 Register RESET on page 3-35.
- 3.2.5 Register OBASE0 on page 3-36.
- 3.2.6 Register QBASE1 on page 3-37.
- 3.2.7 Register QREAD on page 3-37.
- 3.2.8 Register QCONFIG on page 3-37.
- 3.2.9 Register QSIZE on page 3-37.
- 3.2.10 Register PROT on page 3-38.
- 3.2.11 Register CONFIG on page 3-39.
- 3.2.12 Register LOCK on page 3-40.
- 3.2.13 Register REGIONCFG on page 3-40.
- 3.2.14 Register AXI\_LIMIT0 on page 3-43.
- 3.2.15 Register AXI\_LIMIT1 on page 3-44.
- 3.2.16 Register AXI\_LIMIT2 on page 3-45.
- 3.2.17 Register AXI LIMIT3 on page 3-46.

#### 3.2.1 Register ID

ID register.

The default value of this RO register describes the product version. Please refer to the individual fields for information

Table 3-2 Register BASE.ID layout

| Bits    | Link                        | Name           | Usage                                                                            | Default                    |
|---------|-----------------------------|----------------|----------------------------------------------------------------------------------|----------------------------|
| [31:28] | arch_major_rev on page 3-31 | arch_major_rev | This is the major architecture version number, a in the architecture version a.b | 1                          |
| [27:20] | arch_minor_rev on page 3-31 | arch_minor_rev | This is the minor architecture version number, b in the architecture version a.b | 0                          |
| [19:16] | arch_patch_rev on page 3-32 | arch_patch_rev | This is the patch number of the architecture version a.b                         | 6 (implementation defined) |
| [15:12] | product_major on page 3-32  | product_major  | This is the X-part of the ML00X product number                                   | 6 (implementation defined) |
| [11:8]  | version_major on page 3-32  | version_major  | This is the <i>n</i> for the R-part of an R <i>n</i> P <i>n</i> release number   | 0x0                        |
| [7:4]   | version_minor on page 3-32  | version_minor  | This is the <i>n</i> for the P-part of an R <i>n</i> P <i>n</i> release number   | 0x0                        |
| [3:0]   | version_status on page 3-32 | version_status | This is the version of the product                                               | 1 (implementation defined) |

#### Field arch\_major\_rev

This is the major architecture version number, a in the architecture version a.b.

arch\_major\_rev is stored in bits[31:28] and is a 4-bit unsigned integer. Its default value is 1 (implementation defined).

### Field arch\_minor\_rev

This is the minor architecture version number, b in the architecture version a.b.

arch\_minor\_rev is stored in bits[27:20] and is a 8-bit unsigned integer. Its default value is 0 (implementation defined).

#### Field arch\_patch\_rev

This is the patch number of the architecture version a.b.

arch\_patch\_rev is stored in bits[19:16] and is a 4-bit unsigned integer. Its default value is 6 (implementation defined).

### Field product\_major

This is the X-part of the ML00X product number.

product\_major is stored in bits[15:12] and is a 4-bit unsigned integer. Its default value is 6 (implementation defined).

#### Field version\_major

This is the n for the R-part of an RnPn release number.

version\_major is stored in bits[11:8] and is a 4-bit unsigned integer. Its default value is 0x0.

#### Field version\_minor

This is the n for the P-part of an RnPn release number.

version minor is stored in bits[7:4] and is a 4-bit unsigned integer. Its default value is 0x0.

#### Field version\_status

This is the version of the product.

version\_status is stored in bits[3:0] and is a 4-bit unsigned integer. Its default value is 1 (implementation defined).

#### 3.2.2 Register STATUS

Register describes the current operating status of the NPU.

Table 3-3 Register BASE.STATUS layout

| Bits    | Link                            | Name               | Usage                                                                                                            | Default |
|---------|---------------------------------|--------------------|------------------------------------------------------------------------------------------------------------------|---------|
| [31:16] | irq_history_mask on page 3-33   | irq_history_mask   | IRQ History mask                                                                                                 | 0x0     |
| [15:12] | faulting_channel on page 3-33   | faulting_channel   | Faulting channel on a bus abort. Read: 0=Cmd, 1=IFM, 2=Weights, 3=Scale+Bias, 4=Mem2Mem; Write: 8=OFM, 9=Mem2Mem | 0x0     |
| [11]    | faulting_interface on page 3-33 | faulting_interface | Faulting interface on bus abort. 0=AXI-M0, 1=AXI-M1                                                              | 0x0     |
| [10:9]  | Reserved                        | -                  | -                                                                                                                | -       |
| [8]     | ecc_fault on page 3-33          | ecc_fault          | ECC state for internal RAMs: 0=no fault, 1=ECC fault signalled. Can only be cleared by reset.                    | 0x0     |
| [7]     | wd_fault on page 3-33           | wd_fault           | This bit will never be set in this product.                                                                      | -       |
| [6]     | pmu_irq_raised on page 3-33     | pmu_irq_raised     | 0=No PMU IRQ, 1=PMU IRQ raised. Cleared by using command register bit 1                                          | 0x0     |
| [5]     | cmd_end_reached on page 3-34    | cmd_end_reached    | 0=Not reached, 1=Reached. Cleared by writing QBASE or QSIZE when NPU is in stopped state                         | 0x0     |

Table 3-3 Register BASE.STATUS layout (continued)

| Bits | Link                         | Name            | Usage                                                                                                                                                                               | Default                                               |
|------|------------------------------|-----------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------|
| [4]  | cmd_parse_error on page 3-34 | cmd_parse_error | 0=No error, 1=Command-stream parsing error detected. Can only be cleared by a reset                                                                                                 | 0x0                                                   |
| [3]  | reset_status on page 3-34    | reset_status    | Reset is ongoing and only this register can be read (other registers read as 0 and writes are ignored). A value of 0 means the NPU is not being reset and can be accessed as normal | 0x1 if the reset operation is ongoing, otherwise 0x0. |
| [2]  | bus_status on page 3-34      | bus_status      | 0=OK, 1=Bus abort detected and processing halted (the NPU has reached IDLE state and does not start to process any more commands/AXI transactions). Can only be cleared by a reset  | 0x0                                                   |
| [1]  | irq_raised on page 3-34      | irq_raised      | Raw IRQ status, 0 = IRQ not raised, 1 = IRQ raised. IRQ is cleared using command register bit 1                                                                                     | 0x0                                                   |
| [0]  | state on page 3-34           | state           | NPU state, 0 = Stopped, 1 = Running                                                                                                                                                 | stopped                                               |

#### Field irq\_history\_mask

IRQ History mask.

irq history mask is stored in bits[31:16] and is a 16-bit unsigned integer. Its default value is 0x0.

This is used for debug purposes. Each IRQ or Event operation provides a 16-bit mask, which is logically ORed into these bits. The bits can be cleared using the command register.

#### Field faulting\_channel

Faulting channel on a bus abort. Read: 0=Cmd, 1=IFM, 2=Weights, 3=Scale+Bias, 4=Mem2Mem; Write: 8=OFM, 9=Mem2Mem.

faulting channel is stored in bits[15:12] and is a 4-bit unsigned integer. Its default value is 0x0.

#### Field faulting\_interface

Faulting interface on bus abort. 0=AXI-M0, 1=AXI-M1.

faulting interface is stored in bit[11] and is a 1-bit unsigned integer. Its default value is 0x0.

#### Field ecc fault

ECC state for internal RAMs: 0=no fault, 1=ECC fault signalled. Can only be cleared by reset. ecc\_fault is stored in bit[8] and is a 1-bit unsigned integer. Its default value is 0x0.

#### Field wd\_fault

This bit will never be set in this product.

#### Field pmu\_irq\_raised

0=No PMU, IRQ, 1=PMU IRQ raised. Cleared by using command register bit 1.

pmu\_irq\_raised is stored in bit[6] and is a 1-bit unsigned integer. Its default value is 0x0.

#### Field cmd\_end\_reached

0=Not reached, 1=Reached. Cleared by writing QBASE or QSIZE when the NPU is in stopped state. cmd\_end\_reached is stored in bit[5] and is a 1-bit unsigned integer. Its default value is 0x0.

#### Field cmd\_parse\_error

0=No error, 1=Command stream parsing error detected. Can only be cleared by a reset. cmd parse error is stored in bit[4] and is a 1-bit unsigned integer. Its default value is 0x0.

#### Field reset status

Reset is ongoing and only this register can be read (other registers read as 0 and writes are ignored). A value of 0 means the NPU is not being reset and can be accessed as normal.

reset\_status is stored in bit[3] and is a 1-bit unsigned integer. Its default value is 0x1 if the reset operation is ongoing, otherwise its default value is 0x0.

#### Field bus\_status

0=OK, 1=Bus abort detected and processing halted (the NPU has reached IDLE state and does not start to process any more commands/AXI transactions). Can only be cleared by a reset.

bus\_status is stored in bit[2] and is a 1-bit unsigned integer. Its default value is 0x0.

### Field irq\_raised

Raw IRQ status, 0 = IRQ not raised, 1 = IRQ raised. IRQ is cleared using command register bit 1. irq raised is stored in bit[1] and is a 1-bit unsigned integer. Its default value is 0x0.

#### Field state

NPU state, 0 = Stopped, 1 = Running.

state is stored in bit[0] and is a 1-bit enumeration. Its default value is stopped.

The field can contain the following values:

Table 3-4 Field state values

| Value       | Name    | Meaning                      |
|-------------|---------|------------------------------|
| 0 (default) | stopped | The NPU is in stopped state. |
| 1           | running | The NPU is in Running state. |

#### 3.2.3 Register CMD

The Command register, reads as last written command.

Table 3-5 Register BASE.CMD layout

| Bits    | Link                           | Name              | Usage                                                           | Default |
|---------|--------------------------------|-------------------|-----------------------------------------------------------------|---------|
| [31:16] | clear_irq_history on page 3-35 | clear_irq_history | Clears the IRQ history mask                                     | 0x0     |
| [15:4]  | Reserved                       | -                 | -                                                               | -       |
| [3]     | power_q_enable on page 3-35    | power_q_enable    | Write 1 to this bit to enable power off using Power Q-interface | 0x1     |

#### Table 3-5 Register BASE.CMD layout (continued)

| Bits | Link                                     | Name                        | Usage                                                                                          | Default |
|------|------------------------------------------|-----------------------------|------------------------------------------------------------------------------------------------|---------|
| [2]  | clock_q_enable on page 3-35              | clock_q_enable              | Write 1 to this bit to enable clock off using Clock Q-interface and enable the main clock gate | 0x1     |
| [1]  | clear_irq on page 3-35                   | clear_irq                   | Write 1 to clear the IRQ status in the STATUS register. Writing 0 has no effect                | 0x0     |
| [0]  | transition_to_running_state on page 3-35 | transition_to_running_state | Write 1 to transition the NPU to running state. Writing 0 has no effect                        | 0x0     |

#### Field clear\_irq\_history

Clears the IRQ history mask.

clear irq history is stored in bits[31:16] and is a 16-bit unsigned integer. Its default value is 0x0.

When bit k is set then corresponding bit k of the status register (IRQ history) is cleared.

#### Field power q enable

Write 1 to this bit to enable power off using Power Q-interface.

power q enable is stored in bit[3] and is a 1-bit unsigned integer. Its default value is 0x1.

#### Field clock\_q\_enable

Write 1 to this bit to enable clock off using Clock Q-interface and enable the main clock gate.

clock q enable is stored in bit[2] and is a 1-bit unsigned integer. Its default value is 0x1.

#### Field clear\_irq

Write 1 to clear the IRQ status in the STATUS register. Writing 0 has no effect.

clear irq is stored in bit[1] and is a 1-bit unsigned integer. Its default value is 0x0.

#### Field transition\_to\_running\_state

Write 1 to transition the NPU to running state. Writing 0 has no effect.

transition\_to\_running\_state is stored in bit[0] and is a 1-bit unsigned integer. Its default value is 0x0.

# 3.2.4 Register RESET

Request Reset and new security mode.

If this register is written to by a permitted master, then the NPU is reset (clearing all internal RAMs) and the reset register value is updated. (Otherwise the write to this register is ignored and the NPU is not reset.)

The value written to this register sets the privilege level used by the NPU when the NPU acts as an AXI master. The host is permitted to set any level of privilege less than or equal to the host privilege level.

Table 3-6 Register BASE.RESET layout

| Bits   | Link                     | Name        | Usage                                         | Default |
|--------|--------------------------|-------------|-----------------------------------------------|---------|
| [31:2] | Reserved                 | -           | -                                             | -       |
| [1]    | pending_CSL on page 3-36 | pending_CSL | Current security level 0=Secure, 1=Non-secure | secure  |
| [0]    | pending_CPL on page 3-36 | pending_CPL | Current privilege level 0=User, 1=Privileged  | user    |

# Field pending\_CSL

Current security level 0=Secure, 1=Non-secure.

pending\_CSL is stored in bit[1] and is a 1-bit enumeration. Its default value is secure.

The field can contain the following values:

Table 3-7 Field pending\_CSL values

| Value       | Name       | Meaning                                               |
|-------------|------------|-------------------------------------------------------|
| 0 (default) | secure     | The NPU's security level is configured as Secure.     |
| 1           | non_secure | The NPU's security level is configured as Non-Secure. |

# Field pending\_CPL

Current privilege level 0=User, 1=Privileged.

pending\_CPL is stored in bit[0] and is a 1-bit enumeration. Its default value is user.

The field can contain the following values:

Table 3-8 Field pending\_CPL values

| Value       | Name       | Meaning                                            |
|-------------|------------|----------------------------------------------------|
| 0 (default) | user       | The NPU is configured for User-level access.       |
| 1           | privileged | The NPU is configured for Privileged-level access. |

# 3.2.5 Register QBASE0

The base address of the Command-queue bits[31:0]. The address is 4-byte-aligned.

Table 3-9 Register BASE.QBASE0 layout

| Bits   | Link                | Name   | Usage                                                                    | Default    |
|--------|---------------------|--------|--------------------------------------------------------------------------|------------|
| [31:0] | QBASE0 on page 3-36 | QBASE0 | The 4-byte-aligned lower bytes of the base address value for the command | 0x00000000 |
|        |                     |        | stream                                                                   |            |

#### Field QBASE0

The 4-byte-aligned lower bytes of the base address value for the command stream.

QBASE0 is stored in bits[31:0] and is a 32-bit unsigned integer. Its default value is 0x000000000.

### 3.2.6 Register QBASE1

Address extension bits[39:32] bits for queue base.

Table 3-10 Register BASE.QBASE1 layout

| Bits   | Link                | Name   | Usage                                                                        | Default |
|--------|---------------------|--------|------------------------------------------------------------------------------|---------|
| [31:0] | QBASE1 on page 3-37 | QBASE1 | SE1 The 4-byte-aligned upper bytes of the base address value for the command |         |
|        |                     |        | stream                                                                       |         |

#### Field QBASE1

The 4-byte-aligned upper bytes of the base address value for the command stream.

QBASE1 is stored in bits[31:0] and is a 32-bit unsigned integer. Its default value is 0x00000000.

### 3.2.7 Register QREAD

The Read offset in the command stream in bytes. Multiples of 4 in the range 0-16 MB.

Table 3-11 Register BASE.QREAD layout

| Bits   | Link               | Name  | Usage                                                  | Default    |
|--------|--------------------|-------|--------------------------------------------------------|------------|
| [31:0] | QREAD on page 3-37 | QREAD | The read offset of the current command under execution | 0x00000000 |

#### Field QREAD

The read offset of the current command under execution.

QREAD is stored in bits[31:0] and is a 32-bit unsigned integer. Its default value is 0x000000000.

#### 3.2.8 Register QCONFIG

The AXI configuration for the command stream in the range 0-3. Same encoding as for REGIONCFG.

Table 3-12 Register BASE.QCONFIG layout

| Bits   | Link                 | Name    | Usage                                                     | Default    |
|--------|----------------------|---------|-----------------------------------------------------------|------------|
| [31:0] | QCONFIG on page 3-37 | QCONFIG | AXI configuration for the command stream in the range 0-3 | 0x00000000 |

#### Field QCONFIG

The AXI configuration for the command stream in the range 0-3.

QCONFIG is stored in bits[31:0] and is a 32-bit unsigned integer. Its default value is 0x00000000.

### 3.2.9 Register QSIZE

Size of the command stream in bytes. Multiples of 4 in the range 0-16 MB.

Table 3-13 Register BASE.QSIZE layout

| Bits   | Link               | Name  | Usage                                                     | Default    |
|--------|--------------------|-------|-----------------------------------------------------------|------------|
| [31:0] | QSIZE on page 3-38 | QSIZE | Size of the next command stream to be executed by the NPU | 0x00000000 |

#### Field QSIZE

Size of the next command stream to be executed by the NPU.

QSIZE is stored in bits[31:0] and is a 32-bit unsigned integer. Its default value is 0x00000000.

### 3.2.10 Register PROT

The protection level configured for the NPU when acting as an AXI master.

Table 3-14 Register BASE.PROT layout

| Bits   | Link                    | Name       | Usage                                         | Default            |
|--------|-------------------------|------------|-----------------------------------------------|--------------------|
| [31:2] | Reserved                | -          |                                               |                    |
| [1]    | active_CSL on page 3-38 | active_CSL | Current security level 0=Secure, 1=Non-secure | Dependent on PORSL |
| [0]    | active_CPL on page 3-38 | active_CPL | Current privilege level 0=User, 1=Privileged  | Dependent on PORPL |

#### Field active\_CSL

Current security level 0=Secure, 1=Non-secure.

active\_CSL is stored in bit[1] and is a 1-bit enumeration. Its default value is dependent on PORSL.

This is used as AxPROT[1] when the NPU is a master and set from Pending CSL after the reset is complete.

- After a hard reset, this is set to Power-on-reset security level (PORSL), which allows for CPUs that
  do not support TrustZone.
- After a soft reset, this is set to pending\_CSL, if PPROT[1]==0, otherwise it is set to 1. For this to be effective, there must be a memory-protection controller included in the system.

The field can contain the following values:

Table 3-15 Field active\_CSL values

| Value       | Name       | Meaning                                               |  |
|-------------|------------|-------------------------------------------------------|--|
| 0 (default) | secure     | The NPU's security level is configured as Secure.     |  |
| 1           | non_secure | The NPU's security level is configured as Non-Secure. |  |

#### Field active CPL

Current privilege level 0=User, 1=Privileged.

active CPL is stored in bit[0] and is a 1-bit enumeration. Its default value is dependent on PORPL.

This is used as AxPROT[0] when the NPU is a master.

- After a hard reset, this is set to Power-on-reset privilege level (PORPL).
- After a soft reset, this is set to pending\_CPL, if PPROT[0]==1, otherwise it is set to 0. For this to be effective, there must be system-level memory protection built for the system.

The field can contain the following values:

Table 3-16 Field active\_CPL values

| Value       | Name       | Meaning                                            |  |
|-------------|------------|----------------------------------------------------|--|
| 0 (default) | user       | The NPU is configured for User-level access.       |  |
| 1           | privileged | The NPU is configured for Privileged-level access. |  |

## 3.2.11 Register CONFIG

The RTL configuration register.

The default value of this RO register describes the NPU configuration. Please refer to the individual fields for information.

Table 3-17 Register BASE.CONFIG layout

| Bits    | Link                            | Name               | Usage                                                                                                                  | Default                    |
|---------|---------------------------------|--------------------|------------------------------------------------------------------------------------------------------------------------|----------------------------|
| [31:28] | product on page 3-39            | product            | Product configuration                                                                                                  | 1 (implementation defined) |
| [27:16] | Reserved                        | -                  | -                                                                                                                      | -                          |
| [15:8]  | shram_size on page 3-39         | shram_size         | The size of SHRAM is 48KB for the 256 configuration, and 96KB for the 512 configuration.                               | -                          |
| [7:4]   | cmd_stream_version on page 3-39 | cmd_stream_version | Command-stream version accepted by this NPU                                                                            | 0x0                        |
| [3:0]   | macs_per_cc on page 3-39        | macs_per_cc        | The log2 (macs/clock cycle). The valid encoding range is 8 for the 256 configuration, and 9 for the 512 configuration. | -                          |

#### Field product

Product configuration.

product is stored in bits[31:28] and is a 4-bit unsigned integer. Its default value is 1 (implementation defined).

#### Field shram\_size

Size in KB of SHRAM in the range 48-96.

shram size is stored in bits[15:8] and is an 8-bit enumeration.

The field can contain the following values:

Table 3-18 Field shram\_size values

| Value | Name       | Meaning                     |
|-------|------------|-----------------------------|
| 0x30  | SHRAM_48kB | The available SHRAM is 48KB |
| 0x60  | SHRAM_96kB | The available SHRAM is 96KB |

#### Field cmd\_stream\_version

The command-stream version accepted by this NPU.

cmd stream version is stored in bits[7:4] and is a 4-bit unsigned integer. Its default value is 0x0.

#### Field macs\_per\_cc

The log2(macs/clock cycle). Valid encoding range is 8 and 9 for 256 and 512 MACs/clock cycle, respectively (each MAC is an 8-bit x 8-bit MAC).

macs per cc is stored in bits[3:0] and is a 4-bit enumeration.

The field can contain the following values:

Table 3-19 Field macs\_per\_cc values

| Value | Name             | Meaning                                       |
|-------|------------------|-----------------------------------------------|
| 0x8   | Macs_per_cc_is_8 | The number of MACs per clock cycle is 28.     |
| 0x9   | Macs_per_cc_is_9 | The number of MACs per clock cycle is $2^9$ . |

## 3.2.12 Register LOCK

The Lock register. This register is designed for driver use and does not affect NPU functionality.

The register holds a 32-bit value which is cleared to 0 on a reset. The register has special write semantics. Suppose the current register value is "c" and the newly written register value is "w":

If (c==0 or w==0), then the register is updated to the newly written value "w".

Otherwise the write is ignored and the value remains unchanged.

- To try to claim the lock, write a nonzero ID value and read it back to see if the value was accepted.
- To release the lock (that contains your nonzero ID value), write the value 0 to the lock register.

Table 3-20 Register BASE.LOCK layout

| Bits   | Link              | Name | Usage                                   | Default     |
|--------|-------------------|------|-----------------------------------------|-------------|
| [31:0] | LOCK on page 3-40 | LOCK | 32-bit value for the LOCK configuration | 0x000000000 |

#### Field LOCK

32-bit value for the LOCK configuration.

LOCK is stored in bits[31:0] and is a 32-bit unsigned integer. Its default value is 0x000000000.

# 3.2.13 Register REGIONCFG

Region memory type configuration. Bits[2\*k+1:2\*k] give the memory type for REGION[k].

Table 3-21 Register BASE.REGIONCFG layout

| Bits    | Link                 | Name    | Usage                              | Default                   |
|---------|----------------------|---------|------------------------------------|---------------------------|
| [31:16] | Reserved             | -       | -                                  | -                         |
| [15:14] | region7 on page 3-41 | region7 | Bits for the Region7 configuration | axi0_outstanding_counter0 |
| [13:12] | region6 on page 3-41 | region6 | Bits for the Region6 configuration | axi0_outstanding_counter0 |
| [11:10] | region5 on page 3-41 | region5 | Bits for the Region5 configuration | axi0_outstanding_counter0 |
| [9:8]   | region4 on page 3-42 | region4 | Bits for the Region4 configuration | axi0_outstanding_counter0 |
| [7:6]   | region3 on page 3-42 | region3 | Bits for the Region3 configuration | axi0_outstanding_counter0 |
| [5:4]   | region2 on page 3-42 | region2 | Bits for the Region2 configuration | axi0_outstanding_counter0 |
| [3:2]   | region1 on page 3-42 | region1 | Bits for the Region1 configuration | axi0_outstanding_counter0 |
| [1:0]   | region0 on page 3-43 | region0 | Bits for the Region0 configuration | axi0_outstanding_counter0 |

# Field region7

Bits for the Region7 configuration.

region7 is stored in bits[15:14] and is a 2-bit enumeration. Its default value is axi0 outstanding counter0.

The field can contain the following values:

Table 3-22 Field region7 values

| Value       | Name                      | Meaning                                                                      |  |
|-------------|---------------------------|------------------------------------------------------------------------------|--|
| 0 (default) | axi0_outstanding_counter0 | AXI0 port, outstanding counter 0. AXI limits set by the AXI_LIMIT0 register. |  |
| 1           | axi0_outstanding_counter1 | AXI0 port, outstanding counter 1. AXI limits set by the AXI_LIMIT1 register. |  |
| 2           | axi1_outstanding_counter2 | AXI1 port, outstanding counter 2. AXI limits set by the AXI_LIMIT2 register. |  |
| 3           | axi1_outstanding_counter3 | AXI1 port, outstanding counter 3. AXI limits set by the AXI_LIMIT3 register. |  |

# Field region6

Bits for the Region6 configuration.

region6 is stored in bits[13:12] and is a 2-bit enumeration. Its default value is axi0\_outstanding\_counter0.

The field can contain the following values:

Table 3-23 Field region6 values

| Value       | Name                      | Meaning                                                                      |  |
|-------------|---------------------------|------------------------------------------------------------------------------|--|
| 0 (default) | axi0_outstanding_counter0 | AXI0 port, outstanding counter 0. AXI limits set by the AXI_LIMIT0 regist    |  |
| 1           | axi0_outstanding_counter1 | AXI0 port, outstanding counter 1. AXI limits set by the AXI_LIMIT1 register. |  |
| 2           | axi1_outstanding_counter2 | AXI1 port, outstanding counter 2. AXI limits set by the AXI_LIMIT2 register. |  |
| 3           | axi1_outstanding_counter3 | AXI1 port, outstanding counter 3. AXI limits set by the AXI_LIMIT3 register. |  |

## Field region5

Bits for the Region5 configuration.

region5 is stored in bits[11:10] and is a 2-bit enumeration. Its default value is axi0\_outstanding\_counter0. The field can contain the following values:

Table 3-24 Field region5 values

| Value       | Name                      | Meaning                                                                      |  |  |
|-------------|---------------------------|------------------------------------------------------------------------------|--|--|
| 0 (default) | axi0_outstanding_counter0 | AXI0 port, outstanding counter 0. AXI limits set by the AXI_LIMIT0 register. |  |  |
| 1           | axi0_outstanding_counter1 | AXI0 port, outstanding counter 1. AXI limits set by the AXI_LIMIT1 register. |  |  |
| 2           | axi1_outstanding_counter2 | AXI1 port, outstanding counter 2. AXI limits set by the AXI_LIMIT2 register. |  |  |
| 3           | axi1_outstanding_counter3 | AXI1 port, outstanding counter 3. AXI limits set by the AXI_LIMIT3 register. |  |  |

### Field region4

Bits for the Region4 configuration.

region4 is stored in bits[9:8] and is a 2-bit enumeration. Its default value is axi0\_outstanding\_counter0.

The field can contain the following values:

Table 3-25 Field region4 values

| Value       | Name                      | Meaning                                                                      |
|-------------|---------------------------|------------------------------------------------------------------------------|
| 0 (default) | axi0_outstanding_counter0 | AXI0 port, outstanding counter 0. AXI limits set by the AXI_LIMIT0 register. |
| 1           | axi0_outstanding_counter1 | AXI0 port, outstanding counter 1. AXI limits set by the AXI_LIMIT1 register. |
| 2           | axi1_outstanding_counter2 | AXI1 port, outstanding counter 2. AXI limits set by the AXI_LIMIT2 register. |
| 3           | axi1_outstanding_counter3 | AXI1 port, outstanding counter 3. AXI limits set by the AXI_LIMIT3 register. |

## Field region3

Bits for the Region3 configuration.

region3 is stored in bits[7:6] and is a 2-bit enumeration. Its default value is axi0\_outstanding\_counter0. The field can contain the following values:

Table 3-26 Field region3 values

| Value       | Name                      | Meaning                                                                      |  |
|-------------|---------------------------|------------------------------------------------------------------------------|--|
| 0 (default) | axi0_outstanding_counter0 | AXI0 port, outstanding counter 0. AXI limits set by the AXI_LIMIT0 registe   |  |
| 1           | axi0_outstanding_counter1 | AXI0 port, outstanding counter 1. AXI limits set by the AXI_LIMIT1 register. |  |
| 2           | axi1_outstanding_counter2 | AXI1 port, outstanding counter 2. AXI limits set by the AXI_LIMIT2 register. |  |
| 3           | axi1_outstanding_counter3 | AXI1 port, outstanding counter 3. AXI limits set by the AXI_LIMIT3 register. |  |

#### Field region2

Bits for the Region2 configuration.

region2 is stored in bits[5:4] and is a 2-bit enumeration. Its default value is axi0\_outstanding\_counter0. The field can contain the following values:

Table 3-27 Field region2 values

| Value       | Name                      | Meaning                                                                      |  |  |
|-------------|---------------------------|------------------------------------------------------------------------------|--|--|
| 0 (default) | axi0_outstanding_counter0 | AXI0 port, outstanding counter 0. AXI limits set by the AXI_LIMIT0 register. |  |  |
| 1           | axi0_outstanding_counter1 | AXI0 port, outstanding counter 1. AXI limits set by the AXI_LIMIT1 register. |  |  |
| 2           | axi1_outstanding_counter2 | AXI1 port, outstanding counter 2. AXI limits set by the AXI_LIMIT2 register. |  |  |
| 3           | axi1_outstanding_counter3 | AXI1 port, outstanding counter 3. AXI limits set by the AXI_LIMIT3 register. |  |  |

## Field region1

Bits for the Region1 configuration.

region1 is stored in bits[3:2] and is a 2-bit enumeration. Its default value is axi0\_outstanding\_counter0.

The field can contain the following values:

Table 3-28 Field region1 values

| Value       | Name                      | Meaning                                                                      |  |
|-------------|---------------------------|------------------------------------------------------------------------------|--|
| 0 (default) | axi0_outstanding_counter0 | AXI0 port, outstanding counter 0. AXI limits set by the AXI_LIMIT0 register. |  |
| 1           | axi0_outstanding_counter1 | AXI0 port, outstanding counter 1. AXI limits set by the AXI_LIMIT1 register. |  |
| 2           | axi1_outstanding_counter2 | AXI1 port, outstanding counter 2. AXI limits set by the AXI_LIMIT2 register. |  |
| 3           | axi1_outstanding_counter3 | AXI1 port, outstanding counter 3. AXI limits set by the AXI_LIMIT3 register. |  |

## Field region0

Bits for the Region0 configuration.

region0 is stored in bits[1:0] and is a 2-bit enumeration. Its default value is axi0\_outstanding\_counter0.

The field can contain the following values:

Table 3-29 Field region0 values

| Value       | Name                      | Meaning                                                                      |  |
|-------------|---------------------------|------------------------------------------------------------------------------|--|
| 0 (default) | axi0_outstanding_counter0 | AXI0 port, outstanding counter 0. AXI limits set by the AXI_LIMIT0 regist    |  |
| 1           | axi0_outstanding_counter1 | AXI0 port, outstanding counter 1. AXI limits set by the AXI_LIMIT1 register. |  |
| 2           | axi1_outstanding_counter2 | AXI1 port, outstanding counter 2. AXI limits set by the AXI_LIMIT2 register. |  |
| 3           | axi1_outstanding_counter3 | AXI1 port, outstanding counter 3. AXI limits set by the AXI_LIMIT3 register. |  |

## 3.2.14 Register AXI\_LIMIT0

The AXI limits for port 0 counter 0.

Table 3-30 Register BASE.AXI\_LIMIT0 layout

| Bits    | Link                                  | Name                     | Usage                                                                   | Default |
|---------|---------------------------------------|--------------------------|-------------------------------------------------------------------------|---------|
| [31:24] | max_outstanding_write_m1 on page 3-43 | max_outstanding_write_m1 | Maximum number of outstanding AXI write transactions - 1 in range 0-31  | 0x00    |
| [23:16] | max_outstanding_read_m1 on page 3-44  | max_outstanding_read_m1  | Maximum number of outstanding AXI read transactions - 1 in range 0-63   | 0x00    |
| [15:8]  | Reserved                              | -                        | -                                                                       | -       |
| [7:4]   | memtype on page 3-44                  | memtype                  | Memtype                                                                 | -       |
| [3:2]   | Reserved                              | -                        | -                                                                       | -       |
| [1:0]   | max_beats on page 3-44                | max_beats                | Burst-split alignment: 0=64 bytes, 1=128 bytes, 2=256 bytes, 3=reserved | 0x0     |

### Field max\_outstanding\_write\_m1

Maximum number of outstanding AXI write transactions - 1 in range 0-31.

max\_outstanding\_write\_m1 is stored in bits[31:24] and is an 8-bit unsigned integer. Its default value is 0x00.

### Field max\_outstanding\_read\_m1

Maximum number of outstanding AXI read transactions - 1 in range 0-63.

max\_outstanding\_read\_m1 is stored in bits[23:16] and is an 8-bit unsigned integer. Its default value is 0x00.

# Field memtype

Memtype is used to encode AxCACHE signals.

BASE.AXI\_LIMIT0.memtype is stored in bits[7:4] and is a 4-bit enumeration of type axi mem encodign type. Its default value is Device Non Bufferable.

The field can contain the following values:

Table 3-31 Field memtype values

| Value         | Name                                  | Meaning                    |
|---------------|---------------------------------------|----------------------------|
| 0x0 (default) | Device_Non_Bufferable                 | ARCACHE=0000, AWCACHE=0000 |
| 0x1           | Device_Bufferable                     | ARCACHE=0001, AWCACHE=0001 |
| 0x2           | Normal_Non_cacheable_Non_bufferable   | ARCACHE=0010, AWCACHE=0010 |
| 0x3           | Normal_Non_cacheable_Bufferable       | ARCACHE=0011, AWCACHE=0011 |
| 0x4           | Write_through_No_allocate             | ARCACHE=1010, AWCACHE=0110 |
| 0x5           | Write_through_Read_allocate           | ARCACHE=1110, AWCACHE=0110 |
| 0x6           | Write_through_Write_allocate          | ARCACHE=1010, AWCACHE=1110 |
| 0x7           | Write_through_Read_and_Write_allocate | ARCACHE=1110, AWCACHE=1110 |
| 0x8           | Write_back_No_allocate                | ARCACHE=1011, AWCACHE=0111 |
| 0x9           | Write_back_Read_allocate              | ARCACHE=1111, AWCACHE=0111 |
| 0xA           | Write_back_Write_allocate             | ARCACHE=1011, AWCACHE=1111 |
| 0xB           | Write_back_Read_and_Write_allocate    | ARCACHE=1111, AWCACHE=1111 |
| 0xC - 0xF     | Reserved_12_15                        | Reserved                   |

### Field max\_beats

Burst-split alignment: 0=64 bytes, 1=128 bytes, 2=256 bytes, 3=reserved.

max beats is stored in bits[1:0] and is a 2-bit unsigned integer. Its default value is 0x0.

## 3.2.15 Register AXI\_LIMIT1

The AXI limits for port 0 counter 1.

Table 3-32 Register BASE.AXI\_LIMIT1 layout

| Bits    | Link                                  | Name                     | Usage                                                                  | Default |
|---------|---------------------------------------|--------------------------|------------------------------------------------------------------------|---------|
| [31:24] | max_outstanding_write_m1 on page 3-45 | max_outstanding_write_m1 | Maximum number of outstanding AXI write transactions - 1 in range 0-31 | 0x00    |
| [23:16] | max_outstanding_read_m1 on page 3-45  | max_outstanding_read_m1  | Maximum number of outstanding AXI read transactions - 1 in range 0-63  | 0x00    |

Table 3-32 Register BASE.AXI\_LIMIT1 layout (continued)

| Bits   | Link                   | Name      | Usage                                                                   | Default |
|--------|------------------------|-----------|-------------------------------------------------------------------------|---------|
| [15:8] | Reserved               | -         | -                                                                       | -       |
| [7:4]  | memtype on page 3-45   | memtype   | Memtype                                                                 | -       |
| [3:2]  | Reserved               | -         | -                                                                       | -       |
| [1:0]  | max_beats on page 3-45 | max_beats | Burst-split alignment: 0=64 bytes, 1=128 bytes, 2=256 bytes, 3=reserved | 0x0     |

#### Field max\_outstanding\_write\_m1

Maximum number of outstanding AXI write transactions - 1 in range 0-31.

max\_outstanding\_write\_m1 is stored in bits[31:24] and is an 8-bit unsigned integer. Its default value is 0x00.

#### Field max\_outstanding\_read\_m1

Maximum number of outstanding AXI read transactions - 1 in range 0-63.

max\_outstanding\_read\_m1 is stored in bits[23:16] and is an 8-bit unsigned integer. Its default value is 0x00.

#### Field memtype

Memtype.

memtype is stored in bits[7:4] and is a 4-bit unsigned integer.

### Field max\_beats

Burst-split alignment: 0=64 bytes, 1=128 bytes, 2=256 bytes, 3=reserved.

max\_beats is stored in bits[1:0] and is a 2-bit unsigned integer. Its default value is 0x0.

# 3.2.16 Register AXI\_LIMIT2

The AXI limits for port 1 counter 2.

Table 3-33 Register BASE.AXI\_LIMIT2 layout

| Bits    | Link                                  | Name                     | Usage                                                                   | Default |
|---------|---------------------------------------|--------------------------|-------------------------------------------------------------------------|---------|
| [31:24] | max_outstanding_write_m1 on page 3-46 | max_outstanding_write_m1 | Maximum number of outstanding AXI write transactions - 1 in range 0-31  | 0x00    |
| [23:16] | max_outstanding_read_m1 on page 3-46  | max_outstanding_read_m1  | Maximum number of outstanding AXI read transactions - 1 in range 0-63   | 0x00    |
| [15:8]  | Reserved                              | -                        | -                                                                       | -       |
| [7:4]   | memtype on page 3-46                  | memtype                  | Memtype                                                                 | -       |
| [3:2]   | Reserved                              | -                        | -                                                                       | -       |
| [1:0]   | max_beats on page 3-46                | max_beats                | Burst-split alignment: 0=64 bytes, 1=128 bytes, 2=256 bytes, 3=reserved | 0x0     |

#### Field max\_outstanding\_write\_m1

Maximum number of outstanding AXI write transactions - 1 in range 0-31.

max\_outstanding\_write\_m1 is stored in bits[31:24] and is an 8-bit unsigned integer. Its default value is 0x00.

### Field max\_outstanding\_read\_m1

Maximum number of outstanding AXI read transactions - 1 in range 0-63.

max\_outstanding\_read\_m1 is stored in bits[23:16] and is an 8-bit unsigned integer. Its default value is 0x00.

#### Field memtype

Memtype.

memtype is stored in bits[7:4] and is a 4-bit unsigned integer.

### Field max beats

Burst-split alignment: 0=64 bytes, 1=128 bytes, 2=256 bytes, 3=reserved.

max beats is stored in bits[1:0] and is a 2-bit unsigned integer. Its default value is 0x0.

#### 3.2.17 Register AXI\_LIMIT3

The AXI limits for port 1 counter 3.

Table 3-34 Register BASE.AXI\_LIMIT3 layout

| Bits    | Link                                  | Name                     | Usage                                                                   | Default |
|---------|---------------------------------------|--------------------------|-------------------------------------------------------------------------|---------|
| [31:24] | max_outstanding_write_m1 on page 3-46 | max_outstanding_write_m1 | Maximum number of outstanding AXI write transactions - 1 in range 0-31  | 0x00    |
| [23:16] | max_outstanding_read_m1 on page 3-46  | max_outstanding_read_m1  | Maximum number of outstanding AXI read transactions - 1 in range 0-63   | 0x00    |
| [15:8]  | Reserved                              | -                        | -                                                                       | -       |
| [7:4]   | memtype on page 3-47                  | memtype                  | Memtype                                                                 | -       |
| [3:2]   | Reserved                              | -                        | -                                                                       | -       |
| [1:0]   | max_beats on page 3-47                | max_beats                | Burst-split alignment: 0=64 bytes, 1=128 bytes, 2=256 bytes, 3=reserved | 0x0     |

### Field max\_outstanding\_write\_m1

Maximum number of outstanding AXI write transactions - 1 in range 0-31.

max\_outstanding\_write\_m1 is stored in bits[31:24] and is an 8-bit unsigned integer. Its default value is 0x00.

#### Field max outstanding read m1

Maximum number of outstanding AXI read transactions - 1 in range 0-63.

max\_outstanding\_read\_m1 is stored in bits[23:16] and is an 8-bit unsigned integer. Its default value is 0x00.

# Field memtype

Memtype.

memtype is stored in bits[7:4] and is a 4-bit unsigned integer.

# Field max\_beats

Burst-split alignment: 0=64 bytes, 1=128 bytes, 2=256 bytes, 3=reserved.

max beats is stored in bits[1:0] and is a 2-bit unsigned integer. Its default value is 0x0.

# 3.3 Register page BASE\_POINTERS

The NPU base-pointer registers bank.

Table 3-35 BASE\_POINTERS registers

| Address    | Link                                 | Usage                                                                  | Access     | Default    |
|------------|--------------------------------------|------------------------------------------------------------------------|------------|------------|
| 0x00000080 | 3.3.1 Register BASEP0 on page 3-49   | Lower 32 bits of the Base pointer for region index 0                   | Read/write | 0x00000000 |
| 0x00000084 | 3.3.2 Register BASEP1 on page 3-49   | Upper 32 bits of the Base pointer for region index 0                   | Read/write | 0x00000000 |
| 0x00000088 | 3.3.3 Register BASEP2 on page 3-49   | Lower 32 bits of the Base pointer for region index 1                   | Read/write | 0x00000000 |
| 0x0000008C | 3.3.4 Register BASEP3 on page 3-49   | Upper 32 bits of the Base pointer for region index 1                   | Read/write | 0x00000000 |
| 0x00000090 | 3.3.5 Register BASEP4 on page 3-50   | Lower 32 bits of the Base pointer for region index 2                   | Read/write | 0x00000000 |
| 0x00000094 | 3.3.6 Register BASEP5 on page 3-50   | Upper 32 bits of the Base pointer for region index 2                   | Read/write | 0x00000000 |
| 0x00000098 | 3.3.7 Register BASEP6 on page 3-50   | Lower 32 bits of the Base pointer for region index 3                   | Read/write | 0x00000000 |
| 0x0000009C | 3.3.8 Register BASEP7 on page 3-51   | Upper 32 bits of the Base pointer for region index 3                   | Read/write | 0x00000000 |
| 0x000000A0 | 3.3.9 Register BASEP8 on page 3-51   | Lower 32 bits of the Base pointer for region index 4                   | Read/write | 0x00000000 |
| 0x000000A4 | 3.3.10 Register BASEP9 on page 3-51  | Upper 32 bits of the Base pointer for region index 4                   | Read/write | 0x00000000 |
| 0x000000A8 | 3.3.11 Register BASEP10 on page 3-51 | Lower 32 bits of the Base pointer for region index 5                   | Read/write | 0x00000000 |
| 0x000000AC | 3.3.12 Register BASEP11 on page 3-52 | Upper 32 bits of the Base pointer for region index 5                   | Read/write | 0x00000000 |
| 0x000000B0 | 3.3.13 Register BASEP12 on page 3-52 | 2 3-52 Lower 32 bits of the Base pointer for region Read/write index 6 |            | 0x00000000 |
| 0x000000B4 | 3.3.14 Register BASEP13 on page 3-52 | 3-52 Upper 32 bits of the Base pointer for region index 6 Read/wri     |            | 0x00000000 |
| 0x000000B8 | 3.3.15 Register BASEP14 on page 3-52 | Lower 32 bits of the Base pointer for region Read/windex 7             |            | 0x00000000 |
| 0x000000BC | 3.3.16 Register BASEP15 on page 3-53 | Upper 32 bits of the Base pointer for region index 7                   | Read/write | 0x00000000 |

This section contains the following subsections:

- 3.3.1 Register BASEP0 on page 3-49.
- 3.3.2 Register BASEP1 on page 3-49.
- 3.3.3 Register BASEP2 on page 3-49.
- 3.3.4 Register BASEP3 on page 3-49.
- 3.3.5 Register BASEP4 on page 3-50.
- 3.3.6 Register BASEP5 on page 3-50.
- 3.3.7 Register BASEP6 on page 3-50.
- *3.3.8 Register BASEP7* on page 3-51.
- 3.3.9 Register BASEP8 on page 3-51.

- 3.3.10 Register BASEP9 on page 3-51.
- 3.3.11 Register BASEP10 on page 3-51.
- 3.3.12 Register BASEP11 on page 3-52.
- 3.3.13 Register BASEP12 on page 3-52.
- 3.3.14 Register BASEP13 on page 3-52.
- 3.3.15 Register BASEP14 on page 3-52.
- 3.3.16 Register BASEP15 on page 3-53.

# 3.3.1 Register BASEP0

Lower 32 bits of the Base pointer for region index 0.

Table 3-36 Register BASE\_POINTERS.BASEP0 layout

| Bits   | Link                   | Name      | Usage                              |
|--------|------------------------|-----------|------------------------------------|
| [31:0] | addr_word on page 3-49 | addr_word | The low word of the 64-bit address |

# Field addr\_word

The low word of the 64-bit address.

addr\_word is stored in bits[31:0] and is a 32-bit unsigned integer.

### 3.3.2 Register BASEP1

Upper 32 bits of the Base pointer for region index 0.

Table 3-37 Register BASE\_POINTERS.BASEP1 layout

| Bits   | Link      | Name      | Usage                                  |
|--------|-----------|-----------|----------------------------------------|
| [31:0] | addr_word | addr_word | The upper 8 bits of the 40-bit address |

#### Field addr\_word

The upper 8 bits of the 40-bit address.

addr\_word is stored in bits[31:0] and is a 32-bit unsigned integer.

#### 3.3.3 Register BASEP2

Lower 32 bits of the Base pointer for region index 1.

Table 3-38 Register BASE\_POINTERS.BASEP2 layout

| Bits   | Link                   | Name      | Usage                              |
|--------|------------------------|-----------|------------------------------------|
| [31:0] | addr_word on page 3-49 | addr_word | The low word of the 64-bit address |

### Field addr\_word

The low word of the 64-bit address.

addr\_word is stored in bits[31:0] and is a 32-bit unsigned integer.

## 3.3.4 Register BASEP3

Upper 32 bits of the Base pointer for region index 1.

# Table 3-39 Register BASE\_POINTERS.BASEP3 layout

| Bits   | Link                   | Name      | Usage                                  |
|--------|------------------------|-----------|----------------------------------------|
| [31:0] | addr_word on page 3-50 | addr_word | The upper 8 bits of the 40-bit address |

#### Field addr\_word

The upper 8 bits of the 40-bit address.

addr\_word is stored in bits[31:0] and is a 32-bit unsigned integer.

## 3.3.5 Register BASEP4

Lower 32 bits of the Base pointer for region index 2.

Table 3-40 Register BASE POINTERS.BASEP4 layout

|   | Bits   | Link                   | Name      | Usage                              |
|---|--------|------------------------|-----------|------------------------------------|
| İ | [31:0] | addr_word on page 3-50 | addr_word | The low word of the 64-bit address |

# Field addr\_word

The low word of the 64-bit address.

addr\_word is stored in bits[31:0] and is a 32-bit unsigned integer.

#### 3.3.6 Register BASEP5

Upper 32 bits of the Base pointer for region index 2.

Table 3-41 Register BASE\_POINTERS.BASEP5 layout

| Bits   | Link                   | Name      | Usage                                  |
|--------|------------------------|-----------|----------------------------------------|
| [31:0] | addr_word on page 3-50 | addr_word | The upper 8 bits of the 40-bit address |

# Field addr\_word

The upper 8 bits of the 40-bit address.

addr\_word is stored in bits[31:0] and is a 32-bit unsigned integer.

## 3.3.7 Register BASEP6

Lower 32 bits of the Base pointer for region index 3.

Table 3-42 Register BASE\_POINTERS.BASEP6 layout

| Bits   | Link                   | Name      | Usage                              |
|--------|------------------------|-----------|------------------------------------|
| [31:0] | addr_word on page 3-50 | addr_word | The low word of the 64-bit address |

#### Field addr word

The low word of the 64-bit address.

addr\_word is stored in bits[31:0] and is a 32-bit unsigned integer.

## 3.3.8 Register BASEP7

Upper 32 bits of the Base pointer for region index 3.

Table 3-43 Register BASE\_POINTERS.BASEP7 layout

| Bits   | Link                   | Name      | Usage                                  |
|--------|------------------------|-----------|----------------------------------------|
| [31:0] | addr_word on page 3-51 | addr_word | The upper 8 bits of the 40-bit address |

### Field addr\_word

The upper 8 bits of the 40-bit address.

addr word is stored in bits[31:0] and is a 32-bit unsigned integer.

# 3.3.9 Register BASEP8

Lower 32 bits of the Base pointer for region index 4.

Table 3-44 Register BASE\_POINTERS.BASEP8 layout

| Bits   | Link                   | Name      | Usage                              |
|--------|------------------------|-----------|------------------------------------|
| [31:0] | addr_word on page 3-51 | addr_word | The low word of the 64-bit address |

### Field addr\_word

The low word of the 64-bit address.

addr\_word is stored in bits[31:0] and is a 32-bit unsigned integer.

## 3.3.10 Register BASEP9

Upper 32 bits of the Base pointer for region index 4.

Table 3-45 Register BASE\_POINTERS.BASEP9 layout

| Bits  | Link                   | Name      | Usage                                  |
|-------|------------------------|-----------|----------------------------------------|
| [31:0 | addr_word on page 3-51 | addr_word | The upper 8 bits of the 40-bit address |

# Field addr\_word

The upper 8 bits of the 40-bit address.

addr\_word is stored in bits[31:0] and is a 32-bit unsigned integer.

## 3.3.11 Register BASEP10

Lower 32 bits of the Base pointer for region index 5.

Table 3-46 Register BASE\_POINTERS.BASEP10 layout

| Bits   | Link                   | Name      | Usage                              |
|--------|------------------------|-----------|------------------------------------|
| [31:0] | addr_word on page 3-52 | addr_word | The low word of the 64-bit address |

#### Field addr\_word

The low word of the 64-bit address.

addr word is stored in bits[31:0] and is a 32-bit unsigned integer.

#### 3.3.12 Register BASEP11

Upper 32 bits of the Base pointer for region index 5.

Table 3-47 Register BASE\_POINTERS.BASEP11 layout

| Bits   | Link                   | Name      | Usage                                  |
|--------|------------------------|-----------|----------------------------------------|
| [31:0] | addr_word on page 3-52 | addr_word | The upper 8 bits of the 40-bit address |

### Field addr word

The upper 8 bits of the 40-bit address.

addr\_word is stored in bits[31:0] and is a 32-bit unsigned integer.

#### 3.3.13 Register BASEP12

Lower 32 bits of the Base pointer for region index 6.

Table 3-48 Register BASE\_POINTERS.BASEP12 layout

| Bits   | Link                   | Name      | Usage                              |
|--------|------------------------|-----------|------------------------------------|
| [31:0] | addr_word on page 3-52 | addr_word | The low word of the 64-bit address |

#### Field addr\_word

The low word of the 64-bit address.

addr\_word is stored in bits[31:0] and is a 32-bit unsigned integer.

# 3.3.14 Register BASEP13

Upper 32 bits of the Base pointer for region index 6.

Table 3-49 Register BASE\_POINTERS.BASEP13 layout

| Bits   | Link                   | Name      | Usage                                  |
|--------|------------------------|-----------|----------------------------------------|
| [31:0] | addr_word on page 3-52 | addr_word | The upper 8 bits of the 40-bit address |

#### Field addr\_word

The upper 8 bits of the 40-bit address.

addr\_word is stored in bits[31:0] and is a 32-bit unsigned integer.

### 3.3.15 Register BASEP14

Lower 32 bits of the Base pointer for region index 7.

# Table 3-50 Register BASE\_POINTERS.BASEP14 layout

| Bits   | Link                   | Name      | Usage                              |
|--------|------------------------|-----------|------------------------------------|
| [31:0] | addr_word on page 3-53 | addr_word | The low word of the 64-bit address |

## Field addr\_word

The low word of the 64-bit address.

addr\_word is stored in bits[31:0] and is a 32-bit unsigned integer.

# 3.3.16 Register BASEP15

Upper 32 bits of the Base pointer for region index 7.

Table 3-51 Register BASE\_POINTERS.BASEP15 layout

| Bits   | Link                   | Name      | Usage                                  |
|--------|------------------------|-----------|----------------------------------------|
| [31:0] | addr_word on page 3-53 | addr_word | The upper 8 bits of the 40-bit address |

# Field addr\_word

The upper 8 bits of the 40-bit address.

addr\_word is stored in bits[31:0] and is a 32-bit unsigned integer.

# 3.4 Register page ID

The NPU ID-byte registers bank.

Table 3-52 ID registers

| Address    | Link                              | Usage                                                                                                              | Access    | Default    |
|------------|-----------------------------------|--------------------------------------------------------------------------------------------------------------------|-----------|------------|
| 0x00000FD0 | 3.4.1 Register PID4 on page 3-54  | Peripheral ID byte 4 (Arm=code 4)                                                                                  | Read-only | 0x00000004 |
| 0x00000FD4 | 3.4.2 Register PID5 on page 3-55  | Peripheral ID byte 5 (reserved)                                                                                    | Read-only | 0x00000000 |
| 0x00000FD8 | 3.4.3 Register PID6 on page 3-55  | Peripheral ID byte 6 (reserved)                                                                                    | Read-only | 0x00000000 |
| 0x00000FDC | 3.4.4 Register PID7 on page 3-55  | Peripheral ID byte 7 (reserved)                                                                                    | Read-only | 0x00000000 |
| 0x00000FE0 | 3.4.5 Register PID0 on page 3-55  | Peripheral ID byte 0. This is bits[7:0] of the part number.                                                        | Read-only | 0x00000080 |
| 0x00000FE4 | 3.4.6 Register PID1 on page 3-56  | Peripheral ID byte 1. This is bits[11:8] of the part number in bits[3:0] and bits[3:0] of the Arm ID in bits[7:4]. | Read-only | 0x000000B5 |
| 0x00000FE8 | 3.4.7 Register PID2 on page 3-56  | Peripheral ID byte 2. This is bits[6:4] of the Arm ID in bits[2:0] and bit 3 indicates format B.                   | Read-only | 0x0000000B |
| 0x00000FEC | 3.4.8 Register PID3 on page 3-56  | Peripheral ID byte 3.                                                                                              | Read-only | 0x00000000 |
| 0x00000FF0 | 3.4.9 Register CID0 on page 3-56  | Component ID byte 0.                                                                                               | Read-only | 0x0000000D |
| 0x00000FF4 | 3.4.10 Register CID1 on page 3-57 | Component ID byte 1.                                                                                               | Read-only | 0x000000F0 |
| 0x00000FF8 | 3.4.11 Register CID2 on page 3-57 | Component ID byte 2.                                                                                               | Read-only | 0x00000005 |
| 0x00000FFC | 3.4.12 Register CID3 on page 3-57 | Component ID byte 3.                                                                                               | Read-only | 0x000000B1 |

This section contains the following subsections:

- 3.4.1 Register PID4 on page 3-54.
- 3.4.2 Register PID5 on page 3-55.
- 3.4.3 Register PID6 on page 3-55.
- 3.4.4 Register PID7 on page 3-55.
- 3.4.5 Register PID0 on page 3-55.
- *3.4.6 Register PID1* on page 3-56.
- 3.4.7 Register PID2 on page 3-56.
- 3.4.8 Register PID3 on page 3-56.
  3.4.9 Register CID0 on page 3-56.
- 3.4.7 Register CIDO on page 3 30.
- 3.4.10 Register CID1 on page 3-57.
- 3.4.11 Register CID2 on page 3-57.
- 3.4.12 Register CID3 on page 3-57.

### 3.4.1 Register PID4

Peripheral ID byte 4 (Arm=code 4).

Table 3-53 Register ID.PID4 layout

| Bits   | Link              | Name | Usage                                            | Default |
|--------|-------------------|------|--------------------------------------------------|---------|
| [31:0] | PID4 on page 3-54 | PID4 | Byte 4 of the Peripheral ID (Lower 8 bits valid) | 0x04    |

#### Field PID4

Byte 4 of the Peripheral ID (Lower 8 bits valid).

PID4 is stored in bits[31:0] and is a 32-bit unsigned integer. Its default value is 0x04.

## 3.4.2 Register PID5

Peripheral ID byte 5 (reserved).

Table 3-54 Register ID.PID5 layout

| Bits   | Link              | Name | Usage                                            | Default |
|--------|-------------------|------|--------------------------------------------------|---------|
| [31:0] | PID5 on page 3-55 | PID5 | Byte 5 of the Peripheral ID (Lower 8 bits valid) | 0x00    |

#### Field PID5

Byte 5 of the Peripheral ID (Lower 8 bits valid).

PID5 is stored in bits[31:0] and is a 32-bit unsigned integer. Its default value is 0x00.

### 3.4.3 Register PID6

Peripheral ID byte 6 (reserved).

Table 3-55 Register ID.PID6 layout

| Bits   | Link                     | Name | Usage                                            | Default |
|--------|--------------------------|------|--------------------------------------------------|---------|
| [31:0] | <i>PID6</i> on page 3-55 | PID6 | Byte 6 of the Peripheral ID (Lower 8 bits valid) | 0x00    |

#### Field PID6

Byte 6 of the Peripheral ID (Lower 8 bits valid).

PID6 is stored in bits[31:0] and is a 32-bit unsigned integer. Its default value is 0x00.

#### 3.4.4 Register PID7

Peripheral ID byte 7 (reserved).

Table 3-56 Register ID.PID7 layout

| Bits   | Link              | Name | Usage                                            | Default |
|--------|-------------------|------|--------------------------------------------------|---------|
| [31:0] | PID7 on page 3-55 | PID7 | Byte 7 of the Peripheral ID (Lower 8 bits valid) | 0x00    |

#### Field PID7

Byte 7 of the Peripheral ID (Lower 8 bits valid).

PID7 is stored in bits[31:0] and is a 32-bit unsigned integer. Its default value is 0x00.

## 3.4.5 Register PID0

Peripheral ID byte 0. This is bits[7:0] of the part number.

Table 3-57 Register ID.PID0 layout

| Bits   | Link              | Name | Usage                                            | Default |
|--------|-------------------|------|--------------------------------------------------|---------|
| [31:0] | PID0 on page 3-56 | PID0 | Byte 0 of the Peripheral ID (Lower 8 bits valid) | 0x80    |

#### Field PID0

Byte 0 of the Peripheral ID (Lower 8 bits valid).

PID0 is stored in bits[31:0] and is a 32-bit unsigned integer. Its default value is 0x80.

#### 3.4.6 Register PID1

Peripheral ID byte 1. This is bits[11:8] of the part number in bits[3:0] and bits[3:0] of the Arm ID in bits[7:4].

Table 3-58 Register ID.PID1 layout

| Bits   | Link              | Name | Usage                                        | Default |
|--------|-------------------|------|----------------------------------------------|---------|
| [31:0] | PID1 on page 3-56 | PID1 | Byte 1 of Peripheral ID (Lower 8 bits valid) | 0xB5    |

#### Field PID1

Byte 1 of Peripheral ID (Lower 8 bits valid).

PID1 is stored in bits[31:0] and is a 32-bit unsigned integer. Its default value is 0xB5.

## 3.4.7 Register PID2

Peripheral ID byte 2. This is bits[6:4] of the Arm ID in bits[2:0], bit 3 indicates format B.

Table 3-59 Register ID.PID2 layout

| Bits   | Link              | Name | Usage                                            | Default |
|--------|-------------------|------|--------------------------------------------------|---------|
| [31:0] | PID2 on page 3-56 | PID2 | Byte 2 of the Peripheral ID (Lower 8 bits valid) | 0x0B    |

### Field PID2

Byte 2 of the Peripheral ID (Lower 8 bits valid).

PID2 is stored in bits[31:0] and is a 32-bit unsigned integer. Its default value is 0x0B.

#### 3.4.8 Register PID3

Peripheral ID byte 3.

Table 3-60 Register ID.PID3 layout

| Bits   | Link                     | Name | Usage                                            | Default |
|--------|--------------------------|------|--------------------------------------------------|---------|
| [31:0] | <i>PID3</i> on page 3-56 | PID3 | Byte 1 of the Peripheral ID (Lower 8 bits valid) | 0x0     |

#### Field PID3

Byte 1 of the Peripheral ID (Lower 8 bits valid).

PID3 is stored in bits[31:0] and is a 32-bit unsigned integer. Its default value is 0x0.

### 3.4.9 Register CID0

Component ID byte 0.

# Table 3-61 Register ID.CID0 layout

| Bits   | Link                     | Name | Usage                                           | Default |
|--------|--------------------------|------|-------------------------------------------------|---------|
| [31:0] | <i>CID0</i> on page 3-57 | CID0 | Byte 0 of the Component ID (Lower 8 bits valid) | 0x0D    |

#### Field CID0

Byte 0 of the Component ID (Lower 8 bits valid).

CID0 is stored in bits[31:0] and is a 32-bit unsigned integer. Its default value is 0x0D.

## 3.4.10 Register CID1

Component ID byte 1.

Table 3-62 Register ID.CID1 layout

| Bits   | Link                     | Name | Usage                                           | Default |
|--------|--------------------------|------|-------------------------------------------------|---------|
| [31:0] | <i>CID1</i> on page 3-57 | CID1 | Byte 1 of the Component ID (Lower 8 bits valid) | 0xF0    |

#### Field CID1

Byte 1 of the Component ID (Lower 8 bits valid).

CID1 is stored in bits[31:0] and is a 32-bit unsigned integer. Its default value is 0xF0.

## 3.4.11 Register CID2

Component ID byte 2.

Table 3-63 Register ID.CID2 layout

| Bits   | Link              | Name | Usage                                           | Default |
|--------|-------------------|------|-------------------------------------------------|---------|
| [31:0] | CID2 on page 3-57 | CID2 | Byte 2 of the Component ID (Lower 8 bits valid) | 0x05    |

#### Field CID2

Byte 2 of the Component ID (Lower 8 bits valid).

CID2 is stored in bits[31:0] and is a 32-bit unsigned integer. Its default value is 0x05.

## 3.4.12 Register CID3

Component ID byte 3.

Table 3-64 Register ID.CID3 layout

| Bits   | Link                     | Name | Usage                                           | Default |
|--------|--------------------------|------|-------------------------------------------------|---------|
| [31:0] | <i>CID3</i> on page 3-57 | CID3 | Byte 3 of the Component ID (Lower 8 bits valid) | 0xB1    |

#### Field CID3

Byte 3 of the Component ID (Lower 8 bits valid).

CID3 is stored in bits[31:0] and is a 32-bit unsigned integer. Its default value is 0xB1.

# 3.5 Register page PMU

The Performance Monitoring Unit (PMU) control registers.

The PMU consists of a 48-bit cycle counter that can be enabled or disabled, reset, and read through APB. Also, there are programmable event counters controlled through APB.

The PMU has four event counters that log AXI-related events to monitor system performance. It can be configured to generate an interrupt on counter overflow. There is also an option to control the PMU through a command-stream operation.

| Note                                                                                                 |
|------------------------------------------------------------------------------------------------------|
| The PMU uses the NPU clock after the top-level clock gate to count cycles. To get non-gated clock    |
| cycles, the NPU clock must be forced. To force the NPU clock gate, set bit[2] of the CMD register to |

LOW to disable clock-off through the QLPI interface and the main clock gate.

Table 3-65 PMU registers

| Address | Link                                         | Usage                                           | Access     | Default    |
|---------|----------------------------------------------|-------------------------------------------------|------------|------------|
| 0x0180  | 3.5.1 Register PMCR on page 3-60             | PMU master control register                     | Read/write | 0x00002000 |
| 0x0184  | 3.5.2 Register PMCNTENSET on page 3-61       | Count-enable set register                       | Read/write | 0x00000000 |
| 0x0188  | 3.5.3 Register PMCNTENCLR on page 3-62       | Count-enable clear register                     | Read/write | 0x00000000 |
| 0x018C  | 3.5.6 Register PMOVSSET on page 3-66         | Overflow-flag status set register               | Read/write | 0x00000000 |
| 0x0190  | 3.5.7 Register PMOVSCLR on page 3-68         | Overflow-flag status clear register             | Read/write | 0x00000000 |
| 0x0194  | 3.5.8 Register PMINTSET on page 3-70         | Interrupt-enable set register                   | Read/write | 0x00000000 |
| 0x0198  | 3.5.9 Register PMINTCLR on page 3-72         | Interrupt-enable clear register                 | Read/write | 0x00000000 |
| 0x01A0  | 3.5.10 Register PMCCNTR_LO on page 3-74      | Performance-monitor cycle count low register    | Read/write | 0x00000000 |
| 0x01A4  | 3.5.11 Register PMCCNTR_HI on page 3-74      | Performance-monitor cycle count high register   | Read/write | 0x00000000 |
| 0x01AC  | 3.5.12 Register PMCAXI_CHAN on page 3-74     | Set which AXI channel monitor                   | Read/write | 0x00000000 |
| 0x0300  | 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64   | Performance-monitor event counters              | Read/write | 0x00000000 |
| 0x0380  | 3.5.5 PMU_EVTYPER0 PMU_EVTYPER3 on page 3-64 | Performance-monitor event-type control counters | Read/write | 0x00000000 |

This section contains the following subsections:

- 3.5.1 Register PMCR on page 3-60.
- 3.5.2 Register PMCNTENSET on page 3-61.
- 3.5.3 Register PMCNTENCLR on page 3-62.
- 3.5.4 PMU EVCNTR0 ... PMU EVCNTR3 on page 3-64.
- 3.5.5 PMU EVTYPER0 ... PMU EVTYPER3 on page 3-64.
- 3.5.6 Register PMOVSSET on page 3-66.
- 3.5.7 Register PMOVSCLR on page 3-68.
- 3.5.8 Register PMINTSET on page 3-70.
- 3.5.9 Register PMINTCLR on page 3-72.
- 3.5.10 Register PMCCNTR LO on page 3-74.
- 3.5.11 Register PMCCNTR HI on page 3-74.
- 3.5.12 Register PMCAXI CHAN on page 3-74.

### 3.5.1 Register PMCR

The PMCR register is the master control register of the PMU.

Table 3-66 Register PMU.PMCR layout

| Bits    | Link                       | Name          | Usage                                                                   | Default |
|---------|----------------------------|---------------|-------------------------------------------------------------------------|---------|
| [31:16] | Reserved                   | -             | -                                                                       | -       |
| [15:11] | num_event_cnt on page 3-60 | num_event_cnt | Number of event counters available for performance measurement          | 0x04    |
| [10:4]  | Reserved                   | -             | -                                                                       | -       |
| [3]     | mask_en on page 3-60       | mask_en       | PMU can be enabled/disabled by command stream operation NPU_OP_PMU_MASK | 0x0     |
| [2]     | cycle_cnt_rst on page 3-60 | cycle_cnt_rst | Reset cycle counter                                                     | 0       |
| [1]     | event_cnt_rst on page 3-60 | event_cnt_rst | Reset event counter                                                     | 0       |
| [0]     | cnt_en on page 3-60        | cnt_en        | Enable counter                                                          | 0x0     |

### Field num\_event\_cnt

Number of event counters available for performance management.

num\_event\_cnt is stored in bits[15:11] and is a 5-bit unsigned integer. Its default value is 0x04.

The number of available event counters is hard-coded to four.

### Field mask\_en

PMU can be enabled/disabled by command stream operation NPU OP PMU MASK.

mask en is stored in bit[3] and is a 1-bit unsigned integer. Its default value is 0x0.

Note that field cnt\_en must be enabled for the PMU to be active.

## Field cycle\_cnt\_rst

Reset cycle counter.

cycle cnt rst is located in bit[2] and is a 1-bit unsigned integer. Its default value is 0.

Writing a 1 to this register resets the cycle counter. If the cycle counter is active, it will continue counting after reset. This register bit always reads a 0.

### Field event\_cnt\_rst

Reset event counter.

event\_cnt\_rst is located in bit[1] and is a 1-bit unsigned integer. Its default value is 0.

Writing a 1 to this field resets all event counters. If any counter is active, it will continue counting after reset. This register bit always reads a 0.

#### Field cnt\_en

Enable counter.

cnt en is stored in bit[0] and is a 1-bit unsigned integer. Its default value is 0x0.

This is the master switch. When the switch is disabled, the PMU is always off.

### 3.5.2 Register PMCNTENSET

Count-enable set registers to activate the counters.

This register enables the dedicated cycle counter, PMCCNTR, and any implemented event counters PMU EVCNTRn.

3.5.2 Register PMCNTENSET on page 3-61 is used together with the 3.5.3 Register PMCNTENCLR on page 3-62 register. It is implemented in hardware with the same underlying state as the 3.5.3 Register PMCNTENCLR on page 3-62.

Writing to this register enables the counters as follows: writing 1 to bit[31] enables the cycle counter and writing 1 to bit[0-3] enables event counter 0-3, respectively.

Reading from 3.5.2 Register PMCNTENSET on page 3-61 or 3.5.3 Register PMCNTENCLR on page 3-62 gives the same value, which is the enable status of the counters.

Table 3-67 Register PMU.PMCNTENSET layout

| Bits   | Link                     | Name        | Usage                                    | Default |
|--------|--------------------------|-------------|------------------------------------------|---------|
| [31]   | CYCLE_CNT on page 3-61   | CYCLE_CNT   | PMCCNTR enable bit                       | 0       |
| [30:4] | Reserved                 | -           | -                                        | -       |
| [3]    | EVENT_CNT_3 on page 3-61 | EVENT_CNT_3 | Event-counter enable bit for PMU_EVCNTR3 | 0       |
| [2]    | EVENT_CNT_2 on page 3-62 | EVENT_CNT_2 | Event-counter enable bit for PMU_EVCNTR2 | 0       |
| [1]    | EVENT_CNT_1 on page 3-62 | EVENT_CNT_1 | Event-counter enable bit for PMU_EVCNTR1 | 0       |
| [0]    | EVENT_CNT_0 on page 3-62 | EVENT_CNT_0 | Event-counter enable bit for PMU_EVCNTR0 | 0       |

## Field CYCLE\_CNT

PMCCNTR enable bit.

CYCLE CNT is stored in bit[31] and is a 1-bit flag. Its default value is 0.

Enables the dedicated cycle counter, PMCCNTR.

Table 3-68 Field CYCLE\_CNT values

| Value       | Meaning                                                                                       |  |  |  |
|-------------|-----------------------------------------------------------------------------------------------|--|--|--|
| 0 (default) | When read, it means the cycle counter is disabled. When written, it has no effect.            |  |  |  |
| 1           | When read, it means the cycle counter is enabled. When written, it enables the cycle counter. |  |  |  |

## Field EVENT\_CNT\_3

Event-counter enable bit for PMU\_EVCNTR3.

EVENT CNT 3 is stored in bit[3] and is a 1-bit flag. Its default value is 0.

Table 3-69 Field EVENT\_CNT\_3 values

| Value       | Meaning                                                                                                                                                            |
|-------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 (default) | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 is disabled. When written, it has no effect.                                                   |
| 1           | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 event counter is enabled. When written, it enables 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64. |

### Field EVENT\_CNT\_2

Event-counter enable bit for PMU EVCNTR2.

EVENT CNT 2 is stored in bit[2] and is a 1-bit flag. Its default value is 0.

#### Table 3-70 Field EVENT CNT 2 values

| Value       | Meaning                                                                                                                                                            |
|-------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 (default) | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 is disabled. When written, it has no effect.                                                   |
| 1           | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 event counter is enabled. When written, it enables 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64. |

### Field EVENT\_CNT\_1

Event-counter enable bit for PMU\_EVCNTR1.

EVENT CNT 1 is stored in bit[1] and is a 1-bit flag. Its default value is 0.

#### Table 3-71 Field EVENT\_CNT\_1 values

| Value       | Meaning                                                                                                                                                            |
|-------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 (default) | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 is disabled. When written, it has no effect.                                                   |
| 1           | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 event counter is enabled. When written, it enables 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64. |

#### Field EVENT\_CNT\_0

Event-counter enable bit for PMU EVCNTR0.

EVENT\_CNT\_0 is stored in bit[0] and is a 1-bit flag. Its default value is 0.

#### Table 3-72 Field EVENT\_CNT\_0 values

| Value       | Meaning                                                                                                                                                            |  |  |
|-------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| 0 (default) | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 is disabled. When written, it has no effect.                                                   |  |  |
| 1           | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 event counter is enabled. When written, it enables 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64. |  |  |

#### 3.5.3 Register PMCNTENCLR

Count-enable clear registers to disable the counters.

This register disables the dedicated cycle counter, PMCCNTR, and any implemented event counters PMU EVCNTRn.

3.5.3 Register PMCNTENCLR on page 3-62 is used together with the 3.5.2 Register PMCNTENSET on page 3-61 register. It is implemented in hardware with the same underlying state as 3.5.2 Register PMCNTENSET on page 3-61.

Writing to this register disables the counters as follows: writing 1 to bit[31] disables the cycle counter and writing 1 to bit[0-3] disables event counter 0-3, respectively.

Reading from 3.5.2 Register PMCNTENSET on page 3-61 or 3.5.3 Register PMCNTENCLR on page 3-62 gives the same value, which is the enable status of the counters.

# Table 3-73 Register PMU.PMCNTENCLR layout

| Bits   | Link                     | Name        | Usage                                     | Default |
|--------|--------------------------|-------------|-------------------------------------------|---------|
| [31]   | CYCLE_CNT on page 3-63   | CYCLE_CNT   | PMCCNTR disable bit                       | 0       |
| [30:4] | Reserved                 | -           | -                                         | -       |
| [3]    | EVENT_CNT_3 on page 3-63 | EVENT_CNT_3 | Event-counter disable bit for PMU_EVCNTR3 | 0       |
| [2]    | EVENT_CNT_2 on page 3-63 | EVENT_CNT_2 | Event-counter disable bit for PMU_EVCNTR2 | 0       |
| [1]    | EVENT_CNT_1 on page 3-64 | EVENT_CNT_1 | Event-counter disable bit for PMU_EVCNTR1 | 0       |
| [0]    | EVENT_CNT_0 on page 3-64 | EVENT_CNT_0 | Event-counter disable bit for PMU_EVCNTR0 | 0       |

# Field CYCLE\_CNT

PMCCNTR disable bit.

CYCLE CNT is stored in bit[31] and is a 1-bit flag. Its default value is 0.

Disables the dedicated cycle counter, PMCCNTR.

### Table 3-74 Field CYCLE\_CNT values

| Value       | Meaning                                                                                        |  |  |
|-------------|------------------------------------------------------------------------------------------------|--|--|
| 0 (default) | When read, it means the cycle counter is disabled. When written, it has no effect.             |  |  |
| 1           | When read, it means the cycle counter is enabled. When written, it disables the cycle counter. |  |  |

## Field EVENT\_CNT\_3

Event-counter disable bit for PMU\_EVCNTR3.

EVENT\_CNT\_3 is stored in bit[3] and is a 1-bit flag. Its default value is 0.

#### Table 3-75 Field EVENT\_CNT\_3 values

| Value       | Meaning                                                                                                                                               |
|-------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 (default) | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 is disabled. When written, it has no effect.                                      |
| 1           | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 is enabled. When written, it disables 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64. |

## Field EVENT\_CNT\_2

Event-counter disable bit for PMU\_EVCNTR2.

EVENT CNT 2 is stored in bit[2] and is a 1-bit flag. Its default value is 0.

#### Table 3-76 Field EVENT\_CNT\_2 values

| Value       | Meaning                                                                                                                                               |
|-------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 (default) | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 is disabled. When written, it has no effect.                                      |
| 1           | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 is enabled. When written, it disables 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64. |

### Field EVENT\_CNT\_1

Event-counter disable bit for PMU EVCNTR1.

EVENT CNT 1 is stored in bit[1] and is a 1-bit flag. Its default value is 0.

#### Table 3-77 Field EVENT CNT 1 values

| Value       | Meaning                                                                                                                                               |
|-------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 (default) | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 is disabled. When written, it has no effect.                                      |
|             | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 is enabled. When written, it disables 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64. |

### Field EVENT\_CNT\_0

Event-counter disable bit for PMU\_EVCNTR0.

EVENT CNT 0 is stored in bit[0] and is a 1-bit flag. Its default value is 0.

#### Table 3-78 Field EVENT\_CNT\_0 values

| Value       | Meaning                                                                                                                                               |
|-------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 (default) | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 is disabled. When written, it has no effect.                                      |
| 1           | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 is enabled. When written, it disables 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64. |

#### 3.5.4 PMU\_EVCNTR0 ... PMU\_EVCNTR3

Performance-monitor event counters.

PMU EVCNTR[k]: these are the four 32-bit performance counters (k=0-3).

### 3.5.5 PMU\_EVTYPER0 ... PMU\_EVTYPER3

The performance-monitor event-type counters controlling the respective event counters.

PMU\_EVTYPER0 ... PMU\_EVTYPER3 are the events that are connected to performance counters *PMEVCNTR0* on page 3-64, where PMU\_EVTYPER[k] controls performance counter PMU\_EVCNTR[k].

An event is selected using a 10-bit word from the following table.

#### Field EV\_TYPE

Event type.

EV\_TYPE is stored as a 10-bit enumeration. Its default value is no\_event.

The field can contain the following values:

## Table 3-79 Field EV\_TYPE values

| Value first core | Value second core | Name     | Meaning                                   |
|------------------|-------------------|----------|-------------------------------------------|
| 0x00 (default)   | -                 | no_event | No event counted (the event never occurs) |
| 0x11             | -                 | Cycle    | Event occurs every cycle.                 |
| 0x20             | -                 | NPU idle | NPU in stopped state                      |

# Table 3-79 Field EV\_TYPE values (continued)

| Value first core | Value second core | Name                         | Meaning                                                                                   |
|------------------|-------------------|------------------------------|-------------------------------------------------------------------------------------------|
| 0x23             | -                 | NPU running                  | NPU in running state                                                                      |
| 0x30             | 0x130             | MAC: ACTIVE (8 or 16 bit)    | MAC is doing block traversal. Valid blk_cmd and not stalled.                              |
| 0x31             | 0x131             | MAC: ACTIVE 8-bit            | MAC is doing 8-bit block traversal. Valid blk_cmd and not stalled                         |
| 0x32             | 0x132             | MAC: ACTIVE 16-bit           | MAC is doing 16-bit block traversal. Valid blk_cmd and not stalled                        |
| 0x40             | 0x140             | AO: ACTIVE (8-bit or 16-bit) | AO is doing block traversal of ACC or IB. Valid blk_cmd and not stalled                   |
| 0x41             | 0x141             | AO: ACTIVE 8-bit             | AO is doing 8-bit block traversal of ACC or IB. Valid blk_cmd and not stalled             |
| 0x42             | 0x142             | AO: ACTIVE 16-bit            | AO is doing 16-bit block traversal of ACC or IB. Valid blk_cmd and not stalled            |
| 0x50             | 0x150             | WD: ACTIVE                   | WD is decoding weight stream. Valid ofd_cmd and not stalled.                              |
| 0x80             | -                 | axi0_rd_trans_accepted       | AXI-0 read transfer accepted, arready & arvalid (number of read transfers)                |
| 0x81             | -                 | -                            | -                                                                                         |
| 0x82             | -                 | axi0_rd_data_beat_received   | AXI-0, rready & rvalid (read bandwidth)                                                   |
| 0x83             | -                 | axi0_rd_tran_req_stalled     | AXI-0, arvalid & ~arready (read stalls due memory system)                                 |
| 0x84             | -                 | axi0_wr_trans_accepted       | AXI0, awready & awvalid (number write transfers)                                          |
| 0x85-0x86        | -                 | -                            | -                                                                                         |
| 0x87             | -                 | axi0_wr_data_beat_written    | AXI-0, wvalid wready (write bandwidth)                                                    |
| 0x88             | -                 | axi0_wr_tran_req_stalled     | AXI-0, awvalid & ~awready (write transfer stalls due to memory system)                    |
| 0x89             | -                 | axi0_wr_data_beat_stalled    | AXI-0, wvalid & ~wready (write beat stalls due to memory system)                          |
| 0x8A-0x8B        | -                 | -                            | -                                                                                         |
| 0x8C             | -                 | axi0_enabled_cycles          | AXI-0, aclken_input (memory system frequency)                                             |
| 0x8D             | -                 | -                            | -                                                                                         |
| 0x8E             | -                 | axi0_rd_stall_limit          | AXI-0, check if read stalled due to AXI counter limit reached                             |
| 0x8F             | -                 | axi0_wr_stall_limit          | AXI-0, check if write stalled due to AXI counter limit reached                            |
| 0xA0             | -                 | axi_latency_any              | Any latency; measures the total number of transactions for the specified ID and interface |
| 0xA1             | -                 | axi_latency_32               | Latency was ≥ 32 cycles                                                                   |
| 0xA2             | -                 | axi_latency_64               | Latency was ≥ 64 cycles                                                                   |
| 0xA3             | -                 | axi_latency_128              | Latency was ≥ 128 cycles                                                                  |
| 0xA4             | -                 | axi_latency_256              | Latency was ≥ 256 cycles                                                                  |

### Table 3-79 Field EV\_TYPE values (continued)

| Value first core | Value second core | Name                       | Meaning                                                                    |
|------------------|-------------------|----------------------------|----------------------------------------------------------------------------|
| 0xA5             | -                 | axi_latency_512            | Latency was ≥ 512 cycles                                                   |
| 0xA6             | -                 | axi_latency_1024           | Latency was ≥ 1024 cycles                                                  |
| 0xB0             | -                 | DMA ECC event              | DMA RAM error (corrected or uncorrected)                                   |
| 0xB1             | 0x1B1             | SB ECC event               | SB RAM error (corrected or uncorrected)                                    |
| 0x180            | -                 | axi1_rd_trans_accepted     | AXI-1 read transfer accepted, arready & arvalid (number of read transfers) |
| 0x181            | -                 | -                          | -                                                                          |
| 0x182            | -                 | axi1_rd_data_beat_received | AXI-1, rready & rvalid (read bandwidth)                                    |
| 0x183            | -                 | axi1_rd_tran_req_stalled   | AXI-1, arvalid & ~arready (read stalls due memory system)                  |
| 0x184            | -                 | axi1_wr_trans_accepted     | AXI-1, awready & awvalid (number write transfers)                          |
| 0x185-0x186      | -                 | -                          | -                                                                          |
| 0x187            | -                 | axi1_wr_data_beat_written  | AXI-1, wvalid & wready (write bandwidth)                                   |
| 0x188            | -                 | axi1_wr_tran_req_stalled   | AXI-1, awvalid & ~awready (write transfer stalls due to memory system)     |
| 0x189            | -                 | axil_wr_data_beat_stalled  | AXI-1, wvalid & ~wready (write beat stalls due to memory system)           |
| 0x18A-0x18B      | -                 | -                          | -                                                                          |
| 0x18C            | -                 | axi1_enabled_cycles        | AXI-1, aclken_input (memory system frequency)                              |
| 0x18D            | -                 | -                          | -                                                                          |
| 0x18E            | -                 | axi1_rd_stall_limit        | AXI-1, check if read stalled due to AXI counter limit reached              |
| 0x18F            | -                 | axi1_wr_stall_limit        | AXI-1, check if write stalled due to AXI counter limit reached             |

| <br>Note — |  |
|------------|--|

When NPU\_SET\_PARALLEL\_MODE is set to 1, the two core depth mode, the 0x1XY value gives the count for the second core and 0x0XY value gives the count for the first core.

### 3.5.6 Register PMOVSSET

The overflow-flag status set register.

Sets the state of the overflow bit for the dedicated cycle counter, PMCCNTR, and each of the implemented event counters PMU EVCNTR*n*.

3.5.6 Register PMOVSSET on page 3-66 is used together with the 3.5.7 Register PMOVSCLR on page 3-68 register. It is implemented in hardware with the same underlying state as 3.5.7 Register PMOVSCLR on page 3-68.

This register sets the overflow bit as follows: writing 1 to bit[31] sets the overflow bit for the cycle counter and writing 1 to bit[0-3] sets the overflow bit for event counter[0-3]. This register is not written to in normal operation.

# Table 3-80 Register PMU.PMOVSSET layout

| Bits   | Link                         | Name            | Usage                                          | Default |
|--------|------------------------------|-----------------|------------------------------------------------|---------|
| [31]   | CYCLE_CNT_OVF on page 3-67   | CYCLE_CNT_OVF   | PMCCNTR overflow set bit                       | 0       |
| [30:4] | Reserved                     | -               | -                                              | -       |
| [3]    | EVENT_CNT_3_OVF on page 3-67 | EVENT_CNT_3_OVF | Event-counter overflow set bit for PMU_EVCNTR3 | 0       |
| [2]    | EVENT_CNT_2_OVF on page 3-67 | EVENT_CNT_2_OVF | Event-counter overflow set bit for PMU_EVCNTR2 | 0       |
| [1]    | EVENT_CNT_1_OVF on page 3-67 | EVENT_CNT_1_OVF | Event-counter overflow set bit for PMU_EVCNTR1 | 0       |
| [0]    | EVENT_CNT_0_OVF on page 3-68 | EVENT_CNT_0_OVF | Event-counter overflow set bit for PMU_EVCNTR0 | 0       |

# Field CYCLE\_CNT\_OVF

PMCCNTR overflow set bit.

CYCLE CNT OVF is stored in bit[31] and is a 1-bit flag. Its default value is 0.

#### Table 3-81 Field CYCLE CNT OVF values

| Value       | Meaning                                                                                            |
|-------------|----------------------------------------------------------------------------------------------------|
| 0 (default) | When read, it means the cycle counter has not overflowed. When written, it has no effect.          |
| 1           | When read, it means the cycle counter has overflowed. When written, it sets the overflow bit to 1. |

## Field EVENT\_CNT\_3\_OVF

Event-counter overflow set bit for PMU EVCNTR3.

EVENT\_CNT\_3\_OVF is stored in bit[3] and is a 1-bit flag. Its default value is 0.

## Table 3-82 Field EVENT\_CNT\_3\_OVF values

| Value       | Meaning                                                                                                                                                                     |
|-------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 (default) | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 has not overflowed. When written, it has no effect.                                                     |
| 1           | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 has overflowed. When written, it sets the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 overflow bit to 1. |

### Field EVENT\_CNT\_2\_OVF

Event-counter overflow set bit for PMU EVCNTR2.

EVENT CNT 2 OVF is stored in bit[2] and is a 1-bit flag. Its default value is 0.

## Table 3-83 Field EVENT\_CNT\_2\_OVF values

| Value       | Meaning                                                                                                                                                                     |
|-------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 (default) | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 has not overflowed. When written, it has no effect.                                                     |
| 1           | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 has overflowed. When written, it sets the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 overflow bit to 1. |

## Field EVENT\_CNT\_1\_OVF

Event-counter overflow set bit for PMU\_EVCNTR1.

EVENT CNT 1 OVF is stored in bit[1] and is a 1-bit flag. Its default value is 0.

#### Table 3-84 Field EVENT\_CNT\_1\_OVF values

| Value       | Meaning                                                                                                                                                                     |
|-------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 (default) | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 has not overflowed. When written, it has no effect.                                                     |
| 1           | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 has overflowed. When written, it sets the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 overflow bit to 1. |

#### Field EVENT\_CNT\_0\_OVF

Event-counter overflow set bit for PMU\_EVCNTR0.

EVENT CNT 0 OVF is stored in bit[0] and is a 1-bit flag. Its default value is 0.

#### Table 3-85 Field EVENT\_CNT\_0\_OVF values

| Value       | Meaning                                                                                                                                                                     |
|-------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 (default) | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 has not overflowed. When written, it has no effect.                                                     |
| 1           | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 has overflowed. When written, it sets the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 overflow bit to 1. |

### 3.5.7 Register PMOVSCLR

The overflow-flag status clear register.

Contains the status of the overflow bit for the dedicated cycle counter, PMCCNTR, and each of the implemented event counters PMU EVCNTRn.

3.5.7 Register PMOVSCLR on page 3-68 is used together with the 3.5.6 Register PMOVSSET on page 3-66 register. It is implemented in hardware with the same underlying state as 3.5.6 Register PMOVSSET on page 3-66.

Writing to this register clears overflows as follows: writing a 1 to bit[31] clears overflow for the cycle counter and writing 1 to bit[0-3] clears overflow from event counter 0-3, respectively.

Reading from this register gives the overflow status.

Table 3-86 Register PMU.PMOVSCLR layout

| Bits   | Link                         | Name            | Usage                                            | Default |
|--------|------------------------------|-----------------|--------------------------------------------------|---------|
| [31]   | CYCLE_CNT_OVF on page 3-69   | CYCLE_CNT_OVF   | PMCCNTR overflow clear bit                       | 0       |
| [30:4] | Reserved                     | -               | -                                                | -       |
| [3]    | EVENT_CNT_3_OVF on page 3-69 | EVENT_CNT_3_OVF | Event-counter overflow clear bit for PMU_EVCNTR3 | 0       |
| [2]    | EVENT_CNT_2_OVF on page 3-69 | EVENT_CNT_2_OVF | Event-counter overflow clear bit for PMU_EVCNTR2 | 0       |
| [1]    | EVENT_CNT_1_OVF on page 3-69 | EVENT_CNT_1_OVF | Event-counter overflow clear bit for PMU_EVCNTR1 | 0       |
| [0]    | EVENT_CNT_0_OVF on page 3-70 | EVENT_CNT_0_OVF | Event-counter overflow clear bit for PMU_EVCNTR0 | 0       |

### Field CYCLE CNT OVF

PMCCNTR overflow clear bit.

CYCLE CNT OVF is stored in bit[31] and is a 1-bit flag. Its default value is 0.

### Table 3-87 Field CYCLE\_CNT\_OVF values

| Value       | Meaning                                                                                              |
|-------------|------------------------------------------------------------------------------------------------------|
| 0 (default) | When read, it means the cycle counter has not overflowed. When written, it has no effect.            |
| 1           | When read, it means the cycle counter has overflowed. When written, it clears the overflow bit to 0. |

# Field EVENT\_CNT\_3\_OVF

Event-counter overflow clear bit for PMU EVCNTR3.

EVENT\_CNT\_3\_OVF is stored in bit[3] and is a 1-bit flag. Its default value is 0.

#### Table 3-88 Field EVENT\_CNT\_3\_OVF values

| Value       | Meaning                                                                                                                                                                       |
|-------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 (default) | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 has not overflowed. When written, it has no effect.                                                       |
| 1           | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 has overflowed. When written, it clears the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 overflow bit to 0. |

## Field EVENT\_CNT\_2\_OVF

Event-counter overflow clear bit for PMU\_EVCNTR2.

EVENT CNT 2 OVF is stored in bit[2] and is a 1-bit flag. Its default value is 0.

#### Table 3-89 Field EVENT\_CNT\_2\_OVF values

| Value       | Meaning                                                                                                                                                                       |  |
|-------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| 0 (default) | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 has not overflowed. When written, it has no effect.                                                       |  |
|             | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 has overflowed. When written, it clears the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 overflow bit to 0. |  |

### Field EVENT\_CNT\_1\_OVF

Event-counter overflow clear bit for PMU\_EVCNTR1.

EVENT CNT 1 OVF is stored in bit[1] and is a 1-bit flag. Its default value is 0.

# Table 3-90 Field EVENT\_CNT\_1\_OVF values

| Value       | Meaning                                                                                                                                                                       |
|-------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 (default) | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 has not overflowed. When written, it has no effect.                                                       |
| 1           | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 has overflowed. When written, it clears the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 overflow bit to 0. |

### Field EVENT\_CNT\_0\_OVF

Event-counter overflow clear bit for PMU EVCNTR0.

EVENT\_CNT\_0\_OVF is stored in bit[0] and is a 1-bit flag. Its default value is 0.

#### Table 3-91 Field EVENT\_CNT\_0\_OVF values

| Value       | Meaning                                                                                                                                                                       |
|-------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 (default) | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 has not overflowed. When written, it has no effect.                                                       |
| 1           | When read, it means that 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 has overflowed. When written, it clears the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 overflow bit to 0. |

### 3.5.8 Register PMINTSET

The interrupt-enable set register.

Enables the generation of interrupt requests on overflows from the dedicated cycle counter, PMCCNTR, and the event counters PMU\_EVCNTRn. Reading the register shows which overflow interrupt requests are enabled.

3.5.8 Register PMINTSET on page 3-70 is used together with the 3.5.9 Register PMINTCLR on page 3-72 register. It is implemented in hardware with the same underlying state as 3.5.9 Register PMINTCLR on page 3-72.

Writing to this register enables overflow interrupt detection as follows: writing a 1 to bit[31] enables overflow interrupts from the cycle counter and writing a 1 to bit[0-3] enables overflow interrupts from event counter 0-3, respectively.

Reading from 3.5.8 Register PMINTSET on page 3-70 or 3.5.9 Register PMINTCLR on page 3-72 gives the same value, which is the overflow enable status of the counters.

Table 3-92 Register PMU.PMINTSET layout

| Bits   | Link                         | Name            | Usage                                                               | Default |
|--------|------------------------------|-----------------|---------------------------------------------------------------------|---------|
| [31]   | CYCLE_CNT_INT on page 3-70   | CYCLE_CNT_INT   | PMCCNTR overflow interrupt-request enable bit                       | 0       |
| [30:4] | Reserved                     | -               | -                                                                   | -       |
| [3]    | EVENT_CNT_3_INT on page 3-71 | EVENT_CNT_3_INT | Event-counter overflow interrupt-request enable bit for PMU_EVCNTR3 | 0       |
| [2]    | EVENT_CNT_2_INT on page 3-71 | EVENT_CNT_2_INT | Event-counter overflow interrupt-request enable bit for PMU_EVCNTR2 | 0       |
| [1]    | EVENT_CNT_1_INT on page 3-71 | EVENT_CNT_1_INT | Event-counter overflow interrupt-request enable bit for PMU_EVCNTR1 | 0       |
| [0]    | EVENT_CNT_0_INT on page 3-71 | EVENT_CNT_0_INT | Event-counter overflow interrupt-request enable bit for PMU_EVCNTR0 | 0       |

#### Field CYCLE\_CNT\_INT

PMCCNTR overflow interrupt-request enable bit.

CYCLE\_CNT\_INT is stored in bit[31] and is a 1-bit flag. Its default value is 0.

### Table 3-93 Field CYCLE\_CNT\_INT values

| Value       | Meaning                                                                                                                                           |  |
|-------------|---------------------------------------------------------------------------------------------------------------------------------------------------|--|
| 0 (default) | When read, it means the cycle-counter overflow interrupt request is disabled. When written, it has no effect.                                     |  |
| 1           | When read, it means the cycle-counter overflow interrupt request is enabled. When written, it enables the cycle count overflow interrupt request. |  |

# Field EVENT\_CNT\_3\_INT

Event-counter overflow interrupt-request enable bit for PMU\_EVCNTR3.

EVENT CNT 3 INT is stored in bit[3] and is a 1-bit flag. Its default value is 0.

### Table 3-94 Field EVENT\_CNT\_3\_INT values

| Value | Meaning                                                                                                                                                                                                        |  |  |
|-------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| ` ′   | When read, it means that the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 event-counter interrupt request is disabled. When written, it has no effect.                                                           |  |  |
| 1     | When read, it means that the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 event-counter interrupt request is enabled. When written, it enables the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 interrupt request. |  |  |

## Field EVENT\_CNT\_2\_INT

Event-counter overflow interrupt-request enable bit for PMU EVCNTR2.

EVENT\_CNT\_2\_INT is stored in bit[2] and is a 1-bit flag. Its default value is 0.

## Table 3-95 Field EVENT\_CNT\_2\_INT values

| Value       | Meaning                                                                                                                                                                                                        |  |
|-------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| 0 (default) | When read, it means that the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 event-counter interrupt request is disabled. When written, it has no effect.                                                           |  |
| 1           | When read, it means that the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 event-counter interrupt request is enabled. When written, it enables the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 interrupt request. |  |

## Field EVENT\_CNT\_1\_INT

Event-counter overflow interrupt-request enable bit for PMU\_EVCNTR1.

EVENT\_CNT\_1\_INT is stored in bit[1] and is a 1-bit flag. Its default value is 0.

# Table 3-96 Field EVENT\_CNT\_1\_INT values

| Value       | Meaning                                                                                                                                                                                                        |
|-------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 (default) | When read, it means that the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 event-counter interrupt request is disabled. When written, it has no effect.                                                           |
| 1           | When read, it means that the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 event-counter interrupt request is enabled. When written, it enables the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 interrupt request. |

#### Field EVENT\_CNT\_0\_INT

Event-counter overflow interrupt-request enable bit for PMU\_EVCNTR0.

EVENT\_CNT\_0\_INT is stored in bit[0] and is a 1-bit flag. Its default value is 0.

### Table 3-97 Field EVENT\_CNT\_0\_INT values

| Value       | Meaning                                                                                                                                                                                                        |  |  |
|-------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| 0 (default) | When read, it means that the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 event-counter interrupt request is disabled. When written, it has no effect.                                                           |  |  |
| 1           | When read, it means that the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 event-counter interrupt request is enabled. When written, it enables the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 interrupt request. |  |  |

#### 3.5.9 Register PMINTCLR

The interrupt-enable clear register.

Disables the generation of interrupt requests on overflows from the dedicated cycle counter, PMCCNTR, and the event counters PMU\_EVCNTRn. Reading the register shows which overflow interrupt requests are enabled.

3.5.9 Register PMINTCLR on page 3-72 is used together with the 3.5.8 Register PMINTSET on page 3-70 register. It is implemented in hardware with the same underlying state as 3.5.8 Register PMINTSET on page 3-70.

Writing to this register disables overflow interrupt detection as follows: writing a 1 to bit[31] disables overflow interrupts from the cycle counter and writing a 1 to bit[0-3] disables overflow interrupts from event counter 0-3, respectively.

Reading from 3.5.8 Register PMINTSET on page 3-70 or 3.5.9 Register PMINTCLR on page 3-72 gives the same value, which is the overflow enable status of the counters.

Table 3-98 Register PMU.PMINTCLR layout

| Bits   | Link                         | Name            | Usage                                                                | Default |
|--------|------------------------------|-----------------|----------------------------------------------------------------------|---------|
| [31]   | CYCLE_CNT_INT on page 3-72   | CYCLE_CNT_INT   | PMCCNTR overflow interrupt-request disable bit                       | 0       |
| [30:4] | Reserved                     | -               | -                                                                    | -       |
| [3]    | EVENT_CNT_3_INT on page 3-73 | EVENT_CNT_3_INT | Event-counter overflow interrupt-request disable bit for PMU_EVCNTR3 | 0       |
| [2]    | EVENT_CNT_2_INT on page 3-73 | EVENT_CNT_2_INT | Event-counter overflow interrupt-request disable bit for PMU_EVCNTR2 | 0       |
| [1]    | EVENT_CNT_1_INT on page 3-73 | EVENT_CNT_1_INT | Event-counter overflow interrupt-request disable bit for PMU_EVCNTR1 | 0       |
| [0]    | EVENT_CNT_0_INT on page 3-73 | EVENT_CNT_0_INT | Event-counter overflow interrupt-request disable bit for PMU_EVCNTR0 | 0       |

## Field CYCLE\_CNT\_INT

PMCCNTR overflow interrupt-request disable bit.

CYCLE\_CNT\_INT is stored in bit[31] and is a 1-bit flag. Its default value is 0.

#### Table 3-99 Field CYCLE\_CNT\_INT values

| Value       | Meaning                                                                                                                                            |  |
|-------------|----------------------------------------------------------------------------------------------------------------------------------------------------|--|
| 0 (default) | When read, it means the cycle-counter overflow interrupt-request is disabled. When written, it has no effect.                                      |  |
| 1           | When read, it means the cycle-counter overflow interrupt-request is enabled. When written, it disables the cycle count overflow interrupt request. |  |

# Field EVENT\_CNT\_3\_INT

Event-counter overflow interrupt-request disable bit for PMU EVCNTR3.

EVENT CNT 3 INT is stored in bit[3] and is a 1-bit flag. Its default value is 0.

# Table 3-100 Field EVENT\_CNT\_3\_INT values

| Value       | Meaning                                                                                                                                                                                                         |
|-------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 (default) | When read, it means that the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 event counter interrupt request is disabled. When written, it has no effect.                                                            |
| 1           | When read, it means that the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 event counter interrupt request is enabled. When written, it disables the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 interrupt request. |

# Field EVENT\_CNT\_2\_INT

Event-counter overflow interrupt-request disable bit for PMU\_EVCNTR2.

EVENT CNT 2 INT is stored in bit[2] and is a 1-bit flag. Its default value is 0.

## Table 3-101 Field EVENT\_CNT\_2\_INT values

| Value       | Meaning                                                                                                                                                                                                         |
|-------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 (default) | When read, it means that the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 event-counter interrupt request is disabled. When written, it has no effect.                                                            |
| 1           | When read, it means that the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 event-counter interrupt request is enabled. When written, it disables the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 interrupt request. |

## Field EVENT\_CNT\_1\_INT

Event-counter overflow interrupt-request disable bit for PMU EVCNTR1.

EVENT\_CNT\_1\_INT is stored in bit[1] and is a 1-bit flag. Its default value is 0.

# Table 3-102 Field EVENT\_CNT\_1\_INT values

| Value       | Meaning                                                                                                                                                                                                         |
|-------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 (default) | When read, it means that the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 event-counter interrupt request is disabled. When written, it has no effect.                                                            |
| 1           | When read, it means that the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 event-counter interrupt request is enabled. When written, it disables the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 interrupt request. |

## Field EVENT\_CNT\_0\_INT

Event-counter overflow interrupt-request disable bit for PMU EVCNTR0.

EVENT CNT 0 INT is stored in bit[0] and is a 1-bit flag. Its default value is 0.

## Table 3-103 Field EVENT\_CNT\_0\_INT values

| Value       | Meaning                                                                                                                                                                                                         |
|-------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 (default) | When read, it means that the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 event-counter interrupt request is disabled. When written, it has no effect.                                                            |
| 1           | When read, it means that the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 event-counter interrupt request is enabled. When written, it disables the 3.5.4 PMU_EVCNTR0 PMU_EVCNTR3 on page 3-64 interrupt request. |

# 3.5.10 Register PMCCNTR\_LO

Performance-monitor cycle count low register.

This represents the lower 32 bits of the dedicated 48-bit cycle counter, PMCCNTR.

Table 3-104 Register PMU.PMCCNTR\_LO layout

| Bit | ts  | Link                      | Name         |                 | Default    |
|-----|-----|---------------------------|--------------|-----------------|------------|
| [31 | :0] | CYCLE_CNT_LO on page 3-74 | CYCLE_CNT_LO | Cycle count low | 0x00000000 |

# Field CYCLE\_CNT\_LO

Cycle count low.

CYCLE\_CNT\_LO is stored in bits[31:0] and is a 32-bit unsigned integer. Its default value is 0x00000000.

## 3.5.11 Register PMCCNTR\_HI

Performance-monitor cycle count high register.

This represents the higher 16 bits of the dedicated 48-bit cycle counter, PMCCNTR.

Table 3-105 Register PMU.PMCCNTR\_HI layout

| Bits    | Link                      | Name         | Usage            | Default |
|---------|---------------------------|--------------|------------------|---------|
| [31:16] | Reserved                  | -            | -                | -       |
| [15:0]  | CYCLE_CNT_HI on page 3-74 | CYCLE_CNT_HI | Cycle count high | 0x0000  |

## Field CYCLE\_CNT\_HI

Cycle count high.

CYCLE CNT HI is stored in bits[15:0] and is a 16-bit unsigned integer. Its default value is 0x0000.

# 3.5.12 Register PMCAXI CHAN

Set which AXI channel to monitor.

Monitors for AXI bandwidth (bw) events (0x80-0x89, 0x180-0x189) and AXI latency events (0xA0-0xA6).

Table 3-106 Register PMU.PMCAXI\_CHAN layout

| Bits    | Link                         | Name         | Usage                                                                                                                                  | Default  |
|---------|------------------------------|--------------|----------------------------------------------------------------------------------------------------------------------------------------|----------|
| [31:11] | Reserved                     | -            | -                                                                                                                                      | -        |
| [10]    | BW_CH_SEL_EN<br>on page 3-75 | BW_CH_SEL_EN | Enable bandwidth channel selector: 0=AXI bw events measured for all channels, 1=AXI bw events measured for channel specified by CH_SEL | 0x000000 |
| [9:8]   | AXI_CNT_SEL on page 3-75     | AXI_CNT_SEL  | Select AXI counter to monitor for latency measurements (0=AXI0 counter0, 1=AXI0 counter1, 2=AXI1 counter 2, 3=AXI1 counter3)           | 0x000000 |

# Table 3-106 Register PMU.PMCAXI\_CHAN layout (continued)

| Bits  | Link                | Name   | Usage                                                                                                                                                                                                                                     | Default |
|-------|---------------------|--------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|
| [7:4] | Reserved            | -      | -                                                                                                                                                                                                                                         | -       |
| [3:0] | CH_SEL on page 3-75 | CH_SEL | Specify the type of traffic for bandwidth or latency measurements (Read: 0=command traffic, 1=IFM traffic, 2=Weight traffic, 3=Scale+Bias, 4=Mem2Mem traffic - read direction; Write: 8=OFM traffic, 9=Mem2Mem traffic - write direction) | 0x0     |

# Field BW\_CH\_SEL\_EN

Enable bandwidth channel selector: 0=AXI bw events measured for all channels, 1=AXI bw events measured for channel specified by CH SEL.

BW CH SEL EN is stored in bit[10] and is a 1-bit unsigned integer. Its default value is 0x000000.

#### Field AXI CNT SEL

Select AXI counter to monitor for latency measurements (0=AXI0 counter0, 1=AXI0 counter1, 2=AXI1 counter 2, 3=AXI1 counter 3).

AXI\_CNT\_SEL is stored in bits[9:8] and is a 2-bit unsigned integer. Its default value is 0x000000.

A maximum of two separate outstanding transaction queues can be connected to each AXI interface. The counters are used to express the maximum number of outstanding jobs per queue.

#### Field CH SEL

Specify the type of traffic for bandwidth or latency measurements (Read: 0=command traffic, 1=IFM traffic, 2=Weight traffic, 3=Scale+Bias, 4=Mem2Mem traffic - Read direction; Write: 8=OFM traffic, 9=Mem2Mem traffic - Write direction).

CH\_SEL is stored in bits[3:0] and is a 4-bit unsigned integer. Its default value is 0x0.

#### 3.6 Command stream

The application processor uses a command stream to issue tasks to the NPU. The command stream is made from 16-bit commands.

There are two command formats, cmd0 and cmd1. cmd0 is a 32-bit command with no data item. cmd1 is a 32-bit command followed by a single 32-bit data item. In the command stream, these commands must be aligned to start on a 32-bit boundary.

Bits[15:0] determine the command name. Bits[31:16] are the command parameter which the command uses.

The NPU processes commands in the order they are received.

The following table lists the command formats and their differences.

Table 3-107 Command stream formats

| Bit 15 | Bit 14 | Bit 13 | Bit 12 | Bit 11 | Bit 10 | Bits 9-0 | Data                                       |
|--------|--------|--------|--------|--------|--------|----------|--------------------------------------------|
| 0      | 0      | 0      | 0      | 0      | 0      | cmd0     | No data item                               |
| 0      | 1      | 0      | 0      | 0      | 0      | cmd1     | 32-bit data item payload after the command |

----- Note ------

All unused combinations of bits[0:15] are reserved.

The following is an example command stream for a Conv2D network with input tensor 8x8x16, weight tensor 16x2x2x16, stride 2x2, and output tensor 4x4x16. The following example applies to a configuration with 256 MAC units.

```
Code:
            Command:
                                                        Param: Payload:
0x0130 cmd0.NPU_SET_DMA0_SRC_REGION
0x4030 cmd1.NPU_SET_DMA0_SRC
                                                                0x00000000 (0)
                                                           0
0x0131 cmd0.NPU_SET_DMA0_DST_REGION
0x4031 cmd1.NPU_SET_DMA0_DST
                                                           0
                                                                0x00000400 (1024)
                                                           0
0x4032 cmd1.NPU_SET_DMA0_LEN
0x0010 cmd0.NPU_OP_DMA_START
                                                                0x000002e0 (736)
0x0116 cmd0.NPU_SET_OFM_BLK_HEIGHT_M1
0x0115 cmd0.NPU_SET_OFM_BLK_WIDTH_M1
                                                         3
3
15
10
0x0117 cmd0.NPU_SET_OFM_BLK_DEPTH_M1
0x010d cmd0.NPU_SET_IFM_IB_END
0x012d cmd0.NPU_SET_AB_START
                                                          30
0x0124 cmd0.NPU_SET_ACC_FORMAT
                                                           0
0x0107 cmd0.NPU_SET_IFM_UPSCALE
0x0100 cmd0.NPU_SET_IFM_PAD_TOP
0x0101 cmd0.NPU_SET_IFM_PAD_LEFT
0x0103 cmd0.NPU_SET_IFM_PAD_BOTTOM
                                                           0
0
0x0102 cmd0.NPU_SET_IFM_PAD_RIGHT
0x0121 cmd0.NPU_SET_KERNEL_HEIGHT_M1
0x0120 cmd0.NPU_SET_KERNEL_WIDTH_M1
0x0122 cmd0.NPU_SET_KERNEL_STRIDE
                                                                0x00000400 (1024)
0x000002e0 (736)
0x4020 cmd1.NPU_SET_WEIGHT_BASE
0x4021 cmd1.NPU_SET_WEIGHT_LENGTH
0x0128 cmd0.NPU_SET_WEIGHT_REGION
0x4022 cmd1.NPU_SET_SCALE_BASE
0x4023 cmd1.NPU_SET_SCALE_LENGTH
                                                                0x000002e0 (736)
0x000000a0 (160)
0x0129 cmd0.NPU_SET_SCALE_REGION
0x0125 cmd0.NPU SET ACTIVATION
0x0126 cmd0.NPU_SET_ACTIVATION_MIN
0x0127 cmd0.NPU_SET_ACTIVATION_MAX
                                                        255
0x0112 cmd0.NPU_SET_OFM_HEIGHT_M1
0x0111 cmd0.NPU_SET_OFM_WIDTH_M1
0x0111 cmd0.NPU_SET_OFM_DEPTH_M1
0x0104 cmd0.NPU_SET_IFM_DEPTH_M1
                                                         15
                                                         15
0x0109 cmd0.NPU_SET_IFM_ZERO_POINT
0x010b cmd0.NPU_SET_IFM_HEIGHT0_M1
                                                        128
0x010c cmd0.NPU_SET_IFM_HEIGHT1_M1
0x010a cmd0.NPU_SET_IFM_WIDTH0_M1
                                                           7
7
0x010f cmd0.NPU_SET_IFM_REGION
                                                           1
0x4000 cmd1.NPU SET IFM BASE0
                                                                0x00000000 (0)
```

```
0x4001 cmd1.NPU_SET_IFM_BASE1
                                                                                     0x00000000 (0)
0x4002 cmd1.NPU_SET_IFM_BASE2
0x4003 cmd1.NPU_SET_IFM_BASE3
                                                                                     0x00000000
                                                                                     0x00000000
                                                                                                         (0)
0x4006 cmd1.NPU_SET_IFM_STRIDE_C
                                                                                     0x00000001
0x4004 cmd1.NPU_SET_IFM_STRIDE_X
0x4005 cmd1.NPU_SET_IFM_STRIDE_Y
                                                                                     0x00000010
                                                                                                         (16)
                                                                                     0x00000080 (128)
0x0118 cmd0.NPU SET OFM ZERO POINT
0x011b cmd0.NPU_SET_OFM_HEIGHT0_M1
0x011c cmd0.NPU_SET_OFM_HEIGHT1_M1
0x011a cmd0.NPU_SET_OFM_WIDTH0_M1
0x011f cmd0.NPU_SET_OFM_REGION
                                                                             1
0x4010 cmd1.NPU_SET_OFM_BASE0
0x4011 cmd1.NPU_SET_OFM_BASE1
                                                                                     0x000006e0 (1760)
                                                                             õ
                                                                                     0x00000000
0x4011 cmd1.NPU_SEI_UFM_BASE1
0x4012 cmd1.NPU_SET_OFM_BASE2
0x4013 cmd1.NPU_SET_OFM_BASE3
0x4016 cmd1.NPU_SET_OFM_STRIDE_C
0x4014 cmd1.NPU_SET_OFM_STRIDE_X
0x4015 cmd1.NPU_SET_OFM_STRIDE_Y
0x0114 cmd0.NPU_SET_OFM_PRECISION
                                                                             0
                                                                                     0x00000000
                                                                             ŏ
                                                                                     0x00000000
                                                                                                         (0)
                                                                             õ
                                                                                     0x00000001
                                                                             0
                                                                                     0x00000010
                                                                                                         (16
                                                                                     0x00000040 (64)
                                                                             õ
0x0115 cmd0.NPU_SET_IFM_PRECISION
0x0011 cmd0.NPU_OP_DMA_WAIT
0x012f cmd0.NPU_SET_BLOCKDEP
0x0002 cmd0.NPU_OP_CONV
                                                                             0
                                                                              3
0x0000 cmd0.NPU_OP_STOP
                                                                      65535
```

The following is an example command stream for a MaxPool2D with 2x2 kernel and 8x8x16 tensor. The following example applies to a configuration with 256 MAC units.

```
Param: Payload:
 Code:
                         Command:
Code: Command:

0x0116 cmd0.NPU_SET_OFM_BLK_HEIGHT_M1
0x0115 cmd0.NPU_SET_OFM_BLK_WIDTH_M1
0x0117 cmd0.NPU_SET_OFM_BLK_DEPTH_M1
0x010d cmd0.NPU_SET_IFM_IB_END
0x012d cmd0.NPU_SET_AB_START
0x0124 cmd0.NPU_SET_ACC_FORMAT
                                                                                                               15
                                                                                                               10
                                                                                                               30
0x0124 CMd0.NPU_SET_ACC_FORMAI
0x0107 CMd0.NPU_SET_IFM_UPSCALE
0x0100 CMd0.NPU_SET_IFM_PAD_TOP
0x0101 CMd0.NPU_SET_IFM_PAD_LEFT
0x0103 CMd0.NPU_SET_IFM_PAD_BOTTOM
0x0102 CMd0.NPU_SET_IFM_PAD_RIGHT
0x0121 CMd0.NPU_SET_KERNEL_HEIGHT_M1
                                                                                                                 0
                                                                                                                 0
                                                                                                                 1
                                                                                                                 1
0x0120 cmd0.NPU_SET_KERNEL_WIDTH_M1
0x0122 cmd0.NPU_SET_KERNEL_STRIDE
0x0125 cmd0.NPU_SET_ACTIVATION
                                                                                                                 0
                                                                                                                 0
0x0126 cmd0.NPU_SET_ACTIVATION_MIN
0x0127 cmd0.NPU_SET_ACTIVATION_MAX
0x0112 cmd0.NPU_SET_OFM_HEIGHT_M1
                                                                                                                 0
 0x0111 cmd0.NPU SET OFM WIDTH M1
0x0113 cmd0.NPU_SET_OFM_DEPTH_M1
0x0104 cmd0.NPU_SET_IFM_DEPTH_M1
0x0104 Cmd0.NPU_SET_IFM_DEPIH_MI
0x0109 cmd0.NPU_SET_IFM_ZERO_POINT
0x010b cmd0.NPU_SET_IFM_HEIGHT0_M1
0x010c cmd0.NPU_SET_IFM_HEIGHT1_M1
0x010a cmd0.NPU_SET_IFM_WIDTH0_M1
0x010f cmd0.NPU_SET_IFM_REGION
0x4000 cmd1.NPU_SET_IFM_BASE0
                                                                                                            128
                                                                                                                            0x00000000 (0)
0x4000 cmd1.NPU_SET_IFM_BASE0
0x4001 cmd1.NPU_SET_IFM_BASE1
0x4002 cmd1.NPU_SET_IFM_BASE2
0x4003 cmd1.NPU_SET_IFM_BASE3
0x4006 cmd1.NPU_SET_IFM_STRIDE_C
0x4004 cmd1.NPU_SET_IFM_STRIDE_Y
0x4005 cmd1.NPU_SET_IFM_STRIDE_Y
                                                                                                                 0
                                                                                                                            0x00000000
                                                                                                                            0x00000000
                                                                                                                 0
                                                                                                                            0x00000000
                                                                                                                                                           (0)
                                                                                                                 a
                                                                                                                            0x00000001
                                                                                                                 0
                                                                                                                            0x00000010
                                                                                                                 0
                                                                                                                            0x00000080 (128)
0X4005 CMG1.NPU_SET_IFM_SIRIUE_Y
0X0118 CMG0.NPU_SET_OFM_ZERO_POINT
0X011b CMG0.NPU_SET_OFM_HEIGHT0_M1
0X011c CMG0.NPU_SET_OFM_HEIGHT1_M1
0X011a CMG0.NPU_SET_OFM_WIDTH0_M1
0X011f CMG0.NPU_SET_OFM_REGION
0X4010 CMG1.NPU_SET_OFM_REGION
0X4010 CMG1.NPU_SET_OFM_REGION
                                                                                                            128
                                                                                                                            0x00000400 (1024)
                                                                                                                 0
0x4011 cmd1.NPU_SET_OFM_BASE1
0x4012 cmd1.NPU_SET_OFM_BASE2
0x4013 cmd1.NPU_SET_OFM_BASE3
                                                                                                                            0x00000000
                                                                                                                 0
                                                                                                                 0
                                                                                                                            0x00000000
                                                                                                                 0
                                                                                                                            0x00000000
 0x4016 cmd1.NPU_SET_OFM_STRIDE_C
                                                                                                                 0
                                                                                                                            0x00000001
0x4014 cmd1.NPU_SET_OFM_STRIDE_X
0x4015 cmd1.NPU_SET_OFM_STRIDE_Y
                                                                                                                 0
                                                                                                                            0x00000010
                                                                                                                 0
                                                                                                                            0x00000080
 0x0114 cmd0.NPU_SET_OFM_PRECISION
 0x0105 cmd0.NPU_SET_IFM_PRECISION
                                                                                                                 0
 0x012f cmd0.NPU_SET_BLOCKDEP
                                                                                                                  3
0x0005 cmd0.NPU_OP_POOL
0x0000 cmd0.NPU_OP_STOP
                                                                                                       65535
```

This section contains the following subsections:

• 3.6.1 Non-blocking command types on page 3-78.

- 3.6.2 Blocking command types on page 3-78.
- 3.6.3 Command dependency requirements on page 3-78.
- *3.6.4 cmd0 commands* on page 3-78.
- *3.6.5 cmd1 commands* on page 3-84.

# 3.6.1 Non-blocking command types

Commands can be non-blocking, which means that later commands can start before they are completed.

The following table lists the non-blocking command types and the criteria that must be met for the command to complete.

Table 3-108 Non-blocking command types

| Command                                                                                                                                                                                                                         | Completion criteria                                       |  |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------|--|
| NPU_OP_IRQ                                                                                                                                                                                                                      | An IRQ is raised.                                         |  |
| NPU_OP_ <kernel></kernel>                                                                                                                                                                                                       | The resulting tensor is calculated and written to memory. |  |
| <kernel> can be: <ul> <li>CONV for convolution operations</li> <li>DEPTHWISECONV for depth-wise convolution operations</li> <li>POOL for pooling operations</li> <li>ELEMENTWISE for elementwise operations</li> </ul></kernel> |                                                           |  |

# 3.6.2 Blocking command types

Commands can be blocking, which means that later commands cannot start before these commands are completed.

The following table lists the blocking command types and the criteria that must be met for the command to complete.

Table 3-109 Non-blocking command types

| Command                                                                                                                                                                                                                                | Completion criteria              |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------|
| NPU_SET_ <state>  The value is written to the appropriate internal state. This value is applied to all following operations, until a new command overwrites it. New values must not affect operations the already in progress.</state> |                                  |
| NPU_OP_STOP The NPU enters a stopped state.                                                                                                                                                                                            |                                  |
| NPU_OP_DMA_START  The Direct Memory Access (DMA) instruction is accepted into the internal DMA que DMA instruction does not need to complete.                                                                                          |                                  |
| NPU_OP_ <condition>_WAIT</condition>                                                                                                                                                                                                   | The wait condition is satisfied. |

#### 3.6.3 Command dependency requirements

When an operation is started, the NPU must know all the input data for it to be valid. If the NPU does not know all the input data, then the behavior is UNPREDICTABLE.

The NPU OP SET BLOCKDEP command sets the block dependency between NPU kernel operations.

The NPU\_OP\_DMA\_WAIT command causes the NPU to wait for certain results from previously started DMA operations to be completed and written to memory. During this wait, the NPU does not add later commands to the Command queue.

#### 3.6.4 cmd0 commands

cmd0 commands have bits[15:10] = 0. cmd0 bits[9:0] indicate the command. cmd0 commands do not take additional data.

Use these commands to:

- Perform an action, for example, raising an IRQ or starting an operation.
- Set a state based on the 16-bit parameter value.

The following table lists the cmd0 commands and their actions.

Table 3-110 cmd0 operations

| cmd0  | Enumerator         | Parameter      | Function                                                                                                                                                                                                                                                  |
|-------|--------------------|----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0x000 | NPU_OP_STOP        | mask           | <ul> <li>(1) Set BASE_STATUS  = (mask&lt;&lt;16).</li> <li>(2) Move to stopped state.</li> <li>(3) Raise IRQ to host (regardless of mask value).</li> </ul>                                                                                               |
|       |                    |                | At the point the IRQ is raised, the NPU is stopped and all operations complete up to and including the STOP operation.                                                                                                                                    |
|       |                    |                | Operations after the STOP may have been buffered in the Command queue, but are not started (so no input or weight data is read).                                                                                                                          |
| 0x001 | NPU_OP_IRQ         | mask           | <ul> <li>(1) Set BASE_STATUS  = (mask&lt;&lt;16).</li> <li>(2) Remain in run state.</li> <li>(3) Raise IRQ to host (regardless of mask value).</li> </ul>                                                                                                 |
|       |                    |                | At the point the IRQ is raised, all operations are complete up to and including the IRQ operation. Operations after the IRQ may have been started (or even completed).                                                                                    |
| 0x002 | NPU_OP_CONV        | 0              | Start stripe with all-layer convolution or deconvolution.                                                                                                                                                                                                 |
| 0x003 | NPU_OP_DEPTHWISE   | 0              | Start stripe width depth-wise convolution or deconvolution operation.                                                                                                                                                                                     |
| 0x004 | -                  | -              | -                                                                                                                                                                                                                                                         |
| 0x005 | NPU_OP_POOL        | mode           | Start stripe with pooling operation. mode: 0=MaxPool, 1=Average pool.                                                                                                                                                                                     |
| 0x006 | NPU_OP_ELEMENTWISE | mode           | Start stripe with elementwise operation between two IFMs. mode: 0=Mul, 1=Add, 2=Sub, 3=Min, 4=Max, 5=LReLU, and 6=ABS.                                                                                                                                    |
| 0x007 | -                  | -              | -                                                                                                                                                                                                                                                         |
| 0x010 | NPU_OP_DMA_START   | 16*channel     | Queue new DMA for the given channel.                                                                                                                                                                                                                      |
|       |                    |                | The NPU contains one user channel. Therefore, channel=0.                                                                                                                                                                                                  |
|       |                    |                | This command blocks until the DMA channel can accept a new descriptor.                                                                                                                                                                                    |
|       |                    |                | This command is viewed as complete when the DMA has been queued and does not need to wait for the DMA to complete. (This is different to other NPU_OP commands that must have their final results written to memory before they are considered complete.) |
| 0x011 | NPU_OP_DMA_WAIT    | 16*channel + k | Wait for the DMA channel to have k or fewer active descriptors outstanding.                                                                                                                                                                               |
|       |                    |                | The NPU contains one user channel. Therefore, channel=0.                                                                                                                                                                                                  |
|       |                    |                | The NPU contains two descriptor per channel, therefore, k=0,1. Descriptors are not outstanding if they have completed, which means that data written to memory and can be read by the next command.                                                       |

| cmd0  | Enumerator             | Parameter          | Function                                                                                                                                                                 |
|-------|------------------------|--------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0x012 | NPU_OP_KERNEL_WAIT     | n=0-3              | Wait for <i>n</i> or fewer kernel operations to be remaining (that is, not complete) before starting the next command.                                                   |
|       |                        |                    | A kernel operation is Conv, Depthwise, Pool, Elementwise.                                                                                                                |
|       |                        |                    | This command is typically placed before an NPU_OP_DMA_START command to prevent the DMA from starting until a previous kernel operation reading the memory has completed. |
| 0x100 | NPU_SET_IFM_PAD_TOP    | 0-127              | IFM top pad. Padding is applied after upscale, if ifm_upscale_mode!=none.                                                                                                |
| 0x101 | NPU_SET_IFM_PAD_LEFT   | 0-127              | IFM left pad. Padding is applied after upscale, if ifm_upscale_mode!=none.                                                                                               |
| 0x102 | NPU_SET_IFM_PAD_RIGHT  | 0-128              | IFM right pad. Padding is applied after upscale, if ifm_upscale_mode!=none.                                                                                              |
| 0x103 | NPU_SET_IFM_PAD_BOTTOM | 0-128              | IFM bottom pad. Padding is applied after upscale if ifm_upscale_mode!=none.                                                                                              |
| 0x104 | NPU_SET_IFM_DEPTH_M1   | 0-65535            | Number of input channels for convolution -1.                                                                                                                             |
| 0x105 | NPU_SET_IFM_PRECISION  | bitfield           | b0 = activation type 0=unsigned, 1=signed                                                                                                                                |
|       |                        |                    | b1 = reserved for weight size                                                                                                                                            |
|       |                        |                    | b[3:2] = activation precision 0=8 bit, 1=16 bit, 2=32 bit (only available for certain operations)                                                                        |
|       |                        |                    | b[7:6] = IFM format select 0=NHWC or 1=NHCWB16                                                                                                                           |
|       |                        |                    | b[9:8] = IFM scale mode for elementwise ADD and SUB:<br>0=16-bit OPA/OPB scale, 1=32-bit OPA scale applied to<br>OPA, 2=32-bit OPA scale applied to OPB                  |
|       |                        |                    | b[15:14] = IFM round mode: 0=double rounding, 2=round to nearest with 0.5 round to +infinity                                                                             |
| 0x106 | -                      | -                  | -                                                                                                                                                                        |
| 0x107 | NPU_SET_IFM_UPSCALE    | 0, 1, 2            | b[1:0] = ifm_upscale_mode (0=none, 1=2x2 insert nearest, 2=2x2 insert zeros)                                                                                             |
| 0x108 | -                      | -                  | -                                                                                                                                                                        |
| 0x109 | NPU_SET_IFM_ZERO_POINT | int16 or<br>uint16 | IFM zero-point offset. Encoded as int16, if activation is signed or uint16, if activation is unsigned.                                                                   |
|       |                        |                    | Must be zero for 32-bit IFM and for CLZ operation.                                                                                                                       |
|       |                        |                    | Must be a valid activation value.                                                                                                                                        |
| 0x10A | NPU_SET_IFM_WIDTH0_M1  | 0-65535            | IFM Tile 0 and tile 2 (width-1)                                                                                                                                          |
| 0x10B | NPU_SET_IFM_HEIGHT0_M1 | 0-65535            | IFM Tile 0 (height-1)                                                                                                                                                    |
| 0x10C | NPU_SET_IFM_HEIGHT1_M1 | 0-65535            | IFM Tile 1 (height-1)                                                                                                                                                    |
| 0x10D | NPU_SET_IFM_IB_END     | 0-48               | End of IB0,IB1 buffers in the SHRAM in KB units. Multiples of 2.                                                                                                         |
| 0x10E | -                      | -                  | -                                                                                                                                                                        |
| 0x10F | NPU_SET_IFM_REGION     | 0-7                | Index <i>n</i> for IFM access: Region[ <i>n</i> ] is added to all IFM addresses.                                                                                         |
| 0x110 | -                      | -                  | -                                                                                                                                                                        |

| cmd0  | Enumerator                | Parameter          | Function                                                                                               |  |
|-------|---------------------------|--------------------|--------------------------------------------------------------------------------------------------------|--|
| 0x111 | NPU_SET_OFM_WIDTH_M1      | 0-65535            | OFM width-1 (for the stripe to process)                                                                |  |
| 0x112 | NPU_SET_OFM_HEIGHT_M1     | 0-65535            | OFM height-1 (for the stripe to process)                                                               |  |
| 0x113 | NPU_SET_OFM_DEPTH_M1      | 0-65535            | OFM depth-1 for convolution                                                                            |  |
| 0x114 | NPU_SET_OFM_PRECISION     | bitfield           | b0 = activation type 0=unsigned, 1=signed                                                              |  |
|       |                           |                    | b[2:1] = activation precision type 0=8 bit, 1=16 bit, 2=32 bit (only available for certain operations) |  |
|       |                           |                    | b[7:6] = OFM format select 0=NHWC or 1=NHCWB16                                                         |  |
|       |                           |                    | b[8] = scaling, 0=Per channel scale/bias, 1=Global scale (SET_OFM_SCALE), no bias                      |  |
|       |                           |                    | b[15:14] = rounding mode, 0=double rounding, 1=truncate towards zero, 2=Natural rounding               |  |
| 0x115 | NPU_SET_OFM_BLK_WIDTH_M1  | 0-31               | OFM_BLOCK_WIDTH-1 (see 3.9 Block based operation on page 3-107)                                        |  |
| 0x116 | NPU_SET_OFM_BLK_HEIGHT_M1 | 0-31               | OFM_BLOCK_HEIGHT-1 (see 3.9 Block based operation on page 3-107)                                       |  |
| 0x117 | NPU_SET_OFM_BLK_DEPTH_M1  | 3-127              | OFM_BLOCK_DEPTH-1 (see 3.9 Block based operation on page 3-107)                                        |  |
| 0x118 | NPU_SET_OFM_ZERO_POINT    | int16 or<br>uint16 | OFM zero-point offset. Encoded as int16, if activation is signed or uint16, if activation is unsigned. |  |
|       |                           |                    | Must be a valid activation value given by ACTIVATION[15:12].                                           |  |
|       |                           |                    | Must be 0 for 32-bit activation range of for CLZ.  Note                                                |  |
|       |                           |                    | This can be nonzero, if OFM is 32 bit but                                                              |  |
|       |                           |                    | ACTIVATION[15:12] range is 8 bit.                                                                      |  |
| 0x119 | -                         | -                  | -                                                                                                      |  |
| 0x11A | NPU_SET_OFM_WIDTH0_M1     | 0-65535            | OFM Tile 0 and tile 2 (width-1)                                                                        |  |
| 0x11B | NPU_SET_OFM_HEIGHT0_M1    | 0-65535            | OFM Tile 0 (height-1)                                                                                  |  |
| 0x11C | NPU_SET_OFM_HEIGHT1_M1    | 0-65535            | OFM Tile 1 (height-1)                                                                                  |  |
| 0x11D | -                         | -                  | -                                                                                                      |  |
| 0x11E | -                         | -                  | -                                                                                                      |  |
| 0x11F | NPU_SET_OFM_REGION        | 0-7                | Index <i>n</i> for OFM access: Region[ <i>n</i> ] is added to all OFM addresses                        |  |
| 0x120 | NPU_SET_KERNEL_WIDTH_M1   | 0-65535            | Set (dilated_kernel_width-1) = (kernel_width-1)*kernel_x_dilation                                      |  |
| 0x121 | NPU_SET_KERNEL_HEIGHT_M1  | 0-65535            | Set (dilated_kernel_height-1) = (kernel_height-1)*kernel_y_dilation                                    |  |

| cmd0  | Enumerator              | Parameter          | Function                                                                                                                                      |
|-------|-------------------------|--------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|
| 0x122 | NPU_SET_KERNEL_STRIDE   | bitfield           | $b0 = (kernel_x_stride - 1)\&1 (x stride low bit)$                                                                                            |
|       |                         |                    | $b1 = (kernel\_y\_stride - 1)\&1 (y stride low bit)$                                                                                          |
|       |                         |                    | b2 = kernel_weight_order (0=depth-first weight order, 1=part kernel-first weight order)                                                       |
|       |                         |                    | b3 = kernel_x_dilation - 1 (0=no x dilation, 1=x dilation of x2)                                                                              |
|       |                         |                    | b4 = kernel_y_dilation -1 (0=no y dilation, 1=y dilation of x2)                                                                               |
|       |                         |                    | b5 = 0 for kernel_split_size=8, 1 for kernel_split_size=4 (8x8 or 4x4 kernel decomposition)                                                   |
|       |                         |                    | b[8:6] = (kernel_x_stride-1) >> 1 (stride extension bits – supported stride range is 1 to 3)                                                  |
|       |                         |                    | b[11:9] = (kernel_y_stride-1)>>1 (stride extension bits – supported stride range is 1 to 3)                                                   |
| 0x123 | NPU_SET_PARALLEL_MODE   | 0, 1               | 0=1-core and 1=2-core depth                                                                                                                   |
| 0x124 | NPU_SET_ACC_FORMAT      | 0-3                | Sets the accumulator format: 0=32-bit integer, 1=40-bit integer, 2=s5.10 floating point                                                       |
| 0x125 | NPU_SET_ACTIVATION      | 0, 3,4,<br>0x10+n  | 0=none/ReLU, 3=tanh, 4=sigmoid; 0x10+n for 0<=n<8 indicates a LUT operation starting at address n*256 bytes in the last 2KB page of the SHRAM |
|       |                         |                    | b[15:12] = Activation clip range (before table lookup).<br>0=OFM precision, 2=force to uint8 3=force to int8, 5=force to int16                |
| 0x126 | NPU_SET_ACTIVATION_MIN  | int16 or<br>uint16 | Lower bound clip for OFM activations – range is the OFM type range                                                                            |
| 0x127 | NPU_SET_ACTIVATION_MAX  | int16 or<br>uint16 | Upper bound clip for OFM activations – range is the OFM type range                                                                            |
| 0x128 | NPU_SET_WEIGHT_REGION   | 0-7                | Index <i>n</i> for weight access: Region[ <i>n</i> ] is added to all Weight stream offsets                                                    |
| 0x129 | NPU_SET_SCALE_REGION    | 0-7                | Index <i>n</i> for scale access: Region[ <i>n</i> ] is added to all scale stream offsets                                                      |
| 0x12A | -                       | -                  | -                                                                                                                                             |
| 0x12B | -                       | -                  | -                                                                                                                                             |
| 0x12C | -                       | -                  | -                                                                                                                                             |
| 0x12D | NPU_SET_AB_START        | 0-48               | Start of ACC0,ACC1 buffers in the SHRAM in KB units. Multiples of 2.                                                                          |
| 0x12E | -                       | -                  | -                                                                                                                                             |
| 0x12F | NPU_SET_BLOCKDEP        | 0-3                | Set block number of blocks-dependency between kernel operations.                                                                              |
| 0x130 | NPU_SET_DMA0_SRC_REGION | Bitmap             | If Bit[8]=0, Bit[7:0] = Region number in the range 0<= <i>n</i> <8 of SRC offset                                                              |
|       |                         |                    | Bit[8] = must be 0 for external                                                                                                               |
|       |                         |                    | Bit[10:9] = stride mode $0/1/2=1D/2D/3D$                                                                                                      |

| cmd0        | Enumerator              | Parameter          | Function                                                                                                          |
|-------------|-------------------------|--------------------|-------------------------------------------------------------------------------------------------------------------|
| 0x131       | NPU_SET_DMA0_DST_REGION | Bitmap             | If Bit[8]=0, Bit[7:0] = Region number in the range 0<=n<8 of DST offset                                           |
|             |                         |                    | If Bit[8]=1, Bit[7:0] = Core mask to write to (bit k set for core k=0,1)                                          |
|             |                         |                    | Bit[8] = select external/internal=0/1.                                                                            |
|             |                         |                    | Bit[10:9] = stride mode $0/1/2=1D/2D/3D$ .                                                                        |
| 0x132       | NPU_SET_DMA0_SIZE0      | 0-65535            | Size of second dimension for 2D/3D transfers.                                                                     |
| 0x133       | NPU_SET_DMA0_SIZE1      | 0-65535            | Size of third dimension for 3D transfers.                                                                         |
| 0x134-0x17F | -                       | -                  | -                                                                                                                 |
| 0x180       | NPU_SET_IFM2_BROADCAST  | bitfield           | b0 = broadcast H dimension (if set, then any accesses to IFM2 sets y=0 and IFM2 height=1)                         |
|             |                         |                    | b1 = broadcast W dimension (if set, then any accesses to IFM2 sets x=0 and IFM2 width=1)                          |
|             |                         |                    | b2 = broadcast C dimension (if set, then any accesses to IFM2 sets c=0 and IFM2 depth=1)                          |
|             |                         |                    | b6 = operand order 0=IFM2 is second operand B, 1=IFM2 is first operand A.                                         |
|             |                         |                    | b7 = broadcast constant given by NPU_SET_IFM2_SCALAR and so ignore b0-b2                                          |
| 0x181       | NPU_SET_IFM2_SCALAR     | int16 or           | IFM2 scalar value at range IFM_PRECISION.                                                                         |
|             |                         | uint16             | The scalar is encoded with IFM2_ZERO_POINT.                                                                       |
|             |                         |                    | Values are encoded as signed or unsigned 16-bit values depending on whether IFM2_PRECISION is signed or unsigned. |
| 0x182-0x184 | -                       | -                  | -                                                                                                                 |
| 0x185       | NPU_SET_IFM_PRECISION   | bitfield           | b[0] = activation type 0=unsigned, 1=signed – MUST<br>MATCH IFM                                                   |
|             |                         |                    | b[3:2] = activation precision 0=8 bit, 1=16 bit, 2=32 bit – MUST MATCH IFM                                        |
|             |                         |                    | b[7:6] = IFM2 format, select 0=NHWC or 1=NHCWB16                                                                  |
| 0x186-0x188 | -                       | -                  | -                                                                                                                 |
| 0x189       | NPU_SET_IFM2_ZERO_POINT | int16 or<br>uint16 | IFM2 zero-point offset. Encoded as int16, if activation is signed or uint16, if activation is unsigned.           |
|             |                         |                    | Must be zero for 32-bit IFM.                                                                                      |
|             |                         |                    | Must be a valid activation value.                                                                                 |
| 0x18A       | NPU_SET_IFM2_WIDTH0_M1  | 0-65535            | IFM2 Tile 0 and tile 2 (width-1)                                                                                  |
| 0x18B       | NPU_SET_IFM2_HEIGHT0_M1 | 0-65535            | IFM2 Tile 0 (height-1)                                                                                            |
| 0x18C       | NPU_SET_IFM2_HEIGHT1_M1 | 0-65535            | IFM2 Tile 1 (height-1)                                                                                            |
| 0x18D       | NPU_SET_IFM2_IB_START   | 0-48               | Start of IB0, IB1 buffers for IFM2 in SHRAM. In KB units, multiples of 2.                                         |

| cmd0  | Enumerator          | Parameter | Function                                                                          |
|-------|---------------------|-----------|-----------------------------------------------------------------------------------|
| 0x18E | -                   | -         | -                                                                                 |
| 0x18F | NPU_SET_IFM2_REGION | 0-7       | Index <i>n</i> for IFM2 access: Region[ <i>n</i> ] is added to all IFM2 addresses |

## 3.6.5 cmd1 commands

cmd1 commands have bits[15:14] = 1. cmd1 bits[9:0] indicate the command. cmd1 commands take a payload data item of 32 bits in addition to the 16-bit parameter field.

## **About the Parameter field**

Where payload items in the following table give an address offset, stride, or data length, the value is in bytes.

The following table lists the cmd1 commands and their functionality.

Table 3-111 cmd1 operations

| cmd1        | Enumerator                 | Parameter                                                           | Payload data                                                    |  |
|-------------|----------------------------|---------------------------------------------------------------------|-----------------------------------------------------------------|--|
| 0x000       | NPU_SET_IFM_BASE0          | extu_47_32                                                          | IFM tile0 byte offset (top left tile) from IFM_REGION start     |  |
| 0x001       | NPU_SET_IFM_BASE1          | extu_47_32                                                          | IFM tile1 byte offset (top right tile) from IFM_REGION sta      |  |
| 0x002       | NPU_SET_IFM_BASE2          | extu_47_32                                                          | IFM tile2 byte offset (bottom left tile) from IFM_REGION start  |  |
| 0x003       | NPU_SET_IFM_BASE3          | extu_47_32                                                          | IFM tile3 byte offset (bottom right tile) from IFM_REGION start |  |
| 0x004       | NPU_SET_IFM_STRIDE_X       | exts_47_32                                                          | IFM byte stride between horizontal values                       |  |
| 0x005       | NPU_SET_IFM_STRIDE_Y       | exts_47_32                                                          | IFM byte stride between vertical values                         |  |
| 0x006       | NPU_SET_IFM_STRIDE_C       | exts_47_32                                                          | IFM byte stride between channel blocks (of 16 bytes each block) |  |
| 0x007-0x009 | -                          | -                                                                   | -                                                               |  |
| 0x00A-0x00F | -                          | -                                                                   | -                                                               |  |
| 0x010       | NPU_SET_OFM_BASE0          | extu_47_32                                                          | OFM tile0 byte offset (top left tile) from OFM_REGION           |  |
| 0x011       | NPU_SET_OFM_BASE1          | extu_47_32                                                          | OFM tile1 byte offset (top right tile) from OFM_REGION          |  |
| 0x012       | NPU_SET_OFM_BASE2          | extu_47_32                                                          | OFM tile2 byte offset (bottom left tile) from OFM_REGION        |  |
| 0x013       | NPU_SET_OFM_BASE3          | extu_47_32                                                          | OFM tile3 byte offset (bottom right tile) from OFM_REGION       |  |
| 0x014       | NPU_SET_OFM_STRIDE_X       | exts_47_32                                                          | OFM byte stride between horizontal values                       |  |
| 0x015       | NPU_SET_OFM_STRIDE_Y       | exts_47_32                                                          | OFM byte stride between vertical values                         |  |
| 0x016       | NPU_SET_OFM_STRIDE_C       | exts_47_32                                                          | OFM byte stride between channel blocks (of 16 bytes each block) |  |
| 0x017-0x019 | -                          | -                                                                   | -                                                               |  |
| 0x01A-0x01F | -                          | -                                                                   | -                                                               |  |
| 0x020       | NPU_SET_WEIGHT_BASE        | extu_47_32                                                          | Weight stream byte offset in WEIGHT_REGION                      |  |
| 0x021       | NPU_SET_WEIGHT_LENGTH 0 We |                                                                     | Weight stream byte length (unsigned 32 bits)                    |  |
| 0x022       | NPU_SET_SCALE_BASE         | extu_47_32 Scale and bias stream input byte offset from SCALE_REGIO |                                                                 |  |
| 0x023       | NPU_SET_SCALE_LENGTH       | 0                                                                   | Scale and bias stream input byte length (unsigned 20 bits)      |  |

| cmd1        | Enumerator             | Parameter                 | Payload data                                                                                                                                                                                                                               |
|-------------|------------------------|---------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0x024       | NPU_SET_OFM_SCALE      | shift (6-bit<br>unsigned) | Unsigned scale (32 bit). Used by average pool with pad=0, elementwise MUL, ADD, SUB, ABS.  Note  For 32-bit operations scale is not applied but shift is.                                                                                  |
| 0x025       | NPU_SET_OPA_SCALE      | shift (6-bit<br>unsigned) | Unsigned input scale. The format depends on the IFM_PRECISION register:  If IFM scale mode is 0, then shift is ignored and scale is 16 bit.  If IFM scale mode is 1 or 2, then shift is 6 bit and scale is 32 bit or 16 bit, respectively. |
| 0x026       | NPU_SET_OPB_SCALE      | Reserved                  | Unsigned input scale. The format depends on the IFM_PRECISION register:  If IFM scale mode is 0, then scale is 16 bit.  If IFM scale mode is 1 or 2, then this register is not used.                                                       |
| 0x027-0x029 | -                      | -                         | -                                                                                                                                                                                                                                          |
| 0x02A-0x02F | -                      | -                         | -                                                                                                                                                                                                                                          |
| 0x030       | NPU_SET_DMA0_SRC       | extu_47_32                | DMA user channel 0 source byte offset from DMA0_SRC_REGION                                                                                                                                                                                 |
| 0x031       | NPU_SET_DMA0_DST       | extu_47_32                | DMA user channel 0 destination byte offset from DMA0_DST_REGION                                                                                                                                                                            |
| 0x032       | NPU_SET_DMA0_LEN       | extu_47_32                | DMA user channel 0 transfer length in bytes for ID mode. For 2D/3D modes this is the size in bytes of the innermost (1D) transfer. The total transfer size for a 3D transfer is DMA0_LEN*DMA0_SIZE1*DMA_SIZE2                              |
| 0x033       | NPU_SET_DMA0_SKIP0     | extu_47_32                | Byte distance to skip after each inner (1D) transfer (2D/3D mode), any alignment                                                                                                                                                           |
| 0x034       | NPU_SET_DMA0_SKIP1     | extu_47_32                | Byte distance to skip after each 2D transfer (3D mode), any alignment                                                                                                                                                                      |
| 0x035-0x039 | -                      | -                         | -                                                                                                                                                                                                                                          |
| 0x03A-0x03F | -                      | -                         | -                                                                                                                                                                                                                                          |
| 0x080       | NPU_SET_IFM2_BASE0     | extu_47_32                | IFM2 tile0 byte offset (top left tile) from IFM2_REGION start                                                                                                                                                                              |
| 0x081       | NPU_SET_IFM2_BASE1     | extu_47_32                | IFM2 tile1 byte offset (top right tile) from IFM2_REGION start                                                                                                                                                                             |
| 0x082       | NPU_SET_IFM2_BASE2     | extu_47_32                | IFM2 tile2 byte offset (bottom left tile) from IFM2_REGION start                                                                                                                                                                           |
| 0x083       | NPU_SET_IFM2_BASE3     | extu_47_32                | IFM2 tile3 byte offset (bottom right tile) from IFM2_REGION start                                                                                                                                                                          |
| 0x084       | NPU_SET_IFM2_STRIDE_X  | exts_47_32                | IFM2 byte stride between horizontal values                                                                                                                                                                                                 |
| 0x085       | NPU_SET_IFM2_STRIDE_Y  | exts_47_32                | IFM2 byte stride between vertical values                                                                                                                                                                                                   |
| 0x086       | NPU_SET_IFM2_STRIDE_C  | exts_47_32                | IFM2 byte stride between channel blocks (of 16 bytes per block)                                                                                                                                                                            |
| 0x087-0x089 | -                      | -                         | -                                                                                                                                                                                                                                          |
| 0x08A-0x08F | -                      | -                         | -                                                                                                                                                                                                                                          |
| 0x090       | NPU_SET_WEIGHT1_BASE   | extu_47_32                | Weight stream byte offset in WEIGHT_REGION                                                                                                                                                                                                 |
| 0x091       | NPU_SET_WEIGHT1_LENGTH | 0                         | Weight stream byte length (unsigned 32 bits)                                                                                                                                                                                               |

| cmd1        | Enumerator            | Parameter  | Payload data                                               |
|-------------|-----------------------|------------|------------------------------------------------------------|
| 0x092       | NPU_SET_SCALE1_BASE   | extu_47_32 | Scale and bias stream input byte offset from SCALE_REGION  |
| 0x093       | NPU_SET_SCALE1_LENGTH | 0          | Scale and bias stream input byte length (unsigned 20 bits) |
| 0x094-0x099 | -                     | -          | -                                                          |
| 0x09A-0x09F | -                     | -          | -                                                          |

# 3.7 Weight stream format

The weight stream format encodes a sequence of signed weight values in the range -255 to +255. The weights are stored in a lossless compressed format.

The compression encodes sequences of zeros efficiently. Nonzero weight values are compressed using Golomb-Rice coding and a configurable lookup table. The weight stream is made from several bitstream slices, a slice header, and some Variable Length Coded (VLC) symbols. The VLC symbols are grouped into chunks. For each slice, the compression parameters are specified in the slice header and then kept for the duration of the slice.

This section contains the following subsections:

- 3.7.1 Bit order convention on page 3-87.
- 3.7.2 Weight stream structure and slice header syntax on page 3-87.
- *3.7.3 Coding modes* on page 3-89.
- *3.7.4 Chunk syntax* on page 3-91.
- 3.7.5 Weight blocks and ordering on page 3-92.

#### 3.7.1 Bit order convention

In the weight stream, all bits are stored in ascending bit number order. The LSB is therefore the first bit read in a byte.

Syntax elements are stored with the LSB first. Therefore, writing 0b10010 or 0x12, then 0b10111 or 0xB, then 0b1010101 or 0xAB, stores 0b10101011 01110010 from MSB to LSB. Therefore, the content of the first byte is 0b01110010 or 0x72, and the content of the second byte is 0b10101011 or 0xAB.

## 3.7.2 Weight stream structure and slice header syntax

The slice header indicates to the NPU when to switch coding mode. Using an extended header, the slice header can optionally be used to reload the palette (lookup table).

The encoder decides the frequency of slice headers. A higher frequency is a trade-off between improving the compression ratio when switching coding mode and the cost of inserting a header. Adding a slice header also affects the decoding throughput, particularly when a header signals a reload of the palette.

The following figure shows an example weight stream payload.



Figure 3-1 Example weight stream payload

The following example specifies the high-level weight bitstream structure and the slice header syntax. The number of bits used in the bitstream is listed next to each symbol.

```
weight_stream() {
  while( !end_of_stream() ) {
                                                              bit
    zdiv
                                                            3
  if (zdiv == 7) {
                                                            -
-
1
    while (!byte_aligned() )
   bytealign
                                                              bit
       else {
     sĺice_heàder()
        chunks()
  assert( word_aligned() )
slice_header() {
                                                            15 bits
  slicelen
  slice_length = slicelen + 1
                                                            3
                                                         11
  wdiv
                                                              bits
  wtrunc
                                                            1 bit
                                                         // 1 bit
  newpal
```

The byte\_aligned() function returns true if the current bit position is on a byte boundary, otherwise the return value is false. Similarly, the word\_aligned() function returns true if the current bit position is on a 128-bit boundary. The end\_of\_stream() function returns true if the weight stream has reached the end.

The following table lists the symbols in this bitstream and their meanings.

Note ·

Table 3-112 Bitstream symbols

| Symbol                           | Valid values | Meaning                                                                                                                                                                                                                           |
|----------------------------------|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| zdiv                             | 0-3, 6, or 7 | <b>0-3</b> Specifies the zero-run GRC divisor to be 1< <zdiv.< td=""></zdiv.<>                                                                                                                                                    |
|                                  |              | 6 Indicates that zero run compression is not used for the slice.                                                                                                                                                                  |
|                                  |              | 7 Indicates that several bits follow to byte-align the next syntax element. This is used at the end of the weight stream to make sure the weight stream has a length that is a multiple of 128-bit words.                         |
| bytealign                        | 1            | Padding used to align the end of the stream. Must have the value 1.                                                                                                                                                               |
| slicelen,<br>slice_length        | All          | Number of weights in this slice. In alternate mode, this is the number of nonzero weights.                                                                                                                                        |
| wdiv                             | 0-5 or 7     | Weight GRC divisor. Possible values are:                                                                                                                                                                                          |
|                                  |              | <b>0-5</b> Specifies the weight index GRC divisor to be 1< <wdiv.< td=""></wdiv.<>                                                                                                                                                |
|                                  |              | 7 Uncompressed mode                                                                                                                                                                                                               |
| wtrunc                           | All          | If this is set, the weight GRC unary length is truncated to 2.                                                                                                                                                                    |
| newpal                           | All          | If this is set, a new palette mode is configured. If this is not set, then dirofs, palsize, palbits, and palette[i] keep the values from the previous slice. This must be set for the first slice in a stream.                    |
| dirofs                           | All          | Direct mode offset. For more information about direct mode, see <i>Palette mode and direct mode</i> on page 3-89.                                                                                                                 |
| <pre>palsize, palette_size</pre> | All          | Indicates the number of entries in the palette. A value of 0 means direct mode where the palette is not used.                                                                                                                     |
| palbits,<br>palette_bits         | All          | If the palette is used (palette_size>0), then palette_bits indicates the precision in bits of each palette entry.                                                                                                                 |
|                                  |              | In direct mode (palette_size==0), then palette_bits indicates the precision used in uncompressed mode.                                                                                                                            |
| palette[i]                       | All          | Weight value for palette entry with index i. The weight value is stored in sign-magnitude format. The LSB of palette[i] is the sign and the remainder of the bits (bit palette_bits-1 down to bit 1) indicate the absolute level. |
|                                  |              | The weight value is calculated with the following formula:                                                                                                                                                                        |
|                                  |              | $weight\_value = palette[i] & 1 ? -(palette[i] >> 1) : (palette[i] >> 1)$                                                                                                                                                         |

# 3.7.3 Coding modes

There are a few different coding modes.

#### Palette mode and direct mode

The weight stream encodes compressed weight indices. A weight index is a 9-bit unsigned integer in the range 0 to 511 which represents different weight values.

If the weight index is less than palette\_size, the weight index is used as an index into the palette, and the weight value is found in the palette entry for that index. Otherwise (if the weight index is greater than or equal to the palette\_size), the weight value is calculated directly from the weight index using a formula as indicated below. The first mode is called palette mode and the latter mode is called direct mode.

```
if ( weight_index < palette_size )
    tmp = palette[weight_index]
else
    tmp = weight_index - palette_size + dirofs
weight_value = tmp&1 ? -(tmp>>1) : +( tmp>>1)
```

# Weight index coding

Weight indices are either Golomb-Rice coded or uncompressed as indicated by wdiv.

#### Golomb-Rice coding

In Golomb-Rice coding, the weight index is represented as a quotient and a remainder.

Golomb-Rice coding is represented as follows:

```
wq = weight_index >> wdiv
wr = weight_index & ((1<<wdiv)-1)</pre>
```

The quotient wq must be less than or equal to 31. If wtruncis is set, then wqmust be less than or equal to 2. It is the responsibility of the encoder to select the wdiv parameter so that this is the case. The quotient is unary coded in the bitstream and the remainder is stored as an unsigned binary in wdiv bits. Unary coding is a variable length coding where numbers are coded as zero-terminated strings of ones as follows:

Table 3-113 Example of unary coding structure

| wq | Unary coding                            |
|----|-----------------------------------------|
| 0  | 0                                       |
| 1  | 10                                      |
| 2  | 110                                     |
| 3  | 1110                                    |
|    |                                         |
| 31 | 111111111111111111111111111111111111111 |

If truncated coding (wtrunc) is set, the coding is as follows:

Table 3-114 Truncated unary coding

| W | q | Unary coding |
|---|---|--------------|
| 0 |   | 0            |
| 1 |   | 10           |
| 2 |   | 11           |

The unary part is coded in the wunary0 and wunary1 syntax elements and the remainder is encoded in the wremain syntax element as described later.

### **Uncompressed coding**

If wdiv indicates uncompressed coding, the weight\_index is coded directly as an unsigned binary integer.

The number of bits used, uncompressed\_bits, is derived from the palette size when the palette is non-empty. If the palette is empty, then palette\_bits is repurposed to indicate the uncompressed precision. This behavior is summarized in the following formula:

```
uncompressed bits = palette size>0 ? ceil( log2(palette size) ) : palette bits
```

The uncompressed weight index is coded in the wremain syntax element as described later.

# Alternating mode (zero-run coding)

If zdiv<4, alternating mode is enabled. This mode is beneficial if weights with a value of 0 are frequent in the weight stream.

Let n be the number of nonzero weight values and let the array weight\_values (of length n) be the sequence of nonzero weight values. Let the array zruns (of length n+1) be the sequence of zero run lengths between the nonzero weights (zruns[0] is the initial zero run length and zruns[n] is the ending zero run length). For example, consider the following weight sequence:

```
0, 5, 6, 0, 0, 0, 7, 0
```

You then code the following:

```
n = 3
weight_values = {5, 6, 7}
zruns = {1, 0, 3, 1}
```

From the prior code, the original weight sequence can be reconstructed.

The weights values and the zrun values are potentially coded in multiple slices. The initial zero run is only coded for slices with newpal set (and in particular for the first slice in the weight stream, since the first slice must have newpal set). A slice is only allowed to change between alternating and non-alternating coding if newpal is set. So, a slice that does not set newpal must be of the same kind (alternating or non-alternating) as the previous slice.

The following formulas give the number of coded weights values and the number of zrun values in a slice.

```
n_weight_values = slice_length
n_zruns = slice_length + newpal
```

For example, say we have 3 slices and all of them are coded using alternating mode and assume that newpal is set for slice 1 and slice 3

```
0, 5, 6, 0, 0, 0, 7, 0, 8, 9, 10, 0, 11, 12, 13, 14, 15
<-- slice 1 ---> <-- slice 2 ----> <-- slice 3 ---->
```

then we code

```
Slice 1 (newpal=1):
    slice length = 2
    weight_values = {5, 6 }
    zruns = {1, 0, 3 }
Slice 2 (newpal=0):
    slice_length = 4
    weight_values = {7, 8, 9, 10 }
    zruns = {1, 0, 0, 1 }
Slice 3 (newpal=1):
    slice_length = 5
    weight_values = {11, 12, 13, 14, 15 }
    zruns = {0, 0, 0, 0, 0, 0, 0 }
```

The nonzero weight values are coded using the direct and palette modes described in previous sections. The zero run values are Golomb-Rice coded using 1<<zdiv as divisor, so, each zero run value is represented as a quotient and a remainder as follows:

```
zq = zrun >> zdiv
zr = zrun & ((1<<zdiv)-1)
```

Note that unlike for wq, there is no upper bound for zq. That means zero runs of arbitrary length can be coded.

### 3.7.4 Chunk syntax

There is a specific chunk syntax structure.

After the slice header follows some chunks to encode the weight indices and, if alternating mode, the zero runs that belong to the slice. Each chunk encodes from 0-12 weight indices and from 0-12 zero run values. These values are generally not the same number. The reason for this is that the number of values depends on the quotient values, that is, the unary lengths. If the unary lengths are long, then fewer values fit in the chunk compared to if the unary length is short.

The number of chunks in the slice is not known in advance since this number depends on the weight and zrun values.

In alternating mode, a sort of flow control is used to make sure the number of coded weight indices and zrun values are roughly the same. This is achieved by tracking a balance (number of weight indices minus number of zrun values so far). If the balance is greater than or equal to 8, then only zrun values are included in the chunk (so that zrun can catch up). Similarly, if the balance is less than 0, then only weight indices are included in the chunk. If the balance is between 0 and 7, then both weights and zrun values are included in the chunk.

The Golomb-Rice remainders are pipelined to the chunk after the chunk containing the corresponding quotient values.

The chunk bitstream syntax and parsing process are described below. The output of the process is the weight indices and the zruns arrays.

```
chunks() {
 w_cnt = slice_length
z_cnt = slice_length + new_pal
 \overline{zunary} len = \overline{zdiv} < 3 ? 12 : 8
 alternating_mode = zdiv<4
 uncompressed_mode = wdiv==7
 wremain bits = uncompressed mode ? uncompressed bits :
 uncompressed_per_chunk = uncompressed_bits<=5 ? 12 : 8</pre>
 wq = 0
wq_i =
 wr_i =
 zq = 0
 zq_i =
 zr^{-}i = 0
 prev_w_enable=0
 prev_z_enable=0
  do {
// In alternating mode, make sure the rate of weight indices
      // and zrun are kept about the same.

balance = wq_i - zq_i
w_enable = (balance<8 || !alternating_mode) && wq_i < w_cnt
z_enable = balance>=0 && alternating_mode && zq_i < z_cnt
                                                                                             //
      if (w_enable && !uncompressed_mode)
                                                                                                12 bits
         wunary0
      if (z_enable) {
         zunary
                                                                                                zunary_len
         for(i=0; i<zunary_bits; i++) {
  if ( (zunary>>i)&1) ) {
            } else {
                zruns[zq_i++] = zq<<zdiv</pre>
      if (w_enable && !uncompressed_mode) {
         wunary1_len = 0
```

```
for(i=0; i<12; i++)
        if ( (wunary0>>i)&1 || wtrunc )
          wunary1_len++
                                                                                 wunary1_len
    for(i=0, j=0; i<12 && wq i<w cnt; i++) {
        if ( (wunary0>>i)&1 )
          c = 1 + ((wunary1>>j)&1)
        if (c<2 || wtrunc) {
   assert(wq<32)
          weight_indices[wq_i++] = wq<<wdiv</pre>
          wq=0
   }
        (w_enable && uncompressed_mode) {
        for(i=0; i<uncompressed_per_chunk && wq_i<w_cnt; i++) {
   weight_indices[wq_i++] = 0</pre>
    // Remainders corresponding to the quotients
        in the previous chunk
    if (prev_w_enable) {
   while( wr_i < prev_wq_i ) {</pre>
            wremain
                                                                                 wremain_bits
             weight_indices[wr_i++] += wremain
    if (prev_z_enable) {
  while( zr_i < prev_zq_i ) {</pre>
            zremain
                                                                                 zdiv
             zruns[zr_i++] += zremain
      }
    prev w enable = w enable
    prev_wq_i = wq_i
    .
prev_z_enable = z_enable
    prev zq i = zq i
} while(_prev_w_enable || prev_z_enable )
```

# 3.7.5 Weight blocks and ordering

The Ethos-U65 NPU must get weights in a certain order to function correctly. This process is described in this section.

The weights are also compressed, as described in section 3.7 Weight stream format on page 3-87. This section describes how the 1D array, that is the input to the weight encoder, is ordered.

#### Overview

The weights are not only reordered, padding is also inserted to align to full weight blocks that the weight decoder works on. Here padding is done by inserting weights that are 0 into the weight stream. Therefore, unless the stripe dimensions align perfectly to the internal work blocks of the NPU, the uncompressed weight stream is larger than the original weights.

The ordering is described below in pseudocode as nested loops. It is divided into depth-wise convolution, normal convolution with depth-first order, and normal convolution with part-kernel-first order, although they are in most ways similar where, for example, depth-wise with only some exception using the same order as part-kernel-first convolution, but removing the loops used to traverse ifm depth.

#### **Depth-wise convolution**

Table 3-115 Depth-wise convolution weight ordering

| Inputs/outputs              | Description                                                | Range     |  |  |  |
|-----------------------------|------------------------------------------------------------|-----------|--|--|--|
|                             | Input                                                      |           |  |  |  |
| weights                     | weights 3D array of 9-bit signed weights in 2's complement |           |  |  |  |
|                             | Dimensions:                                                |           |  |  |  |
|                             | [ofm-z][ifm_z][kernel_x][kernel_y]                         |           |  |  |  |
|                             | Stripe-dependent input                                     |           |  |  |  |
| ofm_depth                   | Number of ofm channels                                     | [165536]  |  |  |  |
| ofm_block_depth             | Number of ofm channels per block                           | [1128]    |  |  |  |
| kernel_width                | Kernel width (before dilation)                             | [165536]  |  |  |  |
| kernel_height               | Kernel height (before dilation)                            | [165536]  |  |  |  |
| kernel_x_dilation           | Kernel x dilation by 2 enabled                             | Boolean   |  |  |  |
| kernel_y_dilation           | Kernel y dilation by 2 enabled                             | Boolean   |  |  |  |
| kernel_split_size           | Kernel decomposition size                                  | [4,8]     |  |  |  |
|                             | Configuration-dependent input                              |           |  |  |  |
| ublk_depth Microblock depth |                                                            | [4,8]     |  |  |  |
| Output                      |                                                            |           |  |  |  |
| weight_stream               | 1D array of 9-bit signed weights                           | [-255255] |  |  |  |

Example code of weight ordering for depth-wise convolution.

#### Convolution - depth-first weight order

Table 3-116 Depth-first weight ordering

| Inputs/outputs                | Description                                        | Range     |
|-------------------------------|----------------------------------------------------|-----------|
|                               | Input                                              |           |
| weights                       | 4D array of 9-bit signed weights in 2's complement | [-255255] |
|                               | Dimensions:                                        |           |
|                               | [ofm-z][ifm_z][kernel_x][kernel_y]                 |           |
|                               | Stripe-dependent input                             |           |
| ofm_depth                     | Number of ofm channels                             | [165536]  |
| ofm_block_depth               | Number of ofm channels per block                   | [1128]    |
| kernel_width                  | Kernel width (before dilation)                     | [165536]  |
| kernel_height                 | Kernel height (before dilation)                    | [165536]  |
| kernel_x_dilation             | Kernel x dilation by 2 enabled                     | Boolean   |
| kernel_y_dilation             | Kernel y dilation by 2 enabled                     | Boolean   |
| kernel_split_size             | Kernel decomposition size                          | [4,8]     |
| ifm_depth                     | Number of IFM channels                             | [165536]  |
| ifm_bitdepth                  | Bit depth for IFM elements                         | [8,16]    |
| Configuration-dependent input |                                                    |           |
| ublk_depth                    | Microblock depth                                   | [4,8]     |
| Output                        |                                                    |           |
| weight_stream                 | 1D array of 9-bit signed weights in 2's complement | [-255255] |

Example code for depth-first weight ordering.

```
else
weight_stream[w_idx++] = weights[ofm_z][ifm_z][ky][kx]
```

## Convolution - part-kernel-first weight order

Table 3-117 Part-kernel-first weight ordering

| Inputs/outputs                | Description                                        | Range     |
|-------------------------------|----------------------------------------------------|-----------|
|                               | Input                                              |           |
| weights                       | 4D array of 9-bit signed weights in 2's complement | [-255255] |
|                               | Dimensions:                                        |           |
|                               | [ofm-z][ifm_z][kernel_x][kernel_y]                 |           |
|                               | Stripe-dependent input                             |           |
| ofm_depth                     | Number of ofm channels                             | [165536]  |
| ofm_block_depth               | Number of ofm channels per block                   | [1128]    |
| kernel_width                  | Kernel width (before dilation)                     | [165536]  |
| kernel_height                 | Kernel height (before dilation)                    | [165536]  |
| kernel_x_dilation             | Kernel x dilation by 2 enabled                     | Boolean   |
| kernel_y_dilation             | Kernel y dilation by 2 enabled                     | Boolean   |
| kernel_split_size             | Kernel decomposition size                          | [4,8]     |
| ifm_depth                     | Number of IFM channels                             | [165536]  |
| ifm_bitdepth                  | Bit depth for IFM elements                         | [8,16]    |
| Configuration-dependent input |                                                    |           |
| ublk_depth                    | Microblock depth                                   | [4,8]     |
| Output                        |                                                    |           |
| weight_stream                 | 1D array of 9-bit signed weights in 2's complement | [-255255] |

Example code for part-kernel-first weight ordering.

```
decomp_w = kernel_split_size
if (kernel_x_dilation)
    decomp_w = decomp_w / 2
decomp_h = kernel_split_size
if (kernel_y_dilation)
    decomp_h = decomp_h / 2
ifm_block_depth = 32
if (ifm_bitdepth == 16 )
    ifm_block_depth = 16
w_idx = 0
for ( blk_z = 0; blk_z < ofm_depth; blk_z += ofm_block_depth )
    for ( iblk_z = 0; iblk_z < ifm_depth; iblk_z += ifm_block_depth )
    for ( kernel_x = 0; kernel_x < kernel_width; kernel_x += decomp_h )
    for ( kernel_y = 0; kernel_y < kernel_height; kernel_y += decomp_w )
    subkernel_width = min(kernel_width - kernel_x, decomp_w)
    subkernel_height = min(kernel_height - kernel_y, decomp_h)
    subkernel_size = subkernel_width * subkernel_height
    if ( ifm_bitdepth == 16 )
        subkernel_size = ((subkernel_size + 1) / 2) * 2
    if ( ifm_bitdepth == 8 )
        subkernel_size = ((subkernel_size + 3) / 4) * 4
        iblk_d = min(16, ifm_depth - iblk_z)
        for ( iublk_z = 0; iublk_z < iblk_d; iublk_z += 8 )
        blk_d = min(6fm_block_depth, ofm_depth - blk_z)
        for ( wernel_i = 0; kernel_i < subkernel_size; kernel_i++ )
        subkernel_y = kernel_i / subkernel_width
        subkernel_y = kernel_i / subkernel_width</pre>
```

```
for ( z = 0; z < ublk_depth; z++ )
  for ( iz = 0; iz < 8; iz++ )
      kx = kernel_x + subkernel_x
      ky = kernel_y + subkernel_y
      ifm_z = iblk_z + iublk_z + iz
      ofm_z = blk_z + ublk_z + z
      if ( subkernel_y = subkernel_height ||
            ifm_z = ifm_depth ||
            ofm_z = ofm_depth )
            weight_stream[w_idx++] = 0
      else
            weight_stream[w_idx++] = weights[ofm_z][ifm_z][ky][kx]</pre>
```

#### Parallel mode

In parallel mode, both internal cores are run. Both cores work on the same stripe, but process different parts of the output feature maps (OFM) in parallel.

When parallel mode is enabled, the weights are divided into two weight streams so that half of the OFM channels go into each weight stream. The weight stream division is done so that even indexed OFM channels go into the first weight stream and odd indexed channels go into the second weight stream. The first stream contains ceil(ofm\_depth/2) channels and the second contains floor(ofm\_depth/2) channels.

The order of the two streams after the split is the same as in non-parallel mode and is described in the following pseudocode, except the inputs are modified accordingly. The affected inputs are weights, ofm\_depth, and ofm\_block\_depth where weights\_0/1, ofm\_depth\_0/1, and ofm\_block\_depth\_0/1 are the modified inputs for the two weight streams.

\_\_\_\_\_ Note \_\_\_\_\_

Parallel mode is only available for the 512 configuration of the Ethos-U65 NPU.

# 3.8 Operators and performance

This section provides information on supported data types, operators, and operations, and details the convolution and elementwise performance of the Ethos-U65 NPU.

This section contains the following subsections:

- 3.8.1 Supported data types and operators on page 3-97.
- *3.8.2 Operations* on page 3-98.
- 3.8.3 Convolution performance on page 3-103.
- 3.8.4 Elementwise performance on page 3-105.

# 3.8.1 Supported data types and operators

The NPU design process supports the following data types and operators to enable a range of operations. The command-stream generator can construct additional operators.

#### **Data types**

The following data types and formats are supported.

Table 3-118 Supported data types

| Data type                                    | Range / values                                                                                                                                                                            |
|----------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Supported activation and weight combinations | Unsigned 8-bit activations with unsigned 8-bit weights. These allow unsigned zero point of range 0-255 on both activations and weights on a per-tensor basis.                             |
|                                              | Signed 8-bit activations with signed 8-bit weights. These allow signed zero point of range -128 to +127 on activations per tensor, but not zero point on weights (weights are symmetric). |
|                                              | Signed 16-bit activations with signed 8-bit weights. Both activations and weights are symmetric (zero point is not supported).                                                            |
| Output-channel bias-and-scale activations    | 8-bit activations per output-channel bias and scale 16-bit activations per output-channel bias and scale                                                                                  |
| Accumulator formats                          | 32-bit accumulators, 40-bit accumulators, 16-bit floating point (s5.10) accumulators                                                                                                      |
| Bit sizes                                    | 8x16-bit operations run at half the speed of 8x8-bit operations.                                                                                                                          |
| Tensor dimensions                            | Tensor height range 1–65536. Tensor width range 1–65536. Tensor depth range 1–65536.                                                                                                      |

| N             | ote ———                                                                                |
|---------------|----------------------------------------------------------------------------------------|
| The zero-poin | nt data type and range must match the corresponding weight or activation data type and |
| range. For ex | ample:                                                                                 |
| • For int8 t  | activations, the zero point is also int8 t and both are in the range [-128, 127]. The  |

+127 - (-128) = +255. • For uint8\_t activations, the zero\_point is also uint8\_t and both are in the range [0,255]. The minimum value of the range activation-zero\_point is 0 - (+255) = -255 and the maximum value is +255 - (0) =

minimum value of the range activation-zero point is -128 - (+127) = -255 and the maximum value is

value of the range activation-zero\_point is 0 - (+255) = -255 and the maximum value is +255 - (0) = +255.

The tensor size is limited by available memory; therefore, tensor dimensions cannot all have maximum values at the same time.

## **Operators**

The command-stream generator can combine features of the NPU to create the following additional operators.

Table 3-119 Command-stream generated operators

| Operator                     | Construction                                                                                                                                             |
|------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|
| Concat                       | The Concatenation operator is constructed by using strides to lay out tensors.                                                                           |
| ExpandDims                   | The ExpandDims operator does not move data for packed NHWC, but adds a '1' dimension.                                                                    |
| GRU                          | The Gated Recurrent Unit (GRU) operation is constructed from vector products and point-wise MUL, ADD, and SUB.                                           |
| Identity                     | Identity can be realized as a 1x1 average pool with a 1x1 stride. This can be useful for rearranging data.                                               |
| Logistic                     | This is a different name for sigmoid activation, both are 1/(1+exp(-x)).                                                                                 |
| LSTM                         | The Long Short-Term Memory (LSTM) operation is constructed from vector products and point-wise MUL, ADD, and SUB.                                        |
| Pack                         | Same as Stack (see below).                                                                                                                               |
| Reshape                      | The Reshape operator does not move data for packed NHWC, but reinterprets the dimensions.                                                                |
| Split                        | The Split operator is the inverse of Concatenate and can be constructed by using strides to extract a subtensor.                                         |
| Squeeze                      | The Squeeze operator does not move data for packed NHWC, but removes a '1' dimension.                                                                    |
| Stack                        | The Stack operator is constructed by using strides. For example, stack NxHWC tensors to obtain one NHWC tensor.                                          |
| Unpack                       | Same as Unstack (see below).                                                                                                                             |
| Unstack                      | The inverse of Stack. This can be constructed by using strides to extract the lower dimension subtensors.                                                |
| Resize_Bilinear              | For a bilinear x2 upscale, this can be achieved by performing a nearest-neighbor upscale combined with a 2x2 average pool.                               |
| BatchRenorm                  | Average pool 1x1 with per-channel scale and bias to rescale data at inference time with fixed scaling only.                                              |
| StridedSlice, 1-strides only | StridedSlice with strides of 1 extracts a subtensor and can be implemented in NHWC format. (StridedSlice with strides not equal to 1 are not supported.) |

# 3.8.2 Operations

The following tables provide details of parameters that enable a number of convolution, depth-wise convolution, pooling, vector-product, elementwise, and reduction operations.

## **Convolution operations**

A convolution has a weight matrix of size HxWxICxOC

where

# HxWxIC

is the size of the convolution kernel,

IC

the number of input channels, and

 $\mathbf{OC}$ 

the number of convolutions to apply (= number of output channels).

## Table 3-120 Convolution operations

| Parameter        | Range / values                                                                                                                                                                                                   |
|------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Kernels          | 1 <= kernel_x*kernel_y <= 64*64 1 <= kernel_y <= 64 (kernel limit applies after any kernel dilation) The sum of absolute weights must not exceed 127*65536.                                                      |
| Precision        | Weight types: {int8, uint8}  {IFM types} → {OFM types} supported combinations:  {uint8, int8, int16} → {uint8, int8, int16, int32}, any pairing                                                                  |
| Stride           | 1 <= stride_x <= 3<br>1 <= stride_y <= 3                                                                                                                                                                         |
| Kernel dilation  | 1x1, 1x2, 2x1, 2x2                                                                                                                                                                                               |
| Input upscale    | None, 2x2 (nearest neighbor, insert zeros). A 2x2 upscale must use a stride of 1x1.                                                                                                                              |
| Input padding    | 0-31 top/left, 0-32 bottom/right                                                                                                                                                                                 |
| Fused activation | Available activations for {activation type}:     {int8, uint8, int16}: None, ReLU, ReLUX, tanh, sigmoid, LUT     {int32}: None (linear output only)  If LUT is not used, the activation and OFM type must match. |
| Weight order     | Depth-first order, part-kernel-first order (either order can be used for any IFM depth)                                                                                                                          |
| Scaling          | Per output-channel scale and bias parameters                                                                                                                                                                     |
| Accumulators     | fp(s5.10), int32, int40                                                                                                                                                                                          |

| Note —— |  |
|---------|--|
| 1016    |  |

The restrictions in the range / values column allow 2D convolutions of size up to 64x64 and 1D convolutions of size up to 1x4096. The condition on the sum of absolute weights ensures that a 32-bit accumulator does not overflow for 8-bit activation values and a 40-bit accumulator does not overflow for 16-bit activation values.

# **Depth-wise convolution operations**

Depth-wise convolutions have a matrix of HxWxC, where the kernel of size HxW is applied to each channel independently. Only one kernel is applied to each layer (depth\_multiplier=1).

Table 3-121 Depth-wise convolution operations

| Parameter        | Range / values                                                                                                                                                                                                   |
|------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Kernels          | 1 <= kernel_x*kernel_y <= 64*64<br>1 <= kernel_y <= 64                                                                                                                                                           |
|                  | (kernel limit applies after any kernel dilation)  The sum of absolute weights must not exceed 127*65536.                                                                                                         |
| Precision        | Weight types: {int8, uint8}                                                                                                                                                                                      |
|                  | {IFM types} → {OFM types} supported combinations:<br>{uint8, int8, int16} → {uint8, int8, int16, int32}, any pairing                                                                                             |
| Stride           | 1 <= stride_x <= 3<br>1 <= stride_y <= 3                                                                                                                                                                         |
| Dilation         | 1x1, 1x2, 2x1, 2x2                                                                                                                                                                                               |
| Input scale      | None, 2x2 (nearest neighbor, insert zeros). A 2x2 upscale must use a stride of 1x1.                                                                                                                              |
| Input padding    | 0-31 top/left, 0-32 bottom/right                                                                                                                                                                                 |
| Fused activation | Available activations for {activation type}:     {int8, uint8, int16}: None, ReLU, ReLUX, tanh, sigmoid, LUT     {int32}: None (linear output only)  If LUT is not used, the activation and OFM type must match. |
| Depth multiplier | 1                                                                                                                                                                                                                |
| Scaling          | Per output-channel scale and bias parameters                                                                                                                                                                     |
| Accumulators     | fp(s5.10), int32, int40                                                                                                                                                                                          |



The restrictions in the range / values column allow 2D convolutions of size up to 64x64 and 1D convolutions of size up to 1x4096. The condition on the sum of absolute weights ensures that a 32-bit accumulator does not overflow for 8-bit activation values and a 40-bit accumulator does not overflow for 16-bit activation values.

# **Pooling operations**

Pooling operations are applied independently to each channel.

# **Table 3-122 Pooling operations**

| Parameter        | Range / format                                                                                                                                                                           |
|------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Kernels          | Average pool with padding (for example SAME padding):                                                                                                                                    |
|                  | 1 <= kernel_x <= 8<br>1 <= kernel_y <= 8                                                                                                                                                 |
|                  | Average pool without padding and max pool, any padding:  1 <= kernel_x*kernel_y <= 256*256  1 <= kernel_y <= 256                                                                         |
| Precision        | Average pool without padding (VALID type):                                                                                                                                               |
|                  | $\{IFM \text{ types}\} \rightarrow \{OFM \text{ types}\}\ \text{supported combinations: } \{\text{uint8, int8, int16}\} \rightarrow \{\text{uint8, int8, int16}\}\ (\text{any pairing})$ |
|                  | Average pool with padding or max pool. OFM type must equal IFM type. Supported types: {int8, uint8, int16}                                                                               |
| Stride           | 1 <= stride_x <= 3<br>1 <= stride_y <= 3                                                                                                                                                 |
| Input upscale    | Average pool: none, 2x2 nearest neighbor OR 2x2 insert zeros.  Max pool: none, 2x2 nearest neighbor (only for 2x2 mode).  A 2x2 upscale must use a stride of 1x1.                        |
| Input padding    | Average pool: 0-3 top/left, 0-4 bottom/right Max pool: 0-127 top/left, 0-128 bottom/right                                                                                                |
| Fused activation | Available activations for {activation type}:                                                                                                                                             |
|                  | {int8, uint8, int16}: None, ReLU, ReLUX, tanh, sigmoid, LUT                                                                                                                              |
|                  | If LUT is not used, the activation and OFM type must match.                                                                                                                              |
| Scaling          | Average pool with padding or Max pool has no scaling.                                                                                                                                    |
|                  | Average pool with pad=0 has selectable per-channel scale and bias or global scale.                                                                                                       |
| Accumulators     | All pooling: int32                                                                                                                                                                       |
|                  | Average pool with no padding: int32, int40                                                                                                                                               |

# **Vector-product operations**

The kernel for a (fully connected) vector product is 1x1xIC, where IC is the number of input channels. Multiple output vector products with the same weights can be executed in batches of up to eight.

Vector product is implemented as a convolution 2D with a 1x1 kernel size.

Table 3-123 Vector-product operations

| Parameter | Range / format                                                                                                    |
|-----------|-------------------------------------------------------------------------------------------------------------------|
| Kernels   | 1x1x1 to 1x1x64K vector product                                                                                   |
| Precision | Weight types: {int8, uint8}                                                                                       |
|           | {IFM types} → {OFM types} supported combinations:<br>{uint8, int8, int16} → {uint8,int8,int16,int32}, any pairing |

# Table 3-123 Vector-product operations (continued)

| Parameter        | Range / format                                                                                                                                                                                                   |
|------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Fused activation | Available activations for {activation type}:     {int8, uint8, int16}: None, ReLU, ReLUX, tanh, sigmoid, LUT     {int32}: None (linear output only)  If LUT is not used, the activation and OFM type must match. |
| Scaling          | Per output-channel scale and bias parameters                                                                                                                                                                     |
| Accumulators     | int32, int40                                                                                                                                                                                                     |

# **Elementwise operations**

The following operations include both unary element-wise (or point-wise) and binary elementwise operations, which support two IFMs to produce one OFM.

Table 3-124 Elementwise operations

| Parameter                 | Range / format                                                                                                                              |
|---------------------------|---------------------------------------------------------------------------------------------------------------------------------------------|
| Kernels                   | Binary operations: Multiply, Add, Subtract, Minimum, Maximum, SHR, SHL.                                                                     |
|                           | Unary operations: ABS, Leaky ReLU, CLZ.                                                                                                     |
| Precision                 | Multiply, Add, Subtract {IFM}→{OFM}:                                                                                                        |
|                           | {uint8, int8, int16 int32} → {uint8, int16, int32}, any pairing                                                                             |
|                           | Minimum, Maximum, LReLU, ABS:                                                                                                               |
|                           | IFM and OFM must be of the same type, one of:                                                                                               |
|                           | {int8, uint8, int16}                                                                                                                        |
|                           | SHR $\{IFM\} \rightarrow \{OFM\}$ :                                                                                                         |
|                           | $\{int32\} \rightarrow \{int8, uint8, int32\}$ , any pairing                                                                                |
|                           | CLZ and SHL:                                                                                                                                |
|                           | $\{int32\} \rightarrow \{int32\}$ only                                                                                                      |
| Broadcast (for            | Operand IFM2 can be one of the following:                                                                                                   |
| binary tensor operations) | (a) A scalar constant broadcast to all elements for 8-bit or 16-bit IFM (scalar constant is not supported for a                             |
| operations)               | 32-bit IFM). (b) A tensor whose dimensions are either 1 or match IFM1.                                                                      |
|                           | If (b), any dimension that is broadcast to match the dimension of IFM1.                                                                     |
| Operand order             | Selectable if IFM2 is the first or second operand (A or B).                                                                                 |
| Fused activation          | Available activations for {activation type}: {int8, uint8, int16}: None, ReLU, ReLUX, tanh, sigmoid, LUT {int32}: None (linear output only) |
|                           | If LUT is not used, the activation and OFM type must match.                                                                                 |

# Table 3-124 Elementwise operations (continued)

| Parameter      | Range / format                                                                                                                                                   |
|----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Input scaling  | For ADD and SUB (only) the following input scales are supported when neither the IFM nor the activation type is 32-bit:                                          |
|                | <ol> <li>1. 16-bit input scale on elementwise ADD and SUB operands.</li> <li>2. 32-bit input scale applied to only input (fixed shift for the other).</li> </ol> |
| Output scaling | Global 32-bit output scale on elementwise MUL, ADD, SUB, ABS, LReLU, ABS, SHR.                                                                                   |
|                | Leaky ReLU scales only negative inputs.                                                                                                                          |

## **Reduction operations**

The following operations the supported reduction operations for REDUCE\_SUM, which reduce the channel dimension from an HWC tensor to an HW1 tensor.

Table 3-125 Reduction operations

| Parameter        | Range / format                                                                                                                                                                                                    |
|------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Kernels          | REDUCE_SUM                                                                                                                                                                                                        |
| Precision        | Supported {IFM types} → {OFM types}:<br>{uint8, int8, int16, int32} → {int32} (any pairing)                                                                                                                       |
| Input upscale    | 1x1 only                                                                                                                                                                                                          |
| Input padding    | None                                                                                                                                                                                                              |
| Fused activation | Available activations for {activation type}:     {int8, uint8, int16}: None, ReLU, ReLUX, tanh, sigmoid, LUT     {int32}: None (linear output only).  If LUT is not used, the activation and OFM type must match. |
| Scaling          | Global 32-bit scale.                                                                                                                                                                                              |
| Accumulators     | int32, int40                                                                                                                                                                                                      |

## 3.8.3 Convolution performance

The following tables detail the convolution performance of the Ethos-U65 NPU by configuration.

The convolution performance for the different configurations of the NPU depends on the operation used, such as the kernel height (kh), kernel width (kw). In addition, it also depends on the dimensions of the tensors being processed.

| ——— Importa | nt  |
|-------------|-----|
| Importa     | ini |

The purpose of these tables is to explain the architectural limitations of the MAC utilization of different convolutional operations. If layers are broken into small jobs, there may be more overhead at top level.

For shallow 1x1 convolutions, where IFM depth is <64 or OFM depth is <16, the overall performance is limited by the output and memory bandwidth.

# Convolution performance of the Ethos™-U65<sub>256</sub>

In the following tables k, h, w, d, and n should be integers to achieve the MACs per cycle as specified for the operation. For any non-integer values, the hardware effectively rounds up this value and the extra MACs computed as a consequence are lost.



Table 3-126 Convolution performance for 8-bit activations

| 8-bit activation      |             |            |           |           |           |                |  |  |
|-----------------------|-------------|------------|-----------|-----------|-----------|----------------|--|--|
| Operation             | Kernel size | OFM height | OFM width | OFM depth | IFM depth | MACs per cycle |  |  |
| CONV2D (depth first)  | kh*kw=1*k   | 2*h        | 2*w       | 8*d       | 32*n      | 256            |  |  |
| CONV2D (kernel first) | kh*kw=4*k   | 2*h        | 2*w       | 8*d       | 8*n       | 256            |  |  |
| CONV1D (depth first)  | kh=1 kw=1*k | 1          | 4*w       | 8*d       | 32*n      | 256            |  |  |
| CONV1D (kernel first) | kh=1 kw=4*k | 1          | 4*w       | 8*d       | 8*n       | 256            |  |  |
| Fully connected       | kh=1 kw=1   | 1          | 1         | 8*d       | 32*n      | WB             |  |  |
| DepthwiseConv2D       | kh*kw=4*k   | 2*h        | 2*w       | 8*d       | 8*n       | 32             |  |  |
| DepthwiseConv1D       | kh=1 kw=4*k | 1          | 4*w       | 8*d       | 8*n       | 32             |  |  |

Table 3-127 Convolution performance for 16-bit activations

| 16-bit activation     |             |            |           |           |           |                |  |  |
|-----------------------|-------------|------------|-----------|-----------|-----------|----------------|--|--|
| Operation             | Kernel size | OFM height | OFM width | OFM depth | IFM depth | MACs per cycle |  |  |
| CONV2D (depth first)  | kh*kw=1*k   | 2*h        | 2*w       | 8*d       | 16*n      | 128            |  |  |
| CONV2D (kernel first) | kh*kw=2*k   | 2*h        | 2*w       | 8*d       | 8*n       | 128            |  |  |
| CONV1D (depth first)  | kh=1 kw=1*k | 1          | 4*w       | 8*d       | 16*n      | 128            |  |  |
| CONV1D (kernel first) | kh=1 kw=2*k | 1          | 4*w       | 8*d       | 8*n       | 128            |  |  |
| Fully connected       | kh=1 kw=1   | 1          | 1         | 8*d       | 16*n      | WB             |  |  |
| DepthwiseConv2D       | kh*kw=4*k   | 2*h        | 2*w       | 8*d       | 8*n       | 16             |  |  |
| DepthwiseConv1D       | kh=1 kw=4*k | 1          | 4*w       | 8*d       | 8*n       | 16             |  |  |

# Convolution performance of the Ethos™-U65<sub>512</sub>

In the following tables k, h, w, d, and n should be integers to achieve the MACs per cycle as specified for the operation. For any non-integer values, the hardware effectively rounds this value up and the extra MACs computed as a consequence are lost.

| Note                                                                                                 |
|------------------------------------------------------------------------------------------------------|
| Cells marked "WB" denote weight-bound values. The actual performance of weight-bound layers          |
| depends on the number of weights that can be compressed by the weight decoder per cycle. This numb   |
| is affected by the compression ratio and the bandwidth of the memory available for the weights. (The |
| capacity of the weight decoder itself is unaffected.)                                                |
|                                                                                                      |

Table 3-128 Convolution performance for 8-bit activations

| 8-bit activation      |             |            |           |           |           |                |  |  |
|-----------------------|-------------|------------|-----------|-----------|-----------|----------------|--|--|
| Operation             | Kernel size | OFM height | OFM width | OFM depth | IFM depth | MACs per cycle |  |  |
| CONV2D (depth first)  | kh*kw=1*k   | 2*h        | 2*w       | 16*d      | 32*n      | 512            |  |  |
| CONV2D (kernel first) | kh*kw=4*k   | 2*h        | 2*w       | 16*d      | 8*n       | 512            |  |  |
| CONV1D (depth first)  | kh=1 kw=1*k | 1          | 4*w       | 16*d      | 32*n      | 512            |  |  |
| CONV1D (kernel first) | kh=1 kw=4*k | 1          | 4*w       | 16*d      | 8*n       | 512            |  |  |
| Fully connected       | kh=1 kw=1   | 1          | 1         | 16*d      | 32*n      | WB             |  |  |
| DepthwiseConv2D       | kh*kw=4*k   | 2*h        | 2*w       | 16*d      | 8*n       | 64             |  |  |
| DepthwiseConv1D       | kh=1 kw=4*k | 1          | 4*w       | 16*d      | 8*n       | 64             |  |  |

Table 3-129 Convolution performance for 16-bit activations

| 16-bit activation     |             |            |           |           |           |                |  |  |
|-----------------------|-------------|------------|-----------|-----------|-----------|----------------|--|--|
| Operation             | Kernel size | OFM height | OFM width | OFM depth | IFM depth | MACs per cycle |  |  |
| CONV2D (depth first)  | kh*kw=1*k   | 2*h        | 2*w       | 16*d      | 16*n      | 256            |  |  |
| CONV2D (kernel first) | kh*kw=2*k   | 2*h        | 2*w       | 16*d      | 8*n       | 256            |  |  |
| CONV1D (depth first)  | kh=1 kw=1*k | 1          | 4*w       | 16*d      | 16*n      | 256            |  |  |
| CONV1D (kernel first) | kh=1 kw=2*k | 1          | 4*w       | 16*d      | 8*n       | 256            |  |  |
| Fully connected       | kh=1 kw=1   | 1          | 1         | 16*d      | 16*n      | WB             |  |  |
| DepthwiseConv2D       | kh*kw=4*k   | 2*h        | 2*w       | 16*d      | 8*n       | 32             |  |  |
| DepthwiseConv1D       | kh=1 kw=4*k | 1          | 4*w       | 16*d      | 8*n       | 32             |  |  |

# 3.8.4 Elementwise performance

The following tables detail the elementwise performance of the Ethos-U65 NPU by configuration.

The performance of elementwise operations depends on the configuration of the NPU, as well as which operation is performed as shown in the following tables. Note that some operations are bound by the bandwidth required to read and write the operations to external SRAM.

## Elementwise performance of the Ethos™-U65 NPU

Table 3-130 Operations per cycle for 8-bit activations

| Ethos-U65 configuration | LReLU, ABS | MIN, MAX          | MUL  | Simple ADD,<br>SUB | Advanced ADD,<br>SUB | LUT, tanh,<br>sigmoid |
|-------------------------|------------|-------------------|------|--------------------|----------------------|-----------------------|
| 256                     | 4          | 4                 | 2.67 | 2                  | 1.33                 | 1                     |
| 512                     | 8          | 5.33 <sup>a</sup> | 5.33 | 4                  | 2.67                 | 2                     |

a This value is memory-bound.

# Table 3-131 Operations per cycle for 16-bit activations

| Ethos-U65 configuration | LReLU, ABS     | MIN, MAX          | MUL   | Simple ADD,<br>SUB | Advanced ADD,<br>SUB | LUT, tanh,<br>sigmoid |
|-------------------------|----------------|-------------------|-------|--------------------|----------------------|-----------------------|
| 256                     | 4              | 2.67 <sup>a</sup> | 2.67  | 2                  | 1.33                 | 1                     |
| 512                     | 4 <sup>a</sup> | 2.67 <sup>a</sup> | 2.67ª | 2.67 <sup>a</sup>  | 2.67                 | 2                     |

# Table 3-132 Operations per cycle for 32-bit activations

| Ethos-U65 configuration | MUL, 8 to<br>32-bit | ADD, SUB,<br>8 to 32-bit | ,              | ADD, SUB,<br>16 to 32-bit | MUL, 32-<br>bit   | ADD, SUB,<br>32-bit | CLZ | SHL,<br>SHR       |
|-------------------------|---------------------|--------------------------|----------------|---------------------------|-------------------|---------------------|-----|-------------------|
| 256                     | 2                   | 2                        | 2              | 2                         | 0.89              | 0.89                | 1.6 | 0.89              |
| 512                     | 2.67 <sup>a</sup>   | 2.67 <sup>a</sup>        | 2 <sup>a</sup> | 2 <sup>a</sup>            | 1.33 <sup>a</sup> | 1.33 <sup>a</sup>   | 2ª  | 1.33 <sup>a</sup> |

# 3.9 Block based operation

Due to limited internal storage, the NPU must break down an operation into smaller jobs.

The stripe is divided into one or more blocks and jobs scheduled by the hardware are processed one block at a time. The size of each block is specified in the command stream, each block follows the restrictions described in this section. If the block is not a multiple of the stripe size, the hardware runs partial blocks at the edge of the stripe.

## **Output feature map**

The NPU generates the *Output Feature Map* (OFM) of an operation in blocks which repeat in z, x, y order over the OFM. The size of each OFM block must not exceed the size of the available *SHared RAM* (SHRAM).

Each block is configured in the command stream according to the following restrictions:

- OFM BLOCK WIDTH must be in the range 1-64 and a multiple of the MIN BLOCK WIDTH.
- OFM BLOCK HEIGHT must be in the range 1-32 and a multiple of the MIN BLOCK HEIGHT.
- OFM BLOCK DEPTH must be in the range 1-128 and a multiple of MIN BLOCK DEPTH.
- If OFM\_BLOCK\_DEPTH is not a multiple of 16, then OFM\_DEPTH <= OFM\_BLOCK\_DEPTH.
- OFM\_MEMBLK\_DEPTH is set to OFM\_BLOCK\_DEPTH/(1+PARALLEL\_MODE).

The minimum block sizes are listed in the following table.

Table 3-133 Minimum block sizes

| Configuration | PARALLEL_MODE | MIN_BLOCK_HEIGHT | MIN_BLOCK_WIDTH | MIN_BLOCK_DEPTH |
|---------------|---------------|------------------|-----------------|-----------------|
| 256           | 0             | 2                | 2               | 8               |
| 512           | 0             | 2                | 2               | 8               |
| 512           | 1             | 2                | 2               | 16              |

## Input feature map

To generate an OFM block, the NPU reads one or more *Input Feature Map* (IFM) blocks. An upper limit on the size of an IFM block is derived from the OFM block size and the operation being performed, as listed in the following table.

| Note                                                                                     |        |
|------------------------------------------------------------------------------------------|--------|
| The size of the IFM and OFM blocks must not exceed the size of the available SHRAM. For  | more   |
| information about the size of the available SHRAM, see 3.9.1 Internal shared RAM on page | 3-109. |

#### Table 3-134 IFM block size limit

| Dimension        | OFM block size and operation                                                                                                                                                |
|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| IFM_BLOCK_HEIGHT | ALIGN_HEIGHT(min(ifm_get_height(OFM_BLOCK_HEIGHT, min(kernel_split_size, dilated_kernel_height)), ifm_get_height(OFM_HEIGHT, dilated_kernel_height - PAD_TOP - PAD_BOTTOM)) |
| IFM_BLOCK_WIDTH  | ALIGN_WIDTH(min(ifm_get_width(OFM_BLOCK_WIDTH, min(kernel_split_size, dilated_kernel_width)), ifm_get_width(OFM_WIDTH, dilated_kernel_width - PAD_LEFT - PAD_RIGHT))        |
| IFM_MEMBLK_DEPTH | OFM_MEMBLK_DEPTH for a depth-wise convolution, max or average pooling and elementwise operations                                                                            |
|                  | ALIGN(min(32, IFM_DEPTH), 8) for conv2d, fully connected or reduce_sum with 8-bit activations and kernel_weight_order=0                                                     |
|                  | ALIGN(min(16, IFM_DEPTH), 8) for conv2d, fully connected or reduce_sum with 8-bit activation and kernel_weight_order=1                                                      |
|                  | ALIGN(min(16, IFM_DEPTH), 4) for conv2d, fully connected or reduce_sum with 16-bit activation tensor                                                                        |
|                  | ALIGN(min(8, IFM_DEPTH), 2) for reduce_sum with 32-bit activation                                                                                                           |

The definitions used in the preceding table are:

- ALIGN(x, n) = (int)ceil(x/(float)n)\*n = (x + (n-1)) &~ (n-1)
- ALIGN HEIGHT(h) = ALIGN(h, MIN BLOCK HEIGHT)
- ALIGN WIDTH(w) = ALIGN(w, MIN BLOCK WIDTH)
- ifm\_get\_height(ofm\_height, border\_height) = (int)ceil(((ofm\_height-1)\*kernel\_y\_stride + border\_height)/(float)upscaling\_factor\_y)
- ifm\_get\_width(ofm\_width, border\_width) = (int)ceil(((ofm\_width-1)\*kernel\_x\_stride + border\_width)/(float)upscaling\_factor\_x)
- dilated\_kernel\_height = (kernel\_height-1)\*kernel\_y\_dilation+1, dilated\_kernel\_width = (kernel\_width-1)\*kernel\_x\_dilation+1
- upscaling factor x = upscaling factor y = (ifm upscale mode!=0 ? 2 : 1)

#### **Block dependency**

The output of one operation is the input of the following operation. The NPU breaks down the output and input operations into blocks, creating dependencies between each block. The dependency between blocks is specified in the command stream, which ensures the hardware writes the input data before the input data is read. Correctly setting the block dependency allows the hardware to run two operations back to back more efficiently without having to flush the hardware pipeline.

Each block operation reads an IFM block and updates or completes an OFM block. The order of block operations is:

- For depth-wise convolution, pooling, or elementwise operations, the block operations iterate over the IFM and OFM blocks at the same position in z, x, y order (depth, horizontal, then vertical). The IFM block position matches the OFM block position.
- For convolution-2D, the block operations iterate over the OFM blocks in z, x, y order and for each OFM block, the IFM block iterates over the IFM in z order. Each separate IFM block for the same OFM block counts as a separate block operation.

NPU\_SET\_BLOCKDEP takes a block offset k as a parameter. The block dependency guarantees that IFM block read n in the kernel does not start until all OFM block writes of the previous kernel operation, except max(k-n, 0), are complete and written to memory.

The following figure shows an example with two stripes, each of five blocks A0-A4 and B0-B4. The B operation is applied to the output of the A operation but due to the filter margin, the block B(k) read depends on the A(k+1) write as indicated by the arrows.



Figure 3-2 Example blocks

The example shows blocks issued in normal order A0, A1, A2, A3, A4, B0, B1, B2, B3, B4, but B0 is not permitted to start until A1 is complete and written to memory. Similarly, B1 is not permitted to start until A2 is complete. This sequence continues until A4 is complete, and B3 is then permitted to start.

An example of how the dependency is expressed in the command stream is:

- NPU OP A issues operation A
- NPU SET BLOCKDEP #3 expresses the B->A dependency as three block operations
- NPU\_OP\_B issues operation B

This section contains the following subsection:

• 3.9.1 Internal shared RAM on page 3-109.

#### 3.9.1 Internal shared RAM

The NPU has internal SHared RAM (SHRAM) that stores data.

#### SHRAM purpose and buffers

The purposes of the SHRAM are:

- To store data that the NPU is processing, for example *Input Feature Map* (IFM) blocks, accumulators, or *Lookup Table* (LUT) definitions which allow for data reuse.
- To store data being transferred to or from external memory by the *Direct Memory Access* (DMA) controller which absorbs memory read or write latency.

The following table lists the buffers that are placed within SHRAM.

Table 3-135 SHRAM buffers

| Buffer      | Buffer<br>entries | Buffer contents                                                                                                                                                                                                                                                     |  |
|-------------|-------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| IFM         | IB0, IB1          | Double buffered input block buffers that must be the size in bytes of at least IFM_BLOCK_HEIGHT * IFM_BLOCK_WIDTH * ALIGN(IFM_MEMBLK_DEPTH*IFM_BYTEWIDTH,8).                                                                                                        |  |
| IFM2        | IB0, IB1          | Double buffered IFM2 input block buffers that must be the size of at least IFM2_BLOCK_HEIGHT * IFM2_BLOCK_WIDTH * ALIGN(IFM2_MEMBLK_DEPTH*IFM_BYTEWIDTH,8).  Where IFM2 dimensions are equal to the IFM dimension, or are set to one if the dimension is broadcast. |  |
| Accumulator | ACC0,<br>ACC1     | Double buffered accumulator and output block buffers that must be the size in bytes of at least min(OFM_HEIGHT, OFM_BLOCK_HEIGHT) * min(OFM_WIDTH, OFM_BLOCK_WIDTH) * ALIGN(OFM_MEMBLK_DEPTH,8) * ACC_BYTEWIDTH.                                                    |  |
| Output      | OB0, OB1          | Scale output to OFM streaming buffer. The size of OB0 and OB1 is fixed at 1KB.                                                                                                                                                                                      |  |
| LUT         | Tables            | A single 2KB buffer that, if used, must be in the last 2KB of the SHRAM.                                                                                                                                                                                            |  |

#### **SHRAM** format

The SHRAM is divided into 1KB units and each buffer is a whole number of kilobytes.

The following table lists the SHRAM layout for non-elementwise operations. For all configurations of the NPU, the value of t is set to one.

Table 3-136 Non-elementwise operations

| Bank address (KB) | Bank<br>+0KB | Bank<br>+1KB | Notes                          |                                                         |
|-------------------|--------------|--------------|--------------------------------|---------------------------------------------------------|
| 0                 | OB0          | OB1          | Output data buffer             | -                                                       |
| 2                 | IB0          | IB1          | IFM data buffer                | The IFM data buffer is allocated from bank address 2 to |
| 4                 | IB0          | IB1          | IFM data buffer                | IFM_IB_END-2 in steps of 2KB.                           |
|                   |              |              |                                |                                                         |
| IFM_IB_END-2      | IB0          | IB1          | IFM data buffer                |                                                         |
| IFM_IB_END+0      | -            | -            | Not used for the current block | Unallocated bank addresses can be zero or greater.      |
| AB_START+0        | ACC0         | ACC0         | Accumulator buffer 0           | The accumulator buffer is allocated from bank address   |
| AB_START+2        | ACC1         | ACC1         | Accumulator buffer 1           | AB_START+0 to SB_SIZE-2-2t in steps of 2KB.             |
| AB_START+4        | ACC0         | ACC0         | Accumulator buffer 0           |                                                         |
|                   |              |              |                                |                                                         |
| SB_SIZE-2-2t      | ACC1         | ACC1         | Accumulator buffer 1           |                                                         |

The following table lists the SHRAM layout for elementwise operations AB\_START=SB\_SIZE.

Table 3-137 Elementwise operations

| Bank address (KB) | Bank<br>+0KB | Bank<br>+1KB | Notes                          |                                                                                                      |  |
|-------------------|--------------|--------------|--------------------------------|------------------------------------------------------------------------------------------------------|--|
| 0                 | ОВ0          | OB1          | Output data buffer             | -                                                                                                    |  |
| 2                 | IB0          | IB1          | IFM data buffer                | The IFM data buffer is allocated from bank address 2 to                                              |  |
| 4                 | IB0          | IB1          | IFM data buffer                | IFM2_IB_START-2 in steps of 2KB.                                                                     |  |
|                   |              |              |                                |                                                                                                      |  |
| IFM2_IB_START-2   | IB0          | IB1          | IFM data buffer                |                                                                                                      |  |
| IFM2_IB_START+0   | IB0          | IB1          | IFM2 data buffer               | The IFM2 data buffer is allocated from bank address IFM2_IB_START+0 to IFM_IB_END-2 in steps of 2KB. |  |
|                   |              |              |                                |                                                                                                      |  |
| IFM_IB_END-2      | IB0          | IB1          | IFM2 data buffer               |                                                                                                      |  |
| IFM_IB_END+0      | -            | -            | Not used for the current block | IFM_IB_END+0 to SB_SIZE-2-2t are unallocated bank addresses.                                         |  |
| •••               |              |              |                                |                                                                                                      |  |
| SB_SIZE-2-2t      | -            | -            | Not used for the current block |                                                                                                      |  |

#### **Buffer restrictions**

The following table lists the restrictions on IB\_END, IFM2\_IB\_START, and AB\_START. The table also lists the total RAM size, SB\_SIZE for each NPU configuration. The values n, m, and k are positive integers determining the size of the IFM, IFM2, and accumulator buffers respectfully.

#### Table 3-138 Buffer restrictions

| NPU<br>configuration<br>(MAC/cycle) | Elementwise non-scaler |               | Other AB_START values in KB operations |                    |                    |                    | SB_SIZE in KB |    |
|-------------------------------------|------------------------|---------------|----------------------------------------|--------------------|--------------------|--------------------|---------------|----|
|                                     | IFM2_IB_START          | IB_END        | IB_END<br>(KB)                         | 16-bit accumulator | 32-bit accumulator | 40-bit accumulator | Elementwise   |    |
| 256                                 | 2+8*n                  | 2+8*(n<br>+m) | 2+8*n                                  | 46-8*k             | 46-16*k            | 46-20*k            | 46            | 48 |
| 512                                 | 2+8*n                  | 2+8*(n<br>+m) | 2+8*n                                  | 46-8*k             | 46-16*k            | 46-20*k            | 46            | 48 |

The values must satisfy IFM2\_IB\_START <= IB\_END <= AB\_START. The input and accumulator buffer regions must be large enough to hold the configured block size.

# Buffer reconfiguration The SHRAM can be reconfigured between stripes and operations. The hardware uses IB\_END and AB\_START to ensure that data is not overwritten. The host processor must be aware that if IB\_END of the current operation is larger than AB\_START of the previous operation, a pipeline delay occurs. \_\_\_\_\_\_\_Note\_\_\_\_\_\_ Because the accumulators are not required for elementwise operations, set AB\_START to SB\_SIZE.

# Appendix A **Signal descriptions**

This appendix describes the signals for the processor.

It contains the following sections:

- A.1 Clock and reset signals on page Appx-A-113.
- A.2 Interrupt signals on page Appx-A-114.
- A.3 Power management signals on page Appx-A-115.
- A.4 AMBA® 5 AXI master signals on page Appx-A-116.
- A.5 AMBA® 4 APB slave signals on page Appx-A-122.
- A.6 DFT and MBIST signals on page Appx-A-123.

#### A.1 Clock and reset signals

The processor has one clock signal and two reset signals.

The following table lists the clock and reset signals.

Table A-1 Clock and reset signals

| Signal      | Direction | Description                                                                                                                                                           |  |
|-------------|-----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| CLK         | Input     | Clock input                                                                                                                                                           |  |
| nRESET      | Input     | The reset. This signal is an asynchronous, active-LOW signal.                                                                                                         |  |
| nMBISTRESET | Input     | The reset is used to prepare the IP for MBIST mode. This signal is an asynchronous, active-LOW signal.                                                                |  |
| PORPL       | Input     | The Power-On-Reset Privilege Level (PORPL).  This signal sets the privilege level of the NPU after a hard reset.  LOW means User level.  HIGH means Privileged level. |  |
| PORSL       | Input     | The Power-On-Reset Security Level (PORSL).  This signal sets the security level of the NPU after a hard reset.  LOW means Secure.  HIGH means Non-secure.             |  |

#### Related information

A.4 AMBA® 5 AXI master signals on page Appx-A-116 A.6 DFT and MBIST signals on page Appx-A-123

#### A.2 Interrupt signals

The processor has an interrupt signal which you must connect to an interrupt controller.

The following table lists the interrupt signals.

Table A-2 Interrupt signals

| Signal | Direction | Edge or level trigger      |
|--------|-----------|----------------------------|
| IRQ    | Output    | Level triggered when HIGH. |

#### A.3 Power management signals

The processor has several signals for power management.

The following table lists the clock Q-Channel signals.

Table A-3 Clock Q-Channel signals

| Signal      | Direction | Description                                                                                         |
|-------------|-----------|-----------------------------------------------------------------------------------------------------|
| CLKQACTIVE  | Output    | This signal indicates that the NPU requires <b>CLK</b> to be active.                                |
| CLKQREQn    | Input     | This signal indicates that the clock controller wants to gate the clock. This signal is active-LOW. |
| CLKQACCEPTn | Output    | This signal indicates that the NPU accepts the clock controller request. This signal is active-LOW. |
| CLKQDENY    | Output    | This signal indicates that the NPU denies the clock controller request.                             |

The following table lists the power Q-Channel signals.

#### Table A-4 Power Q-Channel signals

| Signal      | Direction | Description                                                                                             |  |
|-------------|-----------|---------------------------------------------------------------------------------------------------------|--|
| PWRQACTIVE  | Output    | This signal indicates that the NPU requires power.                                                      |  |
| PWRQREQn    | Input     | This signal indicates that the power controller wants to power down the NPU. This signal is active-LOW. |  |
| PWRQACCEPTn | Output    | This signal indicates that the NPU accepts the power controller request. This signal is active-LOW.     |  |
| PWRQDENY    | Output    | This signal indicates that the NPU denies the power controller request.                                 |  |

#### A.4 AMBA® 5 AXI master signals

The master port implements a subset of AMBA 5 AXI which is compatible with AMBA 4 AXI, with the addition of **ACLKEN** and **AWAKEUP** signals.

#### M0 wake-up signal and clock enable signals

The following table lists the wake-up and clock enable signals for master 0.

Table A-5 M0 wake-up and clock enable signals

| Signal    | Direction | Description                                                                                                                       |
|-----------|-----------|-----------------------------------------------------------------------------------------------------------------------------------|
| AWAKEUPM0 | Output    | This signal indicates if there is pending activity.                                                                               |
| ACLKENM0  | Input     | This signal is the clock enable. Inputs are sampled when this signal is HIGH and outputs are held stable when this signal is LOW. |

#### M0 write address channel signals

The following table lists the write address channel signals for master 0.

Table A-6 M0 write address channel signals

| Signal                 | Direction | Description                                            |
|------------------------|-----------|--------------------------------------------------------|
| AWVALIDM0              | Output    | This signal indicates that the write address is valid. |
| <b>AWIDM0</b> [7:0]    | Output    | This signal indicates the write address ID.            |
| <b>AWADDRM0</b> [39:0] | Output    | This signal indicates the write address.               |
| <b>AWLENM0</b> [7:0]   | Output    | This signal indicates the write burst length.          |
| AWSIZEM0[2:0]          | Output    | This signal indicates the write burst size.            |
| AWBURSTM0[1:0]         | Output    | This signal indicates the write burst type.            |
| AWCACHEM0[3:0]         | Output    | This signal indicates the write cache type.            |
| <b>AWPROTM0</b> [2:0]  | Output    | This signal indicates the write protection type.       |
| AWREADYM0              | Input     | This signal indicates that the write address is ready. |

#### M0 write data channel signals

The following table lists the write data channel signals for master 0.

Table A-7 M0 write data channel signals

| Signal                 | Direction | Description                                            |
|------------------------|-----------|--------------------------------------------------------|
| WVALIDM0               | Output    | This signal indicates that the write data is valid.    |
| <b>WDATAM0</b> [127:0] | Output    | This signal indicates the write data.                  |
| <b>WSTRBM0</b> [7:0]   | Output    | This signal indicates the write byte lane strobes.     |
| WLASTM0                | Output    | This signal is the write data last transfer indicator. |
| WREADYM0               | Input     | This signal indicates that the write data is ready.    |

#### M0 write response channel signals

The following table lists the write response channel signals for master 0.

Table A-8 M0 write response channel signals

| Signal             | Direction | Description                                             |
|--------------------|-----------|---------------------------------------------------------|
| BVALIDM0           | Input     | This signal indicates that the write response is valid. |
| <b>BIDM0</b> [7:0] | Input     | This signal indicates the write response ID.            |
| BRESPM0[1:0]       | Input     | This signal indicates the write response.               |
| BREADYM0           | Output    | This signal indicates that the write response is ready. |

#### M0 read address channel signals

The following table lists the read address channels signals for master 0.

Table A-9 M0 read address channel signals

| Signal                 | Direction | Description                                           |
|------------------------|-----------|-------------------------------------------------------|
| ARVALIDM0              | Output    | This signal indicates that the read address is valid. |
| <b>ARIDM0</b> [7:0]    | Output    | This signal indicates the read address ID.            |
| <b>ARADDRM0</b> [39:0] | Output    | This signal indicates the read address.               |
| <b>ARLENM0</b> [7:0]   | Output    | This signal indicates the read burst length.          |
| ARSIZEM0[2:0]          | Output    | This signal indicates the read burst size.            |
| ARBURSTM0[1:0]         | Output    | This signal indicates the read burst type.            |
| ARCACHEM0[3:0]         | Output    | This signal indicates the read cache type.            |
| ARPROTM0[2:0]          | Output    | This signal indicates the read protection type.       |
| ARREADYM0              | Input     | This signal indicates that the read address is ready. |

The DMA uses different ARID values to fetch data from external memories. The following tables list the ARIDM0 values that correspond to each stream used by the DMA.

Table A-10 ARIDM0 Ethos-U65 256

| ARID Values | Channel       |
|-------------|---------------|
| 0-3         | Cmd stream    |
| 4-31        | IFM stream    |
| 32-59       | Weight stream |
| 60-63       | Bias stream   |
| 64-115      | M2M stream    |

Table A-11 ARIDM0 Ethos-U65 512

| ARID values | Channel    |
|-------------|------------|
| 0-3         | Cmd stream |
| 4-55        | IFM stream |

#### Table A-11 ARIDM0 Ethos-U65 512 (continued)

| ARID values | Channel              |
|-------------|----------------------|
| 56-83       | Weight stream core 0 |
| 84-87       | Bias stream core 0   |
| 88-139      | M2M stream           |
| 140-167     | Weight stream core 1 |
| 168-171     | Bias stream core 1   |

#### M0 read data channel signals

The following table lists the read data channel signals for master 0.

Table A-12 M0 read data channel signals

| Signal         | Direction | Description                                           |
|----------------|-----------|-------------------------------------------------------|
| RVALIDM0       | Input     | This signal indicates that the read data is valid.    |
| RIDM0[7:0]     | Input     | This signal indicates the read data ID.               |
| RDATAM0[127:0] | Input     | This signal indicates the read data.                  |
| RRESPM0[1:0]   | Input     | This signal indicates the read data response.         |
| RLASTM0        | Input     | This signal is the read data last transfer indicator. |
| RREADYM0       | Output    | This signal indicates that the read data is ready.    |

#### M1 wake-up signal and clock enable signals

The following table lists the wake-up and clock enable signals for master 1.

Table A-13 M1 wake-up and clock enable signals

| Signal    | Direction | Description                                                                                                                       |
|-----------|-----------|-----------------------------------------------------------------------------------------------------------------------------------|
| AWAKEUPM1 | Output    | This signal indicates if there is pending activity.                                                                               |
| ACLKENM1  | Input     | This signal is the clock enable. Inputs are sampled when this signal is HIGH and outputs are held stable when this signal is LOW. |

#### M1 write address channel signals

The following table lists the write address channel signals for master 1.

Table A-14 M1 write address channel signals

| Signal               | Direction | Description                                            |
|----------------------|-----------|--------------------------------------------------------|
| AWVALIDM1            | Output    | This signal indicates that the write address is valid. |
| AWIDM1[7:0]          | Output    | This signal indicates the write address ID.            |
| AWADDRM1[39:0]       | Output    | This signal indicates the write address.               |
| <b>AWLENM1</b> [7:0] | Output    | This signal indicates the write burst length.          |

Table A-14 M1 write address channel signals (continued)

| Signal         | Direction | Description                                            |
|----------------|-----------|--------------------------------------------------------|
| AWSIZEM1[2:0]  | Output    | This signal indicates the write burst size.            |
| AWBURSTM1[1:0] | Output    | This signal indicates the write burst type.            |
| AWCACHEM1[3:0] | Output    | This signal indicates the write cache type.            |
| AWPROTM1[2:0]  | Output    | This signal indicates the write protection type.       |
| AWREADYM1      | Input     | This signal indicates that the write address is ready. |

#### M1 write data channel signals

The following table lists the write data channel signals for master 1.

Table A-15 M1 write data channel signals

| Signal                 | Direction | Description                                            |
|------------------------|-----------|--------------------------------------------------------|
| WVALIDM1               | Output    | This signal indicates that the write data is valid.    |
| <b>WDATAM1</b> [127:0] | Output    | This signal indicates the write data.                  |
| <b>WSTRBM1</b> [7:0]   | Output    | This signal indicates the write byte lane strobes.     |
| WLASTM1                | Output    | This signal is the write data last transfer indicator. |
| WREADYM1               | Input     | This signal indicates that the write data is ready.    |

#### M1 write response channel signals

The following table lists the write response channel signals for master 1.

Table A-16 M1 write response channel signals

| Signal       | Direction | Description                                             |
|--------------|-----------|---------------------------------------------------------|
| BVALIDM1     | Input     | This signal indicates that the write response is valid. |
| BIDM1[7:0]   | Input     | This signal indicates the write response ID.            |
| BRESPM1[1:0] | Input     | This signal indicates the write response.               |
| BREADYM1     | Output    | This signal indicates that the write response is ready. |

#### M1 read address channel signals

The following table lists the read address channels signals for master 1.

Table A-17 M1 read address channel signals

| Signal                 | Direction | Description                                           |
|------------------------|-----------|-------------------------------------------------------|
| ARVALIDM1              | Output    | This signal indicates that the read address is valid. |
| <b>ARIDM1</b> [7:0]    | Output    | This signal indicates the read address ID.            |
| <b>ARADDRM1</b> [39:0] | Output    | This signal indicates the read address.               |
| <b>ARLENM1</b> [7:0]   | Output    | This signal indicates the read burst length.          |
| ARSIZEM1[2:0]          | Output    | This signal indicates the read burst size.            |
| ARBURSTM1[1:0]         | Output    | This signal indicates the read burst type.            |

Table A-17 M1 read address channel signals (continued)

| Signal         | Direction | Description                                           |
|----------------|-----------|-------------------------------------------------------|
| ARCACHEM1[3:0] | Output    | This signal indicates the read cache type.            |
| ARPROTM1[2:0]  | Output    | This signal indicates the read protection type.       |
| ARREADYM1      | Input     | This signal indicates that the read address is ready. |

The DMA uses different ARID values to fetch data from external memories. The following tables list the ARIDM1 values that correspond to each stream used by the DMA.

Table A-18 ARIDM1 Ethos-U65 256

| ARID Values | Channel       |
|-------------|---------------|
| 0-3         | Cmd stream    |
| 4-31        | IFM stream    |
| 32-59       | Weight stream |
| 60-63       | Bias stream   |
| 64-115      | M2M stream    |

Table A-19 ARIDM1 Ethos-U65 512

| ARID values | Channel              |
|-------------|----------------------|
| 0-3         | Cmd stream           |
| 4-55        | IFM stream           |
| 56-83       | Weight stream core 0 |
| 84-87       | Bias stream core 0   |
| 88-139      | M2M stream           |
| 140-167     | Weight stream core 1 |
| 168-171     | Bias stream core 1   |

#### M1 read data channel signals

The following table lists the read data channel signals for master 1.

Table A-20 M1 read data channel signals

| Signal             | Direction                                     | Description                                           |  |
|--------------------|-----------------------------------------------|-------------------------------------------------------|--|
| RVALIDM1           | Input                                         | This signal indicates that the read data is valid.    |  |
| <b>RIDM1</b> [7:0] | Input This signal indicates the read data ID. |                                                       |  |
| RDATAM1[127:0]     | Input                                         | This signal indicates the read data.                  |  |
| RRESPM1[1:0]       | Input                                         | This signal indicates the read data response.         |  |
| RLASTM1            | Input                                         | This signal is the read data last transfer indicator. |  |
| RREADYM1           | Output                                        | This signal indicates that the read data is ready.    |  |

#### Related information

A.1 Clock and reset signals on page Appx-A-113
A.6 DFT and MBIST signals on page Appx-A-123

#### A.5 AMBA® 4 APB slave signals

The slave port implements AMBA 4 APB, with the addition of **PCLKEN** and **PWAKEUP** signals. The following table lists the AMBA 4 APB slave signals.

Table A-21 AMBA 4 APB signals

| Signal       | Direction | Description                                                                                                                                           |
|--------------|-----------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
| PWAKEUP      | Input     | This signal indicates if there is pending activity. This signal is input into an OR-gate that drives <b>CLKQACTIVE</b> .                              |
| PCLKEN       | Input     | This signal is the clock enable, Inputs are sampled when this signal is HIGH and outputs are held stable when this signal is LOW.                     |
| PSEL         | Input     | This signal indicates a transfer request.                                                                                                             |
| PENABLE      | Input     | This signal indicates the second and later cycles of an AMBA 4 APB transfer.                                                                          |
| PPROT[2:0]   | Input     | This signal indicates the transfer privilege and security level. <b>PPROT</b> [2] is an indicator for data or instruction and is not used by the NPU. |
| PWRITE       | Input     | This signal indicates a write transfer.                                                                                                               |
| PADDR[11:0]  | Input     | This signal indicates the transfer address.                                                                                                           |
| PWDATA[31:0] | Input     | This signal indicates the write data.                                                                                                                 |
| PSTRB[3:0]   | Input     | This signal indicates the write data byte strobes.                                                                                                    |
| PREADY       | Output    | This signal indicates that the slave is ready.                                                                                                        |
| PSLVERR      | Output    | This signal indicates the slave error response.                                                                                                       |
| PRDATA[31:0] | Output    | This signal indicates the slave read data.                                                                                                            |

#### A.6 DFT and MBIST signals

The NPU has several DFT and MBIST signals that you must connect.

The following table lists the DFT and MBIST signals.

#### Table A-22 DFT and MBIST signals

| Signal             | Direction | Description                                                                                     |
|--------------------|-----------|-------------------------------------------------------------------------------------------------|
| DFTCGEN            | Input     | This signal forces the clock gates on during scan shift.                                        |
| DFTRSTDISABLE[1:0] | Input     | This signal disables the internal synchronized reset during scan shift.                         |
| DFTRAMHOLD         | Input     | This signal disables the RAM chip select during scan shift.                                     |
| MBISTREQ           | Input     | This signal is the MBIST test request.                                                          |
| nMBISTRESET        | Input     | This signal is the MBIST reset for the whole NPU.                                               |
|                    |           | This active-LOW signal overrides the system resets when the <b>MBISTREQ</b> signal is asserted. |

#### Related information

A.4 AMBA® 5 AXI master signals on page Appx-A-116 A.1 Clock and reset signals on page Appx-A-113

# Appendix B **General neural network concepts**

This appendix describes the various concepts Arm uses to describe the NPU.

It contains the following section:

• B.1 General neural network concepts on page Appx-B-125.

#### B.1 General neural network concepts

Arm uses various concepts to describe the NPU.

The following list describes how Arm uses these architectural concepts in this document:

#### Feature map

A feature map is a 3D array of elements. Feature maps are the data that the layers of a neural network consume and produce. The NPU works with 8-bit or 16-bit integer elements. For example, the initial input to an image recognition network might be a three channel feature map. In this example, the channels correspond to the red, green, and blue color planes of an image. Each element contains an RGB value. Therefore, the feature maps for the first layer describe the image.

| Note                                                         |                                        |
|--------------------------------------------------------------|----------------------------------------|
| Integer elements can also be described as activation values. | values to distinguish them from weight |

#### Layer

A neural network (NN) is composed of several layers; the input to one layer is the output from a prior layer. The NPU is designed to process the layer of a network without requiring interaction from the host application processor. There are various types of layers, with CNNs named due to their large usage of convolutional layers.

#### NHWC and NCHW

NHWC and NCHW are standard memory formats of feature maps. Each letter in the NHWC and NCHW memory formats represents an axis of the feature map. The order of the letters represents the sequence of data when stored in memory. The letters of the memory formats represent:

N Number of batches.

H Height.

W Width.

C

Channels.

NHWC is the standard format for the TensorFlow Lite stack used by the NPU.

#### Weights, kernels, and filters

Weights, kernels, and filters are all related concepts. A filter is an operation on a signal. A kernel is a linear function that is used within a convolution as a filter. A kernel can be represented as a matrix. A weight is an individual element of this matrix.

## Appendix C **Boot flow information**

This appendix describes the various boot flows for the NPU.

It contains the following section:

• *C.1 Boot flow information* on page Appx-C-127.

#### **C.1 Boot flow information**

This appendix describes the software interactions needed to boot up the NPU, perform a soft reset of the NPU, and power down the NPU.

#### **Boot flow**

At system start-up, the NPU is normally powered down. You must do the following before the NPU can

|    | used:                                                                                                                                                                                                                                         | c the tvi o cuii |
|----|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------|
| 1. | If Power Q-Channels are not supported, you must:                                                                                                                                                                                              |                  |
|    | a. Assert nRESET.                                                                                                                                                                                                                             |                  |
|    | b. Enable NPU power.                                                                                                                                                                                                                          |                  |
|    | c. Deassert nRESET.                                                                                                                                                                                                                           |                  |
|    | ——Note——                                                                                                                                                                                                                                      |                  |
|    | The software interface to control the deassertion of the <b>nRESET</b> signal and NPU platform-dependent.                                                                                                                                     | ower is          |
| 2. | Perform a write to the CMD register.  Note  Note                                                                                                                                                                                              |                  |
|    | To ensure the NPU demands power, set the field power_q_enable to $0x0$ .                                                                                                                                                                      |                  |
|    | Setting the field $clock_q_enable$ to $0x0$ ensures the NPU demands that the clock is rethe field $clock_q_enable$ to $0x1$ enables automatic high-level clock gating. Arm recofield $clock_q_enable$ to $0x1$ .                              |                  |
|    | Set all other fields in the CMD register to 0x0. For more information about the CMD <i>CMD</i> on page 3-34.                                                                                                                                  | register, see    |
| 3. | To ensure the NPU is now in a known state, Arm recommends doing a soft reset. A so the risk that power was on before step 1 and <b>nRESET</b> was not asserted. For more infective soft reset, see <i>Soft reset flow</i> on page Appx-C-127. | •                |
| Sc | oft reset flow                                                                                                                                                                                                                                |                  |
| A  | soft reset is used for setting the NPU in a known state and to update the NPU security sallowing to perform a soft reset of the NPU:                                                                                                          | tatus. Do the    |
| 1. | To trigger a soft reset, write to the RESET register. For more information about setting                                                                                                                                                      | g the fields     |
| 2. | pending_CSL and pending_CPL, see <i>RESET</i> on page 3-35.  Read the STATUS register until the field reset_status no longer yields the value 0x                                                                                              | 1.               |
|    | Note                                                                                                                                                                                                                                          |                  |
|    | The value 0x1 indicates a soft reset phase is in progress. During this phase, no other A are allowed.                                                                                                                                         | APB accesses     |
| 3. | Write the CMD register. If the Power Q-Channel is used, set the field power_q_enabl power enabled.                                                                                                                                            | e to 0x0 to keep |
|    | Setting the field $clock\_q\_enable$ to $0x0$ ensures the NPU demands that the clock is rethe field $clock\_q\_enable$ to $0x1$ enables automatic high-level clock gating. Arm recofield $clock\_q\_enable$ to $0x1$ .                        |                  |
|    |                                                                                                                                                                                                                                               |                  |

#### Powering down flow

Do the following to power down the NPU:

1. Acknowledge any pending interrupts by writing register CMD.

|    | Note                                                                                                                        |
|----|-----------------------------------------------------------------------------------------------------------------------------|
|    | All interrupts must be cleared for power down to occur.                                                                     |
|    | For more information about the CMD register, see <i>CMD</i> on page 3-34.                                                   |
| 2. | Write to the CMD register.                                                                                                  |
|    | Note                                                                                                                        |
|    | The field power_q_enable must be set to 0x1 to permit power down.                                                           |
|    |                                                                                                                             |
| 3. | After the preceding sequence of register writes, the powering down starts by the NPU handshaking with the power controller. |

Related concepts

2.2 Security and boot flow on page 2-21

## Appendix D **Revisions**

This appendix describes the technical changes between releases of this book.

It contains the following section:

• D.1 Revisions on page Appx-D-130.

#### D.1 Revisions

This appendix describes the technical changes between releases of this manual.

Table D-1 First development release for r0p0

| Cł  | nange      | Location | Affects |
|-----|------------|----------|---------|
| Fin | st release | -        | -       |

Table D-2 First beta release for r0p0

| Change                                                                                             | Location                                                                       | Affects |
|----------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------|---------|
| Clarified power requirements.                                                                      | 1.1.1 Supported application programming interfaces on page 1-12                | All     |
| Clarified signals that set security and privilege levels on the NPU.                               | 2.2 Security and boot flow on page 2-21                                        | All     |
| Improved description of clock settings.                                                            | 2.3.1 External interfaces on page 2-22                                         | All     |
| Improved description of the Central Control.                                                       | 2.3.2 Central control on page 2-23                                             | All     |
| Improved description of the weight and scaling and bias channels.                                  | 2.3.3 DMA controller on page 2-23                                              | All     |
| Extended description of the Output unit, added new activation function and elementwise operations. | 2.3.7 Output unit on page 2-26                                                 | All     |
| Improved description of each function.                                                             | tanh, sigmoid, and LUT on page 2-27                                            | All     |
| Updated register pages.                                                                            | NPU_REG on page 3-30 NPU_BP on page 3-48 NPU_IDS on page 3-54 PMU on page 3-59 | All     |
| Updated cmd0 and cmd1 commands.                                                                    | 3.6.4 cmd0 commands on page 3-78 3.6.5 cmd1 commands on page 3-84              | All     |
| Updated weight blocks and ordering section.                                                        | 3.7.5 Weight blocks and ordering on page 3-92                                  | All     |

Table D-3 First EAC release for r0p0

| Change                                                            | Location                                  | Affects |
|-------------------------------------------------------------------|-------------------------------------------|---------|
| Added note on how the different channels are controlled and used. | 2.3.3 DMA controller on page 2-23         | All     |
| Added detail to the Leaky ReLU description.                       | ReLU and Leaky ReLU on page 2-27          | All     |
| Added bullet describing how access to the registers is granted.   | 3.1 Register characteristics on page 3-29 | All     |
| Updated register details throughout.                              | NPU_BP on page 3-48                       | All     |
| Added events to Field EV_TYPE table and reordered entries.        | PMEVTYPER0 on page 3-64                   | All     |
| Added command stream examples                                     | 3.6 Command stream on page 3-76           | All     |
| Updated cmd1 information.                                         | 3.6.5 cmd1 commands on page 3-84          | All     |
| Updated the weight stream information.                            | 3.7 Weight stream format on page 3-87     | All     |
| Added coding modes information.                                   | 3.7.3 Coding modes on page 3-89           | All     |
| Added chunk syntax information.                                   | 3.7.4 Chunk syntax on page 3-91           | All     |
| Updated the operations information.                               | 3.8.2 Operations on page 3-98             | All     |

#### Table D-3 First EAC release for r0p0 (continued)

| Change                                     | Location                                          | Affects |
|--------------------------------------------|---------------------------------------------------|---------|
| Added convolution performance information. | 3.8.3 Convolution performance on page 3-103       | All     |
| Added elementwise performance information. | 3.8.4 Elementwise performance on page 3-105       | All     |
| Updated the AMBA 5 AXI master signals.     | A.4 AMBA® 5 AXI master signals on page Appx-A-116 | All     |
| Added boot flow information.               | C.1 Boot flow information on page Appx-C-127      | All     |

#### Table D-4 Second EAC release for r0p0

| Change                                                                                    | Location                                                     | Affects |
|-------------------------------------------------------------------------------------------|--------------------------------------------------------------|---------|
| Updated the description of the NPU.                                                       | 1.1 Description of the neural processing unit on page 1-11   | All     |
| Updated the supported memory formats information.                                         | 2.1.1 Supported memory formats for feature maps on page 2-20 | All     |
| Added a description of the OFM channel and updated the description of the weight channel. | 2.3.3 DMA controller on page 2-23                            | All     |
| Added a block based operation information.                                                | 3.9 Block based operation on page 3-107                      | All     |
| Added internal shared RAM information.                                                    | 3.9.1 Internal shared RAM on page 3-109                      | All     |

#### Table D-5 Third EAC release for r0p0

| Change                                                                              | Location                                          | Affects |
|-------------------------------------------------------------------------------------|---------------------------------------------------|---------|
| Updated the register information.                                                   | STATUS on page 3-32                               | All     |
| Updated the kernel size figures for convolution performance for 16-bit activations. | 3.8.3 Convolution performance on page 3-103       | All     |
| Updated the AXI ID widths.                                                          | A.4 AMBA® 5 AXI master signals on page Appx-A-116 | All     |