

# Arm<sup>®</sup> Mali<sup>™</sup> Offline Compiler

Version 7.5

# **User Guide**

Non-Confidential

Issue 00

Copyright © 2019–2022 Arm Limited (or its affiliates). 101863\_7.5\_00\_en All rights reserved.



### Arm<sup>®</sup> Mali™ Offline Compiler

#### User Guide

Copyright © 2019–2022 Arm Limited (or its affiliates). All rights reserved.

### **Release information**

#### Document history

| Issue   | Date             | Confidentiality  | Change                 |
|---------|------------------|------------------|------------------------|
| 0700-00 | 30 October 2019  | Non-Confidential | New document for v7.0. |
| 0701-00 | 28 February 2020 | Non-Confidential | New document for v7.1. |
| 0702-00 | 26 August 2020   | Non-Confidential | New document for v7.2. |
| 0703-00 | 27 November 2020 | Non-Confidential | New document for v7.3. |
| 0704-00 | 26 August 2021   | Non-Confidential | New document for v7.4. |
| 0705-00 | 22 February 2022 | Non-Confidential | New document for v7.5. |

### **Proprietary Notice**

This document is protected by copyright and other related rights and the practice or implementation of the information contained in this document may be protected by one or more patents or pending patent applications. No part of this document may be reproduced in any form by any means without the express prior written permission of Arm. No license, express or implied, by estoppel or otherwise to any intellectual property rights is granted by this document unless specifically stated.

Your access to the information in this document is conditional upon your acceptance that you will not use or permit others to use the information for the purposes of determining whether implementations infringe any third party patents.

THIS DOCUMENT IS PROVIDED "AS IS". ARM PROVIDES NO REPRESENTATIONS AND NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE WITH RESPECT TO THE DOCUMENT. For the avoidance of doubt, Arm makes no representation with respect to, has undertaken no analysis to identify or understand the scope and content of, third party patents, copyrights, trade secrets, or other rights.

This document may include technical inaccuracies or typographical errors.

#### TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL ARM BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF ARM HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

This document consists solely of commercial items. You shall be responsible for ensuring that any use, duplication or disclosure of this document complies fully with any relevant export laws and regulations to assure that this document or any portion thereof is not exported, directly or indirectly, in violation of such export laws. Use of the word "partner" in reference to Arm's customers is not intended to create or refer to any partnership relationship with any other company. Arm may make changes to this document at any time and without notice.

This document may be translated into other languages for convenience, and you agree that if there is any conflict between the English version of this document and any translation, the terms of the English version of the Agreement shall prevail.

The Arm corporate logo and words marked with ® or <sup>™</sup> are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. Other brands and names mentioned in this document may be the trademarks of their respective owners. Please follow Arm's trademark usage guidelines at https://www.arm.com/company/policies/trademarks.

Copyright © 2019–2022 Arm Limited (or its affiliates). All rights reserved.

Arm Limited. Company 02557590 registered in England.

110 Fulbourn Road, Cambridge, England CB1 9NJ.

(LES-PRE-20349)

### **Confidentiality Status**

This document is Non-Confidential. The right to use, copy and disclose this document may be subject to license restrictions in accordance with the terms of the agreement entered into by Arm and the party that Arm delivered this document to.

Unrestricted Access is an Arm internal classification.

### **Product Status**

The information in this document is Final, that is for a developed product.

### Feedback

Arm<sup>®</sup> welcomes feedback on this product and its documentation. To provide feedback on the product, create a ticket on https://support.developer.arm.com

To provide feedback on the document, fill the following survey: https://developer.arm.com/ documentation-feedback-survey.

### Inclusive language commitment

Arm values inclusive communities. Arm recognizes that we and our industry have used language that can be offensive. Arm strives to lead the industry and create change.

We believe that this document contains no offensive language. To report offensive language in this document, email terms@arm.com.

# Contents

| 1 Introduction                          | 7  |
|-----------------------------------------|----|
| 1.1 Conventions                         | 7  |
| 1.2 Other information                   | 8  |
|                                         |    |
| 2 Platform support                      |    |
| 2.1 API support                         |    |
| 2.2 GPU support                         |    |
| 2.3 Binary generation support           |    |
| 3 Using Mali Offline Compiler           |    |
| 3.1 Installation                        |    |
| 3.2 Querying compiler capabilities      |    |
| 3.3 Compiling OpenGL ES shaders         |    |
| 3.4 Compiling Vulkan shaders            |    |
| 3.5 Compiling OpenCL C kernels          |    |
| 3.5.1 Header includes                   |    |
| 3.6 Syntax error reporting              |    |
| 3.7 Performance analysis                |    |
| 3.7.1 IDVS shader variants              | 16 |
| 3.7.2 Resource usage                    |    |
| 3.7.3 Performance table                 |    |
| 3.7.4 Shader properties                 |    |
| 3.8 Performance considerations          |    |
| 3.9 Generating JSON reports             |    |
| 4 Mali GPU pipelines                    | 21 |
| 4.1 Mali Midgard architecture           |    |
| 4.1.1 Midgard work register breakpoints |    |
| 4.2 Mali Bifrost architecture           |    |
| 4.2.1 Bifrost work register breakpoints |    |
| 4.2.2 Bifrost shader core size          |    |
| 4.3 Mali Valhall architecture           |    |
| 4.3.1 Valhall work register breakpoints |    |
|                                         |    |

| 4.3.2 Valhall shader core size | 3 |
|--------------------------------|---|
|--------------------------------|---|

# 1 Introduction

# 1.1 Conventions

The following subsections describe conventions used in Arm documents.

#### Glossary

The Arm Glossary is a list of terms used in Arm documentation, together with definitions for those terms. The Arm Glossary does not contain terms that are industry standard unless the Arm meaning differs from the generally accepted meaning.

See the Arm<sup>®</sup> Glossary for more information: developer.arm.com/glossary.

#### Typographic conventions

Arm documentation uses typographical conventions to convey specific meaning.

| Convention                 | Use                                                                                                                                                                                                                               |
|----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| italic                     | Citations.                                                                                                                                                                                                                        |
| bold                       | Interface elements, such as menu names.                                                                                                                                                                                           |
|                            | Signal names.                                                                                                                                                                                                                     |
|                            | Terms in descriptive lists, where appropriate.                                                                                                                                                                                    |
| monospace                  | Text that you can enter at the keyboard, such as commands, file and program names, and source code.                                                                                                                               |
| monospace bold             | Language keywords when used outside example code.                                                                                                                                                                                 |
| monospace <u>underline</u> | A permitted abbreviation for a command or option. You can enter the underlined text instead of the full command or option name.                                                                                                   |
| <and></and>                | Encloses replaceable terms for assembler syntax where they appear in code or code fragments.                                                                                                                                      |
|                            | For example:                                                                                                                                                                                                                      |
|                            | MRC p15, 0, <rd>, <crn>, <crm>, <opcode_2></opcode_2></crm></crn></rd>                                                                                                                                                            |
| SMALL CAPITALS             | Terms that have specific technical meanings as defined in the <i>Arm</i> <sup>®</sup> <i>Glossary</i> . For example, <b>IMPLEMENTATION DEFINED</b> , <b>IMPLEMENTATION SPECIFIC</b> , <b>UNKNOWN</b> , and <b>UNPREDICTABLE</b> . |
| Caution                    | Recommendations. Not following these recommendations might lead to system failure or damage.                                                                                                                                      |
| Warning                    | Requirements for the system. Not following these requirements might result in system failure or damage.                                                                                                                           |
| Danger                     | Requirements for the system. Not following these requirements will result in system failure or damage.                                                                                                                            |

| Convention | Use                                                                                |
|------------|------------------------------------------------------------------------------------|
| Note       | An important piece of information that needs your attention.                       |
| - Č        | A useful tip that might make it easier, better or faster to perform a task.        |
| Remember   | A reminder of something important that relates to the information you are reading. |

# 1.2 Other information

See the Arm website for other relevant information.

- Arm<sup>®</sup> Developer.
- Arm<sup>®</sup> Documentation.
- Technical Support.
- Arm<sup>®</sup> Glossary.

# 2 Platform support

Mali<sup>™</sup> Offline Compiler is a command-line tool that provides static analysis of GPU shaders that are written in OpenGL ES Shading Language (ESSL), Vulkan SPIR-V intermediate representation, or OpenCL C.

It can be used to:

- Validate the syntax of shaders.
- Identify performance bottlenecks.
- Measure the performance impact of any changes.

# 2.1 API support

Mali<sup>™</sup> Offline Compiler supports compiling shaders for the OpenGL ES and Vulkan graphics APIs, and compiling kernels for the OpenCL compute API.

The following API versions are supported, subject to support being available for the targeted GPU core:

- OpenGL ES 2.0 and 3.0-3.2
- Vulkan 1.0-1.2
- OpenCL 1.0-1.2, 2.0, and 3.0

OpenCL support is only available on Linux and macOS host installations.

# 2.2 GPU support

Arm<sup>®</sup> Mali<sup>™</sup> Offline Compiler supports the following Mali GPU products:

#### Valhall architecture

- Mali-G710 (OpenGL ES, Vulkan, OpenCL)
- Mali-G610 (OpenGL ES, Vulkan, OpenCL)
- Mali-G510 (OpenGL ES, Vulkan, OpenCL)
- Mali-G310 (OpenGL ES, Vulkan, OpenCL)
- Mali-G78AE (OpenGL ES, Vulkan, OpenCL)
- Mali-G78 (OpenGL ES, Vulkan, OpenCL)
- Mali-G77 (OpenGL ES, Vulkan, OpenCL)
- Mali-G68 (OpenGL ES, Vulkan, OpenCL)
- Mali-G57 (OpenGL ES, Vulkan, OpenCL)

#### Bifrost architecture

- Mali-G76 (OpenGL ES, Vulkan, OpenCL)
- Mali-G72 (OpenGL ES, Vulkan, OpenCL)
- Mali-G71 (OpenGL ES, Vulkan, OpenCL)
- Mali-G52 (OpenGL ES, Vulkan, OpenCL)
- Mali-G51 (OpenGL ES, Vulkan, OpenCL)
- Mali-G31 (OpenGL ES, Vulkan, OpenCL)

#### Midgard architecture

- Mali-T880 (OpenGL ES, Vulkan, OpenCL)
- Mali-T860 (OpenGL ES, Vulkan, OpenCL)
- Mali-T830 (OpenGL ES, Vulkan, OpenCL)
- Mali-T820 (OpenGL ES, Vulkan, OpenCL)
- Mali-T760 (OpenGL ES, Vulkan, OpenCL)
- Mali-T720 (OpenGL ES, OpenCL)

Mali Offline Compiler targets the following driver versions for the supported GPUs:

- Bifrost and Valhall architecture uses r36p0
- Midgard architecture uses r23p0

# 2.3 Binary generation support

Arm<sup>®</sup> Mali<sup>™</sup> Offline Compiler no longer provides the ability to generate binaries for graphics shaders or compute kernels.

Compile and link entire shader programs using the production driver on the target device, and then retrieve the binary using API calls such as glGetProgramBinary(). These whole-program binaries are often more efficient than the single shader stage binaries produced by legacy Mali Offline Compiler releases, as extra program-level optimizations can be applied.



Most compiled shader binaries are specific to a single pairing of GPU hardware version and driver version, so reliance on binary-only shader distribution is not recommended.

# 3 Using Mali Offline Compiler

To query the capabilities of the compiler, or of a specific GPU, and to compile the shader, invoke malioc with different command-line options. If compilation is successful, analyze the output performance report.

# 3.1 Installation

Arm<sup>®</sup> Mali<sup>™</sup> Offline Compiler is installed as part of Arm Mobile Studio.

See Install Arm Mobile Studio for instructions on how to download and install this package.

Before using Mali Offline Compiler, we recommend that you add the installation directory to your PATH environment variable. Otherwise, you must manually invoke the compiler from the installation directory.

# 3.2 Querying compiler capabilities

You can query information about the compiler configuration from the command line.

- The --list option lists all the valid combinations of supported driver versions, GPUs, and hardware revisions. The listing shows the full capabilities of the compiler, but a specific GPU might not support all the language versions and extensions that the compiler supports.
- The --info <gpu> option shows detailed capability information for a specific GPU. For example:

malioc --info -c Mali-G72

It only shows the language versions and extensions that the GPU supports.

# 3.3 Compiling OpenGL ES shaders

Use the following command-line syntax to compile OpenGL ES shader programs:

```
malioc -c <target_gpu> [<shader_type>] <file1> [<file2> ...] \
[-o <file>]
```

target gpu is one of the GPUs that are listed in GPU support.

shader\_type is one of the following:

- --vertex
- --tessellation\_control

- --tessellation\_evaluation
- --geometry
- --fragment
- --compute

You must specify one or more input files that contain the ESSL source code to compile. To read input from stdin, instead of a file on disk, insert a single – character. If the input files use one of the following default file extensions, you do not need to explicitly specify the shader type:

.vert

OpenGL ES vertex shader.

.tesc

OpenGL ES tessellation control shader.

. tese

OpenGL ES tessellation evaluation shader.

#### .geom

OpenGL ES geometry shader.

.frag

OpenGL ES fragment shader.

.comp

OpenGL ES compute shader.

If you specify multiple input files:

- They are concatenated in the order in which they are specified, before compilation.
- They must all use the same extension if you do not explicitly specify the shader type.

By default, malioc emits reports to the stdout output stream. You can write directly to a file by specifying the -o <file> option. The destination directory must exist because it is not created.

Use the -D option to define a macro on the command line for use in shader source code. For example:

-Dfoo

Defines foo with a default value of 1.

```
-Dfoo=bar
```

Defines foo with the value bar.

# 3.4 Compiling Vulkan shaders

Use the following command-line syntax to compile Vulkan shaders:

malioc --vulkan -c <target gpu> [<shader type>] [--spirv] [-n <name>] \

#### <file1> [<file2> ...] [-o <file>]

target\_gpu is one of the GPUs that are listed in GPU support.

shader\_type is one of the following:

- --vertex
- --tessellation\_control
- --tessellation\_evaluation
- --geometry
- --fragment
- --compute

The input files are either:

- One or more ESSL source shaders.
- A single SPIR-V binary module that has been compiled using Vulkan semantics.

To read input from stdin, instead of a file on disk, insert a single – character. You do not need to explicitly specify the source shader type if the input files use one of the supported file extensions:

.vert

OpenGL ES vertex shader.

.tesc

OpenGL ES tessellation control shader.

. tese

OpenGL ES tessellation evaluation shader.

.geom

OpenGL ES geometry shader.

.frag

OpenGL ES fragment shader.

.comp

OpenGL ES compute shader.



For binary modules containing a single shader stage, malioc automatically detects that they are SPIR-V binary modules, and attempts to deduce the shader type and entry point name. For target binary modules containing multiple entry points, you must specify them manually. You can provide shader type information either by using an auto-detected file extension, or a manually specified shader type flag. The supported file extensions are appended with .spv, for example .vert.spv. You can force interpretation of a file as SPIR-V by passing in the --spirv option.

If you specify multiple input files:

- They are concatenated in the order in which they are specified, before compilation.
- If you do not explicitly specify the shader type, they must all use the same extension.

If you pass an ESSL source file, it is automatically converted into a SPIR-V binary module using the version of glslang that is provided in the installation. The resulting SPIR-V module is passed to the Arm<sup>®</sup> Mali<sup>™</sup> Offline Compiler backend.

Use the -n < name> option to specify a custom SPIR-V entry point for binary module inputs. The default entry point is called main.

By default, malioc emits reports to the stdout output stream. You can write directly to a file by specifying the -o <file> option. The destination directory must exist because it is not created.

Use the -D option to define a macro on the command line for use in shader source code. For example:

-Dfoo

Defines  $f_{00}$  with a default value of 1.

#### -Dfoo=bar

Defines foo with the value bar.

# 3.5 Compiling OpenCL C kernels

Use the following command-line syntax to compile OpenCL C kernels:

```
malioc -c <target_gpu> [--opencl <version>] [--kernel] [-n <name>] \
<file1> [<file2> ...] [-o <file>]
```

target\_gpu is one of the GPUs that are listed in GPU support.

Use the --opencl option to specify the targeted version of OpenCL:

1.1

Targets OpenCL 1.1.

1.2

Targets OpenCL 1.2.

#### 2.0

Targets OpenCL 2.0.

#### 3.0

Targets OpenCL 3.0.

If you do not explicitly specify --openc1 the compiler defaults to targeting OpenCL 1.2.

To read input from stdin, instead of a file on disk, insert a single - character. If the input filename has a .cl extension, which is the default for an OpenCL kernel, you do not need to explicitly specify the API as --opencl or the shader type as --kernel.

Use the -n <name> option to specify the entry point of the kernel to be compiled.

If you specify multiple input files:

- They are concatenated in the order in which they are specified, before compilation.
- They must all have a .cl extension if you do not explicitly specify --kernel.

By default, malioc emits reports to the stdout output stream. You can write directly to a file by specifying the -o <file> option. The destination directory must exist because it is not created.

Use the -p option to define a macro on the command line for use in kernel source code. For example:

-Dfoo

Defines foo with a default value of 1.

#### -Dfoo=bar

Defines foo with the value bar.

### 3.5.1 Header includes

The OpenCL C language allows you to use header files in your source code, with the #include preprocessor directive.

Relative path header inclusions use the current working directory as the root of the search path:

#include "my header.h"

You can also use absolute path header inclusions:

#include "/work/my\_header.h"

## 3.6 Syntax error reporting

If Arm<sup>®</sup> Mali<sup>™</sup> Offline Compiler fails to compile a shader program due to an error in the code, it produces a compilation error and emits an error message to the console.

Error messages only give a line number, which is the line number after all input source files have been concatenated.

# 3.7 Performance analysis

If compilation is successful, Arm<sup>®</sup> Mali<sup>™</sup> Offline Compiler emits a static analysis report outlining the shader performance on the target GPU.

For example:

```
Configuration
_____
Hardware: Mali-T880 r2p0
Driver: Midgard r23p0-00rel0
Shader type: OpenGL ES Fragment
Main shader
Work registers: 2
Uniform registers: 2
Stack spilling: false
                                    T Bound
                           А
                              LS
Total Instruction Cycles: 6.0 1.0 0.0
                                         A
Shortest Path Cycles: 1.7 1.0 0.0
                                           А
Longest Path Cycles:
                         1.7 1.0 0.0
                                          А
A = Arithmetic, LS = Load/Store, T = Texture
Shader properties
_____
Has uniform computation: true
```

### 3.7.1 IDVS shader variants

On Arm<sup>®</sup> Mali<sup>™</sup> GPUs in the Bifrost and Valhall families, vertex shaders are executed using an optimized shading flow called Index-Driven Vertex Shading (IDVS).

In the IDVS pipeline, vertex shaders are compiled into two binaries:

- A position shader, which computes only position.
- A varying shader, which computes the remaining non-position vertex attribute outputs.



The position shader is executed for every index vertex, but the varying shader is only executed for vertices that are part of a visible primitive that survives culling. Mali Offline Compiler reports separate performance tables for each of these variants.

### 3.7.2 Resource usage

The resource usage section of the report shows how resources are managed by the shader program. You can see the use of registers, including whether it is spilling to stack memory, shared memory, and the 16-bit data path in the arithmetic unit.

Demand on the work register can impact the number of threads that the shader core can execute simultaneously. This impact is because the available physical register pool is divided among the shader threads that are executing. Reducing the work register usage per thread can increase the number of threads that can be executed, which is often beneficial. See Mali GPU pipelines for more details about work register usage for each Arm<sup>®</sup> Mali<sup>™</sup> GPU architecture.

Shaders that spill to stack are expensive for a GPU to process. Reduce register pressure to help stop the shaders from spilling. You can reduce register pressure in one of the following ways:

- By reducing variable precision.
- By reducing the live ranges of variables.
- By simplifying the shader program.

Shared storage allows threads within a single compute work group to share data. Mali GPUs use cached system RAM to back shared memory, so it will have the same performance as any other buffer access. Use shared storage only where you need algorithmic data sharing across threads.

16-bit arithmetic is more energy efficient, and higher performance, than 32-bit arithmetic. For most operations, Mali can either submit a vec2 SIMD 16-bit operation or a scalar fp32 operation, so in the best case using 16-bit operations will be twice as fast. Even in cases where overall performance does not increase, a higher percentage of 16-bit operations will improve energy efficiency.

### 3.7.3 Performance table

The performance table gives an indication of the potential performance of the shader program for a single shader core.

It contains the following rows:

#### **Total Instruction Cycles**

The cumulative number of execution cycles for all instructions that are generated for the program, irrespective of program control flow.

#### Shortest Path Cycles

An estimate of the number of cycles for the shortest control flow path though the shader program. This row normalizes the cycle cost based on the number of functional units present in the design.

#### Longest Path Cycles

An estimate of the number of cycles for the longest control flow path though the shader program. This row normalizes the cycle cost based on the number of functional units present in the design. It is not always possible to determine the longest path based on static analysis, for example if a uniform variable controls a loop iteration limit. So this row might indicate an unknown cycle count ("N/A").

The reported statistics are broken down by functional unit. The unit column with the highest cycle cost in either or both of the Shortest Path Cycles and Longest Path Cycles rows is a good candidate to optimize. For example, a shader whose highest values are in the A (Arithmetic) column, is arithmetic bound. Optimize the shader by reducing the number of, or the precision of, the mathematical operations that it performs. The <code>Bound</code> column lists the functional units with the highest cycle count, which allows you to quickly identify the units that are a bottleneck in your shader code.

The functional unit columns that are displayed depend on the architecture of the GPU being targeted. See Mali GPU pipelines for more details. In addition, there are some important considerations to be aware of when reviewing the performance data. See Performance considerations for more details.

### 3.7.4 Shader properties

The Shader properties section provides information about behavioral properties of the shader program.

It can contain the following entries:

#### Has uniform computation

Shows if there was any optimized uniform computation. This is computation that depends only on literal constants or uniform values, and therefore produces the same result for every thread in a draw call or compute dispatch. While the drivers can optimize this, it still has a cost, so where possible, move it from the shader into application logic on the CPU.

#### Has side-effects

Shows if this shader has side-effects that are visible in memory, outside of the fixed graphics pipeline. They can be caused by:

- Writes into shader storage buffers
- Stores into images
- Uses of atomics

Side-effecting shaders cannot be optimized away by techniques such as hidden surface removal, so their use should be minimized.

#### Modifies coverage

Shows if a fragment shader has a coverage mask that can be changed by shader execution, for example by using a discard statement. Shaders with modifiable coverage must use a late-ZS update, which reduces efficiency of early ZS testing for later fragments at the same coordinate.



Other API-side behaviors, such as setting of alpha-to-coverage, can also impact coverage masks and are not considered here.

#### Uses late ZS test

Shows if a fragment shader contains logic that forces a late ZS test, for example by writing to g1\_FragDepth. This disables use of early-ZS testing and hidden surface removal, which can be a significant efficiency loss.



Other API-side behaviors, such as disabling depth testing, can override this behavior.

#### Uses late ZS update

Shows if a fragment shader contains logic that forces a late ZS update, for example by reading the old depth value in the shader by using g1\_LastFragDepthARM. This can reduce efficiency of early ZS testing for later fragments at the same coordinate.

#### Reads color buffer

Shows if a fragment shader contains logic that programmatically reads from the color buffer, for example by reading from gl\_LastFragColorARM. Shaders that read from the color buffer in this manner are treated as transparent, and cannot be used as hidden-surface removal occluders.

## 3.8 Performance considerations

There are several important considerations to be aware of when analyzing the data in the performance table:

• The cycle measurements are purely based on the execution cost of the instructions in the program. The actual performance is also dependent on inputs that are not visible in the instruction sequence, such as texture sampler configuration and texture format.

For example, using trilinear filtering for all texture samples halves the filtering rate. Therefore it would double the texture cycle count compared to the value that is reported in the T (Texture) column in the performance table.

- The shortest and longest control flow measurements are based on what is possible in the shader source code. They are not based on the real run-time inputs, such as uniform values, that are used for a specific draw call. These costings therefore define the flight-envelope of performance possibilities but are not accurate for any single specific use of the shader.
- Arm<sup>®</sup> Mali<sup>™</sup> Offline Compiler only processes single shaders at a time. The on-device Mali driver compilation process optimizes whole programs and pipelines, including use of pipeline state information in the case of Vulkan. This optimization can result in the reported performance

being different to the performance that would be seen in a production device, although it should be indicative.



You can directly measure pipeline activity on the target platform using the Arm Streamline profiling tools. Profiling with Streamline can provide a useful comparison with the static analysis that Mali Offline Compiler provides.

# 3.9 Generating JSON reports

By default, Arm<sup>®</sup> Mali<sup>™</sup> Offline Compiler generates reports in a human readable text format. To allow easier integration into other tooling or scripted workflows, it also supports generating machine-readable JSON reports. These reports are enabled by adding the --format json command-line option to any of the operations.

There are four types of JSON output report that Mali Offline Compiler can generate, identified by a schema identifier field in the root JSON object:

#### list

For --list operations.

#### info

For --info operations.

#### error

For compile operations that fail with a compilation error.

#### performance

For compile operations that succeed.

To aid writing parsers, sample reports and JSON Schema definitions are provided for all four of the supported output reports. These files are in <install\_directory>/samples/json\_reports and <install\_directory>/samples/json\_schemas respectively.

To help with JSON parsing, the command line utility can return three possible process return codes:

0

The operation was successful and returns a list, info, or performance (compilation) JSON report.

1

Compilation failed because of a shader syntax error. This utility returns an error JSON report.

2

The tool failed because of a configuration error, such as a bad command line option. This utility always emits human-readable text output, not a JSON report.

# 4 Mali GPU pipelines

The internal microarchitecture of the shader core can influence both the register usage and the processing pipelines that are reported in the performance analysis report.

Correct identification of the shader pipeline with the highest load is critical in performance analysis. Optimizing that pipeline is more likely to give a performance benefit. This section provides a brief summary of the register thresholds and processing pipelines for each supported Arm<sup>®</sup> Mali<sup>™</sup> GPU architecture.

# 4.1 Mali Midgard architecture

Arm<sup>®</sup> Mali<sup>™</sup> Midgard GPU shader cores have three parallel pipeline classes:

#### Arithmetic unit (A)

The arithmetic pipeline executes all types of shader arithmetic instructions. There can be multiple parallel arithmetic pipelines, the number present depends on the Mali GPU being targeted. Data presented in the tool is normalized based on the number of pipelines in the design.

#### Load/store unit (LS)

The load/store pipeline handles all non-texture memory access, including buffer access, image access, and atomic operations. In addition, this pipeline implements the Midgard varying interpolator.

#### Texture unit (T)

The texture pipeline handles all texture sampling and filtering operations.



### 4.1.1 Midgard work register breakpoints

Arm<sup>®</sup> Mali<sup>™</sup> Midgard GPU shader cores allow variable numbers of threads to be created, depending on the number of work registers that are used by the in-flight shader programs.

#### 0-4 registers

Maximum thread capacity

#### 5-8 registers

Half thread capacity

#### 8-16 registers

Quarter thread capacity

Usually, running more threads simultaneously helps a GPU to keep busy. A good objective is to stay at 0-4 registers for fragment shaders and 0-8 threads for other shader types.

The most effective way to reduce register pressure is to minimize the precision of stored variables. Use mediump precision in preference to highp whenever possible.

Copyright © 2019–2022 Arm Limited (or its affiliates). All rights reserved. Non-Confidential

# 4.2 Mali Bifrost architecture

Arm<sup>®</sup> Mali<sup>™</sup> Bifrost GPU shader cores have four parallel pipeline classes:

#### Arithmetic unit (A)

The arithmetic pipeline, also known as the execution engine, executes all types of shader instructions. There can be multiple parallel arithmetic pipelines, the number present depends on the Mali GPU being targeted. To give an overall cost for the targeted shader core, data presented in the tool is normalized based on the number of engines in the design.

#### Load/store unit (LS)

The load/store pipeline handles all non-texture memory access, including buffer access, image access, and atomic operations.

#### Varying unit (V)

The varying pipeline is a dedicated pipeline which implements the varying interpolator.

#### Texture unit (T)

The texture pipeline handles all texture sampling and filtering operations.



### 4.2.1 Bifrost work register breakpoints

Arm<sup>®</sup> Mali<sup>™</sup> Bifrost GPU shader cores allow you to create variable numbers of threads, depending on the number of work registers that are used by the in-flight shader programs:

#### 0-32 registers

Maximum thread capacity

#### 33-64 registers

Half thread capacity

Usually, running more threads simultaneously helps a GPU to work effectively. Aim to use 0-32 registers for fragment shaders.

The most effective way to reduce register pressure is to minimize the precision of stored variables. Use mediump precision in preference to highp whenever possible.

### 4.2.2 Bifrost shader core size

The early-generation Bifrost shader cores, Arm<sup>®</sup> Mali<sup>™</sup>-G71 and Mali-G72, implement a single texel-per-clock and single pixel-per-clock shader core. Later shader cores in the Bifrost family implement a two texel-per-clock and two pixel-per-clock shader core, with an increase in arithmetic performance to compensate. Not every GPU doubled the available performance though.

Mali Offline Compiler reports results per shader core. It is expected, for example, that performance results for a Mali-G76 have approximately half the cycle count of the results for a Mali-G72. Silicon implementations using a Mali-G76 generally implement fewer shader cores than an equivalent Mali-G72 design. Remember therefore that the results must be scaled by the shader core count in your target device.

# 4.3 Mali Valhall architecture

Arm<sup>®</sup> Mali<sup>™</sup> Valhall GPU shader cores have six parallel pipeline classes, comprising three arithmetic pipelines and three fixed-function support pipelines.

All Valhall GPUs implement two parallel processing engines, each containing their own set of arithmetic pipelines. Data presented in the tool is normalized based on the number of engines in the design, to give an overall cost for the targeted shader core, not just for a single engine.

#### Arithmetic fused multiply accumulate unit (FMA)

The FMA pipelines are the main arithmetic pipelines, implementing the floating-point multipliers that are widely used in shader code. Each FMA pipeline implements a 16-wide warp, and can issue a single 32-bit operation or two 16-bit operations per thread and per clock cycle.

Most programs that are arithmetic-limited are limited by the performance of the FMA pipeline.

#### Arithmetic convert unit (CVT)

The CVT pipelines implement simple operations, such as format conversion and integer addition. Each CVT pipeline implements a 16-wide warp, and can issue a single 32-bit operation or two 16-bit operations per thread and per clock cycle.

#### Arithmetic special functions unit (SFU)

The SFU pipelines implement a special functions unit for computation of complex functions such as reciprocals and transcendental functions. Each SFU pipeline implements a 4-wide issue path, executing a 16-wide warp over 4 clock cycles.

#### Load/store unit (LS)

The load/store pipeline handles all non-texture memory access, including buffer access, image access, and atomic operations.

#### Varying unit (V)

The varying pipeline is a dedicated pipeline which implements the varying interpolator.

#### Texture unit (T)

The texture pipeline handles all texture sampling and filtering operations.



### 4.3.1 Valhall work register breakpoints

Arm<sup>®</sup> Mali<sup>™</sup> Valhall GPU shader cores allow variable numbers of threads to be created, depending on the number of work registers that are used by the in-flight shader programs.

#### 0-32 registers

Maximum thread capacity

#### 33-64 registers

Half thread capacity

Usually, running more threads simultaneously helps a GPU to work effectively. Aim to use 0-32 registers for fragment shaders.

The most effective way to reduce register pressure is to minimize the precision of stored variables. Use mediump precision in preference to highp whenever possible.

### 4.3.2 Valhall shader core size

The early-generation Valhall shader cores (Arm<sup>®</sup> Mali<sup>™</sup>-G57, Mali-68, Mali-77, and Mali-78) implement a four texel-per-clock and two pixel-per-clock shader core, with a variable amount of arithmetic performance depending on GPU mode.

The Mali-G610 and Mali-G710 shader core doubles the shader core thoughput to eight texels per clock and four pixels per clock, so the per-core cycle counts reported by Mali Offline Compiler are expected to halve. However, Mali-G710 designs are likely to ship with fewer shader cores to offset the increase in shader core size.

The Mali-G510 and Mali-G310 shader cores support configurable amounts of arithmetic, texturing, and pixel throughput. This allows a silicon design to optimize the shader core for the expected workload, which is ideal for cost-sensitive markets. However, the performance per core is not consistent across configurations. The performance reports for Mali Offline Compiler assume the following configurations:

#### Mali-G310

32 FMA/cycle, 4 texture ops/cycle, 4 pixels/cycle

#### Mali-G510

48 FMA/cycle, 8 texture ops/cycle, 4 pixels/cycle

You may need to rescale the reported performance in the reports if your target device uses a different configuration. Check your chipset documentation for the correct configuration.