ARM Mobile Studio - Mali Offline Compiler - 官方视频教学

这篇具有很好参考价值的文章主要介绍了ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

目的

便于后续不用反复观看视频
直接截图 + 字幕说明，文档化

如果你喜欢的话，也可以直接再油管上直接，选择：搜索视频，查看字幕来定位播放对应时间进度内容
ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

Mali Offline Compiler 使用实例视频

参考：Arm Mali GPU Training - Episode 3.5: Mali Offline Compiler

分析 shader

命令行 ls 列出文件
ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析
比如这个 shader 用于游戏中角色溶解的效果

使用 maili offline compiler 分析的示例

malioc 命令

ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

制定你想要分析的 shader

ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

制定你想要分析的着色器对应的 GPU

ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析
这里使用的是 Mali-072 型号的 GPU

实例解析内容

ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

硬件结构 & 驱动 & Shader 类型信息

第一部分是：

硬件 : Mali-G72 r0p3
结构 : Bifrost
驱动 : r25p0-00rel0
Shader 类型 : OpenGL ES Fragment， OpenGL ES 片源着

寄存器、是否堆溢出、16-bit 算术占比

之后还有

使用到的 registers : 64 个
Uniform registers : 26 个
Stack spilling : false, 没有堆栈溢出
16-bit arithmetic : 2%，16位算术占 2%

寄存器的使用量减少

但是留意，这个 GPU 最大 registers 是 32 个
ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

可用的物理寄存器池被分配给正在执行的着色器线程
the available physical register pool is divided, among the shader threads aht are executing

因此，减少工作寄存器的使用，可以增加可以同时执行的线程数量，有助于保持GPU繁忙
so, reducting work register usage, can increase the number of threads that can be executed simultaneously, helping to keep the GPU busy

浮点精度优化

ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

为了减少工作寄存器的使用，尝试将精度从高32位降低到中16位
To reduce work register usage, try reducing precision from highp, 32-bit to mediump, 16-bit

这使得GPU可以在每个寄存器中存储两倍的变量。
This enables the GPU to store twice as many variables per register.

堆溢出与否

（jave.lin : 看了一些资料说的是，一些变量是否溢出到 Stack 来读写，如果是，那么性能会下降，如果是从 register 上来读写，那么性能会高很多）
ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

对于Valhall和Bifrost GPU，我们可以看到是否有任何变量溢出到堆栈内存，如果是，每个线程溢出多少字节。
For Valhall and Bifrost GPUs, we can see whether any variables are spilled to stack memory, and if so, how many bytes per thread are spilled.

溢出到堆栈的着色器对于GPU来说是昂贵的，所以如果你看到堆栈溢出，尝试通过降低变量精度，减少变量的有效范围或简化着色器程序来减少寄存器压力
Shaders that spill to stack are expensive for a GPU to process, so if you see stack spilling going on, try to reduce register pressure, by reducing variable precision, reducing the live ranges of variables, or by simplifying the shader program

16-bit 算术占比

ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

在这里，您可以看到以16位或更低精度执行的算术运算的百分比。
Here you can see the percentage of arithmetic operations, that are performed at 16-bit precision or lower.

ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

这里的数字越高越好，因为使用mediump选择的16位精度比 highp 32位精度速度快两倍。我们可以看到，只有2%的算法计算是在16位精度下完成的。这意味着着色器在大多数计算中使用32位精度。如果我们将精度从高降低到中，着色器将更有效地运行。这样既降低了能耗，又降低了寄存器压力，使性能翻倍。有些情况下总是需要高分辨率，例如位置和深度计算，但在许多情况下，当将精度降低到中档时，屏幕上几乎没有明显的差异。
A higher number here is better, because 16-bit precision, selected with mediump, is twice as fast as highp at 32-bit precision. We can see that only 2% of artihmetic computation is done at 16-bit precision. This means that the shader is using 32-bit precision highp for most of its calculations. The shader will operate more efficiently if we reduce precision from highp to mediump. This reduces both energy consumption and register pressure, and can double the performance. There are situations where highp is always required, such as for position and depth calculations, but in many cases there is little noticeable difference on-screen when reducing precision to mediump

算术，加载/写入、插值、纹理等单元

接下来，有一个大概的周期成本分解
ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

A = Arithmetic - 算术
LS = Load/Store - 加载/存储
V = Varying - 插值其
T = Texture - 纹理

对于Mali着色器核心中的主要功能单元，算术单元，加载/存储单元，插值单元和纹理单元。
for the major functional units in the Mali shader core, the arithmetic unit, the load/store unit, the varying unit, and the texture unit.

单元占用的cycle周期中，cycle成本最高和最小的单元都是一个很好的候选优化对象。
The unit with the hightest cycle cost in either or both of the shortest path cycles, and longest path cycles, is a good candidate to optimize.

定位瓶颈

ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

在这里，我们可以看到算术单位是使用最频繁的。我们可以通过减少数学运算的次数来优化它，或者降低计算的精度。
Here, we can see that the arithmetic unit is the most heavily used. We can optimize it by reducing the number of mathematical operations that it performs, or the precision of the those calculations.

ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析
其实 Bound 也会告诉你，瓶颈在哪个部分，比如上图都是在 A 的部分，也就是算术部分

（jave.lin : 但是这只是但从单元资源占比分析，具体还是要看 shader 应该场景）

换个 Mali-G78 r1p1 GPU

ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

FMA(乘加加速器), CVT(算术类型转换), SFU(特殊单元)

ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

对于基于Valhall的GPU，如Mali-G78，算法成本显示为FMA, CVT和SFU管道的分解。
For Valhall-based GPUs, such as the Mali-G78, arithmetic cost is shown broken down by FMA, CVT and SFU pipelines.

FMA - Fused multiply accumulate - “熔合乘法累积”（Fused Multiply Accumulate），这是一种结合了乘法和加法的指令，可以通过一条指令实现递归计算。
- 其中最主要的是新增的3操作数指令(3-Operand Instructions)和熔合乘法累积(Fused Multiply Accumulate，FMAC)指令，这两条指令都可以大幅提高操作效率，简化代码。
CVT - Arithmetic conversion - CVT算术转换
- 隐式类型转换又称为“标准转换”，包括以下几种情况： 1) 算术转换(Arithmetic conversion) : 在混合类型的算术表达式中, 最宽的数据类型成为目标转换类型。
SFU - Special functions unit - 特殊功能单元，比如: sin, cos, tan, atan, log, exp 等等

cycles占比介绍说明

ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

所有Valhall GPU都实现了两个并行处理引擎，每个都包含自己的一组算术流水线。
All Valhall GPUs implement two parallel processing engines, each containing their own set of arithmetic pipelines.

报告中的数据是基于设计中的引擎数量进行归一化的，以给出目标着色器核心的总体成本，而不仅仅是单个引擎的成本。
The data presented in the report is normalized based on the number of engines in the design, to give an overall cost for the targeted shader core, not just for a single engine.

Shader Properties (属性部分)

ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

着色器属性(shader properties)部分给出了着色器使用的语言特性的信息，这些特性可能会影响着色器执行的性能。
The shader properties section, gives information about the shader’s use of language features that can impact the performance of shader execution.

是否使用到通用计算

ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

尽可能移植到 CPU 上执行

ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

我们可以看到这个着色器包含了统一计算。这种计算将为drawcall或compute dispatch中的每个线程产生相同的结果。Mali驱动程序可以优化这种方式，但它仍然有成本，所以你应该将这种计算从着色器代码中移出，转移到CPU上，在绘制时执行。
We can see that this shader contains uniform computation. This kind of computation will produce the same answer for every thread in the drawcall or compute dispatch. The Mali driver can optimize the way, but it still has a cost, so you should move this kind of computation out of the shader code, and on to the CPU, to be executed at draw time.

是否有 alpha test, alpha-to-coverage

ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

这告诉我们着色器可以修改片段覆盖掩码 (jave.lin : 也就是片段可能会丢弃，你可以理解为现在的 alpha test)。片段覆盖掩码确定每个像素中的哪些样本点被片段覆盖，使用discard语句删除低于alpha阈值的片段。
This tells us that the shader can modify the fragment coverage mask. The fragment coverage mask determines which sample points in each pixel are covered by a fragment, using the discard statement to drop fragments that are below an alpha threshold.

是否有使用 later ZS testing 和 later ZS update，以及性能影响说明

ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

(jave.lin : alpha test) 覆盖范围可修改的着色器必须使用 later ZS更新，这会降低 early ZS测试的效率和同一坐标下later fragments 的调度效率。
Shaders with modifiable coverage must use a later ZS update, which can reduce efficiency of early ZS testing and fragment scheduling for later fragments at the same coordinate.

应该尽可能少 discard

ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

你应该尽可能减少在fragment着色器中使用discard语句和alpha-to-coverage语句。
You shold minimize the use of discard statements and alpha-to-coverage in fragment shaders where possible.

ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

幸运的是，这个着色器不会强制 Later ZS测试，Later ZS测试会完全禁用 Early ZS测试和 HSR (hidden surface removal) 隐藏表面的删除，从而导致显著的效率损失。
Fortunately, this shader does not force a later ZS test, which would completely disable early ZS testing and hidden surface removal a significant efficiency loss.

输出报告

ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

通过运行带有额外命令行选项的命令，您可以将此报告导出到文件。
You can export this report to a file, by running the command with this extra command line option.

输出json报告，制定自己的dashboard

ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

如果你在持续集成环境中工作，你还可以将Mali离线编译器报告导出为机器可读的JSON文件，这样你就可以构建自己的仪表板来监控着色器的性能。
If you are working within a continuous integration enviroment, you can also export Mali Offline Compiler reports as machine-readable JSON files, so that you can build your own dashboard to monitor shader performance over time.

ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

安装目录中提供了示例报告和JSON模式定义。
Sample reports and JSON schema definitions are provided in the installation directory.

检查帮助页面以查看所有可用选项的列表。
ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

一个 cycle 可能的流程

ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

指令执行周期，或是指令 cycles，参考： Instruction Cycle

A program residing in the memory unit of a computer consists of a sequence of instructions. These instructions are executed by the processor by going through a cycle for each instruction.

In a basic computer, each instruction cycle consists of the following phases:

Fetch instruction from memory.
Decode the instruction.
Read the effective address from memory.
Execute the instruction.

ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

Valhall 架构

Mali Valhall architecture
ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

Mali Valhall architecture

Mali Valhall GPU shader cores have six parallel pipeline classes, comprising three arithmetic pipelines and three fixed-function support pipelines.

All Valhall GPUs implement two parallel processing engines, each containing their own set of arithmetic pipelines. Data presented in the tool is normalized based on the number of engines in the design, to give an overall cost for the targeted shader core, not just for a single engine.

Arithmetic fused multiply accumulate unit (FMA)

The FMA pipelines are the main arithmetic pipelines, implementing the floating-point multipliers that are widely used in shader code. Each FMA pipeline implements a 16-wide warp, and can issue a single 32-bit operation or two 16-bit operations per thread and per clock cycle.

Most programs that are arithmetic-limited are limited by the performance of the FMA pipeline.

Arithmetic convert unit (CVT)

The CVT pipelines implement simple operations, such as format conversion and integer addition. Each CVT pipeline implements a 16-wide warp, and can issue a single 32-bit operation or two 16-bit operations per thread and per clock cycle.

Arithmetic special functions unit (SFU)

The SFU pipelines implement a special functions unit for computation of complex functions such as reciprocals and transcendental functions. Each SFU pipeline implements a 4-wide issue path, executing a 16-wide warp over 4 clock cycles.

Load/store unit (LS)

The load/store pipeline handles all non-texture memory access, including buffer access, image access, and atomic operations.

Varying unit (V)

The varying pipeline is a dedicated pipeline which implements the varying interpolator.

Texture unit (T)

The texture pipeline handles all texture sampling and filtering operations.

Figure 3-3 Valhall shader core
ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

This section contains the following subsections:

Valhall work register breakpoints.
Shader core size.

Valhall work register breakpoints

Valhall work register breakpoints
Mali Valhall GPU shader cores allow variable numbers of threads to be created, depending on the number of work registers that are used by the in-flight shader programs.

0-32 registers
- Maximum thread capacity
33-64 registers
- Half thread capacity

Usually, running more threads simultaneously helps a GPU to keep busy. A good objective is to stay at 0-32 registers for fragment shaders.

The most effective way to reduce register pressure is to minimize the precision of stored variables. Use mediump precision in preference to highp whenever possible.
ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析

Shader Core Size

Shader core size
All Valhall GPU cores implement a four texel-per-clock and two pixel-per-clock shader core.

ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录,unity,TA - 加油站,工具集收藏,OfflineCompiler,Mali,Mali离线编译分析其,shader分析,shader性能分析文章来源地址https://www.toymoban.com/news/detail-612516.html

References

Arm Mali GPU Training - Episode 3.5: Mali Offline Compiler
使用Mali Compiler对Unity Shader进行优化 - Neo TA 大佬的文章，里头有比较详细的：A, L/S, T, SFU 等参数说明

扩展

Mali GPU Counter
Mali Offline Compiler
Mali Graphics Debugger
Using Mali Offline Compiler - 里面有讲 register, stack spilling, A, LS, V, T 的影响
使用Mali Graphics Debugger调优Unity程序（Killer示例）
使用Mali Compiler对Unity Shader进行优化
MaliCompiler
Shader指令优化 - 大眼的 xcode 下的优化
如何评测shader的效率？
Unity Shader Performance: How to Quickly Measure the GPU Cycles Your Shaders Take

到了这里，关于ARM Mobile Studio - Mali Offline Compiler - 官方视频教学 - 笔录的文章就介绍完了。如果您还想了解更多内容，请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章，希望大家以后多多支持TOY模板网！