[Editor's Note]: The author of this article is Love to play computer games Special author and technical expert "Gun God" 。

In the previous article Interpretation of Mobile GPU In, the architecture and related parameters of the mobile GPU are introduced. In this section, the Shader of the mobile GPU, GPU compatibility, the truth of "multi-core" and the score scoring problem are introduced.

Talk about the neglected Shader

Next, let's return to Shader. Shader is the main part of the GPU that is responsible for calculation. It also takes up the largest area and consumes the most power. Today's desktop GPUs often don't talk about triangle generation rate or pixel fill rate anymore. The indicator given is Shader's computing power - GFLOPS. It can be seen that shader performance will become more and more important. Mobile GPUs also have this trend. Let's take a look at the GLBenchmark's Egypt HD 1080p Offscreen score of each GPU tested by Anandtech:

The green number on the right is the approximate calculation performance of the GPU under FP16 precision, in GFLOPS. It can be seen that except for individual GPUs, the correlation between Egypt's performance and Shader's computing ability is relatively obvious.

First, make some bedding:

first, One addition or multiplication of floating point numbers is a single operation, which is recorded as 1 FLOPS. Floating point numbers have a certain precision. For example, the precision of 16bit floating point numbers is FP16. A higher FP32 is a 32bit floating point number, which is often called single precision; The higher one is the 64 bit double precision FP64. Generally speaking, only FP32 and FP64 operations can be considered FLOPS.

In OpenGL ES of mobile platform, you can specify three different accuracies: high, medium and low. For different GPUs, the actual values of high, medium and low precision are slightly different. As shown in the figure below:

For Adreno and GC series, No matter which precision is selected, it will be calculated according to FP32 precision 。 The Pixel Shader of ULP GeForce of Mali-400 and Tegra does not support high precision, Only moderate FP16 precision is supported at most 。 The pixel shader calculation of most games uses medium (FP16) precision, while the Vertex shader calculation is generally FP32 precision.

Secondly, about the Unified Shader and Separate Shader. The former shader can calculate both Vertex and Pixel, such as PowerVR, Adreno and GC series. The Vertex Shader and Pixel Shader of the latter are separate, typically Mali-400 and ULP GeForce. Relatively speaking, the shader utilization rate of the unified rendering architecture will be higher. In the case of extremely many triangles and few pixels, or on the contrary, the computing power of the shader is not easy to be wasted.

Finally, because vertex coordinates (xyzw) and pixel colors (rgba) have four attributes, Shader is often designed as the SIMD of Vec4 in order to improve efficiency, that is, four data can be packaged and processed with one instruction. Of course, if the number of data is less than four, the computing power will be wasted. There are also scalar units designed to process only one data at a time.

Shader composition of each GPU

1. Qualcomm Adreno Series

The Adreno series is a unified rendering architecture, and the shader ALU is a typical Vec4+Scalar. Vec4 can process four FP32 MAD operations (multiplication and addition operations, recorded as 2 FLOPS) every cycle. Scalar units cannot do MAD, so,

An Adreno Shader unit can provide 4 × 2+1=9 FLOPS floating point operands per week.

Mainstream Adreno GPU computing capability:

Adreno 200， 2 Vec4+1， 133MHz，2.4GFLOPS

Adreno 205， 4 Vec4+1， 266MHz，9.5GFLOPS

Adreno 220， 8 Vec4+1， 266MHz，19.1GFLOPS

Adreno 225， 8 Vec4+1， 400MHz，28.8GFLOPS

Adreno 320, if it is 16Vec4+1 and runs at 400MHz, it is 57GFLOPS

All of the above are FP32's computing power. In OPENGL ES, Adreno runs in accordance with FP32 in high, medium and low precision, so it can't improve its performance in low precision.

2. PowerVR SGX series

2.1 Old SGX5 series

Including SGX530/531/535/540/545, and its Shader computing unit is USSE. In one cycle, USSE can perform MADD operation on 4 FX10 (10 bit fixed point number, lower precision than FP16), 2 FP16 or 1 FP32. Since FP32 is FLOPS in the normal sense, its performance is 2 FLOPS per cycle. However, when the operations of two FP32s share one operand, the USSE can also process in one cycle, which is the MAD operation of two FP32s, 4 FLOPS. Therefore, the performance of FPSE FP32 is 2-4 FLOPS per week.

Mainstream SGX5 GPU computing capability:

SGX530， 2USSE，200MHz，0.8～1.6 GFLOPS

MTK's SGX531, 2USSE, 300MHz, 2USSE, 1.2~2.4 GFLOPS

Samsung Hummingbird SGX540, 4USSE, 200MHz, 1.6~3.2 GFLOPS

OMAP4460, SGX540 400MHz of ATOM Z2460, 4USSE, 3.2 ～ 6.4 GFLOPS

However, under FP16, that is, the pixel shader accuracy of most games, it can be doubled compared with the worst case of FP32. It can also be doubled under the FX10 with lower accuracy.

2.2 SGX 5XT series

Including SGX543/544/554, and their multi-core versions. Its Shader computing unit is USSE2. USSE2 is not the same as before. It is a Vec4+scalar architecture. It supports four FP32 MAD operations in a single cycle, plus a simple scalar operation (ADD/MUL). Like Adreno, it is 9 FLOPS per week.

A single 543/544 contains four USSE2s with basically the same performance. The 544 has more DX API support. A single 554 contains 8 USSE2s.

Mainstream SGX5XT GPU computing capability:

543MP2 in iPhone 4S, 2 × 4=8 USSE2200MHz, 14.4 GFLOPS

A single 544384MHz, 4USSE2 in OMAP4470 is similar to the above

544MP2 in Quanzhi A31, the so-called 8 pipelines are 8USSE2300MHz, and there are also 21.6 GFLOPS

A5X, 543MP4, 16USSE2250MHz, 36 GFLOPS in iPad 3

The A6X, 554MP4 and 32USSE2280MHz of the iPad 4 have exceeded 80 GFLOPS

The performance of USSE2 can be improved to some extent when computing FP16 with lower accuracy.

3. ARM Mali series

3.1 Mali-400

The Mali-400 is not a Unified Shader, but vertex and pixel processing are separated

A vertex processor contains a Vertex Shader, Vec4, and supports FP32 precision

A pixel processor contains a Vec4 Pixel Shader and a TMU. The Shader supports FP16 precision

Mainstream MaliGPU computing capability:

A Mali-400 "single core" with a computing capacity of 6.4 GFLOPS at 400MHz

The Mali-400 MP4266MHz of Exynos 4210 is 10.6 GFLOPS

The Mali-400 MP4440MHz of the Galaxy S3 is 17.6 GFLOPS

If the Mali-400 MP4 of Note2 runs at 533MHz, it is 21 GFLOPS

Of course, these are FP16...... Because the Pixel Shader of Mali-400 does not support FP32 precision.

3.2 Mali-T6xx series

T6xx adopts a new architecture, and Shader is a unified rendering architecture. For T604/624/628, one core contains two ALUs, while T658/678 is an enhanced computing type, and one core contains four ALUs.

Each ALU is composed of a 128bit wide Vector Unit and a 32bit Scalar unit.

Therefore, the single precision (FP32) performance is 9 per cycle, the same as that of USSE2.

Then the 533MHz Mali-T604 quad core in Exynos 5250 and the FP32's computing capacity is 38.4 GFLOPS

Similarly, because the Pixel Shader used in the game is FP16 precision, the processing capacity of the VU ALU of the T604 can double to 8 at this time, so that each cycle is 8 * 2+1=17. It conforms to the data of single T604 core 17GFLOPS and four core 68GFLOPS at 500MHz claimed by ARM.

Then the 533MHz Mali-T604 quad core in Exynos 5250 and FP16 have a computing capacity of 72.5 GFLOPS

4. GeForce ULP

GeForce ULP and Mali-400 are separate Shader architectures. Its Vertex Shader and Pixel Shader are Scalar, not Vec4. Vertices support FP32 precision, and pixels support FP20 and FX10 precision. So,

"8-core" Tegra 2, 4VS+4PS, 300MHz, computing power 4.8 GFLOPS

"12 core" Terga3, 4VS+8PS, 520MHz, 12.5 GFLOPS computing power

5. Vivante's GC series

Similar to Adreno, it is also the structure of Vec4+1. Similarly, high, medium and low precision are calculated according to FP32, and there will be no improvement under low precision.

GC800 of RK29, 1Vec4+1450MHz, 4 GFLOPS

Freescale i GC2000, 4Vec4+1600MHz, 21.6 GFLOPS of MX6

GC4000, 8Vec4+1480MHz, 34.6 GFLOPS of HiSilicon K3V2

GPU "Compatibility"

Now there is also the "compatibility" problem of GPUs that is often mentioned. Here we will talk about the texture formats supported by each GPU.

First is ETC1, which is the texture format supported by OPENGL ES 2.0, and everyone must support it. However, one disadvantage of this texture is that it does not support the alpha channel, so the texture with the alpha channel needs to be split into two textures to read, which is inefficient and wastes bandwidth.

PVRTC is PowerVR's own texture format, and ATITC is Qualcomm Adreno's texture format. In addition, S3TC is a common desktop DXT, and Microsoft DirectX 3D's texture format, which supports the alpha channel.

PowerVR GPU supports its own PVRTC and general ETC1 (PVR GPU only supports PVRTC under iOS), Adreno supports its own ATITC and general ETC1, NV's GeForce and Vivante's GC series support DXT and ETC1, and the remaining Mali-400 only supports ETC1. Therefore, there will be different game data packets corresponding to different GPUs. ETC1 is generally used for general data packets. Although it is general, it does not support alpha channel mapping twice, which is actually a disadvantage for non Mali GPUs. If you use other formats that you support, you won't have to suffer from this. The Adreno 2xx series, which has a relatively small number of TMUs (Texture Mapping Units), may suffer even more.

Of course, texture support is only one aspect of compatibility, not the whole issue of compatibility.

"Multi core" of each family

The GPU hardware part is basically finished. Here is a table summarizing the contents of a "core" officially defined by the GPU manufacturer. It should be clear at a glance who has more core material and who is less honest. In the face of various publicity of "16 cores" and "8 pipelines", we should be able to clearly identify them.

Are the scores different from the actual performance? Optimization is very important!

Finally, the specification is only one aspect of the GPU, and the actual performance has a lot to do with the architecture. Furthermore, even GPUs with similar scores in Benchmark will perform differently in different games.

first, Benchmark programs are mostly fair, so in essence, Benchmark programs are "zero optimization" programs. To be fair, their textures will use RGBA PNG, TGA, or ETC1 textures, and will not use the GPU's own format.

But the game is different. The game can be optimized accordingly 。 For example, PVR GPUs can use 4bpp or even 2bpp PVRTC textures, which can save 8 or even 16 times the bandwidth compared with uncompressed maps. If it is not optimized, you may have to use ETC1 that does not support the alpha channel with Mali to map twice, wasting bandwidth. Some manufacturers even put some uncompressed maps in the general data package, which makes the gap even bigger. In the same game, GPUs with similar scores have better special effects and fluency on iOS, which is the reason for optimization.

Secondly, Benchmark is relatively advanced to a certain extent. The frame rate of most GPUs running Benchmarks will not reach the smooth level (how can we measure the difference if the frames are full). Earlier Benchmarks may have focused more on maps and pixels. The new generation of Benchmark has increased the complexity of the scene and further increased the pressure on polygon and shader calculation, for example, the GLBenchmark has increased from 2.1 to 2.5. Therefore, some GPUs with high triangle generation ability and native Shader computing ability, such as Adreno 220/225, will increase their scores significantly. However, Mali-400 encountered the bottleneck of triangle generation in 2.5, and its score was not as good as before.

The game is for people to play, Terminal manufacturers or SOC manufacturers can cooperate with game manufacturers to optimize the features of GPU 。 Different GPUs focus on different things, such as the Mali-400. The triangle is weak, and the pixel part has a strong fill rate. Qualcomm Adreno 2xx and Vivante's GPU have strong polygon and shader calculation, but their fill rate is weak. If the scene is optimized for Mali, the game dealer can reduce the number of polygons in the screen and achieve more special effects with maps and pixels. This is bad for Adreno 2xx series. If Adreno is optimized, the scene complexity can be increased and more triangles can be used for more refined modeling, but this is not good for Mali. On the one hand, there are some details that can be further optimized, and GPUs from various companies will also provide corresponding tools.

Finally, the GPU score can reflect the actual performance of the GPU to a certain extent, but the final performance in the game still depends on the manufacturer's optimization. So don't just stare at the score, ask more friends who have played and see the actual measurement, which will be more helpful.

reference:

1. Tom Olson, Triangles Per Second: Performance Metric or Chocolate Teapot?

2. Zhu Jun, Graphics Development on the i. MX 6 Series

3. Kari Pulli, Jani Vaarala Nokia, Ville Miettinen, Robert Simpson, Tomi Aarnio, Mark Callow, The Mobile 3D Ecosystem

4. Imagination Technologies, PowerVR Series 5 Architecture Guide for Developers

5. Renaldas Zioma, Unity: iOS and Android - Cross-Platform Challenges and Solutions

6. Anandtech, Google Nexus 4 Review

7. Anandtech, Qualcomm Snapdragon S4 (Krait) Performance Preview

8. Comparison between Hiroyuki Ogasawara and Mobile GPU

9. Qualcomm, Snapdragon S4 Processors: System on Chip Solutions for a New Mobile Age

10. Vivante, Vivante Graphics Cores

11. Hiroshige Goto, Mali T-604 & T-658 Shader Core

Talk about the neglected Shader

Shader composition of each GPU

GPU "Compatibility"

"Multi core" of each family

Are the scores different from the actual performance? Optimization is very important!

Learn more about new cool devices, stay tuned