[Editor's Note]: The author of this article is Love to play computer games Special author and technical expert "Gun God" 。

This article will give readers a comprehensive understanding of mobile GPU from the aspects of its structure, parameters, compatibility and running score. This is the first part, which explains the structure and related parameters of the mobile GPU.

preface

Now the "nuclear war" of mobile devices is becoming more and more intense, which has been triggered from the CPU to the GPU. As a result, various eye-catching propaganda such as "16 cores", "8 pipelines", "MP4", "triangle generation rate" and "fill rate" have flooded the sky. I have always hoped that there would be some articles to introduce science popularization, but perhaps professionals think that these articles are too basic. Finally, I can't help but write something by a semi professional. This article refers to and cites some materials on the Internet, which will not be listed one by one due to the limited space. The content should be as simple as possible, but limited to personal knowledge level, understanding ability and expression ability. If there are any inaccuracies or errors, please correct them.

Basic 3D pipeline

First, let's briefly introduce how 3D images are generated. A basic 3D pipeline is shown below:

First, the game engine running on the CPU generates a series of primitives according to some parameters in the game, and sends their vertex data to the GPU.

Second, Vertex Processor performs a series of transformations and lighting processing on vertex data. Just think about it briefly, the coordinates of all objects in the game refer to the world coordinate system in the game, while the actual picture displayed is the player's perspective or camera's perspective, which involves many coordinate system transformations. These tasks need the vertex processor to do, and finally we get the picture of the perspective we need.

However, at this stage, the picture is just a few polygons, while the actual display on the screen is pixels. Rasterizer is required to rasterize (Rasterization, 3) to turn the picture into a pixel image.

The fourth step is to color these pixels. The pixel shader in the Fragment Processor calculates the color of each pixel in the picture according to the algorithm specified by the program. Then the fifth step is to output the results to memory and display them after completion. Of course, the whole process is also related to texture map 6. The so-called mapping is to paste the texture (a two-dimensional static picture) onto the triangular surface of the game according to a certain algorithm.

Are parameters reliable?

When terminal manufacturers publicize how powerful the GPU of their mobile phone chips is, they often mention some parameters, the most common of which are the triangle generation rate and filling rate. In fact, these theoretical parameters of GPUs of different companies are not directly comparable. We can also find that some GPUs may give very high theoretical parameters, but their actual performance is very general, even worse than some GPUs with low parameters. This is because various GPU suppliers, such as IMGtec (PowerVR SGX), Adreno, Vivante (GC series), ARM (Mali), nVIDIA (GeForce), may give different test methods for these theoretical data.

For example, the triangle generation rate itself is affected by many test factors. For example, some triangles will be eliminated at the beginning (such as off the screen, or too small to cover a pixel at all), will not be displayed and will not need to perform too many operations, then should these triangles be counted? If included, the triangle generation rate is naturally high. Or, the test program submits some calculated coordinates to the GPU, so the GPU Vertex Shader does not need to perform complex calculations, and the value is naturally high. Or, it is not necessary to process three vertices to generate a triangle. If there are triangles sharing vertices, use the indexed method to draw, and the number of vertices can be reduced. As shown in the figure below, only 4 vertices need to be processed for 2 triangles. If this method is widely used during testing, it can also improve the value of triangle generation rate.

Similarly, the fill rate reflects the pixel output capability of the GPU. However, many of the theoretical values given by the manufacturer do not have maps or Shader calculations. They are just the ability to generate colorless points, which is far from the actual use. For another example, the filling rate given by the PowerVR SGX series of Imaging Technologies is not the actual value, but the actual value multiplied by a coefficient of 2.5x. This is due to the particularity of the PVR GPU architecture, which can eliminate the shadowed parts in the picture without rendering, reducing the useless work. The original fill rate of the 200MHz SGX540 is 400M. Because of this technology, IMG believes that its equivalent fill rate is equivalent to 1000M. In the actual scene, if there are many obscured parts, this coefficient may far exceed 2.5x. Of course, if there is less occlusion in the scene, the coefficient will correspondingly become smaller.

In fact, triangle generation rate and pixel fill rate, as parameters to measure GPU performance, have been eliminated on PC platform several years ago. Since DX8, modern GPUs have implemented various special effects by using programmable shaders instead of fixed function units, so the computing power of shaders has become a very important point, as well as in mobile platforms.

Characteristics of mobile platform and architecture of mobile GPU

However, there are many differences between mobile platforms and PC platforms. In essence, it is limited by power consumption and volume. For graphics processing, there are two main points:

First, limited bandwidth. In fact, to increase computing power, the heap core is not a difficult task when the power consumption is allowed. In fact, we have seen that many SOCs integrate four core or even "16 core" GPUs. However, the difficulty is that there needs to be enough bandwidth to meet the needs of this powerful GPU to avoid "starvation". In the mobile platform shown on the left, CPU, GPU and bus are integrated together on a single chip, called SOC. The entire SOC, including its CPU and GPU, shares limited memory bandwidth. Even for the relatively high-end, some SOC with 64bit memory bit width, such as Samsung 4412, Qualcomm 8064, etc., are only 6.4 - 8.5GB/s bandwidth. Compared with the bandwidth of more than a dozen GB/s in the main memory of the PC platform, and dozens of PC GPU GDDR5 video memories, many of them have more than 100GB/s bandwidth, which can only be said to be a pity. In the iPad 4, Apple matched the A6X chip with a 128 bit LPDDR2-1066, with a bandwidth of 17 GB/s, to feed the powerful SGX 554 MP4 GPU. However, compared with the PC platform, it is still insignificant. Therefore, mobile platforms need to achieve reasonable performance under limited bandwidth. In many cases, the bottleneck may not lie in computing power, but in bandwidth.

Second, compared with the CPU of the PC platform, the floating point of the CPU of the mobile platform is weak. Although the Cortex-A9 has improved, there is still a significant gap between the 64 bit NEON and the 128 bit or even 256 bit SIMD of the desktop, plus the difference in the dominant frequency. Therefore, more calculations also depend on the hardware Vertex Shader.

Therefore, the GPU of mobile platform is different from that of PC platform. Let's look back at some architectures of the GPU of the mobile platform.

First, the traditional IMR (Immediate Mode Rendering) architecture

At present, almost all desktop GPUs (nVIDIA, AMD) are IMR architectures. In the mobile field, the GeForce ULPs of nVIDIA and the GC series GPUs of Vivante are IMR architectures. After the objects are rendered by the GPU of the IMR architecture, the results will be written to the frame cache in the system memory. Therefore, it may occur that the GPU spends a lot of time rendering an occluded invisible object, and finally these results are covered after rendering the occluded object, making no effort. This problem is called Overdraw. Although the modern IMR architecture GPU can avoid this problem to some extent, it is still difficult to completely avoid Overdraw by requiring applications to submit triangles in the scene to the GPU in a strict front to back order.

On the other hand, because the GPU of the IMR architecture frequently reads and writes and modifies the frame cache, it requires high bandwidth and increases power consumption.

Therefore, most mobile GPUs adopt TBR (Tile Based Rendering) architecture

The Mali GPU of ARM and Adreno GPU of Qualcomm adopt TBR (block rendering) architecture. In fact, IMG's PowerVR is also block rendering. TBR architecture divides the whole picture into small pieces before turning the triangle scene into a pixel image (rasterized). These small pieces are rendered in the cache on the GPU, which avoids frequent reading, writing and modification of the frame cache (located in the system memory). Of course, since a triangle may be divided into several different blocks, the triangle data (geometric data) may need to be read many times, but in general, it can greatly reduce the access to system memory, save bandwidth and reduce power consumption.

Of course, different GPU blocks have different sizes. PowerVR and Mali are generally 16 * 16 pixel block sizes, while most Qualcomm Adrenos have 256K cache, and render with 256K block size, which Qualcomm calls binning.

However, TBR GPUs other than PowerVR, like IMR, cannot avoid the problem of overdraw.

The difference of PowerVR is that its TBDR (Tile Based Deferred Rendering) architecture can completely avoid the problem of overdraw.

What's more, TBDR has a HSR (Hidden Surface Removal) hardware unit after rasterization. By testing the triangles in a block, the blocked triangles are eliminated, and a picture composed of all visible parts is synthesized, which is handed over to the subsequent pipeline for rendering. In this way, the invisible part does not need the Pixel Shader to do the corresponding calculation, nor does it need to pick up the corresponding texture, which saves the amount of calculation and bandwidth, and is very helpful for mobile devices.

reference:

1. Tom Olson, Triangles Per Second: Performance Metric or Chocolate Teapot?

2. Zhu Jun, Graphics Development on the i. MX 6 Series

3. Kari Pulli, Jani Vaarala Nokia, Ville Miettinen, Robert Simpson, Tomi Aarnio, Mark Callow, The Mobile 3D Ecosystem

4. Imagination Technologies, PowerVR Series 5 Architecture Guide for Developers

5. Renaldas Zioma, Unity: iOS and Android - Cross-Platform Challenges and Solutions

6. Anandtech, Google Nexus 4 Review

7. Anandtech, Qualcomm Snapdragon S4 (Krait) Performance Preview

8. Comparison between Hiroyuki Ogasawara and Mobile GPU

9. Qualcomm, Snapdragon S4 Processors: System on Chip Solutions for a New Mobile Age

10. Vivante, Vivante Graphics Cores

11. Hiroshige Goto, Mali T-604 & T-658 Shader Core

preface

Basic 3D pipeline

Are parameters reliable?

Characteristics of mobile platform and architecture of mobile GPU

Learn more about new cool devices, stay tuned