At CES 2015, Nvidia first launched a blockbuster: Tegra X1, At the time of release, the performance of Nvidia Xuancheng's new X1 processor was twice that of the previous generation K1, which means that Tegra X1 has become the most powerful mobile processor on the market today. Please join us to see how strong X1 is and where it is.
First, start with Nvidia's old line GPU. As early as GTC 2014, Nvidia announced that the next generation of Tegra processors would use the Maxwell architecture GPU. The Maxwell architecture has already appeared on the desktop level GPU. Nvidia has also made a lot of efforts to put this architecture on the mobile processor, which is different from the Kepler GPU of Tegra K1, The Maxwell GPU on X1 can be regarded as a new design from 0, rather than a random transplant.
When Nvidia decided to put mobile processor business first, the company's ambition was obvious. For Tegra, the high priority treatment meant that Nvidia's latest and strongest GPUs would log in mobile processors at a faster speed - the release of Maxwell 1 and Tegra X1 was just one year away, Compared with Kepler and K1, the two-year interval is much shorter.
In addition, high priority also means Nvidia will make exclusive power optimization for mobile processors from the bottom of the architecture, which is not only beneficial to Tegra, but also has a significant effect on reducing the energy consumption of desktop GPUs.
Thus, Tegra X1 is the first product under Nvidia's strategy, which is also of far-reaching significance to Nvidia. Thanks to this product strategy, Tegra X1 has evolved on the basis of the already powerful Tegra K1, and many of these evolutions still benefit from the use of Maxwell architecture. In terms of CPU, Nvidia is determined to be the strongest CPU on the market, so Nvidia has also found ARM and asked for the A57 architecture (however, in view of the fact that the high-end CPU architecture will basically be A57 in the future, the biggest weapon of Tegra X1 is still the crazy GPU).
Further into the GPU of Tegra X1, what we see is a Maxwell-2 GPU designed for Tegra, Maxwell 2 architecture has added a series of new functions, including the third-generation polygon color compression technology. The energy efficiency ratio of each CUDA core has also been improved. Other graphics functions include conservative rasterization algorithm, stereoscopic coverage resources, multi frame anti sawing, etc. These cool sounding functions have all been included in Tegra X1.
In X1, Nvidia's improvement of memory bandwidth and overall efficiency is the most important among all improvements, because these two points are basically the bottlenecks of mobile processors. In terms of optimizing memory bandwidth, mobile processor manufacturers often focus on memory bus frequency for high-end mobile processors To upgrade (to 96 bit or 128 bit), this simple and crude method is of course the most effective and intuitive. However, upgrading memory bandwidth means upgrading costs and increasing the complexity of mobile processors and peripheral devices. On X1, Nvidia still uses a 64 bit memory bus, so in order not to starve powerful GPUs, Nvidia has added data compression, coupled with the update of LPDDR4, so that the GPU efficiency of X1 can be brought into full play.
In addition, the thermal design power consumption (TDP) of mobile processors It is also a limiting factor, and the benefits of improving this aspect are also great: it can improve the performance while reducing the power consumption of the processor, and the control of heating also makes the processor perform better when working continuously, which is why X1 uses TSMC's 20 nanometer process to optimize the power consumption of Maxwell.
The last but most important part is that X1 also has a mobile GPU specific function, which does not appear on the desktop GPU. This function is called "Double Speed FP16" by Nvidia. After adding this function, CUDA unit can achieve higher performance under FP16, which is useful in some application scenarios.
Like Kepler and Fermi before, Maxwell only has specific FP32 and FP64 CUDA cores, and X1 is no exception. After knowing the importance of FP16, X1 has a unique way to handle FP16 tasks. On top of K1, FP16 is simply promoted to FP32 and calls the FP32 core for processing, while X1 will combine the two FP16 packages into a single Vec2 package, and then send it to an FP32 CUDA core for processing.
In a word, X1 can package two FP16s in the same process. After packaging, X1 will make more full and flexible use of the CUDA core.
In fact, this is not a new idea. Nvidia's competitors have already started to do this. Generally speaking, this method is still a bit opportunistic, ARM and Imagination have FP16 compatibility in the current GPU (either FP16 processing unit or better ALU configuration), and even AMD will join. It is reasonable for Nvidia to do so.
But what is the importance of FP16? In fact, it's a long story. To put it simply, FP16 is widely used in Android's display sorting work, because for Android, such low precision computing is critical to power saving; In addition, FP16 computing also has a certain position in the field of mobile games. In addition, FP16 also exists in image recognition applications (such as Nvidia's own Drive PX platform).
Although FP16 has its own limitations - 16 bits are really not enough for current floating point numbers, FP16 still plays an important role in the applications mentioned above, so it is important to process FP16 quickly and accurately.
There is only so much functionality left. All that is left is the time to talk with data.
In general, the GPU of X1 is composed of two Maxwell SMMs stuffed into one GPC, and the total number of CUDAs has reached 256. Compared with the single SMX of K1, the number of CUDAs has doubled, which means that basic things such as geometry and texture units have also doubled. Kepler is also overwhelmed by the CUDA core with better energy consumption on X1.
In addition to the number of CUDA cores, Nvidia has also modified the grating ROP unit. X1 has 16 ROPs this time, four times the number of K1, and the number of ROPs has also caught up with the number of GM107 ROPs. This improvement is for X1 support 4K@60Hz It is also critical. At the same time, the upgraded bandwidth management policy (both efficiency and actual bandwidth) also ensures that these ROPs will not be hungry when dealing with heavy tasks.
Finally, we inevitably return to talking about the clock frequency and expected performance. Nvidia has not officially announced the GPU frequency of X1 for the time being, but according to their published performance data, we can still guess a clue: Nvidia claims that the FP16 processing power of X1 has reached 1TFLOPs, from which it can be inferred that the maximum frequency of this GPU may have reached 1GHz (1GHz ×2FP16×2FMA×256=1TFLOPs)。
This frequency level is basically desktop computer level, and this high frequency is also very radical for a mobile processor, but it is still unknown in what form X1 will eventually fall into the hands of consumers, At present, the only certainty is that the equipment equipped with Tegra X1 will not meet us in a short time (of course, not necessarily for Nvidia's own products). When such a nuclear bomb runs at full speed, power consumption and heat dissipation will also be an unavoidable problem.
Via
Update: Field performance test by
The score of 3D Mark performance reached 43241, twice that of Apple A8X.
GFXBench runs minutes and frames burst.
The average power consumption chart shows that the average power consumption of the Apple A8X is 2.651 watts, and the average power consumption of the X1 is 1.498 watts. If this power consumption is controlled again, it is really impossible to use the mobile phone on the Tegra X1.