[Science Popularization] Is it OK to develop SoC by ourselves? Google Sensor Testing and Analysis | Ainuoji - 上海419贵族宝贝-上海后花园1314-上海品茶网-阿拉爱上海

Because the previous generation used Snapdragon 765G's Sao operation, in everyone's mind, the Google Pixel series is a new generation. The Pixel 6 series is different, including Google's own SoC - Google Tensor (Tensor means tensor, the name is AI, ML), camera hardware that has caught up with the times, and relatively generous prices.

Returning to the flagship market, the computational photography tycoon is finally willing to use modern CMOS! Aircircle immediately spread the news until foreign users got the real machine, Anandtech released the test results and analysis of Google Tensor

Without changing the original meaning of Anandtech, we reorganized and compiled the content of such an important and interesting SoC

Full self research or magic modification (semi customized)?

Google said that Google Sensor is the starting point of the journey to explore new workloads, and the existing chip solutions cannot achieve their stated goals. With years of machine learning research experience, Google has made Sensor a SoC that takes machine learning as differentiation. It is said that it enables Pixel to realize many unique new functions.

The first dispute about Google Tensor is that it is completely self researched? Or Mogai (semi customized)? This is mainly based on your definition of "self research". Google and Samsung seem to work closely together, blurring the boundary between traditional self research and semi customization.

Inside Google, The Google Sensor code is GS101, which may mean Google SoC or Google Silicon. However, the Whitechapel (White Chapel) mentioned previously has no evidence to prove that it is a real chip.

Google Sensor basically follows the naming rules of Samsung Exynos, and its ID is "0x09845000". After disassembly, we can see that the silk screen is S5P9845 (Editor: At the beginning of the original release, we thought that the ID corresponds to S5E9845, but after TechInsights' disassembly, it was confirmed to be S5P9845). For reference, the Samsung Exynos 2100 ID is S5E9840, and the Exynos 1080 is S5E9815.

Several years ago, it was reported that Samsung began to provide semi customized chip services. At that time, there was news of Samsung's cooperation with Cisco and Google. ETNews mentioned in an article in August 2020 that Samsung will provide "customized" technologies and functions according to customer needs, even from the chip design stage.

Samsung is no longer a simple chip manufacturer, but is fully involved in chip design, which can be compared with ASIC design services. But this is a very special case. After all, Samsung not only has the chip OEM business like TSMC, but also has its own SoC.

Google Tensor and Samsung Exynos are highly homologous. In addition to the advanced structures like CPU, GPU and NPU, which are often mentioned, many basic structures in the chip are homologous. On paper, Samsung, MediaTek, Hisense, and even Qualcomm (only CPU) all use arm's Cortex CPU and Mali GPU public architecture, but their underlying architectures are very different.

Google Tensor uses the framework of Samsung Exynos, which not only has the same clock and power management architecture, but also has advanced blocks such as PHY IP of their storage controller and external interface, and even larger IP function blocks such as ISP and media codec. Interestingly, Github already has the public information of GS101. You can compare its structure with Exynos 1:1.

However, although the basic module and framework of Exynos are used, the definition of SoC is really controlled by Google, and the connection design between the structure and IP blocks is different between Google Tensor and Samsung Exynos.

For example, on Exynos, CPUs are connected by bus, while Google Tensor's CPU cluster is placed in a larger CCI. Externally, it may be a different bus design or a completely different IP. In addition, like the connection mode of memory controller, they are also different.

Wild performance specifications

From the CPU alone, we can see that Google Tensor is different from popular products. 2x X1+2x A76+4x A55. This "2+2+4" structure has appeared in Samsung Exynos 9820 and Exynos 990. However, in today's Android flagship SoC, 1+3+4 is the absolute mainstream. Google is the only one that dares to stack two X1.

Theoretically, there are two X1 super cores, and its CPU multi-core performance will be stronger than that of a single X1 product. In terms of frequency, X1 of Google Sensor is 2.8GHz, slightly lower than 2.84GHz of Snapdragon 888 and 2.91Hhz of Exynos 2100. In addition, Google has also given 1MB L2 cache like Snapdragon 888, which is more powerful than Exynos 2100's 512KB residual blood X1.

On the big core side, Google chose the old A76 architecture, which is a controversial matter (2.25GHz, 256KB L2 cache). After all, this is not reasonable because A77 and A78 have higher performance and energy efficiency ratio. Even Anandtech didn't get a clear explanation from Google.

They speculated that it might be several years ago when the chip was designed, and Samsung did not have an updated IP for Google to choose from. It may also be that when the super core is replaced by X1, there is no time to replace the super core. But Google should not choose the A76 specifically, because the following test shows that the A76 is really behind the times.

On the small nucleus side, there are four 1.8GHz A55. Google chose 128KB L2 cache instead of the 64KB used by Samsung Exynos, which makes this CPU more like the Snapdragon 888. However, it is strange that Google binds the L3 cache frequency of the cluster with A55, which will lead to latency and power consumption problems. In addition, this is different from the L3 frequency of the Exynos 2100.

The GPU of Google Tensor is Mali-G78 MP20, which is second only to the G78 MP24 of Kirin 9000 (Editor: the limit of G78). At first, people thought that Google would use low frequency to improve energy efficiency ratio. However, Google unexpectedly pushed the shader frequency to 845MHz and the tile and L2 frequency to 996MHz, which is crazy. In addition, it is also the first product to use G78 to separate frequency characteristics.

As a reference, the G78 MP14 of the Exynos 2100 is also "only" 854MHz, and the peak power consumption of the latter is already very high. As a result, Google increased its core by 42%, but still maintained high frequency. Therefore, its peak performance is very desirable, but its peak power consumption will also be very strong. The memory controller seems to be the same as the Exynos 2100. It supports 4x16bit LPDDR5 with a theoretical bandwidth of 51.2 GB/s.

It also uses 8MB system cache, but it is not clear whether it uses the same IP address as Samsung Exynos 2100, because their architecture and behavior are different. Google uses SLC extensively to improve SoC performance (including their own customized modules). This SLC allows self partitioning, which allocates SRAM to specific IP blocks on the SoC, so that they can have exclusive access to all or part of the cache under different use cases.

ISP and TPU: the glory of Google

When people talk about SoC integrated ISPs, they often describe them as a single IP. But in fact, ISP is a combination of different professional IP blocks, and each IP block handles different tasks in the imaging pipeline. Google Tensor is very interesting because it integrates some fragments of Samsung's Exynos chips, and also integrates customized modules developed by itself into the pipeline, as Google said when demonstrating SoC.

The imaging system is the same as Exynos, such as phase detection processing unit, contrast focusing processing unit, image scaler, distortion correction processing block, texture occlusion function processing block, etc. The part less than Exynos may be some image post-processing modules of Samsung.

Google has added its own 3AA module (auto exposure, auto white balance, auto focus) and a pair of its own time-domain noise reduction IP modules (for image alignment and merging) to its ISP. These are probably the modules that Google said will help speed up image processing. These are part of Pixel series of computational photography, and undoubtedly represent a very important part of the image processing pipeline.

TPU is the place where Google Sensor is called Sensor. Google has developed its own TPU for several years. At the driver level, Google calls the TPU of the Sensor Edge TPU. This is a very interesting signal, because it should be related to the Edge TPU released by Google in 2018, which is an ASIC chip designed by Google for edge reasoning (official website cloud. google. com/edge tpu).

The Edge TPU of that year claimed that it could provide 4TOPS computing power under 2W power consumption, but Google did not publish the TPU performance indicators of Sensor, but in some tests, we can see that its maximum power is about 5W. So if they are indeed related, considering the progress in manufacturing process and IP in recent years, The TPU performance of Google Tensor should be significantly improved.

This TPU is the pride of Google's chip team. It is using the latest machine learning processing architecture. This architecture has been optimized for the way Google runs machine learning internally, and said it can allow the development of new and unique use cases, which is one of Google's main goals and starting points for customized SoC. In the later tests, the performance indicators of this TPU are really impressive. Because there is not much information about the TPU, we can only make a simple guess based on its driver. It may contain a four core Cortex-A32 CPU.

Other modules: baseband and audio/video decoder

In terms of media encoder, Google Tensor uses Samsung's multi-functional codec (the same model as Exynos series), as well as a self-developed IP block that looks like it is used for AV1 decoding. This is a bit strange, because in Samsung's propaganda, The Exynos 2100 has the AV1 decoding function, and this function seems to be in the kernel driver. However, in the Galaxy S21 series, this AV1 decoding function has never been implemented on the Android level.

The dedicated AV1 decoder Google added is called "BigOcean" by them, which enables Android system to have AV1 hard solution capability. But it is very strange that it is really only responsible for AV1, and other formats are encoded and decoded by Samsung's MFC.

Google Tensor's audio subsystem is also different. Google has replaced Samsung's low-power audio decoding subsystem with its own IP block. They can play low-power audio without waking up all SoCs. We think this part is also used as a coprocessor, which is also the difference between Google Tensor and Exynos.

Google also uses a hardware memory compressor called Emerald Hill to speed up the LZ77 compression of memory pages, which in turn can be used to speed up the unloading process of ZRAM in exchange. It is not sure whether the Pixel series has enabled this module, but you can confirm that there is "lz77eh" in the "/sys/block/zram0/comp_algorithm" directory. As extracurricular materials, Samsung integrated similar hardware compression IP modules in SoC five years ago. However, for some reasons, these modules have never been enabled, perhaps because the energy efficiency ratio is not as high as they expected.

Source PBKReviews

In addition, Google has also made the first non Qualcomm millimeter wave mobile phone with Samsung's Exynos baseband. The Pixel 6 series uses Samsung's Exynos 5123 baseband. Samsung mentioned its millimeter wave RF and antenna module in 2019, saying that it would appear on mass production machines in 2020 (I don't know whether it planned to launch Pixel 6 in 2020). The peak speed of Pixel 6 series can reach 3200Mbps, but in many tests, its network speed is only about half of that of Qualcomm products.

Although it is the same baseband, it is not integrated in the SoC like the Exynos 2100, but is external. It may be that the GPU and CPU scale of Google Tensor is too large, and the TPU scale is unknown. After all, even if the baseband is plugged out, the scale of Google Tensor is quite large, even compared with the Exynos 2100.

On the whole, Google has designed and defined Tensor. At the same time, there are many Google specific designs, which are the differentiation of the overall chip. But from a lower level perspective, Tensor and Exynos have a lot in common. They use many Samsung specific basic modules, so it may be more appropriate to call it "semi customized".

Actual performance: unsatisfactory

In the test, the DRAM latency of Google Tensor is higher than that of Exynos 2100, and worse than that of Snapdragon 888. Google has changed the memory controller. It will control the speed of MC and DRAM according to the load and the percentage of memory stall in the core. This part is different from Samsung, and its actual utilization rate is not as high as Samsung's memory controller. It is not known whether it is a CPU problem or an internal problem of the entire SoC, but this has definitely affected the following tests.

Its L3 latency is also quite high, much higher than Exynos 2100 and Snapdragon 888. Google does not set a specific frequency for the DSU and CPU L3 cache, but associates it with the frequency of the A55 small core. Strangely, even if X1 or A76 is fully loaded, A55 and L3 are "fishing" at low frequency. In the same case, both Exynos 2100 and Snapdragon 888 will increase the L3 frequency.

In the system cache test, we can see the latency of 11-13MB (1 MB L2+4 MB L3+8 MB SLC). In normal memory access, the Tensor is slower than Exynos, which may be related to the individual cache pipelines that have been modified.

Because the frequencies of L3 and A55 are bundled and high, the A55 small core of Google Tensor has the lowest latency among several SoCs, as if there is no asynchronous clock bridge.

In terms of CPU, Google Tensor is more like the Snapdragon 888 than the Exynos 2100. Although the L2 cache of Google Tensor is twice that of Exynos 2100, the frequency is 3.7% lower (110MhHz).

The weakness of Tensor is memory latency, which causes many sub projects in the SPEC test to be slower than Snapdragon 888 and Exynos 2100, but the energy consumption is higher (CPU is waiting for memory). In terms of the total score of SPEC, Tensor's performance is slightly worse than Exynos 2100, and 12.2% behind Snapdragon 888. Because it takes longer to finish the test, the final power consumption is 13.8% more. After conversion, the difference between Xiaolong 888 and Xiaolong 888 should be about 1.4%.

It also has the same frequency reduction problem as the Exynos 2100, but it is relatively less serious. If the cooling is proper, the performance will be about 5% - 9% higher (the test results in the above figure are obtained under the environment of 11 degrees).

The poor A76 core, the A78 of Snapdragon 888 is 46% better than it, and also saves more power. The actual IPC gap is 34%, which is consistent with the gap between the two architectures. If you really want to save power, you can make a low frequency A78, but Google has put two A76 with high frequency, power consumption and poor performance, so you can only infer that Google has no choice but to do it intentionally.

The closer to the lower right corner, the lower the EER; The closer to the upper left corner, the higher the energy efficiency ratio ↑

The A55 small core is not good either. Its performance is only 11% higher than the A55 of the same frequency Snapdragon 888 (thanks to L3 and SLC), but its power consumption is almost twice that of the Exynos high power consumption A55. Its energy efficiency is even better than its own A76 large core. Look at the A55 of MediaTek Tianji 1200 and the energy efficiency core of A14. It's really a cruel world.

Because of the performance of the A76, even two X2s of Google Tensor are unable to recover, which lowers the overall score. X1 itself is a little slower than its competitors, and its energy efficiency ratio is the same as that of the Exynos 2100 X1 most of the time. However, the A76 is far behind the times (both in terms of performance and energy efficiency ratio), and the A55 inherits Samsung's tradition of low energy efficiency.

The GPU has a large scale and high frequency, but the peak performance of 3DMark Wild Life is only 21% higher than that of Exynos 2100. In the Aztec scenario test of GFX Bench, it was 14% ahead of Exynos 2100 and slightly ahead of Snapdragon 888. Although the frequency division design is adopted, it seems that the bottleneck lies elsewhere in the GPU.

The peak power of the sensor's GPU is as high as 9-10W, and the frequency of the mobile phone is reduced as soon as it runs (one round of test is not finished...), which lowers the overall power consumption, so the average power consumption is 7.28W. The Pixel 6 series has no heat pipe, and its heat dissipation configuration and body structure are more like the iPhone than the Android flagship. When it runs, the SoC on the left side is 45 degrees, but the SoC on the right side is only 30-33 degrees. The heat dissipation is really weak.

What is puzzling is that this year's SoCs have all set unrealistic GPU frequencies, which will be reduced as soon as they run. May be to cope with sudden GPU load? Or what other reasons? However, the actual energy efficiency ratio is affected.

The power consumption of the Tensor, in addition, the Pixel 6 Pro is also an LTPO screen, but its performance is significantly different from that of the Samsung flagship. The full screen excitation brightness is 750 nit, far lower than the 942 nit of the S21U. The actual normal basic power consumption may be higher. A variety of adverse factors ultimately make the Pixel 6 Pro's endurance not good. On the contrary, the 90Hz Pixel 6's endurance is good:

TPU： Extremely strong reasoning performance

This is where Google Tensor saves face. In the MLPerf test, Pixel was run in NNAPI, other manufacturers were their own libraries, Qualcomm was SNPE (recently optimized MLPerf 1.1, improved performance), Samsung was EDEN, MediaTek was Neuron, and Apple did not have coreML acceleration, so it suffered.

In the workload of image classification, object detection and image segmentation, Tensor is lower than Qualcomm, but better than Samsung. In language processing (MobileBERT model), Google Tensor provides three times the performance of Snapdragon 888, and the reasoning part is very strong. In its propaganda, Google did mention that real-time transcription, translation and other use scenarios are its differences.

The GeekBench ML test that has not been released is based on the TensorFlow model, which represents the machine learning performance of the GPU. At this time, Google Tensor is weaker than Exynos 2100. If the NNAPI model is used, this is a hybrid work of CPU+GPU+NPU, and Google Tensor can significantly lead the Snapdragon 888.

In addition to absolute performance, the power consumption of Pixel 6 Pro is close to that of Galaxy S21 Ultra of Exynos 2100 when running AI test. When performing reasoning tasks alone, The explosive power of Exynos 2100 is 14W, and that of Snapdragon 888 is 12W. However, because the AI performance of Google Tensor is higher, the final energy efficiency ratio is higher.

However, Google has no plans to launch related SDKs to let developers make better use of this powerful TPU. But look at Samsung, its NPU has been released for two years, and now there is no SDK... Now the powerful performance of TPU is mainly reflected in the official app, such as adding more machine learning functions and various translation functions to the camera.

summary

Google said that the main reason they developed their own SoC is that the performance and energy efficiency ratio of the existing SoC in machine learning is too low. The Tensor's machine learning performance and energy efficiency are used to support new use cases and experiences, such as many machine learning features we see on the Pixel 6 series. Algorithms such as real-time transcription, real-time translation and image processing are all running on the TPU of Tensor.

Although Google may not want to admit or talk about it, Google Tensor is indeed the product of cooperation with Samsung, most of which originate from Exynos and inherit Samsung's weakness in energy efficiency ratio. CPU was delayed by the old A76, and the large-scale GPU was delayed by heat dissipation, but TPU did perform well, especially in natural language processing, which was far away from all competitors.

But in general, we believe that Google has achieved its initial goal through Tensor. We don't know what route Google's next generation SoC will take, but we are very interested to wait and see.

Follow our Weibo @ Love Computer

Follow our WeChat official account: playphone

Of course, we also pay attention to our Bilibili account: love computer