[Science Popularization] Confused? What are TPU, IPU and NPU| Love to play computer games - 上海419贵族宝贝-上海后花园1314-上海品茶网-阿拉爱上海

In the past few years of AI's rise, on Qualcomm, Apple, Samsung, Kirin, MediaTek, Google's SoC, people often see names like "TPU, IPU, NPU". What's the difference between these "XPUs"? Are there really so many different architectures? Or the concept marketing of manufacturers?

In order to answer this question, SemiEngineering collected a large number of opinions from industry insiders and summarized them into a text. The original link is: . We have simplified and compiled this, but the content of the article is still very hard core, and we are ready to start now!

Source aita

From the perspective of CPU and its development mode, most of these "XPUs" are not real processors. Machine learning accelerators are a class of processors, but the processing parts they use to accelerate are diverse. They are more like GPUs, accelerators used to execute special workloads, and they have many types themselves.

The essence of a processor can be summed up in three things. Finally, it returns to the instruction set architecture (ISA): first, define what to do, then I/O and memory (support ISA and the task it is trying to complete). In the future, we will see more innovations and changes than in the past two or three years.

Many new architectures are not single processors. They are combinations of different types of processors or programmable engines. They exist in the same SoC or the same system, and assign software tasks to different hardware or flexible programmable engines. All these processors may share a common API, but the execution domain is different. At this level, there are indeed different architectures of various types.

But the reality is that most "XPU" names are marketing, and these names and abbreviations refer to two things at the same time: one is used to explain the architecture of the processor, such as SIMD (Single Instruction Multiple Data), and the other defines the application segment it is addressing. So it can be used not only to define the processor architecture, but also as a brand name such as "Tensor Processing Unit (TPU)". After all, manufacturers are not naming a single processor, but their architecture.

history

Forty years ago, the problem of naming was much simpler. First, we are most familiar with the central processing unit (CPU). Although it has many evolution versions, they are basically Von Neumann architecture and Turing complete processors. Each has a different instruction set to improve processing efficiency. There was also a very extensive discussion on the advantages and disadvantages of complex instruction set (CISC) and reduced instruction set (RISC) that year.

The later RISC-V has brought a lot of attention to ISA. ISA defines the optimization degree of the processor for the defined tasks. People can view ISA and start calculating the cycle. For example, if an ISA has native instructions and runs at 1GHz, we can compare it with another ISA processor. The latter may need two instructions to complete the same function, but the frequency is 1.5GHz. It is obvious which is stronger or weaker.

CPU has multiple packaging methods, sometimes IO or memory is placed in the same package, and the latter two are called microcontroller units (MCU). When modems became popular, digital signal processors (DSP) appeared. Their difference is that they used the Harvard architecture to separate the instruction bus from the data bus. Some of them also used the SIMD architecture to improve the efficiency of data processing.

The separation of instructions and data is to improve throughput (although it does limit edge programming such as self programming). Usually, the boundary condition here is not calculation, but I/O or memory. The focus of the industry has shifted from improving computing power to ensuring that there is enough data to allow computing to continue and maintain performance.

When the performance of a single processor can no longer be improved, connect multiple processors together. Usually they also use shared memory to keep each processor and the entire processor cluster Turing complete. It doesn't matter on which core any part of the program is executed. Anyway, the results are the same.

The next major development is the emergence of Graphics Processing Unit (GPU). GPU breaks the convention because each processing unit or pipeline has its own memory and cannot be addressed outside the processing unit. Because the memory size is limited, only those tasks that can be put into memory can be executed, so the task itself is limited.

For some types of tasks, GPUs are very powerful, but their pipelines are very long, resulting in delays and uncertainties. These pipelines allow the GPU to process data continuously, but if you want to refresh the pipelines, the efficiency will be greatly reduced.

GPU and later general GPU (GPGPU) defined a programming paradigm and software stack, making them easier to use than previous accelerators. For many years, some work has been specialized, including CPUs for running continuous programs, and graphics processors that focus on image display and take us into a highly parallel world. The latter uses many small processing units to perform tasks (including current machine learning tasks).

Are there any architectural rules that can be used to explain all the new architectures? Yes, perhaps NoC is a suitable definition. In the past, processor arrays were usually connected by memory or fixed network topology (mesh or ring), while NoC enabled distributed heterogeneous processors to communicate in a more flexible way. In the future, they can also communicate without using memory.

The current NoC is aimed at data, while the future NoC can also send commands, notifications and other data, which can be extended to areas where accelerators are not just interactive data. The communication requirements of accelerator array or cluster may be different from those of CPU or standard SoC, but NoC does not limit designers to a subset. They can optimize and improve performance by meeting the special communication requirements of different accelerators.

Execution architecture

Another way to distinguish processors is to see how they optimize specific operating environments. For example, the cloud and micro IoT devices may run the same software, but the architecture used in different environments is completely different, and their requirements for performance, power consumption, cost, and operating ability under extreme conditions are different.

This may be due to the need for low latency, or because of power consumption, some software originally targeted at cloud computing is now gradually put on the device side. Although it is a different hardware architecture, people naturally want to have the same software stack, so that software can run on two occasions. The cloud needs to provide flexibility because it can run different types of applications and has many users. This requires that the server hardware should be optimized for applications and can provide different scales.

Machine learning tasks also have their own requirements. When using neural networks and machine learning to build systems, you need to use software frameworks and general software stacks to make network programming and mapping to hardware. Then you can make software adapt to different hardware from the perspective of PPA. This drives the need to "adapt different types of processing and processors to various hardware".

These requirements are defined by the application. For example, a company has designed a processor for graphics operations. They optimize and speed up graphics tracking and perform operations such as graphics rearrangement, as well as other brute force parts such as matrix multiplication that speed up machine learning.

Memory access is a special problem for each architecture, because when you build an accelerator, the most important goal is to keep it fully loaded for as long as possible. You must transfer as much data as possible to the ALU, so that it can handle as much data as possible.

They have many things in common. They all have local memory and network on chip for communication. Each processor executing the algorithm is processing a small piece of data. These operations are scheduled by the operating system running on the CPU.

For hardware designers, the tricky part is task prediction. Although there will be similar operation types at some levels, people are studying the differences at different levels. In order to process neural networks, several types of processing capabilities are required. This means that you need to process a part of the neural network in some way, and then another processing operation may be required at another layer, and the data movement and data volume also change layer by layer.

You need to build a whole set of different accelerators for processing pipelines. Understanding and analyzing algorithms and defining optimization processes are tasks related to the complete architecture. Just like for genome sequencing, you may need to do some processing, but you can't use a single type of accelerator to accelerate everything. The CPU is responsible for managing the execution pipeline, setting it, executing DMA, and making decisions.

It may involve partition implementation. No processor can be optimized for every task - FPGA, CPU, GPU and DSP can't do it. Chip designers can create a series of chips containing all these processors, but the difficulty of customer applications is that they need to determine which processors and CPUs each part of the system will run on? On FPGA? Or on the GPU?

However, there must always be a CPU in it. The CPU is responsible for executing the irregular part of the program. The versatility of the CPU has its own advantages. On the contrary, if it is a special data structure or mathematical operation, the CPU will not work. After all, CPU is a general-purpose processor. It is not optimized for anything, and there is no project that is particularly good at.

Change of abstraction layer

Previously, the hardware/software boundary was defined by ISA, and the memory was continuously addressable. When it comes to multiprocessors, the general memory definition is also consistent. But it can be imagined that consistency is not so important in the data flow engine, because data will be directly transferred from one accelerator to another.

Speedster 7t FPGA structure diagram

If you partition the dataset, the consistency will become an obstacle. You need to compare and update the data, and it will take additional computing cycles. So we need and must consider different memory structures. After all, there is only so much memory available. It may be possible to access adjacent memory, but it will also run out quickly and then cannot be accessed in time. Therefore, it must be understood in the design, and the architecture must be designed with understanding.

We also need a higher level of abstraction. Some frameworks can map or compile known networks to target hardware, for example, in a set of low-level kernels or APIs, which will be used in the software stack and ultimately by the neural network mapper. At the bottom, you may be using different types of hardware, depending on what you want to achieve. In any case, different hardware and PPA are used to achieve the same function.

This will put a lot of pressure on the compiler. The main question is how do you program the accelerator in the future? Do you have a hard wired engine like the original GPU? Or have you built a small programmable engine with your own instruction set? Now you must program these things separately, connect each of these engines with the data flow, and then execute the task.

One processor has a subset of the entire instruction set, and the other processor has a different subset. They will all share some overlapping parts of the control flow. The compiler must understand its library and map it.

conclusion

Google's TPU

In fact, the architecture of processors has not changed. They still follow the rules that have been followed for the past 40 years. What has changed is the way chips are constructed. They now contain a large number of heterogeneous processors. These chips optimize memory and communication according to their respective tasks. Each chip makes different choices for processor performance, optimization goal, required data throughput and data flow.

Every hardware supplier hopes to distinguish their chips from other chips. It is much easier to promote brands than to talk about internal technical details. Manufacturers give their chips the name "XPU" and associate it with specific types of applications, but "XPU" is not a name for a specific hardware architecture.

Just like Google calls its ASIC (application specific integrated circuit) TPU (Tensor Processing Unit) tensor processing unit/processor, but in fact TPU refers to a specific hardware architecture.

Follow our Weibo @ Love Computer

Follow our WeChat official account: playphone

Of course, we also pay attention to our Bilibili account: love computer