Today we’re going to dig a little deeper into some of the differences between a CPU and a GPU.
If you haven’t checked our previous post on this subject, you can do so here.
Last time we discussed that the CPU and the GPU are both processors but they are built for different purposes, and therefore solve different problems. We discussed that the architecture of the GPU is different compared to the one of the CPU. Today we’re going to explore more on how they differ. By the end of this post you will:
- have a better understanding of what are the four main categories of processors out there;
- which category does the GPU and the CPU fall in;
- be able to identify tasks that are best suited for the GPU on your own.
Let’s get started.
To begin we need to lay some foundations first. In computing everything revolves around processing data. The only question is how we want that data processed. Sometimes we might want that data processed sequentially, other times – in parallel, and in some cases we want to assert with absolute certainty that the calculated result is indeed the correct one. Which of these we value the most dictates the computer architecture that we end up using.
To achieve the task of processing data the computer is issued commands – a.k.a instructions, that tell it what we want it to do on e.g. a pair of data. The thing that does the actual data processing in the processor is the ALU – Arithmetic Logic Unit (there are other data processing units inside the processor like FPU and the SFU, but they work on more or less the same principles, so without loss of generality we will only focus on the simplest of them all)
The ALU (on the picture above) is a piece of hardware that receives data from memory, say, two integers on its two input ports A and B, then is given an opcode which is essentially an instruction (Add, Subtract, Multiply, Divide), it conveys the requested operation and provides the result at its output Y at which point it is stored back in memory.
And for many years this is how computers worked. You had a single instruction and a single data input. The computer is issued millions, even billions of such instructions each second and it diligently conveys the calculations requested of it without ever tiring. With the increase of computer frequencies, the number of operations that the computer could perform in a second grew linearly, so this architecture was sufficient for most.
So where’s the catch?
In the early 2000s we reached a limit on the frequency of computers.
This graph shows how computers used to scale in frequency year after year, until somewhere around 2005 and then they stopped. What did that mean for us as programmers? It meant that the free lunch was over and we could no longer rely on the same old computer architecture and wait for computers to become magically faster for us. We had to expand and search for a new means to scale. Thankfully, the computer manufacturing industry anticipated this problem long before and were ready on time. The solution was, that if they cannot make the processors’ frequency any faster, they can put more processing power inside of them. So the first commercial Multi-Core processors were introduced. These were processors that had multiple distinct computational cores inside of them, each being capable of doing calculations on its own, independent of the other, in parallel. And that allowed us to begin processing multiple instructions (one on each core) on multiple data (each core can process on different data). This is the architecture used by all modern PCs.
Flynn’s taxonomy is a classification of computer architectures, proposed by Michael J. Flynn in 1966. Ever since it has been widely used to classify computer architectures.
The Flynn taxonomy breaks the computer architectures down into four distinct classes, based on the number of instruction and data streams available to the architecture. We already dissected two of these:
- Single Instruction / Single Data (SISD)
- Multiple Instructions / Multiple Data (MIMD)
So which are the other two?
The architectures we’ve not discussed so far are MISD and SIMD. (marked with * in the table below)
|Single Instruction (SI)||Multiple Instructions (MI)|
|Single Data (SD)||SISD||MISD*|
|Multiple Data (MD)||SIMD*||MIMD|
Let’s try to see what those are.
- Multiple Instruction / Single Data : MISD
This architecture means we have multiple instruction streams, a.k.a cores, that operate on the same data. Remember at the beginning of this article when I said that the choice of architecture serves the need? In real life, If you need to be absolutely certain you did a task correctly what do you do? You double-check it! You triple check it! Well that’s what MISD does as well. You run the same program, on two or more different cores, and you compare the results in the end. If they’re not identical, you rerun the program. This is not a very common usecase, but it has its application. For example – space missions, where without the protective shell of the Earth’s atmosphere, the strong gamma radiation in outer space can spontaneously change a bit and thus invalidate all further computations.
- Single Instruction / Multiple Data : SIMD
I intentionally left SIMD last. So what is SIMD? Single Instruction – Multiple Data. That means we have a single instruction stream, but it works on many pairs of data.
This architecture allows us for example to bundle multiple pairs of integers (say 4) together and add them pair-wise all at once, the result would be 4 integers, the result of all 4 pairs of additions.
Note however that while we have a Single Instruction stream, a single core, we were able to achieve parallelism, and perform 4 instances of addition in parallel. The width (in this example 4) of the SIMD Lane as we call it, is a design decision that the manufacturer of the architecture chooses.
The reason the modern GPU is capable of processing so many pixels all at once… is SIMD. The core of the GPU is a SIMD core and the lane width some manufacturers currently use is 32, so let’s use that as an example. Each time you issue an add, or subtract or divide instruction on the GPU, you know that this operation will be done on 32 data pairs and will produce 32 results. When you work with images, you have millions of pixels that you want to add, or subtract, and that is why the SIMD architecture (the GPU architecture) is so good for graphics processing. The GPU has many SIMD units inside and each of them takes a batch of the total work and starts operating on it in increments of 32. With the introduction of GPGPU frameworks, we programmers can now write any code we like and execute it on the GPU. So we’re no longer limited to just pixels. Any job that is large in nature and where it’s easy to find 32 pairs of inputs that you want to perform the same operation on is perfect for the GPU. This includes Physics Simulations, Biological Simulations, Computer Vision, Machine Learning and many many more. All of these do not necessarily produce an image as a result of their calculations, but they operate on large datasets, large enough to be a perfect fit for the architecture of the GPU.
Now I hope you have a bigger depth of understanding how the CPU and the GPU differ and you can now better identify the tasks that fit one and those that fit the other.
Remember, that the modern computer has both a CPU and a GPU inside and these are two distinct architectures both with their pros and cons and when you write your program, you don’t have to choose one of them, you can break your program down into sections and run each section on the architecture that fits best. You have a Heterogeneous machine (heterogeneous meaning you have cores of different architectures inside).
So program it heterogeneously!