link
I recommend to click on the link because it buttresses the article with numerous graphics, which I am not going to load on this site.
Thoughts.ICube UPU, the next step in processor evolution?
Reported by Nebojsa Novakovic on Friday, January 13 2012 4:34 pm
Any processor guru will tell you - It's very difficult to prop up a new instruction set architecture, even after going through all the technical difficulties, as the whole software stack has to be created from scratch: BIOS, compilers, libraries, operating system porting, drivers, basic applications, not to mention convincing the critical third-party software vendors to actually support the new architecture. All this requires substantial engineering, marketing and financial resources, not to mention a lot of guts.
Ever since Alpha in 1991, there was no new major instruction set architecture to appear in the general market. In fact, since then, most of the non-X86 architectures disappeared from the scene, leaving the X86 - even though widely agreed to be technically the worst - as the pre-dominant one. Power and Sparc still keep a part of the server field, while ARM is, of course, the king of the hill right now in the mobile arena, with its old RISC competitor, MIPS, making some inroads as well.
Now, for the first time in two decades, there's a company openly promoting its own new instruction set, and launching a processor based on it right into the hot waters on the mobile device market. Furthermore, UPU is a brand new philosophy too - for the first time, CPU and GPU are truly fused into one processor core, MVP (Multi-thread Virtual Pipeline), where even the register file is shared! And, the extra surprise element is that the whole thing is fully designed and made in China, a true 'China Core' right from the instruction set definition, without any licensing or other dependencies on the US technology.
ICube was set up by Fred Chow and Simon Moy, two industry veterans: Simon was behind the world's first 64-bits MIPS processors in SGI, and after that the principal engineer in Nvidia for 7 years until 2004, in charge of all the inital GPU, shader and GPU computing efforts. Fred was chief scientist at SGI, in its golden days of funky coloured superworkstations, and principal engineers at MIPS, later developing the Pathscale compiler suite that enabled AMD Opteron its first 64-bit X86 support. He is the chief architect of the open-source Open64 compiler suite.
So, interestingly, here you have two CPU designers actually driving the initial CPU design and instruction set - often it was memory companies (Intel was a DRAM maker when designing its first CPUs) or system companies like IBM, DEC or HP.
UPU (Unified Processor Unit) approach in their 'Harmony' architecture is the first situation where CPU and GPU threads are sharing the same execution units, register file and many instructions. In a sense, it is a 'total fusion' of the two, unlike the AMD Fusion APU approach where CPU and GPU are still distinct, with separate instruction sets, registers, execution units and such. You could consider the UPU as an example of 'homogeneous computing' just like any standard processors we are used to, while the APU belongs to 'heterogeneous computing' where different threads of different nature would run separately on CPU and GPU portions.
A very simple, elegant 32-bit RISC core, not unlike the original MIPS, does both functions, and the single 32-unit 32-bit register file is there for all operations. To support further parallelism, 4-way multithreading per core is supported, with optimised logic to remove the need for 4 separate register files. The compromises in the initial version? No SIMD vector stuff like Intel AVX, and no double-precision FP either. If you want more performance, you use more cores, which can be piled up together easily due to comparatively very small core footprint - only 2.7 square mm in the old 65 nm process. If in 32 nm process, it'd likely be only 1 square mm. This means that, on an average current 200 square mm CPU chip in 32 nm process, you could mount over 100 of these cores, plus interconnect logic and huge multimegabyte shared cache, all together.
The result is a very small, compact dual-core chip even in the initial IC1 iteration, which they plan to scale to a quad-core IC2 next year, using 40 nm or better process. Simon Moy told us that, since China is in need of fast catching up, we are likely to see its CPU vendors jumping two process generations in one go to reach higher performance. That also applies to high-end MIPS-based Loongson processors there.
The first chip specs are not bad knowing it's the very first iteration of a brand new instruction set:
While this first iteration is aimed squarely at inexpensive smartphones and tablets, without need for Full HD display or encoding, the potential of the architecture is there. Small, very low power core, yet with clean, elegant instruction set and ability to put thousands of cores in a single rack, can cover both client and server sides of a cloud, since many-user web page serving doesn't require 64-bitness or fast cores with large memory, just many cores for each of many threads to have its own resource without context switching overhead.
What they could add further is, obviously, 64-bitness with SIMD support in some future iteration for the high end market, as it would help both integer and GPU performance ultimately, as well as address the high end market too. Also, a single channel of DDR2-533 is enough for a low-end phone, but multichannel DDR3 support will be important for market expansion to higher-end devices like servers where its new instruction set is not a problem as the whole open source stuff can be quickly compiled usually.
Once the Android port is fully tuned over the next few months, we'll have a look at the device's performance in real OS, and its real potential. ICube has a tough job on their hand convincing partners that a brand new instruction set should be supported, but, what they created here can apply across many market segments ultimately, from smartphone to superserver. Let's see how they do in their initial chosen market segments first.