nVidia demos CUDA
Posted: 2007-02-16 04:13pm
Beyond3d, who else?
A rather more indepth look.What's CUDA?
NVIDIA CUDA stands for "Complete Unified Device Architecture" and represents their new software and hardware solution for what they call "GPU Computing," or general-purpose computation on a GPU. It is based on the G80 architecture, and will be compatible with all future G8x derivatives and even NVIDIA's next-generation architectures.
GPGPU solutions are primarily aimed at the scientific, professional and financial markets, although it might also eventually have an impact on the PC gaming market too, as well as for other non-gaming applications. It can fundamentally be defined as using a GPU for anything else other real-time image rendering and it could, financially speaking, become a very large market.
From the software side of things, CUDA is exposed through extensions to the ANSI C language, with only a minimal set of deviations (denormals, recursive functions, etc.). On the hardware side of things, it simply exposes NVIDIA's DX10 hardware with two key features: a parallel data cache and shared memory. Both help improve efficiency beyond that of current GPGPU solutions. More on that in a second.
It should also be said that NVIDIA CUDA and AMD CTM ("Close To the Metal") are both very similar, but also very different. The former is directly aimed at developers, and exposes a C-like programming language. On the other hand, AMD CTM corresponds to the assembly language interface for low-level access to the ATI R5xx product family. As such, CUDA and CTM are not directly comparable without an appropriate backend for CTM. And, sadly, we are not aware of any proper and fully-featured public backend for CTM at this point in time.
Why CUDA?
There are three big disadvantages to using OpenGL and Direct3D for GPGPU development. The first, and most obvious one, is that these APIs are made with rendering in mind, not GPGPU programming. As such, they might be less efficient and a lot less straightforward to use for such workloads. Secondly, new drivers might introduce bugs which could significantly affect general-purpose programs, even more so than rendering. And finally, no modern rendering API exposes direct and arbitrary read/write access to video memory.
Both CUDA and CTM fix all of these issues, and very nicely at that. Furthermore, they can achieve very high efficiency at fundamentally "stream-like" workloads, which is what SIMD processors have traditionally been good at. They an also achieve much higher performance than CPUs could ever dream of for such massively parallel computations.
Generally speaking, CPUs are good at single-threaded workloads, although they have recently been adding a small level of thread parallelism with HyperThreading and multi-core chips. GPUs, on the other hand, are inherently massively parallel: there are literally thousands of threads in flight on the GPU at any given time. That's why it's also very unlikely your GPU would be very good at word processing - there's not much parallelism to extract there.
NVIDIA aims to further differentiate and distance themselves from CTM by going above and beyond what traditional SIMD machines however, potentially gaining greater efficiency for certain workloads.
CUDA Hardware Model
With CUDA, introduces "shared memory" (also known as the "parallel data cache") which corresponds to a pool of shared memory for every multiprocessor (ALU cluster; every processor is basically an ALU with little control logic of its own).
You can think of shared memory as pretty much the same thing as a miniature version of the local store on the Cell architecture. It's basically a manually managed cache which lets a clever programmer significantly reduce the number of memory bandwidth his or her program needs.
Furthermore, CUDA introduces inter-thread synchronization. Mostly any thread running on the same multiprocessor can synchronize with another thread running there, which allows for a number of algorithms to run on the GPU with some much nicer performance characteristics, in theory. And in practice too, as far as we can see. Overall, the hardware model is very easy to program for, and quite efficient too.
The paradigm is also fundamentally different from that of current CPUs, which are single-threaded and memory latency-intolerant. Thus, they hide latency with large caches. GPUs, on the other hand, are massively multithreaded and memory latency-tolerant; they simply hide latency by switching to another execution thread. This is a fundamental difference, and some researchers and engineers are already predicting that it is fundamentally impossible to create a single architecture that is well suited to both kinds of workloads.
Conclusion
In the future, an increasingly high proportion of the GPU's transistors will be dedicated to arithmetic processing, rather than texturing or rasterization. As such, there is tremendous potential for them to improve GPGPU performance significantly faster than Moore's Law in the next few years.
Furthermore, new features such as double precision processing (at quarter-speed or below, but still with very impressive raw performance) will extend the reach of the GPGPU market further inside the realm of scientific supercomputing. Needless to say, the future of GPU Computing is very exciting, and extremely promising.
While many applications and workloads are NOT suitable to CUDA or CTM, because they are inherently not massively parallel, the number of potential applications for the technology remains incredibly high. Much of that market is also what has historically been part of the server CPU market, which has very high margins and is very lucrative. So, unsurprisingly, they're quite financially excited to enter that market. The goodies will hopefully benefit PC consumers just as must as scientists and researchers in the longer-term, however.