Lets look at two articles that recently hit the fan:
The first is a regurgitation of a presentation made by Intel to developers. In short, they're considering a move to processors with 10s, if not 100s, of discrete computational units on a single die. The second article is a cry for Intel to stop, or slow down, using the argument that we "have to get the architecture right first."
OK, so the architecture can slow us down. But what slows us down even more is the fact that most programming languages have no support for concurrency, and therefore most programmers think "concurrency" means "threads and spinlocks". The two most interesting explorations we've made in the last few months are the Multiterpreter and the Cell Transterpreter.
If you check out the full Transterpreter project from our Subversion repository, you'll see in the branches a multiterpreter branch. This exploration was a quick proof-of-concept that Christian and I poked at for a few days. While the threading is terribly inefficient, it demonstrated that we can spawn multiple OS threads on-demand from the Transterpreter runtime, and distribute computation across those threads without any modification in the source program. Put simply, if you write:
PAR process1() process2() process3() process4()
on a quad-processor machine, then all four processors will be utilized by Transterpreter instances. We haven't publicized this branch because the approach was the least efficient implementation possible, but simplest to implement. Our explorations into native code generation are, in part, an exploration of what parts of the code-base would need to be refactored to allow efficient multithreading on big machines.
The other branch that might be of interest is the Cell branch. The work being carried out here is described fully in the paper A Cell Transterpreter. Damian Dimmich has successfully run 9 Transterpreters on a single Cell Broadband Engine. Although we only have the Cell simulator, he has demonstrated that you can have multiple Transterpreters running, in parallel, all over the device, and preserve CSP channel semantics across the cores. Put another way, his proof-of-concept demonstrates that we can write occam-pi programs that work seamlessly across multiple, heterogeneous cores.
As we look to unify various compiler explorations within the group, we expect to target these kinds of platforms more directly. If your language makes it easy to express ideas in parallel, it should be easy for the compiler to automatically distribute work units to many different processors. In the case of occam-pi, this is absolutely the case... so if we have a processor with 8, 80, or 800 cores, we should be able to take advantage of that power without significant effort.
Certainly, we can take advantage of this kind of hardware far more easily than a programmer writing in a sequential language like C, C++, C#, or Java.