Not too long ago, we made a few interesting changes to the Transterpreter.
The first was that we updated the Multiterpreter. Donkeys-months ago (like donkeys-years, but not quite as long) we modified the interpreter to map occam-pi processes to operating system processes. This meant that we could run code in true parallel on multiple CPU or multiple-core machines. Unfortunately, the cost for process switching was so high, we couldn't show this to anyone on the planet for fear of being shamed.
About two months ago, Adam and I sat down (I rode copilot) and rewrote that code to use POSIX threads instead of OS processes. In fact, we didn't even do anything savvy, like use a thread-pool; instead, we just create operating system threads when they're needed, reclaim then when they're done. Not the most efficient way to parallelize the virtual machine, but we like to do things in the simplest way possible first. In this case, it had the smallest impact on the codebase, so we did it that way.
On my laptop ("Lyra", a dual-core 2.0GHz Intel Core Duo MacBook), I ran some code that varies three things:
In a multi-threaded or multi-core system, this will stress our run-time and tell us at what workload our parallel virtual machine is more efficient than running a single-threaded runtime on a single core. The core of the code comes from Kevin Vela's doctoral thesis from the University of Kent (not available online).
PAR i = 0 FOR abyg CHAN INT chan: SEQ j = 0 FOR pleng PAR SEQ SEQ k = 0 FOR granu array[(granu * i) + k] := array[(granu * i) + k] + i chan ! i INT t: chan ? t
So, what did we find?
(Note that in each of these graphs I've varied both the y- and x-axis; this is bad reporting practice, and something I will correct when the tech report is put together---for now, these are the graphs I have, and I'm using them as-is. )
To be clear, the different glyphs represent the maximum number of POSIX threads that were allowed to be running at the same time to support a single virtual machine. On the y-axis is the time in seconds needed to compute the replicated PAR shown above (averaged over two runs), and the x-axis is the granularity of the work packets---larger granularity means more work gets done before a context switch is allowed to take place.
On my MacBook (above), we had to have each thread do a significant amount of work (process a 65K element array in a non-trivial manner) with relatively small amounts of communication (long work periods) before we saw a speedup due to parallelism. My suspicion is that OS X is such a hungry operating system that it (and other application processes) were constantly asking for attention, and therefore, the Transterpreter threads rarely had the opportunity to run for a significant amount of time.
So, to belabor this just a bit, we can see that with one thread of control (a single-threaded interpreter), we compute packets of granularity 1 approximately 20x faster than if we run a 2-threaded interpreter. On a two-core machine, this doesn't necessarily make sense... unless we consider the possibility that the two POSIX threads are spending all of their time contending for an opportunity to run, in which case it makes sense that we're not seeing any benefit of having a parallel runtime.
As the granularity approaches 50 (which we can convert into a specific number of VM instruction cycles, if we wanted to), we see that the single-threaded and dual-threaded interpreter run at almost the same speed. This is because we are no longer seeing the contention at the operating system level for multiple threads to be executing. Or, perhaps it is because we are no longer fighting the operating system for a chance to execute.
On Hadar, a SunFire 880 with four 750MHz UltraSparc III processors and 8GB of RAM, I found the results were... a bit odd:
Ignoring a granularity of one for a moment (which is effectively measuring context switch time of the POSIX implementation on a given machine), we can see that a two- and four-threaded interpreter are always slower than a single-threaded interpreter. However, if we increase the number of threads the Transterpreter is allowed to spawn to 8 or 16, the Transterpreter is always faster for even small workloads than a single-threaded Transterpreter.
I have not fully investigated this yet; I believe this is because Solaris handles pthreads differently than other operating systems. In particular, it has a notion of virtual CPUs, and I think (but am not sure) that with too few threads, it assigns them all to a single virtual CPU, which is assigned to a single physical CPU. Therefore, even though we have four cores (most of which are idle), there is massive contention on a single CPU. When the OS sees more threads in play, it actually farms them out to multiple virtual (and physical) CPUs.
I have to read more on this (and have some excellent resources from Sun that I picked up while at SIGCSE), and do not doubt that this situation can be improved. Our initial testing was only over the course of a few days, and I need to investigate this further before turning the work into a tech report. (From exploration, to blog post, to tech report, to publication, I suppose.) In other words, I'm sure that Solaris isn't completely borked---instead, I assume I have to do some more work to make sure it does what I want, instead of whatever the current default is.
The last system we did some tests on was Ninja, an Intel SR1200 server with two PIII 1.4GHz processors and 2GB of RAM running Debian GNU/Linux with a SMP kernel (rev. 2.6.17-2-686).
Ninja produced the curves I expected from the other two machines, actually. We see that a single-threaded Transterpreter is faster than a multi-threaded interpreter for very small work packets. However, with a granularity of 5, we see that a single-threaded, dual-threaded, and quad-threaded interpreter all run at roughly the same speed. From that point forward, it is better to have multiple threads in the interpreter than a single thread, as more work gets done in unit time. And, I believe that this easily represents "real world" work packet sizes.
What I don't quite understand is why four threads is better than two for very small granularities. Like the other platforms, there is a good deal more investigation to be done before I understand the implementation of POSIX threads in a given operating system (OSX/BSD vs. Solaris vs. Debian GNU/Linux 2.6) and how that effects the execution of a parallel Transterpreter.
The Transterpreter runs on the Texas Instruments MSP430 (an 8MHz, 10KB RAM embedded processor), the LEGO Mindstorms (a 16MHz embedded system with 32KB of RAM for both the interpreter and code), as well as my MacBook, Christian's G4 Powerbook, our Linux dual-processor server, quad-processor Suns, and most likely any machine with a C compiler. The code I ran to test our "POSIXterpreter" does not exercise any features of the language that would not run on all of these platforms---put another way, the bytecode for my tests could be executed on the MSP430 without modification; we might blow the RAM if we make the replicated PAR too big, but that's about it. That's how portable the code is across Transterpreter instances right now.
Tyan recently announced their 40-processor "desktop supercomputer." This is a set of five, dual quad-core (8-processor) blades in a box on wheels. So, you get five computers in a box, each with eight processors, and a max of 60GB of RAM (12GB per blade). And the price? $20,000.
I haven't written it up yet, but the Transterpreter also does OpenMPI. That means that we can actually spread computation across machines in a cluster as well as across SMP cores. This is way alpha code, but I've demonstrated to the group that we can split things up on the Darwin H4 supercomputing cluster (a 2.4GHz Dell and a 3.2GHz Dell under adjacent desks on a 100 Mb network). Given a bit of time, this could become a first-class part of the Transterpreter release, and we'd have an excellent environment for controlling parallelism in a heterogeneous cluster environment.
Or, as the case may be, an excellent way of exploiting the resources of a 40-processor supercomputer-in-a-box. With five POSIXterpreters running large thread pools, adding some intelligent (batch) scheduling, and a more fully-featured set of MPI bindings, we'd have a seriously smart setup for doing parallel computing. All the pieces are there, but we don't have the financial justification to dedicate the time to the development. So, we continue stealing a little bit of time here and there to see if we can demonstrate that this is a really powerful way to orchestrate the use of high-perf libraries (like BLAST, LAPACK, etc.) in a robust, safe, and semantically clear way in a truly parallel environment.