Message Passing Interface
I've been mucking around in the world of distributed programming recently. In my defense, I was investigating this topic BEFORE Scott got his hands on a stack of rack mount computers from Intel. But I will admit the sight of all those idle CPU's put the angst into my code-writing fingers. So I've gone ahead and actually written a distributed application.
A little background on distributed computing. First off, we need the obligatory "Linux is awesome." There is a kernel module for Linux called MOSIX that will transparently move CPU/Memory bound applications away from the machine they were launched on, and return those processes as soon as they need to make a system call. That's awesome.
Next I need to introduce the term "grain-size". In a parallel application, there are usually sections that cannot be parallel in between the easily parallel parts. For example, in a genetic algorithm, a population of candidate solutions is held in a single list. That one list needs to be modified by exactly one process for things like member mutation and crossovers. This activity happens once per "generation" and is reasonably quick to do. When that is over, each member needs to be evaluated for fitness before the next generation can be processed. This evaluation happens one at time and there is no interaction between the members.
In a single process application, the amount the fitness evaluation will take is computed by multiplying the amount of time it takes to evaluate one member by the number of members. If we had enough processors, however, we could theoretically reduce this time to the amount of time it takes to compute just one evaluation. this length of time is the grain-size of the application. The smaller the grain, the quicker(theoretically) the parallel job can be finished. The limitations to this notion come from the speed of networking. The overhead in distributed computing is the network bandwidth. If the grain size is too small, all the process spend more time waiting for the network to move bits around instead of crunching some numbers.
The MOSIX approach doesn't perform well when the grain size of a parallel application is smaller than about 10 to 20 seconds. This is because the process starts on a local machine, and there is an algorithm waiting for your process to qualify as "CPU bound." Then, while it's in the middle of crunching numbers, the process is paused, wrapped up, and shipped to a remote computer where it can get it's work done in peace. Then, when the process needs a system resource, it's paused again, then shipped home to do it's work. A programmer is likely to see worse performance by using MOSIX if his grain size is too small than if he just wrote a single process application.
I'd like to see an improvement in performance down to 1 and 2 second grains. Enter stage left: MPI, the Message Passing Interface. After running 'emerge openmpi' I had the libraries and programs necessary to compile and run an MPI capable application. I love Portage. After tooling around with "hello world" MPI style, I knew I had the right tool for the job.
[Scott Notes: Open MPI is included as a stock package in many distributions these days including your editor's favorites, those based on Red Hat.]
MPI is commonly considered a "low-level" distributed computing library. It really only gives basic communication tools. It's up to the developer to make those tools work in an application. My goal was to use any configuration of networked computers to improve the performance of a single application, such as a genetic algorithm that is evolving neural networks.
Enter stage right: the thread pool. A thread pool is a tool used on multiprocessor systems to leverage multiple cores on a set of jobs. It's a common parallel programming tool. I just needed to make it work with MPI, and I would have a distributed computing framework.
A thread pool works by using a single queue of jobs to feed a static set of threads. A static thread is one that sticks around. Starting and killing threads to do work is slow, because the operating system has some work to do to make such things happen. If a programmer can just keep the same threads active and keep them well fed with new jobs, the application will run faster than spawning new threads every time a job needs to be done.
Hopefully the design of my application is starting to be apparent. The program itself runs in parallel on all the MPI machines that are configured. This is managed entirely by the mpi support programs. Inside the program, the index number of the program instance is known, and the zero index is the master node. The master node runs all non-parallel code, and uses a thread pool to control how jobs are delivered to the other copies of the program.
Each thread in the thread pool takes the job data and sends it to a particular remote process. One thread -> one remote process. The thread on the master node doesn't do any actual work, it just waits for the remote process to finish and send back the results. It's a lot like a remote procedure call (RPC), only faster. Lots faster.
That's it. There are still some bugs to work out, and my gosh if my code isn't ugly, then I'd hate to see what ugly code looks like.
Ta-ta for now,