Sunday, April 1, 2012

CPU Parallel Computing via parfor

To have a second baseline to compare to, we modified the code to execute the BPM tests and fit checks in parallel via MATLAB's parfor. This will effectively divide the loop among a pool of workers defined by matlabpool (8 for me since my Dell XPS 17 has 8 hyperthreaded CPU cores). The result was roughly a 2.4x speedup (77s vs 186s). Comparison charts and output below. Next up, experiments with GPU acceleration!

Complete output of running DancingMonkeys_parfor.m with "smooooch・∀・"
Timing comparisons between the original base program and the modified program using parfor
Update: Re-ran tests with different matlabpool Worker counts (graph below). Turns out that (for my computer at least), using a Worker count of 6 yielded the shortest total time (1 Worker = 1 CPU core = 1 dedicated MATLAB thread). For timeTest, 4 threads gave the best result. This is probably due to the multi-threading overhead catching up after 4 threads; dividing the task from 1 to 2 threads and 2 to 4 threadshalved the run time, but after 4, the times increased linearly. For timeFit, however, 8 threads gave the best result. The pattern for timeFit, however, isn't always a perfect fit but does decreases linearly. The deviating behaviour can be explained by the fact that not every iteration in the fit checking loop has the same runtime - some iterations end prematurely while those that don't undergo various size-dependent operations including sorting. Without more than 8 cores, however, it is hard to tell the exact timeFits trend and at what point the returns are balanced by the threading overhead.

Timing comparisons with different CPU core counts






4 comments:

  1. Cool. Out of curiosity, do you know if your application performs better with or without hyperthreading?

    ReplyDelete
    Replies
    1. The Dell XPS 17 BIOS is locked down somehow and doesn't allow for disabling hyperthreading ; / Will have to test on a different machine.

      Delete
  2. Why only 2.4x improvement? Is there locking, a load imbalance, or memory contention?

    ReplyDelete
    Replies
    1. Re-ran the test with different CPU core counts (results are in the updated post). Turns out that past 4 cores, the threading overhead catches up. From 1 to 2 cores and 2 to 4 cores though, the performance is doubled each time (for timeTest at least; timeFits is not so straightforward), so I guess one could say you can reach a 4x improvement before other factors (e.g. threading overhead) start kicking in.

      Delete