To have a second baseline to compare to, we modified the code to execute the BPM tests and fit checks in parallel via MATLAB's
parfor. This will effectively divide the loop among a pool of workers defined by
matlabpool (8 for me since my Dell XPS 17 has 8 hyperthreaded CPU cores). The result was roughly a 2.4x speedup (77s vs 186s). Comparison charts and output below. Next up, experiments with GPU acceleration!
|
Complete output of running DancingMonkeys_parfor.m with "smooooch・∀・" |
|
Timing comparisons between the original base program and the modified program using parfor |
Update: Re-ran tests with different matlabpool Worker counts
(graph below). Turns out that (for my computer at least), using a Worker
count of 6 yielded the shortest total time (1 Worker = 1 CPU core = 1 dedicated MATLAB thread). For timeTest, 4 threads gave
the best result. This is probably due to the multi-threading overhead
catching up after 4 threads; dividing the task from 1 to 2 threads and 2 to 4
threadshalved the run time, but after 4, the times increased linearly.
For timeFit, however, 8 threads gave the best result. The pattern for timeFit, however, isn't always a perfect fit but does decreases linearly. The deviating behaviour can be explained by the fact that not every iteration in the fit checking loop has the same runtime - some iterations end prematurely while those that don't undergo various size-dependent operations including sorting. Without more than 8 cores, however, it is hard to tell the exact timeFits trend and at what point the returns are balanced by the threading overhead.
|
Timing comparisons with different CPU core counts |
Cool. Out of curiosity, do you know if your application performs better with or without hyperthreading?
ReplyDeleteThe Dell XPS 17 BIOS is locked down somehow and doesn't allow for disabling hyperthreading ; / Will have to test on a different machine.
DeleteWhy only 2.4x improvement? Is there locking, a load imbalance, or memory contention?
ReplyDeleteRe-ran the test with different CPU core counts (results are in the updated post). Turns out that past 4 cores, the threading overhead catches up. From 1 to 2 cores and 2 to 4 cores though, the performance is doubled each time (for timeTest at least; timeFits is not so straightforward), so I guess one could say you can reach a 4x improvement before other factors (e.g. threading overhead) start kicking in.
Delete