Dancing Monkeys: Accelerated: April 2012

Tuesday, April 24, 2012

Final

Here's the final results. Wasn't what we expected, but we achieved a 10x speedup through non-GPU means anyway.

Paper: https://github.com/Keripo/DancingMonkeysAccelerated/raw/master/final/paper.docx
Slides: https://github.com/Keripo/DancingMonkeysAccelerated/raw/master/final/presentation.pptx
Video: http://beatsportable.com/static/misc/cis565/DancingMonkeysAccelerated/final/presentation_demo-ver.wmv
Source: https://github.com/Keripo/DancingMonkeysAccelerated

Sunday, April 22, 2012

Jacket Install and Experiments

What is Jacket?
http://www.accelereyes.com/jacket_tour?idx=0

A problem with install:

When installing 64bit Jacket in a 64bit OS, with a 32bit matlab student version (Matlab student version doesn't provide 64bit version for windows.) The Jacket will not run correct. Matlab will call Jacket's mexw32 files which are actually for 32bit OS's NVidia drivers. An error will occur and the program will fail.

How to solve that?

According to the Jacket's widi, we can use something in "<Jacket_Root>/engine/bin" to overwrite those default dll/mexw32 so that they can work correctly in a 32-bit MATLAB and a 64-bit OS.
However, in current version of Jacket, the "bin" folder is actually missing after install.

Where is it?

I traced the Jacket's install wizzard step by step and found something interesting.
When installing, the wizzard actually put something called "bin" and "bin64" under "engine" folder.
But, they will be deleted in the end of install for unknown reason.
So, I stopped the install wizzard and rescued the "bin" folder from its destine of doom. Then, I used everything there to overwrite everything under "engine" folder.
Finally, it seems work.

Experiments

Jacket provides a powerful "gfor". We can use it similar with matlab's "parfor".
However, not all codes are happy with gfor.
We still need to do memory copies between GPU and CPU.

There are 2 ways to handle those memory copies:
1, Initialize data directly on GPU by using "gzeros","gdouble","gint32".... instead of matlab's default data types.
2, Use "LOCAL" in the "gfor" to give each keneral a copy of some data from CPU. However, it seems the performance will be terrible if we send some large array to each keneral as "local" data.

Some other problems:
Logical indexing cannot be used, using mask to modify the code.
Subscripting into CPU variables will fail the program for no directly covertion from gfor's GPU data to CPU data, using "LOCAL" for now.

After handled all those, we made the first loop successfully modified to a "gfor" loop.

Result
However, the timing is slower than "parfor".
I guess it is because the "LOCAL" data sends to each keneral is too large.

Sunday, April 1, 2012

Midpoint Presentation Slides

PPTX:
http://beatsportable.com/static/monkeys/2012-04-02%20CIS%20565%20Project%20Midpoint%20Presentation.pptx

PDF:
http://beatsportable.com/static/monkeys/2012-04-02%20CIS%20565%20Project%20Midpoint%20Presentation.pdf

CPU Parallel Computing via parfor

To have a second baseline to compare to, we modified the code to execute the BPM tests and fit checks in parallel via MATLAB's parfor. This will effectively divide the loop among a pool of workers defined by matlabpool (8 for me since my Dell XPS 17 has 8 hyperthreaded CPU cores). The result was roughly a 2.4x speedup (77s vs 186s). Comparison charts and output below. Next up, experiments with GPU acceleration!

Complete output of running DancingMonkeys_parfor.m with "smooooch・∀・"

Timing comparisons between the original base program and the modified program using parfor

Update: Re-ran tests with different matlabpool Worker counts (graph below). Turns out that (for my computer at least), using a Worker count of 6 yielded the shortest total time (1 Worker = 1 CPU core = 1 dedicated MATLAB thread). For timeTest, 4 threads gave the best result. This is probably due to the multi-threading overhead catching up after 4 threads; dividing the task from 1 to 2 threads and 2 to 4 threadshalved the run time, but after 4, the times increased linearly. For timeFit, however, 8 threads gave the best result. The pattern for timeFit, however, isn't always a perfect fit but does decreases linearly. The deviating behaviour can be explained by the fact that not every iteration in the fit checking loop has the same runtime - some iterations end prematurely while those that don't undergo various size-dependent operations including sorting. Without more than 8 cores, however, it is hard to tell the exact timeFits trend and at what point the returns are balanced by the threading overhead.

Timing comparisons with different CPU core counts