« First Post !!! | Main | .NET Remoting and callbacks »

Compiler on crack(less diet)

So, I decided to finally make my blazing-fast "hierarchial loopy belief propagation on a grid" code extra-fast. An obvious improvement is to speed up the computation of messages going to the four directions.
I chose to replace 4 of the following little snippets (and others):

float minh= h[0];
for (int i= 1; i< LABELS; ++i)
    minh= min(minh, h[i]);

with the corresponding SSE intrinsics version, such as this one:

__m128 minh= _mm_load_ps(h);
for (int i= 1; i< LABELS; ++i) {
    __m128 t= _mm_load_ps(h+ i* 4);
    minh= _mm_min_ps(minh, t);
}

Instant 4x speedup! Not. Code is now running at only 60% the speed. "WTF?" - you say, and I agree.

It would help to know that in the above, LABELS is a small integer (3 or 4) that is a template argument to the function that contains the above code.

Apparently, VC 7.1 (aka .Net 2003) decided not to unroll the second version of the loop. Great! A loop with the constant number of iterations (2) now neatly envelops the grand total of 2 SSE instructions that comprise its body. Ugh! A manual unrolling produces the much more acceptable 2x speedup.

You'd think there is a way to really, really insist that the compiler unrolls your loop, but no #pragma. Just as I had gotten used to not over-optimizing stuff that modern compilers can handle.

Anyhow, the code is now much faster for the typical case of 3 or 4 labels being assigned when SSE intrinsics are used. I suspect that VC 7.1 does not schedule the code very well - it seems to maintain the (arbitrary) order of intrinsics in my code. Next step is to rearrange some of the independent shfts, loads and stores to see if I can get better throughput. Unfortunately, inline assembly does not support SSE, so I have no lazy solution available. At 30 full iterations (60 checkerboard pattern iterations, t= 60), the code does 3.7 frames per second on a 3-label problem on a 320x240 grid. More than 3 full iterations are overkill, so the performance gets to be as high as 8 frames per second. 80% of the time is now spent in the data cost evaluations (9 evaluations of a 3-variate gaussian per pixel).