Main

September 29, 2009

Trivial Destructors Grief

Recently I had the really strange issue with VS 2008 - a method marked __forceinline was not being inlined in Release mode. Messing with #pragma inline_depth and #pragma inline_recursion had no effect whatsoever. This was a big performance issue since a call chain 5 levels deep was not being inlined, and this was in a tight inner loop - VTune showed that the call overhead was larger than the amount of work in the methods. The code was trivial, and the deep expansion is the unfortunate result of a specific recursive template.

All of the un-inlined calls look like:

T operator()(const T &in) {
T result(in);
// do something trivial to result
return result;
}

The problem turned out to be the fact that T has the following declaration inside:

struct T {
~T() { /* empty */ }
};

The compiler does not figure out that the destructor is useless and so avoids doing the NRV optimization in each of the five calls - it tries to ensure that any side effects of calling the destructor (none in this case) are preserved - compilers are conservative this way. This in itself is not enough to prevent __forceinline from working, but for some reason, ~T() has a different linkage than the methods that call it, which prevents inlining.

I do not feel I fully understand what happened there (the code has many __m128 members and some posts on Microsoft Connect seem to indicate issues with inlining such code due to the requirement for 16-byte aligned stack). However, removing the trivial destructor did fix the issue - the methods are all inlined (even without __forceinline) and NRV and copy elision both optimizations both happen in the right spots. The code ran 2x faster as a result.

June 29, 2008

CMake problems on Vista

So, I was trying to build vxl 1.10 and kept running into problems with CMake 2.6.

I thought it was related to their VS 2008 build generator, but it ended up being something stupid - CMake uses its Program Files\Cmake directory as temporary space to build small test programs and test the selected build option.

On Vista, programs don't have write access to the Program Files\ subtree, unless run as administrator. So, check the "run as Adiministrator" option and CMake should run; The problem has nothing to do with VS 2008 and everything to do with the fact that nobody really uses Vista that much, so a lot of Windows software does not run quite right for it.

Setting up VisualStudio 2008 as a MATLAB MEX compiler

This was recently useful:

http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=18508

It allows mex -setup to find your VS 9.0 installation.

September 5, 2006

Blast from the past

I had forgotten about the tons of fun I had as a memebr of the leet demo group Trilogy :). These two links reminded me about it : Cubic Releases (See digg.com and weep about the days of 4K software-rendered, specular-shaded rotating tori). A venerable collection of demoscene-related stuff can be found at the Hornet Archive, which is finally back online as of yesterday. Lastly, if you've had your fun writing truecolor demos using Prometheus TrueColor (PTC) or its sibling TinyPTC, check out Pixel Toaster.

My 4K intro for The North Pole BBS is sadly lost forever, but I remember having lots of fun trying to cram a 10-pattern, 2-instrument FM-music track into it along with the requisite OPL-3 dirver and the snazzy 256-color graphics. All of it compressed with my very own LZSS-based executable compressor with a 128-byte decompression routine

Sadly, today I have much more knowledge about how to make cool effects, compress data and so on, but don't have the time to do anything as fun as writing a 4K demo and spending nights in a row trying to replace the sine tables with minimal second-order approximation routines that were more compact...

August 9, 2006

Time synchronization with Windows XP

In the last post I wrote that all I needed was to make a bunch of machines sync their time with one specific Windows XP box. That turned out to be easier that a million blog and forum postings led me to believe. Just fire up regedit on your newly appointed time server and navigate to:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W32Time\Config

Change the AnnounceFlags value to 5 (from 10) - this tells the Windows Time service to consider the machine as "always a time server" and "always a reliable time source". Now, you should be able to synchronize with that machine as follows (run on a client machine):

w32tm /config /syncfromflags:MANUAL
      /manualpeerlist:URL_OR_IP_OF_SERVER,0x8

w32tm /config /update

w32tm /resync

The magic ",0x8" after the IP of the timeserver indicates that the request should be made as a client (this seems to avoid the annoying "peer stratum is less than host stratum" message that confuses a lot of people (by default Windows tries to be "smart" about it, so it refuses to sync to a workgroup computer as if it's a server).

August 8, 2006

Rdesktop good

Well, the good folks at my CS dept have thankfully installed rdesktop on the Linux machines on campus, so a simple rdesktop -a 16 mylaptop worked :) (the -a 16 should be specified for Windows XP machines, otherwise everything looks ugly and works slower). The source of my happiness is here: http://www.rdesktop.org/.

Windows has a built-in NTP server

This was painless. I was looking for a free NTP server for Windows, and it turns out there is a built-in time server on XP and Win2k.

How to enable a local Windows NTP server by modifying the registry The site appears to be useful, despite the copious amount of ads on it: Windows Registry Guide

I feel like I have discovered hot water...

(Edited at 8:20 pm on the same day)
Apparently the above is not a reliable guide. It appears to be accurate for Win2k, but the story for Windows XP is not exactly the same. Using Windows XP Professional with Service Pack 2 in a Managed Environment: Controlling Communication with the Internet seems to be the reliable guide for Windows XP. I will report progress with this later - the goal is to have every machine (mix of WinXP and Win2k embedded) on a domain-less Windows network synchronize its clock with one specific (WinXP) machine on the same network. It is OK for the time to be incorrect, it just has to be relatively consistent.

August 4, 2006

.NET Remoting and callbacks

Just a brief note about something that caused me a lot of grief recently: .NET remoting, especially with events. The first problem I had was that callbacks were not being sent, because the client application had not registered a listening channel. The MSDN docs have a note about that (so you know it's a problem), but don't go out of their way to point to the solution:
using System.Runtime.Remoting.Channels;
using System.Runtime.Remoting.Channels.Tcp;
IChannel chan= new TcpChannel(0);
ChannelServices.RegisterChannel(chan);
As it turns out, this is all you need for thing to work with .Net 1.0. My problems did not disappear, however, because of security exceptions. This is the place where I wish whoever wrote the MSDN docs about code access security a very, very short life. I have never seen so many ways to avoid answering the first and foremost question that a programmer can have (i.e. "How do I configure the code to have the necessary permissions?") in a single piece of 'technical' writing. Anyhow, since I could care less about security in this particular instance (physically private and secure network), and since I am OK with all types being sent across using remoting, I did the following:
using System.Collections;
using System.Runtime.Remoting.Channels;
using System.Runtime.Remoting.Channels.Tcp;
using System.Runtime.Serialization.Formatters;
Hashtable props= new Hashtable();
// High priority >100 to make sure that this channel gets used to receive the callbacks
props["priority"]= "200";
props["port"]= "0";
BinaryServerFormatterSinkProvider servProvider= new BinaryServerFormatterSinkProvider();
// The default is Low - so only certain types can be remoted, set it to Full to avoid the whole issue
servProvider.TypeFilterLevel= TypeFilterLevel.Full;
IChannel chan= new TcpChannel(props, null, servProvider);
ChannelServices.RegisterChannel(chan);
Aaaargh! The solution is so easy, but you have to dig it out of the documentation character by character. In the end, it was a random blog post somewhere that answered my question (sorry, no link - I already lost it. That's why this is here.) Finally, a friendly reminder - make sure that the callback handlers are public (or accessible to whoever fires them).

Compiler on crack(less diet)

So, I decided to finally make my blazing-fast "hierarchial loopy belief propagation on a grid" code extra-fast. An obvious improvement is to speed up the computation of messages going to the four directions.
I chose to replace 4 of the following little snippets (and others):

float minh= h[0];
for (int i= 1; i< LABELS; ++i)
    minh= min(minh, h[i]);

with the corresponding SSE intrinsics version, such as this one:

__m128 minh= _mm_load_ps(h);
for (int i= 1; i< LABELS; ++i) {
    __m128 t= _mm_load_ps(h+ i* 4);
    minh= _mm_min_ps(minh, t);
}

Instant 4x speedup! Not. Code is now running at only 60% the speed. "WTF?" - you say, and I agree.

It would help to know that in the above, LABELS is a small integer (3 or 4) that is a template argument to the function that contains the above code.

Apparently, VC 7.1 (aka .Net 2003) decided not to unroll the second version of the loop. Great! A loop with the constant number of iterations (2) now neatly envelops the grand total of 2 SSE instructions that comprise its body. Ugh! A manual unrolling produces the much more acceptable 2x speedup.

You'd think there is a way to really, really insist that the compiler unrolls your loop, but no #pragma. Just as I had gotten used to not over-optimizing stuff that modern compilers can handle.

Anyhow, the code is now much faster for the typical case of 3 or 4 labels being assigned when SSE intrinsics are used. I suspect that VC 7.1 does not schedule the code very well - it seems to maintain the (arbitrary) order of intrinsics in my code. Next step is to rearrange some of the independent shfts, loads and stores to see if I can get better throughput. Unfortunately, inline assembly does not support SSE, so I have no lazy solution available. At 30 full iterations (60 checkerboard pattern iterations, t= 60), the code does 3.7 frames per second on a 3-label problem on a 320x240 grid. More than 3 full iterations are overkill, so the performance gets to be as high as 8 frames per second. 80% of the time is now spent in the data cost evaluations (9 evaluations of a 3-variate gaussian per pixel).