September 29, 2009

Trivial Destructors Grief

Recently I had the really strange issue with VS 2008 - a method marked __forceinline was not being inlined in Release mode. Messing with #pragma inline_depth and #pragma inline_recursion had no effect whatsoever. This was a big performance issue since a call chain 5 levels deep was not being inlined, and this was in a tight inner loop - VTune showed that the call overhead was larger than the amount of work in the methods. The code was trivial, and the deep expansion is the unfortunate result of a specific recursive template.

All of the un-inlined calls look like:

T operator()(const T &in) {
T result(in);
// do something trivial to result
return result;

The problem turned out to be the fact that T has the following declaration inside:

struct T {
~T() { /* empty */ }

The compiler does not figure out that the destructor is useless and so avoids doing the NRV optimization in each of the five calls - it tries to ensure that any side effects of calling the destructor (none in this case) are preserved - compilers are conservative this way. This in itself is not enough to prevent __forceinline from working, but for some reason, ~T() has a different linkage than the methods that call it, which prevents inlining.

I do not feel I fully understand what happened there (the code has many __m128 members and some posts on Microsoft Connect seem to indicate issues with inlining such code due to the requirement for 16-byte aligned stack). However, removing the trivial destructor did fix the issue - the methods are all inlined (even without __forceinline) and NRV and copy elision both optimizations both happen in the right spots. The code ran 2x faster as a result.

June 29, 2008

CMake problems on Vista

So, I was trying to build vxl 1.10 and kept running into problems with CMake 2.6.

I thought it was related to their VS 2008 build generator, but it ended up being something stupid - CMake uses its Program Files\Cmake directory as temporary space to build small test programs and test the selected build option.

On Vista, programs don't have write access to the Program Files\ subtree, unless run as administrator. So, check the "run as Adiministrator" option and CMake should run; The problem has nothing to do with VS 2008 and everything to do with the fact that nobody really uses Vista that much, so a lot of Windows software does not run quite right for it.

Setting up VisualStudio 2008 as a MATLAB MEX compiler

This was recently useful:

It allows mex -setup to find your VS 9.0 installation.

MATLAB 2007b and Java heap woes

I had the issue where MATLAB 2007b would only start once after a cold reboot, and then refuse to start (the splash screen would pop up and immediately close). I finally sat on my posterior and traced it with process monitor; the very last step was writing to a java.log file in my AppData/temp; the contents suggested that the Java VM could not allocate the requested amount of contiguous heap space. This gave me enough keywords to find the following:

I reproduce the relevant part of the thread (which cites a Mathcentral tech support message):

This error is currently under investigation. The current
workaround is to set an environment variable that will
bypass the error. Please try the following to start MATLAB:

1) Go to the Start menu to Control Panel.

2) Double click on "System" and go to the Advanced tab (for
Vista go to "System and Maintenance" to "System" to
"Advanced System Settings".

3) Click the Environment Variables button.

4) Click "New" under either option (System or User
variables) to create a new variable. The variable name will
be MATLAB_RESERVE_LO. The variable value will be the number
0 (zero). Press OK to save the changes.


Setting MATLAB_RESERVE_LO=0 tells MATLAB to bypass the
functionality introduced in R2007b (for Windows only) that
tries to reserve the largest available contiguous space for
MATLAB arrays. This process guarantees that at least 256MB
is left available for use by Java for the Heap and PermGen
spaces. It appears that Windows is either loading a DLL or
doing a malloc somewhere in this 256MB space, so that when
Java tries to reserve the Heap and PermGen address space, it
fails, since both the Heap and PermGen spaces must be
contiguous. If MATLAB_RESERVE_LO is set to 0, the reserve
is for a fixed amount of space, not the largest available space.

May 19, 2008

AwesomeBar considered harmful

The "Awesome Bar" in Firefox 3 is so bad, that I had to post about it. Let me count the ways it blows:

1. The search is against web page titles, not just URLs. The bar only shows the URLs of pages you visit, so the principle of least surprise demands that you only autocomplete text entered in the bar with URLs. What is worse, there is no way to disable that behavior.

2. The search is not prefix-only. Whatever you type into the bar will be matched against any part of an URL or web page title. While word boundaries are matched preferentially, this is still bad, because it doesn't seem to work or put enough preference on the prefixes.

3. Each result is displayed on two lines, with different size fonts for the two lines. This is just bad typography and design. It is also an unnecessary waste of space.

4. The underlining of the match positions. This just blatantly points out all of the previous deficiencies, with bells and whistles. Yes, please draw my attention to the fact that you matched page titles!

Don't get me wrong - I have nothing against "adaptive learning algorithms". In fact, I am doing a Ph.D. thesis on a related topic. But every time I "train" the Firefox URL bar by correcting its mistakes, it trains ME not to use the bar.

September 5, 2006

Blast from the past

I had forgotten about the tons of fun I had as a memebr of the leet demo group Trilogy :). These two links reminded me about it : Cubic Releases (See and weep about the days of 4K software-rendered, specular-shaded rotating tori). A venerable collection of demoscene-related stuff can be found at the Hornet Archive, which is finally back online as of yesterday. Lastly, if you've had your fun writing truecolor demos using Prometheus TrueColor (PTC) or its sibling TinyPTC, check out Pixel Toaster.

My 4K intro for The North Pole BBS is sadly lost forever, but I remember having lots of fun trying to cram a 10-pattern, 2-instrument FM-music track into it along with the requisite OPL-3 dirver and the snazzy 256-color graphics. All of it compressed with my very own LZSS-based executable compressor with a 128-byte decompression routine

Sadly, today I have much more knowledge about how to make cool effects, compress data and so on, but don't have the time to do anything as fun as writing a 4K demo and spending nights in a row trying to replace the sine tables with minimal second-order approximation routines that were more compact...

August 19, 2006

Good, Fast, Cheap

Apparently the saying "Good, Fast, Cheap - pick any two" is wrong. I got my Chinese visa today (Saturday), when I was told that it would be mailed out Monday, so this works out to three days earlier than expected.

So far, good and fast. When I opened the envelope, I also found a wad of cash in there, a refund. For some unexplained reason, the visa fee of $30 was waived, so the whole thing cost me under $20 bucks, including express mail delivery. Cheap.

While I'm heaping praise on governmental organizations, USPS did pretty well with its express mail delivery - the mailman missed me, so it looked like I would not get my mail today - the post office closes at 3pm on Saturdays. However, the shift manager told me to stop by around 6pm, when all mailmen are back from their routes. He met me outside the closed post office and gave me my mail. Nice.

August 17, 2006

Stuff you need for a Chinese visa

I'm about to go to a conference in Beijing this October, and it turns out I need a business (F) visa to do so. The process is very simple - a one-page application form, and a letter of invitation is all it takes. Except, of course, that you cannot mail in your application, but must have somebody deliver the documents in person.

Having returned from a trip to Chicago, where the closest Chinese consulate is, all I can say is that you should not trust the consulate website to have all the information you need. While cash is indeed one way to pay for the visa fee, it only works if you come to get the visa yourself, in person. To get it by mail, you can only pay by a cashier's check or money order (no credit or personal checks), add a $5 on top of the visa fee, and bring a self-addressed envelope (pre-paid or otherwise). This tiny glitch prolonged the process about 20 minutes only, as the consulate staff had provided good directions to the closest USPS office (a mere two blocks away).

All in all, this was by far the easiest visa application process I have been through, despite the little omission on the consular site. Having spent months in pursuit of the all-elusive Schengen visas before, and having been through the US visa process several times before, I had braced for long waiting lines and weird questions.

August 9, 2006

Time synchronization with Windows XP

In the last post I wrote that all I needed was to make a bunch of machines sync their time with one specific Windows XP box. That turned out to be easier that a million blog and forum postings led me to believe. Just fire up regedit on your newly appointed time server and navigate to:


Change the AnnounceFlags value to 5 (from 10) - this tells the Windows Time service to consider the machine as "always a time server" and "always a reliable time source". Now, you should be able to synchronize with that machine as follows (run on a client machine):

w32tm /config /syncfromflags:MANUAL

w32tm /config /update

w32tm /resync

The magic ",0x8" after the IP of the timeserver indicates that the request should be made as a client (this seems to avoid the annoying "peer stratum is less than host stratum" message that confuses a lot of people (by default Windows tries to be "smart" about it, so it refuses to sync to a workgroup computer as if it's a server).

August 8, 2006

Rdesktop good

Well, the good folks at my CS dept have thankfully installed rdesktop on the Linux machines on campus, so a simple rdesktop -a 16 mylaptop worked :) (the -a 16 should be specified for Windows XP machines, otherwise everything looks ugly and works slower). The source of my happiness is here:

Windows has a built-in NTP server

This was painless. I was looking for a free NTP server for Windows, and it turns out there is a built-in time server on XP and Win2k.

How to enable a local Windows NTP server by modifying the registry The site appears to be useful, despite the copious amount of ads on it: Windows Registry Guide

I feel like I have discovered hot water...

(Edited at 8:20 pm on the same day)
Apparently the above is not a reliable guide. It appears to be accurate for Win2k, but the story for Windows XP is not exactly the same. Using Windows XP Professional with Service Pack 2 in a Managed Environment: Controlling Communication with the Internet seems to be the reliable guide for Windows XP. I will report progress with this later - the goal is to have every machine (mix of WinXP and Win2k embedded) on a domain-less Windows network synchronize its clock with one specific (WinXP) machine on the same network. It is OK for the time to be incorrect, it just has to be relatively consistent.

August 4, 2006

.NET Remoting and callbacks

Just a brief note about something that caused me a lot of grief recently: .NET remoting, especially with events. The first problem I had was that callbacks were not being sent, because the client application had not registered a listening channel. The MSDN docs have a note about that (so you know it's a problem), but don't go out of their way to point to the solution:
using System.Runtime.Remoting.Channels;
using System.Runtime.Remoting.Channels.Tcp;
IChannel chan= new TcpChannel(0);
As it turns out, this is all you need for thing to work with .Net 1.0. My problems did not disappear, however, because of security exceptions. This is the place where I wish whoever wrote the MSDN docs about code access security a very, very short life. I have never seen so many ways to avoid answering the first and foremost question that a programmer can have (i.e. "How do I configure the code to have the necessary permissions?") in a single piece of 'technical' writing. Anyhow, since I could care less about security in this particular instance (physically private and secure network), and since I am OK with all types being sent across using remoting, I did the following:
using System.Collections;
using System.Runtime.Remoting.Channels;
using System.Runtime.Remoting.Channels.Tcp;
using System.Runtime.Serialization.Formatters;
Hashtable props= new Hashtable();
// High priority >100 to make sure that this channel gets used to receive the callbacks
props["priority"]= "200";
props["port"]= "0";
BinaryServerFormatterSinkProvider servProvider= new BinaryServerFormatterSinkProvider();
// The default is Low - so only certain types can be remoted, set it to Full to avoid the whole issue
servProvider.TypeFilterLevel= TypeFilterLevel.Full;
IChannel chan= new TcpChannel(props, null, servProvider);
Aaaargh! The solution is so easy, but you have to dig it out of the documentation character by character. In the end, it was a random blog post somewhere that answered my question (sorry, no link - I already lost it. That's why this is here.) Finally, a friendly reminder - make sure that the callback handlers are public (or accessible to whoever fires them).

Compiler on crack(less diet)

So, I decided to finally make my blazing-fast "hierarchial loopy belief propagation on a grid" code extra-fast. An obvious improvement is to speed up the computation of messages going to the four directions.
I chose to replace 4 of the following little snippets (and others):

float minh= h[0];
for (int i= 1; i< LABELS; ++i)
    minh= min(minh, h[i]);

with the corresponding SSE intrinsics version, such as this one:

__m128 minh= _mm_load_ps(h);
for (int i= 1; i< LABELS; ++i) {
    __m128 t= _mm_load_ps(h+ i* 4);
    minh= _mm_min_ps(minh, t);

Instant 4x speedup! Not. Code is now running at only 60% the speed. "WTF?" - you say, and I agree.

It would help to know that in the above, LABELS is a small integer (3 or 4) that is a template argument to the function that contains the above code.

Apparently, VC 7.1 (aka .Net 2003) decided not to unroll the second version of the loop. Great! A loop with the constant number of iterations (2) now neatly envelops the grand total of 2 SSE instructions that comprise its body. Ugh! A manual unrolling produces the much more acceptable 2x speedup.

You'd think there is a way to really, really insist that the compiler unrolls your loop, but no #pragma. Just as I had gotten used to not over-optimizing stuff that modern compilers can handle.

Anyhow, the code is now much faster for the typical case of 3 or 4 labels being assigned when SSE intrinsics are used. I suspect that VC 7.1 does not schedule the code very well - it seems to maintain the (arbitrary) order of intrinsics in my code. Next step is to rearrange some of the independent shfts, loads and stores to see if I can get better throughput. Unfortunately, inline assembly does not support SSE, so I have no lazy solution available. At 30 full iterations (60 checkerboard pattern iterations, t= 60), the code does 3.7 frames per second on a 3-label problem on a 320x240 grid. More than 3 full iterations are overkill, so the performance gets to be as high as 8 frames per second. 80% of the time is now spent in the data cost evaluations (9 evaluations of a 3-variate gaussian per pixel).

First Post !!!

So, I have joined the revolution. Yes, I have a blog. Let's see if I have the stamina to keep it up :) I will try to use to record my daily frustrations, discoveries and so on. Oh, and comments will be disabled for now. If you really want me to hear what you think about my posts - feel free to figure out how to contact me and use old-fashioned e-mail.