Sorting interview question revisited

| 1 Comment
I had a lot of thought provoking answers to my question posed a few weeks ago.
Basically the problem posed was you have data in the format of  /^[a-zA-Z]+[0-9]+$/ and you want to sort it. The problem is that you want abc2 to be before abc10 because 2<10

By far the most beautiful solution was moritz's solution:
use Sort::Naturally qw(nsort);
nsort @list;
Today I had to rewrite a similar sort, so I decided to revisit this again. Today I was sorting timestamps in the following format:"MM/DD/YY H:MM [AM|PM]" Which is very similar to the above problem once you break the constituent part away.

Here I will show my sort function for the simpler data above, since it becomes trivial to add the date formatting once . The sort function is only two lines, which I think is really clever, and even readable :

@sorted = sort {
"$a|$b" =~ /^([a-zA-Z]+)([0-9]+)\|([a-zA-Z]+)([0-9]+)$/;
return $1 cmp $3 || $2 <=> $4 ; 
} @array

Basically we get $1-4 set by concatenating the two strings, and then breaking out the alpha parts and the digit parts, and returning the appropriate code depending on the compare values.

1 Comment

I'm sure I won't be the only person to mention it, but whenever you have a sort function that is doing more than just a compare, you really ought to use a Swartzian Transform.

In simple terms, you map from an array of the original data to an array of data-structures that contain your sort keys and your original data, then you sort that second array, before transforming back to just an array of originals.

This way you're only calculating the sort keys once, rather than once per comparision.

The number of comparisions that occur in a sort increases dramatically as you increase the length of the array, so this is always a big saving.

So your code could be rewritten as:

@sorted = map { [ $_, ( /^([a-zA-Z]+)([0-9]+)/ ) ] } @array;
@sorted = sort { ( $a->[ 1 ] cmp $b->[ 1 ] ) or ( $a->[ 2 ] $b->[ 2 ] ) } @sorted;
@sorted = map { $_->[ 0 ] } @sorted;

That's done as three lines for clarity of what the three stages are doing, but you can easily make it into a single statement if you feel the need.

It's tempting not to bother doing this work for small arrays, but even on small sets of data it's usually much faster.

It's also a useful habit to get into: don't work something out more than once unless it's going to have changed.

Btw, greetings from Bristol. ;)

About this Entry

This page contains a single entry by leonard published on May 7, 2010 10:27 AM.

tabs vs spaces? was the previous entry in this blog.

Yapc 2010 is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.