2008-10-06

Python Sort, Sort Keys, and Keys, and Keys...

The Path to “Keys”

My first practical experience of more involved sorting (more than just invoking xs.sort()) was probably in Java, which was the Comparable††/Comparator††† metaphor, something that I had also seen in C/C++ (qsort).

Then I had my first real-world use for such sorting, and it was in Perl, and I happily employed the familiar comparator metaphor. Stupid me had missed the most important thing, the Schwartzian transform.

In retrospect, the Schwartzian transform was exactly what I had needed. It would have fixed part of the horrid performance, because we had to perform some heavy transformations for the sort.

Anyway, I learned about Schwartzian transform from the awesome folks at Perl Seminar NY (feel so stupid that I hadn't got help earlier) and almost soon after discovered the "key" parameters in Python's sort.

To make a long story short, I was hooked... on sort-keys. :-)

Python Sort

Python provides sort facilities in two forms:

  1. sort method on list objects
  2. sorted built-in (2.4 and later)
sort method on List

The sort method on list sorts the list in place, that is, it re-arranges the contents of the list. It has the following signature:

list.sort(cmp=None, key=None, reverse=False)
         --> None 

For example:

xs = [1, 5, 3, 0, 8, 1, 2]
xs.sort()
print(xs) # prints [0, 1, 1, 2, 3, 5, 8]
The sorted Built-in

The sorted built-in, on the other hand, takes any iterable and builds a new list with sorted elements that iterable. It has the following signature:

sorted(iterable, cmp=None, key=None, reverse=False) 
           --> new sorted list

For example:

xs = [1, 5, 3, 0, 8, 1, 2]
print(sorted(xs)) # prints [0, 1, 1, 2, 3, 5, 8]

For Python 2.3 and earlier, sorted function can be defined as follows[]:

def sorted(iterable, cmpfunc=None):
  xs = list(iterable)
  xs.sort(cmpfunc)
  return xs

Sort Options

Note that both the sort method on list and the sorted built-in takes the same optional parameters.

reverse
sorts in the opposite direction of natural ordering (e.g., descending order instead of ascending).
cmp
a comparator function to override the natural sort order
key
a sort-key selection function

These parameters modify how the sort is performed. If you do not specify any sort options, the natural ordering defined by the elements in the collection will define the way the elements are sorted.

Natural Ordering

The natural ordering on custom classes can be defined by defining the special method __cmp__[††] which compares that object to another and indicates that it is equal to (by returning 0), greater than (by returning something greater than 0), or less than (by returning something less than 0) the other object.

Here is an example (a rather naïve one):

class Blah(object):
  ...
  def __cmp__(self, other):
    return cmp(self.x, other.x)

Custom Ordering

As described earlier, there are two methods to override the natural ordering---cmp and key.

The Comparator Function (cmp option)

The cmp option is a comparator function[†††]. This has the same specification as the __cmp__ method, except that instead of asking an object to compare itself to another, we give both objects to a function object (or any callable) to compare each other.

Example (a naïve one)

# the name can be anything
def compare(one, other):
  return cmp(one.x, other.x)
The Sort Key Function (key option)

The key option takes a sort-key selection function. If this is specified, the items will be sorted on the natural ordering of this sort key generated by this function.

This is like SQL's ORDER BY directive, but arbitrary mixing of ordering directions is a lot easier in SQL than in our sort-key function.

To see how this works, consider typical use cases of sorting.

We usually have a collection of something. The most simple case are numbers or strings, while more complex ones include compound objects, tuples, or records.

A Somewhat Complex Example

As an example, consider the following:

We have a collection of compound objects of type Person, called crowd, with attributes last_name, first_name, age, height, and aliases.

The Person type defines its natural ordering based on the concatenation of the attributes last_name and first_name.

We want to sort Person objects in different ways:

  1. by first name (case insensitive)
  2. by first name, then last name,
  3. by age
  4. by age, then by last name

and so on...

Here we notice a pattern. We are always picking a field or fields from a record or an attribute or a bunch of attributes from a compound object, maybe normalizing some or all according to some rule, and then we want them all to be sorted according the values we picked.

Let us consider each of the 4 criterion that we described earlier and write sort-key selection functions (the function names can be anything) for each:

  1. by first name (case insensitive)
    def key(person):
      return person.first_name.lower()
    
  2. by first name, then last name,
    def key(person):
      return person.first_name, person.last_name
    
  3. by age
    def key(person):
      return person.age
    
  4. by age, then by last name
    def key(person):
      return person.age, person.last_name
    

In each case we returned the values that will be used for comparison, instead of the actual elements.

The process used in the sort-key based sorting can be described as follows, but not how it is implemented:

def sortyby(key, xs):
  decorated = list((k(e), e) for e in xs)
  sorted = decorated.sort(
    cmp=lambda (ak, a), (bk, b): cmp(ak, bk)
  )
  undecorated = list(e for k, e in sorted)
  return undecorated

This method is known Decorate-Sort-Undecorate or Schwartzian Transform (named after Randall Schwartz[††††]), and the details of this approach is discussed in great detail on Perl Monks discussion board and in Python sort mini-howto (see Resources).

I am not competent enough to discuss the intricacies of this, so I will leave that to the experts (see Resources). I will, however, present a somewhat naïve implementation of sortby in Perl:

=pod
Simple sort-by-key using Schwartzian Transform.

C<<< sortby(keyfunc, listref, 
  [reverse=>(0|1), numeric=>(0|1)]) >>>

keyfunc=key extraction function (required)
listref=reference to a list (required)
reverse=sort in reverse order (default: 0)
numeric=use numeric instead of string comparison on key (default: 0)

e.g:

C<<< @ss = sortby(sub{$_->lname}, \@xs) >>>
C<<< @ss = sortby(sub{$_->lname}, \@xs, reverse=>1) >>>
C<<< @ss = sortby(sub{$_->age}, \@xs, numeric=>1) >>>
C<<< @ss = sortby(sub{$_->[2] + $_->[1]}, 
     \@xs, numeric=>1, reverse=>1) >>>

=cut
sub sortby {
  my $key = shift @_ or die "sort key required";
  my $list = shift @_ or die "list to sort required";
  my %kwargs = (@_);
  my $reverse = $kwargs{reverse} || 0;
  my $numeric = $kwargs{numeric} || 0;

  my @decorated = map {[$_, $key->($_)]} @$list;
  my @sorted = 
      $reverse 
        ? ($numeric 
          ? sort {$b->[1] <=> $a->[1]} @decorated
          : sort {$b->[1] cmp $a->[1]} @decorated)
        : ($numeric 
          ? sort {$a->[1] <=> $b->[1]} @decorated
          : sort {$a->[1] cmp $b->[1]} @decorated);

  return map { $_->[0] } @sorted;
}

Sort Key vs. Comparator

Since there are two ways to override the natural ordering of elements for a sort, which one is preferred?

In Python

The Python documentation recommends using the sort-key method:

  1. in most cases, it is faster
  2. cmp option will be removed in 3.0 branch

This makes sense in Python because there is no additional cost (in code ugliness or clarity) of using the sort-key method. This is not the case in many other languages, which do not provide built-in support for sort-keys.

In General

In general, the sort-key method is employed for its performance benefits. The potential performance advantages are discussed in detail in various places (see Resources).

Remember that, in many cases, the additional effort to write the nested map-sort-map transformation is not worth the imperceptible performance gain or even a degradation due to the overhead.

Personal Reasons

I really liked the sort-key method for its ease of use, rather than for its performance.

Easier
It is somewhat declarative and for most common cases (80 to 90 per cent), it is just a matter of picking and ordering the components from the element; can't get easier than that.
Less error prone
sort keys make 80 to 90 per cent of sorting use cases less error prone. It is quite easy to make a mistake while writing a comparator function, more so than when picking a key. Furthermore, when writing a comparator, you still have to pick the key and write a correct comparator. When using a sort-key, we leave the comparisons to the natural ordering of the sort-keys and focus on just picking the key.

As an example (ok, this might be a straw-man :-) ), consider the common case of selecting and normalizing some data from the element for sorting. Compare the sort-key function and the corresponding comparator:

def sortkey(elem):
  return normalize(select(elem))

def comparator(ea, eb):
  return cmp(normalize(select(ea)), normalize(select(eb)))

On the downside, there is always that 10 to 20 per cent use cases, where generating a key might be trickier than performing some comparison. Then we have stand on our heads and do the Macarena with the hands tied back. (I might be overestimating these use cases in an attempt to compensate for my bias towards the sort-key method.)

Square Peg in a Round Hole

As always, beware of trying to hammer a square peg into a round hole. There will always be cases, where the built-in sort facility of Python is not enough or where you need fancier sorting. This discussion focused on the mundane and normal use cases only, especially only those that fit my ordinary brain.

Keys in Other Places

Sorting is not the only place where key-generating functions can be used.

For instance, the Python built-ins max and min, both accept key arguments to select the key to find the minimum. This allows finding maximum and minimum using something other than the natural ordering of the elements.

Another place is itertools.groupby. This handy function can be used to group a sorted iterable into groups and accepts a key that defines how the object should be grouped. Since this depends on the iterable being sorted, you should generally use the same key used for sorting.

Resources

Notes

Python 2.3 list.sort did not have key and reverse options
††
In Java land, this is like implementing the Comparable<T> interface in Java or mixing in the Ordered[T] trait in Scala.
†††
In Java land, this would be something that implements the Comparator<T> interface in Java or a function object in Scala. The comparator functions and function objects can also be found in C++ STL and C.
††††
Randall Schwartz is a Perl luminary who is credited with discovering/inventing/popularizing the Schwartzian Transform by Tom Christiansen