How do you make a recursive merge sort more efficient in Python?

5 minute read


This is re-worked from an excerpt of an answer of mine (1) to an SO question that made me discover some interesting things about Python : turns out it’s not easy to make recursion faster by doing tail call elimination, even by hand. (2)

We mean here to look at methods of tail call elimination and their effect on performance, demonstrated within the limits of the language, and with the example of merge sort. Tail call elimination aims at turning tail recursive calls into iterative loops, providing the benefit of not having to manage a recursive call stack, and thus, hopefully, of making things a bit faster. It is also used to deal with functions who are mutually recursive, or, more generally, whose number of recursive calls make the call stack grow very fast.

Python is an interesting example to show tail call elimination within the language itself because it does not have tail call optimization built-in. Some languages (generally functional) have built-in tail call elimination, meaning every tail call you make gets compiled to an iterative loop. If you want an iterative merge sort in one of those languages, all you need is to re-write the classic version to make your recursive calls tail recursive, using a CPS translation, as we will show below.

Of course, if what you want is a fast sort in Python, the native sort, or numpy’s alternative, or even an in-place quicksort are better alternatives. Even more impressively, faster implementations of a sort which closely (3) mimics the recursive pattern of the merge sort have been proposed on reddit in response to this post. And finally, if you wish to break the limits of what the language intends “natively”, it should be interesting to turn to one of the famous tricks for achieving TCO in Python. Our goal here is to apply tail call elimination within language boundaries, and maybe see if we obtain a TCO’ed function faster that the function it was derived from.

The exhaustive benchmark with timeit of all implementations discussed in this answer (including bubblesort) is posted as a gist here. Just to give a starting point, I find the following results for the average duration of a single run :

  • Python’s native (Tim)sort : 0.0144600081444
  • Bubblesort : 26.9620819092

Those are seconds (but since only the relative speed matters, this isn’t very important). As can be expected, Python is much faster. And to think that Timsort is supposed to be a (C) merge sort !

  • For starters, we can write a relatively efficient, non-tail-recursive merge sort in the following way:

      def mergeSort(array):
          n  = len(array)
          if n  array2[-1]:
          return merged_array

    It runs, as expected, quite faster than the Bubblesort. Average run time : 0.126505711079

Now, our example may not be the best test of the difficulty of managing a huge call stack depth. The number of recursive calls R(n) on a list of size n is (2n-2) : for $k \geq 2$, $R(k) = 2+2* R(k/2)$, and $R(1)=0$. But in depth, the stack blowup of the recursive merge sort algorithm is in fact modest : as soon as you get out of the leftmost path in the array-slicing recursion pattern, the algorithm starts returning (& removing frames). So for 10K-sized lists, you get a function stack that is at any given time at most of size $log_2(10 000) = 14$ … pretty modest. But we hope it might still give some glimpse of a hint as to what our TCO methods entail.

  • The call to merge is not tail-recursive; it calls both (mergeSort left) and (mergeSort right) recursively while there is remaining work in the call (merge).

But we can make the merge tail-recursive by using CPS. Note that as-is, this would only be useful as-is if we were using a language that does tail call optimization, without it , it’s a bad idea. Indeed, this is only workable in Python if we do further work on that intermediary function, which, by itself, runs out of stack space for even modest lists:

    def cps_merge_sort(array):
        return cpsmergeSort(array,lambda x:x)

    def cpsmergeSort(array,continuation):
        n  = len(array)
        if n  1:
            maxn,n = n,n/2
        return l[:maxn],leftcomb

    def tcomergesort(l):
        l,stack = leftcomb(l)
        while stack: # l sorted, stack contains tagged slices
            i,ordered = stack.pop()
            if ordered:
                l = fastmerge(l,i)
                stack.append((l,True)) # store return call
                rsub,ssub = leftcomb(i)
                stack.extend(ssub) #recurse
                l = rsub
        return l

This goes a tad faster than the inefficient trampolined version above (trampolined mergesort: 0.170638551712, this stack-based version:0.144994809628). This is not surprising. The interesting fact is that this final iterative version goes slower that the basic, non-tail recursive run-of-the mill merge sort you would have written as CS 101 homework. Now this is cool : CPython’s function stack manipulations are faster than my explicit array manipulations. This is more a statement on the size of the recursive call stack (mentioned above) and the efficiency of CPython than on the quality of my CS homework (indeed, an even faster recursive version has been suggested on reddit). But still. It’s surprising that it would take an iterative function that does not do the basic, systematic recursion-to-stack transformation above to beat the run-of-the-mill recursive version.

The final result summary ? on my machine (Ubuntu natty’s stock Python 2.7.1+), the average run timings (out of of 100 runs -except for Bubblesort-, list of size 10000, containing random integers of size 0-10000000) are:

  • Python’s native (Tim)sort : 0.0144600081444
  • Bubblesort : 26.9620819092
  • non-tail-recursive Mergesort : 0.126505711079
  • trampolined CPS non-recursive mergesort : 0.170638551712
  • stack-based non-recursive mergesort : 0.144994809628

  1. Initially fraught with errors, and therefore duly ill-scored. 

  2. This has also benefited a lot of a discussion with Carey Underwood below and in a reddit thread on this post. Many thanks ! 

  3. But sometimes not exactly