# How do you make a recursive merge sort more efficient in Python?

** Published:**

This is re-worked from an excerpt of an answer of mine (^{1}) to an SO question that made me discover some interesting things about Python : turns out it’s not easy to make recursion faster by doing tail call elimination, even by hand. (^{2})

We mean here to look at *methods of tail call elimination* and their effect on performance, *demonstrated within the limits of the language*, and *with the example of merge sort*. Tail call elimination aims at turning *tail recursive calls* into iterative loops, providing the benefit of not having to manage a recursive call stack, and thus, hopefully, of making things a bit faster. It is also used to deal with functions who are mutually recursive, or, more generally, whose number of recursive calls make the call stack grow very fast.

Python is an interesting example to show tail call elimination within the language itself because it does not have tail call optimization built-in. Some languages (generally functional) have built-in tail call elimination, meaning every tail call you make gets compiled to an iterative loop. If you want an *iterative* merge sort in one of those languages, all you need is to re-write the classic version to *make your recursive calls tail recursive*, using a CPS translation, as we will show below.

Of course, if what you want is *a fast sort* in Python, the native `sort`

, or numpy’s alternative, or even an in-place quicksort are better alternatives. Even more impressively, faster implementations of a sort which closely (^{3}) mimics the recursive pattern of the merge sort have been proposed on reddit in response to this post. And finally, if you wish to break the limits of what the language intends “natively”, it should be interesting to turn to one of the famous tricks for achieving TCO in Python. **Our goal here is to apply tail call elimination within language boundaries, and maybe see if we obtain a TCO’ed function faster that the function it was derived from**.

The exhaustive benchmark with `timeit`

of all implementations discussed in this answer (including bubblesort) is posted as a gist here. Just to give a starting point, I find the following results for the average duration of a single run :

- Python’s native (Tim)sort : 0.0144600081444
- Bubblesort : 26.9620819092

Those are seconds (but since only the relative speed matters, this isn’t very important). As can be expected, Python is much faster. And to think that Timsort is supposed to be a (C) merge sort !

For starters, we can write a relatively efficient, non-tail-recursive merge sort in the following way:

`def mergeSort(array): n = len(array) if n array2[-1]: merged_array.append(array1.pop()) else: merged_array.append(array2.pop()) merged_array.reverse() return merged_array`

It runs, as expected, quite faster than the Bubblesort. Average run time : 0.126505711079

Now, our example may not be the best test of the difficulty of managing a huge call stack **depth**. The number of recursive calls R(n) on a list of size n is (2n-2) : for $k \geq 2$, $R(k) = 2+2* R(k/2)$, and $R(1)=0$. **But in depth, the stack blowup of the recursive merge sort algorithm is in fact modest** : as soon as you get out of the leftmost path in the array-slicing recursion pattern, the algorithm starts returning (& removing frames). So for 10K-sized lists, you get a function stack that is at any given time at most of size $log_2(10 000) = 14$ … pretty modest. But we hope it might still give some glimpse of a hint as to what our TCO methods entail.

- The call to
`merge`

is not tail-recursive; it calls both`(mergeSort left)`

and`(mergeSort right)`

recursively while there is remaining work in the call`(merge)`

.

But we can make the merge tail-recursive by using CPS. **Note that as-is, this would only be useful as-is if we were using a language that does tail call optimization**, without it , it’s a bad idea. Indeed, this is only workable in Python if we do further work on that intermediary function, which, by itself, runs out of stack space for even modest lists:

```
def cps_merge_sort(array):
return cpsmergeSort(array,lambda x:x)
def cpsmergeSort(array,continuation):
n = len(array)
if n 1:
leftcomb.append((l[n:maxn],False))
maxn,n = n,n/2
return l[:maxn],leftcomb
def tcomergesort(l):
l,stack = leftcomb(l)
while stack: # l sorted, stack contains tagged slices
i,ordered = stack.pop()
if ordered:
l = fastmerge(l,i)
else:
stack.append((l,True)) # store return call
rsub,ssub = leftcomb(i)
stack.extend(ssub) #recurse
l = rsub
return l
```

This goes a tad faster than the inefficient trampolined version above (trampolined mergesort: 0.170638551712, this stack-based version:0.144994809628). This is not surprising. The interesting fact is that this final iterative version goes **slower** that the basic, non-tail recursive run-of-the mill merge sort you would have written as CS 101 homework. Now this is cool : **CPython’s function stack manipulations are faster than my explicit array manipulations**. This is more a statement on the size of the recursive call stack (mentioned above) and the efficiency of CPython than on the quality of my CS homework (indeed, an even faster recursive version has been suggested on reddit). But still. It’s surprising that it would take an iterative function that *does not* do the basic, *systematic* recursion-to-stack transformation above to beat the run-of-the-mill recursive version.

The final result summary ? on my machine (Ubuntu natty’s stock Python 2.7.1+), the average run timings (out of of 100 runs -except for Bubblesort-, list of size 10000, containing random integers of size 0-10000000) are:

- Python’s native (Tim)sort : 0.0144600081444
- Bubblesort : 26.9620819092
- non-tail-recursive Mergesort : 0.126505711079
- trampolined CPS non-recursive mergesort : 0.170638551712
stack-based non-recursive mergesort : 0.144994809628

Initially fraught with errors, and therefore duly ill-scored. ↩

This has also benefited

*a lot*of a discussion with Carey Underwood below and in a reddit thread on this post. Many thanks ! ↩But sometimes not exactly. ↩