Merge Sort | Dheerendra Varma

Merge sort is the canonical divide-and-conquer sorting algorithm. Where bubble sort and insertion sort make incremental progress by comparing adjacent elements, merge sort steps back and asks a different question: if I already have two sorted sequences, can I combine them into one sorted sequence cheaply? The answer is yes — in linear time. The rest of the algorithm is just making sure you always have two sorted sequences to work with.

The core insight is that merging two sorted arrays of total size n takes exactly O(n) work: maintain two pointers, one into each half, and repeatedly pick the smaller front element. This is far more efficient than sorting from scratch.

Step-by-Step Trace

Input: [38, 27, 43, 3, 9, 82, 10]

Phase 1 — Recursive splitting (divide)

[38, 27, 43, 3, 9, 82, 10]
         /           \
  [38, 27, 43, 3]   [9, 82, 10]
      /     \          /     \
 [38, 27]  [43, 3]  [9, 82]  [10]
   /   \    /   \    /   \
 [38] [27] [43] [3] [9] [82]

Every leaf is a single element — trivially sorted.

Phase 2 — Merging back up (conquer)

Step 1:  merge [38] + [27]    → [27, 38]
Step 2:  merge [43] + [3]     → [3, 43]
Step 3:  merge [9]  + [82]    → [9, 82]

Step 4:  merge [27, 38] + [3, 43]
         i→27  j→3   pick 3   → [3]
         i→27  j→43  pick 27  → [3, 27]
         i→38  j→43  pick 38  → [3, 27, 38]
         i=end j→43  pick 43  → [3, 27, 38, 43]

Step 5:  merge [9, 82] + [10]
         i→9   j→10  pick 9   → [9]
         i→82  j→10  pick 10  → [9, 10]
         i→82  j=end pick 82  → [9, 10, 82]

Step 6:  merge [3, 27, 38, 43] + [9, 10, 82]
         i→3   j→9   pick 3   → [3]
         i→27  j→9   pick 9   → [3, 9]
         i→27  j→10  pick 10  → [3, 9, 10]
         i→27  j→82  pick 27  → [3, 9, 10, 27]
         i→38  j→82  pick 38  → [3, 9, 10, 27, 38]
         i→43  j→82  pick 43  → [3, 9, 10, 27, 38, 43]
         i=end j→82  pick 82  → [3, 9, 10, 27, 38, 43, 82]

Result: [3, 9, 10, 27, 38, 43, 82]

Each merge pass touches every element exactly once, so each level of the tree costs O(n) total.

Naive Implementation

Top-down recursive merge sort. Each call allocates a new Vec for the merged result and returns it. Simple to read, but every merge allocates heap memory.

fn merge_sort(arr: &[i32]) -> Vec<i32> {
    // Base case: a slice of 0 or 1 elements is already sorted.
    if arr.len() <= 1 {
        return arr.to_vec();
    }
 
    let mid = arr.len() / 2;
 
    // Recursively sort both halves. Each call returns a freshly
    // allocated Vec containing the sorted sub-sequence.
    let left  = merge_sort(&arr[..mid]);
    let right = merge_sort(&arr[mid..]);
 
    merge(&left, &right)
}
 
fn merge(left: &[i32], right: &[i32]) -> Vec<i32> {
    let mut result = Vec::with_capacity(left.len() + right.len());
    let (mut i, mut j) = (0, 0);
 
    // Pick the smaller front element from whichever half still has items.
    while i < left.len() && j < right.len() {
        if left[i] <= right[j] {
            result.push(left[i]);
            i += 1;
        } else {
            result.push(right[j]);
            j += 1;
        }
    }
 
    // Append whatever remains in either half (at most one of these loops runs).
    result.extend_from_slice(&left[i..]);
    result.extend_from_slice(&right[j..]);
 
    result
}
 
fn main() {
    let data = vec![38, 27, 43, 3, 9, 82, 10];
    let sorted = merge_sort(&data);
    println!("{:?}", sorted); // [3, 9, 10, 27, 38, 43, 82]
}

The <= 1 base case handles both empty and single-element slices. The stable <= comparison in the merge loop preserves the original relative order of equal elements, making this a stable sort.

Optimized Implementation

Bottom-up iterative merge sort with ping-pong buffers. Instead of recursing to find base cases and merging on the way back, we start with width-1 runs and double the width each pass. Two buffers trade roles — one is the source, the other the destination — so no extra allocation happens inside the loop.

fn merge_sort_optimized(arr: &mut Vec<i32>) {
    let n = arr.len();
    if n <= 1 {
        return;
    }
 
    // Allocate one auxiliary buffer the same size as the input.
    // During each pass we merge from `src` into `dst`, then swap them.
    let mut buf = arr.clone();
 
    // `width` is the size of each sorted run at the start of a pass.
    // We begin with runs of length 1 (every single element is trivially sorted)
    // and double after each full pass.
    let mut width = 1usize;
 
    // Track which buffer is currently the "source of truth".
    // false → arr is source, buf is dest
    // true  → buf is source, arr is dest
    let mut src_is_buf = false;
 
    while width < n {
        {
            // Reborrow as raw pointers so we can name both slices simultaneously.
            let (src, dst): (&[i32], &mut Vec<i32>) = if src_is_buf {
                (buf.as_slice(), arr)
            } else {
                (arr.as_slice(), &mut buf)
            };
 
            // Sweep through the source in chunks of 2*width.
            let mut lo = 0;
            while lo < n {
                let mid = (lo + width).min(n);
                let hi  = (lo + 2 * width).min(n);
 
                // Merge src[lo..mid] and src[mid..hi] into dst[lo..hi].
                let (mut i, mut j, mut k) = (lo, mid, lo);
                while i < mid && j < hi {
                    if src[i] <= src[j] {
                        dst[k] = src[i];
                        i += 1;
                    } else {
                        dst[k] = src[j];
                        j += 1;
                    }
                    k += 1;
                }
                while i < mid { dst[k] = src[i]; i += 1; k += 1; }
                while j < hi  { dst[k] = src[j]; j += 1; k += 1; }
 
                lo += 2 * width;
            }
        }
 
        src_is_buf = !src_is_buf; // swap roles for next pass
        width *= 2;
    }
 
    // If the sorted result ended up in `buf`, copy it back into `arr`.
    if src_is_buf {
        arr.clone_from(&buf);
    }
}
 
fn main() {
    let mut data = vec![38, 27, 43, 3, 9, 82, 10];
    merge_sort_optimized(&mut data);
    println!("{:?}", data); // [3, 9, 10, 27, 38, 43, 82]
}

Key improvements over the naive version:

No recursion stack — no risk of stack overflow on very large inputs, and no function-call overhead per level.
Single allocation — one auxiliary Vec for the whole sort instead of O(n log n) small allocations spread across all recursive calls.
Ping-pong buffers — instead of copying back after every merge, the buffers simply swap roles. Only one copy-back is needed at the very end, and only when the result happens to land in the wrong buffer.
Cache-friendly access pattern — the bottom-up order processes memory in sequential strides, which is friendlier to hardware prefetchers than the recursive top-down order.

Time Complexity

The recurrence relation for merge sort is:

T(n) = 2 T(n/2) + O(n)

The 2 T(n/2) term comes from the two recursive halves. The O(n) term is the cost of merging them — we scan every element in both halves exactly once.

Applying the Master Theorem (case 2):

a = 2,  b = 2,  f(n) = O(n)

log_b(a) = log_2(2) = 1

f(n) = O(n^1) = O(n^{log_b a})   ← Case 2 applies

Therefore:  T(n) = O(n log n)

Unlike quick sort, this bound holds for all inputs — best, average, and worst case — because the split is always at the exact midpoint regardless of the data values.

Verifying by counting work per level of the recursion tree:

Level 0:  1 merge  of size n         = n  comparisons
Level 1:  2 merges of size n/2 each  = n  comparisons
Level 2:  4 merges of size n/4 each  = n  comparisons
...
Level k:  2^k merges of size n/2^k  = n  comparisons

Number of levels = log_2(n)

Total = n * log_2(n) = O(n log n)

Space Complexity

Auxiliary space: O(n)

Merge sort cannot sort entirely in place using the classic algorithm. The merge step needs a buffer to hold the combined output before writing it back. In the naive implementation this is O(n) per level but the allocations at each level are freed before descending further in the recursion tree, so at any one time the live auxiliary memory is O(n). In the optimized version it is explicitly one buffer of size n allocated once.

Recursion stack: O(log n)

The naive recursive version keeps at most O(log n) stack frames alive simultaneously, one per level of the divide tree. The optimized bottom-up version eliminates the recursion stack entirely, achieving O(1) stack space.

Total:

Version	Auxiliary heap	Stack
Naive recursive	O(n)	O(log n)
Bottom-up iterative	O(n)	O(1)