Email Threading

Email Threading

How email clients reconstruct conversations from messages that don't natively know they're related. Almost every modern client groups replies into threads; the way they do it is mostly the JWZ algorithm from 1997 (named after Jamie Zawinski), with provider-specific shortcuts where available.

The headers that make threading possible

RFC 5322 defines three headers that, between them, let you reconstruct the parent-child relationship of any reply chain:

When you click "reply" in any reasonable mail client, your message gets In-Reply-To set to the parent's Message-ID, and References set to the parent's References + the parent's Message-ID. The chain grows.

The JWZ algorithm

Jamie Zawinski's original write-up (1997) is still the canonical reference. The shape:

  1. Pass 1: For every message, walk its References header. Each adjacent pair is a parent-child link. Build a graph keyed by Message-ID.
  2. Pass 2: For every message, set its parent to the last entry in References, OR the value of In-Reply-To if present and different.
  3. Pass 3: Find roots — messages with no parent, OR whose parent is "phantom" (referenced but never seen, e.g., the original is in someone else's archive).
  4. Pass 4: Build the trees. Sort siblings by date. Optionally collapse phantom containers and merge subjects (Re: Re: Re: hello deduplicated to one root).

Phantom containers are key: when message X references message Y but you don't have Y, JWZ creates a placeholder for Y and treats it as a node, so X's siblings (other replies to Y) still group correctly.

Cycle prevention

Adversarial or buggy mailers occasionally produce circular References chains. JWZ's algorithm explicitly checks for cycles before adding a parent edge — if adding A→B would make A its own ancestor, drop the edge.

Mxr's implementation does this in crates/sync/threading.rs: a would_create_cycle() check before edge insertion.

Subject-based fallback

JWZ also describes a fallback: if a message has no In-Reply-To and no References (older clients, mailing list digests), match by subject. Strip "Re:", "Fwd:", "[ListName]" prefixes, normalise whitespace, then group by normalised subject.

This is increasingly disabled in modern clients because it produces false positives (two unrelated emails titled "lunch?" get threaded). Provider-side threading (Gmail, Microsoft) is preferred when available.

Provider shortcuts

When the provider doesn't supply thread IDs, the client runs JWZ on the messages it has.

What mxr does

Per Mxr's crates/sync/threading.rs:

The internal type:

pub struct ThreadTree {
    pub root_message_id: MessageId,
    pub messages: Vec<MessageId>,    // breadth-first walk
}

Used by the sync engine to populate Envelope.thread_id per message, which the UI then groups on.

Common pitfalls

See also