Email Threading
Email Threading
How email clients reconstruct conversations from messages that don't natively know they're related. Almost every modern client groups replies into threads; the way they do it is mostly the JWZ algorithm from 1997 (named after Jamie Zawinski), with provider-specific shortcuts where available.
The headers that make threading possible
RFC 5322 defines three headers that, between them, let you reconstruct the parent-child relationship of any reply chain:
Message-ID: <abc123@example.com>— globally unique identifier for this message. Mandatory in modern mail. Format:<id@host>.In-Reply-To: <abc123@example.com>— the Message-ID of the message this is a direct reply to.References: <abc123@example.com> <def456@example.com>— space-separated list of Message-IDs along the reply chain, oldest first. The thread's spine.
When you click "reply" in any reasonable mail client, your message gets In-Reply-To set to the parent's Message-ID, and References set to the parent's References + the parent's Message-ID. The chain grows.
The JWZ algorithm
Jamie Zawinski's original write-up (1997) is still the canonical reference. The shape:
- Pass 1: For every message, walk its
Referencesheader. Each adjacent pair is a parent-child link. Build a graph keyed by Message-ID. - Pass 2: For every message, set its parent to the last entry in
References, OR the value ofIn-Reply-Toif present and different. - Pass 3: Find roots — messages with no parent, OR whose parent is "phantom" (referenced but never seen, e.g., the original is in someone else's archive).
- Pass 4: Build the trees. Sort siblings by date. Optionally collapse phantom containers and merge subjects (
Re: Re: Re: hellodeduplicated to one root).
Phantom containers are key: when message X references message Y but you don't have Y, JWZ creates a placeholder for Y and treats it as a node, so X's siblings (other replies to Y) still group correctly.
Cycle prevention
Adversarial or buggy mailers occasionally produce circular References chains. JWZ's algorithm explicitly checks for cycles before adding a parent edge — if adding A→B would make A its own ancestor, drop the edge.
Mxr's implementation does this in crates/sync/threading.rs: a would_create_cycle() check before edge insertion.
Subject-based fallback
JWZ also describes a fallback: if a message has no In-Reply-To and no References (older clients, mailing list digests), match by subject. Strip "Re:", "Fwd:", "[ListName]" prefixes, normalise whitespace, then group by normalised subject.
This is increasingly disabled in modern clients because it produces false positives (two unrelated emails titled "lunch?" get threaded). Provider-side threading (Gmail, Microsoft) is preferred when available.
Provider shortcuts
- Gmail assigns its own
threadIdto every message at receive time. mxr's Gmail provider uses this directly:ThreadId::from_provider_id("gmail", thread_id). No JWZ needed. - Microsoft Exchange uses
Thread-TopicandThread-Indexheaders (proprietary). Fewer clients honour these. - IMAP THREAD extension (RFC 5256) — the server can return threaded results. Spotty server support; clients usually do their own threading regardless.
When the provider doesn't supply thread IDs, the client runs JWZ on the messages it has.
What mxr does
Per Mxr's crates/sync/threading.rs:
- Implements JWZ in three passes (build → roots → trees).
- Cycle prevention via
would_create_cycle(). - Roots sorted by message date (earliest first), then by ID for determinism.
- Phantom containers preserved for thread structure but not displayed.
- For Gmail, threading is skipped — Gmail's
threadIdused directly. - For IMAP without server-side threading, JWZ runs after each sync batch.
The internal type:
pub struct ThreadTree {
pub root_message_id: MessageId,
pub messages: Vec<MessageId>, // breadth-first walk
}
Used by the sync engine to populate Envelope.thread_id per message, which the UI then groups on.
Common pitfalls
- Reply storms across mailing lists — when a list rewrites Message-IDs (some do), threading breaks. Often shows up as one giant thread or many tiny threads.
- Replies that strip
In-Reply-To— older Outlook versions did this. Forces subject-based matching, which is unreliable. - Replies to multiple parents — RFC 5322 doesn't support N-parent links. JWZ picks one (usually the latest in References).
- Attached forwarded messages — RFC 5322 allows
message/rfc822parts. These have their own Message-ID + References but they're nested inside another message. Threading them is undefined. - Subject normalisation locale issues — "Re:" / "Fwd:" / "AW:" / "RE:" / "Fw:" / "答复:" / "回复:" — a thorough normaliser handles many languages.
See also
- MIME — provides the headers threading reads
- Email Internal Model — what
ThreadTreebecomes - How Email Actually Works — synthesis
- Mxr — concrete implementation
- JWZ's original: https://www.jwz.org/doc/threading.html
- RFC 5322 §3.6.4 (identification headers): https://datatracker.ietf.org/doc/html/rfc5322#section-3.6.4
- RFC 5256 (IMAP THREAD): https://datatracker.ietf.org/doc/html/rfc5256