Length-Prefixed Framing

#resources #resources/programming #resources/programming/ipc #resources/programming/networking

Length-Prefixed Framing

A technique for sending discrete messages over a byte stream. Stream protocols (TCP, Unix stream sockets) deliver bytes — no message boundaries. To send messages, you have to mark where each one begins and ends. The simplest, fastest, hardest-to-screw-up approach: prefix each message with its length.

The format

[ 4 bytes: message length N (big-endian) ][ N bytes: message body ]
[ 4 bytes: next message length M         ][ M bytes: next body    ]
...

Reader:

Read exactly 4 bytes. Decode as N.
Read exactly N bytes. That's the message body.
Hand the body to the parser.
Loop.

Writer:

Encode the message body. Get N bytes.
Write 4-byte big-endian N.
Write the N bytes.

That's the entire algorithm. Uses read_exact / write_all — never partial reads, never partial writes. The kernel handles the byte stream; the framing handles message boundaries.

Why "big-endian"

Network byte order is big-endian. The convention from BSD sockets and IETF RFCs. Doesn't matter for local IPC functionally, but matches expectations.

In Rust:

let len_bytes = (body.len() as u32).to_be_bytes();
writer.write_all(&len_bytes).await?;
writer.write_all(&body).await?;

// Reader:
let mut len_buf = [0u8; 4];
reader.read_exact(&mut len_buf).await?;
let len = u32::from_be_bytes(len_buf) as usize;
let mut body = vec![0u8; len];
reader.read_exact(&mut body).await?;

Why 4 bytes

Standard for most use cases:

4 bytes = up to 4 GiB per message. Plenty for any realistic IPC payload.
Fixed header size = simpler reader (always read 4 bytes first).
Aligned to 4-byte boundary on most architectures.

Variants:

2-byte length — capped at 64 KiB. Used for protocols where messages are small (BGP, some game protocols).
8-byte length — ridiculous overkill for IPC; useful only if individual messages might exceed 4 GiB.
Variable-length encoded length (varint) — saves a few bytes per message. Common in protobuf framing. Rarely worth the complexity.

mxr and lazydap use 4-byte big-endian. Standard, simple, sufficient.

Comparison to other framing schemes

`Content-Length: N\r\n\r\n` headers (LSP / DAP style)

Content-Length: 119\r\n
\r\n
{"jsonrpc":"2.0","id":1,...}

Two \r\n terminate the headers. Body is exactly Content-Length bytes. More verbose than length-prefix but human-readable in transit (great for debugging — open the socket with socat and watch).

DAP uses this. Its Content-Length headers can be inspected with socat UNIX-CONNECT:lazydap.sock STDIO while the daemon talks.

Newline-delimited (`\n` between messages)

Each message is one line. JSON over newline-delimited is "JSONL" or "ndjson".

Pros: dead simple, human-readable, easy to grep.

Cons: messages can't contain newlines unless escaped. Adds parsing burden.

Sentinel-delimited

Special byte sequence marks message boundaries. Telnet uses IAC (0xFF). SMTP uses \r\n.\r\n for end of DATA.

Cons: messages can't contain the sentinel; usually requires escaping.

Self-describing protocols (gRPC / protobuf)

Protocol Buffers encode messages with internal length headers and tags. Framing is built into the codec. No separate framing layer needed if you use the codec.

Why length-prefix wins for local IPC

Fastest — no parsing the framing, just two read_exact calls
No escaping — body can contain anything, including null bytes
Trivial implementation — every language has bytes-to-int conversion
Easy to debug — read the 4 bytes, you know the body length

For human-readable streams (LSP/DAP), Content-Length is preferred for debuggability. For fast local IPC where you control both sides (mxr, lazydap), length-prefix is the right answer.

Common pitfalls

Partial reads — TCP and Unix sockets can return fewer bytes than requested. Use read_exact, never read. (Tokio's read_exact loops internally.)
Endianness mismatches — if the writer and reader disagree, you get garbage lengths and run off the rails. Standardise on big-endian.
Maximum message size — without a cap, a malicious or buggy peer can send a length of 4 GiB and fill your memory. Always validate the length against a sane cap (e.g., 16 MiB) and reject larger.
Mixed framing on the same socket — don't switch framing schemes mid-stream. Pick one, stick to it.
Async safety — partial reads in tokio::select! branches need cancel-safe semantics. read_exact is cancellation-unsafe; if cancelled mid-read, you've consumed some bytes from the stream and can't recover. Either don't cancel reads, or use a buffered reader that owns the read state.

What mxr and lazydap do

Both:

// crates/protocol/src/codec.rs

pub async fn write_message<W: AsyncWrite + Unpin>(w: &mut W, msg: &IpcMessage) -> io::Result<()> {
    let body = serde_json::to_vec(msg)?;
    let len = (body.len() as u32).to_be_bytes();
    w.write_all(&len).await?;
    w.write_all(&body).await?;
    w.flush().await?;
    Ok(())
}

pub async fn read_message<R: AsyncRead + Unpin>(r: &mut R) -> io::Result<IpcMessage> {
    let mut len_buf = [0u8; 4];
    r.read_exact(&mut len_buf).await?;
    let len = u32::from_be_bytes(len_buf) as usize;
    if len > MAX_MESSAGE_SIZE {
        return ErrInvalidData, "message too large");
    }
    let mut body = vec![0u8; len];
    r.read_exact(&mut body).await?;
    Okfrom_slice(&body)?
}

Both validate length against a 16 MiB cap. Both use serde_json for the body. Both run over Unix Domain Sockets.

Length-Prefixed Framing

The format

Why "big-endian"

Why 4 bytes

Comparison to other framing schemes

Content-Length: N\r\n\r\n headers (LSP / DAP style)

Newline-delimited (\n between messages)

Sentinel-delimited

Self-describing protocols (gRPC / protobuf)

Why length-prefix wins for local IPC

Common pitfalls

What mxr and lazydap do

See also

`Content-Length: N\r\n\r\n` headers (LSP / DAP style)

Newline-delimited (`\n` between messages)