Designing a real-time text system

This page covers the design of real-time text (RTT) systems – systems where a sender’s keystrokes are transmitted to recipients as they are typed. RTT is distinct from collaborative document editing (which uses techniques like operational transformation and CRDTs to reconcile concurrent edits): in RTT, only one party authors a message at a time, so no conflict resolution is needed.

Table of Contents

1. Prior work

Refer to A survey of real-time text systems and Standards and protocols for real-time text.

2. Considerations

When designing a real-time text system, consider the following factors:

Transport protocol
What does the real-time text protocol run over? TCP guarantees ordering and delivery at the cost of latency (retransmissions block subsequent data). UDP is “fire and forget” meaning lower latency, but the protocol itself must handle packet loss and ordering. WebSocket is a common choice for browser-based systems, providing a persistent full-duplex channel over TCP.
Correctness and recoverability
Should the receiver’s copy of the message always match the sender’s exactly, or is occasional divergence acceptable? What happens when the connection drops mid-message – is the partial message preserved or lost? If the receiver falls out of sync (due to packet loss or a reconnection), how is the correct state restored? The answers depend on the transmission approach (see section 3).
Update frequency
How often are updates transmitted? More frequent updates (e.g., on every keystroke) feel more responsive but consume more bandwidth. Less frequent updates (e.g., batched at fixed intervals) are more efficient but introduce perceptible delay.
Bandwidth and scalability
Bandwidth cost depends on the transmission approach (see section 3), update frequency, and message length. An approach that transmits the entire message on every change (3.2) is cheap for short messages but expensive for long ones. A diff-based approach (3.3) scales better with message length but adds complexity. Consider the expected message sizes, number of concurrent conversations, and available bandwidth.
Protocol flexibility
Can the protocol be extended to accommodate features like rich text, emojis, media attachments, replies, user tagging, and so on?
Platform
Browser-based systems are limited to WebSocket or HTTP-based transports and have weaker cryptographic guarantees. Desktop clients have more flexibility in transport and encryption. Mobile clients must account for intermittent connectivity and background process restrictions.

3. Transmission approaches

There are at least three approaches to transmitting real-time text, differing in bandwidth cost, complexity, and resilience to packet loss. All three require some way to signal the end of a message (a “send” action) in addition to the in-progress updates.

3.1. Transmit exact keystrokes

The sender transmits each character (or small batch of characters) as it is typed. The receiver appends incoming characters to reconstruct the message.

3.2. Transmit entire message

On every change, the sender transmits the full current text of the message. The receiver replaces its displayed text with each incoming update.

3.3. Transmit a diff

The sender transmits only what changed since the last update: insertions, deletions, and optionally cursor movements. The receiver applies these operations to its local copy of the message.

4. Message association

Every update needs enough information to identify the sender and the message it belongs to. Supplementary data – attachments, emoji, reply references, tagged users – can be sent once at the point it is added and omitted from subsequent updates. Receiving clients keep this data in memory alongside the in-progress message. If the update carrying it is lost, the receiver will miss it, but including it again in the finalization packet provides a fallback.

5. Encryption

Key establishment is out of scope for this page. The following assumes that participants in an encrypted RTT channel have agreed on a shared key for encrypting and decrypting message payloads.

5.1. Transport encryption vs end-to-end encryption

Transport encryption (TLS, WSS) protects data in transit between client and server, but the server sees plaintext. End-to-end encryption (E2EE) keeps message content private from the server as well, but complicates server-side functionality such as search, moderation, spam filtering, and push notification previews. RTT’s high update frequency amplifies the cost of E2EE: each update must be encrypted and decrypted at both ends, and the server must relay ciphertext it cannot inspect.

5.2. Cipher choice

A stream cipher (or a block cipher in a streaming mode such as CTR or GCM) is a natural fit for RTT, where updates are small and arrive incrementally. Block ciphers in modes like CBC require input to be padded to a fixed block size, adding overhead to every small update: a single character encrypted with AES-128-CBC still produces at least 16 bytes of ciphertext plus a 16-byte IV.

5.3. Traffic analysis

Encryption protects message content, but RTT traffic is unusually vulnerable to metadata analysis. Even without decrypting any payload, an observer of encrypted RTT traffic can see:

From this metadata alone, an attacker can potentially infer:

5.3.1. Keystroke timing attacks

Song, Wagner, and Tian (2001) demonstrated that inter-keystroke timing in encrypted SSH sessions leaks approximately 1 bit of information per keystroke pair. SSH in interactive mode sends each keystroke as a separate packet immediately after it is pressed, so packet timing directly mirrors typing rhythm. Using a Hidden Markov Model trained on typing patterns, their system could narrow password search spaces by a factor of 50.

RTT systems are susceptible to the same class of attack. Depending on the transmission approach, RTT may send updates per keystroke (3.1) or per edit (3.3), producing packet timing that mirrors the sender’s typing rhythm in the same way.

5.3.2. User identification via typing patterns

Keystroke dynamics – the study of typing rhythm as a behavioral biometric – is a well-established research area. Typing patterns are influenced by factors like typing speed, hand size, keyboard familiarity, and habitual key sequences. Two commonly studied timing features are dwell time (how long a key is held) and flight time (the interval between releasing one key and pressing the next).

One approach to extracting a stable typing signature is to compute the frequency spectrum of inter-keystroke intervals via the Fast Fourier Transform. Dominant peaks in the spectrum may correspond to rhythmic constants in a person’s typing, such as a base typing speed or characteristic inter-word pauses. If such a signature is consistent across sessions, an eavesdropper could potentially identify users across encrypted RTT conversations without decrypting any content.

5.4. Countermeasures

Several techniques can mitigate timing and traffic analysis:

Batching at fixed intervals. Rather than transmitting on each keystroke, updates are buffered and sent at a constant interval, destroying the inter-keystroke timing signal. XEP-0301 recommends 700 ms intervals for interoperability; OpenSSH’s ObscureKeystrokeTiming option uses 20 ms intervals. The trade-off is added latency: the receiver sees updates in steps rather than character-by-character.

Timing metadata in the encrypted payload. If updates are batched, the original keystroke timestamps can be included inside the encrypted payload. The receiver then plays back the text with natural rhythm while the eavesdropper sees only uniform packet spacing. This preserves the real-time feel of RTT without leaking timing information on the wire.

Padding. Encrypting each update to a fixed-size packet obscures how much text each update contains. Without padding, packet size reveals whether the sender typed one character or ten. Padding eliminates this signal at the cost of bandwidth.

Dummy traffic. Sending packets at a constant rate even when the sender is idle prevents an eavesdropper from distinguishing typing from silence. This is the most expensive countermeasure but provides the strongest protection.

These countermeasures can be combined. For example, batching at fixed intervals with padded payloads and embedded timing metadata addresses the three main side channels (timing, size, and activity) while preserving the real-time experience for the recipient.