Home
>
Category
>
Fault Tolerance

Fault Tolerance

OIP v0.5 defines how messages survive network partitions, node failures, and transient errors without violating cross-chain state consistency. This page specifies OIP fault tolerance at the message-delivery layer: the at-least-once semantics, idempotency requirements, TTL mechanics, deduplication windows, per-type retry policies, the dead letter queue (DLQ) recovery protocol, and the primary-finality recovery path that absorbs cross-chain propagation failures.

Atomicity-level failure handling (what happens when PREPARE, COMMIT, or FINALIZE fails inside the cross-chain coordination) is covered in the Cross-chain Protocol page. This page focuses on per-message delivery resilience and the recovery mechanisms that span the gap between primary and secondary finality: every message must arrive at least once, must produce the same observable effect under duplicate delivery, and must escalate to operator review when retries are exhausted, while the protocol’s primary-finality record remains the source of truth across all recovery paths.

Fault Tolerance Stack

OIP fault tolerance is a layered stack. Each layer absorbs a specific class of fault and passes everything else upward. Lower layers handle transient network conditions silently; upper layers escalate to human operators only after lower layers have exhausted their guarantees.

OIP Fault Tolerance Stack

Five layered defenses, lower layers absorb transient faults silently

Delivery

Absorbs

Network drops, transient peer unreachability, in-flight retransmits

Mechanism

At-least-once delivery, TTL freshness check

↓

Idempotency

Absorbs

Duplicate deliveries, retransmits, replay attempts within window

Mechanism

messageId dedup cache, currentStateHash check, attestation transactionId check

↓

Retry

Absorbs

Transient validation failures, temporary lock contention, downstream timeouts

Mechanism

Per-type retry budgets, exponential backoff with jitter, lock TTL alignment

↓

DLQ

Absorbs

Permanent failures (exhausted retries, structural rejections)

Mechanism

Dead letter queue, four-step recovery protocol (notify, analyze, decide, record)

↓

Finality

Absorbs

Cross-chain propagation failures between primary and secondary finality

Mechanism

Pending sync queue, primary finality as source of truth, automatic re-propagation

CRITICAL regulatory actions bypass DLQ and retry indefinitely.

Figure 1: OIP Fault Tolerance Stack

The remainder of this page specifies each layer in order, with normative requirements expressed using RFC 2119 keywords (MUST, SHOULD, MAY).

At-Least-Once Delivery

OIP message delivery follows at-least-once semantics. Every message reaches its destination at least once, and duplicate deliveries can occur due to network retransmits or node-failure recovery. The protocol does not attempt exactly-once delivery at the network layer; instead, it pushes the burden to the receiver via mandatory idempotency.

This choice reflects a deliberate tradeoff. Exactly-once semantics in a distributed system requires either two-phase commit at the network layer (which OIP rejects in favor of the BVC pattern at the application layer) or coordinated state across all senders and receivers (which prevents independent failure recovery). At-least-once plus receiver-side idempotency gives the same end-to-end guarantee with strictly weaker liveness assumptions.

Idempotency Requirement (MUST)

The processing of every state-mutating message MUST be idempotent. Processing the same messageId twice MUST produce the same observable effect as processing it once. Conformant implementations enforce idempotency through four complementary mechanisms.

Mechanism	Applied To	Implementation
`messageId` deduplication	All state-mutating messages	Lookup against the deduplication cache before processing. Cache hit is silently ignored.
`currentStateHash` check	`STATE_SYNC`, `REGULATORY_ACTION`	Apply the message only when the message-declared `currentStateHash` matches the actual state hash.
nonce-based ordering	`LOCK_MANAGEMENT` (optional)	Enforce sequential application of LOCK messages targeting the same asset.
Attestation `transactionId` check	Messages originating from CANTON	Reject duplicate attestations with `ATTESTATION_REPLAY_DETECTED` when `cantonContext.transactionId` repeats.

The fourth mechanism deserves emphasis. Single-key messageId deduplication alone does not stop an attacker from wrapping the same CANTON transaction in two different envelopes. The CANTON Driver attestation carries the original transactionId, and conformant implementations MUST check this field independently of the messageId cache.

Defense-in-Depth Sequence

The four mechanisms above form a defense-in-depth sequence. An attacker or transient duplicate must pass every applicable check before causing a state change. Figure 2 shows how each check absorbs a different class of duplicate or replay attempt.

Idempotency Defense-in-Depth

Four checks block four classes of duplicate attempt before state mutation

Check

Blocks

Rejection Code

messageId Cache

All state-mutating messages

Same messageId delivered twice (network retransmit, sender retry)

DEDUPLICATION_DETECTED

↓

currentStateHash

STATE_SYNC, REGULATORY_ACTION

Stale message that no longer matches current asset state

STATE_HASH_MISMATCH

↓

nonce ordering

LOCK_MANAGEMENT (optional)

Out-of-order LOCK messages on the same asset

(reordered or rejected)

↓

attestation transactionId

CANTON-origin messages

Same CANTON transaction wrapped in two messageIds

ATTESTATION_REPLAY_DETECTED

Message accepted for processing

Figure 2: Idempotency Defense-in-Depth Sequence

Delivery Guarantee Interface

The combined delivery contract that conformant implementations expose to message senders is captured by the following interface. Senders configure delivery semantics per message via this contract, and the runtime enforces the chosen guarantees.

interface MessageDeliveryGuarantee {
  // Delivery semantics. v0.5 supports AT_LEAST_ONCE only.
  semantics: "AT_LEAST_ONCE";

  // Time-to-live, computed as max(senderTtl, assetMinTtl).
  ttlSeconds: number;

  // Idempotency strategy applied at the receiver.
  idempotencyStrategy: "MESSAGE_ID" | "MESSAGE_ID_PLUS_STATE_HASH"
                     | "MESSAGE_ID_PLUS_NONCE";

  // Whether attestation transactionId must be checked.
  // MUST be true for messages originating from CANTON.
  requireAttestationCheck: boolean;

  // Retry budget. Defaults are derived from the message type
  // (see Retry Policies below).
  maxRetries: number;
  baseBackoffMs: number;
  maxBackoffMs: number;

  // Whether failed messages enter the DLQ after exhausting retries.
  // MUST be false for CRITICAL regulatory actions.
  dlqEnabled: boolean;
}

The interface exists in the specification as a normative contract, not a wire-format type. Wire formats for messages are defined in the Message Protocol page, and this contract describes how delivery parameters are derived and applied at the receiver.

TTL (Time-to-Live)

Every OIP message carries a ttlSeconds field in its routing metadata that bounds its freshness. Receivers MUST reject any message where header.timestamp + routing.ttlSeconds < now with the error code MESSAGE_TTL_EXPIRED. TTL prevents stale messages from re-entering the system after long delivery delays and bounds the deduplication-cache memory footprint.

Two Layers of Timeout

OIP defines two distinct timeout layers that conformant implementations MUST manage independently. Per-message TTL bounds individual message freshness. Phase timeouts bound the wall-clock duration of each step in cross-chain coordination. The two layers are configured separately because their failure modes differ: a TTL-expired message is rejected with no retry, while a phase-timed-out coordination triggers atomicity-level failure handling.

Two Timeout Layers

Per-message freshness vs cross-chain phase coordination

Aspect

Per-Message TTLThis page

Phase TimeoutsCross-chain Protocol

Field

routing.ttlSeconds

routing.phaseTimeouts

Scope

Single message freshness from sender to receiver

Wall-clock for PREPARE, COMMIT, FINALIZE

Range

15 s (HEARTBEAT) to 600 s (REGULATORY_ACTION)

prepare 5000 ms, commit 10000 ms, finalize 5000 ms

Failure

MESSAGE_TTL_EXPIRED, REJECT (no retry)

DELIVERY_TIMEOUT per phase, atomicity-level handling

Figure 3: Per-Message TTL vs Cross-Chain Phase Timeouts

Phase timeouts apply only to GUARANTEED and ALL_OR_NOTHING atomicity. BEST_EFFORT coordination has no PREPARE/FINALIZE phases and uses only per-message TTL. The remainder of this section specifies per-message TTL; phase timeout values and their failure handling are detailed on the Cross-chain Protocol page.

TTL Determination

TTL is determined by a two-input rule: a sender-recommended TTL and an asset-class minimum TTL enforced by OSS. The final value is the maximum of the two.

function resolveTtl(senderTtl: number, assetMinTtl: number | null): number {
  // assetMinTtl is null when no asset-class policy applies.
  return assetMinTtl !== null
    ? Math.max(senderTtl, assetMinTtl)
    : senderTtl;
}

Senders pick a value from the recommended TTL table per message type. OSS enforces a minimum based on the asset class to prevent stale-state hazards on high-frequency assets and to give regulatory actions enough delivery margin.

Recommended TTL by Message Type

Message Type	Recommended TTL	Rationale
`STATE_SYNC`	300 s	State synchronization tolerates moderate delivery delay; consistency matters more than freshness.
`REGULATORY_ACTION`	600 s	Regulatory actions require ample delivery margin to survive cross-domain coordination.
`LOCK_MANAGEMENT`	120 s	Locks need fresh delivery; a stale lock acquisition would defeat preemptive lock semantics.
`QUERY`	30 s	Queries are read-only and time-sensitive; senders re-issue if expired.
`ACK`	60 s	Acknowledgments must reach the sender before its own retry timer fires.
`HEARTBEAT`	15 s	Heartbeats are periodic; the next heartbeat replaces a missed one.
`GOVERNANCE` PROPOSAL	600 s	Proposals must reach all eligible voters before the voting period opens.
`GOVERNANCE` VOTE	300 s	Votes are aggregated over the voting window; per-message TTL is shorter than the voting window.
`GOVERNANCE` EXECUTE	600 s	Execution messages must apply within a bounded window after a successful vote.

Asset-class minimum TTLs run on a separate axis. High-frequency-trading assets carry a 5-second minimum to bound stale-state risk; assets under active regulatory action carry a 600-second minimum so that FREEZE, SEIZE, and similar messages cannot expire mid-coordination. Asset classes without an explicit policy fall back to the sender’s recommended TTL.

Message TTL versus Procedure Duration

One distinction matters for governance flows. The TTL of a GOVERNANCE message is not the duration of the procedure it triggers. votingPeriodSeconds on a proposal is the time during which votes are accepted, often 24 hours or more. The 600-second TTL on the proposal message itself is the freshness window for that single delivery. The two values are independent: a long voting period can coexist with a short per-message TTL because each VOTE is its own independent message with its own independent TTL.

Deduplication

Deduplication is the receiver-side mechanism that absorbs duplicate deliveries. The messageId cache holds recently observed identifiers, and a cache hit causes the duplicate to be silently ignored with the diagnostic code DEDUPLICATION_DETECTED.

Deduplication Configuration

interface DeduplicationConfig {
  // Cache window in seconds. MUST be >= max(per-type TTL) + 2 * maxClockSkew.
  windowSize: number;

  // The dedup key. v0.5 specifies MESSAGE_ID.
  strategy: "MESSAGE_ID";

  // Maximum number of messageIds retained. Default 100,000.
  capacity: number;

  // Eviction policy when capacity is reached.
  evictionPolicy: "LRU" | "FIFO";
}

Window ≥ Max TTL Invariant (MUST)

The deduplication window MUST be at least the maximum TTL of any message type the receiver accepts, plus a clock-skew margin. Specifically:

windowSize ≥ max(perTypeTtl) + 2 × maxClockSkew

With v0.5 default values, this floor is 610 seconds (max TTL 600 plus 2 × 5-second clock-skew margin). The recommended operational default is 700 seconds, which absorbs the worst-case STATE_SYNC retry budget (60-second max backoff) plus an additional 40-second margin.

The reason is structural. If windowSize < maxTtl, a duplicate message can arrive after its messageId has aged out of the cache but before its TTL expires. The receiver would treat it as new, violating idempotency. The + 2 × maxClockSkew term accounts for senders and receivers disagreeing on absolute time within the allowed skew. Implementations MAY run a separate window per message type rather than one global window, in which case each per-type window must individually satisfy the invariant for its type.

Cache-Poisoning Defense (MUST)

The deduplication cache must be written only after signature verification succeeds. Reading from the cache can run in parallel with other pre-lock checks, but recording a messageId before its signature is verified opens a denial-of-service vector: an attacker observes a valid messageId in flight, forges a message with that messageId and an invalid signature, and submits it first. If the cache records the messageId before signature verification, the legitimate message is later rejected as a duplicate.

The corresponding rule, restated normatively: a messageId MUST NOT be inserted into the deduplication cache until the message has passed signature verification (covered in detail on the Validation page).

Retry Policies

Retry policy is differentiated per message type. State-mutating messages with strong consistency requirements get larger retry budgets; transient or self-healing messages get smaller budgets or none. The policy table below is normative.

Retry Policy Table

Message Type	Max Retries	Base Backoff	Max Backoff	DLQ
`REGULATORY_ACTION` (CRITICAL)	Unlimited	500 ms	5 s	Disabled
`REGULATORY_ACTION` (other)	10	1000 ms	30 s	Enabled
`STATE_SYNC`	5	2000 ms	60 s	Enabled
`LOCK_MANAGEMENT`	7	1000 ms	30 s	Enabled
`QUERY`	3	1000 ms	10 s	Disabled
`ACK`	3	500 ms	5 s	Disabled
`HEARTBEAT`	0	N/A	N/A	Disabled
`GOVERNANCE` PROPOSAL	5	2000 ms	30 s	Enabled
`GOVERNANCE` VOTE	3	1000 ms	10 s	Enabled
`GOVERNANCE` EXECUTE	7	2000 ms	60 s	Enabled

Backoff Formula

All retries use exponential backoff with jitter. The delay before retry attempt n is:

function nextBackoffMs(
  baseBackoffMs: number,
  maxBackoffMs: number,
  retryCount: number,
): number {
  const jitterFactor = 1 + (Math.random() * 0.2 - 0.1); // ±10%
  const delay = baseBackoffMs * Math.pow(2, retryCount) * jitterFactor;
  return Math.min(maxBackoffMs, delay);
}

The ±10% jitter spreads simultaneous retries across senders to avoid thundering-herd amplification when many nodes recover at once.

Cumulative Retry Window by Type

The combination of max retries and max backoff produces materially different worst-case retry windows. Figure 4 shows the bounded retry duration for each message type, which directly drives the lock-TTL alignment requirement specified later in this section.

Cumulative Retry Window by Type

Worst-case time from first send to retry exhaustion (max_retries × max_backoff)

REGULATORY_ACTION (CRITICAL)

unlimited

GOVERNANCE EXECUTE

420 s

STATE_SYNC

300 s

REGULATORY_ACTION (other)

300 s

LOCK_MANAGEMENT

210 s

GOVERNANCE PROPOSAL

150 s

QUERY

30 s

GOVERNANCE VOTE

30 s

ACK

15 s

HEARTBEAT

0 s

Figure 4: Cumulative Retry Window by Message Type

Per-Type Rationale

REGULATORY_ACTION (CRITICAL) retries indefinitely. A delivery failure on a CRITICAL action is a system-wide safety problem; abandoning it would leave a regulator’s order unenforced. DLQ is disabled because the message must keep trying until accepted or until governance explicitly cancels it.

REGULATORY_ACTION (other) retries 10 times, then escalates to DLQ. Non-critical regulatory actions still warrant human review on permanent failure but do not justify infinite retries.

STATE_SYNC retries 5 times. State synchronization failures usually indicate a deeper consistency problem (cross-chain disagreement, locked-out asset, or stale state hash), so blind retries are less helpful than operator review.

LOCK_MANAGEMENT retries 7 times because lock contention is usually transient and resolves within a few backoff cycles.

QUERY and ACK retry 3 times without DLQ. Queries are read-only and the sender bears the responsibility to re-issue. ACK loss is detected by the sender’s own timeout and re-attempted from the source side rather than carried in a DLQ.

HEARTBEAT never retries. The next heartbeat in the periodic schedule replaces a missed one; queuing retries would only consume bandwidth without improving liveness signal.

GOVERNANCE retries vary by sub-type. PROPOSAL gets 5 attempts because reaching all voters quickly matters. VOTE gets 3 because the voting window aggregates many votes and a single lost vote is rarely decisive. EXECUTE gets 7 because an approved decision must apply, and permanent failure here triggers governance re-vote.

Retry Identity

A retry MUST reuse the original messageId. Changing the messageId on retry breaks idempotency: the receiver would treat the second attempt as a fresh message and could apply the underlying state change twice. The only exception is manual re-issue from the DLQ, which is treated as a new message with a new messageId and re-runs all authority checks.

Lock TTL Alignment (MUST)

If a message acquires a preemptive lock, the lock must remain held throughout the retry window. Otherwise, a retry could complete after the lock expires, leaving the asset modifiable by a competing transaction during the gap. Conformant implementations MUST set the lock TTL such that:

lockTtl ≥ messageTtl + (maxRetries × maxBackoff)

The longest retry windows in the v0.5 policy table (visible in Figure 4) are STATE_SYNC at 5 × 60 s = 300 s and GOVERNANCE EXECUTE at 7 × 60 s = 420 s. Locks held for these message types must be sized accordingly. The lock-protocol details are on the State Management page.

Lock Timeout Safety

OIP guarantees that every preemptive lock is released within a bounded time, regardless of the holding node’s liveness. This guarantee is enforced by lock TTL: when a lock TTL expires, the lock is automatically released even if the holder has crashed, become network-partitioned, or otherwise stopped processing. The receiving chain treats an expired lock as released and accepts new lock acquisitions on the same asset.

The consequence for fault tolerance is direct: a node failure cannot leave assets indefinitely unusable. Once lockTtl elapses, the lock is reclaimed and the asset is available for new transactions, with the failed message either retried by another sender or escalated to the DLQ. This finite-lock-lifetime guarantee is one of the safety properties verified in formal verification work on preemptive lock correctness.

Dead Letter Queue (DLQ)

When retries are exhausted on a message type with DLQ enabled, the message moves to the DLQ for operator review. The DLQ is the protocol’s escalation path for permanent failures.

DLQ Entry Conditions

A message enters the DLQ when all of the following hold: the retry count reaches the message type’s maxRetries value (the failure terminates with MAX_RETRIES_EXCEEDED), the message type has DLQ enabled in the retry policy table, and the message is not a CRITICAL REGULATORY_ACTION (these never enter the DLQ; they retry indefinitely).

DLQ Entry Structure

interface DLQEntry {
  // The full original message, preserved verbatim.
  originalMessage: OIPMessage;

  // History of every failed delivery attempt.
  failureHistory: FailureRecord[];

  // ISO 8601 timestamp when the message entered the DLQ.
  enteredDLQAt: string;

  // ISO 8601 timestamp at which retention ends. Default 30 days.
  retentionUntil: string;

  // Lifecycle state.
  status: "PENDING_REVIEW" | "RETRY_QUEUED" | "ABANDONED" | "RECOVERED";
}

interface FailureRecord {
  attemptNumber: number;
  failedAt: string;          // ISO 8601
  errorCode: string;         // see Error Reference page
  errorMessage: string;
  routingSlipSnapshot?: RoutingSlipEntry[];
}

Default retention is 30 days. After expiry, an implementation MAY archive the entry to long-term storage or delete it; the choice is operational policy and outside the OIP normative scope.

DLQ Alerting

interface DLQAlertConfig {
  enabled: boolean;
  alertThreshold: number;       // DLQ entries per hour
  criticalThreshold: number;    // immediate-alert ceiling
  alertChannels: AlertChannel[];
}

type AlertChannel = "EMAIL" | "SLACK" | "PAGERDUTY" | "WEBHOOK";

The alerting subsystem itself is an operational concern outside OIP, but the alertThreshold and criticalThreshold values are governance parameters. Adjusting them requires a GOVERNANCE message rather than unilateral operator action, because a tampered threshold could mask systemic delivery failures.

DLQ Recovery Protocol

DLQ entries follow a four-step recovery protocol: notify, analyze, decide, record. Automating the decision step is explicitly out of scope for v0.5; the protocol assumes a human (operator or governance) reviews each entry.

DLQ Recovery Protocol

Four sequential steps from notification to permanent record

Step 1

Notify

Alert operators with messageId, failure history, last error code, and routing slip snapshot.

→

Step 2

Analyze

Classify failure (transient or permanent), scope the impact, assess retry viability.

→

Step 3

Decide

Choose retry, abandon, or governance escalation based on the analysis.

→

Step 4

Record

Persist the decision with rationale and decision-maker signature for audit.

Step 3 branches into one of:

Retry

Re-issue with new messageId, optional field edits. Original entry: RETRY_QUEUED.

Abandon

ABANDONED. Sender notified, asset stays at pre-message state, locks expire by TTL.

Escalate

Trigger governance rollback under D-quencer authority for system-wide impact cases.

Figure 5: DLQ Recovery Protocol

Step 1: Notify

Entering the DLQ triggers an alert to the configured channels. The notification carries the original message’s messageId, messageType, and sourceChainId, the full failureHistory, the last error code with its message, and a routingSlip snapshot for debugging. Operators or automated triage systems consume the notification and proceed to analysis.

Step 2: Analyze

Analysis answers four questions: Is the failure permanent (structural rejection, missing authority) or transient (network, downstream timeout)? Does it indicate broader systemic impact across many messages or a single isolated case? Is a retry likely to succeed if the underlying condition is fixed? Is governance intervention warranted because the failure crosses authority boundaries?

Step 3: Decide

One of three outcomes is selected. Retry re-issues the message with a fresh messageId, optionally with edits to fields like priority, atomicity, or scope. Because the messageId changes, all authority and signature checks are re-run from the start. Abandon closes the entry permanently with status ABANDONED: the sender is notified, the targeted asset remains at its pre-message state (the original message never applied), and any locks held by the original message release at TTL expiry. Governance escalation triggers a rollback under D-quencer authority when the failure has system-wide impact, after which governance determines the asset’s final state.

Step 4: Record

Every DLQ decision is persisted in an audit-grade record signed by the decision-maker.

interface DLQDecisionRecord {
  dlqEntryId: string;
  decisionAt: string;            // ISO 8601
  decisionMaker: string;         // OCID or "GOVERNANCE"
  decisionType: "RETRY" | "ABANDON" | "GOVERNANCE_ESCALATION";
  newMessageId?: string;         // present when decisionType is RETRY
  rationale: string;
  signature: string;             // decision-maker signature over the record
}

Records are retained indefinitely and form the auditable trail for every permanent failure. Auditors and regulators rely on this trail to verify that operator interventions were legitimate and bounded.

DLQ Self-Defense

The DLQ is itself a security-sensitive surface. The protocol imposes four guardrails: DLQ access is restricted to operators with governance-granted permissions; re-issued messages from the DLQ pass through fresh authority verification (the re-issuer must hold the same authority as the original sender, or governance must approve the substitution); DLQ entries themselves are integrity-protected with a hash and a signature, so tampering with stored entries is detectable; detected integrity violations trigger automatic governance notification.

Primary Finality Recovery Path

The mechanisms above resolve faults that occur before a message reaches primary finality. A separate fault class arises after primary finality is issued but before secondary finality is reached on Base L2: cross-chain propagation can fail in flight, leaving some chains updated while others lag. OIP defines a recovery path for this gap.

Why a Separate Recovery Path

Once primary finality is issued, the OSS state root is committed and the message is treated as applied for trading purposes. The DLQ recovery protocol does not apply because the message did not fail in the conventional sense; it succeeded at the L3 OSS layer but its external propagation to participating chains is incomplete. Treating this as a DLQ case would misclassify a successful primary-finality message as failed and would risk reverting a state change that already cleared its safety checks.

Instead, OIP requires conformant implementations to maintain a pending sync queue that tracks messages whose primary finality is issued but whose secondary finality is incomplete, and to re-attempt propagation until either secondary finality is reached or governance intervenes.

Pending Sync Queue Invariant (MUST)

OIP v0.5 does not specify the queue’s data structure or algorithm; OSS implementations have full latitude. The specification only requires four invariants that any conformant pending sync queue MUST satisfy.

Invariant	Requirement
Deterministic tracking	After primary finality issuance, which message has been applied to which chain `MUST` be deterministically traceable.
Idempotency on re-propagation	Re-propagation `MUST` preserve `messageId`-based idempotency so that re-attempts do not double-apply the state change.
Long-term primary-finality preservation	The `primaryFinalityRecord` `MUST` be retained in verifiable form long after lock expiry or reconciliation. Compaction, archival, or Merkle-root summarization is permitted as long as verifiability is preserved.
Source of truth on resolution	When DIVERGED or ROLLBACK_INCOMPLETE states are resolved, the `primaryFinalityRecord` `MUST` be used as the authoritative reference.

Implementation freedom (out of scope for OIP) covers exact data structures (RocksDB persistent queue, in-memory plus backup, distributed queue), queueing algorithms (FIFO, priority queue, deadline-based), re-propagation scheduling, and queue sharding or compression.

Re-Propagation Procedure

When the gap between primary and secondary finality exceeds a configurable threshold (default 300 s, signaled by HEARTBEAT.syncProgress.lagSeconds), the receiver triggers re-propagation. The procedure has four steps.

First, gap detection: HEARTBEAT messages report the lag, and a value above the threshold opens recovery. Second, re-propagation attempt: messages still in the pending sync queue are re-sent to chains where they have not been applied; messageId-based idempotency prevents double-application on chains that already received them. Third, re-propagation under existing lock: if the original lock has not expired, re-propagation runs under the same lock so no competing transaction can interleave. After lock expiry, recovery acquires a new lock before re-attempting. Fourth, reconciliation trigger: if re-propagation fails repeatedly, the chain is marked DIVERGED and forced reconciliation against the primaryFinalityRecord begins (covered in State Management).

Source-of-Truth Principle

If primary and secondary finality disagree, or if any participating chain shows a state that differs from the primaryFinalityRecord, the primary-finality record is authoritative. Secondary-finality observations that conflict with the primary record are treated as artifacts of an incomplete or corrupt propagation, and recovery brings the conflicting chain back into alignment with the primary record. This principle keeps OIP’s safety guarantees consistent across all recovery paths: regardless of which fault layer surfaces an inconsistency, the resolution always converges on the same authoritative state.

Atomicity-Level Failure Handling

The fault-tolerance mechanisms above operate per message and at the primary-secondary finality gap. When a message participates in a cross-chain coordinated transaction, additional failure handling applies based on the chosen AtomicityStrategy. The full matrix (what happens when PREPARE, COMMIT, or FINALIZE fails under BEST_EFFORT, GUARANTEED, or ALL_OR_NOTHING) is specified on the Cross-chain Protocol page.

Two boundary points connect the two layers. First, retry policy applies to the message that carries each phase: a failed PREPARE, COMMIT, or FINALIZE can be retried per the message’s retry budget before the atomicity-level failure handler kicks in. Second, lock TTL MUST exceed the retry window so that locks acquired in PREPARE remain valid throughout retries, as specified in the State Management page.

Error Codes Emitted by This Layer

The fault-tolerance layer emits or accepts the following error codes. Full disposition rules (REJECT, RETRY, ROLLBACK, ESCALATE, DIVERGE) are catalogued on the Error Reference page.

Error Code	Emitted By	Disposition
`MESSAGE_TTL_EXPIRED`	TTL freshness check at receiver	REJECT
`DEDUPLICATION_DETECTED`	Deduplication cache hit	silently ignored, idempotent
`ATTESTATION_REPLAY_DETECTED`	CANTON attestation transactionId duplicate	REJECT
`STATE_HASH_MISMATCH`	currentStateHash check	REJECT (or silently ignored if dedup hit)
`DELIVERY_TIMEOUT`	Per-message or phase timeout	RETRY
`MAX_RETRIES_EXCEEDED`	Retry exhaustion	DLQ (when enabled)
`ROLLBACK_INCOMPLETE_REJECT`	New message arrives during incomplete rollback	REJECT

Conformance Checklist

The following items summarize the MUST and SHOULD obligations imposed by this page. A v0.5-compliant implementation can self-check against this list; the comprehensive cross-page conformance criteria are on the Conformance page.

#	Requirement	Level
1	Implement at-least-once delivery with mandatory receiver-side idempotency for every state-mutating message.	MUST
2	Apply `messageId` deduplication, `currentStateHash` check, and (for CANTON-origin) attestation `transactionId` check.	MUST
3	Reject messages with `header.timestamp + routing.ttlSeconds < now` using `MESSAGE_TTL_EXPIRED`.	MUST
4	Set deduplication `windowSize ≥ max(perTypeTtl) + 2 × maxClockSkew` globally, or per-type if separate windows are run.	MUST
5	Insert a `messageId` into the deduplication cache only after signature verification succeeds.	MUST
6	Apply per-type retry policies from the policy table; reuse the original `messageId` on every retry.	MUST
7	Use exponential backoff with ±10% jitter; never exceed the per-type max backoff.	MUST
8	Set `lockTtl ≥ messageTtl + (maxRetries × maxBackoff)` for any message that acquires a preemptive lock.	MUST
9	Retry CRITICAL `REGULATORY_ACTION` indefinitely; never enter the DLQ.	MUST
10	Move messages with DLQ-enabled types to the DLQ on `MAX_RETRIES_EXCEEDED` and run the four-step recovery protocol.	MUST
11	Maintain a pending sync queue satisfying the four invariants (deterministic tracking, idempotency, long-term preservation, source-of-truth on resolution).	MUST
12	Sign every `DLQDecisionRecord` by its decision-maker and retain records indefinitely.	MUST
13	Trigger re-propagation when `HEARTBEAT.syncProgress.lagSeconds` exceeds the configured threshold.	SHOULD
14	Send DLQ entry alerts to configured channels at `alertThreshold` and immediately at `criticalThreshold`.	SHOULD
15	Run a separate deduplication window per message type for tighter memory bounds.	MAY

References

Internal pages: Message Protocol for header and routing-metadata field definitions; State Management for preemptive lock semantics, lock TTL alignment, and DIVERGED reconciliation; Validation for the signature-then-cache ordering rule; Cross-chain Protocol for atomicity-level failure handling and phase timeouts; Error Reference for the complete error-code catalog with disposition codes; Conformance for the comprehensive v0.5 compliance checklist.

External standards: RFC 2119: Key words for use in RFCs to Indicate Requirement Levels; RFC 8174: Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words; ISO 8601: Date and time representations.

Fault Tolerance

Fault Tolerance Stack

At-Least-Once Delivery

Idempotency Requirement (MUST)

Defense-in-Depth Sequence

Delivery Guarantee Interface

TTL (Time-to-Live)

Two Layers of Timeout

TTL Determination

Recommended TTL by Message Type

Message TTL versus Procedure Duration

Deduplication

Deduplication Configuration

Window ≥ Max TTL Invariant (MUST)

Cache-Poisoning Defense (MUST)

Retry Policies

Retry Policy Table

Backoff Formula

Cumulative Retry Window by Type

Per-Type Rationale

Retry Identity

Lock TTL Alignment (MUST)

Lock Timeout Safety

Dead Letter Queue (DLQ)

DLQ Entry Conditions

DLQ Entry Structure

DLQ Alerting

DLQ Recovery Protocol

Step 1: Notify

Step 2: Analyze

Step 3: Decide

Step 4: Record

DLQ Self-Defense

Primary Finality Recovery Path

Why a Separate Recovery Path

Pending Sync Queue Invariant (MUST)

Re-Propagation Procedure

Source-of-Truth Principle

Atomicity-Level Failure Handling

Error Codes Emitted by This Layer

Conformance Checklist

References

Table of Contents