Fault Tolerance
OIP v0.5 defines how messages survive network partitions, node failures, and transient errors without violating cross-chain state consistency. This page specifies OIP fault tolerance at the message-delivery layer: the at-least-once semantics, idempotency requirements, TTL mechanics, deduplication windows, per-type retry policies, the dead letter queue (DLQ) recovery protocol, and the primary-finality recovery path that absorbs cross-chain propagation failures.
Atomicity-level failure handling (what happens when PREPARE, COMMIT, or FINALIZE fails inside the cross-chain coordination) is covered in the Cross-chain Protocol page. This page focuses on per-message delivery resilience and the recovery mechanisms that span the gap between primary and secondary finality: every message must arrive at least once, must produce the same observable effect under duplicate delivery, and must escalate to operator review when retries are exhausted, while the protocol’s primary-finality record remains the source of truth across all recovery paths.
Fault Tolerance Stack
OIP fault tolerance is a layered stack. Each layer absorbs a specific class of fault and passes everything else upward. Lower layers handle transient network conditions silently; upper layers escalate to human operators only after lower layers have exhausted their guarantees.
OIP Fault Tolerance Stack
Five layered defenses, lower layers absorb transient faults silently
messageId dedup cache, currentStateHash check, attestation transactionId check
CRITICAL regulatory actions bypass DLQ and retry indefinitely.
Figure 1: OIP Fault Tolerance Stack
The remainder of this page specifies each layer in order, with normative requirements expressed using RFC 2119 keywords (MUST, SHOULD, MAY).
At-Least-Once Delivery
OIP message delivery follows at-least-once semantics. Every message reaches its destination at least once, and duplicate deliveries can occur due to network retransmits or node-failure recovery. The protocol does not attempt exactly-once delivery at the network layer; instead, it pushes the burden to the receiver via mandatory idempotency.
This choice reflects a deliberate tradeoff. Exactly-once semantics in a distributed system requires either two-phase commit at the network layer (which OIP rejects in favor of the BVC pattern at the application layer) or coordinated state across all senders and receivers (which prevents independent failure recovery). At-least-once plus receiver-side idempotency gives the same end-to-end guarantee with strictly weaker liveness assumptions.
Idempotency Requirement (MUST)
The processing of every state-mutating message MUST be idempotent. Processing the same messageId twice MUST produce the same observable effect as processing it once. Conformant implementations enforce idempotency through four complementary mechanisms.
| Mechanism | Applied To | Implementation |
|---|---|---|
messageId deduplication |
All state-mutating messages | Lookup against the deduplication cache before processing. Cache hit is silently ignored. |
currentStateHash check |
STATE_SYNC, REGULATORY_ACTION |
Apply the message only when the message-declared currentStateHash matches the actual state hash. |
| nonce-based ordering | LOCK_MANAGEMENT (optional) |
Enforce sequential application of LOCK messages targeting the same asset. |
Attestation transactionId check |
Messages originating from CANTON | Reject duplicate attestations with ATTESTATION_REPLAY_DETECTED when cantonContext.transactionId repeats. |
The fourth mechanism deserves emphasis. Single-key messageId deduplication alone does not stop an attacker from wrapping the same CANTON transaction in two different envelopes. The CANTON Driver attestation carries the original transactionId, and conformant implementations MUST check this field independently of the messageId cache.
Defense-in-Depth Sequence
The four mechanisms above form a defense-in-depth sequence. An attacker or transient duplicate must pass every applicable check before causing a state change. Figure 2 shows how each check absorbs a different class of duplicate or replay attempt.
Idempotency Defense-in-Depth
Four checks block four classes of duplicate attempt before state mutation
Figure 2: Idempotency Defense-in-Depth Sequence
Delivery Guarantee Interface
The combined delivery contract that conformant implementations expose to message senders is captured by the following interface. Senders configure delivery semantics per message via this contract, and the runtime enforces the chosen guarantees.
interface MessageDeliveryGuarantee {
// Delivery semantics. v0.5 supports AT_LEAST_ONCE only.
semantics: "AT_LEAST_ONCE";
// Time-to-live, computed as max(senderTtl, assetMinTtl).
ttlSeconds: number;
// Idempotency strategy applied at the receiver.
idempotencyStrategy: "MESSAGE_ID" | "MESSAGE_ID_PLUS_STATE_HASH"
| "MESSAGE_ID_PLUS_NONCE";
// Whether attestation transactionId must be checked.
// MUST be true for messages originating from CANTON.
requireAttestationCheck: boolean;
// Retry budget. Defaults are derived from the message type
// (see Retry Policies below).
maxRetries: number;
baseBackoffMs: number;
maxBackoffMs: number;
// Whether failed messages enter the DLQ after exhausting retries.
// MUST be false for CRITICAL regulatory actions.
dlqEnabled: boolean;
}
The interface exists in the specification as a normative contract, not a wire-format type. Wire formats for messages are defined in the Message Protocol page, and this contract describes how delivery parameters are derived and applied at the receiver.
TTL (Time-to-Live)
Every OIP message carries a ttlSeconds field in its routing metadata that bounds its freshness. Receivers MUST reject any message where header.timestamp + routing.ttlSeconds < now with the error code MESSAGE_TTL_EXPIRED. TTL prevents stale messages from re-entering the system after long delivery delays and bounds the deduplication-cache memory footprint.
Two Layers of Timeout
OIP defines two distinct timeout layers that conformant implementations MUST manage independently. Per-message TTL bounds individual message freshness. Phase timeouts bound the wall-clock duration of each step in cross-chain coordination. The two layers are configured separately because their failure modes differ: a TTL-expired message is rejected with no retry, while a phase-timed-out coordination triggers atomicity-level failure handling.
Two Timeout Layers
Per-message freshness vs cross-chain phase coordination
routing.ttlSecondsrouting.phaseTimeoutsMESSAGE_TTL_EXPIRED, REJECT (no retry)DELIVERY_TIMEOUT per phase, atomicity-level handlingFigure 3: Per-Message TTL vs Cross-Chain Phase Timeouts
Phase timeouts apply only to GUARANTEED and ALL_OR_NOTHING atomicity. BEST_EFFORT coordination has no PREPARE/FINALIZE phases and uses only per-message TTL. The remainder of this section specifies per-message TTL; phase timeout values and their failure handling are detailed on the Cross-chain Protocol page.
TTL Determination
TTL is determined by a two-input rule: a sender-recommended TTL and an asset-class minimum TTL enforced by OSS. The final value is the maximum of the two.
function resolveTtl(senderTtl: number, assetMinTtl: number | null): number {
// assetMinTtl is null when no asset-class policy applies.
return assetMinTtl !== null
? Math.max(senderTtl, assetMinTtl)
: senderTtl;
}
Senders pick a value from the recommended TTL table per message type. OSS enforces a minimum based on the asset class to prevent stale-state hazards on high-frequency assets and to give regulatory actions enough delivery margin.
Recommended TTL by Message Type
| Message Type | Recommended TTL | Rationale |
|---|---|---|
STATE_SYNC | 300 s | State synchronization tolerates moderate delivery delay; consistency matters more than freshness. |
REGULATORY_ACTION | 600 s | Regulatory actions require ample delivery margin to survive cross-domain coordination. |
LOCK_MANAGEMENT | 120 s | Locks need fresh delivery; a stale lock acquisition would defeat preemptive lock semantics. |
QUERY | 30 s | Queries are read-only and time-sensitive; senders re-issue if expired. |
ACK | 60 s | Acknowledgments must reach the sender before its own retry timer fires. |
HEARTBEAT | 15 s | Heartbeats are periodic; the next heartbeat replaces a missed one. |
GOVERNANCE PROPOSAL | 600 s | Proposals must reach all eligible voters before the voting period opens. |
GOVERNANCE VOTE | 300 s | Votes are aggregated over the voting window; per-message TTL is shorter than the voting window. |
GOVERNANCE EXECUTE | 600 s | Execution messages must apply within a bounded window after a successful vote. |
Asset-class minimum TTLs run on a separate axis. High-frequency-trading assets carry a 5-second minimum to bound stale-state risk; assets under active regulatory action carry a 600-second minimum so that FREEZE, SEIZE, and similar messages cannot expire mid-coordination. Asset classes without an explicit policy fall back to the sender’s recommended TTL.
Message TTL versus Procedure Duration
One distinction matters for governance flows. The TTL of a GOVERNANCE message is not the duration of the procedure it triggers. votingPeriodSeconds on a proposal is the time during which votes are accepted, often 24 hours or more. The 600-second TTL on the proposal message itself is the freshness window for that single delivery. The two values are independent: a long voting period can coexist with a short per-message TTL because each VOTE is its own independent message with its own independent TTL.
Deduplication
Deduplication is the receiver-side mechanism that absorbs duplicate deliveries. The messageId cache holds recently observed identifiers, and a cache hit causes the duplicate to be silently ignored with the diagnostic code DEDUPLICATION_DETECTED.
Deduplication Configuration
interface DeduplicationConfig {
// Cache window in seconds. MUST be >= max(per-type TTL) + 2 * maxClockSkew.
windowSize: number;
// The dedup key. v0.5 specifies MESSAGE_ID.
strategy: "MESSAGE_ID";
// Maximum number of messageIds retained. Default 100,000.
capacity: number;
// Eviction policy when capacity is reached.
evictionPolicy: "LRU" | "FIFO";
}
Window ≥ Max TTL Invariant (MUST)
The deduplication window MUST be at least the maximum TTL of any message type the receiver accepts, plus a clock-skew margin. Specifically:
windowSize ≥ max(perTypeTtl) + 2 × maxClockSkew
With v0.5 default values, this floor is 610 seconds (max TTL 600 plus 2 × 5-second clock-skew margin). The recommended operational default is 700 seconds, which absorbs the worst-case STATE_SYNC retry budget (60-second max backoff) plus an additional 40-second margin.
The reason is structural. If windowSize < maxTtl, a duplicate message can arrive after its messageId has aged out of the cache but before its TTL expires. The receiver would treat it as new, violating idempotency. The + 2 × maxClockSkew term accounts for senders and receivers disagreeing on absolute time within the allowed skew. Implementations MAY run a separate window per message type rather than one global window, in which case each per-type window must individually satisfy the invariant for its type.
Cache-Poisoning Defense (MUST)
The deduplication cache must be written only after signature verification succeeds. Reading from the cache can run in parallel with other pre-lock checks, but recording a messageId before its signature is verified opens a denial-of-service vector: an attacker observes a valid messageId in flight, forges a message with that messageId and an invalid signature, and submits it first. If the cache records the messageId before signature verification, the legitimate message is later rejected as a duplicate.
The corresponding rule, restated normatively: a messageId MUST NOT be inserted into the deduplication cache until the message has passed signature verification (covered in detail on the Validation page).
Retry Policies
Retry policy is differentiated per message type. State-mutating messages with strong consistency requirements get larger retry budgets; transient or self-healing messages get smaller budgets or none. The policy table below is normative.
Retry Policy Table
| Message Type | Max Retries | Base Backoff | Max Backoff | DLQ |
|---|---|---|---|---|
REGULATORY_ACTION (CRITICAL) | Unlimited | 500 ms | 5 s | Disabled |
REGULATORY_ACTION (other) | 10 | 1000 ms | 30 s | Enabled |
STATE_SYNC | 5 | 2000 ms | 60 s | Enabled |
LOCK_MANAGEMENT | 7 | 1000 ms | 30 s | Enabled |
QUERY | 3 | 1000 ms | 10 s | Disabled |
ACK | 3 | 500 ms | 5 s | Disabled |
HEARTBEAT | 0 | N/A | N/A | Disabled |
GOVERNANCE PROPOSAL | 5 | 2000 ms | 30 s | Enabled |
GOVERNANCE VOTE | 3 | 1000 ms | 10 s | Enabled |
GOVERNANCE EXECUTE | 7 | 2000 ms | 60 s | Enabled |
Backoff Formula
All retries use exponential backoff with jitter. The delay before retry attempt n is:
function nextBackoffMs(
baseBackoffMs: number,
maxBackoffMs: number,
retryCount: number,
): number {
const jitterFactor = 1 + (Math.random() * 0.2 - 0.1); // ±10%
const delay = baseBackoffMs * Math.pow(2, retryCount) * jitterFactor;
return Math.min(maxBackoffMs, delay);
}
The ±10% jitter spreads simultaneous retries across senders to avoid thundering-herd amplification when many nodes recover at once.
Cumulative Retry Window by Type
The combination of max retries and max backoff produces materially different worst-case retry windows. Figure 4 shows the bounded retry duration for each message type, which directly drives the lock-TTL alignment requirement specified later in this section.
Cumulative Retry Window by Type
Worst-case time from first send to retry exhaustion (max_retries × max_backoff)
REGULATORY_ACTION (CRITICAL)GOVERNANCE EXECUTESTATE_SYNCREGULATORY_ACTION (other)LOCK_MANAGEMENTGOVERNANCE PROPOSALQUERYGOVERNANCE VOTEACKHEARTBEATFigure 4: Cumulative Retry Window by Message Type
Per-Type Rationale
REGULATORY_ACTION (CRITICAL) retries indefinitely. A delivery failure on a CRITICAL action is a system-wide safety problem; abandoning it would leave a regulator’s order unenforced. DLQ is disabled because the message must keep trying until accepted or until governance explicitly cancels it.
REGULATORY_ACTION (other) retries 10 times, then escalates to DLQ. Non-critical regulatory actions still warrant human review on permanent failure but do not justify infinite retries.
STATE_SYNC retries 5 times. State synchronization failures usually indicate a deeper consistency problem (cross-chain disagreement, locked-out asset, or stale state hash), so blind retries are less helpful than operator review.
LOCK_MANAGEMENT retries 7 times because lock contention is usually transient and resolves within a few backoff cycles.
QUERY and ACK retry 3 times without DLQ. Queries are read-only and the sender bears the responsibility to re-issue. ACK loss is detected by the sender’s own timeout and re-attempted from the source side rather than carried in a DLQ.
HEARTBEAT never retries. The next heartbeat in the periodic schedule replaces a missed one; queuing retries would only consume bandwidth without improving liveness signal.
GOVERNANCE retries vary by sub-type. PROPOSAL gets 5 attempts because reaching all voters quickly matters. VOTE gets 3 because the voting window aggregates many votes and a single lost vote is rarely decisive. EXECUTE gets 7 because an approved decision must apply, and permanent failure here triggers governance re-vote.
Retry Identity
A retry MUST reuse the original messageId. Changing the messageId on retry breaks idempotency: the receiver would treat the second attempt as a fresh message and could apply the underlying state change twice. The only exception is manual re-issue from the DLQ, which is treated as a new message with a new messageId and re-runs all authority checks.
Lock TTL Alignment (MUST)
If a message acquires a preemptive lock, the lock must remain held throughout the retry window. Otherwise, a retry could complete after the lock expires, leaving the asset modifiable by a competing transaction during the gap. Conformant implementations MUST set the lock TTL such that:
lockTtl ≥ messageTtl + (maxRetries × maxBackoff)
The longest retry windows in the v0.5 policy table (visible in Figure 4) are STATE_SYNC at 5 × 60 s = 300 s and GOVERNANCE EXECUTE at 7 × 60 s = 420 s. Locks held for these message types must be sized accordingly. The lock-protocol details are on the State Management page.
Lock Timeout Safety
OIP guarantees that every preemptive lock is released within a bounded time, regardless of the holding node’s liveness. This guarantee is enforced by lock TTL: when a lock TTL expires, the lock is automatically released even if the holder has crashed, become network-partitioned, or otherwise stopped processing. The receiving chain treats an expired lock as released and accepts new lock acquisitions on the same asset.
The consequence for fault tolerance is direct: a node failure cannot leave assets indefinitely unusable. Once lockTtl elapses, the lock is reclaimed and the asset is available for new transactions, with the failed message either retried by another sender or escalated to the DLQ. This finite-lock-lifetime guarantee is one of the safety properties verified in formal verification work on preemptive lock correctness.
Dead Letter Queue (DLQ)
When retries are exhausted on a message type with DLQ enabled, the message moves to the DLQ for operator review. The DLQ is the protocol’s escalation path for permanent failures.
DLQ Entry Conditions
A message enters the DLQ when all of the following hold: the retry count reaches the message type’s maxRetries value (the failure terminates with MAX_RETRIES_EXCEEDED), the message type has DLQ enabled in the retry policy table, and the message is not a CRITICAL REGULATORY_ACTION (these never enter the DLQ; they retry indefinitely).
DLQ Entry Structure
interface DLQEntry {
// The full original message, preserved verbatim.
originalMessage: OIPMessage;
// History of every failed delivery attempt.
failureHistory: FailureRecord[];
// ISO 8601 timestamp when the message entered the DLQ.
enteredDLQAt: string;
// ISO 8601 timestamp at which retention ends. Default 30 days.
retentionUntil: string;
// Lifecycle state.
status: "PENDING_REVIEW" | "RETRY_QUEUED" | "ABANDONED" | "RECOVERED";
}
interface FailureRecord {
attemptNumber: number;
failedAt: string; // ISO 8601
errorCode: string; // see Error Reference page
errorMessage: string;
routingSlipSnapshot?: RoutingSlipEntry[];
}
Default retention is 30 days. After expiry, an implementation MAY archive the entry to long-term storage or delete it; the choice is operational policy and outside the OIP normative scope.
DLQ Alerting
interface DLQAlertConfig {
enabled: boolean;
alertThreshold: number; // DLQ entries per hour
criticalThreshold: number; // immediate-alert ceiling
alertChannels: AlertChannel[];
}
type AlertChannel = "EMAIL" | "SLACK" | "PAGERDUTY" | "WEBHOOK";
The alerting subsystem itself is an operational concern outside OIP, but the alertThreshold and criticalThreshold values are governance parameters. Adjusting them requires a GOVERNANCE message rather than unilateral operator action, because a tampered threshold could mask systemic delivery failures.
DLQ Recovery Protocol
DLQ entries follow a four-step recovery protocol: notify, analyze, decide, record. Automating the decision step is explicitly out of scope for v0.5; the protocol assumes a human (operator or governance) reviews each entry.
DLQ Recovery Protocol
Four sequential steps from notification to permanent record
Figure 5: DLQ Recovery Protocol
Step 1: Notify
Entering the DLQ triggers an alert to the configured channels. The notification carries the original message’s messageId, messageType, and sourceChainId, the full failureHistory, the last error code with its message, and a routingSlip snapshot for debugging. Operators or automated triage systems consume the notification and proceed to analysis.
Step 2: Analyze
Analysis answers four questions: Is the failure permanent (structural rejection, missing authority) or transient (network, downstream timeout)? Does it indicate broader systemic impact across many messages or a single isolated case? Is a retry likely to succeed if the underlying condition is fixed? Is governance intervention warranted because the failure crosses authority boundaries?
Step 3: Decide
One of three outcomes is selected. Retry re-issues the message with a fresh messageId, optionally with edits to fields like priority, atomicity, or scope. Because the messageId changes, all authority and signature checks are re-run from the start. Abandon closes the entry permanently with status ABANDONED: the sender is notified, the targeted asset remains at its pre-message state (the original message never applied), and any locks held by the original message release at TTL expiry. Governance escalation triggers a rollback under D-quencer authority when the failure has system-wide impact, after which governance determines the asset’s final state.
Step 4: Record
Every DLQ decision is persisted in an audit-grade record signed by the decision-maker.
interface DLQDecisionRecord {
dlqEntryId: string;
decisionAt: string; // ISO 8601
decisionMaker: string; // OCID or "GOVERNANCE"
decisionType: "RETRY" | "ABANDON" | "GOVERNANCE_ESCALATION";
newMessageId?: string; // present when decisionType is RETRY
rationale: string;
signature: string; // decision-maker signature over the record
}
Records are retained indefinitely and form the auditable trail for every permanent failure. Auditors and regulators rely on this trail to verify that operator interventions were legitimate and bounded.
DLQ Self-Defense
The DLQ is itself a security-sensitive surface. The protocol imposes four guardrails: DLQ access is restricted to operators with governance-granted permissions; re-issued messages from the DLQ pass through fresh authority verification (the re-issuer must hold the same authority as the original sender, or governance must approve the substitution); DLQ entries themselves are integrity-protected with a hash and a signature, so tampering with stored entries is detectable; detected integrity violations trigger automatic governance notification.
Primary Finality Recovery Path
The mechanisms above resolve faults that occur before a message reaches primary finality. A separate fault class arises after primary finality is issued but before secondary finality is reached on Base L2: cross-chain propagation can fail in flight, leaving some chains updated while others lag. OIP defines a recovery path for this gap.
Why a Separate Recovery Path
Once primary finality is issued, the OSS state root is committed and the message is treated as applied for trading purposes. The DLQ recovery protocol does not apply because the message did not fail in the conventional sense; it succeeded at the L3 OSS layer but its external propagation to participating chains is incomplete. Treating this as a DLQ case would misclassify a successful primary-finality message as failed and would risk reverting a state change that already cleared its safety checks.
Instead, OIP requires conformant implementations to maintain a pending sync queue that tracks messages whose primary finality is issued but whose secondary finality is incomplete, and to re-attempt propagation until either secondary finality is reached or governance intervenes.
Pending Sync Queue Invariant (MUST)
OIP v0.5 does not specify the queue’s data structure or algorithm; OSS implementations have full latitude. The specification only requires four invariants that any conformant pending sync queue MUST satisfy.
| Invariant | Requirement |
|---|---|
| Deterministic tracking | After primary finality issuance, which message has been applied to which chain MUST be deterministically traceable. |
| Idempotency on re-propagation | Re-propagation MUST preserve messageId-based idempotency so that re-attempts do not double-apply the state change. |
| Long-term primary-finality preservation | The primaryFinalityRecord MUST be retained in verifiable form long after lock expiry or reconciliation. Compaction, archival, or Merkle-root summarization is permitted as long as verifiability is preserved. |
| Source of truth on resolution | When DIVERGED or ROLLBACK_INCOMPLETE states are resolved, the primaryFinalityRecord MUST be used as the authoritative reference. |
Implementation freedom (out of scope for OIP) covers exact data structures (RocksDB persistent queue, in-memory plus backup, distributed queue), queueing algorithms (FIFO, priority queue, deadline-based), re-propagation scheduling, and queue sharding or compression.
Re-Propagation Procedure
When the gap between primary and secondary finality exceeds a configurable threshold (default 300 s, signaled by HEARTBEAT.syncProgress.lagSeconds), the receiver triggers re-propagation. The procedure has four steps.
First, gap detection: HEARTBEAT messages report the lag, and a value above the threshold opens recovery. Second, re-propagation attempt: messages still in the pending sync queue are re-sent to chains where they have not been applied; messageId-based idempotency prevents double-application on chains that already received them. Third, re-propagation under existing lock: if the original lock has not expired, re-propagation runs under the same lock so no competing transaction can interleave. After lock expiry, recovery acquires a new lock before re-attempting. Fourth, reconciliation trigger: if re-propagation fails repeatedly, the chain is marked DIVERGED and forced reconciliation against the primaryFinalityRecord begins (covered in State Management).
Source-of-Truth Principle
If primary and secondary finality disagree, or if any participating chain shows a state that differs from the primaryFinalityRecord, the primary-finality record is authoritative. Secondary-finality observations that conflict with the primary record are treated as artifacts of an incomplete or corrupt propagation, and recovery brings the conflicting chain back into alignment with the primary record. This principle keeps OIP’s safety guarantees consistent across all recovery paths: regardless of which fault layer surfaces an inconsistency, the resolution always converges on the same authoritative state.
Atomicity-Level Failure Handling
The fault-tolerance mechanisms above operate per message and at the primary-secondary finality gap. When a message participates in a cross-chain coordinated transaction, additional failure handling applies based on the chosen AtomicityStrategy. The full matrix (what happens when PREPARE, COMMIT, or FINALIZE fails under BEST_EFFORT, GUARANTEED, or ALL_OR_NOTHING) is specified on the Cross-chain Protocol page.
Two boundary points connect the two layers. First, retry policy applies to the message that carries each phase: a failed PREPARE, COMMIT, or FINALIZE can be retried per the message’s retry budget before the atomicity-level failure handler kicks in. Second, lock TTL MUST exceed the retry window so that locks acquired in PREPARE remain valid throughout retries, as specified in the State Management page.
Error Codes Emitted by This Layer
The fault-tolerance layer emits or accepts the following error codes. Full disposition rules (REJECT, RETRY, ROLLBACK, ESCALATE, DIVERGE) are catalogued on the Error Reference page.
| Error Code | Emitted By | Disposition |
|---|---|---|
MESSAGE_TTL_EXPIRED | TTL freshness check at receiver | REJECT |
DEDUPLICATION_DETECTED | Deduplication cache hit | silently ignored, idempotent |
ATTESTATION_REPLAY_DETECTED | CANTON attestation transactionId duplicate | REJECT |
STATE_HASH_MISMATCH | currentStateHash check | REJECT (or silently ignored if dedup hit) |
DELIVERY_TIMEOUT | Per-message or phase timeout | RETRY |
MAX_RETRIES_EXCEEDED | Retry exhaustion | DLQ (when enabled) |
ROLLBACK_INCOMPLETE_REJECT | New message arrives during incomplete rollback | REJECT |
Conformance Checklist
The following items summarize the MUST and SHOULD obligations imposed by this page. A v0.5-compliant implementation can self-check against this list; the comprehensive cross-page conformance criteria are on the Conformance page.
| # | Requirement | Level |
|---|---|---|
| 1 | Implement at-least-once delivery with mandatory receiver-side idempotency for every state-mutating message. | MUST |
| 2 | Apply messageId deduplication, currentStateHash check, and (for CANTON-origin) attestation transactionId check. | MUST |
| 3 | Reject messages with header.timestamp + routing.ttlSeconds < now using MESSAGE_TTL_EXPIRED. | MUST |
| 4 | Set deduplication windowSize ≥ max(perTypeTtl) + 2 × maxClockSkew globally, or per-type if separate windows are run. | MUST |
| 5 | Insert a messageId into the deduplication cache only after signature verification succeeds. | MUST |
| 6 | Apply per-type retry policies from the policy table; reuse the original messageId on every retry. | MUST |
| 7 | Use exponential backoff with ±10% jitter; never exceed the per-type max backoff. | MUST |
| 8 | Set lockTtl ≥ messageTtl + (maxRetries × maxBackoff) for any message that acquires a preemptive lock. | MUST |
| 9 | Retry CRITICAL REGULATORY_ACTION indefinitely; never enter the DLQ. | MUST |
| 10 | Move messages with DLQ-enabled types to the DLQ on MAX_RETRIES_EXCEEDED and run the four-step recovery protocol. | MUST |
| 11 | Maintain a pending sync queue satisfying the four invariants (deterministic tracking, idempotency, long-term preservation, source-of-truth on resolution). | MUST |
| 12 | Sign every DLQDecisionRecord by its decision-maker and retain records indefinitely. | MUST |
| 13 | Trigger re-propagation when HEARTBEAT.syncProgress.lagSeconds exceeds the configured threshold. | SHOULD |
| 14 | Send DLQ entry alerts to configured channels at alertThreshold and immediately at criticalThreshold. | SHOULD |
| 15 | Run a separate deduplication window per message type for tighter memory bounds. | MAY |
References
Internal pages: Message Protocol for header and routing-metadata field definitions; State Management for preemptive lock semantics, lock TTL alignment, and DIVERGED reconciliation; Validation for the signature-then-cache ordering rule; Cross-chain Protocol for atomicity-level failure handling and phase timeouts; Error Reference for the complete error-code catalog with disposition codes; Conformance for the comprehensive v0.5 compliance checklist.
External standards: RFC 2119: Key words for use in RFCs to Indicate Requirement Levels; RFC 8174: Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words; ISO 8601: Date and time representations.
Related patterns: Dead Letter Channel (Hohpe and Woolf, 2003); Idempotent Receiver (Hohpe and Woolf, 2003); Guaranteed Delivery (Hohpe and Woolf, 2003).
