rsync

ReorderBuffer spill-to-tempfile for stalled successors

Tracking issue: oc-rsync task #1884. Branch: docs/reorderbuffer-spill-design-1884.

TL;DR (recommendation)

The bounded-memory spill-to-tempfile layer requested by #1884 is already implemented as SpillableReorderBuffer<T> in crates/engine/src/concurrent_delta/spill.rs:218, including the hardening for ENOSPC and temp-dir vanish requested by the parent task list. The hot-zone heuristic, RAII cleanup, length-prefixed binary codec, and directory-backed flavour are all present (spill.rs:1-687) and the SpillCodec trait is implemented for DeltaResult (types.rs:457). PR history confirms the original feature shipped as PR #3982 and the hardening as PR #2247 (cross-referenced from the prompt as #4247).

What is not yet done:

  1. The consumer pipeline still uses the bare ReorderBuffer (consumer.rs:174-176) and never constructs a SpillableReorderBuffer. The spill machinery is therefore dead code from the receiver’s perspective; the in-RAM ring is load-bearing for the whole transfer.
  2. Neither #4195 stall metrics nor #4204 memory benches expose a max_depth reading large enough to justify pulling the spill layer onto the critical path. The 1M-item bench prints max_depth values that track 4 * drift (see reorderbuffer_memory.rs:127); on the largest configured drift (16 384) with the heaviest DeltaResult payload (~200 B) that is ~13 MB - two orders of magnitude under the 64 MB default threshold at spill.rs:61.

The recommendation in section 9 is therefore: do not wire the spill layer in by default. Instead, (a) close the dead-code gap by opt-in opt-in CLI plumbing for adversarial transfers, (b) extend the existing reorderbuffer_memory bench with a synthetic huge-drift scenario that actually exercises the spill path, and (c) gate further work on bench evidence that max_depth × per-item bytes exceeds the threshold on realistic transfers.

Scope

This note covers the receiver-side bounded-memory plan for the concurrent_delta reorder buffer when a slow worker stalls delivery and successor results accumulate. It is a pure receiver-side concern: no wire-protocol changes, no on-disk artefacts visible to a peer, tcpdump-replay against upstream rsync must remain byte-identical with or without the spill layer engaged.

Current spill module layout

The decomposition tracked by SPL-1 (docs/audits/spill-rs-decomposition-plan.md) is in flight. The current layout is:

Path Responsibility
crates/engine/src/concurrent_delta/spill.rs Buffer, codec trait, error type, stats, tempfile backend (residue).
crates/engine/src/concurrent_delta/spill/policy.rs SpillPolicy + ReclaimMode, SpillCompression, SpillGranularity.

The remaining submodules (error.rs, codec.rs, stats.rs, tempfile.rs, buffer.rs, tests/) are tracked by SPL-2 through SPL-8; see the migration status table in docs/audits/spill-rs-decomposition-plan.md for PR-by-PR status. The line-number citations in the rest of this document still resolve against spill.rs until the remaining extractions land; the SPL-9 audit (docs/audits/spl-9-mod-reexports.md) pins the public symbol contract that every extraction PR must satisfy.

Source citations

All paths repository-relative.

ReorderBuffer (in-RAM ring)

Existing SpillableReorderBuffer

DeltaResult codec

Consumer wiring (the dead-code gap)

Stall metrics (#4195) and memory bench (#4204)

Cache bench (#4180) and wire-format audit (#4205)

Hardening already in tree (PR #2247, referenced as “#4247”)

1. Current memory cap

The on-the-wire reorder path has two memory budgets, neither of which is strictly bounded today:

  1. Fixed-capacity ring. ReorderBuffer::new(capacity) allocates exactly capacity Option<T> slots (reorder/mod.rs:150-167). Out-of-window inserts fail with CapacityExceeded at reorder/mod.rs:308.
  2. Adaptive ring. with_adaptive_policy accepts a [min, max] range; try_adaptive_preinsert_grow (reorder/mod.rs:328-349) extends the ring inside max before surrendering to CapacityExceeded.

The consumer’s behaviour on overflow today is re-allocate, not block:

Outcome: a single stalled successor combined with sustained backpressure can grow the ring monotonically until the process runs out of address space. There is no upper bound on the ring’s RAM use, and the doubling step also doubles the relocation cost. PR #4195 exposed max_depth so an operator can finally measure that growth, but no policy reacts to it.

SpillableReorderBuffer exists to bound that growth by serialising the highest-sequence items to a tempfile, but no consumer constructs it (mod.rs:185 exports it, consumer.rs:50 does not import it, consumer.rs:173-177 builds a bare ReorderBuffer). The spill layer is therefore present in the crate and absent from the runtime path.

2. Spill trigger

Threshold inputs:

The existing spill layer triggers on bytes alone (spill.rs:371: if self.memory_used > self.threshold). That behaviour is sound and matches the requested “N MB” threshold. Default: 64 MB (spill.rs:61). The bench’s worst-case readings sit comfortably under that ceiling (see TL;DR), which is the empirical basis for the recommendation in section 9 not to enable the spill on the default path.

For an opt-in adversarial scenario the recommended threshold pair is:

A future tightening, predicated on bench evidence from section 6, could parameterise the threshold per transfer from a CLI flag (--reorder-spill-mb). That belongs to a follow-up task; the design here keeps the threshold a single compile-time constant to match the existing SpillableReorderBuffer::new signature.

3. Spill layout

The on-disk format is already specified by SpillCodec and spill.rs:614:

+--------+--------+--------+--------+----- ... -----+
|        u32 little-endian length    | payload bytes |
+--------+--------+--------+--------+----- ... -----+

Records are appended sequentially. The memory-resident index is spill_index: BTreeMap<u64, u64> (spill.rs:226), mapping sequence number to record start offset. BTreeMap gives O(log S) reload lookup and ordered iteration that matches the highest-sequence-first spill order chosen in section 4.

The codec format is opaque to the spill layer (spill.rs:140): only encode/decode must agree. That keeps the on-disk layout free to evolve with DeltaResult and any future spillable result type without touching the spill machinery.

Backend choice:

4. Drain

Drain is fully handled by the existing next_in_order (spill.rs:412-438):

  1. Check the in-memory ring first.
  2. If empty for next_expected, consult spill_index.
  3. If a spilled record exists, reload_item (spill.rs:628-647) seeks to the offset, reads the length prefix and payload, decodes the item, removes the index entry, increments reload_count, then force_inserts the item back at offset 0 of the in-RAM ring and drives a single next_in_order to advance the cursor.

The force_insert-at-offset-0 step is unconditionally safe because the ring always has at least one free slot at the head after the previous take (the slot we are about to re-fill). The debug_assert! on spill.rs:433-436 guards that invariant.

Drain ordering invariant: spill_excess only spills sequences above the hot zone (spill.rs:524-538). The hot zone always includes next_expected, so the head slot can never be on disk unless the spill threshold itself shrinks below one item’s payload - the existing tests pin that boundary (spill.rs:858-879, spill.rs:893-908).

5. Hazards

5.1 Tempfile cleanup on panic or drop

The default backend (SpooledTempFile) drops its File handle on Drop and the OS unlinks the underlying inode (spill.rs:174-186). A panic that unwinds through the spill buffer therefore leaves no on-disk debris. The directory-backed flavour uses tempfile::tempfile_in, which creates an anonymous handle (already unlinked before any data is written), so the same property holds.

Test coverage: spill.rs:816-829 (cleanup_on_drop) and the directory round-trip at spill.rs:1202-1217.

5.2 ENOSPC during spill

Hardened in PR #2247. spill_item (spill.rs:570-607) encodes the payload into a Vec<u8> before any disk I/O, so a codec failure cannot leave a partial record. ENOSPC from the kernel is surfaced as io::ErrorKind::StorageFull and propagates as SpillError::Io, which the caller can match via SpillError::is_out_of_space (spill.rs:99). The failing item is re-inserted into the ring at spill.rs:555 so the caller can retry or shut down without losing the result.

The receiver must map SpillError::Io to rsync exit code 11 (FileIo) to mirror upstream’s I/O-failure handling. That is the contract documented at spill.rs:75-77.

Test coverage: enospc_during_spill_propagates_as_io_error (spill.rs:1039-1074), partial_write_surfaces_as_write_zero (spill.rs:1077-1108).

5.3 Temp directory vanish mid-transfer

Hardened in PR #2247. For the directory-backed flavour, recreate_spill_dir (spill.rs:656-671) runs exactly once and only when no prior items had been spilled; otherwise those stranded items would be silently lost and the transfer would commit an incomplete merge. dir_recreate_count (spill.rs:240) is exposed via SpillStats for operator diagnostics.

Test coverage: temp_dir_vanish_recreates_when_no_prior_spills (spill.rs:1111-1139), temp_dir_vanish_after_prior_spills_returns_error (spill.rs:1142-1180), dir_recreate_failure_surfaces_io_error (spill.rs:1183-1199).

5.4 Concurrent reordering vs spill writes

SpillableReorderBuffer is !Sync by construction: it takes &mut self on every mutator (spill.rs:362, :387, :412). The consumer thread is already the sole owner of the buffer (consumer.rs:170-216 spawns one delta-reorder thread per pipeline), so no additional locking is required. The spill writes happen on the same thread that performs reorder admission, so a slow disk back-pressures the upstream stream_rx channel (consumer.rs:158) naturally.

Caveat: if a future refactor moves drain onto a second thread - for example to parallelise commit alongside reorder - the spill buffer will need either an outer Mutex or a split owner/reader API. That is out of scope for #1884 and would be a sibling task to the shared-ring serialisation work tracked in MEMORY for io_uring.

5.5 Memory-accounting drift

memory_used is updated by estimated_size(), which the codec documents as “approximate” (spill.rs:156-161). Tests exercise the boundary at spill.rs:858-879 to guarantee a single insert past the threshold triggers a spill. The accounting tolerates under-estimates (the worst case is one extra in-RAM item before the next insert spills) and over-estimates (the worst case is an unnecessary disk round-trip). Neither outcome is data-lossy.

6. Bench plan

Extend crates/engine/benches/reorderbuffer_memory.rs rather than introduce a new bench file. The existing scaffolding already builds deterministic drifted permutations (reorderbuffer_memory.rs:70-83) and prints max_depth alongside Criterion timings (reorderbuffer_memory.rs:131-136); reusing it keeps the comparison fair.

Proposed additions:

  1. Spill cell. Add a fourth bench function, bench_spill_path, that runs the same drifted_permutation inputs through SpillableReorderBuffer::new(capacity, threshold) with threshold set to a fraction of the projected max_depth * mean_item_bytes. Initial sweep: threshold in [1 MB, 8 MB, 64 MB] against drift = 16_384 and count = 100_000. The 64 MB cell verifies the default never spills under the same load that the in-RAM bench already proves comfortable.
  2. Adversarial drift. Add a single synthetic case where one sequence (the head) is deliberately withheld until all successors are inserted - the canonical pathological scenario that motivates the spill. This is a closed-form input (Vec::from_iter((1..N).chain(std::iter::once(0)))), not a permutation, and it exercises spill_excess deterministically.
  3. DeltaResult payload. Replace u64 with DeltaResult so the SpillCodec for the real production type is exercised end to end (types.rs:457). Item size grows from 8 B to ~96-160 B depending on status and error-message presence; the spill threshold is then crossed at realistic item counts.
  4. Spill stats reporting. Print spill_stats.spill_events, reload_events, dir_recreate_events, and memory_used alongside the existing max_depth line so the bench output documents which cells actually exercised the disk path.

Gate the spill cells behind a new BENCH_REORDER_SPILL=1 env var (matches the pattern at reorder_buffer_cache.rs:35) so the default cargo bench -p engine --bench reorderbuffer_memory remains a no-op for the disk-touching cases.

Success criterion: for the threshold = 64 MB cell at count = 100_000 / drift = 16_384, spill_stats.spill_events == 0. That is the empirical proof that the default threshold is not on the critical path for realistic transfers and that opt-in plumbing is sufficient.

7. Wiring plan (deferred)

Wiring SpillableReorderBuffer into the consumer is a single-file change in scope but a behavioural change in impact, so it is explicitly deferred until section 6 produces bench evidence:

This is recorded for completeness; it is not the deliverable for #1884.

8. What this design replaces / supersedes

9. Recommendation

Do not enable the spill layer on the default consumer path without bench evidence from section 6. Concretely:

  1. Land the extended reorderbuffer_memory bench cells from section 6 (separate PR, prefix bench:).
  2. Run those cells in the rsync-profile container with the adversarial drift scenario and the DeltaResult payload.
  3. If any realistic cell triggers spill_events > 0 at the 64 MB threshold, file a follow-up to wire SpillableReorderBuffer into consumer.rs per section 7.
  4. If no realistic cell triggers a spill, expose the spill layer only behind an explicit --reorder-spill-mb=<N> CLI knob for the adversarial transfers (one tiny file followed by a multi-GB delta) that motivated #1884 in the multi-file pipeline note.

The bounded-memory plan requested by #1884 is already in tree at crates/engine/src/concurrent_delta/spill.rs:218, the hardening asked for is already in tree via PR #2247, and the metrics needed to size the threshold are already in tree via PR #4195. The remaining work is bench-evidence and a wiring decision, both of which belong to follow-up tasks once section 6 produces data.

Cross-references