Tracking issue: oc-rsync task #1884. Branch:
docs/reorderbuffer-spill-design-1884.
The bounded-memory spill-to-tempfile layer requested by #1884 is
already implemented as SpillableReorderBuffer<T> in
crates/engine/src/concurrent_delta/spill.rs:218, including the
hardening for ENOSPC and temp-dir vanish requested by the parent task
list. The hot-zone heuristic, RAII cleanup, length-prefixed binary
codec, and directory-backed flavour are all present (spill.rs:1-687)
and the SpillCodec trait is implemented for DeltaResult
(types.rs:457). PR history confirms the original feature shipped as
PR #3982 and the hardening as PR #2247 (cross-referenced from the
prompt as #4247).
What is not yet done:
ReorderBuffer
(consumer.rs:174-176) and never constructs a
SpillableReorderBuffer. The spill machinery is therefore dead
code from the receiver’s perspective; the in-RAM ring is
load-bearing for the whole transfer.max_depth reading large enough to justify pulling the spill
layer onto the critical path. The 1M-item bench prints
max_depth values that track 4 * drift (see
reorderbuffer_memory.rs:127); on the largest configured drift
(16 384) with the heaviest DeltaResult payload (~200 B) that is
~13 MB - two orders of magnitude under the 64 MB default
threshold at spill.rs:61.The recommendation in section 9 is therefore: do not wire the spill
layer in by default. Instead, (a) close the dead-code gap by
opt-in opt-in CLI plumbing for adversarial transfers, (b) extend the
existing reorderbuffer_memory bench with a synthetic huge-drift
scenario that actually exercises the spill path, and (c) gate further
work on bench evidence that max_depth × per-item bytes exceeds the
threshold on realistic transfers.
This note covers the receiver-side bounded-memory plan for the
concurrent_delta reorder buffer when a slow worker stalls delivery
and successor results accumulate. It is a pure receiver-side concern:
no wire-protocol changes, no on-disk artefacts visible to a peer,
tcpdump-replay against upstream rsync must remain byte-identical with
or without the spill layer engaged.
The decomposition tracked by SPL-1 (docs/audits/spill-rs-decomposition-plan.md)
is in flight. The current layout is:
| Path | Responsibility |
|---|---|
crates/engine/src/concurrent_delta/spill.rs |
Buffer, codec trait, error type, stats, tempfile backend (residue). |
crates/engine/src/concurrent_delta/spill/policy.rs |
SpillPolicy + ReclaimMode, SpillCompression, SpillGranularity. |
The remaining submodules (error.rs, codec.rs, stats.rs,
tempfile.rs, buffer.rs, tests/) are tracked by SPL-2 through
SPL-8; see the migration status table in
docs/audits/spill-rs-decomposition-plan.md
for PR-by-PR status. The line-number citations in the rest of this
document still resolve against spill.rs until the remaining
extractions land; the SPL-9 audit
(docs/audits/spl-9-mod-reexports.md)
pins the public symbol contract that every extraction PR must
satisfy.
All paths repository-relative.
ReorderBuffer (in-RAM ring)crates/engine/src/concurrent_delta/reorder/mod.rs:89 -
ReorderBuffer<T> struct with slots: Box<[Option<T>]> and a
capacity field.crates/engine/src/concurrent_delta/reorder/mod.rs:150 -
ReorderBuffer::new(capacity) panics if capacity == 0 and
pre-allocates the ring.crates/engine/src/concurrent_delta/reorder/mod.rs:297 - insert
returns Err(CapacityExceeded) when the sequence is outside the
current ring window unless an adaptive policy is configured.crates/engine/src/concurrent_delta/reorder/mod.rs:328 -
try_adaptive_preinsert_grow extends the ring inside the
AdaptiveCapacityPolicy [min, max] bounds before rejecting.crates/engine/src/concurrent_delta/reorder/mod.rs:502 -
force_insert ignores the capacity bound and, for sequences past
the ring, calls grow which doubles capacity (fixed-cap) or grows
by one slot per call (adaptive). This is the unbounded growth path
the spill layer is meant to displace.crates/engine/src/concurrent_delta/reorder/mod.rs:544 - grow
doubles for the non-adaptive case (min_capacity.max(self.capacity * 2)).crates/engine/src/concurrent_delta/reorder/mod.rs:555 -
resize_to linearises occupied entries into a fresh Box<[Option<T>]>.crates/engine/src/concurrent_delta/reorder/mod.rs:582 - take
removes a slot without advancing the delivery cursor; the spill
layer uses this to extract candidates without disturbing ordering
state.crates/engine/src/concurrent_delta/reorder/mod.rs:464 -
metrics() (added by PR #4195) returns current_depth,
max_depth, and stall_duration.SpillableReorderBuffercrates/engine/src/concurrent_delta/spill.rs:61 -
DEFAULT_SPILL_THRESHOLD = 64 * 1024 * 1024 (64 MB).crates/engine/src/concurrent_delta/spill.rs:68 - HOT_ZONE = 16
items pinned near next_expected to avoid thrashing.crates/engine/src/concurrent_delta/spill.rs:80 - SpillError
with Capacity and Io variants; is_out_of_space matches
io::ErrorKind::StorageFull.crates/engine/src/concurrent_delta/spill.rs:141 - SpillCodec
trait: encode, decode, estimated_size.crates/engine/src/concurrent_delta/spill.rs:174 -
SpillBackend::{Spooled, Directory} - default backend is
tempfile::SpooledTempFile (in-RAM up to 1 MB, then disk);
directory-backed flavour uses tempfile::tempfile_in.crates/engine/src/concurrent_delta/spill.rs:218 -
SpillableReorderBuffer<T: SpillCodec> wrapper with the same
public API surface as ReorderBuffer.crates/engine/src/concurrent_delta/spill.rs:226 -
spill_index: BTreeMap<u64, u64> maps sequence number to byte
offset; reload is O(log S) where S is the spilled-item count.crates/engine/src/concurrent_delta/spill.rs:286 - new(capacity, threshold).crates/engine/src/concurrent_delta/spill.rs:315 -
with_spill_dir allows an explicit on-disk scratch directory.crates/engine/src/concurrent_delta/spill.rs:362 - insert calls
spill_excess whenever memory_used > threshold.crates/engine/src/concurrent_delta/spill.rs:412 - next_in_order
reloads from disk transparently when the spilled sequence equals
next_expected.crates/engine/src/concurrent_delta/spill.rs:515 - spill_excess
walks candidates from highest sequence downward, skipping the hot
zone, and stops as soon as memory_used <= threshold.crates/engine/src/concurrent_delta/spill.rs:570 - spill_item
encodes the payload up front so a codec failure never leaves a
partial record; on NotFound for a directory-backed buffer it
calls recreate_spill_dir and retries once.crates/engine/src/concurrent_delta/spill.rs:656 -
recreate_spill_dir drops the stale file handle, runs
create_dir_all, and only succeeds when no prior items had been
spilled (otherwise those items are unrecoverable and the error is
surfaced).crates/engine/src/concurrent_delta/spill.rs:675 - open_backend
returns SpooledTempFile for the default flavour, which the OS
deletes when the buffer is dropped (RAII).DeltaResult codeccrates/engine/src/concurrent_delta/types.rs:457 -
impl super::spill::SpillCodec for DeltaResult with binary
encoding for ndx, sequence, status, byte counters, and error
messages.crates/engine/src/concurrent_delta/mod.rs:177 - pub mod spill.crates/engine/src/concurrent_delta/mod.rs:185 - pub use
re-exports SpillCodec, SpillError, SpillStats,
SpillableReorderBuffer.crates/engine/src/concurrent_delta/consumer.rs:50 - the consumer
imports the bare ReorderBuffer, not SpillableReorderBuffer.crates/engine/src/concurrent_delta/consumer.rs:173-177 -
delta-reorder thread constructs ReorderBuffer::passthrough()
or ReorderBuffer::new(reorder_capacity). Neither path admits
the spill layer.crates/engine/src/concurrent_delta/consumer.rs:193 - when the
ring fills and next_expected is missing, the loop calls
force_insert. This is the “consumer.rs force_insert smell”
tracked in memory: under sustained backpressure the ring grows
without bound rather than spilling to disk.crates/engine/src/concurrent_delta/reorder/mod.rs:43-52 -
Metrics struct (PR #4195).crates/engine/src/concurrent_delta/reorder/mod.rs:118-125 -
max_depth, stall_duration, stall_started_at fields.crates/engine/src/concurrent_delta/reorder/mod.rs:251 -
refresh_stall_state only samples Instant::now() on the stall
edges, keeping the hot path allocation-free.crates/engine/benches/reorderbuffer_memory.rs:58 -
DRIFTS: [usize; 4] = [32, 256, 2_048, 16_384].crates/engine/benches/reorderbuffer_memory.rs:61-64 -
DEFAULT_COUNTS = [100_000, 500_000], HEAVY_COUNT = 1_000_000
gated behind BENCH_REORDER_MEMORY_1M.crates/engine/benches/reorderbuffer_memory.rs:127 -
let capacity = (drift * 4).max(64); - capacity is provisioned at
4x the drift window so the steady-state insert stays on the
fast path.crates/engine/benches/reorderbuffer_memory.rs:90-109 -
run_cycle calls metrics() and returns the peak max_depth
alongside the elements sum.crates/engine/benches/reorder_buffer_cache.rs:1-49 - cache-miss
/ LLC behaviour at 1M items, gated behind BENCH_REORDER_CACHE=1.
Establishes the storage-layout baseline the spill design must not
regress.docs/audits/parallel-dispatch-wire-format.md (PR #4205) - confirms
parallel dispatch and the reorder buffer do not perturb the wire
protocol; spill-to-tempfile inherits that property because all
serialisation is local-only.crates/engine/src/concurrent_delta/spill.rs:1039 -
enospc_during_spill_propagates_as_io_error.crates/engine/src/concurrent_delta/spill.rs:1077 -
partial_write_surfaces_as_write_zero exercises the WriteZero
contract that the standard library promises on partial writes.crates/engine/src/concurrent_delta/spill.rs:1111 -
temp_dir_vanish_recreates_when_no_prior_spills.crates/engine/src/concurrent_delta/spill.rs:1142 -
temp_dir_vanish_after_prior_spills_returns_error proves we do
not silently swallow data when prior spills are stranded.crates/engine/src/concurrent_delta/spill.rs:1183 -
dir_recreate_failure_surfaces_io_error.docs/design/multi-file-delta-apply-pipeline.md:82-86 -
“spill-to-tempfile pending” risk note that this design retires.The on-the-wire reorder path has two memory budgets, neither of which is strictly bounded today:
ReorderBuffer::new(capacity) allocates
exactly capacity Option<T> slots
(reorder/mod.rs:150-167). Out-of-window inserts fail with
CapacityExceeded at reorder/mod.rs:308.with_adaptive_policy accepts a [min, max]
range; try_adaptive_preinsert_grow
(reorder/mod.rs:328-349) extends the ring inside max before
surrendering to CapacityExceeded.The consumer’s behaviour on overflow today is re-allocate, not block:
consumer.rs:182-196 drains ready items first; if no drain made
progress (head missing) it falls through to
reorder.force_insert(seq, result.clone()) at line 193.ReorderBuffer::force_insert (reorder/mod.rs:502-536) writes
the slot if it fits and otherwise calls grow. grow
(reorder/mod.rs:544-550) doubles capacity for non-adaptive
buffers and adds one slot for adaptive buffers, then resize_to
(reorder/mod.rs:555-571) reallocates a fresh Box<[Option<T>]>
and relocates every occupied entry.Outcome: a single stalled successor combined with sustained
backpressure can grow the ring monotonically until the process runs
out of address space. There is no upper bound on the ring’s RAM use,
and the doubling step also doubles the relocation cost. PR #4195
exposed max_depth so an operator can finally measure that growth,
but no policy reacts to it.
SpillableReorderBuffer exists to bound that growth by serialising
the highest-sequence items to a tempfile, but
no consumer constructs it (mod.rs:185 exports it,
consumer.rs:50 does not import it, consumer.rs:173-177 builds a
bare ReorderBuffer). The spill layer is therefore present in the
crate and absent from the runtime path.
Threshold inputs:
max_depth from Metrics::max_depth
(reorder/mod.rs:50), surfaced by PR #4195.SpillCodec::estimated_size
(spill.rs:160) - for DeltaResult this is dominated by the
status discriminant and optional error-message string
(types.rs:457+).reorderbuffer_memory bench data
(benches/reorderbuffer_memory.rs) for empirical depth versus
drift.The existing spill layer triggers on bytes alone
(spill.rs:371: if self.memory_used > self.threshold). That
behaviour is sound and matches the requested “N MB” threshold.
Default: 64 MB (spill.rs:61). The bench’s worst-case readings sit
comfortably under that ceiling (see TL;DR), which is the empirical
basis for the recommendation in section 9 not to enable the spill on
the default path.
For an opt-in adversarial scenario the recommended threshold pair is:
DEFAULT_SPILL_THRESHOLD = 64 MB.
The 1 MB SpooledTempFile rollover at spill.rs:683 means
transient spikes below ~1 MB never touch disk anyway.capacity field bounds the offset window, and force_insert
growth is the symptom the spill is replacing - once the spill is
on the critical path, force_insert becomes the spill trigger of
last resort rather than an unbounded growth signal. Adding a
second cap is configuration surface without behavioural lift.A future tightening, predicated on bench evidence from section 6,
could parameterise the threshold per transfer from a CLI flag
(--reorder-spill-mb). That belongs to a follow-up task; the design
here keeps the threshold a single compile-time constant to match the
existing SpillableReorderBuffer::new signature.
The on-disk format is already specified by SpillCodec and
spill.rs:614:
+--------+--------+--------+--------+----- ... -----+
| u32 little-endian length | payload bytes |
+--------+--------+--------+--------+----- ... -----+
Records are appended sequentially. The memory-resident index is
spill_index: BTreeMap<u64, u64> (spill.rs:226), mapping
sequence number to record start offset. BTreeMap gives O(log S)
reload lookup and ordered iteration that matches the
highest-sequence-first spill order chosen in section 4.
The codec format is opaque to the spill layer (spill.rs:140):
only encode/decode must agree. That keeps the on-disk layout
free to evolve with DeltaResult and any future spillable result
type without touching the spill machinery.
Backend choice:
SpooledTempFile at 1 MB rollover (spill.rs:683). The
OS deletes the tempfile on drop, so RAII cleanup is automatic
whether the transfer succeeds, fails, or panics.tempfile_in(dir) (spill.rs:677) when the operator
needs the scratch on a dedicated volume (encrypted scratch,
tmpfs, dedicated SSD). This flavour also benefits from
recreate_spill_dir recovery (spill.rs:656-671).Drain is fully handled by the existing next_in_order
(spill.rs:412-438):
next_expected, consult spill_index.reload_item
(spill.rs:628-647) seeks to the offset, reads the length
prefix and payload, decodes the item, removes the index entry,
increments reload_count, then force_inserts the item back
at offset 0 of the in-RAM ring and drives a single
next_in_order to advance the cursor.The force_insert-at-offset-0 step is unconditionally safe because
the ring always has at least one free slot at the head after the
previous take (the slot we are about to re-fill). The
debug_assert! on spill.rs:433-436 guards that invariant.
Drain ordering invariant: spill_excess only spills sequences
above the hot zone (spill.rs:524-538). The hot zone always
includes next_expected, so the head slot can never be on disk
unless the spill threshold itself shrinks below one item’s payload -
the existing tests pin that boundary
(spill.rs:858-879, spill.rs:893-908).
The default backend (SpooledTempFile) drops its File handle on
Drop and the OS unlinks the underlying inode (spill.rs:174-186).
A panic that unwinds through the spill buffer therefore leaves no
on-disk debris. The directory-backed flavour uses
tempfile::tempfile_in, which creates an anonymous handle (already
unlinked before any data is written), so the same property holds.
Test coverage: spill.rs:816-829
(cleanup_on_drop) and the directory round-trip at
spill.rs:1202-1217.
Hardened in PR #2247. spill_item (spill.rs:570-607) encodes the
payload into a Vec<u8> before any disk I/O, so a codec failure
cannot leave a partial record. ENOSPC from the kernel is surfaced
as io::ErrorKind::StorageFull and propagates as
SpillError::Io, which the caller can match via
SpillError::is_out_of_space (spill.rs:99). The failing item is
re-inserted into the ring at spill.rs:555 so the caller can
retry or shut down without losing the result.
The receiver must map SpillError::Io to rsync exit code 11
(FileIo) to mirror upstream’s I/O-failure handling. That is the
contract documented at spill.rs:75-77.
Test coverage: enospc_during_spill_propagates_as_io_error
(spill.rs:1039-1074),
partial_write_surfaces_as_write_zero (spill.rs:1077-1108).
Hardened in PR #2247. For the directory-backed flavour,
recreate_spill_dir (spill.rs:656-671) runs exactly once and
only when no prior items had been spilled; otherwise those
stranded items would be silently lost and the transfer would commit
an incomplete merge. dir_recreate_count (spill.rs:240) is
exposed via SpillStats for operator diagnostics.
Test coverage: temp_dir_vanish_recreates_when_no_prior_spills
(spill.rs:1111-1139),
temp_dir_vanish_after_prior_spills_returns_error
(spill.rs:1142-1180),
dir_recreate_failure_surfaces_io_error
(spill.rs:1183-1199).
SpillableReorderBuffer is !Sync by construction: it takes
&mut self on every mutator (spill.rs:362, :387, :412).
The consumer thread is already the sole owner of the buffer
(consumer.rs:170-216 spawns one delta-reorder thread per
pipeline), so no additional locking is required. The spill writes
happen on the same thread that performs reorder admission, so a
slow disk back-pressures the upstream stream_rx channel
(consumer.rs:158) naturally.
Caveat: if a future refactor moves drain onto a second thread - for
example to parallelise commit alongside reorder - the spill buffer
will need either an outer Mutex or a split owner/reader API. That
is out of scope for #1884 and would be a sibling task to the
shared-ring serialisation work tracked in MEMORY for io_uring.
memory_used is updated by estimated_size(), which the codec
documents as “approximate” (spill.rs:156-161). Tests exercise
the boundary at spill.rs:858-879 to guarantee a single insert
past the threshold triggers a spill. The accounting tolerates
under-estimates (the worst case is one extra in-RAM item before the
next insert spills) and over-estimates (the worst case is an
unnecessary disk round-trip). Neither outcome is data-lossy.
Extend crates/engine/benches/reorderbuffer_memory.rs rather than
introduce a new bench file. The existing scaffolding already builds
deterministic drifted permutations
(reorderbuffer_memory.rs:70-83) and prints max_depth alongside
Criterion timings (reorderbuffer_memory.rs:131-136); reusing it
keeps the comparison fair.
Proposed additions:
bench_spill_path, that runs the same drifted_permutation
inputs through SpillableReorderBuffer::new(capacity, threshold)
with threshold set to a fraction of the projected
max_depth * mean_item_bytes. Initial sweep:
threshold in [1 MB, 8 MB, 64 MB] against
drift = 16_384 and count = 100_000. The 64 MB cell verifies
the default never spills under the same load that the in-RAM
bench already proves comfortable.Vec::from_iter((1..N).chain(std::iter::once(0)))), not a
permutation, and it exercises spill_excess deterministically.u64 with DeltaResult so the
SpillCodec for the real production type is exercised end to
end (types.rs:457). Item size grows from 8 B to ~96-160 B
depending on status and error-message presence; the spill
threshold is then crossed at realistic item counts.spill_stats.spill_events, reload_events,
dir_recreate_events, and memory_used alongside the existing
max_depth line so the bench output documents which cells
actually exercised the disk path.Gate the spill cells behind a new BENCH_REORDER_SPILL=1 env var
(matches the pattern at reorder_buffer_cache.rs:35) so the
default cargo bench -p engine --bench reorderbuffer_memory
remains a no-op for the disk-touching cases.
Success criterion: for the threshold = 64 MB cell at
count = 100_000 / drift = 16_384,
spill_stats.spill_events == 0. That is the empirical proof that
the default threshold is not on the critical path for realistic
transfers and that opt-in plumbing is sufficient.
Wiring SpillableReorderBuffer into the consumer is a single-file
change in scope but a behavioural change in impact, so it is
explicitly deferred until section 6 produces bench evidence:
consumer.rs:50 would gain use super::spill::SpillableReorderBuffer;consumer.rs:173-177 would branch on a SpillPolicy enum
(Off, On { threshold }, Auto { threshold, max_depth_trigger })
rather than hard-coding ReorderBuffer::new.consumer.rs:182-203 insert and drain_ready calls return
Result<_, SpillError> instead of bare values; capacity errors
remain non-fatal (the existing loop) and I/O errors propagate to
the result_tx consumer via a new DeltaResult::failed mapping
to exit code 11.DeltaConsumer::spawn_with_spill(rx, capacity, threshold)
constructor matches the existing spawn /
spawn_bypass factory pair (consumer.rs:128-145).This is recorded for completeness; it is not the deliverable for #1884.
docs/design/multi-file-delta-apply-pipeline.md:82-86 “spill-to-
tempfile pending” risk - retired by the existing
SpillableReorderBuffer implementation; the note should be
updated in a follow-up to point at this doc.Do not enable the spill layer on the default consumer path without bench evidence from section 6. Concretely:
reorderbuffer_memory bench cells from
section 6 (separate PR, prefix bench:).rsync-profile container with the
adversarial drift scenario and the DeltaResult payload.spill_events > 0 at the 64 MB
threshold, file a follow-up to wire SpillableReorderBuffer
into consumer.rs per section 7.--reorder-spill-mb=<N> CLI knob for
the adversarial transfers (one tiny file followed by a multi-GB
delta) that motivated #1884 in the multi-file pipeline note.The bounded-memory plan requested by #1884 is already in tree at
crates/engine/src/concurrent_delta/spill.rs:218, the hardening
asked for is already in tree via PR #2247, and the metrics
needed to size the threshold are already in tree via PR #4195.
The remaining work is bench-evidence and a wiring decision, both of
which belong to follow-up tasks once section 6 produces data.
feat: add bounded-memory spill-to-tempfile for ReorderBuffer (#1884).fix(engine): harden reorder spill against ENOSPC and temp-dir vanish (referenced in the prompt as #4247).feat(engine): expose ReorderBuffer stall and queue-depth metrics.bench(engine): profile ReorderBuffer memory at 100K/500K/1M.bench(engine): profile ReorderBuffer cache behavior at 1M items.docs(audits): wire-format unchanged under parallel dispatch.--delay-updates is off (sibling design).docs/design/reorderbuffer-metrics-and-bypass.md - sibling
metrics and bypass design.docs/design/multi-file-delta-apply-pipeline.md - the surrounding
pipeline design that flagged #1884 as “pending”.