COW Dump — A High-Level Design
How CRIU's Copy-on-Write dump path shortens live-migration downtime from seconds to tens of milliseconds.
TL;DR
Traditional CRIU freezes the source process for the entire memory dump. COW dump keeps the process running, copies memory to the destination while tracking writes, and only freezes for a short final pass that captures the last-moment dirty pages and process-tree state. Result: downtime drops from "full-dump" to "short flush."
Why we built this
Classic CRIU
Freeze, then dump everything
The process is frozen for the entire dump. Downtime scales with total memory size — a 64 GB workload is unusable during the whole transfer. For live migration this is the dominant cost.
COW dump
Copy while running, flush when frozen
Memory moves to the destination while the process keeps running. The kernel tracks which pages are written. Only the last pass runs frozen — downtime depends on the working set, not total memory.
The core idea, in three moves
-
Ask the kernel to tell us which pages get written
We register the process's memory with userfaultfd in write-protect mode. From that moment on, the kernel records (via PAGEMAP_SCAN) every page that gets written to. The process keeps running normally.
-
Copy memory to the destination while the process runs
Dedicated threads stream the entire memory contents over the network in parallel, compressed. Because the process is still running, some of what we send will be stale by the time it lands.
-
Freeze briefly, capture the last dirty pages + the final state
Once most of memory is at the destination, we freeze the process, grab the final list of dirty pages (small), grab the process-tree metadata, and send both. The replica already has the bulk — it only needs these last pieces to unfreeze.
The five phases at a glance
Phase 1
Set up tracking
Inject parasite. Open userfaultfd. Write-protect all memory.
running
Phase 2a
Bulk copy
Stream every page to the destination in parallel, compressed.
running
Phase 2b
Iterate dirty
Ask kernel which pages changed; re-send them (overwrite older copies).
running
Phase 3
Final flush
Freeze. Capture last-moment dirty pages + process-tree metadata.
frozen
Phase 4
Hand off
Replica applies the pages; primary unfreezes or terminates.
running
Only Phase 3 is frozen — that's the observable downtime. Everything else happens with the process live.
What each side is doing, phase by phase
Phase 1 — Preparation
Briefly freeze to inject the tracking hook, then unfreeze immediately. Process resumes.
Waiting for the primary to start streaming.
Phase 2a — Bulk copy
Dedicated worker threads walk all of memory and push it out over many parallel sockets, compressed.
The process itself is untouched — it runs normally on its own cores.
Receive, decompress, accumulate pages in a staging buffer.
No pages are installed in the target yet — the replica is just a giant receive queue at this point.
Phase 2b — Iterate the dirty set
Ask the kernel: "which pages were written since last time?" Re-send those pages only. Repeat until the dirty set is small.
The process keeps running the whole time.
Each re-sent page overwrites the older copy in the staging buffer. Still no installation into the target.
Phase 3 — Final flush (the only frozen step)
Freeze the process. This is the downtime.
One last dirty scan, plus dump the process-tree metadata (file descriptors, namespaces, threads, ...).
Send the final dirty batch + metadata to the replica.
Absorb the final dirty batches into the staging buffer.
Phase 4 — Hand-off
Send "all pages sent." Wait for ACK. Unfreeze (or terminate, if this was a migration).
CRIU restore connects and catches the restored tasks frozen.
Now the drain workers install every staged page into the restored process using userfaultfd.
Drain completes → restore unfreezes the restored tasks → the workload is live on the destination.
Important ordering detail: on the replica, pages are accumulated during Phases 2–3 and only installed in Phase 4. CRIU restore does not start until all pages are received, and it waits for the drain to finish before unfreezing the migrated process. This keeps the hand-off atomic and avoids page-race bugs.
Why this is faster than it looks
∥
Many parallel sockets — throughput scales with cores.
LZ4
Fast compression on every batch — less wire, less CPU.
1 MB
Page batches — amortizes header and syscall cost.
∞→0
Dirty set shrinks each iteration — convergence is cheap.
The key trick: overwrite, don't append
If a page is modified twice during the copy, the second version simply overwrites the first in the replica's staging buffer — in place, with no allocation. By the time Phase 3 ends, the replica already has the most recent version of every page that was re-sent. The final drain is just "copy this ready-made buffer into the restored process's address space."
The key constraint: the freeze window must stay small
Phase 3 is the only part the user sees as downtime. To keep it short, we make sure Phase 2b ends with a small dirty set. If the workload writes memory faster than we can copy it, the dirty set never shrinks and the freeze grows. This is a classic live-migration trade-off; the convergence threshold and iteration cap are tunable.
What COW dump buys you
- Sub-second downtime for typical workloads — dominated by Phase 3 (final scan + metadata dump), not total memory size.
- Network and disk decoupled from freeze time — the big, slow transfer happens with the process still serving requests.
- Predictable hand-off — the replica is always in a well-defined state before CRIU restore runs, so restore itself is the same deterministic operation it is today.
- Backwards-compatible dump format — the replica reads the same image files as a normal CRIU restore; the only thing that's different is how they got there.
Limitations to keep in mind
- Write-heavy workloads can outrun the convergence phase. In that case the final freeze grows — typically a few seconds for data sets of hundreds of GBs.
- Requires Linux 6.7 or newer — that's when
UFFD_FEATURE_WP_ASYNC (async userfaultfd write-protect) and the PAGEMAP_SCAN ioctl both landed.
- Currently single-process-tree in this experimental fork; multi-tree support is future work.
Where to go next
For the full technical design — thread counts, data structures, every protocol message, code pointers — see Detailed Design.