COW Dump — A High-Level Design

How CRIU's Copy-on-Write dump path shortens live-migration downtime from seconds to tens of milliseconds.

TL;DR

Traditional CRIU freezes the source process for the entire memory dump. COW dump keeps the process running, copies memory to the destination while tracking writes, and only freezes for a short final pass that captures the last-moment dirty pages and process-tree state. Result: downtime drops from "full-dump" to "short flush."

Why we built this

Classic CRIU

Freeze, then dump everything

The process is frozen for the entire dump. Downtime scales with total memory size — a 64 GB workload is unusable during the whole transfer. For live migration this is the dominant cost.

COW dump

Copy while running, flush when frozen

Memory moves to the destination while the process keeps running. The kernel tracks which pages are written. Only the last pass runs frozen — downtime depends on the working set, not total memory.

Classic CRIU FROZEN — dump all memory done COW dump RUNNING — bulk + dirty re-send FROZEN done t = 0 final freeze migration complete downtime saved
Green = process running and serving traffic. Red = frozen (observable downtime).

The core idea, in three moves

  1. Ask the kernel to tell us which pages get written

    We register the process's memory with userfaultfd in write-protect mode. From that moment on, the kernel records (via PAGEMAP_SCAN) every page that gets written to. The process keeps running normally.

  2. Copy memory to the destination while the process runs

    Dedicated threads stream the entire memory contents over the network in parallel, compressed. Because the process is still running, some of what we send will be stale by the time it lands.

  3. Freeze briefly, capture the last dirty pages + the final state

    Once most of memory is at the destination, we freeze the process, grab the final list of dirty pages (small), grab the process-tree metadata, and send both. The replica already has the bulk — it only needs these last pieces to unfreeze.

The five phases at a glance

Phase 1

Set up tracking

Inject parasite. Open userfaultfd. Write-protect all memory.

running
Phase 2a

Bulk copy

Stream every page to the destination in parallel, compressed.

running
Phase 2b

Iterate dirty

Ask kernel which pages changed; re-send them (overwrite older copies).

running
Phase 3

Final flush

Freeze. Capture last-moment dirty pages + process-tree metadata.

frozen
Phase 4

Hand off

Replica applies the pages; primary unfreezes or terminates.

running

Only Phase 3 is frozen — that's the observable downtime. Everything else happens with the process live.

What each side is doing, phase by phase

Phase 1 — Preparation

Primary (source)
Briefly freeze to inject the tracking hook, then unfreeze immediately. Process resumes.
Replica (destination)
Waiting for the primary to start streaming.

Phase 2a — Bulk copy

Primary
Dedicated worker threads walk all of memory and push it out over many parallel sockets, compressed.
The process itself is untouched — it runs normally on its own cores.
Replica
Receive, decompress, accumulate pages in a staging buffer.
No pages are installed in the target yet — the replica is just a giant receive queue at this point.

Phase 2b — Iterate the dirty set

Primary
Ask the kernel: "which pages were written since last time?" Re-send those pages only. Repeat until the dirty set is small.
The process keeps running the whole time.
Replica
Each re-sent page overwrites the older copy in the staging buffer. Still no installation into the target.

Phase 3 — Final flush (the only frozen step)

Primary
Freeze the process. This is the downtime.
One last dirty scan, plus dump the process-tree metadata (file descriptors, namespaces, threads, ...).
Send the final dirty batch + metadata to the replica.
Replica
Absorb the final dirty batches into the staging buffer.

Phase 4 — Hand-off

Primary
Send "all pages sent." Wait for ACK. Unfreeze (or terminate, if this was a migration).
Replica
CRIU restore connects and catches the restored tasks frozen.
Now the drain workers install every staged page into the restored process using userfaultfd.
Drain completes → restore unfreezes the restored tasks → the workload is live on the destination.
Important ordering detail: on the replica, pages are accumulated during Phases 2–3 and only installed in Phase 4. CRIU restore does not start until all pages are received, and it waits for the drain to finish before unfreezing the migrated process. This keeps the hand-off atomic and avoids page-race bugs.

Why this is faster than it looks

Many parallel sockets — throughput scales with cores.
LZ4 Fast compression on every batch — less wire, less CPU.
1 MB Page batches — amortizes header and syscall cost.
∞→0 Dirty set shrinks each iteration — convergence is cheap.

The key trick: overwrite, don't append

If a page is modified twice during the copy, the second version simply overwrites the first in the replica's staging buffer — in place, with no allocation. By the time Phase 3 ends, the replica already has the most recent version of every page that was re-sent. The final drain is just "copy this ready-made buffer into the restored process's address space."

The key constraint: the freeze window must stay small

Phase 3 is the only part the user sees as downtime. To keep it short, we make sure Phase 2b ends with a small dirty set. If the workload writes memory faster than we can copy it, the dirty set never shrinks and the freeze grows. This is a classic live-migration trade-off; the convergence threshold and iteration cap are tunable.

What COW dump buys you

Limitations to keep in mind

Where to go next

For the full technical design — thread counts, data structures, every protocol message, code pointers — see Detailed Design.