COW Dump — A High-Level Design

How CRIU's Copy-on-Write dump path shortens live-migration downtime from seconds to tens of milliseconds.

TL;DR

Traditional CRIU freezes the source process for the entire memory dump. COW dump keeps the process running, copies memory to the destination while tracking writes, and only freezes for a short final pass that captures the last-moment dirty pages and process-tree state. Result: downtime drops from "full-dump" to "short flush."

Why we built this

Classic CRIU

Freeze, then dump everything

The process is frozen for the entire dump. Downtime scales with total memory size — a 64 GB workload is unusable during the whole transfer. For live migration this is the dominant cost.

COW dump

Copy while running, flush when frozen

Memory moves to the destination while the process keeps running. The kernel tracks which pages are written. Only the last pass runs frozen — downtime depends on the working set, not total memory.

Green = process running and serving traffic. Red = frozen (observable downtime).

The core idea, in three moves

Ask the kernel to tell us which pages get written

We register the process's memory with userfaultfd in write-protect mode. From that moment on, the kernel records (via PAGEMAP_SCAN) every page that gets written to. The process keeps running normally.
Copy memory to the destination while the process runs

Dedicated threads stream the entire memory contents over the network in parallel, compressed. Because the process is still running, some of what we send will be stale by the time it lands.
Freeze briefly, capture the last dirty pages + the final state

Once most of memory is at the destination, we freeze the process, grab the final list of dirty pages (small), grab the process-tree metadata, and send both. The replica already has the bulk — it only needs these last pieces to unfreeze.

The five phases at a glance

Phase 1

Set up tracking

Inject parasite. Open userfaultfd. Write-protect all memory.

running

Phase 2a

Bulk copy

Stream every page to the destination in parallel, compressed.

running

Phase 2b

Iterate dirty

Ask kernel which pages changed; re-send them (overwrite older copies).

running

Phase 3

Final flush

Freeze. Capture last-moment dirty pages + process-tree metadata.

frozen

Phase 4

Hand off

Replica applies the pages; primary unfreezes or terminates.

running

Only Phase 3 is frozen — that's the observable downtime. Everything else happens with the process live.

What each side is doing, phase by phase

Phase 1 — Preparation

Primary (source)

Briefly freeze to inject the tracking hook, then unfreeze immediately. Process resumes.

Replica (destination)

Waiting for the primary to start streaming.

Phase 2a — Bulk copy

Primary

Dedicated worker threads walk all of memory and push it out over many parallel sockets, compressed.

The process itself is untouched — it runs normally on its own cores.

Replica

Receive, decompress, accumulate pages in a staging buffer.

No pages are installed in the target yet — the replica is just a giant receive queue at this point.

Phase 2b — Iterate the dirty set

Primary

Ask the kernel: "which pages were written since last time?" Re-send those pages only. Repeat until the dirty set is small.

The process keeps running the whole time.

Replica

Each re-sent page overwrites the older copy in the staging buffer. Still no installation into the target.

Phase 3 — Final flush (the only frozen step)

Primary

Freeze the process. This is the downtime.

One last dirty scan, plus dump the process-tree metadata (file descriptors, namespaces, threads, ...).

Send the final dirty batch + metadata to the replica.

Replica

Absorb the final dirty batches into the staging buffer.

Phase 4 — Hand-off

Primary

Send "all pages sent." Wait for ACK. Unfreeze (or terminate, if this was a migration).

Replica

CRIU restore connects and catches the restored tasks frozen.

Now the drain workers install every staged page into the restored process using userfaultfd.

Drain completes → restore unfreezes the restored tasks → the workload is live on the destination.

Important ordering detail: on the replica, pages are accumulated during Phases 2–3 and only installed in Phase 4. CRIU restore does not start until all pages are received, and it waits for the drain to finish before unfreezing the migrated process. This keeps the hand-off atomic and avoids page-race bugs.

Why this is faster than it looks

∥ Many parallel sockets — throughput scales with cores.

LZ4 Fast compression on every batch — less wire, less CPU.

1 MB Page batches — amortizes header and syscall cost.

∞→0 Dirty set shrinks each iteration — convergence is cheap.

The key trick: overwrite, don't append

If a page is modified twice during the copy, the second version simply overwrites the first in the replica's staging buffer — in place, with no allocation. By the time Phase 3 ends, the replica already has the most recent version of every page that was re-sent. The final drain is just "copy this ready-made buffer into the restored process's address space."

The key constraint: the freeze window must stay small

Phase 3 is the only part the user sees as downtime. To keep it short, we make sure Phase 2b ends with a small dirty set. If the workload writes memory faster than we can copy it, the dirty set never shrinks and the freeze grows. This is a classic live-migration trade-off; the convergence threshold and iteration cap are tunable.

What COW dump buys you

Sub-second downtime for typical workloads — dominated by Phase 3 (final scan + metadata dump), not total memory size.
Network and disk decoupled from freeze time — the big, slow transfer happens with the process still serving requests.
Predictable hand-off — the replica is always in a well-defined state before CRIU restore runs, so restore itself is the same deterministic operation it is today.
Backwards-compatible dump format — the replica reads the same image files as a normal CRIU restore; the only thing that's different is how they got there.

Limitations to keep in mind

Write-heavy workloads can outrun the convergence phase. In that case the final freeze grows — typically a few seconds for data sets of hundreds of GBs.
Requires Linux 6.7 or newer — that's when UFFD_FEATURE_WP_ASYNC (async userfaultfd write-protect) and the PAGEMAP_SCAN ioctl both landed.
Currently single-process-tree in this experimental fork; multi-tree support is future work.

Where to go next

For the full technical design — thread counts, data structures, every protocol message, code pointers — see Detailed Design.

COW Dump — A High-Level Design

TL;DR

Why we built this

Freeze, then dump everything

Copy while running, flush when frozen

The core idea, in three moves

Ask the kernel to tell us which pages get written

Copy memory to the destination while the process runs

Freeze briefly, capture the last dirty pages + the final state

The five phases at a glance

Set up tracking

Bulk copy

Iterate dirty

Final flush

Hand off

What each side is doing, phase by phase

Phase 1 — Preparation

Phase 2a — Bulk copy

Phase 2b — Iterate the dirty set

Phase 3 — Final flush (the only frozen step)

Phase 4 — Hand-off

Why this is faster than it looks

The key trick: overwrite, don't append

The key constraint: the freeze window must stay small

What COW dump buys you

Limitations to keep in mind

Where to go next