COW Dump Design Document
Overview
The COW (Copy-on-Write) dump implementation is an experimental feature in this CRIU fork designed to minimize source process downtime during live migration. Instead of freezing the process for the entire dump duration, COW dump uses Linux's userfaultfd write-protect (UFFD_FEATURE_WP_ASYNC) to track memory writes while the process continues running, enabling incremental page transfer.
Goals
- Minimize downtime: Keep the source process running during the bulk of the memory transfer; only freeze for a short skeleton dump + final dirty flush.
- Parallel, compressed transfer: Multiple sender/receiver sockets, LZ4 compression, batched I/O.
- Convergence: Iteratively re-send dirty pages until a threshold is reached (overwriting older copies in the replica batch buffer), then do a single final frozen scan.
- Completeness: Capture dynamically created VMAs as well as the originally-tracked set.
Top-Level Architecture
collect_pstree,pre_dump_one_task- Parasite opens UFFD with
WP_ASYNCinside target UFFDIO_WRITEPROTECTin parallel (64MB chunks)- Unfreeze — process runs with dirty tracking live
P3 Senders (COW_NUM_P3_THREADS_BULK)
- Work-steal from
g_work_queue(atomic fetch-add) - Chunk =
COW_WORK_CHUNK_SIZE(32MB) - Batch =
COW_BATCH_PAGES(256) = 1MB process_vm_readv→ LZ4 → per-thread socket
g_bulk_transfer_done_count
— they cannot start Phase 2b until every bulk sender exhausts the work
queue.
P3 Receivers (N threads, 1 per socket)
- LZ4 decompress into page-pool buffer
cow_page_buffer_add_batch()inserts new 1MB-aligned entry in hash table
Batch Buffer (hash, 256K buckets)
- Entry = 1MB region + bitmap of present pages
- Pages only accumulate — no drain
Scanners (COW_NUM_PRE_SCANNERS active)
PAGEMAP_SCAN(PM_SCAN_WP_MATCHING), 4MB sub-chunks- Iterate until dirty count <
COW_DIRTY_SCAN_FREEZE_THRESHOLDor max iters - Push
dirty_region_entryinto convergence queue
MPMC Convergence Queue
- Flat array, atomic head/tail
- CAS-claim one region per pop
Pack → Read → Compress → Send
- Pack regions into
COW_BATCH_PAGESbatches process_vm_readv→ LZ4 → per-thread socket
P3 Receivers — Overwrite Path
- On re-send:
cow_page_buffer_get_data_ptr()finds existing 1MB entry - LZ4 decompress directly into
entry->dataat page offset - Older bulk copy is overwritten in place (zero alloc, zero memcpy)
cow-p3-receiver.c:126-152
reseize_pstree,collect_pstree_idscow_detect_new_vmas()vs Phase 1 tracked setcow_set_new_vma_ranges();cow_signal_last_scan()- Scanners do frozen final
PAGEMAP_SCAN→ MPMC queue dump_one_task()per item (skeleton: no pages)write_img_inventory()
overwriting existing entries on re-send.
Drain has still not started.
- Unfreeze tasks
- Send
PS_IOV_ALL_PAGES_SENT→ - Wait for
PS_IOV_ALL_PAGES_SENT_ACK← cow_cleanup_async_uffd()- Close sockets
1. Restore connects
handle_lazy_accept()— tasks still running, no drain
2. Catch tasks → TASKS_FROZEN
- Restore
PTRACE_INTERRUPTs all tasks - Sends
LAZY_PAGES_TASKS_FROZENto lazy-pages daemon
3. Drain Threads start
cow_start_drain_thread()onTASKS_FROZEN(uffd.c:1545)- Walk
chunk_index(work-steal vianext_drain_chunk) UFFDIO_COPYwhole 1MB batches; free pool chunks as they emptyEAGAIN→ retry queue;EEXIST/ENOENT→ soft-handle
4. Restore waits for drain
while (cow_drain_thread_running() || cow_page_buffer_count() > 0)(uffd.c:1796)- Restore does NOT run concurrently with drain
5. Drain done → restore unfreezes
- Restore unfreezes the task tree; sends
PS_IOV_ALL_PAGES_SENT_ACKback to primary
Phases
Phase 1 — Seize + WP_ASYNC Initialization
Entry: cr_dump_tasks_cow_phased() in criu/cr-dump.c (redirected from cr_dump_tasks() when opts.cow_dump && opts.lazy_pages).
Steps:
- Seize process tree (
collect_pstree,pre_dump_one_taskfor each item). - Parasite RPC
PARASITE_CMD_COW_DUMP_INITopens a userfaultfd withUFFD_FEATURE_WP_ASYNCinside the target and returns the fd viacompel_util_recv_fd()—cow_dump_init_async()incriu/cow/cow-dump.c. cow_register_vmas()walks the VMA list and records eligible ones (writable, private or anon-shared, not guard/VVAR/droppable). For each it issuesUFFDIO_REGISTERinUFFDIO_REGISTER_MODE_WPmode (register failures are tolerated —PAGEMAP_SCANis the source of truth).cow_apply_writeprotect()splits all tracked VMAs intoCOW_WP_CHUNK_SIZE(64MB) ranges and issuesUFFDIO_WRITEPROTECTin parallel fromsysconf(_SC_NPROCESSORS_ONLN)threads.arch_set_thread_regs()+pstree_switch_state(TASK_ALIVE)— process resumes with write-protection in effect.
Phase 2a — Bulk Transfer
cr_page_server() establishes the PRIMARY side page-server and spawns the unified thread (cow-unified-thread.c) which accepts N P3 connections and calls cow_start_p3_threads().
Bulk transfer (cow-bulk-send.c)
- A shared work queue
g_work_queueis built by splitting each lazy VMA intoCOW_WORK_CHUNK_SIZE(32MB) chunks (bounded byCOW_MAX_WORK_ITEMS= 16384). - Only
COW_NUM_P3_THREADS_BULKthreads participate in bulk; the rest sit idle until Phase 2b. Rationale: during bulk the receiver is usually the bottleneck, so extra senders contend for CPU. - Each thread pulls items via
__atomic_fetch_add(&g_work_queue_next)and, for each item, readsCOW_BATCH_PAGES(256) pages withprocess_vm_readv, LZ4-compresses them, and sends aPS_IOV_ADD_F_COMPRESSbatch on its socket. - Scanner threads are started but block on
g_bulk_transfer_done_countuntil every bulk sender has exhausted the work queue (cow-bulk-send.c:158-159, 304-317). There is no interleaving — Phase 2b begins only after all bulk work is done.
Replica side (Phase 2a): P3 receivers LZ4-decompress each batch into a page-pool buffer and insert it as a new 1MB-aligned entry in the batch buffer hash table (cow_page_buffer_add_batch). No draining happens yet — pages only accumulate in the batch buffer.
Phase 2b — Pre-Scan + Send Dirty (Overwrite Older Pages)
Gated by COW_PRE_SCAN (default on). Triggered when all bulk senders have completed.
cow_start_scanner_thread()startedCOW_NUM_SCANNERSscanner threads at the start of Phase 2; only the firstCOW_NUM_PRE_SCANNERSactively scan in Phase 2b, and all of them are woken once bulk is done. The rest wait directly for the freeze signal.- Each scanner walks
global_lazy_vmas, divides each VMA by pre-scanner count, and runsPAGEMAP_SCANinCOW_PAGEMAP_SCAN_RANGE_SIZE(4MB) sub-chunks withPM_SCAN_WP_MATCHING. The 4MB cap boundsmmap_lockhold time in the kernel. - Dirty regions become
struct dirty_region_entryand are pushed into the MPMC convergence queue (a flat arrayg_conv_slots[CONV_QUEUE_CAP]with atomicg_conv_head/g_conv_tail; pop = CAS on tail, one region per claim). - Same P3 sender threads (now all
COW_NUM_P3_THREADSof them) consume the queue, pack regions intoCOW_BATCH_PAGES-sized batches ofdirty_slices, and issue a single multi-iovprocess_vm_readvper batch followed by LZ4 + send. - Scanners iterate until total dirty pages drop below
COW_DIRTY_SCAN_FREEZE_THRESHOLD(2,000,000) orCOW_PRE_SCAN_MAX_ITERATIONSis reached, then scanner 0 waits for the queue to drain and setsg_last_scan_flag.cow_all_threads_below_threshold()returns true → main thread moves to Phase 3.
Replica side (Phase 2b): on re-send of a page whose 1MB region is already present in the batch buffer, the receiver calls cow_page_buffer_get_data_ptr() to find the existing entry and LZ4-decompresses directly into entry->data at the page offset, overwriting the older bulk copy in place (cow-p3-receiver.c:126-152). Zero allocation, zero memcpy. Still no drain.
Phase 3 — Freeze + Skeleton Dump
Measured wall-clock is the primary metric here; this is the downtime.
reseize_pstree()+collect_pstree_ids().cow_set_dst_id(vpid(root_item))— dst_id wasn't valid in Phase 1.collect_mappings()→cow_detect_new_vmas()compares against Phase 1 tracked set and emits ranges for VMAs that appeared while the process was running. New ranges are also appended totracked_vmasandglobal_lazy_vmas(add_lazy_vma_for_new_region).cow_set_new_vma_ranges()hands the new-range array to senders.cow_signal_last_scan()setsg_scanner_freeze_signal. Scanners do a frozenPAGEMAP_SCAN(no range chunking, no WP_MATCHING — just readPAGE_IS_WRITTEN) over their slice of each VMA and push to the MPMC queue.- Skeleton dump loop:
dump_one_task()for every item. Phase flag (COW_PHASE_SCAN) makescow_is_phased_skeleton_dump()return true and page-dump paths short-circuit (generate_iovs, etc.). cr_dump_post_task_operations().cow_wait_p3_threads()— senders drain the MPMC queue and then send any pages from the new-VMA ranges (send_new_vma_pages, split round-robin by thread id).cow_free_new_vma_ranges();cow_set_phase(COW_PHASE_DONE).write_img_inventory()(still frozen).
Replica side (Phase 3): receivers keep buffering final-scan batches into the batch buffer, overwriting existing entries on re-sends. Drain has still not started.
Phase 4 — Unfreeze + Drain + Replica Restore
Primary:
cr_dump_finishunfreezes tasks, sendsPS_IOV_ALL_PAGES_SENT, waits forPS_IOV_ALL_PAGES_SENT_ACK, then callscow_cleanup_async_uffd()(currently guarded by a hardcoded 15s sleep in the code; the cleanup path unregisters VMAs in chunks with 10ms yields every 10 VMAs, then closes the uffd).
Replica (serial — restore does NOT run concurrently with drain):
criu restoreconnects to the lazy socket (handle_lazy_acceptincriu/uffd.c:1616). Drain is not started here — tasks are still running (criu/uffd.c:1667-1680).- Restore catches all tasks via
PTRACE_INTERRUPT, then sendsLAZY_PAGES_TASKS_FROZENover the lazy socket. lazy_sk_read_eventreceivesTASKS_FROZENand callscow_handle_lazy_accept_post_connect()→cow_start_drain_thread()(criu/uffd.c:1545-1551,cow-uffd.c:1842-1853). Drain threads walkchunk_index(work-stealing vianext_drain_chunk) and issueUFFDIO_COPYfor whole 1MB batches, freeing page-pool chunks as each empties.EAGAIN→ retry queue;EEXIST/ENOENT→ soft-handle.cow_phase3_restore_loopblocks onwhile (cow_drain_thread_running() || cow_page_buffer_count() > 0)(criu/uffd.c:1796). Restore only proceeds past this point after the batch buffer is empty.- Restore unfreezes the tasks and sends
PS_IOV_ALL_PAGES_SENT_ACKback to the primary; primary closes the sockets.
Key Data Structures
Dump-side state — struct cow_dump_info (cow-dump.c)
Tracks the active dump: source pid, destination id (valid only after Phase 3 pstree collection), the WP_ASYNC uffd fd returned by the parasite, and the list of tracked VMAs (which grows when new VMAs are detected in Phase 3). A phase enum cycles IDLE → ASYNC_BULK → SCAN → DONE.
Lazy VMA — struct lazy_vma_entry (cow-mem.c)
One per tracked VMA on the dump side: [start, end) range, back-pointer to the vma_area (NULL for regions discovered in Phase 3), total page count, and the dst_id / source_pid that identify which task the VMA belongs to. Entries form a global list used by scanners and bulk senders.
Dirty region — struct dirty_region_entry (cow-bulk-send.h)
Minimal record pushed from scanners into the convergence queue: the [start, end) dirty range plus dst_id / source_pid. Senders read these and issue process_vm_readv to grab the pages.
MPMC convergence queue (cow-bulk-send.c)
A flat 8M-slot array (CONV_QUEUE_CAP) of region pointers with two atomic cursors. Scanners produce by fetch-add on the head and store their region pointer in the claimed slot. P3 senders consume by CAS on the tail — one region per successful CAS, then a short busy-read until the slot's pointer becomes visible. No empty-queue polling waste and no per-slot lock.
Replica batch buffer — struct batch_buffer_entry (cow-uffd.c)
One entry per 1MB-aligned region of the target process: a data pointer into the page pool (256 contiguous 4KB pages), a page_bitmap tracking which of the 256 pages have arrived, plus a secondary bitmap for drain accounting.
- Primary index: hash table,
COW_BATCH_BUFFER_HASH_SIZE= 256K buckets, one spinlock per bucket. - Secondary index:
chunk_index[COW_MAX_POOL_CHUNKS]maps each page-pool chunk to the list of batches that came out of it — drain can free chunks progressively rather than waiting for the whole buffer to empty.
Page pool (page-pool.c)
- Chunk =
COW_CHUNK_SIZE(64MB), chunk-aligned, maxCOW_MAX_POOL_CHUNKS(8192) ⇒ 512GB ceiling. - Per-thread bump allocator (
COW_MAX_THREADSslots); ownership transfer viapage_pool_get_pages()/page_pool_put(). - Atomic refcount per chunk header (page 0 of each chunk).
munmap()once refcount drops to 0.
Configuration
All tunables live in criu/include/cow/cow-conf.h. Key constants:
| Constant | Value | Purpose |
|---|---|---|
COW_BATCH_PAGES | 256 | Pages per transfer batch (1MB) |
COW_BATCH_SIZE | 1MB | Size of a batch (aligned on replica) |
COW_BATCH_SHIFT | 20 | log2(batch size) |
COW_WP_CHUNK_SIZE | 64MB | Parallel UFFDIO_WRITEPROTECT chunk |
COW_WORK_CHUNK_SIZE | 32MB | Bulk work-stealing chunk |
COW_MAX_WORK_ITEMS | 16384 | Bulk work queue cap |
COW_PAGEMAP_SCAN_RANGE_SIZE | 4MB | Per-ioctl scan range (bounds mmap_lock) |
COW_PAGEMAP_SCAN_VEC_LEN | 1000 | Max regions per PAGEMAP_SCAN ioctl |
COW_DIRTY_SCAN_FREEZE_THRESHOLD | 2,000,000 | Pre-scan convergence target |
COW_PRE_SCAN_MAX_ITERATIONS | 2 | Cap pre-scan iterations |
COW_CHUNK_SIZE | 64MB | Page-pool chunk size |
COW_MAX_POOL_CHUNKS | 8192 | 64MB × 8192 = 512GB pool cap |
COW_BATCH_BUFFER_HASH_SIZE | 256K | Replica batch-buffer buckets |
CONV_QUEUE_CAP (in cow-bulk-send.c) | 8M | MPMC slot count |
Thread counts are profile-gated:
| Constant | COW_PROFILE_SMALL (default) | COW_PROFILE_LARGE |
|---|---|---|
COW_NUM_P3_THREADS | 4 | 15 |
COW_NUM_P3_THREADS_BULK | 1 | 15 |
COW_NUM_SCANNERS | 4 | 20 |
COW_NUM_PRE_SCANNERS | 1 | 1 |
COW_NUM_DRAIN_THREADS | 4 | 10 |
COW_MAX_THREADS | 16 | 33 |
Notable compile-time feature flags (also in cow-conf.h):
COW_PRE_SCAN— iterative pre-freeze scan (on by default).CONFIG_PAGE_STATE_TRACKER,CONFIG_HUNG_PAGE_TRACKER,CONFIG_COW_COMPARE,CONFIG_COW_COMPARE_PAGES— diagnostics, off by default.
Protocol
Three distinct message channels are in use:
- TCP Page-server TCP — PRIMARY (dumper) ↔ REPLICA (lazy-pages daemon), over the network. One main control socket plus
COW_NUM_P3_THREADSparallel P3 data sockets. Frames arestruct page_server_iovheaders with optional trailing payload. - UNIX Lazy-pages UNIX socket — REPLICA lazy-pages daemon ↔ REPLICA
criu restore, same host. Carries fixed-sizeuint32_tsignal values. - RPC Parasite compel RPC —
criu dump↔ parasite thread injected into the target process (same host). Used during Phase 1 only.
TCP Page-server messages (cow-page-xfer.h)
Four message types are actively used on the page-server TCP channel. All share the same struct page_server_iov header; only PS_IOV_ADD_F_COMPRESS carries a trailing payload.
| Cmd | Dir | Socket | Sent during | Purpose |
|---|---|---|---|---|
PS_IOV_GET_ALL (8) |
R → P | main | Phase 2a start (once per task, from cow_phase2 in cow-lazy-pages.c:196-198) |
Replica requests bulk transfer for dst_id. Header-only (no payload). |
PS_IOV_ADD_F_COMPRESS (10) |
P → R | P3 (×N) | Phase 2a (bulk), Phase 2b (pre-scan dirty re-sends), Phase 3 (frozen final-scan dirty + new-VMA pages) | Compressed page batch. Header + 4-byte compressed_size + LZ4 payload. |
PS_IOV_ALL_PAGES_SENT (16) |
P → R | main | Phase 4 start — send_all_pages_sent_signal() from cr_dump_finish (cr-dump.c:2401) |
Primary tells replica: no more pages will be sent. Replica may now transition from receiving into the drain/restore stage. Header-only. |
PS_IOV_ALL_PAGES_SENT_ACK (17) |
R → P | main | When replica's Phase-2 event loop observes cow_is_all_pages_sent_received() (cow-lazy-pages.c:294) |
ACK back to primary so it can close the page-server connection. Header-only. |
Reserved/not used in the COW path:
PS_IOV_ADD_F_PF(9) — placeholder for per-fault push; current COW flow usesPS_IOV_ADD_F_COMPRESSbatches exclusively.PS_IOV_START_RESTORE(12) — defined in the header, never sent on the wire in the COW path.
PS_IOV_ADD_F_COMPRESS wire format (cow-bulk-send.c:send_pages_batch_compressed): the 24-byte page_server_iov header carries the encoded cmd, nr_pages (≤ COW_BATCH_PAGES = 256), base vaddr (may not be batch-aligned), and dst_id. Immediately after it comes a 4-byte compressed_size followed by that many bytes of LZ4-compressed page payload.
LZ4 acceleration: 1 on both the pre-freeze bulk path and the frozen convergence/new-VMA path. An experiment with acceleration=99 during convergence regressed total P3 wall-clock from 2.3s → 4.8s (larger wire size shifted the bottleneck to
tcp_sendmsg/skb_page_frag_refill).
UNIX Lazy-pages UNIX socket signals (criu/uffd.c)
Single-direction uint32_t cookie values over the in-host UNIX socket between the lazy-pages daemon and criu restore. Two cookies (TASKS_FROZEN, DRAIN_COMPLETE) are COW-only; RESTORE_FINISHED exists in the non-COW path as well.
| Signal | Dir | Sent during | Purpose |
|---|---|---|---|
LAZY_PAGES_TASKS_FROZEN |
restore → lazy-pages | Phase 4 (after restore catches all tasks via PTRACE_INTERRUPT) — uffd.c:1470 |
Tells the lazy-pages daemon it is safe to start drain threads. Daemon responds by calling cow_start_drain_thread() (uffd.c:1545-1551, cow-uffd.c:1842). |
LAZY_PAGES_DRAIN_COMPLETE |
lazy-pages → restore | Phase 4, once the batch buffer is empty — uffd.c:1866 |
Tells restore it is safe to unfreeze the task tree. Restore was blocked in lazy_pages_finish_restore() on this signal (uffd.c:1478-1489). |
LAZY_PAGES_RESTORE_FINISHED |
restore → lazy-pages | End of restore — uffd.c:1449, 1492 |
Normal restore-completion signal (non-COW as well). Terminates the lazy-pages event loop. |
RPC Parasite compel RPC (criu/pie/parasite.c)
| Cmd | Caller | Phase | Purpose |
|---|---|---|---|
PARASITE_CMD_COW_DUMP_INIT |
cow_dump_init_async() in cow-dump.c |
Phase 1 | Inside the target process, open a userfaultfd with UFFD_FEATURE_WP_ASYNC and return the fd over the compel socket via compel_util_recv_fd(). |
Per-phase sequence
Three swimlanes show who sends what in each phase. Channel tags: TCP page-server, UNIX lazy-pages socket, RPC parasite compel.
PARASITE_CMD_COW_DUMP_INIT → target process
PS_IOV_GET_ALL (main socket, once per task)
PS_IOV_ADD_F_COMPRESS × many (N P3 sockets, LZ4 bulk batches)
PS_IOV_ADD_F_COMPRESS × iterations (dirty re-sends)
PS_IOV_ADD_F_COMPRESS (final frozen scan + new-VMA pages)
PS_IOV_ALL_PAGES_SENT (main socket)
PTRACE_INTERRUPTLAZY_PAGES_TASKS_FROZEN
UFFDIO_COPY 1MB batchesPS_IOV_ALL_PAGES_SENT_ACK (main socket)
LAZY_PAGES_DRAIN_COMPLETE
LAZY_PAGES_RESTORE_FINISHED
Soft Error Handling (UFFDIO_COPY on replica)
Wrapper: cow_uffd_copy_pages() / cow_uffd_copy() in cow-uffd.c.
| Error | Meaning | Action |
|---|---|---|
EAGAIN | Kernel busy | Queue for retry (drain EAGAIN queue or lpi queue) |
EEXIST | Page already mapped | PAGE_STATE_DISCARDED; strict mode = BUG |
ENOENT | Page unmapped between phases | Mark in unmapped_tracker, discard |
| other | Hard error | BUG() |
Flags: COW_TRACK_STRICT (BUG on EEXIST/ERROR, used during drain), COW_TRACK_RETRY (retry mode, returns -EAGAIN instead of queueing).
File Organization
criu/cow/ — C implementations
| File | ~Lines | Purpose |
|---|---|---|
cow-dump.c | 1,043 | COW dump init/fini, WP apply, new-VMA detect, cleanup |
cow-mem.c | 196 | global_lazy_vmas list management |
cow-page-xfer.c | 258 | COW-specific PS_IOV commands, ACK handling |
cow-unified-thread.c | 244 | Primary page-server thread; spawns P3 senders |
cow-bulk-send.c | 2,300 | P3 senders, scanner threads, MPMC queue, work-stealing |
cow-bulk-recv.c | 185 | Replica control-message receive (PS_IOV_ALL_PAGES_SENT) |
cow-p3-receiver.c | 476 | Replica P3 receiver threads (LZ4 decompress → batch buffer) |
cow-uffd.c | 1,856 | Replica batch buffer, drain threads, UFFDIO_COPY, stats |
cow-lazy-pages.c | 308 | Lazy-page integration glue |
cow-compare.c | 725 | PRIMARY/REPLICA state-compare debug mode |
page-pool.c | 481 | Per-thread 64MB-chunk bump allocator |
page-state-tracker.c | 851 | Optional per-page state machine (debug) |
hung-page-tracker.c | 250 | Optional hung-page detector (debug) |
unmapped-tracker.c | 225 | Replica-side unmapped-range set |
cow-bitmap.c | 69 | Misc bitmap helpers |
criu/include/cow/ — headers
cow-conf.h (central tunables), cow-dump.h, cow-mem.h, cow-bulk-send.h, cow-bulk-recv.h, cow-page-xfer.h, cow-unified-thread.h, cow-uffd.h, cow-lazy-pages.h, cow-compare.h, cow-batch-bitmap.h, cow-bitmap.h, page-pool.h, page-state-tracker.h, hung-page-tracker.h, unmapped-tracker.h, pf-tracker.h, spsc-queue.h, spmc-queue.h, mpsc-queue.h.
mpsc-queue.handspsc-queue.hare kept for the page-request queue (cow-unified-thread.c) and legacy interfaces. The scanner→sender hot path now uses the flat-array MPMC convergence queue defined directly incow-bulk-send.c.
Integration Points
criu/cr-dump.c
cr_dump_tasks()redirects tocr_dump_tasks_cow_phased()whenopts.cow_dump && opts.lazy_pages.- Inside
cr_dump_tasks_cow_phased():- Phase 1:
pre_dump_one_taskloop,pstree_switch_state(TASK_ALIVE). - Phase 2a + 2b:
cr_page_server(false, true, -1)(which internally invokescow_dump_init_async+ parasite RPC earlier, then starts P3 threads via the unified thread). Bulk senders exhaustg_work_queuefirst (Phase 2a); scanners then wake and iteratePAGEMAP_SCAN/convergence-queue sends (Phase 2b). The loopwhile (!cow_all_threads_below_threshold()) usleep(10ms)holds until both complete. - Phase 3:
reseize_pstree,collect_pstree_ids,cow_set_dst_id,collect_mappings+cow_detect_new_vmas+cow_set_new_vma_ranges,cow_signal_last_scan, skeletondump_one_taskloop,cow_wait_p3_threads,cow_free_new_vma_ranges,cow_set_phase(COW_PHASE_DONE),write_img_inventory.
- Phase 1:
cr_dump_finish(): sendsPS_IOV_ALL_PAGES_SENT, callscow_cleanup_async_uffd(),cow_dump_fini().
criu/config.c
Option --cow-dump → opts.cow_dump (case 1105).
criu/mem.c
When opts.cow_dump, generate_iovs() calls cow_mem_add_lazy_vma() to append lazy-capable VMAs to global_lazy_vmas. In Phase 3 skeleton dump mode, it short-circuits page iteration (mdc.cow_skeleton_non_lazy = true).
criu/pie/parasite.c
PARASITE_CMD_COW_DUMP_INIT handler (parasite_cow_dump_init) creates the uffd with the requested features (UFFD_FEATURE_WP_ASYNC) and returns the fd over the compel socket.
Debugging
TIMING:prefix on wall-clock logs at every phase boundary and major step.TIMING @X.XXXXXX:tags deltas relative to freeze start.VMA_TRACE:prefix on VMA-lifecycle logs (registration, scanning, new detection).DEBUG_PERF:per-scanner/sender iteration timing.check_and_print_uffd_stats()— UFFD fault histogram by batch size (cow_get_histogram_bucket).g_compress_uncompressed_bytes/g_compress_compressed_bytes— aggregate compression ratio. Thread-local counters are flushed at thread exit (one atomic per counter, not per-send).page-state-tracker.c(off by default) keeps a 16-entry history per page for diagnosing migration divergence.
Usage
# PRIMARY (source)
sudo criu dump -t <pid> -D /images --cow-dump --lazy-pages \
--page-server --address <replica-ip> --port 27
# REPLICA (destination)
sudo criu page-server --images-dir /images --port 27 --lazy-pages
sudo criu restore --images-dir /images --lazy-pages
Known Limitations
- Single process tree: tracking metadata is shared-global (
g_cow_info,global_lazy_vmas). - Kernel requirements: Linux 5.7+ for
UFFD_FEATURE_WP_ASYNC.PAGEMAP_SCAN(kernel 6.6+) or fallback path needed. - Memory overhead: replica uses up to 512GB worth of 64MB chunks (
COW_MAX_POOL_CHUNKS), refcount-freed as drain progresses. - UFFD cleanup: page-table walks during
UFFDIO_UNREGISTERcan take seconds on large memory;cow_cleanup_async_uffd()chunks the work and relies on close-on-exit for the rest. A hardcoded 15s sleep currently precedes cleanup (investigation outstanding). - Network lock skipped: COW path does not call
network_lock(). - New-VMA metadata gap: VMAs created between Phase 1 and Phase 3 have their pages transferred but their VMA records are not re-dumped — they won't exist on the replica. Logged as
pr_errat Phase 3.
Future Improvements
- Multi-process tree support (drop global singletons).
- Adaptive convergence threshold driven by measured write rate.
- Better handling of rapidly-dirtying workloads (e.g. pin-to-CPU heuristics during bulk already in
COW_P3_SENDER_CPU). - Integration with container runtimes (containerd, CRI-O).