COW Dump Design Document

Generated from COW_DUMP_DESIGN.md. Covers the experimental COW dump path in this CRIU fork (--cow-dump).

Overview

The COW (Copy-on-Write) dump implementation is an experimental feature in this CRIU fork designed to minimize source process downtime during live migration. Instead of freezing the process for the entire dump duration, COW dump uses Linux's userfaultfd write-protect (UFFD_FEATURE_WP_ASYNC) to track memory writes while the process continues running, enabling incremental page transfer.

Goals

  1. Minimize downtime: Keep the source process running during the bulk of the memory transfer; only freeze for a short skeleton dump + final dirty flush.
  2. Parallel, compressed transfer: Multiple sender/receiver sockets, LZ4 compression, batched I/O.
  3. Convergence: Iteratively re-send dirty pages until a threshold is reached (overwriting older copies in the replica batch buffer), then do a single final frozen scan.
  4. Completeness: Capture dynamically created VMAs as well as the originally-tracked set.

Top-Level Architecture

Primary and replica are shown side-by-side so you can read horizontally to see what each side is doing in each phase.

Primary Side (Source)
Replica Side (Destination)
⇄   Per-thread TCP sockets (N connections)   ⇄
Phase 1 — Seize + WP_ASYNC Initialization
1
Seize + Dirty-tracking init
  • collect_pstree, pre_dump_one_task
  • Parasite opens UFFD with WP_ASYNC inside target
  • UFFDIO_WRITEPROTECT in parallel (64MB chunks)
  • Unfreeze — process runs with dirty tracking live
Waiting for P3 connections
Phase 2a — Bulk Transfer (senders only; scanners idle)
2a
Bulk transfer
P3 Senders (COW_NUM_P3_THREADS_BULK)
  • Work-steal from g_work_queue (atomic fetch-add)
  • Chunk = COW_WORK_CHUNK_SIZE (32MB)
  • Batch = COW_BATCH_PAGES (256) = 1MB
  • process_vm_readv → LZ4 → per-thread socket
Scanner threads spawned but blocked on g_bulk_transfer_done_count — they cannot start Phase 2b until every bulk sender exhausts the work queue.
P3 Receivers (N threads, 1 per socket)
  • LZ4 decompress into page-pool buffer
  • cow_page_buffer_add_batch() inserts new 1MB-aligned entry in hash table
Batch Buffer (hash, 256K buckets)
  • Entry = 1MB region + bitmap of present pages
  • Pages only accumulate — no drain
No drain, no restore yet
Phase 2b — Pre-Scan Convergence (after all bulk senders finish)
2b
Pre-scan + send dirty
Scanners (COW_NUM_PRE_SCANNERS active)
  • PAGEMAP_SCAN(PM_SCAN_WP_MATCHING), 4MB sub-chunks
  • Iterate until dirty count < COW_DIRTY_SCAN_FREEZE_THRESHOLD or max iters
  • Push dirty_region_entry into convergence queue
MPMC Convergence Queue
  • Flat array, atomic head/tail
  • CAS-claim one region per pop
consumers = all P3 sender threads (now released)
Pack → Read → Compress → Send
  • Pack regions into COW_BATCH_PAGES batches
  • process_vm_readv → LZ4 → per-thread socket
P3 Receivers — Overwrite Path
  • On re-send: cow_page_buffer_get_data_ptr() finds existing 1MB entry
  • LZ4 decompress directly into entry->data at page offset
  • Older bulk copy is overwritten in place (zero alloc, zero memcpy)
  • cow-p3-receiver.c:126-152
Still no drain, still no restore
Phase 3 — Freeze + Skeleton Dump (primary wall-clock downtime)
3
Freeze + skeleton dump + final flush
  • reseize_pstree, collect_pstree_ids
  • cow_detect_new_vmas() vs Phase 1 tracked set
  • cow_set_new_vma_ranges(); cow_signal_last_scan()
  • Scanners do frozen final PAGEMAP_SCAN → MPMC queue
  • dump_one_task() per item (skeleton: no pages)
  • write_img_inventory()
Receivers keep buffering final-scan batches,
overwriting existing entries on re-send.
Drain has still not started.
Phase 4 — Unfreeze + Drain + Restore (sequential on replica)
4
cr_dump_finish
  • Unfreeze tasks
  • Send PS_IOV_ALL_PAGES_SENT  →
  • Wait for PS_IOV_ALL_PAGES_SENT_ACK  ←
  • cow_cleanup_async_uffd()
  • Close sockets
1. Restore connects
  • handle_lazy_accept() — tasks still running, no drain
2. Catch tasks → TASKS_FROZEN
  • Restore PTRACE_INTERRUPTs all tasks
  • Sends LAZY_PAGES_TASKS_FROZEN to lazy-pages daemon
3. Drain Threads start
  • cow_start_drain_thread() on TASKS_FROZEN (uffd.c:1545)
  • Walk chunk_index (work-steal via next_drain_chunk)
  • UFFDIO_COPY whole 1MB batches; free pool chunks as they empty
  • EAGAIN → retry queue; EEXIST/ENOENT → soft-handle
4. Restore waits for drain
  • while (cow_drain_thread_running() || cow_page_buffer_count() > 0) (uffd.c:1796)
  • Restore does NOT run concurrently with drain
5. Drain done → restore unfreezes
  • Restore unfreezes the task tree; sends PS_IOV_ALL_PAGES_SENT_ACK back to primary

Phases

Phase 1 — Seize + WP_ASYNC Initialization

Entry: cr_dump_tasks_cow_phased() in criu/cr-dump.c (redirected from cr_dump_tasks() when opts.cow_dump && opts.lazy_pages).

Steps:

  1. Seize process tree (collect_pstree, pre_dump_one_task for each item).
  2. Parasite RPC PARASITE_CMD_COW_DUMP_INIT opens a userfaultfd with UFFD_FEATURE_WP_ASYNC inside the target and returns the fd via compel_util_recv_fd()cow_dump_init_async() in criu/cow/cow-dump.c.
  3. cow_register_vmas() walks the VMA list and records eligible ones (writable, private or anon-shared, not guard/VVAR/droppable). For each it issues UFFDIO_REGISTER in UFFDIO_REGISTER_MODE_WP mode (register failures are tolerated — PAGEMAP_SCAN is the source of truth).
  4. cow_apply_writeprotect() splits all tracked VMAs into COW_WP_CHUNK_SIZE (64MB) ranges and issues UFFDIO_WRITEPROTECT in parallel from sysconf(_SC_NPROCESSORS_ONLN) threads.
  5. arch_set_thread_regs() + pstree_switch_state(TASK_ALIVE) — process resumes with write-protection in effect.

Phase 2a — Bulk Transfer

cr_page_server() establishes the PRIMARY side page-server and spawns the unified thread (cow-unified-thread.c) which accepts N P3 connections and calls cow_start_p3_threads().

Bulk transfer (cow-bulk-send.c)

  • A shared work queue g_work_queue is built by splitting each lazy VMA into COW_WORK_CHUNK_SIZE (32MB) chunks (bounded by COW_MAX_WORK_ITEMS = 16384).
  • Only COW_NUM_P3_THREADS_BULK threads participate in bulk; the rest sit idle until Phase 2b. Rationale: during bulk the receiver is usually the bottleneck, so extra senders contend for CPU.
  • Each thread pulls items via __atomic_fetch_add(&g_work_queue_next) and, for each item, reads COW_BATCH_PAGES (256) pages with process_vm_readv, LZ4-compresses them, and sends a PS_IOV_ADD_F_COMPRESS batch on its socket.
  • Scanner threads are started but block on g_bulk_transfer_done_count until every bulk sender has exhausted the work queue (cow-bulk-send.c:158-159, 304-317). There is no interleaving — Phase 2b begins only after all bulk work is done.

Replica side (Phase 2a): P3 receivers LZ4-decompress each batch into a page-pool buffer and insert it as a new 1MB-aligned entry in the batch buffer hash table (cow_page_buffer_add_batch). No draining happens yet — pages only accumulate in the batch buffer.

Phase 2b — Pre-Scan + Send Dirty (Overwrite Older Pages)

Gated by COW_PRE_SCAN (default on). Triggered when all bulk senders have completed.

  • cow_start_scanner_thread() started COW_NUM_SCANNERS scanner threads at the start of Phase 2; only the first COW_NUM_PRE_SCANNERS actively scan in Phase 2b, and all of them are woken once bulk is done. The rest wait directly for the freeze signal.
  • Each scanner walks global_lazy_vmas, divides each VMA by pre-scanner count, and runs PAGEMAP_SCAN in COW_PAGEMAP_SCAN_RANGE_SIZE (4MB) sub-chunks with PM_SCAN_WP_MATCHING. The 4MB cap bounds mmap_lock hold time in the kernel.
  • Dirty regions become struct dirty_region_entry and are pushed into the MPMC convergence queue (a flat array g_conv_slots[CONV_QUEUE_CAP] with atomic g_conv_head/g_conv_tail; pop = CAS on tail, one region per claim).
  • Same P3 sender threads (now all COW_NUM_P3_THREADS of them) consume the queue, pack regions into COW_BATCH_PAGES-sized batches of dirty_slices, and issue a single multi-iov process_vm_readv per batch followed by LZ4 + send.
  • Scanners iterate until total dirty pages drop below COW_DIRTY_SCAN_FREEZE_THRESHOLD (2,000,000) or COW_PRE_SCAN_MAX_ITERATIONS is reached, then scanner 0 waits for the queue to drain and sets g_last_scan_flag. cow_all_threads_below_threshold() returns true → main thread moves to Phase 3.

Replica side (Phase 2b): on re-send of a page whose 1MB region is already present in the batch buffer, the receiver calls cow_page_buffer_get_data_ptr() to find the existing entry and LZ4-decompresses directly into entry->data at the page offset, overwriting the older bulk copy in place (cow-p3-receiver.c:126-152). Zero allocation, zero memcpy. Still no drain.

Phase 3 — Freeze + Skeleton Dump

Measured wall-clock is the primary metric here; this is the downtime.

  1. reseize_pstree() + collect_pstree_ids().
  2. cow_set_dst_id(vpid(root_item)) — dst_id wasn't valid in Phase 1.
  3. collect_mappings()cow_detect_new_vmas() compares against Phase 1 tracked set and emits ranges for VMAs that appeared while the process was running. New ranges are also appended to tracked_vmas and global_lazy_vmas (add_lazy_vma_for_new_region).
  4. cow_set_new_vma_ranges() hands the new-range array to senders.
  5. cow_signal_last_scan() sets g_scanner_freeze_signal. Scanners do a frozen PAGEMAP_SCAN (no range chunking, no WP_MATCHING — just read PAGE_IS_WRITTEN) over their slice of each VMA and push to the MPMC queue.
  6. Skeleton dump loop: dump_one_task() for every item. Phase flag (COW_PHASE_SCAN) makes cow_is_phased_skeleton_dump() return true and page-dump paths short-circuit (generate_iovs, etc.).
  7. cr_dump_post_task_operations().
  8. cow_wait_p3_threads() — senders drain the MPMC queue and then send any pages from the new-VMA ranges (send_new_vma_pages, split round-robin by thread id).
  9. cow_free_new_vma_ranges(); cow_set_phase(COW_PHASE_DONE).
  10. write_img_inventory() (still frozen).

Replica side (Phase 3): receivers keep buffering final-scan batches into the batch buffer, overwriting existing entries on re-sends. Drain has still not started.

Phase 4 — Unfreeze + Drain + Replica Restore

Primary:

  • cr_dump_finish unfreezes tasks, sends PS_IOV_ALL_PAGES_SENT, waits for PS_IOV_ALL_PAGES_SENT_ACK, then calls cow_cleanup_async_uffd() (currently guarded by a hardcoded 15s sleep in the code; the cleanup path unregisters VMAs in chunks with 10ms yields every 10 VMAs, then closes the uffd).

Replica (serial — restore does NOT run concurrently with drain):

  1. criu restore connects to the lazy socket (handle_lazy_accept in criu/uffd.c:1616). Drain is not started here — tasks are still running (criu/uffd.c:1667-1680).
  2. Restore catches all tasks via PTRACE_INTERRUPT, then sends LAZY_PAGES_TASKS_FROZEN over the lazy socket.
  3. lazy_sk_read_event receives TASKS_FROZEN and calls cow_handle_lazy_accept_post_connect()cow_start_drain_thread() (criu/uffd.c:1545-1551, cow-uffd.c:1842-1853). Drain threads walk chunk_index (work-stealing via next_drain_chunk) and issue UFFDIO_COPY for whole 1MB batches, freeing page-pool chunks as each empties. EAGAIN → retry queue; EEXIST/ENOENT → soft-handle.
  4. cow_phase3_restore_loop blocks on while (cow_drain_thread_running() || cow_page_buffer_count() > 0) (criu/uffd.c:1796). Restore only proceeds past this point after the batch buffer is empty.
  5. Restore unfreezes the tasks and sends PS_IOV_ALL_PAGES_SENT_ACK back to the primary; primary closes the sockets.

Key Data Structures

Dump-side state — struct cow_dump_info (cow-dump.c)

Tracks the active dump: source pid, destination id (valid only after Phase 3 pstree collection), the WP_ASYNC uffd fd returned by the parasite, and the list of tracked VMAs (which grows when new VMAs are detected in Phase 3). A phase enum cycles IDLEASYNC_BULKSCANDONE.

Lazy VMA — struct lazy_vma_entry (cow-mem.c)

One per tracked VMA on the dump side: [start, end) range, back-pointer to the vma_area (NULL for regions discovered in Phase 3), total page count, and the dst_id / source_pid that identify which task the VMA belongs to. Entries form a global list used by scanners and bulk senders.

Dirty region — struct dirty_region_entry (cow-bulk-send.h)

Minimal record pushed from scanners into the convergence queue: the [start, end) dirty range plus dst_id / source_pid. Senders read these and issue process_vm_readv to grab the pages.

MPMC convergence queue (cow-bulk-send.c)

A flat 8M-slot array (CONV_QUEUE_CAP) of region pointers with two atomic cursors. Scanners produce by fetch-add on the head and store their region pointer in the claimed slot. P3 senders consume by CAS on the tail — one region per successful CAS, then a short busy-read until the slot's pointer becomes visible. No empty-queue polling waste and no per-slot lock.

Replica batch buffer — struct batch_buffer_entry (cow-uffd.c)

One entry per 1MB-aligned region of the target process: a data pointer into the page pool (256 contiguous 4KB pages), a page_bitmap tracking which of the 256 pages have arrived, plus a secondary bitmap for drain accounting.

Page pool (page-pool.c)

Configuration

All tunables live in criu/include/cow/cow-conf.h. Key constants:

ConstantValuePurpose
COW_BATCH_PAGES256Pages per transfer batch (1MB)
COW_BATCH_SIZE1MBSize of a batch (aligned on replica)
COW_BATCH_SHIFT20log2(batch size)
COW_WP_CHUNK_SIZE64MBParallel UFFDIO_WRITEPROTECT chunk
COW_WORK_CHUNK_SIZE32MBBulk work-stealing chunk
COW_MAX_WORK_ITEMS16384Bulk work queue cap
COW_PAGEMAP_SCAN_RANGE_SIZE4MBPer-ioctl scan range (bounds mmap_lock)
COW_PAGEMAP_SCAN_VEC_LEN1000Max regions per PAGEMAP_SCAN ioctl
COW_DIRTY_SCAN_FREEZE_THRESHOLD2,000,000Pre-scan convergence target
COW_PRE_SCAN_MAX_ITERATIONS2Cap pre-scan iterations
COW_CHUNK_SIZE64MBPage-pool chunk size
COW_MAX_POOL_CHUNKS819264MB × 8192 = 512GB pool cap
COW_BATCH_BUFFER_HASH_SIZE256KReplica batch-buffer buckets
CONV_QUEUE_CAP (in cow-bulk-send.c)8MMPMC slot count

Thread counts are profile-gated:

ConstantCOW_PROFILE_SMALL (default)COW_PROFILE_LARGE
COW_NUM_P3_THREADS415
COW_NUM_P3_THREADS_BULK115
COW_NUM_SCANNERS420
COW_NUM_PRE_SCANNERS11
COW_NUM_DRAIN_THREADS410
COW_MAX_THREADS1633

Notable compile-time feature flags (also in cow-conf.h):

Protocol

Three distinct message channels are in use:

  1. TCP Page-server TCP — PRIMARY (dumper) ↔ REPLICA (lazy-pages daemon), over the network. One main control socket plus COW_NUM_P3_THREADS parallel P3 data sockets. Frames are struct page_server_iov headers with optional trailing payload.
  2. UNIX Lazy-pages UNIX socket — REPLICA lazy-pages daemon ↔ REPLICA criu restore, same host. Carries fixed-size uint32_t signal values.
  3. RPC Parasite compel RPCcriu dump ↔ parasite thread injected into the target process (same host). Used during Phase 1 only.

TCP Page-server messages (cow-page-xfer.h)

Four message types are actively used on the page-server TCP channel. All share the same struct page_server_iov header; only PS_IOV_ADD_F_COMPRESS carries a trailing payload.

CmdDirSocketSent duringPurpose
PS_IOV_GET_ALL (8) R → P main Phase 2a start (once per task, from cow_phase2 in cow-lazy-pages.c:196-198) Replica requests bulk transfer for dst_id. Header-only (no payload).
PS_IOV_ADD_F_COMPRESS (10) P → R P3 (×N) Phase 2a (bulk), Phase 2b (pre-scan dirty re-sends), Phase 3 (frozen final-scan dirty + new-VMA pages) Compressed page batch. Header + 4-byte compressed_size + LZ4 payload.
PS_IOV_ALL_PAGES_SENT (16) P → R main Phase 4 start — send_all_pages_sent_signal() from cr_dump_finish (cr-dump.c:2401) Primary tells replica: no more pages will be sent. Replica may now transition from receiving into the drain/restore stage. Header-only.
PS_IOV_ALL_PAGES_SENT_ACK (17) R → P main When replica's Phase-2 event loop observes cow_is_all_pages_sent_received() (cow-lazy-pages.c:294) ACK back to primary so it can close the page-server connection. Header-only.

Reserved/not used in the COW path:

PS_IOV_ADD_F_COMPRESS wire format (cow-bulk-send.c:send_pages_batch_compressed): the 24-byte page_server_iov header carries the encoded cmd, nr_pages (≤ COW_BATCH_PAGES = 256), base vaddr (may not be batch-aligned), and dst_id. Immediately after it comes a 4-byte compressed_size followed by that many bytes of LZ4-compressed page payload.

LZ4 acceleration: 1 on both the pre-freeze bulk path and the frozen convergence/new-VMA path. An experiment with acceleration=99 during convergence regressed total P3 wall-clock from 2.3s → 4.8s (larger wire size shifted the bottleneck to tcp_sendmsg/skb_page_frag_refill).

UNIX Lazy-pages UNIX socket signals (criu/uffd.c)

Single-direction uint32_t cookie values over the in-host UNIX socket between the lazy-pages daemon and criu restore. Two cookies (TASKS_FROZEN, DRAIN_COMPLETE) are COW-only; RESTORE_FINISHED exists in the non-COW path as well.

SignalDirSent duringPurpose
LAZY_PAGES_TASKS_FROZEN restore → lazy-pages Phase 4 (after restore catches all tasks via PTRACE_INTERRUPT) — uffd.c:1470 Tells the lazy-pages daemon it is safe to start drain threads. Daemon responds by calling cow_start_drain_thread() (uffd.c:1545-1551, cow-uffd.c:1842).
LAZY_PAGES_DRAIN_COMPLETE lazy-pages → restore Phase 4, once the batch buffer is empty — uffd.c:1866 Tells restore it is safe to unfreeze the task tree. Restore was blocked in lazy_pages_finish_restore() on this signal (uffd.c:1478-1489).
LAZY_PAGES_RESTORE_FINISHED restore → lazy-pages End of restore — uffd.c:1449, 1492 Normal restore-completion signal (non-COW as well). Terminates the lazy-pages event loop.

RPC Parasite compel RPC (criu/pie/parasite.c)

CmdCallerPhasePurpose
PARASITE_CMD_COW_DUMP_INIT cow_dump_init_async() in cow-dump.c Phase 1 Inside the target process, open a userfaultfd with UFFD_FEATURE_WP_ASYNC and return the fd over the compel socket via compel_util_recv_fd().

Per-phase sequence

Three swimlanes show who sends what in each phase. Channel tags: TCP page-server, UNIX lazy-pages socket, RPC parasite compel.

Primary (criu dump)
Replica lazy-pages
Replica criu restore
Phase 1 — Seize + WP_ASYNC Initialization
RPC PARASITE_CMD_COW_DUMP_INIT → target process
UFFDIO_WRITEPROTECT (local)
Phase 2a — Bulk Transfer
TCP PS_IOV_GET_ALL  (main socket, once per task)
 
TCP PS_IOV_ADD_F_COMPRESS  × many  (N P3 sockets, LZ4 bulk batches)
 
Phase 2b — Pre-Scan + Send Dirty (overwrite older)
TCP PS_IOV_ADD_F_COMPRESS  × iterations  (dirty re-sends)
 
Phase 3 — Freeze + Skeleton Dump (primary downtime)
TCP PS_IOV_ADD_F_COMPRESS  (final frozen scan + new-VMA pages)
 
Phase 4 — Unfreeze + Drain + Restore
TCP PS_IOV_ALL_PAGES_SENT  (main socket)
(waiting to connect)
 
AF_UNIX connect (restore → lazy-pages)
 
 
restore catches tasks via PTRACE_INTERRUPT
 
UNIX LAZY_PAGES_TASKS_FROZEN
 
drain threads run: UFFDIO_COPY 1MB batches
blocked, waiting for drain
TCP PS_IOV_ALL_PAGES_SENT_ACK  (main socket)
 
 
drain done, buffer empty
 
 
UNIX LAZY_PAGES_DRAIN_COMPLETE
 
 
restore unfreezes tasks
 
UNIX LAZY_PAGES_RESTORE_FINISHED
primary closes page-server sockets

Soft Error Handling (UFFDIO_COPY on replica)

Wrapper: cow_uffd_copy_pages() / cow_uffd_copy() in cow-uffd.c.

ErrorMeaningAction
EAGAINKernel busyQueue for retry (drain EAGAIN queue or lpi queue)
EEXISTPage already mappedPAGE_STATE_DISCARDED; strict mode = BUG
ENOENTPage unmapped between phasesMark in unmapped_tracker, discard
otherHard errorBUG()

Flags: COW_TRACK_STRICT (BUG on EEXIST/ERROR, used during drain), COW_TRACK_RETRY (retry mode, returns -EAGAIN instead of queueing).

File Organization

criu/cow/ — C implementations

File~LinesPurpose
cow-dump.c1,043COW dump init/fini, WP apply, new-VMA detect, cleanup
cow-mem.c196global_lazy_vmas list management
cow-page-xfer.c258COW-specific PS_IOV commands, ACK handling
cow-unified-thread.c244Primary page-server thread; spawns P3 senders
cow-bulk-send.c2,300P3 senders, scanner threads, MPMC queue, work-stealing
cow-bulk-recv.c185Replica control-message receive (PS_IOV_ALL_PAGES_SENT)
cow-p3-receiver.c476Replica P3 receiver threads (LZ4 decompress → batch buffer)
cow-uffd.c1,856Replica batch buffer, drain threads, UFFDIO_COPY, stats
cow-lazy-pages.c308Lazy-page integration glue
cow-compare.c725PRIMARY/REPLICA state-compare debug mode
page-pool.c481Per-thread 64MB-chunk bump allocator
page-state-tracker.c851Optional per-page state machine (debug)
hung-page-tracker.c250Optional hung-page detector (debug)
unmapped-tracker.c225Replica-side unmapped-range set
cow-bitmap.c69Misc bitmap helpers

criu/include/cow/ — headers

cow-conf.h (central tunables), cow-dump.h, cow-mem.h, cow-bulk-send.h, cow-bulk-recv.h, cow-page-xfer.h, cow-unified-thread.h, cow-uffd.h, cow-lazy-pages.h, cow-compare.h, cow-batch-bitmap.h, cow-bitmap.h, page-pool.h, page-state-tracker.h, hung-page-tracker.h, unmapped-tracker.h, pf-tracker.h, spsc-queue.h, spmc-queue.h, mpsc-queue.h.

mpsc-queue.h and spsc-queue.h are kept for the page-request queue (cow-unified-thread.c) and legacy interfaces. The scanner→sender hot path now uses the flat-array MPMC convergence queue defined directly in cow-bulk-send.c.

Integration Points

criu/cr-dump.c

criu/config.c

Option --cow-dumpopts.cow_dump (case 1105).

criu/mem.c

When opts.cow_dump, generate_iovs() calls cow_mem_add_lazy_vma() to append lazy-capable VMAs to global_lazy_vmas. In Phase 3 skeleton dump mode, it short-circuits page iteration (mdc.cow_skeleton_non_lazy = true).

criu/pie/parasite.c

PARASITE_CMD_COW_DUMP_INIT handler (parasite_cow_dump_init) creates the uffd with the requested features (UFFD_FEATURE_WP_ASYNC) and returns the fd over the compel socket.

Debugging

Usage

# PRIMARY (source)
sudo criu dump -t <pid> -D /images --cow-dump --lazy-pages \
    --page-server --address <replica-ip> --port 27

# REPLICA (destination)
sudo criu page-server --images-dir /images --port 27 --lazy-pages
sudo criu restore   --images-dir /images --lazy-pages

Known Limitations

  1. Single process tree: tracking metadata is shared-global (g_cow_info, global_lazy_vmas).
  2. Kernel requirements: Linux 5.7+ for UFFD_FEATURE_WP_ASYNC. PAGEMAP_SCAN (kernel 6.6+) or fallback path needed.
  3. Memory overhead: replica uses up to 512GB worth of 64MB chunks (COW_MAX_POOL_CHUNKS), refcount-freed as drain progresses.
  4. UFFD cleanup: page-table walks during UFFDIO_UNREGISTER can take seconds on large memory; cow_cleanup_async_uffd() chunks the work and relies on close-on-exit for the rest. A hardcoded 15s sleep currently precedes cleanup (investigation outstanding).
  5. Network lock skipped: COW path does not call network_lock().
  6. New-VMA metadata gap: VMAs created between Phase 1 and Phase 3 have their pages transferred but their VMA records are not re-dumped — they won't exist on the replica. Logged as pr_err at Phase 3.

Future Improvements

  1. Multi-process tree support (drop global singletons).
  2. Adaptive convergence threshold driven by measured write rate.
  3. Better handling of rapidly-dirtying workloads (e.g. pin-to-CPU heuristics during bulk already in COW_P3_SENDER_CPU).
  4. Integration with container runtimes (containerd, CRI-O).