Operations¶
Day-2 concerns: auditing, offboarding, high availability, backpressure, WAL retention, and lake maintenance.
Verify¶
modak-worker verify --table public.my_dim
Compares heap vs lake for mirrored tables (row count, tier-key min/max, and a PK checksum, exact on an idle table with the current replication drift reported on a live one) and audits journaled row counts per tiered partition. Exits non-zero on mismatch. Run it after onboarding and periodically from cron.
Unregister¶
modak-worker unregister --table public.my_dim # keep the lake table
modak-worker unregister --table public.my_dim --drop-lake # purge it
Removes the catalog rows (cutline, partitions, delta, pins, and journal all
cascade), drops the replication slot and publication, and resets
REPLICA IDENTITY. The lake table survives by default, because for tiered
tables it holds the only copy of reclaimed rows. --drop-lake purges it.
Unregistering an unknown table is a no-op that still cleans up any leftover
slot or publication.
High availability¶
Run as many workers as you like against the same database. They campaign for
a leader lease (a session-scoped Postgres advisory lock). Exactly one is
active. The rest stand by and take over within
MODAK_CAMPAIGN_INTERVAL_SECONDS when the leader's lock session dies,
including the mirrored tables' replication slots, whose stale holders are
evicted automatically. Every advance is guarded by protocol (monotonic catalog
updates, single-consumer slots, the pre-commit gap probe), so a delayed
step-down cannot corrupt state.
Backpressure¶
Mirror pump memory: a single transaction touching more than
MODAK_MIRROR_MAX_BUFFERED_ROWS rows (default 100k) is folded into the lake
in intermediate commits that never advance the frontier or the slot. Readers
never see a partial transaction, and worker memory stays flat even for a
100M-row UPDATE.
Delta backlog: the sweep logs WARN when a table's correction backlog passes
MODAK_DELTA_BACKLOG_WARN_ROWS (default 100k), and ERROR with a runbook at
four times that. A growing backlog means compaction is blocked (usually a
long-held read pin) or under-provisioned.
Read pins: pins carry expires_at. Expired pins are ignored by the reclaim
and compaction horizon and deleted by the sweep, so a crashed reader can only
stall reclaim until its pin expires.
Slot WAL retention guard¶
A logical slot pins WAL until its consumer advances. If a mirror worker stays
down, Postgres keeps WAL segments and the disk fills. Each sweep the daemon
measures every modak_* slot and exports
modak_slot_retained_wal_bytes{slot}. Above MODAK_SLOT_WARN_BYTES it logs
WARN, above four times that an ERROR with this runbook. The guard never drops
a slot itself. When the alert fires:
- Mirror worker down? Start it (or a standby, takeover is automatic). The slot drains and WAL is recycled.
- Table abandoned?
unregisterit. - As a hard cap, set
max_slot_wal_keep_sizeso a runaway slot can never fill the disk. Past it the slot is invalidated and the table needs a freshregister.
Lake maintenance¶
The cold tier needs upkeep (small-file compaction, delete-debt compaction, snapshot expiry, metadata hygiene) and Modak owns it by default, with per-pass and per-table opt-outs for teams that already run their own. It is a topic of its own: see Lake maintenance, including what must stay with Modak (anything that deletes files, snapshot expiry above all) and what you can safely run externally.
Monitoring¶
Scrape /metrics (either binary) with Prometheus. The key series and their
meanings are listed in the metrics reference. The
useful alerts: modak_leader sum not equal to 1,
modak_slot_retained_wal_bytes growth, modak_delta_backlog_rows growth, and
modak_mirror_lag_bytes sustained high.