Pick a DuckLake Cache Refresh Mode

You have a products table sitting in Parquet on local disk (or any object store DuckDB can read). You want a fast, snapshot-aware API in front of it. Which refresh mode should you pick?

flAPI doesn't have a strategy: knob. The refresh mode is derived from your cache block — specifically the combination of cursor and primary-key — per CacheManager::determineCacheMode in cache_manager.cpp:

no cursor                 -> full
cursor only               -> append
cursor + primary-key      -> merge

This recipe takes the same Parquet backend and shows all three options side-by-side, with the trade-offs spelled out so you can pick the cheapest mode that still answers your business question.

The Common Backend

Every variant below uses the same connection. We pin to Parquet because it makes the example deterministic — every variant has identical I/O up to the WHERE clause.

# flapi.yaml
project-name: products-api
project-description: Three refresh modes against the same Parquet products table

template:
  path: './sqls'

connections:
  products-parquet:
    properties:
      path: './data/products.parquet'

duckdb:
  access_mode: READ_WRITE
  threads: 8

ducklake:
  enabled: true
  alias: cache
  metadata-path: ./data/cache.ducklake
  data-path: ./data/cache
  retention:
    keep-last-snapshots: 10
    max-snapshot-age: 30d
  compaction:
    enabled: true
    schedule: '@daily'
  scheduler:
    enabled: true
    scan-interval: 5m

The ducklake.alias: cache value becomes {{cache.catalog}} everywhere the refresh template runs.

Assume the Parquet file has these columns:

Column	Type	Notes
`id`	bigint	Stable primary key
`sku`	varchar	Vendor SKU
`name`	varchar	Display name
`category`	varchar	e.g. `electronics`, `apparel`
`price_cents`	bigint	Current price
`created_at`	timestamp	Row insert time, never updated
`updated_at`	timestamp	Bumped on any change
`deleted_at`	timestamp	NULL for live rows, set on soft delete

Mode 1: Full Refresh

When: small reference tables (under a few million rows), schema is unstable, you don't trust your updated_at column, or you simply want the dumbest possible refresh.

Triggered by: no cursor. flAPI picks full automatically.

Cache configuration

# sqls/products/products-rest.yaml
url-path: /products/
method: GET
connection:
  - products-parquet
template-source: products.sql

cache:
  enabled: true
  table: products_cache
  schema: catalog
  schedule: 1h
  retention:
    keep-last-snapshots: 5
    max-snapshot-age: 7d

No cursor, no primary-key. CacheManager::determineCacheMode returns "full".

Refresh SQL

-- sqls/products/products_cache.sql
-- Mode: full. Rebuild the cache table from scratch on every schedule.
CREATE OR REPLACE TABLE {{cache.catalog}}.{{cache.schema}}.{{cache.table}} AS
SELECT
    id,
    sku,
    name,
    category,
    price_cents,
    created_at,
    updated_at,
    deleted_at,
    CURRENT_TIMESTAMP        AS cache_updated_at,
    '{{cache.snapshotId}}'   AS cache_snapshot_id
FROM read_parquet('{{conn.path}}')
WHERE deleted_at IS NULL
ORDER BY id;

Notes:

We do not reference {{cache.previousSnapshotTimestamp}} — full mode has no concept of "since last time".
{{cache.mode}} will be full here, available if you want to fork behaviour in the template.

Cost characteristic

O(N) per refresh. Every snapshot scans the entire Parquet file. Cheap with a few million rows, expensive at a billion.

Caveats

Every refresh produces a fully-rewritten DuckLake snapshot, so retention rules will eat through disk faster than incremental modes.
API readers see snapshot isolation, but you cannot answer "what changed since yesterday" from the cache itself — only from the audit table.

Mode 2: Incremental Append

When: append-only data (events, logs, orders that never get updated in place). You have a monotonically increasing timestamp or ID and don't care about updates or deletes.

Triggered by: cursor present, primary-key absent. flAPI picks append.

Cache configuration

# sqls/products/products-rest.yaml
url-path: /products/
method: GET
connection:
  - products-parquet
template-source: products.sql

cache:
  enabled: true
  table: products_cache
  schema: catalog
  schedule: 5m
  cursor:
    column: created_at
    type: timestamp
  retention:
    keep-last-snapshots: 20
    max-snapshot-age: 14d

Refresh SQL

-- sqls/products/products_cache.sql
-- Mode: append. Pull rows created since the previous snapshot.
INSERT INTO {{cache.catalog}}.{{cache.schema}}.{{cache.table}}
SELECT
    id,
    sku,
    name,
    category,
    price_cents,
    created_at,
    updated_at,
    deleted_at,
    CURRENT_TIMESTAMP        AS cache_updated_at,
    '{{cache.snapshotId}}'   AS cache_snapshot_id
FROM read_parquet('{{conn.path}}')
WHERE {{cache.cursorColumn}} > TIMESTAMP '{{cache.previousSnapshotTimestamp}}'
  AND deleted_at IS NULL;

On the very first run, {{cache.previousSnapshotTimestamp}} is empty, so most teams ship a small bootstrap branch:

{{#if cache.previousSnapshotTimestamp}}
-- Incremental
INSERT INTO {{cache.catalog}}.{{cache.schema}}.{{cache.table}}
SELECT ... FROM read_parquet('{{conn.path}}')
WHERE created_at > TIMESTAMP '{{cache.previousSnapshotTimestamp}}';
{{else}}
-- First load
CREATE OR REPLACE TABLE {{cache.catalog}}.{{cache.schema}}.{{cache.table}} AS
SELECT ... FROM read_parquet('{{conn.path}}');
{{/if}}

Cost characteristic

O(delta) per refresh — only new rows are scanned and inserted. Cache grows monotonically until retention expires old snapshots.

Caveats

No update detection. If a row's price_cents changes in Parquet, the cache will not see it. Append mode only watches the cursor.
No delete detection. Hard deletes upstream are invisible. Soft deletes (deleted_at set) can be filtered at read time on the API side.
Pick cursor.column carefully: it must be monotonically non-decreasing. created_at is safe; updated_at is not (it can move backwards relative to insertion order during backfills).

Mode 3: Incremental Merge / Upsert

When: mutable rows. You have a reliable primary key, a trustworthy updated_at column, and you need the cache to reflect upstream changes.

Triggered by: both cursor and primary-key. flAPI picks merge.

Cache configuration

# sqls/products/products-rest.yaml
url-path: /products/
method: GET
connection:
  - products-parquet
template-source: products.sql

cache:
  enabled: true
  table: products_cache
  schema: catalog
  schedule: 1m
  primary-key: [id]
  cursor:
    column: updated_at
    type: timestamp
  rollback-window: 2d
  retention:
    keep-last-snapshots: 30
    max-snapshot-age: 7d
  delete-handling: soft

Refresh SQL

-- sqls/products/products_cache.sql
-- Mode: merge. Upsert by primary key, using updated_at as the cursor.
MERGE INTO {{cache.catalog}}.{{cache.schema}}.{{cache.table}} AS tgt
USING (
    SELECT
        id,
        sku,
        name,
        category,
        price_cents,
        created_at,
        updated_at,
        deleted_at,
        CURRENT_TIMESTAMP        AS cache_updated_at,
        '{{cache.snapshotId}}'   AS cache_snapshot_id
    FROM read_parquet('{{conn.path}}')
    WHERE {{cache.cursorColumn}} > TIMESTAMP '{{cache.previousSnapshotTimestamp}}'
) AS src
ON tgt.id = src.id
WHEN MATCHED THEN UPDATE SET
    sku               = src.sku,
    name              = src.name,
    category          = src.category,
    price_cents       = src.price_cents,
    updated_at        = src.updated_at,
    deleted_at        = src.deleted_at,
    cache_updated_at  = src.cache_updated_at,
    cache_snapshot_id = src.cache_snapshot_id
WHEN NOT MATCHED THEN INSERT VALUES (
    src.id, src.sku, src.name, src.category, src.price_cents,
    src.created_at, src.updated_at, src.deleted_at,
    src.cache_updated_at, src.cache_snapshot_id
);

Note we use updated_at as the cursor (not created_at) — that's the whole point of merge mode. {{cache.primaryKeys}} is available if you want to drive a generic template across multiple endpoints.

Cost characteristic

O(updates + inserts) per refresh. Cheap as long as your upstream system updates only a small fraction of rows per cycle.

Caveats

Hard deletes still need help. A MERGE only sees rows that come back from the source query — physically removed rows are invisible. If you can't trust upstream to soft-delete, you must periodically reconcile (e.g. a daily full sweep). flAPI's delete-handling: soft|hard controls only how the cache stores deletes it has been told about; it does not invent delete detection.
The cursor.column must be set whenever a row changes. Stale updated_at values are silent data corruption in merge mode.

Side-by-Side Comparison

Property	Full	Append	Merge
Detects inserts	yes	yes	yes
Detects updates	yes	no	yes
Detects deletes	yes (via re-scan)	no	only soft deletes
Scan cost per refresh	`O(N)`	`O(delta)`	`O(delta)`
Cache write cost	rewrite entire table	append rows	upsert matching rows
Required source columns	none	a monotonic cursor	primary key + cursor
Template complexity	lowest	medium (bootstrap branch)	highest (`MERGE` statement)
Triggered by	no `cursor`	`cursor` only	`cursor` + `primary-key`

Rule of thumb: start with full for anything under ~1M rows. Move to append when refresh time becomes annoying. Move to merge when you actually need to see updates.

Retention, Rollback, and Delete Handling

The same DuckLake controls apply to all three modes:

cache:
  enabled: true
  table: products_cache
  schema: catalog
  schedule: 1m
  primary-key: [id]
  cursor:
    column: updated_at
    type: timestamp
  retention:
    keep-last-snapshots: 30   # keep the 30 most recent snapshots
    max-snapshot-age: 7d      # ... or anything younger than 7 days
  rollback-window: 2d         # time-travel window for ad-hoc reads
  delete-handling: soft       # soft | hard

What each does:

retention.keep-last-snapshots — flAPI invokes ducklake_expire_snapshots(..., versions => ARRAY[0:N]) after each refresh.
retention.max-snapshot-age — flAPI invokes ducklake_expire_snapshots(..., older_than => CURRENT_TIMESTAMP - INTERVAL '...').
rollback-window — snapshots inside this window remain queryable for time-travel even if they would otherwise be expired.
delete-handling: soft — deleted rows are tombstoned in the cache (kept with a deletion marker).
delete-handling: hard — deleted rows are physically removed.

Every refresh — successful or failed — is recorded in <catalog>.audit.sync_events with sync_type set to full, append, merge, or garbage_collection. The audit table is created automatically by CacheManager::initializeAuditTables.

Reading the Cache from the API

Regardless of mode, the API endpoint template reads from the cache table, not the source. That's how you get sub-5ms responses.

-- sqls/products/products.sql
SELECT
    id,
    sku,
    name,
    category,
    price_cents
FROM {{cache.catalog}}.{{cache.schema}}.{{cache.table}}
WHERE 1=1
  {{#if request.category}}
    AND category = '{{request.category}}'
  {{/if}}
  {{#if request.id}}
    AND id = {{request.id}}
  {{/if}}
  AND deleted_at IS NULL
ORDER BY id
LIMIT {{request.limit}};

Three things to notice:

{{cache.catalog}}.{{cache.schema}}.{{cache.table}} is the same triple injected during refresh, so the API and refresh templates stay in sync via configuration, not copy-paste.
Filtering deleted_at IS NULL at read time is what gives delete-handling: soft its name — the rows are still in the cache, just hidden.
None of the {{cache.previousSnapshot*}} variables are populated on read-time requests. They only exist during a refresh. Don't reference them in API templates.

Cache Template Variables Reference

These are the variables flAPI injects into your cache refresh template, verified against src/sql_template_processor.cpp:

Variable	Populated when
`{{cache.catalog}}`	always (from `ducklake.alias`)
`{{cache.schema}}`	always (defaults to `main`)
`{{cache.table}}`	always
`{{cache.schedule}}`	when `cache.schedule` is set
`{{cache.mode}}`	always (`full`, `append`, or `merge`)
`{{cache.snapshotId}}`	once a current snapshot exists
`{{cache.snapshotTimestamp}}`	once a current snapshot exists
`{{cache.previousSnapshotId}}`	once a previous snapshot exists (i.e. second refresh onwards)
`{{cache.previousSnapshotTimestamp}}`	once a previous snapshot exists
`{{cache.cursorColumn}}`	when `cursor` is configured
`{{cache.cursorType}}`	when `cursor` is configured
`{{cache.primaryKeys}}`	when `primary-key` is configured (comma-separated)

Any other {{cache.*}} you see in a tutorial is wrong. There is no {{cache.strategy}}, no {{cache.lastRefresh}}, no {{cache.delta}}.

Choosing in 30 Seconds

Answer these in order and stop at the first "yes":

Is the table under ~1M rows or do you not have a trustworthy timestamp? Use full.
Is the table append-only (or do you genuinely not care about updates)? Use append.
Do you have a stable primary key and a updated_at that is bumped on every change? Use merge.

If you can't answer "yes" to any of these, you have a data-model problem upstream that no cache can hide. Fix the source.

Next Steps

Caching Strategy concept page — the theory behind DuckLake and the three modes.
Caching setup guide — step-by-step initial configuration.
BigQuery + Caching example — the same modes applied to a per-query-cost backend.
Config Service REST API — trigger a manual cache refresh without restarting the server.

The Common Backend​

Mode 1: Full Refresh​

Cache configuration​

Refresh SQL​

Cost characteristic​

Caveats​

Mode 2: Incremental Append​

Cache configuration​

Refresh SQL​

Cost characteristic​

Caveats​

Mode 3: Incremental Merge / Upsert​

Cache configuration​

Refresh SQL​

Cost characteristic​

Caveats​

Side-by-Side Comparison​

Retention, Rollback, and Delete Handling​

Reading the Cache from the API​

Cache Template Variables Reference​

Choosing in 30 Seconds​

Next Steps​

The Common Backend

Mode 1: Full Refresh

Cache configuration

Refresh SQL

Cost characteristic

Caveats

Mode 2: Incremental Append

Cache configuration

Refresh SQL

Cost characteristic

Caveats

Mode 3: Incremental Merge / Upsert

Cache configuration

Refresh SQL

Cost characteristic

Caveats

Side-by-Side Comparison

Retention, Rollback, and Delete Handling

Reading the Cache from the API

Cache Template Variables Reference

Choosing in 30 Seconds

Next Steps