Skip to main content

Pick a DuckLake Cache Refresh Mode

You have a products table sitting in Parquet on local disk (or any object store DuckDB can read). You want a fast, snapshot-aware API in front of it. Which refresh mode should you pick?

flAPI doesn't have a strategy: knob. The refresh mode is derived from your cache block — specifically the combination of cursor and primary-key — per CacheManager::determineCacheMode in cache_manager.cpp:

no cursor                 -> full
cursor only -> append
cursor + primary-key -> merge

This recipe takes the same Parquet backend and shows all three options side-by-side, with the trade-offs spelled out so you can pick the cheapest mode that still answers your business question.

The Common Backend

Every variant below uses the same connection. We pin to Parquet because it makes the example deterministic — every variant has identical I/O up to the WHERE clause.

# flapi.yaml
project-name: products-api
project-description: Three refresh modes against the same Parquet products table

template:
path: './sqls'

connections:
products-parquet:
properties:
path: './data/products.parquet'

duckdb:
access_mode: READ_WRITE
threads: 8

ducklake:
enabled: true
alias: cache
metadata-path: ./data/cache.ducklake
data-path: ./data/cache
retention:
keep-last-snapshots: 10
max-snapshot-age: 30d
compaction:
enabled: true
schedule: '@daily'
scheduler:
enabled: true
scan-interval: 5m

The ducklake.alias: cache value becomes {{cache.catalog}} everywhere the refresh template runs.

Assume the Parquet file has these columns:

ColumnTypeNotes
idbigintStable primary key
skuvarcharVendor SKU
namevarcharDisplay name
categoryvarchare.g. electronics, apparel
price_centsbigintCurrent price
created_attimestampRow insert time, never updated
updated_attimestampBumped on any change
deleted_attimestampNULL for live rows, set on soft delete

Mode 1: Full Refresh

When: small reference tables (under a few million rows), schema is unstable, you don't trust your updated_at column, or you simply want the dumbest possible refresh.

Triggered by: no cursor. flAPI picks full automatically.

Cache configuration

# sqls/products/products-rest.yaml
url-path: /products/
method: GET
connection:
- products-parquet
template-source: products.sql

cache:
enabled: true
table: products_cache
schema: catalog
schedule: 1h
retention:
keep-last-snapshots: 5
max-snapshot-age: 7d

No cursor, no primary-key. CacheManager::determineCacheMode returns "full".

Refresh SQL

-- sqls/products/products_cache.sql
-- Mode: full. Rebuild the cache table from scratch on every schedule.
CREATE OR REPLACE TABLE {{cache.catalog}}.{{cache.schema}}.{{cache.table}} AS
SELECT
id,
sku,
name,
category,
price_cents,
created_at,
updated_at,
deleted_at,
CURRENT_TIMESTAMP AS cache_updated_at,
'{{cache.snapshotId}}' AS cache_snapshot_id
FROM read_parquet('{{conn.path}}')
WHERE deleted_at IS NULL
ORDER BY id;

Notes:

  • We do not reference {{cache.previousSnapshotTimestamp}} — full mode has no concept of "since last time".
  • {{cache.mode}} will be full here, available if you want to fork behaviour in the template.

Cost characteristic

O(N) per refresh. Every snapshot scans the entire Parquet file. Cheap with a few million rows, expensive at a billion.

Caveats

  • Every refresh produces a fully-rewritten DuckLake snapshot, so retention rules will eat through disk faster than incremental modes.
  • API readers see snapshot isolation, but you cannot answer "what changed since yesterday" from the cache itself — only from the audit table.

Mode 2: Incremental Append

When: append-only data (events, logs, orders that never get updated in place). You have a monotonically increasing timestamp or ID and don't care about updates or deletes.

Triggered by: cursor present, primary-key absent. flAPI picks append.

Cache configuration

# sqls/products/products-rest.yaml
url-path: /products/
method: GET
connection:
- products-parquet
template-source: products.sql

cache:
enabled: true
table: products_cache
schema: catalog
schedule: 5m
cursor:
column: created_at
type: timestamp
retention:
keep-last-snapshots: 20
max-snapshot-age: 14d

Refresh SQL

-- sqls/products/products_cache.sql
-- Mode: append. Pull rows created since the previous snapshot.
INSERT INTO {{cache.catalog}}.{{cache.schema}}.{{cache.table}}
SELECT
id,
sku,
name,
category,
price_cents,
created_at,
updated_at,
deleted_at,
CURRENT_TIMESTAMP AS cache_updated_at,
'{{cache.snapshotId}}' AS cache_snapshot_id
FROM read_parquet('{{conn.path}}')
WHERE {{cache.cursorColumn}} > TIMESTAMP '{{cache.previousSnapshotTimestamp}}'
AND deleted_at IS NULL;

On the very first run, {{cache.previousSnapshotTimestamp}} is empty, so most teams ship a small bootstrap branch:

{{#if cache.previousSnapshotTimestamp}}
-- Incremental
INSERT INTO {{cache.catalog}}.{{cache.schema}}.{{cache.table}}
SELECT ... FROM read_parquet('{{conn.path}}')
WHERE created_at > TIMESTAMP '{{cache.previousSnapshotTimestamp}}';
{{else}}
-- First load
CREATE OR REPLACE TABLE {{cache.catalog}}.{{cache.schema}}.{{cache.table}} AS
SELECT ... FROM read_parquet('{{conn.path}}');
{{/if}}

Cost characteristic

O(delta) per refresh — only new rows are scanned and inserted. Cache grows monotonically until retention expires old snapshots.

Caveats

  • No update detection. If a row's price_cents changes in Parquet, the cache will not see it. Append mode only watches the cursor.
  • No delete detection. Hard deletes upstream are invisible. Soft deletes (deleted_at set) can be filtered at read time on the API side.
  • Pick cursor.column carefully: it must be monotonically non-decreasing. created_at is safe; updated_at is not (it can move backwards relative to insertion order during backfills).

Mode 3: Incremental Merge / Upsert

When: mutable rows. You have a reliable primary key, a trustworthy updated_at column, and you need the cache to reflect upstream changes.

Triggered by: both cursor and primary-key. flAPI picks merge.

Cache configuration

# sqls/products/products-rest.yaml
url-path: /products/
method: GET
connection:
- products-parquet
template-source: products.sql

cache:
enabled: true
table: products_cache
schema: catalog
schedule: 1m
primary-key: [id]
cursor:
column: updated_at
type: timestamp
rollback-window: 2d
retention:
keep-last-snapshots: 30
max-snapshot-age: 7d
delete-handling: soft

Refresh SQL

-- sqls/products/products_cache.sql
-- Mode: merge. Upsert by primary key, using updated_at as the cursor.
MERGE INTO {{cache.catalog}}.{{cache.schema}}.{{cache.table}} AS tgt
USING (
SELECT
id,
sku,
name,
category,
price_cents,
created_at,
updated_at,
deleted_at,
CURRENT_TIMESTAMP AS cache_updated_at,
'{{cache.snapshotId}}' AS cache_snapshot_id
FROM read_parquet('{{conn.path}}')
WHERE {{cache.cursorColumn}} > TIMESTAMP '{{cache.previousSnapshotTimestamp}}'
) AS src
ON tgt.id = src.id
WHEN MATCHED THEN UPDATE SET
sku = src.sku,
name = src.name,
category = src.category,
price_cents = src.price_cents,
updated_at = src.updated_at,
deleted_at = src.deleted_at,
cache_updated_at = src.cache_updated_at,
cache_snapshot_id = src.cache_snapshot_id
WHEN NOT MATCHED THEN INSERT VALUES (
src.id, src.sku, src.name, src.category, src.price_cents,
src.created_at, src.updated_at, src.deleted_at,
src.cache_updated_at, src.cache_snapshot_id
);

Note we use updated_at as the cursor (not created_at) — that's the whole point of merge mode. {{cache.primaryKeys}} is available if you want to drive a generic template across multiple endpoints.

Cost characteristic

O(updates + inserts) per refresh. Cheap as long as your upstream system updates only a small fraction of rows per cycle.

Caveats

  • Hard deletes still need help. A MERGE only sees rows that come back from the source query — physically removed rows are invisible. If you can't trust upstream to soft-delete, you must periodically reconcile (e.g. a daily full sweep). flAPI's delete-handling: soft|hard controls only how the cache stores deletes it has been told about; it does not invent delete detection.
  • The cursor.column must be set whenever a row changes. Stale updated_at values are silent data corruption in merge mode.

Side-by-Side Comparison

PropertyFullAppendMerge
Detects insertsyesyesyes
Detects updatesyesnoyes
Detects deletesyes (via re-scan)noonly soft deletes
Scan cost per refreshO(N)O(delta)O(delta)
Cache write costrewrite entire tableappend rowsupsert matching rows
Required source columnsnonea monotonic cursorprimary key + cursor
Template complexitylowestmedium (bootstrap branch)highest (MERGE statement)
Triggered byno cursorcursor onlycursor + primary-key

Rule of thumb: start with full for anything under ~1M rows. Move to append when refresh time becomes annoying. Move to merge when you actually need to see updates.

Retention, Rollback, and Delete Handling

The same DuckLake controls apply to all three modes:

cache:
enabled: true
table: products_cache
schema: catalog
schedule: 1m
primary-key: [id]
cursor:
column: updated_at
type: timestamp
retention:
keep-last-snapshots: 30 # keep the 30 most recent snapshots
max-snapshot-age: 7d # ... or anything younger than 7 days
rollback-window: 2d # time-travel window for ad-hoc reads
delete-handling: soft # soft | hard

What each does:

  • retention.keep-last-snapshots — flAPI invokes ducklake_expire_snapshots(..., versions => ARRAY[0:N]) after each refresh.
  • retention.max-snapshot-age — flAPI invokes ducklake_expire_snapshots(..., older_than => CURRENT_TIMESTAMP - INTERVAL '...').
  • rollback-window — snapshots inside this window remain queryable for time-travel even if they would otherwise be expired.
  • delete-handling: soft — deleted rows are tombstoned in the cache (kept with a deletion marker).
  • delete-handling: hard — deleted rows are physically removed.

Every refresh — successful or failed — is recorded in <catalog>.audit.sync_events with sync_type set to full, append, merge, or garbage_collection. The audit table is created automatically by CacheManager::initializeAuditTables.

Reading the Cache from the API

Regardless of mode, the API endpoint template reads from the cache table, not the source. That's how you get sub-5ms responses.

-- sqls/products/products.sql
SELECT
id,
sku,
name,
category,
price_cents
FROM {{cache.catalog}}.{{cache.schema}}.{{cache.table}}
WHERE 1=1
{{#if request.category}}
AND category = '{{request.category}}'
{{/if}}
{{#if request.id}}
AND id = {{request.id}}
{{/if}}
AND deleted_at IS NULL
ORDER BY id
LIMIT {{request.limit}};

Three things to notice:

  1. {{cache.catalog}}.{{cache.schema}}.{{cache.table}} is the same triple injected during refresh, so the API and refresh templates stay in sync via configuration, not copy-paste.
  2. Filtering deleted_at IS NULL at read time is what gives delete-handling: soft its name — the rows are still in the cache, just hidden.
  3. None of the {{cache.previousSnapshot*}} variables are populated on read-time requests. They only exist during a refresh. Don't reference them in API templates.

Cache Template Variables Reference

These are the variables flAPI injects into your cache refresh template, verified against src/sql_template_processor.cpp:

VariablePopulated when
{{cache.catalog}}always (from ducklake.alias)
{{cache.schema}}always (defaults to main)
{{cache.table}}always
{{cache.schedule}}when cache.schedule is set
{{cache.mode}}always (full, append, or merge)
{{cache.snapshotId}}once a current snapshot exists
{{cache.snapshotTimestamp}}once a current snapshot exists
{{cache.previousSnapshotId}}once a previous snapshot exists (i.e. second refresh onwards)
{{cache.previousSnapshotTimestamp}}once a previous snapshot exists
{{cache.cursorColumn}}when cursor is configured
{{cache.cursorType}}when cursor is configured
{{cache.primaryKeys}}when primary-key is configured (comma-separated)

Any other {{cache.*}} you see in a tutorial is wrong. There is no {{cache.strategy}}, no {{cache.lastRefresh}}, no {{cache.delta}}.

Choosing in 30 Seconds

Answer these in order and stop at the first "yes":

  1. Is the table under ~1M rows or do you not have a trustworthy timestamp? Use full.
  2. Is the table append-only (or do you genuinely not care about updates)? Use append.
  3. Do you have a stable primary key and a updated_at that is bumped on every change? Use merge.

If you can't answer "yes" to any of these, you have a data-model problem upstream that no cache can hide. Fix the source.

Next Steps

🍪 Cookie Settings