Deep Dive · Storage Engine Internals

Inside SNKV: How a B-tree-Backed Key-Value Store Works at Every Layer

A technical walkthrough of SNKV's kvstore layer — from BLOBKEY encoding and persistent read transactions to two-level mutex locking, cached read cursors, and WAL auto-checkpointing.

SNKV Engineering · 2025 · C B-Tree WAL Systems

The Four-Layer Stack

SNKV is a key-value store built directly on SQLite's B-tree engine, bypassing SQLite's SQL and VM layers entirely. Application code interacts exclusively through the kvstore_* public API defined in kvstore.c. Below that, pages flow through the pager's cache and journal, and ultimately reach disk via the VFS abstraction.

L4 Application Code kvstore_open / put / get / close ···
↕ public API
L3 kvstore.c ← this document persistent read txn · cached cursors · blob encoding · 2-level mutex
↕ BtreeInsert / IndexMoveto / ···
L2 btree.c — B-tree Engine page traversal · cell splits · cursor management
↕ PagerGet / PagerWrite / ···
L1 pager.c — Page Cache & Journal LRU cache · WAL frames · rollback journal
↕ OsRead / OsWrite / OsSync
L0 os.c / os_unix.c / os_win.c / os_kv.c — VFS read() · write() · fsync() · fcntl() locks

Because SNKV opens the B-tree via sqlite3BtreeOpen rather than sqlite3Open, it bypasses sqlite3OpenTail — the function that normally installs SQLite's default 1000-frame WAL auto-checkpoint hook. This has significant consequences for WAL management, addressed in the WAL & Checkpointing section.

Key Design Decision

Skipping the SQL/VDBE layer removes a substantial amount of per-query overhead (query planning, bytecode dispatch, schema locking) but requires SNKV to own all transaction lifecycle management that SQLite would otherwise handle automatically.

Core Data Structures

Three structs form the backbone of the runtime state:

kvstore.c — struct layout
KVStore
  ├─ pBt          → Btree*             B-tree handle
  ├─ db           → sqlite3*           required by btree internals
  ├─ pKeyInfo     → KeyInfo*           shared BLOBKEY comparator (all CFs)
  ├─ iMetaTable   → int                INTKEY B-tree root for CF metadata
  ├─ inTrans      → int                0=none  1=read  2=write
  ├─ isCorrupted  → int                set by integrity_check on failure
  ├─ closing      → int                set to 1 in kvstore_close()
  ├─ zErrMsg      → char*              last error string (owned)
  ├─ pMutex       → sqlite3_mutex*     SQLITE_MUTEX_RECURSIVE (store-level)
  ├─ pDefaultCF   → KVColumnFamily*    always-open default column family
  ├─ apCF[]       → KVColumnFamily**   open named column families
  ├─ stats        → KVStoreStats       {nPuts nGets nDeletes nIterations nErrors}
  ├─ walSizeLimit → int                auto-checkpoint every N commits (0=off)
  └─ walCommits   → int                commits since last auto-checkpoint

KVColumnFamily
  ├─ pKV          → KVStore*           parent store (back-pointer)
  ├─ zName        → char*              CF name (heap-allocated)
  ├─ iTable       → int                B-tree root page for this CF's data
  ├─ refCount     → int                reference count
  ├─ pMutex       → sqlite3_mutex*     SQLITE_MUTEX_RECURSIVE (CF-level)
  └─ pReadCur     → BtCursor*          cached read cursor (NULL = not yet open)

KVIterator
  ├─ pCF          → KVColumnFamily*    column family being iterated
  ├─ pCur         → BtCursor*          own cursor (NOT pCF->pReadCur)
  ├─ eof          → int                1 = past last entry
  ├─ isValid      → int                1 = cursor can be used
  ├─ ownsTrans    → int                1 = iterator opened its own read txn
  ├─ pKeyBuf/nKeyBuf                   reusable key buffer (grows via realloc)
  ├─ pValBuf/nValBuf                   reusable value buffer
  └─ pPrefix/nPrefix                   prefix filter (heap copy, or NULL)

BLOBKEY Encoding: How Every Key-Value Pair Lives on Disk

All user data is stored in SQLite's BLOBKEY B-tree tables — a variant where the entire sort key is an opaque byte string rather than a typed integer rowid. SNKV packs each key-value pair into a single blob cell with a 4-byte big-endian length prefix on the key:

BLOBKEY Cell Layout
┌─────────────┬──────────────────────┬──────────────────────┐
│  key_len    │     key_bytes        │    value_bytes       │
│  (4 bytes,  │    (nKey bytes)      │   (nValue bytes)     │
│  big-endian)│                      │                      │
└─────────────┴──────────────────────┴──────────────────────┘
 ◄── 4B ─────►◄──── nKey ───────────►◄──── nValue ─────────►

Total encoded size = 4 + nKey + nValue

B-tree comparison reads only bytes [4 .. 4+nKey]
→ entries sorted lexicographically by key
→ same key on BtreeInsert = overwrite (upsert semantics)

Example: put("user:1", 6, "Alice", 5)

┌──────────────┬────────────────────┬─────────────────────┐
│ 00 00 00 06  │  u  s  e  r  :  1  │  A  l  i  c  e     │
│  key_len=6   │   (6 bytes)        │   (5 bytes)         │
└──────────────┴────────────────────┴─────────────────────┘
 offset:  0      4                   10                   15

The encoding function kvstoreEncodeBlob() uses a caller-supplied stack buffer for small payloads (≤ 512 bytes) and falls back to sqlite3Malloc only when the encoded size exceeds that threshold. This eliminates the heap round-trip on virtually all short key-value pairs in typical workloads.

CF Metadata Table (Internal)

Column family metadata uses an INTKEY B-tree — SQLite's row-id table variant. The rowid for each CF entry is its FNV-1a hash (with linear probing for collisions). The data blob reuses the same [name_len][name_bytes][root_page_4B_BE] layout.

CF Metadata Table (INTKEY, root page stored in file header meta[3])

rowid = FNV-1a("logs")
data  = [00 00 00 04][l][o][g][s][00 00 00 05]
         name_len=4   name bytes   root_pgno=5

rowid = FNV-1a("metrics")
data  = [00 00 00 07][m][e][t][r][i][c][s][00 00 00 07]
         name_len=7   name bytes             root_pgno=7

Page-1 file header meta slots used by SNKV:
  meta[1] = Default CF root page number
  meta[2] = CF count (total named CFs)
  meta[3] = CF metadata table root page number

After opening a fresh database, the page layout is deterministic:

Page layout after first open (new database, auto-vacuum enabled):

  Page 1: SQLite file header + schema root
  Page 2: Default CF table root (BLOBKEY, initially empty)
  Page 3: CF metadata table root (INTKEY, initially empty)
  Page 4: Pointer-map page (required by incremental auto-vacuum)

Opening a Store: kvstore_open_v2

kvstore_open_v2 is the primary open function. It accepts a KVStoreConfig struct controlling journal mode, sync level, cache size, page size, read-only access, busy-retry timeout, and WAL auto-checkpoint threshold. kvstore_open is a thin backward-compatible shim.

kvstore_open_v2 — initialization sequence
Step 1  sqlite3_initialize()

Step 2  Resolve config defaults (zero → default)
        journalMode == 0   →  KVSTORE_JOURNAL_WAL
        syncLevel   == 0   →  KVSTORE_SYNC_NORMAL
        cacheSize   == 0   →  2000 pages   (~8 MB at 4 KiB/page)
        pageSize    == 0   →  4096 bytes   (only applies to new DBs)
        busyTimeout == 0   →  no busy retry
        walSizeLimit== 0   →  WAL auto-checkpoint disabled

Step 3  Allocate KVStore { db, pMutex=RECURSIVE, inTrans=0, closing=0 }

Step 4  sqlite3BtreeOpen(vfs, path, db, &pBt, 0, flags)
        flags = SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE
        (SQLITE_OPEN_READONLY if cfg.readOnly)

Step 5  sqlite3BtreeSetPageSize(pBt, pageSize, -1, 0)
        New DB  → sets page size before first write
        Old DB  → silently ignored; existing page size wins

Step 6  sqlite3BtreeSetCacheSize(pBt, cacheSize)

Step 7  Set pager sync flags
        syncFlag = (syncLevel + 1) | PAGER_CACHESPILL
        SYNC_OFF    (0) → 0x01  SYNCHRONOUS_OFF
        SYNC_NORMAL (1) → 0x02  SYNCHRONOUS_NORMAL     ← default
        SYNC_FULL   (2) → 0x03  SYNCHRONOUS_FULL

Step 8  Install busy handler (if busyTimeout > 0)
        KVBusyCtx { timeoutMs, pVfs } → heap-alloc via sqlite3_malloc
        db->busyHandler.xBusyHandler = kvstoreBusyHandler
        kvstoreBusyHandler(ctx, nBusy):
          if nBusy * 1ms >= timeoutMs → return 0 (give up)
          sqlite3OsSleep(pVfs, 1000 µs)      (sleep 1 ms)
          return 1                            (retry)

Step 9  sqlite3BtreeSetAutoVacuum(pBt, BTREE_AUTOVACUUM_INCR)

Step 10 sqlite3BtreeSetVersion(pBt, ver)
        ver=1 → Rollback journal (DELETE mode)
        ver=2 → WAL mode                         ← default
        sqlite3BtreeCommit(pBt)

Step 11 Detect new vs. existing database
        BtreeBeginTrans(0) → GetMeta(META_DEFAULT_CF_ROOT) → BtreeCommit
        root == 0  → NEW database: createDefaultCF() + initCFMetadataTable()
        root != 0  → EXISTING: restore KVColumnFamily{iTable=root}

Step 12 Allocate shared KeyInfo { enc=UTF8, nKeyField=1, db=pKV->db }
        Used by ALL BLOBKEY cursor comparisons across all CFs

Step 13 pKV->walSizeLimit = cfg.walSizeLimit
        pKV->walCommits   = 0

Step 14 Open persistent read transaction
        sqlite3BtreeBeginTrans(pBt, 0, 0)
        pKV->inTrans = 1         ← held for lifetime of store
        *ppKV = pKV
        return KVSTORE_OK

Closing a Store

kvstore_close sets closing=1 under the mutex before freeing anything. Every public API function checks this flag immediately after acquiring the mutex and returns an error if set. This prevents any use-after-free scenario when another thread is mid-operation as close begins.

kvstore_close — teardown order
sqlite3_mutex_enter(pKV->pMutex)
pKV->closing = 1                   ← racing threads bail here

if inTrans > 0:
  sqlite3BtreeRollback(pBt, SQLITE_OK, 0)
  pKV->inTrans = 0

for each open CF (default + named):
  if pCF->pReadCur: kvstoreFreeCursor(pCF->pReadCur)
  sqlite3_mutex_free(pCF->pMutex)
  free(pCF->zName), free(pCF)

sqlite3BtreeClose(pBt)             ← PagerClose → OsClose → close(fd)
if db->busyHandler.pBusyArg:
  sqlite3_free(db->busyHandler.pBusyArg)  ← free KVBusyCtx
free(pKeyInfo), free(zErrMsg)
sqlite3_mutex_free(db->mutex)
free(db)

sqlite3_mutex_leave(pKV->pMutex)
sqlite3_mutex_free(pKV->pMutex)    ← freed LAST, after all resources gone
free(pKV)

Transactions: Persistent Read and the WAL Slot Problem

SNKV holds a persistent read transaction for the entire lifetime of an open store. This means that kvstore_get and kvstore_exists — without any explicit kvstore_begin — pay zero BtreeBeginTrans/BtreeCommit overhead. The state machine tracks this with three inTrans values: 0 (none), 1 (read), 2 (write).

The WAL "read slot 0" problem is subtle: SQLite forbids upgrading a connection holding slot 0 to a write transaction, returning SQLITE_BUSY. The fix is to commit the read first, then begin a fresh write.

kvstore_begin — state machine
kvstore_begin(kv, wrflag)

  inTrans=2          → error: write transaction already active

  inTrans=1, wrflag=0 → no-op: persistent read already satisfies request
                         return KVSTORE_OK

  inTrans=1, wrflag=1 → must release read slot before upgrading:
                         sqlite3BtreeCommit(pBt)   ← free WAL read slot
                         inTrans = 0
                         → fall through to BeginTrans(write)

  inTrans=0, wrflag=0 → sqlite3BtreeBeginTrans(pBt, 0, 0) → inTrans=1

  inTrans=0, wrflag=1 → sqlite3BtreeBeginTrans(pBt, 1, 0) → inTrans=2

WHY commit before upgrading to write (WAL mode)?
  sqlite3BtreeOpen bypasses sqlite3OpenTail → no WAL hook registered.
  WAL read slot 0 is assigned when the WAL is empty (e.g. right after DB
  creation). walBeginWriteTransaction() returns SQLITE_BUSY if slot 0 is
  held. Committing the read releases the slot; the next BeginTrans(write)
  succeeds unconditionally.

Transaction State Diagram

  kvstore_open()
        │
        ▼
   inTrans = 1  ─── get / exists ──────────────► inTrans = 1
 (persistent read)                                (unchanged)
        │
        │ begin(write)
        │   BtreeCommit(read) → inTrans=0
        │   BtreeBeginTrans(1) → inTrans=2
        ▼
   inTrans = 2
  (write active)
        │
        │ commit() or rollback()
        │   BtreeCommit / BtreeRollback → inTrans=0
        │   BtreeBeginTrans(0) ─────────► inTrans=1
        ▼
   inTrans = 1  (persistent read restored)

Auto-Transaction Pattern

Write operations that have no enclosing kvstore_begin manage their own single-operation transaction. They commit the persistent read, begin a write, perform the operation, commit, and restore the read — all atomically under the store mutex.

auto-transaction lifecycle (put / delete / cf_create / cf_drop)
[inTrans == 2]  write already active → proceed, autoTrans=0

[inTrans == 1]  persistent read active:
  sqlite3BtreeCommit(pBt)      ← release read lock / WAL slot
  inTrans = 0
  → fall through ↓

[inTrans == 0]  no transaction:
  sqlite3BtreeBeginTrans(pBt, 1, 0)
  autoTrans = 1
  inTrans = 2

··· perform operation ···

[SUCCESS + autoTrans]:
  sqlite3BtreeCommit(pBt)
  inTrans = 0                  ← TRANS_NONE window begins
  kvstoreAutoCheckpoint(pKV)   ← optional PASSIVE checkpoint in window
  sqlite3BtreeBeginTrans(pBt, 0, 0)
  inTrans = 1                  ← persistent read restored

[FAILURE + autoTrans]:
  sqlite3BtreeRollback(pBt, SQLITE_OK, 0)
  inTrans = 0
  // put/delete: persistent read NOT restored on error path
  // rollback(): persistent read IS always restored if !closing
  //   → kvstore_rollback always calls BtreeBeginTrans(0) after rollback

Benefit:
  kvstore_put(kv, "k", 1, "v", 1)          ← single call, no begin/commit
  // internally: commit_read → begin_write → insert → commit → begin_read

  kvstore_begin(kv, 1)                      ← explicit batch
  kvstore_put(kv, "k1", 2, "v1", 2)        // autoTrans=0, no commit
  kvstore_put(kv, "k2", 2, "v2", 2)        // autoTrans=0, no commit
  kvstore_commit(kv)                        // one fsync for both writes

Put, Get, Delete, Exists

Put — Stack-First Encoding + Write Cursor

Every kvstore_put acquires the CF mutex, then the store mutex (in that order — always), validates key/value sizes (key: 1–64 KiB; value: 0–10 MiB), and encodes the blob on the stack before touching the B-tree.

kvstore_put — critical path
sqlite3_mutex_enter(pCF->pMutex)        ← CF lock first
sqlite3_mutex_enter(pKV->pMutex)        ← then store lock

// Encode on stack — no heap alloc for small payloads
unsigned char stackBuf[512];
kvstoreEncodeBlob("user:1", 6, "Alice", 5, stackBuf, 512, &pEncoded)
// 4 + 6 + 5 = 15 bytes → fits stackBuf
// [00 00 00 06][user:1][Alice]

// Separate write cursor (not pCF->pReadCur)
kvstoreAllocCursor()
sqlite3BtreeCursor(pBt, iTable, wrflag=1, pKeyInfo, pCur)

BtreePayload payload = { .pKey = pEncoded, .nKey = 15 };
sqlite3BtreeInsert(pCur, &payload, 0, 0)
  │
  ├── IndexMoveto → traverse root → internal pages → leaf
  │     Compare search key using pKeyInfo comparator at each level
  ├── Key exists?  YES → overwrite cell in-place (upsert)
  │               NO  → insertCell(), rebalance if leaf overflows
  └── PagerWrite() on each modified page → marks page dirty

kvstoreFreeCursor(pCur)                 ← write cursor freed immediately
                                           pCF->pReadCur is UNTOUCHED
pKV->stats.nPuts++

// autoTrans commit + persistent read restore (see §auto-txn)

sqlite3_mutex_leave(pKV->pMutex)
sqlite3_mutex_leave(pCF->pMutex)

Get — Zero-Malloc Fast Path

The get path is the most performance-critical operation. SNKV eliminates two major sources of per-call overhead: (1) the cached read cursor means no BtreeCursor open/close, and (2) the stack-allocated UnpackedRecord avoids the heap-allocated version that SQLite's VDBE layer normally uses.

OperationBefore (naive)After (SNKV)
BtreeCursor opensqlite3MallocZero(cursorSz) + opencached — zero cost
UnpackedRecordVdbeAllocUnpackedRecord() → heapstack alloc — zero cost
BtreeBeginTranscalled every getinTrans=1 → no-op
BtreeCommitcalled every getautoTrans=0 → no-op
BtreeIndexMovetoO(log n) — unavoidableO(log n)
BtreePayload ×2read key_len, read valueread key_len, read value
BtreeCloseCursorcalled every getcursor stays open
kvstore_get — annotated hot path
// Get (or open) cached read cursor — zero alloc after first call
kvstoreGetReadCursor(pCF):
  if pCF->pReadCur != NULL → return it           // ← hot path: always true
  // cold path (first call only):
  kvstoreAllocCursor()
  BtreeCursor(pBt, iTable, 0, pKeyInfo, pCur)
  pCF->pReadCur = pCur

// Seek — entirely stack-allocated
UnpackedRecord idxKey;                           // ← stack
Mem            memField;                         // ← stack
idxKey.u.z = "user:1";
idxKey.n   = 6;
sqlite3BtreeIndexMoveto(pCur, &idxKey, &res)
// res==0 → exact match; res≠0 → key not present

if res != 0: return KVSTORE_NOTFOUND

// Decode value from cell payload
// Cell: [00 00 00 06][user:1][Alice]
BtreePayload(pCur, 0, 4, hdr)      → storedKeyLen = 6
valLen = payloadSz - 4 - 6         = 5
BtreePayload(pCur, 4+6, 5, buf)    → "Alice"

// pReadCur stays open — no close, no free
pKV->stats.nGets++
*ppValue = heap copy of value       // caller owns and must free
*pnValue = valLen
Cursor Safety After Writes

SQLite cursors transition to REQUIRESSEEK state when pages they reference are modified by a concurrent write. The next call to BtreeIndexMoveto transparently restores the cursor position. This makes reusing pReadCur across writes correct — no explicit invalidation is needed.

Delete

Delete follows the same auto-transaction pattern as put but uses a write cursor to seek and remove the cell. If the targeted leaf page underflows after removal, SQLite's B-tree engine rebalances by merging with a sibling. Freed pages go onto the internal freelist; the file does not shrink until a vacuum is performed.

Leaf before: [user:0][user:1][user:2]
                       ▲ BtreeDelete

Leaf after:  [user:0][user:2]
             (if page underflows → merge with sibling)
             freed pages → internal freelist
             (file size unchanged until kvstore_incremental_vacuum)

Exists — Cheapest Point Lookup

kvstore_exists is the cheapest operation: it uses the cached read cursor, performs the B-tree traversal, but reads zero bytes of payload. Only the key comparison at the leaf node is performed. The caller gets a boolean — no allocation occurs at all on the hot path.


Column Families: Isolated B-Tree Tables with a Shared Store

Each column family is a separate BLOBKEY B-tree table rooted at its own page number. They share the same B-tree handle, pager, WAL file, and transaction context — meaning a single kvstore_begin/kvstore_commit can write atomically to multiple CFs.

kvstore_cf_create — creation sequence
kvstore_cf_create(kv, "logs", &cf)

  Validate name: non-null · 1–255 chars · not "default"

  Check CF does not already exist:
    kvstoreMetaSeekKey("logs") → found? → error (already exists)

  Create new BLOBKEY table:
    sqlite3BtreeCreateTable(pBt, &pgno, BTREE_BLOBKEY)
    // Before: pages [1][2-default][3-meta][4-ptrmap]
    // After:  pages [1][2][3][4][5 ← "logs" root]
    // pgno = 5

  Encode metadata payload (stack buffer, no heap):
    [00 00 00 04][l][o][g][s][00 00 00 05]
     name_len=4   "logs"       root_pgno=5

  Insert metadata row:
    slot = FNV-1a("logs")         ← 32-bit hash, linear probe on collision
    BtreeInsert(metaCur, { key=slot, data=encodedMeta })

  Update CF count:
    BtreeGetMeta(META_CF_COUNT) → N
    BtreeUpdateMeta(META_CF_COUNT, N+1)

  Allocate KVColumnFamily {
    iTable   = 5      (new root page)
    refCount = 1
    pMutex   = sqlite3_mutex_alloc(SQLITE_MUTEX_RECURSIVE)
    pReadCur = NULL   ← opened lazily on first get
  }

kvstore_cf_create with the name "default" does not create a new CF — it silently redirects to kvstore_cf_get_default() and returns the existing default handle. Callers should not rely on creation semantics for that name.

kvstore_cf_open resolves the CF by name through the INTKEY metadata table using FNV-1a hashing with linear probing for collision resolution. If inTrans == 0 at call time (the persistent read is absent), it starts a fresh read transaction with autoTrans=1 and commits it after the metadata lookup. It does not assume the persistent read is always available.

kvstore_cf_drop performs its steps in this exact order: (1) delete the metadata row from the INTKEY table via BtreeDelete, (2) drop the CF's data B-tree via BtreeDropTable (moves all pages to freelist), (3) decrement the CF count via UpdateMeta. The metadata row is removed before the table is dropped — not after — so a crash mid-sequence leaves an orphaned metadata entry rather than an unreferenced live table, which is the safer failure mode.


Iterators: Own Cursor, Shared Transaction

Each iterator owns a separate B-tree cursor — distinct from pCF->pReadCur. This allows concurrent iteration and point-lookups on the same CF without one invalidating the other. The iterator's cursor is allocated at creation time and freed at close.

kvstore_iterator_create — key decisions
KVIterator {
  eof      = 1   (not positioned — call iterator_first to begin)
  isValid  = 1
  ownsTrans = ?  (see below)
  pCur     = own BtCursor*   ← NOT pCF->pReadCur
}

Transaction decision:
  inTrans >= 1 → use existing transaction, ownsTrans = 0
  inTrans == 0 → BtreeBeginTrans(0) → inTrans=1, ownsTrans = 1

Iterator close — transaction cleanup:
  ownsTrans == 0 → nothing to commit; leave inTrans unchanged
  ownsTrans == 1 → BtreeCommit(pBt); inTrans = 0
                   (persistent read not restored — store had none before)

Buffer reuse (key and value reads):
  pKeyBuf / pValBuf grown via sqlite3_realloc when needed
  pointers returned to caller are owned by the iterator (do NOT free)

Prefix Iteration — O(log n) Seek, Not O(n) Scan

kvstore_prefix_iterator_create does not scan from the beginning of the B-tree. It calls kvstore_iterator_first, which uses sqlite3BtreeIndexMoveto with the prefix as the search key to land directly at the first matching entry. Note: unlike the point-lookup path in kvstoreSeekKey, the prefix seek allocates an UnpackedRecord on the heap via sqlite3VdbeAllocUnpackedRecord(), then frees it immediately after the seek. Subsequent next() calls check the prefix bound and set eof=1 the moment a key falls outside it.

B-tree entries (sorted lexicographically):
  [admin:1][admin:2][user:1][user:2][user:9][zzz:1]

kvstore_prefix_iterator_create(kv, "user:", 5, &iter)

kvstore_iterator_first():
  pIdxKey = sqlite3VdbeAllocUnpackedRecord(pKeyInfo)  ← heap alloc
  pIdxKey->u.z = "user:", pIdxKey->n = 5
  sqlite3BtreeIndexMoveto(pCur, pIdxKey, &res)
  sqlite3DbFree(db, pIdxKey)                          ← freed immediately

  [admin:1][admin:2][user:1][user:2][user:9][zzz:1]
                     ▲
                     cursor lands at first entry ≥ "user:"
  if res < 0: BtreeNext() to advance past the seek point
  kvstoreIterCheckPrefix() → key starts with "user:"? → eof=0

Subsequent next() calls:
  "user:1" → memcmp(key[0..4], "user:", 5) == 0 → continue
  "user:2" → match → continue
  "user:9" → match → continue
  "zzz:1"  → mismatch → eof = 1 (never scanned beyond prefix range)

Complexity: O(log n) seek + O(k) scan where k = matching keys
Note: contrast with kvstoreSeekKey() (get/exists/delete) which uses
      a stack-allocated UnpackedRecord — zero heap alloc per lookup.

Two-Level Recursive Mutex Locking

SNKV uses two levels of SQLITE_MUTEX_RECURSIVE mutexes. The CF-level mutex protects per-CF state (pReadCur, refCount); the store-level mutex protects global state (inTrans, stats, closing). The lock order is always CF mutex first, then store mutex — consistent ordering eliminates deadlock.

Thread 1: kvstore_cf_put("logs", k, v)    Thread 2: kvstore_cf_put("metrics", k, v)
─────────────────────────────────────     ──────────────────────────────────────────
lock(logs.pMutex)                         lock(metrics.pMutex)
lock(pKV->pMutex)   ← contended ──┐      lock(pKV->pMutex)   acquired ─────────────┐
BtreeInsert(...)                   │      BtreeInsert(...)                           │
unlock(pKV->pMutex)   ◄────────── ┘      unlock(pKV->pMutex) ──────────────────────┘
unlock(logs.pMutex)                       unlock(metrics.pMutex)
Why SQLITE_MUTEX_RECURSIVE?

kvstore_iterator_close calls kvstore_cf_close, which may re-enter pKV->pMutex while the caller already holds it. A non-recursive mutex would deadlock here. SQLITE_MUTEX_RECURSIVE maps to PTHREAD_MUTEX_RECURSIVE on POSIX and a spin-count CRITICAL_SECTION on Windows — both safe for same-thread re-entry.

Mutex Design: sqlite3_mutex with RECURSIVE type

The current implementation uses SQLite's native sqlite3_mutex API throughout, allocated with SQLITE_MUTEX_RECURSIVE. Three problems the current design solves:

ProblemOld behaviorFix
Non-recursive default mutex pthread_mutex_init(NULL) creates a DEFAULT mutex — same-thread re-entry is UB (deadlock or silent corruption) sqlite3_mutex_alloc(SQLITE_MUTEX_RECURSIVE) — safe same-thread re-entry
Use-after-free on close Freed pMutex while another thread was about to wake and dereference it closing=1 set under mutex before any memory freed; all API functions check and bail early
Code duplication Custom mutex code reimplemented what SQLite already provides (platform-aware, well-tested) Use sqlite3_mutex_alloc/enter/leave/free directly; SQLite handles POSIX/Windows/no-op

WAL Growth and Auto-Checkpoint

Why the WAL Grows Unboundedly by Default

SQLite's built-in 1000-frame WAL auto-checkpoint hook is registered inside sqlite3OpenTail via sqlite3_wal_hook. Because SNKV opens via sqlite3BtreeOpen — not sqlite3Open — that hook is never installed. The WAL appends frames indefinitely until kvstore_close triggers sqlite3WalClose, which performs a final checkpoint.

Default SQLite (via sqlite3Open):          SNKV (via sqlite3BtreeOpen):
┌───────────────────────────────┐          ┌────────────────────────────────┐
│ sqlite3Open()                 │          │ sqlite3BtreeOpen()             │
│  └── sqlite3OpenTail()        │          │   (no OpenTail → no wal hook)  │
│        └── sqlite3_wal_hook() │          │                                │
│              registers        │          │  WAL frames accumulate freely  │
│              1000-frame        │          │  until kvstore_close() calls   │
│              auto-checkpoint  │          │  sqlite3WalClose() on last conn │
└───────────────────────────────┘          └────────────────────────────────┘

kvstoreAutoCheckpoint — the TRANS_NONE Window

SNKV solves this with a commit counter checked in the TRANS_NONE window — the brief period after BtreeCommit releases the write lock but before the persistent read transaction is restored. A PASSIVE checkpoint can run safely here because no lock is held.

The implementation counts commits rather than WAL frames because Btree is an opaque typedef in kvstore.c — accessing pBt->pPager to read the WAL frame counter would be a compile error against the internal header. A commit counter is sufficient and cannot overflow: it resets to 0 on every checkpoint fire, so its maximum value is walSizeLimit - 1.

kvstoreAutoCheckpoint — called after every committed write
kvstoreAutoCheckpoint(pKV):
  if walSizeLimit == 0: return        ← disabled

  walCommits++
  if walCommits < walSizeLimit: return  ← threshold not reached

  walCommits = 0                       ← reset counter
  sqlite3BtreeCheckpoint(pKV->pBt, SQLITE_CHECKPOINT_PASSIVE, NULL, NULL)
  // runs in TRANS_NONE window:
  //   write lock already released by BtreeCommit
  //   persistent read not yet opened
  //   → checkpoint can acquire WAL write lock without contention

Timeline with walSizeLimit=3:
  commit #1 → walCommits=1  (no checkpoint)
  commit #2 → walCommits=2  (no checkpoint)
  commit #3 → walCommits=3  ≥ limit → PASSIVE checkpoint → walCommits=0
  commit #4 → walCommits=1  (no checkpoint)
  ···

kvstore_checkpoint — public API

For explicit control, kvstore_checkpoint exposes all four SQLite checkpoint modes. An active write transaction causes an immediate KVSTORE_BUSY return. The persistent read is silently released before the checkpoint and restored after.

ModeBehaviorWAL after
PASSIVE (0)Copy frames without blocking readers or writers; may not copy all if a reader is activemay have remaining frames
FULL (1)Wait for all readers, then copy all frames atomicallyfully checkpointed
RESTART (2)Like FULL + reset WAL write position to frame 0 (reuse the WAL file)file not truncated, reset position
TRUNCATE (3)Like RESTART + truncate WAL file to 0 bytes0 bytes
State around kvstore_checkpoint():

  inTrans=1               inTrans=0               inTrans=1
(persistent read)   BtreeCommit   │   BtreeCheckpoint   │   BtreeBeginTrans(0)
     ──────────────────────────► ─┼────────────────────┼─────────────────────►
                                   │  TRANS_NONE window  │

Write guard:
  kvstore_begin(kv, 1)                     → inTrans=2
  kvstore_checkpoint(kv, PASSIVE, ...)     → KVSTORE_BUSY (write not committed)
  kvstore_rollback(kv)                     → inTrans=1
  kvstore_checkpoint(kv, PASSIVE, ...)     → OK
Choosing walSizeLimit vs. kvstore_checkpoint

walSizeLimit=N (auto, PASSIVE) — set-and-forget; fires every N committed writes with no caller involvement. Best for most workloads. kvstore_checkpoint (any mode) — explicit control at shutdown, after large batches, or to shrink the WAL with TRUNCATE mode. Combine with walSizeLimit=0 for fully manual checkpoint scheduling.


Incremental Vacuum: Recovering Deleted Space

Deleted data pages are placed on SQLite's internal freelist. The file does not shrink automatically. kvstore_incremental_vacuum requires a write transaction with wrflag=2 (exclusive), not the normal wrflag=1 used by put/delete. Importantly, the vacuum path does not commit an active read transaction first — if inTrans==1 it calls BtreeBeginTrans(pBt, 2, 0) directly to attempt an upgrade. This differs from the put/delete pattern which always commits the read before beginning a write.

Transaction handling in kvstore_incremental_vacuum:

  inTrans == 0 → BtreeBeginTrans(pBt, 2, 0)  ← wrflag=2 (exclusive)
                 autoTrans = 1, inTrans = 2

  inTrans == 1 → BtreeBeginTrans(pBt, 2, 0)  ← upgrade attempt (no commit first)
  (read active)  autoTrans = 1, inTrans = 2
                 NOTE: unlike put/delete which commit the read before writing,
                 vacuum calls BeginTrans(2) directly over the read transaction.

  inTrans == 2 → already in write txn, proceed (autoTrans = 0)

BtreeIncrVacuum step (each call reclaims one page):
  1. Identify last page (P8)
  2. P8 is a data page → copy content to free slot P4
  3. Update pointer-map so references to P8 now point to P4
  4. Decrement page count

File BEFORE: [P1][P2][P3][P4 free][P5][P6][P7][P8]
File AFTER:  [P1][P2][P3][P8'][P5][P6][P7]   (truncated on commit)

File size timeline:
  insert 2000 records              → file ~500 KB
  delete 1800 records              → file still ~500 KB (freelist)
  kvstore_incremental_vacuum(kv, 0) → file shrinks to ~50 KB
  (nPage=0 frees ALL unused pages; nPage=N frees up to N pages)

Full Stack Trace: kvstore_put Through All Four Layers

A single kvstore_put("hello", 5, "world", 5) call traverses all four layers. Here is the complete call chain with layer annotations:

kvstore_put("hello", 5, "world", 5) — complete trace
[L4 kvstore.c]
  lock CF mutex → lock KV mutex
  if inTrans==1: BtreeCommit(read)     → inTrans=0
  BtreeBeginTrans(1, 0)                → inTrans=2   (autoTrans)
  kvstoreEncodeBlob()   [stack buf]    → [00 00 00 05][hello][world]
  BtreeCursor(write, pCur)

[L3 btree.c]
  sqlite3BtreeInsert(pCur, {pKey=blob, nKey=15})
    ├── BtreeIndexMoveto()             → traverse root → internal → leaf
    │     sqlite3VdbeRecordCompare()   at each B-tree level
    ├── insertCell()                   → write 15-byte cell to leaf page
    ├── balance_nonroot()              → split/merge if page overflows
    └── PagerWrite(pPage)             → mark page dirty in page cache

[L2 pager.c]
  pager_write()
    ├── [Rollback] writeJournalHdr()   → OsWrite(journal fd)
    └── [WAL]      walWriteLock()      → append WAL frame

[L1 os.c / os_unix.c / os_win.c / os_kv.c]
  sqlite3OsWrite()   → pwrite(fd, buf, amt, offset)
  sqlite3OsSync()    → fsync(fd)      (on commit, SYNC_NORMAL or FULL)
  sqlite3OsLock()    → fcntl(fd, F_SETLK, ...)

[L4 kvstore.c — return path]
  BtreeCloseCursor(write cursor)
  BtreeCommit()                        → autoTrans commit
  kvstoreAutoCheckpoint(pKV)           → PASSIVE ckpt if walSizeLimit hit
  BtreeBeginTrans(0, 0)                → inTrans=1 (persistent read restored)
  unlock KV mutex → unlock CF mutex

kvstore → btree mapping (quick reference)

kvstore functionPrimary btree calls
kvstore_open_v2BtreeOpen, SetPageSize, SetCacheSize, SetPagerFlags, SetAutoVacuum, SetVersion, GetMeta, CreateTable, UpdateMeta, BeginTrans(0)
kvstore_closeBtreeRollback (if active), BtreeClose
kvstore_beginBtreeCommit (if inTrans=1) + BtreeBeginTrans(wrflag)
kvstore_commitBtreeCommit + AutoCheckpoint + BtreeBeginTrans(0)
kvstore_putBtreeCursor(write), BtreeInsert, BtreeCloseCursor
kvstore_getGetReadCursor (cached), BtreeIndexMoveto, BtreePayload ×2
kvstore_deleteBtreeCursor(write), BtreeIndexMoveto, BtreeDelete
kvstore_existsGetReadCursor (cached), BtreeIndexMoveto
kvstore_cf_createBtreeCreateTable(BLOBKEY), BtreeInsert (meta), UpdateMeta
kvstore_cf_dropBtreeDelete (meta row first), BtreeDropTable, UpdateMeta (CF count)
kvstore_iterator_*BtreeCursor (own), BtreeFirst, BtreeNext, BtreePayload; prefix seek uses heap-allocated UnpackedRecord (sqlite3VdbeAllocUnpackedRecord)
kvstore_incremental_vacuumBtreeBeginTrans(wrflag=2, exclusive), BtreeIncrVacuum (loop until DONE or nPage reached)
kvstore_integrity_checkBtreeIntegrityCheck (all root pages)
kvstore_checkpointBtreeCommit (drop read) + BtreeCheckpoint(mode) + BtreeBeginTrans(0)