The Four-Layer Stack
SNKV is a key-value store built directly on SQLite's B-tree engine, bypassing SQLite's SQL and VM layers entirely. Application code interacts exclusively through the kvstore_* public API defined in kvstore.c. Below that, pages flow through the pager's cache and journal, and ultimately reach disk via the VFS abstraction.
Because SNKV opens the B-tree via sqlite3BtreeOpen rather than sqlite3Open, it bypasses sqlite3OpenTail — the function that normally installs SQLite's default 1000-frame WAL auto-checkpoint hook. This has significant consequences for WAL management, addressed in the WAL & Checkpointing section.
Skipping the SQL/VDBE layer removes a substantial amount of per-query overhead (query planning, bytecode dispatch, schema locking) but requires SNKV to own all transaction lifecycle management that SQLite would otherwise handle automatically.
Core Data Structures
Three structs form the backbone of the runtime state:
KVStore
├─ pBt → Btree* B-tree handle
├─ db → sqlite3* required by btree internals
├─ pKeyInfo → KeyInfo* shared BLOBKEY comparator (all CFs)
├─ iMetaTable → int INTKEY B-tree root for CF metadata
├─ inTrans → int 0=none 1=read 2=write
├─ isCorrupted → int set by integrity_check on failure
├─ closing → int set to 1 in kvstore_close()
├─ zErrMsg → char* last error string (owned)
├─ pMutex → sqlite3_mutex* SQLITE_MUTEX_RECURSIVE (store-level)
├─ pDefaultCF → KVColumnFamily* always-open default column family
├─ apCF[] → KVColumnFamily** open named column families
├─ stats → KVStoreStats {nPuts nGets nDeletes nIterations nErrors}
├─ walSizeLimit → int auto-checkpoint every N commits (0=off)
└─ walCommits → int commits since last auto-checkpoint
KVColumnFamily
├─ pKV → KVStore* parent store (back-pointer)
├─ zName → char* CF name (heap-allocated)
├─ iTable → int B-tree root page for this CF's data
├─ refCount → int reference count
├─ pMutex → sqlite3_mutex* SQLITE_MUTEX_RECURSIVE (CF-level)
└─ pReadCur → BtCursor* cached read cursor (NULL = not yet open)
KVIterator
├─ pCF → KVColumnFamily* column family being iterated
├─ pCur → BtCursor* own cursor (NOT pCF->pReadCur)
├─ eof → int 1 = past last entry
├─ isValid → int 1 = cursor can be used
├─ ownsTrans → int 1 = iterator opened its own read txn
├─ pKeyBuf/nKeyBuf reusable key buffer (grows via realloc)
├─ pValBuf/nValBuf reusable value buffer
└─ pPrefix/nPrefix prefix filter (heap copy, or NULL)
BLOBKEY Encoding: How Every Key-Value Pair Lives on Disk
All user data is stored in SQLite's BLOBKEY B-tree tables — a variant where the entire sort key is an opaque byte string rather than a typed integer rowid. SNKV packs each key-value pair into a single blob cell with a 4-byte big-endian length prefix on the key:
BLOBKEY Cell Layout
┌─────────────┬──────────────────────┬──────────────────────┐
│ key_len │ key_bytes │ value_bytes │
│ (4 bytes, │ (nKey bytes) │ (nValue bytes) │
│ big-endian)│ │ │
└─────────────┴──────────────────────┴──────────────────────┘
◄── 4B ─────►◄──── nKey ───────────►◄──── nValue ─────────►
Total encoded size = 4 + nKey + nValue
B-tree comparison reads only bytes [4 .. 4+nKey]
→ entries sorted lexicographically by key
→ same key on BtreeInsert = overwrite (upsert semantics)
Example: put("user:1", 6, "Alice", 5)
┌──────────────┬────────────────────┬─────────────────────┐
│ 00 00 00 06 │ u s e r : 1 │ A l i c e │
│ key_len=6 │ (6 bytes) │ (5 bytes) │
└──────────────┴────────────────────┴─────────────────────┘
offset: 0 4 10 15
The encoding function kvstoreEncodeBlob() uses a caller-supplied stack buffer for small payloads (≤ 512 bytes) and falls back to sqlite3Malloc only when the encoded size exceeds that threshold. This eliminates the heap round-trip on virtually all short key-value pairs in typical workloads.
CF Metadata Table (Internal)
Column family metadata uses an INTKEY B-tree — SQLite's row-id table variant. The rowid for each CF entry is its FNV-1a hash (with linear probing for collisions). The data blob reuses the same [name_len][name_bytes][root_page_4B_BE] layout.
CF Metadata Table (INTKEY, root page stored in file header meta[3])
rowid = FNV-1a("logs")
data = [00 00 00 04][l][o][g][s][00 00 00 05]
name_len=4 name bytes root_pgno=5
rowid = FNV-1a("metrics")
data = [00 00 00 07][m][e][t][r][i][c][s][00 00 00 07]
name_len=7 name bytes root_pgno=7
Page-1 file header meta slots used by SNKV:
meta[1] = Default CF root page number
meta[2] = CF count (total named CFs)
meta[3] = CF metadata table root page number
After opening a fresh database, the page layout is deterministic:
Page layout after first open (new database, auto-vacuum enabled): Page 1: SQLite file header + schema root Page 2: Default CF table root (BLOBKEY, initially empty) Page 3: CF metadata table root (INTKEY, initially empty) Page 4: Pointer-map page (required by incremental auto-vacuum)
Opening a Store: kvstore_open_v2
kvstore_open_v2 is the primary open function. It accepts a KVStoreConfig struct controlling journal mode, sync level, cache size, page size, read-only access, busy-retry timeout, and WAL auto-checkpoint threshold. kvstore_open is a thin backward-compatible shim.
Step 1 sqlite3_initialize()
Step 2 Resolve config defaults (zero → default)
journalMode == 0 → KVSTORE_JOURNAL_WAL
syncLevel == 0 → KVSTORE_SYNC_NORMAL
cacheSize == 0 → 2000 pages (~8 MB at 4 KiB/page)
pageSize == 0 → 4096 bytes (only applies to new DBs)
busyTimeout == 0 → no busy retry
walSizeLimit== 0 → WAL auto-checkpoint disabled
Step 3 Allocate KVStore { db, pMutex=RECURSIVE, inTrans=0, closing=0 }
Step 4 sqlite3BtreeOpen(vfs, path, db, &pBt, 0, flags)
flags = SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE
(SQLITE_OPEN_READONLY if cfg.readOnly)
Step 5 sqlite3BtreeSetPageSize(pBt, pageSize, -1, 0)
New DB → sets page size before first write
Old DB → silently ignored; existing page size wins
Step 6 sqlite3BtreeSetCacheSize(pBt, cacheSize)
Step 7 Set pager sync flags
syncFlag = (syncLevel + 1) | PAGER_CACHESPILL
SYNC_OFF (0) → 0x01 SYNCHRONOUS_OFF
SYNC_NORMAL (1) → 0x02 SYNCHRONOUS_NORMAL ← default
SYNC_FULL (2) → 0x03 SYNCHRONOUS_FULL
Step 8 Install busy handler (if busyTimeout > 0)
KVBusyCtx { timeoutMs, pVfs } → heap-alloc via sqlite3_malloc
db->busyHandler.xBusyHandler = kvstoreBusyHandler
kvstoreBusyHandler(ctx, nBusy):
if nBusy * 1ms >= timeoutMs → return 0 (give up)
sqlite3OsSleep(pVfs, 1000 µs) (sleep 1 ms)
return 1 (retry)
Step 9 sqlite3BtreeSetAutoVacuum(pBt, BTREE_AUTOVACUUM_INCR)
Step 10 sqlite3BtreeSetVersion(pBt, ver)
ver=1 → Rollback journal (DELETE mode)
ver=2 → WAL mode ← default
sqlite3BtreeCommit(pBt)
Step 11 Detect new vs. existing database
BtreeBeginTrans(0) → GetMeta(META_DEFAULT_CF_ROOT) → BtreeCommit
root == 0 → NEW database: createDefaultCF() + initCFMetadataTable()
root != 0 → EXISTING: restore KVColumnFamily{iTable=root}
Step 12 Allocate shared KeyInfo { enc=UTF8, nKeyField=1, db=pKV->db }
Used by ALL BLOBKEY cursor comparisons across all CFs
Step 13 pKV->walSizeLimit = cfg.walSizeLimit
pKV->walCommits = 0
Step 14 Open persistent read transaction
sqlite3BtreeBeginTrans(pBt, 0, 0)
pKV->inTrans = 1 ← held for lifetime of store
*ppKV = pKV
return KVSTORE_OK
Closing a Store
kvstore_close sets closing=1 under the mutex before freeing anything. Every public API function checks this flag immediately after acquiring the mutex and returns an error if set. This prevents any use-after-free scenario when another thread is mid-operation as close begins.
sqlite3_mutex_enter(pKV->pMutex) pKV->closing = 1 ← racing threads bail here if inTrans > 0: sqlite3BtreeRollback(pBt, SQLITE_OK, 0) pKV->inTrans = 0 for each open CF (default + named): if pCF->pReadCur: kvstoreFreeCursor(pCF->pReadCur) sqlite3_mutex_free(pCF->pMutex) free(pCF->zName), free(pCF) sqlite3BtreeClose(pBt) ← PagerClose → OsClose → close(fd) if db->busyHandler.pBusyArg: sqlite3_free(db->busyHandler.pBusyArg) ← free KVBusyCtx free(pKeyInfo), free(zErrMsg) sqlite3_mutex_free(db->mutex) free(db) sqlite3_mutex_leave(pKV->pMutex) sqlite3_mutex_free(pKV->pMutex) ← freed LAST, after all resources gone free(pKV)
Transactions: Persistent Read and the WAL Slot Problem
SNKV holds a persistent read transaction for the entire lifetime of an open store. This means that kvstore_get and kvstore_exists — without any explicit kvstore_begin — pay zero BtreeBeginTrans/BtreeCommit overhead. The state machine tracks this with three inTrans values: 0 (none), 1 (read), 2 (write).
The WAL "read slot 0" problem is subtle: SQLite forbids upgrading a connection holding slot 0 to a write transaction, returning SQLITE_BUSY. The fix is to commit the read first, then begin a fresh write.
kvstore_begin(kv, wrflag)
inTrans=2 → error: write transaction already active
inTrans=1, wrflag=0 → no-op: persistent read already satisfies request
return KVSTORE_OK
inTrans=1, wrflag=1 → must release read slot before upgrading:
sqlite3BtreeCommit(pBt) ← free WAL read slot
inTrans = 0
→ fall through to BeginTrans(write)
inTrans=0, wrflag=0 → sqlite3BtreeBeginTrans(pBt, 0, 0) → inTrans=1
inTrans=0, wrflag=1 → sqlite3BtreeBeginTrans(pBt, 1, 0) → inTrans=2
WHY commit before upgrading to write (WAL mode)?
sqlite3BtreeOpen bypasses sqlite3OpenTail → no WAL hook registered.
WAL read slot 0 is assigned when the WAL is empty (e.g. right after DB
creation). walBeginWriteTransaction() returns SQLITE_BUSY if slot 0 is
held. Committing the read releases the slot; the next BeginTrans(write)
succeeds unconditionally.
Transaction State Diagram
kvstore_open()
│
▼
inTrans = 1 ─── get / exists ──────────────► inTrans = 1
(persistent read) (unchanged)
│
│ begin(write)
│ BtreeCommit(read) → inTrans=0
│ BtreeBeginTrans(1) → inTrans=2
▼
inTrans = 2
(write active)
│
│ commit() or rollback()
│ BtreeCommit / BtreeRollback → inTrans=0
│ BtreeBeginTrans(0) ─────────► inTrans=1
▼
inTrans = 1 (persistent read restored)
Auto-Transaction Pattern
Write operations that have no enclosing kvstore_begin manage their own single-operation transaction. They commit the persistent read, begin a write, perform the operation, commit, and restore the read — all atomically under the store mutex.
[inTrans == 2] write already active → proceed, autoTrans=0 [inTrans == 1] persistent read active: sqlite3BtreeCommit(pBt) ← release read lock / WAL slot inTrans = 0 → fall through ↓ [inTrans == 0] no transaction: sqlite3BtreeBeginTrans(pBt, 1, 0) autoTrans = 1 inTrans = 2 ··· perform operation ··· [SUCCESS + autoTrans]: sqlite3BtreeCommit(pBt) inTrans = 0 ← TRANS_NONE window begins kvstoreAutoCheckpoint(pKV) ← optional PASSIVE checkpoint in window sqlite3BtreeBeginTrans(pBt, 0, 0) inTrans = 1 ← persistent read restored [FAILURE + autoTrans]: sqlite3BtreeRollback(pBt, SQLITE_OK, 0) inTrans = 0 // put/delete: persistent read NOT restored on error path // rollback(): persistent read IS always restored if !closing // → kvstore_rollback always calls BtreeBeginTrans(0) after rollback Benefit: kvstore_put(kv, "k", 1, "v", 1) ← single call, no begin/commit // internally: commit_read → begin_write → insert → commit → begin_read kvstore_begin(kv, 1) ← explicit batch kvstore_put(kv, "k1", 2, "v1", 2) // autoTrans=0, no commit kvstore_put(kv, "k2", 2, "v2", 2) // autoTrans=0, no commit kvstore_commit(kv) // one fsync for both writes
Put, Get, Delete, Exists
Put — Stack-First Encoding + Write Cursor
Every kvstore_put acquires the CF mutex, then the store mutex (in that order — always), validates key/value sizes (key: 1–64 KiB; value: 0–10 MiB), and encodes the blob on the stack before touching the B-tree.
sqlite3_mutex_enter(pCF->pMutex) ← CF lock first
sqlite3_mutex_enter(pKV->pMutex) ← then store lock
// Encode on stack — no heap alloc for small payloads
unsigned char stackBuf[512];
kvstoreEncodeBlob("user:1", 6, "Alice", 5, stackBuf, 512, &pEncoded)
// 4 + 6 + 5 = 15 bytes → fits stackBuf
// [00 00 00 06][user:1][Alice]
// Separate write cursor (not pCF->pReadCur)
kvstoreAllocCursor()
sqlite3BtreeCursor(pBt, iTable, wrflag=1, pKeyInfo, pCur)
BtreePayload payload = { .pKey = pEncoded, .nKey = 15 };
sqlite3BtreeInsert(pCur, &payload, 0, 0)
│
├── IndexMoveto → traverse root → internal pages → leaf
│ Compare search key using pKeyInfo comparator at each level
├── Key exists? YES → overwrite cell in-place (upsert)
│ NO → insertCell(), rebalance if leaf overflows
└── PagerWrite() on each modified page → marks page dirty
kvstoreFreeCursor(pCur) ← write cursor freed immediately
pCF->pReadCur is UNTOUCHED
pKV->stats.nPuts++
// autoTrans commit + persistent read restore (see §auto-txn)
sqlite3_mutex_leave(pKV->pMutex)
sqlite3_mutex_leave(pCF->pMutex)
Get — Zero-Malloc Fast Path
The get path is the most performance-critical operation. SNKV eliminates two major sources of per-call overhead: (1) the cached read cursor means no BtreeCursor open/close, and (2) the stack-allocated UnpackedRecord avoids the heap-allocated version that SQLite's VDBE layer normally uses.
| Operation | Before (naive) | After (SNKV) |
|---|---|---|
BtreeCursor open | sqlite3MallocZero(cursorSz) + open | cached — zero cost |
UnpackedRecord | VdbeAllocUnpackedRecord() → heap | stack alloc — zero cost |
BtreeBeginTrans | called every get | inTrans=1 → no-op |
BtreeCommit | called every get | autoTrans=0 → no-op |
BtreeIndexMoveto | O(log n) — unavoidable | O(log n) |
BtreePayload ×2 | read key_len, read value | read key_len, read value |
BtreeCloseCursor | called every get | cursor stays open |
// Get (or open) cached read cursor — zero alloc after first call kvstoreGetReadCursor(pCF): if pCF->pReadCur != NULL → return it // ← hot path: always true // cold path (first call only): kvstoreAllocCursor() BtreeCursor(pBt, iTable, 0, pKeyInfo, pCur) pCF->pReadCur = pCur // Seek — entirely stack-allocated UnpackedRecord idxKey; // ← stack Mem memField; // ← stack idxKey.u.z = "user:1"; idxKey.n = 6; sqlite3BtreeIndexMoveto(pCur, &idxKey, &res) // res==0 → exact match; res≠0 → key not present if res != 0: return KVSTORE_NOTFOUND // Decode value from cell payload // Cell: [00 00 00 06][user:1][Alice] BtreePayload(pCur, 0, 4, hdr) → storedKeyLen = 6 valLen = payloadSz - 4 - 6 = 5 BtreePayload(pCur, 4+6, 5, buf) → "Alice" // pReadCur stays open — no close, no free pKV->stats.nGets++ *ppValue = heap copy of value // caller owns and must free *pnValue = valLen
SQLite cursors transition to REQUIRESSEEK state when pages they reference are modified by a concurrent write. The next call to BtreeIndexMoveto transparently restores the cursor position. This makes reusing pReadCur across writes correct — no explicit invalidation is needed.
Delete
Delete follows the same auto-transaction pattern as put but uses a write cursor to seek and remove the cell. If the targeted leaf page underflows after removal, SQLite's B-tree engine rebalances by merging with a sibling. Freed pages go onto the internal freelist; the file does not shrink until a vacuum is performed.
Leaf before: [user:0][user:1][user:2]
▲ BtreeDelete
Leaf after: [user:0][user:2]
(if page underflows → merge with sibling)
freed pages → internal freelist
(file size unchanged until kvstore_incremental_vacuum)
Exists — Cheapest Point Lookup
kvstore_exists is the cheapest operation: it uses the cached read cursor, performs the B-tree traversal, but reads zero bytes of payload. Only the key comparison at the leaf node is performed. The caller gets a boolean — no allocation occurs at all on the hot path.
Column Families: Isolated B-Tree Tables with a Shared Store
Each column family is a separate BLOBKEY B-tree table rooted at its own page number. They share the same B-tree handle, pager, WAL file, and transaction context — meaning a single kvstore_begin/kvstore_commit can write atomically to multiple CFs.
kvstore_cf_create(kv, "logs", &cf)
Validate name: non-null · 1–255 chars · not "default"
Check CF does not already exist:
kvstoreMetaSeekKey("logs") → found? → error (already exists)
Create new BLOBKEY table:
sqlite3BtreeCreateTable(pBt, &pgno, BTREE_BLOBKEY)
// Before: pages [1][2-default][3-meta][4-ptrmap]
// After: pages [1][2][3][4][5 ← "logs" root]
// pgno = 5
Encode metadata payload (stack buffer, no heap):
[00 00 00 04][l][o][g][s][00 00 00 05]
name_len=4 "logs" root_pgno=5
Insert metadata row:
slot = FNV-1a("logs") ← 32-bit hash, linear probe on collision
BtreeInsert(metaCur, { key=slot, data=encodedMeta })
Update CF count:
BtreeGetMeta(META_CF_COUNT) → N
BtreeUpdateMeta(META_CF_COUNT, N+1)
Allocate KVColumnFamily {
iTable = 5 (new root page)
refCount = 1
pMutex = sqlite3_mutex_alloc(SQLITE_MUTEX_RECURSIVE)
pReadCur = NULL ← opened lazily on first get
}
kvstore_cf_create with the name "default" does not create a new CF — it silently redirects to kvstore_cf_get_default() and returns the existing default handle. Callers should not rely on creation semantics for that name.
kvstore_cf_open resolves the CF by name through the INTKEY metadata table using FNV-1a hashing with linear probing for collision resolution. If inTrans == 0 at call time (the persistent read is absent), it starts a fresh read transaction with autoTrans=1 and commits it after the metadata lookup. It does not assume the persistent read is always available.
kvstore_cf_drop performs its steps in this exact order: (1) delete the metadata row from the INTKEY table via BtreeDelete, (2) drop the CF's data B-tree via BtreeDropTable (moves all pages to freelist), (3) decrement the CF count via UpdateMeta. The metadata row is removed before the table is dropped — not after — so a crash mid-sequence leaves an orphaned metadata entry rather than an unreferenced live table, which is the safer failure mode.
Iterators: Own Cursor, Shared Transaction
Each iterator owns a separate B-tree cursor — distinct from pCF->pReadCur. This allows concurrent iteration and point-lookups on the same CF without one invalidating the other. The iterator's cursor is allocated at creation time and freed at close.
KVIterator {
eof = 1 (not positioned — call iterator_first to begin)
isValid = 1
ownsTrans = ? (see below)
pCur = own BtCursor* ← NOT pCF->pReadCur
}
Transaction decision:
inTrans >= 1 → use existing transaction, ownsTrans = 0
inTrans == 0 → BtreeBeginTrans(0) → inTrans=1, ownsTrans = 1
Iterator close — transaction cleanup:
ownsTrans == 0 → nothing to commit; leave inTrans unchanged
ownsTrans == 1 → BtreeCommit(pBt); inTrans = 0
(persistent read not restored — store had none before)
Buffer reuse (key and value reads):
pKeyBuf / pValBuf grown via sqlite3_realloc when needed
pointers returned to caller are owned by the iterator (do NOT free)
Prefix Iteration — O(log n) Seek, Not O(n) Scan
kvstore_prefix_iterator_create does not scan from the beginning of the B-tree. It calls kvstore_iterator_first, which uses sqlite3BtreeIndexMoveto with the prefix as the search key to land directly at the first matching entry. Note: unlike the point-lookup path in kvstoreSeekKey, the prefix seek allocates an UnpackedRecord on the heap via sqlite3VdbeAllocUnpackedRecord(), then frees it immediately after the seek. Subsequent next() calls check the prefix bound and set eof=1 the moment a key falls outside it.
B-tree entries (sorted lexicographically):
[admin:1][admin:2][user:1][user:2][user:9][zzz:1]
kvstore_prefix_iterator_create(kv, "user:", 5, &iter)
kvstore_iterator_first():
pIdxKey = sqlite3VdbeAllocUnpackedRecord(pKeyInfo) ← heap alloc
pIdxKey->u.z = "user:", pIdxKey->n = 5
sqlite3BtreeIndexMoveto(pCur, pIdxKey, &res)
sqlite3DbFree(db, pIdxKey) ← freed immediately
[admin:1][admin:2][user:1][user:2][user:9][zzz:1]
▲
cursor lands at first entry ≥ "user:"
if res < 0: BtreeNext() to advance past the seek point
kvstoreIterCheckPrefix() → key starts with "user:"? → eof=0
Subsequent next() calls:
"user:1" → memcmp(key[0..4], "user:", 5) == 0 → continue
"user:2" → match → continue
"user:9" → match → continue
"zzz:1" → mismatch → eof = 1 (never scanned beyond prefix range)
Complexity: O(log n) seek + O(k) scan where k = matching keys
Note: contrast with kvstoreSeekKey() (get/exists/delete) which uses
a stack-allocated UnpackedRecord — zero heap alloc per lookup.
Two-Level Recursive Mutex Locking
SNKV uses two levels of SQLITE_MUTEX_RECURSIVE mutexes. The CF-level mutex protects per-CF state (pReadCur, refCount); the store-level mutex protects global state (inTrans, stats, closing). The lock order is always CF mutex first, then store mutex — consistent ordering eliminates deadlock.
Thread 1: kvstore_cf_put("logs", k, v) Thread 2: kvstore_cf_put("metrics", k, v)
───────────────────────────────────── ──────────────────────────────────────────
lock(logs.pMutex) lock(metrics.pMutex)
lock(pKV->pMutex) ← contended ──┐ lock(pKV->pMutex) acquired ─────────────┐
BtreeInsert(...) │ BtreeInsert(...) │
unlock(pKV->pMutex) ◄────────── ┘ unlock(pKV->pMutex) ──────────────────────┘
unlock(logs.pMutex) unlock(metrics.pMutex)
kvstore_iterator_close calls kvstore_cf_close, which may re-enter pKV->pMutex while the caller already holds it. A non-recursive mutex would deadlock here. SQLITE_MUTEX_RECURSIVE maps to PTHREAD_MUTEX_RECURSIVE on POSIX and a spin-count CRITICAL_SECTION on Windows — both safe for same-thread re-entry.
Mutex Design: sqlite3_mutex with RECURSIVE type
The current implementation uses SQLite's native sqlite3_mutex API throughout, allocated with SQLITE_MUTEX_RECURSIVE. Three problems the current design solves:
| Problem | Old behavior | Fix |
|---|---|---|
| Non-recursive default mutex | pthread_mutex_init(NULL) creates a DEFAULT mutex — same-thread re-entry is UB (deadlock or silent corruption) |
sqlite3_mutex_alloc(SQLITE_MUTEX_RECURSIVE) — safe same-thread re-entry |
| Use-after-free on close | Freed pMutex while another thread was about to wake and dereference it |
closing=1 set under mutex before any memory freed; all API functions check and bail early |
| Code duplication | Custom mutex code reimplemented what SQLite already provides (platform-aware, well-tested) | Use sqlite3_mutex_alloc/enter/leave/free directly; SQLite handles POSIX/Windows/no-op |
WAL Growth and Auto-Checkpoint
Why the WAL Grows Unboundedly by Default
SQLite's built-in 1000-frame WAL auto-checkpoint hook is registered inside sqlite3OpenTail via sqlite3_wal_hook. Because SNKV opens via sqlite3BtreeOpen — not sqlite3Open — that hook is never installed. The WAL appends frames indefinitely until kvstore_close triggers sqlite3WalClose, which performs a final checkpoint.
Default SQLite (via sqlite3Open): SNKV (via sqlite3BtreeOpen): ┌───────────────────────────────┐ ┌────────────────────────────────┐ │ sqlite3Open() │ │ sqlite3BtreeOpen() │ │ └── sqlite3OpenTail() │ │ (no OpenTail → no wal hook) │ │ └── sqlite3_wal_hook() │ │ │ │ registers │ │ WAL frames accumulate freely │ │ 1000-frame │ │ until kvstore_close() calls │ │ auto-checkpoint │ │ sqlite3WalClose() on last conn │ └───────────────────────────────┘ └────────────────────────────────┘
kvstoreAutoCheckpoint — the TRANS_NONE Window
SNKV solves this with a commit counter checked in the TRANS_NONE window — the brief period after BtreeCommit releases the write lock but before the persistent read transaction is restored. A PASSIVE checkpoint can run safely here because no lock is held.
The implementation counts commits rather than WAL frames because Btree is an opaque typedef in kvstore.c — accessing pBt->pPager to read the WAL frame counter would be a compile error against the internal header. A commit counter is sufficient and cannot overflow: it resets to 0 on every checkpoint fire, so its maximum value is walSizeLimit - 1.
kvstoreAutoCheckpoint(pKV): if walSizeLimit == 0: return ← disabled walCommits++ if walCommits < walSizeLimit: return ← threshold not reached walCommits = 0 ← reset counter sqlite3BtreeCheckpoint(pKV->pBt, SQLITE_CHECKPOINT_PASSIVE, NULL, NULL) // runs in TRANS_NONE window: // write lock already released by BtreeCommit // persistent read not yet opened // → checkpoint can acquire WAL write lock without contention Timeline with walSizeLimit=3: commit #1 → walCommits=1 (no checkpoint) commit #2 → walCommits=2 (no checkpoint) commit #3 → walCommits=3 ≥ limit → PASSIVE checkpoint → walCommits=0 commit #4 → walCommits=1 (no checkpoint) ···
kvstore_checkpoint — public API
For explicit control, kvstore_checkpoint exposes all four SQLite checkpoint modes. An active write transaction causes an immediate KVSTORE_BUSY return. The persistent read is silently released before the checkpoint and restored after.
| Mode | Behavior | WAL after |
|---|---|---|
PASSIVE (0) | Copy frames without blocking readers or writers; may not copy all if a reader is active | may have remaining frames |
FULL (1) | Wait for all readers, then copy all frames atomically | fully checkpointed |
RESTART (2) | Like FULL + reset WAL write position to frame 0 (reuse the WAL file) | file not truncated, reset position |
TRUNCATE (3) | Like RESTART + truncate WAL file to 0 bytes | 0 bytes |
State around kvstore_checkpoint():
inTrans=1 inTrans=0 inTrans=1
(persistent read) BtreeCommit │ BtreeCheckpoint │ BtreeBeginTrans(0)
──────────────────────────► ─┼────────────────────┼─────────────────────►
│ TRANS_NONE window │
Write guard:
kvstore_begin(kv, 1) → inTrans=2
kvstore_checkpoint(kv, PASSIVE, ...) → KVSTORE_BUSY (write not committed)
kvstore_rollback(kv) → inTrans=1
kvstore_checkpoint(kv, PASSIVE, ...) → OK
walSizeLimit=N (auto, PASSIVE) — set-and-forget; fires every N committed writes with no caller involvement. Best for most workloads. kvstore_checkpoint (any mode) — explicit control at shutdown, after large batches, or to shrink the WAL with TRUNCATE mode. Combine with walSizeLimit=0 for fully manual checkpoint scheduling.
Incremental Vacuum: Recovering Deleted Space
Deleted data pages are placed on SQLite's internal freelist. The file does not shrink automatically. kvstore_incremental_vacuum requires a write transaction with wrflag=2 (exclusive), not the normal wrflag=1 used by put/delete. Importantly, the vacuum path does not commit an active read transaction first — if inTrans==1 it calls BtreeBeginTrans(pBt, 2, 0) directly to attempt an upgrade. This differs from the put/delete pattern which always commits the read before beginning a write.
Transaction handling in kvstore_incremental_vacuum:
inTrans == 0 → BtreeBeginTrans(pBt, 2, 0) ← wrflag=2 (exclusive)
autoTrans = 1, inTrans = 2
inTrans == 1 → BtreeBeginTrans(pBt, 2, 0) ← upgrade attempt (no commit first)
(read active) autoTrans = 1, inTrans = 2
NOTE: unlike put/delete which commit the read before writing,
vacuum calls BeginTrans(2) directly over the read transaction.
inTrans == 2 → already in write txn, proceed (autoTrans = 0)
BtreeIncrVacuum step (each call reclaims one page):
1. Identify last page (P8)
2. P8 is a data page → copy content to free slot P4
3. Update pointer-map so references to P8 now point to P4
4. Decrement page count
File BEFORE: [P1][P2][P3][P4 free][P5][P6][P7][P8]
File AFTER: [P1][P2][P3][P8'][P5][P6][P7] (truncated on commit)
File size timeline:
insert 2000 records → file ~500 KB
delete 1800 records → file still ~500 KB (freelist)
kvstore_incremental_vacuum(kv, 0) → file shrinks to ~50 KB
(nPage=0 frees ALL unused pages; nPage=N frees up to N pages)
Full Stack Trace: kvstore_put Through All Four Layers
A single kvstore_put("hello", 5, "world", 5) call traverses all four layers. Here is the complete call chain with layer annotations:
[L4 kvstore.c]
lock CF mutex → lock KV mutex
if inTrans==1: BtreeCommit(read) → inTrans=0
BtreeBeginTrans(1, 0) → inTrans=2 (autoTrans)
kvstoreEncodeBlob() [stack buf] → [00 00 00 05][hello][world]
BtreeCursor(write, pCur)
[L3 btree.c]
sqlite3BtreeInsert(pCur, {pKey=blob, nKey=15})
├── BtreeIndexMoveto() → traverse root → internal → leaf
│ sqlite3VdbeRecordCompare() at each B-tree level
├── insertCell() → write 15-byte cell to leaf page
├── balance_nonroot() → split/merge if page overflows
└── PagerWrite(pPage) → mark page dirty in page cache
[L2 pager.c]
pager_write()
├── [Rollback] writeJournalHdr() → OsWrite(journal fd)
└── [WAL] walWriteLock() → append WAL frame
[L1 os.c / os_unix.c / os_win.c / os_kv.c]
sqlite3OsWrite() → pwrite(fd, buf, amt, offset)
sqlite3OsSync() → fsync(fd) (on commit, SYNC_NORMAL or FULL)
sqlite3OsLock() → fcntl(fd, F_SETLK, ...)
[L4 kvstore.c — return path]
BtreeCloseCursor(write cursor)
BtreeCommit() → autoTrans commit
kvstoreAutoCheckpoint(pKV) → PASSIVE ckpt if walSizeLimit hit
BtreeBeginTrans(0, 0) → inTrans=1 (persistent read restored)
unlock KV mutex → unlock CF mutex
kvstore → btree mapping (quick reference)
| kvstore function | Primary btree calls |
|---|---|
kvstore_open_v2 | BtreeOpen, SetPageSize, SetCacheSize, SetPagerFlags, SetAutoVacuum, SetVersion, GetMeta, CreateTable, UpdateMeta, BeginTrans(0) |
kvstore_close | BtreeRollback (if active), BtreeClose |
kvstore_begin | BtreeCommit (if inTrans=1) + BtreeBeginTrans(wrflag) |
kvstore_commit | BtreeCommit + AutoCheckpoint + BtreeBeginTrans(0) |
kvstore_put | BtreeCursor(write), BtreeInsert, BtreeCloseCursor |
kvstore_get | GetReadCursor (cached), BtreeIndexMoveto, BtreePayload ×2 |
kvstore_delete | BtreeCursor(write), BtreeIndexMoveto, BtreeDelete |
kvstore_exists | GetReadCursor (cached), BtreeIndexMoveto |
kvstore_cf_create | BtreeCreateTable(BLOBKEY), BtreeInsert (meta), UpdateMeta |
kvstore_cf_drop | BtreeDelete (meta row first), BtreeDropTable, UpdateMeta (CF count) |
kvstore_iterator_* | BtreeCursor (own), BtreeFirst, BtreeNext, BtreePayload; prefix seek uses heap-allocated UnpackedRecord (sqlite3VdbeAllocUnpackedRecord) |
kvstore_incremental_vacuum | BtreeBeginTrans(wrflag=2, exclusive), BtreeIncrVacuum (loop until DONE or nPage reached) |
kvstore_integrity_check | BtreeIntegrityCheck (all root pages) |
kvstore_checkpoint | BtreeCommit (drop read) + BtreeCheckpoint(mode) + BtreeBeginTrans(0) |