Biscuit Index Architecture
Overview
Biscuit is a PostgreSQL index access method (AM) optimized for pattern matching operations (LIKE, ILIKE, NOT LIKE, NOT ILIKE). It uses compressed bitmap indices with UTF-8-aware character-position indexing to achieve faster query performance on pattern matching workloads.
Core Design Principles
1. UTF-8 Character-Level Position-Based Indexing
Biscuit indexes strings at the character level, not byte level, correctly handling multi-byte UTF-8 sequences:
String: "café" = [63 61 66 C3 A9] (5 bytes)
Character 0: 'c' → bitmap of record IDs
Character 1: 'a' → bitmap of record IDs
Character 2: 'f' → bitmap of record IDs
Character 3: 'é' (bytes C3 A9) → bitmap of record IDs
Key Innovation: Multi-byte UTF-8 characters are indexed with ALL their bytes at the SAME character position. This enables correct pattern matching for international text while maintaining O(1) character lookup.
Why this matters:
'café'has 4 characters but 5 bytesPattern
'caf_'(4 characters) should match'café'Character counts are used for length filters, not byte counts
2. Dual Indexing (Forward + Reverse)
Biscuit maintains two complementary index structures:
Positive Index: Character positions from the start (0, 1, 2, …)
Negative Index: Character positions from the end (-1, -2, -3, …)
This allows anchored patterns to be resolved instantly:
'abc%'→ Query positive index at character positions 0, 1, 2'%xyz'→ Query negative index at character positions -3, -2, -1'abc%xyz'→ Query both indices and intersect results
3. Roaring Bitmap Compression
All bitmaps use Roaring Bitmaps for space efficiency:
Run-length encoding for consecutive IDs
Array containers for sparse data
Bitmap containers for dense data
4. Dual Case Sensitivity Architecture
Each column maintains two completely separate index structures with their own length bitmaps:
typedef struct ColumnIndex {
// ==================== CASE-SENSITIVE (LIKE, NOT LIKE) ====================
CharIndex pos_idx[256]; // Forward character positions
CharIndex neg_idx[256]; // Reverse character positions
RoaringBitmap *char_cache[256]; // Character presence cache
RoaringBitmap **length_bitmaps; // Exact length (based on original string)
RoaringBitmap **length_ge_bitmaps; // Length >= N (based on original string)
int max_length; // Max character count (original)
// ==================== CASE-INSENSITIVE (ILIKE, NOT ILIKE) ====================
CharIndex pos_idx_lower[256]; // Lowercase forward positions
CharIndex neg_idx_lower[256]; // Lowercase reverse positions
RoaringBitmap *char_cache_lower[256]; // Lowercase character cache
RoaringBitmap **length_bitmaps_lower; // Exact length (lowercase string)
RoaringBitmap **length_ge_bitmaps_lower; // Length >= N (lowercase string)
int max_length_lower; // Max character count (lowercase)
} ColumnIndex;
Critical Design Decision: Case-sensitive and case-insensitive indices have separate length bitmaps because lowercasing can change character counts (e.g., German 'ß' → 'ss').
Why separate data caches?
// Original data (for LIKE)
data_cache[rec_idx] = "Café"; // 4 characters
// Lowercase data (for ILIKE)
data_cache_lower[rec_idx] = biscuit_str_tolower("Café"); // "café", still 4 chars
Lowercase conversion is done once during index build using PostgreSQL’s locale-aware lower() function, making ILIKE queries as fast as LIKE.
Index Structure
Main Index Object
typedef struct BiscuitIndex {
// ==================== MULTI-COLUMN SUPPORT ====================
int num_columns;
Oid *column_types;
FmgrInfo *output_funcs;
char ***column_data_cache; // [col][record] - original data
// Per-column indices (multi-column mode)
ColumnIndex *column_indices; // Each column has dual indices
// ==================== SINGLE-COLUMN LEGACY (BACKWARD COMPAT) ====================
// Case-sensitive
CharIndex pos_idx_legacy[256];
CharIndex neg_idx_legacy[256];
RoaringBitmap *char_cache_legacy[256];
RoaringBitmap **length_bitmaps_legacy;
RoaringBitmap **length_ge_bitmaps_legacy;
int max_length_legacy;
// Case-insensitive
CharIndex pos_idx_lower[256];
CharIndex neg_idx_lower[256];
RoaringBitmap *char_cache_lower[256];
RoaringBitmap **length_bitmaps_lower;
RoaringBitmap **length_ge_bitmaps_lower;
int max_length_lower;
char **data_cache_lower; // Lowercase data cache
// ==================== RECORD STORAGE ====================
ItemPointerData *tids; // Heap tuple IDs
char **data_cache; // Original strings (case-sensitive)
int num_records;
int capacity;
int max_len; // Max character count (case-sensitive)
// ==================== CRUD SUPPORT ====================
RoaringBitmap *tombstones; // Deleted records
uint32_t *free_list; // Reusable slots
int free_count;
int free_capacity;
int tombstone_count;
// ==================== STATISTICS ====================
int64 insert_count;
int64 update_count;
int64 delete_count;
} BiscuitIndex;
Character Index Structure
typedef struct {
int pos; // Character position (or negative offset)
RoaringBitmap *bitmap; // Record IDs matching this char/pos
} PosEntry;
typedef struct {
PosEntry *entries;
int count;
int capacity;
} CharIndex;
Each character (0-255) has a CharIndex containing all character positions where it appears.
Length Bitmaps (Dual Architecture)
// ==================== CASE-SENSITIVE LENGTH BITMAPS ====================
// Exact length queries: "___" (3 underscores)
RoaringBitmap *length_bitmaps[max_length + 1];
// Range length queries: "%___" (length >= 3)
RoaringBitmap *length_ge_bitmaps[max_length + 1];
// ==================== CASE-INSENSITIVE LENGTH BITMAPS (SEPARATE!) ====================
// Exact lowercase length: ILIKE '___'
RoaringBitmap *length_bitmaps_lower[max_length_lower + 1];
// Lowercase length range: ILIKE '%___'
RoaringBitmap *length_ge_bitmaps_lower[max_length_lower + 1];
Why separate? Lowercasing can change character counts:
German:
'ß'(1 char) →'ss'(2 chars)Turkish:
'İ'(1 char) →'i̇'(potentially 2 chars depending on locale)
Each set of length bitmaps is populated based on the actual character count of the indexed strings (original or lowercase).
UTF-8 Character Handling
Character Counting
static int
biscuit_utf8_char_count(const char *str, int byte_len)
{
int char_count = 0;
int pos = 0;
while (pos < byte_len) {
int char_len = biscuit_utf8_char_length((unsigned char)str[pos]);
// Safety: Don't read past buffer
if (pos + char_len > byte_len)
char_len = byte_len - pos;
pos += char_len;
char_count++;
}
return char_count;
}
Example:
"café" = [0x63 0x61 0x66 0xC3 0xA9]
'c' 'a' 'f' 'é' (multi-byte)
byte_len = 5
char_count = 4 ✓ Correct!
Multi-Byte Character Indexing
// Index UTF-8 character at character position
int byte_pos = 0;
int char_pos = 0;
while (byte_pos < byte_len) {
unsigned char first_byte = str[byte_pos];
int char_len = biscuit_utf8_char_length(first_byte);
// ✓ CRITICAL: Index ALL bytes at SAME character position
for (int b = 0; b < char_len; b++) {
unsigned char byte_val = str[byte_pos + b];
// Positive position
bitmap = get_pos_bitmap(idx, byte_val, char_pos);
bitmap_add(bitmap, rec_idx);
// Negative position
int remaining_chars = utf8_char_count(str + byte_pos, byte_len - byte_pos);
int neg_offset = -remaining_chars;
bitmap = get_neg_bitmap(idx, byte_val, neg_offset);
bitmap_add(bitmap, rec_idx);
}
byte_pos += char_len;
char_pos++; // Increment ONCE per character
}
Example: Indexing “café”
char_pos=0: 'c' (0x63) → pos_bitmap[0x63][0]
char_pos=1: 'a' (0x61) → pos_bitmap[0x61][1]
char_pos=2: 'f' (0x66) → pos_bitmap[0x66][2]
char_pos=3: 'é' (0xC3 0xA9) → pos_bitmap[0xC3][3] AND pos_bitmap[0xA9][3]
^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^
Both bytes at SAME position!
This ensures that pattern 'caf_' correctly matches 'café' by checking character position 3, not byte position 3.
Query Processing
1. UTF-8-Aware Pattern Analysis
typedef struct ParsedPattern {
char **parts; // Pattern parts split by %
int *part_lens; // CHARACTER counts (for length filters)
int *part_byte_lens; // BYTE lengths (for string operations)
int part_count;
bool starts_percent;
bool ends_percent;
} ParsedPattern;
Critical Fix: Store both character counts and byte lengths:
Character counts used for length filters (
length_ge_bitmaps)Byte lengths used for string operations (
memcmp, offsets)
2. Query Optimization (Multi-Column)
typedef struct {
int column_index;
char *pattern;
ScanKey scan_key;
// Pattern analysis
bool has_percent;
bool is_prefix; // 'abc%'
bool is_suffix; // '%abc'
bool is_exact; // 'abc'
bool is_substring; // '%abc%'
int concrete_chars; // UTF-8 character count
int underscore_count;
int percent_count;
int partition_count;
int anchor_strength; // 0-100
double selectivity_score; // Lower = more selective
int priority; // Execution order
} QueryPredicate;
The query optimizer reorders predicates to minimize candidates:
Priority Tiers:
0-10: Exact matches, many underscores
10-20: Non-% patterns with underscores
20-30: Strong anchored patterns (>30 concrete chars)
30-40: Weak anchored patterns
40-50: Multi-partition patterns
50+: Substring patterns (least selective)
3. Fast Paths (UTF-8-Aware)
Empty Pattern (''):
// Returns only zero-length strings
if (length_bitmaps[0])
return copy(length_bitmaps[0]);
Universal Match ('%'):
// Matches everything except deleted records
for (i = 0; i < num_records; i++)
if (!is_tombstoned(i))
add_to_result(i);
Pure Underscore Wildcards ('___'):
// Exact CHARACTER length constraint: 3 characters
return copy(length_bitmaps[3]); // Uses CHARACTER counts!
Character Length Range ('%___'):
// Any string with CHARACTER length >= 3
return copy(length_ge_bitmaps[3]); // Uses CHARACTER counts!
Exact Match ('abc'):
// Must match at character position 0 with exact CHARACTER length 3
result = pos_bitmap['a'][0];
result &= pos_bitmap['b'][1];
result &= pos_bitmap['c'][2];
result &= length_bitmaps[3]; // CHARACTER length = 3
Prefix Match ('abc%'):
// Must start with 'abc', any CHARACTER length >= 3
result = pos_bitmap['a'][0];
result &= pos_bitmap['b'][1];
result &= pos_bitmap['c'][2];
// No exact length constraint
Suffix Match ('%abc'):
// Must end with 'abc', any CHARACTER length >= 3
result = neg_bitmap['a'][-3]; // 3rd CHARACTER from end
result &= neg_bitmap['b'][-2]; // 2nd CHARACTER from end
result &= neg_bitmap['c'][-1]; // Last CHARACTER
Substring Match ('%abc%' or '%café%'):
// CRITICAL FIX: UTF-8-aware substring search
// Strategy:
// 1. Get candidates using first byte of pattern
// 2. Validate UTF-8 boundaries with character-aware matching
result = empty_bitmap();
// Get candidates (fast filter)
if (char_cache[first_byte])
candidates = copy(char_cache[first_byte]);
// Apply CHARACTER length filter (not byte length!)
int pattern_char_count = utf8_char_count(pattern, pattern_byte_len);
length_filter = get_length_ge(idx, pattern_char_count);
bitmap_and(candidates, length_filter);
// Validate each candidate (correct but slower)
for each candidate rec_idx {
haystack = data_cache[rec_idx];
haystack_char_count = utf8_char_count(haystack);
// Try matching at each CHARACTER position
for (char_pos = 0; char_pos <= haystack_char_count - pattern_char_count; char_pos++) {
// Convert character position to byte offset
byte_offset = utf8_char_to_byte_offset(haystack, char_pos);
// Match pattern bytes at this position
if (match_utf8_pattern(haystack + byte_offset, pattern)) {
bitmap_add(result, rec_idx);
break;
}
}
}
Why substring is slow:
Must validate UTF-8 character boundaries
Cannot rely solely on byte-level bitmap intersections
Uses hybrid approach: bitmap filtering + validation
4. Windowed Matching (Complex Patterns)
For patterns like 'abc%def%ghi', Biscuit uses UTF-8-aware recursive windowed matching:
static void
biscuit_recursive_windowed_match(
RoaringBitmap *result, BiscuitIndex *idx,
const char **parts, int *part_lens, // CHARACTER counts
int part_count, bool ends_percent,
int part_idx, int min_pos, // CHARACTER position
RoaringBitmap *current_candidates, int max_len)
{
// Base case: all parts matched
if (part_idx >= part_count) {
bitmap_or(result, current_candidates);
return;
}
// Calculate remaining CHARACTER length needed
int remaining_chars = 0;
for (int i = part_idx + 1; i < part_count; i++)
remaining_chars += part_lens[i]; // CHARACTER counts
// Last part without trailing % must match at end
if (part_idx == part_count - 1 && !ends_percent) {
end_match = match_part_at_end(idx, parts[part_idx]);
bitmap_and(end_match, current_candidates);
// CHARACTER length constraint
min_length = min_pos + part_lens[part_idx];
length_filter = get_length_ge(idx, min_length);
bitmap_and(end_match, length_filter);
bitmap_or(result, end_match);
return;
}
// Try all valid CHARACTER positions
int current_part_chars = part_lens[part_idx];
int max_pos = max_len - current_part_chars - remaining_chars;
for (int char_pos = min_pos; char_pos <= max_pos; char_pos++) {
part_match = match_part_at_pos(idx, parts[part_idx], char_pos);
next_candidates = bitmap_copy(current_candidates);
bitmap_and(next_candidates, part_match);
if (!bitmap_is_empty(next_candidates)) {
// Recurse with next CHARACTER position
recurse(result, idx, parts, part_lens, part_count,
ends_percent, part_idx + 1,
char_pos + current_part_chars, // CHARACTER arithmetic
next_candidates, max_len);
}
}
}
Key invariant: All position arithmetic uses character offsets, not byte offsets.
ILIKE Implementation
Architecture
// ILIKE query processing
static RoaringBitmap*
biscuit_query_pattern_ilike(BiscuitIndex *idx, const char *pattern) {
// Step 1: Convert pattern to lowercase ONCE
char *pattern_lower = biscuit_str_tolower(pattern, strlen(pattern));
// Step 2: Parse lowercase pattern
ParsedPattern *parsed = biscuit_parse_pattern(pattern_lower);
// Step 3: Query lowercase indices
result = match_using_lowercase_indices(idx, parsed);
// Cleanup
free_parsed_pattern(parsed);
pfree(pattern_lower);
return result;
}
Case Folding
static char *
biscuit_str_tolower(const char *str, int len)
{
text *input = cstring_to_text_with_len(str, len);
Oid collation = get_typcollation(TEXTOID); // Database default
text *result_text = DatumGetTextP(
DirectFunctionCall2Coll(
lower, // PostgreSQL's locale-aware lower()
collation,
PointerGetDatum(input),
collation
)
);
char *result = text_to_cstring(result_text);
return result; // Allocated with palloc
}
Why use PostgreSQL’s lower()?
Locale-aware (handles Turkish İ/i, German ß, etc.)
Consistent with database collation
Stable across index builds
Data Flow
Index Build:
1. Original: "Café" → data_cache[idx] = "Café"
2. Lowercase: "Café" → data_cache_lower[idx] = biscuit_str_tolower("Café") = "café"
3. Index both:
- pos_idx[...] indexes "Café" (case-sensitive)
- pos_idx_lower[...] indexes "café" (case-insensitive)
LIKE Query:
"WHERE col LIKE 'Caf%'"
→ Query pos_idx with "Caf%" → Matches "Café" ✓
ILIKE Query:
"WHERE col ILIKE 'caf%'"
→ Convert pattern: "caf%" → "caf%"
→ Query pos_idx_lower with "caf%" → Matches "café" ✓
Performance Optimizations
1. Skip Wildcard Intersections
'_' wildcards match everything at a character position, so they’re skipped:
for (i = 0; i < part_len; i++) {
if (part[i] == '_') {
char_pos++; // Skip wildcard, advance CHARACTER position
continue;
}
bm = get_pos_bitmap(idx, part[i], char_pos);
result = bitmap_and(result, bm);
char_pos++;
}
2. Early Termination
Stop processing immediately if any intersection yields zero results:
bitmap_and_inplace(result, char_bitmap);
if (bitmap_is_empty(result))
return result; // No point continuing
3. Character vs. Byte Length Separation
typedef struct ParsedPattern {
char **parts;
int *part_lens; // ✓ CHARACTER counts (for length filters)
int *part_byte_lens; // ✓ BYTE lengths (for string operations)
int part_count;
} ParsedPattern;
// Parsing:
part_lens[i] = utf8_char_count(part, part_byte_len); // Characters
part_byte_lens[i] = part_byte_len; // Bytes
Why both?
Character counts for length filters:
length_ge_bitmaps[char_count]Byte lengths for memory operations:
memcmp(str, part, byte_len)
4. TID Sorting for Sequential I/O
Results are sorted by (block, offset) to enable sequential heap scans:
if (count < 5000) {
qsort(tids, count, sizeof(ItemPointer), compare_tids);
} else {
radix_sort_tids(tids, count); // O(n) for large sets
}
5. Skip Sorting for Aggregates
COUNT(*) and EXISTS queries don’t need sorted TIDs:
bool is_aggregate = !scan->xs_want_itup;
if (is_aggregate) {
// Return unsorted TIDs for bitmap heap scan
// Saves O(n log n) sorting cost
}
6. Parallel TID Collection
Large result sets (10k+ TIDs) are collected using multiple workers:
if (count >= 10000) {
int num_workers = (count < 100000) ? 2 : 4;
// Distribute bitmap iteration across workers
}
7. LIMIT-Aware Collection
Stop collecting TIDs early when LIMIT is detected:
int limit_hint = estimate_limit(scan);
if (limit_hint > 0 && idx >= limit_hint)
break; // Collected enough
Multi-Column Support
Architecture
Each column maintains its own completely independent Biscuit index:
BiscuitIndex {
ColumnIndex column_indices[N]; // Each column: dual indices + dual length bitmaps
}
Query Execution
Multi-column queries use intelligent predicate reordering:
WHERE col1 LIKE 'abc%' AND col2 ILIKE '%xyz'
Execution plan:
Analyze: Score each predicate for selectivity
col1 'abc%': priority=20, selectivity=0.15 (strong prefix) col2 '%xyz': priority=30, selectivity=0.25 (weak suffix)
Reorder: Execute most selective first
Execute col1 first (lower priority value = higher priority)
Execute:
// Execute col1 LIKE (case-sensitive indices) candidates = query_column_pattern(idx, col1_idx, 'abc%'); // Execute col2 ILIKE (case-insensitive indices) col2_result = query_column_pattern_ilike(idx, col2_idx, '%xyz'); // Intersect bitmap_and(candidates, col2_result);
Filter tombstones: Remove deleted records
Collect TIDs: Convert bitmap to tuple IDs
Strategy Routing
// Route based on scan key strategy
switch (key->sk_strategy) {
case BISCUIT_LIKE_STRATEGY:
result = query_column_pattern(idx, col_idx, pattern);
break;
case BISCUIT_NOT_LIKE_STRATEGY:
result = query_column_pattern(idx, col_idx, pattern);
bitmap_invert(result); // Negate
break;
case BISCUIT_ILIKE_STRATEGY:
result = query_column_pattern_ilike(idx, col_idx, pattern);
break;
case BISCUIT_NOT_ILIKE_STRATEGY:
result = query_column_pattern_ilike(idx, col_idx, pattern);
bitmap_invert(result); // Negate
break;
}
CRUD Operations
Insert
bool biscuit_insert(values[], isnull[], ht_ctid) {
// 1. Try to reuse deleted slot
if (pop_free_slot(&rec_idx)) {
remove_from_tombstones(rec_idx);
} else {
rec_idx = num_records++;
}
// 2. Store TID and data
tids[rec_idx] = *ht_ctid;
data_cache[rec_idx] = copy_string(value);
// 3. Create lowercase version
data_cache_lower[rec_idx] = biscuit_str_tolower(value, strlen(value));
// 4. Update case-sensitive indices
int byte_pos = 0, char_pos = 0;
while (byte_pos < byte_len) {
int char_len = utf8_char_length(str[byte_pos]);
// Index ALL bytes at SAME character position
for (int b = 0; b < char_len; b++) {
add_to_pos_bitmap(str[byte_pos + b], char_pos, rec_idx);
add_to_neg_bitmap(str[byte_pos + b], -(len_chars - char_pos), rec_idx);
}
byte_pos += char_len;
char_pos++;
}
// 5. Update case-insensitive indices (same logic with lowercase data)
// ... (similar to step 4 but using data_cache_lower)
// 6. Update length bitmaps (BOTH case-sensitive and case-insensitive)
int char_count = utf8_char_count(str, byte_len);
add_to_length_bitmaps(char_count, rec_idx);
int lower_char_count = utf8_char_count(data_cache_lower[rec_idx]);
add_to_length_bitmaps_lower(lower_char_count, rec_idx);
}
Delete (Lazy + Cleanup)
bool callback(ItemPointer tid) {
// Step 1: Mark as tombstone (lazy delete)
bitmap_add(tombstones, rec_idx);
push_free_slot(rec_idx);
tombstone_count++;
// Step 2: Periodic cleanup when threshold reached
if (tombstone_count >= 1000) {
// Remove from ALL indices (case-sensitive + case-insensitive)
for (ch = 0; ch < 256; ch++) {
for (j = 0; j < pos_idx[ch].count; j++)
bitmap_andnot(pos_idx[ch].entries[j].bitmap, tombstones);
for (j = 0; j < neg_idx[ch].count; j++)
bitmap_andnot(neg_idx[ch].entries[j].bitmap, tombstones);
bitmap_andnot(char_cache[ch], tombstones);
// Same for lowercase indices
for (j = 0; j < pos_idx_lower[ch].count; j++)
bitmap_andnot(pos_idx_lower[ch].entries[j].bitmap, tombstones);
// ... etc
}
// Clean up length bitmaps (BOTH sets)
for (i = 0; i <= max_length; i++) {
bitmap_andnot(length_bitmaps[i], tombstones);
bitmap_andnot(length_ge_bitmaps[i], tombstones);
}
for (i = 0; i <= max_length_lower; i++) {
bitmap_andnot(length_bitmaps_lower[i], tombstones);
bitmap_andnot(length_ge_bitmaps_lower[i], tombstones);
}
// Reset tombstone bitmap
bitmap_free(tombstones);
tombstones = bitmap_create();
tombstone_count = 0;
}
}
Update
Updates are delete + insert:
// PostgreSQL calls:
// 1. bulkdelete(old_tid) → Marks old version as tombstone
// 2. insert(new_values, new_tid) → Creates new version
Memory Management
Cache Strategy
Indices are cached in CacheMemoryContext for persistence across queries:
static BiscuitIndexCacheEntry *cache_head = NULL;
// On first access:
idx = load_index(relation);
cache_insert(relation_oid, idx);
// On subsequent access:
idx = cache_lookup(relation_oid);
Size Calculation
PG_FUNCTION_INFO_V1(biscuit_index_memory_size);
Datum biscuit_index_memory_size(Oid indexoid) {
BiscuitIndex *idx = load_index(indexoid);
size_t total = 0;
// Base structure
total += sizeof(BiscuitIndex);
// TID array
total += capacity * sizeof(ItemPointerData);
// String data caches (original + lowercase)
for (i = 0; i < num_records; i++) {
total += strlen(data_cache[i]) + 1;
total += strlen(data_cache_lower[i]) + 1; // Lowercase cache
}
// Case-sensitive indices (256 characters)
for (ch = 0; ch < 256; ch++) {
total += charindex_size(&pos_idx[ch]);
total += charindex_size(&neg_idx[ch]);
total += bitmap_size(char_cache[ch]);
}
// Case-insensitive indices (256 characters)
for (ch = 0; ch < 256; ch++) {
total += charindex_size(&pos_idx_lower[ch]);
total += charindex_size(&neg_idx_lower[ch]);
total += bitmap_size(char_cache_lower[ch]);
}
// Length bitmaps (BOTH sets)
for (i = 0; i <= max_length; i++) {
total += bitmap_size(length_bitmaps[i]);
total += bitmap_size(length_ge_bitmaps[i]);
}
for (i = 0; i <= max_length_lower; i++) {
total += bitmap_size(length_bitmaps_lower[i]);
total += bitmap_size(length_ge_bitmaps_lower[i]);
}
// Tombstones + free list
total += bitmap_size(tombstones);
total += free_capacity * sizeof(uint32_t);
return total;
}
Cleanup
Biscuit registers callbacks for safe cleanup:
void biscuit_register_callback() {
// Invalidate cache when relation changes
CacheRegisterRelcacheCallback(biscuit_relcache_callback);
// Clean up on process exit
on_proc_exit(biscuit_module_unload_callback);
}
Disk Persistence
Metadata Only
Biscuit stores only a metadata marker on disk:
typedef struct BiscuitMetaPageData {
uint32 magic; // 0x42495343 ("BISC")
uint32 version; // 1
BlockNumber root; // 0
uint32 num_records;
} BiscuitMetaPageData;
Rebuild Strategy
Bitmaps are not serialized. On index load:
Read metadata marker
Scan heap table
Rebuild all bitmaps (case-sensitive + case-insensitive) in memory
Generate lowercase data cache
Cache in
CacheMemoryContext
Rationale:
Bitmap serialization is complex and large
Memory representation is optimal for queries
Dual indices (LIKE + ILIKE) would double disk size
Limitations & Tradeoffs
What Biscuit Does Well
Pattern matching (
LIKE,ILIKE,NOT LIKE,NOT ILIKE)UTF-8 multi-byte character support
Prefix/suffix queries (
'abc%','%xyz')Complex multi-wildcard patterns
Multi-column pattern searches
Case-sensitive and case-insensitive (separate indices)
Aggregates (
COUNT(*),EXISTS)International text (German, Turkish, Chinese, etc.)
What Biscuit Doesn’t Do
Equality/range queries: Use B-tree
Full-text search: Use GIN with tsvector
Regex: Not supported
Very long strings: Memory usage scales with character length
Online serialization: Requires rebuild on load
Locale changes: Lowercase cache is locale-dependent