Biscuit Index Architecture

Overview

Biscuit is a PostgreSQL index access method (AM) optimized for pattern matching operations (LIKE, ILIKE, NOT LIKE, NOT ILIKE). It uses compressed bitmap indices with UTF-8-aware character-position indexing to achieve faster query performance on pattern matching workloads.

Core Design Principles

1. UTF-8 Character-Level Position-Based Indexing

Biscuit indexes strings at the character level, not byte level, correctly handling multi-byte UTF-8 sequences:

String: "café" = [63 61 66 C3 A9] (5 bytes)
Character 0: 'c' → bitmap of record IDs
Character 1: 'a' → bitmap of record IDs  
Character 2: 'f' → bitmap of record IDs
Character 3: 'é' (bytes C3 A9) → bitmap of record IDs

Key Innovation: Multi-byte UTF-8 characters are indexed with ALL their bytes at the SAME character position. This enables correct pattern matching for international text while maintaining O(1) character lookup.

Why this matters:

'café' has 4 characters but 5 bytes
Pattern 'caf_' (4 characters) should match 'café'
Character counts are used for length filters, not byte counts

2. Dual Indexing (Forward + Reverse)

Biscuit maintains two complementary index structures:

Positive Index: Character positions from the start (0, 1, 2, …)
Negative Index: Character positions from the end (-1, -2, -3, …)

This allows anchored patterns to be resolved instantly:

'abc%' → Query positive index at character positions 0, 1, 2
'%xyz' → Query negative index at character positions -3, -2, -1
'abc%xyz' → Query both indices and intersect results

3. Roaring Bitmap Compression

All bitmaps use Roaring Bitmaps for space efficiency:

Run-length encoding for consecutive IDs
Array containers for sparse data
Bitmap containers for dense data

4. Dual Case Sensitivity Architecture

Each column maintains two completely separate index structures with their own length bitmaps:

typedef struct ColumnIndex {
    // ==================== CASE-SENSITIVE (LIKE, NOT LIKE) ====================
    CharIndex pos_idx[256];               // Forward character positions
    CharIndex neg_idx[256];               // Reverse character positions
    RoaringBitmap *char_cache[256];       // Character presence cache
    RoaringBitmap **length_bitmaps;       // Exact length (based on original string)
    RoaringBitmap **length_ge_bitmaps;    // Length >= N (based on original string)
    int max_length;                       // Max character count (original)
    
    // ==================== CASE-INSENSITIVE (ILIKE, NOT ILIKE) ====================
    CharIndex pos_idx_lower[256];         // Lowercase forward positions
    CharIndex neg_idx_lower[256];         // Lowercase reverse positions
    RoaringBitmap *char_cache_lower[256]; // Lowercase character cache
    RoaringBitmap **length_bitmaps_lower;    // Exact length (lowercase string)
    RoaringBitmap **length_ge_bitmaps_lower; // Length >= N (lowercase string)
    int max_length_lower;                 // Max character count (lowercase)
} ColumnIndex;

Critical Design Decision: Case-sensitive and case-insensitive indices have separate length bitmaps because lowercasing can change character counts (e.g., German 'ß' → 'ss').

Why separate data caches?

// Original data (for LIKE)
data_cache[rec_idx] = "Café";  // 4 characters

// Lowercase data (for ILIKE)
data_cache_lower[rec_idx] = biscuit_str_tolower("Café");  // "café", still 4 chars

Lowercase conversion is done once during index build using PostgreSQL’s locale-aware lower() function, making ILIKE queries as fast as LIKE.

Index Structure

Main Index Object

typedef struct BiscuitIndex {
    // ==================== MULTI-COLUMN SUPPORT ====================
    int num_columns;
    Oid *column_types;
    FmgrInfo *output_funcs;
    char ***column_data_cache;      // [col][record] - original data
    
    // Per-column indices (multi-column mode)
    ColumnIndex *column_indices;    // Each column has dual indices
    
    // ==================== SINGLE-COLUMN LEGACY (BACKWARD COMPAT) ====================
    // Case-sensitive
    CharIndex pos_idx_legacy[256];
    CharIndex neg_idx_legacy[256];
    RoaringBitmap *char_cache_legacy[256];
    RoaringBitmap **length_bitmaps_legacy;
    RoaringBitmap **length_ge_bitmaps_legacy;
    int max_length_legacy;
    
    // Case-insensitive
    CharIndex pos_idx_lower[256];
    CharIndex neg_idx_lower[256];
    RoaringBitmap *char_cache_lower[256];
    RoaringBitmap **length_bitmaps_lower;
    RoaringBitmap **length_ge_bitmaps_lower;
    int max_length_lower;
    char **data_cache_lower;        // Lowercase data cache
    
    // ==================== RECORD STORAGE ====================
    ItemPointerData *tids;          // Heap tuple IDs
    char **data_cache;              // Original strings (case-sensitive)
    int num_records;
    int capacity;
    int max_len;                    // Max character count (case-sensitive)
    
    // ==================== CRUD SUPPORT ====================
    RoaringBitmap *tombstones;      // Deleted records
    uint32_t *free_list;            // Reusable slots
    int free_count;
    int free_capacity;
    int tombstone_count;
    
    // ==================== STATISTICS ====================
    int64 insert_count;
    int64 update_count;
    int64 delete_count;
} BiscuitIndex;

Character Index Structure

typedef struct {
    int pos;                        // Character position (or negative offset)
    RoaringBitmap *bitmap;          // Record IDs matching this char/pos
} PosEntry;

typedef struct {
    PosEntry *entries;
    int count;
    int capacity;
} CharIndex;

Each character (0-255) has a CharIndex containing all character positions where it appears.

Length Bitmaps (Dual Architecture)

// ==================== CASE-SENSITIVE LENGTH BITMAPS ====================
// Exact length queries: "___" (3 underscores)
RoaringBitmap *length_bitmaps[max_length + 1];

// Range length queries: "%___" (length >= 3)
RoaringBitmap *length_ge_bitmaps[max_length + 1];

// ==================== CASE-INSENSITIVE LENGTH BITMAPS (SEPARATE!) ====================
// Exact lowercase length: ILIKE '___'
RoaringBitmap *length_bitmaps_lower[max_length_lower + 1];

// Lowercase length range: ILIKE '%___'
RoaringBitmap *length_ge_bitmaps_lower[max_length_lower + 1];

Why separate? Lowercasing can change character counts:

German: 'ß' (1 char) → 'ss' (2 chars)
Turkish: 'İ' (1 char) → 'i̇' (potentially 2 chars depending on locale)

Each set of length bitmaps is populated based on the actual character count of the indexed strings (original or lowercase).

UTF-8 Character Handling

Character Counting

static int 
biscuit_utf8_char_count(const char *str, int byte_len)
{
    int char_count = 0;
    int pos = 0;
    
    while (pos < byte_len) {
        int char_len = biscuit_utf8_char_length((unsigned char)str[pos]);
        
        // Safety: Don't read past buffer
        if (pos + char_len > byte_len)
            char_len = byte_len - pos;
        
        pos += char_len;
        char_count++;
    }
    
    return char_count;
}

Example:

"café" = [0x63 0x61 0x66 0xC3 0xA9]
         'c'  'a'  'f'  'é' (multi-byte)
         
byte_len = 5
char_count = 4  ✓ Correct!

Multi-Byte Character Indexing

// Index UTF-8 character at character position
int byte_pos = 0;
int char_pos = 0;

while (byte_pos < byte_len) {
    unsigned char first_byte = str[byte_pos];
    int char_len = biscuit_utf8_char_length(first_byte);
    
    // ✓ CRITICAL: Index ALL bytes at SAME character position
    for (int b = 0; b < char_len; b++) {
        unsigned char byte_val = str[byte_pos + b];
        
        // Positive position
        bitmap = get_pos_bitmap(idx, byte_val, char_pos);
        bitmap_add(bitmap, rec_idx);
        
        // Negative position
        int remaining_chars = utf8_char_count(str + byte_pos, byte_len - byte_pos);
        int neg_offset = -remaining_chars;
        bitmap = get_neg_bitmap(idx, byte_val, neg_offset);
        bitmap_add(bitmap, rec_idx);
    }
    
    byte_pos += char_len;
    char_pos++;  // Increment ONCE per character
}

Example: Indexing “café”

char_pos=0: 'c' (0x63) → pos_bitmap[0x63][0]
char_pos=1: 'a' (0x61) → pos_bitmap[0x61][1]
char_pos=2: 'f' (0x66) → pos_bitmap[0x66][2]
char_pos=3: 'é' (0xC3 0xA9) → pos_bitmap[0xC3][3] AND pos_bitmap[0xA9][3]
                              ^^^^^^^^^^^^^^^^^^    ^^^^^^^^^^^^^^^^^^
                              Both bytes at SAME position!

This ensures that pattern 'caf_' correctly matches 'café' by checking character position 3, not byte position 3.

Query Processing

1. UTF-8-Aware Pattern Analysis

typedef struct ParsedPattern {
    char **parts;              // Pattern parts split by %
    int *part_lens;            // CHARACTER counts (for length filters)
    int *part_byte_lens;       // BYTE lengths (for string operations)
    int part_count;
    bool starts_percent;
    bool ends_percent;
} ParsedPattern;

Critical Fix: Store both character counts and byte lengths:

Character counts used for length filters (length_ge_bitmaps)
Byte lengths used for string operations (memcmp, offsets)

2. Query Optimization (Multi-Column)

typedef struct {
    int column_index;
    char *pattern;
    ScanKey scan_key;
    
    // Pattern analysis
    bool has_percent;
    bool is_prefix;      // 'abc%'
    bool is_suffix;      // '%abc'
    bool is_exact;       // 'abc'
    bool is_substring;   // '%abc%'
    
    int concrete_chars;       // UTF-8 character count
    int underscore_count;
    int percent_count;
    int partition_count;
    int anchor_strength;      // 0-100
    
    double selectivity_score; // Lower = more selective
    int priority;             // Execution order
} QueryPredicate;

The query optimizer reorders predicates to minimize candidates:

Priority Tiers:
0-10:  Exact matches, many underscores
10-20: Non-% patterns with underscores
20-30: Strong anchored patterns (>30 concrete chars)
30-40: Weak anchored patterns
40-50: Multi-partition patterns
50+:   Substring patterns (least selective)

3. Fast Paths (UTF-8-Aware)

Empty Pattern (''):

// Returns only zero-length strings
if (length_bitmaps[0])
    return copy(length_bitmaps[0]);

Universal Match ('%'):

// Matches everything except deleted records
for (i = 0; i < num_records; i++)
    if (!is_tombstoned(i))
        add_to_result(i);

Pure Underscore Wildcards ('___'):

// Exact CHARACTER length constraint: 3 characters
return copy(length_bitmaps[3]);  // Uses CHARACTER counts!

Character Length Range ('%___'):

// Any string with CHARACTER length >= 3
return copy(length_ge_bitmaps[3]);  // Uses CHARACTER counts!

Exact Match ('abc'):

// Must match at character position 0 with exact CHARACTER length 3
result = pos_bitmap['a'][0];
result &= pos_bitmap['b'][1];
result &= pos_bitmap['c'][2];
result &= length_bitmaps[3];  // CHARACTER length = 3

Prefix Match ('abc%'):

// Must start with 'abc', any CHARACTER length >= 3
result = pos_bitmap['a'][0];
result &= pos_bitmap['b'][1];
result &= pos_bitmap['c'][2];
// No exact length constraint

Suffix Match ('%abc'):

// Must end with 'abc', any CHARACTER length >= 3
result = neg_bitmap['a'][-3];  // 3rd CHARACTER from end
result &= neg_bitmap['b'][-2];  // 2nd CHARACTER from end
result &= neg_bitmap['c'][-1];  // Last CHARACTER

Substring Match ('%abc%' or '%café%'):

// CRITICAL FIX: UTF-8-aware substring search
// Strategy:
// 1. Get candidates using first byte of pattern
// 2. Validate UTF-8 boundaries with character-aware matching

result = empty_bitmap();

// Get candidates (fast filter)
if (char_cache[first_byte])
    candidates = copy(char_cache[first_byte]);

// Apply CHARACTER length filter (not byte length!)
int pattern_char_count = utf8_char_count(pattern, pattern_byte_len);
length_filter = get_length_ge(idx, pattern_char_count);
bitmap_and(candidates, length_filter);

// Validate each candidate (correct but slower)
for each candidate rec_idx {
    haystack = data_cache[rec_idx];
    haystack_char_count = utf8_char_count(haystack);
    
    // Try matching at each CHARACTER position
    for (char_pos = 0; char_pos <= haystack_char_count - pattern_char_count; char_pos++) {
        // Convert character position to byte offset
        byte_offset = utf8_char_to_byte_offset(haystack, char_pos);
        
        // Match pattern bytes at this position
        if (match_utf8_pattern(haystack + byte_offset, pattern)) {
            bitmap_add(result, rec_idx);
            break;
        }
    }
}

Why substring is slow:

Must validate UTF-8 character boundaries
Cannot rely solely on byte-level bitmap intersections
Uses hybrid approach: bitmap filtering + validation

4. Windowed Matching (Complex Patterns)

For patterns like 'abc%def%ghi', Biscuit uses UTF-8-aware recursive windowed matching:

static void 
biscuit_recursive_windowed_match(
    RoaringBitmap *result, BiscuitIndex *idx,
    const char **parts, int *part_lens,      // CHARACTER counts
    int part_count, bool ends_percent,
    int part_idx, int min_pos,               // CHARACTER position
    RoaringBitmap *current_candidates, int max_len)
{
    // Base case: all parts matched
    if (part_idx >= part_count) {
        bitmap_or(result, current_candidates);
        return;
    }
    
    // Calculate remaining CHARACTER length needed
    int remaining_chars = 0;
    for (int i = part_idx + 1; i < part_count; i++)
        remaining_chars += part_lens[i];  // CHARACTER counts
    
    // Last part without trailing % must match at end
    if (part_idx == part_count - 1 && !ends_percent) {
        end_match = match_part_at_end(idx, parts[part_idx]);
        bitmap_and(end_match, current_candidates);
        
        // CHARACTER length constraint
        min_length = min_pos + part_lens[part_idx];
        length_filter = get_length_ge(idx, min_length);
        bitmap_and(end_match, length_filter);
        
        bitmap_or(result, end_match);
        return;
    }
    
    // Try all valid CHARACTER positions
    int current_part_chars = part_lens[part_idx];
    int max_pos = max_len - current_part_chars - remaining_chars;
    
    for (int char_pos = min_pos; char_pos <= max_pos; char_pos++) {
        part_match = match_part_at_pos(idx, parts[part_idx], char_pos);
        next_candidates = bitmap_copy(current_candidates);
        bitmap_and(next_candidates, part_match);
        
        if (!bitmap_is_empty(next_candidates)) {
            // Recurse with next CHARACTER position
            recurse(result, idx, parts, part_lens, part_count,
                    ends_percent, part_idx + 1, 
                    char_pos + current_part_chars,  // CHARACTER arithmetic
                    next_candidates, max_len);
        }
    }
}

Key invariant: All position arithmetic uses character offsets, not byte offsets.

ILIKE Implementation

Architecture

// ILIKE query processing
static RoaringBitmap* 
biscuit_query_pattern_ilike(BiscuitIndex *idx, const char *pattern) {
    // Step 1: Convert pattern to lowercase ONCE
    char *pattern_lower = biscuit_str_tolower(pattern, strlen(pattern));
    
    // Step 2: Parse lowercase pattern
    ParsedPattern *parsed = biscuit_parse_pattern(pattern_lower);
    
    // Step 3: Query lowercase indices
    result = match_using_lowercase_indices(idx, parsed);
    
    // Cleanup
    free_parsed_pattern(parsed);
    pfree(pattern_lower);
    
    return result;
}

Case Folding

static char *
biscuit_str_tolower(const char *str, int len)
{
    text *input = cstring_to_text_with_len(str, len);
    Oid collation = get_typcollation(TEXTOID);  // Database default
    
    text *result_text = DatumGetTextP(
        DirectFunctionCall2Coll(
            lower,              // PostgreSQL's locale-aware lower()
            collation,
            PointerGetDatum(input),
            collation
        )
    );
    
    char *result = text_to_cstring(result_text);
    return result;  // Allocated with palloc
}

Why use PostgreSQL’s lower()?

Locale-aware (handles Turkish İ/i, German ß, etc.)
Consistent with database collation
Stable across index builds

Data Flow

Index Build:
1. Original: "Café" → data_cache[idx] = "Café"
2. Lowercase: "Café" → data_cache_lower[idx] = biscuit_str_tolower("Café") = "café"
3. Index both:
   - pos_idx[...] indexes "Café" (case-sensitive)
   - pos_idx_lower[...] indexes "café" (case-insensitive)

LIKE Query:
"WHERE col LIKE 'Caf%'"
→ Query pos_idx with "Caf%" → Matches "Café" ✓

ILIKE Query:
"WHERE col ILIKE 'caf%'"
→ Convert pattern: "caf%" → "caf%"
→ Query pos_idx_lower with "caf%" → Matches "café" ✓

Performance Optimizations

1. Skip Wildcard Intersections

'_' wildcards match everything at a character position, so they’re skipped:

for (i = 0; i < part_len; i++) {
    if (part[i] == '_') {
        char_pos++;  // Skip wildcard, advance CHARACTER position
        continue;
    }
    
    bm = get_pos_bitmap(idx, part[i], char_pos);
    result = bitmap_and(result, bm);
    char_pos++;
}

2. Early Termination

Stop processing immediately if any intersection yields zero results:

bitmap_and_inplace(result, char_bitmap);
if (bitmap_is_empty(result))
    return result;  // No point continuing

3. Character vs. Byte Length Separation

typedef struct ParsedPattern {
    char **parts;
    int *part_lens;         // ✓ CHARACTER counts (for length filters)
    int *part_byte_lens;    // ✓ BYTE lengths (for string operations)
    int part_count;
} ParsedPattern;

// Parsing:
part_lens[i] = utf8_char_count(part, part_byte_len);  // Characters
part_byte_lens[i] = part_byte_len;                    // Bytes

Why both?

Character counts for length filters: length_ge_bitmaps[char_count]
Byte lengths for memory operations: memcmp(str, part, byte_len)

4. TID Sorting for Sequential I/O

Results are sorted by (block, offset) to enable sequential heap scans:

if (count < 5000) {
    qsort(tids, count, sizeof(ItemPointer), compare_tids);
} else {
    radix_sort_tids(tids, count);  // O(n) for large sets
}

5. Skip Sorting for Aggregates

COUNT(*) and EXISTS queries don’t need sorted TIDs:

bool is_aggregate = !scan->xs_want_itup;
if (is_aggregate) {
    // Return unsorted TIDs for bitmap heap scan
    // Saves O(n log n) sorting cost
}

6. Parallel TID Collection

Large result sets (10k+ TIDs) are collected using multiple workers:

if (count >= 10000) {
    int num_workers = (count < 100000) ? 2 : 4;
    // Distribute bitmap iteration across workers
}

7. LIMIT-Aware Collection

Stop collecting TIDs early when LIMIT is detected:

int limit_hint = estimate_limit(scan);
if (limit_hint > 0 && idx >= limit_hint)
    break;  // Collected enough

Multi-Column Support

Architecture

Each column maintains its own completely independent Biscuit index:

BiscuitIndex {
    ColumnIndex column_indices[N];  // Each column: dual indices + dual length bitmaps
}

Query Execution

Multi-column queries use intelligent predicate reordering:

WHERE col1 LIKE 'abc%' AND col2 ILIKE '%xyz'

Execution plan:

Analyze: Score each predicate for selectivity

col1 'abc%': priority=20, selectivity=0.15 (strong prefix)
col2 '%xyz': priority=30, selectivity=0.25 (weak suffix)

Reorder: Execute most selective first

Execute col1 first (lower priority value = higher priority)

Execute:

// Execute col1 LIKE (case-sensitive indices)
candidates = query_column_pattern(idx, col1_idx, 'abc%');

// Execute col2 ILIKE (case-insensitive indices)
col2_result = query_column_pattern_ilike(idx, col2_idx, '%xyz');

// Intersect
bitmap_and(candidates, col2_result);

Filter tombstones: Remove deleted records
Collect TIDs: Convert bitmap to tuple IDs

Strategy Routing

// Route based on scan key strategy
switch (key->sk_strategy) {
    case BISCUIT_LIKE_STRATEGY:
        result = query_column_pattern(idx, col_idx, pattern);
        break;
        
    case BISCUIT_NOT_LIKE_STRATEGY:
        result = query_column_pattern(idx, col_idx, pattern);
        bitmap_invert(result);  // Negate
        break;
        
    case BISCUIT_ILIKE_STRATEGY:
        result = query_column_pattern_ilike(idx, col_idx, pattern);
        break;
        
    case BISCUIT_NOT_ILIKE_STRATEGY:
        result = query_column_pattern_ilike(idx, col_idx, pattern);
        bitmap_invert(result);  // Negate
        break;
}

CRUD Operations

Insert

bool biscuit_insert(values[], isnull[], ht_ctid) {
    // 1. Try to reuse deleted slot
    if (pop_free_slot(&rec_idx)) {
        remove_from_tombstones(rec_idx);
    } else {
        rec_idx = num_records++;
    }
    
    // 2. Store TID and data
    tids[rec_idx] = *ht_ctid;
    data_cache[rec_idx] = copy_string(value);
    
    // 3. Create lowercase version
    data_cache_lower[rec_idx] = biscuit_str_tolower(value, strlen(value));
    
    // 4. Update case-sensitive indices
    int byte_pos = 0, char_pos = 0;
    while (byte_pos < byte_len) {
        int char_len = utf8_char_length(str[byte_pos]);
        
        // Index ALL bytes at SAME character position
        for (int b = 0; b < char_len; b++) {
            add_to_pos_bitmap(str[byte_pos + b], char_pos, rec_idx);
            add_to_neg_bitmap(str[byte_pos + b], -(len_chars - char_pos), rec_idx);
        }
        
        byte_pos += char_len;
        char_pos++;
    }
    
    // 5. Update case-insensitive indices (same logic with lowercase data)
    // ... (similar to step 4 but using data_cache_lower)
    
    // 6. Update length bitmaps (BOTH case-sensitive and case-insensitive)
    int char_count = utf8_char_count(str, byte_len);
    add_to_length_bitmaps(char_count, rec_idx);
    
    int lower_char_count = utf8_char_count(data_cache_lower[rec_idx]);
    add_to_length_bitmaps_lower(lower_char_count, rec_idx);
}

Delete (Lazy + Cleanup)

bool callback(ItemPointer tid) {
    // Step 1: Mark as tombstone (lazy delete)
    bitmap_add(tombstones, rec_idx);
    push_free_slot(rec_idx);
    tombstone_count++;
    
    // Step 2: Periodic cleanup when threshold reached
    if (tombstone_count >= 1000) {
        // Remove from ALL indices (case-sensitive + case-insensitive)
        for (ch = 0; ch < 256; ch++) {
            for (j = 0; j < pos_idx[ch].count; j++)
                bitmap_andnot(pos_idx[ch].entries[j].bitmap, tombstones);
            
            for (j = 0; j < neg_idx[ch].count; j++)
                bitmap_andnot(neg_idx[ch].entries[j].bitmap, tombstones);
            
            bitmap_andnot(char_cache[ch], tombstones);
            
            // Same for lowercase indices
            for (j = 0; j < pos_idx_lower[ch].count; j++)
                bitmap_andnot(pos_idx_lower[ch].entries[j].bitmap, tombstones);
            // ... etc
        }
        
        // Clean up length bitmaps (BOTH sets)
        for (i = 0; i <= max_length; i++) {
            bitmap_andnot(length_bitmaps[i], tombstones);
            bitmap_andnot(length_ge_bitmaps[i], tombstones);
        }
        
        for (i = 0; i <= max_length_lower; i++) {
            bitmap_andnot(length_bitmaps_lower[i], tombstones);
            bitmap_andnot(length_ge_bitmaps_lower[i], tombstones);
        }
        
        // Reset tombstone bitmap
        bitmap_free(tombstones);
        tombstones = bitmap_create();
        tombstone_count = 0;
    }
}

Update

Updates are delete + insert:

// PostgreSQL calls:
// 1. bulkdelete(old_tid)  → Marks old version as tombstone
// 2. insert(new_values, new_tid) → Creates new version

Memory Management

Cache Strategy

Indices are cached in CacheMemoryContext for persistence across queries:

static BiscuitIndexCacheEntry *cache_head = NULL;

// On first access:
idx = load_index(relation);
cache_insert(relation_oid, idx);

// On subsequent access:
idx = cache_lookup(relation_oid);

Size Calculation

PG_FUNCTION_INFO_V1(biscuit_index_memory_size);

Datum biscuit_index_memory_size(Oid indexoid) {
    BiscuitIndex *idx = load_index(indexoid);
    
    size_t total = 0;
    
    // Base structure
    total += sizeof(BiscuitIndex);
    
    // TID array
    total += capacity * sizeof(ItemPointerData);
    
    // String data caches (original + lowercase)
    for (i = 0; i < num_records; i++) {
        total += strlen(data_cache[i]) + 1;
        total += strlen(data_cache_lower[i]) + 1;  // Lowercase cache
    }
    
    // Case-sensitive indices (256 characters)
    for (ch = 0; ch < 256; ch++) {
        total += charindex_size(&pos_idx[ch]);
        total += charindex_size(&neg_idx[ch]);
        total += bitmap_size(char_cache[ch]);
    }
    
    // Case-insensitive indices (256 characters)
    for (ch = 0; ch < 256; ch++) {
        total += charindex_size(&pos_idx_lower[ch]);
        total += charindex_size(&neg_idx_lower[ch]);
        total += bitmap_size(char_cache_lower[ch]);
    }
    
    // Length bitmaps (BOTH sets)
    for (i = 0; i <= max_length; i++) {
        total += bitmap_size(length_bitmaps[i]);
        total += bitmap_size(length_ge_bitmaps[i]);
    }
    
    for (i = 0; i <= max_length_lower; i++) {
        total += bitmap_size(length_bitmaps_lower[i]);
        total += bitmap_size(length_ge_bitmaps_lower[i]);
    }
    
    // Tombstones + free list
    total += bitmap_size(tombstones);
    total += free_capacity * sizeof(uint32_t);
    
    return total;
}

Cleanup

Biscuit registers callbacks for safe cleanup:

void biscuit_register_callback() {
    // Invalidate cache when relation changes
    CacheRegisterRelcacheCallback(biscuit_relcache_callback);
    
    // Clean up on process exit
    on_proc_exit(biscuit_module_unload_callback);
}

Disk Persistence

Metadata Only

Biscuit stores only a metadata marker on disk:

typedef struct BiscuitMetaPageData {
    uint32 magic;        // 0x42495343 ("BISC")
    uint32 version;      // 1
    BlockNumber root;    // 0
    uint32 num_records;
} BiscuitMetaPageData;

Rebuild Strategy

Bitmaps are not serialized. On index load:

Read metadata marker
Scan heap table
Rebuild all bitmaps (case-sensitive + case-insensitive) in memory
Generate lowercase data cache
Cache in CacheMemoryContext

Rationale:

Bitmap serialization is complex and large
Memory representation is optimal for queries
Dual indices (LIKE + ILIKE) would double disk size

Limitations & Tradeoffs

What Biscuit Does Well

Pattern matching (LIKE, ILIKE, NOT LIKE, NOT ILIKE)
UTF-8 multi-byte character support
Prefix/suffix queries ('abc%', '%xyz')
Complex multi-wildcard patterns
Multi-column pattern searches
Case-sensitive and case-insensitive (separate indices)
Aggregates (COUNT(*), EXISTS)
International text (German, Turkish, Chinese, etc.)

What Biscuit Doesn’t Do

Equality/range queries: Use B-tree
Full-text search: Use GIN with tsvector
Regex: Not supported
Very long strings: Memory usage scales with character length
Online serialization: Requires rebuild on load
Locale changes: Lowercase cache is locale-dependent

Code Navigation

Critical UTF-8 Functions

biscuit_utf8_char_length() - Get UTF-8 character byte length
biscuit_utf8_char_count() - Count characters (not bytes)
biscuit_utf8_char_to_byte_offset() - Convert char position to byte offset
biscuit_str_tolower() - Locale-aware lowercase conversion

Index Build

biscuit_build() - Single-column index construction
biscuit_build_multicolumn() - Multi-column index construction
biscuit_load_index() - Load/rebuild index from disk

Query Processing

biscuit_rescan() - Query execution entry (single-column)
biscuit_rescan_multicolumn() - Query execution (multi-column)
biscuit_query_pattern() - LIKE query (case-sensitive)
biscuit_query_pattern_ilike() - ILIKE query (case-insensitive)
biscuit_match_part_at_pos() - Windowed matching (forward)
biscuit_match_part_at_end() - Windowed matching (reverse)
create_query_plan() - Multi-column optimizer
biscuit_collect_tids_optimized() - Result collection

CRUD

biscuit_insert() - Insert with dual indexing
biscuit_bulkdelete() - Lazy delete with cleanup
biscuit_remove_from_all_indices() - Remove from all bitmaps

Diagnostics

biscuit_index_stats() - Runtime statistics
biscuit_index_memory_size() - Memory footprint calculation
biscuit_has_roaring() - Check Roaring support
biscuit_build_info() - Build configuration