Incremental Indexing

Incremental indexing makes SCIP index updates O(changed files) instead of O(entire repo). After editing a file, the index updates in seconds instead of requiring a full reindex.

Availability: Go, TypeScript, JavaScript, Python, Dart, Rust (v7.5+). Other languages fall back to full reindexing.

v1.1 (v7.3): Adds incremental callgraph maintenance—outgoing calls from changed files are always accurate.

v2.0 (v7.3): Adds transitive invalidation—files depending on changed files can be automatically queued for rescanning.

v4.0 (v7.3): Adds CI-generated delta artifacts for O(delta) server-side ingestion.

v5.0 (v7.5): Adds multi-language support via indexer registry pattern.

Why Incremental Indexing?

Full SCIP indexing scans your entire codebase, which can take 30+ seconds for large projects. This creates friction:

During development: You edit one file but wait 30s for the index to update
In CI/CD: Every commit triggers a full reindex even if only one file changed
With watch mode: Frequent reindexes burn CPU and slow down your machine

Incremental indexing solves this by only processing changed files.

How It Works

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐     ┌──────────────────┐
│ Change Detection│ ──► │ SCIP Extraction  │ ──► │ Delta Application│ ──► │ Transitive       │
│ (git diff -z)   │     │ (symbols + calls)│     │ (delete+insert) │     │ Invalidation (v2)│
└─────────────────┘     └──────────────────┘     └─────────────────┘     └──────────────────┘

1. Change Detection

CKB detects changes using git:

git diff --name-status -z <last-indexed-commit> HEAD

The -z flag uses NUL separators, correctly handling paths with spaces or special characters.

Tracked change types:

Added - New .go files
Modified - Changed .go files
Deleted - Removed .go files
Renamed - Moved/renamed .go files (tracks old path for cleanup)

Fallback: For non-git repos, CKB falls back to hash-based comparison against stored file hashes.

2. SCIP Extraction

CKB runs scip-go to regenerate the full SCIP index (protobuf doesn't support partial updates), but then:

Loads the index into memory
Iterates documents, only processing those in the changed set
Extracts symbols, references, and call edges for changed files only
Resolves caller symbols (which function contains each call site)
Skips unchanged documents entirely

This means even though scip-go runs on the full codebase, CKB only does the expensive database work for changed files.

Call Edge Extraction (v1.1): For each reference to a callable symbol (function/method), CKB:

Detects callables using symbol kind or the (). pattern in symbol IDs
Resolves the enclosing function as the caller
Stores edges with location info: (caller_file, line, column, callee_id)

3. Delta Application

For each changed file, CKB applies updates using delete+insert:

Modified file.go:
  1. DELETE FROM file_symbols WHERE file_path = 'file.go'
  2. DELETE FROM indexed_files WHERE path = 'file.go'
  3. DELETE FROM callgraph WHERE caller_file = 'file.go'  -- v1.1
  4. DELETE FROM file_deps WHERE dependent_file = 'file.go'  -- v2
  5. INSERT new symbols, file state, call edges, and dependencies

Renamed old.go → new.go:
  1. DELETE using old path (including callgraph, file_deps)
  2. INSERT using new path

This approach is simple and correct—no complex diffing logic. The caller-owned edges invariant means call edges are always deleted and rebuilt with their owning file.

4. Transitive Invalidation (v2)

When a file changes, other files that depend on it may have stale references. v2 adds transitive invalidation to track and optionally rescan these dependent files.

File Dependency Tracking:

CKB maintains a file_deps table: (dependent_file, defining_file)
When a.go references a symbol defined in b.go, CKB records a.go → b.go
Only internal dependencies are tracked (not stdlib/external packages)

Rescan Queue:

When b.go changes, files depending on it (a.go) are enqueued for rescanning
The queue tracks: file path, reason, BFS depth, and attempt count
Queue processing respects configurable budgets (max files, max time)

Usage

Default Behavior (Supported Languages)

Incremental indexing is enabled by default for supported languages:

Go - scip-go
TypeScript/JavaScript - scip-typescript
Python - scip-python
Dart - scip_dart
Rust - rust-analyzer

# Incremental by default for supported languages
ckb index

# Output for incremental update:
Incremental Index Complete
--------------------------
Files:   3 modified, 1 added, 0 deleted
Symbols: 15 added, 8 removed
Refs:    42 updated
Calls:   127 edges updated
Time:    1.2s
Commit:  abc1234 (+dirty)
Pending: 5 files queued for rescan

Accuracy:
  OK  Go to definition     - accurate
  OK  Find refs (forward)  - accurate
  !!  Find refs (reverse)  - may be stale
  OK  Callees (outgoing)   - accurate
  !!  Callers (incoming)   - may be stale

Run 'ckb index --force' for full accuracy (47 files since last full)

Force Full Reindex

# Full reindex (ignores incremental)
ckb index --force

Use --force when:

You need 100% accurate reverse references
You need accurate caller information (who calls a function)
After major refactoring across many files
When incremental reports issues
To clear the rescan queue and start fresh

Transitive Invalidation Modes (v2)

CKB supports four invalidation modes:

Mode	Behavior
`none`	Disabled—no dependency tracking or invalidation
`lazy`	Enqueue dependents, drain on next full reindex (default)
`eager`	Enqueue and drain immediately (with budgets)
`deferred`	Enqueue and drain periodically in background

Lazy Mode (Default)

In lazy mode, dependent files are queued but not immediately rescanned:

Low overhead during incremental indexing
Queue drains automatically on next ckb index --force
Best for development workflows where occasional staleness is acceptable

Eager Mode

In eager mode, CKB rescans dependent files immediately:

Higher accuracy after incremental updates
Respects budget limits to prevent runaway processing
Best when accuracy is critical

Configuration

{
  "incremental": {
    "threshold": 50,
    "indexTests": false,
    "excludes": ["vendor", "testdata"]
  },
  "transitive": {
    "enabled": true,
    "mode": "lazy",
    "depth": 1,
    "maxRescanFiles": 200,
    "maxRescanMs": 1500
  }
}

Setting	Default	Description
`enabled`	true	Enable transitive invalidation
`mode`	`lazy`	Invalidation mode: `none`, `lazy`, `eager`, `deferred`
`depth`	1	BFS cascade depth (1 = direct dependents only)
`maxRescanFiles`	200	Max files to rescan per drain run
`maxRescanMs`	1500	Max time (ms) per drain run (0 = unlimited)

Accuracy Guarantees

Incremental indexing maintains forward accuracy but may have stale reverse references. With v1.1, call graph accuracy is improved: outgoing calls (callees) are always accurate. With v2 in eager mode with queue drained, all queries are accurate.

Query Type	After Incremental	After Queue Drained
Go to definition	Always accurate	Always accurate
Find refs FROM changed files	Always accurate	Always accurate
Find refs TO symbols in changed files	May be stale	Accurate
Call graph (callees)	Always accurate	Always accurate
Call graph (callers)	May be stale	Accurate
Symbol search	Always accurate	Always accurate

Why Reverse References May Be Stale

Consider this scenario:

// utils.go (unchanged)
func Helper() { ... }

// main.go (changed - removed call to Helper)
func main() {
    // Helper()  <- removed this line
}

After incremental indexing:

main.go is re-indexed correctly (no longer references Helper)
utils.go is NOT re-indexed (unchanged)
CKB's stored references still show main.go → Helper from utils.go's perspective

This is the "caller-owned edges" invariant: references are owned by the FROM file, not the TO file.

Impact: When you ask "what calls Helper?", CKB might still show the deleted call from main.go until you run ckb index --force.

With v2 eager mode: If you change helper.go, files that depend on it are automatically rescanned, keeping reverse references accurate.

Index State Tracking

CKB tracks index state in the database:

Index State:
  State: partial (3 files since last full)
  Commit: abc1234
  Dirty: yes (uncommitted changes)
  Pending: 5 files queued for rescan

States:

full - Complete reindex, all references accurate, queue empty
partial - Incremental updates applied, reverse refs may be stale
pending - Work queued in rescan queue (v2)
full_dirty / partial_dirty - Uncommitted changes detected

When Full Reindex Is Required

CKB automatically triggers a full reindex when:

Condition	Reason
No previous index	Nothing to diff against
Schema version mismatch	Database structure changed
No tracked commit	Can't compute git diff
>50% files changed	Incremental overhead exceeds full reindex

You'll see messages like:

Full reindex required: schema version mismatch (have 7, need 8)

Performance Characteristics

Scenario	Full Index	Incremental
Small project (100 files)	~2s	~0.5s
Medium project (1000 files)	~15s	~1-2s
Large project (10000 files)	~60s	~2-5s
Single file change	~60s	~1s

The key insight: incremental time is proportional to changed files, not total files.

Transitive invalidation overhead (v2):

Lazy mode: negligible (~1ms to enqueue dependents)
Eager mode: depends on cascade size and budgets

Limitations

Current limitations:

Some languages unsupported - Java, Kotlin, C++, Ruby, C#, PHP always do full reindex (build complexity)
Reverse refs may be stale in lazy mode - Use eager mode or --force when accuracy is critical
Callers may be stale - Incoming calls to changed symbols may be outdated until queue drains
No partial SCIP - Still runs full indexer, just processes less output
External deps not tracked - Only internal file dependencies are tracked
Indexer must be installed - Missing indexers fall back to full reindex with install hint

Troubleshooting

"Full reindex required" every time

Check that:

You're in a git repository
The previous index completed successfully
Schema version matches (may need --force after CKB upgrade)

Incremental seems slow

If incremental takes as long as full reindex:

Check how many files changed (git status)
If >50% changed, CKB falls back to full automatically
Large individual files still take time to process

Stale references causing issues

If you're seeing phantom references:

# Force full reindex (also clears rescan queue)
ckb index --force

This rebuilds all references from scratch.

Too many pending rescans

If the rescan queue grows large:

# Check queue status
ckb status

# Force full reindex to clear queue
ckb index --force

Or increase budgets in configuration to process more files per run.

Delta Artifacts (v4)

Delta artifacts enable O(delta) server-side ingestion by pre-computing the diff in CI. Instead of the server comparing databases, CI generates a manifest of exactly what changed.

Why Delta Artifacts?

Traditional incremental indexing computes diffs by comparing the staging DB to the current DB—O(N) over all symbols/refs/calls. For repos with 500k+ symbols, this becomes a bottleneck.

Delta artifacts solve this by having CI emit the diff alongside the index:

┌──────────┐     ┌──────────────┐     ┌─────────────┐
│ CI Build │ ──► │ ckb diff     │ ──► │ delta.json  │
│ (scip)   │     │ (compare DBs)│     │ (manifest)  │
└──────────┘     └──────────────┘     └─────────────┘
                                             │
                                             ▼
┌──────────────┐     ┌─────────────────┐     ┌──────────────┐
│ CKB Server   │ ◄── │ POST /delta     │ ◄── │ CI Upload    │
│ (apply delta)│     │ /ingest         │     │ (artifact)   │
└──────────────┘     └─────────────────┘     └──────────────┘

Generating Delta Artifacts

Use ckb diff to generate a delta manifest:

# Compare two snapshot databases
ckb diff \
  --base /path/to/old-snapshot.db \
  --new /path/to/new-snapshot.db \
  --output delta.json

# Output: delta.json with changes

Delta JSON Schema

{
  "delta_schema_version": 1,
  "base_snapshot_id": "sha256:abc123...",
  "new_snapshot_id": "sha256:def456...",
  "commit": "def456789",
  "timestamp": 1703260800,
  "deltas": {
    "symbols": {
      "added": ["scip-go...NewFunc()."],
      "modified": ["scip-go...ChangedFunc()."],
      "deleted": ["scip-go...RemovedFunc()."]
    },
    "refs": {
      "added": [{"pk": "f_abc:42:12:scip-go...Foo().", "data": {...}}],
      "deleted": ["f_abc:50:5:scip-go...Old()."]
    },
    "callgraph": { "added": [...], "deleted": [...] },
    "files": { "added": [...], "modified": [...], "deleted": [...] }
  },
  "stats": { "total_added": 45, "total_modified": 12, "total_deleted": 8 }
}

Ingesting Delta Artifacts

Upload delta artifacts to CKB server via the API:

# Validate delta without applying
curl -X POST http://localhost:8080/delta/validate \
  -H "Content-Type: application/json" \
  -d @delta.json

# Ingest delta artifact
curl -X POST http://localhost:8080/delta/ingest \
  -H "Content-Type: application/json" \
  -d @delta.json

Server Validation

Before applying a delta, the server validates:

Schema version - delta_schema_version must be supported
Base snapshot - base_snapshot_id must match current active snapshot
Counts - Entity counts must match stats
Hashes - Spot-check hashes for modified entities
Integrity - Foreign key relationships must be valid

If validation fails, the server rejects the delta and requires a full snapshot.

Configuration

{
  "ingestion": {
    "deltaArtifacts": true,
    "deltaValidation": "strict",
    "fallbackToStagingDiff": true
  }
}

Setting	Default	Description
`deltaArtifacts`	true	Enable delta artifact ingestion
`deltaValidation`	`strict`	Validation mode: `strict` or `permissive`
`fallbackToStagingDiff`	true	Fall back to staging diff if delta fails

CI Integration Example (GitHub Actions)

name: Index and Upload Delta

on:
  push:
    branches: [main]

jobs:
  index:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Download previous snapshot
        uses: actions/download-artifact@v4
        with:
          name: ckb-snapshot
          path: .ckb/
        continue-on-error: true

      - name: Run SCIP indexer
        run: ckb index

      - name: Generate delta
        run: |
          if [ -f .ckb/previous.db ]; then
            ckb diff --base .ckb/previous.db --new .ckb/ckb.db --output delta.json
          fi

      - name: Upload delta to CKB server
        if: hashFiles('delta.json') != ''
        run: |
          curl -X POST ${{ secrets.CKB_SERVER_URL }}/delta/ingest \
            -H "Authorization: Bearer ${{ secrets.CKB_TOKEN }}" \
            -H "Content-Type: application/json" \
            -d @delta.json

      - name: Save snapshot for next run
        run: cp .ckb/ckb.db .ckb/previous.db

      - uses: actions/upload-artifact@v4
        with:
          name: ckb-snapshot
          path: .ckb/previous.db
          retention-days: 7

Performance Impact

Repo Size	Traditional Diff	Delta Artifact
10k symbols	50ms	5ms
100k symbols	500ms	10ms
500k symbols	5s	20ms

Delta artifacts shift the diff computation to CI (where it runs once) instead of the server (where it would run on every request).

CI-CD-Integration - Using incremental indexing in CI pipelines
User Guide - CLI commands including ckb index
Performance - Latency targets and benchmarks
Configuration - All configuration options

Incremental Indexing

Why Incremental Indexing?

How It Works

1. Change Detection

2. SCIP Extraction

3. Delta Application

4. Transitive Invalidation (v2)

Usage

Default Behavior (Supported Languages)

Force Full Reindex

Transitive Invalidation Modes (v2)

Lazy Mode (Default)

Eager Mode

Configuration

Accuracy Guarantees

Why Reverse References May Be Stale

Index State Tracking

When Full Reindex Is Required

Performance Characteristics

Limitations

Troubleshooting

"Full reindex required" every time

Incremental seems slow

Stale references causing issues

Too many pending rescans

Delta Artifacts (v4)

Why Delta Artifacts?

Generating Delta Artifacts

Delta JSON Schema

Ingesting Delta Artifacts

Server Validation

Configuration

CI Integration Example (GitHub Actions)

Performance Impact

Related