Skip to content

Hybrid Retrieval (v7.4)

CKB v7.4 introduces hybrid retrieval that combines graph-based ranking with traditional text search to dramatically improve search quality.

Overview

Traditional code search relies on text matching (FTS), which finds symbols by name but doesn't understand code relationships. Hybrid retrieval adds Personalized PageRank (PPR) over the symbol graph to boost results that are structurally related to your query.

Results

Metric Before After Improvement
Recall@10 62.1% 100% +61%
MRR 0.546 0.914 +67%
Latency 29.4ms 29.0ms ~0%

Note: Latency remains similar because PPR computation is cheap. The improvement is in search quality, not speed.

How It Works

1. Initial Search (FTS)

When you search for a symbol, CKB first uses SQLite FTS5 for fast text matching:

Query: "Engine"
FTS Results: Engine, Engine#logger, Engine#config, EngineMock, ...

2. Graph-Based Re-ranking (PPR)

CKB then builds a symbol graph from SCIP edges and runs Personalized PageRank:

Seeds: Top FTS hits (Engine, Engine#logger, ...)
Graph: Call edges, reference edges, type edges
PPR: Propagate importance through graph
Output: Re-ranked by graph proximity + FTS score

Seed Expansion

When FTS returns struct fields (e.g., Engine#logger), seed expansion automatically includes related methods:

FTS seeds: Engine#logger, Engine#config, Engine#db
Expanded: + Engine#SearchSymbols(), Engine#GetCallGraph(), ...

This helps PPR discover cross-module dependencies through method calls, not just field references.

3. Combined Scoring

The final ranking combines FTS position with PPR scores:

combined = 0.6 * position_score + 0.4 * ppr_score

Where:

  • position_score = 1 / (rank + 1) — Original FTS ranking bonus
  • ppr_score — Normalized PPR importance from graph traversal

This simple approach achieves 100% recall without the complexity of multi-signal fusion.

Eval Suite

CKB includes an evaluation framework to measure retrieval quality.

Running Eval

# Run built-in tests
ckb eval

# Custom fixtures
ckb eval --fixtures=./my-tests.json

# JSON output
ckb eval --format=json

Test Types

Needle tests - Find at least one expected symbol in top-K:

{
  "id": "find-engine",
  "type": "needle",
  "query": "Engine",
  "expectedSymbols": ["Engine", "query.Engine"],
  "topK": 10
}

Ranking tests - Verify expected symbol is highly ranked:

{
  "id": "engine-first",
  "type": "ranking",
  "query": "query engine",
  "expectedSymbols": ["Engine"],
  "topK": 3
}

Expansion tests - Check graph connectivity:

{
  "id": "engine-connects-backends",
  "type": "expansion",
  "query": "Engine",
  "expectedSymbols": ["Engine", "Orchestrator", "SCIPAdapter"],
  "topK": 20
}

Metrics

  • Recall@K - % of tests where expected symbol was in top-K
  • MRR - Mean Reciprocal Rank (higher = expected found earlier)
  • Latency - Average query time

PPR Algorithm

Personalized PageRank computes importance scores relative to seed nodes.

Algorithm

Input:
  - seeds: FTS hit symbol IDs
  - graph: SCIP call/reference edges
  - damping: 0.85 (probability of following edge)
  - iterations: 20 (max power iterations)

Process:
  1. Initialize scores: seeds get 1/n, others get 0
  2. Iterate: score[i] = damping * Σ(edge_weight * score[neighbor])
                        + (1-damping) * teleport[i]
  3. Stop when converged or max iterations

Output:
  - Ranked nodes with scores
  - Backtracked paths explaining "why"

Edge Weights

Edge Type Weight Meaning
Call 1.0 Function calls function
Definition 0.9 Reference to definition
Reference 0.8 General reference
Implements 0.7 Type implements interface
Type-of 0.6 Instance of type
Same-module 0.3 Co-located symbols

Export Organizer

The exportForLLM tool now includes an organizer step that structures output for better LLM comprehension.

Before (v7.3)

## internal/query/
  ! engine.go
    $ Engine
    # SearchSymbols()
    # GetSymbol()
  ! symbols.go
    # rankSearchResults()

After (v7.4)

## Module Map

| Module | Symbols | Files | Key Exports |
|--------|---------|-------|-------------|
| internal/query | 150 | 12 | Engine, SearchSymbols |
| internal/backends | 80 | 8 | Orchestrator, SCIPAdapter |

## Cross-Module Connections

- internal/query → internal/backends
- internal/mcp → internal/query

## Module Details

### internal/query/

**engine.go**
  $ Engine
  # SearchSymbols() [c=12] ★★
  # GetSymbol() [c=5] ★

Benefits

  • Module Map - Overview of codebase structure at a glance
  • Cross-Module Bridges - Key integration points highlighted
  • Importance Ordering - Most important symbols first
  • Context Efficiency - LLMs understand structure before details

Configuration

No configuration required. Hybrid retrieval is automatic when:

  1. SCIP index is available (ckb index was run)
  2. Search returns more than 3 results
  3. Symbol graph has nodes

Disabling PPR

If you need to disable PPR re-ranking (not recommended):

// .ckb/config.json
{
  "queryPolicy": {
    "enablePPR": false
  }
}

Research Basis

Hybrid retrieval is based on 2024-2025 research:

Paper Key Insight
HippoRAG 2 (ICML 2025) PPR over knowledge graphs improves associative retrieval
CodeRAG (Sep 2025) Multi-path retrieval + reranking beats single-path
GraphCoder (Jun 2024) Code context graphs for repo-level retrieval
GraphRAG surveys Explicit organizer step improves context packing

What's NOT Included

Per CKB's "structured over semantic" principle:

Feature Why Skipped
Embeddings Adds complexity, PPR sufficient for code navigation
Learned reranker Deterministic scoring works well
External vector DB Violates single-binary principle

Troubleshooting

Low Recall@K

  1. Index freshness - Run ckb index to rebuild
  2. FTS population - Check ckb status for FTS symbol count
  3. Query specificity - More specific queries work better

Slow Queries

  1. Graph size - Very large codebases may need graph pruning
  2. PPR iterations - Default 20 is usually sufficient
  3. Cache - Subsequent queries benefit from caching

Debugging

# Check index status
ckb status

# Run diagnostics
ckb doctor

# Verbose eval output
ckb eval --verbose