Hybrid Retrieval (v7.4)

CKB v7.4 introduces hybrid retrieval that combines graph-based ranking with traditional text search to dramatically improve search quality.

Overview

Traditional code search relies on text matching (FTS), which finds symbols by name but doesn't understand code relationships. Hybrid retrieval adds Personalized PageRank (PPR) over the symbol graph to boost results that are structurally related to your query.

Results

Metric	Before	After	Improvement
Recall@10	62.1%	100%	+61%
MRR	0.546	0.914	+67%
Latency	29.4ms	29.0ms	~0%

Note: Latency remains similar because PPR computation is cheap. The improvement is in search quality, not speed.

How It Works

1. Initial Search (FTS)

When you search for a symbol, CKB first uses SQLite FTS5 for fast text matching:

Query: "Engine"
FTS Results: Engine, Engine#logger, Engine#config, EngineMock, ...

2. Graph-Based Re-ranking (PPR)

CKB then builds a symbol graph from SCIP edges and runs Personalized PageRank:

Seeds: Top FTS hits (Engine, Engine#logger, ...)
Graph: Call edges, reference edges, type edges
PPR: Propagate importance through graph
Output: Re-ranked by graph proximity + FTS score

Seed Expansion

When FTS returns struct fields (e.g., Engine#logger), seed expansion automatically includes related methods:

FTS seeds: Engine#logger, Engine#config, Engine#db
Expanded: + Engine#SearchSymbols(), Engine#GetCallGraph(), ...

This helps PPR discover cross-module dependencies through method calls, not just field references.

3. Combined Scoring

The final ranking combines FTS position with PPR scores:

combined = 0.6 * position_score + 0.4 * ppr_score

Where:

position_score = 1 / (rank + 1) — Original FTS ranking bonus
ppr_score — Normalized PPR importance from graph traversal

This simple approach achieves 100% recall without the complexity of multi-signal fusion.

Eval Suite

CKB includes an evaluation framework to measure retrieval quality.

Running Eval

# Run built-in tests
ckb eval

# Custom fixtures
ckb eval --fixtures=./my-tests.json

# JSON output
ckb eval --format=json

Test Types

Needle tests - Find at least one expected symbol in top-K:

{
  "id": "find-engine",
  "type": "needle",
  "query": "Engine",
  "expectedSymbols": ["Engine", "query.Engine"],
  "topK": 10
}

Ranking tests - Verify expected symbol is highly ranked:

{
  "id": "engine-first",
  "type": "ranking",
  "query": "query engine",
  "expectedSymbols": ["Engine"],
  "topK": 3
}

Expansion tests - Check graph connectivity:

{
  "id": "engine-connects-backends",
  "type": "expansion",
  "query": "Engine",
  "expectedSymbols": ["Engine", "Orchestrator", "SCIPAdapter"],
  "topK": 20
}

Metrics

Recall@K - % of tests where expected symbol was in top-K
MRR - Mean Reciprocal Rank (higher = expected found earlier)
Latency - Average query time

PPR Algorithm

Personalized PageRank computes importance scores relative to seed nodes.

Algorithm

Input:
  - seeds: FTS hit symbol IDs
  - graph: SCIP call/reference edges
  - damping: 0.85 (probability of following edge)
  - iterations: 20 (max power iterations)

Process:
  1. Initialize scores: seeds get 1/n, others get 0
  2. Iterate: score[i] = damping * Σ(edge_weight * score[neighbor])
                        + (1-damping) * teleport[i]
  3. Stop when converged or max iterations

Output:
  - Ranked nodes with scores
  - Backtracked paths explaining "why"

Edge Weights

Edge Type	Weight	Meaning
Call	1.0	Function calls function
Definition	0.9	Reference to definition
Reference	0.8	General reference
Implements	0.7	Type implements interface
Type-of	0.6	Instance of type
Same-module	0.3	Co-located symbols

Export Organizer

The exportForLLM tool now includes an organizer step that structures output for better LLM comprehension.

Before (v7.3)

## internal/query/
  ! engine.go
    $ Engine
    # SearchSymbols()
    # GetSymbol()
  ! symbols.go
    # rankSearchResults()

After (v7.4)

## Module Map

| Module | Symbols | Files | Key Exports |
|--------|---------|-------|-------------|
| internal/query | 150 | 12 | Engine, SearchSymbols |
| internal/backends | 80 | 8 | Orchestrator, SCIPAdapter |

## Cross-Module Connections

- internal/query → internal/backends
- internal/mcp → internal/query

## Module Details

### internal/query/

**engine.go**
  $ Engine
  # SearchSymbols() [c=12] ★★
  # GetSymbol() [c=5] ★

Benefits

Module Map - Overview of codebase structure at a glance
Cross-Module Bridges - Key integration points highlighted
Importance Ordering - Most important symbols first
Context Efficiency - LLMs understand structure before details

Configuration

No configuration required. Hybrid retrieval is automatic when:

SCIP index is available (ckb index was run)
Search returns more than 3 results
Symbol graph has nodes

Disabling PPR

If you need to disable PPR re-ranking (not recommended):

// .ckb/config.json
{
  "queryPolicy": {
    "enablePPR": false
  }
}

Research Basis

Hybrid retrieval is based on 2024-2025 research:

Paper	Key Insight
HippoRAG 2 (ICML 2025)	PPR over knowledge graphs improves associative retrieval
CodeRAG (Sep 2025)	Multi-path retrieval + reranking beats single-path
GraphCoder (Jun 2024)	Code context graphs for repo-level retrieval
GraphRAG surveys	Explicit organizer step improves context packing

What's NOT Included

Per CKB's "structured over semantic" principle:

Feature	Why Skipped
Embeddings	Adds complexity, PPR sufficient for code navigation
Learned reranker	Deterministic scoring works well
External vector DB	Violates single-binary principle

Troubleshooting

Low Recall@K

Index freshness - Run ckb index to rebuild
FTS population - Check ckb status for FTS symbol count
Query specificity - More specific queries work better

Slow Queries

Graph size - Very large codebases may need graph pruning
PPR iterations - Default 20 is usually sufficient
Cache - Subsequent queries benefit from caching

Debugging

# Check index status
ckb status

# Run diagnostics
ckb doctor

# Verbose eval output
ckb eval --verbose

Architecture — System design overview
Performance — Query latency and caching
API-Reference — Search and query tools
Practical-Limits — Accuracy and limitations