Learning Objectives
- Understand what code search means in the context of AI coding agents and why token efficiency matters
- Identify the engineering pattern Semble uses (chunking + dual retrieval + reranking) and why it produces dramatic token savings
- Evaluate when Semble is the right choice versus grep, ripgrep, traditional embedding-based search, or proprietary alternatives
What Is Semble?
Semble is an open-source code search library built by MinishLab, a two-person non-profit NLP lab. The pitch is narrow and concrete: when an AI coding agent needs to find relevant code inside a repository, the conventional grep + read pipeline — search for keywords, then read the matching files into the model's context window — burns enormous amounts of tokens. Semble replaces that pipeline with a code-aware retrieval index that returns just the relevant snippets, slashing token cost without sacrificing recall.
The headline benchmark: Semble hits 94% recall at just 2,000 tokens on the project's evaluation set, while a baseline grep-plus-read pipeline needs a full 100,000-token context window to reach 85% recall on the same queries. That's roughly 50-times fewer tokens for higher recall — a structural change in how an agent reads a codebase, not a marginal optimization.
💡Key Concept
Why this matters for agents: Frontier-model agents like Claude Code, Cursor, and ChatGPT Codex spend a meaningful fraction of every coding session re-reading the same files over and over. With a 200K-token context window and per-million-token pricing, an agent that loads 80,000 tokens of repository context on every turn becomes expensive fast. Semble lets the agent retrieve only the specific lines it needs — typically a few hundred tokens — for the same job.
✅Tip
Visit Semble: github.com/MinishLab/semble — Apache-2.0 licensed, install via pip install semble. Latest release v0.1.7 shipped May 12, 2026.
Pricing
- Apache 2.0 license
- Self-host on your hardware
- No usage caps
- Full feature parity with future releases
Semble is fully open-source under the Apache 2.0 license. There is no hosted SaaS offering — the entire library runs locally inside your AI agent's process. The only cost is whatever compute you use for indexing and query embeddings, both of which run on CPU at production speeds.
How Semble Works
Semble combines four retrieval strategies in a single ranked pipeline:
1. Code-Aware Chunking via Tree-sitter
Rather than splitting source files into fixed-size character windows (the naive approach), Semble uses tree-sitter to parse each file into its abstract syntax tree, then chunks at semantically meaningful boundaries — function definitions, class declarations, method bodies, top-level statements. The resulting chunks are coherent units of code that mean something on their own, not arbitrary slices.
2. Static Semantic Embeddings via Model2Vec
Each chunk is embedded using potion-code-16M, a MinishLab static embedding model tuned for source code. Static embeddings are dramatically faster than transformer-based embeddings (sentence-transformers, OpenAI text-embedding-3, etc.) — Semble indexes a typical repository in roughly 250 milliseconds and queries return in around 1.5 milliseconds. The quality trade-off is small: Semble's NDCG-at-10 score on the project benchmark is 0.854, which the README claims is roughly 99% of transformer-quality retrieval at a tiny fraction of the cost.
3. Lexical Retrieval via BM25
In parallel with the semantic search, Semble runs a BM25 keyword search with identifier stemming — catching exact symbol matches (function names, variable names) that semantic search alone might miss. This is the same trick mature retrieval systems use: semantic captures fuzzy intent, lexical captures exact symbols, and combining them does better than either alone.
4. Reciprocal-Rank Fusion + Reranking
Semantic and lexical results are merged via reciprocal-rank fusion, then reranked using signals like definition boosts (a function definition outranks a function call in most cases), file coherence (chunks from the same file get a small group bonus), and noise penalties (auto-generated files, test fixtures, vendored dependencies get downweighted).
💡Key Concept
Hybrid retrieval is the standard for production search systems — neither pure embedding similarity nor pure keyword matching is enough on its own. Semble's contribution is making the hybrid stack fast enough to run inside an agent's tool-call loop without measurable latency overhead.
Benchmarks
| Metric | Semble | grep + read baseline | Notes |
|---|---|---|---|
| Recall at 2,000 tokens | 94% | — | Same query set; baseline below at 2K tokens |
| Recall at 100,000 tokens | — | 85% | Full context window for grep+read |
| Token efficiency | ~50-times fewer | Baseline | At equivalent or higher recall |
| NDCG at 10 | 0.854 | — | Roughly 99% of transformer-quality retrieval |
| Index time (typical repo) | ~250 ms | N/A | Static embeddings on CPU |
| Query latency | ~1.5 ms | N/A | Single-query, single-threaded |
| Indexing speed vs CodeRankEmbed | 218-times faster | — | Comparable static-embedding baseline |
The headline number (98% fewer tokens) is the marketing line; the more useful number for agent budgeting is the per-query token cost, which drops from tens of thousands of tokens for grep-plus-read into a few hundred tokens for Semble retrieval. For a coding agent that runs 50-plus tool calls per session, that's an order-of-magnitude reduction in input-token spend.
Best Use Cases
Semble is purpose-built for AI coding agents — every design choice optimizes for the specific shape of agent workloads. Use it when:
- Building an agentic coding tool — Claude Agent SDK, custom OpenAI Agents SDK pipelines, or in-house assistants that need to search a private monorepo without burning context
- Replacing grep-plus-read inside an existing agent loop — drop-in replacement that preserves recall while collapsing per-query token spend
- Running large-scale code analysis — bulk repository scanning where embedding-transformer cost would dominate the run
- Self-hosting code search for compliance reasons — no external API calls, no data leaves your infrastructure
When to choose alternatives:
- For interactive human search — GitHub Copilot's native search, JetBrains AI Assistant, or Cursor's built-in indexer give a smoother UX with editor integration. Semble is a library, not an IDE feature.
- For natural-language Q&A over code — Sourcegraph Cody and similar systems combine retrieval with generation in one product. Semble is just the retrieval half.
- For tiny codebases (under a few thousand lines) — plain ripgrep is faster and simpler. Semble's payoff scales with repository size and query volume.
Limitations and Considerations
- No hosted offering. Semble is a library — you embed it in your own agent or service. There's no managed SaaS API endpoint to call.
- Indexing is offline. Semble currently re-indexes on demand or on a schedule; real-time incremental updates as developers edit files are not yet a built-in feature.
- Single language at the indexer level. Tree-sitter grammars are language-specific, so the indexer must be configured per language. Most common languages (Python, JavaScript/TypeScript, Go, Rust, Java, C/C++) are supported out of the box.
- Static embeddings have a quality ceiling. Roughly 99% of transformer-quality recall is excellent, but the remaining 1% may matter for very specific queries that depend on deep code semantics. Worth A/B-testing against a transformer baseline if your use case is recall-critical.
- Early-stage project. Latest release at the time of writing is v0.1.7 (May 12, 2026). API stability and long-term maintenance are open questions, though MinishLab's track record on Model2Vec (over 2,000 GitHub stars, 4 million-plus downloads on Hugging Face) suggests credible ongoing investment.
Strengths
- Token efficiency at production speed: Roughly 98% fewer tokens than grep-plus-read at higher recall — a structural cost shift for agent workloads, not a marginal optimization
- Sub-2-millisecond query latency: Runs inside an agent's tool-call loop with no perceptible overhead, on commodity CPU hardware
- Hybrid retrieval out of the box: Tree-sitter chunking + static embeddings + BM25 + reciprocal-rank fusion + reranking — production-grade search pipeline as a single library
- Apache 2.0 license: Commercially usable, self-hostable, no usage caps, no API rate limits
- Static embeddings via potion-code-16M: 218-times faster indexing than transformer-based code-embedding models at roughly 99% of the quality
- MinishLab's broader portfolio: Same lab also publishes Model2Vec, SemHash, and Vicinity — a coherent stack of fast, efficient retrieval infrastructure
Getting Started
- Install the library:
pip install semble - Point the indexer at your repository:
semble index /path/to/repo— completes in roughly 250 milliseconds for a typical repo - Query from your agent's tool-call loop:
semble.search("how does authentication work?", top_k=10)returns the top 10 most relevant code chunks, typically a few hundred tokens total - Wire the returned chunks into your prompt instead of
read_filecalls — this is where the token savings come from - For production use, persist the index to disk and reload it; rebuild on a schedule that matches your repo's churn rate
For a working example, see the examples/ directory in the Semble repository — there's a reference integration with the Anthropic Python SDK demonstrating the full agent loop.
Key Takeaways
- Semble replaces the conventional
grep + readpipeline inside AI coding agents with a code-aware retrieval index — roughly 98% fewer tokens at higher recall - The architectural pattern is a four-stage hybrid pipeline: tree-sitter chunking, Model2Vec static embeddings, BM25 lexical retrieval, and reciprocal-rank fusion plus reranking
- Static embeddings are the secret to running the full pipeline at sub-2-millisecond query latency on commodity CPU — transformer-based embeddings would be too slow to drop into an agent's tool-call loop
- Open-source under Apache 2.0, self-hosted, no API rate limits — the only cost is whatever local compute you use for indexing and embeddings
- Best fit for agent builders who need code search inside their own pipeline; for interactive human use, an IDE-integrated alternative like GitHub Copilot or Cursor will be a smoother UX