Name: Semble
Availability: InStock
Author: MinishLab

Learning Objectives

Understand what code search means in the context of AI coding agents and why token efficiency matters
Identify the engineering pattern Semble uses (chunking + dual retrieval + reranking) and why it produces dramatic token savings
Evaluate when Semble is the right choice versus grep, ripgrep, traditional embedding-based search, or proprietary alternatives

What Is Semble?

Semble is an open-source code search library built by MinishLab, a two-person non-profit NLP lab. The pitch is narrow and concrete: when an AI coding agent needs to find relevant code inside a repository, the conventional grep + read pipeline — search for keywords, then read the matching files into the model's context window — burns enormous amounts of tokens. Semble replaces that pipeline with a code-aware retrieval index that returns just the relevant snippets, slashing token cost without sacrificing recall.

The headline benchmark: Semble hits 94% recall at just 2,000 tokens on the project's evaluation set, while a baseline grep-plus-read pipeline needs a full 100,000-token context window to reach 85% recall on the same queries. That's roughly 50-times fewer tokens for higher recall — a structural change in how an agent reads a codebase, not a marginal optimization.

💡Key Concept

Why this matters for agents: Frontier-model agents like Claude Code, Cursor, and ChatGPT Codex spend a meaningful fraction of every coding session re-reading the same files over and over. With a 200K-token context window and per-million-token pricing, an agent that loads 80,000 tokens of repository context on every turn becomes expensive fast. Semble lets the agent retrieve only the specific lines it needs — typically a few hundred tokens — for the same job.

✅Tip

Visit Semble: github.com/MinishLab/semble — Apache-2.0 licensed, install via pip install semble. Latest release v0.1.7 shipped May 12, 2026.

Pricing

Plan	Price	Features
Open Source	$0 forever	Apache 2.0 license Self-host on your hardware No usage caps Full feature parity with future releases

Open Source$0 forever

Apache 2.0 license
Self-host on your hardware
No usage caps
Full feature parity with future releases

Semble is fully open-source under the Apache 2.0 license. There is no hosted SaaS offering — the entire library runs locally inside your AI agent's process. The only cost is whatever compute you use for indexing and query embeddings, both of which run on CPU at production speeds.

How Semble Works

Semble combines four retrieval strategies in a single ranked pipeline:

1. Code-Aware Chunking via Tree-sitter

Rather than splitting source files into fixed-size character windows (the naive approach), Semble uses tree-sitter to parse each file into its abstract syntax tree, then chunks at semantically meaningful boundaries — function definitions, class declarations, method bodies, top-level statements. The resulting chunks are coherent units of code that mean something on their own, not arbitrary slices.

2. Static Semantic Embeddings via Model2Vec

Each chunk is embedded using potion-code-16M, a MinishLab static embedding model tuned for source code. Static embeddings are dramatically faster than transformer-based embeddings (sentence-transformers, OpenAI text-embedding-3, etc.) — Semble indexes a typical repository in roughly 250 milliseconds and queries return in around 1.5 milliseconds. The quality trade-off is small: Semble's NDCG-at-10 score on the project benchmark is 0.854, which the README claims is roughly 99% of transformer-quality retrieval at a tiny fraction of the cost.

3. Lexical Retrieval via BM25

In parallel with the semantic search, Semble runs a BM25 keyword search with identifier stemming — catching exact symbol matches (function names, variable names) that semantic search alone might miss. This is the same trick mature retrieval systems use: semantic captures fuzzy intent, lexical captures exact symbols, and combining them does better than either alone.

4. Reciprocal-Rank Fusion + Reranking

Semantic and lexical results are merged via reciprocal-rank fusion, then reranked using signals like definition boosts (a function definition outranks a function call in most cases), file coherence (chunks from the same file get a small group bonus), and noise penalties (auto-generated files, test fixtures, vendored dependencies get downweighted).

💡Key Concept

Hybrid retrieval is the standard for production search systems — neither pure embedding similarity nor pure keyword matching is enough on its own. Semble's contribution is making the hybrid stack fast enough to run inside an agent's tool-call loop without measurable latency overhead.

Benchmarks

Metric	Semble	grep + read baseline	Notes
Recall at 2,000 tokens	94%	—	Same query set; baseline below at 2K tokens
Recall at 100,000 tokens	—	85%	Full context window for grep+read
Token efficiency	~50-times fewer	Baseline	At equivalent or higher recall
NDCG at 10	0.854	—	Roughly 99% of transformer-quality retrieval
Index time (typical repo)	~250 ms	N/A	Static embeddings on CPU
Query latency	~1.5 ms	N/A	Single-query, single-threaded
Indexing speed vs CodeRankEmbed	218-times faster	—	Comparable static-embedding baseline

The headline number (98% fewer tokens) is the marketing line; the more useful number for agent budgeting is the per-query token cost, which drops from tens of thousands of tokens for grep-plus-read into a few hundred tokens for Semble retrieval. For a coding agent that runs 50-plus tool calls per session, that's an order-of-magnitude reduction in input-token spend.

Best Use Cases

Semble is purpose-built for AI coding agents — every design choice optimizes for the specific shape of agent workloads. Use it when:

Building an agentic coding tool — Claude Agent SDK, custom OpenAI Agents SDK pipelines, or in-house assistants that need to search a private monorepo without burning context
Replacing grep-plus-read inside an existing agent loop — drop-in replacement that preserves recall while collapsing per-query token spend
Running large-scale code analysis — bulk repository scanning where embedding-transformer cost would dominate the run
Self-hosting code search for compliance reasons — no external API calls, no data leaves your infrastructure

When to choose alternatives:

For interactive human search — GitHub Copilot's native search, JetBrains AI Assistant, or Cursor's built-in indexer give a smoother UX with editor integration. Semble is a library, not an IDE feature.
For natural-language Q&A over code — Sourcegraph Cody and similar systems combine retrieval with generation in one product. Semble is just the retrieval half.
For tiny codebases (under a few thousand lines) — plain ripgrep is faster and simpler. Semble's payoff scales with repository size and query volume.

Limitations and Considerations

No hosted offering. Semble is a library — you embed it in your own agent or service. There's no managed SaaS API endpoint to call.
Indexing is offline. Semble currently re-indexes on demand or on a schedule; real-time incremental updates as developers edit files are not yet a built-in feature.
Single language at the indexer level. Tree-sitter grammars are language-specific, so the indexer must be configured per language. Most common languages (Python, JavaScript/TypeScript, Go, Rust, Java, C/C++) are supported out of the box.
Static embeddings have a quality ceiling. Roughly 99% of transformer-quality recall is excellent, but the remaining 1% may matter for very specific queries that depend on deep code semantics. Worth A/B-testing against a transformer baseline if your use case is recall-critical.
Early-stage project. Latest release at the time of writing is v0.1.7 (May 12, 2026). API stability and long-term maintenance are open questions, though MinishLab's track record on Model2Vec (over 2,000 GitHub stars, 4 million-plus downloads on Hugging Face) suggests credible ongoing investment.

Strengths

Token efficiency at production speed: Roughly 98% fewer tokens than grep-plus-read at higher recall — a structural cost shift for agent workloads, not a marginal optimization
Sub-2-millisecond query latency: Runs inside an agent's tool-call loop with no perceptible overhead, on commodity CPU hardware
Hybrid retrieval out of the box: Tree-sitter chunking + static embeddings + BM25 + reciprocal-rank fusion + reranking — production-grade search pipeline as a single library
Apache 2.0 license: Commercially usable, self-hostable, no usage caps, no API rate limits
Static embeddings via potion-code-16M: 218-times faster indexing than transformer-based code-embedding models at roughly 99% of the quality
MinishLab's broader portfolio: Same lab also publishes Model2Vec, SemHash, and Vicinity — a coherent stack of fast, efficient retrieval infrastructure

Getting Started

Install the library: pip install semble
Point the indexer at your repository: semble index /path/to/repo — completes in roughly 250 milliseconds for a typical repo
Query from your agent's tool-call loop: semble.search("how does authentication work?", top_k=10) returns the top 10 most relevant code chunks, typically a few hundred tokens total
Wire the returned chunks into your prompt instead of read_file calls — this is where the token savings come from
For production use, persist the index to disk and reload it; rebuild on a schedule that matches your repo's churn rate

For a working example, see the examples/ directory in the Semble repository — there's a reference integration with the Anthropic Python SDK demonstrating the full agent loop.

Key Takeaways

Semble replaces the conventional grep + read pipeline inside AI coding agents with a code-aware retrieval index — roughly 98% fewer tokens at higher recall
The architectural pattern is a four-stage hybrid pipeline: tree-sitter chunking, Model2Vec static embeddings, BM25 lexical retrieval, and reciprocal-rank fusion plus reranking
Static embeddings are the secret to running the full pipeline at sub-2-millisecond query latency on commodity CPU — transformer-based embeddings would be too slow to drop into an agent's tool-call loop
Open-source under Apache 2.0, self-hosted, no API rate limits — the only cost is whatever local compute you use for indexing and embeddings
Best fit for agent builders who need code search inside their own pipeline; for interactive human use, an IDE-integrated alternative like GitHub Copilot or Cursor will be a smoother UX

Semble (MinishLab)

Audio & video lessons are paid features