Est. 2024
MinishLab logo
Nonprofit

MinishLab

Two-person non-profit open-source NLP lab specializing in fast, efficient models for embedding, semantic deduplication, and code search.

Listen to this lesson

Free preview · first 0:30
0:00 / 0:30

Audio & video lessons are paid features

Plus unlocks audio streaming. Pro adds downloadable audio, video, certificates, and more.

Plus adds:
  • Audio streaming
  • Downloadable PDFs
  • All AI Playbooks
  • Personalized content
Pro also adds:
  • Certificates of completion
  • Audio MP3 downloads
  • Video lessonssoon
  • & More…soon

Watch this lesson

Video coming soon

Learn About MinishLab's AI Products

Create a free account to access in-depth lessons on each tool and model.

Start Learning Free

📋About MinishLab

Updated June 15, 2026

MinishLab (branded "Minish") is a two-person non-profit open-source lab focused on natural language processing, founded in 2024 by Thomas van Dongen and Stéphan Tulkens. The lab's tagline is "Solving big problems with small models," and its core philosophy is that "if you make models fast enough, you unlock new possibilities" — embedding the entirety of English Wikipedia in roughly five minutes, classifying tens of thousands of documents per second on CPU, and deduplicating large datasets in minutes are all benchmark targets the lab routinely meets.

MinishLab's published work centers on a family of "potion" static embedding models — including potion-base-2M, potion-base-4M, potion-base-8M, potion-multilingual-128M, potion-retrieval-32M, and potion-code-16M — which power most of the lab's downstream tools. The potion-base-8M and potion-base-4M models alone have crossed roughly 700,000 downloads each on Hugging Face, and the full MinishLab catalog has surpassed 4 million combined package downloads. Across GitHub, the lab's repositories have accumulated over 5,500 stars.

The lab's product portfolio sits in the agentic-developer tooling space: Semble (code search optimized for AI agents that uses roughly 98% fewer tokens than grep-plus-read pipelines), Model2Vec (the family of static embedding models that achieve state-of-the-art speed at a fraction of the compute cost of sentence-transformers), SemHash (multimodal semantic deduplication and dataset filtering), Vicinity (a unified interface across approximate-nearest-neighbor backends), and Tokenlearn (a method for pre-training static word embeddings). Tools and models are released under permissive open-source licenses, and the lab funds itself through community sponsorships and grants rather than traditional venture capital — a deliberate choice that keeps the research direction focused on speed and accessibility rather than commercial scale.

🛠️Products & Tools (1)

SembleOpen SourceAI Coding

Open-source code search library purpose-built for AI coding agents — 98% fewer tokens than grep-plus-read at higher recall, with sub-2-millisecond query latency.

📰MinishLab in the News

Showing the only story where MinishLab is tagged in Top AI Stories.