A local CLI tool for indexing and searching company documents with hybrid retrieval. Combines BM25 full-text search, semantic vector search, and optional reranking into one local-first pipeline for knowledge retrieval across messy internal files. Built with Bun and TypeScript, backed by SQLite with FTS5 and sqlite-vec — no external API calls, no cloud dependency, everything runs on the machine.
What was built
- Hybrid retrieval engine — BM25 full-text search via SQLite FTS5 and semantic vector search via sqlite-vec, combined into a single ranked result set with optional second-stage local reranking
- Local embedding pipeline — ONNX-based embedding model running locally, converting document chunks into vectors without round-tripping to an API
- Multi-format document ingestion — parsers for PDF, Word, PowerPoint, HTML, Excel, CSV, email (.eml), images, and plaintext, handling the messy reality of enterprise file systems
- Resilient indexing pipeline — content hashing to skip unchanged files, crash isolation for heavy parsing loads, and queued processing to keep the system stable under scale
- Live re-indexing — file watcher that detects changes and automatically re-indexes affected documents, keeping the search index current without manual intervention
- Configurable chunking — document splitting with adjustable chunk size and overlap, feeding both the BM25 and vector indexes from a single chunking pass
Architecture
Codebase by Layer
117 totalTech stack
Bun · TypeScript · SQLite · FTS5 · sqlite-vec · ONNX Runtime · CLI
Development
Why this project matters
This is retrieval infrastructure, not a chatbot wrapper. It covers the full arc of a RAG system — document parsing, chunking, embedding, hybrid search, reranking — built to work on real enterprise files without depending on cloud services. It demonstrates search systems design, document intelligence, and the practical engineering of making AI retrieval reliable on messy, real-world data.