helgesen.dev

A local CLI tool for indexing and searching company documents with hybrid retrieval. Combines BM25 full-text search, semantic vector search, and optional reranking into one local-first pipeline for knowledge retrieval across messy internal files. Built with Bun and TypeScript, backed by SQLite with FTS5 and sqlite-vec — no external API calls, no cloud dependency, everything runs on the machine.

117

TypeScript source files

Test files

Hybrid

BM25 + vector search

Local

ONNX embeddings

12+

File formats supported

SQLite

FTS5 + sqlite-vec

What was built

Hybrid retrieval engine — BM25 full-text search via SQLite FTS5 and semantic vector search via sqlite-vec, combined into a single ranked result set with optional second-stage local reranking
Local embedding pipeline — ONNX-based embedding model running locally, converting document chunks into vectors without round-tripping to an API
Multi-format document ingestion — parsers for PDF, Word, PowerPoint, HTML, Excel, CSV, email (.eml), images, and plaintext, handling the messy reality of enterprise file systems
Resilient indexing pipeline — content hashing to skip unchanged files, crash isolation for heavy parsing loads, and queued processing to keep the system stable under scale
Live re-indexing — file watcher that detects changes and automatically re-indexes affected documents, keeping the search index current without manual intervention
Configurable chunking — document splitting with adjustable chunk size and overlap, feeding both the BM25 and vector indexes from a single chunking pass

BM25 full-text search via SQLite FTS5

Semantic vector search via sqlite-vec

Optional second-stage local reranker

Local ONNX embedding model — no API calls

PDF, Word, PowerPoint, HTML, Excel, CSV, email, image parsing

Content hashing — unchanged files skipped on re-index

File watcher for automatic re-indexing on changes

Crash isolation for resilient parsing under heavy loads

Chunking pipeline with configurable overlap and size

CLI interface for indexing, searching, and configuration

Architecture

Codebase by Layer

117 total

Parsers (PDF, Word, PPTX, HTML, Excel, CSV, email, images)24

Search / Retrieval / Reranker18

Indexer / Chunking / Embeddings22

CLI / Config / Watcher / DB38

Tests15

Tech stack

Bun · TypeScript · SQLite · FTS5 · sqlite-vec · ONNX Runtime · CLI

Development

Feb 2026

Repo created, core retrieval pipeline and parser architecture

Mar 2026

Hybrid search, reranker, file watcher, crash isolation, 12+ format support

Why this project matters

This is retrieval infrastructure, not a chatbot wrapper. It covers the full arc of a RAG system — document parsing, chunking, embedding, hybrid search, reranking — built to work on real enterprise files without depending on cloud services. It demonstrates search systems design, document intelligence, and the practical engineering of making AI retrieval reliable on messy, real-world data.