sist2
Sist2 gives you lightning-fast file system indexer and search tool on your own infrastructure.
Open-source file system search, honestly reviewed. GPL-3.0, C-powered, and not for everyone.
TL;DR
- What it is: A fast, self-hosted file system indexer and search tool written in C with a Vue.js frontend. Points at a directory, extracts text from every supported file type — including OCR on scanned PDFs — and gives you a web interface to search it all [README][1].
- Who it’s for: Homelab operators sitting on large collections of documents, repair manuals, ebooks, or media who need Google-style content search across their own files, not just filename matching [1].
- Cost savings: sist2 itself is free (GPL-3.0). No SaaS tier exists. Infrastructure cost is a VPS and Elasticsearch (or the lighter SQLite backend). Managed document search services run $50–$500+/month at volume; sist2 replaces that bill entirely.
- Key strength: Genuinely fast C-based core, incremental scanning, OCR via Tesseract, broad format coverage (PDF, EPUB, DOCX, audio, video, archives), and a live demo you can try before installing [README][4].
- Key weakness: The README still carries a “Warning: sist2 is in early development” notice. Elasticsearch as a dependency consumes 2GB of JVM heap minimum. Community is thin — 1,239 GitHub stars, sparse third-party coverage [README][merged profile].
What is sist2
sist2 (Simple Incremental Search Tool) is a file system indexer built in C with a Vue.js web interface [1]. The pitch is direct: point it at a directory, let the scan complete, then search every file in it — not just by filename but by content. PDFs get text-extracted and OCR’d. Audio files expose their ID3 tags. Video files surface title and artist metadata. Archives (zip, tar, 7z, rar) get recursively unpacked and scanned [README][4].
The tool runs as a three-piece stack: the sist2 binary that handles scanning and indexing, a search backend (Elasticsearch or SQLite), and a web interface for search and job administration. Docker Compose bundles all three for most deployments [README][1].
The project lives at github.com/sist2app/sist2, 1,239 stars, GPL-3.0 license. There’s a live demo at sist2.simon987.net. The community is on Discord. Active development snapshots are published via a CI server at files.simon987.net [README].
One honest flag upfront: the README says “Warning: sist2 is in early development.” That notice has been there for a while. The tool is functional — users are indexing tens of thousands of files — but expect incomplete documentation, unstable APIs, and limited support if something breaks.
Why people choose it
Third-party coverage for sist2 is sparse. It’s not a heavily-reviewed tool. But the use cases that do appear are consistent.
The dominant use case is large document collections. The noted.lol review [1] describes the author sitting on approximately 4,000 PDF repair manuals from an iFixit archive dump shared publicly on Reddit. Every general approach to organizing that — folders, tags, a NAS — fails when you need to find which manual covers a specific board revision. sist2 indexed the entire collection with OCR enabled and made the content searchable. The author describes setup and indexing as “remarkably efficient.” That’s the core story: a problem that’s hard to solve any other way without paying for a managed service.
NAS integration is the second pattern. The TrueNAS Community thread [2] shows a user trying to hook sist2 into a TrueNAS Scale system to index files stored on their NAS, with sist2 and Elasticsearch both running in Docker. The setup isn’t plug-and-play — the thread is asking for advice on exposing NAS datasets to Docker containers — but the fact that users are pushing it into production NAS environments says something about the use case it covers.
What’s notably absent from third-party coverage: independent benchmarks, reliability reports after extended operation, or migration stories from commercial alternatives. The reviews that exist are “I tried it and it worked” rather than “I’ve run this for a year in production.” That gap matters if you’re making a long-term infrastructure bet.
Features
Core scanning engine:
- Multi-threaded scanning with low memory footprint (relative to Elasticsearch, the C binary itself is lean) [README][4]
- Incremental scanning — only processes changed files on subsequent runs [README][1]
- Recursive scanning inside archive files (zip, tar, 7z, rar, ar) [README][4]
- Manual tagging from the web UI; automatic tagging via user scripts based on file attributes [README][1]
- Named-entity recognition (NER) running client-side in the browser [README]
- Stats page with disk utilization visualization [README][1]
Format support:
- PDF, XPS, FB2, EPUB — text extraction + OCR + thumbnails via MuPDF [README][4]
- CBZ, CBR — thumbnails [README]
- Audio (
audio/*) — thumbnails + ID3 tags via ffmpeg [README][4] - Video (
video/*) — thumbnails + title/artist/comment metadata via ffmpeg [README][4] - Images — thumbnails + common EXIF tags [README][4]
- Fonts (TTF, OTF, WOFF) — thumbnails via FreeType2 [README][4]
- Plain text, HTML — full text [README][4]
- DOCX, XLSX, PPTX — text + creator/title metadata [README][4]
- MOBI, AZW, AZW3 — text + author/title via libmobi [README][4]
- Archives — recursively indexed via libarchive [README][4]
OCR: Powered by Tesseract. Pre-installed language packs in the Docker image: English, French, Spanish, Russian, Japanese, Hindi. Additional languages installable via apt [README][4]. OCR applies to PDF, XPS, CBZ/CBR, FB2, EPUB.
Search backends:
- Elasticsearch — the original backend, version >= 6.8.X, ideally >= 7.14.0. Required for full-feature operation [README]
- SQLite — added as a lighter alternative, no external service required, lower RAM overhead. Precise feature delta between backends isn’t documented in available sources [README]
Web interface: Mobile-friendly, job scheduling and configuration via admin panel on port 8080, search frontend on port 4090 [README][1]. The admin panel warns clearly in the README: don’t expose port 8080 publicly.
REST API is listed as a canonical feature in the merged profile, but isn’t comprehensively documented in available sources.
Pricing: SaaS vs self-hosted math
sist2 has no SaaS tier. It’s purely self-hosted, GPL-3.0 licensed.
What you pay for:
- sist2 binary: $0 [README]
- VPS: $10–20/month. The binding constraint is RAM for Elasticsearch — the docker-compose example configures
-Xms2g -Xmx2g(2GB heap minimum for ES), meaning your total server RAM budget needs to be 4GB+ [README][1]. A Hetzner CX21 (4GB RAM) runs around $7–10/month; a DigitalOcean Droplet with comparable specs is $18/month - SQLite backend alternative: If RAM is the constraint, SQLite eliminates Elasticsearch entirely. A 2GB VPS ($5–6/month) may be sufficient, though performance at scale isn’t independently documented
What you’d pay for the alternative: Managed document search services (Elastic Cloud, Algolia, or similar) typically start at $15–50/month for small deployments and climb steeply with data volume — specific current pricing for those services isn’t included in the source data for this review, so treat those figures as directional. For 4,000 PDFs with full-text search, the self-hosted math is clear: a $10–20/month VPS running sist2 is likely cheaper than any managed equivalent once you’re past free tiers.
The honest framing: sist2 saves money compared to managed search services, but costs more than doing nothing. If filename search covers your use case, tools built into your OS or NAS are free and zero-maintenance. sist2 earns its infrastructure cost when you need OCR-powered full-text content search across mixed file types.
Deployment reality check
The noted.lol author [1] describes setup as “remarkably efficient.” That’s roughly accurate for the happy path on a clean Linux host. The less rosy version:
What you actually need:
- Linux host with Docker and docker-compose installed
- 4GB+ RAM (2GB reserved for Elasticsearch’s JVM) [README][1]
- Disk space for your files, Elasticsearch data directory, and sist2 index files
- Correct permissions on the Elasticsearch data volume — must be writable by UID/GID 1000, or you configure PUID/PGID explicitly [README][1]
Setup flow:
Two docker-compose services: elasticsearch and sist2-admin. After docker-compose up, navigate to port 8080 to configure your first scan job — name the job, set the path (the container-internal path, not your host path), choose search backend and OCR language, then kick off indexing. When complete, switch to the frontend config tab to wire the job to the search UI [1].
Gotchas from sources:
- Volume mapping trips up first-timers: the directory path you configure in the sist2 admin panel is the container-internal path, not the host path [1]. Map
/path/on/host:/ifixitin docker-compose, then configure/ifixitin the admin panel. - TrueNAS Scale users face an additional challenge: exposing NAS datasets to Docker containers requires TrueNAS-specific configuration that sist2 doesn’t smooth over [2].
- No authentication on either the admin panel (port 8080) or search frontend (port 4090) by default. Exposing either on a public IP means your files and admin controls are open. You need a reverse proxy with authentication — Caddy or nginx with basic auth, at minimum [README][1].
- The “early development” notice is real — documentation is incomplete, and some features like the REST API lack comprehensive docs in available sources.
Realistic time estimate: 30–60 minutes for a technical user on a fresh VPS with Docker already installed. Add 30–120 minutes for the initial index, depending on file count and whether OCR is enabled (OCR is significantly slower than plain text extraction). For a non-technical user with no Docker experience: this is not the right tool without help.
Pros and cons
Pros
- Fast C-based scanner. Multi-threaded, low memory overhead from the binary itself. Indexing 4,000 OCR’d PDFs described as “remarkably efficient” [1].
- Incremental scanning. Subsequent runs only reprocess changed files — essential for large collections [README][1].
- Broad format coverage. PDF, EPUB, DOCX, audio, video, images, ebooks, archives — the file types that accumulate on a homelab are mostly covered [README][4].
- OCR built in. Tesseract with multi-language support is included in the Docker image; enabling it is a single checkbox [README][1].
- Recursive archive scanning. Contents of zip/tar/7z files indexed without manual extraction [README][4].
- SQLite backend. The lighter option eliminates Elasticsearch for smaller or RAM-constrained deployments [README].
- Live demo before committing. sist2.simon987.net lets you test the search UI before spending an afternoon on deployment [README].
Cons
- “Early development” warning is real. Not boilerplate — documentation has gaps, APIs can change, edge cases may fail silently [README].
- Elasticsearch is heavy. 2GB JVM heap minimum pushes your server requirements up and increases monthly VPS cost [README][1].
- No default authentication. Both the admin panel and search frontend are unauthenticated. You must add a reverse proxy yourself — this is a non-optional security step [README][1].
- GPL-3.0, not MIT. Commercial embedding or redistribution has license implications. Not a concern for personal homelab use; potentially significant for any commercial context.
- Thin community. 1,239 GitHub stars is functional but small. Third-party documentation is sparse. Unusual problems land you in Discord or reading C source code [merged profile].
- NAS integration requires manual work. No first-class integration with TrueNAS, Synology, or other NAS platforms [2].
- REST API documentation is incomplete. Listed as a feature but not comprehensively documented in available sources.
- No user management. No multi-user access, no per-user permissions, no sharing controls.
Who should use this / who shouldn’t
Use sist2 if:
- You have a large collection of documents — PDFs, repair manuals, ebooks, technical docs — that you need to search by content, not just filename.
- You’re a homelab operator comfortable with Docker Compose and basic Linux administration.
- You want OCR-powered search across scanned documents without paying for a managed service.
- You’re willing to run early-development software and troubleshoot via source code or Discord when things break.
Skip it if:
- You need enterprise reliability, documented SLAs, or official support. This is not a product with a support contract.
- Your context is commercial and GPL-3.0 creates licensing complications.
- You have no Docker experience. The setup will be frustrating without it.
- Your use case is application-level search (building search into a web app). Meilisearch or Typesense are designed for that and have much better documentation.
- You need multi-user access, authentication, or sharing controls.
Alternatives worth considering
- Recoll — mature, well-documented desktop document indexer. No web UI by default; works best for single-machine use rather than network storage. More stable than sist2.
- Meilisearch — fast, developer-focused search engine with a clean REST API. Not a file-system crawler — you push documents to it programmatically. Different category than sist2, but worth knowing.
- Typesense — similar to Meilisearch, slightly different trade-offs. Also not a file-system scanner.
- Apache Tika + Elasticsearch — the components sist2 replaces, assembled manually. More control, significantly more configuration work.
- Nextcloud + full-text search app — if your underlying problem is file organization plus search rather than search alone, a full file management platform solves more of the problem. Heavier stack.
- Perkeep (Camlistore) — content-addressed personal storage with indexing. Interesting architecture but low-activity community.
For the specific niche sist2 targets — crawling an existing file system with OCR and full-text search via a web UI — the practical shortlist is sist2 versus assembling your own Tika + Elasticsearch pipeline. sist2 wins on setup simplicity; the DIY approach wins on documentation and long-term stability.
Bottom line
sist2 solves a real problem: making a large, mixed file collection searchable by content without paying for a managed service. The C core is fast, OCR via Tesseract works out of the box, and Docker Compose gets you from zero to a working search index in under an hour if you know what you’re doing. For a homelab operator sitting on thousands of PDFs or repair manuals, it’s probably the most direct path to a working solution.
The caveats are genuine. Early-development software means incomplete docs and potential breakage on upgrades. Elasticsearch’s memory appetite means you need a server with real headroom. No default authentication means you must handle that yourself before exposing it to a network. And a community of 1,239 stars means you’re largely on your own when the unusual case hits.
Try the live demo at sist2.simon987.net first. If the search UI does what you need, the setup is worth the afternoon.
Sources
- noted.lol — “Index and Search Every File on Your Homelab Server using Sist2”. https://noted.lol/sist2/
- TrueNAS Community — “Exposing files in TrueNAS Scale to a Docker-based file indexer? (sist2)”. https://www.truenas.com/community/threads/exposing-files-in-truenas-scale-to-a-docker-based-file-indexer-sist2.89926/
- leviwheatcroft.github.io — “sist2 — selfhosted-awesome-unlist”. https://leviwheatcroft.github.io/selfhosted-awesome-unlist/sist2.html
Primary sources:
- GitHub repository and README: https://github.com/sist2app/sist2 (1,239 stars, GPL-3.0)
- Live demo: https://sist2.simon987.net
- Community Discord: https://discord.gg/2PEjDy3Rfs
Features
Integrations & APIs
- REST API
Search & Discovery
- Tags / Labels
Media & Files
- OCR / Text Recognition
Analytics & Reporting
- Charts & Graphs
Mobile & Desktop
- Responsive / Mobile-Friendly
Category
Related Databases & Data Tools Tools
View all 122 →Supabase
99KThe open-source Firebase alternative — Postgres database, Auth, instant APIs, Realtime subscriptions, Edge Functions, Storage, and Vector embeddings.
Prometheus
63KAn open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach.
NocoDB
62KTurn your existing database into a collaborative spreadsheet interface — without moving a single row of data.
Meilisearch
56KLightning-fast, typo-tolerant search engine with an intuitive API. Drop-in replacement for Algolia that you can self-host for free.
DBeaver
49KFree universal database management tool for developers, DBAs, and analysts. Supports 100+ databases including PostgreSQL, MySQL, SQLite, MongoDB, and more.
Milvus
43KMilvus is a high-performance open-source vector database built for AI applications, supporting billion-scale similarity search with sub-second latency.