unsubbed.co

Paperless AI

Released under MIT, Paperless AI provides automated document analyzer for Paperless using AI on self-hosted infrastructure.

Automated document classification for self-hosted archives, honestly reviewed. What actually changes when you add AI to your document stack.

TL;DR

  • What it is: An AI-powered extension for Paperless-ngx that automatically classifies, tags, and indexes your scanned documents using OpenAI, Ollama, or any compatible AI backend [README].
  • Who it’s for: Paperless-ngx users drowning in untagged documents who want automation — particularly those already running or considering a local LLM setup for privacy [README].
  • Cost: The software is MIT-licensed and free. Real cost is your AI backend: $0/mo with Ollama running locally, or OpenAI API fees that work out to pennies per document. VPS is whatever you’re already paying for Paperless-ngx [README].
  • Key strength: RAG-based document chat — ask “when did I sign my rental agreement?” in natural language and get an actual answer pulled from your archive [README].
  • Key weakness: Requires Paperless-ngx already running. Not a standalone product. If you haven’t set up Paperless-ngx yet, you’re signing up for two things, not one.

What is Paperless AI

Paperless AI is not a document management system. It’s an AI layer that sits on top of Paperless-ngx — the popular open-source document management platform that handles scanning, OCR, and archiving of physical documents. If you’re not already using Paperless-ngx, this tool is not where you start. If you are, it’s the component that removes the part everyone eventually abandons: manual tagging.

The core function is straightforward: when a new document arrives in your Paperless-ngx inbox — a scanned receipt, a utility bill, a lease — Paperless AI intercepts it, sends the content to an AI model of your choice, and writes back the classified metadata. Title, tags, document type, correspondent — all filled in automatically, without you touching a thing [README].

Beyond the automation layer, the project added a RAG (Retrieval-Augmented Generation) chat interface that turns your document archive into something you can actually query. Instead of hunting through folders or guessing tags, you ask in plain language: “What was the amount of my last electricity bill?” or “Which documents mention my health insurance?” The system searches semantically across your full archive and returns precise, grounded answers [README].

The project is MIT-licensed with 5,470 GitHub stars as of this review. It’s maintained by a solo developer (GitHub: clusterzx) with funding through Patreon and Ko-Fi, a Discord community for support, and active Docker Hub distribution [README].


Why people choose it

No independent third-party review articles were available for this tool in the source data. What follows draws from the README, the project website, and the intent visible in the tool’s feature design.

The tagging problem is the real problem. Paperless-ngx gives you a searchable archive — but only if documents are tagged consistently. Most users start strong and abandon the habit within a month. A scanned gas bill sits untitled as something like scan_2024_11_04_0012 forever. Doing manual metadata entry for 200 backlogged documents is exactly tedious enough that people stop. Automating it is the obvious fix, and Paperless AI is currently the most adopted tool built specifically for this gap [README].

Ollama makes full privacy possible. The built-in support for Ollama (Mistral, Llama, Phi-3, Gemma-2) means the entire AI pipeline can run on your own hardware. No document content leaves your network. For users archiving medical records, financial statements, legal agreements, or anything sensitive, this isn’t a nice-to-have — it’s the requirement that makes self-hosting worthwhile in the first place [README].

Backend flexibility without lock-in. The tool supports OpenAI, Ollama, DeepSeek R1, Azure OpenAI, OpenRouter, Perplexity, Together.ai, LiteLLM, VLLM, Fastchat, and Google Gemini — plus any OpenAI API-compatible endpoint [README]. If one provider raises prices or disappears, you swap a config value. This flexibility is genuine, not marketing.

The RAG chat is a qualitative shift. Keyword search on scanned documents is fragile — it misses synonyms, fails on poor OCR, and returns results you still have to read. Semantic search backed by an LLM that understands document context produces answers, not lists. The README example queries (“When did I sign my rental agreement?”, “What was the amount of the last electricity bill?”) aren’t hypothetical — they’re the exact tasks that make the tool worth running [README].


Features

Based on the README and project website:

Automated document processing:

  • Polls Paperless-ngx for new documents automatically
  • Analyzes content via your configured AI backend and writes back title, tags, document type, and correspondent
  • Custom tag rules: define which documents get processed and with what parameters
  • Manual processing interface at /manual for reviewing sensitive documents before AI analysis [README]

RAG-based document chat:

  • Natural language Q&A across your full document archive
  • Semantic search — understands context, not just keywords
  • Privacy-safe: fully local with Ollama, no data leaves the network
  • Covers your entire historical archive, not just recently processed documents [README]

AI backend support (confirmed in README):

  • Ollama (Mistral, Llama, Phi-3, Gemma-2)
  • OpenAI
  • DeepSeek R1
  • Azure OpenAI
  • OpenRouter.ai, Perplexity.ai, Together.ai
  • LiteLLM, VLLM, Fastchat
  • Google Gemini
  • Any OpenAI API-compatible endpoint [README]

Web interface:

  • AI Playground for experimenting with different model configurations and prompts
  • History view showing all processing activity and AI decisions
  • Settings panel for configuration via browser rather than config files [website]

Docker deployment:

  • Single container with health monitoring and auto-restart
  • Persistent volumes for data retention across restarts
  • Minimal setup — pull, run, configure via web [README]

Pricing: SaaS vs self-hosted math

Paperless AI has no SaaS version. The cost question is entirely about your AI backend choice.

With Ollama (local LLM):

  • Paperless AI software: $0 (MIT)
  • Ollama: $0
  • Hardware requirement: Phi-3 mini runs on 4GB RAM; Llama 3 8B needs 8GB+
  • If running on an existing home server: $0/mo ongoing
  • If on a cloud VPS: $5–15/mo for an instance with enough RAM
  • Total ongoing cost: $0–15/mo

With OpenAI API (GPT-4o mini):

  • GPT-4o mini: approximately $0.15 per 1M input tokens
  • A typical scanned document: roughly 500–2,000 tokens to process
  • At 100 documents/month: under $0.05 — negligible
  • At 1,000 documents/month (heavy user): under $0.50/mo
  • Total ongoing cost: under $1/mo for most households

With DeepSeek API:

  • Roughly 10–20x cheaper than OpenAI for equivalent classification tasks
  • Total ongoing cost: well under $0.50/mo for typical use

Paperless-ngx (prerequisite) self-hosted:

  • Software: $0 (GPL licensed)
  • VPS: $5–15/mo on Hetzner or similar; $0 if home server

What the commercial document management market looks like:

  • DocuWare (enterprise DMS): pricing not publicly listed, typically $300–500+/mo for small teams
  • Adobe Acrobat Pro + manual organization: $23/mo plus your time
  • Paying a virtual assistant to tag 100 documents/month: $50–150/mo

The honest comparison isn’t Paperless AI vs. a SaaS competitor — no direct SaaS alternative does exactly this for consumer/prosumer use. The real comparison is AI automation vs. your own time. If you spend two hours a month manually tagging documents, the math is obvious regardless of your billing rate.


Deployment reality check

The single most common setup surprise: Paperless-ngx must already be running and accessible via its REST API before Paperless AI does anything. If you don’t have that, you’re deploying two things, not one.

What you actually need:

  • A running Paperless-ngx instance (REST API enabled by default)
  • Docker on the same host or accessible network
  • An AI backend: Ollama running locally, or API credentials for OpenAI/DeepSeek/etc.
  • A Paperless-ngx API key (generated in the Paperless-ngx admin panel)

Docker quick start:

docker pull clusterzx/paperless-ai:latest
docker run -d \
  --name paperless-ai \
  --network bridge \
  -v paperless-ai_data:/app/data \
  -p 3000:3000 \
  --restart unless-stopped \
  clusterzx/paperless-ai

Then navigate to http://your-server:3000/setup to configure API keys and preferences [website][README].

Critical first-run note: The README explicitly flags that after completing initial setup (API keys, preferences), you must restart the container to build the RAG index. Not required for subsequent updates — only the first configuration [README].

What can go sideways:

  • Docker network isolation between Paperless-ngx and Paperless AI containers — bridge networking requires explicit configuration if containers weren’t set up together
  • Ollama connectivity — if Ollama runs on a different machine or container, the URL must be reachable from Paperless AI’s network namespace
  • Initial RAG indexing time — for large existing archives (thousands of documents), the first index build takes significant time and shouldn’t be interrupted
  • Solo developer maintenance — release cadence and bug response depend on one person’s availability; the Discord community helps, but there’s no SLA

Realistic time estimate:

  • Already running Paperless-ngx with Docker: 30–60 minutes to a working Paperless AI instance
  • Setting up Paperless-ngx from scratch first: 3–5 hours total for both systems
  • New to Docker and Linux server administration: budget a full day and follow a community guide

Pros and cons

Pros

  • MIT licensed. No commercial restrictions, no license-change risk, no vendor leverage over your deployment [README].
  • Solves the actual failure mode. Manual tagging is why most Paperless-ngx deployments decay. Automating it removes the friction that kills the habit [README].
  • Genuinely private path available. Full Ollama integration means zero data leaves your network. For medical, legal, or financial documents, this is the feature that justifies the whole setup [README].
  • Ten-plus AI backends supported. You’re not locked to any provider. Swap as prices and models change [README].
  • RAG chat is useful, not decorative. Semantic document Q&A is a real capability improvement over keyword search, particularly for documents with inconsistent OCR quality [README].
  • Browser-based configuration. Setup happens at /setup in a web interface — not by editing config files manually [website].
  • Active development. 5,470 GitHub stars, active commit history, Docker Hub distribution, and a Discord community indicate this is being maintained and used [README].

Cons

  • Requires Paperless-ngx as a prerequisite. There’s no standalone version. Evaluating this tool requires evaluating the whole Paperless-ngx stack first [README].
  • Solo developer project. The entire project depends on one person’s continued interest and availability. No company, no funded team, no guaranteed roadmap [README].
  • OCR quality is upstream. If your scans are poor, or Paperless-ngx’s OCR produces junk text, the AI classification will be poor too. Paperless AI inherits whatever quality Paperless-ngx produces.
  • RAG index restart required on first setup. A papercut, but worth knowing before you build your configuration [README].
  • No multi-user access control documented. The web interface appears single-user. Multiple people managing the same Paperless-ngx instance through Paperless AI isn’t explicitly addressed.
  • No paid support tier. GitHub Issues and Discord are your options when something breaks [README].
  • Variable API cost with commercial backends. Negligible for most users, but high-volume archives with expensive models can accumulate cost. Monitor usage.

Who should use this / who shouldn’t

Use Paperless AI if:

  • You’re running Paperless-ngx and have a backlog of untagged documents you’ve been meaning to sort for months.
  • You want automated document classification without building the pipeline yourself.
  • Privacy requirements mean your documents can’t leave your network — Ollama gives you a fully local AI stack.
  • You’re comfortable with Docker and can troubleshoot basic networking between containers.
  • You want to query your document archive in natural language without manually building a search index.

Not ready for it yet if:

  • You don’t have Paperless-ngx running. Start there — it’s the actual document management layer. Paperless AI adds nothing without it.
  • You’re expecting a turnkey solution for scanning and managing documents from zero. This is an enhancement to an existing system, not a foundation.

Wrong tool entirely if:

  • You need enterprise document management with audit trails, version control, RBAC, compliance features, and vendor SLAs. Look at DocuWare, M-Files, or similar.
  • You want a managed cloud service. No hosted version of Paperless AI exists.
  • You’re not willing to maintain a self-hosted stack through the occasional upgrade or breaking change.

Alternatives worth considering

If you want Paperless-ngx without the AI addon:

  • Paperless-ngx alone handles scanning, OCR, storage, and keyword search well. Add Paperless AI only once you’ve confirmed the base system works for you [README].

If you want an integrated document management system with AI (no separate addon):

  • OpenPaper.work — an alternative self-hosted DMS with AI tagging built in, no Paperless-ngx dependency. Less community traction but worth evaluating.
  • Teedy — GPL-licensed document management with basic tagging and full-text search, no AI layer. Simpler stack, less automation.

If you want document AI without managing a full DMS:

  • Stirling PDF — open-source PDF tools including OCR processing, but not an archive system.
  • LlamaIndex or LangChain — build a custom RAG pipeline over any document store. Maximum control, maximum work.

Commercial cloud options (for non-technical users who won’t self-host):

  • Google Drive + Gemini — zero setup, AI search built in, but every document transits Google’s servers.
  • Microsoft 365 + Copilot — same trade-off, higher cost. Powerful for teams already in the Microsoft stack.

For the target reader of this site, the honest shortlist is: Paperless AI + Ollama (full privacy, full control, near-zero cost) vs. Paperless-ngx alone (if you don’t want AI complexity) vs. Google Drive (if you’re not willing to self-host anything). The self-hosted option wins on privacy and cost at the price of setup effort.


Bottom line

Paperless AI is a well-scoped tool that does one thing correctly: it removes the manual tagging work from Paperless-ngx and hands it to an AI model you control. The RAG chat layer turns a storage archive into something you can actually interrogate. The MIT license and multi-backend AI support mean no vendor lock-in of any kind. The trade-offs are clear: it’s an addon that requires Paperless-ngx, it’s maintained by one developer, and setup requires Docker fluency and a running prerequisite system. For someone already in the Paperless-ngx ecosystem who’s tired of manual tagging or avoiding the backlog, this is the obvious next layer to add. The cost with Ollama is effectively zero. The barrier is an afternoon, not a budget.

If the afternoon of setup is the blocker, that’s exactly what unsubbed.co’s parent studio upready.dev deploys for clients. One-time fee, done, you own the infrastructure.


Sources

  1. GitHub — clusterzx/paperless-ai (README) · MIT license · 5,470 stars. https://github.com/clusterzx/paperless-ai
  2. Paperless AI — Official project website. https://clusterzx.github.io/paperless-ai
  3. Docker Hub — clusterzx/paperless-ai. https://hub.docker.com/r/clusterzx/paperless-ai
  4. Paperless AI — Installation Wiki. https://github.com/clusterzx/paperless-ai/wiki/2.-Installation
  5. Paperless-ngx — Prerequisite project (GPL). https://github.com/paperless-ngx/paperless-ngx

No independent third-party review articles for Paperless AI were available in the source data for this review. All claims above are grounded in primary sources.

Features

Integrations & APIs

  • Plugin / Extension System

AI & Machine Learning

  • AI / LLM Integration

Search & Discovery

  • Tags / Labels

Security & Privacy

  • Privacy-Focused

Localization & Accessibility

  • Accessibility (a11y)