Most of my AI work used to bounce between three tabs: a web chat, a code-assist plugin, and a separate scratch app I’d built to test ideas. They all spoke to remote APIs. They were all metered. None of them remembered anything between sessions in a way I trusted.
I wanted something quieter. Local where it could be local. Cloud where it had to be. One front door from anywhere.
The shape of it
The stack ended up being three boring pieces stitched together:
- Ollama runs the models. It manages weights, exposes an HTTP API, handles the GPU plumbing, and stays out of the way. Llama 3.1 8B for fast turns, a 70B quant when I want it to actually think.
- Open WebUI is the chat interface. It points at Ollama, supports multi-model chats, has a usable RAG pipeline, and lives at a URL on my home network. It’s the closest thing to “ChatGPT but mine.”
- A small Discord bot is the away-from-home access layer. It bridges a private channel into the same Open WebUI backend so I can ping it from my phone without setting up a VPN every time.
Boring, on purpose. Each piece does one thing. None of them care if I swap the others out.
What it’s actually for
Mostly first drafts and scratch thinking. Stuff where I don’t want a turn to cost anything, and don’t want a transcript living on someone else’s server. Reading code, summarizing logs, talking through a half-formed product idea before it’s worth anyone else’s time.
For the things where the local model just isn’t sharp enough — long-context coding, architectural arguments — I still go to the frontier APIs. The local stack isn’t trying to replace those. It’s trying to replace the first turn of every conversation, which is usually the cheapest thing the smart models can do.
What surprised me
How fast it got useful. I assumed the setup would take a weekend. It took an afternoon, and most of that was downloading weights.
How much I lean on RAG over a folder of markdown notes. Open WebUI’s “knowledge” feature, pointed at my notes directory, is now the answer to half the “where did I write that thing about…” questions I used to lose 20 minutes to.
How quickly I stopped caring about benchmarks. Once the latency was tolerable and the answers were good enough for first-pass work, the size of the model stopped feeling like the point. The point was that it was there, with my context, instantly.
What I’d change
I’d skip the Discord bot if I were starting over. Tailscale on my phone gets me to the Open WebUI instance directly, no middle layer. The bot was fun to build, less fun to maintain.
I’d also reach for a smaller model first next time. The 8B class is good enough that the 70B mostly sits idle now.