apr 3, 2026 · 2 min read

Building my local AI stack

Open WebUI, Ollama, and a Discord bot — running models on my own hardware, reachable from anywhere.

My AI work used to bounce between three tabs — a web chat, a code-assist plugin, a scratch app for testing ideas. All metered, all talking to remote APIs, none of them remembering anything between sessions in a way I trusted. I wanted something quieter: local where it can be, cloud where it has to be, one front door from anywhere.

The shape of it

Three boring pieces:

Ollama runs the models — weights, HTTP API, GPU plumbing, out of the way. Llama 3.1 8B for fast turns, a 70B quant when I want it to think.
Open WebUI is the interface. Points at Ollama, multi-model chats, a usable RAG pipeline, lives on my home network. ChatGPT but mine.
A small Discord bot bridges a private channel to the same backend, so I can ping it from my phone without standing up a VPN.

Each piece does one thing and doesn’t care if I swap the others out.

What it’s for

First drafts and scratch thinking — reading code, summarizing logs, talking through a half-formed idea before it’s worth anyone’s time. Stuff where I don’t want a turn to cost anything or live on someone else’s server. For long-context coding and real architectural arguments I still go to the frontier APIs; the local stack just replaces the first turn, which is the cheapest thing the smart models do anyway.

What surprised me

It got useful fast — an afternoon, most of it downloading weights. And I lean on RAG over my notes folder constantly: Open WebUI’s “knowledge” feature pointed at that directory answers half the “where did I write that” questions I used to lose twenty minutes to.

If I started over I’d skip the Discord bot — Tailscale gets me to the instance directly — and reach for the 8B first. The 70B mostly sits idle.

← all notes