Local LLM

2026-03-14 19:14:50 homelab selfhosting privacy

I’m not going to spend too much time arguing the finer points of LLM usage ethics; all of my opinions should be somewhat evident to anyone reading the whole thing. Suffice it to say I think it’s a technology like any other and in order to understand it, for benefit and criticism, you need to study and use it. This is part of the story of how I’ve tried to further my understanding.

Rationale

I don’t want ads on a service I pay for
I don’t want to pay sub fees to a sector that is committed to multiple forms of the 10x engineer myth
I don’t want to pay sub fees to a sector that works closely and dangerously with the US DoD
I don’t want to pay sub fees to hyperscalers with unsustainable and wasteful growth models
I want to know how to run production-shaped LLM workloads because bubble or no, I don’t think they are going away
I want my prompts (written and audio) to be entirely private
I want the responses to be entirely private
I want to constrain myself to a small-scale power & emissions footprint
I still want small-scale copy/paste and code editor completion-style workflows
I don’t want to run any unsupervised/high-level LLM coding harness
I occasionally want to use voice chat with intelligble TTS

Initial Forays

Self-host, and deal with any degradation or performance hits I take as the price of ticking most/all of the boxes on the above wish list.

I initially used gpt4all, which is mainly a desktop application. You can also expose the model(s) via an API for use by other GPT4All clients on the LAN. Ultimately, this is not what I stayed with because I wanted to converge on a solution that runs entirely as a web service. Nevertheless, this was an important learning step for me to become accustomed with the available models, quantizations, resource constraints and skillfullness tweaks I would need to make in order to get a viable self-hosted system. This phase lasted about a year, with some residual reliance on frontier models and commercial APIs.

Fast-foward to the present. Anthropic and OpenAI recently made the news for their Pentagon dealings. In a long overdue reaction, I finally deleted my OpenAI account, and I resolved to revisit my home setup to ensure I could get rid of Claude as well. I’m not under any illusions that Anthropic is an ethical actor just because they held a line that OpenAI was eager to cross; the logic is subtle but I think this op-ed is in line with my thinking. The sooner I could delete my account, the better.

Permanent Solution

Since I had last evaluated software, some new entries in the “open LLM chatbot” market had emerged or matured. I immediately was impressed by Open WebUI, which is a self-hosted clone of the mainstream chatbot UX. It is capable of talking to many backends, including OpenLLaMa. And the most important development, which I had already applied to my GPT4All instance: OpenAI released new open-weight LLMs which showed skill exceeding all other models I had been using locally; enough to cut the cord completely for search aggregation-type queries.

My hardware didn’t change. I’m still running on a relatively modest GPU with Tensor Cores (GeForce RTX 4070Ti, 16GB). This is able to run openai-oss:20b. Moreover, OpenWebUI supports an offboard vector DB for supporting retrieval of personal data. GPT4All only supported local file document retrieval, requiring more custom plumbing to get RAG working properly (shared and per-user doc stores). Open WebUI seemed like a natural fit for this requirement. With document store RAG, I stood to have a feature I never leveraged on Claude or OpenAI, because I was never willing to upload anything private to either of them.

The features I wanted basically were:

general knowledge chat w/ web RAG
private data RAG
prompt-to-code syntax generation (small-scale only)
vim-ai integration (OpenAI API completions)
Speech-to-text/Text-to-speech (STT/TTS) interaction support with a non-robotic voice

That led directly to this architecture:

The video card can be found for ~1000 USD circa March 2026. I run it with an 750W PSU. I’m starting by using the follwing models:

openai-oss:20b
qwen2.5-coder:14b

These seem to be the most skillful for general chat and coding, respectively.

Compromises

It can be slow. Once a model is loaded, time to first token (TTFT) is very snappy and on par with commercial services. But having use case-specific models and only 16GB VRAM means occasional model swaps occur, and that occasionally involves frustrating latency (~25s). Idle latency can supposedly be solved with ollama configuration flags but I still see some problems with that.

Running both Kokoro and STT on CPU is slower than GPU, and that introduces some lag on the chat interactions, but it’s usable.

Chat is right at the edge of useful skill when using openai-oss:20B, especially when web RAG is used. I may need to tune this more across different parameters: search API, threads, concurrent searches, context size, etc. The MoE model does well enough though, especially when compared to Qwen2.5 or DeepSeek-r1, which aren’t good for general chat + RAG at quantizations that fit in my available VRAM.

It’s still an OpenAI model, and “open weight” is not really “open source”.

Benefits Realized

private prompt and response history
private search engine use for RAG
private data RAG possible
voice chat works (slow)
vim-ai workflow seamlessly swapped out and working well
chat + web RAG workflow seamlessly swapped out, if a slower and less skillfull than frontier models

The long-term success of the switchover will depend on if I can continue to survive without resorting to 3rd party services. So far, this setup is much more usable than my GPT4all setup was, and I’ve been able to wean myself off of Claude. For me this is a pretty good compromise between completely abandoning LLM use and completely ceding my privacy and autonomy to hyperscalers. “Data center repatriation” at the nano-scale seems to be a workable model, especially for GPU workloads.

It was pretty simple to set up but if you want more info on details, please contact me via mast.

Postscript: How I Use Generative Models

This kind of solution might not work for you if you don’t use this tech in the same way. Here’s my user profile:

code completion/syntax generation on the function/method scale
LLM-assisted web search for broad questions (not fact-finding); occasionally via STT/TTS if handsfree
I don’t generate images or music
I don’t use automated coding harnesses
I don’t generate copy for writing
I don’t train my own (useful) models or do LoRA
Experimenting with LLM behavior, jailbreaking, prompt injection

My use case is pretty limited, in other words, and hasn’t really changed much since the GPT3 era. People claiming that “LLMs turned a corner last year” genuinely confuse me. I don’t notice the change, maybe because I’m constraining my use and context in this way. But if “using them right” means using them as bulk NLP -> code compilers, email writers and decision proxies, then I don’t really want that for myself anyway. I’m happy to just age out of the industry if that’s really where it’s going. I suspect, however, that transformer tech as it exists now cannot really be truly “agentic” without additional guardrails and capabilities yet to be developed or even imagined, and that the end state of “spec-to-code” plateaus much earlier than folks are claiming.