Engineering · Software · Technology · Spring 2026

Is It Time to Self-Host LLMs Already?

Self-hosting LLMs is finally realistic for high-volume, private workloads, but not a frontier replacement. The H200 economics, quantization, and a hybrid playbook.

Jose BastosMay 31, 202617 min read3,712 wordsFiled under Engineering
Self-hosting GPUs: a dark data center with rows of server racks lit by red and navy light.
Self-hosting GPUs: a dark data center with rows of server racks lit by red and navy light.

For the last two years, “self-hosting LLMs” has lived in a strange place.

On one side, it has been the dream of every infrastructure-minded AI team: own the stack, control the data, avoid runaway token bills, tune models to your environment, and stop routing sensitive engineering or customer data through third-party APIs. On the other side, it has often been a trap. Teams underestimate serving complexity, compare GPU rental cost to API token cost without including utilization, operations, context length, concurrency, evaluation, or model quality, and assume “open model” means “frontier replacement.” They get a model running in a notebook, then discover that production inference is a very different game.

So the real question is not whether you can self-host an LLM. Of course you can. The better question is whether self-hosting is finally realistic for serious production workloads, and the honest answer is yes, but not in the way people usually mean.

Self-hosting is now realistic for specific, high-volume, well-understood workloads: private coding agents, enterprise search, document processing, classification, summarization, internal copilots, and domain-specific automation. It works when you have enough utilization, enough infrastructure maturity, and enough discipline to route the right tasks to the right model. What it is not, yet, is a clean replacement for frontier managed models like GPT-5.5 or top-tier Claude Opus-class systems. If your expectation is “buy a few GPUs and get the best commercial AI model, privately, forever,” you will be disappointed.

A better framing is this: self-hosting is becoming a serious part of the enterprise AI stack, not the whole stack. The future is hybrid.

Part I
The State of Self-Hosted AI
Where the conversation really stands, and the mental models that decide whether teams get it right.

Why this conversation is happening now

The self-hosting conversation has changed because three things have improved at the same time. The first is memory. NVIDIA’s H200 ships 141GB of HBM3e per GPU, which quietly changes what is practical on a single accelerator: a lone H200 is no longer just a small-model box, two of them give you 282GB, and an 8× H200 node reaches 1,128GB. The second is the models themselves. Open-weight systems such as gpt-oss-120b, Qwen3-235B-A22B, the DeepSeek-R1/V3 family, and Kimi K2-class models now make it possible to build genuinely capable private systems without leaning entirely on closed APIs. They are not all equal, and they do not match frontier commercial systems on every task, but they are good enough for a great deal of valuable production work. The third is quantization, which has gone from exotic to routine. We are no longer only talking about BF16 weights; in practice most deployments now plan around Q4, Q6, Q8, FP4, FP8, or model-specific schemes, and that alone reshapes the economics.

But this is also where people start making mistakes, because a model “fitting” on a GPU is not the same as a model being production-ready on it. Weight memory is only the first constraint; you still need room for KV cache, batching, context length, framework overhead, concurrency, and your latency targets. A demo can fit. A product has to serve. So the realistic question is never “what is the largest model I can cram onto a card?” It is:

What is the best model I can serve reliably for my workload?

Why “equivalent to GPT-5” is the wrong frame

Before the hardware, fix the mental model, because it is the assumption most self-hosting plans get wrong. People love to say things like “Model X is equivalent to GPT-5,” and that claim is almost always wrong, because equivalence depends entirely on the task. A given self-hosted model might be strong at code completion, decent at RAG, weak at long-horizon planning, strong at extraction, mediocre at tool use, and poor at multimodal reasoning, all at once. That is why the only honest language is directional: a model is “GPT-5 mini-ish” or “Claude Haiku-ish” at the small end, “Sonnet-ish” in the middle, and “GPT-5 reasoning-ish” or even “Sonnet-to-Opus-ish” at the top, and even those labels deserve caution.

The most honest version is simply this: a model can play a similar role in the stack for some workloads without being the same product. Commercial frontier models are far more than raw weights. They bundle system prompts, tool behavior, safety tuning, context management, multimodal support, scaling infrastructure, reliability work, and continuous post-training, and all of that matters. It is why even an 8× H200 deployment can be strategically powerful and still not be a GPT-5.5 replacement. Keep that in mind as the numbers start to get tempting.

The real reason to self-host is not just cost

Cost is usually how the conversation starts, but it is rarely the only reason self-hosting wins. The stronger arguments are about control: control over your data, control over latency, and control over a cost line that no longer swings with someone else’s pricing. Layer on the ability to customize models to your own domain, to satisfy sovereignty and data-residency requirements, and to optimize hard for a handful of high-volume, repeatable workloads, and the case stops being about the token price at all.

Picture an engineering organization that leans on coding agents. When every agent loop goes to a frontier API, spend becomes unpredictable: some days are cheap, and some days a few large refactors or CI-debugging loops burn through enormous amounts of context and output. A self-hosted model may not crack the hardest tasks as well as GPT-5.5 or Claude Opus, but it is often perfectly adequate for the work that dominates the day, from explaining CI failures and summarizing diffs to generating first-pass tests, writing migration notes, answering questions over internal docs, drafting boilerplate, triaging logs, classifying support issues, extracting structured data, and routing tickets. When tasks like these run thousands or millions of times a month, self-hosting becomes less about ideology and more about unit economics.

Even then, the metric that matters is not cost per token. It is cost per successful outcome. For coding that might mean cost per accepted PR, per passing test, or per human hour saved; for support, cost per correctly resolved ticket; for enterprise search, cost per grounded answer at acceptable latency. The reason is simple, and the chart below makes it concrete: cheap output that needs rework is expensive, while expensive output that lands first time can be the cheapest thing you buy.

FIG. 01 · COST PER ACCEPTED OUTCOME The cheapest token is not the cheapest outcome token cost human-fix cost FRONTIER-ONLY CHEAP-ONLY HYBRID · ROUTED
Fig. 01 · A cheap model that gets reworked can cost more per accepted change than a frontier model that lands first time. Optimize the total, not the token price.
Part II
Hardware and Economics
What the GPUs cost, what they can actually serve, and the assumptions that quietly decide the answer.

The economics: buy vs rent vs API

Start with money, because the hardware specs only matter once you know how you are paying for them, and the economics of self-hosting are unusually easy to manipulate. Buying hardware looks cheap if you only show depreciation. Renting cloud GPUs looks expensive if you assume continuous use. APIs look expensive if you model heavy usage, and cheap if you ignore data control and high-volume workflows. Self-hosting looks great right up until you add staff cost. Every one of those framings is true in isolation and misleading on its own.

A fair comparison cannot stop at a single number. It has to add up four separate buckets, and most pitches quietly drop at least one of them.

FIG. 02 · WHAT A FAIR COMPARISON COUNTS Four buckets, not one number API COST tokens · model tier caching · batch discounts retries · failed attempts tool-call overhead CLOUD GPU COST hourly rate · utilization reserved capacity storage · networking orchestration · availability OWNED HARDWARE COST depreciation · power · colo networking · support · staff spare capacity · failures opportunity cost OPERATIONAL COST evaluation · routing monitoring · security model upgrades · incidents developer experience
Fig. 02 · Buy-versus-rent-versus-API is only honest when all four buckets are on the table. Most comparisons show one and hide the rest.

API cost is more than the token price: it folds in the model tier you choose, how much prompt caching and batching you actually capture, and the retries, failed attempts, and tool-call overhead that real agents generate. Cloud GPU cost layers the hourly rate, reserved capacity, and utilization on top of storage, networking, orchestration, and availability. Owned hardware trades the hourly rate for depreciation but drags in power, colocation, networking, support, staff, spare capacity, hardware failures, and opportunity cost. And operational cost sits over all of it, from evaluation and routing to monitoring, security, model upgrades, incident response, and the developer experience itself. The self-hosting pitch only becomes credible once it counts all four.

It helps to reduce each side to a single line you can actually fill in. For the API side:

Monthly API cost = users × workdays × daily token usage × model price

And for the self-hosted side:

Monthly self-hosting cost = depreciation or rental + power/colo + platform ops + fallback API spend + staff overhead

Then compare the two against outcomes, not tokens.

The quantization assumption matters more than people think

Many self-hosting arguments mislead because they hide the quantization assumption. The moment someone says “you can run this model on one GPU,” the right reply is a question:

At what precision, with what context length, and with what concurrency?

For planning purposes there are three rough tiers, and they answer three different questions.

Quantization tierHow to think about itPractical meaning
Q4 / FP4Aggressive and cost-efficient“Can we make it fit?”
Q6Practical production compromise“Can we run it well?”
Q8 / FP8Quality-first / near-lossless“Can we reduce quantization compromise?”

Q4 or FP4 is genuinely the right answer for some models and workloads. OpenAI’s gpt-oss-120b, for instance, was designed around MXFP4-style quantized MoE weights and runs on a single 80GB-class GPU, which is a strong signal that one H200 can serve it comfortably. But Q4 is not Q6 or Q8, and the difference is the whole game once you remember that weights are only part of the budget.

FIG. 03 · ONE GPU’S MEMORY BUDGET “Fits” is not “serves” DEMO weights unused PROD weights KV cache batching headroom
Fig. 03 · A demo only has to load weights. A product pays for KV cache, batching, concurrency, and latency headroom out of the same memory budget.

For enterprise production, especially coding, reasoning, or high-stakes workflows, Q6/Q8 is the more credible default, and it tells a more conservative and more useful story than the aggressive Q4 pitch. At that precision a single H200 lands cleanly on gpt-oss-120b, two H200s make Qwen3-235B-A22B a believable general-purpose deployment, and eight H200s bring DeepSeek-R1/V3 671B-class and Kimi K2-class models into realistic reach. That is the version I would trust for planning, and it is the ladder the rest of this section walks.

The H200 self-hosting ladder

To make it concrete, picture three rungs: a single H200, a pair of them, and a full eight-GPU node. It is not the only way to think about GPU infrastructure, but it gives a useful ladder from a small private deployment to a serious open-model platform. For cost, treat the buy option as a four-year asset depreciation rather than a subscription, because owning hardware is not “paying monthly” the way a cloud bill is; you are buying an asset and amortizing it. The figures below also exclude power, colocation, networking, support, staffing, repairs, financing, and utilization risk, all of which are real and belong in a full TCO model, but separating hardware depreciation from operational overhead keeps the baseline honest.

H200 setupBuy hardware estimate4-year depreciationAWS-equivalent monthly costPractical model tier
1× H200$45k-$65k$938-$1,354/mo~$3,632/mo pro-ratedgpt-oss-120b
2× H200$85k-$120k$1,771-$2,500/mo~$7,264/mo pro-ratedQwen3-235B-A22B
8× H200$310k-$400k OEM/HGX or $450k-$600k DGX-style$6,458-$8,333/mo OEM or $9,375-$12,500/mo DGX~$29,054/moDeepSeek-R1/V3 671B-class, Kimi K2-class
FIG. 04 · GPU MEMORY, TO SCALE What actually fits on each rung 1× H200 gpt-oss-120b 141 GB 2× H200 Qwen3-235B-A22B 282 GB 8× H200 DeepSeek-R1/V3 · Kimi K2 1,128 GB
Fig. 04 · HBM scales fast, but memory only decides what fits. What you can serve reliably is a different question, taken rung by rung below.

There is an important AWS caveat, too: H200 capacity there is naturally an eight-GPU node story, since P5e/P5en instances expose up to 8× H200, so the 1× and 2× monthly figures are pro-rated equivalents rather than cleanly rentable shapes. That nuance reshapes the buy-versus-rent decision. If you need one or two GPUs running constantly, owned hardware looks attractive on depreciation; if your demand is bursty, managed cloud usually wins; and if you need guaranteed H200 cloud capacity, you may be renting eight-GPU blocks whether or not your workload needs all eight. Through all of it, the dominant variable is not the sticker price but utilization. A GPU running hot on valuable work is compelling; a GPU sitting idle is just expensive furniture.

What you get with 1× H200

A single H200 is now a meaningful inference box, and the model to anchor around is gpt-oss-120b. It is not a GPT-5.5 replacement and not a top Claude Opus replacement, but it can be a strong private model across a wide range of internal tasks. Think of this tier as the home for private summarization, classification, and extraction, for lightweight coding help and internal Q&A, for RAG over a controlled corpus, and for structured workflow automation and narrow agents with bounded scope.

Commercially, that maps most closely to GPT-5 mini-ish or Claude Haiku-ish behavior depending on the workload, which is not a claim of identical quality so much as a similar role in the stack: fast, economical, useful for many tasks, but not the model you reach for on the hardest reasoning. A one-H200 setup shines when you have enough internal volume to keep it busy, when data control matters, or when you need predictable cost for repetitive work. Its limits are just as clear: it is not where you put your most complex coding agent, not where long-horizon reasoning will match a frontier API, and not where you prove your company can replace every commercial model. It is where you start building a private inference layer.

What you get with 2× H200

Two H200s are where self-hosting starts to feel genuinely serious. At 282GB of combined HBM you can move into models like Qwen3-235B-A22B under realistic quantization, a different category from running a small local model, and one that supports stronger coding, multilingual work, reasoning, tool use, and enterprise automation. Directionally this lands around GPT-5-ish or Claude Sonnet-ish performance for many tasks, not GPT-5.5 frontier performance, and that distinction matters.

For a great deal of internal enterprise work, “Sonnet-ish” is already very valuable. Most companies do not need GPT-5.5 for every PR summary, CI explanation, extraction job, test-generation task, support triage, or internal knowledge query; they need reliable-enough output at predictable cost with strong privacy and observability, and two H200s can be a powerful foundation for exactly that. The workload still decides, though. Huge monorepo coding agents with long contexts, frequent retries, test execution, and heavy tool use can outgrow two GPUs on latency and concurrency, whereas more bounded work like coding assistance, RAG, analysis, and structured transformations fits well. This is also where routing becomes essential: the 2× H200 deployment should not try to do everything, but rather take the tasks it is good at and escalate the rest.

What you get with 8× H200

Eight H200s change the conversation again. At 1,128GB of HBM you can seriously consider DeepSeek-R1/V3 671B-class models and Kimi K2-class models, which is no longer hobby territory but serious open-model infrastructure, enough to stand up private coding platforms and internal agent systems, run domain-specific reasoning and high-volume enterprise inference, keep sensitive-data workloads in house, and support model customization, controlled deployment, and large-scale RAG and tool-using assistants. Directionally the comparison is GPT-5 reasoning-ish or Claude Sonnet-to-Opus-ish for selected workloads, but here the caveat needs to be loud:

8× H200 does not automatically mean GPT-5.5 parity.

Frontier managed models are not just weights; they include post-training, tool integrations, safety systems, product infrastructure, context engineering, multimodal handling, routing, eval infrastructure, fast iteration, and an enormous amount of operational polish. An 8× H200 open-model deployment can still be extremely useful and strategically important, and it can beat commercial APIs on privacy, control, cost predictability, customization, and specific domain workflows. But if the question is whether it replaces GPT-5.5 on the hardest general-purpose tasks, the answer is usually no. It gives you serious open-model performance; it does not buy you automatic frontier parity.

Part III
The Hybrid Architecture
The pattern almost every serious org converges on, and how the routing actually works.

The hybrid stack is the obvious answer

The best AI stacks will be neither purely self-hosted nor purely API-based. They will be hybrid, and in practice a sensible enterprise architecture sorts workloads by what each actually needs.

WorkloadDefault model path
PR summarieslocal/open model
CI failure explanationlocal/open model first, frontier fallback
Unit test generationlocal/open model, escalate on repeated failures
Internal documentation Q&Alocal/open model with RAG
Codebase-wide refactorGPT-5.5 / Claude Opus-class
Security-sensitive reviewfrontier model plus human review
Architecture planningfrontier model
High-volume extractionlocal/open model
Regulated data processingself-hosted or private deployment

The pattern behind that table is the whole point: use the frontier models where they matter, and stop wasting frontier tokens where they do not. The piece of software that makes it real is a routing layer, which becomes the control point of the entire stack.

FIG. 05 · THE ROUTING LAYER Cheap by default, frontier on demand INCOMING any AI / coding task ROUTER type · sensitivity cost · context expected difficulty SELF-HOSTED · OPEN PR summaries · docs · tests extraction · CI · RAG FRONTIER API refactors · security architecture · incidents escalate
Fig. 05 · The router classifies each task and sends it down the cheapest path that still clears the quality and risk bar, escalating to a frontier model only when value or sensitivity demands it.

A router earns its place by deciding which model handles which task, weighing the type of work, the sensitivity of the data, the cost and context length involved, the expected difficulty, and the fallback policy. To do that well it has to know what kind of task it is looking at, whether the data is sensitive, how much context is required, whether latency matters, how expensive failure would be, how the local model has performed on similar work before, and when to escalate. That, rather than any argument about whether open or closed models are “better,” is where the enterprise value lives: in making the routing decision explicit.

A concrete routing example for coding agents

Take a software team that leans on AI coding tools. The naive setup sends everything to the strongest model, which feels great until the bill arrives. A cost-aware setup routes by task and only escalates when a clear trigger fires.

TaskDefaultEscalate when
PR summarylocal modelhigh-risk subsystem or security-sensitive change
Test generationlocal modeltests fail twice or coverage misses target
CI failure explanationlocal modelfailure spans multiple services
Dependency upgrade noteslocal modelbreaking API surface detected
Code review commentsSonnet/GPT-5 tierauth, crypto, payments, infra
Architecture refactorGPT-5.5 / Opus-classalways high-value
Production incident analysisfrontier model + humanalways high-risk

This is how you cut cost without cutting quality: you never ask a cheaper model to do everything, only the work it can do reliably, and you let explicit triggers escalate the rest. Then you measure, because the routing only stays honest if you watch the right numbers. Task success rate and fallback rate tell you whether the local tier is holding; cost per accepted change and tests passing after an agent edit tell you whether cheap output is actually cheap; and human review time saved, the rate of bad diffs, latency, user satisfaction, and production-incident impact tell you whether any of it is worth doing. Once those are instrumented, self-hosting stops being a philosophical debate and becomes an engineering and finance decision.

Part IV
The Enterprise Playbook
How to decide, how to start, and how to know it is working.

Where self-hosting succeeds, and where it fails

Self-hosting is not a yes-or-no question; it succeeds under specific conditions and fails under others, and the work is knowing which side of the line you are on before you spend anything.

When it makes sense

Several conditions tend to line up when self-hosting pays. The foundation is high, predictable usage, because if demand is low or bursty an API is almost always simpler and cheaper. On top of that you want repeatable workloads, since the approach rewards bounded, measurable tasks like extraction, summarization, coding support, RAG, and classification far more than open-ended exploration. It helps enormously if your quality bar is task-specific rather than “the best model on Earth,” because if GPT-5.5 quality is required everywhere, local models will disappoint. Data control can justify the whole exercise on its own, since sensitive code, customer data, and regulated or sovereign workloads can tip the decision even when the raw token math is a wash.

The rest are about whether you can actually run it. You need infrastructure maturity, because someone has to own serving, monitoring, scaling, security, upgrades, quantization, evaluation, and incident response. You need enough volume to keep the GPUs busy, since utilization is precisely what separates a sound investment from an expensive science project. And you need the willingness to evaluate models on your own tasks, because public benchmarks are useful but your workloads are what truly decide.

When it does not, yet

It is just as important to recognize when not to. Self-hosting is probably the wrong first move if you have fewer than about 25 active AI users, if you cannot yet measure your token usage, or if your workloads are still experimental. It is wrong if your team cannot operate GPU infrastructure, if you need the best available model for nearly every task, or if you depend on the multimodal polish of a commercial product. And it is wrong if your usage is highly bursty, if you cannot build or buy observability and evaluation tooling, or if the savings case quietly rests on optimistic utilization and a comparison of hardware depreciation to API spend that leaves out staff and operations.

The most common mistake of all is building infrastructure before understanding the shape of the workload. Before buying a single GPU, you should be able to answer the basics: how many AI tasks you run per day, what your top ten task categories are, and how many input and output tokens each consumes. You should know which tasks genuinely demand frontier quality and which tolerate a cheaper model, what your fallback rate and cost per successful result look like, what latency users expect, how much context you truly need, and who will own and operate the platform. If those answers are not on hand, self-hosting is premature.

What an enterprise self-hosting rollout should look like

So do not start by buying a rack. Start by measuring. A productive first month looks less like a procurement project and more like an instrumentation exercise, and it falls naturally into four weeks.

FIG. 06 · THE FIRST 30 DAYS Measure before you buy WEEK 1 INSTRUMENT track every tool; log tokens, model, task, latency, cost WEEK 2 CLASSIFY group into 6-10 categories; flag low-risk volume WEEK 3 EVALUATE replay ~100 tasks across 3-5 models; score cost + quality WEEK 4 ROUTE ship low-risk routes only; keep frontier fallback
Fig. 06 · Thirty days of measurement beats any benchmark. Instrument, classify, evaluate, then route only the lowest-risk work, keeping a frontier fallback throughout.

The first week is pure instrumentation, and it is worth being exhaustive about it: every coding tool, chat tool, raw API key, CI bot, support workflow, internal assistant, and document pipeline, each logged with its tokens, model, task type, team, latency, and cost. From there the work compounds. Classifying that traffic into six to ten categories reveals which are high-volume and low-risk; replaying around a hundred real tasks across three to five models, from a frontier API down to a quantized local candidate, shows where a cheaper model holds up and where it falls apart; and only the lowest-risk categories get routed first, always with a frontier fallback behind them. After thirty days you will know far more than any benchmark could tell you, and only then should expanding into self-hosted infrastructure be on the table.

The enterprise self-hosting playbook

All of which reduces to a single decision, taken in three postures. Match your situation to the closest column, start there, and revisit as your usage grows.

Cheat sheet
The enterprise self-hosting playbook
Choose API-first
  • usage is low or unpredictable
  • you need frontier quality immediately
  • you have no infrastructure staff
  • workflows are still experimental
  • latency and data residency are not constraints
  • quality matters more than cost
Choose hybrid
  • you already have meaningful AI usage
  • some workloads are repetitive
  • some tasks require frontier quality
  • cost is becoming material
  • data control matters for some workflows
  • you can build a routing layer
Choose self-hosting-first
  • usage is high and predictable
  • privacy or sovereignty is a core requirement
  • workloads are bounded and measurable
  • open-model quality is sufficient
  • you can keep GPUs utilized
  • you own MLOps / platform
  • you optimize for cost per outcome

Most serious organizations will end up in the hybrid column, and that is not a compromise. It is the right architecture.

So, is it time?

If the question means “is it time to include self-hosted LLMs in the enterprise AI architecture?”, the answer is an unqualified yes. If it means “is it time to replace every frontier API with GPUs in your own rack?”, the answer is not yet, and probably not for everyone.

The opportunity is real because open models are now good enough for many valuable tasks, H200-class hardware has the memory to serve meaningful models, and quantization has made deployment far more practical. But the winning strategy is not to self-host everything. It is to measure your workloads first, route routine tasks to cheaper models, self-host the repeatable, high-volume, privacy-sensitive work, keep GPT-5.5 and Claude Opus-class models for the hardest problems, and judge all of it by cost per successful outcome rather than cost per token.

One H200 can run a useful private model tier, two can support a serious internal open-model deployment, and eight can deliver serious open-model performance, but frontier is still frontier. The real shift is not that self-hosting has beaten managed AI; it is that self-hosting has become good enough to deserve a seat in the architecture. For many companies, that is the milestone that matters. The question is no longer whether self-hosting is possible. It is which workloads deserve it.

End.   Set in Fraunces, Newsreader & JetBrains Mono.
TensorOps · Blog · 2026