top of page

Self-Hosting Large Language Models in China: Limitations and Possibilities


While cloud vendors are ramping up their supply of AI models and GPU hardware in China, companies that are required to self-host these models face challenges of their own. Even in the West, hosting models on GPUs is a complex and costly endeavor; in China, export limitations and regulatory restrictions make it an even tougher case.


In this article, we explore the landscape of self-hosted large language models (LLMs) in China for commercial inference. We examine the hurdles companies encounter—ranging from limited access to high-end server-grade GPUs to the technical intricacies of deploying state-of-the-art models like Gemma 3 and LLaMA 4 on multi-GPU configurations. Additionally, we assess the performance impacts of using several lower-end GPUs in parallel, the role of advanced quantization techniques, and the challenges posed by constrained GPU memory and TFLOP limits.


While the regulatory context is only briefly touched upon, it remains an important factor in understanding the overall ecosystem. This preview provides a snapshot of the current state-of-the-art, helping businesses navigate the complexities of self-hosting AI models in a rapidly evolving market.


Hardware Considerations

GPU Options for LLM Inference in China

When self-hosting large language models, the choice of GPU hardware is critical. In China, access to top-tier GPUs is complicated by export restrictions, but there are both server-grade and consumer-grade options available. Table 1 compares several representative GPUs for LLM inference, including NVIDIA’s data-center GPUs (and their China-specific variants) and domestic Chinese accelerators, alongside high-end consumer GPUs that can be used for local deployment:

GPU Model

Memory

Compute (FP16/BF16)

Interconnect BW

Availability / Notes

NVIDIA A100 (80GB)

80 GB HBM2e

~312 TFLOPS (FP16, no sparsity)

NVLink ~600 GB/s

High-end server GPU for AI (export restricted to China)

NVIDIA A800 (80GB)

80 GB HBM2e

~218 TFLOPS (≈70% of A100)​

NVLink ~400 GB/s​

China-specific A100 variant (meets U.S. export rules)

Biren BR100

64 GB HBM2e

~1000 TFLOPS (FP16)​

PCIe 5.0 / CXL (multi-die)

Chinese GPGPU, slightly faster than A100​

NVIDIA RTX 4090

24 GB GDDR6X

~66 TFLOPS (FP16)

PCIe 4.0 ~32 GB/s

Consumer GPU, widely available (no export restrictions)

Table 1: Key GPU options for self-hosting LLMs in China, comparing memory, compute throughput, and interconnect bandwidth. Data-center GPUs like A100 are powerful but face export controls, whereas consumer GPUs offer local availability with less memory and bandwidth.


Server-Grade GPUs: NVIDIA’s A100 (and newer H100) GPUs have been the workhorses for training and deploying LLMs. In China, however, direct access to these is limited by U.S. export regulations. NVIDIA introduced the A800 (Ampere architecture) as a China-tailored version of the A100. The A800 delivers roughly 70% of the performance of a standard A100 Its reduced chip-to-chip communication bandwidth (400 GB/s vs 600 GB/s on A100) is designed to comply with export limits​.

. This change mainly impacts large multi-GPU deployments — as one analyst noted, the A800’s interconnect slowdown is a “clear performance downgrade for a data center where thousands of chips are used together”​. NVIDIA likewise offers an H100 variant (the H800) for China, with scaled-down interconnects and likely a moderate performance reduction versus the H100. As of late 2023, export rules tightened further, reportedly banning even some of these cut-down models​, prompting NVIDIA to plan new variants in 2024 to “straddle the line” of regulations.


Domestic Alternatives: Chinese companies have been developing their own high-performance chips to reduce reliance on NVIDIA. For example, Biren Technology’s BR100 GPU is a multi-die accelerator with 64 GB of HBM2e memory. It delivers on the order of 1000 TFLOPS (FP16), slightly outperforming NVIDIA’s A100 on paper​. A cut-down BR104 model (32 GB) also exists for lower power usage​. Likewise, Huawei’s Ascend series AI accelerators have improved: the latest Ascend 910B (7nm) is reported to offer performance between the A100 and H100 GPUs​, and has been adopted by some Chinese cloud providers. These domestic chips can handle LLM workloads, though they often require specialized software stacks (e.g. Huawei MindSpore for Ascend), and their ecosystem is less mature than NVIDIA’s CUDA.


Consumer-Grade GPUs: In the absence of unlimited access to server GPUs, many practitioners in China turn to high-end consumer GPUs (like the NVIDIA GeForce RTX 3090/4090 or professional RTX A6000 cards) to run LLMs locally. These GPUs are widely


available on the retail market and not subject to the same export bans. A single RTX 4090, for instance, provides 24 GB of VRAM and substantial compute power (~2x a 3090’s TFLOPs), making it capable of serving smaller models (up to ~13B parameters) or quantized larger models. However, consumer cards have limitations: notably less memory than data-center GPUs and much lower multi-GPU communication bandwidth (standard PCIe instead of NVLink/NVSwitch fabric). They also typically lack the server-grade reliability and cooling optimizations for 24/7 heavy workloads. Still, for many smaller companies or research groups, assembling a multi-GPU workstation with consumer cards is a cost-effective way to self-host an LLM, as discussed further below.


Impact of Regulations and GPU Availability

Although a full policy analysis is outside our scope, it’s important to note current restrictions at a high level. Since 2022, U.S. export controls have limited the sale of cutting-edge AI chips to China, citing national security. This initially barred GPUs like the A100/H100 and AMD’s MI250 from the Chinese market. NVIDIA’s response was the A800/H800 series, which lowered interconnect speeds to stay just under the threshold (e.g. 400 GB/s on A800 vs the 600 GB/s limit). These variants allowed Chinese companies to buy modern NVIDIA silicon for AI, albeit with a performance hit in large-scale distributed training/inference. By 2023, further tightening of rules aimed to close loopholes, and reports emerged of planned bans on even A800/H800-class products​. NVIDIA has allegedly developed next-gen “China-specific” GPUs for 2024 that meet the letter of the law while still improving performance.


For commercial inference of LLMs within China, the practical effect is that organizations must either:

  • Use approved hardware (like A800s or domestic GPUs), possibly needing more chips to achieve the same throughput as unrestricted hardware;

  • Rely on consumer GPUs (which are not banned, but also not as scalable for big models); or

  • Utilize cloud services in China that offer compliant AI accelerators.


The GPU supply crunch has in fact led to massive demand for NVIDIA’s allowed chips – Chinese tech firms reportedly spent billions to stockpile A800/H800 GPUs in 2023. Meanwhile, domestic GPU projects (Biren, Huawei, Alibaba’s T-Head, etc.) are racing to catch up so that viable home-grown options exist. In summary, high-end GPUs are available in China but in limited forms, and one may need to be flexible in mixing hardware or optimizing models to work within memory/compute constraints.


Multi-GPU Inference with Limited VRAM GPUs

Because large language models often exceed the memory of a single consumer-grade GPU, a common strategy is to split the model across multiple GPUs. This can be done via model parallelism techniques – for example, assigning different layers (or portions of layers) to different devices. In practice, this means even if you only have lower-end or mid-range GPUs (each with, say, 16–24 GB VRAM), you can still host a big model by using several of them working in tandem. The technical feasibility of this is well-established: frameworks like PyTorch with HuggingFace Accelerate, DeepSpeed, or Megatron-LM can shard model weights across devices. For instance, a 70B-parameter model like LLaMA2-70B typically requires ~140 GB of memory in 16-bit precision, which could be served by 2× 80GB GPUs, 4× 40GB GPUs, or 8× 20–24GB GPUs in parallel​. Table 2 shows illustrative minimum GPU memory requirements for different model sizes:

Model Size

Typical Minimum GPU Memory

Example Multi-GPU Configurations

~7B parameters

~14–16 GB (fp16)

Runs on 1× consumer GPU (e.g. 1× RTX 4090 24GB)

~13B parameters

~28–30 GB (fp16)

Requires ≥2 GPUs (e.g. 2× 24GB GPUs)​

~30B parameters

~60–80 GB (fp16)

Requires 4 GPUs (e.g. 4× 20GB, or 2× 48GB)​

~70B parameters

~120–140 GB (fp16)

Requires 8 GPUs (e.g. 8× 24GB, or 4× 48GB)​

100B+ parameters

200 GB+ (fp16)

Requires large GPU clusters (or heavy quantization)

Table 2: Feasibility of model deployment by size – approximate memory needs and example GPU setups. Quantization can reduce these requirements (see later section).


As shown above, splitting a model across multiple GPUs is technically possible even for very large models. Libraries like Hugging Face’s Accelerate can automatically distribute layers across devices (via device_map="auto"), and more advanced parallelism schemes (tensor parallelism as used in Megatron, pipeline parallelism, etc.) are supported in libraries like vLLM and DeepSpeed. For example, one guide suggests that a 65B model (LLaMA 65B) with ~120GB weight size can run in inference on 2×80GB data-center GPUs, or 8×24GB GPUs, achieving near real-time generation speeds. This means a small company could potentially deploy a 70B-class model using a server with 8 consumer GPUs, if they manage the splitting and memory carefully.


Performance Implications: The catch is that using multiple weaker GPUs in place of one big GPU can come with performance overhead. Without optimization, inference speed may actually degrade when a model is sharded across GPUs. This is because of additional communication and synchronization overhead between the devices each time a new token is generated. In fact, a baseline test showed that naively running a model on two GPUs could be 5–8× slower per token than running on a single GPU​, due to inefficient handling of the model’s internal state across devices. For example, one experiment found a 13B model took 0.067 seconds/token on a single RTX 4090, but a much slower 0.34 seconds/token when split across two 4090s (over 5× slower)​. After fixing the data transfer bottlenecks in software, the two-GPU setup was brought back to ~0.07 sec/token (essentially regaining the single-GPU speed)​.

Table 3 illustrates some performance measurements from community tests of multi-GPU inference:

Model (quantization)

Setup

Generation Speed

Notes

13B LLM (8-bit)

1× RTX 4090

~0.067 s/token​

Baseline single-GPU performance

13B LLM (8-bit)

2× RTX 4090 (naïve split)

~0.34 s/token​

~5× slower – overhead without optimizations

13B LLM (8-bit)

2× RTX 4090 (optimized)

~0.07 s/token​

Parity restored – optimized parallelism

65B LLM (4-bit GPTQ)

4× RTX 3090 (naïve split)

>4.0 s/token​

Very slow without specialized optimization

Table 3: Example of multi-GPU inference performance. Naïve sharding can drastically slow down generation​, but proper optimizations (e.g. efficient handling of attention cache across devices) can eliminate most overhead.


These results highlight that software optimizations are key to make multi-GPU inference efficient. Techniques such as overlapping communication with computation, smarter placement of the model’s key/value caches, and using faster interconnects (NVLink/NVSwitch when available) can greatly improve throughput. Modern inference engines (discussed later, e.g. vLLM, DeepSpeed-Inference, Hugging Face’s Text Generation Inference) incorporate many of these optimizations. With the right setup, multiple mid-range GPUs can nearly match the inference latency of a single big GPU for a single query, and they enable scaling to model sizes that individually wouldn’t fit on those smaller cards.



It’s also worth noting that while splitting a model across GPUs doesn’t accelerate a single query (beyond what one fastest GPU can do), it can improve throughput if handled as a pipeline. For instance, with pipeline parallelism, each GPU holds a different set of layers and one can stream multiple requests such that all GPUs stay busy (like an assembly line). In a 4-GPU pipeline, up to 4 generation tasks can be in-flight in different stages, improving overall request throughput – important for commercial scenarios serving many users. The latency for one query might still be bounded by the slowest stage, but the throughput scales more linearly. In data-center environments, model parallelism is often combined with data parallelism (multiple replicas of the model to serve more queries) to reach both memory capacity and throughput goals.


Interconnect Considerations: When using multiple GPUs, especially consumer cards, the limited bandwidth can become a bottleneck. Server-grade GPUs in HGX systems benefit from NVSwitch or NVLink high-speed links (e.g. 600+ GB/s on A100, or ~900 GB/s on H100). Consumer GPUs typically only have PCIe (e.g. ~32 GB/s for PCIe4 x16), and newer RTX 40-series lack NVLink entirely. This means that if a model layer on GPU0 needs to send activations to GPU1 every step, the transfer is relatively slow. For smaller models or batches, this might not dominate, but for large batch inference or very large models, it can hurt performance. The A800’s reduced NVLink bandwidth is a factor here too – an A800 cluster will not scale as efficiently as an A100 cluster across many nodes​. In practice, if using multiple consumer GPUs, one may favor a pipeline parallel approach (which sends intermediate outputs once per layer) over a tensor-parallel approach (which would require frequent all-reduce communications for every matrix multiplication). The pipeline approach minimizes communication events, partly mitigating the bandwidth disadvantage of PCIe-only setups.


Thermal and Power Constraints (Local vs. Server Deployment)

Running LLM inference at scale is a power-hungry and heat-intensive task, which raises practical considerations depending on the deployment setting. Data center GPU servers are designed for high thermal loads: accelerators like A100/A800 are usually passively cooled, expecting strong chassis fans and facility air conditioning to keep temperatures in check. They can each consume 300–400W (and H100-class up to 700W) under heavy load. Racks full of such GPUs need robust cooling (air or liquid) and significant electrical power provision.

If a company in China opts to self-host an LLM on-premises but not in a proper data center, they might use a tower or workstation with multiple consumer GPUs (each RTX 4090 can draw ~450W). Packing many of these into a small space can quickly lead to thermal throttling if not well-cooled. Enthusiasts building multi-GPU rigs for local LLMs have found that beyond 2–3 high-end GPUs, standard air cooling may be insufficient; custom water-cooling loops or open-frame setups become necessary to manage the heat​. One discussion noted that water cooling was the “sensible solution if we scale beyond two cards” in a multi-GPU rig​. In contrast, data center deployments often use purpose-built cooling (for example, Alibaba’s data centers using liquid cooling plates for AI servers, etc.), which can handle 4–8 GPUs per server chassis continuously.


Another factor is noise and space: Server GPUs and coolers are loud and intended for machine rooms. A local office environment running a rack of GPUs would need to account for noise or isolate the hardware. Power supply is also a concern – a single PC with 4× RTX 4090 could draw 1.5–2 kW of power, requiring enterprise-grade PSUs and possibly special power circuits (in some cases, 240V lines are used for better efficiency). This is manageable for a company’s server room, but it’s far beyond a typical desktop’s consumption.


In summary, thermal and electrical infrastructure must be considered for self-hosting LLMs. If using cloud or colocation in China (offered by providers like Alibaba Cloud, Tencent Cloud, etc.), these issues are handled by the provider. But for truly on-premises deployments (for data privacy or control), ensuring adequate cooling (potentially liquid cooling for multi-GPU setups) and power delivery is essential to reliably operate the hardware. The difference can be boiled down to: data center GPUs are designed for that environment (high airflow, 24/7 utilization), whereas consumer GPUs can do the job but require DIY solutions to run them at full tilt continuously. Commercial users in China will have to weigh these practical costs and modifications when deciding between local GPU clusters vs. renting GPU time from a data center.



Model Compatibility and Deployment Options

Deploying Modern LLMs (Gemma, LLaMA and more) in China

With the hardware in place, the next consideration is choosing and deploying the LLM model itself. There are a few families of models relevant to commercial use in China:

  • Meta’s LLaMA series: LLaMA 2 (released 2023) is an openly available model (7B, 13B, and 70B variants) with a permissive license suitable for commercial use. It’s primarily an English-trained model but does include a mix of languages from its Common Crawl pretraining. A fine-tuned chat version is provided by Meta, and further tuning can adapt it for Chinese if needed. Rumors suggest LLaMA 3 was in development for 2024, potentially aiming at GPT-4-level capabilities​, with speculation that it could also come in around 70B or more parameters. (Indeed, Meta announced smaller LLaMA 3.2 models with 90B parameters and multi-modal capabilities in early 2024) For now, LLaMA-2 70B remains one of the most powerful openly licensed models and has seen adoption in China by companies looking to build ChatGPT-like services without accessing overseas APIs.


  • Google’s Gemma models: Gemma is a family of open-source models from Google (related to their Gemini project). Gemma 2 was introduced with 9B and 27B parameter versions​. It’s noteworthy for its permissive license that allows redistribution and commercial use– making it attractive to developers globally, including in China. Gemma 3, released subsequently, focuses on multimodality and efficiency: it comes in sizes from 1B up to 27B parameters, but with enhanced training (more data) and features like a long 128K context window and support for 140+ languages. In effect, Gemma 3 aims to be “the world's best single-accelerator model” at 27B, meaning it’s designed to run on a single high-end GPU, which greatly eases self-hosting. For Chinese commercial use, Gemma 3’s multilinguality (including Chinese) is a plus, though its smaller size means it may not reach the absolute quality of larger models in very complex tasks. Still, it demonstrates a trend toward smaller, more efficient models that can be self-hosted with limited hardware.


  • Chinese-developed models: Several open models have come from Chinese institutions and companies. Notable examples include BLOOMzh (a Chinese-trained version of the BLOOM model), ChatGLM (6B bilingual model from Tsinghua/ZhipuAI, with an open-source release), Baichuan (a 13B Chinese-focused model open-sourced by Baichuan Intelligence), InternLM (an open 7B model from Shanghai AI Lab), and GLM-130B (130B bilingual model from Tsinghua). Many of these are available for commercial use under open licenses. For instance, Baichuan-13B is Apache-2.0 licensed. However, larger ones like GLM-130B (130B parameters) are so memory-heavy that they require massive hardware (130B fp16 ~260 GB memory). In practice, Chinese companies might use these open models as a starting point and fine-tune them on Chinese instructions and content to build a proprietary model that complies with local guidelines. When considering deployment, the feasibility concerns (memory, inference speed) for these models are similar to their Western counterparts of comparable size – i.e. a 130B model, whether it’s GLM or LLaMA, will need multiple GPUs or aggressive compression to serve.

The table below summarizes a few modern LLM options and their deployment feasibility:

Model

Size (parameters)

Commercial Use

Hardware Needs (for inference)

Notes

LLaMA 2

7B / 13B / 70B

Yes (open license)

7B: ~16GB VRAM; 13B: ~30GB; 70B: ~140GB (fp16)​

Strong general LLM by Meta; 70B rivaling GPT-3.5 in quality. Multilingual to a degree, but not specifically Chinese-tuned.

LLaMA 3 (rumored)

(e.g. 30B, 70B, 100B?)

Expected (open)

Likely similar or higher than LLaMA2’s requirements

Aimed at GPT-4-level ability​; possibly includes vision or longer context. Details TBD (as of 2025).

Google Gemma 2

9B / 27B

Yes (permissive)

9B: ~18GB; 27B: ~50GB (fp16)

Open models from Google based on early Gemini tech. 8K context. Efficient (27B can be 8-bit quantized into ~25GB).

Google Gemma 3

1B / 4B / 12B / 27B

Yes (permissive)

27B: ~50GB (fp16) – fits on 1× 80GB GPU or 2× 24GB GPUs

Multimodal (image+text) support, 128K context​. 27B is highly optimized; supports 140+ languages (good Chinese ability).

ChatGLM 2 (ZhipuAI)

6B

Yes (AGPL v3*)

~12GB (fp16) – easily on 1 GPU

Chinese-English bilingual chat model. Smaller size makes it affordable to run (and even fine-tune).

GLM-130B

130B

Limited (research)

~260GB (fp16) – requires 2× 80GB (with int8) or 4+ GPUs

Very large bilingual model (released for research). Demonstrates capabilities near GPT-3, but deployment is hardware-intensive.

*Table 4: Examples of modern LLMs and their self-hosting feasibility. (Note: ChatGLM-2’s AGPL license permits commercial use if the user’s modifications are also open-sourced.)


As shown, smaller models (under ~13B) are quite feasible to deploy on a single commodity GPU – which is appealing for on-premises setups that can’t obtain ultra-powerful hardware. Models in the 6B–13B range (e.g. LLaMA-13B, Gemma-9B/27B, ChatGLM-6B) can even run on a high-end laptop or desktop with enough RAM by leveraging CPU inference with 4-bit quantization (though slowly). The trade-off is quality: these smaller models may lag behind the largest models in complex reasoning, coding, or knowledge breadth. Still, for many applications (basic chatbots, classifications, etc.), a 7B or 13B model fine-tuned on domain data can be sufficient and much easier to serve.


For larger models (30B, 70B, 100B+), Chinese organizations will need to invest in stronger hardware or creative optimizations. This could mean using the NVIDIA A800 or domestic accelerators in a server (as discussed, e.g. 2× A800 80GB can host a 70B model comfortably), or networking together multiple consumer GPUs. Cloud offerings in China sometimes provide such multi-GPU instances – for example, one could rent 4× Tesla A10G (24GB each) on a Chinese cloud and distribute a 70B model across them. The feasibility is there, but the cost and complexity rise with model size. That’s why models like Gemma-3 (27B) are interesting – they intentionally target a sweet spot of being small enough to deploy easily while trying to maximize performance via efficient training. We might expect more “medium-large” models in the 20B–40B range to emerge, striking a balance between capability and deployability under hardware constraints.


A key consideration for commercial use in China is also the content filtering and alignment with local regulations. Open models like LLaMA or Gemma can be fine-tuned on Chinese data, but they do not come pre-aligned to Chinese content standards (which forbid certain topics). Companies often build a moderation layer or fine-tune the model with instruction data that steers it away from sensitive outputs. Technically, this doesn’t affect the hardware requirements, but it’s part of the deployment pipeline. Some open models from China, like Baidu’s ERNIE or Alibaba’s Tongyi, are not fully open weight releases, partly due to these concerns. Thus, truly self-hosted commercial models in China likely involve open base models with custom fine-tuning – a workflow that is fully doable on local hardware (e.g. using low-rank adaptation or LoRA techniques on a smaller GPU).


Inference Frameworks and Optimization Techniques

To make self-hosting practical, especially on limited hardware, it’s crucial to leverage efficient inference frameworks and model compression techniques. Several approaches can dramatically reduce the memory and compute footprint of LLMs:

  • 8-bit Quantization (LLM.int8): Pioneered in the paper “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale”, this technique allows model weights to be stored in 8-bit precision instead of 16-bit floating point, with minimal loss in accuracy. The popular bitsandbytes library provides an implementation of this for PyTorch. Using 8-bit quantization cuts memory use roughly in half without significant performance degradation in inference speed or model quality. For example, a 13B model that might need ~26GB in fp16 can fit in ~13GB with 8-bit weights, enabling it to run on a single 16GB GPU. Bitsandbytes has become a standard tool to load large models on smaller GPUs, and it’s compatible with Hugging Face Transformers (one can load model.quantize(8) etc.). Many Chinese deployments use this to maximize their GPU usage (since high VRAM GPUs are scarce). It’s important to note 8-bit refers to weights; activations are still in higher precision. Thus the speed is similar to full precision (and sometimes even faster due to better memory bandwidth utilization). NVIDIA also offers its own 8-bit and even 4-bit support in TensorRT and CUDA libraries, but bitsandbytes made it very accessible in Python.

  • 4-bit and Lower Quantization: For even greater compression, 4-bit weight quantization can be used. One approach is GPTQ, an offline post-training quantization method that finds optimal 4-bit scaling for each weight group. With 4-bit weights, a model’s size is quartered (a 70B model ~140GB fp16 becomes ~35GB). This can be the difference between needing 4 GPUs versus 1 GPU. However, extreme quantization comes with some accuracy loss and sometimes slower generation: 4-bit arithmetic isn’t natively supported on all GPUs, so custom kernels (like in the exllama or GPTQ-for-LLaMa projects) are used. That said, community evaluations have found that GPTQ 4-bit models can retain high accuracy if done carefully, often only a few points worse in perplexity than 16-bit models​. From a performance view, 4-bit models can be slightly slower per token because decompressing weights or doing more bit-level operations adds overhead. For instance, one user observed that a 13B model quantized to 4-bit ran ~2× slower than 8-bit on their GPU, even though it saved memory (the trade-off of heavier compute for memory)​. In practice, one might use 4-bit quant for model sizes that otherwise wouldn’t fit at all on available hardware, accepting a speed hit. Tools like llama.cpp (based on the GGML library) even support 3-bit or 2-bit weights, enabling LLMs to run on pure CPU or mobile devices, albeit with further quality degradation. For commercial inference serving multiple users, 8-bit tends to be the sweet spot for maintaining quality and speed, whereas 4-bit is a niche to get something running in tight memory.

  • Memory-efficient Attention / KV Cache Management: Another major aspect of inference is the memory used by the attention Key/Value cache, which grows with the number of tokens processed. Long contexts (like 8K or 16K tokens) can consume more memory than the model itself if many requests are in flight. The vLLM library addresses this with a technique called PagedAttention​ It treats GPU memory like virtual memory, storing the KV cache in fixed-size pages that can be selectively swapped in/out, rather than each query allocating a huge contiguous chunk regardless of actual use. This eliminates a lot of memory fragmentation and waste In real workloads, vLLM can serve more requests with the same hardware by recycling and sharing memory for the KV cache. It also supports continuous batching, meaning incoming requests are dynamically batched to keep the GPU fed without waiting for a batch to fill up. These innovations yield impressive throughput – up to 24× higher than naive HuggingFace Transformers serving in some benchmarks​. For a company in China deploying an LLM, using vLLM or a similar high-throughput server (like TGI or FasterTransformer) can significantly reduce the number of GPUs needed to handle a given QPS (queries per second). The downside is these systems add complexity and are geared more toward batched API serving than interactive single-user sessions. But they show how software can maximize hardware utilization.

  • Distributed Inference Frameworks: Tools like DeepSpeed-Inference and Tensor Parallel (Megatron) allow splitting models across GPUs with optimized communication. For example, Megatron-LM’s tensor parallelism can shard the computations of each layer across multiple GPUs, which is useful if you have GPUs connected via NVLink. DeepSpeed’s inference engine can do kernel fusions and specially handle large model layers across multiple GPUs to reduce latency. These frameworks are especially useful when serving ultra-large models on a cluster of GPUs (common in a data center scenario). In a Chinese commercial environment, if one has a server with 8× A800 GPUs, using such frameworks would be essential to get near-real-time inference from a 100B+ model. DeepSpeed’s Zero-Inference also enables offloading to CPU or even NVMe, which can be a trick to run models bigger than GPU memory by swapping weights in and out for layers. This comes at a cost of latency but can enable, say, a 175B model to run with only 80GB of GPU memory by streaming layers from CPU memory.

  • Model architecture optimizations: Some newer LLM architectures are designed to be more parameter-efficient or hardware-friendly. For example, MPT-7B (by MosaicML) offers a 16K context with only 7B params, using FlashAttention and other tricks to be faster. GPT-4-like MoE (Mixture-of-Experts) models could potentially scale output quality without putting all load on one device (experts can be distributed). While not mainstream yet, these approaches could influence what kind of models are practical to self-host. If an open-source MoE model appears, a company might run different expert shards on different GPUs to effectively increase model size without any single GPU needing to hold the entire model.


In practice, many deployments in China (as elsewhere) combine quantization + optimized serving. For example, a startup might take LLaMA-70B, quantize it to 4-bit using GPTQ (bringing it down to ~35GB), and then serve it on a server with 4 GPUs using vLLM to handle multiple user queries efficiently. This kind of setup dramatically lowers the barrier: instead of needing eight 80GB A100s (which are nearly impossible to obtain in China due to cost and restrictions), they could do with four prosumer GPUs and clever software. The trade-offs are a small drop in model accuracy and some added system complexity, but it makes self-hosting feasible within the constraints.

Performance and Cost Trade-offs

Finally, it’s worth synthesizing the trade-offs involved in self-hosting LLMs in China given the above considerations:

  • Using high-end server GPUs (A800/H800 or domestic equivalents) provides the best performance per GPU and easier scaling (thanks to high-bandwidth interconnects), but these are expensive (an A800 80GB card costs on the order of ¥100k RMB​) and can be hard to procure in quantity. For companies that can afford them, the upside is you might serve a model with fewer GPUs and simpler software (maybe even no quantization needed if you have 80GB cards for a 70B model). The downside is reliance on supply constraints and potential future regulation changes.

  • Leveraging multiple consumer GPUs is a viable path for smaller players: e.g. assembling a few RTX 4090s or RTX 6000 Ada cards. This requires careful optimization (to avoid the multi-GPU slowdowns discussed) and robust cooling/power setups, but it allows usage of readily available hardware. The cost in China for a 4090 is significantly lower per unit (~¥15k RMB each) than an A800, but you may need several to match one A800’s capability. Still, for moderate concurrency or offline batch inference, a handful of 4090s can do a lot – and they can be repurposed for other tasks like fine-tuning models as well.

  • Quantization and efficient serving can substantially reduce the needed hardware. By using 8-bit or 4-bit models, one can double or quadruple the effective capacity of a given GPU memory at the cost of some speed or accuracy. This is often a worthwhile trade, especially for deployment (since slightly slower inference is acceptable as long as it’s within real-time bounds). Moreover, high-throughput servers like vLLM mean that a single GPU can handle many requests in parallel, reducing the total number of GPUs to serve an application. The combination of these software techniques can be seen as “getting more out of what you have”, which is particularly important in China’s context of constrained hardware access.

  • Thermal/power overhead is an often underappreciated cost: Running a rack of GPUs will incur electricity costs and cooling costs. In a commercial setting, this affects the operational expenditure (OpEx) of self-hosting. Sometimes, outsourcing inference to a cloud service might be more cost-efficient if the cloud provider has better economy of scale (even if they themselves are using A800s, for example). However, due to data sovereignty or privacy, many Chinese enterprises prefer keeping sensitive AI models on-premise. Therefore, investing in efficient hardware utilization (to handle more load with fewer GPUs) pays off in lower ongoing costs.

  • Future-proofing: The landscape is quickly evolving. If a company spends a fortune on an array of GPUs now, they must consider that newer models or hardware could obsolete those within a couple of years. For instance, if NVIDIA or Biren release a next-gen chip that is 2× faster and still exportable, or if a future LLaMA 4 model is so well-optimized that a smaller version outperforms today’s larger models, one might regret over-investing in a huge deployment of an older model. Thus, scalability and flexibility (e.g. using standard hardware that can be upgraded, and containerized deployment that can be switched to a new model easily) are part of strategic planning.


In conclusion, self-hosting LLMs in China is possible and increasingly common, but it requires navigating hardware limitations with clever strategies. Companies must mix and match available GPUs (whether it’s buying approved server GPUs like A800s, or repurposing gaming GPUs), leverage model optimizations (quantization, parallelism frameworks), and plan for engineering work to integrate everything. The possibilities are expanding as open-source models improve and new chips come to market. While a few years ago serving a GPT-3 level model locally was impractical, today a single server with multiple GPUs in China can indeed host models approaching GPT-3.5 or GPT-4 in capability. The remaining limitations are primarily around cost, efficiency, and compliance – all of which are being actively addressed by the fast-moving AI tech ecosystem.


Sources:

  1. TechPowerUp – “NVIDIA A800 China-Tailored GPU Performance within 70% of A100”techpowerup.comtechpowerup.com

  2. Reuters – “Nvidia offers new advanced chip for China that meets U.S. export controls”reuters.comreuters.com

  3. NotebookCheck – “New Chinese Biren BR100 GPGPU apparently beats Nvidia’s Ampere A100”notebookcheck.netnotebookcheck.net

  4. SemiAnalysis – “Huawei Ascend 910B … lands between the A100 and H100 in performance” semianalysis.com

  5. AIME Blog – “How to run 30B/65B LLaMA on Multi-GPU” (GPU requirements) ​aime.info

  6. GitHub (HuggingFace Accelerate issue) – multi-GPU inference latency tests​ github.com

  7. Hacker News discussion – on multi-GPU rig cooling (water cooling beyond 2 GPUs)​news.ycombinator.com

  8. Hugging Face Blog – “Welcome Gemma 2 – Google’s new open LLM”huggingface.co

  9. Google AI Developer Page – Gemma 3 features (128K context, multilingual) ​ai.google.dev

  10. Decrypt – “Meta’s Llama 3 Rumored for Early 2024”decrypt.co

  11. Hugging Face docs – bitsandbytes int8 inference memory benefits ​huggingface.co

  12. RedHat Blog – “Meet vLLM: faster, more efficient LLM inference” redhat.comredhat.com


Sign up to get updates when we release another amazing article

Thanks for subscribing!

bottom of page