$ pwd
[$ ] use-case: ai-inference
// NAME
ai-inference — ai inference hosting (open-source llms).
// SYNOPSIS
xmrhost-cli playbook describe --workload=ai-inference
xmrhost-cli provision --workload=ai-inference --region=<is|ro> // TL;DR
$ head -n1 README
// open-weight llm serving on offshore gpu — vllm + ollama preinstalled, outside us export-control gating.
// DESCRIPTION
$ man playbook(ai-inference)
// open-weight inference outside US export-control gating
Open-weight LLMs (Llama, Mistral, Qwen, Gemma) cleared the bar for production inference on most non-frontier tasks somewhere between mid-2024 and late-2025. The constraint moved from API access to GPU availability, and GPU availability is now a geopolitical question. US export-control classifications around H100/H200-class accelerators have made allocation political; mid-tier consumer / pro GPUs (RTX 4090, RTX A6000) in non-US datacenters sidestep the gating entirely.
Memory budget determines what fits. Rule of thumb: VRAM (GB) ≈ params (B) × 2 for FP16, × 0.5 for 4-bit quantization. RTX 4090 (24 GB) runs 13B FP16 or 30B 4-bit comfortably; A6000 (48 GB) runs 70B 4-bit; H100 (80 GB) runs 70B FP16 with serious context. vLLM batches at the request level for production throughput; Ollama is the right answer for single-user / development workflows; both ship preinstalled with CUDA 12 + PyTorch on every gpu-* tier.
The data-sovereignty argument is real and underweighted. Frontier API providers reserve the right to retain prompts and completions for safety, abuse, or training purposes, depending on the SKU. A self-hosted endpoint inside an Iceland or Netherlands GPU plan keeps the prompt corpus, the model weights, and the inference logs all under one operator. Customer data does not flow into a third-party analytics fabric; the OpenAI-compatible HTTP server vLLM ships closes a customer-facing inference endpoint cleanly.
// see also
- vLLM — Efficient Memory Management for LLM Serving (Kwon et al., SOSP 2023)
- Llama 2 / Llama 3 Model Cards (ai.meta.com)
- QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023)
- US BIS — Advanced Computing Export Controls (federalregister.gov)
// RECOMMENDED NODES
$ xmrhost-cli list --workload=ai-inference
// 6 plans flagged for this workload. all xmr-billed.
// RECOMMENDED REGIONS
$ xmrhost-cli regions list --workload=ai-inference
-
is — iceland : Geothermal energy at <$0.05/kWh wholesale — best $/inference profile in EEA. EU AI Act applies via EEA agreement; no extra-territorial US compute-export gating.
-
ro — romania : EU member with the EU AI Act applicable. Lower-cost GPU compute than Iceland for FP16 inference on RTX-class accelerators; deeper carrier mix for inference-endpoint latency.
// THREAT MODEL + AUP BOUNDARY
$ xmrhost-cli scope --workload=ai-inference
// the hosting layer is one component of the threat model. what we cover, and what we explicitly don't:
// scope: in
- Bare-metal GPU access (no virtualisation tax, full CUDA driver flexibility)
- vLLM + Ollama + PyTorch + CUDA 12 preinstalled, version-pinned per release tag
- Predictable egress for OpenAI-compatible HTTP endpoint serving end users
- Outside US export-control gating for RTX 4090 / A6000 / H100 procurement
// scope: out
- Model weights — we host the GPU; you bring the .safetensors
- Content moderation on the inference endpoint (EU AI Act compliance is the operator-customer's)
- Fine-tuning data licensing analysis — Llama / Mistral licenses have specific clauses, read them
- Multi-GPU NVLink configurations (custom procurement, 1-2 week lead time)
// AUP boundary
Customers running inference endpoints serving end users remain responsible for their own content moderation, age-gating, and applicable AI-act compliance (EU AI Act, deployment-jurisdiction equivalents).
// SEE ALSO
// playbook — full workload list, node — full catalog, location — region posture, why-monero — billing rationale.