$ pwd

[$ ] use-case: ai-inference

// NAME

ai-inference — ai inference hosting (open-source llms).

// SYNOPSIS

xmrhost-cli playbook describe --workload=ai-inference
xmrhost-cli provision --workload=ai-inference --region=<is|ro>

// TL;DR

$ head -n1 README

// open-weight llm serving on offshore gpu — vllm + ollama preinstalled, outside us export-control gating.

// DESCRIPTION

$ man playbook(ai-inference)

// open-weight inference outside US export-control gating

Open-weight LLMs (Llama, Mistral, Qwen, Gemma) cleared the bar for production inference on most non-frontier tasks somewhere between mid-2024 and late-2025. The constraint moved from API access to GPU availability, and GPU availability is now a geopolitical question. US export-control classifications around H100/H200-class accelerators have made allocation political; mid-tier consumer / pro GPUs (RTX 4090, RTX A6000) in non-US datacenters sidestep the gating entirely.

Memory budget determines what fits. Rule of thumb: VRAM (GB) ≈ params (B) × 2 for FP16, × 0.5 for 4-bit quantization. RTX 4090 (24 GB) runs 13B FP16 or 30B 4-bit comfortably; A6000 (48 GB) runs 70B 4-bit; H100 (80 GB) runs 70B FP16 with serious context. vLLM batches at the request level for production throughput; Ollama is the right answer for single-user / development workflows; both ship preinstalled with CUDA 12 + PyTorch on every gpu-* tier.

The data-sovereignty argument is real and underweighted. Frontier API providers reserve the right to retain prompts and completions for safety, abuse, or training purposes, depending on the SKU. A self-hosted endpoint inside an Iceland or Netherlands GPU plan keeps the prompt corpus, the model weights, and the inference logs all under one operator. Customer data does not flow into a third-party analytics fabric; the OpenAI-compatible HTTP server vLLM ships closes a customer-facing inference endpoint cleanly.

// see also

vLLM — Efficient Memory Management for LLM Serving (Kwon et al., SOSP 2023)
Llama 2 / Llama 3 Model Cards (ai.meta.com)
QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023)
US BIS — Advanced Computing Export Controls (federalregister.gov)

// RECOMMENDED NODES

$ xmrhost-cli list --workload=ai-inference

// 6 plans flagged for this workload. all xmr-billed.

slug type spec $/mo notes

gpu-lite gpu 8c 64GBDDR5 $489 Offshore RTX 4090 for AI inference & rendering. gpu-pro gpu 16c 128GBDDR5 $1099 Workstation-class GPU for ML training offshore. gpu-beast gpu 32c 256GBDDR5 $2899 Datacenter-grade H100 for serious LLM workloads. ds-mid dedicated 16c 64GBDDR4ECC $149 Mid-tier offshore dedicated — Ryzen 9 power. ds-pro dedicated 24c 128GBDDR4ECC $249 EPYC-grade offshore dedicated for production. ds-beast dedicated 64c 256GBDDR4ECC $449 Top-spec dual-Xeon dedicated for enterprise loads.

// RECOMMENDED REGIONS

$ xmrhost-cli regions list --workload=ai-inference

is — iceland : Geothermal energy at <$0.05/kWh wholesale — best $/inference profile in EEA. EU AI Act applies via EEA agreement; no extra-territorial US compute-export gating.
ro — romania : EU member with the EU AI Act applicable. Lower-cost GPU compute than Iceland for FP16 inference on RTX-class accelerators; deeper carrier mix for inference-endpoint latency.

// THREAT MODEL + AUP BOUNDARY

$ xmrhost-cli scope --workload=ai-inference

// the hosting layer is one component of the threat model. what we cover, and what we explicitly don't:

// scope: in

Bare-metal GPU access (no virtualisation tax, full CUDA driver flexibility)
vLLM + Ollama + PyTorch + CUDA 12 preinstalled, version-pinned per release tag
Predictable egress for OpenAI-compatible HTTP endpoint serving end users
Outside US export-control gating for RTX 4090 / A6000 / H100 procurement

// scope: out

Model weights — we host the GPU; you bring the .safetensors
Content moderation on the inference endpoint (EU AI Act compliance is the operator-customer's)
Fine-tuning data licensing analysis — Llama / Mistral licenses have specific clauses, read them
Multi-GPU NVLink configurations (custom procurement, 1-2 week lead time)

// AUP boundary

Customers running inference endpoints serving end users remain responsible for their own content moderation, age-gating, and applicable AI-act compliance (EU AI Act, deployment-jurisdiction equivalents).

// SEE ALSO

// playbook — full workload list, node — full catalog, location — region posture, why-monero — billing rationale.