[$ xmrhost] _

$ pwd

/playbook/ai-inference

[$ ] use-case: ai-inference

// NAME

ai-inference — ai inference hosting (open-source llms).

// SYNOPSIS

xmrhost-cli playbook describe --workload=ai-inference
xmrhost-cli provision --workload=ai-inference --region=<is|ro>

// TL;DR

$ head -n1 README

// open-weight llm serving on offshore gpu — vllm + ollama preinstalled, outside us export-control gating.

// DESCRIPTION

$ man playbook(ai-inference)

// open-weight inference outside US export-control gating

Open-weight LLMs (Llama, Mistral, Qwen, Gemma) cleared the bar for production inference on most non-frontier tasks somewhere between mid-2024 and late-2025. The constraint moved from API access to GPU availability, and GPU availability is now a geopolitical question. US export-control classifications around H100/H200-class accelerators have made allocation political; mid-tier consumer / pro GPUs (RTX 4090, RTX A6000) in non-US datacenters sidestep the gating entirely.

Memory budget determines what fits. Rule of thumb: VRAM (GB) ≈ params (B) × 2 for FP16, × 0.5 for 4-bit quantization. RTX 4090 (24 GB) runs 13B FP16 or 30B 4-bit comfortably; A6000 (48 GB) runs 70B 4-bit; H100 (80 GB) runs 70B FP16 with serious context. vLLM batches at the request level for production throughput; Ollama is the right answer for single-user / development workflows; both ship preinstalled with CUDA 12 + PyTorch on every gpu-* tier.

The data-sovereignty argument is real and underweighted. Frontier API providers reserve the right to retain prompts and completions for safety, abuse, or training purposes, depending on the SKU. A self-hosted endpoint inside an Iceland or Netherlands GPU plan keeps the prompt corpus, the model weights, and the inference logs all under one operator. Customer data does not flow into a third-party analytics fabric; the OpenAI-compatible HTTP server vLLM ships closes a customer-facing inference endpoint cleanly.

// see also

  • vLLM — Efficient Memory Management for LLM Serving (Kwon et al., SOSP 2023)
  • Llama 2 / Llama 3 Model Cards (ai.meta.com)
  • QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023)
  • US BIS — Advanced Computing Export Controls (federalregister.gov)

// THREAT MODEL + AUP BOUNDARY

$ xmrhost-cli scope --workload=ai-inference

// the hosting layer is one component of the threat model. what we cover, and what we explicitly don't:

// scope: in

  • Bare-metal GPU access (no virtualisation tax, full CUDA driver flexibility)
  • vLLM + Ollama + PyTorch + CUDA 12 preinstalled, version-pinned per release tag
  • Predictable egress for OpenAI-compatible HTTP endpoint serving end users
  • Outside US export-control gating for RTX 4090 / A6000 / H100 procurement

// scope: out

  • Model weights — we host the GPU; you bring the .safetensors
  • Content moderation on the inference endpoint (EU AI Act compliance is the operator-customer's)
  • Fine-tuning data licensing analysis — Llama / Mistral licenses have specific clauses, read them
  • Multi-GPU NVLink configurations (custom procurement, 1-2 week lead time)

// AUP boundary

Customers running inference endpoints serving end users remain responsible for their own content moderation, age-gating, and applicable AI-act compliance (EU AI Act, deployment-jurisdiction equivalents).

// SEE ALSO

// playbook — full workload list, node — full catalog, location — region posture, why-monero — billing rationale.