Overview & Backend Selection

InterGenOS runs local AI inference on the hardware in front of it. The local assistant, InterGen, detects the host’s RAM and GPU at startup and selects an appropriately sized model, then serves it through a llama-server subprocess from llama.cpp over a local HTTP API. The compute backend that subprocess uses to talk to your GPU is the subject of this page.

The guiding principle is the same one that runs through the rest of the system: a machine you understand, can modify, and can trust. The default backend is chosen to work on the widest range of hardware without a vendor-specific driver stack. Vendor-tuned backends are available, but they are opt-in.

Why a vendor-neutral default

A default that pulls in a large proprietary driver stack expands the trusted surface of every install, whether or not that hardware is present. A vendor-neutral backend keeps the default install small, inspectable, and portable across GPUs from different vendors.

For this reason InterGenOS ships with a Vulkan-based compute backend as the default. Vulkan is supported across AMD, Intel, and NVIDIA GPUs through their respective open drivers, so the same default image runs accelerated inference on a broad range of machines without requiring a vendor SDK at build or install time.

See Vulkan (default) for details on the default backend.

Opt-in vendor backends

Two vendor-tuned backends can be enabled when the matching hardware and toolkit are present:

ROCm for AMD GPUs. See ROCm (AMD).
CUDA for NVIDIA GPUs. See CUDA (NVIDIA).

These typically deliver higher throughput than the vendor-neutral path on their respective hardware, at the cost of pulling in a vendor-specific runtime. Because that runtime enlarges the trusted surface, neither backend is enabled by default. You choose to add it.

How a backend is selected

At a high level, backend selection follows the same local-first, hardware-detected philosophy as model selection:

The host’s GPU and RAM are probed at startup.
The default Vulkan backend is used unless a vendor backend has been explicitly enabled.
If a vendor backend (ROCm or CUDA) is configured and the matching hardware is present, inference is routed through it.

This keeps the out-of-the-box experience predictable and vendor-neutral while leaving the higher-performance path one deliberate step away for users who want it.

Relationship to model tiers

Backend selection is independent of, but interacts with, the model tier InterGen picks for your hardware. InterGen scales the model to the detected RAM and GPU: a smaller model on low-memory systems, a larger one where a discrete GPU and ample RAM are present. The compute backend determines how that model is accelerated; the tier determines which model runs. A faster backend does not change the tier, but a discrete GPU influences both.

For the model catalog and tier thresholds, see the InterGen assistant documentation. For per-GPU notes, see Per-GPU guidance.

What this is not

This page covers the local inference backend only. It is unrelated to optional cloud features:

InterGen Sentinel, the pluggable security scanner, runs entirely on-device by default (a fast local-rules pass plus an optional local Qwen classifier). Cloud scanning, backed by one of six opt-in providers, is configured by you and off by default.
Phone-A-Friend (Frontier/Cloud Escalation) is an optional, consent-first path that hands a request to a frontier model in the cloud. It is off by default; no provider is configured out of the box.

Neither feature depends on the GPU backend, and neither is enabled out of the box. The default install runs and reasons entirely on your own hardware.

InterGenOS Wiki

Overview & Backend Selection

Why a vendor-neutral default

Opt-in vendor backends

How a backend is selected

Relationship to model tiers

What this is not

See also

Keyboard shortcuts

InterGenOS Wiki

Overview & Backend Selection

Why a vendor-neutral default

Opt-in vendor backends

How a backend is selected

Relationship to model tiers

What this is not

See also