Language-Runtime Foundations

InterGenOS ships a built-from-source toolchain and a set of language runtimes that the AI subsystem and other workloads run on. This page explains what those foundations are, where they come from, and how the local AI assistant uses them. The goal is the same one that runs through the whole system: a machine you understand, can modify, and can trust.

This is version 1.0-dev (build id v1.0-dev1). Package counts below are derived from the live package tree and drift over time; treat them as a snapshot, not a fixed contract.

Where runtimes come from

Everything is built from source through a staged pipeline rather than assembled from prebuilt binaries. The build runs in 20 phases:

validate → verify-sources → setup → toolchain → chroot-prep → chroot-tools → core → config → core-extra → base → kernel → desktop → ai → extra → bootloader → image → manifest → squashfs → ukis-verity → iso

A publish phase is optional and runs after the others.

The runtime foundations are laid early. The toolchain phase builds the compiler and core build tooling that every later phase depends on; the core and base phases bring in the system libraries and language runtimes that user-facing software links against; and the ai phase builds the assistant on top of those runtimes.

Packages are organized into six tiers. As of this writing the live tree holds roughly:

Tier	Packages (approx.)
toolchain	28
core	272
base	23
desktop	420
extra	112
ai	2
Total	~857

These numbers move as packages are added and consolidated. Derive the current counts from the package tree rather than quoting these as permanent.

The runtime the AI subsystem uses

The local assistant, InterGen, is a tiered, hardware-detected, offline-first local assistant with zero telemetry. It runs entirely on the local machine by default, with no cloud dependency, and selects a model sized to the hardware it detects.

InterGen has two distinct runtime layers, and separating them helps because they place different demands on the system:

The model-serving layer. InterGen serves its selected model through llama-server, the HTTP server from llama.cpp, managed as a subprocess (intergen/llama_manager.py). This is the native, compiled inference runtime that does the heavy numeric work and is the component that benefits from GPU acceleration. It exposes the model over a local HTTP API.
The orchestration layer. The router, safety classifier, hardware probe, model manager, memory store, D-Bus service, and MCP client are organized as modules (intergen/router.py, intergen/safety.py, intergen/hardware.py, intergen/model_manager.py, intergen/memory.py, intergen/state_cache.py, intergen/dbus_daemon.py, intergen/mcp_client.py). These coordinate the assistant: deciding how to answer a request, classifying its safety, and managing state.

These orchestration modules run on the shipping Python runtime, which is Python 3.14 (3.14.3 in the current build).

The split matters for runtime planning. The orchestration layer is light and predictable. The model-serving layer is where memory and GPU capacity decide which model can run at all.

Hardware tiers and what they need

InterGen probes RAM and GPU at startup (intergen/hardware.py) and assigns a tier; the model catalog and selection live in intergen/model_manager.py. The shipping tiers are:

Tier 1 (2B, Basic) — Qwen3.5-2B Q4_K_M, about 1.5 GB. Selected on systems with under 8 GB of RAM (GPU is not a factor at this RAM level). Handles semantic matching, system queries, and keyword extraction. Not used for complex code generation.
Tier 2 (9B, Standard) — Qwen3.5-9B Q4_K_M, about 5.5 GB. Selected on systems with 8 to 15 GB of RAM. The default daily driver: coding, configuration, and reasoning. On a Tier 2 machine without a discrete GPU, InterGen falls back to the 2B model to keep response latency usable.
Tier 3 (35B, Advanced) — Qwen3.5-35B-A3B Q4_K_M, a Mixture-of-Experts model, about 21 GB. Selected on systems with 16 GB or more of RAM and a discrete GPU. Used for deep multi-file analysis and complex architectural reasoning.

A small embedding model (nomic-embed-text, Apache-2.0) ships alongside every tier to power the router’s semantic-matching layer.

The GPU is what unlocks the larger tiers and keeps response latency usable: without a discrete GPU, a system that would otherwise reach Tier 2 falls back to the smaller model. For detecting and configuring the GPU and its compute backend, see the Overview & Backend Selection page.

How the runtime resolves a request

InterGen does not hand every prompt to the model. A priority-ordered router (intergen/router.py) tries to satisfy each request with the cheapest, most predictable method first and only escalates the runtime cost when it has to:

Priority 0 (Decomposition) — splits compound requests into sub-tasks and routes each in turn.
Priority 1 (Keyword/Regex Match) — fast pattern matches for common system commands, dispatched to a built-in tool without invoking the model.
Priority 2 (Semantic Embedding Match) — lightweight embedding search against a precomputed capability catalog; a high-confidence match dispatches to a built-in tool.
Priority 3 (LLM Tool Calling) — for Tier 2 and above, the query plus a tool schema goes to the model, which selects a tool and arguments for the router to execute.
Priority 4 (LLM Free Response) — the fallback: the model answers conversationally from its own knowledge and the conversation context.

The practical effect is that most interactions never touch the model-serving runtime at all, which keeps the system responsive even on modest hardware.

Safety classification at the runtime boundary

Every action the router proposes passes through the classifier in intergen/safety.py, which sorts it into one of three tiers:

AUTO — read-only or harmless operations (for example ls, grep, systemctl status). Run immediately.
CONFIRM — state-changing operations (for example systemctl restart, pkm install, or editing a config file). The assistant pauses and asks for approval first.
BLOCKED — destructive or security-bypassing operations (for example rm -rf /). Refused outright, with an explanation.

Nothing that changes the system runs without explicit approval, and the most dangerous commands cannot run at all.

Crossing the network boundary

Two features can take a request off-device, and both run only when you ask:

InterGen Sentinel is a pluggable security scanner that inspects content crossing the ingress and egress boundaries. Its default configuration runs two local stages: a fast local-rules pass and an optional deep pass backed by a small local Qwen classifier. For deeper analysis you may opt in to one of six cloud providers: Claude (Anthropic), Gemini (Google), Copilot (Microsoft), ChatGPT (OpenAI), Grok (xAI), or DeepSeek. None is configured by default, so a default install scans entirely on-device.
Phone-A-Friend (Frontier/Cloud Escalation) is an optional, consent-first path that hands a request to a more capable frontier model in the cloud when the local assistant cannot satisfy it. It is off by default; the same six providers can be configured, with API keys stored in the system keyring. Every outbound payload is scanned by Sentinel’s egress policy before it leaves the machine.

The rest of the platform

Beyond the AI subsystem, the same source-built foundations support the shipping system: pkm (the package manager), Forge (the installer), a signed Secure Boot chain, dm-verity integrity, and UKI signing. The shipping desktop is GNOME 49 on Wayland.

InterGenOS Wiki