Research2026

Unhosted — Decentralized LLM inference

A Rust-based runtime that pools heterogeneous hardware — MacBooks, gaming PCs, home servers, and an opt-in public swarm of strangers' GPUs — into a single inference cluster behind one endpoint, across CUDA, Metal, and ROCm. Three trust radii (local / trusted / public): the first two are free forever; the third is a USDC-priced safety net used only when the local circle can't fulfill a request. Currently pre-alpha with single-host inference, LAN clustering, mDNS peer discovery + pairing, and model management shipped; VRAM-pooling via llama.cpp's RPC backend is in active build. Co-maintained with the original author Ankur Sinha; my scope covers infrastructure, developer experience, and documentation.

Unhosted — Decentralized LLM inference preview

Technology stack

Rust 2021llama.cppGGUFmDNSWireGuardCUDAMetalROCmAGPL-3.0
01

Problem statement

Frontier-class language-model inference is almost always rented from someone else's datacenter. That's a defensible business model and a bad default — it ties personal compute, privacy, and cost to a handful of hosted endpoints. Most people already own enough silicon collectively (laptops, gaming PCs, home servers, idle workstations) to run a 70B-class model if the orchestration existed to pool it. Unhosted is the orchestration: one endpoint that routes inference across hardware you own first, friends and family second, and a paid public swarm only as a last resort.

02

Dataset & data

Not a model-training project. The "data" here is the network topology: heterogeneous nodes across CUDA / Metal / ROCm, variable VRAM budgets, intermittent availability, and latency that ranges from sub-millisecond on LAN to tens of milliseconds across the internet. Routing decisions depend on per-node capacity introspection, peer-trust state, and a configurable price ceiling for the public-swarm fallback (USDC per token).

03

Architecture & design

Rust runtime that wraps llama.cpp's llama-server for per-host inference and uses its RPC backend (GGML_RPC) for cross-node VRAM pooling so layers of a single model can split across a MacBook and a 4090. mDNS for LAN peer discovery; a one-click pair flow handles trusted peers over WireGuard-style end-to-end encryption with no public exposure. The public swarm is designed around optimistic verification plus redundancy now, with zero-knowledge proofs as a future affordability bet. Three trust radii (local / trusted / public) sit behind one CLI surface — the user picks how far the radius goes by setting a price ceiling, not by choosing a backend.

04

Training pipeline

No training. The engineering work is in the runtime, the routing layer, and the trust model: hot-reloading request routing when peers come and go, layer-splitting that survives node disappearance, idempotent retries across the swarm, and an audit trail for public-swarm requests that protects both the requester and the GPU provider. License is AGPL-3.0 specifically so the code stays auditable and forkable but can't be wrapped as a closed paid service.

05

Results & performance

Pre-alpha. Single-host inference (v0.0.1), LAN cluster with round-robin routing (v0.0.2), mDNS peer discovery + pairing + model management (v0.0.3) all shipped end to end. VRAM-pooling (v0.0.4+), trusted-peer pairing (v0.1.0), web UI + Tauri desktop app (v0.1.0+ / v0.2.0+), and the USDC-priced public swarm (v0.3.0+) are next. No public benchmarks yet — reproducible scripts and honest tokens-per-second numbers land in benchmarks/ as soon as any are real.