On-device personal AI
Local LLM inference via llama.cpp. Private indexing of your code, documents, and photos. A personal adapter trained from your patterns, never leaving the device.
Selfweave is a privacy-first personal AI. Local inference, a private memory of your work, and — when you want it — distributed compute: trusted LAN devices today, a cooperative WAN you help build by contributing. Weave yourself into the network.
Selfweave separates what must stay private from what can be shared. You decide where the line sits — and how much of yourself you weave into the network.
Local LLM inference via llama.cpp. Private indexing of your code, documents, and photos. A personal adapter trained from your patterns, never leaving the device.
Run larger models by pooling VRAM across devices. Selfweave's roadmap extends this from LAN peers you choose to a cooperative pool of strangers — governed by attestation tiering, Sybil defense, and contribution-based incentives. The trust-and-economics layer is the differentiator, not the splitting itself.
An optional shared behavioural layer, trained via SPARTA + DiLoCo. Your gradients contribute to a collective intelligence without ever exposing your raw data.
Selfweave is built in four layers. Personal data lives above the encryption boundary and never crosses it in the clear. Distributed operations happen below it — with cryptographic guarantees, not promises.
The BIL is a small LoRA adapter — about 8–50 MB — that encodes aggregate human behaviour patterns: when people focus, how communication tone shifts across contexts, what people typically need at different times of day. It sits between the frozen base model and your private personal adapter. You get the benefit of the collective without the collective ever seeing your data.
The technique is borrowed, not invented. OpenFedLLM showed federated LoRA exchanging 0.06% of parameters per round can outperform a frontier model in a domain (Ye et al., 2024). Selfweave uses 0.1% via SPARTA sparse aggregation, with DiLoCo-style outer-loop synchronisation from DeepMind's distributed-training work and noise-budgeting from (ε, δ)-differential privacy (Dwork & Roth, 2014). Status: roadmap — cryptographic primitives ship in v0.2; federation activates when the peer-defense stack lands in v0.3.
Selfweave's capacity story made concrete: per-user experience as the network grows, and what one peer's contribution covers. Privacy and cost are Selfweave properties at every scale.
| Config | Model | Decode tok/s | Context | Cost / Mtok | Privacy | Phase |
|---|---|---|---|---|---|---|
| 1-node RTX 3080 Ti | Llama 3.1 8B Q4 | 89.7 | 128k | $0 | Local | shipped |
| 1-node + speculative | 8B Q4 + 1B draft | ~160–225 (proj.) | 128k | $0 | Local | v0.1.x |
| 3-node LAN | 14B Q4 split | ~12 (proj.) | 32k | $0 | LAN-trusted | v0.2.5 |
| 10-node cooperative pool | 70B Q4 sharded | ~10–30 (proj.) | 128k | $0 + contribute | Attestation-tiered | v0.3+ |
| 100-node mature pool | DeepSeek V3 (671B / 37B-MoE) | similar per-user | 128k+ | $0 + contribute | Attestation-tiered | v0.3+ mature |
| 1k-node mature pool | 405B+ dense | usable, large aggregate | 128k+ | $0 + contribute | Attestation-tiered | v0.3+ mature |
| ChatGPT (GPT-5) | proprietary | 135.8 | 128k | $1.25 / $10.00 | Provider-side | now |
| Claude Sonnet 4.6 | proprietary | 45.3 | 1M | $3.75 / $15.00 | Provider-side | now |
| DeepSeek V3 (hosted) | 671B / 37B-MoE | ~34 (provider median) | 128k | $0.40 / $0.89 | Provider-side | now |
Agentic loops compound WAN RTT across both LLM steps and tool I/O. Naïve split inference is poorly suited; the five layers below stack rather than substitute. Numbers are per-user tok/s on a 70B-class model over a Petals-class swarm (tech-spec §4.8).
| Stack level | Technique | Effective per-user tok/s | Status |
|---|---|---|---|
| 0 — baseline | Naïve split inference | ~5 | — |
| 1 — affinity | KV cache session affinity | ~8–10 | design (RT-008) |
| 2 — token spec | Token-level speculative decoding | ~15–20 | wire-up shipped |
| 3 — step spec | Step-level lookahead reasoning | ~30–50 | design |
| 4 — batching | Cross-loop batching at swarm peers | same per-user, 5–10× per peer | v0.2.5 |
| 5 — overlap | Tool-I/O ⇄ next-step prefill overlap | hides tool latency | design |
Capacity grows along three axes: peer count, model hidden-state size (smaller is better — egress is the bottleneck per tech-spec §4.6, not compute), and the RT-009 bandwidth-efficiency lever stack (INT4 activation quantisation, adaptive compression at slow links, cross-loop batching at swarm peers, tier stratification). Cells show users servable @ 5% concurrency with the §4.7 30–60% jitter discount baked in — a contributing peer still covers more user-pipeline work than its own household consumes. The 405B+ dense row is honest about being structurally bandwidth-heavy: the same swarm serves an order of magnitude more users on MoE-class models. Numbers are projections, not measurements.
| Model class (hidden state) | Phase 3 launch v0.3 · ~100 peers · no levers |
Phase 3 + RT-009 initial v0.3.x · ~1k peers · ~5× stack |
Phase 4 mature, full RT-009 v0.4+ · ~10k peers · ~10× stack |
|---|---|---|---|
| 8B (8 KB) | ~120–240 | ~6k–12k | ~120k–240k |
| 14B (10 KB) | ~95–190 | ~4.7k–9.5k | ~95k–190k |
| 70B (16 KB, baseline) | ~60–120 | ~3k–6k | ~60k–120k |
| DeepSeek V3 (14 KB, 671B / 37B-MoE) | ~70–140 | ~3.5k–7k | ~70k–140k |
| 405B+ dense (32 KB) | ~30–60 | ~1.5k–3k | ~30k–60k |
Sources: 1-node decode from a community llama.cpp benchmark on RTX 3080 Ti running Llama 3.1 8B Q4_K_M (localscore.ai). Speculative-decoding 1.8–2.5× from inference-speedups design §3.1 (projected). Multi-node figures derived from _config/tech-spec.md §4.6–4.8: per-peer egress ceiling, swarm capacity model, agentic-loop workaround stack. Effective tok/s discounts theoretical peak by 30–60% per the §4.7 jitter / churn / tail-latency caveat — Petals in production runs at ~6 tok/s/user, so this discount is empirical, not pessimistic. Phase-3 lever multipliers (~5× initial, ~10× full stack) reflect cross-loop batching at swarm peers (§4.8 layer 4: same per-user latency, ~5–10× per-peer throughput) compounded with INT4 activation quantisation (~2× egress per token-pass) and AdaTopK adaptive compression at slow links — all tracked under RT-009; benchmarks pending, ranges are conservative midpoints. Cloud numbers from artificialanalysis.ai and vendor pricing pages, May 2026. Projections labelled (proj.) are design targets, not measurements.
Selfweave exposes an OpenAI-compatible API and a local editor proxy with project-scoped RAG. Your codebase is indexed on-device; completions, fill-in-the-middle, and chat all run against the same private context.
No cloud round-trip. No prompts mined for training. No telemetry of what you type.
Selfweave is in active development. Shipped features work today on Windows; roadmap items are being built in the open.
Completions, fill-in-the-middle, embeddings. Project-scoped RAG with a debounced file watcher. OpenAI-compatible, editor-agnostic.
Query your own document library with citation-grounded responses. Widened search params, structured five-element prompt.
Dual FAISS indices — CLIP for photos, MiniLM for text. Search your life in natural language, offline.
Honesty is measurable. Lexical overlap, agreement markers, and a benchmark harness gate every release. No yes-men.
Three profiles — Strict, Basic, Permissive. Windows Job Objects + AppContainer + restricted tokens. Consent modal before elevation.
Tag-based tool use with a registry + executor. Extend the assistant with local tools — no remote webhook required.
Pool VRAM across household devices via QR-paired peers. Pipeline-parallel split over libp2p Noise transport — the same primitive that extends to the cooperative pool when the trust layer ships.
Contribute idle GPU to the network, earn priority routing. Contribution-based incentive, no tokens, no speculation.
Shared behavioural intelligence trained across users via SPARTA + DiLoCo. You get the benefit of the collective without sharing the data.
These are not promises — they are properties of the system.
Selfweave is an integration of peer-reviewed research and proven open-source projects, not a stack of hopes. Each capability below traces back to published work — verifiable, attributable, replicable.
OpenFedLLM (Ye et al., 2024) — federated LoRA exchanging 0.06% of parameters per round outperformed GPT-4 in financial-domain benchmarks. Selfweave uses 0.1% via SPARTA sparse aggregation.
DiLoCo (DeepMind) and INTELLECT-1 / INTELLECT-2 (Prime Intellect, arXiv 2505.07291, 2025) — 32B-parameter distributed RL across continents on consumer-class infrastructure. Validates Selfweave's post-v1.0 pre-training direction.
(ε, δ)-DP (Dwork & Roth, 2014). ε = 1.0 matches Apple iOS analytics and academic federated-healthcare research — strong enough to be a meaningful guarantee, loose enough for the BIL to actually learn.
EAGLE-3 (Li et al., arXiv 2503.01840, 2025) and Speculative Streaming (Bhendawade et al., arXiv 2402.11131, 2024). Selfweave wires llama.cpp's --model-draft mechanism for lossless decode speedup.
Lookahead Reasoning (ICLR 2026) — orthogonal to token-level speculation; both stack. Aimed at agentic loops where reasoning steps, not tokens, are the speculation unit.
Petals (Borzunov et al., arXiv 2312.08361, 2023) and Parallax (Gradient Network, arXiv 2509.26182, 2025). Pipeline-parallel inference over commodity links — Selfweave inherits the pattern, adds attestation tiering and Sybil defense.
HBCP / BCIO (Mac Aonghusa & Michie, 2020) — the Human Behaviour-Change Project's ontology. Selfweave's 30-50-concept behavioural map mirrors its role: a unified pipeline-friendly representation of diverse natural-language behaviours.
Artificial Behaviour Intelligence (Jo et al., arXiv 2505.03315, 2025) — formalises behaviour understanding in cultural and situational context, with calibrated uncertainty. Direct inspiration for the BIL's context modelling.
ELEPHANT benchmark (Cheng et al., arXiv 2505.13995, 2025) — measures social sycophancy in LLM responses. Selfweave runs a derivative harness before every adapter release and base-model upgrade.
Selfweave's core features run forever on your own hardware at no cost. Paid tiers cover managed conveniences we can't run on your laptop.
The complete local AI. Your hardware, your rules.
End-to-end sync + managed conveniences. Privacy intact.
Everything in Plus, at professional scale.
Selfweave is a long-term solo project. Progress is steady, not flashy. Here's what's real and what's next.
Selfweave is in closed pre-release. Join the waitlist for a build, a changelog entry per week, and a direct line to the developer. Then, when you're ready, weave yourself into the network.