The operating layer for on-device AI

on-device AI
that scale.

One API for every backend. Models that ship over the air. Runs anywhere with a chip — and falls back to the cloud when it can't.

Swift · run on-device
iOSmacOSwatchOStvOSvisionOS
let model = try XybridModelLoader.fromRegistry(modelId: "smollm2-360m").load()

let result = try model.run(envelope: Envelope.text(text: 'Explain quantum computing'))

print(result.text)
● 112 tok/s · on-device
Text generation · smollm2-360m
Run a small language model on-device with cloud fallback.
Apache-2.0 · v0.1.0 · day-one SOTA

Built for leading open-weight models

−80%
inference cost vs. cloud-only
14ms
p50 latency on-device (360m)
Day 0
support for new SOTA models
7
first-class SDK languages
How it routes

The router picks where each call runs.

Per-request decision based on model size, available memory, battery, and your latency budget. On-device by default; falls back to cloud when it can't.

chat()
text in
transcribe()
audio in
embed()
vectors in
route()
On-device
default
Hybrid
mixed
Cloud
fallback

On-device

default

Phone, laptop, console. Zero round-trip, zero spend.

Hybrid

mixed

Embeddings on-device; long-context generation in the cloud.

Cloud

fallback

When the device can't. Same API, same envelope.

No compromise

Everything you get from cloud. None of the bills.

One runtime that ships natively to every device, with cloud as a soft fallback — billed only when the device truly couldn't handle it.

CapabilityCloud onlyOn-premFull on-deviceXybridrecommended
Runs offline
Zero per-token cost
Same SDK across platforms
Telemetry + cost analytics
OTA model updates
Vendor lock-inhighlow
Falls back when device can't
Platform

Integrate once. Ship forever.

Six primitives that make on-device AI shippable. Use the runtime alone, or pair it with the platform to operate models at fleet scale.

Every backend, one API

Bring your own SDK target — iOS, Android, Flutter, Unity, Linux/Edge. The same envelope runs everywhere.

// runtime
Seamless device ↔ cloud

Route per request based on model size, battery, and latency budget. Fall back to cloud automatically when needed.

// router
OTA model updates

Ship new models without touching code. Canary, region-scoped or fleet-wide. Rollback in one click.

// platform
Evals across runtimes

Model-aware harness with prompt libraries. Compare quality across chips, OS versions, and quantization.

// platform
Per-model optimization

Same intent, tuned per backend. The same prompt runs cheaper on every chip — automatically.

// platform
Private by default

Data stays on-device. Telemetry is opt-in and aggregated. SOC-2 in progress.

// runtime

One SDK. Every platform.

Write your AI pipeline once and deploy it natively across mobile, desktop, and game engines — with hardware acceleration on every target.

Console

Operate fleets without leaving a tab.

One UI for telemetry, devices, models, registry, keys, and settings — built for operators running on-device AI at fleet scale.

console.xybrid.dev / telemetry

Telemetry

Last 24h · 1,284 traces · 3 routes
live 24h ▾
p50 latency
14 ms
on-device
p95 latency
92 ms
fallback
route mix
78% on-device
14% hybrid · 8% cloud
cost saved
$1,847
vs. cloud-only
Throughput req/min
Trace · req_8f4a · 96 ms total ● success
request
guardrails.in
router.pick
embed
on-device.run
decode
guardrails.out
telemetry.flush
Open the console ↗
Open source

The runtime is open source.

Read the code. Run it locally. Fork the harness. The platform is the optional layer; the runtime is yours forever.

xybrid-ai/xybrid
commits this month
contributors
Apache-2.0 no copyleft
Questions

Common questions.

What runs on-device vs. in the cloud?
Anything that fits in your device's memory budget runs locally. The router decides per-request based on model size, available memory, battery state and your latency budget. If a request can't be served on-device, it falls back to a hosted endpoint — same response shape, same SDK call.
Which models are supported on day one?
Every model we ship goes through an automated harness that produces native artifacts for Core ML, NNAPI, TFLite, ONNX, and Metal. New SOTA open-weights are typically available within 24h of release.
What does an OTA model update look like?
Push a new model version through the console (or the API). Define a rollout — canary, region-scoped, or fleet-wide — and your devices fetch the artifact on next launch. No app store review. Rollback is one click.
Is this just a wrapper around llama.cpp?
No. The runtime is a backend abstraction: Core ML, NNAPI, ONNX, llama.cpp, Metal and others sit underneath. You write to one API and we pick the cheapest, fastest path for the request.
How is pricing structured?
The runtime is free and open-source. The platform — OTA updates, evals, fleet metrics — is priced per active device per month with a generous free tier for teams under 10k MAU.
Do you handle telemetry / privacy?
Data stays on-device unless a request hits the cloud router. Telemetry is opt-in and aggregated; no prompt or response content leaves the device by default. SOC-2 Type II in progress.
v0.1.0 · free to prototype

Integrate once. Ship anywhere.

Get an API key in 30 seconds. The SDK is open source — you're never blocked on us.

xybrid-ai/xybrid · Apache-2.0