Most founders building an AI feature default to one decision without realizing they made it: route everything to a cloud API. It's the obvious move — until the monthly bill scales with usage, the feature stalls on bad connectivity, and a privacy review asks why user data is leaving the device at all.
The more interesting question in 2026 isn't “which model.” It's “where does the inference run.” The phone in your user's pocket is often faster than the cloud instance you'd rent, and running on it can cost nothing per call. Here's how we decide — on-device, cloud, or both — for a real product.
IMAGE PROMPT: A dark-mode iPhone 15 Pro showing an AI feature processing a photo locally with a subtle on-device chip glow, soft rose-purple ambient lighting, Apple-style minimal design, 8K, product context.
Key takeaways
- The first AI decision is placement — on-device vs cloud — not model choice. It drives cost, latency, and privacy.
- On-device inference costs essentially nothing per call; cloud AI typically runs $5–$150+/month and scales with usage.
- Modern iPhones ship 30+ TOPS of ML throughput and hit sub-20ms latency for production vision models — often faster than a cloud round-trip.
- Use on-device for privacy-sensitive, high-frequency, or offline features; use the cloud for large generative models; combine them deliberately.
The real first decision: where does inference run?
Founders frame AI as a model-selection problem. In a mobile product, placement matters more. The same feature behaves completely differently depending on whether it runs on the device or in the cloud — different cost curve, different latency floor, different privacy story. Pick placement first; the model follows.
And the hardware has quietly made on-device the serious default. iPhone 15 and newer ship 30+ TOPS of on-device ML throughput, and production computer-vision models now run in under 20 milliseconds locally. By most accounts the majority of AI features shipping in 2026 run on-device — not as a research curiosity, but because the economics and the experience are better for a large class of features.
On-device: free per call, private, instant — within limits
Once a model is on the device, each inference costs essentially nothing: no per-query charge, no token bill at month-end. Data never leaves the phone, which turns a privacy liability into a feature. And there's no network round-trip, so it works offline and feels instant.
The limits are real, though. You're bounded by model size and the device's memory, the app download grows with bundled models, and the largest generative models still don't fit. On-device shines for computer vision, classification, transcription, and small language models — not for running a frontier-scale LLM on a phone.
Cloud: power and flexibility, at a metered cost
The cloud is where you reach for large generative models, frequently-updated models, and anything too heavy to ship in an app bundle. The trade-offs are the mirror image of on-device: a bill that scales with usage (commonly $5–$150+ per month per workload and up from there), latency that depends on the network, and user data leaving the device — which becomes a compliance question the moment that data is health-related.
How we actually decide
We run four questions against the specific feature, not the product as a whole — most apps end up mixing both placements.
- Is the data sensitive? Health, biometric, or personal data leans hard toward on-device, especially in a regulated build.
- How often does it run? A feature invoked constantly is a runaway cloud bill and an obvious on-device candidate.
- Does it need to work offline or instantly? If yes, on-device is the only honest answer.
- How big is the model? Frontier-scale generation stays in the cloud; the rest can likely run locally.
| Dimension | On-device (Core ML / MLX) | Cloud API |
|---|---|---|
| Cost per inference | Essentially free after download | Metered; scales with usage |
| Latency | Sub-20ms for vision; no round-trip | Network-dependent |
| Privacy | Data stays on device | Data leaves device — a compliance question |
| Offline | Works | Fails without connectivity |
| Model size ceiling | Bounded by device memory | Effectively unbounded |
The recommendation: default on-device, reach for cloud deliberately
The stance: start by assuming a feature runs on-device, and justify each move to the cloud rather than the reverse. The default most teams use — cloud-everything — is the one that quietly creates the cost, latency, and privacy problems they later have to engineer around. Reaching for the cloud only when the model genuinely demands it gives you a faster, cheaper, more private product by construction.
What this looks like in practice
The products we build lean on this. For FungeeLLC, we built IngrediCheck, which scans barcodes and labels for dietary and allergy needs — computer vision that has to be instant and works best locally — and KIN Calendar, a voice-first app that parses photos into events, where on-device handling keeps a family's personal data private. Our own MedLogsRx scans prescriptions, exactly the kind of sensitive, high-frequency task that argues against shipping raw data to a server.
The placement decision also interacts with your stack and your timeline — heavy native ML work is one of the things that pushes a product toward native iOS, which we walk through in our native vs hybrid framework, and it changes the budget, covered in what an AI-heavy build actually costs. If the data is health-related, on-device placement and HIPAA architecture are the same conversation.
FAQ
Should my app's AI run on-device or in the cloud?
Decide per feature, not per app. Use on-device for privacy-sensitive, high-frequency, or offline work — it costs essentially nothing per call and runs in milliseconds. Use the cloud for large generative models too big to ship on a phone. Most real products combine both, defaulting to on-device.
Is on-device AI cheaper than calling a cloud API?
For high-volume features, almost always. Once a model is downloaded, each inference has no per-query cost, while cloud AI is metered and commonly runs $5–$150+ per month per workload, scaling with usage. The trade-off is that on-device models are bounded by device memory, so the largest generative models still need the cloud.
Can an iPhone really run AI models locally?
Yes. iPhone 15 and newer ship 30+ TOPS of on-device ML throughput, and production computer-vision models run in under 20 milliseconds locally — often faster than a cloud round-trip. Apple's Core ML and the MLX framework make vision, transcription, and small language models practical on the device today.
Is on-device AI better for privacy and HIPAA?
Generally yes. When inference runs on-device, sensitive data never leaves the phone, which removes an entire class of privacy and compliance exposure. For health data, that often makes on-device the default and the cloud the exception you justify. Confirm any specific HIPAA requirement with counsel before relying on it.
Building an AI feature into your mobile product? Book a free 30-min call →
