Edge-Veda: Making On-Device AI Actually Work in the Real World

A managed on-device AI runtime for Flutter that solves the real-world problems of thermal throttling, memory crashes, and session instability that plague most mobile AI demos.

Edge-Veda is tackling one of the most frustrating problems in mobile AI development: demos that look great in controlled environments but completely fall apart when real users start interacting with them. The project has identified that modern on-device AI applications suffer from three critical failure modes—thermal throttling that collapses throughput, memory spikes that cause silent crashes, and sessions that become unstable after just 60 seconds of use.

What makes Edge-Veda different is its focus on behavior over time rather than benchmark bursts. Instead of just making AI models run on phones, it's building a supervised runtime that keeps models alive across long sessions, adapts automatically to thermal and memory pressure, and provides structured observability for debugging. The runtime is private by default with zero cloud dependencies during inference.

Core Capabilities

The system supports text, vision, and speech models running fully on device. For text generation, it maintains persistent workers that keep models loaded in memory across entire sessions, using a streaming token generation architecture with pull-based processing. Multi-turn chat sessions are managed with automatic context summarization when the conversation exceeds available memory.

Speech-to-text comes via whisper.cpp with Metal GPU acceleration, providing real-time streaming transcription in 3-second chunks at 48kHz native audio capture (downsampled to 16kHz). The system processes each chunk in approximately 670ms on an iPhone with Metal GPU using the whisper-tiny.en model (77MB).

For structured output and function calling, Edge-Veda implements GBNF grammar-constrained generation for structured JSON output, along with tool/function calling support through ToolDefinition and ToolRegistry systems. This enables multi-round tool chains with configurable maximum rounds and automatic tool call/result cycling.

Runtime Supervision

The most innovative aspect of Edge-Veda is its compute budget contracts and adaptive runtime policy. The system continuously monitors device thermal state (nominal to critical), available memory, and battery level, then dynamically adjusts quality of service levels:

Full: 2 FPS, 640px resolution, 100 tokens, no pressure
Reduced: 1 FPS, 480px resolution, 75 tokens under thermal warning or low battery
Minimal: 1 FPS, 320px resolution, 50 tokens under serious thermal or critical battery
Paused: 0 FPS when thermal is critical or memory is dangerously low

Thermal escalation is immediate because spikes are dangerous, but restoration requires cooldown periods (60 seconds per level) and happens gradually to prevent oscillation between quality levels.

Architecture

Edge-Veda uses a layered architecture where Flutter apps communicate with persistent isolates for inference, vision, and speech processing. The key constraint is that Dart FFI is synchronous, so all inference runs in background isolates to prevent UI freezing. Native pointers never cross isolate boundaries, and workers maintain persistent contexts so models load once and stay in memory.

The C API provides 50 functions via DynamicLibrary.process(), wrapping llama.cpp for text inference, libmtmd for vision, and whisper.cpp for speech. The system includes a central scheduler that arbitrates concurrent workloads with priority-based degradation and structured performance tracing that writes JSONL flight recorder data for offline analysis.

Performance and Stability

Edge-Veda has been validated through extensive soak testing. For text generation, it achieves 42–43 tokens per second with steady-state memory usage of 400–550MB and no degradation over 10+ turn conversations. Vision processing sustains 12.6-minute sessions processing 254 frames with p50/p95/p99 latencies of 1,412/2,283/2,597ms and zero crashes.

Memory optimization has been dramatic—KV cache reduced from ~64MB to ~32MB using Q8_0 quantization, and steady-state memory dropped from ~1,200MB peak to 400–550MB. Speech-to-text achieves ~670ms transcription latency per 3-second chunk.

Supported Models

The project includes pre-configured models in ModelRegistry with download URLs and SHA-256 checksums:

Llama 3.2 1B Instruct (668MB) - General chat and instruction following
Qwen3 0.6B (397MB) - Tool/function calling
Phi 3.5 Mini Instruct (2.3GB) - Reasoning and longer context
Gemma 2 2B Instruct (1.6GB) - General purpose
SmolVLM2 500M (417MB + 190MB) - Image description
All MiniLM L6 v2 (46MB) - Document embeddings for RAG
Whisper Tiny English (77MB) - Speech-to-text transcription

Any GGUF model compatible with llama.cpp can be loaded by file path.

Getting Started

Installation requires adding edge_veda: ^2.1.0 to pubspec.yaml. The API provides both streaming and blocking generation methods, multi-turn conversation management with built-in templates, function calling support, and speech-to-text capabilities.

For runtime supervision, developers can declare compute budget contracts with explicit p95 latency, battery drain, thermal, and memory ceilings, or use adaptive profiles that auto-calibrate to measured device performance after warm-up.

Platform Status

Currently validated on iOS with Metal GPU fully supported, simulator working with CPU (no microphone). Android support is scaffolded with CPU working and Vulkan GPU planned. The project structure includes 50 C API functions, 11,750 lines of Dart SDK code, and comprehensive testing with 14 unit tests.

Who This Is For

Edge-Veda targets teams building on-device AI assistants, continuous perception apps, privacy-sensitive AI systems, long-running edge agents, voice-first applications, and regulated or offline-first applications. It's particularly valuable for scenarios where reliability and observability matter more than one-time benchmark performance.

The project is actively seeking contributions in platform validation (especially Android), runtime policy improvements, trace analysis tools, model support testing, and example applications for specific use cases like document scanning or visual QA systems.

#On-device AI #Flutter #LLM #Edge Computing #Performance