An open-source project squeezes real-time YOLOv8n UAV detection to 46 FPS on a 2GB Rockchip board under 140 MB of RAM, then chains an on-device LLM to write natural-language incident summaries. The catch: it is not a monolith. It is a chain of small Unix processes, each pinned to a dedicated piece of silicon.
There is a recurring pattern in edge AI projects: the demo runs great on paper, then falls apart when someone tries to scale it. Frame rates drop, RAM balloons, and the CPU is so busy handling I/O that the NPU sits idle half the time. The khadas_yolov8n_multithread project takes a different approach. It treats the RK3588S SoC as a collection of independent hardware blocks and wires them together with Unix sockets, not shared memory. The result is a 46 FPS YOLOv8n detection pipeline that runs on a board selling for under 100 euros.
The Sensor Is the Bottleneck Now
The headline number is 46 FPS, which happens to be the maximum frame rate of the OS08A10 camera sensor. The pipeline hits that ceiling. A naive single-threaded RKNN inference loop gets about 31 FPS at 1080p with YOLOv8n 640x640. By splitting inference across all three NPU cores using rknn_dup_context and rknn_set_core_mask, the project lifts throughput by 48 percent. The pipeline stops being the constraint; the camera is.
That matters because it changes how you think about the system. When the pipeline is the bottleneck, adding features means losing frames. When the sensor is the bottleneck, you can chain downstream processes, tracking, temporal analysis, even an LLM summary, without touching the detection frame rate.
Not a Monolith, a Chain of Processes
The architecture is deliberately fragmented. The main capture and inference loop runs as one process. ByteTrack runs as a second. Temporal feature extraction runs as a third. A presence FSM and an on-demand LLM summary run as a fourth. Each communicates over Unix-domain sockets, one per camera device.
This is not the typical "just link everything together" approach. Unix sockets mean each stage can be restarted independently. They can run on different CPU cores. They can be swapped out without recompiling the rest of the pipeline. The bounded queue implementation in src/ipc/bounded_queue.h is a drop-oldest queue, so a slow consumer does not backpressure the producer.
The tradeoff is serialization overhead on the detection data path. Each detection message gets marshaled, sent over a socket, and unmarshaled by the next stage. For a pipeline pushing 46 FPS of 1080p YOLOv8n output, that overhead is non-trivial but manageable, and the project has clearly decided it is worth the architectural cleanliness.

Every Frame Stays Off the CPU
The memory story is the one that changes the economics. The project targets 2GB RK3588S boards selling for around 90 euros. At 140 MB RSS per stream, two cameras running side by side use about 290 MB. That is well within the RAM of the cheapest boards in the RK3588S range.
The trick is that no heavy per-frame operation runs on the CPU. Capture goes through the camera ISP. Color conversion and resize go through the RGA (Rockchip Graphics Acceleration). Inference goes through the NPU. The CPU just orchestrates. Pre-allocated buffer pools in BufPool (defined in src/main.cc) recycle memory instead of allocating per frame, so RSS stays flat and bounded.
This is not a new idea. Hardware acceleration pipelines have existed for years. What is different here is the granularity. The project does not just offload inference. It offloads every step of the frame lifecycle, including the I/O that most projects leave to the CPU.
The LLM Twist
When a tracked UAV leaves the scene, the pipeline triggers an on-device LLM. The NPU enters a blackout mode, reserving all three cores for Qwen2.5-0.5B inference via the RKLLM runtime. The LLM writes a natural-language assessment of what happened, then releases the NPU back to the detection pipeline.
The blackout/resume control plane is the interesting piece. It means the LLM does not share NPU time with the detection model. It gets exclusive access, runs fast, then gets out of the way. For a 0.5B parameter model on three NPU cores, that is probably under a second of latency for a short summary.
Is this practical? For a UAV detection scenario, maybe. A security camera that can describe what it saw in plain English is more useful to a human operator than a JSON blob of bounding boxes. But the LLM stage is clearly positioned as a research extension, not a production feature. The project disclaimer makes that explicit.
Who This Is Actually For
The project targets RK3588S boards, specifically the Khadas Edge2 but also anything with the same SoC. The build system supports both native compilation on the board and cross-compilation from x86-64/WSL. The installation tree is self-contained: drop it on the board and run.
The detection class is "UAV" for benchmarking purposes, but the YOLOv8n model is swappable. The RKNN training pipeline in the companion repository handles training and exporting any YOLOv8 model to the RKNN format. The architecture does not care what the model detects.
The Broader Pattern
This project sits at a crossroads in edge AI. On one side, you have the Jetson ecosystem, which is powerful but expensive and locked to NVIDIA hardware. On the other, you have cheap Rockchip boards with capable NPUs but fragmented software support. Projects like this one bridge that gap by building the full pipeline, capture through inference through I/O, rather than just a model benchmark.
The multi-process, socket-based architecture is also worth watching. It is more complex than a single-threaded loop, but it scales differently. Adding a new downstream stage means writing a new process, not refactoring the main loop. That is the kind of design that survives contact with real requirements.
The project is Apache 2.0 licensed, with training pipelines for both the vision and language models available in companion repositories: RKNN_TRAIN_YOLO for the detection model and RKLLM_LLAMA_QWEN for the LLM runtime. The architecture docs include Mermaid diagrams of both the internal pipeline and the multi-process topology.

Comments
Please log in or register to join the discussion