Scaling to 5,000 Virtual IoT Devices: The KVM-Powered Erlang Breakthrough on Ampere
Share this article
Running thousands of virtual IoT devices for realistic testing has long been a resource-intensive challenge. But when Underjord's Lars Wikman paired Ampere Computing's 192-core ARM server with a custom bootloader and KVM acceleration, he achieved a staggering 5,100 concurrent virtual Nerves devices—each running a full Erlang/Elixir stack. This milestone offers profound insights for embedded developers scaling cloud-based simulations.
Why This Matters for Embedded Development
Nerves revolutionizes IoT development by treating the BEAM virtual machine as the primary OS layer, using Linux only for kernel/driver functionality. This enables memory-safe, fault-tolerant applications—but simulating thousands of devices demands extreme efficiency. Traditional emulation consumed ~650MB per instance, making large-scale testing impractical:
# Legacy QEMU command (no acceleration)
qemu-system-aarch64 -machine virt -cpu cortex-a53 ...
The breakthrough came through two key innovations:
1. little_loader: Frank Hunleth's minimalist bootloader replaced U-Boot, enabling direct kernel loading from Nerves' A/B partition structure while slashing boot times to seconds
2. KVM Acceleration: By leveraging the host's ARM cores via -accel=kvm and -cpu host, QEMU bypassed emulation overhead:
# Accelerated configuration
qemu-system-aarch64 \
-machine virt,accel=kvm \
-cpu host \
-m 110M \
-kernel little_loader.elf
The Performance Transformation
Results defied expectations:
- 500MB+ memory reduction per instance
- Boot times slashed from >10 seconds to <10 seconds
- 3,389 devices sustained before OOM kills (pre-tuning)
- 5,100 devices achieved after aggressive optimization
Critical tuning included:
- BEAM Allocators: Switched to reduced-memory variants
- Linux Kernel: Adjusted vm.swappiness, vm.dirty_ratio, and vm.vfs_cache_pressure
- ZRAM: Enabled compression for in-memory blocks
- Erlang Mode: Used interactive instead of embedded runtime for leaner startup
Implications for Developers
- Cloud-Based Device Testing: ARM servers now enable realistic thousand-device simulations for CI/CD pipelines
- Cross-Platform Parity: Identical workflows function on Apple Silicon (using HVF) and ARM servers
- Resource Efficiency: 150-160MB/resident memory per device makes large-scale simulations economically viable
- Nerves Ecosystem Growth: The upcoming
nerves_system_qemu_aarch64package will democratize these capabilities
Pushing Boundaries Further
While NUMA/core pinning remains unexplored, idle CPU utilization stayed below 20% with thousands of devices—suggesting room for even denser packing. The real triumph? Transforming a "stunt" into production-grade tooling. As Wikman notes: "The result is something we should get good mileage out of"—proving that deep dives into bootloaders and memory tunings can yield unexpected practical dividends.
Source: Underjord