Automating model design for edge AI
We built an automated model design system using neural architecture search, DeepGate compiler, and real-hardware measurements. On MLPerf Tiny benchmarks, our models run up to 45× faster and use up to 11× less RAM than reference models while maintaining high accuracy.
Building models for microcontrollers is still largely a manual process. Teams either design models from scratch or adapt existing architectures, iteratively modifying them to fit the target hardware. On resource-constrained devices, they often face a trade-off between models that are too large or slow to run and models that fit on the device but make too many mistakes to be useful.
We’ve built the foundations of an automated model design system. By combining neural architecture search, the DeepGate compiler, and real-hardware measurements obtained through our development platform, we can automatically search for models tailored to a target microcontroller. Across the four standard MLPerf Tiny benchmark tasks, ranging from detecting spoken words in audio to identifying the presence of a person in an image, the resulting models ran up to 45× faster and used up to 11× less RAM than the reference models. For example, on the MLPerf Tiny keyword spotting benchmark running on the Analog Devices MAX32655, our search reduced inference latency from 104.3 ms to 2.3 ms and RAM usage from 23.7 KB to 2.1 KB, while maintaining over 90% classification accuracy.
Such gains can enable machine learning models to run on cheaper hardware, extend battery life, and free up memory and compute for other tasks. By pushing the efficiency frontier, we move more advanced AI workloads within reach of microcontrollers, bringing increasingly capable intelligence to billions of devices.
Outperforming the reference models on the same hardware
We evaluated our search on MLPerf Tiny v1.4, the standard benchmark suite for machine learning on microcontrollers. The benchmark covers four representative edge workloads: keyword spotting, visual wake words, CIFAR-10 image classification, and anomaly detection. Each task has a predefined quality target, from 90% top-1 accuracy for keyword spotting to 0.85 AUC for anomaly detection. For each workload, the goal was to meet the target while producing the smallest and fastest model possible, with input dimensions kept fixed to ensure a fair comparison against the reference models.
Across the evaluated boards, our search system and compiler delivered up to 45× faster inference and up to 11× lower RAM usage. Because memory is often the primary constraint on microcontrollers, these memory reductions can be especially important: in some cases, models that exceeded memory limits under the vendor toolchain were able to fit and run successfully after search and compilation.
The results below compare the MLPerf Tiny reference model compiled with each vendor’s toolchain against architectures automatically discovered by our search system and deployed with the DeepGate compiler, with all results measured on the same hardware. Explore the comparisons by switching boards and toggling between latency and RAM usage; RAM is measured as the tensor arena plus peak stack size.
DeepGate runs up to 36.1× faster
Reference modelST Edge AISearched modelDeepGate compiler
STM32H7A3 Cortex-M7 @ 280 MHz
How we did it: two complementary search methods
We ran two search systems side by side and used whichever performed best for a given task. On the MLPerf Tiny workloads, three of the four final models came from our neural architecture search (NAS) system, while the anomaly detection model came from our agentic search.
Agentic architecture search uses an LLM agent that proposes one change at a time – either to the architecture or the training recipe – trains the resulting model, benchmarks it on real hardware, and keeps the change only if the target metric improves. The approach is open-ended and can explore ideas outside any predefined search space, but it operates greedily, improving one model at a time.
Supernet NAS builds on and extends the Once-for-All and MCUNet approaches, adapted for microcontroller deployment using int8 quantization-aware training while keeping input resolution fixed for fair comparison against the reference models. Rather than training every candidate architecture independently, a single supernet can be specialised into many different models with different size, speed, and accuracy trade-offs.
The two approaches offer complementary strengths:
Agentic searchSupernet NAS
What it can changeAnything in code – architecture and training recipeA predefined architecture space (depth, kernel size, expansion ratio)
What you get outOne model, improved step by stepA family of models spanning different size, speed, and accuracy trade-offs
Best whenThe problem is open-ended or the design space is poorly understoodThe design space is well understood and you need optimised models for multiple hardware targets
Both approaches run on the same in-house infrastructure. Each model is compiled into an efficient static binary by the DeepGate compiler and deployed to target microcontrollers through our development platform, which provides a unified benchmarking API across multiple boards. The resulting latency and memory usage are measured directly on the target hardware.
DeepGate’s hardware-in-the-loop benchmarking rig profiles machine learning models on real microcontrollers across major silicon vendors. Models are automatically compiled and deployed via our API.
What’s next
Our long-term goal is to automate the design of highly efficient models, from defining a task to deploying an optimised model on an edge device. To achieve this, we are exploring how to combine our NAS and agentic search methods into a single optimisation loop that unifies the strengths of both approaches.
At the same time, we’re expanding the set of neural network layers available to the search system, including novel DeepGate layers designed to use less memory and run faster than conventional neural network layers. Incorporating these layers into the search space will unlock even greater efficiency on resource-constrained devices, enabling AI workloads once thought beyond the reach of microcontrollers – and ultimately bringing increasingly capable intelligence to billions of devices.
If you’re interested in shrinking your own models – or accessing our optimised vision and audio models – we’d love to hear from you.
Sign up for updates →
Get in touch →
References
Jacob et al., Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference, CVPR 2018.
Cai, Gan, Wang, Zhang, Han, Once-for-All: Train One Network and Specialize it for Efficient Deployment, ICLR 2020.
Lin, Chen, Lin, Cohn, Gan, Han, MCUNet: Tiny Deep Learning on IoT Devices, NeurIPS 2020.
←All posts