2025.01.28 AI/ML

Optimizing SLM for Raspberry Pi 5 w/ Hailo NPU

Edge AI is shifting from a niche experiment to a practical reality. The ability to run language models locally — without round-tripping to a cloud API — opens up use cases where latency, privacy, and connectivity constraints make hosted inference impractical. Think offline assistants, local document summarization on air-gapped networks, or real-time voice command processing in embedded systems.

The Raspberry Pi 5 paired with a Hailo-8L NPU is one of the more compelling platforms for this kind of work. The Pi 5 brings a quad-core Cortex-A76 at 2.4 GHz with up to 8 GB of RAM, while the Hailo-8L M.2 module adds 13 TOPS of dedicated neural network acceleration over PCIe. Together, they form a sub-$150 edge inference stack that fits in the palm of your hand.

Hardware Setup

Getting the hardware configured correctly is the first hurdle. The Hailo module connects via the Pi 5's M.2 HAT+, which exposes a single PCIe 2.0 lane. Driver support requires the Hailo RT and HailoRT PCIe kernel module, both available through Hailo's SDK. Here's the component list for the full setup:

Raspberry Pi 5 (8 GB) — the base compute platform running Raspberry Pi OS (64-bit, Bookworm)
Hailo-8L M.2 module — 13 TOPS neural accelerator in M.2 2242 form factor
Raspberry Pi M.2 HAT+ — adapter board providing the PCIe M.2 slot
Active cooler — essential for sustained inference workloads; the Pi 5 will thermal-throttle without one
27W USB-C power supply — the Hailo module draws additional power; a standard 15W supply is insufficient
High-endurance microSD or NVMe boot drive — model weights benefit from faster read speeds

Model Optimization

Running a small language model on this hardware requires aggressive optimization. The Hailo-8L operates on quantized integer models compiled into its proprietary HEF (Hailo Executable Format). The typical workflow involves taking a pre-trained model in ONNX format, calibrating it against a representative dataset, and then quantizing it down to INT8 or INT4 using the Hailo Dataflow Compiler. The quantization step is where most of the performance is gained — and where most of the accuracy is lost if done carelessly.

For SLMs in the 0.5B–1.5B parameter range, mixed-precision quantization tends to yield the best tradeoff. Attention layers and the final projection head benefit from staying at INT8, while feedforward blocks can tolerate INT4 with minimal perplexity degradation. The Hailo DFC handles layer-wise precision assignment, but you can override it with a custom quantization configuration. Below is an example of compiling a model with the Hailo Dataflow Compiler:

# Compile an ONNX model to HEF with INT8 quantization
hailo compiler model.onnx \
    --hw-arch hailo8l \
    --calib-path ./calibration_dataset/ \
    --classes 32000 \
    --performance-mode max \
    --output-dir ./compiled/ \
    --model-name slm_opt

# Verify the compiled model
hailortcli run ./compiled/slm_opt.hef \
    --measure-latency \
    --measure-power

Benchmarks

After compiling and deploying a quantized 1.1B parameter model (based on a TinyLlama-class architecture), the results are surprisingly usable for edge hardware. With INT8 quantization across all layers, the Hailo-8L achieves approximately 12.4 tokens per second at a batch size of 1, with a time-to-first-token of 180 ms. Switching the feedforward layers to INT4 mixed precision pushes throughput to roughly 18.7 tokens per second, though perplexity increases by about 0.8 points on a WikiText-2 evaluation set. For comparison, running the same model in FP16 on the Pi 5 CPU alone yields only 1.6 tokens per second — an order of magnitude slower. Peak power draw during sustained inference sits at around 11.2W for the entire system, making battery-powered deployments feasible with a modest power bank.

The Raspberry Pi 5 and Hailo-8L combination is not going to replace cloud inference for production workloads, but that is not the point. It demonstrates that meaningful language model inference is possible on a $130 board that draws less power than a phone charger. As model architectures continue to shrink and quantization tooling improves, the gap between edge and cloud will keep narrowing. For anyone building local-first AI applications, embedded assistants, or privacy-sensitive pipelines, this is hardware worth paying attention to.