- Brain Beats
- Posts
- THIS WEEK: We ran a local LLM on a single-board computer. Here's how we did it
THIS WEEK: We ran a local LLM on a single-board computer. Here's how we did it
TECH
Most AI architecture guides start with a silent assumption: there will be internet. A cloud API call, a hosted model endpoint, a reliable connection somewhere in the pipeline. When Crisis Cognition came to us, that assumption was off the table entirely. Their team builds decision-support tools for humanitarian responders operating in disaster zones and conflict areas, where connectivity is typically the first thing to fail. They needed an AI assistant that would keep running when everything else went down.
What offline AI actually means
Offline AI means running the full inference stack locally, model, runtime, and interface, with no dependency on cloud services or an external network. No API calls. No fallback. That distinction matters because most "edge AI" guides still assume some connectivity exists, even if reduced. True offline-first AI removes that safety net entirely. The model has to fit on your hardware, the runtime has to execute without reaching out, and the system has to remain useful within those constraints.
It's a growing space. According to Grand View Research, the global edge AI market sits at $24.91 billion in 2025 and is projected to reach $118.69 billion by 2033. Market growth, however, doesn't mean the engineering is straightforward. The tradeoffs are real, and understanding them upfront is what shapes every decision that follows.
The four decisions that shaped the build
Every offline AI deployment comes down to a handful of foundational choices made in sequence. Get one wrong and no amount of optimisation downstream will fix it.
Hardware first. We selected the Orange Pi 5 Max, built around the Rockchip RK3588 processor, specifically because of its integrated NPU (neural processing unit) - dedicated silicon for AI workloads that dramatically outperforms CPU-only inference. As the Edge AI and Vision Alliance notes, dedicated hardware acceleration is what makes viable on-device inference possible at scale. NPU availability and runtime support should determine your hardware choice before anything else.
Constrain the model to the task. We used Qwen2.5-3B-Instruct, optimised for the RK3588 and quantised to a w8a8 configuration (8-bit weights, 8-bit activations). A 3-billion parameter model is modest by current standards, but it hits the right balance for constrained hardware. The smallest model that meets your accuracy floor for the specific task is the right model. Hugging Face is the best starting point for finding quantised model variants.
Match runtime to silicon. We used rkllm, a runtime built specifically for the RK3588 NPU. For teams not targeting a specific NPU, llama.cpp supports GGUF-format quantised models across a wide range of hardware and has become the de facto standard for local LLM deployment. A chip-native runtime outperforms a generic one — always check whether one exists for your target hardware first.
Design for the end user, not the engineer. We configured the Orange Pi 5 Max as a Wi-Fi access point and implemented a captive portal that routes any connected device directly to the Open WebUI interface. No app install, no login friction, no instructions needed. The model capability is irrelevant if the people who need it can't access it quickly under pressure.
The honest tradeoffs
The prototype confirmed what we set out to demonstrate: LLMs can run reliably on the RK3588 NPU, offline inference is viable for crisis environments, and non-technical users can access it through a browser within seconds of connecting.
The tradeoffs are worth naming plainly. A 3B parameter model has a capability ceiling. It handles structured, task-specific queries well, but it's not a replacement for a frontier model on open-ended reasoning. There's no live data. Keeping the model updated requires a manual sync process when connectivity becomes available.
For teams weighing fully offline against a hybrid approach, research from early 2025 found that hybrid edge/cloud setups can achieve energy savings of up to 75% and cost reductions exceeding 80% compared to pure cloud processing. For Crisis Cognition's context, fully offline was the only viable option. For your use case, it may not be.
What this means
Edge AI isn't a niche concern. As AI gets embedded into tools used in the field, by first responders, field engineers, logistics operators, or teams in low-connectivity environments, the question of what happens when the connection drops becomes a real engineering requirement. The architecture decisions we made for Crisis Cognition apply to any team building for those conditions: hardware, model, runtime, and access layer.
About this newsletter
As passionate small business owners, we share effective strategies straight from our own experiences. Stuff we usually share:
Proven growth tactics: Practical steps to boost your team, revenue, and impact.
Winning marketing strategies: Tips to get your product to market and outshine competitors.
Insider insights: Secrets from successful businesses in your niche.
Exclusive perks: Access to valuable tools and resources for subscribers only.
