You will learn how to build a fully offline, privacy-first local RAG implementation android using ExecuTorch and 4-bit quantized Small Language Models (SLMs). We will cover the end-to-end pipeline from model optimization for mobile NPUs to integrating an on-device vector database.
- Quantizing Llama 4 models to 4-bit using GPTQ for mobile NPU acceleration.
- Setting up the ExecuTorch runtime within an Android production environment.
- Implementing an on-device vector database using ObjectBox or LanceDB-mobile.
- Optimizing context retrieval to fit within strict mobile RAM constraints.
Introduction
Sending your users' most sensitive data to a cloud-based LLM is no longer just a privacy risk; in 2026, it is a competitive disadvantage. As mobile chipsets from Qualcomm, Samsung, and Google now feature dedicated NPUs capable of double-digit TOPS (Tera Operations Per Second), the era of "Cloud-First AI" is being replaced by "Local-First AI."
By June 2026, the shift toward "Privacy-First AI" has peaked, and new NPU-heavy mobile chipsets now allow developers to run full RAG pipelines entirely offline without cloud latency. We are no longer limited to simple chat interfaces; we can now build an offline private AI assistant android that understands a user's entire document library, health data, or private messages without a single byte leaving the device. This is the ultimate win for both security and user experience.
In this guide, we are moving past the theoretical. We will walk through a production-grade local RAG implementation android. You will learn how to leverage ExecuTorch — the evolution of PyTorch Edge — to run 4-bit quantized SLMs like Llama 4-3B at speeds that rival cloud inference, all while maintaining a tiny memory footprint.
How Local RAG Actually Works on Mobile
Think of local RAG like giving your AI model a high-speed, private library card. Instead of trying to cram every piece of world knowledge into the model's weights during training, we provide it with a searchable index of local files that it can reference in real-time.
The standard RAG pipeline involves three main components: an embedding model to turn text into math, a vector database to store those numbers, and an LLM to "read" the retrieved context. On a desktop, you have gigabytes of VRAM to play with. On Android, we are fighting for every megabyte. This is why we use 4-bit quantization mobile NPU techniques to shrink our models and specialized runtimes like ExecuTorch to ensure we aren't burning the battery to a crisp.
Real-world teams in healthcare and finance are already using this. Imagine a medical app that lets a doctor query patient records offline in a remote clinic. The data never hits a server, compliance is built-in by design, and the response is instant. That is the power of moving the RAG stack to the edge.
ExecuTorch differs from its predecessor, PyTorch Mobile, by providing a much leaner runtime and direct access to hardware-specific delegates like Qualcomm's QNN and MediaTek's Neuropilot.
Key Features and Concepts
4-Bit Quantization: The Mobile Sweet Spot
We use 4-bit quantization because it offers the best trade-off between model intelligence and memory usage. By reducing weights from 16-bit floats to 4-bit integers, we shrink a 3B parameter model from ~6GB to roughly 1.6GB, allowing it to fit comfortably in the background RAM of a modern Android device.
On-Device Vector Databases
An on-device vector database mobile needs to be lightweight. We aren't using Pinecone or Milvus here. Instead, we use embedded solutions like ObjectBox Vector Search or a C++ implementation of FAISS. These libraries allow us to perform "K-Nearest Neighbor" searches in milliseconds directly on the local filesystem.
NPU Delegation
Running LLMs on the CPU will drain a battery in minutes and turn the phone into a pocket heater. We use ExecuTorch delegates to offload the heavy matrix multiplication to the NPU. This is the "secret sauce" for achieving 30+ tokens per second on a mobile device.
Always use "Group-wise quantization" when preparing models for mobile. It maintains higher accuracy for SLMs compared to standard per-tensor quantization.
Implementation Guide
We are building a "Private Document Assistant." The app will index PDF files stored on the device and allow the user to ask questions about them. We assume you have a basic Android project set up with NDK support and have installed the ExecuTorch CLI tools on your development machine.
Step 1: Optimizing Llama 4 for Mobile
First, we need to convert our model into a format ExecuTorch understands. We will use 4-bit quantization and export the model as a .pte file.
# Import ExecuTorch and Quantization tools
import torch
from executorch.exir import EdgeCompileConfig
from torchao.quantization import quantize_, int4_weight_only
# Load your Llama 4-3B model
model = torch.load("llama4_3b_base.pt")
# Apply 4-bit weight-only quantization
# This is crucial for optimizing Llama 4 for mobile devices
quantize_(model, int4_weight_only())
# Export the model to the ExecuTorch Edge IR
example_inputs = (torch.randn(1, 128, dtype=torch.long),)
edge_model = torch.compile(model, backend="executorch")
# Save as .pte file for Android deployment
with open("llama4_int4.pte", "wb") as f:
f.write(edge_model.buffer())
This Python script takes a standard PyTorch model, applies 4-bit quantization using the torchao library, and exports it to the ExecuTorch format. Note that we use int4_weight_only to keep the activation tensors in higher precision, which helps maintain reasoning capabilities in smaller models.
Don't forget to calibrate your quantization. If you quantize without a calibration dataset, your model might start hallucinating gibberish once it hits the device.
Step 2: Setting up the On-Device Vector Store
Next, we need a way to store and retrieve document snippets. We will use Kotlin to interface with a local vector store.
// Initialize ObjectBox for local vector storage
// This handles the "retrieval" part of our local RAG implementation android
val store = MyObjectBox.builder().androidContext(context).build()
val box = store.boxFor(DocumentChunk::class.java)
fun addDocumentToStore(text: String, embedding: FloatArray) {
val chunk = DocumentChunk(
content = text,
vector = embedding // This is the 384-dim vector from our embedding model
)
box.put(chunk)
}
fun searchContext(queryVector: FloatArray): List {
// Perform a nearest neighbor search on the device
return box.query()
.nearestNeighbors(DocumentChunk_.vector, queryVector, 5)
.build()
.find()
}
This Kotlin code manages the local vector database. When a user adds a document, we generate an embedding (using a smaller model like BERT-tiny, also running via ExecuTorch) and store it. The nearestNeighbors function is the core of RAG, finding the top 5 most relevant text snippets for any given query.
Step 3: Integrating ExecuTorch in Android
Now we link everything together. We will call the ExecuTorch C++ API through JNI to run our quantized model.
// Loading the ExecuTorch runtime in Kotlin
class LocalLLM(modelPath: String) {
private var module: ExecuTorchModule? = null
init {
// Load the .pte file we generated in Step 1
module = ExecuTorchModule.load(modelPath)
}
fun generateResponse(prompt: String, context: String): String {
val fullPrompt = "Context: $context\n\nQuestion: $prompt\nAnswer:"
// Run inference on the NPU delegate
val output = module?.forward(fullPrompt)
return output?.toString() ?: "Error: Inference failed"
}
}
This is a simplified view of the LocalLLM wrapper. In a real app, you would handle tokenization in C++ or use a library like SentencePiece. The key takeaway is the generateResponse method, which prepends the retrieved context to the user's query before sending it to the 4-bit quantized model.
Always run inference on a background thread. Even with NPU acceleration, LLM inference is a blocking operation that will freeze your UI if run on the main thread.
Best Practices and Common Pitfalls
Memory-Mapped Files (mmap)
Do not load the entire 1.6GB model into the JVM heap. ExecuTorch supports memory-mapping (mmap), which allows the OS to load only the parts of the model currently needed for computation. This prevents "Out of Memory" (OOM) crashes on devices with 6GB or 8GB of RAM.
Handling Context Window Limits
Mobile SLMs usually have a context window of 2k to 4k tokens. In a local RAG setup, it's tempting to shove as much context as possible into the prompt. Don't. Every extra token increases the Time To First Token (TTFT). Use a re-ranking step to ensure only the most relevant 512 tokens are sent to the LLM.
Battery Impact
Frequent RAG operations can drain the battery. Implement a "Batch Indexing" strategy where new documents are embedded and indexed only when the device is charging or idle. For the user query, the retrieval step is cheap, but the LLM generation is expensive — limit response lengths where possible.
Real-World Example: The "Private Vault" App
Consider a fictional legal tech company, "LexiLocal." They built an Android app for attorneys to review case files during flights or in high-security courtrooms where Wi-Fi is banned. Using this exact local RAG implementation android, they allow attorneys to ask questions like "What was the defendant's statement regarding the 2024 contract?"
By using ExecuTorch and a 4-bit quantized Llama 4 model, LexiLocal achieved sub-second retrieval times and 40 tokens/sec generation on a Pixel 9. Most importantly, their marketing highlights that "Your client files never leave your phone," a claim that would be impossible with a cloud-based OpenAI or Anthropic integration.
Future Outlook and What's Coming Next
The next 18 months will bring "Unified Memory" architectures to mobile, where the NPU and CPU share a high-bandwidth memory pool even more efficiently. We are also seeing the rise of 1-bit and 2-bit quantization (BitNet), which could allow 7B parameter models to run on mid-range hardware by 2027.
Furthermore, the ExecuTorch roadmap includes "Multi-modal Delegates." This means our local RAG won't just be about text. We will soon be indexing local images and videos, allowing users to ask, "Find the video where I was talking about the project roadmap and summarize my points."
Conclusion
Local RAG is no longer a futuristic concept; it is a practical reality for Android developers in 2026. By combining ExecuTorch's efficient runtime with 4-bit quantized SLMs, you can build applications that are faster, cheaper, and infinitely more private than their cloud-dependent counterparts.
The transition from cloud-centric AI to edge-centric AI is the biggest architectural shift in mobile development since the introduction of the smartphone itself. Don't wait for the ecosystem to mature further. Start by quantizing a small model, setting up a local vector store, and seeing the performance for yourself.
Build something today that works even when the world is offline.
- ExecuTorch is the industry-standard runtime for running PyTorch models on Android NPUs.
- 4-bit quantization is the "Goldilocks" zone for mobile SLMs, balancing size and intelligence.
- Local RAG requires a three-part stack: Embedding model, Vector DB, and Quantized LLM.
- Download the ExecuTorch binary and try running a pre-quantized Llama 4 model on your test device today.