What Is a Local LLM?
A quick explainer before we dive into setup.
When you use OpenClaw with a service like Claude or ChatGPT, every message you send goes over the internet to that company's servers — and you pay per token. A local LLM (Large Language Model) runs the AI model directly on your own computer instead. Nothing leaves your machine.
🖥️ Local LLM
- ✅ Completely free — no API costs ever
- ✅ 100% private — data never leaves your machine
- ✅ Works offline — no internet needed
- ✅ No rate limits — run as many queries as you want
- ⚠️ Slower than cloud models on most hardware
- ⚠️ Requires 8GB+ RAM (16GB+ recommended)
- ⚠️ Smaller models = less capable than GPT-4/Claude Sonnet
☁️ Cloud API (Claude, GPT-4, etc.)
- ✅ More capable, especially for complex reasoning
- ✅ Fast responses — no hardware bottleneck
- ✅ Always up to date with latest models
- 💸 Costs $5–$30/month depending on usage
- 🌐 Requires internet connection
- 🔒 Data processed on provider's servers
- ⏳ Subject to rate limits on free tiers
You can switch between local and cloud models in your OpenClaw config at any time — even run different models for different tasks. Many users start with a local model to experiment for free, then switch to a cloud model for production use.
Hardware Requirements
What you need before picking a model. The key number is your available RAM (and VRAM if you have a GPU).
| Your Machine | RAM | GPU? | Tier | Best Models | Expected Speed |
|---|---|---|---|---|---|
| Older laptop / basic PC | 8 GB | ✗ CPU only | Basic | Phi-3 Mini (3.8B), Gemma 2B | 3–8 tokens/sec |
| Modern laptop / mid PC | 16 GB | ✗ CPU only | Mid | Llama 3.1 8B, Mistral 7B, Phi-3 Medium | 8–20 tokens/sec |
| Mac with Apple Silicon (M1/M2/M3) | 8–24 GB unified | ✓ Integrated GPU | Power | Llama 3.1 8B, Mistral 7B, Qwen2.5 7B | 25–50 tokens/sec |
| PC with dedicated GPU (RTX 3060+) | 16 GB+ | ✓ 8–12 GB VRAM | Power | Llama 3.1 8B–13B, Mixtral 8x7B | 40–80 tokens/sec |
| PC with RTX 4090 / workstation GPU | 32 GB+ | ✓ 24 GB VRAM | Power | Llama 3.1 70B (Q4), Mixtral, Qwen 72B | 30–60 tokens/sec on large models |
| Low-end PC / Chromebook | 4 GB or less | ✗ | Not recommended | Use Groq free tier instead (cloud but free) | — |
A "quantized" model (e.g. q4_0 or q5_k_m) is a compressed version that uses 3–4× less RAM with minimal quality loss. On 16 GB RAM, a q4_0 Llama 3.1 8B uses ~5 GB rather than ~16 GB. Always pick a quantized version unless you have 32+ GB RAM.
Local LLM Software Options
These tools handle downloading, running, and serving local models. OpenClaw connects to any of them. Pick the one that suits you best.
- One command to pull & run any model
- Automatic GPU acceleration (Apple Silicon, NVIDIA, AMD)
- Exposes an OpenAI-compatible API on port 11434
- Huge model library — Llama, Mistral, Phi, Gemma and more
- Lightest on resources, fastest startup time
- Visual model browser — search and download from HuggingFace
- Built-in chat UI to test models before connecting to OpenClaw
- OpenAI-compatible local server (enable in app settings)
- GPU acceleration on Mac (Metal), Windows (CUDA / Vulkan)
- Familiar chat interface — easy for non-technical users
- One-click model install from their curated hub
- Runs an OpenAI-compatible API server at localhost:1337
- Supports extensions for custom integrations
- Best performance on CPU-only machines
- Built-in "LocalDocs" — chat with your own files
- Very low resource usage — works on older hardware
- Provides an OpenAI-compatible API server
Groq offers free API access to Llama 3.1 and Mixtral with extremely fast inference (using their custom LPU hardware). It's not truly local, but it's free, requires no setup, and is much faster than running models on a CPU. Use "provider": "groq" in your OpenClaw config with a free Groq API key from console.groq.com.
Step-by-Step Setup
Select your local LLM software below for a tailored setup guide.
-
Install Ollama
Download from ollama.com and run the installer. On Mac, drag to Applications. On Linux, run the one-liner:
curl -fsSL https://ollama.com/install.sh | sh -
Pull a model
Open your terminal and pull a model. For most machines, Llama 3.1 8B is the best starting point:
ollama pull llama3.1:8b # For low-end hardware (under 8GB RAM), use a smaller model: ollama pull phi3:mini # For Apple Silicon / powerful GPU, try a larger model: ollama pull llama3.1:8b-instruct-q5_k_m -
Verify Ollama is running
Ollama starts automatically after install. Confirm it's running and your model responds:
ollama run llama3.1:8b "Say hello in one sentence"You should get a response within a few seconds.
-
Configure OpenClaw to use Ollama
Add this to your
openclaw.json(orconfig.yaml):{ "llm": { "provider": "ollama", "model": "llama3.1:8b", "baseUrl": "http://localhost:11434" // No apiKey needed! } } -
Start OpenClaw
Run
openclaw startfrom your project folder. Your agent is now running 100% locally at no cost.
Run ollama list to see your downloaded models. Browse the full library at ollama.com/library — there are hundreds of models including coding specialists, multilingual models, and vision models.
-
Download and install LM Studio
Go to lmstudio.ai, download for your OS, and install it. Launch the app.
-
Download a model
Click the Search tab (magnifying glass icon). Search for
llama3.1ormistral, pick aQ4_K_Mquantized version for your RAM size, and click Download. -
Start the local API server
Go to the Local Server tab (the
<->icon). Select your downloaded model from the dropdown, then click Start Server. It runs onhttp://localhost:1234by default. -
Configure OpenClaw
Add this to your
openclaw.json:{ "llm": { "provider": "openai", "model": "local-model", "apiKey": "lm-studio", "baseUrl": "http://localhost:1234/v1" } }LM Studio uses the OpenAI API format. The
apiKeyvalue doesn't matter — just set it to anything non-empty.
-
Download and install Jan.ai
Go to jan.ai, download the installer for your OS, and install it. Launch the Jan app.
-
Download a model from the Hub
Click Hub in the left sidebar. Browse or search for a model (Llama 3.1 8B or Mistral 7B are good starting points). Click Download next to a
Q4variant. -
Enable the local API server
Go to Settings (gear icon) → Advanced → turn on API Server. The server starts on
http://localhost:1337. -
Configure OpenClaw
{ "llm": { "provider": "openai", "model": "mistral-7b-instruct-v0.2-q4", "apiKey": "jan-local", "baseUrl": "http://localhost:1337/v1" } }Use the exact model name shown in Jan.ai — it appears in the model card. The
apiKeyvalue can be anything.
Groq runs open-source models (Llama, Mixtral, Gemma) on their custom LPU chips in the cloud. It's free with generous rate limits, has no data retention by default, and is dramatically faster than running models locally on CPU. Great if your hardware is limited.
-
Get a free Groq API key
Sign up at console.groq.com (free, no credit card needed). Go to API Keys and create a new key.
-
Configure OpenClaw
{ "llm": { "provider": "groq", "model": "llama-3.1-8b-instant", "apiKey": "$GROQ_API_KEY" } } -
Available free models on Groq
llama-3.1-8b-instant— fast, good for most tasksllama-3.1-70b-versatile— much more capable, still freemixtral-8x7b-32768— great for long-context tasksgemma2-9b-it— Google's efficient model
Which Model Should You Use?
The best model depends on your hardware and use case. Here are our picks for each situation.
| Use Case | Recommended Model | Min RAM | Notes |
|---|---|---|---|
| General assistant / chat | llama3.1:8b |
8 GB | Best balance of quality and speed for most tasks |
| Low-end hardware | phi3:mini |
4 GB | Microsoft's 3.8B model — surprisingly capable for its size |
| Writing & summarization | mistral:7b |
8 GB | Excellent instruction following and long-form writing |
| Coding assistant | qwen2.5-coder:7b |
8 GB | Purpose-built for code generation and debugging |
| Multilingual tasks | qwen2.5:7b |
8 GB | Strong across 30+ languages including Chinese, Japanese, Arabic |
| Complex reasoning | llama3.1:70b-q4 |
32 GB | Near GPT-4 quality, but requires powerful hardware |
| Vision / image understanding | llava:13b |
16 GB | Can analyze images as well as text — multimodal |
| Maximum privacy / air-gapped | Any of the above | — | All local models are fully private — nothing leaves your machine |
Performance Tips
Getting the most speed out of your local model setup.
GPU offloading is the single biggest performance improvement. Ollama, LM Studio, and Jan.ai all enable GPU acceleration automatically if they detect a compatible GPU. On a Mac with Apple Silicon, models run on the Neural Engine / GPU by default — much faster than CPU-only on Intel Macs.
Quantization levels trade quality for speed and RAM:
q4_0— smallest/fastest, minor quality reduction — great for most usesq4_k_m— slightly better quality than q4_0, same RAM — our top pickq5_k_m— excellent quality/speed balance if you have 16+ GB RAMq8_0— near-full quality, needs lots of RAM — only if you have 32+ GB
Local LLMs compete directly with other applications for RAM. Before running a large model, close your browser tabs, video apps, and any other memory-intensive software. On macOS, Activity Monitor shows your "Memory Pressure" — keep it in the green zone for best performance.
Local models are slower with longer conversation histories. For automation tasks (where you don't need persistent chat), set a low maxTokens value and clear conversation history between tasks to keep responses snappy.