What Is a Local LLM?

A quick explainer before we dive into setup.

When you use OpenClaw with a service like Claude or ChatGPT, every message you send goes over the internet to that company's servers — and you pay per token. A local LLM (Large Language Model) runs the AI model directly on your own computer instead. Nothing leaves your machine.

🖥️ Local LLM

✅ Completely free — no API costs ever
✅ 100% private — data never leaves your machine
✅ Works offline — no internet needed
✅ No rate limits — run as many queries as you want
⚠️ Slower than cloud models on most hardware
⚠️ Requires 8GB+ RAM (16GB+ recommended)
⚠️ Smaller models = less capable than GPT-4/Claude Sonnet

☁️ Cloud API (Claude, GPT-4, etc.)

✅ More capable, especially for complex reasoning
✅ Fast responses — no hardware bottleneck
✅ Always up to date with latest models
💸 Costs $5–$30/month depending on usage
🌐 Requires internet connection
🔒 Data processed on provider's servers
⏳ Subject to rate limits on free tiers

💡 You don't have to choose one forever

You can switch between local and cloud models in your OpenClaw config at any time — even run different models for different tasks. Many users start with a local model to experiment for free, then switch to a cloud model for production use.

Hardware Requirements

What you need before picking a model. The key number is your available RAM (and VRAM if you have a GPU).


Your Machine	RAM	GPU?	Tier	Best Models	Expected Speed
Older laptop / basic PC	8 GB	✗ CPU only	Basic	Phi-3 Mini (3.8B), Gemma 2B	3–8 tokens/sec
Modern laptop / mid PC	16 GB	✗ CPU only	Mid	Llama 3.1 8B, Mistral 7B, Phi-3 Medium	8–20 tokens/sec
Mac with Apple Silicon (M1/M2/M3)	8–24 GB unified	✓ Integrated GPU	Power	Llama 3.1 8B, Mistral 7B, Qwen2.5 7B	25–50 tokens/sec
PC with dedicated GPU (RTX 3060+)	16 GB+	✓ 8–12 GB VRAM	Power	Llama 3.1 8B–13B, Mixtral 8x7B	40–80 tokens/sec
PC with RTX 4090 / workstation GPU	32 GB+	✓ 24 GB VRAM	Power	Llama 3.1 70B (Q4), Mixtral, Qwen 72B	30–60 tokens/sec on large models
Low-end PC / Chromebook	4 GB or less	✗	Not recommended	Use Groq free tier instead (cloud but free)	—

⚠️ RAM tip: always use a quantized model

A "quantized" model (e.g. q4_0 or q5_k_m) is a compressed version that uses 3–4× less RAM with minimal quality loss. On 16 GB RAM, a q4_0 Llama 3.1 8B uses ~5 GB rather than ~16 GB. Always pick a quantized version unless you have 32+ GB RAM.

Local LLM Software Options

These tools handle downloading, running, and serving local models. OpenClaw connects to any of them. Pick the one that suits you best.

Recommended

Ollama

The easiest way to run local models. One command to download and run any model — no config needed.

Free CLI Mac Windows Linux

One command to pull & run any model
Automatic GPU acceleration (Apple Silicon, NVIDIA, AMD)
Exposes an OpenAI-compatible API on port 11434
Huge model library — Llama, Mistral, Phi, Gemma and more
Lightest on resources, fastest startup time

ollama.com →

LM Studio

A polished desktop app with a built-in model browser, chat interface, and local server — great for beginners who prefer a GUI.

Free GUI Mac Windows Linux

Visual model browser — search and download from HuggingFace
Built-in chat UI to test models before connecting to OpenClaw
OpenAI-compatible local server (enable in app settings)
GPU acceleration on Mac (Metal), Windows (CUDA / Vulkan)

lmstudio.ai →

Jan.ai

An open-source ChatGPT alternative that runs locally. Clean UI, model hub, and a local API server built in.

Free GUI Mac Windows Linux

Familiar chat interface — easy for non-technical users
One-click model install from their curated hub
Runs an OpenAI-compatible API server at localhost:1337
Supports extensions for custom integrations

jan.ai →

GPT4All

Designed for simplicity. Excellent on low-end hardware, with a built-in local docs feature for RAG workflows.

Free GUI Mac Windows Linux

Best performance on CPU-only machines
Built-in "LocalDocs" — chat with your own files
Very low resource usage — works on older hardware
Provides an OpenAI-compatible API server

gpt4all.io →

🎁 No GPU? Try Groq's free tier instead

Groq offers free API access to Llama 3.1 and Mixtral with extremely fast inference (using their custom LPU hardware). It's not truly local, but it's free, requires no setup, and is much faster than running models on a CPU. Use "provider": "groq" in your OpenClaw config with a free Groq API key from console.groq.com.

Step-by-Step Setup

Select your local LLM software below for a tailored setup guide.

Install Ollama
Download from ollama.com and run the installer. On Mac, drag to Applications. On Linux, run the one-liner:
```
curl -fsSL https://ollama.com/install.sh | sh
```

Pull a model

Open your terminal and pull a model. For most machines, Llama 3.1 8B is the best starting point:

ollama pull llama3.1:8b

# For low-end hardware (under 8GB RAM), use a smaller model:
ollama pull phi3:mini

# For Apple Silicon / powerful GPU, try a larger model:
ollama pull llama3.1:8b-instruct-q5_k_m

Verify Ollama is running
Ollama starts automatically after install. Confirm it's running and your model responds:
```
ollama run llama3.1:8b "Say hello in one sentence"
```
You should get a response within a few seconds.

Configure OpenClaw to use Ollama

Add this to your openclaw.json (or config.yaml):

{
  "llm": {
    "provider": "ollama",
    "model": "llama3.1:8b",
    "baseUrl": "http://localhost:11434"
    // No apiKey needed!
  }
}

Start OpenClaw
Run openclaw start from your project folder. Your agent is now running 100% locally at no cost.

📋 View all available Ollama models

Run ollama list to see your downloaded models. Browse the full library at ollama.com/library — there are hundreds of models including coding specialists, multilingual models, and vision models.

Download and install LM Studio
Go to lmstudio.ai, download for your OS, and install it. Launch the app.
Download a model
Click the Search tab (magnifying glass icon). Search for llama3.1 or mistral, pick a Q4_K_M quantized version for your RAM size, and click Download.
Start the local API server
Go to the Local Server tab (the <-> icon). Select your downloaded model from the dropdown, then click Start Server. It runs on http://localhost:1234 by default.

Configure OpenClaw

Add this to your openclaw.json:

{
  "llm": {
    "provider": "openai",
    "model": "local-model",
    "apiKey": "lm-studio",
    "baseUrl": "http://localhost:1234/v1"
  }
}

LM Studio uses the OpenAI API format. The apiKey value doesn't matter — just set it to anything non-empty.

Download and install Jan.ai
Go to jan.ai, download the installer for your OS, and install it. Launch the Jan app.
Download a model from the Hub
Click Hub in the left sidebar. Browse or search for a model (Llama 3.1 8B or Mistral 7B are good starting points). Click Download next to a Q4 variant.
Enable the local API server
Go to Settings (gear icon) → Advanced → turn on API Server. The server starts on http://localhost:1337.

Configure OpenClaw

{
  "llm": {
    "provider": "openai",
    "model": "mistral-7b-instruct-v0.2-q4",
    "apiKey": "jan-local",
    "baseUrl": "http://localhost:1337/v1"
  }
}

Use the exact model name shown in Jan.ai — it appears in the model card. The apiKey value can be anything.

⚡ Groq isn't local — but it's free and incredibly fast

Groq runs open-source models (Llama, Mixtral, Gemma) on their custom LPU chips in the cloud. It's free with generous rate limits, has no data retention by default, and is dramatically faster than running models locally on CPU. Great if your hardware is limited.

Get a free Groq API key
Sign up at console.groq.com (free, no credit card needed). Go to API Keys and create a new key.

Configure OpenClaw

{
  "llm": {
    "provider": "groq",
    "model": "llama-3.1-8b-instant",
    "apiKey": "$GROQ_API_KEY"
  }
}

Available free models on Groq
- llama-3.1-8b-instant — fast, good for most tasks
- llama-3.1-70b-versatile — much more capable, still free
- mixtral-8x7b-32768 — great for long-context tasks
- gemma2-9b-it — Google's efficient model

Which Model Should You Use?

The best model depends on your hardware and use case. Here are our picks for each situation.


Use Case	Recommended Model	Min RAM	Notes
General assistant / chat	`llama3.1:8b`	8 GB	Best balance of quality and speed for most tasks
Low-end hardware	`phi3:mini`	4 GB	Microsoft's 3.8B model — surprisingly capable for its size
Writing & summarization	`mistral:7b`	8 GB	Excellent instruction following and long-form writing
Coding assistant	`qwen2.5-coder:7b`	8 GB	Purpose-built for code generation and debugging
Multilingual tasks	`qwen2.5:7b`	8 GB	Strong across 30+ languages including Chinese, Japanese, Arabic
Complex reasoning	`llama3.1:70b-q4`	32 GB	Near GPT-4 quality, but requires powerful hardware
Vision / image understanding	`llava:13b`	16 GB	Can analyze images as well as text — multimodal
Maximum privacy / air-gapped	Any of the above	—	All local models are fully private — nothing leaves your machine

Performance Tips

Getting the most speed out of your local model setup.

✅ Use GPU acceleration

GPU offloading is the single biggest performance improvement. Ollama, LM Studio, and Jan.ai all enable GPU acceleration automatically if they detect a compatible GPU. On a Mac with Apple Silicon, models run on the Neural Engine / GPU by default — much faster than CPU-only on Intel Macs.

🧮 Choose the right quantization level

Quantization levels trade quality for speed and RAM:

q4_0 — smallest/fastest, minor quality reduction — great for most uses
q4_k_m — slightly better quality than q4_0, same RAM — our top pick
q5_k_m — excellent quality/speed balance if you have 16+ GB RAM
q8_0 — near-full quality, needs lots of RAM — only if you have 32+ GB

⚠️ Close other RAM-heavy apps

Local LLMs compete directly with other applications for RAM. Before running a large model, close your browser tabs, video apps, and any other memory-intensive software. On macOS, Activity Monitor shows your "Memory Pressure" — keep it in the green zone for best performance.

💡 Keep your OpenClaw context short

Local models are slower with longer conversation histories. For automation tasks (where you don't need persistent chat), set a low maxTokens value and clear conversation history between tasks to keep responses snappy.

💡 What Is a Local LLM?

🖥️ Local LLM

☁️ Cloud API (Claude, GPT-4, etc.)

💻 Hardware Requirements

🛠️ Local LLM Software Options

⚙️ Step-by-Step Setup

🤖 Which Model Should You Use?

⚡ Performance Tips

What Is a Local LLM?

Hardware Requirements

Local LLM Software Options

Step-by-Step Setup

Which Model Should You Use?

Performance Tips