Skip to main content
Set up in under 10 minutes
Works with Llama, Mistral, Phi, Gemma & more
Runs on Mac, Windows, and Linux
GPU acceleration supported

What Is a Local LLM?

A quick explainer before we dive into setup.

When you use OpenClaw with a service like Claude or ChatGPT, every message you send goes over the internet to that company's servers — and you pay per token. A local LLM (Large Language Model) runs the AI model directly on your own computer instead. Nothing leaves your machine.

🖥️ Local LLM

  • Completely free — no API costs ever
  • 100% private — data never leaves your machine
  • Works offline — no internet needed
  • No rate limits — run as many queries as you want
  • ⚠️ Slower than cloud models on most hardware
  • ⚠️ Requires 8GB+ RAM (16GB+ recommended)
  • ⚠️ Smaller models = less capable than GPT-4/Claude Sonnet

☁️ Cloud API (Claude, GPT-4, etc.)

  • More capable, especially for complex reasoning
  • Fast responses — no hardware bottleneck
  • Always up to date with latest models
  • 💸 Costs $5–$30/month depending on usage
  • 🌐 Requires internet connection
  • 🔒 Data processed on provider's servers
  • Subject to rate limits on free tiers
💡 You don't have to choose one forever

You can switch between local and cloud models in your OpenClaw config at any time — even run different models for different tasks. Many users start with a local model to experiment for free, then switch to a cloud model for production use.

Hardware Requirements

What you need before picking a model. The key number is your available RAM (and VRAM if you have a GPU).

Your Machine RAM GPU? Tier Best Models Expected Speed
Older laptop / basic PC 8 GB CPU only Basic Phi-3 Mini (3.8B), Gemma 2B 3–8 tokens/sec
Modern laptop / mid PC 16 GB CPU only Mid Llama 3.1 8B, Mistral 7B, Phi-3 Medium 8–20 tokens/sec
Mac with Apple Silicon (M1/M2/M3) 8–24 GB unified Integrated GPU Power Llama 3.1 8B, Mistral 7B, Qwen2.5 7B 25–50 tokens/sec
PC with dedicated GPU (RTX 3060+) 16 GB+ 8–12 GB VRAM Power Llama 3.1 8B–13B, Mixtral 8x7B 40–80 tokens/sec
PC with RTX 4090 / workstation GPU 32 GB+ 24 GB VRAM Power Llama 3.1 70B (Q4), Mixtral, Qwen 72B 30–60 tokens/sec on large models
Low-end PC / Chromebook 4 GB or less Not recommended Use Groq free tier instead (cloud but free)
⚠️ RAM tip: always use a quantized model

A "quantized" model (e.g. q4_0 or q5_k_m) is a compressed version that uses 3–4× less RAM with minimal quality loss. On 16 GB RAM, a q4_0 Llama 3.1 8B uses ~5 GB rather than ~16 GB. Always pick a quantized version unless you have 32+ GB RAM.

Local LLM Software Options

These tools handle downloading, running, and serving local models. OpenClaw connects to any of them. Pick the one that suits you best.

LM Studio
A polished desktop app with a built-in model browser, chat interface, and local server — great for beginners who prefer a GUI.
Free GUI Mac Windows Linux
  • Visual model browser — search and download from HuggingFace
  • Built-in chat UI to test models before connecting to OpenClaw
  • OpenAI-compatible local server (enable in app settings)
  • GPU acceleration on Mac (Metal), Windows (CUDA / Vulkan)
lmstudio.ai →
Jan.ai
An open-source ChatGPT alternative that runs locally. Clean UI, model hub, and a local API server built in.
Free GUI Mac Windows Linux
  • Familiar chat interface — easy for non-technical users
  • One-click model install from their curated hub
  • Runs an OpenAI-compatible API server at localhost:1337
  • Supports extensions for custom integrations
jan.ai →
GPT4All
Designed for simplicity. Excellent on low-end hardware, with a built-in local docs feature for RAG workflows.
Free GUI Mac Windows Linux
  • Best performance on CPU-only machines
  • Built-in "LocalDocs" — chat with your own files
  • Very low resource usage — works on older hardware
  • Provides an OpenAI-compatible API server
gpt4all.io →
🎁 No GPU? Try Groq's free tier instead

Groq offers free API access to Llama 3.1 and Mixtral with extremely fast inference (using their custom LPU hardware). It's not truly local, but it's free, requires no setup, and is much faster than running models on a CPU. Use "provider": "groq" in your OpenClaw config with a free Groq API key from console.groq.com.

Step-by-Step Setup

Select your local LLM software below for a tailored setup guide.

  1. Install Ollama

    Download from ollama.com and run the installer. On Mac, drag to Applications. On Linux, run the one-liner:

    curl -fsSL https://ollama.com/install.sh | sh
  2. Pull a model

    Open your terminal and pull a model. For most machines, Llama 3.1 8B is the best starting point:

    ollama pull llama3.1:8b
    
    # For low-end hardware (under 8GB RAM), use a smaller model:
    ollama pull phi3:mini
    
    # For Apple Silicon / powerful GPU, try a larger model:
    ollama pull llama3.1:8b-instruct-q5_k_m
  3. Verify Ollama is running

    Ollama starts automatically after install. Confirm it's running and your model responds:

    ollama run llama3.1:8b "Say hello in one sentence"

    You should get a response within a few seconds.

  4. Configure OpenClaw to use Ollama

    Add this to your openclaw.json (or config.yaml):

    {
      "llm": {
        "provider": "ollama",
        "model": "llama3.1:8b",
        "baseUrl": "http://localhost:11434"
        // No apiKey needed!
      }
    }
  5. Start OpenClaw

    Run openclaw start from your project folder. Your agent is now running 100% locally at no cost.

📋 View all available Ollama models

Run ollama list to see your downloaded models. Browse the full library at ollama.com/library — there are hundreds of models including coding specialists, multilingual models, and vision models.

  1. Download and install LM Studio

    Go to lmstudio.ai, download for your OS, and install it. Launch the app.

  2. Download a model

    Click the Search tab (magnifying glass icon). Search for llama3.1 or mistral, pick a Q4_K_M quantized version for your RAM size, and click Download.

  3. Start the local API server

    Go to the Local Server tab (the <-> icon). Select your downloaded model from the dropdown, then click Start Server. It runs on http://localhost:1234 by default.

  4. Configure OpenClaw

    Add this to your openclaw.json:

    {
      "llm": {
        "provider": "openai",
        "model": "local-model",
        "apiKey": "lm-studio",
        "baseUrl": "http://localhost:1234/v1"
      }
    }

    LM Studio uses the OpenAI API format. The apiKey value doesn't matter — just set it to anything non-empty.

  1. Download and install Jan.ai

    Go to jan.ai, download the installer for your OS, and install it. Launch the Jan app.

  2. Download a model from the Hub

    Click Hub in the left sidebar. Browse or search for a model (Llama 3.1 8B or Mistral 7B are good starting points). Click Download next to a Q4 variant.

  3. Enable the local API server

    Go to Settings (gear icon) → Advanced → turn on API Server. The server starts on http://localhost:1337.

  4. Configure OpenClaw
    {
      "llm": {
        "provider": "openai",
        "model": "mistral-7b-instruct-v0.2-q4",
        "apiKey": "jan-local",
        "baseUrl": "http://localhost:1337/v1"
      }
    }

    Use the exact model name shown in Jan.ai — it appears in the model card. The apiKey value can be anything.

⚡ Groq isn't local — but it's free and incredibly fast

Groq runs open-source models (Llama, Mixtral, Gemma) on their custom LPU chips in the cloud. It's free with generous rate limits, has no data retention by default, and is dramatically faster than running models locally on CPU. Great if your hardware is limited.

  1. Get a free Groq API key

    Sign up at console.groq.com (free, no credit card needed). Go to API Keys and create a new key.

  2. Configure OpenClaw
    {
      "llm": {
        "provider": "groq",
        "model": "llama-3.1-8b-instant",
        "apiKey": "$GROQ_API_KEY"
      }
    }
  3. Available free models on Groq
    • llama-3.1-8b-instant — fast, good for most tasks
    • llama-3.1-70b-versatile — much more capable, still free
    • mixtral-8x7b-32768 — great for long-context tasks
    • gemma2-9b-it — Google's efficient model

Which Model Should You Use?

The best model depends on your hardware and use case. Here are our picks for each situation.

Use Case Recommended Model Min RAM Notes
General assistant / chat llama3.1:8b 8 GB Best balance of quality and speed for most tasks
Low-end hardware phi3:mini 4 GB Microsoft's 3.8B model — surprisingly capable for its size
Writing & summarization mistral:7b 8 GB Excellent instruction following and long-form writing
Coding assistant qwen2.5-coder:7b 8 GB Purpose-built for code generation and debugging
Multilingual tasks qwen2.5:7b 8 GB Strong across 30+ languages including Chinese, Japanese, Arabic
Complex reasoning llama3.1:70b-q4 32 GB Near GPT-4 quality, but requires powerful hardware
Vision / image understanding llava:13b 16 GB Can analyze images as well as text — multimodal
Maximum privacy / air-gapped Any of the above All local models are fully private — nothing leaves your machine

Performance Tips

Getting the most speed out of your local model setup.

✅ Use GPU acceleration

GPU offloading is the single biggest performance improvement. Ollama, LM Studio, and Jan.ai all enable GPU acceleration automatically if they detect a compatible GPU. On a Mac with Apple Silicon, models run on the Neural Engine / GPU by default — much faster than CPU-only on Intel Macs.

🧮 Choose the right quantization level

Quantization levels trade quality for speed and RAM:

  • q4_0 — smallest/fastest, minor quality reduction — great for most uses
  • q4_k_m — slightly better quality than q4_0, same RAM — our top pick
  • q5_k_m — excellent quality/speed balance if you have 16+ GB RAM
  • q8_0 — near-full quality, needs lots of RAM — only if you have 32+ GB
⚠️ Close other RAM-heavy apps

Local LLMs compete directly with other applications for RAM. Before running a large model, close your browser tabs, video apps, and any other memory-intensive software. On macOS, Activity Monitor shows your "Memory Pressure" — keep it in the green zone for best performance.

💡 Keep your OpenClaw context short

Local models are slower with longer conversation histories. For automation tasks (where you don't need persistent chat), set a low maxTokens value and clear conversation history between tasks to keep responses snappy.