Skip to main content

How Voice Mode Works

Voice Mode chains four stages together. Each one is independently configurable — you can use any combination, or just one stage at a time.

🎤

Wake Word

Say a trigger phrase ("Hey OpenClaw" or anything you choose) and the agent starts listening. Available on macOS and iOS. Uses native on-device detection — no audio is sent anywhere until after the wake word fires.

🗣️

Speech-to-Text (STT)

Your voice is transcribed by OpenAI Whisper — one of the most accurate speech recognition models available. Works with heavy accents, technical terms, and noisy environments. About $0.006 per minute.

🦞

AI Processing

The transcribed text is sent to your configured LLM exactly like a typed message. The agent runs its full capabilities — tools, memory, skills — and generates a response.

🔊

Text-to-Speech (TTS)

The response is converted to speech by ElevenLabs and streamed back with incremental playback — so you hear the first sentence while the rest is still being generated. Latency under 1 second.

💡

Start with just Whisper STT. Most people add text-to-speech later once they've got STT working. The agent's text replies are still visible on screen — adding ElevenLabs just means you hear them aloud too. Get Stage 2 working first.

Voice Features by Platform

Voice Mode capabilities differ slightly across platforms. Here's exactly what works where.

🍎

macOS

Full support
Wake word detection
Talk Mode (continuous)
Whisper STT
ElevenLabs TTS
Push-to-talk
Interrupt on speech
📱

iOS

Full support
Wake word detection
Talk Mode (continuous)
Whisper STT
ElevenLabs TTS
Push-to-talk
⚠️Background audio (limited)
🤖

Android

Partial
🚫Wake word (not yet)
Talk Mode (manual mic)
Whisper STT
ElevenLabs TTS
Push-to-talk
⚠️System assistant integration
🪟

Windows / Linux

CLI only
🚫Native wake word
Talk Mode (web UI)
Whisper STT
ElevenLabs TTS
Push-to-talk (web UI)
⚠️Via localhost:7799 web UI

🗣️ Set Up Whisper Speech-to-Text

Whisper is the easiest place to start. You'll need an OpenAI API key — even if you use Claude as your main LLM, Whisper is an OpenAI product and needs its own key.

Whisper STT Configuration

⏱ ~5 minutes

Get an OpenAI API key

Go to platform.openai.com → API keys → Create new secret key. You only need the whisper-1 model — a $5 credit lasts the average person several months of voice use.

Add the STT block to openclaw.json

~/.openclaw/openclaw.json
{
  "voice": {
    "stt": {
      "provider": "whisper",
      "model":    "whisper-1",
      "apiKey":   "sk-proj-xxxxxxxxxxxx",
      "language": "en"  // optional — auto-detect if omitted
    }
  }
}

Restart OpenClaw and test

Restart OpenClaw and open the web UI at localhost:7799. You'll see a microphone icon in the input bar. Click it, speak, and release — your words should appear as text in the message field.

Terminal confirmation
✅ Voice STT: Whisper (whisper-1) — ready
   Click the mic icon in the web UI to start speaking

Enable Talk Mode for continuous conversation

Talk Mode keeps the microphone open after each reply so you can have a back-and-forth without clicking the mic button each time. Enable it with:

~/.openclaw/openclaw.json
"voice": {
  "talkMode": {
    "enabled":           true,
    "silenceTimeoutMs":  1500,  // ms of silence before submitting
    "interruptOnSpeech": true   // speaking cancels current TTS reply
  }
}

🔊 Set Up ElevenLabs Text-to-Speech

ElevenLabs gives your agent a natural, human-sounding voice. It streams audio incrementally so replies start playing within about a second of generation.

ElevenLabs TTS Configuration

⏱ ~10 minutes

Create an ElevenLabs account

Go to elevenlabs.io and sign up. The free tier gives you 10,000 characters per month — enough for testing. For daily use, the Starter plan ($5/month, 30,000 characters) is plenty for most people.

Get your API key and choose a voice

In the ElevenLabs dashboard: go to Profile → API Key and copy your key. Then go to Voice Library and pick a voice you like — copy its Voice ID from the voice card. Popular choices: Rachel, Adam, Aria.

Add the TTS block to openclaw.json

~/.openclaw/openclaw.json
{
  "voice": {
    "tts": {
      "provider":    "elevenlabs",
      "apiKey":      "sk_xxxxxxxxxxxxxxxx",
      "voiceId":     "21m00Tcm4TlvDq8ikWAM",  // Rachel
      "modelId":     "eleven_turbo_v2_5",        // fast + cheap
      "outputFormat":"mp3_44100_128",
      "streaming":   true                       // incremental playback
    }
  }
}

Choose your model

ElevenLabs has several models. Here's how to choose:

Model comparison
eleven_turbo_v2_5    // ← Recommended. Fast, cheap, great quality
eleven_multilingual_v2 // Best quality, supports 29 languages, slower
eleven_monolingual_v1  // Older, English only — avoid unless on free tier

Test it

Restart OpenClaw and send any message via the web UI or Telegram. The response should play as audio on your speakers. You'll see a speaker icon in the message if TTS is active.

🎭

Clone your own voice. ElevenLabs lets you upload 1–5 minutes of your own voice recordings to create a personal voice clone. You can then make your OpenClaw agent speak in your own voice — handy for generating audio notes or dictating messages that sound like you.

🎤 Wake Word Setup

Wake word lets you activate your agent without touching your keyboard or phone. Say the phrase, pause, then give your command — all hands-free.

⚠️

macOS and iOS only. Wake word uses native on-device speech detection. Android uses manual push-to-talk instead. Wake word detection runs fully locally — no audio is sent to any server until after you speak the trigger phrase.

Wake Word Configuration

⏱ ~2 minutes

Add wake words to openclaw.json

Set one or more trigger phrases. You can use anything — "Hey OpenClaw", your agent's name, or something less likely to fire accidentally. All phrases are normalized to lowercase.

~/.openclaw/openclaw.json
{
  "voice": {
    "wakeWord": {
      "enabled": true,
      "phrases": [
        "hey openclaw",
        "hey claw",
        "ok lobster"  // add as many as you like
      ],
      "confirmationSound": true  // plays a soft chime when triggered
    }
  }
}

Grant microphone permission

On macOS: System Settings → Privacy & Security → Microphone — enable OpenClaw. On iOS: you'll be prompted automatically on first use.

Test it

Say your wake phrase clearly. You'll hear a confirmation chime (if enabled), then a short silence. Speak your command. The agent transcribes and responds.

Example interaction
You:     "Hey OpenClaw..."
Agent:   *chime* [listening...]
You:     "What's on my calendar today?"
Agent:   *speaks reply via ElevenLabs*

Complete Voice Config

The full voice block in openclaw.json — every option in one place with comments.

~/.openclaw/openclaw.json — complete voice block
{
  "voice": {

    // ── Speech-to-Text ──────────────────────────────────
    "stt": {
      "provider": "whisper",       // "whisper" | "native" | "deepgram"
      "model":    "whisper-1",
      "apiKey":   "sk-proj-xxxx",
      "language": "en"              // omit for auto-detect
    },

    // ── Text-to-Speech ──────────────────────────────────
    "tts": {
      "provider":     "elevenlabs",   // "elevenlabs" | "openai" | "native"
      "apiKey":       "sk_xxxx",
      "voiceId":      "21m00Tcm4TlvDq8ikWAM",
      "modelId":      "eleven_turbo_v2_5",
      "outputFormat": "mp3_44100_128",
      "streaming":    true,          // start playing before full text generated
      "speed":        1.0            // 0.7–1.2 playback speed
    },

    // ── Talk Mode ───────────────────────────────────────
    "talkMode": {
      "enabled":           true,
      "silenceTimeoutMs":  1500, // pause length before sending (ms)
      "interruptOnSpeech": true, // speaking stops current TTS playback
      "pushToTalk":        false // true = hold key/button instead of VAD
    },

    // ── Wake Word ───────────────────────────────────────
    "wakeWord": {
      "enabled":            true,
      "phrases":            ["hey openclaw", "hey claw"],
      "confirmationSound":  true,
      "listenDurationMs":   8000 // how long to listen after wake word (ms)
    },

    // ── Quiet Hours ─────────────────────────────────────
    "quietHours": {
      "enabled": true,
      "start":   "23:00",  // local time
      "end":     "07:00"   // TTS is silenced during this window
    }

  }
}

What Does Voice Mode Cost?

Voice Mode uses third-party APIs for STT and TTS. Here's what each costs and how to keep bills low.

Service Free tier Paid pricing Typical monthly cost
Whisper STT $0.006 / minute of audio $0.50–$2 (light use)
ElevenLabs TTS (Turbo) 10k chars/mo $0.50 / 1,000 chars (Starter) $5–$11 (Starter plan)
ElevenLabs TTS (Multilingual) 10k chars/mo $0.55 / 1,000 chars (Starter) $5–$11 (Starter plan)
OpenAI TTS (alternative) $0.015 / 1,000 chars (tts-1) $1–$3 (light use)
Native TTS (macOS/iOS) 100% free Included with OS $0 always
💰

Use native TTS to start for free. macOS and iOS have built-in text-to-speech that's completely free. Set "tts": { "provider": "native" } — the quality isn't as natural as ElevenLabs, but there's zero cost. Switch to ElevenLabs once you know you'll use Voice Mode regularly.

Get the Best Out of Voice Mode

  • 🎙️
    Use a headset or AirPods for Talk Mode Built-in laptop speakers cause echo and feedback loops where the mic picks up TTS output. Any earbuds eliminate this entirely — Talk Mode works much better with audio isolated to your ears.
  • 🔇
    Set quiet hours so TTS doesn't play at 3am If you use HEARTBEAT.md for automated tasks, the agent may reply via TTS even when you're asleep. Set quietHours in your config to silence audio between 23:00 and 07:00.
  • ⏸️
    Enable interruptOnSpeech for natural conversation With "interruptOnSpeech": true, speaking while the agent is talking immediately stops playback and starts listening. This makes conversations feel natural instead of waiting for the agent to finish.
  • 🌐
    Set the language for faster Whisper transcription If you always speak English, set "language": "en" in your STT config. Whisper skips language detection and transcribes faster. Supports 50+ languages — use your ISO code (e.g. "fr", "de", "ja").
  • 🎭
    Pick a distinct wake phrase to avoid false triggers "Hey OpenClaw" is reliable, but if you often say "hey" in conversation, add something more unique like "ok lobster" or your agent's custom name. Fewer false triggers means less frustration.
  • 🔋
    Disable wake word on battery power Always-on wake word detection uses a small but constant amount of CPU. If you're on a laptop away from power, set a keyboard shortcut for push-to-talk instead to save battery.

Frequently Asked Questions

Yes — Whisper is an OpenAI model and requires its own API key regardless of which LLM you use for the main agent. The good news is you only need a small amount of credit: $5 typically lasts months of regular voice use. Alternatively, you can use native macOS/iOS STT for free, or a third-party like Deepgram.
Yes — audio is sent to OpenAI for Whisper transcription, and text is sent to ElevenLabs for speech synthesis. Wake word detection happens entirely on-device (nothing is sent until the wake word fires). If privacy is a concern, use native STT/TTS — macOS Siri voices and on-device speech recognition process everything locally. Check OpenAI's and ElevenLabs' data retention policies if you're handling sensitive conversations.
Yes — TTS is entirely optional. Without it, the agent's replies appear as text (same as normal). You can also use the free native provider (macOS system voices, iOS speech engine, or Windows SAPI) or OpenAI's TTS API as alternatives to ElevenLabs. Native is free; OpenAI TTS is cheaper than ElevenLabs but slightly less natural sounding.
Whisper is one of the most accent-resilient models available. It handles non-native English, regional accents, and technical jargon very well. For domain-specific terms (code function names, medical terminology), speaking clearly at a moderate pace helps. You can also add a prompt hint in your config that mentions common terms so Whisper biases toward them.
Browse the ElevenLabs Voice Library at elevenlabs.io/voice-library, preview voices, and copy the Voice ID of your favourite. Replace the voiceId in your openclaw.json and restart OpenClaw. The change takes effect immediately.
Yes — if you've connected OpenClaw to Telegram or WhatsApp via Agent Channels, you can speak into those apps' built-in voice note feature. The audio note is sent to OpenClaw, transcribed (if you have STT enabled), and the agent responds. Full native Voice Mode with wake word requires the OpenClaw app on your device, but voice notes via messaging apps work on any phone.

Get Voice Mode Tips by Email

New voice setups, ElevenLabs tricks, and OpenClaw automation ideas — every week.

No spam. Unsubscribe any time.