Say It Out Loud.
Your Agent Listens.
Voice Mode turns OpenClaw into a hands-free AI assistant. Speak naturally, get spoken replies, and trigger your agent with a wake word — no keyboard, no phone, no friction.
How Voice Mode Works
Voice Mode chains four stages together. Each one is independently configurable — you can use any combination, or just one stage at a time.
Wake Word
Say a trigger phrase ("Hey OpenClaw" or anything you choose) and the agent starts listening. Available on macOS and iOS. Uses native on-device detection — no audio is sent anywhere until after the wake word fires.
Speech-to-Text (STT)
Your voice is transcribed by OpenAI Whisper — one of the most accurate speech recognition models available. Works with heavy accents, technical terms, and noisy environments. About $0.006 per minute.
AI Processing
The transcribed text is sent to your configured LLM exactly like a typed message. The agent runs its full capabilities — tools, memory, skills — and generates a response.
Text-to-Speech (TTS)
The response is converted to speech by ElevenLabs and streamed back with incremental playback — so you hear the first sentence while the rest is still being generated. Latency under 1 second.
Start with just Whisper STT. Most people add text-to-speech later once they've got STT working. The agent's text replies are still visible on screen — adding ElevenLabs just means you hear them aloud too. Get Stage 2 working first.
Voice Features by Platform
Voice Mode capabilities differ slightly across platforms. Here's exactly what works where.
macOS
Full supportiOS
Full supportAndroid
PartialWindows / Linux
CLI only🗣️ Set Up Whisper Speech-to-Text
Whisper is the easiest place to start. You'll need an OpenAI API key — even if you use Claude as your main LLM, Whisper is an OpenAI product and needs its own key.
Whisper STT Configuration
⏱ ~5 minutesGet an OpenAI API key
Go to platform.openai.com → API keys → Create new secret key. You only need the whisper-1 model — a $5 credit lasts the average person several months of voice use.
Add the STT block to openclaw.json
{ "voice": { "stt": { "provider": "whisper", "model": "whisper-1", "apiKey": "sk-proj-xxxxxxxxxxxx", "language": "en" // optional — auto-detect if omitted } } }
Restart OpenClaw and test
Restart OpenClaw and open the web UI at localhost:7799. You'll see a microphone icon in the input bar. Click it, speak, and release — your words should appear as text in the message field.
✅ Voice STT: Whisper (whisper-1) — ready Click the mic icon in the web UI to start speaking
Enable Talk Mode for continuous conversation
Talk Mode keeps the microphone open after each reply so you can have a back-and-forth without clicking the mic button each time. Enable it with:
"voice": { "talkMode": { "enabled": true, "silenceTimeoutMs": 1500, // ms of silence before submitting "interruptOnSpeech": true // speaking cancels current TTS reply } }
🔊 Set Up ElevenLabs Text-to-Speech
ElevenLabs gives your agent a natural, human-sounding voice. It streams audio incrementally so replies start playing within about a second of generation.
ElevenLabs TTS Configuration
⏱ ~10 minutesCreate an ElevenLabs account
Go to elevenlabs.io and sign up. The free tier gives you 10,000 characters per month — enough for testing. For daily use, the Starter plan ($5/month, 30,000 characters) is plenty for most people.
Get your API key and choose a voice
In the ElevenLabs dashboard: go to Profile → API Key and copy your key. Then go to Voice Library and pick a voice you like — copy its Voice ID from the voice card. Popular choices: Rachel, Adam, Aria.
Add the TTS block to openclaw.json
{ "voice": { "tts": { "provider": "elevenlabs", "apiKey": "sk_xxxxxxxxxxxxxxxx", "voiceId": "21m00Tcm4TlvDq8ikWAM", // Rachel "modelId": "eleven_turbo_v2_5", // fast + cheap "outputFormat":"mp3_44100_128", "streaming": true // incremental playback } } }
Choose your model
ElevenLabs has several models. Here's how to choose:
eleven_turbo_v2_5 // ← Recommended. Fast, cheap, great quality eleven_multilingual_v2 // Best quality, supports 29 languages, slower eleven_monolingual_v1 // Older, English only — avoid unless on free tier
Test it
Restart OpenClaw and send any message via the web UI or Telegram. The response should play as audio on your speakers. You'll see a speaker icon in the message if TTS is active.
Clone your own voice. ElevenLabs lets you upload 1–5 minutes of your own voice recordings to create a personal voice clone. You can then make your OpenClaw agent speak in your own voice — handy for generating audio notes or dictating messages that sound like you.
🎤 Wake Word Setup
Wake word lets you activate your agent without touching your keyboard or phone. Say the phrase, pause, then give your command — all hands-free.
macOS and iOS only. Wake word uses native on-device speech detection. Android uses manual push-to-talk instead. Wake word detection runs fully locally — no audio is sent to any server until after you speak the trigger phrase.
Wake Word Configuration
⏱ ~2 minutesAdd wake words to openclaw.json
Set one or more trigger phrases. You can use anything — "Hey OpenClaw", your agent's name, or something less likely to fire accidentally. All phrases are normalized to lowercase.
{ "voice": { "wakeWord": { "enabled": true, "phrases": [ "hey openclaw", "hey claw", "ok lobster" // add as many as you like ], "confirmationSound": true // plays a soft chime when triggered } } }
Grant microphone permission
On macOS: System Settings → Privacy & Security → Microphone — enable OpenClaw. On iOS: you'll be prompted automatically on first use.
Test it
Say your wake phrase clearly. You'll hear a confirmation chime (if enabled), then a short silence. Speak your command. The agent transcribes and responds.
You: "Hey OpenClaw..." Agent: *chime* [listening...] You: "What's on my calendar today?" Agent: *speaks reply via ElevenLabs*
Complete Voice Config
The full voice block in openclaw.json — every option in one place with comments.
{ "voice": { // ── Speech-to-Text ────────────────────────────────── "stt": { "provider": "whisper", // "whisper" | "native" | "deepgram" "model": "whisper-1", "apiKey": "sk-proj-xxxx", "language": "en" // omit for auto-detect }, // ── Text-to-Speech ────────────────────────────────── "tts": { "provider": "elevenlabs", // "elevenlabs" | "openai" | "native" "apiKey": "sk_xxxx", "voiceId": "21m00Tcm4TlvDq8ikWAM", "modelId": "eleven_turbo_v2_5", "outputFormat": "mp3_44100_128", "streaming": true, // start playing before full text generated "speed": 1.0 // 0.7–1.2 playback speed }, // ── Talk Mode ─────────────────────────────────────── "talkMode": { "enabled": true, "silenceTimeoutMs": 1500, // pause length before sending (ms) "interruptOnSpeech": true, // speaking stops current TTS playback "pushToTalk": false // true = hold key/button instead of VAD }, // ── Wake Word ─────────────────────────────────────── "wakeWord": { "enabled": true, "phrases": ["hey openclaw", "hey claw"], "confirmationSound": true, "listenDurationMs": 8000 // how long to listen after wake word (ms) }, // ── Quiet Hours ───────────────────────────────────── "quietHours": { "enabled": true, "start": "23:00", // local time "end": "07:00" // TTS is silenced during this window } } }
What Does Voice Mode Cost?
Voice Mode uses third-party APIs for STT and TTS. Here's what each costs and how to keep bills low.
| Service | Free tier | Paid pricing | Typical monthly cost |
|---|---|---|---|
| Whisper STT | None | $0.006 / minute of audio | $0.50–$2 (light use) |
| ElevenLabs TTS (Turbo) | 10k chars/mo | $0.50 / 1,000 chars (Starter) | $5–$11 (Starter plan) |
| ElevenLabs TTS (Multilingual) | 10k chars/mo | $0.55 / 1,000 chars (Starter) | $5–$11 (Starter plan) |
| OpenAI TTS (alternative) | None | $0.015 / 1,000 chars (tts-1) | $1–$3 (light use) |
| Native TTS (macOS/iOS) | 100% free | Included with OS | $0 always |
Use native TTS to start for free. macOS and iOS have built-in text-to-speech that's completely free. Set "tts": { "provider": "native" } — the quality isn't as natural as ElevenLabs, but there's zero cost. Switch to ElevenLabs once you know you'll use Voice Mode regularly.
Get the Best Out of Voice Mode
-
Use a headset or AirPods for Talk Mode Built-in laptop speakers cause echo and feedback loops where the mic picks up TTS output. Any earbuds eliminate this entirely — Talk Mode works much better with audio isolated to your ears.
-
Set quiet hours so TTS doesn't play at 3am If you use HEARTBEAT.md for automated tasks, the agent may reply via TTS even when you're asleep. Set
quietHoursin your config to silence audio between 23:00 and 07:00. -
Enable interruptOnSpeech for natural conversation With
"interruptOnSpeech": true, speaking while the agent is talking immediately stops playback and starts listening. This makes conversations feel natural instead of waiting for the agent to finish. -
Set the language for faster Whisper transcription If you always speak English, set
"language": "en"in your STT config. Whisper skips language detection and transcribes faster. Supports 50+ languages — use your ISO code (e.g. "fr", "de", "ja"). -
Pick a distinct wake phrase to avoid false triggers "Hey OpenClaw" is reliable, but if you often say "hey" in conversation, add something more unique like "ok lobster" or your agent's custom name. Fewer false triggers means less frustration.
-
Disable wake word on battery power Always-on wake word detection uses a small but constant amount of CPU. If you're on a laptop away from power, set a keyboard shortcut for push-to-talk instead to save battery.
Frequently Asked Questions
voiceId in your openclaw.json and restart OpenClaw. The change takes effect immediately.