OpenClaw Voice Options Compared
A comprehensive look at every voice option available to OpenClaw users — built-in features, third-party integrations, and how CrabCallr fits in.
Information valid as of Mon Mar 2, 2026. OpenClaw is evolving rapidly — features may have changed since this was written.
Overview
OpenClaw has a rich ecosystem of voice capabilities. Some are built into the core platform, some come from the OpenClaw project itself, and others are third-party integrations built by companies like Deepgram and ElevenLabs. This page documents every option we're aware of, what each one can and can't do, and how CrabCallr compares.
We've organized options into two groups: built-in (ships with OpenClaw or its companion apps) and third-party (requires external accounts and setup).
Built-in Options
1. Voice-Call Plugin
Built-in · Phone calls · Source on GitHub
The voice-call plugin is OpenClaw's official telephony integration. It runs inside the OpenClaw Gateway and connects to a telephony provider that you manage.
How it works
The plugin starts a webhook server on port 3334. Your telephony provider (Twilio, Telnyx, or Plivo) sends
inbound call events to this webhook. The plugin uses the provider's native speech recognition
(<Gather> TwiML for Twilio) to
capture speech, passes the text to your OpenClaw agent, and plays back the response via TTS.
User speaks → Provider detects silence → Provider's ASR returns text → OpenClaw responds → TTS plays Capabilities
- Telephony providers: Twilio, Telnyx, or Plivo — you bring your own account and phone number
- Inbound calls: Allowlist-based — only approved numbers can reach your agent
- Outbound calls: One-way notifications and multi-turn conversations
- TTS: OpenAI voices or ElevenLabs (ElevenLabs support was buggy initially but has been improved)
- Mid-call tool execution: Your agent can invoke tools (web search, file operations, etc.) during a call
- Barge-in: Limited — depends on provider-level silence detection; you can't interrupt mid-TTS on all providers
Limitations
- User-managed infrastructure: Requires your own telephony provider account, phone number, and a publicly reachable webhook URL (ngrok, Tailscale, or static IP)
- Turn-based ASR: The system waits for you to stop speaking before it processes
- No browser calling: Phone-only — a WebRTC feature request was closed as "not planned"
- Basic VAD: Relies on the provider's silence detection, not sophisticated voice activity detection
- No noise suppression: No built-in audio cleanup
- No custom vocabulary: Cannot boost domain-specific terms for better recognition
- Security: Anyone who knows the phone number can attempt to call unless you carefully maintain the allowlist
Best for: Users who want full control over their telephony stack, already have a Twilio/Telnyx/Plivo account, and are comfortable managing webhooks and public URLs.
2. Talk Mode
Built-in · Companion apps · OpenClaw README
Talk Mode provides continuous voice conversation through OpenClaw's companion apps for macOS, iOS, and Android. The companion app connects to the OpenClaw Gateway via WebSocket.
How it works
- ASR: Runs locally on the device (Apple Speech Recognition on macOS/iOS; platform speech on Android)
- Turn detection: Silence-based — the system waits for you to stop speaking before sending the transcript
- Gateway communication: Transcript is sent as text over WebSocket to the Gateway, which calls the LLM
- TTS: Runs locally on the device (calls ElevenLabs API directly, or uses Edge TTS as a free fallback)
- Barge-in: Yes — if you speak while TTS is playing, playback stops immediately and your new speech is captured
Capabilities
- Platforms: macOS (menu bar app), iOS (Bridge node), Android (Bridge node)
- Barge-in: Yes — interruption stops TTS immediately
- Push-to-talk: Available as an alternative to always-on listening
- Works with local models: Since ASR/TTS run on-device, the Gateway can use a local LLM
Limitations
- Requires companion app: Must install the macOS menu bar app or pair an iOS/Android device
- Not phone-based: Cannot call from a landline, car Bluetooth, or any phone
- No browser calling: No WebRTC — you need the native app
- Setup complexity for remote use: To use away from home, you need Tailscale, SSH tunnels, or a publicly exposed Gateway
- Silence-based turn detection: No streaming interim results; waits for a full pause before processing
- Platform-specific ASR: Uses Apple Speech Recognition (macOS/iOS) or platform speech (Android) — not available on Windows or Linux
- No noise suppression: No built-in Krisp-style audio cleanup
- No custom vocabulary: Cannot boost domain-specific terms
Best for: Users who want hands-free voice on their Mac or phone, are okay with installing a companion app, and primarily use OpenClaw from home or on their local network.
3. Voice Wake
Built-in · Wake word detection · Official docs
Voice Wake provides always-on wake word detection that triggers Talk Mode. Say a custom wake phrase and your assistant starts listening.
How it works
- Wake words: Configurable global list managed by the Gateway (stored at
~/.openclaw/settings/voicewake.json) - Detection: Runs locally on each device — no cloud connection for audio detection
- Protocol:
voicewake.get,voicewake.set, andvoicewake.changedevents keep all devices in sync - Activation: When a wake word is detected, Talk Mode activates automatically
Capabilities
- Platforms: macOS, iOS, and Android
- Custom wake words: You can set any phrase as a trigger
- Multi-device sync: Changes to wake words propagate to all connected devices via the Gateway
- Low resource usage: Minimal CPU and battery impact
Limitations
- Requires companion app: Only available through the macOS, iOS, or Android companion apps
- Not available on Windows/Linux: No desktop Voice Wake outside macOS
- Triggers Talk Mode only: Inherits all Talk Mode limitations (silence-based turn detection, etc.)
Best for: Users who want a hands-free "Hey Siri"-style experience with their OpenClaw assistant from their Mac or phone.
4. TTS & STT Configuration
Built-in · Voice messages & replies · TTS docs
OpenClaw has a configurable TTS and STT pipeline for voice messages across all its messaging channels (WhatsApp, Telegram, Slack, Discord, etc.). This is separate from Talk Mode — it controls how your agent speaks and understands voice notes within chat conversations.
TTS providers
- OpenAI: Voices include alloy, echo, fable, onyx, nova, shimmer. Model:
gpt-4o-mini-tts - ElevenLabs: Highest quality. Dozens of pre-made voices plus custom voice cloning. Configurable stability, similarity boost, style, and speed
- Edge TTS: Free baseline using Microsoft's neural voices (e.g.,
en-US-MichelleNeural). No API key required. Auto-fallback when no other keys are configured
STT providers
Auto-detected in priority order: OpenAI → Groq → Deepgram → Google. Local Whisper CLI is available as a fallback.
- OpenAI Whisper: Default is
gpt-4o-mini-transcribe - Deepgram: Streaming mode with Nova-2 model
- Local Whisper: Run offline via CLI. Options include standard Whisper, faster-whisper (4–6x faster), and Whisper MLX (Apple Silicon optimized)
TTS modes
Controlled via messages.tts.auto in config or the /tts slash command:
off— No TTS (default)inbound— Reply with voice only when the user sends a voice messagealways— Every reply is spokentagged— Only speak when explicitly requested
Limitations
- Not real-time conversation: This is asynchronous voice messaging, not live phone or WebRTC calls
- Latency: Local Whisper on CPU can take 30–60 seconds per voice note
- No barge-in: You send a voice note, wait for processing, then hear the reply
Best for: Users who want voice messages in their existing chat channels (WhatsApp, Telegram, etc.) without needing a live call.
5. Voice Message Transcription
Built-in · All messaging channels · Audio docs
OpenClaw automatically transcribes incoming voice messages and audio files so your agent can process them as text. This works across all supported messaging channels.
Key features
- Auto-detection: Audio transcription is enabled by default. OpenClaw tries local CLIs first, then provider APIs
- Fallback chain: If the first STT provider fails (size, timeout, auth), the next one is tried automatically
- Group mention detection: When
requireMention: trueis set, voice notes are transcribed before checking for mentions — so you can mention your agent in a voice message - Configurable limits: Max character length, timeout settings, and per-provider overrides
Limitations
- Not real-time: Processes completed audio files, not streaming speech
- One direction: Transcribes inbound voice → text. Replies are text (unless TTS auto mode is enabled)
Third-Party Integrations
6. DeepClaw by Deepgram
Third-party · Phone calls · GitHub · Blog post
DeepClaw is an open-source integration by Deepgram that lets you call your OpenClaw over the phone using Deepgram's Voice Agent API. It's the most mature third-party voice integration for OpenClaw.
How it works
DeepClaw runs a ~400-line Python voice agent server. You install it as an OpenClaw skill, tell your agent "I want to call you on the phone," and it walks you through setting up a Deepgram account and Twilio number.
Capabilities
- STT: Deepgram Flux with semantic turn detection — understands when you're done talking instead of waiting for silence
- TTS: Deepgram Aura-2 at ~90ms time-to-first-byte
- Guided setup: Your OpenClaw agent walks you through configuration
- Open source: Fully open, ~400 lines of Python
Limitations
- Requires two external accounts: Deepgram and Twilio
- Phone only: No browser calling or WebRTC
- Latency: Deepgram's STT/TTS adds ~200–300ms, but end-to-end latency per turn is reported at 2.2–3.4 seconds due to OpenClaw Gateway processing time
- Requires running server: The Python voice agent must stay running to accept calls
- No noise suppression: No built-in Krisp-style audio cleanup
Best for: Users who want higher-quality phone calling than the built-in Voice-Call plugin and prefer Deepgram's semantic turn detection.
7. ElevenLabs Conversational AI
Third-party · Phone & widget · ElevenLabs platform
ElevenLabs Conversational AI can use OpenClaw as a "custom LLM" backend. ElevenLabs handles the full voice pipeline
(STT, TTS, turn management) while routing the text through your OpenClaw instance's
/v1/chat/completions endpoint.
Capabilities
- TTS quality: Industry-leading ElevenLabs voices with low latency
- Phone calling: Available via ElevenLabs + Twilio integration
- Embeddable widget: Web-based calling via ElevenLabs' widget
Limitations
- Requires three accounts: ElevenLabs, Twilio, and potentially ngrok
- Requires public URL: Your OpenClaw's
/v1/chat/completionsendpoint must be publicly reachable - Complex setup: Seven steps — see our "Why Not DIY" section on the homepage
- Runs on ElevenLabs' platform: Your text goes through their servers, not yours
- No OpenClaw tool execution mid-call: ElevenLabs agents have their own tool system, separate from OpenClaw's skills
Best for: Users who prioritize ElevenLabs' voice quality and are comfortable with the DIY setup.
8. Other Third-Party Integrations
The OpenClaw ecosystem is growing rapidly. A few more integrations have emerged or been requested:
- LiveKit + LemonSlice: A community project combining LiveKit for WebRTC, Deepgram for STT, ElevenLabs for TTS, and LemonSlice for a lip-synced avatar. Demonstrates that OpenClaw can power a real-time avatar, but requires significant DIY assembly.
- Cartesia & Inworld: Community skills that give OpenClaw access to TTS, voice cloning, and audio transcription via these providers.
- Jupiter Voice: A community project for fully local, offline voice using local Whisper and Piper TTS. No cloud dependency at all.
- Real-time voice conversation (Feature Request #7200): An active request for native real-time bidirectional voice with WebRTC, LiveKit Agents, Pipecat bridge, and OpenAI Realtime API support. Not yet implemented.
- OpenClaw Voice (Purple-Horizons): A browser-based voice interface that uses WebSocket streaming, local Whisper for STT, and ElevenLabs for TTS. Self-hosted (Python), requires ElevenLabs and OpenAI API keys, and uses Silero VAD for basic noise filtering. No barge-in, no custom vocabulary, and no WebRTC (WebSocket only). Early-stage project (~60 stars).
Managed Platform
9. CrabCallr
Managed service · Browser + phone · Getting started
CrabCallr is a managed voice platform purpose-built for OpenClaw. The open-source plugin connects outbound via WebSocket to CrabCallr's cloud infrastructure — no open ports, no webhooks, no tunnels.
What's included
- Browser calling: WebRTC via LiveKit — works in any modern browser, no app install
- Phone calling: Managed Twilio integration with caller ID routing (Basic plan)
- Natural voices: 12+ curated ElevenLabs voices with low-latency streaming TTS
- Barge-in: Interrupt anytime — the assistant stops immediately and listens
- Noise suppression: Krisp-powered audio cleanup for clear calls even in noisy environments
- Custom vocabulary: Add domain-specific keyterms to boost speech recognition accuracy
- Session isolation: Per-channel dmScope keeps voice conversations private from other channels
- Built-in authentication: API key — no open ports, no allowlist management
Setup
- Create a CrabCallr account and get your API key
- Install:
openclaw plugins install @wooters/crabcallr - Add your API key to
~/.openclaw/openclaw.jsonand start talking
Best for: Users who want voice to just work — browser and phone calling with zero infrastructure management. Free tier includes browser calling.
Full Comparison Table
Every OpenClaw voice option, side by side.
| Capability | CrabCallr | Voice-Call Plugin | Talk Mode | DeepClaw (Deepgram) | ElevenLabs Agents |
|---|---|---|---|---|---|
| Browser WebRTC calling | ✓ | ✗ | ✗ | ✗ | Widget only |
| Phone calling | ✓ | User manages | ✗ | User manages | User manages |
| Works without app install | ✓ | ✗ | ✗ | ✗ | Widget only |
| Barge-in support | ✓ | Limited | ✓ | ✓ | ✓ |
| Noise suppression | ✓ | ✗ | ✗ | ✗ | ✗ |
| Custom vocabulary | ✓ | ✗ | ✗ | ✗ | ✗ |
| No open ports / tunnels | ✓ | ✗ | Local only | ✗ | ✗ |
| External accounts needed | 0 | 1–2 | 0–1 | 2 | 2–3 |
| Setup steps | 3 | 5+ | 3–4 | 4+ | 7 |
| Mid-call tool execution | ✓ | ✓ | ✓ | ✓ | ✗ |
| Works on Windows/Linux | ✓ | ✓ | ✗ | ✓ | ✓ |
| Open source | Plugin is OSS | ✓ | ✓ | ✓ | ✗ |
Which Option Should You Choose?
It depends on what matters most to you:
"I want voice to just work."
CrabCallr — 3 setup steps, browser + phone, no infrastructure to manage. Get started →
"I want hands-free voice on my Mac."
Talk Mode + Voice Wake — Great for local use with the macOS companion app. No external accounts needed. But doesn't work from a browser or phone.
"I want full control over my telephony."
Voice-Call Plugin or DeepClaw — Bring your own Twilio/Telnyx account and manage everything yourself. DeepClaw offers better ASR (semantic turn detection) but adds another account (Deepgram).
"I want voice messages in WhatsApp/Telegram."
OpenClaw's built-in TTS/STT — No extra tools needed. Send voice notes to your agent and get voice or text replies. Not real-time conversation, but great for async use.
"I want everything fully local and offline."
Talk Mode + local Whisper + Piper TTS — Check out the Jupiter Voice community project for a fully offline setup.
Sources & Further Reading
- OpenClaw GitHub Repository — Source code, README, and feature list
- OpenClaw TTS Documentation — Full TTS configuration reference
- OpenClaw Voice Wake Documentation — Wake word setup and protocol
- OpenClaw Audio & Voice Notes — STT configuration and voice message handling
- DeepClaw by Deepgram — Open-source phone calling for OpenClaw
- Deepgram Blog: Call Your OpenClaw — DeepClaw setup guide
- Deepgram Blog: Voice Is Now First-Class in OpenClaw — Overview of voice ecosystem
- Discussion #10588: Reducing Latency — Latency analysis for real-time voice agents
- Issue #7200: Real-time Voice Conversation — Feature request for native WebRTC
- Issue #8088: Bidirectional Audio — Feature request for real-time voice call support
- OpenClaw on Wikipedia — Project history and background
Ready to try CrabCallr?
Start with free browser calling. No credit card required, no infrastructure to manage.