OpenClaw Voice Options Compared

Overview

OpenClaw has a rich ecosystem of voice capabilities. Some are built into the core platform, some come from the OpenClaw project itself, and others are third-party integrations built by companies like Deepgram and ElevenLabs. This page documents every option we're aware of, what each one can and can't do, and how CrabCallr compares.

We've organized options into two groups: built-in (ships with OpenClaw or its companion apps) and third-party (requires external accounts and setup).

Built-in Options

1. Voice-Call Plugin

Built-in · Phone calls · Source on GitHub

The voice-call plugin is OpenClaw's official telephony integration. It runs inside the OpenClaw Gateway and connects to a telephony provider that you manage.

How it works

The plugin starts a webhook server on port 3334. Your telephony provider (Twilio, Telnyx, or Plivo) sends inbound call events to this webhook. The plugin uses the provider's native speech recognition (<Gather> TwiML for Twilio) to capture speech, passes the text to your OpenClaw agent, and plays back the response via TTS.

User speaks → Provider detects silence → Provider's ASR returns text → OpenClaw responds → TTS plays

Capabilities

Telephony providers: Twilio, Telnyx, or Plivo — you bring your own account and phone number
Inbound calls: Allowlist-based — only approved numbers can reach your agent
Outbound calls: One-way notifications and multi-turn conversations
TTS: OpenAI voices or ElevenLabs (ElevenLabs support was buggy initially but has been improved)
Mid-call tool execution: Your agent can invoke tools (web search, file operations, etc.) during a call
Barge-in: Limited — depends on provider-level silence detection; you can't interrupt mid-TTS on all providers

Limitations

User-managed infrastructure: Requires your own telephony provider account, phone number, and a publicly reachable webhook URL (ngrok, Tailscale, or static IP)
Turn-based ASR: The system waits for you to stop speaking before it processes
No browser calling: Phone-only — a WebRTC feature request was closed as "not planned"
Basic VAD: Relies on the provider's silence detection, not sophisticated voice activity detection
No noise suppression: No built-in audio cleanup
No custom vocabulary: Cannot boost domain-specific terms for better recognition
Security: Anyone who knows the phone number can attempt to call unless you carefully maintain the allowlist

Best for: Users who want full control over their telephony stack, already have a Twilio/Telnyx/Plivo account, and are comfortable managing webhooks and public URLs.

2. Talk Mode

Built-in · Companion apps · OpenClaw README

Talk Mode provides continuous voice conversation through OpenClaw's companion apps for macOS, iOS, and Android. The companion app connects to the OpenClaw Gateway via WebSocket.

How it works

ASR: Runs locally on the device (Apple Speech Recognition on macOS/iOS; platform speech on Android)
Turn detection: Silence-based — the system waits for you to stop speaking before sending the transcript
Gateway communication: Transcript is sent as text over WebSocket to the Gateway, which calls the LLM
TTS: Runs locally on the device (calls ElevenLabs API directly, or uses Edge TTS as a free fallback)
Barge-in: Yes — if you speak while TTS is playing, playback stops immediately and your new speech is captured

Capabilities

Platforms: macOS (menu bar app), iOS (Bridge node), Android (Bridge node)
Barge-in: Yes — interruption stops TTS immediately
Push-to-talk: Available as an alternative to always-on listening
Works with local models: Since ASR/TTS run on-device, the Gateway can use a local LLM

Limitations

Requires companion app: Must install the macOS menu bar app or pair an iOS/Android device
Not phone-based: Cannot call from a landline, car Bluetooth, or any phone
No browser calling: No WebRTC — you need the native app
Setup complexity for remote use: To use away from home, you need Tailscale, SSH tunnels, or a publicly exposed Gateway
Silence-based turn detection: No streaming interim results; waits for a full pause before processing
Platform-specific ASR: Uses Apple Speech Recognition (macOS/iOS) or platform speech (Android) — not available on Windows or Linux
No noise suppression: No built-in Krisp-style audio cleanup
No custom vocabulary: Cannot boost domain-specific terms

Best for: Users who want hands-free voice on their Mac or phone, are okay with installing a companion app, and primarily use OpenClaw from home or on their local network.

3. Voice Wake

Built-in · Wake word detection · Official docs

Voice Wake provides always-on wake word detection that triggers Talk Mode. Say a custom wake phrase and your assistant starts listening.

How it works

Wake words: Configurable global list managed by the Gateway (stored at ~/.openclaw/settings/voicewake.json)
Detection: Runs locally on each device — no cloud connection for audio detection
Protocol: voicewake.get, voicewake.set, and voicewake.changed events keep all devices in sync
Activation: When a wake word is detected, Talk Mode activates automatically

Capabilities

Platforms: macOS, iOS, and Android
Custom wake words: You can set any phrase as a trigger
Multi-device sync: Changes to wake words propagate to all connected devices via the Gateway
Low resource usage: Minimal CPU and battery impact

Limitations

Requires companion app: Only available through the macOS, iOS, or Android companion apps
Not available on Windows/Linux: No desktop Voice Wake outside macOS
Triggers Talk Mode only: Inherits all Talk Mode limitations (silence-based turn detection, etc.)

Best for: Users who want a hands-free "Hey Siri"-style experience with their OpenClaw assistant from their Mac or phone.

4. TTS & STT Configuration

Built-in · Voice messages & replies · TTS docs

OpenClaw has a configurable TTS and STT pipeline for voice messages across all its messaging channels (WhatsApp, Telegram, Slack, Discord, etc.). This is separate from Talk Mode — it controls how your agent speaks and understands voice notes within chat conversations.

TTS providers

OpenAI: Voices include alloy, echo, fable, onyx, nova, shimmer. Model: gpt-4o-mini-tts
ElevenLabs: Highest quality. Dozens of pre-made voices plus custom voice cloning. Configurable stability, similarity boost, style, and speed
Edge TTS: Free baseline using Microsoft's neural voices (e.g., en-US-MichelleNeural). No API key required. Auto-fallback when no other keys are configured

STT providers

Auto-detected in priority order: OpenAI → Groq → Deepgram → Google. Local Whisper CLI is available as a fallback.

OpenAI Whisper: Default is gpt-4o-mini-transcribe
Deepgram: Streaming mode with Nova-2 model
Local Whisper: Run offline via CLI. Options include standard Whisper, faster-whisper (4–6x faster), and Whisper MLX (Apple Silicon optimized)

TTS modes

Controlled via messages.tts.auto in config or the /tts slash command:

off — No TTS (default)
inbound — Reply with voice only when the user sends a voice message
always — Every reply is spoken
tagged — Only speak when explicitly requested

Limitations

Not real-time conversation: This is asynchronous voice messaging, not live phone or WebRTC calls
Latency: Local Whisper on CPU can take 30–60 seconds per voice note
No barge-in: You send a voice note, wait for processing, then hear the reply

Best for: Users who want voice messages in their existing chat channels (WhatsApp, Telegram, etc.) without needing a live call.

5. Voice Message Transcription

Built-in · All messaging channels · Audio docs

OpenClaw automatically transcribes incoming voice messages and audio files so your agent can process them as text. This works across all supported messaging channels.

Key features

Auto-detection: Audio transcription is enabled by default. OpenClaw tries local CLIs first, then provider APIs
Fallback chain: If the first STT provider fails (size, timeout, auth), the next one is tried automatically
Group mention detection: When requireMention: true is set, voice notes are transcribed before checking for mentions — so you can mention your agent in a voice message
Configurable limits: Max character length, timeout settings, and per-provider overrides

Limitations

Not real-time: Processes completed audio files, not streaming speech
One direction: Transcribes inbound voice → text. Replies are text (unless TTS auto mode is enabled)

Third-Party Integrations

6. DeepClaw by Deepgram

Third-party · Phone calls · GitHub · Blog post

DeepClaw is an open-source integration by Deepgram that lets you call your OpenClaw over the phone using Deepgram's Voice Agent API. It's the most mature third-party voice integration for OpenClaw.

How it works

DeepClaw runs a ~400-line Python voice agent server. You install it as an OpenClaw skill, tell your agent "I want to call you on the phone," and it walks you through setting up a Deepgram account and Twilio number.

Capabilities

STT: Deepgram Flux with semantic turn detection — understands when you're done talking instead of waiting for silence
TTS: Deepgram Aura-2 at ~90ms time-to-first-byte
Guided setup: Your OpenClaw agent walks you through configuration
Open source: Fully open, ~400 lines of Python

Limitations

Requires two external accounts: Deepgram and Twilio
Phone only: No browser calling or WebRTC
Latency: Deepgram's STT/TTS adds ~200–300ms, but end-to-end latency per turn is reported at 2.2–3.4 seconds due to OpenClaw Gateway processing time
Requires running server: The Python voice agent must stay running to accept calls
No noise suppression: No built-in Krisp-style audio cleanup

Best for: Users who want higher-quality phone calling than the built-in Voice-Call plugin and prefer Deepgram's semantic turn detection.

7. ElevenLabs Conversational AI

Third-party · Phone & widget · ElevenLabs platform

ElevenLabs Conversational AI can use OpenClaw as a "custom LLM" backend. ElevenLabs handles the full voice pipeline (STT, TTS, turn management) while routing the text through your OpenClaw instance's /v1/chat/completions endpoint.

Capabilities

TTS quality: Industry-leading ElevenLabs voices with low latency
Phone calling: Available via ElevenLabs + Twilio integration
Embeddable widget: Web-based calling via ElevenLabs' widget

Limitations

Requires three accounts: ElevenLabs, Twilio, and potentially ngrok
Requires public URL: Your OpenClaw's /v1/chat/completions endpoint must be publicly reachable
Complex setup: Seven steps — see our "Why Not DIY" section on the homepage
Runs on ElevenLabs' platform: Your text goes through their servers, not yours
No OpenClaw tool execution mid-call: ElevenLabs agents have their own tool system, separate from OpenClaw's skills

Best for: Users who prioritize ElevenLabs' voice quality and are comfortable with the DIY setup.

8. Other Third-Party Integrations

The OpenClaw ecosystem is growing rapidly. A few more integrations have emerged or been requested:

LiveKit + LemonSlice: A community project combining LiveKit for WebRTC, Deepgram for STT, ElevenLabs for TTS, and LemonSlice for a lip-synced avatar. Demonstrates that OpenClaw can power a real-time avatar, but requires significant DIY assembly.
Cartesia & Inworld: Community skills that give OpenClaw access to TTS, voice cloning, and audio transcription via these providers.
Jupiter Voice: A community project for fully local, offline voice using local Whisper and Piper TTS. No cloud dependency at all.
Real-time voice conversation (Feature Request #7200): An active request for native real-time bidirectional voice with WebRTC, LiveKit Agents, Pipecat bridge, and OpenAI Realtime API support. Not yet implemented.
OpenClaw Voice (Purple-Horizons): A browser-based voice interface that uses WebSocket streaming, local Whisper for STT, and ElevenLabs for TTS. Self-hosted (Python), requires ElevenLabs and OpenAI API keys, and uses Silero VAD for basic noise filtering. No barge-in, no custom vocabulary, and no WebRTC (WebSocket only). Early-stage project (~60 stars).

Managed Platform

9. CrabCallr

Managed service · Browser + phone · Getting started

CrabCallr is a managed voice platform purpose-built for OpenClaw. The open-source plugin connects outbound via WebSocket to CrabCallr's cloud infrastructure — no open ports, no webhooks, no tunnels.

What's included

Browser calling: WebRTC via LiveKit — works in any modern browser, no app install
Phone calling: Managed Twilio integration with caller ID routing (Basic plan)
Natural voices: 12+ curated ElevenLabs voices with low-latency streaming TTS
Barge-in: Interrupt anytime — the assistant stops immediately and listens
Noise suppression: Krisp-powered audio cleanup for clear calls even in noisy environments
Custom vocabulary: Add domain-specific keyterms to boost speech recognition accuracy
Session isolation: Per-channel dmScope keeps voice conversations private from other channels
Built-in authentication: API key — no open ports, no allowlist management

Setup

Create a CrabCallr account and get your API key
Install: openclaw plugins install @wooters/crabcallr
Add your API key to ~/.openclaw/openclaw.json and start talking

Best for: Users who want voice to just work — browser and phone calling with zero infrastructure management. Free tier includes browser calling.

Full Comparison Table

Every OpenClaw voice option, side by side.

Capability	CrabCallr	Voice-Call Plugin	Talk Mode	DeepClaw (Deepgram)	ElevenLabs Agents
Browser WebRTC calling	✓	✗	✗	✗	Widget only
Phone calling	✓	User manages	✗	User manages	User manages
Works without app install	✓	✗	✗	✗	Widget only
Barge-in support	✓	Limited	✓	✓	✓
Noise suppression	✓	✗	✗	✗	✗
Custom vocabulary	✓	✗	✗	✗	✗
No open ports / tunnels	✓	✗	Local only	✗	✗
External accounts needed	0	1–2	0–1	2	2–3
Setup steps	3	5+	3–4	4+	7
Mid-call tool execution	✓	✓	✓	✓	✗
Works on Windows/Linux	✓	✓	✗	✓	✓
Open source	Plugin is OSS	✓	✓	✓	✗

Which Option Should You Choose?

It depends on what matters most to you:

"I want voice to just work."

CrabCallr — 3 setup steps, browser + phone, no infrastructure to manage. Get started →

"I want hands-free voice on my Mac."

Talk Mode + Voice Wake — Great for local use with the macOS companion app. No external accounts needed. But doesn't work from a browser or phone.

"I want full control over my telephony."

Voice-Call Plugin or DeepClaw — Bring your own Twilio/Telnyx account and manage everything yourself. DeepClaw offers better ASR (semantic turn detection) but adds another account (Deepgram).

"I want voice messages in WhatsApp/Telegram."

OpenClaw's built-in TTS/STT — No extra tools needed. Send voice notes to your agent and get voice or text replies. Not real-time conversation, but great for async use.

"I want everything fully local and offline."

Talk Mode + local Whisper + Piper TTS — Check out the Jupiter Voice community project for a fully offline setup.

Sources & Further Reading

OpenClaw GitHub Repository — Source code, README, and feature list
OpenClaw TTS Documentation — Full TTS configuration reference
OpenClaw Voice Wake Documentation — Wake word setup and protocol
OpenClaw Audio & Voice Notes — STT configuration and voice message handling
DeepClaw by Deepgram — Open-source phone calling for OpenClaw
Deepgram Blog: Call Your OpenClaw — DeepClaw setup guide
Deepgram Blog: Voice Is Now First-Class in OpenClaw — Overview of voice ecosystem
Discussion #10588: Reducing Latency — Latency analysis for real-time voice agents
Issue #7200: Real-time Voice Conversation — Feature request for native WebRTC
Issue #8088: Bidirectional Audio — Feature request for real-time voice call support
OpenClaw on Wikipedia — Project history and background

Ready to try CrabCallr?

Start with free browser calling. No credit card required, no infrastructure to manage.

Get started → View pricing →