Best MCP servers for voice and audio
Voice and audio work with an AI agent splits into two jobs: turning text into natural speech and turning recorded audio into structured text and insight. A capable setup covers both, plus the extras around them, voice cloning, sound effects, speaker diarization, and audio intelligence, so an agent can narrate, transcribe, and analyze without you gluing together separate SDKs. The servers below span text-to-speech and conversational voice on one side and speech-to-text with audio understanding on the other. Whether you are building a voice agent, captioning recordings, or mining call audio, these are the picks, each a real MCP server with a verified, current install config.
ElevenLabs
ElevenLabs
ElevenLabs' official MCP server: text-to-speech, voice cloning, speech-to-text, sound effects, and conversational AI agents from your editor.
ElevenLabs' official server brings text-to-speech, voice cloning, speech-to-text, sound effects, and conversational AI agents into your editor, covering most of the voice stack from one tool.
AssemblyAI
AssemblyAI
AssemblyAI's official server lets coding agents search and read its speech-to-text and audio-intelligence documentation on demand.
AssemblyAI's official server lets an agent search and read its speech-to-text and audio-intelligence docs on demand, so it builds correct transcription and analysis calls without guesswork.
Replicate
Replicate
Replicate's official MCP server: discover, compare, and run thousands of hosted AI models — image, video, audio, and language — straight from your agent.
Replicate's official server runs thousands of hosted models including audio and speech ones, a flexible fallback when you need a specific TTS, music, or audio model not covered elsewhere.
fal.ai
Raveen Beemsingh
Community MCP server for fal.ai: generate and edit images, video, music, and audio with 600+ fast generative models from your agent.
fal.ai's server includes fast generative music and audio models alongside image and video, useful when an agent needs to generate sound or music at low latency.