Best MCP servers for voice and audio

Voice and audio work with an AI agent splits into two jobs: turning text into natural speech and turning recorded audio into structured text and insight. A capable setup covers both, plus the extras around them, voice cloning, sound effects, speaker diarization, and audio intelligence, so an agent can narrate, transcribe, and analyze without you gluing together separate SDKs. The servers below span text-to-speech and conversational voice on one side and speech-to-text with audio understanding on the other. Whether you are building a voice agent, captioning recordings, or mining call audio, these are the picks, each a real MCP server with a verified, current install config.

Top pick

ElevenLabs

Official

ElevenLabs' official MCP server: text-to-speech, voice cloning, speech-to-text, sound effects, and conversational AI agents from your editor.

ai-ml

ElevenLabs' official server brings text-to-speech, voice cloning, speech-to-text, sound effects, and conversational AI agents into your editor, covering most of the voice stack from one tool.

ElevenLabs for voice and audio →

Pick 2

AssemblyAI

Official

AssemblyAI's official server lets coding agents search and read its speech-to-text and audio-intelligence documentation on demand.

ai-ml

AssemblyAI's official server lets an agent search and read its speech-to-text and audio-intelligence docs on demand, so it builds correct transcription and analysis calls without guesswork.

AssemblyAI for voice and audio →

Pick 3

Replicate

Official

Replicate's official MCP server: discover, compare, and run thousands of hosted AI models — image, video, audio, and language — straight from your agent.

ai-ml

Replicate's official server runs thousands of hosted models including audio and speech ones, a flexible fallback when you need a specific TTS, music, or audio model not covered elsewhere.

Replicate for voice and audio →

Pick 4

fal.ai

Raveen Beemsingh

Community

Community MCP server for fal.ai: generate and edit images, video, music, and audio with 600+ fast generative models from your agent.

ai-ml48

fal.ai's server includes fast generative music and audio models alongside image and video, useful when an agent needs to generate sound or music at low latency.

fal.ai for voice and audio →