How We Built a Voice Agent on Chatzy.ai in Just One Week
How We Built a Voice Assistant on Chatzy.ai
We’ve all been in situations where a client requests a feature, and if it’s a high-ticket client, you just have to build it.
But this time, things were different.
At Chatzy.ai, we already had a well-built conversational AI agent used by several clients that handled millions of chat flows every month. The platform was stable, scalable, and performing well for conversational use cases.
Our conversational agent was already state-of-the-art — capable of interacting with users in real time, answering queries intelligently, and using RAG (Retrieval-Augmented Generation) to provide precise answers from PDFs, websites, or structured data.
It was designed to replicate a customer support experience while sounding convincingly human.
Then came a turning point.
One of our clients asked, “Can you build a voice agent for us?”
At first, it sounded doable. But here was the catch:
They needed it within one week, or they’d move to another provider.
Typically, such a product would take five weeks — one month to build and another week to test. But in B2B, losing a key client isn’t an option. So, we decided to go all in.
We took on the challenge to build a fully functional voice AI agent in one week. We leveraged our existing infrastructure, reused components from our conversational stack, and moved fast. Thankfully, some of our team members had prior experience building voice agents — a big time saver.
Defining the MVP
The first step was clarity: what exactly would the Minimum Viable Product (MVP) include?
Our product manager outlined the essentials — we decided to focus on what was critical to get a first version running instead of full features.
Before diving in, the team studied other products for inspiration and to avoid known pitfalls.
Understanding How Voice Agents Differ
Both voice and conversational agents rely on LLMs (Large Language Models) to generate responses. Both draw from structured or unstructured data sources like PDFs or websites.
The main difference lies in architecture and cost.
A text-based agent follows this structure:
Data → RAG → Embedding → Response
A voice-based agent, on the other hand, adds two critical layers:
ASR (Automatic Speech Recognition) → Data → RAG → Embedding → Response → TTS (Text-to-Speech)
These extra layers make voice systems far more complex — and up to 10x costlier due to continuous streaming and audio processing.
Still, the user experience advantage made it worth it.
Building the First Version
The first Product Requirements Document (PRD) took around 1–2 hours to draft. It covered:
- The integration flow
- The core logic
- End-to-end deployment plan
- Expected user behavior and interaction sequence
The UI and Flow
1. Home Section
In Chatzy.ai, users can create agents — either conversational or voice — from scratch or using templates.
The first step is choosing between the two.
2. From Template
Templates enable quick setup. Users just replace datasets — everything else is preconfigured.
This reduces setup time from minutes to seconds.
3. From Scratch
Building from scratch gives complete freedom to define datasets, prompts, and workflows.
It’s ideal for enterprise users with specialized use cases.
4. Settings and Voice Configuration
This is where voice agents differ from text-based ones. It includes:
- Base Prompt: Defines the first message and tone
- Model Selection: Specifies the LLM powering the agent
- Voice Settings: Language, provider (ElevenLabs, Azure, etc.), and tone configuration
- Speed Settings: Adjusts speaking rate (default 1x, up to 1.5x)
Managing Cost and Efficiency
Voice processing can quickly drive up costs.
To keep things predictable, we added an auto-disconnect feature that terminates idle user sessions to prevent unnecessary usage.
This small detail saved significant compute costs at scale.
Call Handling and Controls
Key MVP features included:
- Telephone Provider Integration for call routing
- Voicemail Detection for automatic hangups
- Silence-based Call Hangup (after 10s of inactivity)
- Max Call Duration limit (default 5 minutes)
Playground and Testing
The Playground was a critical component that made iteration fast.
Users could:
- Launch live voice tests
- Evaluate tone and clarity
- Cross-check text interactions with the same LLM
This environment streamlined debugging and experience tuning dramatically.
How It Is Billed
Voice agent billing involves multiple cost components:
- Transcriber (ASR): Converts speech to text
- LLM Component: Generates the actual response
- TTS (Text-to-Speech): Synthesizes a human-like reply
- Platform Fee: Covers orchestration and hosting
- Carrier Fee: Comes from the telecom connection provider
Depending on models selected, costs range roughly between $0.001 and $0.005 per interaction.
Final Thoughts
After multiple late nights and quick iterations, we pulled it off.
The Chatzy.ai team built and tested a complete, production-ready voice agent — in just one week.
You can try the voice agent now on Chatzy.ai.
New users get 14 days of free credits to explore it.
Try it out, share your thoughts, and tell us how it performs for your use case.
We’re constantly evolving and love hearing from our community.
