How We Built a Voice Agent on Chatzy.ai in Just One Week

Sourabh Kumar

10 November 20255 min read

How We Built a Voice Assistant on Chatzy.ai

We’ve all been in situations where a client requests a feature, and if it’s a high-ticket client, you just have to build it.

But this time, things were different.

At Chatzy.ai, we already had a well-built conversational AI agent used by several clients that handled millions of chat flows every month. The platform was stable, scalable, and performing well for conversational use cases.

Our conversational agent was already state-of-the-art — capable of interacting with users in real time, answering queries intelligently, and using RAG (Retrieval-Augmented Generation) to provide precise answers from PDFs, websites, or structured data.

It was designed to replicate a customer support experience while sounding convincingly human.

Then came a turning point.
One of our clients asked, “Can you build a voice agent for us?”

At first, it sounded doable. But here was the catch:
They needed it within one week, or they’d move to another provider.

Typically, such a product would take five weeks — one month to build and another week to test. But in B2B, losing a key client isn’t an option. So, we decided to go all in.

We took on the challenge to build a fully functional voice AI agent in one week. We leveraged our existing infrastructure, reused components from our conversational stack, and moved fast. Thankfully, some of our team members had prior experience building voice agents — a big time saver.

Defining the MVP

The first step was clarity: what exactly would the Minimum Viable Product (MVP) include?

Our product manager outlined the essentials — we decided to focus on what was critical to get a first version running instead of full features.

Before diving in, the team studied other products for inspiration and to avoid known pitfalls.

Understanding How Voice Agents Differ

Both voice and conversational agents rely on LLMs (Large Language Models) to generate responses. Both draw from structured or unstructured data sources like PDFs or websites.

The main difference lies in architecture and cost.

A text-based agent follows this structure:

Data → RAG → Embedding → Response

A voice-based agent, on the other hand, adds two critical layers:

ASR (Automatic Speech Recognition) → Data → RAG → Embedding → Response → TTS (Text-to-Speech)

These extra layers make voice systems far more complex — and up to 10x costlier due to continuous streaming and audio processing.

Still, the user experience advantage made it worth it.

Building the First Version

The first Product Requirements Document (PRD) took around 1–2 hours to draft. It covered:

The integration flow
The core logic
End-to-end deployment plan
Expected user behavior and interaction sequence

The UI and Flow

1. Home Section

Home Section

In Chatzy.ai, users can create agents — either conversational or voice — from scratch or using templates.
The first step is choosing between the two.

2. From Template

From Template

Templates enable quick setup. Users just replace datasets — everything else is preconfigured.
This reduces setup time from minutes to seconds.

3. From Scratch

From Scratch

Building from scratch gives complete freedom to define datasets, prompts, and workflows.
It’s ideal for enterprise users with specialized use cases.

4. Settings and Voice Configuration

Settings

This is where voice agents differ from text-based ones. It includes:

Base Prompt: Defines the first message and tone
Model Selection: Specifies the LLM powering the agent
Voice Settings: Language, provider (ElevenLabs, Azure, etc.), and tone configuration
Speed Settings: Adjusts speaking rate (default 1x, up to 1.5x)

Managing Cost and Efficiency

Voice processing can quickly drive up costs.
To keep things predictable, we added an auto-disconnect feature that terminates idle user sessions to prevent unnecessary usage.

This small detail saved significant compute costs at scale.

Call Handling and Controls

Call Settings

Key MVP features included:

Telephone Provider Integration for call routing
Voicemail Detection for automatic hangups
Silence-based Call Hangup (after 10s of inactivity)
Max Call Duration limit (default 5 minutes)

Playground and Testing

The Playground was a critical component that made iteration fast.
Users could:

Launch live voice tests
Evaluate tone and clarity
Cross-check text interactions with the same LLM

This environment streamlined debugging and experience tuning dramatically.

How It Is Billed

Voice agent billing involves multiple cost components:

Transcriber (ASR): Converts speech to text
LLM Component: Generates the actual response
TTS (Text-to-Speech): Synthesizes a human-like reply
Platform Fee: Covers orchestration and hosting
Carrier Fee: Comes from the telecom connection provider

Depending on models selected, costs range roughly between $0.001 and $0.005 per interaction.

Final Thoughts

After multiple late nights and quick iterations, we pulled it off.
The Chatzy.ai team built and tested a complete, production-ready voice agent — in just one week.

You can try the voice agent now on Chatzy.ai.
New users get 14 days of free credits to explore it.

Try it out, share your thoughts, and tell us how it performs for your use case.
We’re constantly evolving and love hearing from our community.

AI agents built in minutes