Build Your First Voice Agent
This guide explains how to create a Voice AI Agent on the Flametree platform. By the end, you will have a working agent that can handle calls, understand speech, and reply with a natural-sounding voice.
Your Voice AI agent will:
- Make and receive phone calls via SIP
- Follow a defined conversation flow (workflow)
- Collect key details like the user’s name, intent, and callback time
Prerequisites
You need a configured SIP integration before creating a Voice AI agent. If you haven’t set it up yet, follow the SIP integration guide. This integration allows your agent to make and receive calls through the SIP protocol.
Configuration Overview
To create and configure a Voice AI Agent, you will complete these five steps:
- Create a Voice Agent – start a new agent in the dashboard.
- Configure Core Agent Settings, Workflow, and Models – define identity, task, speech style, and conversation flow.
- Connect SIP Integration – choose LLM, TTS, and STT engines.
- Set Max Open Sessions – link the agent to a voice channel.
- Configure Environment Variables – fine-tune timing, recognition, and session behavior.
Step 1: Create a Voice Agent
- Go to the AI Agents section in your Flametree dashboard.
- Click Create new agent.
- Select Voice Agent type:
- Inbound Voice Agent – answers incoming calls
- Outbound Voice Agent – makes calls and can also receive them
Tip: Outbound agents are useful for campaigns, surveys, and callback workflows. They can also handle inbound calls automatically.
Step 2: Configure Core Agent Settings, Workflow, and Models
Set up how your Voice Agent looks, speaks, and thinks. In this step, you’ll define its identity, conversation flow, and AI models.
Main parameters
- Identity – the agent’s name or role
- Speech Style and Language – how the agent talks
- Task – the agent’s main purpose or goal
Workflow (conversation logic)
The Workflow defines your voice agent’s conversation steps. Update the description section to outline how the dialogue should progress. Learn more about Workflow configuration
Models
In the Models section, choose which models your agent will use to understand and respond during calls.
| Category | Description | Example |
|---|---|---|
| LLM | Main language model that drives reasoning and response generation | |
gpt-4.1, qwen2.5-instruct, qwen3-instruct | ||
| Text-to-Speech (TTS) | Converts text into natural-sounding voice | Female Azure TTS |
| Speech-to-Text (STT) | Converts caller speech into text | Flametree Whisper |
⚡ Note: The Voice Agent supports only instruct models (for example,
gpt-4.1,qwen2.5-instruct,qwen3-instruct). The LLM speed affects response delay — choose the fastest model that meets your quality needs.
Step 3: Connect SIP Integration
Your Voice Agent uses SIP to make or receive calls. To connect an existing integration:
- Go to the Communication Channels section.
- Click the + (plus) button.
- Select SIP.
- Select the SIP integration you created earlier.
- Click Save.
Step 4: Set Max Open Sessions
In Advanced Settings on the right panel, set Max Opened Sessions — this limits the number of simultaneous outbound calls.
Step 5: Configure Environment Variables
Use environment variables to fine-tune how your Voice Agent behaves during calls. These parameters control timing, recognition quality, session behavior, and call flow.
Core Parameters
These variables define how the Voice Agent operates and interacts with callers.
| Variable | Description | Recommended value | Required / Recommended |
|---|---|---|---|
CODECS_PRIORITY | List or dictionary defining the preferred audio codecs for SIP calls. Leave {} for automatic negotiation. | {} | Required |
SESSION_TIMEOUT_SEC | Time (in seconds) before a session closes after the last message. Should not be shorter than the longest speech segment in the dialogue. | 120 | Required |
WHISPER_LANGUAGE | Language code for Whisper STT (use if you know the expected caller language for better recognition). | en | Recommended |
USER_SALIENCE_TIMEOUT_MS | Time (in milliseconds) before session closes after the last human message. Should not be shorter than the longest AI speech in the dialog. | 100000 | Recommended |
INTERRUPTIONS_ARE_ALLOWED | Allow users to interrupt AI speech. | False | Recommended |
START_PHRASE | Agent’s opening phrase. Keeps greetings consistent. | "Hello, this is Anna. How can I assist you today?" | Recommended |
Voice Detection and Timing
These variables adjust how the Voice Activity Detection (VAD) system detects speech and pauses.
| Variable | Description | Default | Required / Recommended |
|---|---|---|---|
VAD_THRESHOLD | Speech sensitivity. Higher = less sensitive. | 0.65 | Recommended |
VAD_SPEECH_PROB_WINDOW | Window size for calculating speech probability. | 3 | Recommended |
VAD_MIN_SPEECH_DURATION_MS | Minimum speech duration to register as valid. | 250 | Recommended |
VAD_MIN_SILENCE_DURATION_MS | Minimum silence duration to register as pause. | 350 | Recommended |
VAD_SPEECH_PAD_MS | Additional buffer (in milliseconds) before and after speech segments. | 700 | Recommended |
LONG_PAUSE_OFFSET_MS | Defines a long pause (used to detect intent to end or wait). | 850 | Recommended |
SHORT_PAUSE_OFFSET_MS | Defines short pause (used for natural conversation pacing). | 200 | Recommended |
VAD_CORRECTION_ENTER_THRESHOLD | Threshold for entering speech detection mode. | 0.6 | Recommended |
VAD_CORRECTION_EXIT_THRESHOLD | Threshold for exiting speech detection mode. | 0.35 | Recommended |
SIP and Logging Parameters
These settings control SIP signaling and logging verbosity.
| Variable | Description | Default | Required / Recommended |
|---|---|---|---|
SIP_EARLY_EOC | Enables early end-of-call signaling in SIP. Usually disabled. | false | Recommended |
PJSIP_LOG_LEVEL | SIP log verbosity. Higher value = more detailed logs. | 4 | Recommended |
Recommended Setup
At minimum, set the following:
CODECS_PRIORITYSESSION_TIMEOUT_SEC
It is strongly recommended to also set:
WHISPER_LANGUAGEUSER_SALIENCE_TIMEOUT_MSINTERRUPTIONS_ARE_ALLOWEDSTART_PHRASESESSION_VERSION
All other parameters can stay at default values unless you need fine-tuning.
Your Voice Agent is now ready to handle calls 🚀