Quick Start: Build Your First Voice Agent
This step-by-step guide walks you through creating a Voice AI Agent on the Flametree platform.
By the end of this tutorial, you’ll have a fully functional agent capable of handling inbound or outbound calls, understanding speech, and responding naturally with synthesized voice.
What You’ll Build
Your Voice AI agent will be able to:
- Answer or initiate phone calls via SIP
- Follow a defined conversation flow (workflow)
- Collect key data such as customer name, intent, and callback time
Prerequisites
- SIP integration prepared and configured
(If not ready, create one first)
Step 1: Create a Voice Agent
- Go to the AI Agents section in your Flametree dashboard.
- Click Create Agent.
- Choose Voice Agent type:
- Inbound Voice Agent – answers incoming calls
- Outbound Voice Agent – initiates calls to users and can also handle inbound calls
Tip: Outbound agents are often used in campaigns, surveys, or callback flows, but can also handle inbound calls automatically.
Step 2: Define Main Agent Parameters
Set the core configuration parameters of your Voice Agent.
Identity
Speech Style and Language
The Task describes the agent’s high-level purpose and goals.
Step 3: Configure the Workflow (Conversation Logic)
The Workflow defines your voice agent’s conversation steps.
You can edit the description section in the workflow to specify how the conversation should proceed.
Step 4: Select Models
In the Models section, select models for understanding and speaking.
Category | Description | Example |
---|---|---|
LLM | Main language model | gpt-4.1 , qwen2.5-instruct , qwen3-instruct |
Text-to-Speech (TTS) | Converts text to voice output | Female Azure TTS |
Speech-to-Text (STT) | Converts caller speech into text | Flametree Whisper |
⚡ Note: The Voice Agent supports only instruct models (e.g.,
gpt-4.1
,qwen2.5-instruct
,qwen3-instruct
).
The speed of the LLM directly affects response delay — choose the fastest model that meets your quality requirements.
Step 5: Connect SIP Integration
Your Voice Agent uses SIP to send or receive calls.
Configuration Steps
- Go to the Communication Channels section.
- Click the + (plus) button.
- Select SIP.
- Choose the SIP integration you prepared earlier.
- Click Save.
Step 6: Set Max Opened Sessions
On the Advanced Settings section on the right panel, set Max Opened Sessions — this limits the number of simultaneous outbound calls.
Agent Customization
Configure additional environment variables to fine-tune how your Voice Agent behaves during live calls.
These parameters control timing, recognition quality, session behavior, and call flow.
Variable | Description | Recommended value |
---|---|---|
CODECS_PRIORITY | List or dictionary defining the preferred audio codecs for SIP calls. Leave {} for automatic negotiation. | {} |
SESSION_TIMEOUT_SEC | Time (in seconds) before a session is closed after the last message. Should not be shorter than the longest speech segment in the dialogue. | 120 |
WHISPER_LANGUAGE | Language code for Whisper STT (use if you know the expected caller language for better recognition). | en |
USER_SALIENCE_TIMEOUT_MS | Time (in milliseconds) before session closes after the last human message. Should not be shorter than the longest AI speech in the dialog. | 100000 |
INTERRUPTIONS_ARE_ALLOWED | Allow the human speaker to interrupt AI speech. | False |
START_PHRASE | Fixed opening phrase for the agent. Helps control token usage and ensures consistent greetings. | "Hello, this is Anna. How can I assist you today?" |
Voice Detection and Timing Parameters
These settings control the Voice Activity Detection (VAD) system — when the agent detects pauses, starts, or ends speech.
Variable | Description | Default |
---|---|---|
VAD_THRESHOLD | Sensitivity threshold for detecting speech. Higher = less sensitive. | 0.65 |
VAD_SPEECH_PROB_WINDOW | Window size for calculating speech probability. | 3 |
VAD_MIN_SPEECH_DURATION_MS | Minimum speech duration to register as valid speech. | 250 |
VAD_MIN_SILENCE_DURATION_MS | Minimum silence duration to register as pause. | 350 |
VAD_SPEECH_PAD_MS | Additional buffer (in ms) before and after speech segments. | 700 |
LONG_PAUSE_OFFSET_MS | Defines a long pause threshold (used to detect intent to end or wait). | 850 |
SHORT_PAUSE_OFFSET_MS | Defines short pause timing (used for natural conversation pacing). | 200 |
VAD_CORRECTION_ENTER_THRESHOLD | Threshold for entering speech detection mode. | 0.6 |
VAD_CORRECTION_EXIT_THRESHOLD | Threshold for exiting speech detection mode. | 0.35 |
SIP and Logging Parameters
Variable | Description | Default |
---|---|---|
SIP_EARLY_EOC | Enables early end-of-call signaling in SIP. Usually left disabled. | false |
PJSIP_LOG_LEVEL | Verbosity level for SIP library logs. Higher value = more detailed logs. | 4 |
Recommended Configuration Summary
At minimum, you must set:
CODECS_PRIORITY
SESSION_TIMEOUT_SEC
It is strongly recommended to also set:
WHISPER_LANGUAGE
USER_SALIENCE_TIMEOUT_MS
INTERRUPTIONS_ARE_ALLOWED
START_PHRASE
SESSION_VERSION
All other parameters can remain at default values unless advanced tuning is required.