Deep Analysis: Create YAML configurations to analyze chatbot conversations using LLM-based metrics
Complete guide for creating YAML configurations to analyze chatbot conversations using LLM-based metrics. This system extracts business insights from user-chatbot interactions through structured analysis and visualization.
Prerequisites
- Understanding of YAML syntax and indentation rules
- Basic knowledge of LLM prompting techniques
- Familiarity with business metrics and KPIs
- Access to conversation dialog data
Quick Start
Your First Configuration in 5 Minutes
This minimal working configuration analyzes user satisfaction from conversation tone:
scheduling_rules:
cron_exp: "0 0 * * *"
depth: 30
analysis_types:
- id: "BasicSatisfaction"
order: 0
description: "Analyze user satisfaction from conversation tone"
prompt_template: |
# Task
{task}
# Chat history
{chat_history}
# Response JSON schema
{formatting}
prompt_parts:
task: |
Determine if the user was satisfied with the chatbot interaction.
llm_metrics:
- id: "user_satisfaction"
name: "User Satisfaction"
kind: llm
type: Literal['satisfied','neutral','dissatisfied']
description: "Overall user satisfaction level"
values:
- id: "satisfied"
name: "Satisfied"
color: "green"
- id: "neutral"
name: "Neutral"
color: "gray"
- id: "dissatisfied"
name: "Dissatisfied"
color: "red"
prompt: |-
Classify user satisfaction:
- satisfied: User expressed gratitude, positive feedback, or achieved their goal
- neutral: User completed interaction without clear positive/negative sentiment
- dissatisfied: User expressed frustration, complaints, or left unsatisfied
Return exactly one label: satisfied, neutral, or dissatisfied — nothing else.
visualization:
tabs:
- id: "satisfaction_overview"
title: "User Satisfaction"
plots:
- id: "satisfaction_summary"
kind: summary
type: detailed_bars
title: "Satisfaction Distribution"
metrics:
- id: "user_satisfaction"
features:
- id: unit
aggregation: sum
What this does:
- Analyzes last 30 days of conversations daily at midnight
- Classifies each conversation by user satisfaction level
- Creates a bar chart showing satisfaction distribution
Core Concepts
System Architecture
The dialog analysis system works in three stages:
- Data Collection: Gather conversation dialogs based on scheduling rules
- Metric Extraction: Process each dialog through LLM or code-based analysis
- Visualization: Generate dashboards from extracted metrics
Key Principles
-
🎯 One Metric, One Purpose Each metric should measure exactly one concept. Don't combine satisfaction + engagement in a single metric.
-
📊 Business-First Design Design metrics around business questions: "Are users satisfied?" not "What sentiment words appear?"
-
🔍 Explicit Classification LLM prompts must be extremely specific with definitions, examples, and edge cases.
-
⚡ Performance Optimized All metrics in one analysis_type are processed together (~1 second per dialog regardless of metric count).
Configuration Structure
File Structure Requirements
CRITICAL: Every YAML file must use exactly 2 spaces for indentation and contain these three sections in order:
scheduling_rules: # When and how much data to analyze
analysis_types: # What metrics to extract and how
visualization: # How to display results
Scheduling Rules Section
Controls when analysis runs and how much historical data to process.
scheduling_rules:
cron_exp: "<cron_expression>"
depth: <integer>
Fields:
Field | Type | Required | Description |
---|---|---|---|
cron_exp | string | Yes | Standard cron expression in quotes |
depth | integer | Yes | Number of days of historical data to analyze |
Examples:
# Daily analysis at midnight, last 30 days
scheduling_rules:
cron_exp: "0 0 * * *"
depth: 30
# Weekly analysis on Sundays at 2 AM, last 7 days
scheduling_rules:
cron_exp: "0 2 * * 0"
depth: 7
Analysis Types Section
Defines what metrics to extract from conversations. Each analysis type processes dialogs with a focused set of related metrics.
Basic Structure
analysis_types:
- id: "<unique_identifier>"
order: <integer>
description: "<purpose_description>"
prompt_template: |
# Task
{task}
# Chat history
{chat_history}
# Response JSON schema
{formatting}
prompt_parts:
task: |
<overall_analysis_description>
llm_metrics:
- <metric_definition>
order
- specifies the order in which to calculate the given type of analysis. Currently unused.
LLM Metrics Definition
- id: "<metric_identifier>"
name: "<display_name>"
kind: llm
type: <data_type>
description: "<metric_purpose>"
prompt: |-
<detailed_classification_instructions>
values: # For categorical metrics only
- id: "<value_id>"
name: "<display_name>"
color: "<color_name>"
Data Types
Type | Description | Example |
---|---|---|
Literal['val1','val2'] | Fixed set of string values | Literal['positive','negative','neutral'] |
str | Free-form text response | Analysis explanations, summaries |
bool | Boolean true/false | true , false |
list[str] | Array of strings | ["topic1", "topic2"] |
Visualization Section
Defines dashboard structure with tabs and plots to display extracted metrics.
visualization:
tabs:
- id: "<tab_identifier>"
title: "<tab_display_name>"
plots:
- id: "<plot_identifier>"
kind: <plot_kind>
type: <plot_type>
title: "<plot_title>"
metrics:
- id: "<metric_id>"
features:
- id: unit
aggregation: <aggregation_type>
Plot Types:
Kind | Type | Description |
---|---|---|
trend | bar_time_series | Time-series visualization |
summary | detailed_bars | Categorical distribution |
More plot types will be added in the future.
Best Practices
Effective Prompt Writing
🎯 Single Metric Focus Each prompt analyzes ONLY one metric. Never reference other metrics.
📋 Exhaustive Categories For Literal types, define ALL possible values with:
- Definition: Clear, unambiguous criteria
- Examples: Concrete examples from real dialogs
- Notes: Edge cases and clarifications
✅ Correct Prompt Structure:
prompt: |-
Classify user sentiment based on their messages:
- positive:
• Definition: Explicit gratitude, satisfaction, or achievement of goals
• Examples: "Thank you!", "This helped me", "Perfect solution"
• Notes: Any appreciation without complaints indicates positive
- negative:
• Definition: Complaints, frustration, anger, or explicit dissatisfaction
• Examples: "This is terrible", "Waste of time", profanity
• Notes: Sarcasm with clear negativity counts as negative
- neutral:
• Definition: No clear emotional indicators either way
• Examples: "Okay", "I understand", factual questions only
• Notes: Default for ambiguous cases
Return exactly one label: positive, negative, or neutral — nothing else.
Metric Design Strategy
Always start with the business question, then design metrics:
❌ Wrong: "Analyze sentiment words in conversations" ✅ Right: "Are users satisfied with support quality?"
Business Categories
- User Experience Metrics: Satisfaction, sentiment, engagement
- Chatbot Performance: Goal achievement, quality issues, efficiency
- Business Outcomes: Conversion, escalation, risk assessment
Configuration Examples
Customer Support Analysis
scheduling_rules:
cron_exp: "0 0 * * *"
depth: 7
analysis_types:
- id: "SupportQuality"
order: 0
description: "Analyze customer support interaction quality"
prompt_template: |
# Task
{task}
# Chat history
{chat_history}
# Response JSON schema
{formatting}
prompt_parts:
task: |
Evaluate the quality of customer support provided in this conversation.
llm_metrics:
- id: "issue_resolution"
name: "Issue Resolution"
kind: llm
type: Literal['resolved','partially_resolved','unresolved']
description: "Whether the customer's issue was addressed"
values:
- id: "resolved"
name: "Fully Resolved"
color: "green"
- id: "partially_resolved"
name: "Partially Resolved"
color: "yellow"
- id: "unresolved"
name: "Unresolved"
color: "red"
prompt: |-
Assess issue resolution:
- resolved: Customer's problem was completely addressed and confirmed
- partially_resolved: Some progress made but issue not fully addressed
- unresolved: No meaningful progress on customer's core issue
Return exactly one label: resolved, partially_resolved, or unresolved — nothing else.
visualization:
tabs:
- id: "support_overview"
title: "Support Quality Overview"
plots:
- id: "resolution_trend"
kind: trend
type: bar_time_series
title: "Issue Resolution Trends"
metrics:
- id: "issue_resolution"
features:
- id: unit
aggregation: sum
percentage: true
Sales Conversation Analysis
analysis_types:
- id: "SalesOutcome"
order: 0
description: "Analyze sales conversation outcomes"
prompt_template: |
# Task
{task}
# Chat history
{chat_history}
# Response JSON schema
{formatting}
prompt_parts:
task: |
Analyze this sales conversation for lead qualification and interest level.
llm_metrics:
- id: "lead_interest"
name: "Lead Interest Level"
kind: llm
type: Literal['high_interest','medium_interest','low_interest','not_interested']
description: "Customer's level of interest in the product/service"
values:
- id: "high_interest"
name: "High Interest"
color: "green"
- id: "medium_interest"
name: "Medium Interest"
color: "blue"
- id: "low_interest"
name: "Low Interest"
color: "yellow"
- id: "not_interested"
name: "Not Interested"
color: "red"
prompt: |-
Assess customer interest level:
- high_interest: Strong engagement, asks detailed questions, requests next steps
- medium_interest: Shows curiosity but has concerns or needs more information
- low_interest: Minimal engagement, generic responses, seems distracted
- not_interested: Explicit disinterest or attempts to end conversation
Return exactly one label: high_interest, medium_interest, low_interest, or not_interested — nothing else.
Common Issues & Solutions
YAML file fails to parse with indentation errors
Problem: Using 4 spaces or tabs instead of 2 spaces
analysis_types:
- id: "wrong" # 4 spaces - WRONG
Solution: Always use exactly 2 spaces for indentation
analysis_types:
- id: "correct" # 2 spaces - CORRECT
order: 0 # 2 spaces - CORRECT
LLM returns inconsistent or unexpected classifications
Problem: Abstract instructions without clear criteria
prompt: "Analyze if the user was happy"
Solution: Provide specific criteria with examples and edge cases
prompt: |-
Classify user happiness:
- happy: Explicit positive expressions or goal achievement
Examples: "Thank you so much!", "Perfect!", successful completion
- unhappy: Complaints, frustration, or unresolved issues
Examples: "This doesn't work", "Frustrated", abandoning conversation
- neutral: No clear emotional indicators
Examples: "OK", factual questions, simple acknowledgments
Return exactly one label: happy, unhappy, or neutral — nothing else.
Visualization shows no data or missing metrics
Problem: Referencing non-existent metric IDs
llm_metrics:
- id: "user_sentiment"
visualization:
metrics:
- id: "sentiment" # WRONG - doesn't match above
Solution: Ensure exact ID matching throughout configuration
llm_metrics:
- id: "user_sentiment"
visualization:
metrics:
- id: "user_sentiment" # CORRECT - exact match
FAQ
How many metrics can I include in one analysis type?
While there's no hard limit, keep analysis types focused on related metrics (3-5 metrics max). This improves performance and maintains logical grouping. Use multiple analysis types for different business areas.
Can I use custom Python code in expressions?
Yes, code metrics support Python expressions with access to dialog data and built-in libraries. However, only standard Python modules are available - no external packages can be imported.
How do I handle edge cases where conversations have no clear classification?
Always include a "neutral" or "unknown" category in your Literal types. Define this as the default for ambiguous cases and provide clear criteria for when to use it.
What's the difference between 'kind: trend' and 'kind: summary' plots?
- Trend plots: Show data over time, useful for tracking changes and patterns
- Summary plots: Show current state distribution, useful for understanding overall composition