Skip to main content

Deep Analysis: Create YAML configurations to analyze chatbot conversations using LLM-based metrics

Complete guide for creating YAML configurations to analyze chatbot conversations using LLM-based metrics. This system extracts business insights from user-chatbot interactions through structured analysis and visualization.

Prerequisites

  • Understanding of YAML syntax and indentation rules
  • Basic knowledge of LLM prompting techniques
  • Familiarity with business metrics and KPIs
  • Access to conversation dialog data

Quick Start

Your First Configuration in 5 Minutes

This minimal working configuration analyzes user satisfaction from conversation tone:

scheduling_rules:
cron_exp: "0 0 * * *"
depth: 30

analysis_types:
- id: "BasicSatisfaction"
order: 0
description: "Analyze user satisfaction from conversation tone"
prompt_template: |
# Task
{task}
# Chat history
{chat_history}
# Response JSON schema
{formatting}
prompt_parts:
task: |
Determine if the user was satisfied with the chatbot interaction.
llm_metrics:
- id: "user_satisfaction"
name: "User Satisfaction"
kind: llm
type: Literal['satisfied','neutral','dissatisfied']
description: "Overall user satisfaction level"
values:
- id: "satisfied"
name: "Satisfied"
color: "green"
- id: "neutral"
name: "Neutral"
color: "gray"
- id: "dissatisfied"
name: "Dissatisfied"
color: "red"
prompt: |-
Classify user satisfaction:
- satisfied: User expressed gratitude, positive feedback, or achieved their goal
- neutral: User completed interaction without clear positive/negative sentiment
- dissatisfied: User expressed frustration, complaints, or left unsatisfied
Return exactly one label: satisfied, neutral, or dissatisfied — nothing else.

visualization:
tabs:
- id: "satisfaction_overview"
title: "User Satisfaction"
plots:
- id: "satisfaction_summary"
kind: summary
type: detailed_bars
title: "Satisfaction Distribution"
metrics:
- id: "user_satisfaction"
features:
- id: unit
aggregation: sum

What this does:

  • Analyzes last 30 days of conversations daily at midnight
  • Classifies each conversation by user satisfaction level
  • Creates a bar chart showing satisfaction distribution

Core Concepts

System Architecture

The dialog analysis system works in three stages:

  1. Data Collection: Gather conversation dialogs based on scheduling rules
  2. Metric Extraction: Process each dialog through LLM or code-based analysis
  3. Visualization: Generate dashboards from extracted metrics

Key Principles

  • 🎯 One Metric, One Purpose Each metric should measure exactly one concept. Don't combine satisfaction + engagement in a single metric.

  • 📊 Business-First Design Design metrics around business questions: "Are users satisfied?" not "What sentiment words appear?"

  • 🔍 Explicit Classification LLM prompts must be extremely specific with definitions, examples, and edge cases.

  • ⚡ Performance Optimized All metrics in one analysis_type are processed together (~1 second per dialog regardless of metric count).

Configuration Structure

File Structure Requirements

CRITICAL: Every YAML file must use exactly 2 spaces for indentation and contain these three sections in order:

scheduling_rules:     # When and how much data to analyze
analysis_types: # What metrics to extract and how
visualization: # How to display results

Scheduling Rules Section

Controls when analysis runs and how much historical data to process.

scheduling_rules:
cron_exp: "<cron_expression>"
depth: <integer>

Fields:

FieldTypeRequiredDescription
cron_expstringYesStandard cron expression in quotes
depthintegerYesNumber of days of historical data to analyze

Examples:

# Daily analysis at midnight, last 30 days
scheduling_rules:
cron_exp: "0 0 * * *"
depth: 30

# Weekly analysis on Sundays at 2 AM, last 7 days
scheduling_rules:
cron_exp: "0 2 * * 0"
depth: 7

Analysis Types Section

Defines what metrics to extract from conversations. Each analysis type processes dialogs with a focused set of related metrics.

Basic Structure

analysis_types:
- id: "<unique_identifier>"
order: <integer>
description: "<purpose_description>"
prompt_template: |
# Task
{task}
# Chat history
{chat_history}
# Response JSON schema
{formatting}
prompt_parts:
task: |
<overall_analysis_description>
llm_metrics:
- <metric_definition>

order - specifies the order in which to calculate the given type of analysis. Currently unused.

LLM Metrics Definition

- id: "<metric_identifier>"
name: "<display_name>"
kind: llm
type: <data_type>
description: "<metric_purpose>"
prompt: |-
<detailed_classification_instructions>
values: # For categorical metrics only
- id: "<value_id>"
name: "<display_name>"
color: "<color_name>"

Data Types

TypeDescriptionExample
Literal['val1','val2']Fixed set of string valuesLiteral['positive','negative','neutral']
strFree-form text responseAnalysis explanations, summaries
boolBoolean true/falsetrue, false
list[str]Array of strings["topic1", "topic2"]

Visualization Section

Defines dashboard structure with tabs and plots to display extracted metrics.

visualization:
tabs:
- id: "<tab_identifier>"
title: "<tab_display_name>"
plots:
- id: "<plot_identifier>"
kind: <plot_kind>
type: <plot_type>
title: "<plot_title>"
metrics:
- id: "<metric_id>"
features:
- id: unit
aggregation: <aggregation_type>

Plot Types:

KindTypeDescription
trendbar_time_seriesTime-series visualization
summarydetailed_barsCategorical distribution

More plot types will be added in the future.

Best Practices

Effective Prompt Writing

🎯 Single Metric Focus Each prompt analyzes ONLY one metric. Never reference other metrics.

📋 Exhaustive Categories For Literal types, define ALL possible values with:

  • Definition: Clear, unambiguous criteria
  • Examples: Concrete examples from real dialogs
  • Notes: Edge cases and clarifications

✅ Correct Prompt Structure:

prompt: |-
Classify user sentiment based on their messages:
- positive:
• Definition: Explicit gratitude, satisfaction, or achievement of goals
• Examples: "Thank you!", "This helped me", "Perfect solution"
• Notes: Any appreciation without complaints indicates positive
- negative:
• Definition: Complaints, frustration, anger, or explicit dissatisfaction
• Examples: "This is terrible", "Waste of time", profanity
• Notes: Sarcasm with clear negativity counts as negative
- neutral:
• Definition: No clear emotional indicators either way
• Examples: "Okay", "I understand", factual questions only
• Notes: Default for ambiguous cases
Return exactly one label: positive, negative, or neutral — nothing else.

Metric Design Strategy

Always start with the business question, then design metrics:

❌ Wrong: "Analyze sentiment words in conversations" ✅ Right: "Are users satisfied with support quality?"

Business Categories

  1. User Experience Metrics: Satisfaction, sentiment, engagement
  2. Chatbot Performance: Goal achievement, quality issues, efficiency
  3. Business Outcomes: Conversion, escalation, risk assessment

Configuration Examples

Customer Support Analysis

scheduling_rules:
cron_exp: "0 0 * * *"
depth: 7

analysis_types:
- id: "SupportQuality"
order: 0
description: "Analyze customer support interaction quality"
prompt_template: |
# Task
{task}
# Chat history
{chat_history}
# Response JSON schema
{formatting}
prompt_parts:
task: |
Evaluate the quality of customer support provided in this conversation.
llm_metrics:
- id: "issue_resolution"
name: "Issue Resolution"
kind: llm
type: Literal['resolved','partially_resolved','unresolved']
description: "Whether the customer's issue was addressed"
values:
- id: "resolved"
name: "Fully Resolved"
color: "green"
- id: "partially_resolved"
name: "Partially Resolved"
color: "yellow"
- id: "unresolved"
name: "Unresolved"
color: "red"
prompt: |-
Assess issue resolution:
- resolved: Customer's problem was completely addressed and confirmed
- partially_resolved: Some progress made but issue not fully addressed
- unresolved: No meaningful progress on customer's core issue
Return exactly one label: resolved, partially_resolved, or unresolved — nothing else.

visualization:
tabs:
- id: "support_overview"
title: "Support Quality Overview"
plots:
- id: "resolution_trend"
kind: trend
type: bar_time_series
title: "Issue Resolution Trends"
metrics:
- id: "issue_resolution"
features:
- id: unit
aggregation: sum
percentage: true

Sales Conversation Analysis

analysis_types:
- id: "SalesOutcome"
order: 0
description: "Analyze sales conversation outcomes"
prompt_template: |
# Task
{task}
# Chat history
{chat_history}
# Response JSON schema
{formatting}
prompt_parts:
task: |
Analyze this sales conversation for lead qualification and interest level.
llm_metrics:
- id: "lead_interest"
name: "Lead Interest Level"
kind: llm
type: Literal['high_interest','medium_interest','low_interest','not_interested']
description: "Customer's level of interest in the product/service"
values:
- id: "high_interest"
name: "High Interest"
color: "green"
- id: "medium_interest"
name: "Medium Interest"
color: "blue"
- id: "low_interest"
name: "Low Interest"
color: "yellow"
- id: "not_interested"
name: "Not Interested"
color: "red"
prompt: |-
Assess customer interest level:
- high_interest: Strong engagement, asks detailed questions, requests next steps
- medium_interest: Shows curiosity but has concerns or needs more information
- low_interest: Minimal engagement, generic responses, seems distracted
- not_interested: Explicit disinterest or attempts to end conversation
Return exactly one label: high_interest, medium_interest, low_interest, or not_interested — nothing else.

Common Issues & Solutions

YAML file fails to parse with indentation errors

Problem: Using 4 spaces or tabs instead of 2 spaces

analysis_types:
- id: "wrong" # 4 spaces - WRONG

Solution: Always use exactly 2 spaces for indentation

analysis_types:
- id: "correct" # 2 spaces - CORRECT
order: 0 # 2 spaces - CORRECT
LLM returns inconsistent or unexpected classifications

Problem: Abstract instructions without clear criteria

prompt: "Analyze if the user was happy"

Solution: Provide specific criteria with examples and edge cases

prompt: |-
Classify user happiness:
- happy: Explicit positive expressions or goal achievement
Examples: "Thank you so much!", "Perfect!", successful completion
- unhappy: Complaints, frustration, or unresolved issues
Examples: "This doesn't work", "Frustrated", abandoning conversation
- neutral: No clear emotional indicators
Examples: "OK", factual questions, simple acknowledgments
Return exactly one label: happy, unhappy, or neutral — nothing else.
Visualization shows no data or missing metrics

Problem: Referencing non-existent metric IDs

llm_metrics:
- id: "user_sentiment"
visualization:
metrics:
- id: "sentiment" # WRONG - doesn't match above

Solution: Ensure exact ID matching throughout configuration

llm_metrics:
- id: "user_sentiment"
visualization:
metrics:
- id: "user_sentiment" # CORRECT - exact match

FAQ

How many metrics can I include in one analysis type?

While there's no hard limit, keep analysis types focused on related metrics (3-5 metrics max). This improves performance and maintains logical grouping. Use multiple analysis types for different business areas.

Can I use custom Python code in expressions?

Yes, code metrics support Python expressions with access to dialog data and built-in libraries. However, only standard Python modules are available - no external packages can be imported.

How do I handle edge cases where conversations have no clear classification?

Always include a "neutral" or "unknown" category in your Literal types. Define this as the default for ambiguous cases and provide clear criteria for when to use it.

What's the difference between 'kind: trend' and 'kind: summary' plots?
  • Trend plots: Show data over time, useful for tracking changes and patterns
  • Summary plots: Show current state distribution, useful for understanding overall composition