Deep Analysis: Create YAML configurations to analyze chatbot conversations using LLM-based metrics

Complete guide for creating YAML configurations to analyze chatbot conversations using LLM-based metrics. This system extracts business insights from user-chatbot interactions through structured analysis and visualization.

Prerequisites

Understanding of YAML syntax and indentation rules
Basic knowledge of LLM prompting techniques
Familiarity with business metrics and KPIs
Access to conversation dialog data

Quick Start

Your First Configuration in 5 Minutes

This minimal working configuration analyzes user satisfaction from conversation tone:

scheduling_rules:
  cron_exp: "0 0 * * *"
  depth: 30

analysis_types:
  - id: "BasicSatisfaction"
    order: 0
    description: "Analyze user satisfaction from conversation tone"
    prompt_template: |
      # Task
      {task}
      # Chat history
      {chat_history}
      # Response JSON schema
      {formatting}
    prompt_parts:
      task: |
        Determine if the user was satisfied with the chatbot interaction.
    llm_metrics:
      - id: "user_satisfaction"
        name: "User Satisfaction"
        kind: llm
        type: Literal['satisfied','neutral','dissatisfied']
        description: "Overall user satisfaction level"
        values:
          - id: "satisfied"
            name: "Satisfied"
            color: "green"
          - id: "neutral"
            name: "Neutral"
            color: "gray"
          - id: "dissatisfied"
            name: "Dissatisfied"
            color: "red"
        prompt: |-
          Classify user satisfaction:
          - satisfied: User expressed gratitude, positive feedback, or achieved their goal
          - neutral: User completed interaction without clear positive/negative sentiment  
          - dissatisfied: User expressed frustration, complaints, or left unsatisfied
          Return exactly one label: satisfied, neutral, or dissatisfied — nothing else.

visualization:
  tabs:
    - id: "satisfaction_overview"
      title: "User Satisfaction"
      plots:
        - id: "satisfaction_summary"
          kind: summary
          type: detailed_bars
          title: "Satisfaction Distribution"
          metrics:
            - id: "user_satisfaction"
              features:
                - id: unit
                  aggregation: sum

What this does:

Analyzes last 30 days of conversations daily at midnight
Classifies each conversation by user satisfaction level
Creates a bar chart showing satisfaction distribution

Core Concepts

System Architecture

The dialog analysis system works in three stages:

Data Collection: Gather conversation dialogs based on scheduling rules
Metric Extraction: Process each dialog through LLM or code-based analysis
Visualization: Generate dashboards from extracted metrics

Key Principles

🎯 One Metric, One Purpose Each metric should measure exactly one concept. Don't combine satisfaction + engagement in a single metric.
📊 Business-First Design Design metrics around business questions: "Are users satisfied?" not "What sentiment words appear?"
🔍 Explicit Classification LLM prompts must be extremely specific with definitions, examples, and edge cases.
⚡ Performance Optimized All metrics in one analysis_type are processed together (~1 second per dialog regardless of metric count).

Configuration Structure

File Structure Requirements

CRITICAL: Every YAML file must use exactly 2 spaces for indentation and contain these three sections in order:

scheduling_rules:     # When and how much data to analyze
analysis_types:       # What metrics to extract and how
visualization:        # How to display results

Scheduling Rules Section

Controls when analysis runs and how much historical data to process.

scheduling_rules:
  cron_exp: "<cron_expression>"
  depth: <integer>

Fields:

Field	Type	Required	Description
`cron_exp`	string	Yes	Standard cron expression in quotes
`depth`	integer	Yes	Number of days of historical data to analyze

Examples:

# Daily analysis at midnight, last 30 days
scheduling_rules:
  cron_exp: "0 0 * * *"
  depth: 30

# Weekly analysis on Sundays at 2 AM, last 7 days
scheduling_rules:
  cron_exp: "0 2 * * 0" 
  depth: 7

Analysis Types Section

Defines what metrics to extract from conversations. Each analysis type processes dialogs with a focused set of related metrics.

Basic Structure

analysis_types:
  - id: "<unique_identifier>"
    order: <integer>
    description: "<purpose_description>"
    prompt_template: |
      # Task
      {task}
      # Chat history
      {chat_history}
      # Response JSON schema
      {formatting}
    prompt_parts:
      task: |
        <overall_analysis_description>
    llm_metrics:
      - <metric_definition>

order - specifies the order in which to calculate the given type of analysis. Currently unused.

LLM Metrics Definition

- id: "<metric_identifier>"
  name: "<display_name>"
  kind: llm
  type: <data_type>
  description: "<metric_purpose>"
  prompt: |-
    <detailed_classification_instructions>
  values:              # For categorical metrics only
    - id: "<value_id>"
      name: "<display_name>"
      color: "<color_name>"

Data Types

Type	Description	Example
`Literal['val1','val2']`	Fixed set of string values	`Literal['positive','negative','neutral']`
`str`	Free-form text response	Analysis explanations, summaries
`bool`	Boolean true/false	`true`, `false`
`list[str]`	Array of strings	`["topic1", "topic2"]`

Visualization Section

Defines dashboard structure with tabs and plots to display extracted metrics.

visualization:
  tabs:
    - id: "<tab_identifier>"
      title: "<tab_display_name>"
      plots:
        - id: "<plot_identifier>"
          kind: <plot_kind>
          type: <plot_type>
          title: "<plot_title>"
          metrics:
            - id: "<metric_id>"
              features:
                - id: unit
                  aggregation: <aggregation_type>

Plot Types:

Kind	Type	Description
`trend`	`bar_time_series`	Time-series visualization
`summary`	`detailed_bars`	Categorical distribution

More plot types will be added in the future.

Best Practices

Effective Prompt Writing

🎯 Single Metric Focus Each prompt analyzes ONLY one metric. Never reference other metrics.

📋 Exhaustive Categories For Literal types, define ALL possible values with:

Definition: Clear, unambiguous criteria
Examples: Concrete examples from real dialogs
Notes: Edge cases and clarifications

✅ Correct Prompt Structure:

prompt: |-
  Classify user sentiment based on their messages:
  - positive:
    • Definition: Explicit gratitude, satisfaction, or achievement of goals
    • Examples: "Thank you!", "This helped me", "Perfect solution"
    • Notes: Any appreciation without complaints indicates positive
  - negative:
    • Definition: Complaints, frustration, anger, or explicit dissatisfaction
    • Examples: "This is terrible", "Waste of time", profanity
    • Notes: Sarcasm with clear negativity counts as negative
  - neutral:
    • Definition: No clear emotional indicators either way
    • Examples: "Okay", "I understand", factual questions only
    • Notes: Default for ambiguous cases
  Return exactly one label: positive, negative, or neutral — nothing else.

Metric Design Strategy

Always start with the business question, then design metrics:

❌ Wrong: "Analyze sentiment words in conversations" ✅ Right: "Are users satisfied with support quality?"

Business Categories

User Experience Metrics: Satisfaction, sentiment, engagement
Chatbot Performance: Goal achievement, quality issues, efficiency
Business Outcomes: Conversion, escalation, risk assessment

Configuration Examples

Customer Support Analysis

scheduling_rules:
  cron_exp: "0 0 * * *"
  depth: 7

analysis_types:
  - id: "SupportQuality"
    order: 0
    description: "Analyze customer support interaction quality"
    prompt_template: |
      # Task
      {task}
      # Chat history
      {chat_history}
      # Response JSON schema
      {formatting}
    prompt_parts:
      task: |
        Evaluate the quality of customer support provided in this conversation.
    llm_metrics:
      - id: "issue_resolution"
        name: "Issue Resolution"
        kind: llm
        type: Literal['resolved','partially_resolved','unresolved']
        description: "Whether the customer's issue was addressed"
        values:
          - id: "resolved"
            name: "Fully Resolved"
            color: "green"
          - id: "partially_resolved"
            name: "Partially Resolved"
            color: "yellow"
          - id: "unresolved"
            name: "Unresolved"
            color: "red"
        prompt: |-
          Assess issue resolution:
          - resolved: Customer's problem was completely addressed and confirmed
          - partially_resolved: Some progress made but issue not fully addressed
          - unresolved: No meaningful progress on customer's core issue
          Return exactly one label: resolved, partially_resolved, or unresolved — nothing else.

visualization:
  tabs:
    - id: "support_overview"
      title: "Support Quality Overview"
      plots:
        - id: "resolution_trend"
          kind: trend
          type: bar_time_series
          title: "Issue Resolution Trends"
          metrics:
            - id: "issue_resolution"
              features:
                - id: unit
                  aggregation: sum
                  percentage: true

Sales Conversation Analysis

analysis_types:
  - id: "SalesOutcome"
    order: 0
    description: "Analyze sales conversation outcomes"
    prompt_template: |
      # Task
      {task}
      # Chat history
      {chat_history}
      # Response JSON schema
      {formatting}
    prompt_parts:
      task: |
        Analyze this sales conversation for lead qualification and interest level.
    llm_metrics:
      - id: "lead_interest"
        name: "Lead Interest Level"
        kind: llm
        type: Literal['high_interest','medium_interest','low_interest','not_interested']
        description: "Customer's level of interest in the product/service"
        values:
          - id: "high_interest"
            name: "High Interest"
            color: "green"
          - id: "medium_interest"
            name: "Medium Interest"
            color: "blue"
          - id: "low_interest"
            name: "Low Interest"
            color: "yellow"
          - id: "not_interested"
            name: "Not Interested"
            color: "red"
        prompt: |-
          Assess customer interest level:
          - high_interest: Strong engagement, asks detailed questions, requests next steps
          - medium_interest: Shows curiosity but has concerns or needs more information
          - low_interest: Minimal engagement, generic responses, seems distracted
          - not_interested: Explicit disinterest or attempts to end conversation
          Return exactly one label: high_interest, medium_interest, low_interest, or not_interested — nothing else.

Common Issues & Solutions

YAML file fails to parse with indentation errors

Problem: Using 4 spaces or tabs instead of 2 spaces

analysis_types:
    - id: "wrong"        # 4 spaces - WRONG

Solution: Always use exactly 2 spaces for indentation

analysis_types:
  - id: "correct"        # 2 spaces - CORRECT
    order: 0             # 2 spaces - CORRECT

LLM returns inconsistent or unexpected classifications

Problem: Abstract instructions without clear criteria

prompt: "Analyze if the user was happy"

Solution: Provide specific criteria with examples and edge cases

prompt: |-
  Classify user happiness:
  - happy: Explicit positive expressions or goal achievement
    Examples: "Thank you so much!", "Perfect!", successful completion
  - unhappy: Complaints, frustration, or unresolved issues
    Examples: "This doesn't work", "Frustrated", abandoning conversation
  - neutral: No clear emotional indicators
    Examples: "OK", factual questions, simple acknowledgments
  Return exactly one label: happy, unhappy, or neutral — nothing else.

Visualization shows no data or missing metrics

Problem: Referencing non-existent metric IDs

llm_metrics:
  - id: "user_sentiment"
visualization:
  metrics:
    - id: "sentiment"     # WRONG - doesn't match above

Solution: Ensure exact ID matching throughout configuration

llm_metrics:
  - id: "user_sentiment"
visualization:
  metrics:
    - id: "user_sentiment"  # CORRECT - exact match

FAQ

How many metrics can I include in one analysis type?

While there's no hard limit, keep analysis types focused on related metrics (3-5 metrics max). This improves performance and maintains logical grouping. Use multiple analysis types for different business areas.

Can I use custom Python code in expressions?

Yes, code metrics support Python expressions with access to dialog data and built-in libraries. However, only standard Python modules are available - no external packages can be imported.

How do I handle edge cases where conversations have no clear classification?

Always include a "neutral" or "unknown" category in your Literal types. Define this as the default for ambiguous cases and provide clear criteria for when to use it.

What's the difference between 'kind: trend' and 'kind: summary' plots?

Trend plots: Show data over time, useful for tracking changes and patterns
Summary plots: Show current state distribution, useful for understanding overall composition

Prerequisites​

Quick Start​

Your First Configuration in 5 Minutes​

Core Concepts​

System Architecture​

Key Principles​

Configuration Structure​

File Structure Requirements​

Scheduling Rules Section​

Analysis Types Section​

Basic Structure​

LLM Metrics Definition​

Data Types​

Visualization Section​

Best Practices​

Effective Prompt Writing​

Metric Design Strategy​

Business Categories​

Configuration Examples​

Customer Support Analysis​

Sales Conversation Analysis​

Common Issues & Solutions​

FAQ​

Related Resources​