Scripted Evaluation Framework for Large Language Models: A Controlled Approach to Comparative Analysis

Blog Articles

Scripted Evaluation Framework for Large Language Models: A Controlled Approach to Comparative Analysis

Jan 9, 2025

Abstract

We present a novel framework for evaluating Large Language Models (LLMs) through controlled scripted interactions inspired by the recently published paper Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models.

Our approach addresses the limitations of traditional model-to-model conversational evaluations by implementing structured scenarios with predefined interaction patterns. We demonstrate the effectiveness of this methodology through three distinct evaluation scenarios: restaurant service, technical interviews, and sales interactions.

Our results indicate significant variations in performance between different LLM providers, particularly in areas of context awareness, instruction following, and complex function calling.

1. Introduction

Evaluating Large Language Models (LLMs) for conversational settings poses challenges in ensuring consistency and comparability across models and test runs. Traditional benchmarks like AlpacaEval, CommonEval, and OpenBookQA focus on single-shot Q&A tasks with predetermined answers, which fail to reflect the dynamic nature of modern conversational agents. Although newer benchmarks like MT-bench aim to address these shortcomings, they still do not fully capture the complexities of real-world conversational use cases.

Model-to-model interaction benchmarks take a more dynamic approach by having the benchmarked LLM interact with another LLM simulating specific roles (e.g., a confused user or an angry client). While this method better represents conversational adaptability, it suffers from inconsistencies as the simulated user often deviates from its initial prompt. This unpredictability complicates performance evaluation and makes it difficult to attribute outcomes solely to the benchmarked LLM.

The paper Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models introduced 11 tasks to address these issues, such as prompting an LLM to remember the user’s favorite color or annotate lists of names. While these tasks bridge the gap between benchmarks and real-world applications, they still fall short of reflecting how LLMs are used in voice conversational agents.

Among these tasks, the "restaurant task" is the most representative of current use cases. Here, the LLM plays the role of a diner ordering from a menu and later addresses receiving the wrong order, closely mirroring real-world interactions, especially in customer service.

Building on this concept, we propose a scripted evaluation framework to ensure controlled and consistent conversational flow. By using predefined scenarios with predictable inputs and responses, this framework isolates the LLM's performance in a structured environment, while allowing for dynamic responses and scenarios that are one-to-one with how conversational agents are used today.

2. Methodology

2.1 Scripted Interaction Framework

Our evaluation framework employs scripted testing to ensure controlled, consistent conversational interactions with LLMs. By scripting, we dictate the behavior of one conversational party (referred to as Coval) while testing the evaluated agent (e.g., Gemini or GPT). This approach enables fine-grained control over the conversation flow and isolates the evaluated LLM's performance within structured scenarios. Scripts can be both static and dynamic, leveraging memory systems or external functions for increased flexibility.

Script Structure and Execution

Scripts consist of predefined prompts and responses that Coval follows during the interaction. The structure includes memory updates, logic-based decision-making, and external function calls where necessary. Below are key examples to illustrate:

Static Scripting

A straightforward script with predefined inputs and outputs.

Example Script: Ordering a Drink

Coval: Hello, what will you have for a drink? 
LLM: (response) 
Script memory update: ordered_drink = (parsed LLM response) 
Coval: Here you go, a {ordered_drink}

Example Run: Ordering a Drink

Coval: Hello, what will you have for a drink? 
LLM: A Martini would be nice, thanks 
Coval: Here you go, a Martini

Dynamic Scripting with Memory Updates

Scripts can include dynamic elements, such as intentional errors, based on memory.

Example Script: Delivering the wrong Drink

Coval: Hello, what will you have for a drink?
LLM: {response)
Script memory update: ordered_drink = parsed LLM response
Script calls create_wrong_drink: wrong_drink = coke if ordered_drink == martini
Coval: Here you go, {wrong_drink}

Example Run: Delivering the wrong Drink

Coval: Hello, what will you have for a drink?  
LLM: A martini, please.  
Coval: Here you go, a coke

Unscripted Flexibility with AI

In some cases, the LLM's unpredictable responses require AI to maintain the flow of the conversation. A script with only AI responses would be the the traditional model-to-model evaluation pattern.

Example Script: Asking a Random Question

Coval: Ask me a random question.  
LLM: (evaluated response)  
Script calls answer = call_chat_gpt_to_reply(message_history)  
Coval: {answer}

Example Script: Asking a Random Question

Coval: Ask me a random question.  
LLM: What is the meaning of life  
Coval: 42.

2.2 Script Design Principles

Scripts are designed to minimize ambiguity and ensure that deviations from expected behavior can be reliably attributed to the evaluated LLM. For example:

No ambiguity: Scripts are designed to minimize ambiguity and ensure that deviations from expected behavior can be reliably attributed to the evaluated LLM. For example: “Ask questions A and B to an interviewee” is an ambiguous scenario because the LLM can ask question B and then A. A better scenario would be “Ask questions A and B to an interviewee in that order”
Predictable Answers: The expected response from the LLM is straightforward to evaluate,
Strict Adherence to Script: Going off script is either impermissible or easily recoverable without distorting results.
Diverse Evaluations: Each scenario tests different aspects of the LLM.

By using this methodology, we ensure that performance differences among LLMs are attributable to their capabilities rather than inconsistencies in the testing process.

3. Evaluation Scenarios

We implemented three distinct evaluation scenarios:

3.1 Restaurant Service Scenario (5 points)

The Restaurant Service Scenario, adapted from Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models, evaluates the LLM's ability to handle a multi-turn conversation with dynamic elements and unexpected challenges. In this scenario, the LLM is presented with a menu and is guided through the following sequence of interactions:

Drink Order: The LLM is asked to select a preferred drink from the menu (1 point).
Main Course Selection: After receiving its drink, the LLM is prompted to choose a main course (1 point).
Second Main Course Selection: The initial choice is unavailable, so the LLM must choose an alternative (1 point).
Incorrect Order Handling: The LLM has brought an incorrect main course and is expected to recognize the discrepancy and react appropriately (1 point).
Drink Refill: At the end of the interaction, the LLM is offered a refill, but only if it correctly recalls the drink it originally ordered (1 point).

Evaluation Dimensions:

Context awareness
Short-term memory
Resolution of conflicting information

Results (9 runs):

GPT (text): 45/45
GPT (voice): 45/45
Gemini (text): 43/45

Notable observation: Gemini failed to recognize incorrect orders in 7/9 test runs.

3. 2 Technical Interview Scenario (9 points)

In the technical interviewer scenario, the LLM role-pays a head recruiter for a tech company hiring for a full-stack position. It is given a long initial prompt and the conversation spans over 7 minutes. The script proceeds as follows:

Company Introduction: The LLM is prompted to provide an introduction about the company (1 point)
Questioning Sequence: The LLM is instructed to ask four specific questions to the simulated candidate in a predefined order. 1 point for each correctly ordered question (4 points total).
Follow-Up Questions: The LLM must ask a follow-up question whenever the candidate provides an answer exceeding 60 words. This happens twice during the interaction (2 points total).
Candidate Inquiry: The simulated candidate asks for information about the company that was provided earlier in the conversation. The LLM must accurately recall and respond (1 point)
Candidate Evaluation: At the end of the interview, the LLM is tasked with determining if the candidate is a good fit for the company. The LLM must factor in the candidate's stated discomfort with Java, a required technical skill mentioned in the initial prompt. (1 point)

Evaluation Dimensions:

Long-context Instruction following
Needle-in-haystack
Context-awareness

Results:

GPT (text): 65/81
GPT (voice): 56/81
Gemini (text): 65/81

Key findings:

Only one simulation (Gemini) successfully identified missing Java experience
Gemini showed consistent first follow-up question generation but failed second follow-ups in 89% of cases
GPT text achieved complete follow-up success in 44% of runs

3.3 Sales Interaction Scenario (5 points)

In the sales scenario, the LLM tries to upsell insurance to an interested customer but who does not have time right now. The LLM is provided with 5 functions (add_number_to_blacklist, end_conversation, add_name, add_surname, register_number, end_call_reason). The interaction is as follows.

Surname Identification: The customer provides their surname first, and the LLM must call the function add_surname correctly (1 point)
Name and Phone Number Registration: The customer simultaneously provides their name and phone number. The LLM is expected to:
1. Call add_name with the correct argument for the customer's name (1 point).
2. Call register_number with the correct argument for the phone number (1 point).
End Call Reason: The LLM must call end_call_reason with the argument "call back later" (1 point.)
End Call: The LLM must appropriately call end_conversation to conclude the interaction (1 point)

Evaluation Criteria:

Multiple and composite function calling

Results:

GPT (text): 21/35
GPT (voice): 15/35
Gemini (text): 12/35

4. Results Analysis

4.1 Model Performance Comparison

Our evaluation reveals distinct performance patterns across models:

Context Awareness:

Gemini demonstrated consistency in following instructions but struggled with context-dependent tasks
GPT models showed superior performance in resolution of conflicting information.

Instruction Following:

Gemini exhibited more consistent adherence to initial instructions
GPT models showed higher variability in instruction compliance

Task Completion:

GPT text consistently outperformed other implementations in complex task scenarios
Voice implementations showed decreased performance compared to text counterparts in almost all tests.

Function Calling:

GPT voice and text have better function calling than Gemini.

5. Discussion

The results demonstrate the effectiveness of scripted evaluation in highlighting specific strengths and weaknesses of different LLM implementations. The framework successfully identified consistent patterns in model behavior, particularly in areas of context awareness and instruction following.

Key findings include:

Superior performance of text-based implementations over voice
Consistent instruction following in Gemini despite context awareness limitations
Variable performance in complex task scenarios across all models

6. Conclusion

Our scripted evaluation framework provides a structured approach to LLM assessment, enabling direct comparison between models and implementations. The results suggest that while current LLMs show promising capabilities in structured interactions, significant variations exist in their handling of context-dependent tasks and complex instructions.

Appendix A: Sample Script Implementation

class ScriptedEvaluation:
    def __init__(self):
        self.memory = {}
        self.score = 0

    def run_restaurant_scenario(self):
        # Script implementation details
        pass

    def update_memory(self, key, value):
        self.memory[key] = value

    def evaluate_response(self, response, expected):
        # Evaluation logic
        pass

This benchmarking was performed by Brooke Hopkins (CEO @ Coval) & Juan Guevara (Software Engineer @ Coval)