
LLMs and GenAI are now assisting professionals in more & more workflows in many different settings – whether that be large companies, financial institutions, academic research and even in high-stakes industries such as healthcare and law. The outputs that LLMs produce within these workflows influence decisions that have real consequences. Meanwhile, however, standards for how to evaluate the performance of these AI systems or workflows haven’t necessarily kept pace with the speed of real-world deployment.
Generative LLM tasks typically don’t have a single, absolute 'correct' answer. Certain standard metrics like accuracy, F1 scores (a harmonic mean of precision and recall), and BLEU scores (a measure of coherence) have emerged to evaluate LLM outputs, but their usefulness is limited. For instance, F1 scores help evaluate how well an LLM performs on classification tasks such as classifying an email as spam or sentiment analysis, but fall flat when it comes to giving signals about the LLM/AI system’s quality of reasoning, contextual relevance, clarity of writing, or instruction following. High scores on these standard metrics can tend to add a false sense of security about the LLM’s performance which is not good for critical business workflows, where a more nuanced and thorough approach is needed to evaluate whether an LLM/Gen AI workflow is performing well.
Evals (short for Evaluations) help you assess how well an LLM or AI model’s output aligns with what the task or the user actually needs. You need evals in order to measure how effectively your LLM is performing in the context of the workflow it is embedded in and the task it is supposed to perform. Evals go beyond correctness or factual accuracy to different dimensions of LLM/Gen AI workflow performance such as usefulness, clarity, instruction following, reliability, actionability, and ethical alignment (for a deeper dive, see AI ethics and LLMs).
Because different domains prioritize different kinds of ‘quality’, evaluations become the subjective anchor of quality, helping answer ‘Would a domain expert trust and use this output?’. Furthermore, evals then become a way for businesses to capture the complexity and nuances of the specific business processes within which these LLMs and GenAI-powered workflows are embedded and measure whether they are doing what they’re supposed to.
Given the high stakes of GenAI powered workflows and the $ going into the investment to set them up, it is incumbent upon businesses and the right roles to create robust evaluations so they can get some meaningful signals about the performance of their many different agents and AI systems at a massive scale.
The first place to start when working on coming up with evals is to think about the dimensions relevant to evaluating the performance of the LLM within the scenario or business process it’s plugged into.
Let’s take an example of a use case where we’ve all benefited from LLMs and GenAI – a meeting transcription and summarization service, a very common use case for enterprises enabling GenAI and LLMs. Some of the dimensions we would want to evaluate its performance on include:
In order to assess if the GenAI-powered meeting summary is meeting the needs of its consumers, we can go a step deeper with our evaluations. Employees who consume meeting summaries are often looking for detailed recaps of who owns what action items based on the meeting discussions. So if we needed to evaluate the quality of action items captured in the meeting summary, we would want to measure:
The core idea is to think intuitively about the different dimensions that a task output would need to be evaluated on in order to infer if it is performing well. A correct output may not always be the ‘right’ output for your use case/workflow/process. For inspiration, Stanford’s benchmarking work, the Holistic Evaluation of Language Models (HELM), has shown how multidimensional scoring surfaces weaknesses invisible to single-number metrics.
If you’re interested in instrumenting evaluations within your respective work setting, there are 3 broad ways in which it can be done:
Evals offer a comprehensive framework for assessing LLM outputs against user and task requirements, extending beyond traditional metrics. By focusing on subjective quality and domain-specific nuances, Evals are crucial for ensuring high-stakes GenAI applications perform optimally. Implementing robust evaluations is vital for businesses to gain meaningful insights into AI system performance and make informed decisions in critical workflows.
Rajat is a Senior Staff Product Manager for Analytics and AI at ServiceNow where he leads the building of AI & ML products that provide insights and Sales-ready information for 1000+ strong Sales and Product teams and leaders. Rajat is a prolific contributor to the Business Technology, SaaS and IT communities, as a judge, speaker and writer.
Disclaimer: The author is completely responsible for the content of this article. The opinions expressed are their own and do not represent IEEE's position, nor that of the Computer Society, nor its Leadership.