G-Eval
G-Eval is a framework that uses LLMs with chain-of-thoughts (CoT) to evaluate LLM outputs based on ANY custom criteria. The G-Eval metric is the most versatile type of metric deepeval
has to offer, and is capable of evaluating almost any use case with human-like accuracy.
You can run real-time evaluations in production on metrics such as GEval
using Confident AI.
Required Arguments
To use the GEval
, you'll have to provide the following arguments when creating an LLMTestCase
:
input
actual_output
You'll also need to supply any additional arguments such as expected_output
and context
if your evaluation criteria depends on these parameters.
Example
To create a custom metric that uses LLMs for evaluation, simply instantiate an GEval
class and define an evaluation criteria in everyday language:
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
correctness_metric = GEval(
name="Correctness",
criteria="Determine whether the actual output is factually correct based on the expected output.",
# NOTE: you can only provide either criteria or evaluation_steps, and not both
evaluation_steps=[
"Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
"You should also heavily penalize omission of detail",
"Vague language, or contradicting OPINIONS, are OK"
],
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
)
There are three mandatory and six optional parameters required when instantiating an GEval
class:
name
: name of metriccriteria
: a description outlining the specific evaluation aspects for each test case.evaluation_params
: a list of typeLLMTestCaseParams
. Include only the parameters that are relevant for evaluation.- [Optional]
evaluation_steps
: a list of strings outlining the exact steps the LLM should take for evaluation. Ifevaluation_steps
is not provided,GEval
will generate a series ofevaluation_steps
on your behalf based on the providedcriteria
. You can only provide eitherevaluation_steps
ORcriteria
, and not both. - [Optional]
threshold
: the passing threshold, defaulted to 0.5. - [Optional]
model
: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of typeDeepEvalBaseLLM
. Defaulted to 'gpt-4o'. - [Optional]
strict_mode
: a boolean which when set toTrue
, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted toFalse
. - [Optional]
async_mode
: a boolean which when set toTrue
, enables concurrent execution within themeasure()
method. Defaulted toTrue
. - [Optional]
verbose_mode
: a boolean which when set toTrue
, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted toFalse
.
For accurate and valid results, only the parameters that are mentioned in criteria
should be included as a member of evaluation_params
.
As mentioned in the metrics introduction section, all of deepeval
's metrics return a score ranging from 0 - 1, and a metric is only successful if the evaluation score is equal to or greater than threshold
, and GEval
is no exception. You can access the score
and reason
for each individual GEval
metric:
from deepeval.test_case import LLMTestCase
...
test_case = LLMTestCase(
input="The dog chased the cat up the tree, who ran up the tree?",
actual_output="It depends, some might consider the cat, while others might argue the dog.",
expected_output="The cat."
)
correctness_metric.measure(test_case)
print(correctness_metric.score)
print(correctness_metric.reason)
How Is It Calculated?
G-Eval is a two-step algorithm that first generates a series of evaluation_steps
using chain of thoughts (CoTs) based on the given criteria
, before using the generated steps to determine the final score using the parameters presented in an LLMTestCase
.
When you provide evaluation_steps
, the GEval
metric skips the first step and uses the provided steps to determine the final score instead.
In the original G-Eval paper, the authors used the the probabilities of the LLM output tokens to normalize the score by calculating a weighted summation.
This step was introduced in the paper because it minimizes bias in LLM scoring. This normalization step is automatically handled by deepeval
by default (unless you're using a custom model).