Evaluating Datasets
Quick Summary
You can pull evaluation datasets created on Confident AI like how you would pull code from a GitHub repo by specifying the dataset alias
, which is unique dataset name within a project.
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.pull(alias="My Confident Dataset")
Once the dataset is pulled locally, you can start running evaluations using deepeval
's metrics you defined as demonstrated in the datasets section.
from deepeval.metrics import AnswerRelevancyMetric
...
dataset.evaluate(metrics=[AnswerRelevancyMetric()])
Here we're just showing the simplest example. In reality, you will likely have to generate actual_output
s before running evaluations as the datasets you store on Confident AI will likely not have pre-computed actual_output
s.
Pull Your Dataset From Confident AI
Pull datasets from Confident by specifying its alias
:
from deepeval.dataset import EvaluationDataset
# Initialize empty dataset object
dataset = EvaluationDataset()
# Pull from Confident
dataset.pull(alias="My Confident Dataset")
An EvaluationDataset
accepts one mandatory and one optional argument:
alias
: the alias of your dataset on Confident. A dataset alias is unique for each project.- [Optional]
auto_convert_goldens_to_test_cases
: Defaulted toTrue
. When set toTrue
,dataset.pull()
will automatically convert all goldens that were fetched from Confident into test cases and override all test cases you currently have in yourEvaluationDataset
instance.
Essentially, auto_convert_goldens_to_test_cases
is convenient if you have a complete, pre-computed dataset on Confident ready for evaluation. However, this might not always be the case. To disable the automatic conversion of goldens to test cases within your dataset, set auto_convert_goldens_to_test_cases
to False
. You might find this useful if you:
- have goldens in your
EvaluationDataset
that are missing essential components, such as theactual_output
, to be converted to test cases for evaluation. This may be the case if you're looking to generateactual_output
s at evaluation time. Remember, a golden does not require anactual_output
, but a test case does. - have extra data preprocessing to do before converting goldens to test cases. Remember, goldens have an extra
additional_metadata
field, which is a dictionary that contains additional metadata for you to generate custom outputs.
Here is an example of how you can use goldens as an intermediary variable to generate an `actual_output before converting them to test cases for evaluation:
# Pull from Confident
dataset.pull(
alias="My Confident Dataset",
# Don't convert goldens to test cases yet
auto_convert_goldens_to_test_cases=False
)
Then, process the goldens and convert them into test cases:
# A hypothetical LLM application example
from chatbot import query
from typing import List
from deepeval.test_case import LLMTestCase
from deepeval.dataset import Golden
...
def convert_goldens_to_test_cases(goldens: List[Golden]) -> List[LLMTestCase]:
test_cases = []
for golden in goldens:
test_case = LLMTestCase(
input=golden.input,
# Generate actual output using the 'input' and 'additional_metadata'
actual_output=query(golden.input, golden.additional_metadata),
expected_output=golden.expected_output,
context=golden.context,
)
test_cases.append(test_case)
return test_cases
# Data preprocessing before setting the dataset test cases
dataset.test_cases = convert_goldens_to_test_cases(dataset.goldens)
Finally, define metric(s) like in previous examples in this documentation and evaluate:
from deepeval.metrics import HallucinationMetric
...
metric = HallucinationMetric()
dataset.evaluate([metric])
Evaluate Your Dataset
You can start running evaluations as usual once you have your dataset pulled from Confident AI. Remember, a dataset is simply a list of test cases, so what you previously learned on evaluating test cases still applies.
The term "evaluations" and "test run" means the same and is often used interchangebly throughout this documentation.
With Pytest (highly recommended)
You can view a complete example here.
from deepeval import assert_test
from deepeval.metrics import HallucinationMetric
from deepeval.dataset import EvaluationDataset
from deepeval.test_case import LLMTestCase
# Initialize empty dataset object
dataset = EvaluationDataset()
# Pull from Confident
dataset.pull(alias="My Confident Dataset")
@pytest.mark.parametrize(
"test_case",
dataset,
)
def test_customer_chatbot(test_case: LLMTestCase):
hallucination_metric = HallucinationMetric(threshold=0.3)
assert_test(test_case, [hallucination_metric])
Don't forget to run deepeval test run
in the CLI:
deepeval test run test_example.py
Without Pytest
from deepeval import evaluate
from deepeval.metrics import HallucinationMetric
from deepeval.dataset import EvaluationDataset
hallucination_metric = HallucinationMetric(threshold=0.3)
# Initialize empty dataset object and pull from Confident
dataset = EvaluationDataset()
dataset.pull(alias="My Confident Dataset")
dataset.evaluate([hallucination_metric])
# You can also call the evaluate() function directly
evaluate(dataset, [hallucination_metric, answer_relevancy_metric])
You can learn more about the evaluate()
method here.