Introduction
Quick Summary
In deepeval
, a metric serves as a standard of measurement for evaluating the performance of an LLM output based on a specific criteria of interest. Essentially, while the metric acts as the ruler, a test case represents the thing you're trying to measure. deepeval
offers a range of default metrics for you to quickly get started with, such as:
- G-Eval
- Summarization
- Faithfulness
- Answer Relevancy
- Contextual Relevancy
- Contextual Precision
- Contextual Recall
- Ragas
- Hallucination
- Toxicity
- Bias
deepeval
also offers conversational metrics, which are metrics used to evaluate conversations instead of individual, granular LLM interactions. These include:
- Conversation Completeness
- Conversation Relevancy
- Knowledge Retention
You can also easily develop your own custom evaluation metrics in deepeval
. All metrics are measured on a test case. Visit the test cases section to learn how to apply any metric on test cases for evaluation.
Types of Metrics
A custom metric is a type of metric you can easily create by implementing abstract methods and properties of base classes provided by deepeval
. They are extremely versatile and seamlessly integrate with Confident AI without requiring any additional setup. As you'll see later, a custom metric can either be an LLM-Eval (LLM evaluated) or classic metric. A classic metric is a type of metric whose criteria isn't evaluated using an LLM.
deepeval
also offers default metrics, which can either be conversational or non-conversational. Non-conversational metrics are used to evaluate LLMTestCase
s, while conversational metrics can be used to evaluate ConversationalTestCase
s.
Since there are way more non-conversational metrics in deepeval
than conversational metrics, you should assume the term 'metrics' refer to non-conversational metrics.
Almost all default metrics offered by deepeval
are LLM-Evals, which means they are evaluated using LLMs. This is deliberate because LLM-Evals are versatile in nature and better align with human expectations when compared to traditional model based approaches.
deepeval
's LLM-Evals are a step up to other implementations because they:
- are extra reliable as LLMs are only used for extremely specific tasks during evaluation to greatly reduce stochasticity and flakiness in scores.
- provide a comprehensive reason for the scores computed.
- can be computed using any LLM.
All of deepeval
's default metrics output a score between 0-1, and require a threshold
argument to instantiate. A default metric is only successful if the evaluation score is equal to or greater than threshold
.
All GPT models from OpenAI are available for LLM-Evals (metrics that use LLMs for evaluation). You can switch between models by providing a string corresponding to OpenAI's model names via the optional model
argument when instantiating an LLM-Eval.
Using OpenAI
To use OpenAI for deepeval
's LLM-Evals (metrics evaluated using an LLM), supply your OPENAI_API_KEY
in the CLI:
export OPENAI_API_KEY=<your-openai-api-key>
Alternatively, if you're working in a notebook enviornment (Jupyter or Colab), set your OPENAI_API_KEY
in a cell:
%env OPENAI_API_KEY=<your-openai-api-key>
Please do not include quotation marks when setting your OPENAI_API_KEY
if you're working in a notebook enviornment.
Using Azure OpenAI
deepeval
also allows you to use Azure OpenAI for metrics that are evaluated using an LLM. Run the following command in the CLI to configure your deepeval
enviornment to use Azure OpenAI for all LLM-based metrics.
deepeval set-azure-openai --openai-endpoint=<endpoint> \
--openai-api-key=<api_key> \
--deployment-name=<deployment_name> \
--openai-api-version=<openai_api_version> \
--model-version=<model_version>
Note that the model-version
is optional. If you ever wish to stop using Azure OpenAI and move back to regular OpenAI, simply run:
deepeval unset-azure-openai
Using A Custom LLM
deepeval
allows you to use ANY custom LLM for evaluation. This includes LLMs from langchain's chat_model
module, Hugging Face's transformers
library, or even LLMs in GGML format.
We CANNOT guarantee that evaluations will work as expected when using a custom model. This is because evaluation requires high levels of reasoning and the ability to follow instructions such as outputing responses in valid JSON formats. To better enable custom LLMs output valid JSONs, read this guide.
Alternatively, if you find yourself running into JSON errors and would like to ignore it, use the -c
and -i
flag during deepeval test run
:
deepeval test run test_example.py -i -c
The -i
flag ignores errors while the -c
flag utilizes the local deepeval
cache, so for a partially successful test run you don't have to rerun test cases that didn't error.
Azure OpenAI Example
Here is an example of creating a custom Azure OpenAI model through langchain's AzureChatOpenAI
module for evaluation:
from langchain_openai import AzureChatOpenAI
from deepeval.models.base_model import DeepEvalBaseLLM
class AzureOpenAI(DeepEvalBaseLLM):
def __init__(
self,
model
):
self.model = model
def load_model(self):
return self.model
def generate(self, prompt: str) -> str:
chat_model = self.load_model()
return chat_model.invoke(prompt).content
async def a_generate(self, prompt: str) -> str:
chat_model = self.load_model()
res = await chat_model.ainvoke(prompt)
return res.content
def get_model_name(self):
return "Custom Azure OpenAI Model"
# Replace these with real values
custom_model = AzureChatOpenAI(
openai_api_version=openai_api_version,
azure_deployment=azure_deployment,
azure_endpoint=azure_endpoint,
openai_api_key=openai_api_key,
)
azure_openai = AzureOpenAI(model=custom_model)
print(azure_openai.generate("Write me a joke"))
When creating a custom LLM evaluation model you should ALWAYS:
- inherit
DeepEvalBaseLLM
. - implement the
get_model_name()
method, which simply returns a string representing your custom model name. - implement the
load_model()
method, which will be responsible for returning a model object. - implement the
generate()
method with one and only one parameter of type string that acts as the prompt to your custom LLM. - the
generate()
method should return the final output string of your custom LLM. Note that we calledchat_model.invoke(prompt).content
to access the model generations in this particular example, but this could be different depending on the implementation of your custom model object. - implement the
a_generate()
method, with the same function signature asgenerate()
. Note that this is an async method. In this example, we calledawait chat_model.ainvoke(prompt)
, which is an asynchronous wrapper provided by LangChain's chat models.
The a_generate()
method is what deepeval
uses to generate LLM outputs when you execute metrics / run evaluations asynchronously.
If your custom model object does not have an asynchronous interface, simply reuse the same code from generate()
(scroll down to the Mistral7B
example for more details). However, this would make a_generate()
a blocking process, regardless of whether you've turned on async_mode
for a metric or not.
Lastly, to use it for evaluation for an LLM-Eval:
from deepeval.metrics import AnswerRelevancyMetric
...
metric = AnswerRelevancyMetric(model=azure_openai)
While the Azure OpenAI command configures deepeval
to use Azure OpenAI globally for all LLM-Evals, a custom LLM has to be set each time you instantiate a metric. Remember to provide your custom LLM instance through the model
parameter for metrics you wish to use it for.
Mistral 7B Example
Here is an example of creating a custom Mistral 7B model through Hugging Face's transformers
library for evaluation:
from transformers import AutoModelForCausalLM, AutoTokenizer
from deepeval.models.base_model import DeepEvalBaseLLM
class Mistral7B(DeepEvalBaseLLM):
def __init__(
self,
model,
tokenizer
):
self.model = model
self.tokenizer = tokenizer
def load_model(self):
return self.model
def generate(self, prompt: str) -> str:
model = self.load_model()
device = "cuda" # the device to load the model onto
model_inputs = self.tokenizer([prompt], return_tensors="pt").to(device)
model.to(device)
generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
return self.tokenizer.batch_decode(generated_ids)[0]
async def a_generate(self, prompt: str) -> str:
return self.generate(prompt)
def get_model_name(self):
return "Mistral 7B"
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
mistral_7b = Mistral7B(model=model, tokenizer=tokenizer)
print(mistral_7b.generate("Write me a joke"))
Note that for this particular implementation, we initialized our Mistral7B
model with an additional tokenizer
parameter, as this is required in the decoding step of the generate()
method.
You'll notice we simply reused generate()
in a_generate()
, because unfortunately there's no asynchronous interface for Hugging Face's transformers
library, which would make all metric executions a synchronous, blocking process.
However, you can try offloading the generation process to a separate thread instead:
import asyncio
class Mistral7B(DeepEvalBaseLLM):
# ... (existing code) ...
async def a_generate(self, prompt: str) -> str:
loop = asyncio.get_running_loop()
return await loop.run_in_executor(None, self.generate, prompt)
Some additional considerations and reasons why you should be extra careful with this implementation:
- Running the generation in a separate thread may not fully utilize GPU resources if the model is GPU-based.
- There could be potential performance implications of frequently switching between threads.
- You'd need to ensure thread safety if multiple async generations are happening concurrently and sharing resources.
Lastly, to use your custom Mistral7B
model for evaluation:
from deepeval.metrics import AnswerRelevancyMetric
...
metric = AnswerRelevancyMetric(model=mistral_7b)
You need to specify the custom evaluation model you created via the model
argument when creating a metric.
Google VertexAI Example
Here is an example of creating a custom Google's Gemini model through langchain's ChatVertexAI
module for evaluation:
from langchain_google_vertexai import (
ChatVertexAI,
HarmBlockThreshold,
HarmCategory
)
from deepeval.models.base_model import DeepEvalBaseLLM
class GoogleVertexAI(DeepEvalBaseLLM):
"""Class to implement Vertex AI for DeepEval"""
def __init__(self, model):
self.model = model
def load_model(self):
return self.model
def generate(self, prompt: str) -> str:
chat_model = self.load_model()
return chat_model.invoke(prompt).content
async def a_generate(self, prompt: str) -> str:
chat_model = self.load_model()
res = await chat_model.ainvoke(prompt)
return res.content
def get_model_name(self):
return "Vertex AI Model"
# Initilialize safety filters for vertex model
# This is important to ensure no evaluation responses are blocked
safety_settings = {
HarmCategory.HARM_CATEGORY_UNSPECIFIED: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
}
#TODO : Add values for project and location below
custom_model_gemini = ChatVertexAI(
model_name="gemini-1.0-pro-002"
, safety_settings=safety_settings
, project= "<project-id>"
, location= "<region>" #example : us-central1
)
# initiatialize the wrapper class
vertexai_gemini = GoogleVertexAI(model=custom_model_gemini)
print(vertexai_gemini.generate("Write me a joke"))
To use it for evaluation for an LLM-Eval:
from deepeval.metrics import AnswerRelevancyMetric
...
metric = AnswerRelevancyMetric(model=vertexai_gemini)
AWS Bedrock Example
Here is an example of creating a custom AWS Bedrock model through the langchain_community.chat_models
module for evaluation:
from langchain_community.chat_models import BedrockChat
from deepeval.models.base_model import DeepEvalBaseLLM
class AWSBedrock(DeepEvalBaseLLM):
def __init__(
self,
model
):
self.model = model
def load_model(self):
return self.model
def generate(self, prompt: str) -> str:
chat_model = self.load_model()
return chat_model.invoke(prompt).content
async def a_generate(self, prompt: str) -> str:
chat_model = self.load_model()
res = await chat_model.ainvoke(prompt)
return res.content
def get_model_name(self):
return "Custom Azure OpenAI Model"
# Replace these with real values
custom_model = BedrockChat(
credentials_profile_name=<your-profile-name>, # e.g. "default"
region_name=<your-region-name>, # e.g. "us-east-1"
endpoint_url=<your-bedrock-endpoint>, # e.g. "https://bedrock-runtime.us-east-1.amazonaws.com"
model_id=<your-model-id>, # e.g. "anthropic.claude-v2"
model_kwargs={"temperature": 0.4},
)
aws_bedrock = AWSBedrock(model=custom_model)
print(aws_bedrock.generate("Write me a joke"))
Finally, supply the newly created aws_bedrock
model to LLM-Evals:
from deepeval.metrics import AnswerRelevancyMetric
...
metric = AnswerRelevancyMetric(model=aws_bedrock)
Measuring A Metric
All metrics in deepeval
, including custom metrics that you create:
- can be executed via the
metric.measure()
method - can have its score accessed via
metric.score
, which ranges from 0 - 1 - can have its score reason accessed via
metric.reason
- can have its status accessed via
metric.is_successful()
- can be used to evaluate test cases or entire datasets, with or without Pytest
- has a
threshold
that acts as the threshold for success.metric.is_successful()
is only true ifmetric.score
is above/belowthreshold
- has a
strict_mode
property, which when turned on enforcesmetric.score
to a binary one - has a
verbose_mode
property, which when turned on prints metric logs whenever a metric is executed
In additional, all metrics in deepeval
execute asynchronously by default. This behavior is something you can configure via the async_mode
parameter when instantiating a metric.
Visit an individual metric page to learn how they are calculated, and what is required when creating an LLMTestCase
in order to execute it.
Here's a quick example.
export OPENAI_API_KEY=<your-openai-api-key>
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
# Initialize a test case
test_case = LLMTestCase(
input="...",
actual_output="...",
retrieval_context=["..."]
)
# Initialize metric with threshold
metric = AnswerRelevancyMetric(threshold=0.5)
Using this metric, you can either execute it directly as a standalone to get its score and reason:
...
metric.measure(test_case)
print(metric.score)
print(metric.reason)
Or you can either assert a test case using assert_test()
via deepeval test run
:
from deepeval import assert_test
...
def test_answer_relevancy():
assert_test(test_case, [metric])
deepeval test run test_file.py
Or using the evaluate
function:
from deepeval import evaluate
...
evaluate([test_case], [metric])
Measuring Metrics in Async
When a metric's async_mode=True
(which is the default value for all metrics), invocations of metric.measure()
will execute its internal algorithms concurrently. However, it's important to note that while operations INSIDE measure()
executes concurrently, the metric.measure()
call itself still blocks the main thread.
Let's take the FaithfulnessMetric
algorithm for example:
- Extract all factual claims made in the
actual_output
- Extract all factual truths found in the
retrieval_context
- Compare extracted claims and truths to generate a final score and reason.
from deepeval.metrics import FaithfulnessMetric
...
metric = FaithfulnessMetric(async_mode=True)
metric.measure(test_case)
print("Metric finished!")
When async_mode=True
, steps 1 and 2 executes concurrently (ie. at the same time) since they are independent of each other, while async_mode=False
will cause steps 1 and 2 to execute sequentially instead (ie. one after the other).
In both cases, "Metric finished!" will wait for metric.measure()
to finish running before printing, but setting async_mode
to True
would make the print statement appear earlier, as async_mode=True
allows metric.measure()
to run faster.
To measure multiple metrics at once and NOT block the main thread, use the asynchronous a_measure()
method instead.
import asyncio
...
# Remember to use async
async def long_running_function():
# These will all run at the same time
await asyncio.gather(
metric1.a_measure(test_case),
metric2.a_measure(test_case),
metric3.a_measure(test_case),
metric4.a_measure(test_case)
)
print("Metrics finished!")
asyncio.run(long_running_function())
Debugging A Metric
You can turn on verbose_mode
for ANY deepeval
metric at metric initialization to debug a metric whenever the measure()
or a_measure()
method is called:
...
metric = AnswerRelevancyMetric(verbose_mode=True)
metric.measure(test_case)
Turning verbose_mode
on will print the inner workings of a metric whenever measure()
or a_measure()
is called.