Synthesizer
Quick Summary
deepeval
offers a data Synthesizer
for anyone to easily generate evaluation datasets from scratch. The Synthesizer
class is a synthetic data generator that first uses an LLM to generate a series of input
s, before evolving each input
to make them more complex and realistic. These evolved inputs are then used to create a list of synthetic Golden
s, which makes up your synthetic EvaluationDataset
.
deepeval
's Synthesizer
uses the data evolution method to generate large volumes of data across various complexity levels to make synthetic data more realistic. This method was originally introduced by the developers of Evol-Instruct and WizardML.
For those interested, here is a great article on how deepeval
's synthesizer was built.
Creating An Synthesizer
deepeval
's Synthesizer
can be used as a standalone or within an EvaluationDataset
. To begin, create a Synthesizer
:
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer()
There are two optional parameters when creating a Synthesizer
:
- [Optional]
model
: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of typeDeepEvalBaseLLM
. Defaulted togpt-4o
. - [Optional]
multithreading
: a boolean which when set toTrue
, enables concurrent generation of goldens. Defaulted toTrue
. - [Optional]
embedder
: a string specifying which of OpenAI's embedding models to use, OR any custom embedding model of typeDeepEvalBaseEmbeddingModel
. Defaulted to 'text-embedding-3-small'.
As you'll learn later, an embedding model is only used when using the generate_goldens_from_docs()
method, so don't worry about the embedder
parameter too much unless you're looking to use your own embedding model.
Using Synthesizer As A Standalone
There are 4 approaches a deepeval
's Synthesizer
can generate synthetic Golden
s:
- Generating synthetic
Golden
s using context extracted from documents. - Generating synthetic
Golden
s from a list of provided context. - Generating synthetic adversarial
Golden
s for red teaming. - Generating synthetic
Golden
s from scratch
1. Generating From Documents
To generate synthetic Golden
s from documents, simply provide a list of document paths:
The generate_goldens_from_docs
method employs a token-based text splitter to manage document chunking, meaning the chunk_size
and chunk_overlap
parameters do not guarantee exact context sizes. This approach is designed to ensure meaningful and coherent context extraction, but might lead to variations in the expected size of each context
.
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer()
synthesizer.generate_goldens_from_docs(
document_paths=['example.txt', 'example.docx', 'example.pdf'],
max_goldens_per_document=2
)
There are one mandatory and seven optional parameters when using the generate_goldens_from_docs
method:
document_paths
: a list strings, representing the path to the documents from which contexts will be extracted from. Supported documents types include:.txt
,.docx
, and.pdf
.- [Optional]
include_expected_output
: a boolean which when set toTrue
, will additionally generate anexpected_output
for each syntheticGolden
. Defaulted toFalse
. - [Optional]
max_goldens_per_document
: the maximum number of goldens to be generated per document. Defaulted to 5. - [Optional]
chunk_size
: specifies the size of text chunks (in characters) to be considered for context extraction within each document. Defaulted to 1024. - [Optional]
chunk_overlap
: an int that determines the overlap size between consecutive text chunks during context extraction. Defaulted to 0. - [Optional]
num_evolutions
: the number of evolution steps to apply to each generated input. This parameter controls the complexity and diversity of the generated dataset by iteratively refining and evolving the initial inputs. Defaulted to 1. - [Optional]
evolutions
: a list ofEvolution
s, specifying the type of data evolution used. Defaulted to allEvolution
s.
Evolution
is an ENUM
that specifies the different data evolution techniques you wish to employee to make synthetic Golden
s more realistic.
from deepeval.synthesizer import Evolution
available_evolutions = [
Evolution.REASONING,
Evolution.MULTICONTEXT,
Evolution.CONCRETIZING,
Evolution.CONSTRAINED,
Evolution.COMPARATIVE,
Evolution.HYPOTHETICAL,
Evolution.IN_BREADTH
]
For those interested in what these evolutions mean, you can read this article here.
2. Generating From Provided Contexts
deepeval
also allows you to generate synthetic Goldens
from a manually provided a list of context instead of directly generating from your documents.
This is especially helpful if you already have an embedded knowledge base. For example, if you already have documents parsed and stored in an existing vector database, you may consider handling the logic to retrieve text chunks yourself.
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer()
synthesizer.generate_goldens(
# Provide a list of context for synthetic data generation
contexts=[
["The Earth revolves around the Sun.", "Planets are celestial bodies."],
["Water freezes at 0 degrees Celsius.", "The chemical formula for water is H2O."],
]
)
There are one mandatory and five optional parameters when using the generate_goldens
method:
contexts
: a list of context, where each context is itself a list of strings, ideally sharing a common theme or subject area.- [Optional]
include_expected_output
: a boolean which when set toTrue
, will additionally generate anexpected_output
for each syntheticGolden
. Defaulted toFalse
. - [Optional]
max_goldens_per_context
: the maximum number of goldens to be generated per context. Defaulted to 2. - [Optional]
num_evolutions
: the number of evolution steps to apply to each generated input. This parameter controls the complexity and diversity of the generated dataset by iteratively refining and evolving the initial inputs. Defaulted to 1. - [Optional]
evolutions
: a list ofEvolution
s, specifying the type of data evolution used. Defaulted to allEvolution
s.
While the previous methods first use an LLM to generate a series of inputs based on the provided context before evolving them, generate_goldens_from_inputs
simply evolves the provided list of inputs into more complex and diverse Golden
s. It's also important to note that this method will only populate the input field of each generated Golden
.
3. Generating Adversarial Goldens
for Red Teaming
deepeval
also allows you to generate adversarial Goldens
to red team LLM applications. You can also optionally provide a list of contexts to keep each generated synthetic Golden
grounded in your data.
This is especially helpful for assessing your LLM application's risks and vulnerabilities.
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer()
synthesizer.generate_red_teaming_goldens(
# Optionally provide a list of context to keep adversarial goldens grounded
contexts=[
["The Earth revolves around the Sun.", "Planets are celestial bodies."],
["Water freezes at 0 degrees Celsius.", "The chemical formula for water is H2O."],
]
)
There are six optional parameters when using the generate_red_teaming_goldens
method:
- [Optional]
contexts
: a list of context, where each context is itself a list of strings, ideally sharing a common theme or subject area. Whencontexts
is supplied, theSynthesizer
generates adversarial goldens based on information presented in contexts. If not, they are generated from scratch. - [Optional]
include_expected_output
: a boolean which when set toTrue
, will additionally generate anexpected_output
for each syntheticGolden
. Defaulted toFalse
. - [Optional]
max_goldens
: the maximum number of golden to be generated ifcontexts
isn't supplied. Ifcontexts
is supplied,max_goldens
will be confined to the length ofcontexts
. Defaulted to 2. - [Optional]
num_evolutions
: the number of evolution steps to apply to each generatedGolden
. Defaulted to 1. - [Optional]
attacks
: a list ofRTAdversarialAttack
s, specifying the types of adversarial attacks you want generated for your syntheticGolden
s. Defaulted to allRTAdversarialAttack
s. - [Optional]
vulnerabilities
: a list ofRTVulnerability
s, representing the undesirable vulnerability you are trying to elicit. Defaulted to allRTVulnerability
s.
RTAdversarialAttack
is an ENUM
that specifies the different types of red teaming attacks (prompt injection, prompt probing, etc.) you wish to include in your final dataset, while the RTVulnerability
ENUM
specifies the type of vulnerability and risk you are trying to assess your LLM application on (hallucination, bias, etc.):
from deepeval.synthesizer import RTAdversarialAttack, RTVulnerability
available_red_teaming_attacks = [
RTAdversarialAttack.PROMPT_INJECTION,
RTAdversarialAttack.PROMPT_PROBING,
RTAdversarialAttack.GRAY_BOX_ATTACK,
RTAdversarialAttack.JAIL_BREAKING
]
available_red_teaming_vulnerabilities = [
RTVulnerability.HALLUCINATION,
RTVulnerability.OFFENSIVE,
RTVulnerability.BIAS,
RTVulnerability.DATA_LEAKAGE,
RTVulnerability.UNFORMATTED
]
Click here to learn more about the different types of adversarial attacks and vulnerabilities used for red teaming.
4. Generating From Scratch
If you do not have a list of example prompts, or wish to solely rely on an LLM generation for synthesis, you can also generate synthetic Golden
s simply by specifying the subject, task, and output format you wish your prompts to follow.
Generating goldens from scratch is especially helpful when you wish to evaluate your LLM on a specific task, such as red-teaming or text-to-SQL use cases!
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer()
synthesizer.generate_goldens_from_scratch(
subject="Harmful and toxic prompts, with emphasis on dark humor",
task="Red-team LLMs",
output_format="string",
num_initial_goldens=25,
num_evolutions=20
)
This method is a 2-step function that first generates a list of prompts about a given subject for a certain task and in a certain output format, before using the generated list of prompts to generate more prompts through data evolution.
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer() synthesizer.generate_goldens_from_scratch( subject="Harmful and toxic prompts, with emphasis on dark humor", task="Red-team LLMs", output_format="string", num_initial_goldens=25, num_evolutions=20 )
This method is a **2-step function** that first generates a list of prompts about a given subject for a certain task and in a certain output format, before using the generated list of prompts to generate more prompts through data evolution.
The subject, task, and output format parameters are all strings that are inserted into a predefined prompt template, meaning these parameters are flexible and will need to be iterated on for optimal results. :::
There are four mandatory and three optional parameters when using the generate_goldens_from_docs
method:
subject
: a string, specifying the subject and nature of your generatedGolden
stask
: a string, representing the purpose of these evaluationGolden
soutput_format
: a string, representing the expected output format. This is not equivalent to pythontype
s but simply gives you more control over the structure of your synthetic data.num_initial_goldens
: the number of goldens generated before consequent evolutions- [Optional]
num_evolutions
: the number of evolution steps to apply to each generated prompt. This parameter controls the complexity and diversity of the generated dataset by iteratively refining and evolving the initial inputs. Defaulted to 1. - [Optional]
evolution_types
: a list ofPromptEvolution
, specifying methods used during data evolution. Defaulted to allPromptEvolution
s. The subject, task, and output format parameters are all strings that are inserted into a predefined prompt template, meaning these parameters are flexible and will need to be iterated on for optimal results.
PromptEvolution
is an ENUM
that specifies the different data evolution techniques you wish to employee to make synthetic contextless Golden
s more realistic.
from deepeval.synthesizer import PromptEvolution
available_evolutions = [
PromptEvolution.REASONING,
PromptEvolution.CONCRETIZING,
PromptEvolution.CONSTRAINED,
PromptEvolution.COMPARATIVE,
PromptEvolution.HYPOTHETICAL,
PromptEvolution.IN_BREADTH
]
For those interested in what these evolutions mean, you can read this article here.
Saving Generated Goldens
To not accidentally lose any generated synthetic Golden
, you can use the save_as()
method:
synthesizer.save_as(
file_type='json', # or 'csv'
directory="./synthetic_data"
)
Using Synthesizer Within An Evaluation Dataset
An EvaluationDataset
also has the generate_goldens_from_docs
and generate_goldens
methods, which under the hood is powered by the Synthesizer
's implementation.
Except for an additional option to accept a custom Synthesizer
as argument, the generate_goldens_from_docs
and generate_goldens
methods in an EvaluationDataset
accepts the exact same arguments as those on a Synthesizer
.
You can optionally specify a custom Synthesizer
when calling generate_goldens_from_docs
and generate_goldens
through the EvaluationDataset
interface if for example, you wish to use a custom LLM to generate synthetic data. If no Synthesizer
is provided, the default Synthesizer
configuration is used.
To begin, optionally create a custom Synthesizer
:
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer(model="gpt-3.5-turbo")
Then, provide it as an argument to generate_goldens_from_docs
:
from deepeval.dataset import EvaluationDataset
...
dataset = EvaluationDataset()
dataset.generate_goldens_from_docs(
synthesizer=synthesizer,
document_paths=['example.pdf'],
)
Or, to generate_goldens
:
...
dataset.generate_goldens(
synthesizer=synthesizer,
contexts=[
["The Earth revolves around the Sun.", "Planets are celestial bodies."],
["Water freezes at 0 degrees Celsius.", "The chemical formula for water is H2O."],
]
)
Or, to generate_red_teaming_goldens
:
...
dataset.generate_red_teaming_goldens(
synthesizer=synthesizer,
contexts=[
["The Earth revolves around the Sun.", "Planets are celestial bodies."],
["Water freezes at 0 degrees Celsius.", "The chemical formula for water is H2O."],
]
)
Lastly, don't forget to call save_as()
to perserve any generated synthetic Golden
:
saved_path = dataset.save_as(
file_type='json', # or 'csv'
directory="./synthetic_data"
)
The save_as()
method returns a string to the path the dataset was saved to, just in case you need to use it in code later on.
Using a Custom Embedding Model
Under the hood, only the generate_goldens_from_docs()
method uses an embedding model. This is because in order to generate goldens from documents, the Synthesizer
uses cosine similarity to generate the relevant context needed for data synthesization.
Using Azure OpenAI
You can use Azure's OpenAI embedding models by running the following commands in the CLI:
deepeval set-azure-openai --openai-endpoint=<endpoint> \
--openai-api-key=<api_key> \
--deployment-name=<deployment_name> \
--openai-api-version=<openai_api_version> \
--model-version=<model_version>
Then, run this to set the Azure OpenAI embedder:
deepeval set-azure-openai-embedding --embedding_deployment-name=<embedding_deployment_name>
The first command configures deepeval
to use Azure OpenAI LLM globally, while the second command configures deepeval
to use Azure OpenAI's embedding models globally.
Using Any Custom Model
Alternatively, you can also create a custom embedding model in code by inheriting the base DeepEvalBaseEmbeddingModel
class. Here is an example of using the same custom Azure OpenAI embedding model but created in code instead using langchain's langchain_openai
module:
from typing import List, Optional
from langchain_openai import AzureOpenAIEmbeddings
from deepeval.models import DeepEvalBaseEmbeddingModel
class CustomEmbeddingModel(DeepEvalBaseEmbeddingModel):
def __init__(self):
pass
def load_model(self):
return AzureOpenAIEmbeddings(
openai_api_version="...",
azure_deployment="...",
azure_endpoint="...",
openai_api_key="...",
)
def embed_text(self, text: str) -> List[float]:
embedding_model = self.load_model()
return embedding_model.embed_query(text)
def embed_texts(self, texts: List[str]) -> List[List[float]]:
embedding_model = self.load_model()
return embedding_model.embed_documents(texts)
async def a_embed_text(self, text: str) -> List[float]:
embedding_model = self.load_model()
return await embedding_model.aembed_query(text)
async def a_embed_texts(self, texts: List[str]) -> List[List[float]]:
embedding_model = self.load_model()
return await embedding_model.aembed_documents(texts)
def get_model_name(self):
"Custom Azure Embedding Model"
When creating a custom embedding model, you should ALWAYS:
- inherit
DeepEvalBaseEmbeddingModel
. - implement the
get_model_name()
method, which simply returns a string representing your custom model name. - implement the
load_model()
method, which will be responsible for returning the model object instance. - implement the
embed_text()
method with one and only one parameter of typestr
as the text to be embedded, and returns a vector of typeList[float]
. We calledembedding_model.embed_query(prompt)
to access the embedded text in this particular example, but this could be different depending on the implementation of your custom model object. - implement the
embed_texts()
method with one and only one parameter of typeList[str]
as the list of strings text to be embedded, and return a list of vectors of typeList[List[float]]
. - implement the asynchronous
a_embed_text()
anda_embed_texts()
method, with the same function signature as their respective synchronous versions. Since this is an asynchronous method, remember to useasync/await
.
If an asynchronous version of your embedding model does not exist, simply reuse the synchronous implementation:
class CustomEmbeddingModel(DeepEvalBaseEmbeddingModel):
...
async def a_embed_text(self, text: str) -> List[float]:
return self.embed_text(text)
Lastly, provide the custom embedding model through the embedder
parameter when creating a Synthesizer
:
from deepeval.synthesizer import Synthesizer
...
synthesizer = Synthesizer(embedder=CustomEmbeddingModel())
If you run into invalid JSON errors using custom models, you may want to consult this guide on using custom LLMs for evaluation, as synthetic data generation also supports pydantic confinement for custom models.