LLM Application Monitoring in Production
Quick Summary
deepeval
allows you to monitor live LLM responses in production with a single API call. By monitoring responses in production, you can leverage Confident AI to identify unsatisfactory LLM responses, have users and human annotators leave feedback on such responses, and incorperate real-world data to improve your evaluation dataset over time.
Additionally, you may enable real-time evaluations, which automatically runs under the hood to help you identify preliminary failing responses, as well as LLM tracing, which allows you to easily debug your LLM applications.
Monitoring Live Responses
To monitor LLM responses, use the deepeval.monitor(...)
method in your LLM application to start monitoring responses.
import deepeval
# At the end of your LLM call,
# usually in your backend API handler
deepeval.monitor(
event_name="Chatbot",
model="gpt-4",
input="input",
response="response"
)
There are four mandatory and ten optional parameters when using the monitor()
function to monitor responses in production:
event_name
: typestr
specifying the type of response event monitored. Theevent_name
can be thought of as an identifier for different functionalities your LLM application performs.model
: typestr
specifying the name of the LLM model used.input
: typestr
.response
: typestr
.- [Optional]
retrieval_context
: typelist[str]
that indicates the context that were retrieved in your RAG pipeline. - [Optional]
additional_data
: typedict[str, Union[str, dict, Link]]
. See below for more details. - [Optional]
hyperparameters
: typedict[str, Union[str, int, float]]
. You can provide a dictionary to specify any additional hyperparamters used to generate the response. - [Optional]
distinct_id
: typestr
to identify end users using your LLM application. - [Optional]
conversation_id
: typestr
to group together multiple messages under a single conversation thread. - [Optional]
completion_time
: typefloat
that indicates how many seconds it took your LLM application to complete. - [Optional]
token_usage
: typefloat
- [Optional]
token_cost
: typefloat
- [Optional]
fail_silently
: typebool
. You should set this toFalse
in development to check ifmonitor()
is working properly. Defaulted toFalse
. - [Optional]
raise_expection
: typebool
. You should set this toFalse
in production if you don't want to raise expections in production. Defaulted toTrue
.
Please do NOT provide placeholder values for optional parameters. Leave it blank instead.
The monitor()
function returns an response_id
upon a successful API request to Confident's servers, which you can later use to send human feedback regarding a particular LLM response you've monitored.
import deepeval
response_id = deepeval.monitor(...)
Congratulations! With a few lines of code, deepeval
will now automatically log all LLM responses in production to Confident AI.
Logging Custom Hyperparameters
In addition to logging which model
was used to generate each respective response, you can also associate any custom hyperparameters you wish to each response you're monitoring.
import deepeval
deepeval.monitor(
...
model="gpt-4",
hyperparameters={
"prompt template": "...",
"temperature": 1,
"chunk size": 500
}
)
Logging hyperparameters allows you to more easily filter and search for the different responses on Confident AI.
Logging Additional Custom Data
Similar to hyperparameters, you can easily associate custom additional data for each response.
import deepeval
from deepeval.monitor import Link
deepeval.monitor(
...
additional_data={
"Example Text": "...",
# the Link class allows you to access the link directly on Confident AI
"Example Link": Link(value="https://www.youtube.com"),
"Example list of Links": [Link(value="https://www.instagram.com")],
"Example JSON": {"Any Key": "Any Value"}
},
)
Note that you can log either text, a Link
, list of Link
s, or any custom dict (as shown in the "Example Json"). Similar to hyperparameter, you can also filter and search for different responses based on the provided custom data.
Although you can technically log a link as a native string, the Link(value="...")
class allows you to access said link directly on Confident AI. You should also aim to have your Link
s start with 'https://...'.
Responses on Confident AI
Confident AI allows you to view and manage your monitored responses in the Observatory. On this page, you can inspect the ways in which these responses were generated, view entire conversational threads, view and debug traces, leave human feedback, and add unsatsifactory responses to your evaluation dataset.
While conversation_id
is logged here, it is an optional parameter that will appear as None
if not supplied to deepeval
's monitor()
function during live monitoring.
Viewing A Response
To view an response in more detail, simply click on the response you want to examine, which will trigger a dropdown panel to appear. Here, you can view the event name, Distinct ID (user ID), Conversation ID, time and completion time (automatically tracked), token cost, and token usage, as well as your usual LLM parameters, including input, response and retrieval context.
You can view logged hyperparameters and custom data by toggling the tabs next to the real-time evaluation metric results, and click the inspect button to toggle the "Response Details" side panel.
Inspecting A Response
On the side panel, you may inspect response information data in greater detail.
You may also view the associated trace or leave human feedback by clicking on the respective buttons on the side panel.
Filtering for Responses
Filtering for responses is vital as it allows you to replay or recreate the real-world conditions of which this response was generated under. For example, with Confident AI you can easily filter for responses where data was retrieved from a particular knowledge base or data lake.
On Confident AI, you can filter responses on numerous and a combination of criteria, ranging from your Distinct (user) IDs and Conversation IDs to associated human feedback information such as the rating, provider, explanation, and expected response. You can also filter responses based on the hyperparameters and/or custom data you have logged.
Use the date selector to choose a timeframe for the responses you wish to view. This feature is particularly useful when assigning data annotation tasks to your labelers.
Here is an example of a list of filtered responses with a rating of 1!