Skip to main content

LLM Application Monitoring in Production

Quick Summary

deepeval allows you to monitor live LLM responses in production with a single API call. By monitoring responses in production, you can leverage Confident AI to identify unsatisfactory LLM responses, have users and human annotators leave feedback on such responses, and incorperate real-world data to improve your evaluation dataset over time.

tip

Additionally, you may enable real-time evaluations, which automatically runs under the hood to help you identify preliminary failing responses, as well as LLM tracing, which allows you to easily debug your LLM applications.

Monitoring Live Responses

To monitor LLM responses, use the deepeval.monitor(...) method in your LLM application to start monitoring responses.

import deepeval

# At the end of your LLM call,
# usually in your backend API handler
deepeval.monitor(
event_name="Chatbot",
model="gpt-4",
input="input",
response="response"
)

There are four mandatory and ten optional parameters when using the monitor() function to monitor responses in production:

  • event_name: type str specifying the type of response event monitored. The event_name can be thought of as an identifier for different functionalities your LLM application performs.
  • model: type str specifying the name of the LLM model used.
  • input: type str.
  • response: type str.
  • [Optional] retrieval_context: type list[str] that indicates the context that were retrieved in your RAG pipeline.
  • [Optional] additional_data: type dict[str, Union[str, dict, Link]]. See below for more details.
  • [Optional] hyperparameters: type dict[str, Union[str, int, float]]. You can provide a dictionary to specify any additional hyperparamters used to generate the response.
  • [Optional] distinct_id: type str to identify end users using your LLM application.
  • [Optional] conversation_id: type str to group together multiple messages under a single conversation thread.
  • [Optional] completion_time: type float that indicates how many seconds it took your LLM application to complete.
  • [Optional] token_usage: type float
  • [Optional] token_cost: type float
  • [Optional] fail_silently: type bool. You should set this to False in development to check if monitor() is working properly. Defaulted to False.
  • [Optional] raise_expection: type bool. You should set this to False in production if you don't want to raise expections in production. Defaulted to True.
caution

Please do NOT provide placeholder values for optional parameters. Leave it blank instead.

The monitor() function returns an response_id upon a successful API request to Confident's servers, which you can later use to send human feedback regarding a particular LLM response you've monitored.

import deepeval

response_id = deepeval.monitor(...)

Congratulations! With a few lines of code, deepeval will now automatically log all LLM responses in production to Confident AI.

Logging Custom Hyperparameters

In addition to logging which model was used to generate each respective response, you can also associate any custom hyperparameters you wish to each response you're monitoring.

import deepeval

deepeval.monitor(
...
model="gpt-4",
hyperparameters={
"prompt template": "...",
"temperature": 1,
"chunk size": 500
}
)
info

Logging hyperparameters allows you to more easily filter and search for the different responses on Confident AI.

Logging Additional Custom Data

Similar to hyperparameters, you can easily associate custom additional data for each response.

import deepeval
from deepeval.monitor import Link

deepeval.monitor(
...
additional_data={
"Example Text": "...",
# the Link class allows you to access the link directly on Confident AI
"Example Link": Link(value="https://www.youtube.com"),
"Example list of Links": [Link(value="https://www.instagram.com")],
"Example JSON": {"Any Key": "Any Value"}
},
)

Note that you can log either text, a Link, list of Links, or any custom dict (as shown in the "Example Json"). Similar to hyperparameter, you can also filter and search for different responses based on the provided custom data.

Did you know?

Although you can technically log a link as a native string, the Link(value="...") class allows you to access said link directly on Confident AI. You should also aim to have your Links start with 'https://...'.

Responses on Confident AI

Confident AI allows you to view and manage your monitored responses in the Observatory. On this page, you can inspect the ways in which these responses were generated, view entire conversational threads, view and debug traces, leave human feedback, and add unsatsifactory responses to your evaluation dataset.

ok

info

While conversation_id is logged here, it is an optional parameter that will appear as None if not supplied to deepeval's monitor() function during live monitoring.

Viewing A Response

To view an response in more detail, simply click on the response you want to examine, which will trigger a dropdown panel to appear. Here, you can view the event name, Distinct ID (user ID), Conversation ID, time and completion time (automatically tracked), token cost, and token usage, as well as your usual LLM parameters, including input, response and retrieval context.

ok

tip

You can view logged hyperparameters and custom data by toggling the tabs next to the real-time evaluation metric results, and click the inspect button to toggle the "Response Details" side panel.

Inspecting A Response

On the side panel, you may inspect response information data in greater detail.

ok

tip

You may also view the associated trace or leave human feedback by clicking on the respective buttons on the side panel.

Filtering for Responses

Filtering for responses is vital as it allows you to replay or recreate the real-world conditions of which this response was generated under. For example, with Confident AI you can easily filter for responses where data was retrieved from a particular knowledge base or data lake.

ok

On Confident AI, you can filter responses on numerous and a combination of criteria, ranging from your Distinct (user) IDs and Conversation IDs to associated human feedback information such as the rating, provider, explanation, and expected response. You can also filter responses based on the hyperparameters and/or custom data you have logged.

info

Use the date selector to choose a timeframe for the responses you wish to view. This feature is particularly useful when assigning data annotation tasks to your labelers.

Here is an example of a list of filtered responses with a rating of 1!

ok