Concepts
Evaluation and Monitoring

Evaluation and Monitoring

Basic concept

Evaluation and Monitoring is a key aspect of staying on top of your LLM based application. This step is necessary if you want to be able to make claims about your LLM performance that are actually backed by data and not made up. Like most things in GuardOps, Evaluationd and Monitoring is also directly linked to a project.

Evaluation

An evaluation can be something as a predefined test to determine a model's performance in summaziring text. The evaluation itself defines the steps that need to be done to perform one iteration of the evaluation, called a run or evaluation item. For every evaluation, multiple runs can be performed in any interval that then can be monitored over time. This allows data backed judgement of your application's performance.

Evaluation can be done in two ways. You can use the Python Library provided with GuardOps or build simplified, less tunable evaluations using Flows in the Frontend.

Evaluation using the Python Library

The Python Library provides a more granulated Framework with predefined Evaluation methods already implemented. In addition to these readily available Evaluation methods, the basic structure of the library is built so that users can expand the core components with their own individual implementations of evaluation methods. Refer to the How-To-Guide and Usage Examples for code examples.

Evaluation using Flows

The Frontend provides a flow based integration for using the predefined methods provided through the Python Library. This allows users without sufficient python knowledge to use the library in a simplified version. In the future, Flows for established Evaluation Methods will be available for one click importing.

Monitoring

Monitoring is important to get an overview of an application's performance over time. The monitoring done in GuardOps is fully on the frontend side and does not use the Python Library. For every project, monitoring shows charts for each evaluation and the result of their runs.

In the future, a sandbox based approach to pick the exact data that you want to display is planned, but for now it is focused on the visualization of an evaluation's total score over time.