This version is still in development and is not considered stable yet. For the latest snapshot version, please use Spring AI 1.0.0-SNAPSHOT! |
Evaluation Testing
Testing AI applications requires evaluating the generated content to ensure the AI model has not produced a hallucinated response.
One method to evaluate the response is to use the AI model itself for evaluation. Select the best AI model for the evaluation, which may not be the same model used to generate the response.
The Spring AI interface for evaluating responses is Evaluator
, defined as:
@FunctionalInterface
public interface Evaluator {
EvaluationResponse evaluate(EvaluationRequest evaluationRequest);
}
The input to the evaluation is the EvaluationRequest
defined as
public class EvaluationRequest {
private final String userText;
private final List<Content> dataList;
private final String responseContent;
public EvaluationRequest(String userText, List<Content> dataList, String responseContent) {
this.userText = userText;
this.dataList = dataList;
this.responseContent = responseContent;
}
...
}
-
userText
: The raw input from the user as aString
-
dataList
: Contextual data, such as from Retrieval Augmented Generation, appended to the raw input. -
responseContent
: The AI model’s response content as aString
RelevancyEvaluator
One implementation is the RelevancyEvaluator
, which uses the AI model for evaluation. More implementations will be available in future releases.
The RelevancyEvaluator
uses the input (userText
) and the AI model’s output (chatResponse
) to ask the question:
Your task is to evaluate if the response for the query
is in line with the context information provided.\n
You have two options to answer. Either YES/ NO.\n
Answer - YES, if the response for the query
is in line with context information otherwise NO.\n
Query: \n {query}\n
Response: \n {response}\n
Context: \n {context}\n
Answer: "
Here is an example of a JUnit test that performs a RAG query over a PDF document loaded into a Vector Store and then evaluates if the response is relevant to the user text.
@Test
void testEvaluation() {
dataController.delete();
dataController.load();
String userText = "What is the purpose of Carina?";
ChatResponse response = ChatClient.builder(chatModel)
.build().prompt()
.advisors(new QuestionAnswerAdvisor(vectorStore, SearchRequest.defaults()))
.user(userText)
.call()
.chatResponse();
String responseContent = response.getResult().getOutput().getContent();
var relevancyEvaluator = new RelevancyEvaluator(ChatClient.builder(chatModel));
EvaluationRequest evaluationRequest = new EvaluationRequest(userText,
(List<Content>) response.getMetadata().get(QuestionAnswerAdvisor.RETRIEVED_DOCUMENTS), responseContent);
EvaluationResponse evaluationResponse = relevancyEvaluator.evaluate(evaluationRequest);
assertTrue(evaluationResponse.isPass(), "Response is not relevant to the question");
}
The code above is from the example application located here.
FactCheckingEvaluator
The FactCheckingEvaluator is another implementation of the Evaluator interface, designed to assess the factual accuracy of AI-generated responses against provided context. This evaluator helps detect and reduce hallucinations in AI outputs by verifying if a given statement (claim) is logically supported by the provided context (document).
The 'claim' and 'document' are presented to the AI model for evaluation. Smaller and more efficient AI models dedicated to this purpose are available, such as Bespoke’s Minicheck, which helps reduce the cost of performing these checks compared to flagship models like GPT-4. Minicheck is also available for use through Ollama.
Usage
The FactCheckingEvaluator constructor takes a ChatClient.Builder as a parameter:
public FactCheckingEvaluator(ChatClient.Builder chatClientBuilder) {
this.chatClientBuilder = chatClientBuilder;
}
The evaluator uses the following prompt template for fact-checking:
Document: {document}
Claim: {claim}
Where {document}
is the context information, and {claim}
is the AI model’s response to be evaluated.
Example
Here’s an example of how to use the FactCheckingEvaluator with an Ollama-based ChatModel, specifically the Bespoke-Minicheck model:
@Test
void testFactChecking() {
// Set up the Ollama API
OllamaApi ollamaApi = new OllamaApi("http://localhost:11434");
ChatModel chatModel = new OllamaChatModel(ollamaApi,
OllamaOptions.builder().withModel(BESPOKE_MINICHECK).withNumPredict(2).withTemperature(0.0d).build())
// Create the FactCheckingEvaluator
var factCheckingEvaluator = new FactCheckingEvaluator(ChatClient.builder(chatModel));
// Example context and claim
String context = "The Earth is the third planet from the Sun and the only astronomical object known to harbor life.";
String claim = "The Earth is the fourth planet from the Sun.";
// Create an EvaluationRequest
EvaluationRequest evaluationRequest = new EvaluationRequest(context, Collections.emptyList(), claim);
// Perform the evaluation
EvaluationResponse evaluationResponse = factCheckingEvaluator.evaluate(evaluationRequest);
assertFalse(evaluationResponse.isPass(), "The claim should not be supported by the context");
}