评估测试

测试 AI 应用程序需要评估生成的内容,以确保 AI 模型没有产生幻觉响应。spring-doc.cn

评估响应的一种方法是使用 AI 模型本身进行评估。选择最佳 AI 模型进行评估,该模型可能与用于生成响应的模型不同。spring-doc.cn

用于评估响应的 Spring AI 接口定义为:Evaluatorspring-doc.cn

@FunctionalInterface
public interface Evaluator {
    EvaluationResponse evaluate(EvaluationRequest evaluationRequest);
}

评估的输入定义为EvaluationRequestspring-doc.cn

public class EvaluationRequest {

	private final String userText;

	private final List<Content> dataList;

	private final String responseContent;

	public EvaluationRequest(String userText, List<Content> dataList, String responseContent) {
		this.userText = userText;
		this.dataList = dataList;
		this.responseContent = responseContent;
	}

  ...
}
  • userText:用户的原始输入(作为Stringspring-doc.cn

  • dataList:附加到原始输入的上下文数据,例如来自 Retrieval Augmented Generation 的数据。spring-doc.cn

  • responseContent:AI 模型的响应内容为Stringspring-doc.cn

RelevancyEvaluator

一种实现是 ,它使用 AI 模型进行评估。未来版本中将提供更多实施。RelevancyEvaluatorspring-doc.cn

它使用输入 () 和 AI 模型的输出 () 来提出问题:RelevancyEvaluatoruserTextchatResponsespring-doc.cn

Your task is to evaluate if the response for the query
is in line with the context information provided.\n
You have two options to answer. Either YES/ NO.\n
Answer - YES, if the response for the query
is in line with context information otherwise NO.\n
Query: \n {query}\n
Response: \n {response}\n
Context: \n {context}\n
Answer: "

下面是一个 JUnit 测试示例,该测试对加载到 Vector Store 中的 PDF 文档执行 RAG 查询,然后评估响应是否与用户文本相关。spring-doc.cn

@Test
void testEvaluation() {

    dataController.delete();
    dataController.load();

    String userText = "What is the purpose of Carina?";

    ChatResponse response = ChatClient.builder(chatModel)
            .build().prompt()
            .advisors(new QuestionAnswerAdvisor(vectorStore, SearchRequest.defaults()))
            .user(userText)
            .call()
            .chatResponse();
    String responseContent = response.getResult().getOutput().getContent();

    var relevancyEvaluator = new RelevancyEvaluator(ChatClient.builder(chatModel));

    EvaluationRequest evaluationRequest = new EvaluationRequest(userText,
            (List<Content>) response.getMetadata().get(QuestionAnswerAdvisor.RETRIEVED_DOCUMENTS), responseContent);

    EvaluationResponse evaluationResponse = relevancyEvaluator.evaluate(evaluationRequest);

    assertTrue(evaluationResponse.isPass(), "Response is not relevant to the question");

}

上面的代码来自位于此处的示例应用程序。spring-doc.cn

事实核查评估员

FactCheckingEvaluator 是 Evaluator 接口的另一种实现,旨在根据提供的上下文评估 AI 生成的响应的事实准确性。此评估器通过验证给定的陈述(声明)是否在逻辑上得到提供的上下文(文档)的支持,帮助检测和减少 AI 输出中的幻觉。spring-doc.cn

“claim”和“document”将提交给 AI 模型进行评估。可以使用专门用于此目的的更小、更高效的 AI 模型,例如 Bespoke 的 Minicheck,与 GPT-4 等旗舰模型相比,它有助于降低执行这些检查的成本。Minicheck 也可通过 Ollama 使用。spring-doc.cn

用法

FactCheckingEvaluator 构造函数将 ChatClient.Builder 作为参数:spring-doc.cn

public FactCheckingEvaluator(ChatClient.Builder chatClientBuilder) {
  this.chatClientBuilder = chatClientBuilder;
}

评估员使用以下提示模板进行事实核查:spring-doc.cn

Document: {document}
Claim: {claim}

其中 是上下文信息,是要评估的 AI 模型的响应。{document}{claim}spring-doc.cn

以下是如何将 FactCheckingEvaluator 与基于 Ollama 的 ChatModel(特别是 Bespoke-Minicheck 模型)一起使用的示例:spring-doc.cn

@Test
void testFactChecking() {
  // Set up the Ollama API
  OllamaApi ollamaApi = new OllamaApi("http://localhost:11434");

  ChatModel chatModel = new OllamaChatModel(ollamaApi,
				OllamaOptions.builder().withModel(BESPOKE_MINICHECK).withNumPredict(2).withTemperature(0.0d).build())


  // Create the FactCheckingEvaluator
  var factCheckingEvaluator = new FactCheckingEvaluator(ChatClient.builder(chatModel));

  // Example context and claim
  String context = "The Earth is the third planet from the Sun and the only astronomical object known to harbor life.";
  String claim = "The Earth is the fourth planet from the Sun.";

  // Create an EvaluationRequest
  EvaluationRequest evaluationRequest = new EvaluationRequest(context, Collections.emptyList(), claim);

  // Perform the evaluation
  EvaluationResponse evaluationResponse = factCheckingEvaluator.evaluate(evaluationRequest);

  assertFalse(evaluationResponse.isPass(), "The claim should not be supported by the context");

}