【observability】【evaluation01】AIMon的LlamaIndex扩展用于LLM响应评估 案例目标本案例展示了如何使用AIMon的评估器来评估LlamaIndex框架中语言模型(LLM)生成的响应质量和准确性。主要目标是演示如何使用AIMon的幻觉检测评估器识别模型生成的不受上下文支持的信息展示如何使用指南评估器确保模型响应遵循预定义的指令和指南介绍如何使用上下文相关性评估器评估提供的上下文在支持模型响应方面的相关性和准确性提供一个完整的RAG(检索增强生成)应用评估流程注意本案例特别关注使用幻觉评估器、指南评估器和上下文相关性评估器来评估RAG应用程序。技术栈与核心依赖本案例使用了以下技术栈和核心依赖LlamaIndexAIMonOpenAI APIdatasetsrequests主要依赖包pip install requests datasets aimon-llamaindex llama-index-embeddings-openai llama-index-llms-openai核心组件AIMon评估器包括幻觉评估器、指南评估器和上下文相关性评估器LlamaIndex用于构建RAG应用程序的核心框架OpenAI模型使用gpt-4o-mini作为LLMtext-embedding-3-small作为嵌入模型MeetingBank数据集用作上下文信息的会议记录数据集环境配置1. 安装依赖%%capture!pip install requests datasets aimon-llamaindex llama-index-embeddings-openai llama-index-llms-openai2. 配置API密钥import osimport jsonfrom google.colab import userdataos.environ[OPENAI_API_KEY] userdata.get(OPENAI_API_KEY)重要需要在Google Colab secrets中配置OPENAI_API_KEY和AIMON_API_KEY并授予notebook访问权限。AIMon API密钥可以从这里获取。3. 加载数据集from datasets import load_datasetmeetingbank load_dataset(huuuyeah/meetingbank)案例实现1. 数据准备从MeetingBank数据集中提取会议记录并将其转换为LlamaIndex的Document对象from llama_index.core import Documentdef extract_and_create_documents(transcripts):documents []for transcript in transcripts:try:doc Document(texttranscript)documents.append(doc)except Exception as e:print(fFailed to create document)return documentstranscripts [meeting[transcript] for meeting in meetingbank[train]]documents extract_and_create_documents(transcripts[:5]) # 只使用5个记录以保持示例简洁2. 构建向量索引设置嵌入模型并生成文档嵌入from llama_index.embeddings.openai import OpenAIEmbeddingfrom aimon_llamaindex import generate_embeddings_for_docs, build_index, build_retrieverembedding_model OpenAIEmbedding(modeltext-embedding-3-small,embed_batch_size100,max_retries3)nodes generate_embeddings_for_docs(documents, embedding_model)index build_index(nodes)retriever build_retriever(index, similarity_top_k5)3. 配置LLMfrom llama_index.llms.openai import OpenAIllm OpenAI(modelgpt-4o-mini,temperature0.4,system_promptPlease be professional and polite. Answer the users question in a single line. Even if the context lacks information to answer the question, make sure that you answer the users question based on your own knowledge.,)4. 定义查询和指令user_query Which council bills were amended for zoning regulations?user_instructions [Keep the response concise, preferably under the 100 word limit.]# 更新LLM的系统提示llm.system_prompt (fPlease comply to the following instructions {user_instructions}.)5. 获取响应from aimon_llamaindex import get_responsellm_response get_response(user_query, retriever, llm)6. 配置AIMon客户端from aimon import Clientaimon_client Client(auth_headerBearer {}.format(userdata.get(AIMON_API_KEY)))7. 运行评估指南评估from aimon_llamaindex.evaluators import GuidelineEvaluatorguideline_evaluator GuidelineEvaluator(aimon_client)evaluation_result guideline_evaluator.evaluate(user_query,llm_response,user_instructions)幻觉检测评估from aimon_llamaindex.evaluators import HallucinationEvaluatorhallucination_evaluator HallucinationEvaluator(aimon_client)evalution_result hallucination_evaluator.evaluate(user_query, llm_response)上下文相关性评估from aimon_llamaindex.evaluators import ContextRelevanceEvaluatorevaluator ContextRelevanceEvaluator(aimon_client)task_definition (Find the relevance of the context data used to generate this response.)evaluation_result evaluator.evaluate(user_query,llm_response,task_definition)案例效果指南评估结果指南评估器检查模型响应是否遵循了用户提供的指令。评估结果显示{extractions: [],instructions_list: [{explanation: ,follow_probability: 0.982,instruction: Keep the response concise, preferably under the 100 word limit.,label: true}],score: 1.0}评估得分为1.0表明模型完全遵循了保持响应简洁的指令。幻觉检测评估结果幻觉检测评估器识别模型生成的不受上下文支持的信息{is_hallucinated: False,score: 0.22446,sentences: [{score: 0.22446,text: The council bills amended for zoning regulations include the small lot moratorium and the text amendment related to off-street parking exemptions for preexisting small lots. These amendments aim to balance the interests of local neighborhoods, health institutions, and developers.}]}幻觉分数为0.22446范围0.0-1.0表明响应中幻觉内容较少信息相对可靠。上下文相关性评估结果上下文相关性评估器评估用于生成响应的上下文数据的相关性[{explanations: [Document 1 discusses a council bill related to zoning regulations, specifically mentioning a text amendment that aims to balance neighborhood interests with developer needs. However, it primarily focuses on parking issues and personal experiences rather than detailing specific zoning regulation amendments or the council bills directly related to them, which makes it less relevant to the query.,Document 2 mentions zoning and development issues, including the need for mass transit and affordability, but it does not provide specific information on which council bills were amended for zoning regulations...,// ... 其他文档解释],query: Which council bills were amended for zoning regulations?,relevance_scores: [40.5,40.25,44.25,38.5,43.0]}]评估提供了每个文档的相关性分数和解释帮助用户了解上下文数据与查询的相关程度。案例实现思路本案例的实现思路遵循以下步骤环境准备安装必要的依赖库配置API密钥确保可以访问OpenAI和AIMon服务。数据准备从MeetingBank数据集中加载会议记录并将其转换为LlamaIndex可处理的Document对象。向量索引构建使用OpenAI的嵌入模型为文档生成向量表示并构建向量索引以支持高效检索。LLM配置设置OpenAI的gpt-4o-mini模型配置系统提示和用户指令确保模型能够按照要求生成响应。响应生成使用构建的检索器和LLM对用户查询生成响应。评估配置配置AIMon客户端准备使用各种评估器。多维度评估使用指南评估器、幻觉检测评估器和上下文相关性评估器对生成的响应进行全方位评估。结果分析分析评估结果了解模型响应的质量和可靠性。关键思路通过多维度评估全面了解RAG系统的性能识别可能的问题并指导系统优化。扩展建议1. 扩展评估维度除了本案例中使用的三种评估器外AIMon还提供了其他评估器可以考虑添加完整性评估器检查响应是否完全解决了查询或任务的所有方面简洁性评估器评估响应是否简洁而完整避免不必要的冗长毒性评估器标记响应中有害、冒犯性或不适当的语言2. 批量评估扩展案例以支持批量评估多个查询和响应提供更全面的系统性能评估# 批量评估示例queries [Query 1, Query 2, Query 3]results []for query in queries:response get_response(query, retriever, llm)guideline_result guideline_evaluator.evaluate(query, response, user_instructions)hallucination_result hallucination_evaluator.evaluate(query, response)context_result context_evaluator.evaluate(query, response, task_definition)results.append({query: query,response: response,guideline_score: guideline_result[score],hallucination_score: hallucination_result[score],context_relevance: context_result})3. 可视化评估结果添加可视化功能使评估结果更直观使用图表展示不同评估维度的得分分布创建仪表板显示系统整体性能指标实现评估结果的历史趋势分析4. 自定义评估器根据特定需求开发自定义评估器针对特定领域的评估标准结合业务逻辑的评估规则多语言支持评估5. 集成反馈循环将评估结果反馈到RAG系统中实现自动优化根据评估结果调整检索策略优化提示词和系统指令动态调整模型参数总结本案例展示了如何使用AIMon的评估器来评估LlamaIndex框架中RAG应用程序的响应质量。通过使用指南评估器、幻觉检测评估器和上下文相关性评估器我们可以全面评估模型生成的响应确保其遵循指令、内容可靠且基于相关上下文。这种多维度评估方法对于构建高质量、可靠的AI应用至关重要。它不仅帮助开发者识别系统中的潜在问题还提供了明确的改进方向。通过持续评估和优化我们可以不断提高RAG系统的性能和用户体验。核心价值AIMon的评估器为LLM应用提供了全面的质量评估框架使开发者能够构建更加可靠、准确和有用的AI系统。参考资源AIMon官方网站AIMon文档MeetingBank: A Benchmark Dataset for Meeting Summarization