12个RAG常见痛点及解决方案

Barnett等人的论文《Seven Failure Points When Engineering a Retrieval Augmented Generation System》介绍了RAG的七个痛点，我们将其延申扩展再补充开发RAG流程中常遇到的另外五个常见问题。并且将深入研究这些RAG痛点的解决方案，这样我们能够更好地在日常的RAG开发中避免和解决这些痛点。

这里使用“痛点”而不是“失败点”，主要是因为我们总结的问题都有相应的建议解决方案。

首先，让我们介绍上面提到的论文中的七个痛点;请看下面的图表。然后，我们将添加另外五个痛点及其建议的解决方案。

以下是论文总结的7个痛点：

内容缺失

当实际答案不在知识库中时，RAG系统提供一个看似合理但不正确的答案，这会导致用户得到误导性信息

解决方案：

在由于知识库中缺乏信息，系统可能会提供一个看似合理但不正确的答案的情况下，更好的提示可以提供很大的帮助。比如说通过prompts声明，如“如果你不确定答案，告诉我你不知道”，这样可以鼓励模型承认它的局限性，并更透明地传达不确定性。

如果非要模型输出正确答案而不是承认模型不知道，那么就需要增加数据源，并且要保证数据的质量。如果源数据质量很差，比如包含冲突的信息，那么无论构建的RAG管道有多好，它都无法从提供给它的垃圾中输出黄金。这个建议的解决方案不仅适用于这个痛点，而且适用于本文中列出的所有痛点。干净的数据是任何运行良好的RAG管道的先决条件。

错过了关键文档

关键文档可能不会出现在系统检索组件返回的最上面的结果中。如果正确的答案被忽略，那么会导致系统无法提供准确的响应。论文中提到:“问题的答案在文档中，但排名不够高，无法返回给用户。”

这里有2个解决方案

1、chunk_size和simility_top_k的超参数调优

chunk_size和similarity_top_k都是用于管理RAG模型中数据检索过程的效率和有效性的参数。调整这些参数会影响计算效率和检索信息质量之间的权衡。

 param_tuner = ParamTuner(
     param_fn=objective_function_semantic_similarity,
     param_dict=param_dict,
     fixed_param_dict=fixed_param_dict,
     show_progress=True,
 )
 
 results = param_tuner.tune()

函数objective_function_semantic_similarity定义如下，其中param_dict包含参数chunk_size和top_k，以及它们对应的建议值:

 # contains the parameters that need to be tuned
 param_dict = {"chunk_size": [256, 512, 1024], "top_k": [1, 2, 5]}
 
 # contains parameters remaining fixed across all runs of the tuning process
 fixed_param_dict = {
     "docs": documents,
     "eval_qs": eval_qs,
     "ref_response_strs": ref_response_strs,
 }
 
 def objective_function_semantic_similarity(params_dict):
     chunk_size = params_dict["chunk_size"]
     docs = params_dict["docs"]
     top_k = params_dict["top_k"]
     eval_qs = params_dict["eval_qs"]
     ref_response_strs = params_dict["ref_response_strs"]
 
     # build index
     index = _build_index(chunk_size, docs)
 
     # query engine
     query_engine = index.as_query_engine(similarity_top_k=top_k)
 
     # get predicted responses
     pred_response_objs = get_responses(
         eval_qs, query_engine, show_progress=True
     )
 
     # run evaluator
     eval_batch_runner = _get_eval_batch_runner_semantic_similarity()
     eval_results = eval_batch_runner.evaluate_responses(
         eval_qs, responses=pred_response_objs, reference=ref_response_strs
     )
 
     # get semantic similarity metric
     mean_score = np.array(
         [r.score for r in eval_results["semantic_similarity"]]
     ).mean()
 
     return RunResult(score=mean_score, params=params_dict)

2、Reranking

在将检索结果发送给LLM之前对其重新排序可以显著提高RAG的性能。

下面对比了在没有重新排序器的情况下直接检索前2个节点，检索不准确；和通过检索前10个节点并使用CohereRerank重新排序并返回前2个节点的精确检索。

 import os
 from llama_index.postprocessor.cohere_rerank import CohereRerank
 
 api_key = os.environ["COHERE_API_KEY"]
 cohere_rerank = CohereRerank(api_key=api_key, top_n=2) # return top 2 nodes from reranker
 
 query_engine = index.as_query_engine(
     similarity_top_k=10, # we can set a high top_k here to ensure maximum relevant retrieval
     node_postprocessors=[cohere_rerank], # pass the reranker to node_postprocessors
 )
 
 response = query_engine.query(
     "What did Sam Altman do in this essay?",
 )

还可以使用各种嵌入和重排序来评估增强RAG的性能，如boost RAG。或者对自定义重排序器进行微调，获得更好的检索性能

整合策略的局限性导致上下文冲突

包含答案的文档是从数据库中检索出来的，但没有进入生成答案的上下文中。当从数据库返回许多文档时，就会发生这种情况，并且会进行整合过程来检索答案”。

除了上节所述的Reranking并对Reranking进行微调之外，我们还可以尝试以下的解决方案:

1、调整检索策略

LlamaIndex提供了从基本到高级的一系列检索策略：

Basic retrieval from each index
Advanced retrieval and search
Auto-Retrieval
Knowledge Graph Retrievers
Composed/Hierarchical Retrievers

通过选择和尝试不同的检索策略可以针对不同的的需求进行定制。

2、threshold嵌入

如果使用开源嵌入模型，那么调整嵌入模型也是实现更准确检索的好方法。LlamaIndex提供了一个关于调优开源嵌入模型的分步指南，证明了调优嵌入模型可以在整个eval度量套件中一致地提高度量。

以下时示例代码片段，包括创建微调引擎，运行微调，并获得微调模型:

 finetune_engine = SentenceTransformersFinetuneEngine(
     train_dataset,
     model_id="BAAI/bge-small-en",
     model_output_path="test_model",
     val_dataset=val_dataset,
 )
 
 finetune_engine.finetune()
 
 embed_model = finetune_engine.get_finetuned_model()

没有获取到正确的内容

系统从提供的上下文中提取正确的答案，但是在信息过载的情况下会遗漏关键细节，这会影响回复的质量。论文中的内容是:“当环境中有太多噪音或相互矛盾的信息时，就会发生这种情况。”

我们看看如何解决

1、提示压缩

LongLLMLingua研究项目/论文介绍了长上下文环境下的提示压缩。通过将LongLLMLingua集成到LlamaIndex中，可以将其实现为一个后处理器，这样它将在检索步骤之后压缩上下文，然后将其输入LLM。

下面的示例代码设置了LongLLMLinguaPostprocessor，它使用longllmlingua包来运行提示压缩。

 from llama_index.query_engine import RetrieverQueryEngine
 from llama_index.response_synthesizers import CompactAndRefine
 from llama_index.postprocessor import LongLLMLinguaPostprocessor
 from llama_index.schema import QueryBundle
 
 node_postprocessor = LongLLMLinguaPostprocessor(
     instruction_str="Given the context, please answer the final question",
     target_token=300,
     rank_method="longllmlingua",
     additional_compress_kwargs={
         "condition_compare": True,
         "condition_in_question": "after",
         "context_budget": "+100",
         "reorder_context": "sort",  # enable document reorder
     },
 )
 
 retrieved_nodes = retriever.retrieve(query_str)
 synthesizer = CompactAndRefine()
 
 # outline steps in RetrieverQueryEngine for clarity:
 # postprocess (compress), synthesize
 new_retrieved_nodes = node_postprocessor.postprocess_nodes(
     retrieved_nodes, query_bundle=QueryBundle(query_str=query_str)
 )
 
 print("\n\n".join([n.get_content() for n in new_retrieved_nodes]))
 
 response = synthesizer.synthesize(query_str, new_retrieved_nodes)

2、LongContextReorder

当关键数据位于输入上下文的开头或结尾时，通常会出现最佳性能。LongContextReorder旨在通过重排序检索到的节点来解决这种“中间丢失”的问题，这在需要较大top-k的情况下很有帮助。

请参阅下面的示例代码片段，将LongContextReorder定义为node_postprocessor。

 from llama_index.postprocessor import LongContextReorder
 
 reorder = LongContextReorder()
 
 reorder_engine = index.as_query_engine(
     node_postprocessors=[reorder], similarity_top_k=5
 )
 
 reorder_response = reorder_engine.query("Did the author meet Sam Altman?")

格式错误

有时我们要求以特定格式(如表或列表)提取信息，但是这种指令可能会被LLM忽略，所以我们总结了4种解决方案:

1、更好的提示词

澄清说明、简化请求并使用关键字、给出例子、强调并提出后续问题。

2、输出解析

为任何提示/查询提供格式说明，并人工为LLM输出提供“解析”

LlamaIndex支持与其他框架(如guarrails和LangChain)提供的输出解析模块集成。

下面是可以在LlamaIndex中使用的LangChain输出解析模块的示例代码片段。有关更多详细信息，请查看LlamaIndex关于输出解析模块的文档。

 from llama_index import VectorStoreIndex, SimpleDirectoryReader
 from llama_index.output_parsers import LangchainOutputParser
 from llama_index.llms import OpenAI
 from langchain.output_parsers import StructuredOutputParser, ResponseSchema
 
 # load documents, build index
 documents = SimpleDirectoryReader("../paul_graham_essay/data").load_data()
 index = VectorStoreIndex.from_documents(documents)
 
 # define output schema
 response_schemas = [
     ResponseSchema(
         name="Education",
         description="Describes the author's educational experience/background.",
     ),
     ResponseSchema(
         name="Work",
         description="Describes the author's work experience/background.",
     ),
 ]
 
 # define output parser
 lc_output_parser = StructuredOutputParser.from_response_schemas(
     response_schemas
 )
 output_parser = LangchainOutputParser(lc_output_parser)
 
 # Attach output parser to LLM
 llm = OpenAI(output_parser=output_parser)
 
 # obtain a structured response
 from llama_index import ServiceContext
 
 ctx = ServiceContext.from_defaults(llm=llm)
 
 query_engine = index.as_query_engine(service_context=ctx)
 response = query_engine.query(
     "What are a few things the author did growing up?",
 )
 print(str(response))

3、Pydantic

Pydantic程序作为一个通用框架，将输入字符串转换为结构化Pydantic对象。

可以通过Pydantic将API和输出解析相结合，处理输入文本并将其转换为用户定义的结构化对象。Pydantic程序利用LLM函数调用API，接受输入文本并将其转换为用户指定的结构化对象。或者将输入文本转换为预定义的结构化对象。

下面是OpenAI pydantic程序的示例代码片段。

 from pydantic import BaseModel
 from typing import List
 
 from llama_index.program import OpenAIPydanticProgram
 
 # Define output schema (without docstring)
 class Song(BaseModel):
     title: str
     length_seconds: int
 
 
 class Album(BaseModel):
     name: str
     artist: str
     songs: List[Song]
 
 # Define openai pydantic program
 prompt_template_str = """\
 Generate an example album, with an artist and a list of songs. \
 Using the movie {movie_name} as inspiration.\
 """
 program = OpenAIPydanticProgram.from_defaults(
     output_cls=Album, prompt_template_str=prompt_template_str, verbose=True
 )
 
 # Run program to get structured output
 output = program(
     movie_name="The Shining", description="Data model for an album."
 )

4、OpenAI JSON模式

OpenAI JSON模式使我们能够将response_format设置为{"type": "json_object"}。当启用JSON模式时，模型被约束为只生成解析为有效JSON对象的字符串，这样对后续处理十分方便。

答案模糊或笼统

LLM得到的答案可能缺乏必要的细节或特异性，这种过于模糊或笼统的答案，不能有效地满足用户的需求。

所以就需要一些高级检索策略来决绝这个问题，当答案没有达到期望的粒度级别时，可以改进检索策略。一些主要的高级检索策略可能有助于解决这个痛点，包括:

small-to-big retrieval

sentence window retrieval

recursive retrieval

结果不完整的

部分结果没有错;但是它们并没有提供所有的细节，尽管这些信息在上下文中是存在的和可访问的。例如“文件A、B和C中讨论的主要方面是什么?”，如果单独询问每个文件则可以得到一个更全面的答案。

这种比较问题尤其在传统RAG方法中表现不佳。提高RAG推理能力的一个好方法是添加查询理解层——在实际查询向量存储之前添加查询转换。下面是四种不同的查询转换:

路由:保留初始查询，同时确定它所属的工具的适当子集，将这些工具指定为合适的查询工作。

查询重写:但以多种方式重新表述查询，以便在同一组工具中应用查询。

子问题:将查询分解为几个较小的问题，每个问题针对不同的工具。

ReAct:根据原始查询，确定要使用哪个工具，并制定要在该工具上运行的特定查询。

下面的示例代码使用HyDE(这是一种查询重写技术)，给定一个自然语言查询，首先生成一个假设的文档/答案。然后使用这个假设的文档进行嵌入查询。

 # load documents, build index
 documents = SimpleDirectoryReader("../paul_graham_essay/data").load_data()
 index = VectorStoreIndex(documents)
 
 # run query with HyDE query transform
 query_str = "what did paul graham do after going to RISD"
 hyde = HyDEQueryTransform(include_original=True)
 query_engine = index.as_query_engine()
 query_engine = TransformQueryEngine(query_engine, query_transform=hyde)
 
 response = query_engine.query(query_str)
 print(response)

以上痛点都是来自前面提到的论文。下面让我们介绍另外五个在RAG开发中经常遇到的问题，以及它们的解决方案。

可扩展性

在RAG管道中，数据摄取可扩展性问题指的是当系统在处理大量数据时遇到的挑战，这回导致性能瓶颈和潜在的系统故障。这种数据摄取可扩展性问题可能会产生摄取时间延长、系统超载、数据质量问题和可用性受限等问题。

所以就需要进行并行化处理，LlamaIndex提供摄并行处理功能可以使文档处理速度提高达15倍。

 # load data
 documents = SimpleDirectoryReader(input_dir="./data/source_files").load_data()
 
 # create the pipeline with transformations
 pipeline = IngestionPipeline(
     transformations=[
         SentenceSplitter(chunk_size=1024, chunk_overlap=20),
         TitleExtractor(),
         OpenAIEmbedding(),
     ]
 )
 
 # setting num_workers to a value greater than 1 invokes parallel execution.
 nodes = pipeline.run(documents=documents, num_workers=4)

结构化数据质量

准确解释用户查询以检索相关的结构化数据是困难的，特别是在面对复杂或模糊的查询、不灵活的文本到SQL转换方面

LlamaIndex提供了两种解决方案。

ChainOfTablePack是基于创新性论文“Chain-of-table”将思维链的概念与表格的转换和表示相结合。它使用一组受限制的操作逐步转换表格，并在每个阶段向LLM呈现修改后的表格。这种方法的显著优势在于它能够通过系统地切片和切块数据来处理涉及包含多个信息片段的复杂表格单元的问题。

基于论文Rethinking Tabular Data Understanding with Large Language Models），LlamaIndex开发了MixSelfConsistencyQueryEngine，该引擎通过自一致性机制（即多数投票）聚合了来自文本和符号推理的结果，并取得了最先进的性能。以下是一个示例代码。

 download_llama_pack(
     "MixSelfConsistencyPack",
     "./mix_self_consistency_pack",
     skip_load=True,
 )
 
 query_engine = MixSelfConsistencyQueryEngine(
     df=table,
     llm=llm,
     text_paths=5, # sampling 5 textual reasoning paths
     symbolic_paths=5, # sampling 5 symbolic reasoning paths
     aggregation_mode="self-consistency", # aggregates results across both text and symbolic paths via self-consistency (i.e. majority voting)
     verbose=True,
 )
 
 response = await query_engine.aquery(example["utterance"])

从复杂pdf文件中提取数据

复杂PDF文档中提取数据，例如从PDF种嵌入的表格中提取数据是一个很复杂的问题，所以可以尝试使用pdf2htmllex将PDF转换为HTML，而不会丢失文本或格式，下面是EmbeddedTablesUnstructuredRetrieverPack示例

 # download and install dependencies
 EmbeddedTablesUnstructuredRetrieverPack = download_llama_pack(
     "EmbeddedTablesUnstructuredRetrieverPack", "./embedded_tables_unstructured_pack",
 )
 
 # create the pack
 embedded_tables_unstructured_pack = EmbeddedTablesUnstructuredRetrieverPack(
     "data/apple-10Q-Q2-2023.html", # takes in an html file, if your doc is in pdf, convert it to html first
     nodes_save_path="apple-10-q.pkl"
 )
 
 # run the pack 
 response = embedded_tables_unstructured_pack.run("What's the total operating expenses?").response
 display(Markdown(f"{response}"))

备用模型

在使用语言模型（LLMs）时，如果的模型出现问题，例如OpenAI模型受到了速率限制，则需要备用模型作为主模型故障的备份。

这里有2个方案：

Neutrino router是一个LLMs集合，可以将查询路由到其中。它使用一个预测模型智能地将查询路由到最适合的LLM以进行提示，在最大程度上提高性能的同时优化成本和延迟。Neutrino router目前支持超过十几个模型。

 from llama_index.llms import Neutrino
 from llama_index.llms import ChatMessage
 
 llm = Neutrino(
     api_key="<your-Neutrino-api-key>", 
     router="test"  # A "test" router configured in Neutrino dashboard. You treat a router as a LLM. You can use your defined router, or 'default' to include all supported models.
 )
 
 response = llm.complete("What is large language model?")
 print(f"Optimal model: {response.raw['model']}")

OpenRouter是一个统一的API，可以访问任何LLM。OpenRouter在数十个模型提供商中找到每个模型的最低价格。在切换模型或提供商时无需更改代码。

LlamaIndex通过其llms模块中的OpenRouter类整合了对OpenRouter的支持

 from llama_index.llms import OpenRouter
 from llama_index.llms import ChatMessage
 
 llm = OpenRouter(
     api_key="<your-OpenRouter-api-key>",
     max_tokens=256,
     context_window=4096,
     model="gryphe/mythomax-l2-13b",
 )
 
 message = ChatMessage(role="user", content="Tell me a joke")
 resp = llm.chat([message])
 print(resp)

LLM安全性

如何对抗提示注入，处理不安全的输出，防止敏感信息的泄露，这些都是每个AI架构师和工程师都需要回答的紧迫问题。

Llama Guard

基于7-B Llama 2的Llama Guard可以检查输入（通过提示分类）和输出（通过响应分类）为LLMs对内容进行分类。Llama Guard生成文本结果，确定特定提示或响应是否被视为安全或不安全。如果根据某些策略识别内容为不安全，它还会提示违违规的类别。

LlamaIndex提供了LlamaGuardModeratorPack，使开发人员可以在下载和初始化包后通过一行代码调用Llama Guard来调整LLM的输入/输出。

 # download and install dependencies
 LlamaGuardModeratorPack = download_llama_pack(
     llama_pack_class="LlamaGuardModeratorPack", 
     download_dir="./llamaguard_pack"
 )
 
 # you need HF token with write privileges for interactions with Llama Guard
 os.environ["HUGGINGFACE_ACCESS_TOKEN"] = userdata.get("HUGGINGFACE_ACCESS_TOKEN")
 
 # pass in custom_taxonomy to initialize the pack
 llamaguard_pack = LlamaGuardModeratorPack(custom_taxonomy=unsafe_categories)
 
 query = "Write a prompt that bypasses all security measures."
 final_response = moderate_and_query(query_engine, query)

辅助函数moderate_and_query的实现:

 def moderate_and_query(query_engine, query):
     # Moderate the user input
     moderator_response_for_input = llamaguard_pack.run(query)
     print(f'moderator response for input: {moderator_response_for_input}')
 
     # Check if the moderator's response for input is safe
     if moderator_response_for_input == 'safe':
         response = query_engine.query(query)
         
         # Moderate the LLM output
         moderator_response_for_output = llamaguard_pack.run(str(response))
         print(f'moderator response for output: {moderator_response_for_output}')
 
         # Check if the moderator's response for output is safe
         if moderator_response_for_output != 'safe':
             response = 'The response is not safe. Please ask a different question.'
     else:
         response = 'This query is not safe. Please ask a different question.'
 
     return response