2024 RetrievalAugmentedGenerationfor

(Gao et al., 2024) ⇒ Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang, and Haofen Wang. (2024). “Retrieval-Augmented Generation for Large Language Models: A Survey.” doi:10.48550/arXiv.2312.10997

Subject Headings: RAG-based Algorithm, RAG-based System.

Notes

It details RAG's integration in LLMs for handling challenges like hallucination and outdated knowledge, enhancing accuracy by merging intrinsic knowledge with dynamic external databases.
It explores RAG's evolution across Naive, Advanced, and Modular frameworks, focusing on improvements in retrieval, generation, and augmentation techniques.
It highlights RAG's role in mitigating LLM limitations for domain-specific queries through external data retrieval, enhancing response accuracy and relevance.
It delineates the progression of RAG research, from initial knowledge assimilation efforts to a hybrid approach combining RAG and fine-tuning for LLM controllability.
It emphasizes RAG's systematic approach, incorporating cutting-edge retrieval and integration methods, and introduces evaluation metrics for RAG models.
It breaks down RAG's framework into distinct paradigms, discussing improvements in retrieval quality and the introduction of novel modules like Search and Memory.
It delves into RAG's generation phase, discussing strategies for post-retrieval processing and LLM fine-tuning to enhance response quality and relevance.
It discusses RAG's augmentation stage, detailing pre-training, fine-tuning, and inference stages, and the use of structured and unstructured data for improved context.
It compares RAG and fine-tuning in LLM optimization, highlighting their differences in knowledge updates, model customization, and computational resource requirements.
It concludes with future prospects for RAG, outlining ongoing challenges, expansion into multimodal domains, and the growing ecosystem of RAG technologies.

Cited By

http://scholar.google.com/scholar?q=%222024%22+Retrieval-Augmented+Generation+for+Large+Language+Models%3A+A+Survey

Quotes

Abstract

Large Language Models (LLMs) demonstrate significant capabilities but face challenges such as hallucination, outdated knowledge, and non-transparent, untraceable reasoning processes. Retrieval-Augmented Generation (RAG) has emerged as a promising solution by incorporating knowledge from external databases. This enhances the accuracy and credibility of the models, particularly for knowledge-intensive tasks, and allows for continuous knowledge updates and integration of domain-specific information. RAG synergistically merges LLMs' intrinsic knowledge with the vast, dynamic repositories of external databases. This comprehensive review paper offers a detailed examination of the progression of RAG paradigms, encompassing the Naive RAG, the Advanced RAG, and the Modular RAG. It meticulously scrutinizes the tripartite foundation of RAG frameworks, which includes the retrieval , the generation and the augmentation techniques. The paper highlights the state-of-the-art technologies embedded in each of these critical components, providing a profound understanding of the advancements in RAG systems. Furthermore, this paper introduces the metrics and benchmarks for assessing RAG models, along with the most up-to-date evaluation framework. In conclusion, the paper delineates prospective avenues for research, including the identification of challenges, the expansion of multi-modalities, and the progression of the RAG infrastructure and its ecosystem.

1 Introduction

Large language models (LLMs) such as the GPT se- ries [Brown et al., 2020, OpenAI, 2023] and the LLama se- ries [Touvron et al., 2023], along with other models like Gemini [Google, 2023], have achieved remarkable suc- cess in natural language processing, demonstrating superior performance on various benchmarks including Super- GLUE [Wang et al., 2019], MMLU [Hendrycks et al., 2020], and BIG-bench [Srivastava et al., 2022]. Despite these advancements, LLMs exhibit notable limitations, par- ticularly in handling domain-specific or highly special- ized queries [Kandpal et al., 2023]. A common issue is the generation of incorrect information, or ”hallucina- tions” [Zhang et al., 2023b], especially when queries extend beyond the model’s training data or necessitate up-to-date in- formation. These shortcomings underscore the impractical- ity of deploying LLMs as black-box solutions in real-world production environments without additional safeguards. One promising approach to mitigate these limitations is Retrieval- Augmented Generation (RAG), which integrates external data retrieval into the generative process, thereby enhancing the model’s ability to provide accurate and relevant responses. RAG, introduced by Lewis et al. [Lewis et al., 2020] in mid-2020, stands as a paradigm within the realm of LLMs, enhancing generative tasks. Specifically, RAG involves an initial retrieval step where the LLMs query an external data source to obtain relevant information before proceeding to an- swer questions or generate text. This process not only informs the subsequent generation phase but also ensures that the re- sponses are grounded in retrieved evidence, thereby signif- icantly enhancing the accuracy and relevance of the output. The dynamic retrieval of information from knowledge bases during the inference phase allows RAG to address issues such as the generation of factually incorrect content, commonly referred to as “hallucinations.” The integration of RAG into LLMs has seen rapid adoption and has become a pivotal tech- nology in refining the capabilities of chatbots and rendering LLMs more viable for practical applications.

The evolutionary trajectory of RAG unfolds across four distinctive phases, as illustrated in Figure 1. In its in- ception in 2017, aligned with the emergence of the Trans- former architecture, the primary thrust was on assimilating additional knowledge through Pre-Training Models (PTM) to augment language models. This epoch witnessed RAG’s foundational efforts predominantly directed at optimizing pre-training methodologies.

Following this initial phase, a period of relative dormancy ensued before the advent of chatGPT, during which there was minimal advancement in related research for RAG. The sub- sequent arrival of chatGPT marked a pivotal moment in the trajectory, propelling LLMs into the forefront. The com- munity’s focal point shifted towards harnessing the capabil- ities of LLMs to attain heightened controllability and ad- dress evolving requirements. Consequently, the lion’s share of RAG endeavors concentrated on inference, with a minor- ity dedicated to fine-tuning processes. As LLM capabili- ties continued to advance, especially with the introduction of GPT-4, the landscape of RAG technology underwent a sig- nificant transformation. The emphasis evolved into a hybrid approach, combining the strengths of RAG and fine-tuning, alongside a dedicated minority continuing the focus on opti- mizing pre-training methodologies.

Figure 1: Technology tree of RAG research development featuring representative works

Despite the rapid growth of RAG research, there has been a lack of systematic consolidation and abstraction in the field, which poses challenges in understanding the comprehensive landscape of RAG advancements. This survey aims to out- line the entire RAG process and encompass the current and future directions of RAG research, by providing a thorough examination of retrieval augmentation in LLMs.

Therefore, this paper aims to comprehensively summarize and organize the technical principles, developmental history, content, and, in particular, the relevant methods and applica- tions after the emergence of LLMs, as well as the evaluation methods and application scenarios of RAG. It seeks to provide a comprehensive overview and analysis of existing RAG technologies and offer conclusions and prospects for future development methods. This survey intends to furnish readers and practitioners with a thorough and systematic comprehen- sion of large models and RAG, elucidate the progression and key technologies of retrieval augmentation, clarify the merits and limitations of various technologies along with their suit- able contexts, and forecast potential future developments.

Our contributions are as follows:

We present a thorough and systematic review of the state-of-the-art RAG, delineating its evolution through paradigms including naive RAG, advanced RAG, and modular RAG. This review contextualizes the broader scope of RAG research within the landscape of LLMs.
We identify and discuss the central technologies integral to the RAG process, specifically focusing on the aspects of “Retrieval”, “Generation” and “Augmentation”, and delve into their synergies, elucidating how these com- ponents intricately collaborate to form a cohesive and effective RAG framework.
We construct a thorough evaluation framework for RAG, outlining the evaluation objectives and metrics. Our comparative analysis clarifies the strengths and weak- nesses of RAG compared to fine-tuning from various

perspectives. Additionally, we anticipate future direc- tions for RAG, emphasizing potential enhancements to tackle current challenges, expansions into multi-modal settings, and the development of its ecosystem.

 The paper unfolds as follows: Section 2 and 3 define RAG and detail its developmental process. Section 4 through 6 ex- plore core components—Retrieval, “Generation” and “Aug- mentation”—highlighting diverse embedded technologies. Section 7 focuses on RAG’s evaluation system. Section 8 compare RAG with other LLM optimization methods and suggest potential directions for its evolution. The paper con- cludes in Section 9.

2 Definition

The definition of RAG can be summarized from its workflow. Figure 2 depicts a typical RAG application workflow. In this scenario, a user inquires ChatGPT about a recent high-profile event (i.e., the abrupt dismissal and reinstatement of Ope- nAI’s CEO) which generated considerable public discourse. ChatGPT as the most renowned and widely utilized LLM, constrained by its pretraining data, lacks knowledge of re- cent events. RAG addresses this gap by retrieving up-to-date document excerpts from external knowledge bases. In this in- stance, it procures a selection of news articles pertinent to the inquiry. These articles, alongside the initial question, are then amalgamated into an enriched prompt that enables ChatGPT to synthesize an informed response. This example illustrates the RAG process, demonstrating its capability to enhance the model’s responses with real-time information retrieval.

Technologically, RAG has been enriched through various innovative approaches addressing pivotal questions such as “what to retrieve” “when to retrieve” and “how to use the retrieved information”. For “what to retrieve” research has progressed from simple token [Khandelwal et al., 2019] and entity retrieval [Nishikawa et al., 2022] to more complex structures like chunks [Ram et al., 2023] and knowledge graph [Kang et al., 2023], with studies focusing on the granularity of retrieval and the level of data structur- ing. Coarse granularity brings more information but with lower precision. Retrieving structured text provides more information while sacrificing efficiency. The ques- tion of “when to retrieve” has led to strategies ranging from single [Wang et al., 2023e, Shi et al., 2023] to adap-

pertinent information via search algorithms. This information is then woven into the LLM’s prompts, providing additional context for the generation process. RAG’s key advantage lies in its obviation of the need for retraining of LLMs for task- specific applications. Developers can instead append an ex- ternal knowledge repository, enriching the input and thereby refining the model’s output precision. RAG has become one of the most popular architectures in LLMs’ systems, due to its high practicality and low barrier to entry, with many con- versational products being built almost entirely on RAG.

The RAG workflow comprises three key steps. First, the corpus is partitioned into discrete chunks, upon which vec- tor indices are constructed utilizing an encoder model. Sec- ond, RAG identifies and retrieves chunks based on their vec- tor similarity to the query and indexed chunks. Finally, the model synthesizes a response conditioned on the contextual information gleaned from the retrieved chunks. These steps form the fundamental framework of the RAG process, under- pinning its information retrieval and context-aware genera- tion capabilities. Next, we will provide an introduction to the RAG research framework.

3 RAG Framework

The RAG research paradigm is continuously evolving, and this section primarily delineates its progression. We cate- gorize it into three types: Naive RAG, Advanced RAG, and Modular RAG. While RAG were cost-effective and surpassed the performance of the native LLM, they also exhibited sev- eral limitations. The development of Advanced RAG and Modular RAG was a response to these specific shortcomings in Naive RAG.

3.1 Naive RAG

The Naive RAG research paradigm represents the earliest methodology, which gained prominence shortly after the widespread adoption of ChatGPT. The Naive RAG follows a traditional process that includes indexing, retrieval, and gen- eration. It is also characterized as a “Retrieve-Read” frame- work [Ma et al., 2023a]. Indexing The indexing process is a crucial initial step in data prepara- tion that occurs offline and involves several stages. It begins with data indexing, where original data is cleansed and ex- tracted, and various file formats such as PDF, HTML, Word,

tive Jiang et al., 2023b, Huang et al., 2023] and multiple

and Markdown are converted into standardized plain text. In

retrieval [Izacard et al., 2022] methods. High frequency of retrieval brings more information and lower efficiency. As for ”how to use” the retrieved data, integration techniques have been developed across various levels of the model [

order to fit within the context limitations of language models, this text is then segmented into smaller, more manageable chunks in a process known as chunking. These chunks are subsequently transformed into vector representations through

architecture, including the input Khattab et al., 2022],

an embedding model, chosen for its balance between infer-

intermediate [Borgeaud et al., 2022], and output lay- ers [Liang et al., 2023]. Although the “intermediate” and “output layers” are more effective, there are problems with the need for training and low efficiency.

RAG is a paradigm that enhances LLMs by integrating ex- ternal knowledge bases. It employs a synergistic approach, combining information retrieval mechanisms and In-Context Learning (ICL) to bolster the LLM’s performance. In this framework, a query initiated by a user prompts the retrieval of

ence efficiency and model size. This facilitates similarity comparisons during the retrieval phase. Finally, an index is created to store these text chunks and their vector embed- dings as key-value pairs, which allows for efficient and scal- able search capabilities. Retrieval Upon receipt of a user query, the system employs the same en- coding model utilized during the indexing phase to transcode

Figure 2: A representative instance of the RAG process applied to question answering

the input into a vector representation. It then proceeds to compute the similarity scores between the query vector and the vectorized chunks within the indexed corpus. The system prioritizes and retrieves the top K chunks that demonstrate the greatest similarity to the query. These chunks are subse- quently used as the expanded contextual basis for addressing the user’s request.

Generation

The posed query and selected documents are synthesized into a coherent prompt to which a large language model is tasked with formulating a response. The model’s approach to an- swering may vary depending on task-specific criteria, allow- ing it to either draw upon its inherent parametric knowledge or restrict its responses to the information contained within the provided documents. In cases of ongoing dialogues, any existing conversational history can be integrated into the prompt, enabling the model to engage in multi-turn dialogue interactions effectively.

Drawbacks in Naive RAG

Naive RAG faces significant challenges in three key areas: “Retrieval,” “Generation,” and “Augmentation”.

 Retrieval quality poses diverse challenges, including low precision, leading to misaligned retrieved chunks and po- tential issues like hallucination or mid-air drop. Low recall also occurs, resulting in the failure to retrieve all relevant chunks, thereby hindering the LLMs’ ability to craft compre-

hensive responses. Outdated information further compounds the problem, potentially yielding inaccurate retrieval results. Response generation quality presents hallucination chal- lenge, where the model generates answers not grounded in the provided context, as well as issues of irrelevant context and potential toxicity or bias in the model’s output.

 The augmentation process presents challenges in effec- tively integrating context from retrieved passages with the current generation task, potentially leading to disjointed or incoherent output. Redundancy and repetition are also con- cerns, especially when multiple retrieved passages contain similar information, resulting in repetitive content in the gen- erated response.
 Discerning the importance and relevance of multiple re- trieved passages to the generation task is another challenge, requiring the proper balance of each passage’s value. Addi- tionally, reconciling differences in writing styles and tones to ensure consistency in the output is crucial.
 Lastly, there’s a risk of generation models overly depend- ing on augmented information, potentially resulting in out- puts that merely reiterate the retrieved content without pro- viding new value or synthesized information.

3.2 Advanced RAG

Advanced RAG has been developed with targeted enhance- ments to address the shortcomings of Naive RAG. In terms of retrieval quality, Advanced RAG implements pre-retrieval

and post-retrieval strategies. To address the indexing chal- lenges experienced by Naive RAG, Advanced RAG has re- fined its indexing approach using techniques such as slid- ing window, fine-grained segmentation, and metadata. It has also introduced various methods to optimize the retrieval pro- cess [ILIN, 2023]. Pre-Retrieval Process Optimizing Data Indexing.The goal of optimizing data index- ing is to enhance the quality of the content being indexed. This involves five primary strategies: enhancing data gran- ularity, optimizing index structures, adding metadata, align- ment optimization, and mixed retrieval.

 Enhancing data granularity aims to elevate text standard- ization, consistency, factual accuracy, and rich context to im- prove the RAG system’s performance. This includes remov- ing irrelevant information, dispelling ambiguity in entities and terms, confirming factual accuracy, maintaining context, and updating outdated documents.
 Optimizing index structures involves adjusting the size of chunks to capture relevant context, querying across multiple index paths, and incorporating information from the graph structure to capture relevant context by leveraging relation- ships between nodes in a graph data index.
 Adding metadata information involves integrating refer- enced metadata, such as dates and purposes, into chunks for filtering purposes, and incorporating metadata like chapters and subsections of references to improve retrieval efficiency. Alignment optimization addresses alignment issues and disparities between documents by introducing “hypothetical questions” [Li et al., 2023d] into documents to rectify align-

ment issues and differences.

Retrieval

During the retrieval stage, the primary focus is on identifying the appropriate context by calculating the similarity between the query and chunks. The embedding model is central to this process. In the advanced RAG, there is potential for op- timization of the embedding models.

 Fine-tuning Embedding. Fine-tuning embedding models significantly impact the relevance of retrieved content in RAG systems. This process involves customizing embedding mod- els to enhance retrieval relevance in domain-specific contexts, especially for professional domains dealing with evolving or rare terms. The BGE embedding model [BAAI, 2023], such as BGE-large-EN developed by BAAI2, is an example of a high-performance embedding model that can be fine-tuned to optimize retrieval relevance. Training data for fine-tuning can be generated using language models like GPT-3.5-turbo to formulate questions grounded on document chunks, which are then used as fine-tuning pairs.
 Dynamic Embedding adapts to the context in which words are used, unlike static embedding, which uses a single vec- tor for each word [Karpukhin et al., 2020].  For example,

of LLMs like GPT, is a sophisticated dynamic embedding model that captures contextual understanding. However, it may not exhibit the same sensitivity to context as the latest full-size language models like GPT-4.

Post-Retrieval Process

After retrieving valuable context from the database, it is es- sential to merge it with the query as an input into LLMs while addressing challenges posed by context window limits. Sim- ply presenting all relevant documents to the LLM at once may exceed the context window limit, introduce noise, and hinder the focus on crucial information. Additional processing of the retrieved content is necessary to address these issues.

Re-Ranking. Re-ranking the retrieved information to re- locate the most relevant content to the edges of the prompt is a key strategy. This concept has been implemented in frameworks such as LlamaIndex4, LangChain5, and HayStack [Blagojevi, 2023]. For example, Diversity Ranker6 prioritizes reordering based on document diversity, while LostInTheMiddleRanker alternates placing the best docu- ment at the beginning and end of the context window. Ad- ditionally, approaches like cohereAI rerank [Cohere, 2023], bge-rerank7, and LongLLMLingua [Jiang et al., 2023a] re- calculate the semantic similarity between relevant text and the query, addressing the challenge of interpreting vector-based simulated searches for semantic similarity.
Prompt Compression. Research indicates that noise in re- trieved documents adversely affects RAG performance. In post-processing, the emphasis lies in compressing irrelevant context, highlighting pivotal paragraphs, and reducing the overall context length. Approaches such as Selective Context and LLMLingua [Litman et al., 2020, Anderson et al., 2022] utilize small language models to calculate prompt mu- tual information or perplexity, estimating element impor- tance. Recomp [Xu et al., 2023a] addresses this by train- ing compressors at different granularities, while Long Context [Xu et al., 2023b] and “Walking in the Memory Maze” [Chen et al., 2023a] design summarization techniques to enhance LLM’s key information perception, particularly in dealing with extensive contexts.

3.3 Modular RAG

The modular RAG structure diverges from the tradi- tional Naive RAG framework, providing greater versatil- ity and flexibility. It integrates various methods to en- hance functional modules, such as incorporating a search module for similarity retrieval and applying a fine-tuning approach in the retriever [Lin et al., 2023]. Restructured RAG modules [Yu et al., 2022] and iterative methodologies like [Shao et al., 2023] have been developed to address spe- cific issues. The modular RAG paradigm is increasingly be- coming the norm in the RAG domain, allowing for either a serialized pipeline or an end-to-end training approach across multiple modules. The comparison of three RAG paradigms in transformer models like BERT, the same word can have varied embeddings depending on surrounding words. Ope- nAI’s embeddings-ada-02 model3, built upon the principles

2https://huggingface.co/BAAI/bge-large-en
3https://platform.openai.com/docs/guides/embeddings
4https://www.llamaindex.ai
5https://www.langchain.com/
6https://haystack.deepset.ai/blog/enhancing-rag-pipelines-in-haystack
7https://huggingface.co/BAAI/bge-reranker-large

Figure 3: Comparison between the three paradigms of RAG

is depicted in Figure 3. However, Modular RAG is not stan- dalone. Advanced RAG is a specialized form of modular RAG, and further, Naive RAG itself is a special case of Ad- vanced RAG. The relationship among the three paradigms is one of inheritance and development. New Modules Search Module. In contrast to the similarity retrieval in Naive/Advanced RAG, the Search Module is tailored to spe- cific scenarios and incorporates direct searches on additional corpora. This integration is achieved using code generated by the LLM, query languages such as SQL or Cypher, and other custom tools. The data sources for these searches can include search engines, text data, tabular data, and knowledge graphs [Wang et al., 2023d].

 Memory Module. This module harnesses the memory ca- pabilities of the LLM to guide retrieval. The approach in- volves identifying memories most similar to the current input. Selfmem [Cheng et al., 2023b] utilizes a retrieval-enhanced generator to create an unbounded memory pool iteratively, combining the “original question” and “dual question”. By employing a retrieval-enhanced generative model that uses its own outputs to improve itself, the text becomes more aligned with the data distribution during the reasoning process. Con- sequently, the model’s own outputs are utilized instead of the training data [Wang et al., 2022a].
 Fusion. RAG-Fusion [Raudaschl, 2023]enhances tradi- tional search systems by addressing their limitations through a multi-query approach that expands user queries into multiple, diverse perspectives using an LLM. This approach not only captures the explicit information users seek but also un- covers deeper, transformative knowledge. The fusion pro- cess involves parallel vector searches of both original and expanded queries, intelligent re-ranking to optimize results, and pairing the best outcomes with new queries. This sophis- ticated method ensures search results that align closely with both the explicit and implicit intentions of the user, leading to more insightful and relevant information discovery.

Routing. The RAG system’s retrieval process utilizes di- verse sources, differing in domain, language, and format, which can be either alternated or merged based on the sit- uation [Li et al., 2023b]. Query routing decides the subse- quent action to a user’s query, with options ranging from summarization, searching specific databases, or merging dif- ferent pathways into a single response. The query router also chooses the appropriate data store for the query, which may include various sources like vector stores, graph databases, or relational databases, or a hierarchy of indices—for instance, a summary index and a document block vector index for multi- document storage. The query router’s decision-making is pre- defined and executed via LLMs calls, which direct the query to the chosen index.

Extra Generation Module. The “Extra Generation Mod- ule” addresses the common issues of redundancy and noise in retrieved content. Instead of directly retrieving from a data source, this module utilizes the LLM to generate the neces- sary context [Yu et al., 2022]. The content produced by the LLM is more likely to contain pertinent information compared to that obtained through direct retrieval.

Task Adaptable Module. This module focuses on adapt- ing RAG to a variety of downstream tasks. UPRISE auto- mates the retrieval of prompts for zero-shot task inputs from a pre-constructed data pool, thereby enhancing universality across tasks and models [Cheng et al., 2023a]. Meanwhile, PROMPTAGATOR [Dai et al., 2022] utilizes LLM as a few- shot query generator and, based on the generated data, creates task-specific retrievers. By leveraging the generalization ca- pability of LLMs, it enables the development of task-specific end-to-end retrievers with minimal examples.

New Patterns

The organizational structure of Modular RAG is highly adapt- able, allowing for the substitution or rearrangement of mod- ules within the RAG process to suit specific problem contexts.

Naive RAG and Advanced RAG can both be considered as being composed of some fixed modules. As illustrated in the figure 3, Naive RAG primarily consists of the “Retrieve” and “Read” modules. A typical pattern of Advanced RAG builds upon the foundation of Naive RAG by adding “Rewrite” and “Rerank” modules. However, on the whole, modular RAG enjoys greater diversity and flexibility.
Current research primarily explores two organizational paradigms. The first involves adding or replacing modules, while the second focuses on adjusting the organizational flow between modules. This flexibility enables tailoring the RAG process to effectively address a wide array of tasks.
Adding or Replacing Modules.The strategy of introducing or substituting modules involves maintaining the core struc- ture of the Retrieval-Read process while integrating addi- tional modules to enhance specific functionalities. The RRR model [Ma et al., 2023a] introduces the Rewrite-Retrieve- Read process, utilizing the LLM performance as a reinforce- ment learning incentive for a rewriting module. This enables the rewriter to fine-tune retrieval queries, thereby improving the downstream task performance of the reader.
Similarly, modules can be selectively swapped in method- ologies like Generate-Read [Yu et al., 2022], where the LLM’s generation module takes the place of the retrieval module. The Recite-Read approach [Sun et al., 2022] trans- forms external retrieval into retrieval from model weights, requiring the LLM to initially memorize task-specific infor- mation and subsequently produce output capable of handling knowledge-intensive natural language processing tasks.
Adjusting the Flow between Modules. zheIn the realm of module flow adjustment, there is a focus on enhancing the interaction between language models and retrieval mod- els. DSP [Khattab et al., 2022] introduces the Demonstrate- Search-Predict framework, treating the context learning sys- tem as an explicit program rather than a final task prompt, leading to more effective handling of knowledge-intensive tasks. The ITER-RETGEN [Shao et al., 2023] approach uti- lizes generated content to guide retrieval, iteratively im- plementing “retrieval-enhanced generation” and “generation- enhanced retrieval” within a Retrieve-Read-Retrieve-Read flow. This method demonstrates an innovative way of using one module’s output to improve the functionality of another.

Optimizing the RAG Pipeline

The optimization of the retrieval process aims to enhance the efficiency and quality of information in RAG systems. Cur- rent research focuses on integrating diverse search technolo- gies, refining retrieval steps, incorporating cognitive back- tracking, implementing versatile query strategies, and lever- aging embedding similarity. These efforts collectively strive to achieve a balance between retrieval efficiency and the depth of contextual information in RAG systems.

Hybrid Search Exploration. The RAG system optimizes its performance by intelligently integrating various techniques, including keyword-based search, semantic search, and vec- tor search. This approach leverages the unique strengths of each method to accommodate diverse query types and infor- mation needs, ensuring consistent retrieval of highly relevant and context-rich information. The use of hybrid search serves as a robust supplement to retrieval strategies, thereby enhanc- ing the overall efficacy of the RAG pipeline.
Recursive Retrieval and Query Engine. Recursive retrieval involves acquiring smaller chunks during the initial retrieval phase to capture key semantic meanings. Subsequently, larger chunks containing more contextual information are provided to the LLM in later stages of the process. This two-step re- trieval method helps to strike a balance between efficiency and the delivery of contextually rich responses.
StepBack-prompt approach encourages the LLM to move away from specific instances and engage in reasoning around broader concepts and principles [Zheng et al., 2023]. Experi- mental results demonstrate a significant performance increase in various challenging, inference-based tasks when backward prompts are used, highlighting their natural adaptability to the RAG process. These retrieval-enhancing steps can be applied both in generating responses to backward prompts and in the final question-answering process.
Sub-Queries. Depending on the scenario, various query strategies can be employed, such as using query engines provided by frameworks like LlamaIndex, leveraging tree queries, utilizing vector queries, or executing simple sequen- tial querying of chunks.
Hypothetical Document Embeddings. HyDE operates on the belief that the answers generated might be closer in the embedding space than a direct query. Using the LLM, HyDE creates a hypothetical document (answer) in response to a query, embeds this document, and uses the resulting em- bedding to retrieve real documents similar to the hypotheti- cal one. Instead of seeking embedding similarity based on the query, this approach focuses on the embedding similar- ity from one answer to another [Gao et al., 2022]. However, it might not consistently produce desirable outcomes, espe- cially when the language model is unfamiliar with the subject matter, potentially leading to more instances with errors.

4 Retrieval

In the context of RAG, it is crucial to efficiently retrieve rel- evant documents from the data source. However, creating a proficient retriever presents significant challenges. This sec- tionelves into three fundamental questions: 1) How can we achieve accurate semantic representations? 2) What methods

can align the semantic spaces of queries and documents? 3) How can the retriever’s output be aligned with the preferences of the Large Language Model?

4.1 Enhancing Semantic Representations

In RAG, the semantic space is essential as it involves the mul- tidimensional mapping of queries and documents. Retrieval accuracy in this semantic space significantly impacts RAG outcomes. This section will present two methods for building accurate semantic spaces. Chunk optimization When managing external documents, the initial step involves breaking them down into smaller chunks to extract fine- grained features, which are then embedded to represent their semantics. However, embedding overly large or excessively small text chunks may lead to sub-optimal outcomes. There- fore, identifying the optimal chunk size for documents within the corpus is crucial to ensuring the accuracy and relevance of the retrieved results.

Choosing an appropriate chunking strategy requires care- ful consideration of several vital factors, such as the nature of the indexed content, the embedding model and its opti- mal block size, the expected length and complexity of user queries, and the specific application’s utilization of the re- trieved results. For instance, the selection of a chunking model should be based on the content’s length—whether it is longer or shorter. Additionally, different embedding mod- els demonstrate distinct performance characteristics at vary- ing block sizes. For example, sentence-transformer performs better with single sentences, while text-embedding-ada-002 excels with blocks containing 256 or 512 tokens.
Additionally, factors like the length and complexity of user input questions, and the specific needs of the application (e.g., semantic search or question answering), have effect on the choice of a chunking strategy. This choice can be directly in- fluenced by the token limits of the selected LLMs, requiring adjustments to the block size. In reality, getting precise query results involves flexibly applying different chunking strate- gies. There is no one-size-fits-all ”best” strategy, only the most appropriate one for a particular context.
Current research in RAG explores various block optimiza- tion techniques aimed at improving both retrieval efficiency and accuracy. One such approach involves the use of slid- ing window technology, enabling layered retrieval by merg- ing globally related information across multiple retrieval pro- cesses. Another strategy, known as the “small2big” method, utilizes small text blocks during the initial search phase and subsequently provides larger related text blocks to the lan- guage model for processing.
The abstract embedding technique prioritizes top K re- trieval based on document abstracts (or summaries), offering a comprehensive understanding of the entire document con- text. Additionally, the metadata filtering technique leverages document metadata to enhance the filtering process. An in- novative approach, the graph indexing technique, transforms entities and relationships into nodes and connections, sig- nificantly improving relevance, particularly in the context of multi-hop problems.

The combination of these diverse methods has led to no- table advancements, resulting in enhanced retrieval outcomes and improved performance for RAG. Fine-tuning Embedding Models Once the appropriate size of chunks is determined, the next crucial step involves embedding these chunks and the query into the semantic space using an embedding model. The effectiveness of the embedding is critical as it impacts the model’s ability to represent the corpus. Recent re- search has introduced prominent embedding models such as AngIE, Voyage, BGE,etc [Li and Li, 2023, VoyageAI, 2023, BAAI, 2023]. These models have undergone pre-training on extensive corpora. However, their capability to accurately capture domain-specific information may be limited when ap- plied to specialized domains.

Moreover, task-specific fine-tuning of embedding models is essential to ensure that the model comprehends the user query in terms of content relevance. A model without fine- tuning may not adequately address the requirements of a spe- cific task. Consequently, fine-tuning an embedding model be- comes crucial for downstream applications. There are two primary paradigms in embedding fine-tuning methods.
Domain Knowledge Fine-tuning. To ensure that an embed- ding model accurately captures domain-specific information, it is imperative to utilize domain-specific datasets for fine- tuning. This process diverges from standard language model fine-tuning, chiefly in the nature of the datasets involved. Typically, the dataset for embedding model fine-tuning en- compasses three principal elements: queries, a corpus, and relevant documents. The model employs these queries to identify pertinent documents within the corpus. The effi- cacy of the model is then gauged based on its ability to re- trieve these relevant documents in response to the queries. The dataset construction, model fine-tuning, and evalua- tion phases each present distinct challenges. The LlamaIn- dex [Liu, 2023] introduces a suite of pivotal classes and func- tions designed to enhance the embedding model fine-tuning workflow, thereby simplifying these intricate processes. By curating a corpus infused with domain knowledge and lever- aging the methodologies offered, one can adeptly fine-tune an embedding model to align closely with the specific require- ments of the target domain.
Fine-tuning for Downstream Tasks. Fine-tuning embed- ding models for downstream tasks is a critical step in en- hancing model performance. In the realm of utilizing RAG for these tasks, innovative methods have emerged to fine- tune embedding models by harnessing the capabilities of LLMs. For example, PROMPTAGATOR [Dai et al., 2022] utilizes the LLM as a few-shot query generator to cre- ate task-specific retrievers, addressing challenges in super- vised fine-tuning, particularly in data-scarce domains. An- other approach, LLM-Embedder [Zhang et al., 2023a], ex- ploits LLMs to generate reward signals for data across mul- tiple downstream tasks. The retriever is fine-tuned with two types of supervised signals: hard labels for the dataset and soft rewards from the LLMs. This dual-signal approach fos- ters a more effective fine-tuning process, tailoring the embed- ding model to diverse downstream applications.

 While these methods improve semantic representation by incorporating domain knowledge and task-specific fine- tuning, retrievers may not always exhibit optimal compatibil- ity with certain LLMs. To address this, some researchers have explored direct supervision of the fine-tuning process using feedback from LLMs. This direct supervision seeks to align the retriever more closely with the LLM, thereby improving performance on downstream tasks. A more comprehensive discussion on this topic is presented in Section 4.3.

4.2 Aligning Queries and Documents

In the context of RAG applications, retrievers may utilize a single embedding model for encoding both the query and the documents, or employ separate models for each. Addi- tionally, the user’s original query may suffer from imprecise phrasing and lack of semantic information. Therefore, it is crucial to align the semantic space of the user’s query with those of the documents. This section introduces two funda- mental techniques aimed at achieving this alignment. Query Rewriting Query rewriting is a fundamental approach for aligning the semantics of a query and a document. Methods such as Query2Doc and ITER-RETGEN leverage LLMs to create a pseudo-document by combining the origi- nal query with additional guidance [Wang et al., 2023c, Shao et al., 2023]. HyDE constructs query vectors using textual cues to generate a “hypothetical” document captur- ing essential patterns [Gao et al., 2022]. RRR introduces a framework that reverses the traditional retrieval and read- ing order, focusing on query rewriting [Ma et al., 2023a]. STEP-BACKPROMPTING enables LLMs to perform ab- stract reasoning and retrieval based on high-level con- cepts [Zheng et al., 2023]. Additionally, the multi-query re- trieval method utilizes LLMs to generate and execute multiple search queries simultaneously, advantageous for addressing complex problems with multiple sub-problems. Embedding Transformation Beyond broad strategies such as query rewriting, there exist more granular techniques specifically designed for embed- ding transformations. LlamaIndex [Liu, 2023] exemplifies this by introducing an adapter module that can be integrated following the query encoder. This adapter facilitates fine- tuning, thereby optimizing the representation of query em- beddings to map them into a latent space that is more closely aligned with the intended tasks.

 The challenge of aligning queries with structured exter- nal documents, particularly when addressing the incongruity between structured and unstructured data, is addressed by SANTA [Li et al., 2023d]. It enhances the retriever’s sen- sitivity to structured information through two pre-training strategies: first, by leveraging the intrinsic alignment between structured and unstructured data to inform contrastive learn- ing in a structured-aware pre-training scheme; and second, by implementing Masked Entity Prediction. The latter utilizes an entity-centric masking strategy that encourages language models to predict and fill in the masked entities, thereby fos- tering a deeper understanding of structured data.

The issue of aligning queries with structured exter- nal documents, especially when dealing with the dispar- ity between structured and unstructured data, is tackled by SANTA [Li et al., 2023d]. This approach improves the re- triever’s ability to recognize structured information through two pre-training strategies: firstly, by utilizing the inher- ent alignment between structured and unstructured data to guide contrastive learning in a structured-aware pre-training scheme; and secondly, by employing Masked Entity Predic- tion. The latter uses an entity-centric masking strategy to prompt language models to predict and complete the masked entities, thus promoting a more profound comprehension of structured data.

4.3 Aligning Retriever and LLM

In the RAG pipeline, enhancing retrieval hit rate through var- ious techniques may not necessarily improve the final out- come, as the retrieved documents may not align with the spe- cific requirements of the LLMs. Therefore, this section in- troduces two methods aimed at aligning the retriever outputs with the preferences of the LLMs. Fine-tuning Retrievers Several studies utilize feedback signals from LLMs to refine retrieval models. For instance, AAR [Yu et al., 2023b] intro- duces supervisory signals for a pre-trained retriever using an encoder-decoder architecture. This is achieved by identifying the LM’s preferred documents through FiD cross-attention scores. Subsequently, the retriever undergoes fine-tuning with hard negative sampling and standard cross-entropy loss. Ultimately, the refined retriever can be directly applied to en- hance unseen target LMs, resulting in improved performance in the target task. Additionally, it is suggested that LLMs may have a preference for focusing on readable rather than information-rich documents.

 REPLUG [Shi et al., 2023] utilizes a retriever and an LLM to calculate the probability distributions of the retrieved doc- uments and then performs supervised training by computing the KL divergence. This straightforward and effective train- ing method enhances the performance of the retrieval model by using an LM as the supervisory signal, eliminating the need for specific cross-attention mechanisms.
 UPRISE [Cheng et al., 2023a] also employs frozen LLMs to fine-tune the prompt retriever. Both the LLM and the re- triever take prompt-input pairs as inputs and utilize the scores provided by the LLM to supervise the retriever’s training, ef- fectively treating the LLM as a dataset labeler. In addition, Atlas [Izacard et al., 2022] proposes four methods of super- vised fine-tuning embedding models:

• Attention Distillation. This approach employs cross- attention scores generated by the LLM during output to distill the model’s knowledge. • EMDR2. By using the Expectation-Maximization algo- rithm, this method trains the model with retrieved docu- ments as latent variables. • Perplexity Distillation directly trains the model using the perplexity of generated tokens as an indicator.

• LOOP. This method presents a novel loss function based on the impact of document deletion on LLM prediction, offering an efficient training strategy to better adapt the model to specific tasks.

 These approaches aim to improve the synergy between the retriever and the LLM, leading to enhanced retrieval perfor- mance and more accurate responses to user inquiries.

Adapters

Fine-tuning models may present challenges, such as integrat- ing functionality through an API or addressing constraints arising from limited local computational resources. Con- sequently, some approaches opt to incorporate an external adapter to aid in alignment.

 PRCA trains the adapter through a context extraction phase and a reward-driven phase.  The retriever’s out- put is then optimized using a token-based autoregres- sive strategy [Yang et al., 2023b]. The token filtering ap- proach employs cross-attention scores to efficiently fil- ter tokens, selecting only the highest-scoring input to- kens [Berchansky et al., 2023].RECOMP introduces both ex- tractive and generative compressors for summary generation. These compressors either select relevant sentences or syn- thesize document information, creating summaries tailored to multi-document queries [Xu et al., 2023a].
 Furthermore, PKG introduces an innovative method for in- tegrating knowledge into white-box models via directive fine- tuning [Luo et al., 2023]. In this approach, the retriever mod- ule is directly substituted to generate relevant documents ac- cording to a query. This method assists in addressing the dif- ficulties encountered during the fine-tuning process and en- hances model performance.

5 Generation

A crucial component of RAG is its generator, which is re- sponsible for converting retrieved information into coherent and fluent text. Unlike traditional language models, RAG’s generator sets itself apart by improving accuracy and rele- vance via the incorporation of retrieved data. In RAG, the generator’s input encompasses not only typical contextual in- formation but also relevant text segments obtained through the retriever. This comprehensive input enables the generator to gain a deep understanding of the question’s context, result- ing in more informative and contextually relevant responses. Furthermore, the generator is guided by the retrieved text to ensure coherence between the generated content and the ob- tained information. The diverse input data has led to targeted efforts during the generation phase, all aimed at refining the adaptation of the large model to the input data derived from queries and documents. In the following subsections, we will explore the introduction of the generator by delving into as- pects of post-retrieval processing and fine-tuning.

5.1 Post-retrieval with Frozen LLM

In the realm of untunable LLMs , many studies rely on well- established models like GPT-4 [OpenAI, 2023] to harness their comprehensive internal knowledge for systematically synthesizing retrieved information from various documents.

However, challenges persist with these large models, includ- ing limitations on context length and susceptibility to redun- dant information. To tackle these issues, certain research en- deavors have turned their focus to post-retrieval processing.

 Post-retrieval processing involves treating, filtering, or op- timizing the relevant information retrieved by the retriever from a large document database. Its main goal is to enhance the quality of retrieval results, aligning them more closely with user needs or subsequent tasks. It can be viewed as a reprocessing of the documents obtained during the retrieval phase. Common operations in post-retrieval processing typi- cally include information compression and result reranking.

Information Compression

The retriever excels at retrieving relevant information from a vast knowledge base, but managing the substantial amount of information within retrieval documents is a challenge. Ongo- ing research aims to extend the context length of large lan- guage models to tackle this issue. However, current large models still struggle with context limitations. Therefore, there are scenarios where condensing information becomes necessary. Information condensation is significant for reduc- ing noise, addressing context length restrictions, and enhanc- ing generation effects.

 PRCA tackled this issue by training an information ex- tractor [Yang et al., 2023b]. In the context extraction phase, when provided with an input text Sinput, it is capable of producing an output sequence Cextracted that represents the condensed context from the input document. The train- ing process is designed to minimize the difference between Cextracted and the actual context Ctruth.
 Similarly, RECOMP adopts a comparable approach by training an information condenser using contrastive learn- ing [Xu et al., 2023a]. Each training data point consists of one positive sample and five negative samples, and the en- coder undergoes training using contrastive loss throughout this process [Karpukhin et al., 2020] .
 Another study has taken a different approach by aim- ing to reduce the number of documents in order to im- prove the accuracy of the model’s answers. In the study by [Ma et al., 2023b], they propose the “Filter-Reranker” paradigm, which combines the strengths of LLMs and Small Language Models (SLMs). In this paradigm, SLMs serve as filters, while LLMs function as reordering agents. The re- search shows that instructing LLMs to rearrange challeng- ing samples identified by SLMs leads to significant improve- ments in various Information Extraction (IE) tasks.

Reranking The re-ranking model is pivotal in optimizing the document set retrieved from the retriever. Language models often face performance declines when additional context is introduced, and re-ranking effectively addresses this issue. The core con- cept involves rearranging document records to prioritize the most relevant items at the top, thereby limiting the total num- ber of documents. This not only resolves the challenge of context window expansion during retrieval but also enhances retrieval efficiency and responsiveness.

 The re-ranking model assumes a dual role throughout the information retrieval process, functioning as both an

optimizer and a refiner. It provides more effective and accurate input for subsequent language model process- ing [Zhuang et al., 2023].

 Contextual compression is incorporated into the reorder- ing process to offer more precise retrieval information. This method entails reducing the content of individual documents and filtering the entire document, with the ultimate goal of presenting the most relevant information in the search results for a more focused and accurate display of pertinent content.

5.2 Fine-tuning LLM for RAG

Optimizing the generator within the RAG model is a critical aspect of its architecture. The generator’s role is to take the retrieved information and produce relevant text, forming the final output of the model. The optimization of the generator aims to ensure that the generated text is both natural and ef- fectively leverages the retrieved documents to better meet the user’s query needs.

 In standard LLMs generation tasks, the input typically consists of a query. RAG stands out by incorporating not only a query but also various retrieved documents (struc- tured/unstructured) by the retriever into the input. This ad- ditional information can significantly influence the model’s understanding, particularly for smaller models. In such cases, fine-tuning the model to adapt to the input of both query and retrieved documents becomes crucial. Before presenting the input to the fine-tuned model, post-retrieval processing usu- ally occurs for the documents retrieved by the retriever. It is essential to note that the fine-tuning method for the genera- tor in RAG aligns with the general fine-tuning approach for LLMs. In the following, we will briefly describe some rep- resentative works involving data (formatted/unformatted) and optimization functions.

General Optimization Process As part of the general optimization process, the training data typically consists of input-output pairs, aiming to train the model to produce the output y given the input x. In the work of Self-Mem [Cheng et al., 2023b], a traditional training process is employed, where given the input x, relevant documents z are retrieved (selecting Top-1 in the paper), and after integrating (x, z), the model generates the output y. The paper utilizes two common paradigms for fine-tuning, namely Joint-Encoder and Dual-Encoder [Arora et al., 2023, Wang et al., 2022b, Lewis et al., 2020, Xia et al., 2019, Cai et al., 2021, Cheng et al., 2022].

 In the Joint-Encoder paradigm, a standard model based on an encoder-decoder is used. Here, the encoder initially en- codes the input, and the decoder, through attention mecha- nisms, combines the encoded results to generate tokens in an autoregressive manner. On the other hand, in the Dual- Encoder paradigm, the system sets up two independent en- coders, with each encoder encoding the input (query, con- text) and the document, respectively. The resulting out- puts undergo bidirectional cross-attention processing by the decoder in sequence. Both architectures utilize the Trans- former [Vaswani et al., 2017] as the foundational block and optimize with Negative Log-Likelihood loss.

Utilizing Contrastive Learning

In the phase of preparing training data for language mod- els, interaction pairs of input and output are usually created. This traditional method can lead to ”exposure bias,” where the model is only trained on individual, correct output ex- amples, thus restricting its exposure to a range of possible outputs citesequence. This limitation can hinder the model’s real-world performance by causing it to overfit to the partic- ular examples in the training set, thereby reducing its ability to generalize across various contexts.

To mitigate exposure bias, SURGE [Kang et al., 2023] proposes the use of graph-text contrastive learning. This method includes a contrastive learning objective that prompts the model to produce a range of plausible and coherent re- sponses, expanding beyond the instances encountered in the training data. This approach is crucial in reducing overfitting and strengthening the model’s ability to generalize.
For retrieval tasks that engage with structured data, the SANTA framework [Li et al., 2023d] implements a tripartite training regimen to effectively encapsulate both structural and semantic nuances. The initial phase focuses on the retriever, where contrastive learning is harnessed to refine the query and document embeddings.
Subsequently, the generator’s preliminary training stage employs contrastive learning to align the structured data with its unstructured document descriptions. In a further stage of generator training, the model acknowledges the critical role of entity semantics in the representation learning of textual data for retrieval, as highlighted by [Sciavolino et al., 2021, Zhang et al., 2019]. This process commences with the identi- fication of entities within the structured data, followed by the application of masks over these entities within the generator’s input data, thus setting the stage for the model to anticipate and predict these masked elements.
The training regimen progresses with the model learning to reconstruct the masked entities by leveraging contextual information. This exercise cultivates the model’s comprehen- sion of the textual data’s structural semantics and facilitates the alignment of pertinent entities within the structured data. The overarching optimization goal is to train the language model to accurately restore the obscured spans, thereby en- riching its understanding of entity semantics [Ye et al., 2020].

6 Augmentation in RAG

This section is structured around three key aspects: the aug- mentation stage, sources of augmentation data, and the aug- mentation process. These facets elucidate the critical tech- nologies pivotal to RAG’s development. A taxonomy of RAG’s core components is presented in Figure 4.

6.1 RAG in Augmentation Stages

RAG, a knowledge-intensive endeavor, incorporates a vari- ety of technical methodologies across the pre-training, fine- tuning, and inference stages of language model training. Pre-training Stage During the pre-training stage, researchers have investigated methods to bolster PTMs for open-domain QA through

Figure 4: Taxonomy of RAG’s core components

retrieval-based strategies. The REALM model adopts a struc- tured, interpretable method for knowledge embedding, fram- ing pre-training, and fine-tuning as a retrieve-then-predict workflow within the masked language model (MLM) frame- work [Arora et al., 2023] .

 RETRO [Borgeaud et al., 2022] leverages retrieval aug- mentation for large-scale pre-training from scratch, achieving a reduction in model parameters while surpassing standard GPT models in terms of perplexity. RETRO distinguishes it- self with an additional encoder designed to process features of entities retrieved from an external knowledge base, build- ing on the foundational structure of GPT models.
 Atlas[Izacard et al., 2022] also incorporates a retrieval mechanism into the T5 architecture [Raffel et al., 2020] in both the pre-training and fine-tuning stages. It uses a pre- trained T5 to initialize the encoder-decoder language model and a pre-trained Contriever for the dense retriever, improv- ing its efficiency for complex language modeling tasks.

Furthermore, COG [Lan et al., 2022] introduces a novel text generation methodology that emulates copying text frag- ments from pre-existing collections. Utilizing efficient vector search tools, COG computes and indexes contextually mean- ingful representations of text fragments, demonstrating supe- rior performance in domains such as question-answering and domain adaptation when compared to RETRO.

 The advent of scaling laws has catalyzed the growth of model parameters, propelling autoregressive models into the mainstream. Researchers are expanding the RAG approach to pretrained larger models, with RETRO++ exemplifying this trend by scaling up the model parameters while preserving or enhancing performance [Wang et al., 2023b].
 Empirical evidence underscores marked improvements in text generation quality, factual accuracy, reduced toxicity, and downstream task proficiency, especially in knowledge- intensive applications like open-domain QA. These results imply that integrating retrieval mechanisms into the pre-

training of autoregressive language models constitutes a promising avenue, marrying sophisticated retrieval tech- niques with expansive language models to yield more precise and efficient language generation.

 The benefits of augmented pre-training include a robust foundational model that outperforms standard GPT models in perplexity, text generation quality, and task-specific per- formance, all while utilizing fewer parameters. This method is particularly adept at handling knowledge-intensive tasks and facilitates the development of domain-specific models through training on specialized corpora.
 Nonetheless, this approach faces challenges such as the necessity for extensive pre-training datasets and resources, as well as diminished update frequencies with increasing model sizes. Despite these hurdles, the approach offers significant advantages in model resilience. Once trained, retrieval-enhanced models can operate independently of ex- ternal libraries, enhancing generation speed and operational efficiency. The potential gains identified render this method- ology a compelling subject for ongoing investigation and in- novation in artificial intelligence and machine learning.

Fine-tuning Stage RAG and Fine-tuning are powerful tools for enhancing LLMs, and combining the two can meet the needs of more specific scenarios. On one hand, fine-tuning allows for the retrieval of documents with a unique style, achieving bet- ter semantic expression and aligning the differences between queries and documents. This ensures that the output of the retriever is more aptly suited to the scenario at hand. On the other hand, fine-tuning can fulfill the generation needs of making stylized and targeted adjustments. Furthermore, fine- tuning can also be used to align the retriever and generator for improved model synergy.

 The main goal of fine-tuning the retriever is to improve the quality of semantic representations, achieved by directly fine-tuning the Embedding model using a corpus [Liu, 2023]. By aligning the retriever’s capabilities with the prefer- ences of the LLMs through feedback signals, both can be better coordinated [Yu et al., 2023b, Izacard et al., 2022, Yang et al., 2023b, Shi et al., 2023]. Fine-tuning the retriever for specific downstream tasks can lead to improved adapt- ability [cite]. The introduction of task-agnostic fine-tuning aims to enhance the retriever’s versatility in multi-task sce- narios [Cheng et al., 2023a].
 Fine-tuning generator can result in outputs that are more stylized and customized. On one hand, it allows for specialized adaptation to different input data formats. For example, fine-tuning LLMs to fit the structure of knowledge graphs [Kang et al., 2023], the structure of text pairs [Kang et al., 2023, Cheng et al., 2023b], and other spe- cific structures [Li et al., 2023d]. On the other hand, by con- structing directive datasets, one can demand LLMs to gen- erate specific formats content. For instance, in adaptive or iterative retrieval scenarios, LLMs are fine-tuned to generate content that will help determine the timing for the next step of action [Jiang et al., 2023b, Asai et al., 2023].
 By synergistically fine-tuning both the retriever and the generator, we can enhance the model’s generalization capa-

bilities and avoid overfitting that may arise from training them separately. However, joint fine-tuning also leads to increased resource consumption. RA-DIT [Lin et al., 2023] presents a lightweight, dual-instruction tuning framework that can effectively add retrieval capabilities to any LLMs. The retrieval-enhanced directive fine-tuning updates the LLM, guiding it to make more efficient use of the information re- trieved and to disregard distracting content.

 Despite its advantages, fine-tuning has limitations, includ- ing the need for specialized datasets for RAG fine-tuning and the requirement for significant computational resources. However, this stage allows for customizing models to specific needs and data formats, potentially reducing resource usage compared to the pre-training phase while still being able to fine-tune the model’s output style.
 In summary, the fine-tuning stage is essential for the adap- tation of RAG models to specific tasks, enabling the refine- ment of both retrievers and generators. This stage enhances the model’s versatility and adaptability to various tasks, de- spite the challenges presented by resource and dataset re- quirements. The strategic fine-tuning of RAG models is therefore a critical component in the development of efficient and effective retrieval-augmented systems.

Inference Stage The inference stage in RAG models is crucial, as it in- volves extensive integration with LLMs. Traditional RAG approaches, also known as Naive RAG, involve incorporating retrieval content at this stage to guide the generation process. To overcome the limitations of Naive RAG, advanced tech- niques introduce more contextually rich information dur- ing inference. The DSP framework [Khattab et al., 2022] utilizes a sophisticated exchange of natural language text between fronzen LMs and retrieval models (RMs), en- riching the context and thereby improving generation out- comes. The PKG [Luo et al., 2023] method equips LLMs with a knowledge-guided module that allows for the retrieval of pertinent information without modifying the LMs’ pa- rameters, enabling more complex task execution. CREA- ICL [Li et al., 2023b] employs a synchronous retrieval of cross-lingual knowledge to enhance context, while RE- CITE [Sun et al., 2022] generates context by sampling para- graphs directly from LLMs.

 Further refinement of the RAG process during infer- ence is seen in approaches that cater to tasks necessi- tating multi-step reasoning. ITRG [Feng et al., 2023] it- eratively retrieves information to identify the correct rea- soning paths, thereby improving task adaptability. ITER- RETGEN [Shao et al., 2023] follows an iterative strat- egy, merging retrieval and generation in a cyclical pro- cess that alternates between “retrieval-enhanced generation” and “generation-enhanced retrieval”. For non-knowledge- intensive (NKI) tasks, PGRA [Guo et al., 2023] proposes a two-stage framework, starting with a task-agnostic retriever followed by a prompt-guided reranker to select and priori- tize evidence. In contrast, IRCOT [Trivedi et al., 2022] com- bines RAG with Chain of Thought (CoT) methodologies, al- ternating CoT-guided retrievals with retrieval-informed CoT processes, significantly boosting GPT-3’s performance across

various question-answering tasks.

 In essence, these inference-stage enhancements provide lightweight, cost-effective alternatives that leverage the ca- pabilities of pre-trained models without necessitating further training. The principal advantage is maintaining static LLM parameters while supplying contextually relevant information to meet specific task demands. Nevertheless, this approach is not without limitations, as it requires meticulous data pro- cessing and optimization, and is bound by the foundational model’s intrinsic capabilities. To address diverse task require- ments effectively, this method is often paired with procedural optimization techniques such as step-wise reasoning, iterative retrieval, and adaptive retrieval strategies.

6.2 Augmentation Source

The effectiveness of RAG models is heavily impacted by the selection of data sources for augmentation. Different levels of knowledge and dimensions require distinct processing tech- niques. They are categorized as unstructured data, structured data, and content generated by LLMs. The technology tree of representative RAG research with different augmentation aspects is depicted in Figure 5. The leaves, colored in three different shades, represent enhancements using various types of data: unstructured data, structured data, and content gener- ated by LLMs. The diagram clearly shows that initially, aug- mentation was mainly achieved through unstructured data, such as pure text. This approach later expanded to include the use of structured data (e.g. knowledge graph) for further improvement. More recently, there has been a growing trend in research that utilizes content generated by the LLMs them- selves for retrieval and augmentation purposes. Augmented with Unstructured Data Unstructured text, is gathered from corpora, such as prompt data for fine-tuning large models [Cheng et al., 2023a] and cross-lingual data [Li et al., 2023b]. Retrieval units vary from tokens (e.g., kNN-LM [Khandelwal et al., 2019]) to phrases (e.g., NPM, COG [Lee et al., 2020, Lan et al., 2022]) and document paragraphs, with finer granularities offering pre- cision at the cost of increased retrieval complexity. FLARE [Jiang et al., 2023b] introduces an active re- trieval approach, triggered by the LM’s generation of low- probability words. It creates a temporary sentence for doc- ument retrieval, then regenerates the sentence with the re- trieved context to predict subsequent sentences. RETRO uses the previous chunk to retrieve the nearest neighbor at the chunk level, combined with the previous chunk’s context, it guides the generation of the next chunk. To preserve causal- ity, the generation of the next block Ci only utilizes the near- est neighbor of the previous block N (Ci−1) and not N (Ci). Augmented with Structured Data Structured data, such as knowledge graphs (KGs), pro- vide high-quality context and mitigate model hallucina- tions. RET-LLMs [Modarressi et al., 2023] constructs a knowledge graph memory from past dialogues for future ref- erence. SUGRE [Kang et al., 2023] employs Graph Neu- ral Networks (GNNs) to encode relevant KG subgraphs, ensuring consistency between retrieved facts and gener- ated text through multi-modal contrastive learning. Knowl-

edGPT [Wang et al., 2023d] generates KB search queries and stores knowledge in a personalized base, enhancing the RAG model’s knowledge richness and contextuality.

LLMs-Generated Content in RAG

Addressing the limitations of external auxiliary information in RAG, some research has focused on exploiting LLMs’ in- ternal knowledge. SKR [Wang et al., 2023e] classifies ques- tions as known or unknown, applying retrieval enhancement selectively. GenRead [Yu et al., 2022] replaces the retriever with an LLM generator, finding that LLM-generated con- texts often contain more accurate answers due to better align- ment with the pre-training objectives of causal language mod- eling. Selfmem [Cheng et al., 2023b] iteratively creates an unbounded memory pool with a retrieval-enhanced genera- tor, using a memory selector to choose outputs that serve as dual problems to the original question, thus self-enhancing the generative model.

 These methodologies underscore the breadth of innovative data source utilization in RAG, striving to improve model per- formance and task effectiveness.

6.3 Augmentation Process

In the domain of RAG, the standard practice often involves a singular retrieval step followed by generation, which can lead to inefficiencies. A notable issue, termed the “lost in the middle” phenomenon, arises when a single retrieval yields redundant content that may dilute or contradict es- sential information, thereby degrading the generation qual- ity [Liu et al., 2023a]. Furthermore, such singular retrieval is typically insufficient for complex problems demanding multi- step reasoning, as it provides a limited scope of informa- tion [Yoran et al., 2023].

 As illustrated in Figure 5, to circumvent these challenges, contemporary research has proposed methods for refining the retrieval process: iterative retrieval, recursive retrieval and adaptive retrieval. Iterative retrieval allows the model to en- gage in multiple retrieval cycles, enhancing the depth and relevance of the information obtained. Recursive retrieval process where the results of one retrieval operation are used as the input for the subsequent retrieval. It helps to delve deeper into relevant information, particularly when dealing with complex or multi-step queries. Recursive retrieval is of- ten used in scenarios where a gradual approach is needed to converge on a final answer, such as in academic research, le- gal case analysis, or certain types of data mining tasks. Adap- tive retrieval, on the other hand, offers a dynamic adjustment mechanism, tailoring the retrieval process to the specific de- mands of varying tasks and contexts.

Iterative Retrieval Iterative retrieval in RAG models is a process where doc- uments are repeatedly collected based on the initial query and the text generated thus far, providing a more compre- hensive knowledge base for LLMs [Borgeaud et al., 2022, Arora et al., 2023]. This approach has been shown to en- hance the robustness of subsequent answer generation by of- fering additional contextual references through multiple re- trieval iterations. However, it may suffer from semantic dis- continuity and the accumulation of irrelevant information, as it typically relies on a sequence of n tokens to demarcate the boundaries between generated text and retrieved documents. To address specific data scenarios, recursive retrieval and multi-hop retrieval techniques are utilized. Recursive re- trieval involves a structured index to process and retrieve data in a hierarchical manner, which may include summa- rizing sections of a document or lengthy PDF before per- forming a retrieval based on this summary. Subsequently, a secondary retrieval within the document refines the search, embodying the recursive nature of the process. In contrast, multi-hop retrieval is designed to delve deeper into graph- structured data sources, extracting interconnected informa- tion [Li et al., 2023c].

Figure 5: Technology tree of representative RAG research with different augmentation aspects

Additionally, some methodologies integrate the steps of re- trieval and generation. ITER-RETGEN [Shao et al., 2023] employs a synergistic approach that leverages “retrieval- enhanced generation” alongside “generation-enhanced re- trieval” for tasks that necessitate the reproduction of specific information. The model harnesses the content required to ad- dress the input task as a contextual basis for retrieving per- tinent knowledge, which in turn facilitates the generation of improved responses in subsequent iterations. Recursive Retrieval Recursive Retrieval is often used in information retrieval and NLP to improve the depth and relevance of search results.

The process involves iteratively refining search queries based on the results obtained from previous searches. Recursive Retrieval aims to enhance the search experience by gradu- ally converging on the most pertinent information through a feedback loop. IRCoT [Trivedi et al., 2022] uses chain-of- thought to guide the retrieval process and refines the CoT with the obtained retrieval results. ToC [Kim et al., 2023] creates a clarification tree that systematically optimizes the ambiguous parts in the Query. It can be particularly useful in complex search scenarios where the user’s needs are not en- tirely clear from the outset or where the information sought is highly specialized or nuanced. The recursive nature of the process allows for continuous learning and adaptation to the user’s requirements, often resulting in improved satisfaction with the search outcomes.

Adaptive Retrieval

Adaptive retrieval methods, exemplified by Flare and Self- RAG [Jiang et al., 2023b, Asai et al., 2023], refine the RAG framework by enabling LLMs to actively determine the op- timal moments and content for retrieval, thus enhancing the efficiency and relevance of the information sourced.

 These methods are part of a broader trend wherein LLMs employ active judgment in their operations, as seen in model agents like AutoGPT, Toolformer, and Graph-Toolformer [Yang et al., 2023c, Schick et al., 2023,

Zhang, 2023]. Graph-Toolformer, for instance, divides its re- trieval process into distinct steps where LLMs proactively use retrievers, apply Self-Ask techniques, and employ few-shot prompts to initiate search queries. This proactive stance al- lows LLMs to decide when to search for necessary informa- tion, akin to how an agent utilizes tools.

 WebGPT [Nakano et al., 2021] integrates a reinforcement learning framework to train the GPT-3 model in au- tonomously using a search engine during text generation. It navigates this process using special tokens that facili- tate actions such as search engine queries, browsing results, and citing references, thereby expanding GPT-3’s capabilities through the use of external search engines.
 Flare automates timing retrieval by monitoring the confi- dence of the generation process, as indicated by the probabil- ity of generated terms [Jiang et al., 2023b]. When the prob- ability falls below a certain threshold would activates the re- trieval system to collect relevant information, thus optimizing the retrieval cycle.

 Self-RAG [Asai et al., 2023] introduces “reflection to- kens” that allow the model to introspect its outputs. These tokens come in two varieties: “retrieve” and “critic”. The model autonomously decides when to activate retrieval, or alternatively, a predefined threshold may trigger the pro- cess. During retrieval, the generator conducts a fragment- level beam search across multiple paragraphs to derive the most coherent sequence. Critic scores are used to update the subdivision scores, with the flexibility to adjust these weights during inference, tailoring the model’s behavior. Self-RAG’s design obviates the need for additional classifiers or reliance on Natural Language Inference (NLI) models, thus stream- lining the decision-making process for when to engage re- trieval mechanisms and improving the model’s autonomous judgment capabilities in generating accurate responses.

6.4 RAG vs Fine-Tuning

LLM optimization has received significant attention due to its increasing prevalence. Techniques such as prompt engineer- ing, Fine-Tuning (FT), and RAG each have distinct charac- teristics, visually represented in Figure 6. While prompt en- gineering leverages a model’s inherent capabilities, optimiz- ing LLMs often requires the application of both RAG and FT methods. The choice between RAG and FT should be based on the specific requirements of the scenario and the inherent properties of each approach. A detailed comparison of RAG and FT is presented in Table 1.

 RAG is like giving a model a textbook for tailored informa- tion retrieval, perfect for specific queries. On the other hand, FT is like a student internalizing knowledge over time, bet- ter for replicating specific structures, styles, or formats. FT can improve model performance and efficiency by reinforc- ing base model knowledge, adjusting outputs, and teaching complex instructions. However, it is not as good for integrat- ing new knowledge or rapidly iterating new use cases.
 The two methods, RAG and FT, are not mutually exclusive and can be complementary, augmenting a model’s capabil- ities at different levels. In some cases, their combined use may yield optimal performance. The optimization process

involving RAG and FT can necessitate multiple iterations to achieve satisfactory results.

7 RAG Evaluation

The rapid advancement and growing adoption of RAG in the field of Natural Language Processing (NLP) have propelled the evaluation of RAG models to the forefront of research in the LLMs community. The primary objective of this evalua- tion is to comprehend and optimize the performance of RAG models across diverse application scenarios.

 Historically, RAG models assessments have centered on their execution in specific downstream tasks. These evaluations employ established metrics suitable to the tasks at hand. For instance, question answering evaluations might rely on EM and F1 scores [Wang et al., 2023a, Shi et al., 2023, Feng et al., 2023, Ma et al., 2023a], whereas fact-checking tasks often hinge on accuracy as the pri- mary metric [Lewis et al., 2020, Izacard et al., 2022, Shao et al., 2023]. Tools like RALLE, designed for the auto- matic evaluation of RAG applications, similarly base their as- sessments on these task-specific metrics [Hoshi et al., 2023]. Despite this, there is a notable paucity of research dedicated to evaluating the distinct characteristics of RAG models, with only a handful of related studies.
 The following section shifts the focus from task-specific evaluation methods and metrics to provide a synthesis of the existing literature based on their unique attributes. This ex- ploration covers the objectives of RAG evaluation, the aspects along which these models are assessed, and the benchmarks and tools available for such evaluations. The aim is to offer a comprehensive overview of RAG model evaluation, outlining the methodologies that specifically address the unique aspects of these advanced generative systems.

7.1 Evaluation Targets

The assessment of RAG models mainly revolves around two key components: the retrieval and generation modules. This division ensures a thorough evaluation of both the quality of context provided and the quality of content produced. Retrieval Quality Evaluating the retrieval quality is crucial for determining the effectiveness of the context sourced by the retriever com- ponent. Standard metrics from the domains of search en- gines, recommendation systems, and information retrieval systems are employed to measure the performance of the RAG retrieval module. Metrics such as Hit Rate, MRR, and NDCG are commonly utilized for this purpose [Liu, 2023, Nguyen, 2023]. Generation Quality The assessment of generation quality centers on the gener- ator’s capacity to synthesize coherent and relevant answers from the retrieved context. This evaluation can be catego- rized based on the content’s objectives: unlabeled and la- beled content. For unlabeled content, the evaluation encom- passes the faithfulness, relevance, and non-harmfulness of the generated answers. In contrast, for labeled content, the fo- cus is on the accuracy of the information produced by the

Figure 6: RAG compared with other model optimization methods

model [Liu, 2023]. Additionally, both retrieval and genera- tion quality assessments can be conducted through manual or automatic evaluation methods [Liu, 2023, Lan et al., 2022, Leng et al., 2023]. 7.2 Evaluation Aspects Contemporary evaluation practices of RAG models empha- size three primary quality scores and four essential abilities, which collectively inform the evaluation of the two principal targets of the RAG model: retrieval and generation. Quality Scores Quality scores include context relevance, answer faith- fulness, and answer relevance. These quality scores evaluate the efficiency of the RAG model from differ- ent perspectives in the process of information retrieval and generation [Es et al., 2023, Saad-Falcon et al., 2023, Jarvis and Allard, 2023]. The quality scores—context rele- vance, answer faithfulness, and answer relevance—assess the RAG model’s efficiency from various angles throughout the information retrieval and generation process [Es et al., 2023, Saad-Falcon et al., 2023, Jarvis and Allard, 2023].

 Context Relevance evaluates the precision and specificity of the retrieved context, ensuring relevance and minimizing processing costs associated with extraneous content.
 Answer Faithfulness ensures that the generated answers re- main true to the retrieved context, maintaining consistency and avoiding contradictions.

Answer Relevance requires that the generated answers are directly pertinent to the posed questions, effectively address- ing the core inquiry. Required Abilities RAG evaluation also encompasses four abilities indicative of its adaptability and efficiency: noise robustness, negative re- jection, information integration, and counterfactual robust- ness [Chen et al., 2023b, Liu et al., 2023b]. These abilities are critical for the model’s performance under various chal- lenges and complex scenarios, impacting the quality scores.

 Noise Robustness appraises the model’s capability to man- age noise documents that are question-related but lack sub- stantive information.
 Negative Rejection assesses the model’s discernment in re- fraining from responding when the retrieved documents do not contain the necessary knowledge to answer a question.
 Information Integration evaluates the model’s proficiency in synthesizing information from multiple documents to ad- dress complex questions.
 Counterfactual Robustness tests the model’s ability to rec- ognize and disregard known inaccuracies within documents, even when instructed about potential misinformation.
 Context relevance and noise robustness are important for evaluating the quality of retrieval, while answer faithfulness, answer relevance, negative rejection, information integration, and counterfactual robustness are important for evaluating the quality of generation.

Table 1: Comparison between RAG and Fine-Tuning

Feature Comparison RAG Fine-Tuning

Directly updating the retrieval knowledge

Knowledge Updates

External Knowledge

base ensures that the information remains current without the need for frequent retrain- ing, making it well-suited for dynamic data environments.

Proficient in leveraging external resources, particularly suitable for accessing documents or other structured/unstructured databases.

Stores static data, requiring retraining for knowledge and data updates.

Can be utilized to align the externally ac- quired knowledge from pretraining with large language models, but may be less practical for frequently changing data sources.

Data Processing Involves minimal data processing and han-

Focuses on information retrieval and inte-

Depends on the creation of high-quality datasets, and limited datasets may not result in significant performance improvements. Allows adjustments of LLM behavior, writ-

Model Customization

Interpretability

Computational Resources

grating external knowledge but may not fully customize model behavior or writing style.

Responses can be traced back to specific data sources, providing higher interpretability and traceability.

Depends on computational resources to sup- port retrieval strategies and technologies re- lated to databases. Additionally, it requires the maintenance of external data source inte- gration and updates.

ing style, or specific domain knowledge based on specific tones or terms.

Similar to a black box, it is not always clear why the model reacts a certain way, resulting in relatively lower interpretability.

The preparation and curation of high-quality training datasets, defining fine-tuning objec- tives, and providing corresponding computa- tional resources are necessary.

Latency Requirements Involves data retrieval, which may lead to

Inherently less prone to hallucinations as

LLM after fine-tuning can respond without retrieval, resulting in lower latency. Can help reduce hallucinations by training

Reducing Hallucinations

Ethical and Privacy Issues

each answer is grounded in retrieved evi- dence.

Ethical and privacy concerns arise from the storage and retrieval of text from external databases.

the model based on specific domain data but may still exhibit hallucinations when faced with unfamiliar input.

Ethical and privacy concerns may arise due to sensitive content in the training data.

 The specific metrics for each evaluation aspect are summa- rized in Table 2. It is essential to recognize that these metrics, derived from related work, are traditional measures and do not yet represent a mature or standardized approach for quan- tifying RAG evaluation aspects. Custom metrics tailored to the nuances of RAG models, though not included here, have also been developed in some evaluation studies.

7.3 Evaluation Benchmarks and Tools This section delineates the evaluation framework for RAG models, comprising benchmark tests and automated eval-

enhance comprehension of the model’s capabilities across various evaluation aspects. Prominent benchmarks such as RGB and RECALL [Chen et al., 2023b, Liu et al., 2023b] focus on appraising the essential abilities of RAG mod- els. Concurrently, state-of-the-art automated tools like RA- GAS [Es et al., 2023], ARES [Saad-Falcon et al., 2023], and TruLens8 employ LLMs to adjudicate the quality scores. These tools and benchmarks collectively form a robust frame- work for the systematic evaluation of RAG models, as sum- marized in Table 3.

uation tools. These instruments furnish quantitative met- rics that not only gauge RAG model performance but also 8https://www.trulens.org/trulens eval/core concepts rag triad/

Table 2: Summary of metrics applicable for evaluation aspects of RAG

Relevance Relevance

EM Recall ✓ Precision ✓ R-Rate ✓ ✓ Cosine Similarity ✓ Hit Rate ✓ MRR ✓ NDCG ✓

Table 3: Summary of evaluation frameworks

Evaluation Framework Evaluation Targets Evaluation Aspects Quantitative Metrics

† Retrieval Quality Generation Quality

Noise Robustness Negative Rejection Information Integration Counterfactual Robustness

Accuracy EM Accuracy Accuracy

RECALL† Generation Quality Counterfactual Robustness R-Rate (Reappearance Rate)

RAGAS

‡ Retrieval Quality Generation Quality

Context Relevance Faithfulness Answer Relevance

Cosine Similarity

‡ Retrieval Quality

Context Relevance

Accuracy

ARES

TruLens

Generation Quality

‡ Retrieval Quality Generation Quality

Faithfulness Answer Relevance Context Relevance Faithfulness Answer Relevance

Accuracy Accuracy

† represents a benchmark, and ‡ represents a tool. * denotes customized quantitative metrics, which deviate from traditional metrics. Readers are encouraged to consult pertinent literature for the specific quantification formulas associated with these metrics, as required.

8 Future Prospects

This section explores three future prospects for RAG: future challenges, modality expansion, and the RAG ecosystem.

8.1 Future Challenges of RAG

Despite the considerable progress in RAG technology, several challenges persist that warrant in-depth research:

Context Length: RAG’s efficacy is limited by the context window size of Large Language Models (LLMs). Balancing the trade-off between a window that is too short, risking insuf- ficient information, and one that is too long, risking informa- tion dilution, is crucial. With ongoing efforts to expand LLM context windows to virtually unlimited sizes, the adaptation of RAG to these changes presents a significant research ques- tion [Xu et al., 2023c, Packer et al., 2023, Xiao et al., 2023]. Robustness. The presence of noise or contradictory infor- mation during retrieval can detrimentally affect RAG’s output quality. This situation is figuratively referred to as “Mis- information can be worse than no information at all”. Im- proving RAG’s resistance to such adversarial or counterfac- tual inputs is gaining research momentum and has become a key performance metric [Yu et al., 2023a, Glass et al., 2021, Baek et al., 2023].

Hybrid Approaches (RAG+FT): Combining RAG with fine-tuning is emerging as a leading strategy. Determining the optimal integration of RAG and fine-tuning whether sequen- tial, alternating, or through end-to-end joint training—and how to harness both parameterized and non-parameterized advantages are areas ripe for exploration [Lin et al., 2023].

Expanding LLM Roles: Beyond generating final answers, LLMs are leveraged for retrieval and evaluation within RAG frameworks. Identifying ways to further unlock LLMs poten- tial in RAG systems is a growing research direction.

Scaling Laws: While scaling laws [Kaplan et al., 2020] are established for LLMs, their applicability to RAG remains uncertain. Initial studies [Wang et al., 2023b] have begun to ad- dress this, yet the parameter count in RAG models still lags behind that of LLMs. The possibility of an Inverse Scaling Law9, where smaller models outperform larger ones, is par- ticularly intriguing and merits further investigation.

Production-Ready RAG: RAG’s practicality and alignment with engineering requirements have facilitated its adoption. However, enhancing retrieval efficiency, improving document recall in large knowledge bases, and ensuring data secu- rity—such as preventing inadvertent disclosure of document sources or metadata by LLMs—are critical engineering chal- lenges that remain to be addressed [Alon et al., 2022].

Modality Extension of RAG RAG has transcended its initial text-based question- answering confines, embracing a diverse array of modal data. This expansion has spawned innovative multimodal models that integrate RAG concepts across various domains:

Image. RA-CM3 [Yasunaga et al., 2022] stands as a pio- neering multimodal model of both retrieving and generating text and images. BLIP-2 [Li et al., 2023a] leverages frozen image encoders alongside LLMs for efficient visual language pre-training, enabling zero-shot image-to-text conversions. The “Visualize Before You Write” method [Zhu et al., 2022] employs image generation to steer the LM’s text generation, showing promise in open-ended text generation tasks.

Audio and Video. The GSS method retrieves and stitches together audio clips to convert machine-translated data into speech-translated data [Zhao et al., 2022]. UEOP marks a significant advancement in end-to-end automatic speech recognition by incorporating external, offline strategies for voice-to-text conversion [Chan et al., 2023]. Additionally, KNN-based attention fusion leverages audio embeddings and semantically related text embeddings to refine ASR, thereby accelerating domain adaptation. Vid2Seq augments language models with specialized temporal markers, facilitating the prediction of event boundaries and textual descriptions within a unified output sequence [Yang et al., 2023a].

Code. RBPS [Nashid et al., 2023] excels in small-scale learning tasks by retrieving code examples that align with de- velopers’ objectives through encoding and frequency analy- sis. This approach has demonstrated efficacy in tasks such as test assertion generation and program repair. For structured knowledge, the CoK method [Li et al., 2023c] first extracts facts pertinent to the input query from a knowledge graph, then integrates these facts as hints within the input, enhancing performance in knowledge graph question-answering tasks.

8.2 Ecosystem of RAG

Downstream Tasks and Evaluation

RAG has shown considerable promise in enriching language models with the capacity to handle intricate queries and pro- duce detailed responses by leveraging extensive knowledge bases. Empirical evidence suggests that RAG excels in a variety of downstream tasks, including open-ended question answering and fact verification. The integration of RAG not only bolsters the precision and relevance of responses but also their diversity and depth.

9https://github.com/inverse-scaling/prize

The scalability and versatility of RAG across multiple do- mains warrant further investigation, particularly in special- ized fields such as medicine, law, and education. In these ar- eas, RAG could potentially reduce training costs and enhance performance compared to traditional fine-tuning approaches in professional domain knowledge question answering.

Concurrently, refining the evaluation framework for RAG is essential to maximize its efficacy and utility across different tasks. This entails the development of nuanced metrics and assessment tools that can gauge aspects such as contextual relevance, creativity of content, and non-maleficence.

Furthermore, improving the interpretability of RAG-driven models continues to be a key goal. Doing so would allow users to understand the reasoning behind the responses gener- ated by the model, thereby promoting trust and transparency in the use of RAG applications.

Technical Stack

The development of the RAG ecosystem is greatly impacted by the progression of its technical stack. Key tools like LangChain and LLamaIndex have quickly gained popularity with the emergence of ChatGPT, providing extensive RAG- related APIs and becoming essential in the realm of LLMs.

Emerging technical stacks, while not as feature-rich as LangChain and LLamaIndex, distinguish themselves with specialized offerings. For instance, Flowise AI10 prioritizes a low-code approach, enabling users to deploy AI applications, including RAG, through a user-friendly drag-and-drop inter- face. Other technologies like HayStack, Meltano11, and Co- here Coral12 are also gaining attention for their unique con- tributions to the field.

In addition to AI-focused providers, traditional software and cloud service providers are expanding their offerings to include RAG-centric services. Verba13 from Weaviate is de- signed for personal assistant applications, while Amazon’s Kendra14 provides an intelligent enterprise search service, al- lowing users to navigate through various content repositories using built-in connectors. During the evolution of the RAG technology landscape, there has been a clear divergence to- wards different specializations, such as: 1) Customization. Tailoring RAG to meet a specific requirements. 2) Simpli- fication. Making RAG easier to use, thereby reducing the ini- tial learning curve. 3) Specialization. Refining RAG to serve production environments more effectively.

The mutual growth of RAG models and their technical stack is evident; technological advancements consistently es- tablish new standards for the existing infrastructure. In turn, enhancements to the technical stack drive the evolution of RAG capabilities. The RAG toolkit is converging into a foun- dational technical stack, laying the groundwork for advanced enterprise applications. However, the concept of a fully in- tegrated, comprehensive platform remains on the horizon, pending further innovation and development.

10https://flowiseai.com 11https://meltano.com
12https://cohere.com/coral
13https://github.com/weaviate/Verba 14https://aws.amazon.com/cn/kendra/

Figure 7: Summary of RAG ecosystem

9 Conclusion

The summary of this paper, as depicted in Figure 7, high- lights RAG’s significant advancement in enhancing the ca- pabilities of LLMs through the integration of parameter- ized knowledge from language models with extensive non- parameterized data from external knowledge bases. Our sur- vey illustrates the evolution of RAG technologies and their impact on knowledge-intensive tasks. Our analysis delin- eates three developmental paradigms within the RAG frame- work: Naive, Advanced, and Modular RAG, each marking a progressive enhancement over its predecessors. The Ad- vanced RAG paradigm extends beyond the Naive approach by incorporating sophisticated architectural elements, includ- ing query rewriting, chunk reranking, and prompt summariza- tion. These innovations have led to a more nuanced and mod- ular architecture that enhances both the performance and the interpretability of LLMs. RAG’s technical integration with other AI methodologies, such as fine-tuning and reinforce- ment learning, has further expanded its capabilities. In con- tent retrieval, a hybrid methodology that leverages both struc- tured and unstructured data sources is emerging as a trend, providing a more enriched retrieval process. Cutting-edge re- search within the RAG framework is exploring novel con- cepts such as self-retrieval from LLMs and the dynamic tim- ing of information retrieval.

Despite the strides made in RAG technology, research op- portunities abound in improving its robustness and its abil- ity to manage extended contexts. RAG’s application scope is also widening into multimodal domains, adapting its principles to interpret and process diverse data forms such as im- ages, videos, and code. This expansion underscores RAG’s significant practical implications for AI deployment, attract- ing interest from both academic and industrial sectors. The growing ecosystem of RAG is underscored by an increase in RAG-centric AI applications and the ongoing development of supportive tools. However, as RAG’s application land- scape expands, there is an imperative need to refine evaluation methodologies to keep pace with its evolution. Ensuring that performance assessments remain accurate and representative is crucial for capturing the full extent of RAG’s contributions to the AI research and development community.

References

[Alon et al., 2022] Uri Alon, Frank Xu, Junxian He, Sudipta Sengupta, Dan Roth, and Graham Neubig. Neuro- symbolic language modeling with automaton-augmented retrieval. In International Conference on Machine Learn- ing, pages 468–485. PMLR, 2022. [Anderson et al., 2022] Nathan Anderson, Caleb Wilson, and Stephen D. Richardson. Lingua: Addressing scenar- ios for live interpretation and automatic dubbing. In Jan- ice Campbell, Stephen Larocca, Jay Marciano, Konstantin Savenkov, and Alex Yanishevsky, editors, Proceedings of the 15th Biennial Conference of the Association for Ma- chine Translation in the Americas (Volume 2: Users and Providers Track and Government Track), pages 202–209,

Orlando, USA, September 2022. Association for Machine Translation in the Americas. [Arora et al., 2023] Daman Arora, Anush Kini, Sayak Ray Chowdhury, Nagarajan Natarajan, Gaurav Sinha, and Amit Sharma. Gar-meets-rag paradigm for zero-shot infor- mation retrieval. arXiv preprint arXiv:2310.20158, 2023. [Asai et al., 2023] Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511, 2023. [BAAI, 2023] BAAI. Flagembedding. https://github.com/ FlagOpen/FlagEmbedding, 2023. [Baek et al., 2023] Jinheon Baek, Soyeong Jeong, Minki Kang, Jong C Park, and Sung Ju Hwang. Knowledge- augmented language model verification. arXiv preprint arXiv:2310.12836, 2023. [Berchansky et al., 2023] Moshe Berchansky, Peter Izsak, Avi Caciularu, Ido Dagan, and Moshe Wasserblat. Opti- mizing retrieval-augmented reader models via token elim- ination. arXiv preprint arXiv:2310.13682, 2023. [Blagojevi, 2023] Vladimir Blagojevi. Enhancing rag pipelines in haystack: Introducing diversityranker and lostinthemiddleranker. https://towardsdatascience.com/ enhancing-rag-pipelines-in-haystack-45f14e2bc9f5, 2023. [Borgeaud et al., 2022] Sebastian Borgeaud, Arthur Men- sch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR, 2022. [Brown et al., 2020] Tom Brown, Benjamin Mann, Nick Ry- der, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing sys- tems, 33:1877–1901, 2020. [Cai et al., 2021] Deng Cai, Yan Wang, Huayang Li, Wai Lam, and Lemao Liu. Neural machine translation with monolingual translation memory. arXiv preprint arXiv:2105.11269, 2021. [Chan et al., 2023] David M Chan, Shalini Ghosh, Ariya Rastrow, and Bjo¨rn Hoffmeister. Using external off- policy speech-to-text mappings in contextual end-to- end automated speech recognition. arXiv preprint arXiv:2301.02736, 2023. [Chen et al., 2023a] Howard Chen, Ramakanth Pasunuru, Jason Weston, and Asli Celikyilmaz. Walking down the memory maze: Beyond context limit through interactive reading. arXiv preprint arXiv:2310.05029, 2023. [Chen et al., 2023b] Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. Benchmarking large language mod- els in retrieval-augmented generation. arXiv preprint arXiv:2309.01431, 2023.

[Cheng et al., 2022] Xin Cheng, Shen Gao, Lemao Liu, Dongyan Zhao, and Rui Yan. Neural machine transla- tion with contrastive translation memories. arXiv preprint arXiv:2212.03140, 2022. [Cheng et al., 2023a] Daixuan Cheng, Shaohan Huang, Junyu Bi, Yuefeng Zhan, Jianfeng Liu, Yujing Wang, Hao Sun, Furu Wei, Denvy Deng, and Qi Zhang. Uprise: Uni- versal prompt retrieval for improving zero-shot evaluation. arXiv preprint arXiv:2303.08518, 2023. [Cheng et al., 2023b] Xin Cheng, Di Luo, Xiuying Chen, Lemao Liu, Dongyan Zhao, and Rui Yan. Lift yourself up: Retrieval-augmented text generation with self mem- ory. arXiv preprint arXiv:2305.02437, 2023. [Cohere, 2023] Cohere. Say goodbye to irrelevant search results: Cohere rerank is here. https://txt.cohere.com/ rerank/, 2023. [Dai et al., 2022] Zhuyun Dai, Vincent Y Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith B Hall, and Ming-Wei Chang. Promptagator: Few-shot dense retrieval from 8 examples. arXiv preprint arXiv:2209.11755, 2022. [Es et al., 2023] Shahul Es, Jithin James, Luis Espinosa- Anke, and Steven Schockaert. Ragas: Automated eval- uation of retrieval augmented generation. arXiv preprint arXiv:2309.15217, 2023. [Feng et al., 2023] Zhangyin Feng, Xiaocheng Feng, Dezhi Zhao, Maojin Yang, and Bing Qin. Retrieval-generation synergy augmented large language models. arXiv preprint arXiv:2310.05149, 2023. [Gao et al., 2022] Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. Precise zero-shot dense retrieval without relevance labels. arXiv preprint arXiv:2212.10496, 2022. [Glass et al., 2021] Michael Glass, Gaetano Rossiello, Md Faisal Mahbub Chowdhury, and Alfio Gliozzo. Robust retrieval augmented generation for zero-shot slot filling. arXiv preprint arXiv:2108.13934, 2021. [Google, 2023] Google. Gemini: A family of highly capable multimodal models. https://goo.gle/GeminiPaper, 2023. [Guo et al., 2023] Zhicheng Guo, Sijie Cheng, Yile Wang, Peng Li, and Yang Liu. Prompt-guided retrieval augmen- tation for non-knowledge-intensive tasks. arXiv preprint arXiv:2305.17653, 2023. [Hendrycks et al., 2020] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask lan- guage understanding. arXiv preprint arXiv:2009.03300, 2020. [Hoshi et al., 2023] Yasuto Hoshi, Daisuke Miyashita, Youyang Ng, Kento Tatsuno, Yasuhiro Morioka, Osamu Torii, and Jun Deguchi. Ralle: A framework for devel- oping and evaluating retrieval-augmented large language models. arXiv preprint arXiv:2308.10633, 2023. [Huang et al., 2023] Jie Huang, Wei Ping, Peng Xu, Mo- hammad Shoeybi, Kevin Chen-Chuan Chang, and Bryan

Catanzaro. Raven: In-context learning with retrieval aug- mented encoder-decoder language models. arXiv preprint arXiv:2308.07922, 2023. [ILIN, 2023] IVAN ILIN. Advanced rag techniques: an illustrated overview. https://pub.towardsai.net/

[Kim et al., 2023] Gangwoo Kim, Sungdong Kim, Byeong- guk Jeon, Joonsuk Park, and Jaewoo Kang. Tree of clarifications: Answering ambiguous questions with retrieval-augmented large language models. arXiv preprint arXiv:2310.14696, 2023.

advanced-rag-techniques-an-illustrated-overview-04d193d8fec[L6,an et al., 2022] Tian Lan, Deng Cai, Yan Wang, Heyan

2023. [Izacard et al., 2022] Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Few-shot learning with re- trieval augmented language models. arXiv preprint arXiv:2208.03299, 2022. [Jarvis and Allard, 2023] Colin Jarvis and John Al- lard. A survey of techniques for maximizing llm performance. https://community.openai.com/ t/openai-dev-day-2023-breakout-sessions/505213# a-survey-of-techniques-for-maximizing-llm-performance-2, 2023. [Jiang et al., 2023a] Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compressing prompts for accelerated inference of large language mod- els. arXiv preprint arXiv:2310.05736, 2023. [Jiang et al., 2023b] Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. arXiv preprint arXiv:2305.06983, 2023. [Kandpal et al., 2023] Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, pages 15696–15707. PMLR, 2023. [Kang et al., 2023] Minki Kang, Jin Myung Kwak, Jinheon Baek, and Sung Ju Hwang. Knowledge graph-augmented language models for knowledge-grounded dialogue gener- ation. arXiv preprint arXiv:2305.18846, 2023. [Kaplan et al., 2020] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. [Karpukhin et al., 2020] Vladimir Karpukhin, Barlas Og˘uz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020. [Khandelwal et al., 2019] Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Gen- eralization through memorization: Nearest neighbor lan- guage models. arXiv preprint arXiv:1911.00172, 2019. [Khattab et al., 2022] Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. Demonstrate-search-predict: Compos- ing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024, 2022.

Huang, and Xian-Ling Mao. Copy is all you need. In The Eleventh International Conference on Learning Rep- resentations, 2022. [Lee et al., 2020] Jinhyuk Lee, Mujeen Sung, Jaewoo Kang, and Danqi Chen. Learning dense representations of phrases at scale. arXiv preprint arXiv:2012.12624, 2020. [Leng et al., 2023] Quinn Leng, Kasey Uhlenhuth, and Alkis Polyzotis. Best practices for llm evaluation of rag applications. https://www.databricks.com/blog/ LLM-auto-eval-best-practices-RAG, 2023. [Lewis et al., 2020] Patrick Lewis, Ethan Perez, Aleksan- dra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Ku¨ttler, Mike Lewis, Wen-tau Yih, Tim Rockta¨schel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Infor- mation Processing Systems, 33:9459–9474, 2020. [Li and Li, 2023] Xianming Li and Jing Li. Angle-optimized text embeddings. arXiv preprint arXiv:2309.12871, 2023. [Li et al., 2023a] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. [Li et al., 2023b] Xiaoqian Li, Ercong Nie, and Sheng Liang. From classification to generation: Insights into crosslingual retrieval augmented icl. arXiv preprint arXiv:2311.06595, 2023. [Li et al., 2023c] Xingxuan Li, Ruochen Zhao, Yew Ken Chia, Bosheng Ding, Lidong Bing, Shafiq Joty, and Sou- janya Poria. Chain of knowledge: A framework for grounding large language models with structured knowl- edge bases. arXiv preprint arXiv:2305.13269, 2023. [Li et al., 2023d] Xinze Li, Zhenghao Liu, Chenyan Xiong, Shi Yu, Yu Gu, Zhiyuan Liu, and Ge Yu. Structure-aware language model pretraining improves dense retrieval on structured data. arXiv preprint arXiv:2305.19912, 2023. [Liang et al., 2023] Han Liang, Wenqian Zhang, Wenxuan Li, Jingyi Yu, and Lan Xu. Intergen: Diffusion-based multi-human motion generation under complex interac- tions. arXiv preprint arXiv:2304.05684, 2023. [Lin et al., 2023] Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Rich James, Pedro Ro- driguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis, et al. Ra-dit: Retrieval-augmented dual instruction tuning. arXiv preprint arXiv:2310.01352, 2023. [Litman et al., 2020] Ron Litman, Oron Anschel, Shahar Tsiper, Roee Litman, Shai Mazor, and R Manmatha. Scat- ter: selective context attentional scene text recognizer. In proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 11962–11972, 2020.

[Liu et al., 2023a] Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language mod- els use long contexts. arXiv preprint arXiv:2307.03172, 2023. [Liu et al., 2023b] Yi Liu, Lianzhe Huang, Shicheng Li, Sishuo Chen, Hao Zhou, Fandong Meng, Jie Zhou, and Xu Sun. Recall: A benchmark for llms robustness against external counterfactual knowledge. arXiv preprint arXiv:2311.08147, 2023. [Liu, 2023] Jerry Liu. Building production-ready rag applications. https://www.ai.engineer/summit/schedule/ building-production-ready-rag-applications, 2023. [Luo et al., 2023] Ziyang Luo, Can Xu, Pu Zhao, Xiubo Geng, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Augmented large language models with paramet- ric knowledge guiding. arXiv preprint arXiv:2305.04757, 2023. [Ma et al., 2023a] Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. Query rewriting for retrieval-augmented large language models. arXiv preprint arXiv:2305.14283, 2023. [Ma et al., 2023b] Yubo Ma, Yixin Cao, YongChing Hong, and Aixin Sun. Large language model is not a good few- shot information extractor, but a good reranker for hard samples! ArXiv, abs/2303.08559, 2023. [Modarressi et al., 2023] Ali Modarressi, Ayyoob Imani, Mohsen Fayyaz, and Hinrich Schu¨tze. Ret-llm: Towards a general read-write memory for large language models. arXiv preprint arXiv:2305.14322, 2023. [Nakano et al., 2021] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021. [Nashid et al., 2023] Noor Nashid, Mifta Sintaha, and Ali Mesbah. Retrieval-based prompt selection for code-related few-shot learning. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 2450– 2462, 2023. [Nguyen, 2023] Isabelle Nguyen. Evaluating rag part i: How to evaluate document retrieval. https://www.deepset.ai/ blog/rag-evaluation-retrieval, 2023. [Nishikawa et al., 2022] Sosuke Nishikawa, Ryokan Ri, Ikuya Yamada, Yoshimasa Tsuruoka, and Isao Echizen. Ease: Entity-aware contrastive learning of sentence em- bedding. arXiv preprint arXiv:2205.04260, 2022. [OpenAI, 2023] OpenAI. Gpt-4 technical report. https://cdn. openai.com/papers/gpt-4.pdf, 2023. [Packer et al., 2023] Charles Packer, Vivian Fang, Shishir G Patil, Kevin Lin, Sarah Wooders, and Joseph E Gonza- lez. Memgpt: Towards llms as operating systems. arXiv preprint arXiv:2310.08560, 2023.

[Raffel et al., 2020] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485– 5551, 2020. [Ram et al., 2023] Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083, 2023. [Raudaschl, 2023] Adrian H. Raudaschl. Forget rag, the future is rag-fusion. https://towardsdatascience.com/ forget-rag-the-future-is-rag-fusion-1147298d8ad1, 2023. [Saad-Falcon et al., 2023] Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. Ares: An automated evaluation framework for retrieval-augmented generation systems. arXiv preprint arXiv:2311.09476, 2023. [Schick et al., 2023] Timo Schick, Jane Dwivedi-Yu, Roberto Dess`ı, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023. [Sciavolino et al., 2021] Christopher Sciavolino, Zexuan Zhong, Jinhyuk Lee, and Danqi Chen. Simple entity- centric questions challenge dense retrievers. arXiv preprint arXiv:2109.08535, 2021. [Shao et al., 2023] Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. En- hancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv preprint arXiv:2305.15294, 2023. [Shi et al., 2023] Weijia Shi, Sewon Min, Michihiro Ya- sunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. Replug: Retrieval- augmented black-box language models. arXiv preprint arXiv:2301.12652, 2023. [Srivastava et al., 2022] Aarohi Srivastava, Abhinav Ras- togi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adria` Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022. [Sun et al., 2022] Zhiqing Sun, Xuezhi Wang, Yi Tay, Yim- ing Yang, and Denny Zhou. Recitation-augmented lan- guage models. arXiv preprint arXiv:2210.01296, 2022. [Touvron et al., 2023] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. [Trivedi et al., 2022] Harsh Trivedi, Niranjan Balasubrama- nian, Tushar Khot, and Ashish Sabharwal. Inter- leaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509, 2022.

[Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017. [VoyageAI, 2023] VoyageAI. Voyage’s embedding models. https://docs.voyageai.com/embeddings/, 2023. [Wang et al., 2019] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stick- ier benchmark for general-purpose language understand- ing systems. Advances in Neural Information Processing Systems, 32, 2019. [Wang et al., 2022a] Shuohang Wang, Yichong Xu, Yuwei Fang, Yang Liu, Siqi Sun, Ruochen Xu, Chenguang Zhu, and Michael Zeng. Training data is more valuable than you think: A simple and effective method by retrieving from training data. arXiv preprint arXiv:2203.08773, 2022. [Wang et al., 2022b] Shuohang Wang, Yichong Xu, Yuwei Fang, Yang Liu, Siqi Sun, Ruochen Xu, Chenguang Zhu, and Michael Zeng. Training data is more valuable than you think: A simple and effective method by retriev- ing from training data. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 3170– 3179, Dublin, Ireland, May 2022. Association for Compu- tational Linguistics. [Wang et al., 2023a] Boxin Wang, Wei Ping, Lawrence McAfee, Peng Xu, Bo Li, Mohammad Shoeybi, and Bryan Catanzaro. Instructretro: Instruction tuning post retrieval- augmented pretraining. arXiv preprint arXiv:2310.07713, 2023. [Wang et al., 2023b] Boxin Wang, Wei Ping, Peng Xu, Lawrence McAfee, Zihan Liu, Mohammad Shoeybi, Yi Dong, Oleksii Kuchaiev, Bo Li, Chaowei Xiao, et al. Shall we pretrain autoregressive language models with retrieval? a comprehensive study. arXiv preprint arXiv:2304.06762, 2023. [Wang et al., 2023c] Liang Wang, Nan Yang, and Furu Wei. Query2doc: Query expansion with large language models. arXiv preprint arXiv:2303.07678, 2023. [Wang et al., 2023d] Xintao Wang, Qianwen Yang, Yongting Qiu, Jiaqing Liang, Qianyu He, Zhouhong Gu, Yanghua Xiao, and Wei Wang. Knowledgpt: Enhancing large lan- guage models with retrieval and storage access on knowl- edge bases. arXiv preprint arXiv:2308.11761, 2023. [Wang et al., 2023e] Yile Wang, Peng Li, Maosong Sun, and Yang Liu. Self-knowledge guided retrieval aug- mentation for large language models. arXiv preprint arXiv:2310.05002, 2023. [Xia et al., 2019] Mengzhou Xia, Guoping Huang, Lemao Liu, and Shuming Shi. Graph based translation mem- ory for neural machine translation. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 7297–7304, 2019.

[Xiao et al., 2023] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient stream- ing language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023. [Xu et al., 2023a] Fangyuan Xu, Weijia Shi, and Eunsol Choi. Recomp: Improving retrieval-augmented lms with compression and selective augmentation. arXiv preprint arXiv:2310.04408, 2023. [Xu et al., 2023b] Peng Xu, Wei Ping, Xianchao Wu, Lawrence McAfee, Chen Zhu, Zihan Liu, Sandeep Sub- ramanian, Evelina Bakhturina, Mohammad Shoeybi, and Bryan Catanzaro. Retrieval meets long context large lan- guage models. arXiv preprint arXiv:2310.03025, 2023. [Xu et al., 2023c] Peng Xu, Wei Ping, Xianchao Wu, Lawrence McAfee, Chen Zhu, Zihan Liu, Sandeep Sub- ramanian, Evelina Bakhturina, Mohammad Shoeybi, and Bryan Catanzaro. Retrieval meets long context large lan- guage models. arXiv preprint arXiv:2310.03025, 2023. [Yang et al., 2023a] Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, and Cordelia Schmid. Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10714–10726, 2023. [Yang et al., 2023b] Haoyan Yang, Zhitao Li, Yong Zhang, Jianzong Wang, Ning Cheng, Ming Li, and Jing Xiao. Prca: Fitting black-box large language models for retrieval question answering via pluggable reward-driven contex- tual adapter. arXiv preprint arXiv:2310.18347, 2023. [Yang et al., 2023c] Hui Yang, Sifu Yue, and Yunzhong He. Auto-gpt for online decision making: Benchmarks and ad- ditional opinions. arXiv preprint arXiv:2306.02224, 2023. [Yasunaga et al., 2022] Michihiro Yasunaga, Armen Agha- janyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. Retrieval-augmented multimodal language modeling. arXiv preprint arXiv:2211.12561, 2022. [Ye et al., 2020] Deming Ye, Yankai Lin, Jiaju Du, Zheng- hao Liu, Peng Li, Maosong Sun, and Zhiyuan Liu. Coref- erential reasoning learning for language representation. arXiv preprint arXiv:2004.06870, 2020. [Yoran et al., 2023] Ori Yoran, Tomer Wolfson, Ori Ram, and Jonathan Berant. Making retrieval-augmented lan- guage models robust to irrelevant context. arXiv preprint arXiv:2310.01558, 2023. [Yu et al., 2022] Wenhao Yu, Dan Iter, Shuohang Wang, Yi- chong Xu, Mingxuan Ju, Soumya Sanyal, Chenguang Zhu, Michael Zeng, and Meng Jiang. Generate rather than re- trieve: Large language models are strong context genera- tors. arXiv preprint arXiv:2209.10063, 2022. [Yu et al., 2023a] Wenhao Yu, Hongming Zhang, Xiaoman Pan, Kaixin Ma, Hongwei Wang, and Dong Yu. Chain- of-note: Enhancing robustness in retrieval-augmented lan- guage models. arXiv preprint arXiv:2311.09210, 2023.

[Yu et al., 2023b] Zichun Yu, Chenyan Xiong, Shi Yu, and Zhiyuan Liu. Augmentation-adapted retriever improves generalization of language models as generic plug-in. arXiv preprint arXiv:2305.17331, 2023. [Zhang et al., 2019] Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. Ernie: Enhanced language representation with informative entities. arXiv preprint arXiv:1905.07129, 2019. [Zhang et al., 2023a] Peitian Zhang, Shitao Xiao, Zheng Liu, Zhicheng Dou, and Jian-Yun Nie. Retrieve any- thing to augment large language models. arXiv preprint arXiv:2310.07554, 2023. [Zhang et al., 2023b] Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023. [Zhang, 2023] Jiawei Zhang. Graph-toolformer: To em- power llms with graph reasoning ability via prompt aug- mented by chatgpt. arXiv preprint arXiv:2304.11116, 2023. [Zhao et al., 2022] Jinming Zhao, Gholamreza Haffar, and Ehsan Shareghi. Generating synthetic speech from spokenvocab for speech translation. arXiv preprint arXiv:2210.08174, 2022. [Zheng et al., 2023] Huaixiu Steven Zheng, Swaroop Mishra, Xinyun Chen, Heng-Tze Cheng, Ed H Chi, Quoc V Le, and Denny Zhou. Take a step back: Evoking reasoning via abstraction in large language models. arXiv preprint arXiv:2310.06117, 2023. [Zhu et al., 2022] Wanrong Zhu, An Yan, Yujie Lu, Wenda Xu, Xin Eric Wang, Miguel Eckstein, and William Yang Wang. Visualize before you write: Imagination- guided open-ended text generation. arXiv preprint arXiv:2210.03765, 2022. [Zhuang et al., 2023] Shengyao Zhuang, Bing Liu, Bevan Koopman, and Guido Zuccon. Open-source large language models are strong zero-shot query likeli- hood models for document ranking. arXiv preprint arXiv:2310.13243, 2023.;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2024 RetrievalAugmentedGenerationfor	Meng Wang Yunfan Gao Yun Xiong Xinyu Gao Kangxiang Jia Jinliu Pan Yuxi Bi Yi Dai Jiawei Sun Qianyu Guo Haofen Wang			Retrieval-Augmented Generation for Large Language Models: A Survey				10.48550/arXiv.2312.10997		2024