搜索

LLMs之GraphRAG:《From Local to Global: A Graph RAG Approach to Query-Focused Summarization》翻译与解读

发布日期:2025-12-24 11:57 点击次数:138

LLMs之GraphRAG:《From Local to Global: A Graph RAG Approach to Query-Focused Summarization》翻译与解读导读:该论文淡薄了一种基于图结构的学问图谱增强生成(Graph RAG)方法,用于回复用户针对通盘文本聚拢的全局性质斟酌,以撑持东谈主们对多数数据进行全面领略。布景痛点:传统的回复增强生成(RAG)方法主要用于腹地问答任务,无法很好处治针对通盘文本聚拢的全局性质斟酌问题。传统的字据查询聚焦的自动选录(QFS)方法难以粗豪RAG系统常见的大范围文本索引。​​​​​​​中枢旨趣​​​​​​​:GraphRAG结束全局性诽谤答的中枢旨趣如下:>> 修复基于学问图谱的二级索引结构。第一步,从源文档中通过LLM索取实体与关系,构建学问图谱;第二步,使用社区检测算法将学问图谱分割成与实体精致干系的社区模块。>> 对每个社区模块使用LLM生成领会式自动选录,酿成一个隐敝源文档偏激基础的学问图谱的模块性图索引。>> 用户淡薄查询时,最初让每个社区选录零丁并欺诈用LLM生成部分回复;然后对整个干系部分回复再次使用LLM进行汇总,得出全局回返回回给用户。念念路方法:源文档→文本块→实体与关系实例→实体与关系描述→学问图谱→Graph Communities→社区自动选录→社区谜底→全局谜底总体来说,GraphRAG通过分层构建学问图谱索引,利用其内在的模块性达成并行处理本事;然后使用map-reduce念念想结束对全局查询的回复,在保证回复全面性的同期提高了效果,这是其结束全局性诽谤答任务的中枢念念路。中枢特色:>> 充分利用学问图谱内在的模块性,结束并行处理本事。>> 社区模块中的实体与关系得到充分真切描述,有益于生成更全面和各种化的回复。>> 与平直选用源文档比拟,图结构索引节俭多数凹凸文信息量,且查询效果更高。上风:>> 实验终结标明,与传统RAG方法和平直全局文本汇总方法比拟,Graph RAG方法在回复全面性和各种性方面齐有权贵提高,同期节俭多数凹凸文信息量,尤其是利用根社区水平得到很好的查询性能。该方法结束了复杂问题回复任务的可膨大性。总之,该论文淡薄的Graph RAG方法很好地将学问图谱、RAG和查询聚焦选录技巧相结合,结束了对大范围文本聚拢的全局性质斟酌的回复,有益于撑持东谈主类进行真切领略和宏不雅把合手。《From Local to Global: A Graph RAG Approach to Query-Focused Summarization》翻译与解读地址论文地址:https://arxiv.org/abs/2404.16130时候2024年4月24日作家Microsoft团队Abstract选录The use of retrieval-augmented generation (RAG) to retrieve relevant informa-tion from an external knowledge source enables large language models (LLMs) to answer questions over private and/or previously unseen document collections. However, RAG fails on global questions directed at an entire text corpus, such as “What are the main themes in the dataset?”, since this is inherently a query-focused summarization (QFS) task, rather than an explicit retrieval task. Prior QFS methods, meanwhile, fail to scale to the quantities of text indexed by typical RAG systems. To combine the strengths of these contrasting methods, we propose a Graph RAG approach to question answering over private text corpora that scales with both the generality of user questions and the quantity of source text to be in-dexed. Our approach uses an LLM to build a graph-based text index in two stages: first to derive an entity knowledge graph from the source documents, then to pre-generate community summaries for all groups of closely-related entities. Given a question, each community summary is used to generate a partial response, before all partial responses are again summarized in a final response to the user. For a class of global sensemaking questions over datasets in the 1 million token range, we show that Graph RAG leads to substantial improvements over a 浅易的RAG baseline for both the comprehensiveness and diversity of generated answers. An open-source, Python-based implementation of both global and local Graph RAG approaches is forthcoming at https://aka.ms/graphrag.使用检索增强生成(retrieve -augmented generation, RAG)从外部学问起头检索干系信息,使大型谈话模子(LLM)好像回复稀零和/或夙昔未见过的文档聚拢上的问题。然则,RAG在针对通盘文本语料库的全局问题上失败了,举例“数据聚会的主题是什么?”,因为这实质上是一个以查询为中心的查询聚焦选录(QFS)任务,而不是一个明确的检索任务。与此同期,先前的QFS方法无法膨大到典型RAG系统索引的文本数目。为了结合这些对比喻法的上风,咱们淡薄了一种基于稀零文本语料库的Graph RAG方法,该方法不错字据用户问题的通用性和要索引的源文本的数目进行膨大。咱们的方法使用LLM分两个阶段构建基于图的文本索引:最初从源文档中导出实体学问图,然后为整个密切干系的实体组预生成社区选录。给定一个问题,每个社区选拜托于生成部分反应,然后将整个部分反应再次汇总为对用户的最终反应。对于100万个令牌范围内的数据集上的一类全局语义问题,咱们标明Graph RAG在生成谜底的全面性和各种性方面比浅易的RAG基线有了实质性的阅兵。一个开源的、基于python的全局和局部Graph RAG方法的结束行将在https://aka.ms/graphrag上结束。Figure 1: Graph RAG pipeline using an LLM-derived graph index of source document text. This index spans nodes (e.g., entities), edges (e.g., relationships), and covariates (e.g., claims) that have been detected, extracted, and summarized by LLM prompts tailored to the domain of the dataset. Community detection (e.g., Leiden, Traag et al., 2019) is used to partition the graph index into groups of elements (nodes, edges, covariates) that the LLM can summarize in parallel at both index-ing time and query time. The “global answer” to a given query is produced using a final round of query-focused summarization over all community summaries reporting relevance to that query.图1:使用LLM派生的源文档文本的图索引的Graph RAG管谈。该索引涵盖了节点(举例,实体)、边(举例,关系)和协变量(举例,主张),这些节点、边和协变量是由针对数据集领域的LLM教唆进行检测、索取和回来的。社区检测(举例,Leiden,Traag等东谈主,2019年)用于将图索引永别为元素组(节点、边、协变量),LLM不错在索引时候和查询时候并行回来这些元素组。给定查询的“全局谜底”是通过在整个与该查询干系的社区选录上使用临了一轮的查询聚焦选录来产生的。

图片

1 Introduction先容Human endeavors across a range of domains rely on our ability to read and reason about large collections of documents, often reaching conclusions that go beyond anything stated in the source texts themselves. With the emergence of large language models (LLMs), we are already witnessing attempts to automate human-like sensemaking in complex domains like scientific discovery (Mi-crosoft, 2023) and intelligence analysis (Ranade and Joshi, 2023), where sensemaking is defined as “a motivated, continuous effort to understand connections (which can be among people, places, and events) in order to anticipate their trajectories and act effectively” (Klein et al., 2006a). Supporting human-led sensemaking over entire text corpora, however, needs a way for people to both apply and refine their mental model of the data (Klein et al., 2006b) by asking questions of a global nature.东谈主类在各个领域进行的步履依赖于咱们阅读和推理多数文档的本事,往往得出超出源文本自己的论断。跟着大型谈话模子(LLMs)的出现,咱们照旧见证了在科学发现(Mi-crosoft, 2023)和谍报分析(Ranade和Joshi, 2023)等复杂领域自动化类东谈主语义构建的尝试,其华文义构建被界说为“一种有动机的、接续的坚苦,以领略斟酌(不错是东谈主、地点和事件之间的斟酌),以便臆测它们的轨迹并灵验地选用行动”(Klein等东谈主,2006a)。然则,撑持东谈主类主导的通盘文本语料库的语义构建,需要一种方法,让东谈主们通过淡薄全局性的问题来应用和完善他们对数据的心计模子(Klein等东谈主,2006b)。Retrieval-augmented generation (RAG, Lewis et al., 2020) is an established approach to answering user questions over entire datasets, but it is designed for situations where these answers are contained locally within regions of text whose retrieval provides sufficient grounding for the generation task. Instead, a more appropriate task framing is query-focused summarization (QFS, Dang, 2006), and in particular, query-focused abstractive summarization that generates natural language summaries and not just concatenated excerpts (Baumel et al., 2018; Laskar et al., 2020; Yao et al., 2017) . In recent years, however, such distinctions between summarization tasks that are abstractive versus extractive, generic versus query-focused, and single-document versus multi-document, have become less rele-vant. While early applications of the transformer architecture showed substantial improvements on the state-of-the-art for all such summarization tasks (Goodwin et al., 2020; Laskar et al., 2022; Liu and Lapata, 2019), these tasks are now trivialized by modern LLMs, including the GPT (Achiam et al., 2023; Brown et al., 2020), Llama (Touvron et al., 2023), and Gemini (Anil et al., 2023) series, all of which can use in-context learning to summarize any content provided in their context window.检索增强生成(RAG, Lewis等东谈主,2020)是一种针对通盘数据集回复用户问题的既定方法,但它是为这些谜底局部包含在文本区域内的情况而联想的,这些文本区域的检索为生成任务提供了实足的基础。违犯,更合适的任务框架所以查询为中心的选录(QFS, Dang, 2006),尽头所以查询为中心的抽象选录,它生成当然谈话选录,而不单是是勾引的摘录(Baumel等东谈主,2018;Laskar et al., 2020;Yao等东谈主,2017)。然则,连年来,抽象与抽取、通用与以查询为中心、单文档与多文档的选录任务之间的区别照旧变得不那么蹙迫了。诚然transformer架构的早期应用在整个此类汇总任务上齐清醒出了盛大的跨越(Goodwin et al., 2020;Laskar et al., 2022;Liu和Lapata, 2019),这些任务面前被当代LLMs简化了,包括GPT (Achiam等东谈主,2023;Brown et al., 2020), Llama (Touvron et al., 2023)和Gemini (Anil et al., 2023)系列,整个这些齐不错使用凹凸文体习往复来凹凸文窗口中提供的任何内容。The challenge remains, however, for query-focused abstractive summarization over an entire corpus. Such volumes of text can greatly exceed the limits of LLM context windows, and the expansion of such windows may not be enough given that information can be “lost in the middle” of longer contexts (Kuratov et al., 2024; Liu et al., 2023). In addition, although the direct retrieval of text chunks in 浅易的RAG is likely inadequate for QFS tasks, it is possible that an alternative form of pre-indexing could support a new RAG approach specifically targeting global summarization.然则,对于通盘语料库的以查询为中心的抽象选录来说,挑战仍然存在。这么的文本量不错大大特出LLM凹凸文窗口的限度,况且琢磨到信息可能会“丢失在中间”的较长的凹凸文,这么的窗口的膨大可能是不够的(Kuratov等东谈主,2024;Liu et al., 2023)。此外,尽管在浅易的RAG中平直检索文本块可能不符合QFS任务,但是一种替代形式的预索引可能撑持特意针对全局选录的新RAG方法。In this paper, we present a Graph RAG approach based on global summarization of an LLM-derived knowledge graph (Figure 1). In contrast with related work that exploits the structured retrieval and traversal affordances of graph indexes (subsection 4.2), we focus on a previously unexplored quality of graphs in this context: their inherent modularity (Newman, 2006) and the ability of com-munity detection algorithms to partition graphs into modular communities of closely-related nodes (e.g., Louvain, Blondel et al., 2008; Leiden, Traag et al., 2019). LLM-generated summaries of these community descriptions provide complete coverage of the underlying graph index and the input doc-uments it represents. Query-focused summarization of an entire corpus is then made possible using a map-reduce approach: first using each community summary to answer the query independently and in parallel, then summarizing all relevant partial answers into a final global answer.在本文中,咱们淡薄了一种基于LLM派生的学问图的全局回来的Graph RAG方法(图1)。与利用图索引的结构化检索和遍历可视性的干系责任(第4.2节)比拟,咱们专注于在此布景下夙昔未探索的图的质地:它们固有的模块化(Newman, 2006)以及社区检测算法将图永别为密切相要道点的模块化社区的本事(举例,Louvain, Blondel等东谈主,2008;莱顿,Traag等东谈主,2019)。LLM生成的这些社区描述的选录提供了底层图形索引偏激所代表的输入文档的完好意思隐敝。然后,不错使用map-reduce方法对通盘语料库进行以查询为中心的汇总:最初使用每个社区汇总来独当场并行地回复查询,然后将整个干系的部分谜底汇总为最终的全局谜底。To evaluate this approach, we used an LLM to generate a diverse set of activity-centered sense-making questions from short descriptions of two representative real-world datasets, containing pod-cast transcripts and news articles respectively. For the target qualities of comprehensiveness, diver-sity, and empowerment (defined in subsection 3.4) that develop understanding of broad issues and themes, we both explore the impact of varying the the hierarchical level of community summaries used to answer queries, as well as compare to 浅易的RAG and global map-reduce summarization of source texts. We show that all global approaches outperform 浅易的RAG on comprehensiveness and diversity, and that Graph RAG with intermediate- and low-level community summaries shows favorable performance over source text summarization on these same metrics, at lower token costs.为了评估这种方法,咱们使用LLM从两个具有代表性的真正宇宙数据集的节略描述中生成了一组以步履为中心的意念念构建问题,这些数据集分别包含播客文稿和新闻著述。对于发展对庸俗问题和主题的领略的详尽性、各种性和赋权(在第3.4末节中界说)的目的质地,咱们齐探索了用于回复查询的不同社区选录的档次水平的影响,并与浅易的RAG和源文本的环球舆图减少选录进行了比较。咱们标明,整个全局方法在全面性和各种性方面齐优于浅易的RAG,况且具有中级和初级社区选录的Graph RAG在这些换取的筹商上以更低的令牌老本清醒出比源文本选录更好的性能。2 Graph RAG Approach & Pipeline图RAG方法和管谈We now unpack the high-level data flow of the Graph RAG approach (Figure 1) and pipeline, de-scribing key design parameters, techniques, and implementation details for each step.面前咱们解压缩Graph RAG方法的高级数据流(图1)和管谈,描述每个方法的症结联想参数、技巧和结束细节。2.1 Source Documents → Text Chunks源文档→文本块A fundamental design decision is the granularity with which input texts extracted from source doc-uments should be split into text chunks for processing. In the following step, each of these chunks will be passed to a set of LLM prompts designed to extract the various elements of a graph index. Longer text chunks require fewer LLM calls for such extraction, but suffer from the recall degrada-tion of longer LLM context windows (Kuratov et al., 2024; Liu et al., 2023). This behavior can be observed in Figure 2 in the case of a single extraction round (i.e., with zero gleanings): on a sample dataset (HotPotQA, Yang et al., 2018), using a chunk size of 600 token extracted almost twice as many entity references as when using a chunk size of 2400. While more references are generally better, any extraction process needs to balance recall and precision for the target activity.一个基本的联想决策是将从源文档中索取的输入文分内割成文本块进行处理的粒度。在接下来的方法中,每个块齐将传递给一组LLM教唆符,这些教唆符旨在索取图索引的各种元素。较长的文本块需要较少的LLM调用来进行这种索取,但较长的LLM凹凸文窗口会导致调回率下跌(Kuratov等东谈主,2024;Liu et al., 2023)。在单个索取轮(即零相聚)的情况下,不错在图2中不雅察到这种步履:在样本数据集(HotPotQA, Yang等东谈主,2018)上,使用块大小为600的令牌索取的实体援用简直是使用块大小为2400时的两倍。诚然援用越多越好,但任何索取过程齐需要均衡目的步履的调回率和精度。2.2 Text Chunks → Element Instances文本块→元素实例The baseline requirement for this step is to identify and extract instances of graph nodes and edges from each chunk of source text. We do this using a multipart LLM prompt that first identifies all entities in the text, including their name, type, and description, before identifying all relationships between clearly-related entities, including the source and target entities and a description of their relationship. Both kinds of element instance are output in a single list of delimited tuples.The primary opportunity to tailor this prompt to the domain of the document corpus lies in the choice of few-shot examples provided to the LLM for in-context learning (Brown et al., 2020). For example, while our default prompt extracting the broad class of “named entities” like people, places, and organizations is generally applicable, domains with specialized knowledge (e.g., science, medicine, law) will benefit from few-shot examples specialized to those domains. We also support a secondary extraction prompt for any additional covariates we would like to associate with the extracted node instances. Our default covariate prompt aims to extract claims linked to detected entities, including the subject, object, type, description, source text span, and start and end dates.这一步的基本要求是从每个源文本块中识别和索取图节点和边的实例。咱们使用多部分LLM教唆符来完成此操作,该教唆符最初识别文本中的整个实体,包括它们的称号、类型和描述,然后识别明确干系实体之间的所干系系,包括源实体和目的实体以及它们之间关系的描述。这两种类型的元素实例齐输出在单个分隔元组列表中。将此教唆定制为文档语料库领域的主要契机在于选拔提供给LLMs进行凹凸文体习的小数示例(Brown et al., 2020)。举例,诚然咱们的默许教唆索取“定名实体”(如东谈主员、地点和组织)的庸俗类别往往是适用的,但具有特意学问的领域(举例,科学、医学、法律)将受益于特意针对这些领域的小数示例。对于咱们但愿与索取的节点实例干系联的任何其他协变量,咱们还撑持扶持索取教唆符。咱们默许的协变量教唆旨在索取与检测到的实体干系联的声明,包括主题、对象、类型、描述、源文本跨度以及开动和终结日历。To balance the needs of efficiency and quality, we use multiple rounds of “gleanings”, up to a specified maximum, to encourage the LLM to detect any additional entities it may have missed on prior extraction rounds. This is a multi-stage process in which we first ask the LLM to assess whether all entities were extracted, using a logit bias of 100 to force a yes/no decision. If the LLM responds that entities were missed, then a continuation indicating that “MANY entities were missed in the last extraction” encourages the LLM to glean these missing entities. This approach allows us to use larger chunk sizes without a drop in quality (Figure 2) or the forced introduction of noise.为了均衡效果和质地的需要,咱们使用多轮“相聚”,直到指定的最大值,以饱读舞LLM检测之前索取轮次中可能遗漏的任何其他实体。这是一个多阶段的过程,咱们最初要求LLM评估是否索取了整个实体,使用100的logit偏差来强制作念出是/否的决定。要是LLM反应实体丢失了,那么指令“在前次索取中丢失了好多实体”的延续将饱读舞LLM相聚这些丢失的实体。这种方法允许咱们使用更大的块大小,而不会镌汰质地(图2)或强制引入噪声。2.3 Element Instances → Element Summaries元素实例→元素选录The use of an LLM to “extract” descriptions of entities, relationships, and claims represented in source texts is already a form of abstractive summarization, relying on the LLM to create inde-pendently meaningful summaries of concepts that may be implied but not stated by the text itself (e.g., the presence of implied relationships). To convert all such instance-level summaries into sin-gle blocks of descriptive text for each graph element (i.e., entity node, relationship edge, and claim covariate) requires a further round of LLM summarization over matching groups of instances.使用LLM来“索取”源文本中示意的实体、关系和声明的描述照旧是一种抽象选录的形式,依靠LLM来创建可能隐含但未由文本自己讲明的看法的零丁有意念念的选录(举例,隐含关系的存在)。要将整个这么的实例级选录转机为每个图元素(即实体节点、关系旯旮和索赔协变量)的单个描述性文本块,需要对匹配的实例组进行进一步的LLM选录。A potential concern at this stage is that the LLM may not consistently extract references to the same entity in the same text format, resulting in duplicate entity elements and thus duplicate nodes in the entity graph. However, since all closely-related “communities” of entities will be detected and summarized in the following step, and given that LLMs can understand the common entity behind multiple name variations, our overall approach is resilient to such variations provided there is sufficient connectivity from all variations to a shared set of closely-related entities.Overall, our use of rich descriptive text for homogeneous nodes in a potentially noisy graph structure is aligned with both the capabilities of LLMs and the needs of global, query-focused summarization. These qualities also differentiate our graph index from typical knowledge graphs, which rely on concise and consistent knowledge triples (subject, predicate, object) for downstream reasoning tasks.这个阶段的一个潜在问题是,LLM可能无法耐久如一地以换取的文本形式索取对销毁实体的援用,从而导致重复的实体元素,从而导致实体图中的重复节点。然则,由于整个密切干系的实体“社区”将在接下来的方法中被检测和回来,况且琢磨到LLM不错领略多个称号变化背后的共同实体,咱们的全体方法对于这些变化是有弹性的,惟有整个变化与一组分享的密切干系的实体有实足的勾引。2.4 Element Summaries → Graph Communities元素选录→图社区The index created in the previous step can be modelled as an homogeneous undirected weighted graph in which entity nodes are connected by relationship edges, with edge weights representing the normalized counts of detected relationship instances. Given such a graph, a variety of community detection algorithms may be used to partition the graph into communities of nodes with stronger connections to one another than to the other nodes in the graph (e.g., see the surveys by Fortu-nato, 2010 and Jin et al., 2021). In our pipeline, we use Leiden (Traag et al., 2019) on account of its ability to recover hierarchical community structure of large-scale graphs efficiently (Figure 3). Each level of this hierarchy provides a community partition that covers the nodes of the graph in a mutually-exclusive, collective-exhaustive way, enabling divide-and-conquer global summarization.在前一步中创建的索引不错建模为一个同构无向加权图,其中实体节点通过关系边勾引,边的权重示意检测到的关系实例的法度化计数。给定这么一个图,不错使用各种社区检测算法将图永别为节点社区,这些节点之间的勾引比图中其他节点之间的勾引更强(举例,参见fortune -nato, 2010和Jin et al., 2021的造访)。在咱们的管谈中,咱们使用Leiden (Traag等东谈主,2019),因为它好像灵验地还原大范围图的分层社区结构(图3)。该档次结构的每个级别齐提供了一个社区分区,该分区以互斥的、集体防范的面目隐敝图的节点,从而结束分而治之的全局回来。2.5 Graph Communities → Community Summaries社区图→社区汇总The next step is to create report-like summaries of each community in the Leiden hierarchy, using a method designed to scale to very large datasets. These summaries are independently useful in their own right as a way to understand the global structure and semantics of the dataset, and may themselves be used to make sense of a corpus in the absence of a question. For example, a user may scan through community summaries at one level looking for general themes of interest, then follow links to the reports at the lower level that provide more details for each of the subtopics. Here, however, we focus on their utility as part of a graph-based index used for answering global queries.Community summaries are generated in the following way:下一步是使用一种旨在膨大到相等大的数据集的方法,为Leiden档次结构中的每个社区创建访佛领会的选录。这些选录动作领略数据集的全体结构和语义的一种面目,它们自己是零丁有用的,况且不错在莫得问题的情况下用于领略语料库。举例,用户不错浏览某一级别的社区选录,寻找感兴味的一般主题,然后点击指向较初级别领会的聚会,这些聚会为每个子主题提供了更多防范信息。然则,在这里,咱们暖和的是它们动作用于回复全局查询的基于图的索引的一部分的遵守。社区选录以以下面目生成:2.6 Community Summaries → Community Answers → Global Answer社区选录→社区解答→全局解答Given a user query, the community summaries generated in the previous step can be used to generate a final answer in a multi-stage process. The hierarchical nature of the community structure also means that questions can be answered using the community summaries from different levels, raising the question of whether a particular level in the hierarchical community structure offers the best balance of summary detail and scope for general sensemaking questions (evaluated in section 3).给定一个用户查询,在前一步中生成的社区选录可用于在多阶段经由中生成最终谜底。社区结构的档次性也意味着不错使用来自不同档次的社区选录往复复问题,这就淡薄了这么一个问题:在档次化社区结构中,某个特定的档次是否提供了纲目细节和一般性问题范围的最好均衡(在第3节中进行了评估)。For a given community level, the global answer to any user query is generated as follows:>> Prepare community summaries. Community summaries are randomly shuffled and divided into chunks of pre-specified token size. This ensures relevant information is distributed across chunks, rather than concentrated (and potentially lost) in a single context window.>> Map community answers. Generate intermediate answers in parallel, one for each chunk.The LLM is also asked to generate a score between 0-100 indicating how helpful the gen-erated answer is in answering the target question. Answers with score 0 are filtered out.>> Reduce to global answer. Intermediate community answers are sorted in descending order of helpfulness score and iteratively added into a new context window until the token limit is reached. This final context is used to generate the global answer returned to the user.对于给定的社区级别,生成任何用户查询的全局谜底如下:>>准备社区回来。社区选录被惟恐打乱并分红事前指定的令牌大小的块。这确保了干系信息散布在各个块之间,而不是聚会在单个凹凸文窗口中(况且可能丢失)。>>舆图社区谜底。并行生成中间谜底,每个块一个。LLMs还被要求生成一个0-100分之间的分数,标明生成的谜底对回复目的问题的匡助进度。得分为0的谜底将被过滤掉。>>减少到全局谜底。中间社区谜底按有用性分数降序排序,并迭代地添加到新的凹凸文窗口中,直到达到令牌限度。临了一个凹凸文用于生成返回给用户的全局谜底。3 Evaluation评估3.1 Datasets数据集We selected two datasets in the one million token range, each equivalent to about 10 novels of text and representative of the kind of corpora that users may encounter in their real world activities:>> Podcast transcripts. Compiled transcripts of podcast conversations between Kevin Scott, Microsoft CTO, and other technology leaders (Behind the Tech, Scott, 2024). Size: 1669 × 600-token text chunks, with 100-token overlaps between chunks (∼1 million tokens).>> News articles. Benchmark dataset comprising news articles published from September 2013 to December 2023 in a range of categories, including entertainment, business, sports, technology, health, and science (MultiHop-RAG; Tang and Yang, 2024). Size: 3197 × 600-token text chunks, with 100-token overlaps between chunks (∼1.7 million tokens).咱们在100万个令牌范围内选拔了两个数据集,每个数据集十分于约莫10本演义的文本,代表了用户在现实宇宙步履中可能遭受的语料库类型:>>播客文本。汇编了微软首席技巧官凯文·斯科特与其他技巧魁首之间的播客对话记载(《科技背后》,斯科特,2024年)。大小:1669 × 600个令牌文本块,块之间有100个令牌相通(约100万个令牌)。新闻著述。基准数据集包括从2013年9月到2023年12月在一系列类别中发布的新闻著述,包括文娱,贸易,体育,技巧,健康和科学(MultiHop-RAG;Tang and Yang, 2024)。大小:3197 × 600个令牌文本块,块之间有100个令牌相通(约170万个令牌)。3.2 Queries查询Many benchmark datasets for open-domain question answering exist, including HotPotQA (Yang et al., 2018), MultiHop-RAG (Tang and Yang, 2024), and MT-Bench (Zheng et al., 2024). However, the associated question sets target explicit fact retrieval rather than summarization for the purpose of data sensemaking, i.e., the process though which people inspect, engage with, and contextualize data within the broader scope of real-world activities (Koesten et al., 2021). Similarly, methods for extracting latent summarization queries from source texts also exist (Xu and Lapata, 2021), but such extracted questions can target details that betray prior knowledge of the texts.面前存在好多敞开域问答的基准数据集,包括HotPotQA (Yang等东谈主,2018)、MultiHop-RAG (Tang和Yang, 2024)和MT-Bench (Zheng等东谈主,2024)。然则,干系的问题集以明确的事实检索为目的,而不所以数据语义为目的的回来,即东谈主们在更庸俗的现实宇宙步履范围内检查、参与和情境化数据的过程(Koesten et al., 2021)。雷同,从源文本中索取潜在选录查询的方法也存在(Xu和Lapata, 2021),但这些索取的问题可能针对抵御文本先验学问的细节。To evaluate the effectiveness of RAG systems for more global sensemaking tasks, we need questions that convey only a high-level understanding of dataset contents, and not the details of specific texts. We used an activity-centered approach to automate the generation of such questions: given a short description of a dataset, we asked the LLM to identify N potential users and N tasks per user, then for each (user, task) combination, we asked the LLM to generate N questions that require understanding of the entire corpus. For our evaluation, a value of N = 5 resulted in 125 test questions per dataset. Table 1 shows example questions for each of the two evaluation datasets.为了评估RAG系统在更多全局意念念生成任务中的灵验性,咱们需要的问题只传达对数据集内容的高等次领略,而不是特定文本的细节。咱们使用以步履为中心的方法来自动生成此类问题:给定数据集的节略描述,咱们要求LLM识别N个潜在用户和每个用户的N个任务,然后对于每个(用户,任务)组合,咱们要求LLM生成N个需要领和会盘语料库的问题。对于咱们的评估,N = 5的值导致每个数据集有125个测试问题。表1清醒了两个评估数据集的示例问题。3.3 Conditions要求We compare six different conditions in our analysis, including Graph RAG using four levels of graph communities (C0, C1, C2, C3), a text summarization method applying our map-reduce approach directly to source texts (TS), and a na¨ıve “semantic search” RAG approach (SS):在咱们的分析中,咱们比较了六种不同的情况,包括使用四个级别的图社区(C0, C1, C2, C3)的Graph RAG,平直应用咱们的map-reduce方法到源文本的文本选录方法(TS),以及na¨ıve“语义搜索”RAG方法(SS)。3.4 Metrics筹商LLMs have been shown to be good evaluators of natural language generation, achieving state-of-the-art or competitive results compared against human judgements (Wang et al., 2023a; Zheng et al., 2024). While this approach can generate reference-based metrics when gold standard answers are known, it is also capable of measuring the qualities of generated texts (e.g., fluency) in a reference-free style (Wang et al., 2023a) as well as in head-to-head comparison of competing outputs (LLM-as-a-judge, Zheng et al., 2024). LLMs have also shown promise at evaluating the performance of conventional RAG systems, automatically evaluating qualities like context relevance, faithfulness, and answer relevance (RAGAS, Es et al., 2023).LLMs已被解释是当然谈话生成的精熟评估者,与东谈主类判断比拟,取得了滥觞进或具有竞争力的终结(Wang等东谈主,2023a;郑等东谈主,2024)。诚然这种方法不错在黄金尺度谜底已知的情况下生成基于参考的筹商,但它也好像以无参考的面目(Wang等东谈主,2023a)测量生成文本的质地(举例勾引性),以及对竞争输出进行正面比较(LLM-as-a-judge, Zheng等东谈主,2024)。LLMs在评估传统RAG系统的性能方面也领会出了但愿,自动评估凹凸文干系性、简直度和谜底干系性等质地(RAGAS, Es等东谈主,2023)。3.6 Results终结The indexing process resulted in a graph consisting of 8564 nodes and 20691 edges for the Podcast dataset, and a larger graph of 15754 nodes and 19520 edges for the News dataset. Table 3 shows the number of community summaries at different levels of each graph community hierarchy.Global approaches vs. 浅易的RAG. As shown in Figure 4, global approaches consistently out-performed the 浅易的RAG (SS) approach in both comprehensiveness and diversity metrics across datasets. Specifically, global approaches achieved comprehensiveness win rates between 72-83%for Podcast transcripts and 72-80% for News articles, while diversity win rates ranged from 75-82%and 62-71% respectively. Our use of directness as a validity test also achieved the expected results,  e., that 浅易的RAG produces the most direct responses across all comparisons.索引过程的终结是Podcast数据集的图由8564个节点和20691条边构成,News数据集的图由15754个节点和19520条边构成。表3清醒了每个图社区档次结构中不同级别的社区选录数目。环球方法vs. ıve RAG。如图4所示,在数据集的全面性和各种性筹商方面,全局方法耐久优于浅易的RAG (SS)方法。具体而言,环球方法在播客文本和新闻著述上的详尽胜率分别为72-83%和72-80%,而各种性胜率分别为75-82%和62-71%。咱们使用平直性动作效度测试也达到了预期的终结,即浅易的RAG在整个比较中产生最平直的反应。Community summaries vs. source texts. When comparing community summaries to source texts using Graph RAG, community summaries generally provided a small but consistent improvement in answer comprehensiveness and diversity, except for root-level summaries. Intermediate-level summaries in the Podcast dataset and low-level community summaries in the News dataset achieved comprehensiveness win rates of 57% and 64%, respectively. Diversity win rates were 57% for Podcast intermediate-level summaries and 60% for News low-level community summaries. Table 3 also illustrates the scalability advantages of Graph RAG compared to source text summarization: for low-level community summaries (C3), Graph RAG required 26-33% fewer context tokens, while for root-level community summaries (C0), it required over 97% fewer tokens. For a modest drop in performance compared with other global methods, root-level Graph RAG offers a highly efficient method for the iterative question answering that characterizes sensemaking activity, while retaining advantages in comprehensiveness (72% win rate) and diversity (62% win rate) over 浅易的RAG.社区选录vs.源文本。当使用Graph RAG将社区选录与源文本进行比较时,除了根级选录外,社区选录往往在谜底的全面性和各种性方面提供了小而一致的阅兵。Podcast数据聚会的中级选录和News数据聚会的初级社区选录的详尽胜率分别为57%和64%。播客中级回来的各种性胜率为57%,新闻初级社区回来的各种性胜率为60%。表3还讲明了与源文本选录比拟,Graph RAG的可伸缩性上风:对于初级社区选录(C3), Graph RAG需要的凹凸文令牌减少了26-33%,而对于根级社区选录(C0),它需要的令牌减少了97%以上。与其他全局方法比拟,在性能上略有下跌的情况下,根级图RAG提供了一种高效的迭代问题回复方法,该方法具有意念念生成步履的特征,同期保留了比浅易的RAG在全面性(72%胜率)和各种性(62%胜率)方面的上风。Empowerment. Empowerment comparisons showed mixed results for both global approaches versus 浅易的RAG (SS) and Graph RAG approaches versus source text summarization (TS). Ad-hoc LLM use to analyze LLM reasoning for this measure indicated that the ability to provide specific exam-ples, quotes, and citations was judged to be key to helping users reach an informed understanding. Tuning element extraction prompts may help to retain more of these details in the Graph RAG index.赋权。授权比较清醒,全局方法与浅易的RAG (SS)和图形RAG方法与源文本选录(TS)的终结不同。特意使用LLMs来分析LLMs对这一度量的推理标明,提供具体示例、援用和援用的本事被觉得是匡助用户赢得知情领略的症结。调优元素索取教唆可能有助于在Graph RAG索引中保留更多这些细节。4 Related Work干系责任4.1 RAG Approaches and Systems方法和系统When using LLMs, RAG involves first retrieving relevant information from external data sources, then adding this information to the context window of the LLM along with the original query (Ram et al., 2023). 浅易的RAG approaches (Gao et al., 2023) do this by converting documents to text, splitting text into chunks, and embedding these chunks into a vector space in which similar positions represent similar semantics. Queries are then embedded into the same vector space, with the text chunks of the nearest k vectors used as context. More advanced variations exist, but all solve the problem of what to do when an external dataset of interest exceeds the LLM’s context window.当使用LLM时,RAG最初波及从外部数据源检索干系信息,然后将此信息与原始查询一谈添加到LLM的凹凸文窗口(Ram等东谈主,2023)。浅易的RAG方法(Gao et al., 2023)通过将文档转机为文本,将文分内割成块,并将这些块镶嵌到向量空间中,其中相似的位置示意相似的语义来结束这一丝。然后将查询镶嵌到换取的向量空间中,使用最近k个向量的文本块动作凹凸文。存在更高级的变体,但齐处治了当感兴味的外部数据集超出LLM的凹凸文窗口时该奈何办的问题。Advanced RAG systems include pre-retrieval, retrieval, post-retrieval strategies designed to over-come the drawbacks of 浅易的RAG, while Modular RAG systems include patterns for iterative and dynamic cycles of interleaved retrieval and generation (Gao et al., 2023). Our implementation of Graph RAG incorporates multiple concepts related to other systems. For example, our community summaries are a kind of self-memory (Selfmem, Cheng et al., 2024) for generation-augmented re-trieval (GAR, Mao et al., 2020) that facilitates future generation cycles, while our parallel generation of community answers from these summaries is a kind of iterative (Iter-RetGen, Shao et al., 2023) or federated (FeB4RAG, Wang et al., 2024) retrieval-generation strategy. Other systems have also combined these concepts for multi-document summarization (CAiRE-COVID, Su et al., 2020) and multi-hop question answering (ITRG, Feng et al., 2023; IR-CoT, Trivedi et al., 2022; DSP, Khattab et al., 2022). Our use of a hierarchical index and summarization also bears resemblance to further approaches, such as generating a hierarchical index of text chunks by clustering the vectors of text embeddings (RAPTOR, Sarthi et al., 2024) or generating a “tree of clarifications” to answer mul-tiple interpretations of ambiguous questions (Kim et al., 2023). However, none of these iterative or hierarchical approaches use the kind of self-generated graph index that enables Graph RAG.先进的RAG系统包括预检索、检索和后检索政策,旨在克服浅易的RAG的流弊,而模块化RAG系统包括交错检索和生成的迭代和动态轮回模式(Gao等东谈主,2023)。咱们对Graph RAG的结束包含了与其他系统干系的多个看法。举例,咱们的社区选录是一种自我挂牵(Selfmem, Cheng等东谈主,2024),用于世代增强检索(GAR, Mao等东谈主,2020),有益于异日的世代轮回,而咱们从这些选录中并行生成社区谜底是一种迭代(ter- retgen, Shao等东谈主,2023)或长入(FeB4RAG, Wang等东谈主,2024)检索生成政策。其他系统也将这些看法结合起来用于多文档选录(cire - covid, Su等东谈主,2020)和多跳问答(ITRG, Feng等东谈主,2023;IR-CoT, Trivedi等,2022;DSP, Khattab et al., 2022)。咱们对档次索引和选录的使用也与进一步的方法相似,举例通过聚类文本镶嵌向量来生成文本块的档次索引(RAPTOR, Sarthi等东谈主,2024)或生成“涌现树”往复复对歧义问题的多种解释(Kim等东谈主,2023)。然则,这些迭代或分层方法齐莫得使用撑持graph RAG的自生成图索引。4.2 Graphs and LLMs图与LLMsUse of graphs in connection with LLMs and RAG is a developing research area, with multiple directions already established. These include using LLMs for knowledge graph creation (Tra-janoska et al., 2023) and completion (Yao et al., 2023), as well as for the extraction of causal graphs (Ban et al., 2023; Zhang et al., 2024) from source texts. They also include forms of ad-vanced RAG (Gao et al., 2023) where the index is a knowledge graph (KAPING, Baek et al., 2023), where subsets of the graph structure (G-Retriever, He et al., 2024) or derived graph metrics (Graph-ToolFormer, Zhang, 2023) are the objects of enquiry, where narrative outputs are strongly grounded in the facts of retrieved subgraphs (SURGE, Kang et al., 2023), where retrieved event-plot sub-graphs are serialized using narrative templates (FABULA, Ranade and Joshi, 2023), and where the system supports both creation and traversal of text-relationship graphs for multi-hop question an-swering (Wang et al., 2023b). In terms of open-source software, a variety a graph databases are supported by both the LangChain (LangChain, 2024) and LlamaIndex (LlamaIndex, 2024) libraries, while a more general class of graph-based RAG applications is also emerging, including systems that can create and reason over knowledge graphs in both Neo4J (NaLLM, Neo4J, 2024) and Nebula-Graph (GraphRAG, NebulaGraph, 2024) formats. Unlike our Graph RAG approach, however, none of these systems use the natural modularity of graphs to partition data for global summarization.在LLMs和RAG中使用图形是一个发展中的研究领域,照旧修复了多个场地。其中包括使用LLMs创建学问图谱(Tra-janoska等东谈主,2023)和完成学问图谱(Yao等东谈主,2023),以及索取因果图(Ban等东谈主,2023;Zhang et al., 2024)。它们还包括高级RAG的形式(Gao等东谈主,2023),其中索引是一个学问图(KAPING, Baek等东谈主,2023),其中图结构的子集(g - retriver, He等东谈主,2024)或派生的图度量(graph - toolformer, Zhang, 2023)是查询对象,其中叙事输出热烈地基于检索子图的事实(SURGE, Kang等东谈主,2023),其中检索的事件情节子图使用叙事模板(FABULA, Ranade和Joshi)序列化。2023),其中系统撑持多跳问答的文本关系图的创建和遍历(Wang et al., 2023b)。在开源软件方面,LangChain (LangChain, 2024)和LlamaIndex (LlamaIndex, 2024)库齐撑持多种图形数据库,而更通用的基于图形的RAG应用步履也正在兴起,包括不错在Neo4J (NaLLM, Neo4J, 2024)和星云图(GraphRAG,星云图,2024)形式下创建和推理学问图的系统。然则,与咱们的Graph RAG方法不同,这些系统齐莫得使用图的当然模块化来永别数据以进行全局汇总。5 Discussion辩论Limitations of evaluation approach. Our evaluation to date has only examined a certain class of sensemaking questions for two corpora in the region of 1 million tokens. More work is needed to understand how performance varies across different ranges of question types, data types, and dataset sizes, as well as to validate our sensemaking questions and target metrics with end users. Comparison of fabrication rates, e.g., using approaches like SelfCheckGPT (Manakul et al., 2023), would also improve on the current analysis.评价方法的局限性。到面前为止,咱们的评估仅检查了100万个令牌区域中两个语料库的某类语义问题。需要作念更多的责任来了解不同问题类型、数据类型和数据集大小范围的性能变化,以及与最终用户考证咱们的语义问题和目的筹商。比较伪造率,举例,使用SelfCheckGPT (Manakul等东谈主,2023)等方法,也将阅兵现时的分析。Trade-offs of building a graph index. We consistently observed Graph RAG achieve the best head-to-head results against other methods, but in many cases the graph-free approach to global summa-rization of source texts performed competitively. The real-world decision about whether to invest in building a graph index depends on multiple factors, including the compute budget, expected number of lifetime queries per dataset, and value obtained from other aspects of the graph index (including the generic community summaries and the use of other graph-related RAG approaches).构建图表索引的强横量度。咱们一直不雅察到,Graph RAG与其他方法比拟赢得了最好的正面终结,但在许厚情况下,无图方法对源文本进行全局汇总具有竞争性。对于是否投资构建图索引的试验决策取决于多个成分,包括缠绵预算、每个数据集的预期人命周期查询数目,以及从图索引的其他方面赢得的值(包括通用社区选录和其他与图干系的RAG方法的使用)。Future work. The graph index, rich text annotations, and hierarchical community structure support-ing the current Graph RAG approach offer many possibilities for refinement and adaptation. This includes RAG approaches that operate in a more local manner, via embedding-based matching of user queries and graph annotations, as well as the possibility of hybrid RAG schemes that combine embedding-based matching against community reports before employing our map-reduce summa-rization mechanisms. This “roll-up” operation could also be extended across more levels of the community hierarchy, as well as implemented as a more exploratory “drill down” mechanism that follows the information scent contained in higher-level community summaries.异日的责任。撑持现时graph RAG方法的图索引、富文本谛视和分层社区结构为阅兵和养息提供了好多可能性。这包括以更局部的面目操作的RAG方法,通过基于镶嵌的用户查询匹配和图形谛视,以及夹杂RAG决策的可能性,该决策在使用咱们的map-reduce汇总机制之前将基于镶嵌的匹配与社区领会结合起来。这种“上卷”操作还不错膨大到社区档次结构的更多级别,也不错动作一种更具探索性的“下钻”机制来结束,该机制罢免更高级别社区选录中包含的信息偃息。6 Conclusion论断We have presented a global approach to Graph RAG, combining knowledge graph generation, retrieval-augmented generation (RAG), and query-focused summarization (QFS) to support human sensemaking over entire text corpora. Initial evaluations show substantial improvements over a 浅易的RAG baseline for both the comprehensiveness and diversity of answers, as well as favorable comparisons to a global but graph-free approach using map-reduce source text summarization. For situations requiring many global queries over the same dataset, summaries of root-level communi-ties in the entity-based graph index provide a data index that is both superior to 浅易的RAG and achieves competitive performance to other global methods at a fraction of the token cost.咱们淡薄了一种全局的Graph RAG方法,将学问图生成、检索增强生成(RAG)和以查询为中心的查询聚焦选录(QFS)相结合,以撑持东谈主类对通盘文本语料库的意念念领略。初步评估标明,在谜底的全面性和各种性方面,相较于浅易的RAG基线有实质性的阅兵,况且与使用映射-简化源文本选录的全局但无图方法比拟也清醒出有益的比较。对于需要针对销毁数据集进行好多全局查询的情况,基于实体的图索引中的根级社区选录提供了一个优于浅易RAG的数据索引,况且以较小的令牌老本结束了与其他全局方法相忘形的性能。An open-source, Python-based implementation of both global and local Graph RAG approaches is forthcoming at https://aka.ms/graphrag.一个开源的、基于python的全局和局部Graph RAG方法的结束行将在https://aka.ms/graphrag上结束。 本站仅提供存储奇迹,整个内容均由用户发布,如发现存害或侵权内容,请点击举报。
查看更多