Long-Document MapReduce-based Summarization Algorithm

From GM-RKB
Jump to navigation Jump to search

A Long-Document MapReduce-based Summarization Algorithm is a LLM-based long document summarization algorithm that employs the MapReduce programming model.



References

2024

  • "Text Summarization of Large Documents using LangChain." Github Notebook.
    • NOTES:
      • The MapReduce method for document summarization employs a multi-stage process to summarize extensive texts, ideally suited for large documents by breaking them into smaller chunks and summarizing each before combining them into a unified summary.
      • Utilizes LangChain's MapReduceDocumentsChain within the load_summarize_chain method, specifying map_reduce as the chain_type to efficiently manage and summarize large pieces of text.
      • For a 32-page document, the map_reduce chain segments the document into maximum 1024 token chunks, applying an initial prompt to each chunk to generate individual summaries.
      • The map prompt used for initial summarization of each chunk: {text}. BULLET POINT SUMMARY:
      • Following the generation of chunk summaries, a combine prompt merges these into a comprehensive document summary: ```Write a summary of the entire document that includes the main points from all of the individual summaries.```
      • Prompts are defined using PromptTemplate with specified templates for mapping and combining phases to guide the summarization process.
      • The map_reduce_chain is initialized with parameters including vertex_llm_text, chain_type, map_prompt, combine_prompt, and return_intermediate_steps set to True.
      • Summaries are generated using the map_reduce_chain, leveraging a tokenizer with a 1024 token limit by default to process input documents.
      • The results are organized and validated through a Pandas DataFrame, listing input documents alongside their corresponding summaries, facilitating easy review and analysis.
      • This method overcomes the limitations of the stuffing method by allowing for parallel processing, though it requires multiple calls to the model and may risk losing context between document sections.
      • Despite potential context loss, the MapReduce method provides a scalable solution for summarizing large documents efficiently.

2023