PDF Table Extraction System

From GM-RKB
Jump to navigation Jump to search

A PDF Table Extraction System is an PDF file processing system that can solve a PDF table extraction task (which facilitates the extraction of tables from PDF files into more accessible and editable formats).

  • Context:
    • It can vary in complexity, from simple tools with limited features to more advanced systems that can handle complex table structures and formatting.
    • It can sometimes be incorporated as a feature within broader PDF editing or document management software.
    • It can be particularly useful for dealing with documents where copying and pasting is not viable due to formatting issues.
    • It can work with OCR (Optical Character Recognition) Systems.
    • ...
  • Example(s):
    • Tabula, an open-source tool that allows for the extraction of tables from text-based PDF files into CSV or Microsoft Excel format.
    • Adobe Acrobat, which provides tools for exporting tables from PDFs to Excel.
    • Able2Extract, a PDF converter tool that can extract PDF tables to Excel.
    • ...
  • Counter-Example(s):
  • See: PDF, Data Extraction, Tabular Data, Optical Character Recognition, CSV, Microsoft Excel.


References

2021

  • (Fayyaz et al., 2021) ⇒ Fayyaz, N., Khusro, S., & Ullah, S. (2021). Accessibility of tables in PDF documents. Information Technology and Libraries, 40(2).
    • QUOTE: ... People access and share information over the web and in other digital environments, including digital libraries, in the form of documents such as books, articles, technical reports ...

2018

  • (Shigarov et al., 2018) ⇒ Alexey Shigarov, Andrey Altaev, Andrey Mikhailov, Viacheslav Paramonov, Evgeniy Cherkashin. (2018). "TabbyPDF: Web-Based System for PDF Table Extraction". In: International Conference on Information Science and Technology. Springer.
    • QUOTE: ... Table extraction as a part of table understanding [6] includes ... Many table extraction methods traditionally deal with only document ... Several methods and tools for PDF table extraction are ...
    • We develop TabbyPDF, a novel web-based system for PDF table extraction from machine-readable untagged documents. This extends our previous work for table structure recognition [23]. The system exploits a set of customizable ad-hoc heuristics for table detection and cell structure reconstruction based on features of text and ruling lines presented in PDF documents. Most of them such as hor- izontal and vertical distances, fonts, and rulings are well known and used in the existing methods. Additionally, we propose to exploit the feature of appearance of text printing instruction in PDF files and positions of a drawing cursor. We also demonstrate experimental results based on the existing competition dataset, “ICDAR 2013 Table Competition”. The standard metric F -score is 93.64% for the structure recognition phase and 83.18% for the table extraction with au- tomatic table detection. The results are comparable with the state-of-the-art academic solutions.
    • ...The process of PDF table extraction involves the following phases:
      1. data preparation, to recover text blocks presented words and ruling lines from instructions of a source PDF document;
      2. text line and paragraph extraction, to recover text blocks presented lines and paragraphs;
      3. table detection, to recover a bounding box of each table located on a page;
      4. table structure recognition, to recover a cell structure of a detected table.