PDF Table Extraction Task

From GM-RKB
Jump to navigation Jump to search

A PDF Table Extraction Task is a PDF information extraction that is a table extraction task from PDF files and converting them into more accessible and editable formats, such as spreadsheets or databases.

  • Context:
    • It can be performed manually by copying and pasting, which is often tedious and prone to errors, or automated using specialized PDF Table Extraction Systems.
    • It can involve various challenges including handling of different table structures, merged cells, rotated text, and poor quality scans.
    • It often requires the understanding and recognition of table structures including rows, columns, and headings.
    • When dealing with scanned PDFs, OCR (Optical Character Recognition) Systems may be used to convert the image-based content into selectable text before extraction.
    • It can be useful in various domains such as finance, research, data analytics, and journalism, where PDFs are a common format for reports and publications.
    • ...
  • Example(s):
    • Extracting financial tables from annual PDF reports to perform data analysis.
    • Retrieving tables from scientific PDF papers for research data aggregation.
    • Converting government PDF reports into structured datasets for journalistic investigation.
    • ...
  • Counter-Example(s):
    • Extracting plain text data, not in table format, from a PDF document.
    • Converting a PDF document into an image file.
    • ...
  • See: Data Extraction, Tabular Data.


References

2018

  • (Shigarov et al., 2018) ⇒ Alexey Shigarov, Andrey Altaev, Andrey Mikhailov, Viacheslav Paramonov, Evgeniy Cherkashin. (2018). "TabbyPDF: Web-Based System for PDF Table Extraction". In: International Conference on Information Science and Technology. Springer.
    • ABSTRACT: PDF is one of the most widespread ways to represent non-editable documents. Many of PDF documents are machine-readable but remain untagged. They have no tags for identifying layout items such as paragraphs, columns, or tables. One of the important challenges with these documents is how to extract tabular data from them. The paper presents a novel web-based system for extracting tables located in untagged PDF documents with a complex layout, for recovering their cell structures, and for exporting them into a tagged form (e.g. in CSV or HTML format). The system uses a heuristic-based approach to table detection and structure recognition. It mainly relies on recovering a human reading order of text, including document paragraphs and table cells. A prototype of the system was evaluated, using the methodology and dataset of “ICDAR 2013 Table Competition”. The standard metric F-score is 93.64% for the structure recognition phase and 83.18% for the table extraction with automatic table detection. The results are comparable with the state-of-the-art academic solutions.