Canonical Data File Format

From GM-RKB
Jump to navigation Jump to search

A Canonical Data File Format is a data file format to capture documents.



References

2017

  • (Big Data Patterns, 2017) ⇒ http://www.bigdatapatterns.org/design_patterns/canonical_data_format Retrieved on 2017-05-28
    • Canonical Data Format: A canonical and extensible serialization format is chosen to save data such that disparate clients are able to read and write data. This saves from having to perform any data format conversion or keeping multiple copies of a dataset in different formats. The canonical serialization format is generally based on a schema-driven format that provides information about the structure of the data.

      A dataset is serialized into a common format that is then consumed by three disparate clients without the need to perform any data format conversion.

2009

  • (Bloechle et al., 2009) ⇒ Bloechle, J. L., Lalanne, D., & Ingold, R. (2009, July). OCD: an optimized and canonical document format. In Document Analysis and Recognition, 2009. ICDAR'09. 10th International Conference on (pp. 236-240). IEEE.
    • Abstract: Revealing and being able to manipulate the structured content of PDF documents is a difficult task, requiring pre-processing and reverse engineering techniques. In this paper, we present OCD, an optimized, easy-to-process and canonical format for representing structured electronic documents. The system and methods used for reverse engineering PDF documents into the OCD format are presented as well as the techniques to optimize it. We finally expose concrete evaluations of our OCD format compactness and restructuring performances.

2006

  • (Bloeche et al., 2006) ⇒ Bloechle, J. L., Rigamonti, M., Hadjar, K., Lalanne, D., & Ingold, R. (2006, February). XCDF: a canonical and structured document format. In International Workshop on Document Analysis Systems (pp. 141-152). Springer Berlin Heidelberg. DOI: 10.1007/11669487_13
    • Abstract: Accessing the structured content of PDF document is a difficult task, requiring pre-processing and reverse engineering techniques. In this paper, we first present different methods to accomplish this task, which are based either on document image analysis, or on electronic content extraction. Then, XCDF, a canonical format with well-defined properties is proposed as a suitable solution for representing structured electronic documents and as an entry point for further researches and works. The system and methods used for reverse engineering PDF document into this canonical format are also presented. We finally present current applications of this work into various domains, spacing from data mining to multimedia navigation, and consistently benefiting from our canonical format in order to access PDF document content and structures.