- (Panov, 2012) ⇒ Panče Panov. (2012). “A Modular Ontology of Data Mining". PhD Doctoral Dissertation.
The domain of data mining (DM) deals with analyzing different types of data. The data typically used in data mining is in the format of a single table, with primitive datatypes as attributes. However, structured (complex) data, such as graphs, sequences, networks, text, image, multimedia and relational data, are receiving an increasing amount of interest in data mining. A major challenge is to treat and represent the mining of different types of structured data in a uniform fashion. A theoretical framework that unifies different data mining tasks, on different types of data can help to formalize the knowledge about the domain and provide a base for future research, unification and standardization. Next, automation and overall support of the Knowledge Discovery in Databases (KDD) process is also an important challenge in the domain of data mining. A formalization of the domain of data mining is a solution that addresses these challenges. It can directly support the development of a general framework for data mining, support the representation of the process of mining structured data, and allow the representation of the complete process of knowledge discovery.
In this thesis, we propose a reference modular ontology for the domain of data mining OntoDM, directly motivated by the need for formalization of the data mining domain. The OntoDM ontology is designed and implemented by following ontology best practices and design principles. Its distinguishing feature is that it uses Basic Formal Ontology (BFO) as an upper-level ontology and a template, a set of formally defined relations from Relational Ontology (RO) and other state-of-the-art ontologies, and reuses classes and relations from the Ontology of Biomedical Investigations (OBI), the Information Artifact Ontology (IAO), and the Software Ontology (SWO). This will ensure compatibility and connections with other ontologies and allow cross-domain reasoning capabilities. The OntoDM ontology is composed of three modules covering different aspects of data mining: OntoDT, which supports the representation of knowledge about datatypes and is based on an accepted ISO standard for datatypes in computer systems; OntoDM-core, which formalizes the key data mining entities for representing the mining of structured data in the context of a general framework for data mining; and OntoDM-KDD, which formalizes the [[knowledge discovery process based on the Cross Industry Standard Process for Data Mining (CRISP-DM) process model.
The OntoDT module provides a representation of the datatype entity, defines a taxonomy of datatype characterizing operations, and a taxonomy of datatype qualities. Furthermore, it defines a datatype taxonomy comprising classes and instances of primitive datatypes, generated datatypes (non-aggregate and aggregated datatypes), subtypes, and defined datatypes. With this structure, the module provides a generic mechanism for representing arbitrarily complex datatypes.
The OntoDM-core module formalizes the key data mining entities needed for the representation of mining structured data in the context of a general framework for data mining. These include the entities dataset, data mining task, generalization, data mining algorithm, and others. More specifically, it provides a representation of datasets, and a taxonomy of datasets based on the type of data. Next, it provides a representation of data mining tasks, and proposes a taxonomy of data mining tasks, predictive modeling tasks and hierarchical classification tasks. Furthermore, it provides a representation for generalizations, and proposes a taxonomy of generalizations and predictive models based on the types of data and generalization language. Moreover, it provides a representation of data mining algorithms, proposes a taxonomy of data mining algorithms, predictive modeling algorithms, and hierarchical classification algorithms, and generalizes the mechanism for representing data mining algorithms to represent general algorithms in computer science. In addition, the OntoDM-core module provides a representation of constraints and constraint-based data mining tasks and proposes a taxonomy thereof. Finally, the module provides a representation of data mining scenarios that includes data mining scenarios as a specification, data mining work ows, and the process of executing a data mining work ow.
The OntoDM-KDD module supports the representation of data mining investigations. It provides a representation of data mining investigation by directly extending classes from the OBI and IAO ontologies. Furthermore, it models each of the phases in a data mining investigation (such as application understanding, data understanding, data preparation, modeling, DM process evaluation, and deployment), and their inputs and outputs.
The OntoDM ontology and its three modules OntoDT, OntoDM-core, and OntoDM-KDD) were evaluated in order to assess their quality. The evaluation was performed by assessing the ontology against a set of design principles and best practices, and assessing whether the competency questions posed in the design phase were implemented in the lan - guage of the ontology. In addition, we provided a domain coverage assessment by comparing the OntoDM data mining tasks taxonomy with the data mining topic ontology constructed in a semi-automatic fashion from abstracts of articles from data mining conferences and journals.
The developed ontology supports a large variety of applications. We demonstrate the use and the application of the ontology by describing six use cases. The OntoDM ontology is used for the annotation of data mining algorithms; for the representation of data mining scenarios; for the annotation of data mining investigations; in cross domain applications to support ontology-based representation of QSAR modeling for drug discovery, as a mid-level ontology by the Expose ontology; and for the annotation of articles containing data mining terms in combination with text mining tools.
The novelties that the OntoDM ontology introduces and what distinguishes it from other related ontologies are the facts that it allows representation of mining of structured data and the general process of data mining in a principled way, it is based on a theoretical ontological framework and due to this it can be connected to other domain ontologies to support cross-domain applications. The OntoDM ontology is also the first ontology that supports the representation of the complete process of knowledge discovery.
In the future developments of the OntoDM ontology, we plan to focus on several aspects. First, we would like to align and map of our ontology to other upper-level ontologies. Second, we plan to extend the established ontological framework to represent entities about components of data mining algorithms, such as distance functions and kernel functions. Next, we plan to populate the ontology downward with instances. Furthermore, we plan to extend the representational framework for representing experiments for mining structured data in the context of experiment databases. Finally, we plan to include more contributors from the domain of data mining into the development of OntoDM and apply the OntoDM design principles to the development of ontologies for other areas of computer science.
|2012 AModularOntologyofDataMiningDoc||Panče Panov||A Modular Ontology of Data Mining: Doctoral Dissertation||2012|