2003 MLandDMforYeastFunctionalGenomics

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Hierarchical Decision Trees, C4.5H

Notes

Cited By

  • ~35

Quotes

Abstract

Hierarchical multilabel classification (HMC) is an extension of binary classification where an instance can be labelled with multiple classes that are organised in a hierarchy. A well-known application of this kind of problem is gene function prediction. A gene can have multiple functions at the same time, and these functions are hierarchically organised: a gene predicted to have a certain class should also be predicted to have all its superclasses, as given by the hierarchy. A straightforward approach to solve this problem would be to learn a binary classifier for each class separately and then to combine the predictions. However, this has several disadvantages: (1) learning is not very efficient, since a separate classifier has to be learned for each class, (2) binary classifiers have known problems with skewed class distributions and (3) the hierarchy constraint, implying that a class should be predicted along with all its superclasses, is not utomatically fulfilled. The obvious alternative is to learn a single model that predicts all the different classes at once. In this paper we propose a method for learning decision trees that predicts for each instance a set of classes instead of a single class.

  • multi-label learning (where each example belongs to more than one possible class)
  • learning with both hierarchically-structured data and classes

Kaufman and Michalski (1996) presents ideas for dealing with structured attributes, including the use of generalisation rules, and “anchor nodes” that allow the user to mark nodes “at preferable levels of abstraction” in the hierarchy. They apply their ideas in the INLEN-2 algorithm and discover simpler rules as a result.

Mitchell (1998) demonstrated that a hierarchical Bayesian classifier would have the same performance as a flat Bayesian classifier under certain assumptions: smoothing is not used to estimate the probabilities and the same features are used by different classifiers in the hierarchy.

Most work in this area has been done in relation to classifying large volumes of text documents, for example to create Yahoo-like topic hierarchies. The classification algorithms used in these text processing applications tend to be either clustering or very simple statistical algorithms such as naive Bayes, working on high volumes of data.

More recently a couple of papers have been published which look at the combined problem of both hierarchical classes and multiple labels. Wang et al. (2001) describe a method for classifying documents that is based on association rule mining.

Recently, Blockeel et al. (2002) have also designed an algorithm to tackle the problem of hierarchical multi-classification. They construct a type of decision tree called a “clustering tree” to do the classification, where the criteria for deciding on how to split a node is based on minimizing the intra-cluster variance of the data within a node. The distance measure used to calculate intra-cluster variance works on sets of class labels and takes into account both the class hierarchy and the multiple labels.

To adapt C4.5 to make use of a class hierarchy several modifications are needed:

  • reading and storing the class hierarchy;
  • testing for membership of a class;
  • finding the best class or classes to represent a node;
  • performing entropy calculations

Original contributions to knowledge

This thesis makes the following original contributions to knowledge:

Computer Science:

  1. . An extension of C4.5 to multi-label learning3. The standard machine learning program C4.5 was extended to allow multiple class labels to be specified for each ORF. This is not common in the field of machine learning and would normally be handled either by producing many classifiers, one per class.
  2. . A distributed first order association mining algorithm. This is a version of the WARMR algorithm which is designed to allow processing to be distributed across a Beowulf cluster of computers without shared memory. This was necessary to process the volume of data associated with the yeast genome. PolyFARM is freely available for non-commercial use from http: www.aber.ac.uk/compsci/Research/bio/dss/polyfarm.
  3. . An extension of C4.5 to hierarchical class learning. C4.5 was modified again to be able to use a class hierarchy. There has been little prior work on hierarchical learning except in conjunction with document clustering and simple Bayesian methods. Our version of C4.5 can now deal with problems of both multiple and hierarchical class labels, which is a common real world problem.
  4. . An extension of the distributed first order association mining algorithm which allows hierarchically structured attributes. This was a method of directly implementing the hierarchically structured attributes rather than explicitly having to list all ancestors for each item. This was necessary to reduce the size of our yeast database by allowing the hierarchy to become background knowledge and to avoid duplication. This should provide faster access to the data than the Prolog-style alternative of allowing the hierarchy to be specified and searched by recursive relations.

References

  • Winzeler, E., & Davis, R. (1997). Functional analysis of the yeast genome. Current Opinion in Genetics and Development, 7, 771–776.
  • Ian H. Witten, & Frank, E. (1999). Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco.
  • Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. Pages 412–420 of: Proceedings of ICML-97, 14th International onference on Machine Learning. Morgan Kaufmann Publishers, San Francisco, US.
  • Yeung, K.Y., Haynor, D., & Ruzzo, W. (2001). Validating clustering for gene expression data. Bioinformatics, 17(4), 309–318.
  • Yu, J., et al.. (2002). A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science, 296, 79–92.
  • Zavaljevski, N., Stevens, F. J., & Reifman, J. (2002). Support vector machines with selective kernel scaling for protein classification and identification of key amino acid positions. Bioinformatics, 18, 689–696.
  • Zhang, M. (1999). Large Scale Gene Expression Data Analysis: A New Challenge to Computational Biologists. Genome Research, 9, 681–688.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2003 MLandDMforYeastFunctionalGenomicsAmanda ClareMachine learning and data mining for yeast functional genomicsDoctoral Dissertationhttp://users.aber.ac.uk/afc/papers/AClarePhDThesis.pdf2003