2006 KernelBasedLearningOfHMCModels

(Rousu et al., 2006) ⇒ J. Rousu, C. Saunders, S. Szedmak, John Shawe Taylor. (2006). “Kernel-based learning of hierarchical multilabel classification models.” In: Journal of Machine Learning Research, 7.

Subject Headings: Kernel Methods, Hierarchical Classification, Text Categorization, Convex Optimization, Structured Outputs

Quotes

Abstract

We present a kernel-based algorithm for hierarchical text classification where the documents are allowed to belong to more than one category at a time. The classification model is a variant of the Maximum Margin Markov Network framework, where the classification hierarchy is represented as a Markov tree equipped with an exponential family defined on the edges. We present an efficient optimization algorithm based on incremental conditional gradient ascent in single-example subspaces spanned by the marginal dual variables. The optimization is facilitated with a dynamic programming based algorithm that computes best update directions in the feasible set. Experiments show that the algorithm can feasibly optimize training sets of thousands of examples and classification hierarchies consisting of hundreds of nodes. Training of the full hierarchical model is as efficient as training independent SVM-light classifiers for each node. The algorithm’s predictive accuracy was found to be competitive with other recently introduced hierarchical multicategory or multilabel classification learning algorithms.

Introduction

In many application fields, taxonomies and hierarchies are natural ways to organize and classify objects, hence they are widely used for tasks such as text classification. In contrast, machine learning research has largely been focused on flat target prediction, where the output is a single binary or multivalued scalar variable. Naively encoding a large hierarchy either into a series of binary problems or a single multiclass problem with many possible class values suffers from the fact that dependencies between the classes cannot be represented well. For example, if a news article belongs to category MUSIC, it is very likely that the article belongs to category ENTERTAINMENT. The failure to represent these relationships leads to a steep decline of the predictive accuracy in the number of possible categories. In recent years, methods that utilize the hierarchy in learning the classification have been proposed by several authors (Koller and Sahami, 1997; McCallum et al., 1998; Dumais and Chen, 2000). Very recently, new hierarchical classification approaches utilizing kernel methods have been introduced (Hofmann et al., 2003; Cai and Hofmann, 2004; Dekel et al., 2004). The main idea behind these methods is to map the documents (or document–labeling pairs) into a potentially high-dimensional feature space where linear maximum margin separation of the documents becomes possible.
Most of the above mentioned methods assume that the object to be classified belongs to exactly one (leaf) node in the hierarchy. In this paper we consider the more general case where a single object can be classified into several categories in the hierarchy, to be specific, the multilabel is a union of partial paths in the hierarchy. For example, a news article about David and Victoria Beckham could belong to partial paths SPORT, FOOTBALL and ENTERTAINMENT, MUSIC but might not belong to any leaf categories such as CHAMPIONS LEAGUE. The problem of multiple partial paths was also considered in Cesa-Bianchi et al. (2004).

2.3 Feature Representations for Hierarchical Outputs

When the input features are used in hierarchical classification, they need to be associated with the labelings of the hierarchy. In our setting, this is done via constructing a joint feature map f : X ×Y 7→ Fxy.
There are important design choices to be made in how the hierarchical structure should reflect in the feature representation. There are two general types of features that can be distinguished:
- Global features are given by the feature map fx : X 7→ Fx. They are not tied to a particular vertex or edge but represent the structured object as a whole. For example, the bag-of-words or the substring spectrum of a document is not tied to a single class of documents in a hierarchy, but a given word can relate to different classes with different importances.
- Local features, are given by a feature map fxj : X 7→ Fx j tied to a particular vertex j or edge of the structure. For example, given a structured representation of a scientific article, we can make a difference between elements occurring within the title, abstract, article body and references, and construct local feature maps for each of the components.

5. Experiments

We tested the presented learning approach on three datasets that have an associated classification hierarchy:
REUTERS Corpus Volume 1, RCV1 (Lewis et al., 2004). 2500 documents were used for training and 5000 for testing. As the label hierarchy we used the ’CCAT’ family of categories (Corporate/Industrial news articles), which had a total of 34 nodes, organized in a tree with maximum depth 3. The tree is quite unbalanced, half of the nodes residing in depth 1, and very few nodes in depth 3.
WIPO-alpha patent dataset (WIPO, 2001). The dataset consisted of the 1372 training and 358 testing document comprising the D section of the hierarchy. The number of nodes in the hierarchy was 188, with maximum depth 3.
ENZYME classification dataset. The training data consisted of 7700 protein sequences with hierarchical classification given by the Enzyme Classification (EC) system. The hierarchy consisted of 236 nodes organized into a tree of depth three. Test data consisted of 1755 sequences.
[Tested against three baseline algorithms]
- SVM denotes an SVM trained for each microlabel separately
- H-SVM denotes the case where the SVMfor a microlabel is trained only with examples for which the ancestor labels are positive.
- H-RLS is a batch version of the hierarchical least squares algorithm described in Cesa-Bianchi et al. (2004).

References

S. M. Aji and R. McEliece. The generalized distributive law. IEEE Transactions on Information Theory, 46(2):325–343, 2000.
Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden markov support vector machines. In: Proceedings of The International Conference of Machine Learning, 2003.
D. Bertsekas. Nonlinear Programming. Athena Scientific, 1999.
L. Cai and T. Hofmann. Hierarchical document categorization with support vector machines. In 13 ACM CIKM, 2004.
N. Cancedda, E. Gaussier, C. Goutte, and J.-M. Renders. Word-sequence kernels. Journal of Machine Learning Research, 3:1059–1082, 2003.
N. Cesa-Bianchi, C. Gentile, A. Tironi, and L. Zaniboni. Incremental algorithms for hierarchical classification. In Neural Information Processing Systems, 2004.
N. Cristianini and John Shawe Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000.
O. Dekel, J. Keshet, and Yoram Singer. Large margin hierarchical classification. In ICML’04, pages 209–216, 2004.
S. T. Dumais and H. Chen. Hierarchical classification of web content. In SIGIR’00, pages 256–263,

2000.

T. Gartner. A survey of kernels for structured data. ACM SIGKDD Explorations, pages 49–58,

2003.

T. Hofmann, L. Cai., and Massimiliano Ciaramita. Learning with taxonomies: Classifying documents and

words. In NIPS Workshop on Syntax, Semantics, and Statistics, 2003.

Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of the European Conference onMachine Learning, pages 137 – 142, Berlin, (1998). Springer.
Daphne Koller andM. Sahami. Hierarchically classifying documents using very few words. In ICML’97,

pages 170–178, 1997.

F. Kschischang, B. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE

Transactions on Information Theory, 47:498–519, 2001.

John D. Lafferty, X. Zhu, and Y. Liu. Kernel conditional random fields: representation and clique selection.

In: Proceedings of 21th International Conference on Machine Learning, pages 504–511, 2004.

C. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein

classification. In: Proceedings of the Pacific Symposium on Biocomputing, pages 564 – 575, 2002.

D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization

research. JMLR, 5:361–397, Apr 2004.

H. Lodhi, C. Saunders, John Shawe Taylor, N. Cristianini, and C. Watkins. Text classification using

string kernels. Journal of Machine Learning Research, 2:419–444, February 2002.

Andrew McCallum, R. Rosenfeld, Tom M. Mitchell, and A. Y. Ng. Improving text classification by shrinkage

in a hierarchy of classes. In ICML’98, pages 359–367, 1998.

J. Rousu and John Shawe Taylor. Efficient computation of gapped substring kernels on large alphabets.

JMLR, 6:1323–1344, 2005.

Gerard M. Salton. Automatic Text Processing. Addison-Wesley, Massachusetts, 1989.
C. Saunders, H. Tschach, and John Shawe Taylor. Syllables and other string kernel extensions. In

ICML’02, pages 530–537, 2002.

John Shawe Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University

Press, 2004.

Ben Taskar, C. Guestrin, and Daphne Koller. Max-margin markov networks. In Neural Information Processing

Systems 2003, 2003.

Ben Taskar, V. Chatalbashev, and Daphne Koller. Learning associative markov networks. In: Proceedings of 21th

International Conference on Machine Learning, pages 807–814, 2004.

I. Tsochantaridis, T. Hofmann, Thorsten Joachims, and Y.n Altun. Support vector machine learning for

interdependent and structured output spaces. In: Proceedings of 21th International Conference on Machine Learning, pages 823–830, 2004.

S. V. N. Vishwanathan and A. J. Smola. Fast kernels on strings and trees.

In: Proceedings of Neural Information Processing Systems 2002, (2002). URL http://users.rsise.anu.edu.au/ vishy/papers/VisSmo02.pdf.

M. Wainwright and Michael I. Jordan. Graphical models, exponential families, and variational inference. Technical Report 649, Department of Statistics, University of California, Berkeley, 2003.
M. Wainwright, T. Jaakkola, and A.Willsky. Tree-based reparameterization framework for analysis

of sum-product and related algorithms. IEEE Transactions on information theory, 49:1120–1146, May 2003.

WIPO. World Intellectual Property Organization. http://www.wipo.int/classifications/en. 2001.

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2006 KernelBasedLearningOfHMCModels	John Shawe-Taylor J. Rousu C. Saunders S. Szedmak			Kernel-based learning of hierarchical multilabel classification models			http://www.jmlr.org/papers/volume7/rousu06a/rousu06a.pdf