# 2004 HierarchicalDocumentCategorizat

## Quotes

### Abstract

Automatically categorizing documents into pre-defined topic hierarchies or taxonomies is a crucial step in knowledge and content management. Standard machine learning techniques like Support Vector Machines and related large margin methods have been successfully applied for this task, albeit the fact that they ignore the inter-class relationships. In this paper, we propose a novel hierarchical classification method that generalizes Support Vector Machine learning and that is based on discriminant functions that are structured in a way that mirrors the class hierarchy. Our method can work with arbitrary, not necessarily singly connected taxonomies and can deal with task-specific loss functions. All parameters are learned jointly by optimizing a common objective function corresponding to a regularized upper bound on the empirical loss. We present experimental results on the WIPO-alpha patent collection to show the competitiveness of our approach.

### = 1. INTRODUCTION

Document categorization is a crucial and well-proven instrument for organizing large volumes of textual information. Being no brainchild of our age, comprehensive classification systems have been developed and maintained by librarians since the 19th century and are today in widespread use. The advent of the Web and the enormous growth of digital content in intranets, databases, and archives, have further increased the demand for categorization. In the face of the pace and complexity of this process, manual categorization often lacks economic efficiency and automatic tools are indispensable to supplement human efforts.

In most cases, the use of statistical or machine learning techniques has been proven to be successful in this context, since it is typically more feasible to induce categorization rules based on example documents, than to elicit such rules from domain experts. The wide range of methods applied to this problem include nearest neighbor classifiers [23], neural networks [19], generative probabilistic classifiers [6, 7], and – more recently – boosting [12] and Support Vector Machines (SVMs) [5], to name just a few. Extensive experimental comparisons (e.g. [ 5, 24, 1]) have evidenced that among the methods available today, SVMs are highly competitive in their classification accuracy and can therefore be considered the state-of-the art in document categorization.

A potential drawback of all of the above mentioned classification methods is that they treat the category structure as ‘flat’ and that they do not consider relationships between categories, which are commonly expressed in concept hierarchies or taxonomies. Such structures, however, are the preferred way in which concepts, subject headings, or categories are arranged in practice. Taxonomies offer clear advantages in supporting tasks like browsing, searching or visualization. They are also easier to maintain and alleviate the [[manual annotation process. This is witnessed by the fact that virtually all real world classification systems have complex hierarchical structure. This includes traditional systems like Dewey or Library of Congress subject headings, the International Patent Classification (IPC) scheme [21], the Medical Subject Headings (MeSH) maintained by NIH, as well as Web catalogs created by Yahoo!, the Open Directory Project (DMOZ) or LookSmart, to name some of the most important ones.

There is reason to believe that taxonomies offer valuable information which learning methods should be able to capitalize on, in particular since the number of training examples for individual classes may be relatively small when dealing with tens of thousands of classes. The potential loss of valuable information suffered from ignoring class hierarchies has been pointed out many times before and has led to a number of approaches that deal with ways to exploit hierarchies, e.g. [6, 8, 19, 18, 4].

In this paper, we present a generalization of the SVM classification architecture [17] that directly incorporates prior knowledge about class relationships. This is accomplished by using discriminant functions that decompose into contributions from different levels of the hierarchy. Compared to previous approaches, our main contribution is a novel formulation of hierarchical classification as a joint large margin problem (cf. Section 2), for which we derive an efficient training algorithm (cf. Section 4). Moreover, the proposed method is not restricted to the zero-one classification loss, but is able to directly incorporate specific loss functions, in particular ones derived from the taxonomy (cf. Section 3).

### 2. HIERARCHICAL SVM CLASSIFICATION

There are two slightly different settings for document categorization: problems involving multiple overlapping binary classes and multiclass problems. In the latter case, each document belongs to exactly one category, whereas in the former case a document may belong to multiple categories. We will present a hierarchical model for both of these settings, but will focus on the multiclass case with the experimental evaluation on the WIPO-alpha collection [22] (cf. Section 6).

#### 2.1 Multiclass SVM Classification

We take the multiclass SVM learning approach of [3] as our starting point. Let ${ (x_i, y_i) }^n_{i=1}$ be a set of n labeled training documents. Here $x_i \in R^d$ denotes a standard vector representation for the i-th training document. Each label yi refers to a unique category encoded as an integer, $y_i \ in \ mathcal{Y} \ equiv {1,..., q}$, where q is the total number of categories.

...

### 2.2 SVM Learning with Class Attributes =

We would like to extend the above multiclass SVM formulation to cases, where classes are not just arbitrary numbers, but can be characterized by attribute vectors ¤ (y) 2 <s. This should be carried out in a way that recovers the standard multiclass setting as a special case of an orthogonal attribute representation with s = q and ¸r (y) = ±yr, i.e. a case where each class is interpreted as a binary attribute of its own. To that extent, we propose to redefine a more general version of the discriminant functions F in (1) as

...

It is of conceptual and computational interest to derive the dual quadratic program (QP) of the above formulation of large margin learning with class attributes. To that extent one first forms the Lagrangian function

...

#### 2.4 Class Attributes from Taxonomies

The application of the method presented in the previous section to classification problems with pre-defined taxonomies is straightforward. The main idea is to encode the relationship between classes, expressed in the taxonomy, in terms of a class attribute representation. We define a taxonomy as an arbitrary lattice (e.g. tree) whose minimal elements (e.g. leaves) correspond to the categories. Cases where interior nodes represent categories can be easily handled by adding a single (terminal) node to every inner node. Elements of the lattice, i.e. terminal as well as non-terminal nodes are denoted by z 2 Z = {z1,..., zp} in the following, with p ¸ q and where we identify yk = zk for k = 1,..., q.

...

#### 2.5 Hierarchical Multilabel Classification

The same idea can be carried out in the multilabel case of overlapping binary classes, where document may be assigned to multiple categories. While the standard learning approach amounts to treating every binary problem for class y as an independent problem with weight vector wy, the binary problems are coupled in the hierarchical setting. The latter happens through the shared weight vectors we for common ancestors z. Since we have not carried out experiments with this formulation, we only present a brief sketch of the formulation.

...

### 3. LOSS-SENSITIVE LEARNING

A shortcoming of the approach presented so far is that it is based on the standard misclassification loss. More precisely, it is well known that the average margin violation or hinge loss $1/n \ Sum_i \ Ei$ provides an upper bound on the empirical misclassification rate. However, in many applications the actual loss of an incorrect prediction will depend on the relation of the classes. In particular, it is reasonable to assume that confusing classes that are “nearby” in the taxonomy is less costly or severe than predicting a class that is “far away” from the correct class. Hence, we would like to work with general loss functions 4: Y ×Y ! < where 4 (y, ˆy) denotes the loss of predicting ˆy, when the true class is y. We assume that 4 (y, y) = 0 and that 4 (y, ˆy) > 0 for y 6= ˆy. Two problems need to be addressed: (i) how to define meaningful loss functions for taxonomies and (ii) how to modify the SVM formulation to more directly minimize (an upper bound on) the desired loss.

#### 3.1 Hierarchical Loss Functions

The first problem of designing suitable metrics that take the position of classes in the taxonomy into account has been investigated in [18], where 4 (y, y0) is defined via the length of the shortest (undirected) path connecting y and y0. Another related proposal based on similarities between categories is due to [13], but the latter does not make use of the taxonomy and rather computes the similarity between classes from the cosine-similarity between class centroid vectors.

...

...

### 7. CONCLUSIONS

We have proposed a large margin architecture for hierarchical categorization that extends the strengths of Support Vector Machine classification to also take advantage of information about class relationships encoded in a taxonomy. It is possible to minimize upper bounds on arbitrary loss functions, in particular ones that quantifies the severeness of incorrect categorizations based on the taxonomy. We believe this to be a valuable feature in many real-world applications. Furthermore, we have derived a column generation algorithm to more efficiently deal with the large number of margin constraints. Results on patent classification have confirmed the competitiveness of our overall approach.

## References

;

volumeDate ValuetitletypejournaltitleUrldoinoteyear
2004 HierarchicalDocumentCategorizatHierarchical Document Categorization with Support Vector Machines10.1145/1031171.10311862004
 Author Lijuan Cai + and Thomas Hofmann + doi 10.1145/1031171.1031186 + title Hierarchical Document Categorization with Support Vector Machines + year 2004 +