- (Boutell et al., 2004) ⇒ Matthew R. Boutell, Jiebo Luo, Xipeng Shen, and Christopher M. Brown. (2004). “Learning Multi-label Scene Classification.” In: Pattern recognition Journal, 37(9). doi:10.1016/j.patcog.2004.03.009
- Image understanding; Semantic scene classification; Multi-label classification; Multi-label training; Multi-label evaluation; Image organization; Cross-training; Jaccard similarity
In classic pattern recognition problems, classes are mutually exclusive by definition. Classification errors occur when the classes overlap in the feature space. We examine a different situation, occurring when the classes are, by definition, not mutually exclusive. Such problems arise in semantic scene and document classification and in medical diagnosis. We present a framework to handle such problems and apply it to the problem of semantic scene classification, where a natural scene may contain multiple objects such that the scene can be described by multiple class labels (e.g., a field scene with a mountain in the background). Such a problem poses challenges to the classic pattern recognition paradigm and demands a different treatment. We discuss approaches for training and testing in this scenario and introduce new metrics for evaluating individual examples, class recall and precision, and overall accuracy. Experiments show that our methods are suitable for scene classification; furthermore, our work appears to generalize to other classification problems of the same nature.
In traditional classification tasks : Classes are mutually exclusive by definition. Let [math]\Chi[/math] be the domain of examples to be classified, [math]Y[/math] be the set of labels, and H be the set of classifiers for [math]\Chi \rarr Y[/math]. The goal is to find the classifier [math]h \in H[/math] maximizing the probability of [math]h(x)=y[/math], where [math]y \in Y[/math] is the ground truth label of x, i.e.,
- [math]y = arg max P(y_i|x).[/math]
Classification errors occur when the classes overlap in the selected feature space (Fig. 2a). Various classification methods have been developed to provide different operating ?characteristics, including linear discriminant functions, artificial neural networks (ANN), k-nearest-neighbor (k-NN), [[radial basis functions (RBF)]] and support vector machines (SVM) .
However, in some classification tasks, it is likely that some data belongs to multiple classes, causing the actual classes to overlap by definition. In text or music categorization, documents may belong to multiple genres, such as government and health, or rock and blues [2,3]. Architecture may belong to multiple genres as well. In medical diagnosis, a disease may belong to multiple categories, and genes may have multiple functions, yielding multiple labels .
A problem domain receiving renewed attention is semantic scene classification [5–18], categorizing images into semantic classes such as beaches, sunsets or parties. Semantic scene classification finds application in many areas, including content-based indexing and organization and content-sensitive image enhancement.
Many current digital library systems allow a user to specify a query image and search for images “similar” to it, where similarity is often defined only by color or texture properties. This the so-called “query by example” process has often proved to be inadequate . Knowing the category of a scene helps narrow the search space dramatically, reducing the search space, and simultaneously increasing the hit rate and reducing the false alarm rate.
Fig. 1. Examples of multi-label images.
Knowledge about the scene category can find also application in context-sensitive image enhancement . While an algorithm might enhance the quality of some classes of pictures, it can degrade others. Rather than applying a generic algorithm to all images, we could customize it to the scene type (allowing us, for example, to retain or enhance the brilliant colors of sunset images while reducing the warm-colored cast from tungsten-illuminated scenes). In the scene classification domain, many images may belong to multiple semantic classes. Fig. 1(a) shows an image that had been classified by a human as a beach scene. However, it is clearly both a beach scene and an urban scene. It is not a fuzzy member of each (due to ambiguity), but is a full member of each class (due to multiplicity). Fig. 1(b) (beach and mountains) is similar.
Much research has been done on scene classification recently, e.g., [5–18]. Most systems are exemplar-based, learning patterns from a training set using statistical pattern recognition techniques. A variety of features and classifiers have been proposed; most systems use low-level features (e.g., color, texture). However, none addresses the use of multi-label images.
When choosing their data sets, most researchers either avoid such images, label them subjectively with the base (single-label) class most obvious to them, or consider “beach+urban” as a new class. The last method is unrealistic in most cases because it would increase the number of classes to be considered substantially and the data in such combined classes is usually sparse. The first two methods have limitations as well. For example, in content-based image indexing and retrieval applications, it would be more diLcult for a user to retrieve a multiple-class image (e.g., beach+urban) if we only have exclusive beach or urban labels. It may require that two separate queries be conducted respectively and the intersection of the retrieved images be taken. In a content-sensitive image enhancement application, it may be desirable for the system to have different settings for beach, urban, and beach+urban scenes. This is impossible using exclusive single labels.
In this work, we consider the following problem: The base classes are non-mutually exclusive and may overlap by de-nition (Fig. 2b). As before, let ? be the domain of examples to be classified and Y be the set of labels. Now let B be a set of binary vectors, each of length |Y |. Each vector b ? B indicates membership in the base classes in Y (+1 = member;-1 = non-member). H is the set of classifiers for ? ? B. The goal is to find the classifier h ?H that minimizes a distance (e.g., Hamming), between h(x) and bx for a newly observed example x.
In a probabilistic formulation, the goal of classifying x is to find one or more base class labels in a set C and for a threshold T such that
- [math]P(c|x) \gt T, \forall c \in C.[/math]
Clearly, the mathematical formulation and its physical meaning are distinctively different from those used in classic pattern recognition. Few papers address this problem (see Section 2), and most of these are specialized for text classification or bioinformatics. Based on the multi-label model, we investigate several methods of training and propose a novel training method, “cross-training”. We also propose three classification criteria in testing. When applying our methods to scene classification, our experiments show that our approach is successful on multi-label images even without an abundance of training data. We also propose a generic evaluation metric that can be tailored to applications needing different error forgiveness.
It is worth noting that multi-label classification is different from fuzzy logic-based classification. Fuzzy logics are used as a means to cope with ambiguity in the feature space between multiple classes for a given sample, not as the end for achieving multi-label classification. The fuzzy membership stems from ambiguity and often a de-fuzzification step is eventually used to derive a crisp decision (typically by choosing the class with the highest membership value). For example, a foliage scene and a sunset scene may share some warm, bright colors, therefore there is confusion between the two scene classes in the selected feature space if color features are used; fuzzy logic would be suitable for solving this problem.
Fig. 2. Figure (a) is the typical pattern recognition problem. Two classes contain examples that are diLcult to separate in the feature space. (b) is the multi-label problem. The * data belongs to both of the other two classes simultaneously.
In contrast, multi-label classification is a unique problem in that a sample may possess multiple properties of multiple classes. The content for different classes can be quite distinct: for example, there is little confusion between beach (sand, water) and city (buildings).
The only commonalty between fuzzy-logic classification and multi-class classification is the use of membership functions. However, there is correlation between fuzzy membership functions: when one membership takes low values, the other also takes low values or high values and vice versa . On the other hand, the membership functions in multi-label case are largely coincidence (e.g., resort on the beach). In practice, the sum of fuzzy memberships usually is normalized to 1, while no such constraints apply to the multi-class problem (e.g., a beach resort scene is both a beach scene and a city scene, each with certainty).
With these differences aside, it is conceivable that one could use the learning strategies described in this paper in combination with a fuzzy classifier in a similar way as they were used with the pattern classifiers in this study.
In this paper, we first review past work related to multi-label classification. In Section 3, we describe our training models and testing criteria. Section 4 contains the proposed evaluation methods. Section 5 contains the experimental results obtained by applying our approaches to multi-labeled scene classification. We conclude with a discussion and suggestions for future work.
2. Related work
The sparse literature on multi-label classification is primarily geared to text classification or bioinformatics. For text classification, Schapire and Singer  proposed Boos- Texter, extending AdaBoost to handle multi-label text categorization. However, they note that controlling complexity due to overfitting in their model is an open issue. McCallum  proposed a mixture model trained by EM, selecting the most probable set of labels from the power set of possible classes and using heuristics to overcome the associated computational complexity. However, his generative model is based on learning text frequencies in documents, and is thus speci"c to text applications. Joachims’ approach is most similar to ours in that he uses a set of binary SVM classi- "ers . He "nds that SVM classifiers achieve higher accuracy than others. However, he does not discuss multi-label training models or specific testing criteria. In bioinformatics, Clare and King  extended the definition of entropy to include multi-label data (gene expression in their case), but they used a decision tree as their baseline algorithm algorithm. As they stated, they chose a decision tree because of the sparseness of the data and because they needed to learn accurate rules, not a complete classification. However we desire to use Support Vector Machines for their high accuracy in classification.
A related approach to image classification consists of segmenting and classifying image regions (e.g., sky, grass) [22,23]. A seemingly natural approach to multi-label scene classification is to model such scenes using combinations of these labels. For example, if a mountain scene is defined as one containing rocks and sky and a "eld scene as one containing grass and sky, then an image with grass, rocks, and sky would be considered both a field scene and a mountain scene.
However, this approach has drawbacks. First, region labeling has only been applied with success to constrained environments with a limited number of predictable objects (e.g., outdoor images captured from a moving vehicle ). Second, because scenes consist of groups of regions, there is a combinatorial explosion in the number of region combinations. Third, scene modeling is a difficult problem in its own right, encompassing more than mere presence or absence of objects. For example, a scene with sky, water and sand could be best described as a lake or a beach scene, depending on the relative size and placement of the components.