2004 EditorialSpeclIssOnLearnImb

(Chawla et al., 2004) ⇒ Nitesh Chawla, Nathalie Japkowicz, and Aleksander Kolcz. (2004). “Editorial: Special issue on learning from imbalanced data sets.” In: ACM SIGKDD Explorations Newsletter, 6(1). doi:10.1145/1007730.1007733

Subject Headings: Imbalanced Supervised Classification Algorithm, Imbalanced Training Dataset.

Notes

Quotes

1. INTRODUCTION

The class imbalance problem is one of the (relatively) new problems that emerged when machine learning matured from an embryonic science to an applied technology, amply used in the worlds of business, industry and scientific research. Although practitioners might already have known about this problem early, it made its appearance in the machine learning/data mining research circles about a decade ago. Its importance grew as more and more researchers realized that their data sets were imbalanced and that this imbalance caused suboptimal classification performance. This increase in interest gave rise to two workshops held in 2000 [1] and 2003 [3] at the AAAI conference and ICML conferences, respectively. These workshops and the ensuing e-mail discussions and information seeking requests that followed them allowed us to note two points of importance:

. The class imbalance problem is pervasive and ubiquitous, causing trouble to a large segment of the data mining community.
. Despite the fact that two workshops have already been held on the topic, a large number of practitioners plagued by the problem are still working in isolation, not knowing that a large part of the research community is actively looking into ways to alleviate the problem.

The purpose of this special issue is to communicate and present some of the latest research carried out in this area while reviewing other important recent developments in the field. In this Editorial, we begin by reviewing the class imbalance as well as an array of general solutions that were previously proposed to deal with it. We then discuss the progression of ideas starting at the 2000 workshop to today. In order to give a comprehensive picture of the state of the art in the eld, we give a short overview of the papers that were presented at the 2003 workshop as well as a short description of the papers contained in this volume. The excellent overview paper by Gary Weiss [55] published in this volume will complete this short picture.

2. THE CLASS IMBALANCE PROBLEM

The class imbalance problem typically occurs when, in a classification problem, there are many more instances of some classes than others. In such cases, standard classifiers tend to be overwhelmed by the large classes and ignore the small ones. In practical applications, the ratio of the small to the large classes can be drastic such as 1 to 100, 1 to 1,000, or 1 to 10,000 (and sometimes even more). (See, for example, [41], [57]). As mentioned earlier this problem is prevalent in many applications, including: fraud/intrusion detection, risk management, text classification, and medical diagnosis/monitoring, but there are many others. It is worth noting that in certain domains (like those just mentioned) the class imbalance is intrinsic to the problem. For example, within a given setting, there are typically very few cases of fraud as compared to the large number of honest use of the o ered facilities. However, class imbalances sometimes occur in domains that do not have an intrinsic imbalance. This will happen when the data collection process is limited (e.g., due to economic or privacy reasons), thus creating "artificial" imbalances. Conversely, in certain cases, the data abounds and it is for the scientist to decide which examples to select and in what quantity [56]. In addition, there can also be an imbalance in costs of making different errors, which could vary per case [3].

A number of solutions to the class-imbalance problem were previously proposed both at the data and algorithmic levels. At the data level, these solutions include many di erent forms of re-sampling such as random oversampling with replacement, random undersampling, directed oversampling (in which no new examples are created, but the choice of samples to replace is informed rather than random), directed undersampling (where, again, the choice of examples to eliminate is informed), oversampling with informed generation of new samples, and combinations of the above techniques. At the algorithmic level, solutions include adjusting the costs of the various classes so as to counter the class imbalance, adjusting the probabilistic estimate at the tree leaf (when working with decision trees), adjusting the decision threshold, and recognition-based (i.e., learning from one class) rather than discrimination-based (two class) learning. Many of these solutions are discussed in the papers presented in the workshops [1][3] or are referred to in the active bibliography on the topic. (http://www.site.uottawa.ca/~nat/Research/class_imbalance_bibli.html)

3.2 One-class Learning

When negative examples greatly outnumber the positive ones, certain discriminative learners have a tendency to over-fit. A recognition-based approach provides an alternative to discrimination where the model is created based on the examples of the target class alone. Here, one attempts to measure (either implicitly or explicitly) the amount of similarity between a query object and the target class, where classification is accomplished by imposing a threshold on the similarity value [26].

Mainly, two classes of learners were previously studied in the context of the recognition-based one-class approach - SVMs [50][49] and autoencoders [26][38]|and were found to be competitive [38].

An interesting aspect of one-class (recognition-based) learning is that, under certain conditions such as multi-modality of the domain space, one class approaches to solving the classification problem may in fact be superior to discriminative (two-class) approaches (such as decision trees or Neural Networks) [26]. This is supported in the current volume by [48], who demonstrate the optimality of one-class SVMs over two-class ones in certain important imbalanced-data domains, including genomic data. In particular, [48] shows that one class learning is particularly useful when used on extremely unbalanced data sets composed of a high dimensional noisy feature space. They argue that the one-class approach is related to aggressive feature selection methods, but is more practical since feature selection can often be too expensive to apply.

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2004 EditorialSpeclIssOnLearnImb	Nathalie Japkowicz Nitesh V. Chawla Aleksander Kołcz			Editorial: Special issue on learning from imbalanced data sets			http://www.sigkdd.org/explorations/issues/6-1-2004-06/edit intro.pdf	10.1145/1007730.1007733