- (Zhu, 2008) ⇒ Xiaojin Zhu. (2008). “Semi-Supervised Learning Literature Survey (revised edition)." Technical Report 1530, Department of Computer Sciences, University of Wisconsin, Madison.
- This survey has been updated several times. It was begun in 2005 and was updated in 2007.
- Author comment on Homepage:
- "We review the literature on semi-supervised learning, i.e., machine learning from both labeled and unlabeled data. This online paper is updated frequently to incorporate the latest development in the field."
- Q:What is semi-supervised learning?
- A: In this survey we focus on semi-supervised classification. It is a special form of classification. Traditional classifiers use only labeled data (feature / label pairs) to train. …
- Q: Does unlabeled data always help?
- A: No, there’s no free lunch. Bad matching of problem structure with model assumption can lead to degradation in classifier performance. For example, quite a few semi-supervised learning methods assume that the decision boundary should avoid regions with high p(x). These methods include transductive support vector machines (TSVMs), information regularization, Gaussian processes with null category noise model, graph-based methods if the graph weights is determined by pairwise distance. Nonetheless if the data is generated from two heavily overlapping Gaussian, the decision boundary would go right through the densest region, and these methods would perform badly. On the other hand EM with generative mixture models, another semi-supervised learning method, would have easily solved the problem. Detecting bad match in advance however is hard and remains an open question.
- Q: How many semi-supervised learning methods are there?
- A: Many. Some often-used methods include: EM with generative mixture models, self-training, co-training, transductive support vector machines, and graph-based methods. See the following sections for more methods.
- Self-training is a commonly used technique for semi-supervised learning. In self-training a classifier is first trained with the small amount of labeled data. The classifier is then used to classify the unlabeled data. Typically the most confident unlabeled points, together with their predicted labels, are added to the training set. The classifier is re-trained and the procedure repeated. Note the classifier uses its own predictions to teach itself. The procedure is also called self-teaching or bootstrapping (not to be confused with the statistical procedure with the same name). The generative model and EM approach of section 2 can be viewed as a special case of ‘soft’ self-training. One can imagine that a classification mistake can reinforce itself. Some algorithms try to avoid this by ‘unlearn’ unlabeled points if the prediction confidence drops below a threshold.
- Self-training has been applied to several natural language processing tasks. Yarowsky (1995) uses self-training for word sense disambiguation, e.g. deciding whether the word ‘plant’ means a living organism or a factory in a give context. Riloff et al. (2003). uses it to identify subjective nouns. Maeireizo et al. (2004) classify dialogues as ‘emotional’ or ‘non-emotional’ with a procedure involving two classifiers.Self-training has also been applied to parsing and machine translation. Rosenberg et al. (2005) apply self-training to object detection systems from images, and show the semi-supervised technique compares favorably with a stateof- the-art detector.
- Self-training is a wrapper algorithm, and is hard to analyze in general. However, for specific base learners, there has been some analyzer’s on convergence. See e.g. (Haffari & Sarkar, 2007; Culp & Michailidis, 2007).
4 Co-Training and Multiview Learning
- Co-training (Blum & Mitchell, 1998) (Mitchell, 1999) assumes that (i) features can be split into two sets; (ii) each sub-feature set is sufficient to train a good classifier; (iii) the two sets are conditionally independent given the class. Initially two separate classifiers are trained with the labeled data, on the two sub-feature sets respectively. Each classifier then classifies the unlabeled data, and ‘teaches’ the other classifier with the few unlabeled examples (and the predicted labels) they feel most confident. Each classifier is retrained with the additional training examples given by the other classifier, and the process repeats.
- In co-training, unlabeled data helps by reducing the version space size. In other words, the two classifiers (or hypotheses) must agree on the much larger unlabeled data as well as the labeled data. …
11.9 Metric-based Model Selection
- Metric-based model selection (Schuurmans & Southey, 2001) is a method to detect hypotheses inconsistency with unlabeled data. We may have two hypotheses which are consistent on [math]L[/math], for example they all have zero training set error. However they may be inconsistent on the much larger U. If so we should reject at least one of them, e.g. the more complex one if we employ Occam’s razor.
|2008 SemiSupervisedLearningLitReview||Xiaojin Zhu||Semi-Supervised Learning Literature Survey (revised edition)||http://www.cs.wisc.edu/~jerryzhu/pub/ssl survey.pdf||2008|