1997 AProbabilisticAnalOfTheRocchioAlg

Subject Headings: Text Categorization Task, Rocchio Algorithm.

Notes

The Rocchio relevance feedback algorithm is one of the most popular and widely applied learning methods from information retrieval. Here, a probabilistic analysis of this algorithm is presented in a text categorization framework. The analysis gives theoretical insight into the heuristics used in the Rocchio algorithm, particularly the word weighting scheme and the similarity metric. It also suggests improvements which lead to a probabilistic variant of the Rocchio classifier. The Rocchio classifier, its probabilistic variant, and a naive Bayes classifier are compared on six text categorization tasks. The results show that the probabilistic algorithms are preferable to the heuristic Rocchio classifier not only because they are more well-founded, but also because they achieve better performance.

The representation of a problem has a strong impact on the generalization accuracy of a learning system. For categorization a document, which typically is a string of characters, has to be transformed into a representation which is suitable for the learning algorithm and the classification task. IR research suggests that words work well as representation units and that their ordering in a document is of minor importance for many tasks. This leads to a representation of documents as bags of words.
This bag-of-words representation is equivalent to an attribute-value representation as used in machine learning. Each distinct word corresponds to a feature with the number of times the word occurs in the document as its value. Figure 1 shows an example feature vector for a particular document. To avoid unnecessarily large feature vectors words are considered as features only if they occur in the training data at least [math]\displaystyle{ m }[/math] (e.g. [math]\displaystyle{ m }[/math] = 3) times. The set of considered features (i.e. words) will be called F.

This type of classifier is based on the relevance feedback algorithm originally proposed by Rocchio [Rocchio, 1971] for the vector space retrieval model [Salton, 1991]. Due to its heuristic components, there are a number of similar algorithms corresponding to the particular choice of those heuristics. The three main design choices are
- the word weighting method
- the document length normalization
- the similarity measure.
An overview of some heuristics is given in [Salton, Buckley, 1988]. In the following the most popular combination will be used (known as tfc): tf word weights [Salton, Buckley, 1988], document length normalization using Euclidian vector length and cosine similarity.
Originally developed for information retrieval, the algorithm returns a ranking of documents without providing a threshold to de ne a decision rule for class membership. Therefore the algorithm has to be adapted to be used for text categorization. The variant presented here seems to be the most straightforward adaptation of the Rocchio algorithm to text

…
[Rocchio, 1971] J. Rocchio. (1971). “Relevance Feedback in Information Retrieval.” In: Salton: The SMART Retrieval System: Experiments in Automatic Document Processing, Chapter 14, Prentice-Hall.
[Salton, 1991] G. Salton. (1991). “Developments in Automatic Text Retrieval.” In: Science, Vol. 253.
[Salton, Buckley, 1988] G. Salton, and C. Buckley. (1988). “Term Weighting Approaches in Automatic Text Retrieval.” In: Information Processing and Management, Vol. 24, No. 5.
…,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
1997 AProbabilisticAnalOfTheRocchioAlg	Thorsten Joachims			A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization		Proceedings of the Fourteenth International Conference on Machine Learning	http://www.cs.cornell.edu/People/tj/publications/joachims 97a.pdf			1997