Frequency-based Feature Selection Algorithm

References

http://nlp.stanford.edu/IR-book/html/htmledition/frequency-based-feature-selection-1.html
- A third feature selection method is frequency-based feature selection, that is, selecting the terms that are most common in the class. Frequency can be either defined as document frequency (the number of documents in the class [math]\displaystyle{ c }[/math] that contain the term $\tcword$) or as collection frequency (the number of tokens of $\tcword$ that occur in documents in [math]\displaystyle{ c }[/math]). Document frequency is more appropriate for the Bernoulli model, collection frequency for the multinomial model.
- Frequency-based feature selection selects some frequent terms that have no specific information about the class, for example, the days of the week (Monday, Tuesday, ...), which are frequent across classes in newswire text. When many thousands of features are selected, then frequency-based feature selection often does well. Thus, if somewhat suboptimal accuracy is acceptable, then frequency-based feature selection can be a good alternative to more complex methods. However, Figure 13.8 is a case where frequency-based feature selection performs a lot worse than MI and $\chi ^2$ and should not be used.