Text-Item Clustering Task
(Redirected from document clustering)
- input: a Text-Item Set.
- output: a set of Text Document Clusters.
- It can be solved by a Text Clustering System (that applies a text clustering algorithm).
- It can range from being a Heuristic Text-Item Clustering Task to being a Data-Driven Text-Item Clustering Task.
- It can range from (typically) being a Semantic Text-Item Clustering Task to being a Syntactic Text-Item Clustering Task.
- It can (often) be a High-Dimensional Clustering Task.
- It can range from be Text Document Clustering Task to being a Paragraph Clustering Task to being a Sentence Clustering Task to being a Phrase Clustering Task to being a Context Window Clustering Task to being a Word Clustering Task.
- It can support: a Corpus Browsing Task, a Topic Modeling Task, an Information Retrieval Task (though there is weak evidence of performance improvement to support this application).
- See: Image Clustering Task; Clustering; Information Retrieval; Text Mining; Unsupervised Learning.
- (Zhao & Karypis, 2011) ⇒ Ying Zhao; and George Karypis. (2011). “Document Clustering.” In: (Sammut & Webb, 2011)
- QUOTE: At a high-level, the problem of document clustering is defined as follows. Given a set S of n documents, we would like to partition them into a predetermined number of k subsets S1, S2, …, Sk, such that the documents assigned to each subset are more similar to each other than the documents assigned to different subsets. Document clustering is an essential part of text mining and has many applications in information retrieval and knowledge management. Document clustering faces two big challenges: the dimensionality of the feature space tends to be high (i.e., a document collection often consists of thousands or tens of thousands unique words) and the size of a document collection tends to be large.
- (Beil et al., 2002) ⇒ Florian Beil, Martin Ester, and Xiaowei Xu. (2002). “Frequent Term-based Text Clustering.” In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002). doi:10.1145/775047.775110