1992 ScatterGatherAClusterBasedApprDocumentColls

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Document Clustering Algorithm, Corpus Browsing Task, Linear-Time Algorithm.

Notes

Cited By

Quotes

Abstract

Document clustering has not been well received as an information retrieval tool. Objections to its use fall into two main categories: first, that clustering is too slow for large corpora (with running time often quadratic in the number of documents); and second, that clustering does not appreciably improve retrieval. We argue that these problems arise only when clustering is used in an attempt to improve conventional search techniques. However, looking at clustering as an information access tool in its own right obviates these objections, and provides a powerful new access paradigm. We present a document browsing technique that employs document clustering as its primary operation. We also present fast (linear time) clustering algorithms which support this interactive browsing paradigm.

References

  • 1. Chris Buckley, Alan F. Lewit, Optimization of inverted vector searches, Proceedings of the 8th ACM SIGIR Conference retrieval, p.97-110, June 05-07, 1985, Montreal, Quebec, Canada doi:10.1145/253495.253515
  • 2. W.B. Croft. Clustering large files of documents using the single-link method. Journal of the Amemcan Soczety for Informatzon Science, 28:341-344, 1977.
  • 3. A. El-Hamdouchi, P. Willett, Hierarchic document classification using Ward's clustering method, Proceedings of the 9th ACM SIGIR Conference retrieval, p.149-156, September 1986, Palazzo dei Congressi, Pisa, Italy doi:10.1145/253168.253200
  • 4. A. Grifiiths, H.C. Luckhurst, and P. Willett. Using inter-document similarity information in document retrieval systems. Journal of the American Society for Information Sczence, 37:3-11, 1986.
  • 5. Anil K. Jain, Richard C. Dubes, Algorithms for clustering data, Prentice-Hall, Inc., Upper Saddle River, NJ, 1988
  • 6. N. aardine and C.J. van Rijsbergen. The use of hierarchical clustering in information retrieval. Informatzon Storage and Retrzeval, 7:217-240, 1971.
  • 7. O. Pedersen, D. R. Cutting, and A. W. Tukey. Snippet search: a single phrase approach to text access. In: Proceedings of the 1991 Yoznt Statistical Meetings. American Statistical Association, (1991). Also available as Xerox PARC technical report SSL- 91-08.
  • 8. Gerard M. Salton. The SMART Retmeval System. Prentice- Hall, Englewood Cliffs, N.J., 1971.
  • 9. Gerard M. Salton, Michael J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, Inc., New York, NY, 1986
  • 10. R. Sibson. SLINK: an optimally efficient algorithm for the single link cluster method. Computer Journal, 16:30-34, 1973.
  • 11. C. J. Van Rijsbergen, Information Retrieval, Butterworth-Heinemann, Newton, MA, 1979
  • 12. C. J. van Rijsbergen and W.B. Croft. Document clustering: An evaluation of some experiments with the Cranfield 1400 collection. Information Processing Management, 11:171-182, 1975.
  • 13. P. Willett. Document clustering using an inverted file approach. Journal of Informatzon Sczence, 2:223- 231, 1980.
  • 14. P. Willett. A fast procedure for the calculation of similarity coefficients in automatic classification. Informatzon Processzng ~ Management, 17:53-60, 1981.
  • 15. Peter Willett, Recent trends in hierarchic document clustering: a critical review, Information Processing and Management: an International Journal, v.24 n.5, p.577-597, 1988 doi:10.1016/0306-4573(88)90027-1,


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
1992 ScatterGatherAClusterBasedApprDocumentCollsDouglass R. Cutting
David R. Karger
Jan O. Pedersen
John W. Tukey
Scatter/Gather: a cluster-based approach to browsing large document collectionsProceedings of the 15th ACM SIGIR Conference retrievalhttp://www.jopedersen.com/Publications/cutting92scattergather.pdf10.1145/133160.1332141992