2006 AComprehensiveComparStudyOfDocClustForBiomedDigiLibMEDLINE

From GM-RKB
Jump to navigation Jump to search

Subject Headings:

Notes

Cited By

Quotes

Abstract

Document clustering has been used for better document retrieval, document browsing, and text mining in digital library. In this paper, we perform a comprehensive comparison study of various document clustering approaches such as three hierarchical methods (single-link, complete-link, and complete link), Bisecting K-means, K-means, and Suffix Tree Clustering in terms of the efficiency, the effectiveness, and the scalability. In addition, we apply a domain ontology to document clustering to investigate if the ontology such as MeSH improves clustering quality for MEDLINE articles. Because an ontology is a formal, explicit specification of a shared conceptualization for a domain of interest, the use of ontologies is a natural way to solve traditional information retrieval problems such as synonym/hypernym/hyponym problems. We conducted fairly extensive experiments based on different evaluation metrics such as misclassification index, F-measure, cluster purity, and Entropy on very large article sets from MEDLINE, the largest biomedical digital library in biomedicine.



References

  • Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., and Park, J. S. Fast algorithms for projected clustering. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of data, 1999, 61-72.
  • Beil, F., Ester, M. and Xu, X. Frequent Term-based Text Clustering, In: Proceedings of 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23-26, 2002, Edmonton, Alberta, Canada, 436-442.
  • Beyer, K., Jonathan Goldstein, Ramakrishnan, R., and Shaft, U. When is nearest neighbor meaningful?. Proceedings of 7th International Conference on Database Theory, 1999, 217-235.
  • Buckley, C., Gerard M. Salton, Allen, J. and Singhal, A. Automatic query expansion using SMART: TREC-3. In: D. K. Harman (ed.), The Third Text Retrieval Conference (TREC-3). U.S. Department of Commerce, 1995, 69-80.
  • Buckley, C. and Lewit, A. F. Optimization of inverted vector searches. In: Proceedings of SIGIR-85, 1985, 97-110.
  • Cutting, D., Karger, D., Pedersen, J. and Tukey, J. Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, In: Proceedings of SIGIR ’92, 1992, 318-329.
  • Ghosh, J. Scalable clustering methods for data mining. In N. Ye (Ed.), Handbook of data mining. Lawrence Erlbaum, 2003.
  • Gruber, T.R. Towards Principles for the Design of Ontologies used for Knowledge Sharing. International Journal of Human-Computer Studies, 43, 1995, 907-928.
  • Hearst, M. A. and Pedersen, J. O. Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In: Proceedings of SIGIR-96, 1996, 76–84.
  • Hotho, A., Maedche A., and Staab S. Text Clustering Based on Good Aggregations. Künstliche Intelligenz (KI), 16, 4, 2002, 48-54.
  • Hu, X. Mining Novel Connections from Large Online Digital Library Using Biomedical Ontologies, Library Management Journal, 26, 4/5, 2005, 261-270.
  • Kaufman, L., and Rousseeuw, P.J. Finding Groups in Data: an Introduction to Cluster Analysis, 1999, John Wiley & Sons.
  • Koller, D. and Sahami, M. Hierarchically classifying documents using very few words. In: Proceedings of ICML-97, 1997, 170–176.
  • Larsen, B. and Aone, C. Fast and Effective Text Mining Using Linear-time Document Clustering, KDD-99, San Diego, California, 1999, 16-22.
  • Li, T., Ma, S., and Ogihara, M. Document clustering via adaptive subspace iteration. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of data, 2004, 218-225.
  • Patrick Pantel and Dekang Lin Document clustering with committees. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of data, 2002, 199-206.
  • Steinbach, M., Karypis, G., and Vipin Kumar A Comparison of Document Clustering Techniques. Technical Report #00-034. Department of Computer Science and Engineering, University of Minnesota, 2000. 228
  • van Rijsbergen, C. J. Information Retrieval, 2nd edition, London: Buttersworth, 1979. http://www.dcs.gla.ac.uk/Keith/Preface.html)
  • Wang, B.B., McKay, R I., Abbass, H.A., Barlow M. Learning Text Classifier using the Domain Concept Hierarchy. In: Proceedings of International Conference on Communications, Circuits and Systems 2002, China.
  • Willett, P. Recent trends in hierarchical document clustering: A critical review. Information Processing & Management, 24, 5, 1988, 577-597.
  • Xu, W. and Gong, Y. Document clustering by concept factorization. Proceedings of SIGIR-04, 2004, 202-209.
  • Zamir O., and Etzioni O. Web Document Clustering: A Feasibility Demonstration, In: Proceedings of SIGIR 98, 1998, 46-54.
  • Zeng, Y., Tang, J., Garcia-Frias, J. and Gao, G.R. An Adaptive Meta-Clustering Approach: Combining The Information From Different Clustering Results, IEEE Computer Society Bioinformatics Conference (CSB2002), 2002, 276-287.
  • Zhao, Y., and Karypis, G. Criterion functions for document clustering: Experiments and analysis, Technical Report, Department of Computer Science, University of Minnesota, 2002.
  • Zhao, Y., and Karypis, G. Evaluation of Hierarchical Clustering Algorithms for Document Datasets, Technical Report, Department of Computer Science, University of Minnesota, 2002.
  • Zhong, S., and Ghosh, J. A comparative study of generative models for document clustering. Proceedings of the workshop on Clustering High Dimensional Data and Its Applications in SIAM Data Mining Conference, 2003.
  • zu Eissen, S.M., Stein, B, Potthast, M. The Suffix Tree Document Model Revisited, In: Proceedings of the 5th International Conference on Knowledge Management, 2005, 596-603.


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2006 AComprehensiveComparStudyOfDocClustForBiomedDigiLibMEDLINEXiaohua Hu
Illhoi Yoo
A Comprehensive Comparison Study of Document Clustering for a Biomedical Digital Library MEDLINEProceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Librarieshttp://www.mendeley.com/research/a-comprehensive-comparison-study-of-document-clustering-for-a-biomedical-digital-library-medline/2006