2006 IntegrationOfSemBipGraphRepAndMutRefForBioLitClust

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Text Clustering Algorithm, MEDLINE Abstract, Biomedical Ontology, Bipartite Graph, Mutual Refinement.

Notes

Cited By

Quotes

Abstract

We introduce a novel document clustering approach that overcomes those problems by combining a semantic-based bipartite graph representation and a mutual refinement strategy. The primary contributions of this paper are the following. First, we introduce a new representation of documents using a bipartite graph between documents and co-occurrence concepts in the documents. Second, we show how to enhance clustering quality by applying the mutual refinement strategy to the initial clustering results. Third, through the experiments on MEDLINE documents, we show that our integrated method significantly enhances cluster quality and clustering reliability compared to existing clustering methods. Our approach improves on the average 29.5 cluster quality and 26.3 clustering reliability, in terms of misclassification index, over Bisecting K-means with the best parameters.


References

  • 1 Charu C. Aggarwal, Joel L. Wolf, Philip S. Yu, Cecilia Procopiuc, Jong Soo Park, Fast algorithms for projected clustering, Proceedings of the 1999 ACM SIGMOD Conference, p.61-72, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
  • 2. (Beil et al., 2002) ⇒ Florian Beil, Martin Ester, and Xiaowei Xu. (2002). “Frequent Term-based Text Clustering.” In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002). doi:10.1145/775047.775110
  • 3 Chris Buckley, Alan F. Lewit, Optimization of inverted vector searches, Proceedings of the 8th ACM SIGIR Conference retrieval, p.97-110, June 05-07, 1985, Montreal, Quebec, Canada doi:10.1145/253495.253515
  • 4 Butte, A.J. and Kohane, I. S. Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac. Symp. Biocomput. 2000, 418--429.
  • 5 Kenneth W. Church, Patrick Hanks, Word association norms, mutual information, and lexicography, Proceedings of the 27th annual meeting on Association for Computational Linguistics, p.76-83, June 26-29, 1989, Vancouver, British Columbia, Canada doi:10.3115/981623.981633
  • 6 Jack G. Conrad, Mary Hunter Utt, A system for discovering relationships by feature extraction from text databases, Proceedings of the 17th ACM SIGIR Conference retrieval, p.260-270, July 03-06, 1994, Dublin, Ireland
  • 7 Douglass R. Cutting, David R. Karger, Jan O. Pedersen, John W. Tukey, Scatter/Gather: a cluster-based approach to browsing large document collections, Proceedings of the 15th ACM SIGIR Conference retrieval, p.318-329, June 21-24, 1992, Copenhagen, Denmark doi:10.1145/133160.133214
  • 8 Fano, R. Transmission of information. MIT Press, Cambridge, 1961
  • 9 Ghosh, J. Scalable clustering methods for data mining. In N. Ye (Ed.), Handbook of data mining. Lawrence Erlbaum, 2003.
  • 10 Marti A. Hearst, Jan O. Pedersen, Reexamining the cluster hypothesis: scatter/gather on retrieval results, Proceedings of the 19th ACM SIGIR Conference retrieval, p.76-84, August 18-22, 1996, Zurich, Switzerland doi:10.1145/243199.243216
  • 11 Hristovski, D. et al., Supporting discovery in medicine by association rule mining in Medline and UMLS, Medinfo, 10, 2001, 1344--1348.
  • 12 Hu, X. Mining novel connections from large online digital library using biomedical ontologies, Library Management Journal, 26, 4/5, 2005, 261--270.
  • 13 Jenssen, T. K., et al. A literature network of human genes for high-throughput analysis of gene expression. Nat. Genet., 28, 2001, 21--28
  • 14 Daphne Koller, Mehran Sahami, Hierarchically Classifying Documents Using Very Few Words, Proceedings of the Fourteenth International Conference on Machine Learning, p.170-178, July 08-12, 1997
  • 15. Bjornar Larsen, Chinatsu Aone, Fast and effective text mining using linear-time document clustering, Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p.16-22, August 15-18, 1999, San Diego, California, United States doi:10.1145/312129.312186
  • 16. Dekang Lin, An Information-Theoretic Definition of Similarity, Proceedings of the Fifteenth International Conference on Machine Learning, p.296-304, July 24-27, 1998
  • 17. Perez-Iratxeta, C., Bork, P. and Andrade, M. A. Association of genes to genetically inherited diseases using data mining. Nat. Genet., 31, 2002, 316--319.
  • 18. Noam Slonim, Naftali Tishby, Document clustering using word clusters via the information bottleneck method, Proceedings of the 23rd ACM SIGIR Conference retrieval, p.208-215, July 24-28, 2000, Athens, Greece doi:10.1145/345508.345578
  • 19. Steinbach, M., Karypis, G., and Vipin Kumar A comparison of document clustering techniques. Technical Report #00-034. Department of Computer Science and Engineering, University of Minnesota, 2000.
  • 20. C. J. van Rijsbergen, Information Retrieval, Butterworth-Heinemann, Newton, MA, 1979
  • 21. Peter Willett, Recent trends in hierarchic document clustering: a critical review, Information Processing and Management: an International Journal, v.24 n.5, p.577-597, 1988 doi:10.1016/0306-4573(88)90027-1
  • 22. Wren, J. D. Extending the mutual information measure to rank inferred literature relationships, BMC Bioinformatics, 5, 2004, 145.
  • 23. Wei Xu, Yihong Gong, Document clustering by concept factorization, Proceedings of the 27th ACM SIGIR Conference retrieval, July 25-29, 2004, Sheffield, United Kingdom doi:10.1145/1008992.1009029
  • 24. Yoo I., Hu X., and Song I. Y., Clustering Ontology-enriched Graph Representation for Biomedical Documents based on Scale-Free Network Theory, accepted in the IEEE Conference on Intelligent Systems, Sept 4-6, 2006.
  • 25. Illhoi Yoo, Xiaohua Hu, A comprehensive comparison study of document clustering for a biomedical digital library MEDLINE, Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, June 11-15, 2006, Chapel Hill, NC, USA doi:10.1145/1141753.1141802.
  • 26. Oren Zamir, Oren Etzioni, Web document clustering: a feasibility demonstration, Proceedings of the 21st ACM SIGIR Conference retrieval, p.46-54, August 24-28, 1998, Melbourne, Australia doi:10.1145/290941.290956
  • 27. Yujing Zeng, Jianshan Tang, Javier Garcia-Frias, Guang R. Gao, An Adaptive Meta-Clustering Approach: Combining the Information from Different Clustering Results, Proceedings of the IEEE Computer Society Conference on Bioinformatics, p.276, August 14-16, 2002.
  • 28. Hongyuan Zha, Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering, Proceedings of the 25th ACM SIGIR Conference retrieval, August 11-15, 2002, Tampere, Finland doi:10.1145/564376.564398
  • 29. Zhao, Y. and Karypis, G. Criterion functions for document clustering: experiments and analysis, Technical Report, Department of Computer Science, University of Minnesota, 2001.
  • 30. Zhong, S. and Ghosh, J. A comparative study of generative models for document clustering. Proceedings of the workshop on Clustering High Dimensional Data and Its Applications in SIAM Data Mining Conference, 2003.
  • 31. http://www-users.cs.umn.edu/~karypis/cluto/download.html,


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2006 IntegrationOfSemBipGraphRepAndMutRefForBioLitClustXiaohua Hu
Illhoi Yoo
Il-Yeol Song
Integration of Semantic-based Bipartite Graph Representation and Mutual Refinement Strategy for Biomedical Literature ClusteringProceedings of the ACM SIGKDD Conferencehttp://www.cis.drexel.edu/faculty/thu/research-papers/rtpp705 Yoo.pdf10.1145/1150402.11505052006