2007 TheGoogleSimilarityDistance

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Semantic Similarity Measure, Normalized Google Distance.

Notes

Cited By

2008

Quotes

Abstract

Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers, the equivalent of "society" is "database," and the equivalent of "use" is "a way to search the database". We present a new theory of similarity between words and phrases based on information distance and Kolmogorov complexity. To fix thoughts, we use the World Wide Web (WWW) as the database, and Google as the search engine. The method is also applicable to other search engines and databases. This theory is then applied to construct a method to automatically extract similarity, the Google similarity distance, of words and phrases from the WWW using Google page counts. The WWW is the largest database on earth, and the context information entered by millions of independent users averages out to provide automatic semantics of useful quality. We give applications in hierarchical clustering, classification, and language translation. We give examples to distinguish between colors and numbers, cluster names of paintings by 17th century Dutch masters and names of books by English novelists, the ability to understand emergencies and primes, and we demonstrate the ability to do a simple automatic English-Spanish translation. Finally, we use the WordNet database as an objective baseline against which to judge the performance of our method. We conduct a massive randomized trial in binary classification using support vector machines to learn categories based on our Google distance, resulting in an a mean agreement of 87 percent with the expert crafted WordNet categories


References

  • [1] J.P. Bagrow, D. ben-Avraham, On the Google-fame of scientists and other populations, AIP Conference Proceedings 779:1(2005), 81–89.
  • [2] C.H. Bennett, P. G´acs, M. Li, P.M.B. Vit´anyi, W. Zurek, Information Distance, IEEE Trans. Information Theory, 44: 4 (1998), 1407–1423.
  • [3] C.H. Bennett, M. Li, B. Ma, Chain letters and evolutionary histories, Scientific American, June 2003, 76–81.
  • [4] C.J.C. Burges. A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery, 2:2(1998),121–167.
  • [5] Automatic Meaning Discovery Using Google: 100 Experiments in Learning WordNet Categories, 2004, http://www.cwi.nl/~cilibrar/googlepaper/appendix.pdf
  • [6] R. Cilibrasi, Complearn Home, http://www.complearn.org/
  • [7] R. Cilibrasi, R. de Wolf, P. Vitanyi. Algorithmic clustering of music based on string compression, Computer Music J., 28:4(2004), 49-67.
  • [8] R. Cilibrasi, P. Vitanyi. Clustering by compression, IEEE Trans. Information Theory, 51:4(2005), 1523- 1545.
  • [9] R. Cilibrasi, P. Vitanyi, Automatic meaning discovery using Google, http://xxx.lanl.gov/abs/cs.CL/0412098 (2004).
  • [10] R. Cilibrasi, P. Vitanyi, A New Quartet Tree Heuristic for Hierarchical Clustering, http://arxiv.org/abs/cs.DS/0606048
  • [11] P. Cimiano, S. Staab, Learning by Googling, SIGKDD Explorations, 6:2(2004), 24–33.
  • [12] T.M. Cover and J.A. Thomas, Elements of Information Theory, Wiley, New York, 1991.
  • [13] J.-P. Delahaye, Classer musiques, langues, images, textes et genomes, Pour La Science, 317(March 2004), 98–103.
  • [14] The basics of Google search, http://www.google.com/help/basics.html.
  • [15] L.G. Kraft, A device for quantizing, grouping and coding amplitude modulated pulses. Master’s thesis, Dept. of Electrical Engineering, M.I.T., Cambridge, Mass., 1949.
  • [16] D. Graham-Rowe, A search for meaning, New Scientist, 29 January 2005, p.21.
  • [17] Slashdot, From January 29, 2005: http://science.slashdot.org/article.pl?sid=05/01/29/1815242&tid=217&tid=14
  • [18] Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm
  • [19] P. Cimiano, S. Staab, Learning by googling, ACM SIGKDD Explorations Newsletter, 6:2 (December 2004), 24 – 33
  • [20] H. Muir, Software to unzip identity of unknown composers, New Scientist, 12 April 2003.
  • [21] K. Patch, Software sorts tunes, Technology Research News, April 23/30, 2003.
  • [22] D. B. Lenat. Cyc: A large-scale investment in knowledge infrastructure, Comm. ACM, 38:11(1995),33–38.
  • [23] F Keller, M Lapata, Using the web to obtain frequencies for unseen bigrams, Computational Linguistics, 29:3(2003), 459–484.
  • [24] A. N. Kolmogorov. Three approaches to the quantitative definition of information, Problems Inform. Transmission, 1:1(1965), 1–7.
  • [25] M. Li, J.H. Badger, X. Chen, S. Kwong, P. Kearney, and H. Zhang, An information-based sequence distance and its application to whole mitochondrial genome phylogeny, Bioinformatics, 17:2(2001), 149–154.
  • [26] M. Li, X. Chen, X. Li, B. Ma, P. Vitanyi. The similarity metric, Iaa IEEE Trans. Information Theory, 50:12(2004), 3250- 3264.
  • [27] M. Li, P. M. B. Vitanyi. An Introduction to Kolmogorov Complexity and Its Applications, 2nd Ed., Springer-Verlag, New York, 1997.
  • [28] M. Li and P.M.B. Vit´anyi. Algorithmic Complexity, pp. 376–382 in: International Encyclopedia of the Social & Behavioral Sciences, N.J. Smelser and P.B. Baltes, Eds., Pergamon, Oxford, 2001/2002.
  • [29] M. Li and P.M.B. Vit´anyi, Reversibility and adiabatic computation: trading time and space for energy, Proceedings of Royal Society of London, Series A, 452(1996), 769-789.
  • [30] S. L. Reed, D. B. Lenat. Mapping ontologies into cyc. Proceedings of AAAI Conference 2002 Workshop on Ontologies for the Semantic Web, Edmonton, Canada. http://citeseer.nj.nec.com/509238.html
  • [31] D. H. Rumsfeld, The digital revolution, originally published June 9, 2001, following a European trip. In: H. Seely, The Poetry of D.H. Rumsfeld, 2003, http://slate.msn.com/id/2081042/
  • [32] C. E. Shannon. A mathematical theory of communication. Bell Systems Technical J., 27(1948), 379–423 and 623–656.
  • [33] George A. Miller et.al, WordNet, A Lexical Database for the English Language, Cognitive Science Lab, Princeton University, http://www.cogsci.princeton.edu/ wn
  • [34] E. Terra and C. L. A. Clarke. Frequency Estimates for Statistical Word Similarity Measures. HLT/NAACL 2003, Edmonton, Alberta, May 2003. 37/162
  • [35] M. E. Lesk, Word-word associations in document retrieval systems, American Documentation, 20:1(1969), 27–38.
  • [36] P.-N. Tan, Vipin Kumar, J. Srivastava, Selecting the right interestingness measure for associating patterns. Proceedings of ACM-SIGKDD Conference Knowledge Discovery and Data Mining, 2002, 491–502.
  • [37] Thomas K. Landauer and S. Dumais, A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge, Psychol. Rev., 104(1997), 211–240.
  • [38] Corpus collosal: How well does the world wide web represent human language? The Economist, January 20, 2005. http://www.economist.com/science/displayStory.cfm?story id=3576374
  • [39] Eamonn Keogh, S. Lonardi, and C.A. Rtanamahatana, Toward parameter-free data mining, In: Proceedings10th ACM SIGKDD Intn’l Conference Knowledge Discovery and Data Mining, 2004, 206–215.
  • [40] C. Costa Santos, J. Bernardes, P.M.B. Vitanyi, L. Antunes, Clustering fetal heart rate tracings by compression, In: Proceedings19th IEEE Symp. Computer-based Medical Systems, 2006, 685-690.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2007 TheGoogleSimilarityDistanceRudi L. Cilibrasi
Paul M.B. Vitanyi
The Google Similarity Distancehttp://arxiv.org/PS cache/cs/pdf/0412/0412098v3.pdf10.1109/TKDE.2007.48