Semantic Similarity Measure

From GM-RKB
Jump to: navigation, search

A semantic similarity measure is a similarity measure that approximates a semantic relationship (between two (or more) meaning carriers).



References

2015

  1. 1.0 1.1 Cite error: Invalid <ref> tag; no text was provided for refs named harispe2013
  • (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Semantic_similarity#Topological_similarity Retrieved:2015-2-10.
    • There are essentially two types of approaches that calculate topological similarity between ontological concepts:
      • Edge-based: which use the edges and their types as the data source;
      • Node-based: in which the main data sources are the nodes and their properties.
    • Other measures calculate the similarity between ontological instances:
      • Pairwise: measure functional similarity between two instances by combining the semantic similarities of the concepts they represent
      • Groupwise: calculate the similarity directly not combining the semantic similarities of the concepts they represent

2014

  • (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Semantic_similarity#Statistical_similarity Retrieved:2014-12-10.
    • Statistical Similarity
      • LSA (Latent semantic analysis) (+) vector-based, adds vectors to measure multi-word terms; (−) non-incremental vocabulary, long pre-processing times
      • PMI (Pointwise mutual information) (+) large vocab, because it uses any search engine (like Google); (−) cannot measure relatedness between whole sentences or documents
      • SOC-PMI (Second-order co-occurrence pointwise mutual information) (+) sort lists of important neighbor words from a large corpus; (−) cannot measure relatedness between whole sentences or documents
      • GLSA (Generalized Latent Semantic Analysis) (+) vector-based, adds vectors to measure multi-word terms; (−) non-incremental vocabulary, long pre-processing times
      • ICAN (Incremental Construction of an Associative Network) (+) incremental, network-based measure, good for spreading activation, accounts for second-order relatedness; (−) cannot measure relatedness between multi-word terms, long pre-processing times
      • NGD (Normalized Google distance) (+) large vocab, because it uses any search engine (like Google); (−) can measure relatedness between whole sentences or documents but the larger the sentence or document the more ingenuity is required, Cilibrasi & Vitanyi (2007), reference below.
      • NCD (Normalized Compression Distance)
      • ESA (Explicit Semantic Analysis) based on Wikipedia and the ODP
      • SSA (Salient Semantic Analysis) which indexes terms using salient concepts found in their immediate context.
      • n° of Wikipedia (noW), inspired by the game Six Degrees of Wikipedia, is a distance metric based on the hierarchical structure of Wikipedia. A directed-acyclic graph is first constructed and later, Dijkstra's shortest path algorithm is employed to determine the noW value between two terms as the geodesic distance between the corresponding topics (i.e. nodes) in the graph.
      • VGEM (Vector Generation of an Explicitly-defined Multidimensional Semantic Space) (+) incremental vocab, can compare multi-word terms (−) performance depends on choosing specific dimensions
      • BLOSSOM (Best path Length On a Semantic Self-Organizing Map) (+) uses a Self Organizing Map to reduce high-dimensional spaces, can use different vector representations (VGEM or word-document matrix), provides 'concept path linking' from one word to another (−) highly experimental, requires nontrivial SOM calculation
      • SimRank

2008

2007

2006

1989