Semantic Similarity Measure

Jump to navigation Jump to search

A Semantic Similarity Measure is a similarity measure that approximates a semantic relationship (between two (or more) meaning carriers).




  • (Wikipedia, 2021) ⇒ Retrieved:2021-5-29.
    • Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature.[1] [2] The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations.

      For example, "car" is similar to "bus", but is also related to "road" and "driving".

      Computationally, semantic similarity can be estimated by defining a topological similarity, by using ontologies to define the distance between terms/concepts. For example, a naive metric for the comparison of concepts ordered in a partially ordered set and represented as nodes of a directed acyclic graph (e.g., a taxonomy), would be the shortest-path linking the two concept nodes. Based on text analyses, semantic relatedness between units of language (e.g., words, sentences) can also be estimated using statistical means such as a vector space model to correlate words and textual contexts from a suitable text corpus. The evaluation of the proposed semantic similarity / relatedness measures are evaluated through two main ways. The former is based on the use of datasets designed by experts and composed of word pairs with semantic similarity / relatedness degree estimation. The second way is based on the integration of the measures inside specific applications such the information retrieval, recommender systems, natural language processing, etc.

  1. Harispe S.; Ranwez S. Janaqi S.; Montmain J. (2015). “Semantic Similarity from Natural Language and Ontology Analysis". Synthesis Lectures on Human Language Technologies. 8:1: 1–254.
  2. Feng Y.; Bagheri E.; Ensan F.; Jovanovic J. (2017). “The state of the art in semantic relatedness: a framework for comparison". Knowledge Engineering Review. 32: 1–30. doi:10.1017/S0269888917000029.




  • (Wikipedia, 2015) ⇒ Retrieved:2014-12-10.
    • Statistical Similarity.
      • LSA (Latent semantic analysis) (+) vector-based, adds vectors to measure multi-word terms; (−) non-incremental vocabulary, long pre-processing times
      • PMI (Pointwise mutual information) (+) large vocab, because it uses any search engine (like Google); (−) cannot measure relatedness between whole sentences or documents
      • SOC-PMI (Second-order co-occurrence pointwise mutual information) (+) sort lists of important neighbor words from a large corpus; (−) cannot measure relatedness between whole sentences or documents
      • GLSA (Generalized Latent Semantic Analysis) (+) vector-based, adds vectors to measure multi-word terms; (−) non-incremental vocabulary, long pre-processing times
      • ICAN (Incremental Construction of an Associative Network) (+) incremental, network-based measure, good for spreading activation, accounts for second-order relatedness; (−) cannot measure relatedness between multi-word terms, long pre-processing times
      • NGD (Normalized Google distance) (+) large vocab, because it uses any search engine (like Google); (−) can measure relatedness between whole sentences or documents but the larger the sentence or document the more ingenuity is required, Cilibrasi & Vitanyi (2007), reference below.
      • NCD (Normalized Compression Distance)
      • ESA (Explicit Semantic Analysis) based on Wikipedia and the ODP.
      • SSA (Salient Semantic Analysis) which indexes terms using salient concepts found in their immediate context.
      • n° of Wikipedia (noW), inspired by the game Six Degrees of Wikipedia, is a distance metric based on the hierarchical structure of Wikipedia. A directed-acyclic graph is first constructed and later, Dijkstra's shortest path algorithm is employed to determine the noW value between two terms as the geodesic distance between the corresponding topics (i.e. nodes) in the graph.
      • VGEM (Vector Generation of an Explicitly-defined Multidimensional Semantic Space) (+) incremental vocab, can compare multi-word terms (−) performance depends on choosing specific dimensions
      • BLOSSOM (Best path Length On a Semantic Self-Organizing Map) (+) uses a Self Organizing Map to reduce high-dimensional spaces, can use different vector representations (VGEM or word-document matrix), provides 'concept path linking' from one word to another (−) highly experimental, requires nontrivial SOM calculation
      • SimRank