Jaro-Winkler Distance Measure

From GM-RKB
Jump to: navigation, search

A Jaro-Winkler Distance Measure is a unit edit distance measure that modifies the weights of poorly matching string pairs that share a common string prefix.



References

2015

  • (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Jaro–Winkler_distance Retrieved:2015-3-24.
    • In computer science and statistics, the Jaro–Winkler distance (Winkler, 1990) is a measure of similarity between two strings. It is a variant of the Jaro distance metric (Jaro, 1989, 1995), a type of string edit distance, and was developed in the area of record linkage (duplicate detection) (Winkler, 1990). The higher the Jaro–Winkler distance for two strings is, the more similar the strings are. The Jaro–Winkler distance metric is designed and best suited for short strings such as person names. The score is normalized such that 0 equates to no similarity and 1 is an exact match.

2009


  • http://alias-i.com/lingpipe/demos/tutorial/stringCompare/read-me.html
    • QUOTE: String comparison attempts to measure the similarity between strings. This is useful for applications ranging from database deduplication and record linkage to terminology extraction, spell checking, and k-nearest-neighbors classifiers. In this tutorial, we demonstrate the ways in which string comparisons are used in LingPipe.

      Jaro-Winkler Distance There are a family of distance measures defined by the U.S. Census Bureau for comparing single person names. The original metric was defined by Matt Jaro and later refined by Bill Winkler.

2006

2003

  • (Cohen et al., 2003) ⇒ William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. (2003). “A Comparison of String Distance Metrics for Name-Matching Tasks.” In: Workshop on Information Integration on the Web (IIWeb-03).
    • QUOTE: A broadly similar metric, which is not based on an edit-distance model, is the Jaro metric (Jaro 1995; 1989; Winkler 1999). In the record-linkage literature, good results have been obtained using variants of this method, which is based on the number and order of the common characters between two strings. Given strings [math]s = a_1\; \cdots \; a_K[/math] and [math]t= b_1 \;\cdots\;b_L[/math], define a character [math]a_i[/math] in [math]s[/math] to be common with [math]t[/math] there is a [math]b_j = a_i[/math] in [math]t[/math] such that [math]i − H \leq j \leq i + H[/math], where [math]H = \frac{min(|s|\cdot|t|)}{2}[/math] . Let [math]s' = a_1' \; \cdots \; a_{K'}'[/math] be the characters in [math]s[/math] which are common with [math]t[/math] (in the same order they appear in [math]s[/math]) and let [math]t'= b_1' \;\cdots\;b_{L'}'[/math] be analogous; now define a transposition for [math]s'[/math] , [math]t'[/math] to be a position [math]i[/math] such that [math]a'_i \neq b'_i[/math]. Let [math]T_{s',t'}[/math] be half the number of transpositions for [math]s'[/math] and [math]t'[/math] . The Jaro similarity metric for [math]s[/math] and [math]t[/math] is

      [math]Jaro(s, t) = \frac{1}{3}\cdot \left( \frac{|s'|}{|s|} + \frac{|t'|}{|t|} + \frac{|s'| − T_{s',t'}}{|s'|}\right)[/math]

      A variant of this due to Winkler (1999) also uses the length [math]P[/math] of the longest common prefix of [math]s[/math] and [math]t[/math]. Letting [math]P'= max(P, 4)[/math] we define

      [math]Jaro-Winkler(s,t) = Jaro(s, t) + \frac{P'}{10}\cdot (1 − Jaro(s, t))[/math].

      The Jaro and Jaro-Winkler metrics seem to be intended primarily for short strings (e.g., personal first or last names.)

1999

1997

  • Edward H. Porter, and William E. Winkler. (1997). “Approximate String Comparison and its Effect on an Advanced Record Linkage Systems. U.S. Bureau of the Census, Research Report.

1995

1990

  • (Winkler, 1990) ⇒ William E. Winkler. (1990). “String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage.” In: Proceedings of the Section on Survey Research Methods, American Statistical Association.

1989