2012 MatchingProductTitlesUsingWebba

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Product Record Entity Resolution.

Notes

Cited By

Quotes

Author Keywords

Abstract

Matching product titles from different data feeds that refer to the same underlying product entity is a key problem in online shopping. This matching problem is challenging because titles across the feeds have diverse representations with some missing important keywords like brand and others containing extraneous keywords related to product specifications. In this paper, we propose a novel unsupervised matching algorithm that leverages web earch engines to (1) enrich product titles by adding important missing tokens that occur frequently in search results, and (2) compute importance scores for tokens based on their ability to retrieve other (enriched title) tokens in search results. Our matching scheme calculates the Cosine similarity between enriched title pairs with tokens weighted by their importance scores. We propose an optimization that exploits the templatized structure of product titles to reduce the number of search queries. In experiments with real-life shopping datasets, we found that our matching algorithm has superior F1 scores compared to IDF-based cosine similarity.

1. INTRODUCTION

Consumers are increasingly turning to online shopping sites (e.g., shopping.yahoo.com, shopping.bing.com) to research products prior to making buying decisions. These sites leverage data from multiple sources that include aggregator sites (e.g., PriceGrabber, CNET), merchant sites (e.g., buy.com, amazon.com) containing sales offers, reviews sites (e.g., epinions.com, newegg.com) containing reviews and ratings information, and auction sites (e.g., eBay, Amazon marketplace). The sites consolidate the specifications, prices, ratings, and reviews information for each product present in the various source data feeds, and present a unified view of each product to the user. Thus, the sites allow users to compare prices offered by different vendors for a product or browse through reviews for a product from different web sites. A key challenge to providing a unified product view is matching records for the same product from different data feeds. This record matching problem has been extensively studied in the literature under different names like record linkage [13], duplicate detection [12, 20], entity resolution [2, 1], and merge/purge [14]. Much of this previous work has focused on matching records with multiple overlapping attributes. The general approach is to first compute similarity scores for each attribute using traditional similarity metrics like Jaccard similarity, Cosine similarity, edit distance, etc. [17, 12], and then combine these attribute-level similarity scores to derive record-level matching scores using unsupervised and supervised techniques. However, as discussed below, prior approaches may not work well in our product record matching scenario.

In our product setting, records in the different data feeds may have heterogeneous schemas and thus contain diverse sets of attribute values. For example, sales offer feeds from merchants may contain information about prices but not reviews, while a feed from a blogging or a product forum site may contain ratings and reviews information but not prices. Furthermore, even if these attributes are present, their values for the same product may vary widely across feeds making them unreliable to use for matching. This is because different vendors may price the same product very differently; similarly, product reviews across feeds may be written by diverse users with disparate viewpoints and writing styles. In the absence of universally agreed upon unique product identifiers between the various information providers, the only attribute that uniquely identifies a product and that is consistently present in all the feeds is product title. So in this paper, we rely on product titles for matching records in the different data feeds.

A product title is a short unstructured textual description that uniquely identifies a product. Previous work has used similarity metrics like Jaccard similarity, Cosine similarity, edit distance, etc. for determining the similarity between string-valued attributes like person names or addresses. However, different feeds may use diverse product title representations for the same product. Consequently, traditional string similarity metrics may not work well for matching product titles.

Consider the 3 product title pairs from product aggregators PriceGrabber and CNET in rows (a)-(c) of Table 1. The pair of titles in row (a) are different representations of the same camera product with model number “d200”. The brand name “nikon” is missing in Title 1 while Title 2 does not contain relevant descriptive keywords like “digital slr camera” and other product specifications. Even though the two titles correspond to the same product, the Jaccard similarity between their token sets (using white space as delimiter) is only 1 = 0.045. In contrast, the pair of titles in row (b) correspond to different camera models but have a much higher Jaccard similarity of 0.125. Now, based on the title pairs in rows (a) and (b), it may be tempting to declare two titles as matching if they have identical model numbers. However, besides the obvious difficulty of identifying a wide range of model number formats within titles, this strategy will not work in many cases. For instance, the two titles in row (c) have the same model number but they represent different products. Specifically, Title 2 corresponds to a camera while Title 1 refers to its accessory – a camera battery charger.

Clearly, traditional similarity metrics like Jaccard similarity fare very poorly at the task of detecting matching product titles – the first matching title pair in Table 1 has a much lower similarity score compared to the other two non-matching pairs. The problem with simple Jaccard similarity is that it treats all tokens equally. However, as we saw earlier, tokens like model number are more important and thus should be assigned a higher weight. A popular weight assignment method in the IR literature [17] is Inverse Document Frequency (IDF) – it assigns each token [math]\displaystyle{ w }[/math] Nw where N is the the total number of records in the feeds and Nw is the number of records that contain token w. The Cosine similarity metric (with IDF token weights) between a pair t, ti of titles is then given by v2 w?(tntt ) [math]\displaystyle{ w }[/math] 2. We show the Cosine simi-larity scores for the title pairs in the final column of Table 1 (the IDF scores are computed over a corpus of 30M product titles obtained from PriceGrabber). As can be seen, taking IDF weights into account, the Cosine similarity score of the matching title pair in row (a) increases but is still much below the similarity scores for non-matching titles in rows (b) and (c).

One of the main reasons for the poor performance of the Cosine similarity metric is that product titles for the same entity across feeds have fairly diverse representations. Some titles could be missing important tokens like brand (e.g., "nikon" in Title 1 of row (a)), while others may contain extraneous tokens corresponding to product specifications that are not critical for identifying the product (e.g., Title 1 in row (a)). This hurts the similarity score of the matching title pair in row (a). Furthermore, in some instances, IDF weights do not accurately capture the importance of tokens. In Table 2, we show the IDF weights (in parenthesis) assigned to tokens belonging to the titles in Table 1.

As can be seen, in Title 1 of row (a), several extraneous tokens like "3872" and "2592" are assigned higher weights than the model number "d200". Similarly, in Title 1 of row (c), important tokens like "charger", "2800mah" and "rechargeable" that describe the battery charger accessory are assigned low weights compared to the model number "fe140". A shortcoming of IDF is that it assigns a single weight to each token proportional to the inverse of its global frequency in the feeds independent of the context. Essentially, IDF does not take into account the relevance of a token to its product title context when computing its weight.

To overcome the above-mentioned challenges associated with using traditional similarity metrics to match product titles, we leverage the web. Our approach to matching product titles involves three key steps: (1) Enrich each product title by filling in missing tokens that frequently occur in the broader context of the product title, (2) Compute an importance score for each token in the enriched product title based on its power to identify the underlying product, and (3) Match enriched titles with tokens weighted by their importance scores. We use web search engines in the first two steps to determine an expanded context for each product title in order to enrich it, and to compute importance scores. Web search engines allow us to efficiently sift through billions of pages to identify the most relevant web pages for a product.

Our main contributions can be summarized as follows:

1) We present an end-to-end system architecture for matching product titles with varying formats. The system is completely unsupervised, and performs a sequence of tasks starting with enrichment of product titles and computation of token weights, followed by matching the potentially co-referent enriched title pairs.

2) We propose a method for enriching product titles with (missing) keywords that appear frequently in the context defined by search engine results.

3) We assign an importance score to each token in the enriched title based on its ability to retrieve other tokens (of enriched title) in search results. Our matching algorithm calculates the Cosine similarity between enriched title pairs with tokens weighted by their importance scores.

4) We propose an optimization for reducing the number of calls to the search engine. Our optimization predicts approximate enrichments and importance scores for a product title based on past enrichments for similar-structured product titles.

5) In experiments with real-life shopping datasets, our matching algorithm achieves higher F1 scores compared to IDF-based Cosine similarity. Furthermore, our enrichment and importance score prediction optimizations reduce the number of search queries significantly without adversely impacting matching accuracy.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2012 MatchingProductTitlesUsingWebbaRajeev Rastogi
Vishrawas Gopalakrishnan
Suresh Parthasarathy Iyengar
Amit Madaan
Srinivasan Sengamedu
Matching Product Titles Using Web-based Enrichment10.1145/2396761.2396839