2012 TailoringEntityResolutionforMat

From GM-RKB
Jump to navigation Jump to search

Subject Headings:

Notes

Cited By

2014

Quotes

Abstract

Product matching is a challenging variation of entity resolution to identify representations and offers referring to the same product. Product matching is highly difficult due to the broad spectrum of products, many similar but different products, frequently missing or wrong values, and the textual nature of product titles and descriptions. We propose the use of tailored approaches for product matching based on a preprocessing of product offers to extract and clean new attributes usable for matching. In particular, we propose a new approach to extract and use so-called product codes to identify products and distinguish them from similar product variations. We evaluate the effectiveness of the proposed approaches with challenging real-life datasets with product offers from online shops. We also show that the UPC information in product offers is often error-prone and can lead to insufficient match decisions.

1. INTRODUCTION

Product matching deals with the identification of different descriptions or offers referring to the same real-world product. Given that many thousands of online shops sell millions of diverse products over the web, product matching has become of increasing importance. For example, it is a critical task for aggregating offers for the same product within price comparison portals (e.g., PriceGrabber), online marketplaces (e.g., Amazon.com), or product search engines (e.g., Google Product Search) [5] [10]. Product matching is a special case of entity resolution (matching) that is needed to identify equivalent entities or duplicates within a data source or between data sources. While this problem has received a huge amount of effort in research (see [3, 7] for recent surveys), only little work has been devoted to product matching. Product matching for e-commerce websites introduces several specific challenges that make this problem much harder than other forms of entity resolution, e.g., to match records about publications. In particular, there is a huge degree of heterogeneity since product offers come from thousands of merchants using different names and descriptions of the products. Furthermore, offers frequently have missing or wrong values and are mostly not well structured but mix different product characteristics in text fields such as product name or description [5].

Figure 1: Product offers related to the Canon VIXIA HF camcorder in Google Product Search

For illustration, we show in Figure 1 some result offers for the product search engine Google Product Search and a specific camcorder. The offers refer to different merchants that use heterogeneous names, descriptions, and other attributes for the same product and may also contain misspellings and other errors. For example, the product names for the considered product Canon Vixia HF S10 partially include specific technical details that may complicate product matching, e.g., to find out that (only) the first three entries refer to the same product. Note that Google already performs a product matching since the first entry refers to offers from 52 merchants. The remaining duplicates in the example show that Google's clustering is imperfect and needs to be improved.

A recent benchmark study [8] evaluated current entity matching prototypes and a commercial tool on different real-world match tasks including e-commerce product matching. Despite the use of small-sized datasets, current solutions could achieve only about 30 - 70% F-measure for product matching (compared to up to 98% F-measure for publication matching) underlining the high diffiulty of product matching. In [5] a machine learning approach is presented to match product offers to comprehensive product descriptions. Their evaluation …

3. PRODUCT CODE EXTRACTION

One key observation is the frequent existence of specific product codes for certain product types that can help to differentiate similar but different products. A product code is a manufacturer-specific identifier that typically appears in the product title and description. In general, it can be any sequence consisting of alphabetic, special, and numeric characters split by an arbitrary number of white spaces. In the example of Figure 1 the term HF S10 is a product code for the first three entries. A product code is under full control of the manufacturer and thus we observe very good data quality, i.e., the product code is usually correct if it is available. Unfortunately, product codes are generally not provided as a separate attribute but appear only within the product title or description.

Figure 5: Example code extraction for Hahnel HL-XF51 7.2V 680mAh for Sony NP-FF51 (manufacturer: Hahnel).

The extraction of the product code of the offered product is non-trivial as the title and the description of the product offer contain several unstructured information. Furthermore, accessory products may also contain multiple product codes, e.g., one for the accessory itself and one for the target product. Product code extraction is a special case of product attribute extraction that identifies attribute-value pairs out of unstructured textual descriptions (e.g., [4]). However, such approaches typically require labeled (tagged) training data whereas our focused product code extraction does not need any training data but employs the rich knowledge of search engines. The product code extraction algorithm is illustrated in Algorithm 1 and will be described next. For illustration purposes Figure 5 demonstrates the extraction workflow for the sample product title Hahnel HL-XF51 7.2V 680mAh for Sony NP-FF51.

The first step, feature extraction, applies regular expressions to extract common features such as dimensions, weight specification, colors, etc. In our example the voltage (7.2V) and energy (680mAh) are extracted. The next step, tokenization, breaks the title string into words. Tokens are separated by white spaces and punctuations.

Filtering comprises the removal of stop words as well as other tokens that appear frequently in product offers of several different manufacturers. For this we calculate a manufacturer-based frequency for each token t appearing in any offer representation. Let N(t;m) be the number of product offers of manufacturer m containing the token t and let N(t) be the overall number of product offers containing t. For any product offer of m only tokens t with a ratio N(t;m)=N(t) above a given threshold are considered for product code extraction. In our experiments we employ a threshold of 50%, i.e., at least 50% of all product offers containing t must be from manufacturer m. In the running example the term for will thus be excluded from further steps.

Afterwards we generate candidates for product codes. In general, a candidate consists of up to 3 consecutive tokens. To reduce the possibly large number of candidates regular expressions are employed to find "interesting" candidates, e.g., candidates that contain both letters and numbers. To this end we use a manually created list of regular expressions that captures knowledge on the syntactical structure of common product codes. Furthermore, string type frequencies can be computed to identify types that frequently occur with a particular manufacturer. For example, a significant number of candidates for the manufacturer Hahnel follows

Figure 6: Number of product offers for 10 product categories (5 accessory and 5 non-accessory categories).

the pattern [A-Z]f2gn-[A-Z]f2g[0-9]f2g ("two capital letters, minus, two capital letters, two digits") which indicates that such strings can be product codes.

Finally, a web verification step utilizes the web as an external knowledge source to verify the extracted candidates. For each of the determined candidates a query is submitted to a web search engine. The correctness of a code candidate is verified by the ratio of the results containing the corresponding manufacturer. Figure 5 illustrates for the two candidates HL-XF51 and NP-FF51 the retrieved top 2 query results. For the first candidate, HL-XF51, all results contain the manufacturer name Hahnel and thus giving an overlap of 100%. The term HL-XF51 is therefore considered a valid product code. On the other hand, NP-FF51 is not a product code because none of the results contain the manufacturer name Hahnel.

Figure 10: Match quality for different reference mappings

5. CONCLUSIONS

Matching product offers is a challenging problem requiring sophisticated and tailored entity resolution approaches. We outlined and evaluated such an approach that is based on machine learning and a comprehensive pre-processing. In particular, we proposed a new approach for improving product matching based on a pattern-based extraction and web-based verification of so-called product codes. Our evaluation with a large real-life dataset showed the high benefit of product code matching, especially for non-accessory products. Furthermore, we found that category-specific match strategies should be applied. We also analyzed the use of UPC values for evaluating match strategies and for product matching and observed significant limitations. In particular, UPC-based match evaluations tend to be too pessimistic and UPC-based matching may leave many matching or comparable offers unmatched. In future work we will investigate techniques to further improve product matching and

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2012 TailoringEntityResolutionforMatErhard Rahm
Hanna Köpcke
Andreas Thor
Stefan Thomas
Tailoring Entity Resolution for Matching Product Offers10.1145/2247596.22476622012