2017 ExtractingAttributeValuePairsfr

(Petrovski & Bizer, 2017) ⇒ Petar Petrovski, and Christian Bizer. (2017). “Extracting Attribute-value Pairs from Product Specifications on the Web.” In: Proceedings of the International Conference on Web Intelligence. ISBN:978-1-4503-4951-2 doi:10.1145/3106426.3106449

Subject Headings: Product Offering Title.

Notes

Cited By

Quotes

Abstract

Comparison shopping portals integrate product offers from large numbers of e-shops in order to support consumers in their buying decisions. Product offers often consist of a title and a free-text product description, both describing product attributes that are considered relevant by the specific vendor. In addition, product offers might contain structured or semi-structured product specifications in the form of HTML tables and HTML lists. As product specifications often cover more product attributes than free-text descriptions, being able to extract attribute-value pairs from these specifications is a critical prerequisite for achieving good results in tasks such as product matching, product categorisation, faceted product search, and product recommendation.

In this paper, we present an approach for extracting attribute-value pairs from product specifications on the Web. We use supervised learning to classify the HTML tables and HTML lists within a web page as product specification or not. In order to extract attribute-value pairs from the HTML fragments identified by the specification detector, we again use supervised learning to classify columns as attribute column or value column. Compared to DEXTER, the current state-of-the-art approach for extracting attribute-value pairs from product specifications, we introduce several new features for specification detection and support the extraction of attribute-value pairs from specifications having more than two columns. This allows us to improve the F-score up to 10% for extracting attribute-value pairs from tables and up to 3% for lists. In addition, we report the results of using duplicate-based schema matching to align the product attribute schemata of 32 different e-shops. This experiment confirms the suitability of duplicate-based schema matching for product data integration.

…

3 Related Work

This section gives an overview of the existing research on product feature extraction from free-text product descriptions, as well as existing work on feature extraction from product specifications.

Feature Extraction from Product Descriptions

Several methods for extracting attribute-value pairs form product descriptions have been developed for the use case of product matching. The methods either use bag-of-words approaches to extract attribute-value pairs from the descriptions [3, 4, 12, 24, 25], a dictionary-based approach [6], or a combination of both [9, 18, 19].

In contrast, named entity recognition based feature extraction models are developed in (Melli, 2014, 17, 23). All approaches use a similar models for feature extraction. In (Melli, 2014) propose an approach for annotating products descriptions based on a sequence BIO tagging model, following an NLP text chunking process. Specifically, the authors train a linear-chain conditional random field model on a manually annotated training dataset, to identify only eight general classes of terms. However, the approach is not able to extract explicit attribute-value pairs. Ristoski and Mika [23] improved upon this shortcoming employing a CRF model using a comprehensive set of discrete features that comes from the standard distribution of the Stanford NER3 mode. Ortona et al. [17] propose a three fold approach that performs the following functions: validation of the offers values, blocking to reduce the number of compared offers, and scoring of the pairwise offers. For the validation, an annotator is used which performs NER extraction (places, locations, names, organizations), and ontology which contains some domain specific constrains. In the blocking step, all pairs of products that violate some of the ontology constrains are clustered in different clusters. In the third step, pairwise scores are calculated for the offers in each cluster.

Recently, several approaches employ word embeddings as additional knowledge for extracting features from product descriptions for the use cases of product matching [7, 26], product recommendation [5, 13, 28], and product classification [10]. However, the approaches can not bypass the problem of free-text product descriptions often covering only small number of features.

Feature extraction from Product Specifications

While there is a relatively large body of research for extracting product features from product descriptions, only a handful of works have studied the problem of feature extraction from semi-structured data within web pages such as HTML tables and HTML lists. Etzioni et al. [2] relies on a approach from [8] to extract plane ticket prices from HTML tables within web pages. Specifically, the method involves automatically learning wrappers relaying on so called "landmarks" (i.e., groups of consecutive tokens) that enable

…

References

1. Lidong Bing, Tak-Lam Wong, and Wai Lam. 2016. Unsupervised Extraction of Popular Product Attributes from E-Commerce Web Sites by Considering Customer Reviews. ACM Trans. Internet Technol. 16, 2, Article 12 (April 2016), 17 Pages. doi:10.1145/2857054
2. Oren Etzioni, Rattapoom Tuchinda, Craig A. Knoblock, Alexander Yates, To Buy Or Not to Buy: Mining Airfare Data to Minimize Ticket Purchase Price, Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 24-27, 2003, Washington, D.C. doi:10.1145/956750.956767
3. Rayid Ghani, Katharina Probst, Yan Liu, Marko Krema, Andrew Fano, Text Mining for Product Attribute Extraction, ACM SIGKDD Explorations Newsletter, v.8 n.1, p.41-48, June 2006 doi:10.1145/1147234.1147241
4. Vishrawas Gopalakrishnan, Suresh Parthasarathy Iyengar, Amit Madaan, Rajeev Rastogi, Srinivasan Sengamedu, Matching Product Titles Using Web-based Enrichment, Proceedings of the 21st ACM International Conference on Information and Knowledge Management, October 29-November 02, 2012, Maui, Hawaii, USA doi:10.1145/2396761.2396839
5. Mihajlo Grbovic, Vladan Radosavljevic, Nemanja Djuric, Narayan Bhamidipati, Jaikit Savla, Varun Bhagwan, Doug Sharp, E-commerce in Your Inbox: Product Recommendations at Scale, Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 10-13, 2015, Sydney, NSW, Australia doi:10.1145/2783258.2788627
6. Anitha Kannan, Inmar E Givoni, Rakesh Agrawal, and Ariel Fuxman. 2011. Matching Unstructured Product Offers to Structured Product Specifications. In 17th ACM SIGKDD.
7. M. Hadi Kiapour, Xufeng Han, Svetlana Lazebnik, Alexander C. Berg, Tamara L. Berg, Where to Buy It: Matching Street Clothing Photos in Online Shops, Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), p.3343-3351, December 07-13, 2015 doi:10.1109/ICCv.2015.382
8. Craig A. Knoblock, Kristina Lerman, Steven Minton, and Ion Muslea. 2003. Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. Physica-Verlag HD, Heidelberg, 275--287. doi:10.1007/978-3-7908-1772-0_17
9. Hanna Köpcke, Andreas Thor, Stefan Thomas, Erhard Rahm, Tailoring Entity Resolution for Matching Product Offers, Proceedings of the 15th International Conference on Extending Database Technology, March 27-30, 2012, Berlin, Germany doi:10.1145/2247596.2247662
10. Zornitsa Kozareva. 2015. Everyone Likes Shopping! Multi-class Product Categorization for E-Commerce. In The 2015 Annual Conference of the North Americal Chapter for the ACL. 1329--1333. doi:10.3115/v1/N15-1147
11. Oliver Lehmberg, Dominique Ritze, Petar Ristoski, Robert Meusel, Heiko Paulheim, Christian Bizer, The Mannheim Search Join Engine, Web Semantics: Science, Services and Agents on the World Wide Web, v.35 N.P3, p.159-166, December 2015 doi:10.1016/j.websem.2015.05.001
12. Nikhil Londhe, Vishrawas Gopalakrishnan, Aidong Zhang, Hung Q. Ngo, Rohini Srihari, Matching Titles with Cross Title Web-search Enrichment and Community Detection, Proceedings of the VLDB Endowment, v.7 n.12, p.1167-1178, August 2014 doi:10.14778/2732977.2732990
13. Julian McAuley, Christopher Targett, Qinfeng Shi, Anton Van Den Hengel, Image-Based Recommendations on Styles and Substitutes, Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, August 09-13, 2015, Santiago, Chile doi:10.1145/2766462.2767755
14. (Melli, 2014) ⇒ Gabor Melli, Shallow Semantic Parsing of Product Offering Titles (for Better Automatic Hyperlink Insertion), Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 24-27, 2014, New York, New York, USA doi:10.1145/2623330.2623343
15. Robert Meusel, Petar Petrovski, and Christian Bizer. 2014. The Webdatacommons Microdata, RDFa and Microformat Dataset Series. In The Semantic Web-IS WC. 277--292.
16. Hoa Nguyen, Ariel Fuxman, Stelios Paparizos, Juliana Freire, Rakesh Agrawal, Synthesizing Products for Online Catalogs, Proceedings of the VLDB Endowment, v.4 n.7, p.409-418, April 2011 doi:10.14778/1988776.1988777
17. Stefano Ortona, An Analysis of Duplicate on Web Extracted Objects, Proceedings of the 23rd International Conference on World Wide Web, April 07-11, 2014, Seoul, Korea doi:10.1145/2567948.2579708
18. Petar Petrovski, Volha Bryl, Christian Bizer, Integrating Product Data from Websites Offering Microdata Markup, Proceedings of the 23rd International Conference on World Wide Web, April 07-11, 2014, Seoul, Korea doi:10.1145/2567948.2579704
19. Petar Petrovski, Volha Bryl, and Christian Bizer. 2014. Learning Regular Expressions for the Extraction of Product Attributes from E-commerce Microdata. (2014).
20. Petar Petrovski, Anna Primpeli, Robert Meusel, and Christian Bizer. 2017. The WDC Gold Standards for Product Feature Extraction and Product Matching. Springer International Publishing, Cham, 73--86. doi:10.1007/978-3-319-53676-7_6
21. Disheng Qiu, Luciano Barbosa, Xin Luna Dong, Yanyan Shen, Divesh Srivastava, Dexter: Large-scale Discovery and Extraction of Product Specifications on the Web, Proceedings of the VLDB Endowment, v.8 n.13, p.2194-2205, September 2015 doi:10.14778/2831360.2831372
22. Daniel Rinser, Dustin Lange, Felix Naumann, Cross-lingual Entity Matching and Infobox Alignment in Wikipedia, Information Systems, v.38 n.6, p.887-907, September, 2013 doi:10.1016/j.is.2012.10.003
23. Petar Ristoski, Peter Mika, Enriching Product Ads with Metadata from HTML Annotations, Proceedings of the 13th International Conference on The Semantic Web. Latest Advances and New Domains, May 29-June 02, 2016 doi:10.1007/978-3-319-34129-3_10
24. Ronald Van Bezu, Sjoerd Borst, Rick Rijkse, Jim Verhagen, Damir Vandic, and Flavius Frasincar. 2015. Multi-component Similarity Method for Web Product Duplicate Detection. (2015).
25. Damir Vandic, Jan-Willem Van Dam, Flavius Frasincar, Faceted Product Search Powered by the Semantic Web, Decision Support Systems, v.53 n.3, p.425-437, June, 2012 doi:10.1016/j.dss.2012.02.010
26. Xi Wang, Zhenfeng Sun, Wenqiang Zhang, Yu Zhou, Yu-Gang Jiang, Matching User Photos to Online Products with Robust Deep Features, Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, June 06-09, 2016, New York, New York, USA doi:10.1145/2911996.2912002
27. Tak-Lam Wong, Wai Lam, Tik-Shun Wong, An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites, Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, July 20-24, 2008, Singapore, Singapore doi:10.1145/1390334.1390343
28. W. X. Zhao, S. Li, Y. He, E. Chang, J. R. Wen, and X. Li. 2015. Connecting Social Media to E-Commerce: Cold-Start Product Recommendation On Microblogs. IEEE Transactions on Knowledge and Data Engineering PP, 99 (2015), 1--1. doi:10.1109/TKDE.2015.2508816

}};

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2017 ExtractingAttributeValuePairsfr	Christian Bizer Petar Petrovski			Extracting Attribute-value Pairs from Product Specifications on the Web				10.1145/3106426.3106449		2017