2007 ExtractingandUsingAttributeValu

(Probst et al., 2007) ⇒ Katharina Probst, Rayid Ghani, Marko Krema, Andy Fano, and Yan Liu. (2007). “Extracting and Using Attribute-Value Pairs from Product Descriptions on the Web.” In: From Web to Social Web: Discovering and Deploying User and Content Profiles. doi:10.1007/978-3-540-74951-6_3

Subject Headings: Product Description Phrase; Product Term Dictionary

Notes

This is a journal paper-version of (Ghani et al., 2006).

Cited By

Quotes

Abstract

We describe an approach to extract attribute-value pairs from product descriptions in order to augment product databases by representing each product as a set of attribute-value pairs. Such a representation is useful for a variety of tasks where treating a product as a set of attribute-value pairs is more useful than as an atomic entity. We formulate the extraction task as a classification problem and use Naïve Bayes combined with a multi-view semi-supervised algorithm (co-EM). The extraction system requires very little initial user supervision: using unlabeled data, we automatically extract an initial seed list that serves as training data for the semi-supervised classification algorithm. The extracted attributes and values are then linked to form pairs using dependency information and co-location scores. We present promising results on product descriptions in two categories of sporting goods products. The extracted attribute-value pairs can be useful in a variety of applications, including product recommendations, product comparisons, and demand forecasting. In this paper, we describe one practical application of the extracted attribute-value pairs: a prototype of an Assortment Comparison Tool that allows retailers to compare their product assortments to those of their competitors. As the comparison is based on attributes and values, we can draw meaningful conclusions at a very fine-grained level. We present the details and research issues of such a tool, as well as the current state of our prototype.

1 Introduction

Retailers have been collecting a growing amount of sales data containing customer information and related transactions. These data warehouses also contain product information, but that information is often very sparse and limited. Specifically, most retailers treat their products as atomic entities with very few related attributes (typically brand, size, or color). Treating products as atomic entities hinders the effectiveness of many applications that businesses currently use transactional data for such as product recommendation, demand forecasting, assortment optimization, and assortment comparison. If a business could represent their products in terms of attributes and attribute values, all of the above applications could be improved significantly.

…

4 Evaluation

4.1 Attribute-Value Pairs Extraction

…

The evaluation of this task is not straightforward. The main problem is that people often do not agree on what the ‘correct’ attribute-value pair should be. Consider the example Audio/JPEG navigation menu. This phrase can be expressed as an attribute-value pair in multiple ways:

|     Possible Attribute     |     Possible Value    |
|------------------------|-------------------|
| navigation menu            | Audio/JPEG            |
| menu                       | Audio/JPEG navigation |
| Audio/JPEG navigation menu | #true#                |

In the last case, the entire phrase is considered a binary attribute. All three pairs are both possibly useful attribute-value pairs. The implication is that a human annotator will make one decision, while the system may make a different decision (with both of them being consistent). For this reason, we give partial credit to an automatically extracted attribute-value pair, even if it does not completely match the human annotation. In some cases, an extracted pair deserves only partial credit, while in other cases, the automatically extracted pair is an equally valid attribute-valid pair.

For each of the metrics, we report type and token performance. Type performance (at the data item level, i.e., at the level of individual product description phrases) refers to performance for unique examples (each example contributes the same regardless of frequency). The data sets contain a number of duplicates, as many attributes apply to more than one product. Token performance refers to performance including duplicates, therefore emphasizing those examples that occur more frequently than others.

…

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2007 ExtractingandUsingAttributeValu	Yan Liu Rayid Ghani Katharina Probst Marko Krema Andy Fano			Extracting and Using Attribute-Value Pairs from Product Descriptions on the Web				10.1007/978-3-540-74951-6_3		2007