2009 ImprovingClassificationAccuracy

(Fuxman et al., 2009) ⇒ Ariel Fuxman, Anitha Kannan, Andrew B. Goldberg, Rakesh Agrawal, Panayiotis Tsaparas, and John Shafer. (2009). “Improving Classification Accuracy Using Automatically Extracted Training Data.” In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2009). doi:10.1145/1557019.1557143

Subject Headings:

Notes

Cited By

Quotes

Author Keywords

Abstract

Classification is a core task in knowledge discovery and data mining, and there has been substantial research effort in developing sophisticated classification models. In a parallel thread, recent work from the NLP community suggests that for tasks such as natural language disambiguation even a simple algorithm can outperform a sophisticated one, if it is provided with large quantities of high quality training data. In those applications, training data occurs naturally in text corpora, and high quality training data sets running into billions of words have been reportedly used.

We explore how we can apply the lessons from the NLP community to KDD tasks. Specifically, we investigate how to identify data sources that can yield training data at low cost and study whether the quantity of the automatically extracted training data can compensate for its lower quality. We carry out this investigation for the specific task of inferring whether a search query has commercial intent. We mine toolbar and click logs to extract queries from sites that are predominantly commercial (e.g., Amazon) and non-commercial (e.g., Wikipedia). We compare the accuracy obtained using such training data against manually labeled training data. Our results show that we can have large accuracy gains using automatically extracted training data at much lower cost.

References

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2009 ImprovingClassificationAccuracy	Ariel Fuxman Anitha Kannan Panayiotis Tsaparas John Shafer Andrew B. Goldberg Rakesh Agrawal			Improving Classification Accuracy Using Automatically Extracted Training Data		KDD-2009 Proceedings		10.1145/1557019.1557143		2009