2005 EmailDataCleaning

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Email Data, Data Cleaning Task.

Notes

Cited By

Quotes

Abstract

  • Addressed in this paper is the issue of 'email data cleaning' for text mining. Many text mining applications need take emails as input. Email data is usually noisy and thus it is necessary to clean it before mining. Several products offer email cleaning features, however, the types of noises that can be eliminated are restricted. Despite the importance of the problem, email cleaning has received little attention in the research community. A thorough and systematic investigation on the issue is thus needed. In this paper, email cleaning is formalized as a problem of non-text filtering and text normalization. In this way, email cleaning becomes independent from any specific text mining processing. A cascaded approach is proposed, which cleans up an email in four passes including non-text filtering, paragraph normalization, sentence normalization, and word normalization. As far as we know, non-text filtering and paragraph normalization have not been investigated previously. Methods for performing the tasks on the basis of Support Vector Machines (SVM) have also been proposed in this paper. Features in the models have been defined. Experimental results indicate that the proposed SVM based methods can significantly outperform the baseline methods for email cleaning. The proposed method has been applied to term extraction, a typical text mining processing. Experimental results show that the accuracy of term extraction can be significantly improved by using the data cleaning method.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2005 EmailDataCleaningHang Li
Jie Tang
Yunbo Cao
Zhaohui Tang
Email Data Cleaninghttp://keg.cs.tsinghua.edu.cn/persons/tj/publications/f142-tang.pdf10.1145/1081870.1081926