2008 MetaAnalysisWithinAuthorshipVerification

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Authorship Verification Task, Plagiarism Analysis.

Quotes

Abstract

In an authorship verification problem one is given writing examples from an author A, and one is asked to determine whether or not each text in fact was written by A. In a more general form of the authorship verification problem one is given a single document d only, and the question is whether or not d contains sections from other authors. The heart of authorship verification is the quantization of an author’s writing style along with an outlier analysis to identify anomalies. Human readers are well-versed in detecting such spurious sections since they combine a highly developed sense for wording with context-dependent meta knowledge in their analysis.

The intention of this paper is to compile an overview of the algorithmic building blocks for authorship verification. In particular, we introduce authorship verification problems as decision problems, discuss possibilities for the use of meta knowledge, and apply meta analysis to postprocess unreliable style analysis results. Our meta analysis combines a confidence-basedmajority decision with the unmasking approach of Koppel and Schler. With this strategy we can improve the analysis quality in our experiments by 33% in terms of the F-measure.

1. Introduction

Authorship verification is a one-class classification problem. In a one-class classification problem one is given a target class for which a certain number of examples exist. Objects outside the target class are called outliers, and the one-class classification task is to tell apart outliers from target class members. Actually, the set of “outliers” can be much bigger than the target class, and an arbitrary number of outlier examples could be collected. Hence a one-class classification problem may look like a two-class discrimination problem; however, there is an important difference: members of the target class can be considered as representatives for their class, whereas one will not be able to compile a set of outliers that is representative for some kind of “nontarget class”. This fact is rooted in the enormous number and diversity of the non-target objects. Put another way, solving a one-class classification problem means to learn a concept (the concept of the target class) in the absence of discriminating features. (In rare cases, knowledge about outliers can be used to construct “representative counter examples” with respect to the target class. Then a standard discrimination approach can be applied.)

Within authorship verification the target class is comprised of writing examples of a certain author A, whereas each piece of text written by another author B, B 6= A, is an outlier. In their excellent paper [10] Koppel and Schler give an illustrative discussion of authorship verification as a one-class classification problem. At the same place they introduce a new approach, called unmasking, to determine with a high probability whether a set of writing examples is a subset of the target class. Observe the term “set” in this connection: the unmasking approach does not solve the one-class classification problem for a single object but requires a batch of objects all of which must stem either from the target class or not (see details in Section 3).

1.1 Authorship Verification Problems

The complexity of an authorship verification problem can vary significantly, depending on the given constraints and assumptions. To organize existing research and the developed approaches we introduce, for the first time, three authorship verification problems — formulated as decision problems.

Problem. AVFIND

  • Given. A text d, allegedly written by author A.
  • Q. Does d contain sections written by an author B, B 6= A?

Problem. AVOUTLIER

  • Given. A set of texts D = {d1, . . ., dn}, allegedly written by author A.
  • Q. Does D contain texts written by an author B, B 6= A?

Problem. AVBATCH

  • Given. Two sets of texts, D1 = {d11, . . ., d1k} and D2 = {d21, . . ., d2l}, each of which written by a single author.
  • Q. Are the texts in D1 and D2 written by the same author?

Note that the problems can be transformed into each other, for example:

  • AVFIND → AVOUTLIER → AVBATCH (1)

Given a document d an AVFIND problem can be transformed into an AVOUTLIER problem by extracting suspicious sections from d. The AVOUTLIER problem in turn can be transformed into an AVBATCH problem by forming two sets D1 and D2 containing the suspicious and the nonsuspicious sections respectively.

Note that the authorship verification problem AVFIND and intrinsic plagiarism analysis represent two sides of the same coin: the goal of intrinsic plagiarism analysis is to identify potentially plagiarized sections by analyzing a document with respect to changes in writing style [14].

1.2 Existing Research

Research related to authorship verification divides into the following areas: (i) models for the quantification of writing style — using classical measures for text complexity and grading level assessment [1, 7, 8, 6, 3, 4, 24] as well as author-specific stylometric analyses [18, 19, 11, 10, 9], (ii) technology for outlier analysis and machine learning [22, 23, 15, 12], and (iii) meta knowledge processing. Regarding the last area we refer to techniques for knowledge representation, deduction, and symbolic knowledge processing [17, 20].

4. Discussion

The improvements achieved by the meta analysis within the post-processing stage are substantial (see Table 2). We would like to point out that the approach of Koppel and Schler unfolds its power especially if an impure document is mistakenly classified as a document from a single author. The case that |D1| is in the uncertainty domain happens in 3% of all AVBATCH instances, and in 30% of all impure AVBATCH instances.

Finally, observe the following tradeoff: with increasing θ the solution of AVOUTLIER becomes more difficult, but the solution of AVBATCH becomes simpler. Rationale for the former is that an increasing θ masks possible style deviations from a document’s averaged writing style model. Rationale for the latter is the availability of more sample texts to apply unmasking.

References

[1] J. Chall and E. Dale. Readability Revisited: The new Dale-Chall Readability Formula. Brookline Books, 1995. [2] F. Choi. Advances in domain independent linear text segmentation. Proceedings of the first conf. on North American chapter of the Association for Computational Linguistics. Morgan Kaufmann, 2000 [3] E. Dale and J. Chall. A formula for predicting readability. Educ. Res. Bull.. 1948. [4] R. Flesch. A new readability yardstick. J. of Applied Psychology. 1948. [5] N. Graham, Graeme Hirst, and B. Marthi. Segmenting a document by stylistic character. Natural Language Engineering. 2005. [6] R. Gunning. The Technique of Clear Writing. McGraw-Hill. 1952. [7] A. Honore. Some simple measures of richness of vocabulary. Association for Literary and Linguistic Computing Bulletin. 1979. [8] J. Kincaid, R. Fishburne, R. Rogers, and B. Chissom. Derivation of new readability formulas for navy enlisted personnel. Research Branch Report 8 75 Millington TN: Naval Technical Training US Naval Air Station. 1975. [9] M. Koppel and J. Schler. Exploiting stylistic idiosyncrasies for authorship attribution. Proceedings of IJCAI’03 Workshop on Computational Approaches to Style Analysis and Synthesis. 2003. [10] M. Koppel and J. Schler. Authorship Verification as a One-Class Classification Problem. Proceedings of the 21st International Conference on Machine Learning. ACM, 2004. [11] M. Koppel, J. Schler, S. Argamon, and E. Messeri. Authorship attribution with thousands of candidate authors. Proceedings of the 29th annual int. ACM SIGIR conf. on Research and development in information retrieval. ACM, 2006. [12] L. Manevitz and M. Yousef. One-Class SVMs for Document Classification. J. of Machine Learning Research. 2001. [13] S. Meyer zu Eißen and B. Stein. Genre Classification of Web Pages: User Study and Feasibility Analysis. KI 2004: Advances in Artificial Intelligence. Springer, 2004. [14] S. Meyer zu Eissen and B. Stein. Intrinsic plagiarism detection. Proceedings of the European Conference on Information Retrieval (ECIR 2006). Springer, 2006. [15] G. Ratsch, S. Mika, Bernhard Schölkopf, and K.-R. Muller. Constructing Boosting Algorithms from SVMs: An Application to One-Class Classification. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2002. [16] J. Reynar. Topic Segmentation: Algorithms and Applications. PhD thesis, University of Pennsylvania, 1998. [17] S. Russel and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice-Hall, Englewood Cliffs, 1995. [18] E. Stamatatos. Author Identification Using Imbalanced and Limited Training Texts. 18th International Conference on Database and Expert Systems Applications (DEXA 07). 2007. [19] E. Stamatatos, N. Fakotakis, and G. Kokkinakis. Computer-based authorship attribution without lexical measures. Computers and the Humanities, 2001. [20] M. Stefik. Introduction to Knowledge Systems. Morgan Kaufmann, 1995. [21] B. Stein and S. Meyer zu Eissen. Intrinsic Plagiarism Analysis with Meta Learning. SIGIR Workshop on Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection (PAN 07). 2007. [22] D. Tax. One-Class Classification. PhD thesis, Technische Universiteit Delft, 2001. [23] D. Tax and R. Duin. Combining One-Class Classifiers. Proceedings of the Second International Workshop on Multiple Classifier Systems. Springer, 2001. [24] G. Yule. The Statistical Study of Literary Vocabulary. Cambridge University Press, 1944.


,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2008 MetaAnalysisWithinAuthorshipVerificationBenno Stein
Nedim Lipka
Sven Meyer zu Eissen
Meta Analysis within Authorship VerificationProceedings of Fifth International Workshop on Text-based Information Retrievalhttp://www.uni-weimar.de/medien/webis/publications/downloads/papers/stein 2008g.pdf2008