719,202
edits
m (→3.2 Algorithm) |
|||
Line 87: | Line 87: | ||
====5.1 Experimental Setup==== | ====5.1 Experimental Setup==== | ||
<B>[[Definitional Sentences Dataset|Dataset]]s</B>. We conducted experiments on two different datasets: | |||
� A corpus of 4,619 Wikipedia sentences, that contains 1,908 definitional and 2,711 nondefinitional sentences. The former were obtained from a random selection of the first sentences of Wikipedia articles3. The defined terms belong to different Wikipedia domain categories4, so as to capture a representative and cross-domain sample of lexical and syntactic patterns for definitions. These sentences were manually annotated with DEFINIENDUM, DEFINITOR, DEFINIENS and REST fields by an expert annotator, who also marked the hypernyms. The associated set of negative examples (“syntactically plausible” false definitions) was obtained by extracting from the same Wikipedia articles sentences in which the page title occurs. | |||
* A subset of the ukWaC Web corpus (Ferraresi et al., 2008), a large corpus of the English language constructed by crawling the .uk domain of the Web. The subset includes over 300,000 sentences in which occur any of 239 terms selected from the terminology of four different domains (COMPUTER SCI- 3The first sentence of Wikipedia entries is, in the large majority of cases, a definition of the page title. 4en.wikipedia.org/wiki/Wikipedia:Categories ENCE, ASTRONOMY, CARDIOLOGY, AVIATION). | |||
The reason for using the ukWaC corpus is that, unlike the “clean” Wikipedia dataset, in which relatively simple patterns can achieve good results, ukWaC represents a real-world test, with many complex cases. For example, there are sentences that should be classified as definitional according to Section 3.1 but are rather uninformative, like “dynamic programming was the brainchild of an american mathematician”, as well as informative sentences that are not definitional (e.g., they do not have a hypernym), like “cubism was characterised by muted colours and fragmented images”. Even more frequently, the dataset includes sentences which are not definitions but have a definitional pattern (“A Pacific Northwest tribe’s saga refers to a young woman who [..]”), or sentences with very complex definitional patterns (“white body cells are the body’s clean up squad” and “joule is also an expression of electric energy”). These cases can be correctly handled only with fine-grained patterns. Additional details on the corpus and a more thorough linguistic analysis of complex cases can be found in Navigli et al. (2010). | |||
<B>System</B>s. For definition extraction, we experiment with the following systems: | |||
* <B>[[WCL-1]] and [[WCL-3]]: these two classifiers are based on our [[Word-Class Lattice model]]. WCL-1 learns from the training set a lattice for each cluster of sentences, whereas WCL- 3 identifies clusters (and lattices) separately for each sentence field (DEFINIENDUM, DEFINITOR and DEFINIENS) and classifies a sentence as a definition if any combination from the three sets of lattices matches (cf. Section 3.2.4, the best combination is selected). | |||
* <B>[[Star pattern]]s: a simple classifier based on the patterns learned as a result of step 1 of our [[WCL learning algorithm]] (cf. Section 3.2.1): a sentence is classified as a definition if it matches any of the star patterns in the model. | |||
* <B>[[(Cui & al, 2007) Algorithm|Bigrams]]</B>: an implementation of the bigram classifier for soft pattern matching proposed by Cui et al. (2007). The classifier selects as definitions all the sentences whose probability is above a specific threshold. The probability is calculated as a mixture of bigram and unigram probabilities, with Laplace smoothing on the latter. We use the very same settings of Cui et al. (2007), including threshold values. While the authors propose a second soft-pattern approach based on Profile HMM (cf. Section 2), their results do not show significant improvements over the bigram language model. | |||
For hypernym extraction, we compared WCL- 1 and WCL-3 with Hearst’s patterns, a system that extracts hypernyms from sentences based on the lexico-syntactic patterns specified in Hearst’s seminal work (1992). These include (hypernym in italic): “such NP as fNP ,g f(or j and)g NP”, “NP f, NPg f,g or other NP”, “NP f,g including f NP ,g for j andg NP”, “NP f,g especially f NP ,g for j andg NP”, and variants thereof. However, it should be noted that hypernym extraction methods in the literature do not extract hypernyms from definitional sentences, like we do, but rather from specific patterns like “X such as Y”. Therefore a direct comparison with these methods is not possible. Nonetheless, we decided to implement Hearst’s patterns for the sake of completeness. We could not replicate the more refined approach by Snow et al. (2004) because it requires the annotation of a possibly very large dataset of sentence fragments. In any case Snow et al. (2004) reported the following performance figures on a corpus of dimension and complexity comparable with ukWaC: the recall-precision graph indicates precision 85% at recall 10% and precision 25% at recall of 30% for the hypernym classifier. A variant of the classifier that includes evidence from coordinate terms (terms with a common ancestor in a taxonomy) obtains an increased precision of 35% at recall 30%. We see no reasons why these figures should vary dramatically on the ukWaC. | |||
Finally, we compare all systems with the random baseline, that classifies a sentence as a definition with probability <math>\half</math>. | |||
Measures. To assess the performance of our systems, we calculated the following measures: � precision – the number of definitional sentences correctly retrieved by the system over the number of sentences marked by the system as definitional. � recall – the number of definitional sentences correctly retrieved by the system over the number of definitional sentences in the dataset. � the F1-measure – a harmonic mean of precision (P) and recall (R) given by 2PR P+R. � accuracy – the number of correctly classified sentences (either as definitional or nondefinitional) over the total number of sentences in the dataset. | |||
====5.2 Results and Discussion==== | |||
<B>Definition Extraction</B>. ... | |||
<B>Hypernym Extraction</B>. ... | |||
===6 Conclusions=== | |||
In [[2010 LearningWordClassLatticesforDef|this paper]], we have presented a lattice-based approach to definition and hypernym extraction. The novelty of our approach is: | |||
# The use of a lattice structure to generalize over lexico-syntactic definitional patterns; | |||
# The ability of the system to jointly identify definitions and extract hypernyms; | |||
# The generality of the method, which applies to genericWeb documents in any domain and style, and needs no parameter tuning; | |||
# The high performance as compared with the best-known methods for both definition and hypernym extraction. Our approach outperforms the other systems particularly where the task is more complex, as in real-world documents (i.e., the ukWaC corpus). | |||
Even though definitional patterns are learned from a manually annotated dataset, the dimension and heterogeneity of the training dataset ensures that training needs not to be repeated for specific domains<ref>Of course, it would need some additional work if applied to languages other than English. However, the approach does not need to be adapted to the language of interest.</ref>, as demonstrated by the cross-domain evaluation on the ukWaC corpus. | |||
The datasets used in our experiments are available from http://lcl.uniroma1.it/wcl. We also plan to release our system to the research community. In the near future, we aim to apply the output of our classifiers to the task of automated taxonomy building, and to test the WCL approach on other information extraction tasks, like hypernym extraction from generic sentence fragments, as in Snow et al. (2004). | |||
===Footnotes=== | ===Footnotes=== |