# 2008 TwoLanguagesAreBetterThanOnefor

Subject Headings: Feature Ablation Study, Bitext Corpus.

## Quotes

### Abstract

We show that jointly parsing a bitext can substantially improve parse quality on both sides. In a maximum entropy bitext parsing model, we define a distribution over source trees, target trees, and node-to-node alignments between them. Features include monolingual parse scores and various measures of syntactic divergence. Using the translated portion of the Chinese treebank, our model is trained iteratively to maximize the marginal likelihood of training tree pairs, with alignments treated as latent variables. The resulting bitext parser outperforms state-of-the-art monolingual parser baselines by 2.5 F1 at predicting English side trees and 1.8 F1 at predicting Chinese side trees (the highest published numbers on these corpora). Moreover, these improved trees yield a 2.4 BLEU increase when used in a downstream MT evaluation.

### 1 Introduction

Methods for machine translation (MT) have increasingly leveraged not only the formal machinery of syntax (Wu, 1997; Chiang, 2007; Zhang et al., 2008), but also linguistic tree structures of either the source side (Huang et al., 2006; Marton and Resnik, 2008; Quirk et al., 2005), the target side (Yamada and Knight, 2001; Galley et al., 2004; Zollmann et al., 2006; Shen et al., 2008), or both (Och et al., 2003; Aue et al., 2004; Ding and Palmer, 2005). These methods all rely on automatic parsing of one or both sides of input bitexts and are therefore impacted by parser quality. Unfortunately, parsing general bitexts well can be a challenge for newswiretrained treebank parsers for many reasons, including out-of-domain input and tokenization issues.

On the other hand, the presence of translation pairs offers a new source of information: bilingual constraints. For example, Figure 1 shows a case where a state-of-the-art English parser (Petrov and Klein, 2007) has chosen an incorrect structure which is incompatible with the (correctly chosen) output of a comparable Chinese parser. Smith and Smith (2004) previously showed that such bilingual constraints can be leveraged to transfer parse quality from a resource-rich language to a resourceimpoverished one. In this paper, we show that bilingual constraints and reinforcement can be leveraged to substantially improve parses on both sides of a bitext, even for two resource-rich languages. Formally, we present a log-linear model over triples of source trees, target trees, and node-tonode tree alignments between them. We consider a set of core features which capture the scores of monolingual parsers as well as measures of syntactic alignment. Our model conditions on the input sentence pair and so features can and do reference input characteristics such as posterior distributions from a word-level aligner (Liang et al., 2006; DeNero and Klein, 2007).

Our training data is the translated section of the Chinese treebank (Xue et al., 2002; Bies et al., 2007), so at training time correct trees are observed on both the source and target side. Gold tree alignments are not present and so are induced as latent variables using an iterative training procedure. To make the process efficient and modular to existing monolingual parsers, we introduce several approximations: use of k-best lists in candidate generation, an adaptive bound to avoid considering all k2 combinations, and Viterbi approximations to alignment posteriors.

### 6 Statistical Parsing Experiments

All the data used to train the joint parsing model and to evaluate parsing performance were taken from articles 1-325 of the Chinese treebank, which all have English translations with gold-standard parse trees. The articles were split into training, development, and test sets according to the standard breakdown for Chinese parsing evaluations. Not all sentence pairs could be included for various reasons, including one-to-many Chinese-English sentence alignments, sentences omitted from the English translations, and low-fidelity translations. Additional sentence pairs were dropped from the training data because they had unambiguous parses in at least one of the two languages. Table 1 shows how many sentences were included in each dataset.

We had two training setups: rapid and full. In the rapid training setup, only 1000 sentence pairs from the training set were used, and we used fixed alignments for each tree pair rather than iterating (see §4.1). The full training setup used the iterative training procedure on all 2298 training sentence pairs.

We used the English and Chinese parsers in Petrov and Klein (2007)[1] to generate all k-best lists and as our evaluation baseline. Because our bilingual data is from the Chinese treebank, and the data typically used to train a Chinese parser contains the Chinese side of our bilingual training data, we had to train a new Chinese grammar using only articles 400-1151 (omitting articles 1-270). This modified grammar was used to generate the k-best lists that we trained our model on. However, as we tested on the same set of articles used for monolingual Chinese parser evaluation, there was no need to use a modified grammar to generate k-best lists at test time, and so we used a regularly trained Chinese parser for this purpose.

We also note that since all parsing evaluations were performed on Chinese treebank data, the Chinese test sentences were in-domain, whereas the English sentences were very far out-of-domain for the Penn Treebank-trained baseline English parser. Hence, in these evaluations, Chinese scores tend to be higher than English ones.

Posterior word alignment probabilities were obtained from the word aligner of Liang et al. (2006) and DeNero and Klein (2007)[2], trained on approximately 1.7 million sentence pairs. For our alignment model we used anHMMin each direction, trained to agree (Liang et al., 2006), and we combined the posteriors using DeNero and Klein’s (2007) soft union method.

Unless otherwise specified, the maximum value of k was set to 100 for both training and testing, and all experiments used a value of 25 as the $\epsilon$ parameter for training set pruning and a cutoff rank of 500 for test set pruning.

#### 6.1 Feature Ablation

To verify that all our features were contributing to the model’s performance, we did an ablation study, removing one group of features at a time. Table 2 shows the $F_1$ scores on the bilingual development data resulting from training with each group of features removed.[3] Note that though head word features seemed to be detrimental in our rapid training setup, earlier testing had shown a positive effect, so we reran the comparison using our full training setup, where we again saw an improvement when including these features.

Baseline Parsers
Features Ch $F_1$ Eng $F_1$ Tot $F_1$
Monolingual 84.95 76.75 81.15
Rapid Training
Features Ch F1 Eng F1 Tot F1
All 86.37 78.92 82.91
−Hard align 85.83 77.92 82.16
−Scaled align 86.21 78.62 82.69
−Span diff 86.00 77.49 82.07
−Num children 86.26 78.56 82.69
−Child labels 86.35 78.45 82.68
Full Training
Features Ch F1 Eng F1 Tot F1
All 86.76 79.41 83.34
Table 2: Feature ablation study. $F_1$ on dev set after training with individual feature groups removed. Performance with individual baseline parsers included for reference.

$\epsilon$ Ch $F_1$ Eng $F_1$ Tot $F_1$ Tree Pairs
15 85.78 77.75 82.05 1,463,283
20 85.88 77.27 81.90 1,819,261
25 86.37 78.92 82.91 2,204,988
30 85.97 79.18 82.83 2,618,686
40 86.10 78.12 82.40 3,521,423
50 85.95 78.50 82.50 4,503,554
100 86.28 79.02 82.91 8,997,708
Table 3: Training set pruning study. $F_1$ on dev set after training with different values of the $\epsilon$ parameter for training set pruning.


#### 6.2 Training Set Pruning

To find a good value of the $\epsilon$ parameter for training set pruning we tried several different values, using our rapid training setup and testing on the dev set. The results are shown in Table 3. We selected 25 as it showed the best performance/speed tradeoff, on average performing as well as if we had done no pruning at all, while requiring only a quarter the memory and CPU time.

### 8 Conclusions

By jointly parsing (and aligning) sentences in a translation pair, it is possible to exploit mutual constraints that improve the quality of syntactic analyses over independent monolingual parsing. We presented a joint log-linear model over source trees, target trees, and node-to-node alignments between them, which is used to select an optimal tree pair from a k-best list. On Chinese treebank data, this procedure improves F1 by 1.8 on Chinese sentences and by 2.5 on out-of-domain English sentences. Furthermore, by using this joint parsing technique to preprocess the input to a syntactic MT system, we obtain a 2.4 BLEU improvement.

## References

• 1. Anthony Aue, Arul Menezes, Bob Moore, Chris Quirk, and Eric Ringger. 2004. Statistical Machine Translation Using Labeled Semantic Dependency Graphs. In TMI.
• 2. Ann Bies, Martha Palmer, Justin Mott, and Colin Warner. 2007. English Chinese Translation Treebank V 1.0. Web Download. LDC2007T02.
• 3. Daniel M. Bikel, David Chiang, Two Statistical Parsing Models Applied to the Chinese Treebank, Proceedings of the Second Workshop on Chinese Language Processing: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics, October 08-08, 2000, Hong Kong doi:10.3115/1117769.1117771
• 4. Eugene Charniak, Mark Johnson, Coarse-to-fine n-best Parsing and MaxEnt Discriminative Reranking, Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, p.173-180, June 25-30, 2005, Ann Arbor, Michigan doi:10.3115/1219840.1219862
• 5. David Chiang, Hierarchical Phrase-Based Translation, Computational Linguistics, v.33 n.2, p.201-228, June 2007 doi:10.1162/coli.2007.33.2.201
• 6. Michael Collins, Head-Driven Statistical Models for Natural Language Parsing, Computational Linguistics, v.29 n.4, p.589-637, December 2003 doi:10.1162/089120103322753356
• 7. John DeNero and Dan Klein. 2007. Tailoring Word Alignments to Syntactic Machine Translation. In ACL.
• 8. Yuan Ding, Martha Palmer, Machine Translation Using Probabilistic Synchronous Dependency Insertion Grammars, Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, p.541-548, June 25-30, 2005, Ann Arbor, Michigan doi:10.3115/1219840.1219907
• 9. Michel Galley, Mark Hopkins, Kevin Knight, and Daniel Marcu. 2004. What's in a Translation Rule? In HLT-NAACL.
• 10. Michel Galley, Jonathan Graehl, Kevin Knight, Daniel Marcu, Steve DeNeefe, Wei Wang, Ignacio Thayer, Scalable Inference and Training of Context-rich Syntactic Translation Models, Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, p.961-968, July 17-18, 2006, Sydney, Australia doi:10.3115/1220175.1220296
• 11. Liang Huang, Kevin Knight, and Aravind Joshi. 2006. Statistical Syntax-directed Translation with Extended Domain of Locality. In HLT-NAACL.
• 12. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, Evan Herbst, Moses: Open Source Toolkit for Statistical Machine Translation, Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, June 25-27, 2007, Prague, Czech Republic
• 13. Percy Liang, Ben Taskar, Dan Klein, Alignment by Agreement, Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, p.104-111, June 04-09, 2006, New York, New York doi:10.3115/1220835.1220849
• 14. Yuval Marton and Philip Resnik. 2008. Soft Syntactic Constraints for Hierarchical Phrase-based Translation. In ACL.
• 15. Franz Josef Och, Daniel Gildea, Sanjeev Khudanpur, Anoop Sarkar, Kenji Yamada, Alex Fraser, Shankar Kumar, Libin Shen, David Smith, Katherine Eng, Viren Jain, Zhen Jin, and Dragomir Radev. 2003. Syntax for Statistical Machine Translation. Technical Report, CLSP, Johns Hopkins University.
• 16. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2001. Bleu: A Method for Automatic Evaluation of Machine Translation. Research Report, IBM. RC22176.
• 17. Slav Petrov and Dan Klein. 2007. Improved Inference for Unlexicalized Parsing. In HLT-NAACL.
• 18. Chris Quirk, Arul Menezes, Colin Cherry, Dependency Treelet Translation: Syntactically Informed Phrasal SMT, Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, p.271-279, June 25-30, 2005, Ann Arbor, Michigan doi:10.3115/1219840.1219874
• 19. Libin Shen, Jinxi Xu, and Ralph Weishedel. 2008. A New String-to-dependency Machine Translation Algorithm with a Target Dependency Language Model. In ACL.
• 20. David A. Smith and Noah A. Smith. 2004. Bilingual Parsing with Factored Estimation: Using English to Parse Korean. In EMNLP.
• 21. Leslie G. Valiant. 1979. The Complexity of Computing the Permanent. In Theoretical Computer Science 8.
• 22. Wen Wang, Andreas Stolcke, and Jing Zheng. 2007. Reranking Machine Translation Hypotheses with Structured and Web-based Language Models. In IEEE ASRU Workshop.
• 23. Dekai Wu, Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora, Computational Linguistics, v.23 n.3, p.377-403, September 1997
• 24. Nianwen Xue, Fu-Dong Chiou, Martha Palmer, Building a Large-scale Annotated Chinese Corpus, Proceedings of the 19th International Conference on Computational Linguistics, p.1-8, August 24-September 01, 2002, Taipei, Taiwan doi:10.3115/1072228.1072373
• 25. Kenji Yamada, Kevin Knight, A Syntax-based Statistical Translation Model, Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, p.523-530, July 06-11, 2001, Toulouse, France doi:10.3115/1073012.1073079
• 26. Hao Zhang, Chris Quirk, Robert C. Moore, and Daniel Gildea. 2008. Bayesian Learning of Non-compositional Phrases with Synchronous Parsing. In ACL.
• 27. Andreas Zollmann, Ashish Venugopal, Stephan Vogel, and Alex Waibel. 2006. The Cmu-aka Syntax Augmented Machine Translation System for Iwslt-06. In IWSLT.

;

volumeDate ValuetitletypejournaltitleUrldoinoteyear
2008 TwoLanguagesAreBetterThanOneforTwo Languages Are Better Than One (for Syntactic Parsing)2008
1. Available at http://nlp.cs.berkeley.edu.
2. Available at http://nlp.cs.berkeley.edu.
3. We do not have a test with the basic alignment features removed because they are necessary to compute a0(t, t0).
 Author David Burkett + and Dan Klein + title Two Languages Are Better Than One (for Syntactic Parsing) + year 2008 +