1996 AStochasticFSWordSegAlgForChi

Jump to navigation Jump to search

Keyword: Word Segmentation Task, Chinese Language, Chinese Word Segmentation, Writing System.


Cited By



The initial stage of text analysis for any NLP task usually involves the tokenization of the input into words. For languages like English one can assume, to a first approximation, that word boundaries are given by whitespace or punctuation. In various Asian languages, including Chinese, on the other hand, whitespace is never used to delimit words, so one must resort to lexical information to "reconstruct" the word-boundary information. In this paper we present a stochastic finite-state model wherein the basic workhorse is the weighted finite-state transducer. The model segments Chinese text into dictionary entries and words derived by various productive lexical processes, and---since the primary intended application of this model is to text-to-speech synthesis --- provides pronunciations for these words. We evaluate the system's performance by comparing its segmentation "judgments" with the judgements of a pool of human segmenters, and the system is shown to perform quite well.

1. Introduction

Any NLP application that presumes as input unrestricted text requires an initial phase of text analysis; such applications involve problems as diverse as machine translation, information retrieval, and text-to-speech synthesis (TTS). An initial step of any text analysis task is the tokenization of the input into words. For a language like English, this problem is generally regarded as trivial since words are delimited in English text by whitespace or marks of punctuation. Thus in an English sentence such as I'm going to show up at the ACL one would reasonably conjecture that there are eight words separated by seven spaces. A moment's reflection will reveal that things are not quite that simple. There are clearly eight orthographic words in the example given, but if one were doing syntactic analysis one would probably want to consider I'm to consist of two syntactic words, namely I and am. If one is interested in translation, one would probably want to consider show up as a single dictionary word since its semantic interpretation is not trivially derivable from the meanings of show and up. And if one is interested in TTS, one would probably consider the single orthographic word ACL to consist of three phonological words--/eJ s'i ~l/--corresponding to the pronunciation of each of the letters in the acronym. Space- or punctuation-delimited orthographic words are thus only a starting point for further analysis and can only be regarded as a useful hint at the desired division of the sentence into words.

Figure 1 - A Chinese sentence in (a) illustrating the lack of word boundaries. In (b) is a plausible segmentation for this sentence; in (c) is an implausible segmentation.

  • a)
    • 日文章魚怎麼說
    • How do you say octopus in Japanese?
  • b) Plausible segmentation
    • [日文] [章魚] [怎麼] [說]
    • [Japanese] [octopus] [how] [say]
  • c) Implausible segmentation
    • [日] [文章] [魚] [怎麼] [說]
    • [Japan] [essay] [fish] [how] [say]

Whether a language even has orthographic words is largely dependent on the writing system used to represent the language (rather than the language itself); the notion “orthographic word” is not universal. Most languages that use Roman, Greek, Cyrillic, Armenian, or Semitic scripts, and many that use Indian-derived scripts, mark orthographic word boundaries; however, languages written in a Chinese-derived writing system, including Chinese and Japanese, as well as Indian-derived writing systems of languages like Thai, do not delimit orthographic words.

Put another way, written Chinese simply lacks orthographic words. In Chinese text, individual characters of the script, to which we shall refer by their traditional name of hanzi, 2 are written one after another with no intervening spaces; a Chinese sentence is shown in Figure 1. 3 Partly as a result of this, the notion "word" has never played a role in Chinese philological tradition, and the idea that Chinese lacks anything analogous to words in European languages has been prevalent among Western sinologists; see DeFrancis (1984). Twentieth-century linguistic work on Chinese (Chao 1968; Li and Thompson 1981; Tang 1988, 1989, inter alia) has revealed the incorrectness of this traditional view. All notions of word, with the exception of the orthographic word, are as relevant in Chinese as they are in English, and just as is the case in other languages, a word in Chinese may correspond to one or more symbols in the orthography:, K. ren2 'person' is a fairly uncontroversial case of a monographemic word, and ~] zhongl-guo2 (middle country) 'China' a fairly uncontroversial case of a digraphemic word. The relevance of the distinction between, say, phonological words and, say, dictionary words is shown by an example like ~J~,:t/4~l~I zhonglhua2 ren2-min2 gong4-he2-guo2 (China people republic) 'People's Republic of China.' Arguably this consists of about three phonological words. On the other hand, in a translation system one probably wants to treat this string as a single dictionary word since it has a conventional and somewhat unpredictable translation into English.

Thus, if one wants to segment words--for any purpose--from Chinese sentences, one faces a more difficult task than one does in English since one cannot use spacing as a guide. For example, suppose one is building a TTS system for Mandarin Chinese. For that application, at a minimum, one would want to know the phonological word boundaries. Now, for this application one might be tempted to simply bypass the segmentation problem and pronounce the text character-by-character. However, there are several reasons why this approach will not in general work:

  • 1. Many hanzi have more than one pronunciation, where the correct pronunciation depends upon word affiliation: ~ is pronounced deO when it is a prenominal modification marker, but di4 in the word ~I ~ mu4-di4 'goal'; ~ is normally ganl 'dry,' but qian2 in a person's given name.
  • 2. Some phonological rules depend upon correct word segmentation, including Third Tone Sandhi (Shih 1986), which changes a 3 (low) tone into a 2 (rising) tone before another 3 tone: tJ~ xiao3 [lao3 shu3] 'little rat,' becomes xiao3 [lao2-shu3], rather than xiao2 [lao2-shu3], because the rule first applies within the word lao3-shu3 'rat,' blocking its phrasal application.
  • 3. In various dialects of Mandarin certain phonetic rules apply at the word level. For example, in Northern dialects (such as Beijing), a full tone (1, 2, 3, or 4) is changed to a neutral tone (0) in the final syllable of many words: ~,~ dongl-gual 'winter melon' is often pronounced dongl-guaO. The high 1 tone of~ would not normally neutralize in this fashion if it were functioning as a word on its own.
  • 4. TTS systems in general need to do more than simply compute the pronunciations of individual words; they also need to compute intonational phrase boundaries in long utterances and assign relative prominence to words in those utterances. It has been shown for English (Wang and Hirschberg 1992; Hirschberg 1993; Sproat 1994, inter alia) that grammatical part of speech provides useful information for these tasks.

Given that part-of-speech labels are properties of words rather than morphemes, it follows that one cannot do part-of-speech assignment without having access to word-boundary information. Making the reasonable assumption that similar information is relevant for solving these problems in Chinese, it follows that a prerequisite for intonation-boundary assignment and prominence assignment is word segmentation.

The points enumerated above are particularly related to TTS, but analogous arguments can easily be given for other applications; see for example Wu and Tseng's (1993) discussion of the role of segmentation in information retrieval. There are thus some very good reasons why segmentation into words is an important task.

A minimal requirement for building a Chinese word segmenter is obviously a dictionary; furthermore, as has been argued persuasively by Fung and Wu (1994), one will perform much better at segmenting text by using a dictionary constructed with text of the same genre as the text to be segmented. For novel texts, no lexicon that consists simply of a list of word entries will ever be entirely satisfactory, since the list will inevitably omit many constructions that should be considered words. Among these are words derived by various productive processes, including:

  • Morphologically derived words such as r~r__~ xue2-shengl+menO (student+plural) 'students,' which is derived by the affixation of the plural affix ~ menO to the noun _~ xue2-shengl.
  • Personal names such as J~,~,~]~ zhoul-enl-lai2 'Zhou Enlai.' Of course, we can expect famous names like Zhou Enlai's to be in many dictionaries, but names such as ;~t~ shi2-jil-lin2, the name of the second author of this paper, will not be found in any dictionary.
  • Transliterated foreign names such as, ,~. ~ ma3-1ai2-xil-ya3 'Malaysia.' Again, famous place names will most likely be found in the dictionary, but less well-known names, such as ~l~..~ bu4-lang3-shi4-wei2-ke4 'Brunswick' (as in the New Jersey town name 'New Brunswick') will not generally be found.

In this paper we present a stochastic finite-state model for segmenting Chinese text into words, both words found in a (static) lexicon as well as words derived via the above-mentioned productive processes. The segmenter handles the grouping of hanzi into words and outputs word pronunciations, with default pronunciations for hanzi it cannot group; we focus here primarily on the system's ability to segment text appropriately (rather than on its pronunciation abilities). The model incorporates various recent techniques for incorporating and manipulating linguistic knowledge using finite-state transducers. It also incorporates the Good-Turing method (Baayen 1989; Church and Gale 1991) in estimating the likelihoods of previously unseen constructions, including morphological derivatives and personal names. We will evaluate various specific aspects of the segmentation, as well as the overall segmentation performance. This latter evaluation compares the performance of the system with that of several human judges since, as we shall show, even people do not agree on a single correct way to segment a text.

Finally, this effort is part of a much larger program that we are undertaking to develop stochastic finite-state methods for text analysis with applications to TTS and other areas; in the final section of this paper we will briefly discuss this larger program so as to situate the work discussed here in a broader context.

2. A Brief Introduction to the Chinese Writing System

The first point we need to address is what type of linguistic object a hanzi represents. Much confusion has been sown about Chinese writing by the use of the term ideograph, suggesting that hanzi somehow directly represent ideas. The most accurate characterization of Chinese writing is that it is morphosyllabic (DeFrancis 1984): each hanzi represents one morpheme lexically and semantically, and one syllable phonologically.

Thus in a two-hanzi word like ~]~ zhongl-guo2 (middle country) 'China' there are two syllables, and at the same time two morphemes. Of course, since the number of attested (phonemic) Mandarin syllables (roughly 1400, including tonal distinctions) is far smaller than the number of morphemes, it follows that a given syllable could in principle be written with any of several different hanzi, depending upon which morpheme is intended: the syllable zhongl could be ~ 'middle,' ~ 'clock,', ~ 'end,' or, ~ 'loyal.' A morpheme, on the other hand, usually corresponds to a unique hanzi, though there are a few cases where variant forms are found. Finally, quite a few hanzi are homographs, meaning that they may be pronounced in several different ways, and in extreme cases apparently represent different morphemes: The prenominal modification marker ft~ deO is presumably a different morpheme from the second morpheme of I~l~ mu4-di4, even though they are written the same way.

7. Conclusions

Despite these limitations, a purely finite-state approach to Chinese word segmentation enjoys a number of strong advantages. The model we use provides a simple framework in which to incorporate a wide variety of lexical information in a uniform way. The use of weighted transducers in particular has the attractive property that the model, as it stands, can be straightforwardly interfaced to other modules of a larger speech or natural language system: presumably one does not want to segment Chinese text for its own sake but instead with a larger purpose in mind. As described in Sproat (1995), the Chinese segmenter presented here fits directly into the context of a broader finite-state model of text analysis for speech synthesis. Furthermore, by inverting the transducer so that it maps from phonemic transcriptions to hanzi sequences, one can apply the segmenter to other problems, such as speech recognition (Pereira, Riley, and Sproat 1994). Since the transducers are built from human-readable descriptions using a lexical toolkit (Sproat 1995), the system is easily maintained and extended. While size of the resulting transducers may seem daunting- - the segmenter described here, as it is used in the Bell Labs Mandarin TTS system has about 32,000 states and 209,000 arcs--recent work on minimization of weighted machines and transducers (cf. Mohri [1995]) shows promise for improving this situation. The model described here thus demonstrates great potential for use in widespread applications. This flexibility, along with the simplicity of implementation and expansion, makes this framework an attractive base for continued research.

(In Chinese, numerals and demonstratives cannot modify nouns directly, and must be accompanied by a classifier. The particular classifier used depends upon the noun.)


  • 1. Antworth, Evan. (1990). PC-KIMMO: A Two-level Processor for Morphological Analysis. Occasional Publications in Academic Computing, 16. Summer Institute of Linguistics, Dallas, TX.
  • 2. Baayen, Harald. 1989. A Corpus-based Approach to Morphological Productivity: Statistical Analysis and Psycholinguistic Interpretation. Ph.D. thesis, Free University, Amsterdam.
  • 3. Becker, Richard, John Chambers, and Allan Wilks. 1988. The New S Language. Wadsworth and Brooks, Pacific Grove.
  • 4. Chang, Chao-Huang and Cheng-Der Chen. (1993). A study on integrating Chinese word segmentation and part-of-speech tagging. Communications of the Chinese and Oriental Languages Information Processing Society, 3(2):69--77.
  • 5. Chang, Jyun-Shen, C.-D. Chen, and Shun-De Chen. (1991). Xianzhishi manzu ji jilu zuijiahua de zhongwen duanci fangfa {Chinese word segmentation through constraint satisfaction and statistical optimization}. In: Proceedings of ROCLING IV, pages 147--165, Taipei. ROCLING.
  • 6. Chang, Jyun-Shen, Shun-De Chen, Ying Zheng, Xian-Zhong Liu, and Shu-Jin Ke. (1992). Large-corpus-based methods for Chinese personal name recognition. Journal of Chinese Information Processing, 6(3):7--15.
  • 7. Chao, Yuen-Ren. 1968. A Grammar of Spoken Chinese. University of California Press, Berkeley, CA.
  • 8. Keh-Jiann Chen, Shing-Huan Liu, Word identification for Mandarin Chinese sentences, Proceedings of the 14th conference on Computational linguistics, August 23-28, 1992, Nantes, France doi:10.3115/992066.992085
  • 9. Kenneth W. Church and William A. Gale. (1991). A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language, 5(1):19--54.
  • 10. Kenneth Ward Church, Patrick Hanks, Word association norms, mutual information, and lexicography, Proceedings of the 27th annual meeting on Association for Computational Linguistics, p.76-83, June 26-29, 1989, Vancouver, British Columbia, Canada doi:10.3115/981623.981633
  • 11. DeFrancis, John. 1984. The Chinese Language. University of Hawaii Press, Honolulu.
  • 12. Fan, C.-K. and W.-H. Tsai. 1988. Automatic word identification in Chinese sentences by the relaxation technique. Computer Processing of Chinese and Oriental Languages, 4:33--56.
  • 13. Fung, Pascale and Dekai Wu. (1994). Statistical augmentation of a Chinese machine-readable dictionary. In WVLC-94, Second Annual Workshop on Very Large Corpora.
  • 14. Gan, Kok-Wee. (1994). Integrating Word Boundary Disambiguation with Sentence Understanding. Ph.D. thesis, National University of Singapore.
  • 15. Gu, Ping and Yuhang Mao. (1994). Hanyu zidong fenci de jinlin pipei suanfa ji qi zai QHFY hanying jiqi fanyi xitong zhong de shixian {The adjacent matching algorithm of Chinese automatic word segmentation and its implementation in the QHFY Chinese-English system}. In: Proceedings of The International Conference on Chinese Computing, Singapore.
  • 16. Julia Hirschberg, Pitch accent in context: predicting intonational prominence from text, Artificial Intelligence, v.63 n.1-2, p.305-340, Oct. 1993 doi:10.1016/0004-3702(93)90020-C
  • 17. Huang, Chu-Ren, Kathleen Ahrens, and Keh-jiann Chen. (1993). A data-driven approach to psychological reality of the mental lexicon: Two studies on Chinese corpus linguistics. Presented at the conference on Language and its Psychobiological Bases, December.
  • 18. Ronald M. Kaplan, Martin Kay, Regular models of phonological rule systems, Computational Linguistics, v.20 n.3, September 1994
  • 19. Lauri Karttunen, Ronald M. Kaplan, Annie Zaenen, Two-level morphology with composition, Proceedings of the 14th conference on Computational linguistics, August 23-28, 1992, Nantes, France doi:10.3115/992066.992091
  • 20. Koskenniemi, Kimmo. 1983. Two-level Morphology: A General Computational Model for Word-Form Recognition and Production. Ph.D. thesis, University of Helsinki, Helsinki.
  • 21. Kupiec, Julian. (1992). Robust part-of-speech tagging using a hidden Markov model. Computer Speech and Language. Submitted.
  • 22. Li, B. Y., S. Lin, C. F. Sun, and M. S. Sun. (1991). Yi zhong zhuyao shiyong yuliaoku biaoji jinxing qiyi jiaozheng de zuida pipei hanyu zidong fenci suanfa sheji {A maximum-matching word segmentation algorithm using corpus tags for disambiguation}. In ROCLING IV, pages 135--146, Taipei. ROCLING.
  • 23. Li, Charles and Sandra Thompson. 1981. Mandarin Chinese: A Functional Reference Grammar. University of California Press, Berkeley, CA.
  • 24. Liang, Nanyuan. 1986. Shumian hanyu zidong fenci xitong-CDWS {A written Chinese automatic segmentation system-CDWS}. Journal of Chinese Information Processing, 1(1):44--52.
  • 25. Lin, Ming-Yu, Tung-Hui Chiang, and Keh-Yi Su. (1993). A preliminary study on unknown word problem in Chinese word segmentation. In ROCLING 6, pages 119--141. ROCLING.
  • 26. Mohri, Mehryar. (1993). Analyse et représentation par automates de structures syntaxiques composées. Ph.D. thesis, University of Paris 7, Paris.
  • 27. Mohri, Mehryar. (1995). Minimization algorithms for sequential transducers. Theoretical Computer Science. Submitted.
  • 28. Masaaki Nagata, A stochastic Japanese morphological analyzer using a forward-DP backward-A* N-best search algorithm, Proceedings of the 15th conference on Computational linguistics, August 05-09, 1994, Kyoto, Japan doi:10.3115/991886.991920
  • 29. Nie, Jian-Yun, Wanying Jin, and Marie-Louise Hannan. (1994). A hybrid approach to unknown word detection and segmentation of Chinese. In: Proceedings of The International Conference on Chinese Computing, Singapore.
  • 30. Peng, Z.-Y. and J-S. Chang. (1993). Zhongwen cihui qiyi zhi yanjiu---duanci yu cixing biaoshi {Research on Chinese lexical ambiguity---segmentation and part-of-speech tagging}. In ROCLING 6, pages 173--193. ROCLING.
  • 31. Fernando Pereira, Michael Riley, Richard Sproat, Weighted rational transductions and their application to human language processing, Proceedings of the workshop on Human Language Technology, March 08-11, 1994, Plainsboro, NJ doi:10.3115/1075812.1075870
  • 32. PRCNSC, 1994. Contemporary Chinese Language Word Segmentation Specification for Information Processing. People's Republic of China National Standards Committee. In Chinese.
  • 33. ROCLING. (1993). Jisuan yuyanxue tongxun {Computational linguistics communications}. Newsletter of the Republic of China Computational Linguistics Society (ROCLING), April. In Chinese.
  • 34. Shih, Chilin. 1986. The Prosodic Domain of Tone Sandhi in Chinese. Ph.D. thesis, UCSD, La Jolla, CA.
  • 35. Sproat, Richard. (1992). Morphology and Computation. MIT Press, Cambridge, MA.
  • 36. Sproat, Richard. (1994). English noun-phrase accent prediction for text-to-speech. Computer Speech and Language, 8:79--94.
  • 37. Richard Sproat, Barbara Brunson, Constituent-based morphological parsing: a new approach to the problem of word-recognition., Proceedings of the 25th annual meeting on Association for Computational Linguistics, p.65-72, July 06-09, 1987, Stanford, California doi:10.3115/981175.981185
  • 38. Sproat, Richard and Chilin Shih. (1990). A statistical method for finding word boundaries in Chinese text. Computer Processing of Chinese and Oriental Languages, 4:336--351.
  • 39. Sproat, Richard and Chilin Shih. (1995). A corpus-based analysis of Mandarin nominal root compounds. Journal of East Asian Linguistics, 4(1):1--23.
  • 40. Tang, Ting-Chih. 1988. Hanyu Cifa Jufa Lunji {Studies on Chinese Morphology and Syntax}, volume 2. Student Book Company, Taipei. In Chinese.
  • 41. Tang, Ting-Chih. 1989. Hanyu Cifa Jufa Xuji {Studies on Chinese Morphology and Syntax:2}, volume 2. Student Book Company, Taipei. In Chinese.
  • 42. Torgerson, Warren. 1958. Theory and Methods of Scaling. Wiley, New York.
  • 43. Evelyne Tzoukermann, Mark Y. Liberman, A finite-state morphological processor for Spanish, Proceedings of the 13th conference on Computational linguistics, p.277-282, August 20-25, 1990, Helsinki, Finland doi:10.3115/991146.991195
  • 44. Liang-Jyh Wang, Wei-Chuan Li, Chao-Huang Chang, Recognizing unregistered names for Mandarin word identification, Proceedings of the 14th conference on Computational linguistics, August 23-28, 1992, Nantes, France doi:10.3115/992424.992473
  • 45. Wang, Michelle and Julia Hirschberg. (1992). Automatic classification of intonational phrase boundaries. Computer Speech and Language, 6:175--196.
  • 46. Wang, Yongheng, Haiju Su, and Yan Mo. (1990). Automatic processing of Chinese words. Journal of Chinese Information Processing, 4(4):1--11.
  • 47. Wieger, L. 1965. Chinese Characters. Dover, New York. Republication of second edition, published 1927 by Catholic Mission Press.
  • 48. Dekai Wu, Pascale Fung, Improving Chinese tokenization with linguistic filters on statistical lexical acquisition, Proceedings of the fourth Conference on Applied Natural Language Processing, October 13-15, 1994, Stuttgart, Germany doi:10.3115/974358.974399
  • 49. Zimin Wu, Gwyneth Tseng, Chinese text segmentation for text retrieval: achievements and problems, Journal of the American Society for Information Science, v.44 n.9, p.532-542, Oct. 1993 <532::AID-ASI3>3.0.CO;2-M doi:10.1002/(SICI)1097-4571(199310)44:9<532::AID-ASI3>3.0.CO;2-M
  • 50. Yeh, Ching-long and Hsi-Jian Lee. (1991). Rule-based word identification for Mandarin Chinese sentences --- a unification approach. Computer Processing of Chinese and Oriental Languages, 5(2):97--118.


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
1996 AStochasticFSWordSegAlgForChiWilliam A. Gale
Richard Sproat
Chilin Shih
Nancy Chang
A Stochastic Finite-state Word-Segmentation Algorithm for ChineseComputational Linguistics (CL) Research Areahttp://acl.ldc.upenn.edu/J/J96/J96-3004.pdf1996