2014 LearningLanguagefromaLargeUnann

(Vepstas & Goertzel, 2014) ⇒ Linas Vepstas, and Ben Goertzel. (2014). "Learning Language from a Large (Unannotated) Corpus". eprint arXiv:1401.3372

Subject Headings: RelEx System; OpenCog System; Link Grammar; Natural Language Comprehension System; Natural Language Generation System.

Notes

Cited By

Google Scholar: ~ 8 Citations (Retrieved:2019-10-17).
Semantic Scholar ~ 5 Citations (Retrieved:2019-10-17).

Quotes

Abstract

A novel approach to the fully automated, unsupervised extraction of dependency grammars and associated syntax-to-semantic-relationship mappings from large text corpora is described. The suggested approach builds on the authors' prior work with the Link Grammar, RelEx and OpenCog systems, as well as on a number of prior papers and approaches from the statistical language learning literature. If successful, this approach would enable the mining of all the information needed to power a natural language comprehension and generation system, directly from a large, unannotated corpus.

1 Introduction

2 Algorithmic Overview

3 Assumed Linguistic Infrastructure

4 Linguistic Content To Be Learned

4.1 Deep Comprehension

5 A Methodology for Unsupervised Language Learning from a Large Corpus

5.1 A High Level Perspective on Language Learning

5.2 Learning Syntax

5.2.1 Loose language

5.2.2 Elaboration Of the Syntactic Learning Loop

5.3 Learning Semantics

Syntactic relationships provide only the shallowest interpretation of language; semantics comes next. One may view semantic relationships (including semantic relationships close to the syntax level, which we may call ”syntactico-semantic” relationships) as ensuing from syntactic relationships, via a similar but separate learning process to the one proposed above. Just as our approach to syntax learning is heavily influenced by our work with Link Grammar. our approach to semantics is heavily influenced by our work on the RelEx system [RVG05, LGE10, GPPG06, LGK+12], which maps the output of the Link Grammar parser into a more abstract semantic form. Prototype systems [GPA+10, LGK+12] have also been written mapping the output of RelEx into even more abstract semantic form, consistent with the semantics of the Probabilistic Logic Networks [GIGH08] formalism as implemented in the OpenCog [HG08] framework. These systems are largely based on hand-coded rules, and thus not in the spirit of language learning pursued in this proposal. However, they display the same structure that we assume here; the difference being that here we specify a mechanism for learning the linguistic content that fills in the structure via unsupervised corpus learning, obviating the need for hand-coding.

(...)

5.3.1 Elaboration Of the Semantic Learning Loop

6 The Importance of Incremental Learning

7 Conclusion

Appendix B: Mutual Information

Appendix A: Meaning-Text Theory

References

[Ash65] Robert B. Ash. Information Theory. Dover Publications, 1965.
[Bel03] Anthony J. Bell. The co-information lattice. Somewhere or other, 2003.
[BN99] Franz Baader and Tobias Nipkow. Term rewriting and all that. Cambridge University Press, 1999.
[CS10] Shay B. Cohen and Noah A. Smith. Covariance in unsupervised learning of probabilistic grammars. Journal of Machine Learning Research, 11:3117–3151, 2010.
[dS77] Ferdinand de Saussure. Course in General Linguistics. Fontana/Collins, 1977. Orig. published 1916 as ”Cours de linguistique générale”.
[Gib98] Edward Gibson. Linguistic complexity: locality of syntactic dependencies. Cognition, 68:1–76, 1998.
[GIGH08] B. Goertzel, M. Ikle, I. Goertzel, and A. Heljakka. Probabilistic Logic Networks. Springer, 2008.
[Goe94] Ben Goertzel. Chaotic Logic. Plenum, 1994.
[Goe08] Ben Goertzel. A pragmatic path toward endowing virtually-embodied ais with human-level linguistic capability. IEEE World Congress on Computational Intelligence (WCCI), 2008.
[GPA+10] Ben Goertzel, Cassio Pennachin, Samir Araujo, Ruiting Lian, Fabricio Silva, Murilo Queiroz, Welter Silva, Mike Ross, Linas Vepstas, and Andre Senna. A general intelligence oriented architecture for embodied natural language processing. In: Proceedings of the Third Conference on Artificial General Intelligence. Springer, 2010.
[GPPG06] Ben Goertzel, Hugo Pinto, Cassio Pennachin, and Izabela Freire Goertzel. Using dependency parsing and probabilistic inference to extract relationships between genes, proteins and malignancies implicit among multiple biomedical research abstracts. In Proc. of Bio-NLP 2006, 2006.
[HG08] David Hart and Ben Goertzel. Opencog: A software framework for integrative artificial general intelligence. In: Proceedings of the First Conference on Artificial General Intelligence. IOS Press, 2008.
[Hod97] Wilfred Hodges. A Shorter Model Theory. Cambridge University Press, 1997.
[Hud84] Richard Hudson. Word Grammar. Oxford: Blackwell, 1984.
[Hud07] Richard Hudson. Language Networks: The New Word Grammar. Oxford Linguistics, 2007.
[iC06] R. Ferrer i Cancho. Why do syntactic links not cross? EPL (Europhysics Letters), 76(6):1228– 1234, 2006.
[Kah03] Sylvain Kahane. The meaning-text theory. Dependency and Valency. An International Handbook of Contemporary Research, 1:546–570, 2003.
[KM04] Dan Klein and Christopher D. Manning. Corpus-based induction of syntactic structure: Models of dependency and constituency. In ACL ’04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, pages 479–486. Association for Computational Linguistics, 2004.
[KSPC13] Dimitri Kartsaklis, Mehrnoosh Sadrzadeh, Stephen Pulman, and Bob Coecke. Reasoning about meaning in natural language with compact closed categories and frobenius algebras. 2013.
[LGE10] Ruiting Lian, Ben Goertzel, and Al Et. Language generation via glocal similarity matching. Neurocomputing, 2010.
[LGK+12] Ruiting Lian, Ben Goertzel, Shujing Ke, Jade OÕNeill, Keyvan Sadeghi, Simon Shiu, Dingjie Wang, Oliver Watkins, and Gino Yu. Syntax-semantic mapping for general intelligence: Language comprehension as hypergraph homomorphism, language generation as constraint satisfaction. In Artificial General Intelligence: Lecture Notes in Computer Science Volume 7716. Springer, 2012.
[Liu08] Haitao Liu. Dependency distance as a metric of language comprehension difficulty. Journal of Cognitive Science, 9(2):159–191, 2008.
[LP01] Dekang Lin and Patrick Pantel. Dirt: Discovery of inference rules from text. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’01), pages 323–328. ACM Press, 2001.
[Mih05] Rada Mihalcea. Unsupervised large-vocabulary word sense disambiguation with graph-based algorithms for sequence data labeling. In HLT ’05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 411–418, Morristown, NJ, USA, 2005. Association for Computational Linguistics.
[Mil06] Jasmina Milićević. A short guide to the meaning-text linguistic theory. Journal of Koralex, (8):187–233, 2006.
[MLP06] Ryan McDonald, Kevin Lerman, and Fernando Pereira. Multilingual dependency analysis with a two-stage discriminative parser. In CoNLL-X ’06: Proceedings of the Tenth Conference on Computational Natural Language Learning, pages 216–220, Morristown, NJ, USA, 2006. Association for Computational Linguistics.
[MP87] Igor A. Melcuk and Alain Polguere. A formal lexicon in meaning-text theory. Computational Linguistics, 13:261–275, 1987.
[MPRH05] Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajič. Non-projective dependency parsing using spanning tree algorithms. In HLT-EMNLP 05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 523–530, Morristown, NJ, USA, 2005. Association for Computational Linguistics.
[MTF04] Rada Mihalcea, Paul Tarau, and Elizabeth Figa. Pagerank on semantic networks, with application to word sense disambiguation. In COLING ’04: Proceedings of the 20th International Conference on Computational Linguistics, Morristown, NJ, USA, 2004. Association for Computational Linguistics.
[PD09] Hoifung Poon and Pedro Domingos. Unsupervised semantic parsing. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 1–10, Singapore, August 2009. Association for Computational Linguistics.
[Pro13] The Univalent Foundations Program. Homotopy Type Theory: Univalent Foundations of Mathematics. Institute for Advanced Study, 2013.
[RVG05] Mike Ross, Linas Vepstas, and Ben Goertzel. http://opencog.org/wiki/RelEx, 2005.
[SM07] Ravi Sinha and Rada Mihalcea. Unsupervised graph-basedword sense disambiguation using measures of word semantic similarity. In ICSC ’07: Proceedings of the International Conference on Semantic Computing, pages 363–369, Washington, DC, USA, 2007. IEEE Computer Society.
[ST91] Daniel Sleator and Davy Temperley. Parsing english with a link grammar. Technical report, Carnegie Mellon University Computer Science technical report CMU-CS-91-196, 1991.
[ST93] Daniel D. Sleator and Davy Temperley. Parsing english with a link grammar. In Proc. Third International Workshop on Parsing Technologies, pages 277–292, 1993.
[Ste90] James Steele, editor. Meaning-Text Theory: Linguistics, Lexicography, and Implications. University of Ottowa Press, 1990.
[Tem07] David Temperley. Minimization of dependency length in written english. Cognition, 105:300–333, 2007.
[Tes59] Lucien Tesnière. Éléments de syntaxe structurale. Klincksieck, Paris, 1959.
[WP-a] Argument. http://en.wikipedia.org/wiki/Arguments(linguistics).
[WP-b] Boolean satisfiability problem. http://en.wikipedia.org/wiki/Boolean_satisfiability_problem.
[WP-c] Conditional mutual information. http://en.wikipedia.org/wiki/Conditional_mutual_information.
[WP-d] Dpll algorithm. http://en.wikipedia.org/wiki/DPLL_algorithm.
[WP-e] Predicate. http://en.wikipedia.org/wiki/Predicate(grammar).
[Yur98] Deniz Yuret. Discovery of Linguistic Relations Using Lexical Attraction. PhD thesis, MIT, 1998.;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2014 LearningLanguagefromaLargeUnann	Ben Goertzel Linas Vepstas			Learning Language from a Large (Unannotated) Corpus						2014