2003 ANeuralProbabilisticLanguageMod

(Bengio et al., 2003a) ⇒ Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. (2003). “A Neural Probabilistic Language Model.” In: The Journal of Machine Learning Research, 3.

Subject Headings: Distributional Word Representation, Neural Probabilistic Language Model.

Notes

Cited By

Google Scholar: ~ 7,516 Citations.
ACM DL: ~ 595 Citations.

2018

https://towardsdatascience.com/3-silver-bullets-of-word-embedding-in-nlp-10fa8f50cc5a
- QUOTE: ... 12 years before Tomas et al. introduces Word2Vec, Bengio et al. published a paper (Bengio et al., 2003a) to tackle language modeling and it is the initial idea of word embedding. At that time, they named this process as “learning a distributed representation for words”.

2013

(Mikolov et al., 2013a) ⇒ Tomáš Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. (2013). “Efficient Estimation of Word Representations in Vector Space.” In: Proceedings of International Conference of Learning Representations Workshop.
- QUOTE: Representation of words as continuous vectors has a long history [10, 26, 8]. A very popular model architecture for estimating neural network language model (NNLM) was proposed in (Bengio et al., 2003a), where a feedforward neural network with a linear projection layer and a non-linear hidden layer was used to learn jointly the word vector representation and a statistical language model. This work has been followed by many others.

Quotes

Abstract

A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. The model learns simultaneously (1) a distributed representation for each word along with (2) the probability function for word sequences, expressed in terms of these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to take advantage of longer contexts.

References

BibTeX

@article{2003_ANeuralProbabilisticLanguageMod,
  author    = {Yoshua Bengio and
               Rejean Ducharme and
               Pascal Vincent and
               Christian Janvin},
  title     = {A Neural Probabilistic Language Model},
  journal   = {Journal of Machine Learning Research},
  volume    = {3},
  pages     = {1137--1155},
  year      = {2003},
  url       = {http://jmlr.org/papers/v3/bengio03a.html},
}

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2003 ANeuralProbabilisticLanguageMod	Yoshua Bengio Pascal Vincent Christian Janvin Réjean Ducharme			A Neural Probabilistic Language Model						2003