NNLM的数学推导

NNLM相关论文解读

Posted by JL on June 1, 2020

A Neural Probabilistic Language Model

NNLM 是Bingo 2003 在A Neural Probabilistic Language Model 这篇论文中提出来的。

Abstract

A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training.

Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set.

传统但非常成功的基于n-gram的方法是通过将训练集中看到的非常短的重叠序列连接(concatenating)起来获得泛化。

We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences.

我们提出通过学习单词的分布式表示来对抗维数的诅咒,这种分布式表示允许每个训练语句通知模型关于语义相邻语句的指数数量。

The model learns simultaneously (1) a distributed representation for each word along with (2) the probability function for word sequences, expressed in terms of these representations.

该模型同时学习(1)每个单词的分布式表示以及(2)单词序列的概率函数

Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence.

1. Introduction

A fundamental problem that makes language modeling and other learning problems difficult is the curse of dimensionality. It is particularly obvious in the case when one wants to model the joint distribution between many discrete random variables (such as words in a sentence, or discrete attributes in a data-mining task).

In high dimensions, it is crucial to distribute probability mass where it matters rather than uniformly in all directions around each training point.

We will show in this paper that the way in which the approach proposed here generalizes is fundamentally different from the way in which previous state-of-the-art statistical language modeling approaches are generalizing.