自然语言处理最新论文

什么是自然语言处理

自然语言处理（英语：natural language processing，缩写作 NLP）是人工智能和语言学领域的分支学科。此领域探讨如何处理及运用自然语言；自然语言认知则是指让电脑“懂”人类的语言。自然语言生成系统把计算机数据转化为自然语言。自然语言理解系统把自然语言转化为计算机程序更易于处理的形式。

自然语言处理有四大类常见的任务

什么是命名实体识别

命名实体识别（NER）是信息提取（Information Extraction）的一个子任务，主要涉及如何从文本中提取命名实体并将其分类至事先划定好的类别，如在招聘信息中提取具体招聘公司、岗位和工作地点的信息，并将其分别归纳至公司、岗位和地点的类别下。命名实体识别往往先将整句拆解为词语并对每个词语进行此行标注，根据习得的规则对词语进行判别。这项任务的关键在于对未知实体的识别。基于此，命名实体识别的主要思想在于根据现有实例的特征总结识别和分类规则。这些方法可以被分为有监督（supervised）、半监督（semi-supervised）和无监督（unsupervised）三类。有监督学习包括隐形马科夫模型（HMM）、决策树、最大熵模型（ME）、支持向量机（SVM）和条件随机场（CRF）。这些方法主要是读取注释语料库，记忆实例并进行学习，根据这些例子的特征生成针对某一种实例的识别规则。

什么是词性标注

词性标注 (pos tagging) 是指为分词结果中的每个单词标注一个正确的词性的程序，也即确定每个词是名词、动词、形容词或其他词性的过程。

什么是文本分类

该技术可被用于理解、组织和分类结构化或非结构化文本文档。文本挖掘所使用的模型有词袋（BOW）模型、语言模型（ngram）和主题模型。隐马尔可夫模型通常用于词性标注（POS）。其涵盖的主要任务有句法分析、情绪分析和垃圾信息检测。

GLUE benchmark

General Language Understanding Evaluation benchmark，通用语言理解评估基准，用于测试模型在广泛自然语言理解任务中的鲁棒性。

LM：Language Model

语言模型，一串词序列的概率分布，通过概率模型来表示文本语义。语言模型有什么作用？通过语言模型，可以量化地衡量一段文本存在的可能性。对于一段长度为n的文本，文本里每个单词都有上文预测该单词的过程，所有单词的概率乘积便可以用来评估文本。在实践中，如果文本很长，P(wi|context(wi))的估算会很困难，因此有了简化版：N元模型。在N元模型中，通过对当前词的前N个词进行计算来估算该词的条件概率。

重要文献与资料

我们介绍词的向量表征，也称为 word embedding 。词向量是自然语言处理中常见的一个操作，是搜索引擎、广告系统、推荐系统等互联网服务背后常见的基础技术。

在这些互联网服务里，我们经常要比较两个词或者两段文本之间的相关性。为了做这样的比较，我们往往先要把词表示成计算机适合处理的方式。最自然的方式恐怕莫过于向量空间模型(vector space model)。在这种方式里，每个词被表示成一个实数向量（one-hot vector），其长度为字典大小，每个维度对应一个字典里的每个词，除了这个词对应维度上的值是1，其他元素都是0。

One-hot vector虽然自然，但是用处有限。比如，在互联网广告系统里，如果用户输入的query是“母亲节”，而有一个广告的关键词是“康乃馨”。虽然按照常理，我们知道这两个词之间是有联系的——母亲节通常应该送给母亲一束康乃馨；但是这两个词对应的one-hot vectors之间的距离度量，无论是欧氏距离还是余弦相似度(cosine similarity)，由于其向量正交，都认为这两个词毫无相关性。得出这种与我们相悖的结论的根本原因是：每个词本身的信息量都太小。所以，仅仅给定两个词，不足以让我们准确判别它们是否相关。要想精确计算相关性，我们还需要更多的信息——从大量数据里通过机器学习方法归纳出来的知识。

在机器学习领域里，各种“知识”被各种模型表示，词向量模型(word embedding model)就是其中的一类。通过词向量模型可将一个 one-hot vector映射到一个维度更低的实数向量（embedding vector），如embedding(母亲节)=[0.3,4.2,−1.5,...],embedding(康乃馨)=[0.2,5.6,−2.3,...]。在这个映射到的实数向量表示中，希望两个语义（或用法）上相似的词对应的词向量“更像”，这样如“母亲节”和“康乃馨”的对应词向量的余弦相似度就不再为零了。

词向量模型可以是概率模型、共生矩阵(co-occurrence matrix)模型或神经元网络模型。在用神经网络求词向量之前，传统做法是统计一个词语的共生矩阵X。 X是一个|V|×|V| 大小的矩阵，Xij表示在所有语料中，词汇表V(vocabulary)中第i个词和第j个词同时出现的词数，|V|为词汇表的大小。对X做矩阵分解（如奇异值分解），得到的U即视为所有词的词向量：

但这样的传统做法有很多问题：

基于神经网络的模型不需要计算和存储一个在全语料上统计产生的大表，而是通过学习语义信息得到词向量，因此能很好地解决以上问题。

神经网络

当词向量训练好后，我们可以用数据可视化算法t-SNE[ 4 ]画出词语特征在二维上的投影（如下图所示）。从图中可以看出，语义相关的词语（如a, the, these; big, huge）在投影上距离很近，语意无关的词（如say, business; decision, japan）在投影上的距离很远。

另一方面，我们知道两个向量的余弦值在[−1,1]的区间内：两个完全相同的向量余弦值为1, 两个相互垂直的向量之间余弦值为0，两个方向完全相反的向量余弦值为-1，即相关性和余弦值大小成正比。因此我们还可以计算两个词向量的余弦相似度。

模型概览

语言模型

在介绍词向量模型之前，我们先来引入一个概念：语言模型。语言模型旨在为语句的联合概率函数P(w1,...,wT)建模, 其中wi表示句子中的第i个词。语言模型的目标是，希望模型对有意义的句子赋予大概率，对没意义的句子赋予小概率。这样的模型可以应用于很多领域，如机器翻译、语音识别、信息检索、词性标注、手写识别等，它们都希望能得到一个连续序列的概率。以信息检索为例，当你在搜索“how long is a football bame”时（bame是一个医学名词），搜索引擎会提示你是否希望搜索"how long is a football game", 这是因为根据语言模型计算出“how long is a football bame”的概率很低，而与bame近似的，可能引起错误的词中，game会使该句生成的概率最大。

对语言模型的目标概率P(w1,...,wT)，如果假设文本中每个词都是相互独立的，则整句话的联合概率可以表示为其中所有词语条件概率的乘积，即：

然而我们知道语句中的每个词出现的概率都与其前面的词紧密相关, 所以实际上通常用条件概率表示语言模型：

N-gram neural model

在计算语言学中，n-gram是一种重要的文本表示方法，表示一个文本中连续的n个项。基于具体的应用场景，每一项可以是一个字母、单词或者音节。 n-gram模型也是统计语言模型中的一种重要方法，用n-gram训练语言模型时，一般用每个n-gram的历史n-1个词语组成的内容来预测第n个词。

Yoshua Bengio等科学家就于2003年在著名论文 Neural Probabilistic Language Models [ 1 ] 中介绍如何学习一个神经元网络表示的词向量模型。文中的神经概率语言模型（Neural Network Language Model，NNLM）通过一个线性映射和一个非线性隐层连接，同时学习了语言模型和词向量，即通过学习大量语料得到词语的向量表达，通过这些向量得到整个句子的概率。因所有的词语都用一个低维向量来表示，用这种方法学习语言模型可以克服维度灾难（curse of dimensionality）。注意：由于“神经概率语言模型”说法较为泛泛，我们在这里不用其NNLM的本名，考虑到其具体做法，本文中称该模型为N-gram neural model。

在上文中已经讲到用条件概率建模语言模型，即一句话中第t个词的概率和该句话的前t−1个词相关。可实际上越远的词语其实对该词的影响越小，那么如果考虑一个n-gram, 每个词都只受其前面n-1个词的影响，则有：

给定一些真实语料，这些语料中都是有意义的句子，N-gram模型的优化目标则是最大化目标函数:

其中f(wt,wt−1,...,wt−n+1)表示根据历史n-1个词得到当前词wt的条件概率，R(θ)表示参数正则项。

Continuous Bag-of-Words model(CBOW)

CBOW模型通过一个词的上下文（各N个词）预测当前词。当N=2时，模型如下图所示：

具体来说，不考虑上下文的词语输入顺序，CBOW是用上下文词语的词向量的均值来预测当前词。

其中xt为第t个词的词向量，分类分数（score）向量 z=U∗context，最终的分类y采用softmax，损失函数采用多类分类交叉熵。

Skip-gram model

CBOW的好处是对上下文词语的分布在词向量上进行了平滑，去掉了噪声，因此在小数据集上很有效。而Skip-gram的方法中，用一个词预测其上下文，得到了当前词上下文的很多样本，因此可用于更大的数据集。

如上图所示，Skip-gram模型的具体做法是，将一个词的词向量映射到2n个词的词向量（2n表示当前输入词的前后各n个词），然后分别通过softmax得到这2n个词的分类损失值之和。

我们介绍了词向量、语言模型和词向量的关系、以及如何通过训练神经网络模型获得词向量。在信息检索中，我们可以根据向量间的余弦夹角，来判断query和文档关键词这二者间的相关性。在句法分析和语义分析中，训练好的词向量可以用来初始化模型，以得到更好的效果。在文档分类中，有了词向量之后，可以用聚类的方法将文档中同义词进行分组，也可以用 N-gram 来预测下一个词。希望大家在本章后能够自行运用词向量进行相关领域的研究。

参考：

NLP领域必读的8篇论文

推荐下NLP领域内最重要的8篇论文吧（依据学术范标准评价体系得出的8篇名单）：

一、Deep contextualized word representations

摘要：We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis. We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.

全文链接： Deep contextualized word representations——学术范

二、Glove: Global Vectors for Word Representation

摘要：Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a word-word cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition.

全文链接： Glove: Global Vectors for Word Representation——学术范

三、SQuAD: 100,000+ Questions for Machine Comprehension of Text

摘要：We present the Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. We analyze the dataset to understand the types of reasoning required to answer the questions, leaning heavily on dependency and constituency trees. We build a strong logistic regression model, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). However, human performance (86.8%) is much higher, indicating that the dataset presents a good challenge problem for future research. The dataset is freely available at this https URL

全文链接： SQuAD: 100,000+ Questions for Machine Comprehension of Text——学术范

四、GloVe: Global Vectors for Word Representation

摘要：Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a word-word cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition.

全文链接： GloVe: Global Vectors for Word Representation——学术范

五、Sequence to Sequence Learning with Neural Networks

摘要：Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

全文链接： Sequence to Sequence Learning with Neural Networks——学术范

六、The Stanford CoreNLP Natural Language Processing Toolkit

摘要：We describe the design and use of the Stanford CoreNLP toolkit, an extensible pipeline that provides core natural language analysis. This toolkit is quite widely used, both in the research NLP community and also among commercial and government users of open source NLP technology. We suggest that this follows from a simple, approachable design, straightforward interfaces, the inclusion of robust and good quality analysis components, and not requiring use of a large amount of associated baggage.

全文链接： The Stanford CoreNLP Natural Language Processing Toolkit——学术范

七、Distributed Representations of Words and Phrases and their Compositionality

摘要：The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.

全文链接： Distributed Representations of Words and Phrases and their Compositionality——学术范

八、Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

摘要：Semantic word spaces have been very useful but cannot express the meaning of longer phrases in a principled way. Further progress towards understanding compositionality in tasks such as sentiment detection requires richer supervised training and evaluation resources and more powerful models of composition. To remedy this, we introduce a Sentiment Treebank. It includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality. To address them, we introduce the Recursive Neural Tensor Network. When trained on the new treebank, this model outperforms all previous methods on several metrics. It pushes the state of the art in single sentence positive/negative classification from 80% up to 85.4%. The accuracy of predicting fine-grained sentiment labels for all phrases reaches 80.7%, an improvement of 9.7% over bag of features baselines. Lastly, it is the only model that can accurately capture the effects of negation and its scope at various tree levels for both positive and negative phrases.

全文链接： Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank——学术范

希望可以对大家有帮助，学术范是一个新上线的一站式学术讨论社区，在这里，有海量的计算机外文文献资源与研究领域最新信息、好用的文献阅读及管理工具，更有无数志同道合的同学以及学术科研工作者与你一起，展开热烈且高质量的学术讨论！快来加入我们吧！

一文看尽2018全年AI技术大突破：NLP跨过分水岭、CV研究效果惊人

量子位出品 | 公众号 QbitAI

2018，仍是AI领域激动人心的一年。

这一年成为NLP研究的分水岭，各种突破接连不断；CV领域同样精彩纷呈，与四年前相比GAN生成的假脸逼真到让人不敢相信；新工具、新框架的出现，也让这个领域的明天特别让人期待……近日，Analytics Vidhya发布了一份2018人工智能技术总结与2019趋势预测报告，原文作者PRANAV DAR。量子位在保留这个报告架构的基础上，对内容进行了重新编辑和补充。这份报告总结和梳理了全年主要AI技术领域的重大进展，同时也给出了相关的资源地址，以便大家更好的使用、查询。报告共涉及了五个主要部分：

下面，我们就逐一来盘点和展望，嘿喂狗~

2018年在NLP 历史上的特殊地位，已经毋庸置疑。

这份报告认为，这一年正是NLP的分水岭。2018年里，NLP领域的突破接连不断：ULMFiT、ELMo、最近大热的BERT……

迁移学习成了NLP进展的重要推动力。从一个预训练模型开始，不断去适应新的数据，带来了无尽的潜力，甚至有“NLP领域的ImageNet时代已经到来”一说。

正是这篇论文，打响了今年NLP迁移学习狂欢的第一枪。论文两名作者一是Fast.ai创始人Jeremy Howard，在迁移学习上经验丰富；一是自然语言处理方向的博士生Sebastian Ruder，他的NLP博客几乎所有同行都在读。两个人的专长综合起来，就有了ULMFiT。想要搞定一项NLP任务，不再需要从0开始训练模型，拿来ULMFiT，用少量数据微调一下，它就可以在新任务上实现更好的性能。

他们的方法，在六项文本分类任务上超越了之前最先进的模型。详细的说明可以读他们的论文：网站上放出了训练脚本、模型等：

这个名字，当然不是指《芝麻街》里那个角色，而是“语言模型的词嵌入”，出自艾伦人工智能研究院和华盛顿大学的论文Deep contextualized word representations，NLP顶会NAACL HLT 2018的优秀论文之一。

ELMo用语言模型（language model）来获取词嵌入，同时也把词语所处句、段的语境考虑进来。

这种语境化的词语表示，能够体现一个词在语法语义用法上的复杂特征，也能体现它在不同语境下如何变化。

当然，ELMo也在试验中展示出了强大功效。把ELMo用到已有的NLP模型上，能够带来各种任务上的性能提升。比如在机器问答数据集SQuAD上，用ELMo能让此前最厉害的模型成绩在提高4.7个百分点。

这里有ELMo的更多介绍和资源：

它由Google推出，全称是 B idirectional E ncoder R epresentations from T ransformers，意思是来自Transformer的双向编码器表示，也是一种预训练语言表示的方法。从性能上来看，没有哪个模型能与BERT一战。它在11项NLP任务上都取得了最顶尖成绩，到现在，SQuAD 2.0前10名只有一个不是BERT变体：

如果你还没有读过BERT的论文，真的应该在2018年结束前补完这一课：另外，Google官方开源了训练代码和预训练模型：如果你是PyTorch党，也不怕。这里还有官方推荐的PyTorch重实现和转换脚本：

BERT之后，NLP圈在2018年还能收获什么惊喜？答案是，一款新工具。

就在上周末，Facebook开源了自家工程师们一直在用的NLP建模框架PyText。这个框架，每天要为Facebook旗下各种应用处理超过10亿次NLP任务，是一个工业级的工具包。

（Facebook开源新NLP框架：简化部署流程，大规模应用也OK）

PyText基于PyTorch，能够加速从研究到应用的进度，从模型的研究到完整实施只需要几天时间。框架里还包含了一些预训练模型，可以直接拿来处理文本分类、序列标注等任务。

想试试？开源地址在此：

它能主动打电话给美发店、餐馆预约服务，全程流畅交流，简直以假乱真。Google董事长John Hennessy后来称之为“非凡的突破”，还说：“在预约领域，这个AI已经通过了图灵测试。”Duplex在多轮对话中表现出的理解能力、合成语音的自然程度，都是NLP目前水平的体现。如果你还没看过它的视频……

NLP在2019年会怎么样？我们借用一下ULMFiT作者Sebastian Ruder的展望：

今年9月，当搭载BigGAN的双盲评审中的ICLR 2019论文现身，行家们就沸腾了：简直看不出这是GAN自己生成的。

在计算机图像研究史上，BigGAN的效果比前人进步了一大截。比如在ImageNet上进行128×128分辨率的训练后，它的Inception Score（IS）得分166.3，是之前最佳得分52.52分 3倍。

除了搞定128×128小图之外，BigGAN还能直接在256×256、512×512的ImageNet数据上训练，生成更让人信服的样本。

在论文中研究人员揭秘，BigGAN的惊人效果背后，真的付出了金钱的代价，最多要用512个TPU训练，费用可达11万美元，合人民币76万元。

不止是模型参数多，训练规模也是有GAN以来最大的。它的参数是前人的2-4倍，批次大小是前人的8倍。

研究论文：

前前后后，Fast.ai团队只用了16个AWS云实例，每个实例搭载8块英伟达V100 GPU，结果比Google用TPU Pod在斯坦福DAWNBench测试上达到的速度还要快40%。这样拔群的成绩，成本价只需要 40美元，Fast.ai在博客中将其称作人人可实现。