论文: Efficient Neural Architecture Search via Parameter Sharing
神经网络结构搜索(NAS)目前在图像分类的模型结构设计上有很大的成果,但十分耗时,主要花在搜索到的网络(child model)的训练。论文的主要工作是提出 Efficient Neural Architecture Search (ENAS),强制所有的child model进行权重共享,避免从零开始训练,从而达到提高效率的目的。虽然不同的模型使用不同的权重,但从迁移学习和多任务学习的研究结果来看,将当前任务的模型A学习到的参数应用于别的任务的模型B是可行的。从实验看来,不仅共享参数是可行的,而且能带来很强的表现,实验仅用单张1080Ti,相对与NAS有1000x倍加速
NAS的搜索结果可以看作是大图中的子图,可以用单向无环图(DAG)来表示搜索空间,每个搜索的结构可以认为是图2的DAG一个子网。ENAS定义的DAG为所有子网的叠加,其中每个节点的每种计算类型都有自己的参数,当特定的计算方法激活时,参数才使用。因此,ENAS的设计允许子网进行参数共享,下面会介绍具体细节
为了设计循环单元(recurrent cell),采用 节点的DAG,节点代表计算类型,边代表信息流向,ENAS的controller也是RNN,主要定义:1) 激活的边 2) 每个节点的计算类型。在NAS(Zoph 2017),循环单元的搜索空间在预先定义结构的拓扑结构(二叉树)上,仅学习每个节点的计算类型,而NAS则同时学习拓扑结构和计算类型,更灵活
为了创建循环单元,the controller RNN首先采样 个block的结果,取 , 为当前单元输入信息(例如word embedding), 为前一个time step的隐藏层输出,具体步骤如下:
注意到每对节点( )都有独立的参数 ,根据选择的索引决定使用哪个参数,因此,ENAS的所有循环单元能同一个共享参数集合。论文的搜索空间包含指数数量的配置,假设有N个节点和4种激活函数,则共有 种配置
ENAS的controller为100个隐藏单元的LSTM,通过softmax分类器以自回归(autoregressive fashion)的方式进行选择的决定,上一个step的输出作为下一个step的输入embedding,controller的第一个step则接受空embedding输入。学习的参数主要有controller LSTM的参数 和子网的共享权重 ,ENAS的训练分两个交叉的阶段,第一阶段在完整的训练集上进行共享权重 学习,第二阶段训练controller LSTM的参数
固定controller的策略 ,然后进行 进行随机梯度下降(SGD)来最小化交叉熵损失函数的期望 , 为模型 在mini-batch上的交叉熵损失,模型 从 采样而来
梯度的计算如公式1, 上从 采样来的,集合所有模型的梯度进行更新。公式1是梯度的无偏估计,但有一个很高的方差(跟NAS一样,采样的模型性能差异),而论文发现,当 时,训练的效果还行
固定 然后更新策略参数 ,目标是最大化期望奖励 ,使用Adam优化器,梯度计算使用Williams的REINFORCE方法,加上指数滑动平均来降低方差, 的计算在独立的验证集上进行,整体基本跟Zoph的NAS一样
训练好的ENAS进行新模型构造,首先从训练的策略 采样几个新的结构,对于每个采样的模型,计算其在验证集的minibatch上的准确率,取准确率最高的模型进行从零开始的重新训练,可以对所有采样的网络进行从零训练,但是论文的方法准确率差不多,经济效益更大
对于创建卷积网络,the controller每个decision block进行两个决定,这些决定构成卷积网络的一层:
做 次选择产生 层的网络,共 种网络,在实验中,L取12
NASNet提出设计小的模块,然后堆叠成完整的网络,主要设计convolutional cell和reduction cell
使用ENAS生成convolutional cell,构建B节点的DAG来代表单元内的计算,其中node 1和node 2代表单元输入,为完整网络中前两个单元的输出,剩余的 个节点,预测两个选择:1) 选择两个之前的节点作为当前节点输入 2) 选择用于两个输入的计算类型,共5种算子:identity, separable convolution with kernel size 3 × 3 and 5 × 5, and average pooling and max pooling with kernel size 3×3,然后将算子结果相加。对于 ,搜索过程如下:
对于reduction cell,可以同样地使用上面的搜索空间生成: 1) 如图5采样一个计算图 2) 将所有计算的stride改为2。这样reduction cell就能将输入缩小为1/2,controller共预测 blocks 最后计算下搜索空间的复杂度,对于node i ,troller选择前 个节点中的两个,然后选择五种算子的两种,共 种坑的单元。因为两种单元是独立的,所以搜索空间的大小最终为 ,对于 ,大约 种网络
节点的计算做了一点修改,增加highway connections,例如 修改为 ,其中 , 为elementwise乘法。搜索到的结果如图6所示,有意思的是:1) 激活方法全部为tanh或ReLU 2) 结构可能为局部最优,随机替换节点的激活函数都会造成大幅的性能下降 3) 搜索的输出是6个node的平均,与mixture of contexts(MoC)类似
单1080Ti训练了10小时,Penn Treebank上的结果如表1所示,PPL越低则性能越好,可以看到ENAS不准复杂度低,参数量也很少
表2的第一块为最好的分类网络DenseNet的结构,第二块为ENAS设计整个卷积网络的结果(感觉这里不应有micro search space),第三块为设计单元的结果
全网络搜索的最优结构如图7所示,达到4.23%错误率,比NAS的效果要好,大概单卡搜索7小时,相对NAS有50000x倍加速
单元搜索的结构如图8所示,单卡搜索11.5小时, ,错误率为3.54%,加上CutOut增强后比NASNet要好。论文发现ENAS搜索的结构都是局部最优的,修改都会带来性能的降低,而ENAS不采样多个网络进行训练,这个给NAS带来很大性能的提升
NAS是自动设计网络结构的重要方法,但需要耗费巨大的资源,导致不能广泛地应用,而论文提出的 Efficient Neural Architecture Search (ENAS),在搜索时对子网的参数进行共享,相对于NAS有超过1000x倍加速,单卡搜索不到半天,而且性能并没有降低,十分值得参考
推荐下NLP领域内最重要的8篇论文吧(依据学术范标准评价体系得出的8篇名单): 一、Deep contextualized word representations 摘要:We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis. We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals. 全文链接: Deep contextualized word representations——学术范 二、Glove: Global Vectors for Word Representation 摘要:Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a word-word cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition. 全文链接: Glove: Global Vectors for Word Representation——学术范 三、SQuAD: 100,000+ Questions for Machine Comprehension of Text 摘要:We present the Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. We analyze the dataset to understand the types of reasoning required to answer the questions, leaning heavily on dependency and constituency trees. We build a strong logistic regression model, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). However, human performance (86.8%) is much higher, indicating that the dataset presents a good challenge problem for future research. The dataset is freely available at this https URL 全文链接: SQuAD: 100,000+ Questions for Machine Comprehension of Text——学术范 四、GloVe: Global Vectors for Word Representation 摘要:Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a word-word cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition. 全文链接: GloVe: Global Vectors for Word Representation——学术范 五、Sequence to Sequence Learning with Neural Networks 摘要:Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier. 全文链接: Sequence to Sequence Learning with Neural Networks——学术范 六、The Stanford CoreNLP Natural Language Processing Toolkit 摘要:We describe the design and use of the Stanford CoreNLP toolkit, an extensible pipeline that provides core natural language analysis. This toolkit is quite widely used, both in the research NLP community and also among commercial and government users of open source NLP technology. We suggest that this follows from a simple, approachable design, straightforward interfaces, the inclusion of robust and good quality analysis components, and not requiring use of a large amount of associated baggage. 全文链接: The Stanford CoreNLP Natural Language Processing Toolkit——学术范 七、Distributed Representations of Words and Phrases and their Compositionality 摘要:The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible. 全文链接: Distributed Representations of Words and Phrases and their Compositionality——学术范 八、Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank 摘要:Semantic word spaces have been very useful but cannot express the meaning of longer phrases in a principled way. Further progress towards understanding compositionality in tasks such as sentiment detection requires richer supervised training and evaluation resources and more powerful models of composition. To remedy this, we introduce a Sentiment Treebank. It includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality. To address them, we introduce the Recursive Neural Tensor Network. When trained on the new treebank, this model outperforms all previous methods on several metrics. It pushes the state of the art in single sentence positive/negative classification from 80% up to 85.4%. The accuracy of predicting fine-grained sentiment labels for all phrases reaches 80.7%, an improvement of 9.7% over bag of features baselines. Lastly, it is the only model that can accurately capture the effects of negation and its scope at various tree levels for both positive and negative phrases. 全文链接: Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank——学术范 希望可以对大家有帮助, 学术范 是一个新上线的一站式学术讨论社区,在这里,有海量的计算机外文文献资源与研究领域最新信息、好用的文献阅读及管理工具,更有无数志同道合的同学以及学术科研工作者与你一起,展开热烈且高质量的学术讨论!快来加入我们吧!
324 浏览 4 回答
208 浏览 5 回答
94 浏览 3 回答
144 浏览 3 回答
288 浏览 5 回答
92 浏览 2 回答
86 浏览 2 回答
319 浏览 5 回答
118 浏览 5 回答
348 浏览 7 回答
205 浏览 3 回答
227 浏览 5 回答
275 浏览 4 回答
254 浏览 7 回答
185 浏览 3 回答