第1章绪论信息的基本概念信息论的产生信息的基本概念香农信息论研究的内容通信系统模型香农信息论的主要内容香农信息论研究的进展与应用香农信息论创立的背景香农的主要贡献香农信息论的研究进展香农信息论的应用12思考题12第2章离散信息的度量自信息和互信息自信息互信息信息熵信息熵的定义与计算条件熵与联合熵熵的基本性质平均互信息平均互信息的定义平均互信息的性质平均条件互信息30本章小结33思考题34习题34第3章离散信源离散信源的分类与数学模型离散信源的分类离散无记忆信源的数学模型离散有记忆信源的数学模型离散无记忆信源的熵单符号离散无记忆信源的熵离散无记忆信源N次扩展源的熵离散平稳信源的熵离散平稳信源离散平稳有记忆信源的熵有限状态马尔可夫链马氏链基本概念齐次马氏链马氏链状态分类马氏链的平稳分布马尔可夫信源马氏源的基本概念马氏源的产生模型马氏链N次扩展源的熵的计算马氏源符号熵的计算信源的相关性与剩余度信源的相关性信源剩余度(冗余度)自然语言的相关性和剩余度56本章小结59思考题59习题60第4章连续信息与连续信源连续随机变量集合的熵连续随机变量的离散化连续随机变量集的熵连续随机变量集的条件熵连续随机变量集的联合熵连续随机变量集合差熵的性质连续随机变量集合的信息散度离散时间高斯信源的熵一维高斯随机变量集的熵多维独立高斯随机变量集的熵多维相关高斯随机变量集的熵连续最大熵定理限峰值最大熵定理限功率最大熵定理熵功率和剩余度连续随机变量集的平均互信息连续随机变量集的平均互信息连续随机变量集平均互信息的性质离散集与连续集之间的互信息离散事件与连续事件之间的互信息离散集合与连续集合的平均互信息76本章小结77思考题77习题77第5章无失真信源编码概述信源编码器信源编码的分类分组码定长码无失真编码条件信源序列分组定理定长码信源编码定理变长码异前置码的性质变长码信源编码定理哈夫曼编码二元哈夫曼编码多元哈夫曼编码马氏源的编码97*几种实用的编码方法算术编码游程编码编码101本章小结101思考题102习题102第6章离散信道及其容量概述信道的分类离散信道的数学模型信道容量的定义单符号离散信道及其容量离散无噪信道的容量离散对称信道的容量一般离散信道的容量级联信道及其容量多维矢量信道及其容量多维矢量信道输入与输出的性质离散无记忆扩展信道及其容量并联信道及其容量和信道及其容量信道容量的迭代计算123本章小结125思考题126习题126第7章有噪信道编码概述信道编码的基本概念判决与译码规则译码错误概率最佳判决与译码准则最大后验概率准则最大似然准则信道编码与最佳译码线性分组码序列最大似然译码 几种简单的分组码费诺(Fano)不等式信道疑义度费诺(Fano)不等式有噪信道编码定理联合典型序列有噪信道编码定理无失真信源信道编码定理纠错编码技术简介线性分组码的编译码几种重要的分组码卷积码简介149*信道编码性能界限汉明球包界界界152本章小结153思考题153习题154第8章波形信道离散时间连续信道时间离散连续信道模型平稳无记忆连续信道多维矢量连续信道的性质离散时间连续信道的容量加性噪声信道与容量加性噪声信道的容量加性高斯噪声信道的容量一般加性噪声信道容量界并联加性高斯噪声信道的容量信道的容量加性高斯噪声波形信道波形信道的互信息与容量信道的容量高斯噪声信道编码定理功率利用率和频谱利用率的关系有色高斯噪声信道有色高斯噪声信道容量信道容量的进一步讨论175*数字调制系统的信道容量176本章小结179思考题180习题180第9章信息率失真函数概述系统模型失真测度离散信源信息率失真函数信息率失真函数(D)函数的性质限失真信源编码定理码率的压缩限失真信源编码定理限失真信源信道编码定理离散信源信息率失真函数的计算(D)参量表示法求解(D)求解过程归纳参量s的意义连续信源信息率失真函数信息率失真函数与性质(D)函数的计算差值失真测度高斯信源的R(D)函数离散时间无记忆高斯信源独立并联高斯信源一般连续信源R(D)函数201*有损数据压缩技术简介量化预测编码子带编码变换编码203本章小结204思考题205习题205第10章有约束信道及其编码标号图的性质标号图的基本概念标号图的变换有约束信道容量有约束信道容量的定义等时长符号有约束信道的容量不等时长符号无约束信道的容量不等时长符号有约束信道的容量有约束序列的性质信道对传输序列的约束游程长度受限序列(RLL)部分响应最大似然(PRML)序列直流平衡序列其他频域受限序列有约束信道编码定理编码器的描述有约束信道编码定理有限状态编码定理编码器性能指标222*有约束序列编码与应用块编码器实用直流平衡序列常用有约束序列编码及应用225本章小结227思考题227习题227第11章网络信息论初步概述多址接入信道二址接入信道的容量不同多址方式下的接入信道容量分析多址接入信道的容量广播信道退化广播信道退化广播信道的容量区域相关信源编码典型的相关信源编码模型相关信源编码定理244本章小结247思考题248习题249*第12章信息理论方法及其应用信源熵的估计离散信源序列熵的估计连续信源熵的估计最大熵原理最大熵原理的描述熵集中定理几种重要的最大熵分布最小交叉熵原理最小交叉熵原理交叉熵的性质最小交叉熵推断的性质交叉熵法信息理论方法的应用序列的熵估计和压缩最大熵谱估计和最小交叉熵谱估计最大熵建模及其在自然语言处理中的应用最大熵原理在经济学中的应用信息理论方法应用展望273本章小结273思考题274习题274参考文献276
01 提出 信息论由美国数学家香农率先提出。1948年,香农出版了《通信的数学理论》,这本书被看作是信息论的奠基之作。 02 观点 在书中,他提及一个通信和传播的基本问题:在通信的一端精确地或近似地复现另一端所挑选的信息。如果把它借用到传播学里来,就是传播者的意图能否被接受者理解和解读。香农提出了一个通信模式,其中包含了通信内在的要素和彼此之间的关系,以及信源、信道、信宿、信息、编码、信号这些非常重要的概念。 03 启发 信息论对传播学的启发主要体现在几个方面:首先,传播学借用了信息论的核心概念,比如信息、编码译码;其次,香农——韦弗通信模式虽然着眼于工程技术领域,但很多早期的学者都得益于这个模式,对后来传播模式的研究深有启发。 【参考文献:《传播学概论》】
We propose a new learning paradigm, Local to Global Learning (LGL), for Deep Neural Networks (DNNs) to improve the performance of classification problems. The core of LGL is to learn a DNN model from fewer categories (local) to more categories (global) gradually within the entire training set. LGL is most related to the Self-Paced Learning (SPL) algorithm but its formulation is different from trains its data from simple to complex, while LGL from local to global. In this paper, we incorporate the idea of LGL into the learning objective of DNNs and explain why LGL works better from an information-theoretic perspective. Experiments on the toy data, CIFAR-10, CIFAR-100,and ImageNet dataset show that LGL outperforms the baseline and SPL-based algorithms. 我们为深度神经网络(DNN)提出了一种新的学习范式,即从局部到全局学习(LGL),以提高分类问题的性能。LGL的核心是在整个培训集中逐步从更少的类别(本地)学习更多的类别(全局)DNN模型。LGL与自定进度学习(SPL)算法最相关,但其形式与SPL不同。SPL将数据从简单训练到复杂,而将LGL从本地训练到全局。在本文中,我们将LGL的思想纳入了DNN的学习目标,并从信息论的角度解释了LGL为什么表现更好。对玩具数据,CIFAR-10,CIFAR-100和ImageNet数据集的实验表明,LGL优于基线和基于SPL的算法。 Researchers have spent decades to develop the theory and techniques of Deep Neural Networks (DNNs). Now DNNs are very popular in many areas including speech recognition [9], computer vision [16, 20], natural language processing [30] etc. Some techniques have been proved to be effective, such as data augmentation [32, 29] and identity mapping between layers [10, 11]. Recently, some researchers have focused on how to improve the performance of DNNs by selecting training data in a certain order, such as curriculum learning [3] and self-paced learning [17]. Curriculum learning (CL) was first introduced in 2009 by Bengio et al [3]. CL is inspired by human and animal learning which suggests that a model should learn samples gradually from a simple level to a complex level. However, the curriculum often involves prior man-made knowledge that is independent of the subsequent learning process. To alleviate the issues of CL, Self-Paced Learning (SPL) [17] was proposed to automatically generate the curriculum during the training process. SPL assigns a binary weight to each training sample. Whether or not to choose a sample is decided based on the sample’s loss at each iteration of training. Since [17], many modifications of the basic SPL algorithm have emerged. Moreover, [13] introduces a new regularization term incorporating both easiness and diversity in learning. [12] designs soft weighting (instead of binary weight) methods such as linear soft weighting and logarithmic soft weighting. [14] proposes a framework called self-paced curriculum learning (SPCL) which can exploit both prior knowledge before the training and information extracted dynamically during the training. 研究人员花费了数十年的时间来开发深度神经网络(DNN)的理论和技术。现在,DNN在很多领域都非常流行,包括语音识别[9],计算机视觉[16、20],自然语言处理[30]等。一些技术已被证明是有效的,例如数据增强[32、29]和层之间的身份映射[10,11]。近来,一些研究人员致力于通过按特定顺序选择训练数据来提高DNN的性能,例如课程学习[3]和自定进度学习[17]。课程学习(CL)由Bengio等人于2009年首次提出[3]。CL受人类和动物学习的启发,这表明模型应该从简单的层次逐步学习到复杂的层次。但是,课程通常涉及先前的人造知识,而这些知识与后续的学习过程无关,为了缓解CL的问题,提出了自定进度学习(SPL)[17]在培训过程中自动生成课程表。SPL将二进制权重分配给每个训练样本。是否选择样本取决于每次训练迭代时样本的损失。自[17]以来,已经出现了对基本SPL算法的许多修改。此外,[13]引入了一个新的正规化术语,在学习中兼顾了易用性和多样性。[12]设计了软加权(而不是二进制加权)方法,例如线性软加权和对数软加权。[14]提出了一种称为自定进度课程学习(SPCL)的框架,该框架可以利用训练之前的先验知识和训练期间动态提取的信息。 However, some SPL-based challenges still remain: 1) It is hard to define simple and complex levels. CL defines these levels according to prior knowledge, which needs to be annotated by human. This process is extremely complicated and time consuming, especially when the number of categories is large. Another solution is to choose simple samples according to the loss like SPL. However, the samples’ losses are related to the choice of different models and hyper-parameters, since it is likely that the loss of a sample is large for one model but small for another; 2) SPL4748 based algorithms always bring additional hyper-parameters. One must tune hyper-parameters very carefully to generate a good curriculum, which increases the difficulty of training the model. 但是,仍然存在一些基于SPL的挑战:1)很难定义简单和复杂的级别。CL根据需要由人类注释的先验知识定义这些级别。此过程极其复杂且耗时,尤其是类别数量很大时。另一种解决方案是根据损耗(如SPL)选择简单样本。但是,样本损失与选择不同的模型和超参数有关,因为一个模型的样本损失可能很大,而另一模型的损失却很小。2)基于SPL4748的算法总是带来附加的超参数。必须非常仔细地调整超参数以生成好的课程表,这增加了训练模型的难度。 To address the above two problems, we propose a new learning paradigm called Local to Global Learning (LGL). LGL learns the neural network model from fewer categories (local) to more categories (global) gradually within the entire training set, which brings only one hyper-parameter ( inverse proportional to how many classes to add at each time) to DNN. This new hyper-parameter is also easy to be tuned. Generally, we can improve the performance of DNN by increasing the value of the new hyper-parameter. The intuition behind LGL is that the network is usually better to memorize fewer categories1 and then gradually learns from more categories, which is consistent with the way people learn. The formulation of LGL can be better understood by comparing it with transfer learning shown in Figure 1. In transfer learning, the initial weights of DNNs are transferred from another dataset. But in LGL, the initial weights of DNNs are transferred from the self-domain without knowledge of other datasets. The traditional methods randomly initialize the weights, which do not consider the distributions of the training data and may end up with a bad local minimum; whereas LGL initializes the weights which capture the distributions of the trained data. So LGL can be also seen as an initialization strategy of DNNs. In this paper, we explain the methodology of LGL from the mathematical formulation in detail. Instead of concentrating on sample loss (as in SPL), we pay attention to training DNN effectively by continually adding a new class to DNN. There are three main contributions from this paper: 为了解决上述两个问题,我们提出了一种新的学习范式,称为本地到全球学习(LGL)。LGL在整个训练集中逐渐从较少的类别(局部)到更多的类别(全局)学习神经网络模型,这仅给DNN带来一个超参数(与每次添加多少个类成反比)。这个新的超参数也很容易调整。通常,我们可以通过增加新的超参数的值来提高DNN的性能。LGL的直觉是,网络通常可以更好地记住较少的类别1,然后逐渐从更多的类别中学习,这与人们的学习方式是一致的。通过将LGL的公式与图1所示的转移学习进行比较,可以更好地理解LGL的公式。在转移学习中,DNN的初始权重是从另一个数据集中转移的。但是在LGL中,DNN的初始权重是在不了解其他数据集的情况下从自域传递的。传统方法是随机初始化权重,这些权重不考虑训练数据的分布,最终可能会导致不良的局部最小值。而LGL会初始化权重,以捕获训练数据的分布。因此,LGL也可以视为DNN的初始化策略。在本文中,我们将从数学公式详细解释LGL的方法。我们不专注于样本丢失(如SPL),而是通过不断向DNN添加新类来关注有效地训练DNN。本文主要有三点贡献: We propose a new learning paradigm called Local to Global Learning (LGL) and incorporate the idea of LGL into the learning objective of DNN. Unlike SPL, LGL guides DNN to learn from fewer categories (local) to more categories (global) gradually within the entire training set. • From an information-theoretic perspective (conditional entropy), we confirm that LGL can make DNN more stable to train from the beginning. • We perform the LGL algorithm on the toy data, CIFAR-10, CIFAR-100, and ImageNet dataset. The experiments on toy data show that the loss curve of LGL is more stable and the algorithm converges faster than the SPL algorithm when the model or data distributions vary. The experiments on CIFAR-10, CIFAR100 and ImageNet show that the classification accuracy of LGL outperforms the baseline and SPL-based algorithms. 我们提出了一种新的学习范式,称为本地到全球学习(LGL),并将LGL的思想纳入DNN的学习目标。与SPL不同,LGL指导DNN在整个培训集中逐步从较少的类别(本地)学习到更多的类别(全局)。•从信息理论的角度(条件熵),我们确认LGL可以使DNN从一开始就更稳定地进行训练。•我们对玩具数据,CIFAR-10,CIFAR-100和ImageNet数据集执行LGL算法。对玩具数据的实验表明,当模型或数据分布变化时,LGL的损失曲线更稳定,并且收敛速度比SPL算法快。在CIFAR-10,CIFAR100和ImageNet上进行的实验表明,LGL的分类精度优于基线和基于SPL的算法。 SPL has been applied to many research fields. [24] uses SPL for long-term tracking problems to automatically select right frames for the model to learn. [28] integrates the SPL method into multiple instances learning framework for selecting efficient training samples. [27] proposes multi-view SPL for clustering which overcomes the drawback of stuck in bad local minima during the optimization. [31] introduces a new matrix factorization framework by incorporating SPL methodology with traditional factorization methods. [8] proposes a framework named self-paced sparse coding by incorporating self-paced learning methodology with sparse coding as well as manifold regularization. The proposed method can effectively relieve the effect of nonconvexity. [21] designs a new co-training algorithm called self-paced co-training. The proposed algorithm differs from the standard co-training algorithm that does not remove false labelled instances from training. [18] brings the ideaof SPL into multi-task learning and proposes a frameworkthat learns the tasks by simultaneously taking into consideration the complexity of both tasks and instances per task. Recently, some researchers have combined SPL withmodern DNNs. [19] proposes self-paced convolutional network (SPCN) which improves CNNs with SPL for enhancing the learning robustness. In SPCN, each sample is assigned a weight to reflect the easiness of the sample. A dynamic self-paced function is incorporated into the learning objective of CNNs to jointly learn the parameters ofCNNs and latent weight variable. However, SPCN seemsto only work well on simple dataset like MNIST. [2] showsthat CNNs with the SPL strategy do not show actual improvement on the CIFAR dataset. [15] shows that whenthere are fewer layers in the CNN, an SPL-based algorithmmay work better on CIFAR. But when the number of layers increases, like for VGG [23], the SPL algorithm performs almost equal to that of traditional CNN training. [25]proposes a variant form of self-paced learning to improvethe performance of neural networks. However, the methodis complicated and can not be applied to large dataset likeImageNet. Based on the above analysis of SPL’s limitations, we develop a new data selection method for CNNscalled Local to Global Learning (LGL). LGL brings onlyone hyper-parameter (easy to be tuned) to the CNN and performs better than the SPL-based algorithms. SPL已应用于许多研究领域。[24]使用SPL解决长期跟踪问题,以自动选择合适的框架供模型学习。[28]将SPL方法集成到多个实例学习框架中,以选择有效的训练样本。[27]提出了一种用于聚类的多视图SPL,它克服了优化过程中卡在不良局部极小值中的缺点。[31]通过将SPL方法与传统因式分解方法相结合,引入了新的矩阵因式分解框架。文献[8]提出了一种框架,该框架通过将自定进度的学习方法与稀疏编码以及流形正则化相结合,提出了自定进度的稀疏编码。所提出的方法可以有效地缓解不凸性的影响。[21]设计了一种新的协同训练算法,称为自定步距协同训练。提出的算法与标准的协同训练算法不同,后者不会从训练中删除错误标记的实例。[18]将SPL的思想带入了多任务学习,并提出了一个通过同时考虑任务和每个任务实例的复杂性来学习任务的框架。 最近,一些研究人员将SPL与现代DNN相结合。文献[19]提出了一种自定速度的卷积网络(SPCN),它利用SPL改进了CNN,从而增强了学习的鲁棒性。在SPCN中,为每个样本分配了权重以反映样本的难易程度。动态自定步函数被纳入CNN的学习目标,以共同学习CNN的参数和潜在权重变量。但是,SPCN似乎只能在像MNIST这样的简单数据集上很好地工作。[2]显示,采用SPL策略的CNN在CIFAR数据集上并未显示出实际的改进。[15]表明,当CNN中的层数较少时,基于SPL的算法在CIFAR上可能会更好地工作。但是,当层数增加时,例如对于VGG [23],SPL算法的性能几乎与传统CNN训练的性能相同。[25]提出了一种自定进度学习的变体形式,以提高神经网络的性能。但是,该方法很复杂,不能应用于像ImageNet这样的大型数据集。基于以上对SPL局限性的分析,我们为CNN开发了一种新的数据选择方法,称为本地到全球学习(LGL)。LGL仅给CNN带来一个超参数(易于调整),并且比基于SPL的算法性能更好。 There are still two learning regimes similar to our workcalled Active Learning [6] and Co-training [4] which also select the data according to some strategies. But in active learning, the labels of all the samples are not known when the samples are chosen. Co-training deals with semisupervised learning in which some labels are missing. Thus,these two learning regimes differ in our setting where the labels of all the training data are known. 仍然有两种与我们的工作类似的学习方式称为主动学习[6]和联合训练[4],它们也根据某些策略选择数据。但是在主动学习中,选择样本时不知道所有样本的标签。联合培训涉及缺少某些标签的半监督学习。因此,这两种学习方式在我们设置所有训练数据的标签的环境中是不同的。 Learning Let us first briefly review SPL before introducing LGL. Let L(yi, g(xi, w)) denote the loss of the ground truth label yi and estimated label g(xi, w), where w represents theparameters of the model. The goal of SPL is to jointlylearn the model parameters w and latent variable v =[vi, . . . , vn]T by minimizing: 在介绍LGL之前,让我们首先简要回顾一下SPL。令L(yi,g(xi,w))表示地面真值标签yi和估计标签g(xi,w)的损失,其中w表示模型的参数。SPL的目标是共同学习模型参数w和潜在变量v = [vi,...,vn] T通过最小化: In the above, v denotes the weight variables reflecting the samples’ importance; λ is a parameter for controlling the learning pace; f is called the self-paced function which controls the learning scheme. SPL-based algorithms are about to modify f to automatically generate a good curriculum during the learning the original SPL algorithm [17], v ∈ {0, 1}^n, and fis chosen as: Another popular algorithm is called SPLD (self-paced learning with diversity) [13] which considers both ||v||1 and the sum of group-wise ||v||2. In SPLD, f is chosen as: In general, iterative methods like Alternate Convex Search (ACS) are used to solve (1), where w and v are optimized alternately. When v is fixed, we can use existing supervised learning methods to minimize the first term in (1) to obtain the optimal w∗. Then when w is fixed,and suppose f is adopted from (2), the global optimum v∗= [vi∗, . . . , vn*]T can be explicitly calculated as: 通常,使用迭代方法(如交替凸搜索(ACS))求解(1),其中w和v交替优化。当v固定时,我们可以使用现有的有监督学习方法来最小化(1)中的第一项,以获得最佳w ∗。然后,当w固定时,假设从(2)中采用f,则全局最优v ∗ = [v ∗ i,。。。,v ∗ n] T可以明确地计算为: From (4), λ is a parameter that determines the difficulty of sampling the training data: When λ is small, ‘easy’ samples with small losses are sent into the model to train; When we gradually increase λ, the ‘complex’ samples will be provided to the model until the entire training set is the above analysis, the key step in an SPL algorithm is to adjust the hyper-parameter λ at each iteration of training. In reality, however, we do not know the loss of each sample before training. Therefore sometimes one needs to run a baseline (a training algorithm without SPL) first to observe the average loss at each iteration and then set an empirical value for λ to increase. For more complex algorithms like SPLD from (3), researchers must control two parameters λ and γ, which makes the training difficult. To avoid the difficulty of tuning parameters in the SPL-based algorithms, we introduce our easy-to-train LGL algorithm. 从(4)中,λ是一个参数,它确定对训练数据进行采样的难度:当λ较小时,将损失较小的“简单”样本发送到模型中进行训练;当我们逐渐增加λ时,将向模型提供“复杂”样本,直到处理完整个训练集为止。根据以上分析,SPL算法中的关键步骤是在每次训练迭代时调整超参数λ。但是,实际上,我们不知道训练前每个样本的损失。因此,有时需要先运行基线(无SPL的训练算法)以观察每次迭代的平均损耗,然后为λ设置一个经验值以增加。对于(3)中的SPLD等更复杂的算法,研究人员必须控制两个参数λ和γ,这使训练变得困难。为了避免在基于SPL的算法中调整参数的困难,我们引入了易于训练的LGL算法。
80 浏览 7 回答
148 浏览 2 回答
310 浏览 3 回答
238 浏览 4 回答
355 浏览 3 回答
178 浏览 3 回答
231 浏览 3 回答
207 浏览 3 回答
176 浏览 2 回答
124 浏览 5 回答
115 浏览 3 回答
322 浏览 3 回答
229 浏览 3 回答
268 浏览 6 回答
336 浏览 6 回答