求解高维相似度矩阵(All Pairs Similarity Search,or Pairwise Similarity),或者在大规模数据集上挖掘Top-K最相似的items(K-Nearest Neighbor Graph Construction, or TopK Set expansion),主要有如下几种方法(以Document Similarity为例):Brute Force:最直接、暴力的方法,两个for循环,计算任意两篇文档之间的相似度,时间复杂度为O(n^2)。这种方法可以得到最好的效果,但是计算量太大,效率较差,往往作为baseline。 Inverted Index Based:由于大量文档之间没有交集term,为了优化算法性能,只需计算那些包含相同term文档之间的相似度即可,算法伪代码如下:基于MapReduce的分布式计算框架如下:为了进一步优化计算,节省空间,研究人员提出了一系列剪枝策略和近似算法,详细见:《Scaling Up All Pairs Similarity Search》、《Pairwise document similarity in large collections with MapReduce》、《Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce》。Locality Sensitive Hashing(LSH):通过对文档进行某种度量操作后将其分组散列在不同的桶中。在这种度量下相似度较高的文档被分在同一个桶中的可能性较高。主要用于Near-duplicate detection和Image similarity identification等,详细见:《Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality》、《Google news personalization: scalable online collaborative filtering》。
① research on similar matrix②Aticle expatiated the concept, nature and application of similar matrix, also summarized it's methods of proof.③ similar ,similar matrix, nature of similar matrix, methods of proof of similar matrix.
Topic: The similar matrixs studies the Chinese abstract: This article elaborated similar matrixs's definition, the nature and the application, and have made the conclusion to similar matrixs's proof method. Chinese key word: Similar similar matrixs similar matrixs nature similar matrixs proof 参考一下吧