International Business Machines Corporation
Low-complexity methods for assessing distances between pairs of documents
Last updated:
Abstract:
Two sets X.sub.2 and X.sub.1 of histograms of words, and a vocabulary V are accessed. Each of the two sets is representable as a sparse matrix, each row of which corresponds to a histogram. Each histogram is representable as a sparse vector, whose dimension is determined by a dimension of the vocabulary. Two phases compute distances between pairs of histograms. The first phase includes computations performed for each histogram and for each word in the vocabulary to obtain a dense, floating-point vector y. The second phase includes computing, for each histogram, a sparse-matrix, dense-vector multiplication between a matrix-representation of the set X.sub.1 of histograms and the vector y. The multiplication is performed to obtain distances between all histograms of the set X.sub.1 and each histogram X.sub.2[j]. Distances between all pairs of histograms are obtained, based on which distances between documents can subsequently be assessed.
Utility
12 Mar 2018
11 Jan 2022