Recently, LSI (Latent Semantic Indexing) based on SVD (Singular Value Decomposition) is proposed to overcome the problems of polysemy and homonym in traditional lexical matching. However, it is usually criticized as with low discriminative power for representing documents although it has been validated as with good representative quality. In this paper, SVD on clusters is proposed to improve the discriminative power of LSI. The contribution of this paper is three manifolds. Firstly, we make a survey of existing linear algebra methods for LSI, including both SVD based methods and nonSVD based methods. Secondly, we propose SVD on clusters for LSI and theoretically explain that dimension expansion of document vectors and dimension projection using SVD are the two manipulations involved in SVD on clusters. Moreover, we develop updating processes to fold in new documents and terms in a decomposed matrix by SVD on clusters. Thirdly, two corpora, a Chinese corpus and an English corpus, are used to evaluate the performances of the proposed methods. Experiments demonstrate that, to some extent, SVD on clusters can improve the precision of interdocument similarity measure in comparison with other SVD based LSI methods.
As computer networks become the backbones of science and economy, enormous quantities of machine readable documents become available. The fact that about 80 percent of businesses are conducted on unstructured information [
Typically, information is retrieved by literally matching terms in documents with those of a query. However, lexical matching methods can be inaccurate when they are used to match a user’s query. Since there are usually many ways to express a given concept (synonymy), the literal terms in a user’s query may not match those of a relevant document. In addition, most words have multiple meanings (polysemy and homonym), so terms in a user’s query will literally match terms in irrelevant documents. For these reasons, a better approach would allow users to retrieve information on the basis of a conceptual topic or meanings of a document [
Latent Semantic Indexing (LSI) is proposed to overcome the problem of lexical matching by using statistically derived conceptual indices instead of individual words for retrieval [
In this paper, we propose SVD on clusters (SVDC) to improve the discriminative power of LSI. The contribution of this paper is three manifolds. Firstly, we make a survey of existing linear algebra methods for LSI, including both SVD based methods and nonSVD based methods. Secondly, we theoretically explain that dimension expansion of document vectors and dimension projection using SVD are the two manipulations involved in SVD on clusters. We develop updating processes to fold in new documents and terms in a decomposed matrix by SVD on clusters. Thirdly, two corpora, a Chinese corpus and an English corpus, are used to evaluate the performances of the proposed methods.
The rest of this paper is organized as follows. Section
The singular value decomposition is commonly used in the solution of unconstrained linear least square problems, matrix rank estimation, and canonical correlation analysis [
Here
Recently, a series of methods based on different methods of matrix decomposition have been proposed to conduct LSI. A common point of these decomposition methods is to find a rankdeficient matrix in the decomposed space to approximate the original matrix so that the term frequency distortion in termdocument can be adjusted. Basically, we can divide these methods into two categories: matrix decomposition based on SVD and matrix decomposition not based on SVD. Table
Existing linear algebra methods for LSI.
Category  Abbreviation  Full name 

SVD based decomposition for termdocument matrix  IRR  Iterative Residual Rescaling 
SVR  Singular Value Rescaling  
ADE  Approximate Dimension Equalization  


NonSVD based decomposition for termdocument matrix  SDD  Semidiscrete Decomposition 
LPI  Locality Preserving Indexing  
RSVD  RiemannianSVD 
In the aspect of SVD based LSI methods, it includes IRR [
In the aspect of nonSVD based LSI methods, it includes SDD [
Recently, two methods in [
Although the two methods are very similar with SVD on clusters, they are proposed for different uses with different motivations. Firstly, this research presents a complete theory for SVD on clusters, including theoretical motivation, theoretical analysis of effectiveness, and updating process, which are entirely not mentioned in any of the two referred methods. Secondly, this research describes the detailed procedures of using SVD on clusters and attempts to use different clustering methods (
The motivation for the proposal of SVD on clusters can be specified as the following 4 aspects:
The huge computation complexity involved in traditional SVD. According to [
Clusters existing in a document collection. Usually, there are different topics scattered in different documents of a text collection. Even if all documents in a collection are concerning on a same topic, we can divide them into several subtopics. Although SVD has the ability to uncover the most representative vectors for text representation, it might not be optimal in discriminating documents with different semantics. In information retrieval, the relevant documents with the query should be retrieved as many as possible; on the other hand, the irrelevant documents with the query should be retrieved as few as possible. If principal clusters, in which documents have closely related semantics, can be extracted automatically, then relevant documents can be retrieved in the cluster with the assumption that closely associated documents tend to be relevant to the same request; that is, relevant documents are more like one another than they are like nonrelevant documents.
Contextual information and cooccurrence of index terms in documents. Classic weighting schemes [
Divideandconquer strategy as theoretical support. The singular values in
With all of the above observations from both practices and theoretical analysis, SVD on clusters is proposed for LSI to improve its discriminative power in this paper.
To proceed, the basic concepts adopted in SVD on clusters are defined in the following in order to make clear the remainder of this paper.
Assuming that
Assuming that
With the above two definitions of cluster submatrix and SVDC approximation matrix, we proposed two versions of SVD on clusters by using
Algorithm of SVD on clusters with
Input:
Output:
Method:
Cluster the document vectors
Allocate the document vectors according to vectors’ cluster labels from
Conduct SVD for each of the cluster submatrices of
Merge all the SVD approximation matrices of the cluster submatrices to construct the SVDC approximation matrix of
For simplicity, here, we only consider the case that termdocument
We can conclude that there are actually two kinds of manipulations involved in SVD on clusters: the first one is dimension expansion of document vectors and the second one is dimension projection using SVD.
On the one hand, notice that
Theoretically, according to the explanation, document vectors which are not in the same cluster submatrix will have zero cosine similarity. However, in fact, all document vectors have the same terms in representation and dimension expansion of document vectors is derived by merely copying the original pace of
Algorithm of SVD on clusters with SOMs clustering to approximate the termdocument matrix for LSI is as follows:
Input:
Output:
Method:
Cluster the document vectors
Allocate the document vectors’ according to vectors’ cluster labels from
Conduct SVD using predefined preservation rate for each cluster submatrix of
Merge all the SVD approximation matrices of the cluster submatrices to construct the SVDC approximation matrix of
On the other hand, when using SVD for
The computation complexity of SVDC is
In rapidly changing environments such as the World Wide Web, the document collection is frequently updated with new documents and terms constantly being added, and there is a need to find the latentconcept subspace for the updated document collection. In order to avoid recomputing the matrix decomposition, there are two kinds of updates for an established latent subspace of LSI: folding in new documents and folding in new terms.
Let
Despite this, to fold in
As for folding in these
Second,
Third, (
Here,
Thus, we finish the process of folding in a new document vector into SVDC decomposition and the centroid of
Let
Although the method specified above has a disadvantage of SVD for folding in new terms, we do not have better method to tackle this problem until now if no recomputing of SVD is desired. To fold in
Concerning folding in an element
Here,
Then, each
Finally, approximation termdocument
Thus, we finish the process of folding
Reuters21578 distribution 1.0 is used for performance evaluation as the English corpus and it is available online (
TanCorpV1.0 is used as the Chinese corpus in this research which is available in the internet (
We use similarity measure as the method for performance evaluation. The basic assumption behind similarity measure is that document similarity should be higher for any document pair relevant to the same topic (intratopic pair) than for any pair relevant to different topics (crosstopic pair). This assumption is based on consideration of how the documents would be used by applications. For instance, in text clustering by
In this research, documents in same category are regarded as having same topic and documents in different category are regarded as crosstopic pairs. Firstly, document pairs are produced by coupling each document vector in a predefined category and another document vector in the whole corpus, iteratively. Secondly, cosine similarity is computed for each document pair, and all the document pairs are sorted in a descending order by their similarities. Finally, (
Here,
For both Chinese and English corpus, we carried out experiments for measuring similarities of documents in each category. When using SVDC in Algorithm
Corpus 

SOMs clustering 

Chinese  0.7367  0.6046 
English  0.7697  0.6534 
Average precision (see (
Similarity measure on English documents of SVD on clusters and other SVD based LSI methods. PR is the abbreviation for “preservation rate” and the best performances (measured by average precision) are marked in bold type.
PR  SVD  SVDC ( 
SVDC (SOMs)  SVR  ADE  IRR 

1.0 



0.4202 ± 0.0156  0.3720 ± 0.0253  0.3927 ± 0.0378 
0.9  0.4382 ± 0.0324  0.4394 ± 0.0065 

0.4202 ± 0.0197  0.2890 ± 0.0271  0.3929 ± 0.0207 
0.8  0.4398 ± 0.0185  0.4425 ± 0.0119 

0.4202 ± 0.0168  0.3293 ± 0.0093  0.3927 ± 0.0621 
0.7  0.4420 ± 0.0056 

0.4385 ± 0.0287  0.4089 ± 0.0334  0.3167 ± 0.0173  0.3928 ± 0.0274 
0.6  0.4447 ± 0.0579 

0.4462 ± 0.0438  0.4201 ± 0.0132  0.3264 ± 0.0216  0.3942 ± 0.0243 
0.5  0.4475 ± 0.0431 

0.4487 ± 0.0367  0.4203 ± 0.0369  0.3338 ± 0.0295  0.3946 ± 0.0279 
0.4  0.4499 ± 0.0089 

0.4498 ± 0.0194  0.4209 ± 0.0234  0.3377 ± 0.0145  0.3951 ± 0.0325 
0.3  0.4516 ± 0.0375 

0.4396 ± 0.0309  0.4222 ± 0.0205  0.3409 ± 0.0247  0.3970 ± 0.0214 
0.2  0.4538 ± 0.0654 

0.4372 ± 0.0243  0.4227 ± 0.0311  0.3761 ± 0.0307  0.3990 ± 0.0261 
0.1  0.4553 ± 0.0247 

0.4298 ± 0.0275  0.4229 ± 0.0308  0.4022 ± 0.0170  0.3956 ± 0.0185 
Similarity measure on Chinese documents of SVD on clusters and other SVD based LSI methods. PR is the abbreviation for “preservation rate” and the best performances (measured by average precision) are marked in bold type.
PR  SVD  SVDC ( 
SVDC (SOMs)  SVR  ADE  IRR 

1.0 



0.4272 ± 0.0200  0.3632 ± 0.0286  0.2730 ± 0.0168 
0.9  0.4312 ± 0.0279 

0.4463 ± 0.0245  0.4272 ± 0.0186  0.3394 ± 0.0303  0.2735 ± 0.0238 
0.8  0.4358 ± 0.0422 

0.4458 ± 0.0239  0.4273 ± 0.0209  0.3136 ± 0.0137  0.2735 ± 0.0109 
0.7  0.4495 ± 0.0387 

0.4573 ± 0.0146  0.4273 ± 0.0128  0.3075 ± 0.0068  0.2732 ± 0.0127 
0.6  0.4550 ± 0.0176 

0.4547 ± 0.0294  0.4273 ± 0.0305  0.3006 ± 0.0208  0.2730 ± 0.0134 
0.5  0.4573 ± 0.0406 

0.4588 ± 0.0164  0.4273 ± 0.0379  0.2941 ± 0.0173  0.2729 ± 0.0141 
0.4  0.4587 ± 0.0395  0.4624 ± 0.0098 

0.4275 ± 0.0294  0.2857 ± 0.0194  0.2726 ± 0.290 
0.3  0.4596 ± 0.0197 

0.4582 ± 0.0203  0.4285 ± 0.0305  0.2727 ± 0.0200  0.2666 ± 0.242 
0.2  0.4602 ± 0.0401 

0.4432 ± 0.0276  0.4305 ± 0.0190  0.2498 ± 0.0228  0.2672 ± 0.0166 
0.1  0.4617 ± 0.0409 

0.4513 ± 0.0188  0.4343 ± 0.0193  0.3131 ± 0.0146  0.2557 ± 0.0188 
To compare document indexing methods at different parameter settings, preservation rate is varied from 0.1 to 1.0 in increments of 0.1 for SVD, SVDC, SVR, and ADE. For SVR, its rescaling factor is set to 1.35, as suggested in [
We can see from Tables
Considering the variances of average precisions on different categories, we admit that SVDC may not be a robust approach since its superiority is not obvious than SVD (as pointed out by one of the reviewers). However, we regard that the variances of the mentioned methods are comparable to each other because they have similar values.
Moreover, SVDC with
To better illustrate the effectiveness of each method, the classic
Results of
Method  SVDC with SOMs clustering  SVD 

SVDC with 


SVDC with SOMs clustering  > 
Results of
Method  SVDC with SOMs clustering  SVD 

SVDC with 
>  > 
SVDC with SOMs clustering  ~ 
Figure
Similarity measure of SVDC with
We can see from Figure
In folding in new terms, SVDC with
This paper proposes SVD on clusters as a new indexing method for Latent Semantic Indexing. Based on the review on current trend of linear algebraic methods for LSI, we claim that the state of art of LSI roughly follows two disciplines: SVD based LSI methods and nonSVD based LSI methods. Then, with the specification of its motivation, SVD on clusters is proposed. We describe the algorithm of SVD on clusters with two different clustering algorithms:
The possible applications of SVD on clusters may be automatic categorization of large amount of Web documents where LSI is an alternative for document indexing but with huge computation complexity and the refinement of document clustering where interdocument similarity measure is decisive for its performance. We admit that this paper covers merely linear algebra methods for latent sematic indexing. In the future, we will compare SCD on clusters with the topic based methods for Latent Semantic Indexing on interdocument similarity measure, such as Probabilistic Latent Semantic Indexing [
The authors declare that they have no competing interests.
This research was supported in part by National Natural Science Foundation of China under Grants nos. 71101138, 61379046, 91218301, 91318302, and 61432001; Beijing Natural Science Fund under Grant no. 4122087; the Fundamental Research Funds for the Central Universities (buctrc201504).