A Heterogeneous System Based on Latent Semantic Analysis Using GPU and Multi-CPU

Latent Semantic Analysis (LSA) is a method that allows us to automatically index and retrieve information from a set of objects by reducing the term-by-document matrix using the Singular Value Decomposition (SVD) technique. However, LSA has a high computational cost for analyzing large amounts of information. The goals of this work are (i) to improve the execution time of semantic space construction, dimensionality reduction, and information retrieval stages of LSA based on heterogeneous systems and (ii) to evaluate the accuracy and recall of the information retrieval stage. We present a heterogeneous Latent Semantic Analysis (hLSA) system,which has beendevelopedusingGeneral-Purpose computing onGraphics ProcessingUnits (GPGPUs) architecture, which can solve large numeric problems faster through the thousands of concurrent threads on multiple CUDA cores of GPUs and multi-CPU architecture, which can solve large text problems faster through a multiprocessing environment. We execute the hLSA system with documents from the PubMed Central (PMC) database. The results of the experiments show that the acceleration reached by the hLSA system for large matrices with one hundred and fifty thousandmillion values is around eight times faster than the standard LSA version with an accuracy of 88% and a recall of 100%.


Introduction
Latent Semantic Analysis (LSA) is a method that allows us to automatically index and retrieve information from a set of objects by reducing a term-by-document matrix using term weighting schemes such as Log Entropy or Term Frequency-Inverse Document Frequency (TF-IDF) and using the Singular Value Decomposition (SVD) technique.LSA improved one of the main problems of information retrieval techniques, that is, handling polysemous words, by assuming there is some underlying latent semantic structure in the data that is partially obscured by the randomness of word choice [1].LSA uses statistical techniques to estimate this latent structure and get rid of the obscuring "noise."Also, LSA has been considered as a new general theory of acquisition of similarities and knowledge representation, which is helpful in simulating the learning of vocabulary and other psycholinguistic phenomena [2].
Latent Semantic Analysis, from its beginnings to the present, has been implemented in several research topics, for example, in applications to predict a reader's interest in a selection of news articles, based on their reported interest in other articles [3]; in the self-diagnosis of diseases through the description of medical imaging [4]; in applications to detect cyberbullying in teens and young adults [5]; in the field of visual computers by improving techniques for tracking moving people [6]; in applications for the classification of less popular websites [7].
LSA has a computational complexity of ( 2  3 ), where  is the smaller value between the number of documents and the number of terms and k is the number of singular values [8].LSA takes a considerable amount of time to index and to compute the semantic space, when it is applied to large-scale datasets [9][10][11].
An introduction for a parallel LSA implementation based on a GPU has achieved acceleration of five to seven times with large matrices divisible by 16-and 2-fold for matrices with another size.The GPU is being used for the tridiagonalization of matrices, and the routines that compute the eigenvalues and eigenvectors of matrices are still being implemented on the CPU.The results present that the accuracy and speed needed further research in order to produce an effective fully implementable LSA algorithm [12].
A technique called index interpolation is presented for a rapid computation of the term-by-document matrix for large documents collections; the associated symmetric eigenvector problem is then solved by distributing its computation among any number of computational units without increasing the overall number of multiplications.The experiments took 42.5 hours to compute 300,000 terms on 16 CPUs [13].
We present a fully heterogeneous system based on Latent Semantic Analysis (hLSA), which utilizes the resources of both GPU and CPU architectures to accelerate execution time.Our aim is to compute, reduce, and retrieve information faster than standard LSA versions and to evaluate the accuracy and recall of the information retrieval procedure in the hLSA system.The performance of hLSA has been evaluated, and the results show that acceleration as high as eight times could be achieved, with an accuracy of 88% and a recall of 100%.An early version of the hLSA system has been presented as a poster in the GPU Technology Conference [14].
The rest of the paper is organized as follows.Section 2 introduces the related background of LSA.In Section 3, we present our heterogeneous Latent Semantic Analysis system.Section 4 gives a description of the design of experiments.Section 5 presents the results of the experiments.Finally, Section 6 concludes the work.

Background
LSA takes a matrix of a term-by-document (M) and constructs a semantic space wherein terms and documents that are closely associated are placed near one another.Normally, the constructed semantic space has as many dimensions as unique terms.Additionally, instead of working with count data, the entries of matrix M are weighted with a representation of the occurrence of a word token within a document.Hence, LSA uses a normalized matrix which can be large and rather sparse.For this research, two types of weightings scheme are used.
(i) The first is a logarithmic local and global entropy weighting, known as a Log Entropy scheme.That is, if  [,] denotes the number of times (frequency) that a word  appears in document  and  is the total number of documents in the dataset, then  [,] = log ( [,] + 1) where  [,] is the fraction of documents containing the th term; for example, This particular term weighting scheme has been very successful for many LSA studies [15], but other functions are possible.
(ii) The second is a term frequency and inverse document frequency weighting, known as a TF-IDF scheme, which assigns to word  a weight in document , when  is the total number of documents in the dataset and  is the term frequency by documents matrix; for example, ) . ( This particular term weighting scheme has been very successful for many LSA studies [8].
To reflect the major associative patterns in matrix A and ignore the smaller less important influences, a reduced-rank approximation of matrix A is computed using the truncated Singular Value Decomposition [16].Note that the SVD of the original weighted matrix can be written as where To obtain the truncated SVD denoted by   , it is necessary to restrict SVD matrices to their first  < min (terms, documents) dimensions, as revealed by Choosing the appropriate number of dimensions  is an open research problem.It has been proposed that the optimum value of k is in a range from 50 to 500 dimensions, depending on the size of the dataset.As described in [17], if the number of dimensions is too small, significant semantic content will remain uncaptured, and if the dimension is too large, random noise in word usage will be remodeled.Note that the truncated SVD represents both terms and documents as vectors in -dimensional space.Finally, for information retrieval purposes, the dimensional semantic space is used.Thus, the terms of a user query are folded into the -dimensional semantic space to identify a point in the space.This can be accomplished by parsing a query into a vector denoted by  whose nonzero values correspond to the term weights of all unique valid words of the user query.Then, the query folding process denoted by   can be represented as Then, this   vector can be compared with any or all documents/terms vectors of the -dimensional semantic space.To  The hLSA system presents three stages: semantic space, dimensionality reduction, and information retrieval.Also, the hLSA system works with a user query, as well as a knowledge base, and presents the relevant documents.
compare vectors, the dot product or cosine between points is used.For example, where V is a vector representation of the -dimensional space.
LSA proposes the retrieval of information in two ways: (1) by establishing a minimum value of similarity, for example, all the similarities greater than 0.90, and (2) by obtaining the top-ranked values of similarity, for example, the top 10 and the top 5.

Heterogeneous Latent Semantic Analysis System
The hLSA system has several technical challenges: (1) how to construct the semantic space using the multi-CPU architecture to speed up the text processing; (2) how to reduce the dimensionality of the term-by-document matrix using the GPU architecture to accelerate the matrices processing; and (3) how to retrieve information from the semantic space using GPU mechanisms to speed up the matrix and text computations.Figure 1 presents the proposed hLSA system for constructing, reducing, and retrieving relevant documents using heterogeneous architectures.Notably, one of the features of the hLSA system is the use of GPU architecture, which is able to execute SIMD (Single Instruction, Multiple Data) operations, such as matrix and vector multiplications, very efficiently with high parallelism.In addition, we take advantage of the multi-CPU architecture, which is able to execute SIMD operations, such as map and reduce functions.On the one hand, the hLSA system utilizes the GPU architecture to solve matrix operations, especially in the stage of dimensionality reduction and information retrieval.On the other hand, the hLSA system utilizes CPU and multi-CPU architectures to solve text processing, in the semantic space stage.Thus, with the use of these mechanisms the hLSA system is able to achieve accelerated performance.

Semantic Space Stage.
The input for constructing the semantic space is a knowledge base; this is represented by a dataset of raw texts, which is generally stored in the disk drive of a personal computer.Therefore, the hLSA system first needs to find the text documents.This execution has no major computational cost and therefore is executed by the CPU in a sequential manner.As a result, a document list is generated which is loaded into the main memory of the CPU.
From the document list obtained, the hLSA system begins to read each document text and starts the preprocessing, for instance, eliminating the blank spaces at the beginning and the end and, also, eliminating special characters such as "?!(),;:.⋅ and ignoring common words known as stopwords, such as "the"; "for"; and "you."As a result, the hLSA system generates a list of preprocessed texts, which is stored in the main CPU memory.This procedure is performed on CPU runtime in a sequential manner.
Then, the hLSA system starts to generate the count matrix of term frequency by documents denoted as A, where rows represent the terms and columns represent the documents.

Semantic space stage
Read file from the hard drive disk

Multi-CPUs
Distribute each text to a core of the multi-CPUs Figure 2: The architecture used in the semantic space stage of the hLSA system.The blocks in light yellow represent the principal procedures executed by the CPU, the blocks in light blue represent the procedures executed by multi-CPUs, and the blocks in light green represent the procedures executed by GPU, respectively.The multi-CPUs include shared memory models for multiprocessing programming.

Process mapping module
In order to do this, the hLSA system analyzes the list of preprocessed documents.Therefore, the hLSA system takes each preprocessed text of the list and splits the content when a blank space appears.For example, the text "This is an example" is split at each blank space.As a result, a list of words: ["This"; "is"; "an"; "example"] is generated.
Next, the hLSA system iterates the new list of words and for each time the word appears in a document the hLSA system adds one to the corresponding cell value of matrix A. This procedure has a high computational cost.
Thus, we implement a parallel multiprocessing model using a multi-CPU architecture.Each CPU core is in charge of processing an element of the preprocessed text of the list as shown in Figure 2. As a result of the semantic space stage, the hLSA system generates in parallel and sequential procedures matrix A, where each cell value corresponds to the frequency that a word appears in document of the knowledge base.

Dimensionality Reduction
Stage.The hLSA system uses the CUDA programming model in order to process the dimensionality reduction of matrix A. As a consequence, the hLSA system uses two dimensions (, ) of the CUDA architecture; the x dimension is associated with the rows of matrices and the y dimension with the columns of matrices.Each block of the CUDA architecture executes 32 concurrent threads and the total number of blocks executed depends on the total size of the matrix to be processed.
The hLSA system needs to normalize matrix A. Therefore, two term weighting schemes are used to normalize matrix A: the Log Entropy and the Term Frequency-Inverse Document Frequency schemes.In order to compute the term weighting  Figure 3: The architecture used in the dimensionality reduction stage of the hLSA system.The blocks in light yellow represent the principal procedures executed by the CPU, the blocks in light blue represent the procedures executed by multi-CPUs, and the blocks in light green represent the procedures executed by GPU, respectively.The GPU executions include CUDA kernels and CUDA functions using floatingpoint numbers.
schemes, the hLSA system loads matrix A into the GPU global memory and, then, executes the CUDA kernel functions and the resulting matrix is saved into the GPU global memory as A norm.
The hLSA system uses the SVD algorithm, which decomposes matrix A norm into three matrices.Matrix T is an orthogonal matrix of left singular vectors.Matrix S is a diagonal matrix and contains the singular values ordered by magnitude.Matrix D T is an orthogonal matrix of right singular vectors.Therefore, the hLSA system needs to reserve space in the GPU global memory for the resultant matrices T, S, and D T .Then, it executes the compute svd CUDA function.At the end of the execution the hLSA system frees the GPU memory space of matrix A norm.
Finally, the hLSA system truncates the resultant matrices of the SVD function into their first k dimensions.Therefore, it executes the truncate svd CUDA function.As a result, the hLSA system generates three matrices denoted by Sy, Ty, and Dy.At the end of the execution the matrices are copied into the main memory of the CPU and then the GPU global memory is freed.The dimensionality reduction stage of the hLSA system is shown in Figure 3.

Information Retrieval Stage.
The hLSA system uses a user query in order to retrieve the relevant information from the semantic space.Therefore, the hLSA system creates a one-dimensional vector that is denoted by q, with a size equal to the total number of terms obtained in the semantic space.Then, it analyzes the query; namely, if a word of the query corresponds to a word in the semantic space, the hLSA system increments by one the corresponding cell value in the vector .This execution has no major computational cost and therefore is executed by the CPU in a sequential manner.
Then the hLSA system folds the vector q into the semantic space by using the query folding equation (6).Therefore, the hLSA system executes the query folding CUDA function using the matrices Sy, Ty, and Dy loaded in the GPU global memory.As a result, the hLSA system generates a vector   , whose values correspond to the terms weights of all the unique words of the query.Then, the hLSA system compares the vector   with all document vectors inside the semantic space in order to find their similarity.
To compute the similarity, the hLSA systems uses the cosine similarity equation (7).Hence, it executes the cosine similarity CUDA function using the vector   and the document vectors of matrix Dy.Ultimately, the hLSA system presents the most relevant documents based on the similarity value between the vector   and each document vector.To present the results, the hLSA system establishes a threshold of similarity, for example, all the similarities greater than 0.90, and/or the hLSA systems obtain the top-ranked values of similarity, for example, the top 20, the top 10, and the top 5.The information retrieval stage of the hLSA system is shown in Figure 4.

Experiments
The objective of this section is to evaluate the performance of the proposed hLSA system.The design of the experiments is detailed in The documents are named and identified by a unique number (PMCID) defined by an <article-id> element.Also, each NXML file has XML elements such as <article-title>, <doi>, < abstract>, <fig>, and <publisher-name>,.For experimental purposes, we only use the abstract of the document.This information is extracted from each NXML and copied to a new text file (.txt).Each text file is stored in a working directory with a file name equal to the PMCID identifier.
(2) Definition of Dataset Size.The experiments conducted with the hLSA system use a total of five thousand documents of the PMC database.In order to observe the variation of the total time of execution of the hLSA system with largescale datasets, we partition the five-thousand documents into twelve subsets of different sizes ranging in size from five hundred to five thousand documents.In Table 1, we present a summary of each subset.The first column represents the total number of documents.The second column represents the total number of words in a dataset after preprocessing the raw texts.The third column presents the total number of unique words in the dataset.The fourth column represents the total number of elements in the matrix of term-by-document and the last column represents the size in megabytes of the termby-document matrix.
(4) Definition of  Dimensions.The experiments conducted with the hLSA uses several values of  to find the appropriate value with which the hLSA system is more accurate.In total, twenty values of  were used ranging from twenty-five to five hundred dimensions in increments of twenty-five.
(5) Definition of Use Cases.The experiments in the hLSA system were conducted with three use cases: (a) bipolar disorders, (b) lupus disease, and (c) topiramate weight-loss.
(a) Bipolar Disorders.Bipolar disorders could be presented in patients in their middle age with obese disorders, who struggle with their weight and eat when feeling depressed, excessively anxious, agitated, and irritable and when having suicidal thoughts and/or difficulty sleeping.
(b) Lupus Disease.Lupus disease could be presented in patients with alopecia, rash around their nose and checks, delicate nonpalpable purpura on their calves, and swelling and tenderness of their wrists and ankles and with normocytic anemia, thrombocytopenia, and being positive for protein and RBC casts.experiments.Hence, to evaluate our hLSA system, we have used two methods: (a) we have built an LSA system using CPU sequential-only architecture, in order to specifically evaluate the acceleration reached by the hLSA system.Thus, we have compared the execution time of the hLSA and LSA (CPU sequential-only) system.(b) We have evaluated the accuracy and recall of the information retrieval stage.
(a) Time Execution.We have executed the hLSA system and LSA systems a total of four hundred eighty times to complete all the experiments with the different subsets, k dimensions, weighting schemes, and use cases.Therefore, in order to compare the execution time of the hLSA system versus the LSA system, we have collected the execution time of each procedure involved in each stage of both systems.Hence, we have presented in Section 4 the mean values of these executions.
(b) Similarity Accuracy.We have executed the hLSA system and LSA systems a total of four hundred eighty times to complete all the experiments with the different subsets, k dimensions, weighting schemes, and use cases.Therefore, in order to evaluate the information retrieval process of the hLSA system, we have presented the top 10 ranked documents, and we have compared the similarity values generated by the hLSA system in each experiment with the subset of five thousand documents.

Results
In this section, we present the results of the experiments with the hLSA system.First, we show the execution time for the semantic space stage.Second, we present the execution time for the dimensionality reduction stage.Third, we show the execution time for the information retrieval stage.Fourth, we present the execution time for the full hLSA system.Finally, the results to evaluate the similarity accuracy and recall of the information retrieval stage are presented.
The hLSA system has been executed on a Linux Open-SUSE 13.2 (3.16.6-2 kernel) system with Core i7 Intel processors at 2.50 GHz with 16 GB DDR3L of main memory and 4 cores with 8 threads for each core and a NVIDIA GeForce GTX 970M.The GPU has 10 SMs with 128 CUDA cores for each SM, a total of 1280 CUDA cores.The GPU has a total amount of global memory of 3064 Mbytes and a clock speed of 2.5 GHz.Also, the maximum number of threads per multiprocessor is 2048, and the maximum number of threads per block is 1028.The GPU maximum dimension size of thread block is "1024, 1024, and 64" in the (, , ) dimension.
(1) Runtime Results: Semantic Space Stage.The procedures executed in the semantic space stage are to read the stopwords file, find documents in the hard disk drive, read documents from the hard disk drive, parse documents into the main CPU memory, and build the matrix of terms frequency by documents.The hLSA system uses the benefits of multi-CPU architecture to parse the documents into the main CPU memory.Table 2 present the execution time of procedures executed in CPU sequential-only.In Table 3, we compare the execution time of the Document Parser procedure using CPU sequential-only and multi-CPU architecture with four, eight, and sixteen processors.Additionally, in Figure 5, we present the overall execution time using CPU and multi-CPU architectures.The execution times were obtained as mean values of the four hundred eighty executions and are presented in milliseconds.
The procedures with high execution time in the LSA system are Read Documents and Document Parser.On the other hand, the procedure with the highest execution time in hLSA system is Read Documents.This is due to the fact that the Read Documents procedure has to access the drive disk in order to read the text documents, and the time to access the drive disk depends on the hardware.Therefore, there is a bottleneck which is limited by the hardware of disk I/O system, which goes beyond the scope of our work.
Moreover, we have found a substantial improvement in the Document Parser procedure in the hLSA system using multi-CPU architecture.Acceleration of 2.50 times is seen with five hundred documents and maximum acceleration of 2.90 times with five-thousand documents using four processors.When using eight processors, acceleration of 3.00 times is seen with five-hundred documents and maximum acceleration of 3.85 times with five thousand documents.Finally, for sixteen processors, acceleration of 3.30 times is seen with five-hundred documents and maximum acceleration of 3.96 times with five thousand documents.Thus, a better overall performance is found with sixteen processors and greater acceleration when the document corpus increases.
(2) Runtime Results: Dimensionality Reduction Stage.The weighting schemes Log Entropy and Term Frequency-Inverse Document Frequency have a high computational cost in the LSA system, as shown in Table 4.The hLSA system reaches acceleration of thirty times for the smaller datasets (five hundred to two thousand five hundred) and acceleration of eight hundred fifty times for the larger datasets (three thousand to five thousand) using the TF-IDF scheme.Also, using the Log Entropy scheme the hLSA system has reached acceleration of sixty-two times for the smaller datasets and acceleration of one thousand fifty-two times for the bigger datasets.The computational complexity of Log Entropy is 0(d) and that of TD-IDF schemes is O(dw), where  is the total number of individual words and  is the total number of documents in the corpus.As shown in the results, the TF-IDF scheme in the hLSA system has reduced the execution time from more than 160 seconds to less than 0.20 seconds, and the Log Entropy scheme in the hLSA system has reduced the execution time from more than 321 seconds to less than 0.20 seconds.
Meanwhile, an exact SVD has a computational complexity of O( 2 ), and this is expensive for large matrices.In the hLSA system, we implement the GPU-SVD function, which Table 4: Results of the execution time in the hLSA system using GPU-accelerated procedures and in the LSA system using sequential-only procedures for the Log Entropy scheme and the TF-IDF scheme.6: Results of the execution time of the SVD procedure applied in the hLSA system using GPU-accelerated procedures and in the LSA system using CPU sequential-only procedures.

Number of documents
gives us acceleration of up to three times compared to the LSA system.In Figure 6, we present a comparison of the SVD execution time between the hLSA system and the LSA system.
In the smallest dataset, we measure an execution time of 1.60 seconds in the LSA system and, in the largest dataset, we measure an execution time of 122.10 seconds.On the other hand, in the hLSA system, we measure an execution time of 0.50 seconds in the smallest dataset and an execution time of 53.40 seconds in the largest dataset.Hence, we reach acceleration of two to three times in most cases.
Consequently, the SVD procedure with the NVIDIA Visual Profiler was analyzed.We found that SVD generates about 77 percent of all the time-processing and launches around six thousand seven hundred thirty kernel instances of SVD in CUDA architecture.In addition, the profiler showed an optimization problem in memory access for computing the SVD procedure.For shared memory, the SVD utilizes a low bandwidth of 39.193 GB/s for 273,163 total load/store transactions, and, for device memory, the SVD utilizes a low Figure 7: Results of a complete execution of dimensionality reduction stage with -dimension equal to 300 and TF-IDF weighting scheme applied in the hLSA system using GPU-accelerated procedures and the LSA system using CPU sequential-only procedures.bandwidth of 42.814 GB/s for 1,193,617 read/write transactions.
Meanwhile, the computational cost for the truncated SVD procedure takes less than one second in the LSA system using the biggest dataset.However, in the hLSA system, we have accelerated up to nine times for the smaller datasets and up to four times for the larger datasets.In Table 5, we present the results of an execution with the dataset of two thousand documents.In this execution, our results present acceleration of four times.
We have presented the results of procedures: Log Entropy, TF-IDF, SVD, and Truncated SVD.As shown in results, the dimensionality reduction stage has the most computational cost in our experiments.Therefore, we now present in Figure 7 the results of a complete execution of dimensionality reduction stage with  = 300 and the TF-IDF weighting scheme.We obtain for the smaller datasets acceleration from six to ten times, and for the larger datasets acceleration from five to eight times.
(3) Runtime Results: Information Retrieval Stage.We have implemented the hLSA system not just for indexing documents, but also for information retrieval.Thus, we now present the results of the execution time in information retrieval procedures, which are as follows: query parser, query folding, and cosine similarity.We obtain the process time for the query parser procedure, which has been implemented in CPU sequential-only for both systems hLSA and LSA.We present, in Table 6, the results of the query parse procedure on each dataset using the three use cases.
Table 7: Results of execution time for the query folding and cosine similarity procedures are applied in the hLSA system using GPUaccelerated procedures and the LSA system using sequential-only procedures for 5000 documents and topiramate weight-loss use case.The process with the highest computational cost in terms of the information retrieval stage is the query parser procedure.For the biggest dataset with the use case of bipolar disorders it took around 34 seconds, with the use case of lupus disease it took around 27 seconds, and with the use case of topiramate weight-loss it took more than 120 seconds.This is due to the number of unique words in the queries.The first use case has an average of 31 unique words, the second use case has an average of 22 unique words, and the last use case has an average of 136 unique words.

Number of
Additionally, we now present the result times for the query folding procedure and cosine similarity procedure.We present the best results using the use case of topiramate weight-loss.These procedures do not have a high computational cost; in most cases, the execution time is less than one second for all datasets in the hLSA and LSA system.Table 7 shows several results of query folding and cosine similarity procedures.
(4) Runtime Results: hLSA versus LSA.We present in Table 8 the overall execution time for the hLSA system versus the LSA system with five thousand documents, -dimension equal to three hundred, three use cases, and two weighting schemes.Also, for use case of bipolar disorders, we present results with four processors; for use case of lupus disease, we present results with eight processors; and, for use case of topiramate weight-loss, we present results with sixteen processors.We reach overall acceleration of five to eight times.
(5) Information Retrieval Results.To evaluate the accuracy, we compare the documents retrieved by the hLSA system based on a text query related to each use case versus the relevant documents defined by the experts.The experts define that the most relevant documents for use case of bipolar disorders are the articles with identifiers 1087494, 1434505, and 2031887; for use case of lupus disease the most relevant documents are the articles with identifiers 1065341, 1459118, and 1526641; and for use case of topiramate weight-loss the most relevant document is the article with identifier 1087494.We present the results of the similarities for each relevant document covering all the documents of the knowledge base with two weighting schemes and twenty values of , for use case of bipolar disorders in Figure 8, for use case of lupus disease in Figure 9, and for use case of topiramate weight-loss in Figure 10.
As shown in the results, the best similarities found in the experiments with the hLSA system for use case of bipolar disorders are found with values of  = 50 and accuracy = 0.88, for use case of lupus disease are found with values of  = 25 and accuracy = 0.56, and for use case of topiramate weight-loss are found with values of  = 25 and accuracy = 0.98.Anyhow, the similarities values are normalized in the range from one hundred fifty to three hundred.
Moreover, we present the results of the top ten documents with use case of bipolar disorders in Table 9; with use case of lupus disease, we show results in Table 10, and, for use case of topiramate weight-loss, we present the results in Table 11.As shown in the results, in several of the experiments the identifiers for the relevant documents defined by the experts are retrieved in the top ten.

Conclusions
The paper has introduced a heterogeneous system based on the Latent Semantic Analysis method.The proposed system allows the indexing of text documents from a knowledge base faster than the standard LSA versions.In addition, the hLSA system allows the retrieval of relevant documents based on a query.The results we have found are feasible and promising.The hLSA system is divided into three main stages: semantic space, dimensionality reduction, and information retrieval.Acceleration of four times is shown in the semantic space stage, acceleration of eight times is shown in a dimensionality reduction stage, and in the information retrieval stage we do not find acceleration.However, the hLSA system is limited by the main and global memory of GPU and CPU.Therefore, in future work, we propose adding a multi-GPU functionality to increase the size of matrices to be processed.
We have succeeded with our main motivation of improving the execution time of Latent Semantic Analysis method through the use of heterogeneous architectures such as GPUs and multi-CPU.As shown in experimental tests, we have achieved overall acceleration of around eight times faster than standard LSA systems for five thousand documents.
We have retrieved documents using twelve subsets of different sizes from five hundred to five thousand documents of the Open Access Subset of PubMed Central by using two weighting schemes: the Log Entropy and the Term Frequency-Inverse Document Frequency and twenty different values of  ranging from 25 to 250 to prove the appropriate value with which the hLSA system is more accurate.Hence, the hLSA has achieved an accuracy of 88% with the use case of bipolar disorders, an accuracy of 56% with the use case of lupus disease, and an accuracy of 98% with the use case of topiramate weight-loss.Moreover, we can infer that in our experiments the Log Entropy scheme has a higher value of similarity in two out of three use cases over the TF-IDF scheme.

Figure 1 :
Figure1: The hLSA system presents three stages: semantic space, dimensionality reduction, and information retrieval.Also, the hLSA system works with a user query, as well as a knowledge base, and presents the relevant documents.

( 1 ) 4 )
Definition of the Knowledge Base, (2) Definition of Dataset Size, (3) Definition of Stopword Document, (Definition of  Dimensions, (5) Definition of Use Cases, and (6) Evaluation Methods.(1) Definition of Knowledge Base.The experiments conducted with the hLSA system use documents of the Open Access Subset of PubMed Central (PMC) as the knowledge base of the experiments.PMC is an online digital database of freely available full-text biomedical literature.Extensively used in the Text REtrieval Conference (TREC) (the Text REtrieval Conference (TREC), cosponsored by the National Institute of Standards and Technology (NIST) and the US Department of Defense, can be found at http://trec.nist.gov).The text of each article in the open access subset is represented as an NXML file, which is an XML format used by NML Journal Archiving and Interchange Tag Set (the Journal Archiving and Interchange Tag Set defines elements and attributes that describe the content and metadata of journal articles, including research and nonresearch articles, letters, editorials, and book and product reviews).

Figure 4 :
Figure 4:  The architecture used in the information retrieval stage of the hLSA system.The blocks in light yellow represent the principal procedures executed by the CPU, the blocks in light blue represent the procedures executed by multi-CPUs, and the blocks in light green represent the procedures executed by GPU, respectively.The GPU executions include CUDA kernels and CUDA functions using floatingpoint numbers.

Figure 5 :
Figure 5: Results of the overall execution time in seconds of the semantic space stage using CPU sequential-only architecture and multi-CPU architecture with four, eight, and sixteen processors.

Figure 8 :
Figure 8: Results of the similarity for relevant document 1: 1087494, relevant document 2: 1434505, and relevant document 3: 2031887 in the use case of bipolar disorder using the subset of 5000 documents, twenty values of , and two weighting schemes (TF-IDF and Log Entropy).

Table 1 :
Subsets used in the experiments for the hLSA system.

Table 2 :
Results of the CPU sequential-only procedures in milliseconds of the semantic space stage in the hLSA system.

Table 3 :
Results of the Document Parser procedure in milliseconds of the semantic space stage in the LSA system using CPU sequential-only architecture and in the hLSA system using multi-CPU architecture with four, eight, and sixteen processors.

Table 5 :
Results of execution time in a truncated SVD procedure applied in the hLSA system using GPU-accelerated procedures and the LSA system using CPU sequential-only procedures for 2000 documents.

Table 6 :
Results of the execution time of query parser procedure applied in the hLSA system with the three use cases: bipolar disorders, lupus disease, and topiramate weight-loss.

Table 8 :
Results of the total execution time in the hLSA system versus the LSA system.Results include all procedures using 5000 documents,  = 300, Log Entropy, TF-IDF, and four, eight, and sixteen multi-CPUs.