Research on Text Similarity Measurement Hybrid Algorithm with Term Semantic Information and TF-IDF Method

. TF-IDF (term frequency-inverse document frequency) is one of the traditional text similarity calculation methods based on statistics. Because TF-IDF does not consider the semantic information of words, it cannot accurately reflect the similarity between texts, and semantic information enhanced methods distinguish between text documents poorly because extended vectors with semantic similar terms aggravate the curse of dimensionality. Aiming at this problem, this paper advances a hybrid with the semantic understanding and TF-IDF to calculate the similarity of texts. Based on term similarity weighting tree (TSWT) data structure and the definition of semantic similarity information from the HowNet, the paper firstly discusses text preprocess and filter process and then utilizes the semantic information of those key terms to calculate similarities of text documents according to the weight of the features whose weight is greater than the given threshold. The experimental results show that the hybrid method is better than the pure TF-IDF and the method of semantic understanding at the aspect of accuracy, recall, and F1-metric by different K-means clustering methods.


Introduction
Text similarity measurement is some way to measure the degree of semantic similarity between two texts, and it is a very important task for natural language processing.Text similarity measures have extremely widespread application in many elds, such as text duplicate detection eld, image retrieval, information retrieval, the automatic generation of text areas, and text classi cation.ere are statistical ways and semantic analysis algorithm in traditional text similarity measurement methods.In text similarity measurement method based on statistics, the whole text is regarded as a set of words.By analyzing the occurrences number of each term, the text model vector is constructed in terms of e ective word frequency information.Moreover, the similarity of text vectors is calculated by cosine similarity or Jaccard coecient.e model based on attribute theory, semantic index model, and vector space model belong to statistics method.For statistical similarity measurement, it expresses the text as a vector to simplify the complex relationship between the keywords in the text, by which the model is calculated easily [1].However, the method ignored the meaning and semantic relationship of word item; it needs large scale word corpus to support.Due to the large number of words and texts, the vector dimension in the text representation model is extremely high so that it is di cult to handle directly.TF-IDF method is a traditional statistics-based text similarity measure algorithm, which constructed model by text word frequency vector, and the similarity of texts is calculated through cosine similarity measurement.
For text similarity measurement method based on semantic analysis, the semantic relationship of text word (e.g., synonym, redundancy, and inclusion) is set up by speci ed domain knowledge [2], and it also determines texts similarity degree.e advantage of this method is that the algorithm accuracy is very high and it does not depend on a large corpus to support.However, it is very easy to establish a knowledge base, which needs large scale and complex work.us, the current research generally adopts a complete dictionary with words rather than a knowledge base.Literature [3] introduced resolving process of text similarity based on WordNet and HowNet.Literature [4] put forward text similarity measurement by sememe space of HowNet.Literature [5] introduced text similarity measurement in terms of weighting semantic web.ese methods considered the semantic information of word terms, but they ignored the different degree of importance to the various texts.e methods improved the vector dimension of the text representation, and they cannot reflect the similarity between the two texts.
According to the defects of the above method, a method that can effectively reduce the dimension of the text representation model and combine the semantic information of words terms is proposed.e algorithm proposed can efficiently and automatically calculate the similarity of the semantic texts, and there is a broad application prospect for the hybrid similarity measurement method.

Related Works
TF-IDF method is the most typical text similarity measure algorithm, and it represents the text as a vector composed of n weighted words terms that appear in the text by following empiric observation [6].
(1) Term Frequency.e more frequently a word appears in a text, the more relevant it is to the topic of the text.ere are many specific words in specific linguistic environment that do not have this property and should be excluded, such as "a" and "an".(2) Inverse Document Frequency.
e more times a term appears in multiple text in a text collection, the worse the term is.For example, in a collection including 10000 texts, if a term A is present in 1000 texts and another term B appears only in 10 texts, then term B is better discrimination than A. By using the above concept, the TF-IDF value of every term ω i can be calculated according to equations ( 1 where tf (ω i ) is occurrence frequency of current term ω i in text j, and N is total number of all text in text collection {d j }.With the development of the Internet, how to acquire more accurate information from massive amounts of text data is a challenge to the approach (e.g., TF-IDF) of ignoring the terms semantics.We should analyze, capture, and characterize the meaning of the text more precisely rather than only term occurrence frequency.For example, there is an article about gift (present) and another article about gift (talents).e two articles will be regarded as similar things, if the articles are measured based on term frequency method.On the other side, an article about girl and another article about boy are regarded as dissimilar papers for their different term (boy and girl).
erefore, term similarity is researched gradually.e similarity measurement in terms requires organizing all words to form a semantic network (e.g., WordNet), and it is realized by determining information of edges and vertexes in terms.
Literature [7] described an approach for domain-specific WSD by selecting the predominant sense (sunset from WordNet) of ambiguous words.To achieve it the method uses two corpora: the domain-specific test corpus and a domain-specific auxiliary corpus.Literature [8] put forward method considering vertex information and edge relationship, which is helpful to similarity application of noun or verb.However, it is difficult to organize hierarchical relationships like nouns for adjective or adverb.Literature [9] discussed local correlation information to determine similarity of texts by WordNet.Literature [10] defined similarity among terms by applying information theory on the premise of text vocabulary in specified probability distribution.Literature [11] put forward semantic similarity measurement method to improve the traditional term frequencybased text similarity measure result, but the method does not reduce the dimension of the text model.Literature [12] discussed a method of determining sentence similarity, and text automatic summary is applied in the method.Literature [13] recalculated text correlation of results returned by the search engine by ontology.Literature [14] introduced term similarity measurement method combined with WordNet and applied it to improve the vector representation model of the text.By analyzing text concept, synonym, and term hyponymy relation, it improved more extensive frequency vector including text concept, synonym, and term hyponymy relation and realized text clustering by computing the cosine similarity among the vectors.Literature [15] designed a supply chain information oriented mining model based on TF-IDF algorithm to obtain the required supply chain information.Literature [16] put forward a new method integrating the advantages of TF-IDF and semantic information from HowNet, and the method worked out the value of text similarity by hamming distance to avoid direct processing of high-dimensional sparse matrix.Literature [17] proposed the scientific research project TF-IDF (SRP-TF-IDF) model, which combined TF-IDF with a weight balance algorithm designed to recalculate candidate keywords.Literature [18] improved Bayes algorithm with TF-IDF method, and it introduced decentralized word frequency factor and feature word position factor to enhance the accuracy of feature weights.Literature [19] proposed a method based on the combination of contents and their 2 Advances in Multimedia semantic similarities, and the method is a collection of synonyms and inverse document frequency combining semantic similarity by WordNet synonyms set.ese methods did not reduce text representing vector dimension, and its calculating method of text similarity is also traditional cosine similarity between vectors.By analyzing the above methods about the text similarity, the paper firstly preprocesses the text in natural language processing techniques.After that, key terms with high TF-IDF value in text are searched in terms of TF-IDF method.Besides, the similarity of two texts is calculated by external dictionary terms analysis, term similarity weighting tree structure, and text semantic definition.e method in the paper can make text similarity measures more efficient and accurate, and it also decreases the dimension of the text similarity model.By text clustering experiments with the benchmark data set, the algorithm discussed in the paper is better than the pure TF-IDF and the method of semantic understanding at the aspect of accuracy, recall, and F1-metric by different K-means clustering methods.

Text Similarity Measurement Hybrid
Algorithm with Term Semantic Information and TF-IDF Method 3.1.Text Preprocessing.Current natural language processing techniques cannot deal with full original information of text easily.e text data is usually unstructured or semistructured; then machine cannot handle it directly.erefore, it is very necessary to properly preprocess the text first and then to establish the frequency vector of the terms in the text to finally transform the text into a structured form.

Text Segmentation.
e first key point of text preprocess is text segmentation, root reduction, and stop words deletion.English contains different tenses, and words are also divided into single and plural forms, and then most words also appear in different forms.If the words based on the same root appear in different entries as different forms, then it is possible that the text with the same theme has a very low similarity, which directly affects the quality of the text clustering, so the root reduction processing is required.Stop words are words that have little significance to identify text content but appear very frequently, and they will lead to large errors in calculating text similarity or in training the model to obtain parameters, and they are usually regarded as a noise.For example, definite articles "a" and "an" will appear in almost any text, but there is little substantial contribution to the expression of the textual meaning.erefore, it is very necessary to remove these stop words from the original text, and the process is called deleting stop words.e deletion of stop words is achieved by establishing a list of stop words.e list of stop words is a query process to delete stop words.By querying each item by item, then all terms in the list are deleted.Text preprocessing typical example is shown in Figure 1.

Special Word Deletion.
e method in the paper requires semantic analysis of term, and then the following three preprocessing steps are necessary based on stop word deletion.
(1) Special terms (such as people's name, place name, organize name, etc.) in the text need to be processed.ese special terms always have high TF-IDF value in TF-IDF computing process, and they are incorrectly selected as text key term.In addition, this special word term makes a great impact on similarity result.In the paper, this special word term is processed by name entity recognition technology [20], and the special word items recognized are replaced with specific string.In order to avoid the possible adverse effects of these word terms on text clustering during feature selection, the special word terms are ignored in selecting feature terms.
(2) Synonyms may appear simultaneously in a document, and then synonyms appearing in the text should be treated consistently.In other words, the same meaning word terms should be combined, and they are represented by a single name to reduce costs on calculating the semantic similarity of texts.(3) Since the most important thing about characterizing the meaning of the text is substantive in the text, the final step is to perform a verbal analysis of all the terms in the text.e semantic properties of all terms should be judged to distinguish nouns, verbs, adjectives, and adverbs, etc.

Special Word Deletion
(1) Text Preprocessing Process.By comprehensively considering comprehensive word segmentation, root reduction, stop words deletion, and special term filtering technology, text preprocessing process is shown in Figure 2.

Key Terms Selection.
After the text preprocessing is completed, the TF-IDF values of terms in one text should be calculated, and each term TF-IDF value in text is represented as a vector to support texts similarity computing.e text vector is high-dimensional and extremely sparse.According to information theory, the value of IDF is cross entropy of term probability distribution in special condition, and TF is used to increase the weight of words to describe the information features of words in text.us, several important words from each text can be selected to represent the text. is can reduce the text feature vector representation without affecting text feature extraction.e approach in detail is shown below: (1) All terms in the text are sorted according to their TF-IDF value.(2) Nouns and verb terms are selected as key word terms, if their TF-IDF value is greater than p (p is the percentage).Besides, the selected key term is add to the vector.

Advances in Multimedia
(3) e last key term vector is regarded as feature representation of the text.Compared with traditional TF-IDF method, the key term vector dimension decreases by 1-p, and it is a large increase in efficiency.

Text Similarity Calculation.
After the eigenvector of each text is determined, the following problem to resolve is how to calculate similarity of two texts.e most important information in an article is the characteristic word term, and the text similarity measurement can be translated into similarity calculation between the vectors of the feature terms.As a result, the similarity between the original texts can be regard as the similarity between vectors of feature word terms.To ensure that the vector of the similarity meets the basic similarity measure, the dimension influence must be removed.
If Sim (x, y) is similarity of data points x and y, the following conditions should be satisfied.
Let v i and v j represent key term vector, where v i � (w i1 , w i2 , . . ., w ik , . . ., w im ), v j � (w j1 , w j2 , . . ., w jk , . . ., w jm ), v j � (w j1 , w j2 , . . ., w jk , . . ., w jm ), and the similarity of two texts is defined as below: where k w is weight coefficient of key term vectors v i and v j .Similar terms can determine the TF-IDF value in the document.e more the similar terms, the higher the TF-IDF value.It indicates that these terms can reflect their importance better in the text.us, weighting is determined by the proportion of the TF-IDF values of the keyword terms in the sum of the whole text TF-IDF values in the keyword vector.Weight coefficient k w is calculated by equations ( 5) and (6).
ave(i, j) � 1 2 where TFIDF (w ik ) is TF-IDF value of key term w ik , and ave (i, j) represents the proportion of the TF-IDF values of the keyword terms in the sum of the whole text TF-IDF values in the keyword vector.Sets Λ i and Λ j are defined as below.
In key term vector v i , keyword w ik is put into set Λ i , if similarity of key terms w ik and w js exceeds setting threshold value.Sim (w ik , w js ) is semantic similarity of keywords w ik and w js .where vectSim (v i , v j ) is determined by term similarity of vectors v i and v j , and terms with high similarity degree must appear in similar vector, and the vectors including low similarity degree are obviously dissimilar.e weighting coefficient is calculated according to term similarity weighing hierarchical tree data structure and sememe similarity formula.ere are leaf nodes and nonleaf nodes in three-layer weighting tree, and all terms with similarity exceeding threshold value ɵ are sorted in order from large to small, and they are saved in leaf nodes.Figure 3 is construction process of weighting tree.
(1) e initialization of TSWT e three-layer similarity weighting tree of feature terms is constructed by user concrete task, and the feature term is put into each leaf node, whose similarity is greater than special threshold value.
(2) e weight and update of TSWT In the calculation of the eigenvector similarity process, the similarity result of eigenvectors v i and v j is disposed, if a certain pair of feature terms (w ik and w js ) satisfies following one condition.
(a) w ik and w js belong to ordered queue of terms for a certain leaf node in a weighted tree.(b) If w ik belongs to ordered queue of terms for a certain leaf node, w js distinguishes foreign from the queue, and there is a high similarity above the threshold k w .According to similarity of w js and other terms, the sequence location of w js in ordered queue including w ik is determined.(c) If w ik and w js do not belong to ordered queue of terms for a certain leaf node a weighted tree, there is maximum and minimum similarity value with w ik and w js .If the similarity degree is less than threshold k w , a branch with terms of maximum and minimum similarity should be constructed, and w ik and w js are inserted into the new branch.(d) If w ik and w js do not belong to ordered queue of terms for a certain leaf node in a weighted tree, there is maximum and minimum similarity value with w ik and w js .If the similarity degree is less than threshold k w and exceeds threshold k w , the sequence location of w ik and w js is determined by the similarity of other terms with w ik and w js .
(3) Text similarity calculation Similarity of two key term vectors is calculated by equation (4) and TSWT.
Text similarity measurement hybrid algorithm with term semantic information and TF-IDF method is shown as below.
Input: Feature term vector v i and v j ; term similarity weighting tree; the threshold value k w .
Output: Similarity Sim (v i , v j ) of key terms v i and v j .
Step 1. e term similarity weighting tree is initialized.
Step 2. Starting from w il in the vector v i , the most similar term w jk to w il in the vector v j is searched by sememe similarity equation, and the similarity of w il and w jk is recorded.
Step 3. e weighting coefficient k w is calculated by TSWT weighting principle, and determine whether w il and w jk are added to weighting tree according to TSWT updating principle.
Step 4. Repeat the procedure of Steps 2 and 3 for other terms in vector v i until all terms in vector v i find the corresponding most similar term in vector v j .
e similarity value of Step 2, Step 3, and Step 4 is accumulated, and the result divided by the number of all terms in vector v i is the dimension of vector v i ; thus the similarity of vectors v i and v j Sim (v i , v j ) is determined.
Step 6. Start from w jl in the vector v j , and repeat from Step 2 to Step 5. us, the similarity of vectors v j and v i Sim (v j , v i ) is determined.e goal of this step is to keep the vector v i dimension the same as v j .
e average of Sim (v i , v j ) and Sim (v j , v i ) is calculated, and vectSim (v i , v j ) is regarded as semantic similarity of vectors v i and v j .
Step 8.According to above steps cumulation, the sum weighting coefficient ω f is determined.
e similarity of vectors v i and v j is processed in weight by text similarity definition, and text similarity of vectors v i and v j is determined.

Case Study
In order to verify the effectiveness of the hybrid algorithm in solving text similarity measurement problem, this paper collected 500 article papers of HowNet as data set, and text set involves multiple fields, including computer, economy, organism, physics, and mechanics, etc. e total number of five-class text is, respectively, 131 (computer), 117 (economy), 113 (organism), 91 (physics), and 73 (mechanics).e feature of each data set is shown as Figure 4.
e above text set is preprocessed firstly by natural processing language software LinPipe of Alias company.Segment and word class tagging of each text is realized by LinPipe, and then the relevant person names, place names, and organization names involved in the text collection are identified.e weight of terms in text is calculated by TF-IDF algorithm, and specific percentages top value is selected from computing result.e similarity of the experimental text is calculated by the hybrid method in the paper, and text Advances in Multimedia similarity matrix is determined.According to text similarity matrix and TF-IDF matrix, clustering experiment is realized, and the results of direct K-means algorithm (DKM), binary K algorithm (BKM), aggregation K-means algorithm (AKM), and hybrid algorithm of the paper are analyzed and compared.In order to make the experimental results more objective, this paper measures text similarity by multiple indexes, and indexes include accuracy, recall ratio, F1metric, and macroaveraging.
It is necessary to select different percentage of top characteristic terms in the similarity calculation to understand how top characteristic terms impact similarity calculation.From this, characteristic terms similarity threshold value k w is set as zero to ensure that all characteristic terms are equally important.Figures 5-7 describe influence of various percentage top characteristic terms on similarity result.If percentage top characteristic terms are located in the interval of 30% and 50%, the accuracy of computer, economy, organism, physics, and mechanics is the highest, and it is about 6 percentage points higher than other top characteristic terms' percentage.For recall ratio, there is also best interval in [0.4,0.5].e value of F1-metric is inflection point at the 40 percentage top characteristic terms.According to the above statistical analysis, the text term clustering result is the best, when percentage TOP characteristic terms are selected as about 40%.
In order to determine influence of threshold value k w for similarity computing, the experiment selects 40% top characteristic terms as text feature vector, and DKM is also selected as clustering algorithm.k w distributes in 0.6 and 0.   Advances in Multimedia accuracy, recall ratio, and F1-metric.Macroaverage assigns the same weight to each category, and it is calculated as the following formula to prove the method validity.
e three-method text similarity measurement in DKM clustering algorithm is shown as Figures 11-13.For macroaveraging accuracy, the hybrid algorithm of the paper is optimal, and t is two percentage points higher than TF-IDF and term semantic method in algorithm accuracy for the hybrid algorithm.
e hybrid algorithm put forward in paper also has the same advantages in both macroaveraging recall ratio and F1-metric.e results of three-method text similarity in AKM and BKM are analyzed, and the conclusion is the same as DKM experimental result.is shows that the method used in this paper has a better clustering effect than the two traditional algorithms and effectively avoids the disadvantages of the traditional methods to some extent and confirms the validity of the method used in this paper.

Conclusion
e terms with high TF-IDF value are selected as feature keywords in hybrid algorithm, and the method reduces the impact of the high dimensions of the traditional vector representation.Besides, it also decreases computing time.It fully combines the similarity of the feature keywords semantics in the text with external dictionary word analysis to realize semantic similarity degree computing between two texts by terms similarity weighting tree structure.Based on TF-IDF model, at the same time keywords in the text analysis of semantic information, a new method of text similarity measure is discussed.e probability distribution of the terms in the text is fully discussed, and experiment result shows that the clustering method in the paper is better than traditional method, such as TF-IDF or semantic method at the aspect of accuracy, recall rate, and F1-metric.e work of this paper has some improved effects on the traditional two types of text similarity measures; however there are still many shortcomings to be overcome.e cosine angle problem is not fully considered when calculating the cosine similarity of texts, and there are many works to mine semantic characteristics contained in the analysis of text similarity, such as semantic information of statements, paragraphs, and chapters in the text.

Data Availability
e labeled dataset used to support the findings of this study is available from the corresponding author upon request.

Conflicts of Interest
e author declares no competing interests.

Figure 5 :Figure 6 :
Figure 5: Accuracy comparison of various percentage top characteristic terms.

Figure 7 :
Figure 7: F1-metric comparison of various percentage top characteristic terms.
9, and Figures 8-10 describe influence of threshold value on similarity result.