Sentence Similarity Calculation Based on Probabilistic Tolerance Rough Sets

Sentence similarity calculation is one of the important foundations of natural language processing. The existing sentence similarity calculation measurements are based on either shallow semantics with the limitation of inadequately capturing latent semantics information or deep learning algorithms with the limitation of supervision. In this paper, we improve the traditional tolerance rough set model, with the advantages of lower time complexity and becoming incremental compared to the traditional one. And then we propose a sentence similarity computation model from the perspective of uncertainty of text data based on the probabilistic tolerance rough set model. It has the ability of mining latent semantics information and is unsupervised. Experiments on SICK2014 task and STSbenchmark dataset to calculate sentence similarity identify a signiﬁcant and eﬃcient performance of our model.


Introduction
With the rapid development of information technique, innumerable text data are continuously growing. Unlike digital data, the processing of text data is more complex and difficult. Sentence similarity aims at calculating the degree of resemblance or distance between two sentences. It plays an important role in the application of natural language processing (NLP), like text summarization [1,2], machine translation [3], question answering systems [4], and information retrieval [5]. ese applications are based on sentence similarity to a certain extent, whose development makes the research of sentence similarity become urgent.
Text data are characterized by uncertainty, inaccuracy, and incompleteness. Existing sentence similarity computation methods are almost all based on the relation among words and words in the sentences or based on the deep learning algorithms. Methods based on the relation among the words and words such as word cooccurrence mainly consider sentence semantics from the shallow level and cannot capture the latent semantics information behind the sentences. Methods based on the deep learning algorithms such as convolutional neural network (CNN) can capture deep semantics information, but most of them are with high time complexity and supervision. In addition, both of the classes of methods cannot commendably process the uncertainty and imprecision of text sentences. In this paper, we start with the uncertainty and imprecision of text data. We improve the tolerance rough set model [6] by Ho et al.and present a sentence similarity computation model based on the probabilistic tolerance rough set model. Our model can not only process the uncertainty and imprecision of text data, but also overcome the shortcomings mentioned before.
is paper is organized as follows. Some related works on sentence similarity measures are reviewed in Section 2. Section 3 presents our proposed probabilistic tolerance rough sets-based model for sentence similarity computation in detail. Section 4 demonstrates the experimental results and discusses on sentence similarity tasks. In Section 5, some conclusions are made.

Related Work
e main work is to improve the traditional tolerance rough set model, and then establish a sentence similarity computation model based on the probabilistic tolerance rough set model. In this section, we discuss some related works about sentence similarity calculation methods and tolerance rough set models in NLP.

Sentence Similarity Calculation.
Traditional works about sentence similarity are generally categorized into two classes, methods based on shallow semantics and methods based on deep learning algorithms. e idea of shallow semantics methods is to calculate the similarity between words. Methods based on words' cooccurrence and on corpus are two representatives. Methods based on words' cooccurrence are mentioned in [7][8][9]. Han et al. used the Bag-of-Words (BoW) technique [8], and Jones et al. [7] applied the term frequency inverse document frequency (TF-IDF) technique to represent sentences, and then the cosine distance or Euclidean distance was utilized to calculate the similarity between sentences. A keyword-based approach was proposed [9], which calculates the keywords' ranking score extracted in the sentences. Methods based on corpus such as WordNet, HowNet are mentioned in [10,11]. In [12], Prasad et al. combined common words and semantic features for measuring sentence similarity. ey extracted both syntactic features by searching for common words between sentences and semantic features by utilizing information content of sentences. Methods based on shallow semantics can only obtain the literal meaning of sentences, and fail to capture high-level semantics information behind sentences.
Nowadays, neural network and deep learning have been widely used in NLP and have made great achievements. By training sentences with deep learning algorithms, deep semantics information can be captured in the computation of sentence similarity. In [13], a CNN-based parallel semantic matching model was established; two parallel CNNs were built to train two sentences, respectively. en, the two CNNs were cascaded into one multilayer neural network for matching the similarity of sentences. An elaborate convolutional network (ConvNet) variant was presented [14], which inferred sentence similarity by integrating differences of convolutions at different scales. For the problem of variable length sentences and complex sentences, Mueller et al. proposed a Siamese Network on the basis of the long short-term memory (LSTM) model [15]. Methods mentioned before mainly concentrate on the similar information of two sentences; on the bias, methods concentrated on the dissimilar information of two sentences were proposed. Wang et al. developed a sentence similarity learning model by decomposing and composing lexical semantics which considered both the similar information and dissimilar information between sentences [16]. In [17], a contextaligned recurrent neural network (CA-RNN) model was put forward. In this model, the contextual information of the aligned words was integrated in the neural network. Liu et al. incorporated the shallow semantics and deep information to evaluate the sentence similarity [18]. e shallow part is represented by the lexical similarity based on keywords and sentence lengths; and the deep part is modeled by a parallel CNN which extracts both the whole sentence and their context as the features. However, most of the sentence similarity learning algorithms based on neural network and deep learning are supervised, which need to train the data set first. Jacob et al. [19] proposed an unsupervised Bidirectional Encoder Representations from Transformers (BERT) model, which has reached excellent results for language representation.
It is undeniable that text data possess uncertainty, imprecision, and incompleteness. However, methods mentioned above do not measure the similarity between sentences from the perspective of uncertainty and imprecision. Fuzzy set theory and rough set theory are created to process such uncertainty and imprecision. A fuzzy set and rough sets-based approach was developed for measuring cross-lingual semantic similarity [20]. In [1], Chatterjee et al. proposed a fuzzy rough sets-based model. Sentence similarity was computed according to the upper approximation and lower approximation of two sentences.
We improve the traditional tolerance rough set model and propose a sentence similarity computation model based on the probabilistic tolerance rough set model. With the model from the point of the uncertainty of text data to process text data, it can not only solve the problem of inability to obtain high-level semantics information on methods based on shallow semantics, but also overcome the drawback of supervision on methods based on deep learning algorithms, with the advantages of capturing more latent semantics information and nonsupervision.

Tolerance Rough Sets in NLP.
Rough set theory was proposed by the Polish scholar Pawlak for handling uncertainty, imprecision, and fuzziness in 1982 [21]. It has been effectively applied in the field of machine learning, data mining, and NLP [22][23][24]. Rough sets partition a set X by using an equivalence relation. Whether one certain object belongs to a set X or not is represented by a pair of concepts called lower approximation space and upper approximation space. A possible part is the upper approximation except the lower approximation, called the boundary region. Researchers generalized rough set theory to some expanded models according to different requirements, including probabilistic rough set model [25], decision rough set model [26], and tolerance rough set model [27]. An equivalence relation contains three properties of reflexivity, symmetry, and transitivity, in which the limitation of transitivity leads to the inapplicability in some cases. e tolerance relation was introduced to replace the equivalence relation by Skowron et al. since some applications cannot achieve the condition of transitivity, and the corresponding model was tolerance rough set model [27].
With the tolerance rough set model applied in NLP, a search result clustering method was put forward [28], in which the tolerance relation was defined as the number of word cooccurrences in documents. In [29], a tolerance rough sets-based semantic clustering algorithm is introduced by Meng et al. for web search results, extending the original text semantics and processing the limitation on the sparsity of data. A nonhierarchical document clustering algorithm was established by Ho et al. [6] for information retrieval based on a tolerance rough set model, which can capture more potential semantics information. Patra and Nandi developed a single-link clustering algorithm on the basis of tolerance rough set model to obtain a better clustering result [30]. In this paper, we adopt the tolerance rough set model via expressing each sentence as a pair of upper approximation and lower approximation to separately compute the upper approximation similarity and lower approximation similarity.

Proposed Method
In this section, we firstly describe the traditional tolerance rough set model briefly. en, we introduce the probabilistic tolerance rough sets-based sentence similarity calculation model detailedly.

Tolerance Rough Set eory.
A tolerance space was defined as a quadruple R � (U, I, ], P) [6], where U � x 1 , x 2 , . . . , x n } is the universe of all the objects, I(x) is an uncertainty function, I: U ⟶ 2 U , a set of tolerance classes, { } is a structural function. e uncertainty function I: U ⟶ 2 U is defined as a tolerance class. If an object shares similar information with x, it is an element of I(x). Any function satisfying reflexivity and symmetry can be defined as an uncertainty function I(x), that is, for arbitrary x, y ∈ U, x ∈ I(x) iff x ∈ I(y). e vague inclusion ] is monotonous, i.e., for any X, Y, Z ⊆ U and Y ⊆ Z, ](X, Y) ≤ ](X, Z). It measures the degree of inclusion of sets, whether a set X contains the tolerance class I(x) of an object x ∈ U. e structural function P is defined as two classes-structural subsets (P(I(x)) � 1) and nonstructural subsets (P(I(x)) � 0)-which are on functions of I(x) for each x ∈ U [6]. e upper approximation U(R, X) and lower approximation L(R, X) of any X ⊆ U are defined as If the upper approximation and lower approximation are with parameters α and β, which are denoted as where α ∈ [0, 1), β ∈ (0, 1], α ≤ β, then it is called as the probabilistic tolerance rough set model [25].

Probabilistic Tolerance Rough Sets-Based Sentence Similarity Model.
Firstly, we introduce the definition of the quadruple of tolerance rough sets in our model. Suppose that where N is the vocabulary size. en, we define the universe as U � W. e determinations of tolerance relation and tolerance classes are the essential steps for formulating a tolerance rough set model. In the tolerance rough set model proposed by Ho et al. [6], the cooccurrence of terms in all the documents in the corpus was applied to construct the tolerance relation. However, it suffers from two disadvantages: (1) whenever the whole corpus gets some changes, even increasing or decreasing by only one document, all the procedures need to be recalculated; (2) the time complexity is relatively high. Hence, we choose the word similarity between words as the tolerance relation. Generally, the semantics similarity between two words is defined as the cosine similarity between the word vectors of the two words [31]. When the corpus increases or decreases one document, the number of cooccurrences of all the words will change and recalculation is needed, but the word similarity between words does not need to change. It provides the model employing the new tolerance relation to be incremental. According to the algorithm flow of the tolerance rough set model, the time For a positive threshold θ, 0 ≤ θ ≤ 1, the uncertainty function I θ of w i is defined as follows: where sim(w i , w j ) denotes the cosine similarity degree between the word w i and w j .
where w i and w j denote the word vectors of w i and w j , respectively. It is evident that the uncertainty function I θ satisfies the condition of reflexive and symmetric.
Here, we give a counterexample to illustrate when I θ does not satisfy the property of transitivity. Using the trained word2vec embeddings by Google [32], we can obtain similarity( ′ beautiful ′ , ′ nice ′ ) � 0.5341, similarity( ′ nice ′ , ′ pretty ′ ) � 0.5106, and similarity( ′ beautiful ′ , ′ pretty ′ ) � 0.3299. Let the cosine similarity degree threshold θ � 0.5; it is obvious that similarity( ′ beautiful ′ , ′ nice ′ ) > θ, similarity So, we conclude that the uncertainty function I θ does not satisfy the condition of transitivity. e vague inclusion function ] is defined the same as in [6]: Let S � S 1 , S 2 , . . . , S n be a collection of sentences, where S i is represented by a group of words of the universe en, the fuzzy membership function μ for w j ∈ W, S i ∈ S is expressed as Suppose that all the tolerance classes of words are structural subsets in the whole process, i.e., for any Mathematical Problems in Engineering w i ∈ W, P(I θ (w i )) � 1. en, we define the upper approximation U(R, S i ) and lower approximation L(R, S i ) in R of any S i ∈ D as where α ∈ [0, 1), β ∈ (0, 1], α < β. U(R, S i ) and L(R, S i ) in R are also written as S i and S i . If S i is regarded as one certain concept about the vague description of feature w i ; then U(R, S i ) can be explained as a collection of concepts that share some semantics with S i , and L(R, S i ) can be explained as a collection of the core concepts of S i . e probability values α and β can be used to adjust the accuracy of upper approximation and lower approximation.
Each sentence is denoted by two fuzzy sets on both upper approximation and lower approximation. Assume that one sentence S 1 is made up of a collection of words w 1 , w 2 , . . . , w m }; then the upper approximation and lower approximation of S 1 are represented by Considering the membership degree only, the upper approximation and lower approximation of S 1 can also be written as S 1 � u 11 w 1 , u 12 w 2 , . . . , u 1i w k , . . . , u 1n w n , S 1 � l 11 w 1 , l 12 w 2 , . . . , l 1i w i , . . . , l 1n w n , where u 11 (w k ) and l 11 (w k ) denote the membership degrees of w k to S 1 and S 1 , respectively. e upper approximation represents the expanded semantics of sentence S 1 , capturing the latent semantics that S 1 contains. e similarity between two sentences can be measured by both the upper approximation similarity and lower approximation similarity of the two sentences. From the two different perspectives, both the expanded semantics similarity and the core semantics similarity can be captured sufficiently. For each sentence has been represented by two fuzzy sets, we employ two measurements to calculate the similarity between the two fuzzy sets, as defined as follows.
On the basis of the upper approximations and lower approximations of the two sentences, except for representing each sentence by a pair of fuzzy sets, we propose another method to measure the similarity. Assume that the elements of the upper and lower approximation of sentence S 1 and sentence S 2 are . . , w d , then a new similarity degree measurement is defined as follows.

Measurement 3. Consider
e lower similarity determines the degree to which two sentences are similar assuredly. Correspondingly, the upper similarity determines the degree to which two sentences are similar possibly. To measure the final similarity degree of the two sentences, we utilize the linear combination of the upper and lower approximation similarity, which is given as where i � 1, 2, 3, λ is the linear coefficient. λ indicates the proportion of the upper approximation similarity degree and (1 − λ) indicates the proportion of the lower approximation similarity degree. On account that the lower approximation is composed of the core semantics, the proportion of the lower approximation similarity degree is assigned a higher value than the upper approximation similarity degree. Generally, 0 ≤ λ ≤ 0.5. (Algorithm 1) Example 1. Here, we give an example of our proposed methods to calculate the sentence similarity. Assume that the corpus contained four sentences as follows: (i) Three boys are jumping in the leaves. (ii) Three kids are jumping in the leaves.
(iii) Three kids are sitting in the leaves.
(iv) Children in red shirts are playing in the leaves.
After preprocessing every sentence, 9 words are included in the corpus. en, let the universe be the set of words U � boys, jumping, leaves, kids, sitting, children, red, shirts, playing}.
en, we illustrate the proposed probabilistic tolerance rough sets-based sentence similarity model for computing the similarity degree of the following sentences: (i) S 1 : Three boys are jumping in the leaves.
(ii) S 2 : Three kids are jumping in the leaves.
Here, we set the similarity degree threshold θ � 0.6 and the probabilistic values α � 0 and β � 0.7. en, the upper and lower approximations of these two sentences are shown in Table 1. e upper approximation similarity degrees and lower approximation similarity degrees by the proposed three measurements are listed in Table 2.
Let the linear combination coefficient λ � 0.4; then the final similarity degrees between S 1 and S 2 by three measurements are as follows: It is apparent that our proposed probabilistic tolerance rough sets-based sentence similarity algorithm can reflect the similarity relation between sentences commendably. Firstly, from the sentences S 1 and S 2 , it is evident that both of them express the core semantics of "jumping" and "leaves," just like the lower approximation obtained by our algorithm. Secondly, the lower approximation similarity degree is computed to be 1, which means that S 1 and S 2 share the same core meaning. irdly, from the upper approximation of S 2 , it can be seen that the word "children" did not originally belong to S 2 , but the meaning of "children" is mined through our method. e new meaning "children" comes from the tolerance class of the word "kid," so, in a sense, "children" is the explanation of "kids." erefore, our proposed methods can capture some latent semantics behind texts from upper approximation, which can better distinguish whether two sentences are similar from a more general perspective. Analogously, our proposed algorithms can refine the core semantics of texts by the lower approximation, which can preferably analyze sentence similarity from a more accurate perspective.
Example 2. We use the traditional tolerance rough set model [6] on Example 1 for comparison. e word cooccurrence degree is set as 2. en, the upper and lower approximations can be seen in Table 3. Table 4 displays the corresponding upper and lower approximation similarity degrees. en, the sentence similarity degrees of the three measurements are as follows: (i) sim 1 (S 1 , S 2 ) � 0.517, (ii) sim 2 (S 1 , S 2 ) � 0.476, (iii) sim 3 (S 1 , S 2 ) � 0.799.
From the results, we can see that it provides a worse performance in contrast with our methods. en, we discuss the condition that one sentence "Three boys are sitting in the leaves" is added to the corpus in Algorithm 1: Probabilistic tolerance rough sets-based sentence similarity model Input: A collection of sentences S � S 1 , S 2 , . . . , S n . Parameters: e cosine similarity degree threshold: θ; the probabilistic value: α, β; the linear combination parameter: λ. Output: e similarity degree between S i and S l ,.
(1) Preprocess the sentence corpus S � S 1 , S 2 , . . . , S n , and generate the universe including all the distinct words of the corpus.
(2) Compute the uncertainty function I θ (w i ) of each word in the universe according to equation (3).
(3) Suppose that the similarity degree between sentence S i and S l is to be calculated. Apply equation (6) to calculate the fuzzy membership degree μ(w j , S i ) of each word in sentence S i , 1 ≤ j ≤ N, 1 ≤ i ≤ n. (4) Obtain the upper approximation U(R, S i ) and lower approximation L(R, S i ) of each sentence S i ∈ S according to equation (7) equation and (8). Similarly, acquire U(R, S l ) and L(R, S l ). (5) Represent the upper approximation and lower approximation of S i and S l as fuzzy sets according to equation (9) and equation (10), which are written as S i , S i , S l and S l . (6) Calculate the upper approximation similarity sim(S i , S l ) between S i and S l and the lower approximation similarity sim(S i , S l ) between S i and S l according to equations (12)-(17) of the three measurements, respectively. (7) Obtain the final sentence similarity degree sim(S i , S l ) between S i and S l utilizing the linear combination in equation (18). ALGORITHM 1: e procedure of our proposed model in detail.  Mathematical Problems in Engineering Example 1. e whole computational process and results by the probabilistic tolerance rough sets do not alter. However, the procedures have to repeat from the calculation of uncertainty function by the traditional tolerance rough sets; then the new upper and lower approximation of S 1 and S 2 are illustrated in Table 5. us, the applicability of the model [6] has been greatly reduced.

Experimental Results and Discussion
In this section, we take from SICK2014 task and STSbenchmark dataset to evaluate the performance of our methods. [33] is a dataset for the similarity evaluation of sentence pairs, which contains the training set, trial set, and testing set for a total of 15000 sentence pairs. Since our proposed model is unsupervised, which do not require additional training on the dataset, we select the 5000 sentence pairs of the training set for the experiments. And each sentence pair has been assigned a similarity score from 0 to 5 by experts. Table 6 shows two examples in the SICK2014 dataset. STS is the abbreviation for Semantic Textual Similarity. e SemEval STS datasets from 2012 [34] to 2017 [35] were selected for this dataset. Each sentence pair has been assigned a similarity score from 0 to 5 by experts. STS-train, STS-dev, STS-test, and MSRvid are chosen for the experiments.

Dataset and Preprocessing. SICK2014
For better comparison with our experimental results, we have normalized the similarity score. We take the word embedding trained by Google [23] as the word vector in the experiment.

Evaluation Metrics.
We exploit the Pearson correlation coefficient (Pcc) [36] and mean square error (MSE) [37] to evaluate the performance of sentence similarity measurements.
Pcc is a linear correlation coefficient that reflects the linear correlation of two variables. As for two variants X and Y, the mathematical expression of Pcc is defined as where Cov(X, Y) is the covariance of X and Y, D(X) and D(Y) denote the standard deviation of X and Y individually, and EX refers to the mathematical expectation of X. e greater the absolute value of Pcc, the stronger the correlation is. MSE is a measure reflecting the degree of difference between estimator value and real value. e definition of MSE is where N is the sample size, y is the real value, and y is the estimator value. A smaller value of MSE demonstrates a smaller deviation between the estimator value and the real value.

Experimental Results and Analysis.
We proposed three sentence similarity measurements based on the probabilistic tolerance rough set model. e performances on the SICK2014 dataset are displayed in Table 7. In the table, BERT-687 and BERT-1024 are two different BERT models for sentence representation, and the sentence similarity is calculated by the cosine similarity. Fuzzy rough is the model proposed in [12]. As can be seen in Table 7, on the whole, the three measures have much better performance than the other three models. Obviously, the results of Measurement 3 achieve at the optimal performance of Pcc as 0.725 and MSE as 0.033. Particularly for the value of MSE, it is evident that there is very small error between the sentence similarity degree calculated by our methods and the real value. Tables 8 and 9 show the Pcc and MSE results on the STSbenchmark dataset. From the tables, we can see that all of the three measures have much better performance than the results by BERT on the four datasets of STSbenchmark. e reason is that more latent semantics behind sentences can be captured by our models. erefore, the experimental results confirm the efficiency and applicability of our methods.

Cosine Similarity Degree reshold.
In our improved probabilistic tolerance rough set model, the cosine similarity degree threshold θ controls the accuracy of the uncertainty function. e higher the value of θ, the more precise the    uncertainty function. However, too high value of θ will result in inadequate semantics mining. Too small value of θ will lead to more redundancy and noisy information. e interaction of θ on Example 1 can be identified in Table 10. Figures 1 and 2 reveal the interactions of different cosine similarity threshold value θ on Pcc and MSE, respectively. In this experiment, α is set as 0, β is set as 0.6, and λ is set as 0.3. θ ranges from 0.5 to 1. As shown in Figure 1, the value of Pcc increases from θ � 0.5, achieves the peak at θ � 0.9, then decreases. Similarly, the value of MSE decreases from θ � 0.55, achieves the minimum at θ � 0.95, then increases. We can conclude that the interaction of θ satisfies the regular analyzed above.

Probability Value.
e values of α and β are used for adjusting the precision of upper approximation and lower approximation. α determines the range of upper approximation. e smaller the value of αis, the more elements the upper approximate set has. In the traditional tolerance rough sets, α is 0, by which the upper approximation contains the most information. It can reduce the generation of redundant information and does not lose too much potential semantics information via adjusting the value of α. e influence of α on Example 1 can be observed in Table 11. Similarly, β determines the range of lower approximation. A larger value of β leads to fewer elements of the lower approximate set. When β � 1, the fewest elements are included in the lower approximation, which may cause the loss of some core semantics information. Adjusting β properly may better and more adequately mine core semantics information. e effect of β on Example 1 can be observed in Table 12.

Conclusion
In this paper, owing to the property of uncertainty of text data, we incorporate the probabilistic tolerance rough sets to establish a novel sentence similarity computation model. For the reason that the traditional tolerance rough set model is not incremental and has high complexity, we make some improvement to it, making the model becoming incremental and reducing the time complexity. rough introducing the probability values α and β, the accuracy of the upper approximation and lower approximation can be adjusted. e upper approximation and lower approximation are served to represent every sentence. And on this basis, three sentence similarity calculation measurements are proposed. Upper approximation similarity and lower approximation similarity are individually calculated of each sentence pair. e linear combination of the upper approximation similarity and the lower approximation similarity is used to indicate the total sentence similarity. On the one hand, it can dig out more latent semantics information than the traditional methods based on shallow semantics. On the other hand, it is unsupervised, which relieves the defect of supervised deep learning-based methods. We carry out some experiments on the SICK2014 task to evaluate the performance of our proposed model. e results verify the efficiency and applicability of the proposed models. e proposed model is established without considering the order of sentences, in which our future work will include it.

Data Availability
e SICK2014 task data used to support the findings of this study are available from clic.cimec.unitn.it/composes/sick. html.

Conflicts of Interest
e authors declare that they have no conflicts of interest regarding the publication of this paper.