Active Learning for Constrained Document Clustering with Uncertainty Region

Constrained clustering is intended to improve accuracy and personalization based on the constraints expressed by an Oracle. In this paper, a new constrained clustering algorithm is proposed and some of the informative data pairs are selected during an iterative process. Then, they are presented to the Oracle and their relation is answered with “Must-link (ML) or Cannot-link (CL).” In each iteration, ﬁrst, the support vector machine (SVM) is utilized based on the label produced by the current clustering. According to the distance of each document from the hyperplane, the distance matrix is created. Also, based on cosine similarity of word2vector of each document, the similarity matrix is created. Two types of probability (similarity and degree of similarity) are calculated and they are smoothed for belonging to neighborhoods. Neighborhoods form the samples that are labeled by Oracle, to be in the same cluster. Finally, at the end of each iteration, the data with a greater level of uncertainty (in term of probability) is selected for questioning the oracle. In order to evaluate, the proposed method is compared with famous state-of-the-art methods based on two criteria and over a standard dataset. The result demonstrates an increased accuracy and stability of the obtained result with fewer questions.


Introduction
Clustering is one of the main important methods in the background of machine learning [1] and can be applied to different datasets such as the document set. In the common methods of clustering, there is no previous information, and as such, it is called the unsupervised learning method [2,3]; however, in the real world, some information [4] is normally available, or we can obtain from Oracle. is information can be in different forms and can be used in the process of clustering [5][6][7][8][9][10][11][12][13].
If the information is presented as pairwise constraints (where a document pair must be in the same cluster (ML), while a document pair should not be located in the same cluster (CL)), and these pairwise constraints are used in the process of clustering, this method will be called pairwise constrained clustering [6,14,15]. Pairwise constraints can be useful in the clustering process in two ways: when enough informative pairwise constraints exist, where the accuracy and efficiency of the clustering can be improved, and when we want to change the process of clustering and personalize it [10,12,16].
Generally, it is important to select a valuable data pair as an informative pairwise constraint. Active learning selects the informative pairwise constraints and sends them to Oracle for responding (Must-link/Cannot-Link) [15]. Active learning offers the greatest improvement and accuracy in clustering by saving time and cost for the minimum number of pairwise constraints [1,11,14,[17][18][19][20]. Active learning is performed frequently in the classification and has provided better results. However, active learning in clustering is used with limitation. e aim of these methods is to select a data pair not correctly clustered by the current clustering most of the time [14,[21][22][23].
Often, pairwise constraints are selected once, and they are sent to Oracle, while the current clustering cannot have a more effective role in the selection. For this reason, iterative methods and concepts such as neighborhoods and uncertainty are used for the selection of informative pairwise constraints [14,[24][25][26].
e main objective of this study is to present active learning for pairwise constrained document clustering with uncertainty region. Traditionally, active learning is used to select valuable data and fewer questions from the Oracle. Active learning depends on a variety of methods for better results. ese methods are usually statistical and complex.
ere are less effective methods with simplicity and intuition. In this study, uncertainty region is an effective, simple, and novel method. Active learning has the most impact on the selection of valuable documents. In constrained document clustering, obtaining pairwise constraint requires a user with excessive amount of work to read the documents in question and indicate their relationship, which is feasible but time consuming. For this reason, we use the document dataset for evaluation. is dataset highlights the impact of active learning.
Since dataset is document clustering, preprocessing is the necessary step for achieving the best result. Document representation and word embedding constitute the core of this step. A common approach to represent the document is bag-of-words (N-gram by TF-IDF). is method is useful to achieve word frequency. However, structural and semantic information is ignored. Some methods, such as graph representation [27], wikifying, word net, and others are useful for enrichment and semantic representation. Nowadays, neural networks based on language methods significantly outperform traditional methods which can preserve the semantic relationship between the words of documents. For example, Word2vec is a simple and effective method for word (and document) representation and dimension reduction.
In this study, some data pairs as informative pairwise constraints are selected in each iteration and, based on a response received from Oracle the set of constraints, is updated. Iteration continues until the number of the queries from Oracle reaches zero. In each iteration, constrained clustering is performed. en, SVM is used based on the label assigned to each data from the currently constrained clustering. Following the SVM model, the distance matrix and similarity matrix are created, respectively, based on the distance of each document from the hyperplane (HP) and cosine similarity between each semantic representation of documents such as word2vector [31,32,34]. Afterwards, the probability of each data belonging to every neighborhood is calculated based on these matrices. e similarity and degree of similarity methods are implemented for obtaining the probabilities. Our method suggests a new concept such as the uncertainty region for expressing the degree of similarity. In each iteration, the boundary of this region is obtained.
Uncertainty estimation such as silhouette index is an approach in active learning which has been widely used, where the data point is ranked based on the level of uncertainty regarding their probability of belonging to neighborhoods. To select valuable data as an informative pairwise constraint, we use the silhouette index with smaller values representing a greater level of uncertainty [26]. Experimental results reveal the stability and improvement of accuracy over five different datasets and four famous stateof-the-art methods. Ultimately, our main contributions are summarized as follows: (1) Presentation of a new concept (the uncertainty region) to detect uncertainty instead of complicated methods in the literature (2) Creation of boundary of uncertainty region and penalty of violation from constraints automatically (by histogram thresholding) (3) Development of an adaptive and consensus ensemble method (local search for selecting uncertainty data pair items instead of global search) to balance clusters and achieve stable results Concerning the organization of the paper, we address the related works in Section 2. In Section 3, the proposed method and "Materials and Methods" are expressed. Section 4 presents the experimental method, and Section 5 presents excremental results and discussion. In Section 6, we conclude the paper and present future works.

Related Works
One of the methods which has been rarely considered is the use of a support vector machine, deep learning concept, and neighborhoods, especially in constrained clustering [22,31,35]. First, we explain these concepts and then introduce methods similar to our proposed method.
Generally, if x i , x j , x k are three data points of a dataset, Oracle response is as {Must-Link, Cannot-Link} and current clustering label is expressed with lab.
en, equation (1) must hold true in order to have problem-free neighborhood concepts [14,24,26]: Each neighborhood includes data samples, which are in the same cluster. e essential assumption in the neighborhood is that the existing data in different neighborhoods must lie in different clusters. Neighborhoods can be shown as neighbor � N 1 , N 2 , N 3 , · · · , N h , where a number of h neighborhoods exist [24,26].
Neighborhoods are used as they are economical in questioning the Oracle such that, after selecting the informative data in each iteration, we can present those data only versus one of the data samples of each neighborhood to Oracle. If the result is Must-Link, we will add this data to that neighborhood, and it will be Must-link with all the data available in that neighborhood. Now, if we do not have Must-link versus any of the members of neighborhoods, then a new neighborhood must be created and those data must lie in a new neighborhood [26].
Deep learning can train a huge dataset and outperform the traditional methods. is method is useful in dimension reduction and preserving the semantics as well as the structure of the dataset. ere is little research which has applied deep learning methods in constrained clustering. Via this method, we can apply the dimension reduction and calculate the similarity matrix and others. Our paper used the deep learning method effectively in the process of a constrained method [11,13,31,32].
Here, the dataset with D � x 1 , x 2 , x 3 , · · · , x n has the constraints with data pairs with Must-link and Cannot-Link relationship. In this case, the objective function can be changed for combination and applying pairwise constraints in the clustering process. In this change, a penalty can be considered for the violation from pairwise constraints [17]. For example, PCKmeans that the algorithm [36] has been able to use both standard objective function and a penalty for the violation of constraints, by changing the objective function. ese two sections together constitute the objective function and are locally minimized.
On the contrary, in COPKmeans algorithm, no violation occurs from pairwise constraints. is method is called hard clustering, which is contrary to the previous clustering algorithm called soft clustering algorithm [36].
Along with the algorithms that address the development of new methods of constrained clustering, there are weaker methods in order to select the informative pairwise constraint [6][7][8]36]. Active learning is widely used for classification issues when the label-having data is few for the training set [9,11,13,18,19,22,35,37].
In this regard, the first active learning algorithm for constrained clustering was first developed by Basu et al. [36].
is algorithm has two main phases of exploration and consolidation. Gradually, in the first phase, the data are selected based on the farthest-first strategy. After selection, the data are presented to Oracle with a data sample of each neighborhood; if it is not located in any neighborhood, then a new neighborhood will be created. e second phase selects data iteratively and randomly, after which the data are presented to the Oracle with a data sample of each neighborhood until it lies in one neighborhood. In the first phase, the objective is to develop the number of neighborhoods, while in the second phase, the objective is to develop the number of data samples in each neighborhood. Obviously, this method is the basis of other methods and there have been some developments of this algorithm. e examples include [17], where informative data selection was not performed randomly.
Greene and Cunningham [24] performed informative data selection using another method. In this algorithm, two main phases, similar to the previous algorithm, have been used. First, the dataset is clustered with different algorithms; then, for each data pair, according to the frequencies it lies in the same cluster (at all times), the similarity matrix is constituted. At this stage, two thresholds are determined from the values of this matrix, and data pairs are divided into three categories. e section with values higher than thresholds is selected as data pairs of Must-link whose transitive closure forms the neighborhoods. In the first phase, the mean of values is calculated for each neighborhood in terms of the existing data; then, initial clustering is formed based on the correspondence of data to these means. In the second phase, from the similarity matrix, for each data, the probability of belonging to each cluster is calculated and the data with the greatest uncertainty for belonging to clusters are selected as informative data.
Xiong et al. [26] introduced the most appropriate and close framework to our proposed method in terms of active learning background. In this paper, constrained clustering is used as a black box, and in each iteration, the results of only the performed clustering are applied. In each iteration, the result of clustering is determined as a class label for each data; then, the method of random forest classification is used. e ratio of the number of times every data pair is placed in one leaf to the total number of model iteration is regarded as an element of the similarity matrix between every data, where the similarity matrix is obtained. e probability for each data belonging to neighborhoods is calculated by the similarity matrix. en, informative data for each question are obtained based on the probability obtained for each data and neighborhoods from uncertainty sampling estimation methods such as entropy and cost expectation mean.
Recently, Xiong et al. [38] developed a new online framework for active clustering with model-based uncertainty detection. is method uses semisupervised spectral clustering as a black box, which selects pairwise constraints as a clustering process, based on the uncertainty detection principle. e main idea in this paper is based on concepts of "certain sample sets" and "estimate the uncertainty." Certain sample sets are approximately similar to neighborhoods' sets. For estimating the uncertainty, a novel method approximates the first-order model, which decomposes the expected uncertainty into two components: a gradient and a step-scale factor. Calculations of this framework are complicated and time consuming in terms of runtime. is framework is hard for preserving semantic and dimension reduction in unstructured datasets such as a document.
Oliveira et al. [39] proposed new hybrid methods that used random key genetic algorithm with local search heuristic and column generation with path relinking. ey found that genetic algorithm with local search can act as an Complexity alternative and efficient method to solve the constrained clustering problem. Yang et al. [40] introduced a theoretical effect of the diversity and quality of the ensemble and then proposed a unified framework to solve the clustering ensemble selection problem with three criteria metrics. Wei et al. [41] introduced a semisupervised clustering ensemble approach which involved both pairwise constraints and metric learning. In this method, via supervised information, the method generated different base clustering partitions using constraint-based semisupervised clustering and metric-based semisupervised clustering, respectively. en, consensus function smoothed the result of each independent clustering. Yu et al. [42] developed a new ensemble clustering with active learning and selected constraint projection. In this method, first, a random subspace dataset was provided, after which with constraint set, high-dimensional data were mapped to low-dimensional space. After provision of subspace and dimension reduction, different weights were generated for each constrained cluster. Finally, with consensus function, each result of clustering was ensembled.
Another category of algorithms also exists in this field including production of pairwise constraints actively and iteratively [8,18,19,24,43], genetic heuristic-based algorithms [39], communication between constraints for enriching constraint sets [44], and constraint space transfer with kernel [45]. ese methods have often used previously published algorithms as a black box trying to reduce uncertainty with novel methods.
e methods, such as uncertainty sampling-based [24,26], commission (or ensemble) as well as hybrid method [40,46], and lowering error rate in the main model along with distance from hyperplane in SVM [30], are used in active learning and constrained algorithms [15,19]. ese methods have been used in the literature in different ways [8,26]. e main drawbacks of the mentioned algorithms included the following: unstable results, ignoring semantic representation, using a weak method to measure similarity, heavy calculations, using random selection broadly, weak dimension reduction, and using weak uncertainty detection methods.

Materials and Methods
Since clustering dataset is a document, first, it is necessary to convert the document set to the document-term matrix. For this purpose, preprocess should be used such as removing the empty document, numbers, and nonstop terms. In order to extract all terms of a document, the token process is required, in which a document is tokenized into a batch of terms, with each term in documents being given a weight. In the following, for decreasing the dimensions of this matrix, only informative terms are preserved.
Furthermore, word2vector methods are used for preserving the semantics and structure of the document dataset. Word2vec method uses a raw dataset to generate a vector for each word in the document. en, simply by an averaging vector of each word in the document, the document vector is generated. is method is used to create a similarity-document-term matrix. A row of this matrix consists of the document vector where the size of the dimension is equal to the size of the document-term matrix [32].
Afterwards, we use PCKmeans clustering as a black box. In this paper, two essential changes have been applied to PCKmeans: (i) in the section of initialization and (ii) in the section of calculating centers of clusters. Furthermore, the penalty of violation from constraints is created automatically [12,47].

Problem Preliminaries.
e set of documents is shown as ., x n , in which x represents the document and i � 1, · · · · · · , n. en, by applying preprocessing, the terms of these documents are converted into the weights with different values by TF-IDF (which has the best result in this case) [12,14,19,29,33,48]. In this case, each document can be shown as x � w 1 , w 2 , w 3 , · · · , w t , b � 1, · · · · · · , t. In this set, w represents the weight of the intended term, which is obtained as follows: In this formula, tf is the frequency of the term b in document i and df refers to the frequency of the number of documents, in which this term exists. In order to reduce the term dimensions of the matrix, mean-tfidf method is used; initially, the weight mean is calculated for each term. en, the terms that are higher than the weight mean remain in the matrix while other terms are removed. Equation (3) shows this method: After creating the distance matrix and reducing its dimension (in addition to using word2vector to create similarity matrix), Murkowski distance method has been used in the clustering algorithm in this paper. is method is one of the most famous methods for clustering.

3.2.
e Proposed Clustering Algorithm. Support vector machine based on pairwise constrained clustering algorithm-SVBPCKmeans-is represented in Algorithm 1. e objective function of this algorithm is minimized locally, as with the PCKmeans method. In the proposed method, the penalty for violating constraints is calculated and normalized at each stage (in contrast to PCKmeans).
In this algorithm, first, C as the set of pair constraints, neigh as the set of the neighborhood, and h as the current number of neighborhoods are initialized. en, Algorithm 2, known as "Cons_ set_ initial," is called only once at the beginning of the algorithm. In the following, the results obtained from Algorithm 2 are utilized. Afterwards, while section will continue until q (question from Oracle) does not reach zero. e center of each neighborhood is calculated at the beginning of each loop, with these centers of neighborhoods introduced during the clustering algorithm process as the initial value of clustering centers. Note that if the number of neighborhoods is lower than the required number of clustering centers, the other centers must be selected randomly.
In the next section, we enter an iterative process which will continue until the objective function with the violation penalty values does not reach convergence. In this iterative process, centers of new clusters will be obtained in Algorithm 1. New cluster centers will be smoothed with neighborhood centers obtained from the previous stage by a coefficient. e main reason of this smoothing is to find the chance to establish a balance between the centers of newly created clusters and centers of the neighborhood at each stage. e process is terminated after convergence, where Algorithm 3, known as "Cons_ set_ develop," is called. e results of this algorithm are the same as those of Algorithm 2, but they are used at each repetition. Figure 1 depicts an overview of the proposed algorithm to explain the main steps of the proposed algorithm.
with centroid of neigh N h k h�1 and random chosen point Repeat until convergence//not changed pairwise constrained clustering A: assign_cluster: Assign each data point x i � M i to the cluster p * such that: e task of Algorithm 2 is to explore the neighborhoods. is algorithm continues until the question of Oracle is allowed or the number of neighborhoods is not larger than that of clusters. e first neighborhood with h � 1 will be formed using random data. e strategy in this algorithm is to use the centers of clusters obtained from a simple clustering algorithm, such as k-means. In this method, clustering is performed on the dataset; then, centers of clusters are obtained and finally the nearest data to the centers of clusters are selected.
Each of the nearest data will be presented to the Oracle in an iterative mode versus each data sample of neighborhoods. If the answer is ML, the nearest data will be appended to the corresponding neighborhood and update the constraint set and break; otherwise, a new neighborhood is formed if there are no neighborhoods with ML response.
In this strategy, the main objective of the algorithm is to find the maximum number of neighborhoods, benefitting from the clustering algorithm. For example, the centers of clusters have a Cannot-link relation. us, this method is better than the strategies such as selection of the farthest-first points or random points. In our method, fewer questions are required to reach the maximum number of neighborhoods, while the rest of the question can be used in the next algorithm, which is Algorithm 3.
Algorithm 3 tries to build each neighborhood with a balanced number. In this algorithm, the objective is to find informative data with Must-link response from Oracle. First, Algorithm 4 is known as "informative-points." is algorithm selects k informative data points equal to the number of clustering and send them to Algorithm 3. ereafter, the distance between the informative data point and the centers of neighborhoods is calculated so that the distances are sorted in an ascending order and are sent to Oracle versus the data sample of each neighborhood. e goal is to find the Must-link with a minimum cost. Finally, the data with Mustlink answer are added to the corresponding neighborhood, and all the sets are updated accordingly.
Note that Algorithm 4 can be mentioned as the main algorithm for selecting the informative data points. is algorithm also dynamically finds the penalty of violating Must-link under the name W m and the penalty of violating Cannot-link called W c . In this algorithm, we introduce a new concept known as the uncertainty region. is region is used for determining the degree of similarity which is the basis for documents belonging to neighborhoods. Indeed, the uncertainty region is a set of data pairs with greater uncertainty based on values in the distance matrix.
is algorithm assigns the label of the data obtaining from the current pairwise constrained clustering (in Algorithm 1) and considers them as the class. en, it uses SVM classification for k times. In this way, the distance from the HP is calculated for the available data, and d_m matrix is calculated for each pair of data according to equation (5). e point for this method is that, in contrary to the common methods, the values of this matrix lie within the continuous  6 Complexity interval [0 1] after normalization, which can create a high decision-making power. Next, we calculate the normalized matrix s_sm from the similarity-document-term matrix SM which is calculated for each pair of data according to equation (6). s_sm uses pretrained word2vector and cosine similarity method [18,32]: In line 5 of Algorithm 4, the m c and m d are calculated from values on matrix d_m via the histogram threshold to calculate the boundary region of uncertainty. One of the methods used for obtaining threshold value in a continuous interval is the histogram threshold [49]. is method is used as a classification method for the number of two classes, whose objective is to reduce the ambiguity within the interval of the existing values.
For this work, first, the available unique values in the d_M matrix are collected; then, these values are divided into some intervals and the average of each interval is specified as (D i ). In Complexity 7 the next step, the numbers of data pairs in each interval are counted (g (D i )). Next, a weighted moving average of the window number of 5, f (D i ), in equation (7) is calculated from these values. According to f (D i ), we begin from the first intervals and consider the first relative minimum f (D v ) as the threshold value; in this way, the boundary to the uncertainty region is calculated according to equation (8): . Find the first valley points in the modified histogram: if m d ≤ two point distance ≤ m c : uncertainty region, else : strong region.
en, two types of probability are calculated for belonging to neighborhoods. A similarity value between each data and each neighborhood is calculated from equation (9) based on S_SM matrix. Also, in equation (10), the degree of similarity is measured. If the value of the data pair in the d_m matrix is between the boundaries within the uncertainty region, we consider the value 0.1 for S: degree toneigh en, to measure the level of uncertainty for each data for belonging to neighborhoods, we use equation (14). Generally, there are many criteria such as entropy for measuring the level of uncertainty, where selecting each of them does not affect the performance of our method.
We use a criterion based on the well-known silhouette index, which is often used in internal cluster validation. In this method, for each data, the first highest probability (fm) and second highest probabilities (sm) of belonging to neighborhoods are selected and the level of uncertainty is determined. Finally, the method is applied to probabilities and then smoothed with coefficient β. We use β � 0.4 which offers the best result. Equation (14) presents the method [24]: Data with a smaller value in equation (13) (indicative of a greater level of uncertainty) is selected as the most informative data. Unlike other methods, we use local selection, instead of a global selection of informative data. In global methods, informative data is selected from all datasets. In the local method in this paper, in order to balance the number of data in each neighborhood and, consequently, to balance clustering, the most informative data are first selected from the current pairwise constrained cluster, which has the maximum number of data. Consideration of the local method makes the results 8 Complexity stable and more accurate. Finally, the informative data are selected from the data that is not a member of neighborhoods.

Experimental Method
In this section, we empirically evaluate the performance and accuracy of the proposed method in comparison with the methods explained in related works. First, we explain the experimental setup and then the experimental results.

Dataset.
ere are three document clustering datasets that are normally used by many types of research: News-group20, Sector Dataset, and Webkb Dataset (http://people. cs.umass.edu/∼mccallum/data.html). In order to reveal the robustness of our algorithm against different situations, five datasets with different classes and sizes are selected randomly from the three mentioned major datasets. e fifth dataset which is represented with details in Table 1 is randomly selected from the third main dataset.

Evaluation Criterion.
ere are many criteria for evaluating document clustering. In this paper, two methods are used for the evaluation. Rand index (RI) is used for calculating the agreement between the labels obtained from the results of clustering compared to the class of real labels.
RI measures the agreement between two partitions, p 1 and p 2 , of the same dataset D. Each partition is viewed as a collection (in this case, ML and CL) of n * (n − 1)/2 pairwise decisions, where n is the size of D. For each pair of points di and dj in D, Pi assigns them to either the same cluster or to different clusters. Let a be the number of decisions, where di and dj are in the same cluster. Let b be the number of decisions where the two instances are placed in different clusters in both partitions. e total agreement can then be calculated using e second method is called normalized mutual information (NMI). is method is used for the evaluation of the assigned clustering label compared to the real class label of data. NMI considers both real and assigned classes' labels from clustering as two random variables. en, it measures the mutual information obtained from these two random variables and normalizes them within the interval between zero and one. If C is the random variable of the assigned class from clustering and K is the random variable of real class for data sample, then NMI is obtained using . (16) In this formula, I (C; K) � H(C) − H(C|K) is the mutual information between two random variables C and K, where H(C) is the entropy of variable C and H(C|K) refers to the conditional entropy of variable C given K. In order to obtain the best result, for each dataset, the proposed algorithm is run 10 times; then, we represent the average result.

Experimental Methodology.
In order to evaluate the proposed method, three perspectives have been considered. ese perspectives can evaluate the proposed method in terms of different aspects.
First perspective: In this perspective, five datasets and two mentioned criteria are considered to compare the proposed algorithm with famous and similar algorithms. Some of the utilized algorithms are not for document clustering; therefore, at first, the documents are converted into document-term and similarity-document-term matrices. en, the algorithms are run on these matrices. e result of these runs can be observed in Table 2 and Figure 2. ese similar algorithms include: (1) Random selection: in this algorithm, active learning, neighborhoods, and similar concepts are not used, and pairs are randomly selected and presented to Oracle for responding. is algorithm is usually used as the basic approach. (2) PCKmeans [36] (

3) NPU ([26]) (4) URASC ([38])
Second perspective: One of the differences between the proposed algorithm and state-of-the-art algorithms is the exploration of neighborhoods in the first phase. In the formation of neighborhoods, selection of data with Cannot-link response from oracles is considered. erefore, the farthest-first selection strategy and random selection are usually used in the first phase. If the neighborhoods are completed earlier, a greater number of questions remain for the second phase; therefore, in the second phase, more informative data are selected, where more balanced neighborhood is consolidated; in this way, the accuracy and efficiency of the algorithm are enhanced. e number of questions from Oracle for the exploration of neighborhoods as well as also the accuracy and efficiency of the proposed algorithm in the first phase compared to the PCKmeans method can be observed in Table 3 and Figure 3.
ird perspective: For showing the power of the semantic representation of the document, word2vector is used. So, for the evaluation, the proposed algorithm is implemented with and without word2vector representation. In case of without word2vector representation, the similarity matrix is obtained from the inverse to the distance matrix. Figure 4 demonstrates the result from this method for two datasets.  Table 2 presents the run of the algorithms in the first perspective. In Figure 2, the y-axis represents the resulting clustering performances for the first perspective (measured by RI) while the x-axis indicates the total number of queries from Oracle. As mentioned previously, each curve Complexity shows the average RI of proposed and state-of-the-art methods across 10 independent runs.

Excremental Results and Discussion
At the beginning of the curves, it is observed that curves are approaching each other as their number of informative constrained pairwise is similar. In the middle of each curve, with approximately 20 queries, the curves separated, but after 20 queries, all of the methods except the random method had a significantly high RI. As an explanation, we find that approximately after 20 queries, the high informative pairwise constraint is selected converging sections A and B, which is quick in algorithm 1. For example, our method has quick and robust convergence than other methods which has been followed by the URASC and NPU methods, respectively. e number of queries in the proposed method and the first phase is low. erefore, a large number of the query was saved for the second phase. is is another reason stated in the second perspective. Table 2 reports an analysis based on NMI criteria. In this table, NMI such as RI has the same style. When the number of queries is 15, NMI has a significant jump after which the NMI value grows progressively. In our proposed method, the growth of NMI is usually greater than that of other methods. e reason for this occurrence is the same with RI and suggests that this same style is repeated with some differences. e same style in RI and NMI results suggests the reliability and validity of our method. However, URASC      Complexity is similar to our proposed algorithm but URASC has a complicated statistical computation. Note that URASC is hard to adopt for document clustering and the NMI value is low sometimes. e most important point in the investigation of these results is the balance and stability of the proposed algorithm, such that, in the proposed algorithm, the accuracy of the result usually improves by increasing the number of pairwise constraints.
However, sometimes there are reductions of accuracy and efficiency in the other algorithms by increasing the number of pairwise constraints. e reasons for these problems and some solutions which were considered in the proposed algorithm are stated in Table 3.

(17)
In order to explain the second perspective, we analyze Table 4 and Figure 3. In Figure 3, the y-axis shows the resulting clustering performances for a second perspective (left side as measured by RI and right side as measured by NMI) and the x-axis indicates the five mentioned datasets. Each triangle point reveals the results of RI and NMI of our method while the rectangle point indicates the results of RI and NMI of our method with "farthest-first strategy" in the first phase. We use 30 queries which are the average of a query from the first perspective. As mentioned earlier, we use the average result across 10 independent runs.
As can be seen in Table 4, our proposed algorithm usually asks fewer questions from Oracle in the phase of neighborhoods' exploration for each of the five datasets. is reflects the relative superiority of the proposed algorithm in the first phase. In order to investigate the accuracy and efficiency in terms of two stated criteria, we implement the proposed algorithm in the first phase with two strategies of the "farthest-first strategy" which has been used in the published works and our proposed strategy. For example, NS5, ND6, and SD9 have a low query number with our proposed strategy in the first phase. Figure 3 indicates that each dataset with our proposed algorithm has high RI and NMI values. Our method enjoys greater reliability and validity compared to other methods over the dataset with a variety of sizes and class numbers. In particular, our proposed method has offered better results over the dataset with a great size and great class number. e third perspective is an important aspect of the proposed method. Application of word2vector can preserve the semantics and structure of the document. In contrary to other datasets, document clustering results depend on the semantic representation due to its unstructured content. erefore, it is necessary to use a semantic representation which has been rarely used in the other state-of-the-art methods. In contrary to traditional semantic representation, use of deep learning offers a better result.
In Figure 4, the y-axis reveals the clustering performances obtained for the third perspective (for two datasets), and the x-axis indicates the number of the queries. is figure shows that use of word2vector can obviously improve the results. Following half of the queries, due to the gradual Complexity improvement of similarity, the curve of SVBPC with word2vector is separated significantly. In any case, overall, our proposed method (first and second phases together) has offered the best results plus a higher efficiency and accuracy. Indeed, the results indicate that the proposed algorithm enhances efficiency and accuracy and the balance of results along with elevation of the pairwise number.
For tuning coefficients such as α in Algorithm 1, β in equation (15), and value of S operator in equation (13), the proposed algorithm runs with diffident values several times with main datasets, after which the best coefficients are selected.
α in Algorithm 1: this coefficient is for smoothing the center of current pairwise clusters and the center of neighborhoods. Indeed, we want to smooth Oracle response (in the form of neighborhoods) and the current pairwise clustering. At the beginning of the run, we find that a large α is better, but after the middle of the run, the small value is best as neighborhoods are not complete and they gradually would become complete and stable. Finally, we use α � 0.3 since our method converges earlier, and this value offers a better accuracy.
β in equation (15): two types of probability, similarity, and degree of similarity are calculated for linking data points to neighborhoods. We want to determine the contribution of the degree of similarity to obtaining the level of uncertainty. If β � 1, it means that we do not use the degree of similarity, and our method would be unstable; otherwise, β � 0; then, we do not use the similarity probability. At the beginning of the run, great β is better as the neighborhoods are not completed. For efficiency at the beginning of the run, we use β � 0.8; then, (half of the number of queries) we use β � 0.4.
S operator in equation (13): the value of this operator is required for obtaining the degree of similarity in equation (13). Based on the values of the d_m matrix, we want to determine the weak pairwise relationship between data, where the values of the matrix are divided into three sections with histogram threshold. e values in the middle section involve a weak relationship between data. We want to assign the value of the operator to pairwise in the middle section. At the beginning of the run, a small value of the operator did not yield the best result as neighborhoods were not completed. Generally, we use a small value (approximately 0.1) for the value of the operator in the middle section. For two other sections, with a strong relationship, we use a large value for the operator (approximately 0.9). We can obtain stability of accuracy and performance with these values for the operator. However, it is better to use variable values for the operator in all sections though it is time consuming and costly.

Conclusion and Future Works
In this paper, first document-term and similarity-documentterm matrices were made from documents; then, in an iterative process, data with high uncertainty were selected for belonging to neighborhoods. In order to reach informative data, concepts of the SVM model, word2vec, neighborhoods, and uncertainty were used in each iteration. e proposed method outperformed the state-of-the-art methods with fewer queries from Oracle. In the phase of exploration, it achieved a better result with fewer questions. In the second phase, use of our proposed strategy alongside the uncertainty region balanced the number of data in each neighborhood. Generally, the obtained results were more balanced, which means that, by increasing the number of pairwise constraints, the accuracy grows simultaneously. e reason for the balance has been the uncertainty region and determining the degree of similarity for linking each data to the neighborhood. Using the SVM model, initialization of the current clustering based on pairwise constraints and updating of the centroid in each iteration and semantic representation yields considerable improvement of accuracy.
In future work, heuristic methods can be used to find the parameters of the proposed method. Deep learning tools can improve the results with semantic representation. For extending this research, one can use deep learning tools in any step of document constrained clustering especially in the similarity matrix and dimension reduction. In addition, hybrid methods can be used in each iteration, instead of using a support vector machine.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.