Efficient 2-Step Protocol and Its Discriminative Feature Selections in Secure Similar Document Detection

Secure similar document detection (SSDD) identifies similar documents of two parties while each party does not disclose its own sensitive documents to another party. In this paper, we propose an efficient 2-step protocol that exploits a feature selection as the lower-dimensional transformation and presents discriminative feature selections to maximize the performance of the protocol. For this, we first analyze that the existing 1-step protocol causes serious computation and communication overhead for high dimensional document vectors. To alleviate the overhead, we next present the feature selection-based 2-step protocol and formally prove its correctness. The proposed 2-step protocol works as follows: (1) in the filtering step, it uses low dimensional vectors obtained by the feature selection to filter out non-similar documents; (2) in the post-processing step, it identifies similar documents only from the non-filtered documents by using the 1-step protocol. As the feature selection, we first consider the simplest one, random projection (RP), and propose its 2-step solution SSDD-RP. We then present two discriminative feature selections and their solutions: SSDD-LF (local frequency) which selects a few dimensions locally frequent in the current querying vector and SSDD-GF (global frequency) which selects ones globally frequent in the set of all document vectors. We finally propose a hybrid one, SSDD-HF (hybrid frequency), that takes advantage of both SSDD-LF and SSDD-GF. We empirically show that the proposed 2-step protocol outperforms the 1-step protocol by three or four orders of magnitude.


Introduction
Similar document detection is the problem of finding similar documents of two parties, Alice and Bob, and it has been widely used in version management of files, copyright protection, and plagiarism detection [24,25].
Recently, secure similar document detection (SSDD) [15] has been introduced to identify similar documents while preserving privacy of each party's documents as shown in Figure 1.That is, SSDD finds similar document pairs whose cosine similarity [13,26] exceeds the given tolerance while not disclosing document vectors to each other party.SSDD is a typical example of privacy-preserving data mining (PPDM) [1,2,16], and has the following applications [15].First, in two or more conferences that are not allowing double submissions, SSDD finds the double-submitted papers while not disclosing the papers to each other conference.Second, in the insurance fraud detection system, SSDD searches similar accident cases of two or more insurance companies while not providing sensitive and private cases to each other company.Jiang et al. [15] have proposed a novel solution for SSDD by exploiting secure multiparty computation (SMC) [9,22] in a semi-honest model.Their solution has preserved privacy of two parties by using the secure scalar product in computing cosine similarity between document vectors.As the secure scalar product, they have suggested random matrix and homomorphic encryption methods [12,28].In this paper, we use the random matrix method as a base protocol, and we call it SSDD-Base.However, SSDD-Base has a critical problem of incurring severe computation and communication overhead.Let Alice's and Bob's document sets be U and V, respectively, then SSDD-Base requires |U||V| secure scalar products.In many cases, the dimension n of document vectors reaches tens of thousands or even hundreds of thousands, and SSDD-Base incurs a very high complexity of O(n|U||V|), which is not practical to support a large volume of document databases.In particular, if there are many parties or frequent changes in document databases, the overhead becomes much more critical.

Secure Similar Document Detection
To alleviate the computation and communication overhead of SSDD-Base, in this paper we present a 2-step protocol that exploits the feature selection of lower-dimensional transformation.The feature selection transforms high dimensional document vectors to low dimensional feature vectors, and in general it selects tens to hundreds dimensions from thousands to tens of thousands dimensions.We call the feature selection FS in short.Representative FS includes RP (random projection) [5], DF (document frequency) [27], and LDA (linear discriminant analysis) [7].In this paper, we use RP and DF since they are known as simple but efficient feature selections [27].To devise a 2-step protocol, we need to find an upper bound of cosine similarity for the filtering process.Thus, we first present an upper bound of FS and formally prove its correctness.Using the upper bound property of FS, we then propose a generic 2-step protocol, called SSDD-FS.The proposed SSDD-FS works as follows: in the first filtering step, it converts n-dimensional vectors to f (≪ n)-dimensional vectors and applies the secure protocol to f -dimensional vectors to filter out nonsimilar n-dimensional vectors; in the second post-processing step, it applies the base protocol SSDD-Base to the non-filtered n-dimensional vectors.In the filtering step, SSDD-FS prunes many non-similar high dimensional vectors by comparing low dimensional vectors with relatively less complexity of O(f |U||V|), and thus, it significantly improves the performance compared with SSDD-Base.
To make SSDD-FS be efficient, FS should be highly discriminative, i.e., FS should filter out as many high dimensional vectors as possible if they are non-similar.In this paper, we analyze SSDD protocols in detail and propose four different techniques as the discriminative implementation of FS.We can think RP first as an easiest way of implementing FS.RP randomly selects f dimensions from n dimensions.RP is easy, but its filtering effect will be very low due to the randomness.To solve the problem of RP, we exploit DF that selects feature dimensions based on frequencies in all document vectors.In particular, by referring the concept of DF, we present three variants of DF, called LF (local frequency), GF (global frequency), and HF (hybrid frequency).First, LF considers term frequencies of Alice's current querying vector (we call it the current vector), and it selects dimensions whose frequencies higher than the others in the current vector.LF focuses on the locality, which means that considering the current vector only might be enough to decrease the upper bound of cosine similarity.Second, GF means DF itself, that is, GF counts the number of documents containing each term (dimension), constructs a frequency vector from those counts (we call it the whole vector), and selects high frequency dimensions from the whole vector.GF focuses on the globality since it considers all the document vectors.To implement GF, however, we need to make a secure protocol Table 1: Feature selection methods to be used for SSDD-FS.

Method
Description SSDD-RP Randomly select f dimensions from an n-dimensional vector.SSDD-LF Select highly frequent f dimensions from Alice's n-dimensional current vector.SSDD-GF Select highly frequent f dimensions from the n-dimensional whole vector.SSDD-RP Select high-valued f dimensions from the n-dimensional difference vector between current and whole vectors.
for obtaining the whole vector from both Alice's and Bob's document sets.For this, we propose a secure protocol SecureDF as a secure implementation of DF.Third, HF takes advantage of both locality of LF and globality of GF.HF computes a difference vector between current and whole vectors and selects highvalued dimensions from the difference vector.This is because HF tries to maximize the value difference between Alice's and Bob's vectors for each selected dimension and eventually decrease the upper bound of cosine similarity.Table 1 summarizes these four feature selections and their corresponding SSDD protocols, SSDD-RP, SSDD-LF, SSDD-GF, and SSDD-HF, to be proposed in Section 4.
In this paper, we empirically evaluate the base protocol, SSDD-Base, and our four SSDD-FS protocols (SSDD-RP, SSDD-LF, SSDD-GF, SSDD-HF) using various data sets.Experimental results show that the SSDD-FS protocols significantly outperform SSDD-Base.This means that the proposed 2-step protocols effectively prune a large number of non-similar sequences early in the filtering step.In particular, SSDD-HF that takes advantage of both locality of SSDD-LF and globality of SSDD-GF shows the best performance.Compared with SSDD-Base, SSDD-HF reduces the execution time of SSDD by three or four orders of magnitude.
The rest of this paper is organized as follows.Section 2 explains related work and background of the research.Section 3 presents the FS-based 2-step protocol, SSDD-FS, and proves its correctness.Section 4 introduces four novel feature selections, RP, LF, GF, and HF, and it proposes their corresponding secure protocols.Section 5 explains experimental results on various data sets.We finally summarize and conclude the paper in Section 6.

Related Work and Background
We use cosine similarity as the basic operation of similar document detection.The cosine similarity of two securely in two parties, we can also compute cos( − → U , − → V ) securely.There are two representative methods for the secure scalar product [15].The first one is the random matrix method [28], where two parties share the same random matrix and compute the scalar product securely using the matrix.The second one is the homomorphic encryption method [12], where two parties use the homomorphic probability key system for the secure computation of scalar products.In this paper, we use the random matrix method since it is more efficient than the homomorphic encryption one, but we can also instead use the homomorphic encryption method for the protocols to be discussed later.Without loss of generality, we assume that vectors − → U and − → V are normalized to size 1.That is, U = V = 1, and thus, simply cos( Figure 2 shows the protocol of SSDD-Base, the recent solution of SSDD by Jiang et al. [15].SSDD-Base uses the random matrix method [28] for secure scalar products, where Alice and Bob share the same matrix A and securely determine whether two vectors − → U and − → V are similar or not.For the correctness and detailed explanation on Protocol SSDD-Base, readers are referred to [15].In SSDD, we perform SSDD-Base for each pair of document vectors.More formally, if U and V are sets of document vectors owned by Alice and Bob, respectively, we perform SSDD-Base for each pair ( As we mentioned in Section 1, however, SSDD-Base incurs the severe computation and communication overhead of O(n U V ), which will be much serious if there are several parties, or a large number of documents are changed dynamically.To alleviate this critical overhead, in this paper we discuss the 2-step solution for SSDD.
In text mining and time-series mining, many lower-dimensional transformations have been proposed to solve the dimensionality curse problem [3,20] of high dimensional vectors.We can classify lowerdimensional transformations into feature extractions and feature selections [23,27].First, the feature extraction creates a few new features from an original high dimensional vector.Representative examples of feature extractions include LSI (latent semantic indexing) [10,30], LPI (locality preserving indexing) [6], DFT (discrete Fourier transform) [11,17,21], DWT (discrete Wavelet transform) [8,18], and PAA (piecewise  aggregate approximation) [14,31].In contrary, the feature selection selects a few discriminative features from an original (or transformed) high dimensional vectors.Representative examples of feature selections include RP, DF, LDA, and PCA (principal component analysis) [5,7,27].In this paper, we use RP and DF with appropriate variations.This is because RP and DF are much simpler than other transformations, and accordingly, they are easily applied to SSDD with low complexity; on the other hand, LSI, LPI, LDA, and PCA may provide very accurate feature vectors, but they are too complex to be applied to SSDD.For the detailed explanation on lower-dimensional transformations for text mining, readers are referred to [23,27,30].
There have been many efforts on PPDM [4].PPDM solutions can be classified into four categories: data perturbation, k-anonymization, distributed privacy preservation, and privacy preservation of mining results [19].SSDD can be regarded as an application of distributed privacy privation.For the detailed explanation on problems and solutions of data perturbation and k-anonymization, readers are referred to survey papers [1,4].

Feature Selection-based Secure 2-Step Protocol
In this paper, we use FS, feature selection, for the secure 2-step protocol.To transform an n-dimensional vector to an f -dimensional vector, FS chooses randomly or highly frequent f dimensions from n dimensions, and thus, its transformation process is very simple.In this section, we first assume that FS can select f dimensions from n dimensions in a secure manner, and we then propose the secure 2-step protocol of SSDD by using the secure FS.
To use a lower-dimensional transformation F for SSDD, we need to find an upper bound function , respectively, by the transformation F .In Eq. ( 1), The reason why the transformation F should satisfy Eq. ( 1) is that SSDD of using F should not incur any false dismissal, and this is known as Parseval's theorem (the lower bound property of Euclidean distances) in time-series matching [11,18,20].To obtain an upper bound of the lower-dimensional transformation F , we first define an upper bound of F as follows.
, respectively, we define an upper bound function of F , denoted by where In this paper, we want to use FS as a lower-dimensional transformation F , and thus, we formally prove that the upper bound function of FS satisfies Eq. ( 1), the upper bound property of cosine similarity.
PROOF: First, let Then, Eqs. ( 4) and ( 5) hold for We note that all entry values of − → U and − → V are non-negative, and FS constructs Based on this property, Eq. ( 6) holds.

cos
Therefore, upper( By using the upper bound property of FS, we now propose a generic 2-step protocol SSDD-FS. ) by using Eq. ( 8).
After then, Alice computes an upper bound function of FS, upper( ), in Line 6.In Line 7, we perform the filtering process by comparing the upper bound (= υ) and the given tolerance (= ǫ).If the upper bound is less than the tolerance, i.e. if υ < ǫ, the actual cosine similarity will also be less than the tolerance, and we don't need to compute it in the next n-dimensional space.That is, if υ < ǫ, we can skip Line 8 of the second step.Thus, Line 8 is executed only if n-dimensional vectors of ( − → U , − → V ) are not filtered out by the upper bound.In Line 8, we compute the actual cosine similarity for ( − → U , − → V ) by using SSDD-Base.
We here note that how SSDD-FS improves the performance compared with SSDD-Base depends on how many n-dimensional vectors are discarded in the first step.This filtering effect largely depends on the discriminative power of the feature selection, i.e., efficiency of FS.In other words, if FS exploits the filtering effect largely, SSDD-FS can reduce the computation and communication overhead from Based on this observation, we need to maximize the filtering effect of FS, and this can be seen a problem of how we choose f dimensions from n dimensions for maximizing the discriminative power of FS.Therefore, we propose efficient FS variants and their SSDD protocols in Section 4 and evaluate their performance in Section 5.

Discriminative Feature Selections for the 2-Step Protocol
In this section, we propose four methods to implement FS of Protocol SSDD-FS.(2) Assume that Alice and Bob share the matrices A and respectively.
[1 st step] The filtering step in the f-dimensional space] Alice: 1. Execute Lines 1 to 3 of SSDD-Base for    The second RP method uses the same f dimensions for all ( − → U , − → V ) pairs.We can easily implement this method as Alice and Bob share the same f indexes only once before starting SSDD-FS.These first and second RP methods do not disclose any values of Alice's and Bob's document vectors, and thus, they are said to be secure.Also, these two methods have the same effect in selecting f dimensions randomly.Thus, we use the second one since it is much simpler than the first one, and we call the second one SSDD-RP by differentiating it from SSDD-FS.

LF: Local Frequency
SSDD-RP proposed in Section 4.1 has a problem of exploiting only a little filtering effect in the first filtering step.This low filtering effect is due to that RP chooses features without any consideration of characteristics of document vectors.According to the real experiments, SSDD-RP shows a very little improvement in SSDD performance compared with SSDD-Base.To solve the problem of SSDD-RP and to enlarge the filtering effect, in this paper we consider how frequent each term is in the document or document set, i.e., we use the term frequency (TF) 1 .In general, we use the TF concept as follows: we first compute the number of occurrences (i.e., frequency) of each term throughout the whole data set and then choose the highly frequent dimensions.We call this selection method DF (document frequency) as in [27].The reason why we consider TF (or DF) in SSDD-FS is that, if we select the highly frequent f dimensions, we can obtain relatively small upper bounds upper( 2), and accordingly, we can exploit the filtering effect largely.
As a feature selection using term frequencies, we first consider how frequent each term is in an individual document rather than the whole document set, that is, we first propose the feature selection of exploiting locality of each document.More precisely, for a pair of documents ( − → U , − → V ), the locality-based selection chooses f dimensions highly frequent in Alice's current vector − → U .This selection is based on the simple intuition that, even without considering whole vectors of the document set, the current vector itself will make a big influence on the upper bound upper( − → U , − → V ).In this selection, we can instead use Bob's vector − → V rather than Alice's vector − → U as the current vector, or we can also use both Alice's and Bob's vectors − → U and − → V .Using − → V , however, incurs the additional communication overhead, and thus, in this paper we consider a simple method of using Alice's − → U as the current vector.We call this selection method LF(local frequency) since it considers individual (i.e., local) documents rather than whole documents, and we denote the protocol of applying LF to SSDD-FS as SSDD-LF.
SSDD-LF exploits the locality by selecting f dimensions for each document at every start time.(1-2) Alice chooses f dimensions, i1, …, if (1 ≤ ij ≤ n), whose TFs are larger than other dimensions.
(1-3) Alice sends those f indexes to Bob. // This can be done together with Line 3 of Figure 2 (1-4)  Second, Alice and Bob need the additional communication overhead to share the f indexes.However, this communication process can be done with Line (3) of SSDD-Base of Figure 2, that is, Alice can send f indexes together with the encrypted vector − → Z to Bob.The amount of f indexes is much smaller than that of the n-dimensional vector, and the overhead of f indexes can be negligible.Thus, we can say that SSDD-LF causes the computation overhead of O(n log f ), but the communication overhead can be ignored.In particular, we compare each vector − → U of Alice with a large number of vectors − → V (∈ V) of Bob, and thus, the computation overhead of O(n log f ) can also be ignored as a pre-processing step.
Another considering point in SSDD-LF is whether its feature selection process is secure or not.That is, there should be no privacy disclosure when Alice selects f indexes and shares them with Bob.Fortunately, Alice sends only indexes i j to Bob rather than entry values u i j of − → U , and the sensitive values u i j are not disclosed in the selection process.Unfortunately, however, the information that which f dimensions are frequent in − → U is revealed to Bob.If the user cannot be allowable even this limited disclosure of information, s/he cannot use SSDD-LF as a secure protocol.In this case, we recommend to use the previous SSDD-RP or the next SSDD-GF or SSDD-HF as the more secure protocol.and denote the GF-based secure protocol as SSDD-GF.Actually, GF is the same as DF, which has been widely used as the representative feature selection, and it works as follows.First, let − → A = {a 1 , . . ., a n } be a whole vector and a k be a number of documents containing the k-th term, that is, a k be the DF value of the k-th term.Then, to reduce the number of dimensions from n to f , GF simply selects f dimensions whose DF values are larger than those of the other (n − f ) dimensions.We can get the whole vector by scanning all the document vectors once.The traditional DF constructs the whole vector based on the assumption that all the document vectors are maintained in a single computer.In SSDD, however, document vectors are distributed in Alice and Bob, and they do not want to provide their own vectors to each other.Thus, to use GF in SSDD, we first need to present a secure protocol of constructing the whole vector from the document vectors distributively stored in Alice and Bob.We now explain SSDD-GF which exploits SecureDF as the feature selection.Figure 8 shows how we modify Line (1) of Figure 3 for converting SSDD-FS to SSDD-GF.In Line (1-0), we first perform

GF: Global Frequency
SecureDF to obtain the whole vector − → A and determine f indexes which are most frequent in (1) U is a set of n-dimensional vector owned by Alice.
(2) V is a set of n-dimensional vectors owned by Bob.  the determined f indexes.As shown in Figure 8, the current vectors and even their term frequencies are not disclosed to each other, and thus, we can say that SSDD-GF is a secure protocol of SSDD.

HF: Hybrid Frequency
LF and GF proposed in Sections 4.2 and 4.3 have the following characteristics in a viewpoint of the filtering effect.First, LF considers Alice's current vector − → U only, and thus, the filtering effect will be large for only a part of Bob's vectors whose TF patterns much differ from the current vector, but the effect are less exploited for most of the other vectors.In other words, LF can exploit the better filtering effect than GF when Alice's current vector quite differs from the whole vector in TF patterns.Second, GF considers the whole vector − → A obtained by SecureDF without considering the current vector, and it thus can exploit the filtering effect relatively evenly on many of Bob's document vectors.That is, GF can exploit the better filtering effect than LF when Alice's current vector has the similar characteristics with the whole vector in TF patterns.
To take advantage of both locality of LF and globality of GF, we now propose a hybrid feature selection, called HF(hybrid frequency).That is, HF uses the current vector for exploiting locality of LF, and at the same time it also use the whole vector for exploiting globality of GF.We then present an advanced secure protocol SSDD-HF by applying HF to the SSDD-FS.Simply speaking, HF compares current and whole vectors and selects feature dimensions whose differences are larger than those of the other dimensions.In more detail, we select feature dimensions which have one of the following two characteristics: (1) the dimensions which frequently occur in Alice's current vector but seldom occur in the whole vector (i.e., whose values are relatively large in the current vector but relatively small in the whole vector); or on the contrary, (2) the dimensions which seldom occur in Alice's current vector but frequently occur in the whole vector.This is because the larger |u F i − v F i | (= the difference between values of the selected feature dimension), the smaller 2), which exploits the larger filtering effect.
However, we cannot directly compare Alice's current vector − → U and the whole vector − → A by SecureDF.
The reason is that − → U represents "frequencies of terms" in a single vector while − → A represents "frequencies of documents" containing those terms.That is, the meaning of frequencies in − → U differs from that of − → A , and thus, their scales are also different.To resolve this problem, before comparing two vectors − → U and − → A , we first normalize them using their mean (= µ) and standard deviation (= σ).More precisely, we first normalize − → U and − → A to − → U and − → A by Eq. ( 9), and we next obtain the difference vector After then, we select the largest f dimensions from − → D and use them as the features of SSDD-HF.
(1-5) Alice sends those f indexes to Bob. // This can be done together with Line 3 of Figure 2 (1-6)  n-dimensional vector can be ignored since it can also be seen as the pre-processing step.One more notable point is that SSDD-HF is a secure protocol like SSDD-GF since it uses SecureDF and the difference vector which are secure and do not disclose any original values or any sensitive indexes of individual vectors.

Experimental Data and Environment
In this section, we empirically evaluate feature selection-based SSDD protocols proposed in Section 4. As the experimental data, we use three datasets obtained from the document sets of UCI repository [29] We experiment five SSDD protocols: SSDD-Base as the basic one and four proposed ones of SSDD-RP, SSDD-LF, SSDD-GF, and SSDD-HF.In the experiment, we basically measure the elapsed time of executing SSDD for each protocol.In the first experiment, we vary the number of dimensions for a fixed tolerance, where the number of dimensions means f , i.e., the number of selected features (dimensions) by the feature selection.In the second experiment, we vary the tolerance for a fixed number of dimensions.
For these two experiments, we use KOS and NIPS, which have a relatively small number of documents compared with EMAILS.On the other hand, the third experiment is to test scalability of each protocol, and we thus use EMAILS whose number of documents is much larger than those of KOS and NIPS.
The hardware platform is HP ProLiant ML110 G7 workstation equipped with Intel(R) Xeon(R) Quad Core CPU E31220 3.10GHz, 16GB RAM, and 250GB HDD; its software platform is CentOS 6.5 Linux.We use C language for implementing all the protocols.We perform SSDD in a single machine using a local loop for network communication.The reason why we use the local loop is that we want to intentionally ignore the network speed since different network speeds or environments may largely distort the actual execution time of each protocol.We measure the execution time spent for that Alice sends each document to Bob and identifies its similarity securely.More precisely, we store the whole dataset in Bob and select ten query documents for Alice.After then, we execute each SSDD protocol for those ten query documents and use their sum as the experimental result.

Experimental Results
Figure 10 shows the experimental results for KOS.First, in Figure 10(a), we set the tolerance to 0.80 and vary the number of documents by 70, 210, 350, 490, and 640, which correspond to 1%, 3%, 5%, 7%, and 9% of KOS documents.As shown in the figure , x axis shows the number of (selected) dimensions, and y axis does the actual execution time.Note that the y axis is a log scale.Figure 10(a) shows that all proposed protocols significantly outperform the basic SSDD-Base.Even SSDD-RP of selecting features randomly beats SSDD-Base by exploiting the filtering effect in the first step of the 2-step protocol.Next, SSDD-GF shows the better performance than SSDD-RP since it selects the frequently occurred features throughout the whole dataset by using DF.In case of SSDD-RP and SSDD-GF, we note that, as the number of dimensions increases, the execution time decreases.This is because the more number of dimensions we use, the larger filtering effect we can exploit.SSDD-LF of using locality of the current vector also outperforms SSDD-RP as well as SSDD-Base.In particular, SSDD-LF is better than SSDD-GF for a small number of dimensions, but it is worse than SSDD-GF for a large number of dimensions.This is because only a small number of dimensions make a big influence on the locality of the current vector.Finally, SSDD-HF of taking advantage of both SSDD-LF and SSDD-GF shows the best performance for all dimensions.In Figure 10(a), we note that the execution time of SSDD-LF and SSDD-HF slightly increases as the number of dimensions increases.The reason is that, as the number f of dimensions increases, the filtering effect increases relatively slowly, but the overhead of obtaining a current/difference vector and choosing f dimensions from that vector increases relatively quickly.
Second, in Figure 10(b), we set the number of dimensions to 70 (1% of total dimensions) and vary the tolerance from 0.95 to 0.75 by decreasing 0.05.Note that the closer to 1.0 the tolerance is, the stronger similarity we use.As shown in the figure, all proposed protocols significantly improve the performance compared with SSDD-Base.In particular, SSDD-LF and SSDD-HF, which exploits the locality, show the better performance than the other two proposed ones.We here note that, as the tolerance decreases, execution times of all proposed protocols gradually increase.This is because the smaller tolerance we use, the more documents we get as similar ones.That is, as the tolerance decreases, the more documents pass the first step, and thus, the more time is spent in the second step.In summary of Figure 10, the proposed SSDD-LF and SSDD-HF significantly outperform SSDD-Base by up to 726.6 and 9858 times, respectively.Figure 12 shows the results for scalability test using a large volume of high dimensional dataset, EMAILS.We set the tolerance and the number of dimensions to 0.80 and 70, respectively, and we increase the number of documents (emails) from 40 (0.1%) to 39,861 (100%) by 10 times.In this experiment, we exclude the results of SSDD-Base, SSDD-RP, and SSDD-GF for the case of 39,861 documents due to excessive execution time.As shown in the figure, like the results of KOS and NIPS, our feature selectionbased protocols outperform SSDD-Base at all cases, and in particular, SSDD-LF and SSDD-HF show the best performance regardless of the number of documents.We also note that all proposed protocols show a pseudo linear trend on the number of documents.(Please note that xand y-axis are all log scales.)That is, the protocols are pseudo linear solutions on the number of documents, and we can say that they are excellent

Conclusions
In this paper, we addressed an efficient method of significantly reducing computation and communication overhead in secure similar document detection.Contributions of the paper can be summarized as follows.
First, we thoroughly analyzed the previous 1-step protocol and pointed out that it incurred serious performance overhead for high dimensional document vectors.Second, to alleviate the overhead, we presented the feature selection-based 2-step protocol and formally proved its correctness.Third, to improve the filtering efficiency of the 2-step protocol, we proposed four feature selections: (1) RP of selecting features randomly, (2) LF of exploiting locality of a current vector, (3) GF of exploiting globality of all document vectors, and (4) HF of considering both locality and globality.Fourth, for each feature selection, we presented its formal protocol and analyzed its secureness and overhead.Fifth, through experiments on three real datasets, we showed that all proposed protocols significantly outperformed the base protocol, and in particular, the HFbased secure protocol improved performance by up to three or four orders of magnitude.As the future work, we will consider two issues: (1) use of feature extraction (feature creation) instead of feature selection for dimensionality reduction and (2) use of homomorphic encryption rather than random matrix for the secure

Figure 1 :
Figure 1: Concept of secure similar document detection.
Figure 3 shows Protocol SSDD-FS.As shown in the protocol, SSDD-FS maintains f -dimensional − − → U F S and − − → U F S as well as n-dimensional − → U and − → V of SSDD-Base.Also, Alice and Bob share an f × f /2 matrix A F S as well as an n × n/2 matrix A of SSDD-Base.Lines 1 to 7 of SSDD-FS are the first step of discarding non-similar n-dimensional vectors in the f -dimensional space.First, Lines 1 to 4 securely compute the scalar product δ for f -dimensional vectors − − → U F S and − −→ V F S .Except using f -dimensional vectors instead of n-dimensional vectors, these steps are the same as those of SSDD-Base.The only difference from SSDD-Base is that Bob additionally sends Figure 4 shows the procedure of SSDD-FS including the feature selection step.As shown in the figure, we first obtain − − → U F S and −−→ V F S from − → U and − → V through the feature selection which should also be done securely.As mentioned in Section 1, we present RP, LF, GF, and HF as the feature selection method, and we explain how they work Protocol SSDD-FS (1) U ur and V ur are n-dimensional document vectors; f-dimensional feature vectors.

2 . 7 .
Execute Lines 4 to 6 of SSDD-Base for if υ < ε then Discard the pair ( ) , U V ur ur as a non-similar one; // υ = upper bound [2 nd step] The post-processing step in the n-dimensional space] Alice and Bob: 8. Execute Lines 1 to 9 of SSDD-Base if ( ) , U V ur ur is not discarded in the 1 st step;

Fig- ure 6
shows how we implement SSDD-LF by modifying Line (1) of SSDD-FS of Figure3.In Line (1-2), Alice first selects top f frequent dimensions from her current vector − → U .She sends those indexes of the selected f dimensions to Bob in Line (1-3).Thus, they can share the same indexes and obtain f -dimensional feature vectors by using the same f indexes in Line(1)(2)(3)(4).We now analyze the computation and communication overhead of feature selection in SSDD-LF.As shown in Figure6, for each vector − → U , Alice (1) chooses the top f frequent dimensions from n dimensions dimensional document vectors.

Figure 7 shows
Figure 7 shows Protocol SecureDF that securely constructs a whole vector − → A from Alice's and Bob's document vectors and gets f frequent dimensions from − → A .In Lines 1 to 8, Alice and Bob computes their own whole vectors independently.That is, Alice computes her own whole vector −−−→ A Alice from her own document set U, and Bob gets − −− → A Bob from V. In Lines 4 and 8, they share those whole vectors −−−→ A Alice and − −− → A Bob with each other.In Lines 9 to 11, they then compute the aggregated whole vector − → A from those vectors.After obtaining the whole vector − → A , Alice and Bob can select f frequent dimensions from − → A .We note that Alice sends −−−→ A Alice to Bob in Line 4, and Bob sends − −− → A Bob to Alice in Line 8. Vectors −−−→ A Alice and − −− → A Bob , however, are not exact values of document vectors, but simple statistics, and thus, we can say that SecureDF does not reveal any privacy of individual documents.Computation and communication complexities of SecureDF are merely O(n |U| + n |V|) and O(n), respectively.Also, SecureDF can be seen as a pre-processing step executed only once for all document vectors of Alice and Bob.Thus, its complexity can be negligible compared with the complexity (n |U| |V|) of SSDD-Base.

Figure 7 :
Figure 7: Secure protocol for constructing the whole vector.

Figure 11
Figure11shows the experimental results for NIPS.As in Figure10of KOS, we measure the execution time of SSDD by varying the number of dimensions and the tolerance.In Figure11(a), we set the tolerance 0.80 and increase the number of dimensions from 120 (1%) to 600 (5%) by 120 (1%), where 120 means 1% of total 12,419 documents.Next, in Figure11(b), we set the number of dimensions to 120 and decrease the tolerance from 0.95 to 0.75 by 0.05.The experimental results of Figures11(a) and 11(b) show a very similar trend with those of Figures 10(a) and 10(b).That is, all proposed protocols significantly outperform SSDD-Base, and SSDD-HF shows the best performance.In Figure 11, SSDD-HF extremely improves the performance compared with SSDD-Base by up to 16620 times.

Figure 12 :
Figure 12: Experimental results of scalability test for EMAILS.
uses the whole vector of which each element represents the number of documents containing the corresponding term.Unlike LF of focusing on the current vector only, it considers whole document vectors, and it has characteristics of globality.We call this feature selection GF (global frequency) SSDD-LF of Section 4.2 has a problem of considering only Alice's current vector but ignoring all the other vectors of Bob.Due to this problem, SSDD-LF exploits the filtering effect for only a part of Bob's vectors, but it does not for most of other vectors.To overcome this problem, in this section we propose another feature selection that ak = DF value of the k th dimension 11. end-for 12. Choose f dimensions, i1, …, if, whose DF values of A ur are larger than the other dimensions; . These datasets are KOS blog entries, NIPS full papers, and Enron emails, which have been frequently used in text mining.The first dataset consists of KOS blog entries collected from dailykos.com,andwe call it KOS.KOS consists of 3,430 documents with 6,906 different terms (dimensions), and it has total 467,714 terms.The second dataset contains NIPS full papers published in Neural Information Processing Systems Conference, and we call it NIPS.NIPS consists of 1,500 documents with 12,419 different terms, and it has about 1.9 million terms in total.The third dataset contains e-mail messages of Enron, and we call it EMAILS.EMAILS consists of 39,861 e-mails with 28,102 different terms, and it has about 6.4 million terms in total.