Effective and Fast Near Duplicate Detection via Signature-Based Compression Metrics

Detecting near duplicates on the web is challenging due to its volume and variety. Most of the previous studies require the setting of input parameters, making it difficult for them to achieve robustness across various scenarios without careful tuning. Recently, a universal and parameter-free similarity metric, the normalized compression distance or NCD, has been employed effectively in diverse applications. Nevertheless, there are problems preventing NCD from being applied to medium-to-large datasets as it lacks efficiency and tends to get skewed by large object size. To make this parameter-free method feasible on a large corpus of web documents, we propose a new method called SigNCD which measures NCD based on lightweight signatures instead of full documents, leading to improved efficiency and stability. We derive various lower bounds of NCD and propose pruning policies to further reduce computational complexity. We evaluate SigNCD on both English and Chinese datasets and show an increase in F1 score comparedwith the original NCDmethod and a significant reduction in runtime. Comparisonswith other competitivemethods also demonstrate the superiority of our method. Moreover, no parameter tuning is required in SigNCD, except a similarity threshold.


Introduction
With the rapid growing of information in the big data era, near duplicate detection algorithms face a number of challenges.Corresponding to the four V's of big data, namely, Volume, Velocity, Variety, and Value, near duplicates should be detected with scalability, efficiency, robustness, and effectiveness.Although massive research efforts have been devoted, it is still difficult to meet all the requirements.For example, most of the existing algorithms cannot be adapted to the evolving scenarios and heterogeneous data types (e.g., image and video) without human intervention, especially when efficiency is considered.Feasible solutions are still under exploration to meet more and more requirements from a more thorough and extensive point of view.
A quantitative way to define two objects as near duplicates is to use similarity or distance functions, such as methods with Jaccard [1], cosine similarities [2], and Hamming or edit distances [3].To improve efficiency, a common approach is to extract feature vectors or more lightweight signatures/fingerprints [4] from documents to perform similarity matching.However, it is challenging to choose suitable signatures or fingerprints, as it usually involves a tradeoff between effectiveness and efficiency.To achieve better performance, a set of complicated factors are taken into account, such as the spots and frequencies of occurrence, and delicately designed mapping functions (e.g., hash) are also involved.This process generally requires careful parameter tuning for good performance.However, detection task keeps evolving (which is common on the Internet), making it difficult to adapt to varying scenarios.Therefore, most of these signature-based approaches are task-specific and parameter-sensitive.
Apart from the above parameter-dependent methods, there also exist parameter-free methods due to a special similarity metric called normalized compression distance or NCD [5], which is measured by exploiting the off-the-shelf 2 Mathematical Problems in Engineering compressors to estimate the amount of information shared by any two documents.NCD has been proven to be universal and can naturally be applied to a variety of domains such as genomics, languages, music, and images [5][6][7][8][9][10].However, most of these methods were only experimented on small datasets for two reasons.First, it is extremely time-consuming to compress each of the documents and each of the pairwise concatenations of documents, leading to a prohibitive ( 2 ) time complexity, where  is the number of documents.Second, as we verify in the experiments, NCD is prone to be skewed by long documents [11].Thus, NCD is only effective for short documents.However, web documents can be of a very wide range of lengths, making the performance of NCD unpredictable.
In this paper, to deal with large collections of documents with a very wide range of lengths, we propose a new near duplicate algorithm called SigNCD which combines signature extraction process with normalized compression distance.Specifically, we first propose a punctuation-spot signature extraction method, which is robust and can be applied to different languages.Then we use lightweight signatures (rather than the full documents) as the inputs of NCD, resulting in dramatically reduced complexity and significantly improved stability.To further improve the efficiency of SigNCD, we derive various lower bounds of SigNCD (or NCD) to filter out a large portion of unnecessary comparisons.In contrast to the parameter-laden methods, no parameter is required for SigNCD, making it simple to implement and employ.
Overall, the contributions of this paper are threefold: (i) Due to the drawbacks of both signature-based and compression-based approaches, we propose a novel framework, SigNCD, to enjoy the best of both worlds.Besides, SigNCD is robust and efficient and requires no parameter tuning except a similarity threshold.
(ii) Based on the derived tight lower bounds of SigNCD, we propose exact pruning policies for similarity search to significantly reduce the complexity of processing large collections of web documents.
(iii) Experimental evaluation over both English and Chinese web document datasets shows that SigNCD outperforms NCD with an improvement in terms of F1 score and runtime.Comparison with other competitive signature-based methods also shows that SigNCD produces better results.

Definition of NCD.
We first introduce the Kolmogorov complexity [12] on which the definition of NCD is based.Kolmogorov complexity is a concept in algorithmic information theory, and the Kolmogorov complexity of an object, such as a piece of text, is the length of the shortest computer program (in a predetermined programming language) that produces the object as output.It is a measure of the computational resources needed to specify the object and is also known as descriptive complexity [13].Consider the following two strings of 48 lowercase letters and digits: (i) "abcabcabcabcabcabcabcabcabcabcabcabcabcabcabc abc" (ii) "4c1j5x8rx2y39umgw5q85s7b2p0cv4w1dqoxjausakc pvc" The first string has a short description, namely, "abc 16 times," which consists of 12 characters.The second one has no obvious simple description other than writing down the string itself, which has 48 characters.Thus, it can be concluded that the first string is less than the second string in Kolmogorov complexity.Please note that Kolmogorov complexity is uncomputable.
Formally, the Kolmogorov complexity () of a finite string  is defined as the length of the shortest program to generate  on a universal computer.Intuitively, the minimal information distance between  and  is the length of the shortest program for a universal computer to transform  into  and  into .This measure will be shown to be, up to a logarithmic additive term, equal to the maximum of the conditional Kolmogorov complexities, that is, ( | ) and ( | ).The conditional Kolmogorov complexity ( | ) of  given a finite string  is defined as the length of the shortest program that generates  when  is used as an auxiliary input to the program.The information distance (, ) [5] is then developed and defined as It is shown in [5] that NID is a universal similarity metric.Unfortunately, NID is based on Kolmogorov complexity, which is incomputable in the Turing sense.Thus, it is necessary to approximate it by a given compression.The result of approximating the NID using a real compressor  is called the normalized compression distance (NCD), formally defined as Here, () denotes the compressed size of the concatenation of  and , and () and () denote the compressed size of  and , respectively.NCD is the real-world version of the ideal notion of NID.The idea is that if  and  share common information they will compress better together than separately.NCD can be explicitly computed between any two strings or files  and .In practice, NCD is a nonnegative number 0 ≤  ≤ 1 + , where  is caused by the imperfections in compression techniques, but with most standard compression algorithms one is unlikely to see  above 0.1.The more similar the two files are, the smaller value of NCD is.

Properties of NCD.
We give necessary properties used in our work.Firstly, we provide axioms determining a We omit the illustration material which can be found in [5].

Lemma 2. If the compressor is normal, then NCD is a normalized admissible distance satisfying the metric (in)equalities as follows:
(i) Idempotency: (, ) = 0 The proof of triangle inequality for NCD can also be found in [5].
To obtain NCD, off-the-shelf compressors such as gzip and Snappy can be used.Since the Kolmogorov complexity is not computable, it is impossible to compute how far away the NCD is from the NID.Nevertheless, previous works on various application domains have confirmed the effectiveness of NCD as a universal similarity metric.

Analysis of Existing Methods
In this section, we will analyze two existing methods for duplicate detection.One is SpotSigs, representing a category of methods with complex parameters to tune.The other is NCD, a parameter-free method.We will experimentally show their drawbacks.
Table 1 shows the performance of SpotSigs [1] under seven parameter settings.There are totally six parameters to tune (the specific meaning of each parameter can be found in [1]). is the Jaccard similarity threshold within [0, 1] and  = 0.3 can provide the best F1, so, in all seven settings, we keep  = 0.3.It can be observed that SpotSigs is sensitive to different settings; for example, the best setting (setting 1) performs better than the worst setting (setting 5) by 51% in terms of F1.SpotSigs therefore is parameter-dependent, and it is challenging to choose the suitable setting from a large parameter space.Moreover, when the tasks evolve, it is difficult to adapt to varying scenarios.
In contrast to the parameter-dependent algorithms such as SpotSigs, NCD is a parameter-free method, that is, without parameters to tune except a similarity threshold.NCD has been proven to be effective as a universal method for various applications.However, it is extremely time-consuming to compress the large-size object, and the number of pairwise comparisons is prohibitive for large collections of objects.In addition, NCD can be easily skewed by long documents.Figure 1 shows the results of NCD when comparing the first  bytes of two identical documents with compressor Snappy [14].It can be observed that NCD begins to get skewed when the size of the document exceeds 15 KB, which makes it infeasible to operate on documents with large size.The problem is due to the violations of the compressors' inner limitations, such as the constraints of size in the block, the sliding window, and the lookahead window.Similar phenomenon is also found in previous work, where bzip2 and gzip are used as compressors [11].Figure 1 also shows that SigNCD, on the contrary, can alleviate this problem; that is, SigNCD is more effective and robust than NCD.The reason is that using signatures instead of full documents can dramatically reduce the size of strings for compression.In addition, using signatures together with the pruning policies can further improve the efficiency.We will describe SigNCD in detail in the next section.

SigNCD
In this section, we first describe the general framework of SigNCD and then illustrate the pruning policies and implementation issues.

General Framework.
Given a set of documents, there are four steps to conduct near duplicate detection for web documents, preprocessing, signature extraction, compression, and comparison, which are described in detail as follows.
(1) Preprocessing.The crawled web pages usually contain noises such as framing elements for branding and advertisements.Moreover, the core text in web pages is often lexically fragmented because HTML tables are used for layout control to insert images, ads., or even unrelated material.In such case, preprocessing is required before detection.To get the main texts which we focus on, the page source is scanned and HTML framing elements such as ⟨script . ..⟩ ⋅ ⋅ ⋅ ⟨/script⟩, ⟨style . ..⟩ ⋅ ⋅ ⋅ ⟨/style⟩, ⟨a . ..⟩ ⋅ ⋅ ⋅ ⟨/a⟩, and ⟨iframe . ..⟩ ⋅ ⋅ ⋅ ⟨/iframe⟩ are replaced with blank spaces.
(2) Signature Extraction.Signature extraction is the key part of SigNCD, aiming to capture the core contents for similarity matching.The spots in the page at which signatures are generated are typically frequent within the corpus and are better to be domain-independent or even linguisticsindependent.A simple choice is using punctuation as spots, which are likely to occur in every document and whose occurrences are widely and uniformly spread out in the documents.Hence punctuation-spot signatures extract the words around a subset of punctuations to construct a signature for each document.For example, for the first paragraph in this subsection Given a set of documents, there are four steps to conduct near-duplicate detection for web documents, preprocessing, signature extraction, compression, and comparison, which are described in detail as follows.
if we choose comma as spot punctuation and extract the words before each comma, then the signature would be "documents preprocessing extraction comparison." Please note that if the number of occurrences of spot punctuation in a document is too small (less than three in our setting), we do not extract signatures but directly use the full document for compression and comparison instead.In addition to the proposed punctuation-spot signature extraction method, other signature extraction methods (e.g., using stop words as spots) can also be employed in our general framework, as we have done in the baseline called SpotSigNCD described in Section 4.2.
(3) Compression.We refer to the signature extracted from a document  as sig(), which would be compressed by off-the-shelf compressors.Then the size of the compressed signature, denoted as (sig()), would be used as NCD input.Besides, to measure the similarity between documents  and , the concatenation of sig() and sig() would also need to be compressed, and the size is denoted as (sig()sig()).As compression is generally time-consuming, compressing signatures instead of full documents can significantly reduce computational complexity.Moreover, NCD can also benefit from the reduced size because it can be skewed by large objects.To further improve efficiency, we choose real-world compressors with superior speed.
⟨, ⟩ is detected as a near duplicate if SigNCD(, ) ≤ , where  is the similarity threshold.
Although computation complexity has been reduced by adopting signatures, comparisons for each pair of documents are still prohibitive (( 2 ),  is the number of documents) for large collections.We propose pruning policies to filter unnecessary comparisons in the following subsection.

Pruning Polices.
We provide two pruning policies, P1 and P2, based on the properties of NCD.Though the lemmas we derived here are for NCD(, ), they can also be applied to SigNCD(, ) as they enjoy the same properties.
We then describe how to incorporate the two pruning polices, namely, P1 and P2, into the general framework.To apply P1, we first need to sort the documents according to their compressed size of signatures in ascending order (line (8)).Then we obtain an ordered list  = { 1 ,  2 , . . .,   }, where (sig( 1 )) ≤ (sig( 2 )), . . ., ≤ (sig(  )).After that, a straightforward approach for comparison is that, for each examined document, say   , compare   with each document   , where  <  (line (25)).According to P1, since we can safely skip the comparison of ⟨  ,   ⟩ when (sig(  ))/(sig(  )) < 1 − , it indicates an implicit matching partition on  for each document, where comparisons are only required within the partition.More specifically, for each examined   , we may Mathematical Problems in Engineering find a matching partition [  ,   ) ( <  and  ≤ ) on , where   is the boundary object, that is, the first object in  that satisfies (sig(  )) > (sig(  ))/(1 − ) (line (20)).Then all the documents in  whose indices are between  and  will be put into this partition.Note that sometimes we may not be able to find   satisfying the above condition.In such case, the matching partition would be [  ,   ], where   is the last document in .Given the matching partition of   , computation of SigNCD(  ,   ) ( < ) is only necessary when   is within   's matching partition.
Besides, P2 can be used in combination with P1 to further reduce the number of comparisons.However, to make it work, there are two conditions.First, for the examined pair ⟨  ,   ⟩, the third-party   should be in front of both   and   in ; that is,  < min{, }.Second,   and   should also be within the matching partition of   .In other words, if the matching partition of   is [  ,   ), max{, } < .The compressor used in our experiment is Snappy [14], which is the key part of the Google infrastructure, with very high speed and reasonable compression.Note that SigNCD can also be compatible with other compressors, which are also evaluated in our experiments.It has also been shown in [16] that NCD is largely independent of the underlying compression algorithms.

Experiments
As for effectiveness, we report microaverages for F1 score as quality measures, which are consistent with previous works such as SpotSigs [1].As for efficiency, to make fair comparisons, all the algorithms are implemented in Java and run as single-threaded programs.Note that, for each algorithm, the runtime involves all the steps of the algorithm (so it also includes the preprocessing step).Multithreaded version of SigNCD is also evaluated for scalability.All experiments are performed on Intel Core(TM)2 Quad CPU Q9950 @ 2.83 GHz with 4 GB RAM.

Comparison Methods. We use three baselines evaluated against SigNCD:
(i) SigNCD.It is our proposal with three variations for evaluation, and we refer to SigNCD without pruning policies, with only P1 and with both P1 and P2 as SigNCD w/o, SigNCD w/ P1, and SigNCD w/ P1+P2, respectively.(ii) NCD.NCD is applied without signature extraction or pruning processes [5].(iii) SpotSigs.SpotSigs [1] is a competitive algorithm which shows superior performance against a set of counterparts in terms of both 1 and runtime.It also involves signature extraction (stop word-spot) and pruning policies (based on multiset Jaccard), making it a strong competitor against SigNCD.
(iv) Google's simhash.It is a fingerprinting technique based on Charikar's work [17], which maps highdimensional vectors to small-sized fingerprints for efficiency.Besides, it has been improved with an algorithmic technique which can quickly find all fingerprints that differ from a given fingerprint in at most  bit positions.Google reported using simhash for duplicate detection for web crawling [3].(v) SL+ST.It is a recently proposed algorithm that uses two sentence-level features, that is, the number of terms and the terms at particular positions, to detect near duplicate documents.The suffix tree is adopted to efficiently match sentence blocks [18].
(vi) SpotSigNCD.It is based on the general framework of SigNCD, but the punctuation-spot signature extraction method is replaced by the stop word-spot signature extraction method proposed in SpotSigs [1].It is denoted as SpotSigNCD w/ P1 if P1 pruning policy is employed.
Note that, for all the baselines except NCD, we use their recommended settings.Specifically, we use 64-bit fingerprint in Google's simhash.For SpotSigs, we use the default setting within the author's code [19]. 2 shows F1 score with a variety of subsets of punctuations under SigNCD w/ P1.It can be observed that using commas as spots performs best with an F1 of 0.92 when the similarity threshold  = 0.7 (where SigNCD w/ P1 has been proven to perform well in Figure 6), and the combination of commas and full stops takes the second place with an F1 of 0.87.It can also be observed that adding more extracted signatures sometimes may hurt performance.The possible reason is that the additional information may involve some noise.For example, on one hand, semicolon occurs more frequently in the web framing elements than the comma.On the other hand, the comma is more common in the core part than in the web framing part.Therefore, compared to only using commas  as spots, using both comma and semicolon as spots may result in noisy signatures, which leads to performance degradation.When runtime is taken into account (as shown in Figure 3), using commas as spots performs much better than using the combination of commas and full stops.The reason is that larger signatures due to more spots will lead to more time spent in compression.Thus, we choose commas as spots for SigNCD in the following experiments.40.0%.The reason is that with the increase of , the matching partition is getting larger and hence fewer comparisons can be pruned.We can also observe that SigNCD w/ P1+P2 can filter out more comparisons than SigNCD w/ P1, but the difference is trivial.The possible reason is that most of the unnecessary comparisons have already been pruned by P1 and there is little room left for P2. Figure 5 shows that SigNCD w/ P1 consistently runs faster than SigNCD w/o when  ≤ 0.8, while SigNCD w/ P1+P2 runs slower than SigNCD w/ P1 in most of cases except when  = 0.9.The main reason is that P2 is complex in computing, and SigNCD w/ P1+P2 incurs more overhead than benefits.Overall, SigNCD w/ P1 is a better choice than SigNCD w/ P1+P2 and will be used in the following experiments.2 summarizes the results of SigNCD w/ P1 against other methods when each algorithm achieves its maximum F1 score.It can be observed that SigNCD w/ P1 and SpotSigNCD w/ P1 outperform all other methods in all metrics involving precision, recall, max 1, and runtime.SigNCD w/ P1 and SpotSigNCD w/ P1 yield increases of 10.8%, 10.8%, 55.9%, and 48.4% against NCD, SpotSigs, Google's simhash, and SL+ST, respectively, in terms of F1, which shows the superiority of our SigNCD framework.When average runtime is considered, SigNCD w/ P1 still performs best and achieves speedups of 7.3 and 30.9 against NCD and SL+ST and speed-ups of 1.6, 1.7, and 1.4 against SpotSigs, Google's simhash, and SpotSigNCD w/ P1, respectively, which shows the efficiency of our punctuation-spot signature method.Keep in mind that, in contrast to SpotSigs, SpotSigNCD w/ P1, and Google's simhash, no parameter tuning is required for SigNCD, except a similarity threshold .

SigNCD versus the Baselines on Gold Set. Table
Figures 6 and 7 compare F1 and runtime of SigNCD w/ P1 against NCD, SpotSigs, SL+ST, and SpotSigNCD w/ P1 when varying the values of .Note that as SpotSigs and SL+ST use Jaccard similarity rather than normalized compression distance, to make fair comparison, a conversion of the thresholds was conducted.Specifically, the performance of SpotsSigs and SL+ST at  =  in Figures 6 and 7 is actually obtained via using (1 − ) as their Jaccard threshold.In Figure 6, we observe that  = 0.5, 0.6, and 0.7 are the good settings to operate for SigNCD w/ P1, and the best F1 appears when  = 0.7.When 0.3 ≤  ≤ 0.8, SigNCD w/ P1 consistently performs better than SpotSigs and NCD. Figure 7 demonstrates that SigNCD w/ P1 outperforms SL+ST, NCD, SpotSigNCD w/ P1, and SpotSigs throughout all the values of  except  = 1.When  = 0.7, SigNCD w/ P1 achieves speedups of 45.6, 4.8, 1.2, and 1.6 against SL+ST, NCD, SpotSigNCD w/ P1, and SpotSigs, respectively.Note that, different from similarity thresholds  ∈ [0,1], Google's simhash uses distances instead.The distances are denoted as a number of bits with wide ranges, and hence we do not show their results in Figures 6 and 7, where values on -axis are within [0, 1].

SigNCD versus the Baselines on Chinese Finance News.
Table 3    each comma) are extracted as signatures.In addition, if the two words (or three words) before each comma as well as the two words (or three words) after each comma are extracted as signatures, they are referred to as Sig-4 (or Sig-6).
Figure 8 shows the results of F1 score using different lengths of signatures as a function of .It can be observed that, on average, Sig-1L performs best, while Sig-1R performs worst, and the relative difference is about 10.1%.More specifically, F1 score is quite stable for different lengths of signatures except  = 0.9 and  = 1 ( = 0.9, 1 are hardly used in real-world applications).Figure 9 shows that the runtime increases as the length of signatures grows.The reason is that it takes more time to accomplish signature extraction and compression.To summarize, using only one word before each comma as signatures (i.e., Sig-1L) has been proven to be both effective and efficient and therefore is used as default punctuation-spot signatures in our proposal.In addition, our experimental results also show that, on average, the size of the signatures is only 5.2% of the size of the preprocessed documents, which means that SigNCD w/ P1 can significantly alleviate the problem of NCD that tends to get skewed by large object size.0.7.We can observe that Snappy and lz4 perform better than gzip and zip.

Scalability.
To evaluate scalability, we perform SigNCD w/ P1 on subsets of the whole Chinese Finance News dataset involving 43000 documents and measure the runtime.We randomly sampled from 12.5% to 100% of the records and scale down the data so that the data distribution could remain approximately the same.We normalize the square roots of the running times to those obtained on the subset of 12.5% of records.The results are shown in Figure 11 with  = 0.5, 0.6, and 0.7, where SigNCD w/ P1 usually achieves good performance.We also show the curve of  =  for comparison.It is clear that the runtime of SigNCD w/ P1 grows quadratically, which is not surprising given the fact that the actual comparisons of documents also grow quadratically.
SigNCD w/ P1 has demonstrated a slower growth rate than   =  (which indicates all-pairs comparisons).Figure 12 shows that runtime can be reduced by 77.3% when the number of threads is increased from 1 to 4, which shows that SigNCD w/ P1 can be accelerated by parallel techniques to further improve efficiency.

Related Work
Signature-based or fingerprint-based methods are widely used to detect near duplicates.A Shingling algorithm [2] generates a sequence of fingerprints, called singles, from the token sequence of a page.Then the percentage of unique shingles on which the two pages agree can be used to measure the similarity.To improve the efficiency, the use of super shingles was later proposed to deal with large collections [20].
Discontinuous n-grams were taken by skipping the words in between [21].SpotSigs [1] used strings starting with stop words as features.However, different stop word lists may lead to different feature sets.In [22], the author combined two algorithms, namely, shingling [2] and Charikar's simhash [17], and achieved a better precision than each of the individuals.Sentence-level features with heavily weighted terms were adopted in [18,23].A similar idea is also proposed in [24] which weighs the phrases in a sliding window based on the term frequency within the document of terms in that window and inverse document frequency of those phrases.A hybrid approach embeds Jaro distance and statistical results of word usage frequency for near duplicate detection [25].
An improved locality-sensitive hashing based method is used for detecting duplicated tweets in order to identify potential social spammers [26].The work in [27] proposed an approach to approximate the Jaccard similarity of two streams which are highly similar.Google's simhash [3] extends Charikar's simhash [17] with an efficient technique to quickly identify all fingerprints that differ from a given fingerprint in at most  bit positions, making it a practical method to handle large collections of web documents.Later, a more efficient version is proposed at the cost of recall [28].MinHash [29] uses hash collision for detection.A compact binary sketch with one hash function was used for estimating Jaccard to detect cases of very high similarity [4].Kolmogorov complexity-based similarity metric has been used in several domains involving image [30], audio [31], and time series [32].A dictionary-based compression dissimilarity measure was proposed for multitask clustering [33].A TokenCompress algorithm was designed in place of the universal compression algorithm [6].A b-bit NCD which only stores b bits of each byte value of an object can improve efficiency [34].A metric for multiset is proposed based on NCD [35].In [7], a fast compression distance was proposed based on dictionaries extracted from images.Overall, most of the previous studies focus on variations of the metrics, data representations, or novel compressors.The successful application of NCD in the context of near duplicate detection is scarce, and it also lacks works on feasible bounds to reduce complexity, which is the main contribution of our work.

Conclusion
Normalized compression distance (NCD) is a parameterfree, feature-free similarity metric.However, it falls short in effectiveness due to limitations of real-world compressors and performs badly in efficiency because of compression and pairwise comparisons.To tackle these problems, we propose SigNCD, which integrates the signature-based method into compression-based metric, to achieve robustness and efficiency.Furthermore, it can be even faster with pruning policies based on the derived lower bounds.Thorough experiments on both English and Chinese datasets demonstrate the superior performance of SigNCD in terms of F1 score and runtime, compared with NCD and other methods.In addition, SigNCD and the associated pruning policies are universal and require no parameter tuning except the similarity threshold.Hence, they can be easily extended to other applications.

Figure 1 :
Figure 1: Normalized compression distance under SigNCD and NCD for the first  bytes of two identical documents.

Figure 3 :
Figure 3: Runtime of SigNCD w/ P1 with different punctuation-spot signatures evaluated on Gold Set.

Figure 4 Figure 4 :Figure 5 :
Figure 4: Number of comparisons with different pruning configurations as a function of .

Figure 8 :
Figure 8: F1 score of SigNCD w/ P1 with different lengths of punctuation-spot signatures evaluated on Gold Set.

Figure 10 showsFigure 9 :Figure 10 :
Figure 9: Runtime of SigNCD w/ P1 with different lengths of punctuation-spot signatures evaluated on Gold Set.

Figure 11 :
Figure 11: Normalized square root of runtime of SigNCD when the number of documents increases.

Figure 12 :
Figure 12: Runtime with varying number of threads.

Table 2 :
SigNCD versus the baselines for Gold Set.

Table 3 :
SigNCD versus the baselines on Chinese Finance News.