Content Deduplication with Granularity Tweak Based on Base and Deviation for Large Text Dataset

,


Introduction
Te cloud is a wonderful option to store data from pervasive entities, which attracts the digital world toward itself, and as a result it becomes the universe for data and services. Te fexibility provided by the cloud, like the accessibility of services and data by user from anywhere which attracted huge number of user towards cloud as a result of it. Large amount of digital content of zettabytes in size is being dumped into the cloud environment and to realize the scenario the amount of data stored in the cloud has neared one hundred zettabytes at the end of 2020 [1]. On another side, the internet-enabled physical devices are called as cyber-physical systems (CPS) which include simple embedded to complex healthcare devices, and they are growing in large number [2] by producing large amount of data. Te pervasive nature of CPS facilities is the utilization of the existing cloud infrastructure for data storage and processing, with billions of devices working around the clock all over the world producing a rain of data with similar contents, and as a result the scarcity of storage space increases [3]. As almost 60% to 70% of data stored in the cloud [4] are accountable as duplicates, this invites the necessity of deduplication techniques.
Data deduplication is a technique that eliminates redundant or duplicated and stores only the unique copy of data, which considerably reduces the storage space [5]. It is performed in two ways such as fle-level deduplication and subfle-level deduplication. Te fle-level deduplication is performed on the fle, which eliminates redundancy by comparation of fles and saves only one instance of the fle if they are identical [6]. Te hash value which is computed for each fle is used for comparison, which is sensitive to change, so two fles with little change will produce diferent hash values, so these two fles are considered diferent, so both of them get stored as a result and storage optimization is not achievable. Tis paved the way for a new concept called subfle-level or block-level deduplication.
In subfle-level deduplication, the fles are divided into small chunks of fxed or variable in size. In the fxed-size chunking algorithm, the fles are divided into fxed-size chunks based on the predefned ofset value, and in a variable-chunking algorithm, an ofset value is calculated by using various techniques based on the context of the fle [7]. Tus, the deduplication technique mainly operates on fles with similar contents among themselves or in other correlated fles. When the similarity concentration increases, the deduplication efciency drops; that is, it fails to fgure out the small changes which occur at the chunk level. Tis enlightens the concept of content deduplication with granularity tweak based on base and deviation which is the main core of this paper, and the main contributions of this paper are as follows: (i) A deduplication technique is proposed that works in conjunction with generalized and classical deduplication (ii) Te utilization of versioning concept, a feature implemented by utilizing a machine learning technique which improvises the deduplication efciency (iii) Te base deviation which is produced by the generalized deduplication technique acts as the tweak (iv) Te grouping of index-centric data in a cluster will decrease the processing time Tis paper is organized frst with a detailed literature review, followed by the proposed system, and fnally, a comparative analysis to justify the fndings is presented.

Background
Te modern digital ecosystem is shifted to a distributed environment where the demand for greater computational and storage requirements is satisfed. Tis shift also brings changes in the organization of data; that is, they are allowed to be stored in a distributed fashion. In such an environment, diferent text-based data-producing nodes produce data of diferent types with less and more similar identical data, so it is a point where classical deduplication fnds its phaseout because it is designed based on the context, and the change in the context will degrade their performance.
Te deduplication technique is mostly preferred among other compression techniques since the former will not require decompression for reconstructing the data, so it dominates the deduplication ecosystem [8]. Classical deduplication works by comparing the hash of the chunks which are obtained by segmenting the fle and followed by hash computation [9]. Te chunks are the heart of deduplication, and their quantities play a crucial part in the performance in terms of space and operational complexity.
In [10], a rule of thumb states that for 1 TB of data, there must be 5 GB of RAM for processing in the Zettabyte fle system (ZFS), so generalized data deduplication (GDD) which is recently handled in [11] is capable of converging a large similar chunk to a common base, which brings down the memory consumption and improves the performance.
Generalized deduplication (GD) [12] fts under the lossless compression approach, which dynamically eliminates both similar and identical data. It works by splitting the data unit into base (B) and deviation (D) with the help of the transformation function. For example, consider a chunk C i of a fle X which is transformed to B i , D i and in which only one copy of B is stored in similar chunks with all its changes as deviation {D 1 , D 2 , . . ., D i }.
To be practical, considering a house located in the countryside, a time-lapse photograph of the house will have the house as static, with only the environmental conduction and movement of the object changing. When such a scenario comes to cloud, the GD transformation will separate each detail of the photo to the base and deviation, in which the house becomes the base since it is the static object and the other things become deviation (morning, evening, snow, with a parked car, without a parked car, etc.). GD will generate the base deviation pair, and it will store only the deviation for the similar base. Te reconstruction of the image will be simple; that is, GD will determine the correct base (house) and then apply the exact deviation (environmental changes). Te efciency of matching depends on the transformation function, and the heart of deduplication lies in the indexing mechanism which will speed up search operation [13,14] and improve if the entities in the index are organized based on the similarity with the help of fuzzy classifers [15].

Overview of the System Model.
Te system is built in a modular fashion which includes summary generation, chunk generation, and content deduplication with granularity tweak (CDGT). Te fles from diferent time domains that come with similar text content have to be compared with duplicate content in a short duration, so the proposed system employs a summary generator for every new fle that enters the system. Te summary of a fle can act as the metadata, which enhances the process of fnding out similar contents to eliminate interfle duplication. Content deduplication with granularity tweak (CDGT) is designed to perform deduplication at the chunk level with the support of the base and deviation which is produced by Reed Solomon [16].

Summary Generator Based on NLP.
Te document in every storage environment has to be organized to facilitate faster retrieval based on their content of interest and has paved the way to incorporate natural language processing (NLP). Te large content of the document is squashed to miniature which includes mostly the important signature of the document, such miniature content is called a summary [17], and it comes under the extractive summarization technique [18].
Te summary extraction is performed in a sequence of steps as shown in Figure 1; when a document arrives, it will be put into the document pool, from where it is fetched for processing.
Te process begins with the data in the fles being divided into sentences/words called tokens, and then the cleaning process takes care of eliminating the most commonly occurring words and special characters, called stop-words, because they do not provide useful information for the NPL to learn [9,19]. Ten, the term frequency (TF) is calculated, which is stated as the total number of times a word(t) appears in the sentence to that of the total number of words in the sentence of the fle [20] as shown in equation.
Number of occurrences of a word(t) in a sentence Total Number of wor ds in a sentence . (1) Once the TF is calculated to bring out the common words in the fle, then the IDF is calculated to bring out more unique words in the documents using equation.
Total Number of sentences in a File Total Number of sentences with the word(t) . (2) Te calculated values from equations (1) and (2) are used to assign weights to the words as shown in equation (3), where t is the word in the sentence.
Te word weight calculated in equation (3) for each word in the fle is used to calculate the score values for each sentence in the fle. Equation (4) describes the score calculation, where N is the number of words in a sentence.
Te average score value calculated for a fle using equation (4) is used to set the threshold for extracting unique sentences from the fle. Te value calculated in equation (5) is normalized by multiplying with some constant value to extract the most unique sentences from the fle, and it is set to 1.3 in this proposed system.
Te sentence with a score value higher than that of the threshold value is considered the more unique sentence in that fle. In this manner, all unique sentences are extracted to form a summary for a fle which forms the metadata. Tis metadata can act as the backbone for the processes of versioning.

Chunking Mechanism.
Almost all deduplication techniques centre on chunking; it is the process that breaks the fle into a small number of pieces called chunks, which may be fxed or variable in size. Tese chunks are subjected to hashing whose output is called a fngerprint which acts as a unique identifer for the chunk. When a new chunk arrives, it is compared with the stored identifer in the index table, and if both chunks are diferent, the new chunk will get stored, or else it is a duplicate entity, so only a new reference pointer is created for that chunk [21]. Figure 2 shows the working model of deduplication.
In traditional compression techniques, the duplicates are eliminated over a small group of fles or among contents in the same fles, but in the deduplication technique, the redundancy in both interfles and intrafles over the largest data repository and even possible over multiple distributed storage servers are eliminated [22].
Te variable size chunking method eliminates more redundant data such as the content defned chunking (CDC) method, which is more scalable and faster in processing. Te CDC works by fguring out a set of locations to break the input data stream, and such breakpoints are called cutpoints, and based on these points, the contents of the fle are chunked. Many algorithms based on CDC will lead to poor performance when it produces chunks of smaller size, that is, below 8 KB [23]. In fxed-size chunking, all chunks will be of equal size, so the processing rate will be higher [24], but this method sufers from boundary shifting problems.
Te boundary shifting problem will occur in a particular scenario when an already stored fle arrives with some modifcation for backup, then while performing fxed-size chunking on the recently received fle, the additional content which is being deleted/added/appended will relocate the information from one chunk to another chunk back and forth based on the nature of the modifcation. Tat is, the fle with the inserted content will push the content from one chunk to another, and if the content is deleted from a fle, the Scientifc Programming chunks with the deleted part will bring up all information to fll up the desired chunk size from the chunks below. Tus, the chunks of the new fle will pass the deduplication process, and this is not fair since the chunks with similar content with small deviated content will overload the storage space. In this paper, to tackle the problem, a hybrid method is utilized which is both dynamic and fxed in nature since the variable and fxed-size chunking provides the same result [25].

Cosine Similarity and Dynamic Content Adjustment
Policy. Te cosine similarity is used to measure the similarity between two documents of any size [26]. It measures the cosine of the angle between two given vectors in a multidimensional space; that is, the cosine of 00 is 1, and if the angle is between (0, π), the value will be less than one. In another way, if the two terms are the same, then the cosine similarity is one; this value will reach 0 when the similarity between the texts reduces. Te formula to calculate the cosine similarity is shown in equation.
where x and y are the two-term frequency vectors using which the angle between the vectors is calculated. When the system is unable to fnd a more similar fle for a newly arrived fle from the storage system by using a cosine similarity matching process, then in such a situation, the system uses perfect square divisor (PSD) to defne the size of a chunk. For example, we consider a fle size of 729 MB, and the perfect square of 729 is 27 MB, so the fle is divided into chunks of the size of 27 MB each. However, in the case of prime numbers, there is no chance of getting perfect squares, and such a situation seeks the nearby square of a given number by using equation.
Te DCAP is activated whenever a fle arrives in the storage system, and if it is a new fle, then there will not be any copy of it which can be fgured out by computing the cosine similarity among the available summaries. If the cosine similarity value between the newly arrived and already stored fle is less than 1, then the stored fle chunks will be fetched for deduplication. If the cosine similarity is 1, both fles are the same and only pointers are adjusted.
Te Dynamic Content Adjustment Policy (DCAP) can be explained by considering two sets of scenarios. Te frst scenario details the management of the boundary shifting problem for a fle that has been modifed by the insertion of new content, and the second scenario portrays the boundary shifting problem for a fle with deleted contents.
(1) Dynamic Content Adjustment Policy (DCAP) Working Model. When a fle named as File Version (FV2) has to be stored, it will trigger the cosine similarity function and fetch the File Version (FV1) without altering the chunk structure. Te system will perform DCAP with the help of FV1 and FV2, which is explained with an example. Figure 3  shows the new version of a fle that is being generated by a typical user. Te newly inserted characters and their hexadecimal values are marked in red colour. Te objective of DCAP is to divide the content of FV2 into an equal number of chunks by identifying the newly modifed contents. Te content of FV2 is compared with the contents of FV1 by using a sliding window-like mechanism in which the content of FV2 is logically divided based on the window (W) size which is equal to the size of the chunk of FV1, and inside the W, two small seek windows (SW) are available; this SW is capable of moving across the windows. Te SW located at the starting of the W is called the beginning seek window (BSW), and the one at the end is called the end seek window (ESW). Te SW plays an important role in the DCAP, their size is fxed dynamically based on the W size or the number of characters in the chunks, and mostly, it is set to 64 bytes for larger fles; in this example, the number of characters inside the W is 6, so the BSW and ESW are set to 2. Figure 4 shows the two fles with the window and its corresponding seek window.
Te fle FV1 is taken as chunks, and the fle FV2 is kept as a full fle in which a window of size 6 will roll over the contents of fle FV2. Te DCAP algorithm starts by examining the fle from the start. Te characters inside the BSW and ESW in both fles are compared, and if they are the same, then the characters in the current window over FV2 are grouped to form a chunk. In this example, as shown in Figure 4, due to the insertion of contents in FV2, the BSW and ESW have diferent characters when compared with FV1, which indicates that the fle is modifed.
Once the modifcation is fgured out, the DCAP starts to compare the characters from the end of the fle as shown in Figure 5. Here, the last chunk of the fle FV1 is C5, so the operation starts from here, the character inside the BSW and ESW of FV2 is compared with that of FV1. In this case, the character is shown as follows: Both the SW of FV1 and FV2 have the same characters, and the characters under the current window over FV2 produce the same checksum as that of C5, so it will be considered a chunk that is similar to C5 of FV1. If the checksum is not the same, the window is considered a new chunk. Ten, the window moves backward over the FV1 and FV2 to generate chunks.
Finally, the window reaches the point where the modifcation has happened, which is shown in Figure 6. Te SW has a set of characters that indicate the occurrence of modifcation as shown in the following points: Te characters under BSW of FV1 and FV2 have a diferent value, which indicates that the modifcation has happened in the previous chunk, which may push the characters towards the lower chunks, or the modifcation has happened at the same location.
In Figure 6, the characters under ESW of FV1 have {62, 69} which have the matching value at the same location in ESW of FV2, and the chunks C3, C4, and C5 of FV1 are similar to that of chunks in FV2 indicating that the modifcation has happened towards the beginning of the fle.
Te modifcation at the beginning of the fle has pushed the characters forward, so the character {61, 69} of ESW of FV1 is missing in ESW of FV2 since it has moved in the forward direction. Tis states that the characters up to {61,  In Figure 7 shows search operation, the ESW searches for the set of characters {61, 69}, and the BSW searches for {73, 61} in which the ESW moves in a forward direction and BSW in a backward direction. On the successful search, all bits up to {61, 69} are grouped as C1, and all bits up to {73, 61} are taken as C2 of FV2 and are shown in Figure 8 provided that the newly formed chunks C1, C2 of FV2 should be less than or equal to 10 KB in size. C1 and C2 are the modifed chunks that will have the newly inserted contents. Te system will only consider the C1 and C2 of the fle FV2 for performing GD and storing.
(2) Implementation of DCAP for Deleted File Version. When a node wants to upload a fle to the storage system, we will frst look for a similar version of the fle in the storage system. If the system fnds a fle with similar content to the fle that the user wants to upload, it will be sent to the user in   chunks for verifcation. Figure 9(a) shows a fle version that is already available in the storage system as chunks. Te fle is named FV1, and the chunks are named FV1C1 to FV1C5. Te DCAP is associated with window (W) which will roll over the fles, and the beginning seek window (BSW) and the end seek window (ESW) are smaller windows that will perform the matching function by seeking. Te chunk of FV1 has 7 characters, so the window size is set to 7, and the SW windows are set to 2 as shown in Figure 10. Te DCAP tries to fnd the deleted portion in FV2 to form a chunk by retaining the remaining portion as the same as that of FV1; the working mechanism is shown as follows: (i) Te DCAP starts from the beginning of the fle as shown in Figure 10. In this, the rolling window covers 7 characters of FV2, which is compared with the frst chunk FV1C1 of FV1. Since both fles have similar content inside SW, and their checksum values are compared, and if it is found to be the same, the content under the window over FV2 is converted to a chunk with 7 characters which is similar to FV1C1. If they do not produce the same checksum, then the content under the window over the FV2 is taken as a new chunk.

Scientifc Programming
In Figure 14(a), the BSW of FV2 fnds out {6c, 65}, and this is combined with the current window to form a chunk. Te remaining characters of a new chunk are as shown in Figure 14, and the newly formed chunks should be less than or equal to 10 KB in size.
Te search for the missing content in FV2 by BSW will move backward until it fnds a matching value corresponding to the value of BSW of FV1. We need to stop once it fnds a correct value. In some cases, if the character under consideration is deleted, the search operation fails. In such a situation, the DCAP will try to fnd a possible way to form a chunk similar to that of its older version as detailed next; for example, we consider two versions of fles, the old version fle (OVF) and the new version fle (NVF). Te OVF has 6 chunks. Te frst 2 chunks are matched with those of NVF, the 3rd chunk contains the modifcation, and then DCAP will start to perform its operation from the end of the fle. Te last window is unable to produce a chunk due to different values under BSW or ESW of NVF and OVF. In this situation, the BSW of NVF will try to fnd the value under

Fuzzy C-Means Cluster Assist
Indexing. Te performance of information retrieval is a more important feature that directly afects the deduplication throughput, so it has to be addressed with extra care. Te deduplication process is designed to handle the text data, which is called documents or fles which can be grouped based on the similarity that exists among them. Te grouping of documents is termed clustering, and it is performed based on the relationship that preserves among the documents. Te relationship can be drawn based on the terms and strong semantic connection among the terms using LSI [27], and fuzzy C-means clustering can be applied to group similar documents. Te mechanism behind document clustering is based on the term frequency extracted from all available documents and groups the documents which have more similar terms into one cluster. Te texts in the document are converted into vector space, which grows along with the number of documents. Te large size vector will become more complex while performing clustering, and the larger search space leads to inaccurate matching at the term level. Inaccurate matching will lead to the clustering of irrelevant documents. Te application of latent semantic indexing (LSI) to the document vector will reduce the space, and more semantically related documents are grouped [28]. Terefore, clustering is performed after preprocessing the documents based on LSI.

Latent Semantic Indexing.
Te documents are associated with the summary which is being extracted based on the unique sentences available in it, and this summary is utilized for further processing. Te frst step is the tokenization process in which the fles are divided into sentences called tokens by removing the stopping words to extract more unique words, and the collection of tokens is called a bag of words. Ten, the term frequency (TF) is calculated based on equation (1), which brings out only the most common words from a particular summary. Te second step is to create an IDF that contains only the more unique terms among all available summaries of "n" documents, which is computed based on equation (2). Ten, the values of TF and IDF are used to generate the vector space based on equation (3), and this vector space is called a document vector.
Te document vector space is very large, and it keeps on increasing each time when a new document is added to the system. Singular value decomposition (SVD) is used to reduce the document vector space and increase document retrieval accuracy. SVD is used as a tool for LSI. Te normal document vector mostly fnds it difcult to bring out the hidden concepts while applying some algorithms to retrieve documents. Let us consider that there are three terms {t1, t2, and t3} and {D1, D2, and D4} documents in which the term t2 appears in D1 and D4, so these two documents are logically related but not semantically connected. In this case, if a retrieval query Q arrives with a term t2 in it, then the matching document D1, D4 is given as matching, but D1 is the only document that semantically matches with the query Q. Tus, the LSI will map all terms and documents to a semantically connected vector space. Te arrival of a new document induces the changes that have to be made to the existing vector space without recalculating SVD using a concept called folding-in; for more details refer to [29]. LSI is beyond the scope of this paper, so the reader may refer [44] for more information.
(1) Singular Value Decomposition for Latent Semantic Indexing. SVD is a factorization technique performed on matrix values to group documents into a defned number of topics. SVD is provided with the document matrix A of size m × n and along with the number of topics C. Te dimensions m and n correspond to terms and document collection based on the number of summary provided at the initial stage. SVD produces three sets of matrices as shown in the following equation: where U is the eigenvector matrix of A T A, V is the eigenvector matrix of AAT, and Σ is the diagonal matrix of singular values obtained by the square root of the eigenvalues of A T A. Now, Σ constitutes too small singular values that are negligible since all values along the diagonal of Σ are listed in ascending order and it keeps only the top r values, which reduces the matrix of U, Σ, and V T with U r , Σ r , and V r T . Tat is, it keeps the frst r rows of U and r column of V T . Tis process is named truncated SVD. Te approximated matrix Ar is shown in the equation as follows: Te following example is used to demonstrate SVD as shown in the following table. Table 1 contains nine documents and their set of bags of words which are tokenized words. Tis bag of words is used to calculate TF-IDF, and the resultant matrix is called a document vector. Table 2 shows the ranking of documents; that is, {D1, D2, and D7} are having maximum scores in topic 1, and {D3, D4, D8, and D9} and {D5, D6} are in topics 2 and 3, respectively, based on the value they produce. Table 3 shows the new document D10 named pseudo document (PD), and its terms {John, gold, Juliet} are matched with the {T9, T10, T11} terms already available in the term vector space. Ten, vector space for PD is calculated by taking the average of weighted terms in {T9, T10, and T11} as shown in Table 3, and from the result {0.05, 0.22, and 0.16}, it is clear that this document PD has a maximum score for Topic 2.

Fuzzy C-Means Cluster.
To group similar fles in a metadata table, FCM clustering is used. Te resultant matrix V of LSI is a representative of document collection, which is Scientifc Programming given as input (X, {x 1 , x 2 , . . .., x n }) for FCM. Te main motive of the FCM algorithm is to minimize the objective function [30], and it is shown in equation as follows: where ‖x i − c j ‖ is the Euclidean distance between the ith data point and the jth cluster, m is the fuzzifer whose value is set to 2, n is the number of data points in X, K is the number of clusters, and μ ij is the membership degree of the ith data point in the jth cluster. Te steps of the FCM algorithm are as follows: (1) Initialize the parameters such as number of clusters (k), initial membership matrix (μ (0) ), cluster centre C � {c 1 , c 2 , . . ., c k }, termination criteria (β), and maximum iteration (Max_Iter). (2) Calculate the cluster centre by using the equation as follows: (3) Calculate the Euclidean distance between the data point (X) and the cluster centre (C) using the equation (4) Ten, update the membership matrix (μ ij ) using the equation (5) If ||μ (k+1) − μ (k) || < β, then STOP; otherwise go to Step 2.
Here, k is the iteration step, and β is set to 0.01. Similarly, when there are no changes in the cluster center [31], the algorithm can be stopped. Te specifed number of clusters will contain their corresponding documents. Te cluster formed and the associated documents are shownas follows:

Fingerprint Indexing.
Te DCAP algorithm produces chunks, and these chunks have to be stored in the data node. Te deduplication techniques work by fnding similar chunks by comparing them, and to speed up the operation, a fngerprint is computed for all chunks that are stored in the data node. Te fngerprint is an important entity in this system generated by using the SHA-3 (256 Bit) algorithm, and for this computation, the Keccak approach is considered which is based on the construction of a sponge. Te SHA-3 uses a sponge to absorb the data and squeeze out the output. Te message is subjected to various transformations in which the frst phase messages are divided into blocks, and they are XOR'ed to get a subset of states [7]. Ten, the entire states are transformed using the permutation function. In the squeeze phase, the output is fetched from the same states which are obtained from the transformation operation. Once the fngerprint is calculated, they have to be properly indexed for faster processing, so fuzzy C-means cluster assist indexing (FCMI) is used.
FCMI is performed in two phases; in the frst phase, the LSI-ranked document is clustered, and in the second phase, the fngerprint values associated with the chunks of fles inside a similar cluster are indexed based on B-tree.

Granularity Tweak Based on Base and Deviation.
Te chunks which are produced by the DCAP algorithm are subject to hashing which acts as the frst stage index which will become clear in the later stage. Te raw chunks will be subjected to base and deviation separation. Te base and deviation are the entities that can be reversed to produce the original chunk, and this can be utilized for storage space optimization beyond deduplication. In normal hash-based deduplication, the more identical chunks are eliminated, but the less similar chunks fnd their way to the storage system, which leads to poor storage optimization. Te storage optimization is still fnely tuned with the help of base and deviation concepts, in which the less similar chunks are All that glitters is not gold D9 Money makes many Bag of words {"arrived: T1," "cats: T2," "dagger: T3," "damaged: T4," "died: T5," "eat: T6," "fshT7," "glitters: T8," "gold: T9," "John: T10," "Juliet: T11," "makes: T12," "money: T13," "Romeo: T14," and "shipment: T15," "truck: T16"}. separated, which considerably reduces the storage requirement. Te idea is that each chunk is divided into a pair of deviation and base, in which the base contains the majority of information and the deviation comparatively contains small information. In case two, if chunks are similar, then it is enough to store a single base and their two diferent deviations; in this way, we can perform deduplication on the sorted base, which will considerably reduce the storage cost. Te transformation of a chunk to the base is many-to-one mapping, the algorithm takes a chunk as input to produce the base and the deviation which shows the diference between chunks, and such kinds of transformations are performed by error-correcting algorithms.

Generalized Deduplication Algorithm.
In this session, a generalized deduplication mechanism is presented, and its objective is to reduce the size of chunks. Let there be n chunks in the system set_C � {C 1 , C 2 , . . ., C n } and let each chunk be of k bit length, so the chunks need the storage size as given in equation To picturize the case, we consider a typical system that produces 100 chunks at time t of size 64 bytes each, and then the storage space requirement will be 100 * 64 � 6400 bytes; this can be further reduced with the help of GD. Te GD system will divide the chunk into two parts: base (B) and deviation (D). Ten, for some K numbers of B with qk bits, the cost for storing B with its identifer is shown in equation Ten, if there are N chunks that are associated with the ⌈log N⌉ bit for the identifer and also include the deviation, the storage required for each chunk is given in the equation Ten, the total cost required to store N chunks becomes Ten, to make a comparison, the cost of storing N chunks without GD is shown in the equation Te compression factor is expressed in equation (19) which includes both the chunks and B, D, where CF > 1 states that the compression has occurred.
3.5.2. Deduplication System. Data deduplication is a technique that eliminates redundant or duplicated data and stores a unique copy of data. Tis technique considerably reduces storage [32]. Data deduplication is a need to be performed before inserting data into a data store. In this system, deduplication is performed based on the base and deviation of the chunks; before performing deduplication,  the system is preprocessed to enable smooth processing. Te sequence starts with the chunking of fles based on DCAP or user-defned chunking, which is followed by fngerprint calculation. Ten, the base deviation of the chunk is calculated, and, on another side, the documents are clustered, and the fngerprints are indexed using the B-tree as shown in Figure 15. Te fles seeking storage space are collected from the fle pool, and then these newly arrived fles are subjected to summary extraction, which will help disclose their associated similarity with other fles which are already fnding their way in, and if this search is positive, then DCAP is performed which produces some sets of chunks, and on negative, the chunks are produced based on the user-defned chunking algorithm. (See Figure 16). Te base and deviation are derived from the chunk of the fle which acts as the main base for the deduplication system. Te two identical chunks produce the same base and deviation, but two less similar chunks produce the same base with diferent deviations, and the chunk can be reconstructed using the base and deviation. Te deduplication system works in the following manner: (1) Te chunks of a particular fle are subjected to fngerprint calculation. (2) Te base and deviation are computed.
(3) Te fle under consideration is searched for a suitable domain among the available clusters. On fnding a suitable cluster, the fngerprint is searched for a match based on which the following activities are performed: (a) If a fngerprint is matched, it indicates that there is a matching chunk, so it is not necessary to store the base and deviation of the currently working chunk and so only the reference pointer is adjusted; that is the currently working chunk is deduplicated (b) If fngerprint matching failed, it indicates that the chunk is new, so a new index is created, and then the base pool is queried for matching (i) If a base pool search is successful, it indicates that there is a chunk with similar data, so the current base is taken as a reference with a new deviation (ii) If the base pool search is failed, then a new base and deviation are generated and they get stored in the system

Data Placement Policy.
Te CDGT is designed to work for text-based fles, and it is well integrated with the existing Hadoop architecture. Te base and deviation are the entities that are allowed to be stored and retrieved from the HDFS (Hadoop distributed fle system), and the experimental setup consists of the name node (NN) and a set of data nodes (DN). Te NN acts as the master, and DN are slaves used to store the data; in our case base and deviations, the fngerprint index is maintained in the MongoDB for faster processing. Tis system utilizes the in-line deduplication mechanism where the deduplication is performed before the data are written to the disk. Te data fow is explained in .
(i) Te base and deviation which are produced and fnalized are the entity that has to be stored in DN (ii) When any client uploads the fle, the agent module in this proposed deduplication ecosystem will trigger the deduplication process as shown in Figure 17 (iii) Te fles which reach the agent module in the name node will get deduplicated (iv) Te fnal base and deviation that pop out of the deduplication system are stored in DN (v) In case of the same base, the reference is adjusted and the corresponding deviations alone get stored in DN (vi) Te name node manages the storage of data, and the indexes are maintained in the MongoDB

Results and Discussion
4.1. Performance of Summary Generation. Te proposed system involves a large amount of fles to facilitate the processing of versioning, and this system uses summary extraction based on NLP, which is considered more advantageous than other non-NLP methods. Te performance of the proposed summary extraction method is evaluated based on the parameters [33], such as precision (P), recall (R), and F-measure, which are mostly used in the feld of information retrieval. Let the summary extracted by NLP be given by SNLP and the non-NLP is given by SnNLP, which are shown in the equations (20)- (22). 12 Scientifc Programming Te ROUGE method is also utilized to evaluate the performance, and this method measures the quality of the summary by counting the overlapping units such as N-gram, work pair, and sequences between the proposed and reference summary based on the DUC 2004 dataset [34]. Te ROUGE-N is used to compare the N-grams between two summaries [35]. Table 4 shows the performance improvement by the NLP method.

Performance Evaluation of DCAP.
To evaluate, the experiment is carried out using Pentium E5700 and Intel i5 9th generation processors since the system is built around a commodity system. Te data nodes are virtually created in Hadoop and are allocated to disk space based on the data set which is being used to evaluate the parameters. Te developed system is designed to work in an interactive environment, so the realistic data sets such as the backup of user text-based fles from 30 users of Windows and Linux are collected at various time intervals such as 37 GB, 75 GB, and 98 GB. Te proposed system uses the Dynamic Content Adjustment Policy (DCAP) whose performance can be compared based on the following parameters: (i) Chunk overhead (ii) Hash judgment time Te chunk acts as a processing unit in any typical deduplication system, so the number of chunks has a direct impact on the performance. Te proposed DCAP algorithm produces a smaller number of chunks when compared with other methods. Te DCAP uses the versioning concept which is the main responsibility for producing a lower number of chunks when the fle size is 15 MB, and the DCAP produces 64%, 75% fewer chunks than that of fxed size  chunking (FSC) and content defned chunking (CDC), respectively. When FSC is compared with CDC, the FSC performs better by producing 39% lesser chunks than that CDC for a 23 MB fle, but when compared with DCAP, the FSC outputs 74% more chunks for 23 MB of the fle. Te higher number of chunks produced by the CDC than that of FSC is because the CDC examines the content for breakpoints and based on it the fles are chunked. Te FSC produces chunks of fxed size that are set to fnetune the deduplication elimination ratio (DER), and if the chunk size is set to smaller KBs, then the chunk overhead increases, and the same phenomena are seen in CDC. Ten, for 36 MB fles, FSC produces 61% more chunks than that DCAP, and the same is seen for CDC with 75% higher value. Te DCAP produces an average of 66%, 75% less overhead than that of FSC and CDC, respectively, as shown in Figure 18. Te lesser chunk overhead of DCAP is due to the versioning of fles, which helps to improve the processing efciency of CDGT.

Hash Judgment Time.
Te hash judgment time is defned as the time needed to fnd the duplicate entity for the newly calculated chunks in the already stored base by looking up in the index table which is built based on the hash values of the chunks and the base. Te efciency of hash judgment time depends on the nature of the index. Te following are the evaluation parameters taken into consideration, and Table 5 shows the operating modes.
(i) Hash judgment time normal mode (ii) Hash judgment time GD mode (iii) Hash judgment time normal mode-FCMI (iv) Hash judgment time GD mode-FCMI Te hash judgment time is measured for both methods that are normal deduplication without GD and deduplication based on GD. Te hash judgment time in the normal mode (H-NoFCM-NM) consumes more time than that of the hash judgment time with FCM in the normal mode (H-FCM-NM) which is an average of 47.48 ms and is reduced by employing FCM in NM. Te FCM indexes the entries, so the search time reduces considerably, which is seen in the lower hash judgment time in H-FCM-NM. Te same kind of result is seen with deduplication based on GD; that is, hash judgment in the GD mode without FCM (H-NoFCM-GD) takes about an average of 20.66 ms more time than hash judgment in the GD mode with FCM (H-FCM-GD). Te GD-based method takes comparatively lesser time than normal deduplication to determine the duplicate contents; that is, there is an average of 26.82 ms diference which is seen between them as shown in Figure 19. Te average reduction in the hash judgment time in GD is due to GD which reduces the number of chunks with more fne granules such as the base and deviation.

Performance Evaluation of GD.
Te deduplication elimination ratio (DER) is defned as the ratio between the fle size before deduplication (FSBD) to that of fle size after deduplication (FSAD) as shown in equation (23). Te following points are the evaluation parameters taken into consideration: (i) DER normal mode (ii) DER GD mode (iii) Saving percentage (SP) DER � File size Before Deduplication File size After Deduplication , SP � 1 − Compressed data length Uncompressed data length * 100.
(24) Table 6 shows the DER comparison between deduplication with and without GD for some sets of fles with varying fle sizes. Te system based on GD shows an improvement in DER of about 10.98%, which will help to reduce the storage space considerably.
Te saving percentage (the amount of storage space occupied) has increased by an average value of 6.72% percentage when GD is employed in the system. Te proposed system improvises storage utilization as shown in Figure 20.

Conclusion and Future Enhancement
Te content deduplication with granularity tweak based on base and deviation is a deduplication system that is designed to improve the deduplication efciency in terms of deduplication elimination ratio (DER) and to optimize the storage utilization. Te implemented system employs the Dynamic Content Adjustment Policy as a chunking method that performs chunking, and this method is assisted by summary extraction based on NLP. Te utilization of the cosine similar mechanism improves the matching accuracy while performing versioning, which has put the proposed system in a promising position to achieve higher DER. Te novel chunking mechanism DCAP is sharper in producing minimal chunks, which considerably brings down the overhead in view of computation complexity. Te fngerprint of chunks and base deviation are organized as a cluster using FCM, and their entire ties are ordered, which brings   down the searching time and smoothly increases the accuracy rate irrespective of the size of the cluster and index and that there is an average of 68.5% efciency which is recorded. Te important performance parameter of any deduplication system is deduplication elimination ratio (DER), so this paper focuses on the improvement of DER. GD provided such a facility that fne-tunes the DER, and this can be seen in the result session; the GD-DER shows an improvement of about 10.98% over the deduplication system without GD. GD improves the storage utilization by 6.72% when compared with that of the system without GD.

Data Availability
Te data used to support the fndings of this study have not been made available because they contain more personal fles (Te Digital Universe of Opportunities. Rich data and the increasing value of the Internet of things and EMC Digital Universe with Research and Analysis by IDC, [Online]. Available: https://www.emc.com/leadership/digitaluniverse/2014iview/executive-summary.htm). .

Conflicts of Interest
Te authors declare that they have no conficts of interest.