A Fragile Zero Watermarking Scheme to Detect and Characterize Malicious Modifications in Database Relations

We put forward a fragile zero watermarking scheme to detect and characterize malicious modifications made to a database relation. Most of the existing watermarking schemes for relational databases introduce intentional errors or permanent distortions as marks into the database original content. These distortions inevitably degrade the data quality and data usability as the integrity of a relational database is violated. Moreover, these fragile schemes can detect malicious data modifications but do not characterize the tempering attack, that is, the nature of tempering. The proposed fragile scheme is based on zero watermarking approach to detect malicious modifications made to a database relation. In zero watermarking, the watermark is generated (constructed) from the contents of the original data rather than introduction of permanent distortions as marks into the data. As a result, the proposed scheme is distortion-free; thus, it also resolves the inherent conflict between security and imperceptibility. The proposed scheme also characterizes the malicious data modifications to quantify the nature of tempering attacks. Experimental results show that even minor malicious modifications made to a database relation can be detected and characterized successfully.


Introduction
Digital watermarking is a class of information hiding technique that provides measures for copyright protection, broadcast monitoring, covert communication, copy control, tamper, and integrity proof of digital assets. The watermarking techniques were primarily proposed for multimedia content [1][2][3][4]; however, in the last decade, the research community has extended these techniques to relational databases for its copyright protection, temper detection, and integrity proof. Most of the existing watermarking schemes for relational databases [5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20] introduce intentional errors or distortions as marks in the underlying data with some error tolerance so that it does not have a significant impact on the usefulness of data. However, this results in degrading data quality as the integrity of a relational database is violated. A large collection of real-world datasets has a strong usability constraint that disallows any permanent distortions or intentional errors. For example, the safety critical datasets are designed to minimize errors rather than to introduce intentional errors. Similarly, a business application may require that local properties like item-cost, ordered-quantity, and so forth, are preserved as well as global properties like natural join between item and sales, employees and department, and so forth. Moreover, in business datasets, the semantic constraints are not violated, like dissimilarity in attribute value for two similar transactions [21]. Query processing is sensitive due to selection criteria and has welldefined semantics; therefore, the watermarking schemes that introduce distortion into the database original content are not appropriate for certain applications.
The initial work on fragile watermarking schemes can be found on images [29][30][31], which is extended to audio [32,33] and video [3,34] schemes. Recently, the importance of other data domains is recognized and fragile schemes for text [35,36] and relational databases [17][18][19][20][22][23][24][25] are proposed. Like robust schemes, most of the fragile schemes for relational databases [17][18][19][20] introduce distortion into the database original contents that degrades data quality and also affects data usability. These schemes are based on the content characteristics of database relation itself to create a secure hash (used as a watermark) which is stored in Least Significant Bits (LSBs) of database original contents, thus introducing distortion.
A fragile watermarking scheme presented by Guo et al. [17] detects malicious modifications made to a database relation. In their scheme, the watermark generation is based on the content characteristics of the database relation itself. The generated watermarks are embedded in at most two LSBs of all attributes in the database relation that introduces considerable distortion in the database original contents. The fragile scheme presented by Khataeimaragheh and Rashidi [18] is also a distortion-based scheme for integrity proof of database relations. Like [17], the watermarks are embedded in at most two LSBs of all attributes in the relation that forms a two-bit watermark grid. The fragile scheme presented by Iqbal et al. [19] logically partitions the database relation into three groups and generates self-constructing fragile watermark information from each group. The generated watermarks are embedded at LSBs of numerical attributes in each group of a database relation which introduces distortion in database original contents. Prasannakumari [20] presented a fragile scheme for temper detection in database relations. This technique also introduces distortion as it inserts a fake attribute in database relation to act as a watermark. The data values for the newly inserted attribute are determined by applying aggregate function on original database content.
Beside distortion-based techniques, some researches also presented distortion free fragile watermarking schemes [22][23][24][25] for integrity proof of database relations. The main feature of these schemes is that the watermark embedding in actual fact is the tuples or attributes reordering based on the content characteristics of database relation. A fragile scheme proposed by Li et al. [22] detects and localizes malicious modifications made to the database relations. Their scheme partitions the database relation into disjoint groups and the watermark is embedded and verified in each group independently. In their scheme, the watermark is embedded as tuple reordering and the order of each tuple pair in group is changed or unchanged depending on the tuple hash values and the corresponding group hash value. Though their technique does not introduce any distortion in the database relation, but it works only for categorical data type. Kamel [23] presented a fragile scheme to protect the integrity of database relations. Their scheme divides the database relations in groups and each group is marked independently. As in [22], the watermark embedding is reordering of tuples in each group that corresponds to the value of some secret watermark. The fragile scheme proposed by Bhattacharya and Cortesi [24] detects malicious modifications in database relations having categorical attributes. Their scheme divides the database relation into groups on the basis of categorical attribute values. Like [22,23], tuple hash value is used to obtain a watermark as permutation of tuples. A fragile zero watermarking scheme is presented by Hamadou et al. [25] for authentication of database relations. Their technique is distortion-free and is based on attribute reordering method. Initially, the attributes of database relation are virtually sorted on hash values of attribute names to define a secret initial order of attributes. For each attribute in database relation, the Most Significant Bits (MSBs) are extracted and used for watermark generation. The generated watermark is then registered with the Certification Authority (CA) for certification purpose. As their technique is based on virtual sorting of attributes by their names, so any change in attribute name by attacker would fail the temper detection process.
In the previous discussion, we have identified two important issues in existing fragile watermarking schemes. First, the fragile schemes are distortion based [17][18][19][20] that inevitably degrade data integrity and thus affect data usability; therefore, these schemes are not applicable to non-error-tolerant data like safety critical datasets, and so forth. Second, though there exist some fragile schemes like [22][23][24][25] that are distortionfree, but the watermarking approach is based on reordering of tuples or attributes; so, they are vulnerable to sorting attacks. Also, if the modification is small, such that, it does not affect the order of tuples, the temper detection would fail. To address these issues, we propose a fragile scheme based on zero watermarking approach that does not modify any part or properties of the database relations itself; therefore, the proposed scheme assures imperceptibility and overcomes weaknesses like data integrity and data usability in existing fragile watermarking schemes. Also, the proposed scheme is independent of tuple ordering as well as attributes ordering and naming, so it is not vulnerable to sorting attacks. The watermark generation in the proposed scheme is based on algorithmically evaluating the local characteristics of database relation like frequency distribution of digit count, length and range of data values. This enables us to characterize the malicious data modifications on parameters like the fraction of digit, length and range of data values attacked, the type of attack (insertion, deletion, or update), and the effect of attack (low to high, high to low, or no change) on data values. Also, to the best of our knowledge, there is no such distortion-free fragile watermarking scheme that can characterize the tempering attacks, that is, the The Scientific World Journal 3 nature of tempering. Experimental results show that the proposed scheme can detect and characterize malicious data modifications successfully.

Materials and Methods
In this section, we present our proposed fragile zero watermarking scheme to detect and characterize malicious modifications made to a database relation. The proposed scheme exhibits the following important properties of a fragile watermarking system as discussed in [17].
(1) Fragility. The proposed scheme is designed to be fragile; that is, if there are any malicious data modifications, the embedded watermark is not detectable (destroyed).
(2) Imperceptibility. As the proposed scheme is based on zero watermarking approach, it does not introduce any distortion in the underlying data; therefore, the embedded watermark is invisible or imperceptible.
(3) Key-Based System. The watermark generation and verification in the proposed scheme is a key-based system. Also, to detect and characterize malicious data modifications, a secret key is required.
(4) Blindness. In the proposed scheme, the original database relation is not required to detect and characterize malicious data modifications.
(5) Tuple and Attribute Ordering. The existing fragile schemes are based on tuple ordering [22][23][24] and attribute ordering and naming [25]. The proposed scheme is independent of tuple and attributes ordering so it is not vulnerable to sorting attacks.
(6) Characterization. The proposed scheme not only detects but also characterizes the malicious data modifications in database relation to quantify the nature of tempering attacks.

Watermark Generation.
Let be a database relation with primary key PK and ] attributes denoted by (PK, 1, 2, . . . , ] ). The watermark generation in the proposed scheme is based on the content characteristics of numeric data values, so we assume that some attributes of the database relation are numeric. Figure 1 shows the watermark generation process that comprises of subwatermark generation for digit count, length, and range of data values. The generated watermark is registered with the Certification Authority (CA) for certification purpose. Table 1 presents the list of notations used in our algorithms and discussion.
The algorithm for watermark generation is presented in Algorithm 1. At lines 1-3, the digit, length, and range of data values in a database relation are algorithmically evaluated to generate the subwatermarks as presented in Algorithms 2-4. These subwatermarks are then used to generate a database relation watermark as shown at line 4. At line 5, the relation watermark is encrypted with a secret key SK known only to the database owner. We assume that the secret key is selected from large key space such that it is ( Register to Algorithm 1: Watermark generation.
computationally infeasible for attacker to guess a key. At lines 6-7, the encrypted relation watermark is concatenated with owner Id along with date and time stamp to generate a watermark certificate , which is then registered with the CA before publishing the database for certification purpose.
Algorithm 2 generates a digit subwatermark which is based on digit frequency for all data values present in adatabase relation. At lines 1-3, the length of each data value is determined which is then used to extract the individual digits as shown at lines 4-5. Lines 6-7 compute the frequency of each digit and the total number of digits present in the database relation. At line 11, the relative frequency of each digit is determined which is then used to generate a digit subwatermark as shown at line 13. At lines 15-16, the digit subwatermark is concatenated with total digit count and is returned to the watermark generation algorithm. It is to be noted that the digit subwatermark is composed of each digit relative frequency and the total count of all digits. In fact, this information is used for characterization of attacks as discussed in Section 3.
The subwatermark generation for length of data values in a database relation is presented in Algorithm 3. At lines 1-3, the length of each data value is determined. Lines 4-5 determine the frequency for each length of data values and the total count of data values length present in the database relation. At line 9, the relative frequency for each length of data value is computed which is then used to generate 4 The Scientific World Journal Algorithm 2: Digit sub-watermark generation.
total length count++ (6) EndFor length subwatermark as shown at line 10. At lines 12-13, the length subwatermark is concatenated with total length count and is returned.
Algorithm 4 presents the algorithm for subwatermark generation for range of data values in a database relation. At line 1, different data ranges are defined in which the data value of a database relation may fall. It is to be noted that the defined data ranges may be adjusted as per the nature of data values in the database relation and also for more precise characterization of malicious data modifications, as discussed in Section 3. Lines 1-3 determine the attribute value, within each tuple. Lines 5-13 determine the frequency for different data ranges in which the data value may fall and the total number of data ranges present in the database relation. At lines 16-17, the relative frequency for each range of data value is computed, which is then used to generate range subwatermark . Lines [19][20] show that the range subwatermark is concatenated with total range count and is returned.
x in range 0: range frequency[0]++ (7) x in range 1: range frequency[1]++ (8) x in range 2: range frequency[2]++ (9) x in range 3: range frequency[3]++ (10) x in range 4: range frequency[4]++ (11) End Select (12) Figure 2 shows the model for detection of malicious modifications in suspicious database relation . For detection of malicious data modifications, the relation watermark is regenerated for suspicious database relation and compared with the relation watermark registered at CA; if both watermarks are different then the suspicious database relation is considered as a tempered relation.

Watermark Verification.
The algorithm for watermark detection is presented in Algorithm 5. At line 1, the watermark is generated by using Algorithm 1 for suspicious database relation . The watermark certificate which is already registered at CA is used to extract database relation watermark as shown at lines 2-4. At lines 5-10, each digit of is compared with the corresponding digit of and match count is incremented on each successful match. At line 9, the total count is computed to know the number of digits tested. At lines 11-12, the WAR (Watermark Accuracy Rate) and WDR (Watermark Distortion Rate) are computed. If the distortion exists in the suspicious database relation , then is rejected as a tempered relation with distortion rate WDR as shown at lines 13-15. The algorithm for characterization of malicious data modifications is presented in Algorithm 6. At line 2, the relative frequency of each digit is extracted from digit subwatermark as ⊆ and is already registered at CA. The frequency distribution of each digit in relation is determined at line 3. At line 4, the frequency distribution of each digit for suspicious database relation is determined. The change in frequency distribution of each digit Δ is computed at line 5 and the fractional change in each digit ΔF is determined at line 6. The computed value of ΔF is then used to characterize the malicious modifications made to the database relation . For example, if ΔF is zero, then the suspicious relation is not tempered. A positive ΔF indicates that F fraction of digit is maliciously inserted by attacker as an attempt to transform low data values to high in database relation . Similarly, a negative ΔF indicates that F fraction of digit is maliciously deleted by attacker as an attempt to transform high data values to low in database relation . At lines 8-14 and 15-21, a similar method as discussed earlier is used to determine ΔF and ΔF to characterize the attacks on length and range of data values in database relation . The characterization of malicious data modifications is further elaborated in Section 3.2 with experimental results.

Results and Discussion
Suppose that Alice is the database owner and she has used the proposed algorithms along with the secret key to generate a watermark for the database relation . The attacker Mallory for his own nefarious objectives may attempt to make malicious modifications in Alice watermarked database relation. = Decrypt( , ) (5) For = 1 to length( ) Do (6) If [ ] = [ ] Then (7) match count = match count + 1 (8) End if (9) total count = total count + 1 (10) End For (11) = match count/total count * 100 581,102 tuples, each with 10 integer attributes, 44 Boolean attributes, and 1 categorical attribute. In our experiments, we have used all 10 integer attributes. It is to be noted that in robust watermarking schemes, the aim of Mallory is to destroy the Alice watermark without affecting the database relation, whereas in fragile schemes, Mallory attempts to make malicious modifications in Alice watermarked database relation without affecting the watermark. The experimental results presented in this section show that the watermark is adversely affected by even minor malicious data modifications; therefore, the generated watermark is fragile.

Detection of Malicious Modifications.
In this set of experiments, we randomly introduce malicious modifications in  Forest Cover Type data set [37]. As discussed in Algorithm 5, these malicious modifications are detected by generating the watermark for the suspicious database relation to obtain , which is then compared with the registered watermark to determine the WAR (Watermark Accuracy Rate) and WDR (Watermark Distortion Rate). Table 2 shows the WAR and WDR for the malicious insertions made to the database relation with different attack rates. For example, when 10% of the fake but similar tuples are randomly inserted into the database relation , the WDR is found to be high and malicious insertions are detected with low WAR. Tables 3-4 show similar results as of insertion attack for malicious deletions and updates made to the database relation . Figure 3 summarizes the insertion, deletion, and update attacks and shows that the WDR is always high for different volume of malicious data modifications.
The Scientific World Journal 7  In another set of attacks, we simultaneously perform malicious insertion, deletion, and update of tuples with different attack rates in database relation . Table 5 shows the WDR for this set of attack.
The experimental results presented in Tables 2-5 show that the malicious modifications are always detected and fragility of the registered watermark is observed for even low volumes of attack. The WAR is low and WDR is high for different volume of malicious insertions, deletions, and updates made to the database relation. The low WAR indicates the extent to which the database relation has been attacked, whereas the high WDR indicates that the database relation has been tampered and is not authentic. The accuracy of watermark is adversely affected even with minor malicious data modifications and the watermark fragility proves that the database relation has been attacked.

Characterization of Malicious Modifications.
One of the important features of the proposed watermarking scheme is to characterize the malicious modifications made to the database relations. As discussed in Algorithm 1, the watermark generation is based on the content characteristics of database relation itself which enable us to characterize the malicious data modifications. Algorithm 6 elaborates the algorithm for characterization of malicious data modifications by evaluating the fractional change in each digit ΔF , length ΔF and range ΔF of data values in the tempered database relation .
We have conducted experiments for both random and deterministic attacks for characterization of malicious data modifications. In random tempering attacks, we randomly attack the digit frequency, length, and range of data values in the database relation, whereas in deterministic attacks, the attack is performed with the specific attack rates. The random tempering attacks are presented in this section and the results of detailed deterministic attacks are shown in the Appendix for reference.

Attacks on Digit Frequency.
In this set of attacks, Mallory randomly performs malicious insertion, deletion, and update attacks on digit frequency in Alice's watermarked relation . For example, in insertion attack, Mallory may attempt to maliciously insert some digits in . Table 6 shows the experimental results obtained for characterization of malicious insertion attack on digits 9 and 0 as discussed in Algorithm 6. A positive value of ΔF indicates that F fraction of digits 9 and 0 is maliciously inserted by Mallory in the database relation . The characteristic of this attack is an attempt to relatively increase the low data values to high in database relation as an increase of 35.84% and 24.42% is observed in ΔF of digits 9 and 0, respectively. As the other digits are not attacked, so ΔF is zero for digits 1-8 and there is no change in the digit frequency Δ of these digits. This characteristic of attack, when combined with the nature of data, may provide useful information about the attacker intention. For example, in the product sales environment, these malicious insertions indicate that the attacker may have attempted to increase the low volume and amount of product sales. Table 7 shows the result for random malicious deletions of digits 9 and 0 made to the database relation . A negative value of ΔF indicates that F fraction of digits 9 and 0 is maliciously deleted by the attacker. The characteristic of this attack is an attempt to relatively decrease the high data values to low in the database relation . In this attack, 14.70% of digit 9 and 12.44% of digit 0 are randomly deleted from the database relation. As the other digits are not deleted, so ΔF is zero for digits 1-8. Table 8 shows similar result for random malicious update for digits 9 and 0 made to the database relation. In this attack, digits 9 and 0 are randomly replaced with some other digits, so the digit frequency Δ of digits 9 and 0 is decreased (high to low), where as the digit frequency Δ of digits 1-8 is increased (low to high). Figure 4 summarizes the malicious insertion, deletion, and update attacks on digits 9 and 0. The insertion attack shows a positive increase (low to high) on attacked digits, where as a negative trend (high to low) is observed in attacked digits for deletion attack. In update attack, both negative (high to low) and positive trends (low to high) are observed for attacked and unattacked digits, respectively.
In another set of attacks, we randomly insert, delete and update 10% (lower bound) and 90% (upper bound) of the tuples from the database relation . Table 9 shows the effect on fractional change in digit frequency ΔF for each digit. It is to be noted that, in insertion attack, a fraction of positive trend (low to high) is being observed in each digit frequency of database relation . For example, when 10% of 8 The Scientific World Journal     similar tuples are inserted in database relation, an increase of approximately 10% is being observed in ΔF for each digit of database relation. Similarly, in deletion attack, a fraction of negative trend (high to low) is observed in ΔF for each digit of database relation. In update attack, no specific trend is observed in ΔF as fractions of digits are randomly replaced by some other digits.
It is to be noted that the attack on digit frequency (as discussed above) can be characterized on parameters like the digits being attacked, the fraction of each digit attacked, the type of attack (insertion, deletion, or update) on each digit, and the effect of attack (low to high, high to low, or no change) on data values.

Attack on Length of Data Values.
In this set of attacks, Mallory randomly performs malicious insertion, deletion, and update attacks on length of data values. Table 10 shows the experimental result for characterization of malicious insertion on data values of length 3 in the database relation . A positive value of ΔF indicates that F fraction of length is maliciously inserted in the database relation . The characteristic of this attack is to relatively increase the low data values to high as an increase of 18.27% is observed in ΔF for data values of length 3. Also, ΔF is zero for lengths 1, 2, and 4, which shows that the data values of these lengths are not attacked. Table 11 shows result of random malicious deletion for data values of length 3. As in deletion of digit frequency attack, a negative value of ΔF indicates that F fraction of length is maliciously deleted with characteristic of decreasing high data values to low in database relation. Also, as in malicious insertion, the ΔF is zero for lengths 1, 2, and 4, which indicates that the data values of these lengths are not deleted. Table 12 shows results for malicious updates on data values of length 3. In this attack, the data values of length 3 are randomly replaced by lengths 1, 2, and 4. This attack shows a decrease in ΔF for length 3, where as the ΔF for lengths 1, 2, and 4 is increased. Figure 5 summarizes the malicious insertion, deletion, and update attacks on length 3 of data values. The insertion attack shows a positive increase (low to high) in attacked length, where as a negative trend (high to low) on attacked length is observed in deletion attack. In modification attack, a negative trend (high to low) is observed on attacked length, where as a positive trend (low to high) is observed on unattacked length of data values. Table 13 shows the effect on fractional change in length frequency ΔF , when 10% (lower bound) and 90% (upper bound) of tuples are maliciously inserted, deleted, and updated in the database relation. In insertion attack, the fractional change in length frequency ΔF has a fraction of positive trend (low to high) for each length of data values. Similarly, in deletion attack, a fraction of negative trend (high to low) is observed for each length of data values. For example, when 10% of tuples are randomly deleted from a database relation, a decrease of approximately 10% is observed in ΔF for each length of data values. The update attack does not show any specific trend as fraction of different length of data values are randomly replaced by some other length of data values.
It is to be noted that the attack on length of data values can be characterized on parameters like the length of data values being attacked, the fraction of each length of data values attacked, the type of attack (insertion, deletion, or update), and the effect of attack (low to high, high to low, or no change) on each length of data values.

Attack on Range of Data Values.
In this set of attacks, Mallory randomly performs insertion, deletion, and update attack on range 1, that is, (100-999) of data values present in the database relation . Table 14 shows the experimental results for characterization of malicious insertion for range 1   Table 15 shows the results of random malicious deletion for data values of range 1. As in deletion of digit frequency attack, a negative value of ΔF indicates that F fraction of range 1 is maliciously deleted with characteristic of transforming high data values to low in database relation . As the data values of ranges 0 and 2 are not attacked, so the ΔF is zero for these ranges. Table 16 shows the results for malicious updates on data values of range 1. In this attack, the data values of range 1 are randomly replaced by ranges 0 and 2. This attack shows a decrease in ΔF for range 1, where as the ΔF for range 0 and 2 is increased.
The malicious insertion, deletion, and update attacks on range 1 of data values are summarized in Figure 6. A positive increase is observed in the attacked range for insertion attack (low to high) and a negative trend (high to low) is observed in attacked range for deletion attack. The modification attack shows a negative trend (high to low) for attacked range, that is, range 1 of data values and a positive increase for nonattacked ranges, that is, range 0 and 2 of data values.
In another set of attacks, we randomly inserted, deleted, and updated 10% (lower bound) and 90% (upper bound) of tuples from the database relation . Table 17 shows the effect on fractional change in range frequency ΔF , for each range of data values. The fractional change in range frequency ΔF has a fraction of positive trend (low to high) for malicious insertion in each range of data values. Similarly, in deletion attack, a fraction of negative trend (high to low) is observed for each range of data values. For example, when 10% of tuples are randomly deleted from a database relation, a decrease of approximately 10% is observed in ΔF for each range of data values. The update attack does not show any specific trend as fraction of different range of data values are randomly replaced by some other range of data values. It is to be noted that the data characteristics used for our experiments like digit, length, and range of data values are cohesive to each another. Due to this relationship, we evaluated the effect of malicious data modifications on these three data characteristics. For example, if Mallory maliciously       The detailed experiments for this set of attacks are presented in the Appendix (Tables 20(a)-20(f)).