Lossless Join Decomposition for Extended Possibility-Based Fuzzy Relational Databases

Functional dependency is the basis of database normalization. Various types of fuzzy functional dependencies have been proposed for fuzzy relational database and applied to the process of database normalization. However, the problem of achieving lossless join decomposition occurs when employing the fuzzy functional dependencies to database normalization in an extended possibilitybased fuzzy data models. To resolve the problem, this study defined fuzzy functional dependency based on a notion of approximate equality for extended possibility-based fuzzy relational databases. Examples show that the notion is more applicable than other similarity concept to the research related to the extendedpossibility-based datamodel.Weprovide a decompositionmethodof using the proposed fuzzy functional dependency for database normalization and prove the lossless join property of the decomposition method.


Introduction
Database normalization plays a crucial role in the design theory of relational database to avoid insertion and deletion and update anomalies in a database.The database normalization involves decomposition of a relation schema (table) into several smaller ones.The essential requirement of the decomposition is lossless join property, which ensures that the original relation can be obtained from its decomposed results via combination operations [1].Several methods have been proposed to design normalized relation schemes based on the keys and functional dependencies of a relation to achieve lossless join decomposition [2,3].The design theory has been applied to fuzzy databases, in which uncertain and imprecise information can be represented and manipulated.The fuzzy databases are extended from the classical databases based on fuzzy sets and possibility theory [4], and they can be resemblance-based fuzzy model [5,6] and possibility-based fuzzy model [7,8].In the context of fuzzy databases, fuzzy functional dependency (FFD) has emerged to extend the classical functional dependency to represent functional relationships between classes/attributes of objects for fuzzy database models.Various FFD definitions have been proposed in some fuzzy data models for database normalization [9,10].
However, very few research methods discuss lossless join property for the normalization in possibility-based fuzzy databases.To achieve lossless join decomposition by using FFDs for the possibility-based fuzzy databases is more difficult than for the resemblance-based fuzzy databases, especially for extended possibility-based fuzzy database.The extended possibility-based fuzzy database [7] is an extension of possibility-based fuzzy database [8] by including a resemblance-based fuzzy model [6].In the fuzzy database, attribute values could be the possibility distributions of the attribute on its domain.Additionally, the elements in a domain have some degree of resemblance.Previous work has applied FFDs on the decomposition for the fuzzy database [10][11][12].Informally, these FFDs are based on a certain degree of similarity between two attribute values.Namely, two tuples that are similar but not identical might be regarded as redundant.Applying the similarity-based FFDs on relation decomposition prompts the difficulty for lossless join decomposition on two facets: (i) redundancy removal: how to eliminate redundant tuples that are not identical from the decomposed results so that the results can be, later on, used to produce the original relation without losing information and (ii) tuple merging: how to combine two relations via merging their tuples of which attribute values are similar but not identical.
Complicating this problem further, most similarity measures [7,13,14] of values in the form of possibility distribution are not transitive.When tuple redundancy is determined by the similarity measures of nontransitivity, the result of eliminating redundant tuples from a decomposed relation might be not unique (or order sensitive 1 ).An inconsistent data redundancy removal not only leads to unstable results of data integration, as described in [15], but also causes decomposition results not lossless.When the decomposition result of a relation is not unique, the combination of the result will have many different outcomes, at least one of which is different from the original relation.Accordingly, the decomposition inevitably violates the lossless join property.Moreover, the nonunique results occur for relation combination when the attribute values to be joined/merged have similarity relation of nontransitivity.
To avoid nontransitivity, Chen et al. provided FFD with embedded classical FD [11,16], where redundancy removal is restricted to duplicate tuples.But, this restriction draws the normalization process back to the traditional operations of crisp data.To obtain transitive relationship among tuples, some research applied the max-min transitive closure on the relationship matrix of similarity degree between tuples [17].The max-min transitive closure of a relationship matrix must be a matrix with max-min transitivity [18].By referring to the transitive closure of the relationship matrix, the tuples which have similarity higher than a given threshold can be grouped into disjointed sets.The tuples in the same set were regarded as redundant.However, this approach cannot determine the similarity of two tuples by merely examining these two, and the similarity is changed by inserting or deleting other tuples.The nondeterministic and dynamic characteristic is not applicable to the practice of databases.
To our knowledge, very few studies provide a complete guideline to perform normalization that ensures lossless join decomposition in the fuzzy databases.Therefore, the purpose of this study is to fill up this gap.This study first proposes a notion of approximate equality which represents the transitive equivalent relation among tuples.Then, it provides new definition of FFD and lossless join decomposition based on approximate equality for the fuzzy databases.Both functional dependencies and lossless join decomposition in a traditional database are special cases in this proposal.Examples show that the notion is more applicable than other similarity concepts to the research related to the fuzzy databases.This work also provides the method of achieving the lossless join decomposition for the fuzzy databases.
The remainder of this paper is organized as follows.Section 2 gives a brief introduction to database normalization and fuzzy database and the survey of the similarity measures related to the fuzzy database.Section 3 demonstrates the problem of using nontransitive similarity measures for determining tuple redundancy and provides a notion of approximate equality for it.The FFD is then defined based on the approximate equality in Section 4. Besides, the lossless join decomposition is proposed for the fuzzy databases, and its property is proven as well.Section 5 draws the conclusion of this paper.

Preliminaries
This section first briefly reviews the essential operations for lossless join decomposition in traditional databases.Then, it introduces the fuzzy databases considered in this work and the similarity measures of values in form of possibility distribution.

Essential Operations for
The natural join and projection operations are, respectively, used to combine and decompose relations.
In other words, the lossless join decomposition ensures that the combination of the decomposed results of a relation has no spurious tuple or missing tuple to the relation via natural join operation [1].
In this case, the decomposition {  ,   } of  has lossless join property because the natural join result of   and   is exactly the same as  (as shown in Table 2).

The Fuzzy Databases.
In last two decades, fuzzy concepts have been incorporated in traditional databases [5,8,19] and applied to measure the relation between data [20][21][22].The fuzzy databases enable dealing with imprecision and uncertainty in the real world based on the theory of fuzzy sets and possibility distribution theory.The possibilitybased fuzzy theory has been widely applied in environmental management, such as flood-diversion planning [23], water resources management [24], and air quality management [25].This work considers the extended possibility-based databases proposed by Chen et al. [7] because it can capture both the possibility-based fuzzy model and the resemblancebased fuzzy concept.The fuzzy database has drawn much attention of research on semantic measures, information processing, update operation, and UML class diagram therewith [20,26,27].The data model of the fuzzy databases is a hybrid of a possibility-based data model in [8] and a resemblancebased data model in [6].The possibility-based model derives from Zadeh's fuzzy theory.In the theory [4], a fuzzy set  on a universe of discourse  is described by {  ()/ |  ∈ }, where   :  → [0, 1] is a membership function for the fuzzy set  and   () denotes the degree of membership of  in .In a possibility-based database [8], the value of an attribute  on a domain  is a possibility distribution   = {  ()/ |  ∈ }, where   () denotes the possibility that  is the actual value of .For example,  = { 1 ,  2 ,  3 } and   = {0.8/ 1 , 0.5/ 2 , 0.6/ 3 }.An example of applying the possibility-based fuzzy theory 2 in real world is shown below.Consider a domain of attribute "eye color" is black, brown, blue, green and a possibility distribution is given below: Suppose that John's eye color is an "Asia color." Then, according to the interpretation for possibility-based fuzzy theory, one concludes that the possibility of John's eye being brown blue color is 0.3.
In the extended possibility-based database, attribute values are represented by possibility distributions of an attribute on its domain, and a domain is associated with a similarity relation of domain elements.Formally, an -ary relation instance  on a schema ( 1 ,  2 , . . .,   ) in the fuzzy database is a subset of Cartesian product of Φ( 1 ) × Φ( 2 ) × ⋅ ⋅ ⋅ × Φ(  ), where Φ(  ) represents a set of all possibility distributions of attribute   on its domain.For a domain   , a proximity relation is given to describe the resemblance between domain elements in   .A proximity is a mapping   :   ×   → [0, 1] with reflexivity and symmetry; that is,   (, ) = 1 and   (, V) =   (V, ).The elements in a domain cannot directly be partitioned into disjoint equivalent classes by a threshold cutting on the proximity relation for the domain elements.
To acquire equivalent classes of a proximity relation on a domain, Shenoi et al. [18] proposed -proximate relation.Two elements , V ∈  are -proximate (denoted by  +  V) if (, V) >  or there exists a sequence  1 ,  2 , . . .,   ∈ , Such that min{(,  1 ), ( 1 ,  2 ), . . ., ( −1 ,   ), (  , V)} > .Given a proximity relation  and a threshold  for domain , the domain can be partitioned into disjoint subsets (called -proximate equivalent classes) such that the elements in a partition are -proximate.The equivalent classes are regarded as basic concepts for the methods being reviewed or proposed hereinafter.
By extending traditional functional dependency, research has proposed variety of fuzzy functional dependencies (FFDs) for fuzzy databases [14,21,28].The FFDs are determined by the degree of similarity of attribute values rather than by the identity.Several similarity measures of attribute values are proposed for the extended possibility-based frameworks [7,16,20,29].Most of them provide the estimates within the interval [0, 1].The similarity measures are briefly restated hereinafter, in which  and  ∈ [0,1], respectively, denote the proximity relation and a threshold defined on a given domain  = { 1 ,  2 , . . .,   };   and   represent two possibility distributions on .The degree of closeness between   and   , denoted by  1 (  ,   ), is defined as follows [29]: where The measure of  1 may give low degree of similarity for two values that are very similar to each other, for example,  1 ({0.9/excellent,1/good}, {1/good}) = 0.1.To prevent some counter-intuitive estimates of  1 , Chen et al. defined the possibility that   =   is true as shown below [7] (here ∧ denotes minimum): This assessment is widely adopted in the extended possibilitybased databases and is adoptable for the application with subnormal distribution (i.e.,  2 (  ,   ) < 1, or see [4] for details).For normal distribution, Chen et al. [16] included identity relation (denoted by = id ) into (4) as follows: (5) Ma et al. defined the similarity measure from the perspective of the semantic closeness between two attribute values [20] as shown below: where  denotes a semantic inclusion degree.Consider the following: The notion of  4 may violate the convention that the similarity degree of two values lies within [0, 1];  4 (  ,   ) = 1.52 for the case that   = {0.9/excellence,0.8/good} and   = {0.7/excellence,0.6/good} when the similarity of "excellence" and "good" is larger than the given threshold; that is,  (excellence, good) ≥ .It is difficult to set up a proper threshold for estimates that range out of [0, 1], having an unpredictable upper bound.Liu et al. [13] extended the semantic equivalence to ensure that the result of similarity measure lies within [0, 1].The measurement adjusts the possibility distributions of values based on -proximate equivalent classes of the domain before measuring their similarity.Let  = { 1 ,  2 , . . .,   } be the -proximate equivalent classes of domain .The adjusted value of possibility distribution  • is defined as follows: where Although the methods mentioned above differ from each other on measuring similarity of attribute values, most of the methods of measuring the similarity of tuples are the same.The methods adopt the minimum of the similarity of each pair of attribute values.Given tuples  = (  1 ,   2 , . . .,    ) and   = (   1 ,    2 , . . .,     ), the resemblance of tuples  and   , denoted by (,   ), is given by where  • could be either  1 ,  2 ,  3 ,  4 , or  5 in (3)- (9).Tuples  and   are redundant to each other if (,   ) ≥ , where  is a given threshold.The similarity measure of tuples has been applied to extract representative tuples for reducing information redundancy [17].Fuzzy functional dependency (FFD) is a concept derived from traditional FD.Both FFD and FD have several applications on databases, for example, redundancy elimination [30], missing data prediction, fuzzy data compression [17,31], and lossless join decomposition [10,28,32].In literature, various FFDs are defined for different fuzzy data model.For some fuzzy data representation, FFDs are defined based on the equivalence classes of tuples, such as the similarity-based fuzzy data model [33].In the extended possibility-based databases, the definition of FFD is also of variety, such as literature [10,14,34,35].One example among the FFD definitions in the literature is listed below.
Definition 1 (see [10], fuzzy functional dependency).Let  ∼ >  denote that attribute  is fuzzy functional which depends on attribute  in a relation .The example helps in understanding the problem of applying the FFDs on relation decomposition in the fuzzy databases illustrated in Section 3.

Redundancy Removal and Tuple Merging
Several factors determine whether the relation decomposition possesses the lossless join property.They are the ways to decompose a relation, to remove redundant tuples, and to combine the decomposed results.Redundancy removal is to eliminate redundant tuples.If the similarity measures used to measure tuple redundancy are not transitive, the result of redundancy removal could be nonunique.An example of nontransitivity is that tuples   and   are redundant to each other, and   and   are redundant as well, but  and   are not redundant.In this case, the result of redundancy removal will be {,   } if   is deleted first, which differs from the one-tuple result (either {} or {  }) when first deleting the tuples other than   .The nontransitivity makes the result of redundancy removal order sensitive and hinders the lossless join decomposition.
To resolve this problem, this study proposes the operations of projection and equal join for the fuzzy databases, which involves evaluation of redundancy and tuple merging.Since the decomposition of relations is based on FFD, it depends on the similarity of tuples.For the data in the fuzzy model, ( 3)-( 9) can be used to measure the similarity of tuples and define FFDs in the fuzzy databases.However, (5) restricts redundant tuples to those duplicate.Equations ( 3), ( 4), and (6) lack transitivity.Therefore, this work adopts ( 9) and (10) to define approximate equality for the tuples that might not be identical but have high similarity degree.The approximate equality enables obtaining a unique result of redundancy removal.
In other words, tuples  and   are approximately equal if their similarity (,   ) = 1.
The tuples of approximate equality are considered to be redundant to each other.The notion of approximate equality can be applied to query processing with the predicate containing fuzzy concept [36] for fuzzy databases in different models.For simplicity, we let   ≅   denote  5 (  ,   ) = 1 hereafter.

Proposition 5. The approximate equality can be used to classify values of the fuzzy database into disjoint sets (equivalence classes).
Proof.Based on the definition of ( 9), it is obvious that  5 is reflective and symmetric; that is,  5 (  ,   ) = 1 and  5 (  ,   ) =  5 (  ,   ) for values   and   .Besides, approximate equality is transitive according to Lemma 3. Therefore, two different sets of approximately equal values are either disjoint sets or same class sets, where any two of the values are approximately equal to each other.
The transitivity of similarity measure is important to any operation involving redundancy removal or tuple merging.Besides, the measure of transitivity can be applied to clustering methods or data groupings, such as the ones in [36,37].Proposition 6.Given   and its adjusted value π following (8),   ≅ π .
Proof.It is obvious by the definition of ( 9).
Buckles and Petry first proposed the way of tuple merging and applied it to remove redundant tuples in a fuzzy database [5].Tuple merging can also be used at join operation.This study extends the tuple merging of Chen et al. [16] to be (11) where each π• (or π • ) is the adjusted value of  • (or   • ) according to (8)  Proof.Based on ( 9) and (11), it is obvious that if  5 (  ,    ) = 1, then  5 (  ,   ∘    ) = 1 and  5 (   ,   ∘    ) = 1.
Based on the literature review and Lemma 7, we summarize the property of different similarity measures with threshold  = 1 in Table 3 to show the merit of (9) adopted in this work.

Approximate Lossless Join Decomposition
This section first offers the operations for relation decomposition and combination.Then, it proposes a notion of approximate lossless join decomposition (ALJD), which incorporates fuzzy concepts into lossless join decomposition.It also provides the method to achieve the ALJD.
Similar to the works in [37], this study generalizes the projection and natural join operations in traditional database to the fuzzy databases, as below.Here, given a relation , Θ denotes a set of attributes in  (i.e., Θ ⊂ ), and [Θ] denotes the composite of values in tuple  over attribute Θ.
Proposition 8.The projection result of a relation based on (12) must be unique.
Proof.It can be directly derived from Proposition 5.
Based on the operations ( 12) and ( 13), the ALJD is formally defined following the extension of approximate equality from tuple level to relation level in Definition 9.

Definition 9 (approximately equal relation instances). Two relation instances 𝑟(𝑅)
and   () in the fuzzy database are approximately equal, denoted by () ≈   (), if for every tuple  ∈ (), there must exist a tuple   ∈   () such that  ≅   and vice versa.

Definition 10 (approximate lossless join). A composition
The approximate lossless join decomposition means the natural join of all decomposed results of a relation instance is approximately equal to the original relation instance.More specifically, every tuple in the original relation is approximately equal to one of tuples in the combination result.Proposition 11.Consider the following: Proof.It can be derived from ( 11) and (12).
Corollary 12. Consider the following: Proof.It can be derived directly from Proposition 11.
The projection of a relation over the same schema, as shown in Corollary 12, represents no operations other than removing redundant tuples from the instance of the relation via tuple merging.Corollary 12 shows that the result of redundancy removal of a relation instance is approximately equal to the original instance.This property is essential for obtaining the combination result that is approximately equal to the original instance after relation decomposition.
This study proposes FFD for the decomposition in the fuzzy database as shown below.Proof.Proof by contradiction: we assumed that (  ) ≈ () and there exists an FFD  ∼>  such that  satisfies  ∼>  and   does not.Because  ∼>  exists in (),  (16).
It is noted that the FFD in Definition 13 satisfies Armstrong's axioms (inference rules), including reflexive rule, augmentation rule, and transitive rule 3 .This property enables the result of lossless join decomposition that has dependency preservation property 4  [1].
Inputs R and F, where R is a relation and F is the set of FFDs exists in ().
Step 1.Let R = {} and   be the set of all  ∈  that are not trivial.
Step 2. For a   :  ∼>  in   , Let   be the relation chosen from R such that both  and  are in   .
Step 3. If X is not a key attribute in   , do followings: (1) decompose   into  1 and  2 , such that  1 = Π (,) (  ) and Output is R, the set of relations decomposed from .Note:   () represents the list of all attributes in   other than X.Proof.We proved that () ≈ Π  1 () ⊗ Π  2 () based on Definition 10.Let   be a relation such that (  ) = Π  1 () ⊗ Π  2 ().We first prove that, for all  ∈ (), there exist   ∈ (  ) and   ≅  and then prove that, for all  ∈ (  ), there must exist   ∈ () and   ≅ .Proof by contradiction: let us assume that  ∈ () is the tuple such that no   ∈ (  ) satisfies   ≅ .Let  1 and  2 be tuples such that  Proof by contradiction for second part with renewed symbols: assume that   ∈ (  ) is the tuple such that no  ∈ () satisfies   ≅ .Since (  )Π  1 ()⊗Π  2 (), there exist  1 ∈ ( 1 ) and  2 ∈ ( 2 ) such that  1 ≅   [, ],  2 ≅   [, ], and  1 [] ≅  2 [] based on (13).Also, we let t1 and t2 be the tuples such that Since  1 = Π (,) (), based on Proposition 11, there must exist  ∈ () such that Likewise, there must also exist   ∈  () Based on (17)  Based on Armstrong's inference rules, IR1 and IR3 (see endnote 2), if a set  of FFD contains  ∼> , the closure of  will also contain  ∼>  and  ∼> , which is trivial.When a relation is decomposed into more relations, it takes more join operations to obtain the original data for query process.Considering the cost of join operations, it is not efficient to decompose a relation that has already been in the third normal form.A relation is in the third normal form if there is no functional dependency between nonkey attributes in the relation [1].Accordingly, the relation decomposition has two prerequisites as follows.For example, if  is not a key in (), then  will be decomposed based on  ∼>  rather than on trivial FFD  ∼>  or  ∼> .Based on Lemma 16 and Definition 17, we propose an algorithm for ALJD (see Algorithm 1).In the ALJD algorithm, an FFD containing key attributes is excluded from the decomposition process at Step 3.This follows the concept of the normalization of traditional databases, where only the FD of nonkey attributes is considered.To have a consistent presentation of data, this work generalizes the definition of key attributes for the fuzzy databases; namely, an attribute  is a key attribute in  if there does not exist two tuples  and   in () such that [] ≅   [].The exclusion of processing FFDs containing key attributes can prevent unnecessary decomposing on the relations which have no update anomaly problem.Although the decomposition without the key exclusion is still an ALJD, it increases the cost of the join operations of query process.Proposition 18.Let (, , ) be a relation and let  ∼>  be an FFD in ().If  1 = Π (,) () and  2 = Π (,) (), then (i) each FFD existing in ( 1 ) or ( 2 ) must exist in () and (ii) every FFD existing in () must either exist in ( 1 ) or ( 2 ) or be derived via FFDs in ( 1 ) and ( 2 ).
Proof.Statement (i) can be derived by Proposition 11 and Lemma 15.Statement (ii) can be derived by Lemma 16 and the property of FFD (namely, Armstrong's axiom IRs 1, 2, and 3, described at endnote 2).
The above statements show that the ALJD also preserves the closure of FFDs in the original relation, which is important to the issues related to the application of FFDs.

Conclusion
The contribution of this work is threefold.First, it highlights the problem of relation decomposition when tuple elimination is order sensitive.To overcome the problem, it proposes the notion of approximate equality for the tuples or relations in the fuzzy databases and provides the measure of the approximate equality.The measurement is reflexive, symmetric, and transitive.It enables classifying tuples into disjoint sets and ensures that a decomposed relation has unique result after redundancy removal or tuple merging.Therefore, the notion of approximate equality is important for data operations in the fuzzy databases.Second, it proposes approximate lossless join decomposition for the fuzzy databases and defines two operations projection and equal join for the decomposition, all of which are based on the approximate equality.The data operations and ALJD can be applied to the issue on data compression in the fuzzy databases.Third, this work defines FFDs and proposes an algorithm to decompose relations in the fuzzy databases based on the FFDs.The decomposition by the algorithm ensures the approximate lossless join property.The FFD and ALJD proposed for the fuzzy databases are, respectively, the general cases of the traditional FD and lossless join decomposition.The general property is important for dealing with the databases containing crisp data and fuzzy data.Forth, similar to the existing approaches of database normalization on resemblance-based fuzzy databases, this study provides several propositions to prove that the proposed approach of decomposition satisfies a degree of lossless join property.Compared to the normalization approaches for resemblancebased fuzzy databases, achieving lossless join decomposition for the extended possibility-based fuzzy databases is more difficult because of having more complex data.
There are some directions of future work.Future study can adopt the notion of approximate equality to define data operations for the query processing in the fuzzy databases.Research can apply the notion on the research related to data compression, fuzzy association rules, missing value prediction, relation compactness, and the integrity constraint in the fuzzy databases.Study aims to incorporate the fuzzy concept into clustering methods or data groupings for decisionmaking in marketing, healthcare applications, or business operations that can adopt the approximate equality for the similarity measures.Since the fuzzy concept has been incorporated into object-oriented databases in literature, future work can provide the approximate equality specifically for the data in the fuzzy object-oriented data models.
(i) It needs to avoid decomposing a relation based on trivial FFDs.(ii) It needs to make sure that the decomposed result preserves the closure of FFDs in the original relation.
Lossless Join Decomposition.In traditional relational database, a row is called a tuple; a column header is called an attribute; and the table is called a relation.Given an -ary relation schema ( 1 ,  2 , . . .,   ), an instance of  denoted by () is the set of all tuples in .Let Α denote a set of attributes  1 ,  2 , . . .,   .A functional dependency FD   →   existing in (A) represents the tuples having the same values on attribute   that must be identical on   , where   ,   ∈ A. Two operations are related to the lossless join decomposition: projection and natural join.The operation projection generates a result by selecting certain attributes from given relation and removing redundant tuples.Let Θ denote a set of attributes in (A); that is, Θ ⊂ A. The result of projection  over attributes Θ ⊆ A is Π Θ () = {[Θ] |  ∈ ()}, where [Θ] represents the composite of values on Θ in tuple .The natural join (denoted by * ) of   (XY) and   (YZ) is obtained by removing duplicate attribute from the results of equal join on joined attribute Y and is denoted as shown below: