A Feasibility and Performance Study of Dependency Inference

This study develops the foundation for a simple, yet efficient method for uncovering functional and approximate functional dependencies in relational databases. The technique is based upon the mathematical theory of partitions defined over a relation's row identifiers. Using a levelwise algorithm the minimal non-trivial functional dependencies can be found using computations conducted on integers. Therefore, the required operations on partitions are both simple and fast. Additionally, the row identifiers provide the added advantage of nominally identifying the exceptions to approximate functional dependencies, which can be used effectively in practical data mining applications.


Introduction
The complexity of discovering functional dependencies has been studied in [5], [6], [7].Functional dependencies are relationships between attributes of a database relation.A functional dependency states that the value of an attribute is uniquely determined by the value of some other attributes.Algorithmic approaches to the discovery of functional dependencies has been studied in [2], [6], [10], [11], [3].
Suppose that a company sets up a database to keep track of its employees and the various departments to which they are assigned from time to time.This would require three relations: one for employees, one for departments and one for assignments of employees to departments.An instance of this database might include the following relations: † Requests for reprints should be sent to Ronald King,Computer Science Department, The University of Texas at Tyler, Tyler, Texas 75799.In the ASSIGNMENTS relation Employee number does not functionally determine Date Assigned, but Employee number plus Department code does functionally determine Date assigned.Employee number is a primary key for the EMPLOYEES relation since the Employee number uniquely functionally determines all of the remaining attributes within the relation.Department code is the primary key for the DEPARTMENTS relation.
An approximate functional dependency is a functional dependency that almost holds.In a relation a few rows can contain errors, due to various noise factors, or simply be a row that is an exception to the rule.Many operational definitions for approximate functional definitions have been studied [4].The definition utilized in this paper is based upon the minimum number of rows that need to be removed from the relation r for X → A to hold in r: the error, g 3 (X → A) = 1 − (max{|s| : s ⊆ r and X → A holds in s}/|r|.X → A is an approximate dependency if and only if g 3 (X → A) ≤ for 0 ≤ ≤ 1.If the following modification for the EMPLOYEES relation is made: Then First name plus Family name does not functionally determine Salary since the first entry has a Salary The algorithm for discovering functional and approximate functional dependencies employed in this paper is similar to the levelwise approach for the discovery of association rules [1].This search strategy first computes some non-trivial information about attribute sets, frequent item sets, and then which association rules can be computed easily.In the present study, first computations for the non-trivial information about attribute sets takes the form of partitions of row identification numbers, from which the dependencies can be computed.[8] employed the levelwise method for the computations of dependencies as an instance of the generic data mining algorithm.[9] introduced the concept of rough sets which is based upon partitions.Using rough sets [12] utilized rough sets for identifying the most critical factors for allowing for the elimination of irrelevant attributes in a relation prior to the generation of rules describing data dependencies in databases.

Definition. Rows s and t are equivalent with respect to a set of attributes X if and only if s[A] = t[A] for all AεX.
Note that the definition for equivalent rows on a set of attributes X partitions the rows of the relation into equivalence classes.The equivalence class of a row tεr with respect to a given set Theorem 1 A functional dependency X → A, where A is a single attribute and X is a set of attributes, holds if and only if Π X refines Π A .
Thus there exists C ⊆ D such that DεΠ A .We can therefore conclude that Π X refines Π A .
An extremely interesting simplification for the latter theorem exists which states that adding the attribute A to the set of attributes X does not break any equivalence classes of Π X whenever Π X refines Π A .
where |Π X | denotes the rank of the partition Π X (or the number of equivalence classes belonging to the partition).
Proof: Case I: Assume that X → A. Then adding A to X does not break any equivalence classes in X, since t Using Theorem 2 one can determine the approximate functional dependencies for a relation r.The error g 3 (X → A) for a functional dependency, X → A, is the minimum fraction of rows that must be removed from the relation for the dependency to hold.Note that any equivalence class C of Π X is the union of one or more equivalence classes C 1 , C 2 , ... of Π X {A} , and the rows in all but one of the C i s must be removed for X → A to be valid.The minimum number of rows to remove is the size of C minus the size of the largest of the C i s.Therefore, Clearly any superkey has the property that its' partition consists of singleton equivalence classes only.Additionally, a set X is a key it is a superkey and no proper subset of it is a superkey.These observations lead to the following definition: Definition.The error of a superkey, g 3 (X), is the minimum fraction of rows that need to be removed from the relation r for X to be a superkey.Given an error threshold , where 0 ≤ ≤ 1, then X is an approximate superkey if and only if g 3 (X) is at most .The partition Π X can be utilized for computing g 3 (X) : g 3 (X) = 1 − |Π X |/|r|.Also, due to the latter computation, we have: Then from Theorem 2 we have: The latter foundation can be employed in data mining to find all minimal non-trivial functional dependencies by searching through the space of non-trivial dependencies and testing the validity and minimality for each dependency.A functional dependency X → A for which there does not exist Y ⊂ X such that Y → A is called a minimal functional dependency.But these minimal non-trivial dependencies are test for validity by taking refinements of partitions and superkeys are represented by partitions containing only singleton equivalence classes.This leads to the observation that a singleton equivalence class, of the left-hand side of a functional dependency, cannot break any dependency.

Optimizations via Constrained Partitions
Both space conservation and efficiency consideration lead to the concept of constrained partitions.
Definition.A partition with singleton equivalence classes removed is called a constrained partition.Π denotes the constrained partition for the partition Π.
Theorem 1 still holds for constrained partitions, since the refinement relationships of partitions are not affected by the singleton equivalence classes.But | Π X | can be the same as | Π X {A }| even if |Π X | = |Π X {A} |.However, Theorem 2 can be employed.The value g 3 (X) can be found using constrained partitions: where || Π X || is the sum of the cardinalities of the equivalence classes in Π X .The computing error g 3 (X → A) for the relation r is O(|r|).But employing the operational definitions for g 3 (X → A) and g 3 (X) leads to: Therefore, if g 3 (X) − g 3 (X {A}) ≥ ε or g 3 (X) < , then one does not need to compute g 3 (X → A) to find whether or not X → A.

Searching for the Non-Trivial Minimal Functional Dependencies
Partitions of row numbers can be utilized to perform necessary validity tests on functional dependencies.Computations of constrained partitions make these validity tests able to be done efficiently.
The search for the functional / approximate functional dependencies consist of the space of all left-hand sides of potential dependencies.For example the set containment lattice, the latter space, for W, X, Y, Z is illustrated in Figure 1.
Using a levelwise algorithm the search starts from the singleton sets, and works its way through the lattice level by level until the minimal functional dependencies are found.For each set X of attributes, the algorithm will test dependencies of the form X − {A} → A, where AεX.False dependen-cies are eliminated as early as possible in order to reduce the search space.An edge between sets X and X {A} in the containment set lattice represents the non-trivial dependency X → A. Search efficiency is obtained by reducing the computations on each level by using results from previous level computations.
By using the lattice, one can compute a partition as a product of two earlier partitions.The product of two partitions Π and Π , denoted by Π × Π , is the least refined partition Π that refines both Π and Π .
Proof: Case I: Assume we are given Π X and Π Y .For each CεΠ X form the set of all nonempty intersections which can be formed from the equivalence classes of Π Y with C, call this set C X .Then the union of all C X 's is by construction the least refined partition for both Π X and Π Y .Therefore the latter construction yields Π X × Π Y .Let u, tεC for some CεΠ X × Π Y .Then there exist a DεΠ X and there exist an A levelwise computational scheme can be employed to compute the partitions.First the computations for Π { A}, for each attribute AεR, are performed directly on the database.Partitions Π X , for |X| ≥ 2, are computed as the product of partitions with respect to two subsets of X.To find a size two partition the direct product of two singleton set partitions will be employed.A size k partition will be found using the direct product of a (k − 1) and a singleton partition, thus only the partitions from the previous levels are employed in the computations for finding partitions on the present level.
Once the partition Π X is found, the error g 3 (X) is computed in order to test a functional dependencies validity using Theorem 3. The same value g 3 (X) can be utilized for testing the validity of X → A or X − {A} → A for several AεR.

Testing for Minimality of Functional Dependencies
To test for the minimality of X − {A} → A, we need to know whether Y − {A} → A for some proper subset Y of X. c (X) = {Aε X : there exists B εX − {A} such that X − {A, B} → B}.The closure of the rhs candidates, C + (X), for a set X ⊆ R is defined as: Proof: Suppose that X − {A}εA is not minimal, then there exists BεX − {A} for which X − {A, B} → A. Then A / ∈ C + (X − {B}).Therefore, for all B, if AεC + (X − {B}), then X − {A} → A is minimal.Assume that there exists BεX with A / ∈ C + (X − {B}).Then there exists CεX Theorem 7 gives the closure of rhs candidates two advantages over rhs candidates.First, once such a B is encountered checking can be stopped.Second, for some B, C + (X{B}) can be empty when C(X − {B}) is not empty.The last statement implies that with the closure of rhs candidates, the set X is never even generated due to pruning.
When a key is found, additional pruning methods can be applied.X → A, A / ∈ X, is tested when X {A} is computed since one needs Π X {A} for validity testing.If X is a superkey then X → A is always valid and we do not need X {A}.If a superkey X is not a primary key, then a dependency X → A is not minimal for any A / ∈ X.Also, if AεX and X − {A} → A, then by the second part of Theorem 5, X − {A} is a superkey and we do not need Π X for testing the validity of X − {A} → A. X and Π X are not required for finding the non-trivial dependencies.All keys can be deleted and thereby all their supersets can be pruned, i.e., the superkeys that are not keys.

Conclusion
The foundation for a new algorithmic approach for the discovery of functional and approximate functional dependencies from relations has been provided.The approach is based on partitions of row identification numbers from the relation and determining non-trivial minimal dependencies from the partitions.A breadth-first or levelwise search for the dependencies is conducted.Additionally, the search space can be pruned effectively.Both the partitions and dependencies can be computed efficiently.

Figure 1 .
Figure 1.The set containment lattice for W, X, Y, Z:

Table 1 .
Company Database

Table 2 .
Modified EMPLOYEES Relation: Approximate Functional Dependency F irstname F amilyname → Salary