A Top-Down Proof Procedure for Generalized Data DependenciesStéphane Coulondre |
Abstract: Data dependencies are well known in the context of relational database. They aim to specify constraints that the data must satisfy to model correctly the part of the world under consideration. The implication problem for dependencies is to decide whether a given dependency is logically implied by a given set of dependencies. A proof procedure for the implication problem, called “chase”, has already been studied in the generalized case of tuple-generating and equality-generating dependencies. The chase is a bottom-up procedure: from hypotheses to conclusion, and thus is not goal-directed. It also entails in the case of TGDs the dynamic creation of new constants, which can turn out to be a costly operation. This paper introduces a new proof procedure which is top-down: from conclusion to hypothesis, that is goal-directed. The originality of this procedure is that it does not act as classical theorem proving procedures, which require a special form of expressions, such as clausal form, obtained after Skolemization. We show, with our procedure, that this step is useless, and that the notion of piece allows infering directly on dependencies, thus saving the cost of Skolemizing the dependencies set and, morever, that the inference can be performed without dynamically creating new constants. Although top-down approaches are known to be less efficient in time than bottom-up ones, the notion of piece cuts down irrelevant goals usually generated, leading to a usable top-down method. With the more recent introduction of constrained and ordered dependencies, some interesting perspectives also arise.
Dependency theory allows the expression and modelling of constraints that the data must satisfy in order to reflect correctly the world that a relational database intends to describe. Since the introduction of functional dependencies by Codd ([Cod72]), many kinds of dependencies have been studied in the litterature, and a lot of work has been carried out in the late 70’s and early 80’s. Database dependencies theory is still an active area of research [SF00] [Her95] [LL97a] [LL97b] [LL98]. Functional and multivalued dependencies are the most known classes of data dependencies. In practice, these two kinds are sufficient in general to express constraints ([Ull88]). Nevertheless, more general classes have been introduced, with the purpose of finding a uniform way to express constraints ([BV81],[SU82]). This paper deals with the class commonly known for generalizing most of the dependencies: that of tuple-generating and equality-generating dependencies (TGDs and EGDs) ([BV84b]). For a survey on this general class of dependencies, we refer the reader to [LL99] or [FV86].
The central problem is the implication problem, which is to decide whether a dependency is logically implied by a given set of dependencies. A process that could solve this problem would provide a solution to find the minimal cover of a set of dependencies, to decide whether a dependency is redundant within a set, useful during the constraint acquisition stage, etc. A procedure has already been designed in [BV84b] for that purpose: the well-known chase. Unfortunately, as the implication problem for TGDs and EGDs is semi-decidable, the chase is only a proof procedure, and therefore the process may run forever. As we argue in this paper, the chase is clearly a bottom-up procedure: from hypotheses to conclusion. Also, the chase entails the dynamic creation of new constants in the general case of TGDs.
We introduce a new proof procedure which is top-down: from conclusion to hypothesis that is goal-directed. The originality of this procedure is that it does not act as classical theorem proving procedures, by requiring a special form of the dependencies set, such as clausal form, obtained after Skolemization. We show, with our procedure, that this step is useless, and that the notion of piece allows infering directly on the dependencies set.
Therefore, our top-down chase is not simply the usual chase reversed, but a new way of solving the implication problem. The fact it can be performed top-down is the first contribution of this paper. The second contribution is to avoid dynamic creation of constants, as well as Skolemization of the dependencies set, usually applied on the original knowledge base prior to top-down proofs. This is realized by taking advantage of the form of the dependencies. To our knowledge, this has not been realized before. Indeed, dynamic creation of constants can be costly in proof procedures, such as the chase, in order to take into account the effect of existential quantifiers.
While it is true that top-down approaches can take exponential time longer w.r.t. bottom-up approaches, many reasons allow us to think that the proof procedure presented in this paper is efficient. These arguments will be detailed in the last section. The efficiency w.r.t. the bottom-up chase is currently being assessed in details.
More recently, constrained dependencies ([Mah94], [BCW95], [Mah97]) and ordered dependencies [Ng99] have been introduced. They originate in the constraint programming field and permit expression of semantic relations on variables, thus giving them an interpretation. The chase procedure has been redesigned in [MS96], still in a bottom-up way, in order to deal with constrained tuple-generating dependencies. This work in the dependency theory gives new perspectives for the top-down chase procedure we present.
The top-down chase originates in the conceptual graphs model, which is a knowledge representation model introduced by Sowa ([Sow84]). The base model has been extended with graph rules and an inference method, called piece resolution ([SM96]). The logical roots of this process have been studied in [CS98], and constitute the basis of the top-down chase. Proofs of the lemmas and theorems of this paper are therefore derived from these two last-mentioned.
Section 2 describes the framework and the implication problem for data dependencies. We sketch the existing (bottom-up) chase. In section 3 the top-down chase is explained. Section 4 closes the paper with some concluding remarks.
The following definitions are for the most part taken from [BV84b], though simplified for the purpose of this paper. The first subsection states some assumptions we make throughout this paper. The second subsection presents the necessary definitions. The third subsection is a description of the kind of dependencies this paper focus on. Note that they are known to capture the semantics of most of the dependency types studied in the litterature. The fourth subsection presents formally the implication problem and then describes the traditional chase procedure, which has been designed to solve the implication problem. The chase is clearly a bottom-up mechanism, from hypotheses to conclusion.
For sake of clarity, we assume several restrictions on the model. First, we assume that the universal relation assumption holds, i.e. the database is modelled with only one relation, because it usually permits a simpler formal presentation of the approaches. Secondly, dependencies are typed (many-sorted), so attribute domains are disjoint. Thirdly, we do not address dependencies with constants, as the last one illustrated below.
TGDs and EGDs can all be expressed in first-order logic. Informally speaking, an equality-generating dependency (EGD) says that if some tuples exist in the database, then some values in these tuples must be equal. A tuple-generating dependency (TGD) says that if some tuples exist in the database, then some other tuples, whose values are not necessarily taken from the first tuples, must also exist in the database. Thus some values may be unknown.
For example, to express a classical functional dependency stating that two customers having the same name also have the same identifier, we use the following EGD which is also a functional dependency :
To express that a invoice is always related to an existing order, we use the following TGD :
By introducing constants, we can state some more specialized constraints. For example, the invoice 23 is related to an order taken by the customer 12 :
However, the framework used in this paper is that of tableaux, because it is well-suited when the universal relation assumption holds. On the other hand, this assumption is very unpractival for real-world applications, because it would imply modelling the whole database schema with only one relation. That is the reason why we do not use real-world examples throughout this paper. Note that these restrictions appear here for clarity reasons only and all the results are applicable in the unrestricted model.
I : | w : | ||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||
▴
An equality-generating dependency (EGD) says that if some tuples exist in the database, then some values in these tuples must be equal.
<(c1,c2),I> : | K : | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||
▴
A tuple-generating dependency (TGD) says that if some tuples exist in the database, then some other tuples, whose values are not necessarily taken from the first tuples, must also exist in the database. Thus some values may be unknown.
I′ : | T=<I′,I> : | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||
K′ : | ||||||||||||||||||||||||||||
|
▴
Let D be a set of dependencies, and d be a dependency. The implication problem is to decide whether D⊨ d , that is to determine whether d is true in every database in which each dependency of D is true. Let SAT(D) be the set of all relations, only composed of constants, that satisfy all the dependencies in D , then the implication problem is equivalent to decide whether SAT(D)⊆ SAT(d) .
The implication problem can also be considered under a different view: that of finite implication problem. The finite implication problem is to decide whether d is satisfied by every finite relation that satisfies all dependencies in D. However, as detailed below, this problem admits no proof procedure.
The chase procedure has been designed to solve the implication problem. The reader can refer to [BV84b] for a complete description. Intuitively speaking, if the dependency to be proven is the TGD <I′ ,I> , or the EGD <(a1,a2),I> , the chase procedure takes the relation I , and treats it as if it formed a set of tuples. Then it applies repeatedly the dependencies of D , following two distinct rules, one for TGDs, whose effect is to add tuples to the relation, and one for EGDs, whose effect is to “identify” two constants. When the dependency to be proven is a TGD, the procedure stops when obtaining the tuples of I′ . When it is a EGD, the procedure stops when obtaining the identification of a1 and a2 .
This mechanism has been shown to be sound and complete in [Var84]. Note that the implication problem for TGDs and EGDs is semi-decidable. Thus the chase may not stop.
The chase procedure is clearly a bottom-up (or forward chaining) one. Indeed, rule applications are generating new tuples or identification of constants. This is executed until we obtain the desired conclusion. The goal to be proven is not used to guide the process. Moreover, when applying a TGD rule whose effect is to add tuples to the relation, existential quantification always requires a costly dynamic creation of new constants.
We now show that we can apply a top-down (or backward) procedure, in which the process is goal-directed.
Depending on the type of the dependencies, the implication problem is solvable or recursively unsolvable ([BV81, Var84, GL82]). This means that in the first case, there is a decision procedure, hence an algorithm that always halt, whereas in the second case, there is a proof procedure: if the implication is true, then the process will terminate. In the other case, it might never stop. The finite implication problem, on the other hand, is not even partially solvable. Therefore, there can be no proof procedure for it.
A subset of TGDs is known as Full TGDs. These dependencies have the same form as Datalog rules. In this case, the notion of piece, stated in the introduction, is useless. Therefore, dedicated top-down theorem proving procedures can be applied to solve the implication problem involving only Full TGDs, such as Query-Subquery ([Vie86]) and OLD-Resolution ([TS86]). The principle aim of these procedures is to provide a terminating algorithm for top-down evaluation of Datalog rules. Indeed, even if the implication problem is decidable for this particular class, their top-down evaluation might not terminate. To tackle this problem, tabulation (or memoing) techniques must be applied in order to cut down looping sequences. Notice that these techniques do not affect completeness. In the general case of TGDs, however, there are unfinishing sequences that remain undetectable, due to the undecidable nature of the implication problem. In the case of Full TGDs, the implication problem and the finite implication problem coincide and are both decidable.
For this kind of dependencies, as well as all other subsets (functional and multivalued dependencies) the interest of the top-down chase is to be also applicable. However, in these particular cases, specific proof procedures shall be more efficient because they implement memoing. The top-down chase and the classical bottom-up chase are clearly needed when dependencies are more complex (i.e. not a decidable subset), and also to provide a general way to solve the implication problem. Notice that memoing can be added to the top-down chase too, but this is an implementation-level extension.
There have been a lot of work carried out also in the active database area. For a survey, we refer the reader to [WC95]. Active databases generally support the automatic triggering of updates in response to internal or external events. These updates are expressed by mean of rules which are slightly different from TGDs and EGDs. Usually, these rules can be expressed in Datalog with negation [BDR98] [AHV95]. The main difference lies in the fact that the variables occuring in the conclusion are all universally quantified, whereas in database dependencies they are existentially quantified if they do not occur in the hypothesis too.
The first subsection presents a proof procedure for TGDs only. The second subsection shows how this procedure can take EGDs into account as well, by using some reduction theorems.
Let D be a set of TGDs without constants, and let d=<J′,J> be a TGD without constants. Intuitively speaking, to decide whether D⊨ d , we start with the tuples of J′ and treat it as if it formed a relation. Let Q be this relation. Q is considered as the goal to reach. On the other hand, we add J to D , after replacing each symbol of J by a new constant, by transforming each tuple into a TGD with constants. Then we try to remove the tuples of Q, by applying successively a rule, giving each time a new goal. These rule applications may introduce new tuples. If we achieve in removing all tuples, i.e. obtaining an empty goal, then D⊨ d .
Compared with classical theorem proving methods, this proof procedure relies on a complex core rule that does not require any modification of the dependencies in D. Indeed, in order to apply the classical resolution method, one needs to rewrite D and ¬ d under clausal form. In order to do so, the Skolemization step would generate constants or functions due to the presence of existential quantifiers in the conclusion. Then the resolution method would try to generate the empty clause. When obtained, then the Herbrand theorem ensures that D ⊨ d.
In the present case, the top-down proof procedure does not require rewriting the dependencies of D nor dynamically creating new constants or functions. Indeed, the originality is to allow inferring directly on dependencies of D, thus providing a general mechanism. Notice that this can be obtained thanks to the particular form of TGDs and EGDs.
Let us now explain this process in a more formal way. We need the notion of distinct substitution that replaces each symbol of a relation by a new constant.
Now, we can add to D a set of TGDs with constants, each of them being of the form <u,∅ > , with u corresponding to a distinct tuple in ω (J) . Thus we add | ω (J)| TGDs. The following theorem states that checking whether D⊨ d can be reduced to checking whether Dω⊨ Q , where Dω is the result of adding the new TGDs to D , and Q is the goal. Note that the added TGDs are the only ones of Dω having constants in it. More formally, we have :
d=<J′,J> : | ||||||||||||||||||||||||||||||||||||
|
Let ω be a distinct substitution on J such that a4↦ a , a5↦ a′ , b3↦ b , b4↦ b′ , c3↦ c , c4↦ c′ and d3↦ d , with a,a′,b,b′,c,c′,d being constants. Let ω ′ be an extension of ω on J′ such that ω ′ is the identity on VAL(J′)−VAL(J) . Let Dω denote D∪ { <u,∅ > | u∈ ω (J)} :
{ <u,∅ > | u∈ ω (J)} : | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
Let Q denote the TGD <ω ′(J′),∅ > :
Q=<ω ′(J′),∅ > : | ||||||||||||||||||||||||
| ||||||||||||||||||||||||
Then D⊨ d if and only if Dω⊨ Q .
▴
We now describe the main step of the process. Given a TGD and a goal, this rule constructs a new goal. We need to introduce the notion of piece. A piece is a set of tuples that are semantically linked. When applying the rule, we will see that these tuples can be treated at the same time. This notion is an alternative to the classical Skolemization process that replaces unknown values by constants, prior to using classical first-order logic provers. A piece is a set of tuples which share unknown values, and are therefore treated as a whole. This is the originality of the top-down chase.
L : | ||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||
P1 : | P2 : | ||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||
▴
We now define the core rule.
then the result of the application of the rule, denoted Qn+1=T(T,Qn) , is Qn+1=<h(V)\ P∪ h(I),∅ > . Qn+1 becomes a new goal. Thus it is obtained by removing P from h(V) and by adding h(I) . Note that there may be several possible TGDs obtained by an application of the rule, depending on the valuation h . In a sense, the piece notion allows us to group together some tuples according to some particular symbols (which correspond in logic to existentially quantified variables of T) they share.
T=<I′,I> : | Qn=<V,∅ > : | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
T′=<h(I′),h(I)> : | Qn′=<h(V),∅ > : | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There are two pieces of h(V) in relation to (VAL(I′)−VAL(I))\ CONS={a2} , which are the first tuple of h(V) and the last two tuples of h(V) . The second piece P is such that P⊆ h(I′) . We construct Qn+1=T(T,Qn) . Let V′=h(V)\ P∪ h(I) :
Qn+1=<V′,∅ > : | ||||||||||||||||||||||||
| ||||||||||||||||||||||||
▴
The following theorem states that a goal is implied by a set of TGDs if and only if there is a sequence of rule applications such that the sequence starts with the original goal, and gives the empty goal as an end result. The proof is detailed at the end of the paper.
It is important to notice that whenever Dω⊨ Q, theorem 2 ensures that there is at least one sequence of rule applications, but it does not give a method to find it. Therefore, to implement the top-down chase, we need to add a search strategy (breadth-first or depth-first for example). In order to illustrate the need for a search strategy, let us detail in the following three kinds of sequences for the same set of TGDs Dω and the same goal Q. The first one stops and ends with an empty goal, thus proving the implication. The second one does not terminate (although in this particular case, the loop might be detected and stopped), and the third one is stuck (no rule can be applied any more) and thus needs backtracking whenever the search algorithm is depth-first.
In the following examples, we will consider that Dω contains the TGD T of example 6, as well as another TGD S detailed below. Remember also that in 4 it has been shown that in order to prove that a dependency d is implied by a set of dependencies D, we construct a goal to be proven (i.e. without hypothesis), and we add some TGDs (also without hypothesis) to D, giving Dω. Therefore, in the present case, Dω contains T as well as two more TGDs presented in example 6. Finally, the goal to be proven is Q, already presented in example 4 . Dω is detailed below:
U1 : | U2 : | ||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||
T : | S : | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
And Q is :
Q : | ||||||||||||||||||||||||
| ||||||||||||||||||||||||
Q1 : | ||||||||||||||||||||||||
| ||||||||||||||||||||||||
Consider the TGD U1. Let U1=<I′,I> and Q1=<V1,∅ >. We can check that (VAL(V1)∩ (VAL(I)∪ VAL(I′)))\ CONS=∅ . Let h be a valuation on I∪ I′ ∪ V1 such that ∀ a∈ VAL(I)−VAL(I′) , h(a)=a and ∀ a∈ VAL(I′)−VAL(I) , h(a)=a (trivial). h is also defined by: h(a1)=a and h(d1)=d . Therefore we have :
Q1′=<h(V1),∅ > : | ||||||||||||||||||||||||
| ||||||||||||||||||||||||
As there is no element in (VAL(I′)−VAL(I))\ CONS, each tuple of h(V1) is itself a piece. The third tuple P is such that P ⊆ h(I′) . We construct Q2=T(U1,Q1) . Let Q2=<h(V1)\ P∪ h(I), ∅> :
Q2 : | ||||||||||||||||||||
| ||||||||||||||||||||
Now consider the TGD U2. Let U2=<I′,I> and Q2=<V2,∅ >. We can check that (VAL(V2)∩ (VAL(I)∪ VAL(I′)))\ CONS=∅ . Let h be a valuation on I∪ I′ ∪ V2 such that ∀ a∈ VAL(I)−VAL(I′) , h(a)=a and ∀ a∈ VAL(I′)−VAL(I) , h(a)=a (trivial). h is also defined by: h(d2)=d′ . Therefore we have :
Q2′=<h(V2),∅ > : | ||||||||||||||||||||
| ||||||||||||||||||||
As there is no element in (VAL(I′)−VAL(I))\ CONS, each tuple of h(V2) is itself a piece. The first tuple P is such that P ⊆ h(I′) . We construct Q3=T(U2,Q2) . Let Q3=<h(V2)\ P ∪ h(I), ∅> :
Q3 : | ||||||||||||||||
| ||||||||||||||||
Following the same schema, we apply the very same rule to Q3, giving Q4=<h(V3)\ P ∪ h(I), ∅> :
Q4 : | ||||||||||||
| ||||||||||||
Therefore, we have proven that Dω⊨ Q, thus D⊨ d.
▴
Q1 : | T | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Let T=<I′,I> and Q1=<V1,∅ >. We can check that (VAL(V1)∩ (VAL(I)∪ VAL(I′)))\ CONS={a1, d1, d2} . Therefore, we must rename these symbols either in T or in Q1. Let now Q1 be :
Q1 : | ||||||||||||||||||||||||
| ||||||||||||||||||||||||
Let h be a valuation on I∪ I′ ∪ V1 such that ∀ a∈ VAL(I)−VAL(I′) , h(a)=a and ∀ a∈ VAL(I′)−VAL(I) , h(a)=a . Therefore h(a1)=a1 , h(a2)=a2, and h(d1)=d1 . h is also defined by: h(a3)=a2 , h(b1)=b′ , h(b2)=b , h(c1)=c , h(c2)=c′ , h(d3)=d2 and h(d4)=d2 . Therefore we have :
T′=<h(I′),h(I)> : | Q1′=<h(V1),∅ > : | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There are two pieces of h(V1) in relation to (VAL(I′)−VAL(I))\ CONS={a2} , which are the first tuple of h(V1) and the last two tuples of h(V1) . The second piece P is such that P⊆ h(I′) . We construct Q2=T(T,Q1) . Let V2=h(V1)\ P∪ h(I) :
Q2=<V2,∅ > : | ||||||||||||||||||||||||
| ||||||||||||||||||||||||
By applying rule T over Q2 one more time, giving Q3=T(T,Q2), we obtaain :
Q3 : | ||||||||||||||||||||||||
| ||||||||||||||||||||||||
Therefore Q3=Q1 w.r.t. symbol renaming, thus the sequence is entering a loop. Note that in this particular case, it is simple to detect and thus to stop. However, this is an implementation-level treatment.
▴
Q1 : | S | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Let S=<I′,I> and Q1=<V1,∅ >. We can check that (VAL(V1)∩ (VAL(I)∪ VAL(I′)))\ CONS={a1, d1, d2} . Therefore, we must rename these symbols either in S or in Q1. Let now Q1 be :
Q1 : | ||||||||||||||||||||||||
| ||||||||||||||||||||||||
Let h be a valuation on I∪ I′ ∪ V1 such that ∀ a∈ VAL(I)−VAL(I′) , h(a)=a and ∀ a∈ VAL(I′)−VAL(I) , h(a)=a . Therefore h(d1)=d1, h(d2)=d2 and h(d3)=d3 . h is also defined by: h(a1)=a , h(a3)=a , h(b1)=b , h(b2)=b′ , h(c1)=c′ , h(c2)=c , h(d4)=d3 and h(d5)=d3 . Therefore we have :
S′=<h(I′),h(I)> : | Q1′=<h(V1),∅ > : | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There are two pieces P1 and P2 of h(V1) in relation to (VAL(I′)−VAL(I))\ CONS={d2, d3} , which are the first (identical) two tuples of h(V1) and the last tuple of h(V1) . The two piece are included in h(I′). Note that we can save one sequence step by removing both pieces at the same time instead of performing two steps with the same valuation h. We construct Q2=T(T,Q1) . Let V2=h(V1)\ P1 \ P2 ∪ h(I) :
Q2=<V2,∅ > : | ||||||||||||||||||||
| ||||||||||||||||||||
One can now verify that there is no applicable rule to Q2 amongst T, S, U1 and U2.
▴
EGDs add some difficulties because they include an equality predicate. Nevertheless, it has been shown in [BV84b] that the implication problem for TGDs and EGDs without constants is reducible to the implication problem for TGDs without constants. Thus, our core rule is sufficient.
Let e be the A -EGD e=<(a1,a2),J> . Let w1 be a tuple such that w1[A]=a1 and for all attributes B∈ A , w1[B]∉J[B] . Let w2 be a tuple such that w2[A]=w1[A] and w2[A]=a2 . We associate with e two TGDs. e1 is <w1,J∪ {w2}> , e2 is <w2,J∪ {w1}> . For a given set D of TGDs and EGDs, let D* be the result of replacing each EGD e in D by its two associated TGDs e1 and e2 .
w1 : | w2 : | ||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||
e1=<w1,I∪ {w2}> : | e2=<w2,I∪ {w1}> : | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▴
The following theorems allow us to ignore the presence of EGDs by reducing the implication of a dependency by a set of TGDs and EGDs to the implication of a TGD by a set of TGDs only:
Thus, we can now generalize theorem 1 and give the final following theorem that reduces the implication problem of TGDs and EGDs to the existence of a top-down chase using the rule:
As a conclusion, we discuss several points. First we compare the top-down chase with a backward formal system and with other proof procedures. Then we discuss the contributions of our work. We conclude this paper by lifting some restrictions on the model and pointing out some perspectives.
Many formal systems have been studied for data dependencies ([BFH77], [Sci82], [BV84a], [SU82]). In [BV84a], some formal systems for TGDs and EGDs are studied. Two of these systems are backward, but only one, namely the T3 formal system, has some similarities with the top-down chase. We sketch here the main differences. We refer the reader to this paper for more details about formal systems for TGDs and EGDs.
The T3 system deals with TGDs and their unknown values (i.e. non-constant symbols of VAL(J′)−VAL(J) ) in the following way : the process starts with a goal tuple Q and applies a TGD T=<J′,J> by making J′ and J coincide. Typically this is achieved using especially the collapsing, augmentation and projection rules. When J′ and J coincide, it uses the transitivity rule to derive a new goal. A derivation tree is not linear , as it is the case for SLD-resolution.
The top-down chase leads to a linear inference in the sense that it uses directly TGDs from the base without first applying rules on them. There is only one rule. Obviously, it is a more complex step.
To our knowledge, this is the first time a top-down proof procedure is used to solve the implication problem for dependencies. As such, a first constribution is to have shown that this can be performed top-down.
The top-down chase avoids rewriting the dependencies set under clausal form, and avoids the corresponding Skolemization process, necessary in order to use classical top-down theorem proving procedures. In the top-down chase, only the preliminary transformation of the TGD to be proven requires creating as many new constants as universally quantified variables appearing in it. Thus this step is unsignificant. By providing a way to infer directly on the original form of dependencies, by way of the notion of piece, the top-down chase is conceptually speaking a simpler approach, this is our second contribution.
Compared to the classical chase, it does not entails the dynamic creation of new constants. This is our third contribution.
Note that the top-down chase is not the usual chase simply reversed. The core rule is totally different from those used in the bottom-up approach of [BV84b]. Indeed, it has been shown that there is a strong relationship between the bottom-up chase and resolution with paramodulation [BC90][NR01]. Whereas the latters acts on the Skolemized set of clauses, the top-down chase does not need this prior step.
We are currently investigating in details the efficiency of the top-down chase. Actually, this is not trivial to assess, because of the various parameters that come into play. We agree with [BMSU86] on the fact that “ it is unreasonable to expect proofs of optimality, or even superiority of one algorithm over another, in all cases ”.
However, we shall detail some of our arguments.
First of all, there is no need for Skolemizing the dependencies set nor dynamic constant creation. This might increase the efficiency. Indeed, on the contrary of the bottom-up chase, existentially quantified variables in the TGD conclusions never transform into constants. The only new symbols that are created are those of the current goal that are temporary variables. The goal can grow by the addition of hypotheses of the dependencies used by the T−rule. Notice that these symbols are, on the other hand, deleted whenever they are mapped to existing constants, or whenever some pieces are removed from the goal. Depending on the search strategy, they are also deleted from memory whenever the sequence of rule applications is stuck or successful (and needs for example backtracking when doing depth-first search).
On the other hand, top-down approaches can take exponential time longer w.r.t. bottom-up approaches. However, the notion of piece performs some dynamic optimizations by dealing with groups of atoms instead of one at the time, thus failures are detected earlier w.r.t. classical top-down procedures, and irrelevant goals are cut down earlier. This feature dramatically reduces the number of backtracks. As an illustration, we have compared our method with Prolog over 5000 proofs generated at random, varying each time the size and the content of the dependencies set, and the size of the dependencies. The top-down chase has been implemented on top of the CoGITo platform [GS98], which is a set of tools for conceptual graphs management. These tests showed that the top-down chase provides a logarithmic average improvement for the number of backtracks as well as for the proof duration (when the process stops), in spite of a non-optimized implementation. This reduces drastically the drawback of the top-down approach, and gives the top-down chase a fair practical level of efficiency. The fact that the search for the right symbol mappings has an exponential complexity does not practically seem to act as a brake. Indeed, whereas for SLD-Resolution, some irrelevant goals may lead to exploring a branch of the resolution tree, the top-down chase detects failures earlier and allows saving the average time.
However, either approach could run faster in practice on given data [BMSU86]. That is why we think presenting an example in which the top-down chase would be more efficient than the bottom-up chase would not be significant and would not provide any worthwhile addition to the discussion.
For all these reasons, we plan to implement the chase of [BV84b] in order to practically compare their efficiency over a random dataset. We think dependencies might be divided into classes for which one or the other approach would be better but identifying them is still an open problem.
We must mention that some optimisations to the bottom-up approach have been made by magic-sets in [MS96]. The principle of magic sets ([BMSU86]) is to perform at compile time some optimizations that are usually performed at run-time, by rewriting the set of dependencies before inference. This leads to avoiding the generation of irrelevant facts during the process, which is the essence of the top-down approach. We shall take these optimizations into account for the implementation.
As already stated, we assumed some restrictions on the model that can be easily lifted. The reduction of EGDs to TGDs also works in this unrestricted model, which is stated in [BV84b]. Thus all the results can be extended, as they are in [CS98] for piece resolution.
There has been a renewed interest in data dependency theory with the introduction of constrained dependencies and ordered dependencies. These type of dependency can express a wide variety of constraints on the data ([BCW95]), besides generalizing most of the temporal dependencies of the taxonomy presented in [JS92]. The chase procedure has been redesigned, still in a bottom-up way, in order to deal with constrained tuple-generating dependencies ([MS96]), which are constrained functional dependencies. Our procedure can serve as a basis for the design of a top-down chase for constrained tuple-generating dependencies.
We now prove theorem 1 :
(Only If). Let K∈ SAT(Dω) be a relation. (i) There is a valuation k on ω (J) such that k(ω (J))⊆ K . As ω (J) contains only constants, then ω (J)⊆ K . (ii) As D⊆ Dω , then Dω⊨ D and K∈ SAT(D) . As D⊨ d , then K∈ SAT(d) . As ω (J)⊆ K (cf. i) and K∈ SAT(d) , then there is a extension k′ of k on ω ′(J′) such that k′(ω ′(J′))⊆ K . Hence K∈ SAT(Q) , and Dω⊨ Q .
We now prove theorem 2. We shall first introduce several lemmas.
To do that, we must rewrite Φ(Q′), Φ(T) and ¬ Φ(Q) under clausal form, in the following way:
• Φ(Q′) is under the form Φ(Q′)=∃ x1... ∃ xh(A1∧ ... ∧ Aj). We need to introduce Skolem constants that we shall denote qi, i∈ [1..h], each of them being respectively replacing the variable xi. We construct j clauses1 under the form Q′i=Ai[q1,...,qh], i∈ [1..j].
• Φ(T) is under the form Φ(T)=∃ y1... ∃ yk(C1∧ ... ∧ Cl) ← H1∧ ... ∧ Hn, universally closed by the variables x1,...,xp. We need to introduce Skolem functions that we shall denote fi(x1,...,xp), i∈ [1..k], each of them being respectively replacing the variable yi. We construct l clauses under the form Ti=(Ci∨ ¬ H1 ∨ ... ∨ ¬ Hn)[x1,...,xp, f1(x1,...,xp), ...,fk(x1,...,xp)] with i∈ [1..l].
• Φ(Q) is under the form Q=∃ x1... ∃ xs(Q1∧ ... ∧ Qt). The negation of Φ(Q) is under the form ¬ Q1∨ ... ∨ ¬ Qt, universally closed by the variables x1,...,xs. We construct a clause under the form NQ=(¬ Q1∨ ... ∨ ¬ Qt)[x1,...,xs]
The linear refutation starting with NQ exists. As NQ is composed only by negative litterals, NQ can only be resolved with clauses in which there are positive litterals, i.e. the clauses Q′a,a ∈ [1...j] and Tb,b ∈ [1..l].
Let us suppose that the refutation only use the clauses Q′a, a∈ [1..j]. Then the clauses Tb, b∈ [1..l] are unnecessary, and thus {Φ(Q′), ¬ Φ(Q)} is unconsistant, thus Φ(Q′)⊨ Φ(Q), which is in contradiction with the hypothesis. Therefore the resolution does not only use the clauses Q′a, a∈ [1..j].
Remember that Ti=(Ci∨ ¬ H1 ∨ ... ∨ ¬ Hn)[x1,...,xp, f1(x1,...,xp), ...,fr(x1,...,xp)] with i∈ [1..l]. If the refutation does not only use the clauses Q′a, a∈ [1..j], then it uses at least one of the atoms Ca, a∈ [1..l] of the clauses Ti,i∈[1..l] (which are the only other positive litterals). This step will give a resolvent containing the litterals ¬ H1 ∨ ... ∨ ¬ Hn. The f1(x1,...,xp),...,fr(x1,...,xp) are Skolem functions. They correspond to existentially quantified variables of Φ(T), therefore there is a substitution between a Qb, b∈ [1..t] and a Ca, a∈ [1..l], which does not maps from the equivalent existentially quantified variables (1st condition of the T-rule verified) nor from the variables of Hi, i∈[1..n] that are not in Ca. Therefore, within the tableaux framework, there is an equivalent valuation h that satisfied the first condition of the T-rule (that concerns symbols of the conclusion of T not in the hypothesis, i.e. existentially quantified variables), and the second condition of the T-rule (that concerns symbols of the hypothesis of T not in the conclusion of T, i.e. universally quantified variables appearing only in the hypothesis). Indeed, as the variables do not come into account in the substitution, we force the equivalent symbols (in the tableaux valuation) to be identical. This has no consequences, because we have supposed that no symbols have the same name in T and Q.
This resolution step has instanced the variables of NQ. Therefore litterals of NQ containing, as a term, one of these variables have them also instanced. Thus, as there is no function symbols nor variables in Q′a, a∈ [1..j], these litterals are unified with Ca[x1,...,xp, f1(x1,...,xp), ...,fr(x1,...,xp)], a∈ [1..n]. We are therefore sure that there is at least one piece of the conclusion of h(Q) appearing in the conclusion of h(T). Thus the third condition of the T-rule is satisfied.
Let us focus on the litterals added to the resolvent in the resolution step: ¬ H1 ∨ ... ∨ ¬ Hn. There are two possibilities: either there are unified with the Ca[x1,...,xp,f1(x1,...,xp), ...,fr(x1,...,xp)], a∈ [1..n], or with the Aa, a∈ [1..j]. Let us suppose that at the moment they come into account within the resolution, they are unified with the Ca[x1,...,xp,f1(x1,...,xp),...,fr(x1,...,xp)], a∈ [1..n], then the resolvent would contain again the same atoms, because negative atoms of the clauses Ti,i∈[1..l] are the same in each clause Ti,i∈[1..l]. They are thus necessarily unified with the Aa, a∈ [1..j], for the resolution to have an end (which is the case). Thus we see that negative litterals of the clauses Ti,i∈[1..l] can be only unified with the Aa, a∈ [1..j]. Therefore, the negation of the new goal has a refutation which does not need the clauses Tb, b∈ [1..l], and thus {Φ(Q′),¬ Φ(B)} is unconsistant, and Q′⊨ B
We now have a new clause resulting of a sequence of resolutions with the clauses coming from Tip. Thus this clause has a linear refutation. Let us show that this clause corresponds to a goal. To do that, let us perform the inverse transformation (i.e. clausal-form to FOL and thus tableaux). This clause only contains litterals from the original goal, and litterals coming from the hypothesis of Tip, all negative. Moreover, it does not contain function symbols. Therefore we can perform the inverse transformation under the form of a goal Bp−1 and thus Γ ⊨ Bp−1.
As Γ ⊨ Bp−1, by induction, there is a finite sequence of indices i1,...,ip−1 such that {Ti1,...,Tip−1}⊨ Q, with ij ∈ [1..n] and such that ∀ j ∈ [1..p−1], there is a goal Bj−1 such that {Bj−1,Tij}⊨ Bj with B0=<∅,∅>. As we showed that {Bp−1,Tip}⊨ Q, then there is a finite sequence of indices i1,...,ip such that {Ti1,...,Tip}⊨ Q, with ij ∈ [1..n] and such that ∀ j ∈ [1..p], there is a goal Bj−1 such that {Bj−1,Tij}⊨ Bj with B0=<∅,∅> and Bp=Q
The SLD-resolution allows to produce the empty clause from the set of clauses corresponding to {Φ(Q′),¬ Φ(Q)}. There is therefore a linear refutation starting from the negative clause corresponding to ¬ Φ(Q), provided that Φ(Q′) be under clausal form with exactly one positive litteral. Let us perform the same transformation as in lemma 2. The linear refutation starting from NQ exists. As NQ is composed only of positive litterals, NQ can be resolved only with clauses having positive litterals, i.e. the clauses Q′a,a ∈ [1...j]. Each litteral of NQ is thus unifiable with one of the Aa[q1,...,qh], a∈ [1..j]. All the atoms of NQ thus appear in the Aa[q1,...,qh], a∈ [1..j] and can only differ by a variable of NQ corresponding to a term in the Aa[q1,...,qh], a∈ [1..j]. Therefore tuples of Q are a subset of the tuples of Q′, and can only differ by extra symbols in Q.
If there is a rule application between Q′ and a TGD T then there is a valuation h′ and (at least) one piece of the conclusion of h′(Q′) appearing entirely in the conclusion of h′(T). As tuples of Q can only have extra symbols, and as tuples of Q all appear in Q′, there is also a valuation h and (at least) one piece of the conclusion of h(Q) appearing entirely in the conclusion of h(T). Let us construct the now goal B by removing from the conclusion of h(Q) only the pieces corresponding to those removed from the conclusion of h′(Q′) when B′ was constructed. Then the new goal B contains some tuples of h(Q) that differ, by construction, from that of B′ by potentially extra symbols, and some tuples of h(T) that also differ from that of B by potentially extra symbols. By performing the transformation of ¬ Φ(B) and Φ(B′) under clausal form, it is easy to show that each litteral of the clause coming from ¬ Φ(B) is unifiable with an atom of a clause coming from Φ(B′), and thus that the set {Φ(Q′),¬ Φ(Q)} is unconsistant, and B′ ⊨ B. By recurrence on the number of rule applications starting with Q′, we conclude that there is also a sequence of rule applications starting from Q and terminating with success.
(If). By induction on the number of T-rule applications. Trivially, as Qn=<∅ ,∅ > , then Dω ⊨ Qn . Assume the induction is true for Qi, ∀ i∈ 2,… ,n , thus Dω ⊨ Q2 . Let us prove that Dω ⊨ Q1 . By lemma 1, as Q2=T(T1,Q1) , then {Q2,T1}⊨ Q1 , and also {Q2,Dω }⊨ Q1 . As Dω ⊨ Q2 , it follows that Dω ⊨ Q1 .
(Only If). By lemma 3, there is a sequence of indices i1,...,ip such that {Ti1,...,Tip}⊨ Q, with ij ∈ [1..n] and such that ∀ j ∈ [1..p], there is a goal Bj−1 such that {Bj−1,Tij}⊨ Bj with B0=<∅,∅> and Bp=Q. The proof is made by induction on p. At step 1, we have {<∅,∅>, Ti1 }⊨ B1, therefore by lemma 2, there is a rule application between B1 and Ti1, giving a goal B0 such that <∅,∅>⊨ B0. As <∅,∅> ⊨ <∅,∅>, the resolution terminates with success. Let us suppose this hypothesis true until step p−1, i.e. there is a sequence of rule applications starting from Bp−1 and terminating with success. At step p, we have Bp=Q. By lemma 2, there is a rule application between Q et Tip, giving a goal B such that Bp−1⊨ B. According to the induction hypothesis, there is a sequence of rule applications starting from Bp−1 and terminating with success. By lemma 4, there is also a sequence of rule applications starting from B, thus from Q, terminating with success.
By theorem 4, D⊨ e if and only if D*⊨ e1 and there is a non-trivial A -EGD in D . D* is a a set of TGDs. Therefore, by theorem 1, D*⊨ e1 if and only if Dω *⊨ Q . It follows that D⊨ e if and only if Dω *⊨ Q and there is a non-trivial A -EGD in D . By theorem 2, we prove the existence of the sequence of T-rules.
This document was translated from LATEX by HEVEA.