A TopDown Proof Procedure for Generalized Data DependenciesStéphane Coulondre 
Abstract: Data dependencies are well known in the context of relational database. They aim to specify constraints that the data must satisfy to model correctly the part of the world under consideration. The implication problem for dependencies is to decide whether a given dependency is logically implied by a given set of dependencies. A proof procedure for the implication problem, called “chase”, has already been studied in the generalized case of tuplegenerating and equalitygenerating dependencies. The chase is a bottomup procedure: from hypotheses to conclusion, and thus is not goaldirected. It also entails in the case of TGDs the dynamic creation of new constants, which can turn out to be a costly operation. This paper introduces a new proof procedure which is topdown: from conclusion to hypothesis, that is goaldirected. The originality of this procedure is that it does not act as classical theorem proving procedures, which require a special form of expressions, such as clausal form, obtained after Skolemization. We show, with our procedure, that this step is useless, and that the notion of piece allows infering directly on dependencies, thus saving the cost of Skolemizing the dependencies set and, morever, that the inference can be performed without dynamically creating new constants. Although topdown approaches are known to be less efficient in time than bottomup ones, the notion of piece cuts down irrelevant goals usually generated, leading to a usable topdown method. With the more recent introduction of constrained and ordered dependencies, some interesting perspectives also arise.
Dependency theory allows the expression and modelling of constraints that the data must satisfy in order to reflect correctly the world that a relational database intends to describe. Since the introduction of functional dependencies by Codd ([Cod72]), many kinds of dependencies have been studied in the litterature, and a lot of work has been carried out in the late 70’s and early 80’s. Database dependencies theory is still an active area of research [SF00] [Her95] [LL97a] [LL97b] [LL98]. Functional and multivalued dependencies are the most known classes of data dependencies. In practice, these two kinds are sufficient in general to express constraints ([Ull88]). Nevertheless, more general classes have been introduced, with the purpose of finding a uniform way to express constraints ([BV81],[SU82]). This paper deals with the class commonly known for generalizing most of the dependencies: that of tuplegenerating and equalitygenerating dependencies (TGDs and EGDs) ([BV84b]). For a survey on this general class of dependencies, we refer the reader to [LL99] or [FV86].
The central problem is the implication problem, which is to decide whether a dependency is logically implied by a given set of dependencies. A process that could solve this problem would provide a solution to find the minimal cover of a set of dependencies, to decide whether a dependency is redundant within a set, useful during the constraint acquisition stage, etc. A procedure has already been designed in [BV84b] for that purpose: the wellknown chase. Unfortunately, as the implication problem for TGDs and EGDs is semidecidable, the chase is only a proof procedure, and therefore the process may run forever. As we argue in this paper, the chase is clearly a bottomup procedure: from hypotheses to conclusion. Also, the chase entails the dynamic creation of new constants in the general case of TGDs.
We introduce a new proof procedure which is topdown: from conclusion to hypothesis that is goaldirected. The originality of this procedure is that it does not act as classical theorem proving procedures, by requiring a special form of the dependencies set, such as clausal form, obtained after Skolemization. We show, with our procedure, that this step is useless, and that the notion of piece allows infering directly on the dependencies set.
Therefore, our topdown chase is not simply the usual chase reversed, but a new way of solving the implication problem. The fact it can be performed topdown is the first contribution of this paper. The second contribution is to avoid dynamic creation of constants, as well as Skolemization of the dependencies set, usually applied on the original knowledge base prior to topdown proofs. This is realized by taking advantage of the form of the dependencies. To our knowledge, this has not been realized before. Indeed, dynamic creation of constants can be costly in proof procedures, such as the chase, in order to take into account the effect of existential quantifiers.
While it is true that topdown approaches can take exponential time longer w.r.t. bottomup approaches, many reasons allow us to think that the proof procedure presented in this paper is efficient. These arguments will be detailed in the last section. The efficiency w.r.t. the bottomup chase is currently being assessed in details.
More recently, constrained dependencies ([Mah94], [BCW95], [Mah97]) and ordered dependencies [Ng99] have been introduced. They originate in the constraint programming field and permit expression of semantic relations on variables, thus giving them an interpretation. The chase procedure has been redesigned in [MS96], still in a bottomup way, in order to deal with constrained tuplegenerating dependencies. This work in the dependency theory gives new perspectives for the topdown chase procedure we present.
The topdown chase originates in the conceptual graphs model, which is a knowledge representation model introduced by Sowa ([Sow84]). The base model has been extended with graph rules and an inference method, called piece resolution ([SM96]). The logical roots of this process have been studied in [CS98], and constitute the basis of the topdown chase. Proofs of the lemmas and theorems of this paper are therefore derived from these two lastmentioned.
Section 2 describes the framework and the implication problem for data dependencies. We sketch the existing (bottomup) chase. In section 3 the topdown chase is explained. Section 4 closes the paper with some concluding remarks.
The following definitions are for the most part taken from [BV84b], though simplified for the purpose of this paper. The first subsection states some assumptions we make throughout this paper. The second subsection presents the necessary definitions. The third subsection is a description of the kind of dependencies this paper focus on. Note that they are known to capture the semantics of most of the dependency types studied in the litterature. The fourth subsection presents formally the implication problem and then describes the traditional chase procedure, which has been designed to solve the implication problem. The chase is clearly a bottomup mechanism, from hypotheses to conclusion.
For sake of clarity, we assume several restrictions on the model. First, we assume that the universal relation assumption holds, i.e. the database is modelled with only one relation, because it usually permits a simpler formal presentation of the approaches. Secondly, dependencies are typed (manysorted), so attribute domains are disjoint. Thirdly, we do not address dependencies with constants, as the last one illustrated below.
TGDs and EGDs can all be expressed in firstorder logic. Informally speaking, an equalitygenerating dependency (EGD) says that if some tuples exist in the database, then some values in these tuples must be equal. A tuplegenerating dependency (TGD) says that if some tuples exist in the database, then some other tuples, whose values are not necessarily taken from the first tuples, must also exist in the database. Thus some values may be unknown.
For example, to express a classical functional dependency stating that two customers having the same name also have the same identifier, we use the following EGD which is also a functional dependency :
To express that a invoice is always related to an existing order, we use the following TGD :
By introducing constants, we can state some more specialized constraints. For example, the invoice 23 is related to an order taken by the customer 12 :
However, the framework used in this paper is that of tableaux, because it is wellsuited when the universal relation assumption holds. On the other hand, this assumption is very unpractival for realworld applications, because it would imply modelling the whole database schema with only one relation. That is the reason why we do not use realworld examples throughout this paper. Note that these restrictions appear here for clarity reasons only and all the results are applicable in the unrestricted model.
I :  w :  

 
▴
An equalitygenerating dependency (EGD) says that if some tuples exist in the database, then some values in these tuples must be equal.
<(c_{1},c_{2}),I> :  K :  

 
▴
A tuplegenerating dependency (TGD) says that if some tuples exist in the database, then some other tuples, whose values are not necessarily taken from the first tuples, must also exist in the database. Thus some values may be unknown.
I′ :  T=<I′,I> :  

 
K′ :  

▴
Let D be a set of dependencies, and d be a dependency. The implication problem is to decide whether D⊨ d , that is to determine whether d is true in every database in which each dependency of D is true. Let SAT(D) be the set of all relations, only composed of constants, that satisfy all the dependencies in D , then the implication problem is equivalent to decide whether SAT(D)⊆ SAT(d) .
The implication problem can also be considered under a different view: that of finite implication problem. The finite implication problem is to decide whether d is satisfied by every finite relation that satisfies all dependencies in D. However, as detailed below, this problem admits no proof procedure.
The chase procedure has been designed to solve the implication problem. The reader can refer to [BV84b] for a complete description. Intuitively speaking, if the dependency to be proven is the TGD <I′ ,I> , or the EGD <(a_{1},a_{2}),I> , the chase procedure takes the relation I , and treats it as if it formed a set of tuples. Then it applies repeatedly the dependencies of D , following two distinct rules, one for TGDs, whose effect is to add tuples to the relation, and one for EGDs, whose effect is to “identify” two constants. When the dependency to be proven is a TGD, the procedure stops when obtaining the tuples of I′ . When it is a EGD, the procedure stops when obtaining the identification of a_{1} and a_{2} .
This mechanism has been shown to be sound and complete in [Var84]. Note that the implication problem for TGDs and EGDs is semidecidable. Thus the chase may not stop.
The chase procedure is clearly a bottomup (or forward chaining) one. Indeed, rule applications are generating new tuples or identification of constants. This is executed until we obtain the desired conclusion. The goal to be proven is not used to guide the process. Moreover, when applying a TGD rule whose effect is to add tuples to the relation, existential quantification always requires a costly dynamic creation of new constants.
We now show that we can apply a topdown (or backward) procedure, in which the process is goaldirected.
Depending on the type of the dependencies, the implication problem is solvable or recursively unsolvable ([BV81, Var84, GL82]). This means that in the first case, there is a decision procedure, hence an algorithm that always halt, whereas in the second case, there is a proof procedure: if the implication is true, then the process will terminate. In the other case, it might never stop. The finite implication problem, on the other hand, is not even partially solvable. Therefore, there can be no proof procedure for it.
A subset of TGDs is known as Full TGDs. These dependencies have the same form as Datalog rules. In this case, the notion of piece, stated in the introduction, is useless. Therefore, dedicated topdown theorem proving procedures can be applied to solve the implication problem involving only Full TGDs, such as QuerySubquery ([Vie86]) and OLDResolution ([TS86]). The principle aim of these procedures is to provide a terminating algorithm for topdown evaluation of Datalog rules. Indeed, even if the implication problem is decidable for this particular class, their topdown evaluation might not terminate. To tackle this problem, tabulation (or memoing) techniques must be applied in order to cut down looping sequences. Notice that these techniques do not affect completeness. In the general case of TGDs, however, there are unfinishing sequences that remain undetectable, due to the undecidable nature of the implication problem. In the case of Full TGDs, the implication problem and the finite implication problem coincide and are both decidable.
For this kind of dependencies, as well as all other subsets (functional and multivalued dependencies) the interest of the topdown chase is to be also applicable. However, in these particular cases, specific proof procedures shall be more efficient because they implement memoing. The topdown chase and the classical bottomup chase are clearly needed when dependencies are more complex (i.e. not a decidable subset), and also to provide a general way to solve the implication problem. Notice that memoing can be added to the topdown chase too, but this is an implementationlevel extension.
There have been a lot of work carried out also in the active database area. For a survey, we refer the reader to [WC95]. Active databases generally support the automatic triggering of updates in response to internal or external events. These updates are expressed by mean of rules which are slightly different from TGDs and EGDs. Usually, these rules can be expressed in Datalog with negation [BDR98] [AHV95]. The main difference lies in the fact that the variables occuring in the conclusion are all universally quantified, whereas in database dependencies they are existentially quantified if they do not occur in the hypothesis too.
The first subsection presents a proof procedure for TGDs only. The second subsection shows how this procedure can take EGDs into account as well, by using some reduction theorems.
Let D be a set of TGDs without constants, and let d=<J′,J> be a TGD without constants. Intuitively speaking, to decide whether D⊨ d , we start with the tuples of J′ and treat it as if it formed a relation. Let Q be this relation. Q is considered as the goal to reach. On the other hand, we add J to D , after replacing each symbol of J by a new constant, by transforming each tuple into a TGD with constants. Then we try to remove the tuples of Q, by applying successively a rule, giving each time a new goal. These rule applications may introduce new tuples. If we achieve in removing all tuples, i.e. obtaining an empty goal, then D⊨ d .
Compared with classical theorem proving methods, this proof procedure relies on a complex core rule that does not require any modification of the dependencies in D. Indeed, in order to apply the classical resolution method, one needs to rewrite D and ¬ d under clausal form. In order to do so, the Skolemization step would generate constants or functions due to the presence of existential quantifiers in the conclusion. Then the resolution method would try to generate the empty clause. When obtained, then the Herbrand theorem ensures that D ⊨ d.
In the present case, the topdown proof procedure does not require rewriting the dependencies of D nor dynamically creating new constants or functions. Indeed, the originality is to allow inferring directly on dependencies of D, thus providing a general mechanism. Notice that this can be obtained thanks to the particular form of TGDs and EGDs.
Let us now explain this process in a more formal way. We need the notion of distinct substitution that replaces each symbol of a relation by a new constant.
Now, we can add to D a set of TGDs with constants, each of them being of the form <u,∅ > , with u corresponding to a distinct tuple in ω (J) . Thus we add  ω (J) TGDs. The following theorem states that checking whether D⊨ d can be reduced to checking whether D_{ω}⊨ Q , where D_{ω} is the result of adding the new TGDs to D , and Q is the goal. Note that the added TGDs are the only ones of D_{ω} having constants in it. More formally, we have :
d=<J′,J> :  

Let ω be a distinct substitution on J such that a_{4}↦ a , a_{5}↦ a′ , b_{3}↦ b , b_{4}↦ b′ , c_{3}↦ c , c_{4}↦ c′ and d_{3}↦ d , with a,a′,b,b′,c,c′,d being constants. Let ω ′ be an extension of ω on J′ such that ω ′ is the identity on VAL(J′)−VAL(J) . Let D_{ω} denote D∪ { <u,∅ >  u∈ ω (J)} :
{ <u,∅ >  u∈ ω (J)} :  
 
Let Q denote the TGD <ω ′(J′),∅ > :
Q=<ω ′(J′),∅ > :  
 
Then D⊨ d if and only if D_{ω}⊨ Q .
▴
We now describe the main step of the process. Given a TGD and a goal, this rule constructs a new goal. We need to introduce the notion of piece. A piece is a set of tuples that are semantically linked. When applying the rule, we will see that these tuples can be treated at the same time. This notion is an alternative to the classical Skolemization process that replaces unknown values by constants, prior to using classical firstorder logic provers. A piece is a set of tuples which share unknown values, and are therefore treated as a whole. This is the originality of the topdown chase.
L :  
 
P_{1} :  P_{2} :  

 
▴
We now define the core rule.
then the result of the application of the rule, denoted Q_{n+1}=T(T,Q_{n}) , is Q_{n+1}=<h(V)\ P∪ h(I),∅ > . Q_{n+1} becomes a new goal. Thus it is obtained by removing P from h(V) and by adding h(I) . Note that there may be several possible TGDs obtained by an application of the rule, depending on the valuation h . In a sense, the piece notion allows us to group together some tuples according to some particular symbols (which correspond in logic to existentially quantified variables of T) they share.
T=<I′,I> :  Q_{n}=<V,∅ > :  

 
T′=<h(I′),h(I)> :  Q_{n}′=<h(V),∅ > :  

 
There are two pieces of h(V) in relation to (VAL(I′)−VAL(I))\ CONS={a_{2}} , which are the first tuple of h(V) and the last two tuples of h(V) . The second piece P is such that P⊆ h(I′) . We construct Q_{n+1}=T(T,Q_{n}) . Let V′=h(V)\ P∪ h(I) :
Q_{n+1}=<V′,∅ > :  
 
▴
The following theorem states that a goal is implied by a set of TGDs if and only if there is a sequence of rule applications such that the sequence starts with the original goal, and gives the empty goal as an end result. The proof is detailed at the end of the paper.
It is important to notice that whenever D_{ω}⊨ Q, theorem 2 ensures that there is at least one sequence of rule applications, but it does not give a method to find it. Therefore, to implement the topdown chase, we need to add a search strategy (breadthfirst or depthfirst for example). In order to illustrate the need for a search strategy, let us detail in the following three kinds of sequences for the same set of TGDs D_{ω} and the same goal Q. The first one stops and ends with an empty goal, thus proving the implication. The second one does not terminate (although in this particular case, the loop might be detected and stopped), and the third one is stuck (no rule can be applied any more) and thus needs backtracking whenever the search algorithm is depthfirst.
In the following examples, we will consider that D_{ω} contains the TGD T of example 6, as well as another TGD S detailed below. Remember also that in 4 it has been shown that in order to prove that a dependency d is implied by a set of dependencies D, we construct a goal to be proven (i.e. without hypothesis), and we add some TGDs (also without hypothesis) to D, giving D_{ω}. Therefore, in the present case, D_{ω} contains T as well as two more TGDs presented in example 6. Finally, the goal to be proven is Q, already presented in example 4 . D_{ω} is detailed below:
U_{1} :  U_{2} :  

 
T :  S :  

 
And Q is :
Q :  
 
Q_{1} :  
 
Consider the TGD U_{1}. Let U_{1}=<I′,I> and Q_{1}=<V_{1},∅ >. We can check that (VAL(V_{1})∩ (VAL(I)∪ VAL(I′)))\ CONS=∅ . Let h be a valuation on I∪ I′ ∪ V_{1} such that ∀ a∈ VAL(I)−VAL(I′) , h(a)=a and ∀ a∈ VAL(I′)−VAL(I) , h(a)=a (trivial). h is also defined by: h(a_{1})=a and h(d_{1})=d . Therefore we have :
Q_{1}′=<h(V_{1}),∅ > :  
 
As there is no element in (VAL(I′)−VAL(I))\ CONS, each tuple of h(V_{1}) is itself a piece. The third tuple P is such that P ⊆ h(I′) . We construct Q_{2}=T(U_{1},Q_{1}) . Let Q_{2}=<h(V_{1})\ P∪ h(I), ∅> :
Q_{2} :  
 
Now consider the TGD U_{2}. Let U_{2}=<I′,I> and Q_{2}=<V_{2},∅ >. We can check that (VAL(V_{2})∩ (VAL(I)∪ VAL(I′)))\ CONS=∅ . Let h be a valuation on I∪ I′ ∪ V_{2} such that ∀ a∈ VAL(I)−VAL(I′) , h(a)=a and ∀ a∈ VAL(I′)−VAL(I) , h(a)=a (trivial). h is also defined by: h(d_{2})=d′ . Therefore we have :
Q_{2}′=<h(V_{2}),∅ > :  
 
As there is no element in (VAL(I′)−VAL(I))\ CONS, each tuple of h(V_{2}) is itself a piece. The first tuple P is such that P ⊆ h(I′) . We construct Q_{3}=T(U_{2},Q_{2}) . Let Q_{3}=<h(V_{2})\ P ∪ h(I), ∅> :
Q_{3} :  
 
Following the same schema, we apply the very same rule to Q_{3}, giving Q_{4}=<h(V_{3})\ P ∪ h(I), ∅> :
Q_{4} :  
 
Therefore, we have proven that D_{ω}⊨ Q, thus D⊨ d.
▴
Q_{1} :  T  

 
Let T=<I′,I> and Q_{1}=<V_{1},∅ >. We can check that (VAL(V_{1})∩ (VAL(I)∪ VAL(I′)))\ CONS={a_{1}, d_{1}, d_{2}} . Therefore, we must rename these symbols either in T or in Q_{1}. Let now Q_{1} be :
Q_{1} :  
 
Let h be a valuation on I∪ I′ ∪ V_{1} such that ∀ a∈ VAL(I)−VAL(I′) , h(a)=a and ∀ a∈ VAL(I′)−VAL(I) , h(a)=a . Therefore h(a_{1})=a_{1} , h(a_{2})=a_{2}, and h(d_{1})=d_{1} . h is also defined by: h(a_{3})=a_{2} , h(b_{1})=b′ , h(b_{2})=b , h(c_{1})=c , h(c_{2})=c′ , h(d_{3})=d_{2} and h(d_{4})=d_{2} . Therefore we have :
T′=<h(I′),h(I)> :  Q_{1}′=<h(V_{1}),∅ > :  

 
There are two pieces of h(V_{1}) in relation to (VAL(I′)−VAL(I))\ CONS={a_{2}} , which are the first tuple of h(V_{1}) and the last two tuples of h(V_{1}) . The second piece P is such that P⊆ h(I′) . We construct Q_{2}=T(T,Q_{1}) . Let V_{2}=h(V_{1})\ P∪ h(I) :
Q_{2}=<V_{2},∅ > :  
 
By applying rule T over Q_{2} one more time, giving Q_{3}=T(T,Q_{2}), we obtaain :
Q_{3} :  
 
Therefore Q_{3}=Q_{1} w.r.t. symbol renaming, thus the sequence is entering a loop. Note that in this particular case, it is simple to detect and thus to stop. However, this is an implementationlevel treatment.
▴
Q_{1} :  S  

 
Let S=<I′,I> and Q_{1}=<V_{1},∅ >. We can check that (VAL(V_{1})∩ (VAL(I)∪ VAL(I′)))\ CONS={a_{1}, d_{1}, d_{2}} . Therefore, we must rename these symbols either in S or in Q_{1}. Let now Q_{1} be :
Q_{1} :  
 
Let h be a valuation on I∪ I′ ∪ V_{1} such that ∀ a∈ VAL(I)−VAL(I′) , h(a)=a and ∀ a∈ VAL(I′)−VAL(I) , h(a)=a . Therefore h(d_{1})=d_{1}, h(d_{2})=d_{2} and h(d_{3})=d_{3} . h is also defined by: h(a_{1})=a , h(a_{3})=a , h(b_{1})=b , h(b_{2})=b′ , h(c_{1})=c′ , h(c_{2})=c , h(d_{4})=d_{3} and h(d_{5})=d_{3} . Therefore we have :
S′=<h(I′),h(I)> :  Q_{1}′=<h(V_{1}),∅ > :  

 
There are two pieces P_{1} and P_{2} of h(V_{1}) in relation to (VAL(I′)−VAL(I))\ CONS={d_{2}, d_{3}} , which are the first (identical) two tuples of h(V_{1}) and the last tuple of h(V_{1}) . The two piece are included in h(I′). Note that we can save one sequence step by removing both pieces at the same time instead of performing two steps with the same valuation h. We construct Q_{2}=T(T,Q_{1}) . Let V_{2}=h(V_{1})\ P_{1} \ P_{2} ∪ h(I) :
Q_{2}=<V_{2},∅ > :  
 
One can now verify that there is no applicable rule to Q_{2} amongst T, S, U_{1} and U_{2}.
▴
EGDs add some difficulties because they include an equality predicate. Nevertheless, it has been shown in [BV84b] that the implication problem for TGDs and EGDs without constants is reducible to the implication problem for TGDs without constants. Thus, our core rule is sufficient.
Let e be the A EGD e=<(a_{1},a_{2}),J> . Let w_{1} be a tuple such that w_{1}[A]=a_{1} and for all attributes B∈ A , w_{1}[B]∉J[B] . Let w_{2} be a tuple such that w_{2}[A]=w_{1}[A] and w_{2}[A]=a_{2} . We associate with e two TGDs. e_{1} is <w_{1},J∪ {w_{2}}> , e_{2} is <w_{2},J∪ {w_{1}}> . For a given set D of TGDs and EGDs, let D^{*} be the result of replacing each EGD e in D by its two associated TGDs e_{1} and e_{2} .
w_{1} :  w_{2} :  

 
e_{1}=<w_{1},I∪ {w_{2}}> :  e_{2}=<w_{2},I∪ {w_{1}}> :  

 
▴
The following theorems allow us to ignore the presence of EGDs by reducing the implication of a dependency by a set of TGDs and EGDs to the implication of a TGD by a set of TGDs only:
Thus, we can now generalize theorem 1 and give the final following theorem that reduces the implication problem of TGDs and EGDs to the existence of a topdown chase using the rule:
As a conclusion, we discuss several points. First we compare the topdown chase with a backward formal system and with other proof procedures. Then we discuss the contributions of our work. We conclude this paper by lifting some restrictions on the model and pointing out some perspectives.
Many formal systems have been studied for data dependencies ([BFH77], [Sci82], [BV84a], [SU82]). In [BV84a], some formal systems for TGDs and EGDs are studied. Two of these systems are backward, but only one, namely the T3 formal system, has some similarities with the topdown chase. We sketch here the main differences. We refer the reader to this paper for more details about formal systems for TGDs and EGDs.
The T3 system deals with TGDs and their unknown values (i.e. nonconstant symbols of VAL(J′)−VAL(J) ) in the following way : the process starts with a goal tuple Q and applies a TGD T=<J′,J> by making J′ and J coincide. Typically this is achieved using especially the collapsing, augmentation and projection rules. When J′ and J coincide, it uses the transitivity rule to derive a new goal. A derivation tree is not linear , as it is the case for SLDresolution.
The topdown chase leads to a linear inference in the sense that it uses directly TGDs from the base without first applying rules on them. There is only one rule. Obviously, it is a more complex step.
To our knowledge, this is the first time a topdown proof procedure is used to solve the implication problem for dependencies. As such, a first constribution is to have shown that this can be performed topdown.
The topdown chase avoids rewriting the dependencies set under clausal form, and avoids the corresponding Skolemization process, necessary in order to use classical topdown theorem proving procedures. In the topdown chase, only the preliminary transformation of the TGD to be proven requires creating as many new constants as universally quantified variables appearing in it. Thus this step is unsignificant. By providing a way to infer directly on the original form of dependencies, by way of the notion of piece, the topdown chase is conceptually speaking a simpler approach, this is our second contribution.
Compared to the classical chase, it does not entails the dynamic creation of new constants. This is our third contribution.
Note that the topdown chase is not the usual chase simply reversed. The core rule is totally different from those used in the bottomup approach of [BV84b]. Indeed, it has been shown that there is a strong relationship between the bottomup chase and resolution with paramodulation [BC90][NR01]. Whereas the latters acts on the Skolemized set of clauses, the topdown chase does not need this prior step.
We are currently investigating in details the efficiency of the topdown chase. Actually, this is not trivial to assess, because of the various parameters that come into play. We agree with [BMSU86] on the fact that “ it is unreasonable to expect proofs of optimality, or even superiority of one algorithm over another, in all cases ”.
However, we shall detail some of our arguments.
First of all, there is no need for Skolemizing the dependencies set nor dynamic constant creation. This might increase the efficiency. Indeed, on the contrary of the bottomup chase, existentially quantified variables in the TGD conclusions never transform into constants. The only new symbols that are created are those of the current goal that are temporary variables. The goal can grow by the addition of hypotheses of the dependencies used by the T−rule. Notice that these symbols are, on the other hand, deleted whenever they are mapped to existing constants, or whenever some pieces are removed from the goal. Depending on the search strategy, they are also deleted from memory whenever the sequence of rule applications is stuck or successful (and needs for example backtracking when doing depthfirst search).
On the other hand, topdown approaches can take exponential time longer w.r.t. bottomup approaches. However, the notion of piece performs some dynamic optimizations by dealing with groups of atoms instead of one at the time, thus failures are detected earlier w.r.t. classical topdown procedures, and irrelevant goals are cut down earlier. This feature dramatically reduces the number of backtracks. As an illustration, we have compared our method with Prolog over 5000 proofs generated at random, varying each time the size and the content of the dependencies set, and the size of the dependencies. The topdown chase has been implemented on top of the CoGITo platform [GS98], which is a set of tools for conceptual graphs management. These tests showed that the topdown chase provides a logarithmic average improvement for the number of backtracks as well as for the proof duration (when the process stops), in spite of a nonoptimized implementation. This reduces drastically the drawback of the topdown approach, and gives the topdown chase a fair practical level of efficiency. The fact that the search for the right symbol mappings has an exponential complexity does not practically seem to act as a brake. Indeed, whereas for SLDResolution, some irrelevant goals may lead to exploring a branch of the resolution tree, the topdown chase detects failures earlier and allows saving the average time.
However, either approach could run faster in practice on given data [BMSU86]. That is why we think presenting an example in which the topdown chase would be more efficient than the bottomup chase would not be significant and would not provide any worthwhile addition to the discussion.
For all these reasons, we plan to implement the chase of [BV84b] in order to practically compare their efficiency over a random dataset. We think dependencies might be divided into classes for which one or the other approach would be better but identifying them is still an open problem.
We must mention that some optimisations to the bottomup approach have been made by magicsets in [MS96]. The principle of magic sets ([BMSU86]) is to perform at compile time some optimizations that are usually performed at runtime, by rewriting the set of dependencies before inference. This leads to avoiding the generation of irrelevant facts during the process, which is the essence of the topdown approach. We shall take these optimizations into account for the implementation.
As already stated, we assumed some restrictions on the model that can be easily lifted. The reduction of EGDs to TGDs also works in this unrestricted model, which is stated in [BV84b]. Thus all the results can be extended, as they are in [CS98] for piece resolution.
There has been a renewed interest in data dependency theory with the introduction of constrained dependencies and ordered dependencies. These type of dependency can express a wide variety of constraints on the data ([BCW95]), besides generalizing most of the temporal dependencies of the taxonomy presented in [JS92]. The chase procedure has been redesigned, still in a bottomup way, in order to deal with constrained tuplegenerating dependencies ([MS96]), which are constrained functional dependencies. Our procedure can serve as a basis for the design of a topdown chase for constrained tuplegenerating dependencies.
We now prove theorem 1 :
(Only If). Let K∈ SAT(D_{ω}) be a relation. (i) There is a valuation k on ω (J) such that k(ω (J))⊆ K . As ω (J) contains only constants, then ω (J)⊆ K . (ii) As D⊆ D_{ω} , then D_{ω}⊨ D and K∈ SAT(D) . As D⊨ d , then K∈ SAT(d) . As ω (J)⊆ K (cf. i) and K∈ SAT(d) , then there is a extension k′ of k on ω ′(J′) such that k′(ω ′(J′))⊆ K . Hence K∈ SAT(Q) , and D_{ω}⊨ Q .
We now prove theorem 2. We shall first introduce several lemmas.
To do that, we must rewrite Φ(Q′), Φ(T) and ¬ Φ(Q) under clausal form, in the following way:
• Φ(Q′) is under the form Φ(Q′)=∃ x_{1}... ∃ x_{h}(A_{1}∧ ... ∧ A_{j}). We need to introduce Skolem constants that we shall denote q_{i}, i∈ [1..h], each of them being respectively replacing the variable x_{i}. We construct j clauses^{1} under the form Q′_{i}=A_{i}[q_{1},...,q_{h}], i∈ [1..j].
• Φ(T) is under the form Φ(T)=∃ y_{1}... ∃ y_{k}(C_{1}∧ ... ∧ C_{l}) ← H_{1}∧ ... ∧ H_{n}, universally closed by the variables x_{1},...,x_{p}. We need to introduce Skolem functions that we shall denote f_{i}(x_{1},...,x_{p}), i∈ [1..k], each of them being respectively replacing the variable y_{i}. We construct l clauses under the form T_{i}=(C_{i}∨ ¬ H_{1} ∨ ... ∨ ¬ H_{n})[x_{1},...,x_{p}, f_{1}(x_{1},...,x_{p}), ...,f_{k}(x_{1},...,x_{p})] with i∈ [1..l].
• Φ(Q) is under the form Q=∃ x_{1}... ∃ x_{s}(Q_{1}∧ ... ∧ Q_{t}). The negation of Φ(Q) is under the form ¬ Q_{1}∨ ... ∨ ¬ Q_{t}, universally closed by the variables x_{1},...,x_{s}. We construct a clause under the form NQ=(¬ Q_{1}∨ ... ∨ ¬ Q_{t})[x_{1},...,x_{s}]
The linear refutation starting with NQ exists. As NQ is composed only by negative litterals, NQ can only be resolved with clauses in which there are positive litterals, i.e. the clauses Q′_{a},a ∈ [1...j] and T_{b},b ∈ [1..l].
Let us suppose that the refutation only use the clauses Q′_{a}, a∈ [1..j]. Then the clauses T_{b}, b∈ [1..l] are unnecessary, and thus {Φ(Q′), ¬ Φ(Q)} is unconsistant, thus Φ(Q′)⊨ Φ(Q), which is in contradiction with the hypothesis. Therefore the resolution does not only use the clauses Q′_{a}, a∈ [1..j].
Remember that T_{i}=(C_{i}∨ ¬ H_{1} ∨ ... ∨ ¬ H_{n})[x_{1},...,x_{p}, f_{1}(x_{1},...,x_{p}), ...,f_{r}(x_{1},...,x_{p})] with i∈ [1..l]. If the refutation does not only use the clauses Q′_{a}, a∈ [1..j], then it uses at least one of the atoms C_{a}, a∈ [1..l] of the clauses T_{i},i∈[1..l] (which are the only other positive litterals). This step will give a resolvent containing the litterals ¬ H_{1} ∨ ... ∨ ¬ H_{n}. The f_{1}(x_{1},...,x_{p}),...,f_{r}(x_{1},...,x_{p}) are Skolem functions. They correspond to existentially quantified variables of Φ(T), therefore there is a substitution between a Q_{b}, b∈ [1..t] and a C_{a}, a∈ [1..l], which does not maps from the equivalent existentially quantified variables (1st condition of the Trule verified) nor from the variables of H_{i}, i∈[1..n] that are not in C_{a}. Therefore, within the tableaux framework, there is an equivalent valuation h that satisfied the first condition of the Trule (that concerns symbols of the conclusion of T not in the hypothesis, i.e. existentially quantified variables), and the second condition of the Trule (that concerns symbols of the hypothesis of T not in the conclusion of T, i.e. universally quantified variables appearing only in the hypothesis). Indeed, as the variables do not come into account in the substitution, we force the equivalent symbols (in the tableaux valuation) to be identical. This has no consequences, because we have supposed that no symbols have the same name in T and Q.
This resolution step has instanced the variables of NQ. Therefore litterals of NQ containing, as a term, one of these variables have them also instanced. Thus, as there is no function symbols nor variables in Q′_{a}, a∈ [1..j], these litterals are unified with C_{a}[x_{1},...,x_{p}, f_{1}(x_{1},...,x_{p}), ...,f_{r}(x_{1},...,x_{p})], a∈ [1..n]. We are therefore sure that there is at least one piece of the conclusion of h(Q) appearing in the conclusion of h(T). Thus the third condition of the Trule is satisfied.
Let us focus on the litterals added to the resolvent in the resolution step: ¬ H_{1} ∨ ... ∨ ¬ H_{n}. There are two possibilities: either there are unified with the C_{a}[x_{1},...,x_{p},f_{1}(x_{1},...,x_{p}), ...,f_{r}(x_{1},...,x_{p})], a∈ [1..n], or with the A_{a}, a∈ [1..j]. Let us suppose that at the moment they come into account within the resolution, they are unified with the C_{a}[x_{1},...,x_{p},f_{1}(x_{1},...,x_{p}),...,f_{r}(x_{1},...,x_{p})], a∈ [1..n], then the resolvent would contain again the same atoms, because negative atoms of the clauses T_{i},i∈[1..l] are the same in each clause T_{i},i∈[1..l]. They are thus necessarily unified with the A_{a}, a∈ [1..j], for the resolution to have an end (which is the case). Thus we see that negative litterals of the clauses T_{i},i∈[1..l] can be only unified with the A_{a}, a∈ [1..j]. Therefore, the negation of the new goal has a refutation which does not need the clauses T_{b}, b∈ [1..l], and thus {Φ(Q′),¬ Φ(B)} is unconsistant, and Q′⊨ B
We now have a new clause resulting of a sequence of resolutions with the clauses coming from T_{ip}. Thus this clause has a linear refutation. Let us show that this clause corresponds to a goal. To do that, let us perform the inverse transformation (i.e. clausalform to FOL and thus tableaux). This clause only contains litterals from the original goal, and litterals coming from the hypothesis of T_{ip}, all negative. Moreover, it does not contain function symbols. Therefore we can perform the inverse transformation under the form of a goal B_{p−1} and thus Γ ⊨ B_{p−1}.
As Γ ⊨ B_{p−1}, by induction, there is a finite sequence of indices i_{1},...,i_{p−1} such that {T_{i1},...,T_{ip−1}}⊨ Q, with i_{j} ∈ [1..n] and such that ∀ j ∈ [1..p−1], there is a goal B_{j−1} such that {B_{j−1},T_{ij}}⊨ B_{j} with B_{0}=<∅,∅>. As we showed that {B_{p−1},T_{ip}}⊨ Q, then there is a finite sequence of indices i_{1},...,i_{p} such that {T_{i1},...,T_{ip}}⊨ Q, with i_{j} ∈ [1..n] and such that ∀ j ∈ [1..p], there is a goal B_{j−1} such that {B_{j−1},T_{ij}}⊨ B_{j} with B_{0}=<∅,∅> and B_{p}=Q
The SLDresolution allows to produce the empty clause from the set of clauses corresponding to {Φ(Q′),¬ Φ(Q)}. There is therefore a linear refutation starting from the negative clause corresponding to ¬ Φ(Q), provided that Φ(Q′) be under clausal form with exactly one positive litteral. Let us perform the same transformation as in lemma 2. The linear refutation starting from NQ exists. As NQ is composed only of positive litterals, NQ can be resolved only with clauses having positive litterals, i.e. the clauses Q′_{a},a ∈ [1...j]. Each litteral of NQ is thus unifiable with one of the A_{a}[q_{1},...,q_{h}], a∈ [1..j]. All the atoms of NQ thus appear in the A_{a}[q_{1},...,q_{h}], a∈ [1..j] and can only differ by a variable of NQ corresponding to a term in the A_{a}[q_{1},...,q_{h}], a∈ [1..j]. Therefore tuples of Q are a subset of the tuples of Q′, and can only differ by extra symbols in Q.
If there is a rule application between Q′ and a TGD T then there is a valuation h′ and (at least) one piece of the conclusion of h′(Q′) appearing entirely in the conclusion of h′(T). As tuples of Q can only have extra symbols, and as tuples of Q all appear in Q′, there is also a valuation h and (at least) one piece of the conclusion of h(Q) appearing entirely in the conclusion of h(T). Let us construct the now goal B by removing from the conclusion of h(Q) only the pieces corresponding to those removed from the conclusion of h′(Q′) when B′ was constructed. Then the new goal B contains some tuples of h(Q) that differ, by construction, from that of B′ by potentially extra symbols, and some tuples of h(T) that also differ from that of B by potentially extra symbols. By performing the transformation of ¬ Φ(B) and Φ(B′) under clausal form, it is easy to show that each litteral of the clause coming from ¬ Φ(B) is unifiable with an atom of a clause coming from Φ(B′), and thus that the set {Φ(Q′),¬ Φ(Q)} is unconsistant, and B′ ⊨ B. By recurrence on the number of rule applications starting with Q′, we conclude that there is also a sequence of rule applications starting from Q and terminating with success.
(If). By induction on the number of Trule applications. Trivially, as Q_{n}=<∅ ,∅ > , then D_{ω }⊨ Q_{n} . Assume the induction is true for Q_{i}, ∀ i∈ 2,… ,n , thus D_{ω }⊨ Q_{2} . Let us prove that D_{ω }⊨ Q_{1} . By lemma 1, as Q_{2}=T(T_{1},Q_{1}) , then {Q_{2},T_{1}}⊨ Q_{1} , and also {Q_{2},D_{ω }}⊨ Q_{1} . As D_{ω }⊨ Q_{2} , it follows that D_{ω }⊨ Q_{1} .
(Only If). By lemma 3, there is a sequence of indices i_{1},...,i_{p} such that {T_{i1},...,T_{ip}}⊨ Q, with i_{j} ∈ [1..n] and such that ∀ j ∈ [1..p], there is a goal B_{j−1} such that {B_{j−1},T_{ij}}⊨ B_{j} with B_{0}=<∅,∅> and B_{p}=Q. The proof is made by induction on p. At step 1, we have {<∅,∅>, T_{i1} }⊨ B_{1}, therefore by lemma 2, there is a rule application between B_{1} and T_{i1}, giving a goal B_{0} such that <∅,∅>⊨ B_{0}. As <∅,∅> ⊨ <∅,∅>, the resolution terminates with success. Let us suppose this hypothesis true until step p−1, i.e. there is a sequence of rule applications starting from B_{p−1} and terminating with success. At step p, we have B_{p}=Q. By lemma 2, there is a rule application between Q et T_{ip}, giving a goal B such that B_{p−1}⊨ B. According to the induction hypothesis, there is a sequence of rule applications starting from B_{p−1} and terminating with success. By lemma 4, there is also a sequence of rule applications starting from B, thus from Q, terminating with success.
By theorem 4, D⊨ e if and only if D^{*}⊨ e_{1} and there is a nontrivial A EGD in D . D^{*} is a a set of TGDs. Therefore, by theorem 1, D^{*}⊨ e_{1} if and only if D_{ω }^{*}⊨ Q . It follows that D⊨ e if and only if D_{ω }^{*}⊨ Q and there is a nontrivial A EGD in D . By theorem 2, we prove the existence of the sequence of Trules.
This document was translated from L^{A}T_{E}X by H^{E}V^{E}A.