A Top-Down Proof Procedure for Generalized Data Dependencies

Stéphane Coulondre
INSA / University of Lyon, 20 Av Albert Einstein, 69621 Villeurbanne Cedex, France
Stephane.Coulondre(at)insa-lyon.fr

Abstract: Data dependencies are well known in the context of relational database. They aim to specify constraints that the data must satisfy to model correctly the part of the world under consideration. The implication problem for dependencies is to decide whether a given dependency is logically implied by a given set of dependencies. A proof procedure for the implication problem, called “chase”, has already been studied in the generalized case of tuple-generating and equality-generating dependencies. The chase is a bottom-up procedure: from hypotheses to conclusion, and thus is not goal-directed. It also entails in the case of TGDs the dynamic creation of new constants, which can turn out to be a costly operation. This paper introduces a new proof procedure which is top-down: from conclusion to hypothesis, that is goal-directed. The originality of this procedure is that it does not act as classical theorem proving procedures, which require a special form of expressions, such as clausal form, obtained after Skolemization. We show, with our procedure, that this step is useless, and that the notion of piece allows infering directly on dependencies, thus saving the cost of Skolemizing the dependencies set and, morever, that the inference can be performed without dynamically creating new constants. Although top-down approaches are known to be less efficient in time than bottom-up ones, the notion of piece cuts down irrelevant goals usually generated, leading to a usable top-down method. With the more recent introduction of constrained and ordered dependencies, some interesting perspectives also arise.

1  Introduction

Dependency theory allows the expression and modelling of constraints that the data must satisfy in order to reflect correctly the world that a relational database intends to describe. Since the introduction of functional dependencies by Codd ([Cod72]), many kinds of dependencies have been studied in the litterature, and a lot of work has been carried out in the late 70’s and early 80’s. Database dependencies theory is still an active area of research [SF00] [Her95] [LL97a] [LL97b] [LL98]. Functional and multivalued dependencies are the most known classes of data dependencies. In practice, these two kinds are sufficient in general to express constraints ([Ull88]). Nevertheless, more general classes have been introduced, with the purpose of finding a uniform way to express constraints ([BV81],[SU82]). This paper deals with the class commonly known for generalizing most of the dependencies: that of tuple-generating and equality-generating dependencies (TGDs and EGDs) ([BV84b]). For a survey on this general class of dependencies, we refer the reader to [LL99] or [FV86].

The central problem is the implication problem, which is to decide whether a dependency is logically implied by a given set of dependencies. A process that could solve this problem would provide a solution to find the minimal cover of a set of dependencies, to decide whether a dependency is redundant within a set, useful during the constraint acquisition stage, etc. A procedure has already been designed in [BV84b] for that purpose: the well-known chase. Unfortunately, as the implication problem for TGDs and EGDs is semi-decidable, the chase is only a proof procedure, and therefore the process may run forever. As we argue in this paper, the chase is clearly a bottom-up procedure: from hypotheses to conclusion. Also, the chase entails the dynamic creation of new constants in the general case of TGDs.

We introduce a new proof procedure which is top-down: from conclusion to hypothesis that is goal-directed. The originality of this procedure is that it does not act as classical theorem proving procedures, by requiring a special form of the dependencies set, such as clausal form, obtained after Skolemization. We show, with our procedure, that this step is useless, and that the notion of piece allows infering directly on the dependencies set.

Therefore, our top-down chase is not simply the usual chase reversed, but a new way of solving the implication problem. The fact it can be performed top-down is the first contribution of this paper. The second contribution is to avoid dynamic creation of constants, as well as Skolemization of the dependencies set, usually applied on the original knowledge base prior to top-down proofs. This is realized by taking advantage of the form of the dependencies. To our knowledge, this has not been realized before. Indeed, dynamic creation of constants can be costly in proof procedures, such as the chase, in order to take into account the effect of existential quantifiers.

While it is true that top-down approaches can take exponential time longer w.r.t. bottom-up approaches, many reasons allow us to think that the proof procedure presented in this paper is efficient. These arguments will be detailed in the last section. The efficiency w.r.t. the bottom-up chase is currently being assessed in details.

More recently, constrained dependencies ([Mah94], [BCW95], [Mah97]) and ordered dependencies [Ng99] have been introduced. They originate in the constraint programming field and permit expression of semantic relations on variables, thus giving them an interpretation. The chase procedure has been redesigned in [MS96], still in a bottom-up way, in order to deal with constrained tuple-generating dependencies. This work in the dependency theory gives new perspectives for the top-down chase procedure we present.

The top-down chase originates in the conceptual graphs model, which is a knowledge representation model introduced by Sowa ([Sow84]). The base model has been extended with graph rules and an inference method, called piece resolution ([SM96]). The logical roots of this process have been studied in [CS98], and constitute the basis of the top-down chase. Proofs of the lemmas and theorems of this paper are therefore derived from these two last-mentioned.

Section 2 describes the framework and the implication problem for data dependencies. We sketch the existing (bottom-up) chase. In section 3 the top-down chase is explained. Section 4 closes the paper with some concluding remarks.

2  The Framework

The following definitions are for the most part taken from [BV84b], though simplified for the purpose of this paper. The first subsection states some assumptions we make throughout this paper. The second subsection presents the necessary definitions. The third subsection is a description of the kind of dependencies this paper focus on. Note that they are known to capture the semantics of most of the dependency types studied in the litterature. The fourth subsection presents formally the implication problem and then describes the traditional chase procedure, which has been designed to solve the implication problem. The chase is clearly a bottom-up mechanism, from hypotheses to conclusion.

2.1  Preliminaries

For sake of clarity, we assume several restrictions on the model. First, we assume that the universal relation assumption holds, i.e. the database is modelled with only one relation, because it usually permits a simpler formal presentation of the approaches. Secondly, dependencies are typed (many-sorted), so attribute domains are disjoint. Thirdly, we do not address dependencies with constants, as the last one illustrated below.

TGDs and EGDs can all be expressed in first-order logic. Informally speaking, an equality-generating dependency (EGD) says that if some tuples exist in the database, then some values in these tuples must be equal. A tuple-generating dependency (TGD) says that if some tuples exist in the database, then some other tuples, whose values are not necessarily taken from the first tuples, must also exist in the database. Thus some values may be unknown.

For example, to express a classical functional dependency stating that two customers having the same name also have the same identifier, we use the following EGD which is also a functional dependency :

custid1custid2custname (cust(custid1,custname)cust(custid2,custname)custid1 = custid2)

To express that a invoice is always related to an existing order, we use the following TGD :

invoiceidamountorderid (invoice(invoiceid, amount, orderid)
custid order(custid, orderid))

By introducing constants, we can state some more specialized constraints. For example, the invoice 23 is related to an order taken by the customer 12 :

amountorderid (invoice(23, amount, orderid)
order(12, orderid))

However, the framework used in this paper is that of tableaux, because it is well-suited when the universal relation assumption holds. On the other hand, this assumption is very unpractival for real-world applications, because it would imply modelling the whole database schema with only one relation. That is the reason why we do not use real-world examples throughout this paper. Note that these restrictions appear here for clarity reasons only and all the results are applicable in the unrestricted model.

2.2  Attributes, relations and tableaux

Definition 1 (Attribute)   Attributes are symbols taken from a given finite set U={A1,… ,An} called the universe. All sets of attributes are subsets of U . The complement of an attribute A in U is denoted by A .
Definition 2 (Relation)   With each attribute A is associated an infinite set, its domain, denoted DOM(A) , such that DOM(A)∩ DOM(B)=∅ for AB . Let Dom=∪AUDOM(A) . For an attribute set X , an X -value is a mapping w:XDom , such that w(A)∈ DOM(A) for all AX . A tuple is a U -value. A relation is a set (not necessarily finite) of tuples. We use the letters t,u,v,w,… to denote tuples, and I,J,… to denote relations. For a tuple w and a set YU , we denote the restriction of w to Y by w[Y] . For an attribute A , we do not distinguish between w[A] (which is an A -value) and w(A) (which is an element of DOM(A) ). Let I be a relation. The set of X -values in I is I[X]={w[X]|wI}. The set of values in I is VAL(I)=∪AUI[A] .
Definition 3 (Valuation)   A valuation is a mapping h:DomDom , such that aDOM(A) implies h(a)∈ DOM(A) for all aDom . The valuation h can be extended to tuples and relations as follows. Let w be a tuple. Then h(w)=hw (where denotes functional composition). Let I be a relation. Then h(I)={h(w)|wI} . Usually, we are interested only in a small subset of Dom , for example, VAL(I) . We let h be undefined for other values, and say that h is a valuation on I . We associate with each attribute A an infinite set of constants CONS(A)⊆ DOM(A) . Let CONS=∪AUCONS(A). For any valuation h and a constant aCONS(A) , we require h(a)=a . Let I be a relation, CONS(I) is the set of constants appearing in VAL(I).
Example 1   Let I be a relation, and w be a tuple, as shown below. Let h be a valuation on w such that a1a1 , b1b1 , c1c1 and d1d2 . Then h(w)∈ h(I) .
I : w :
 
ABCD
a1 b1 c1 d2
a1 b2 c2 d1
ABCD
a1 b1 c1 d1
 

Definition 4 (Extension)   Let I be a relation, and let h be a valuation on I . An extension of h to another relation I is a valuation h on I, which agrees with h on VAL(I) . If h is the identity relation on VAL(I) , then we say that h is the identity on I .

2.3  Dependencies

An equality-generating dependency (EGD) says that if some tuples exist in the database, then some values in these tuples must be equal.

Definition 5 (EGD)   An Equality-Generating Dependency (EGD) is a finite pair <(a1,a2),I> , where a1 and a2 are A -values for some attribute A , and I is a finite relation, such that a1,a2I[A] . We also call this EGD an A -EGD. A relation J satisfies <(a1,a2),I> if, for any valuation h such that h(I)⊆ J , we have h(a1)=h(a2) . Note that, if a1=a2 , then <(a1,a2),I> is trivially satisfied by every relation and the EGD is trivial.
Example 2   Consider the relation I of example 1. Then E=<(c1,c2),I> is an EGD, as shown below, and K satisfies E . Note that E is equivalent to the following functional dependency : AC.
<(c1,c2),I> : K :
 
ABCD
c1=c2
a1 b1 c1 d2
a1 b2 c2 d1
ABCD
a b c d
a b c d
a b c d
 

A tuple-generating dependency (TGD) says that if some tuples exist in the database, then some other tuples, whose values are not necessarily taken from the first tuples, must also exist in the database. Thus some values may be unknown.

Definition 6 (TGD)   A Tuple-Generating Dependency (TGD) is a pair of finite relations <I′ ,I> . It is satisfied by a relation J if for any valuation h , such that h(I)⊆ J , there is an extension h of h to I so that h′ (I′ )⊆ J .
Example 3   Consider the relation I of example 1. Let I be another relation, as shown below. Then T=<I′,I> is a TGD and K satisfies T .
I′ : T=<I′,I> :
 
ABCD
a2 b1 c2 d2
a2 b2 c1 d2
ABCD
a2 b1 c2 d2
a2 b2 c1 d2
a1 b1 c1 d2
a1 b2 c2 d1
 
K′ :
 
ABCD
a b c d
a b c d
a b c d
a b c d

2.4  The implication problem and the (bottom-up) chase

Let D be a set of dependencies, and d be a dependency. The implication problem is to decide whether Dd , that is to determine whether d is true in every database in which each dependency of D is true. Let SAT(D) be the set of all relations, only composed of constants, that satisfy all the dependencies in D , then the implication problem is equivalent to decide whether SAT(D)⊆ SAT(d) .

The implication problem can also be considered under a different view: that of finite implication problem. The finite implication problem is to decide whether d is satisfied by every finite relation that satisfies all dependencies in D. However, as detailed below, this problem admits no proof procedure.

The chase procedure has been designed to solve the implication problem. The reader can refer to [BV84b] for a complete description. Intuitively speaking, if the dependency to be proven is the TGD <I′ ,I> , or the EGD <(a1,a2),I> , the chase procedure takes the relation I , and treats it as if it formed a set of tuples. Then it applies repeatedly the dependencies of D , following two distinct rules, one for TGDs, whose effect is to add tuples to the relation, and one for EGDs, whose effect is to “identify” two constants. When the dependency to be proven is a TGD, the procedure stops when obtaining the tuples of I′ . When it is a EGD, the procedure stops when obtaining the identification of a1 and a2 .

This mechanism has been shown to be sound and complete in [Var84]. Note that the implication problem for TGDs and EGDs is semi-decidable. Thus the chase may not stop.

The chase procedure is clearly a bottom-up (or forward chaining) one. Indeed, rule applications are generating new tuples or identification of constants. This is executed until we obtain the desired conclusion. The goal to be proven is not used to guide the process. Moreover, when applying a TGD rule whose effect is to add tuples to the relation, existential quantification always requires a costly dynamic creation of new constants.

We now show that we can apply a top-down (or backward) procedure, in which the process is goal-directed.

3  The Top-Down Chase

Depending on the type of the dependencies, the implication problem is solvable or recursively unsolvable ([BV81, Var84, GL82]). This means that in the first case, there is a decision procedure, hence an algorithm that always halt, whereas in the second case, there is a proof procedure: if the implication is true, then the process will terminate. In the other case, it might never stop. The finite implication problem, on the other hand, is not even partially solvable. Therefore, there can be no proof procedure for it.

A subset of TGDs is known as Full TGDs. These dependencies have the same form as Datalog rules. In this case, the notion of piece, stated in the introduction, is useless. Therefore, dedicated top-down theorem proving procedures can be applied to solve the implication problem involving only Full TGDs, such as Query-Subquery ([Vie86]) and OLD-Resolution ([TS86]). The principle aim of these procedures is to provide a terminating algorithm for top-down evaluation of Datalog rules. Indeed, even if the implication problem is decidable for this particular class, their top-down evaluation might not terminate. To tackle this problem, tabulation (or memoing) techniques must be applied in order to cut down looping sequences. Notice that these techniques do not affect completeness. In the general case of TGDs, however, there are unfinishing sequences that remain undetectable, due to the undecidable nature of the implication problem. In the case of Full TGDs, the implication problem and the finite implication problem coincide and are both decidable.

For this kind of dependencies, as well as all other subsets (functional and multivalued dependencies) the interest of the top-down chase is to be also applicable. However, in these particular cases, specific proof procedures shall be more efficient because they implement memoing. The top-down chase and the classical bottom-up chase are clearly needed when dependencies are more complex (i.e. not a decidable subset), and also to provide a general way to solve the implication problem. Notice that memoing can be added to the top-down chase too, but this is an implementation-level extension.

There have been a lot of work carried out also in the active database area. For a survey, we refer the reader to [WC95]. Active databases generally support the automatic triggering of updates in response to internal or external events. These updates are expressed by mean of rules which are slightly different from TGDs and EGDs. Usually, these rules can be expressed in Datalog with negation [BDR98] [AHV95]. The main difference lies in the fact that the variables occuring in the conclusion are all universally quantified, whereas in database dependencies they are existentially quantified if they do not occur in the hypothesis too.

The first subsection presents a proof procedure for TGDs only. The second subsection shows how this procedure can take EGDs into account as well, by using some reduction theorems.

3.1  A proof procedure for TGDs

Let D be a set of TGDs without constants, and let d=<J′,J> be a TGD without constants. Intuitively speaking, to decide whether Dd , we start with the tuples of J′ and treat it as if it formed a relation. Let Q be this relation. Q is considered as the goal to reach. On the other hand, we add J to D , after replacing each symbol of J by a new constant, by transforming each tuple into a TGD with constants. Then we try to remove the tuples of Q, by applying successively a rule, giving each time a new goal. These rule applications may introduce new tuples. If we achieve in removing all tuples, i.e. obtaining an empty goal, then Dd .

Compared with classical theorem proving methods, this proof procedure relies on a complex core rule that does not require any modification of the dependencies in D. Indeed, in order to apply the classical resolution method, one needs to rewrite D and ¬ d under clausal form. In order to do so, the Skolemization step would generate constants or functions due to the presence of existential quantifiers in the conclusion. Then the resolution method would try to generate the empty clause. When obtained, then the Herbrand theorem ensures that Dd.

In the present case, the top-down proof procedure does not require rewriting the dependencies of D nor dynamically creating new constants or functions. Indeed, the originality is to allow inferring directly on dependencies of D, thus providing a general mechanism. Notice that this can be obtained thanks to the particular form of TGDs and EGDs.

Let us now explain this process in a more formal way. We need the notion of distinct substitution that replaces each symbol of a relation by a new constant.

Definition 7 (Distinct substitution)   Let J be a relation without constants. A distinct substitution ω on J is a valuation on J such that ω (a) is a new distinct constant for each aVAL(J).

Now, we can add to D a set of TGDs with constants, each of them being of the form <u,∅ > , with u corresponding to a distinct tuple in ω (J) . Thus we add | ω (J)| TGDs. The following theorem states that checking whether Dd can be reduced to checking whether DωQ , where Dω is the result of adding the new TGDs to D , and Q is the goal. Note that the added TGDs are the only ones of Dω having constants in it. More formally, we have :

Theorem 1   Let D be a set of TGDs without constants, and let d=<J′,J> be a TGD without constants. Let ω be a distinct substitution on J . Let ω ′ be an extension of ω on J such that ω ′ is the identity on VAL(J′)−VAL(J) . Let Dω denote D∪ { <u,∅ > | u∈ ω (J)} and let Q denote the TGD <ω ′(J′),∅ > with possibly some constants. Then Dd if and only if DωQ . Q is also called a goal.
Example 4   Let d be a TGD without constants, as shown below.
d=<J′,J> :
 
ABCD
a3 b4 c4 d5
a4 b4 c3 d5
a4 b3 c4 d5
a3 b3 c3 d3
a3 b4 c4 d4

Let ω be a distinct substitution on J such that a4a , a5a′ , b3b , b4b′ , c3c , c4c′ and d3d , with a,a′,b,b′,c,c′,d being constants. Let ω ′ be an extension of ω on J′ such that ω ′ is the identity on VAL(J′)−VAL(J) . Let Dω denote D∪ { <u,∅ > | u∈ ω (J)} :

{ <u,∅ > | u∈ ω (J)} :
 
ABCD
a b c d
ABCD
a b c d
 

Let Q denote the TGD <ω ′(J′),∅ > :

Q=<ω ′(J′),∅ > :
 
ABCD
a b c d5
a4 b c d5
a4 b c d5
 

Then Dd if and only if DωQ .

We now describe the main step of the process. Given a TGD and a goal, this rule constructs a new goal. We need to introduce the notion of piece. A piece is a set of tuples that are semantically linked. When applying the rule, we will see that these tuples can be treated at the same time. This notion is an alternative to the classical Skolemization process that replaces unknown values by constants, prior to using classical first-order logic provers. A piece is a set of tuples which share unknown values, and are therefore treated as a whole. This is the originality of the top-down chase.

Definition 8 (Piece)   Let L be a relation, and vVAL(L) . Pieces of L in relation to v are defined in the following way : for all tuples u,vL , u and v belong to the same piece if and only if there is a sequence of tuples (u1,… ,um) of L such that u1=u and um=v and i=1… m−1 , av such that aVAL(ui) and aVAL(ui+1) . As a corollary, if a tuple does not share any element of v with another, it forms itself a piece. By construction, the set of pieces is unique for a given relation L and a given set of symbols v.
Example 5   Let L be a relation, as shown below, and v={a2,c1,d2} . Then there are three pieces of L in relation to v : P1 is containing tuples of L that are sharing the symbols a2 and c1 , P2 is containing tuples of L that are sharing the symbol d2 , and P3 is containing the last tuple of L that does not share any of these symbols and therefore is itself a piece.
L :
 
ABCD
a1 b3 c1 d1
a2 b4 c3 d3
a2 b1 c1 d1
a1 b2 c2 d2
a1 b3 c3 d2
a1 b1 c3 d1
 
P1 : P2 :
 
ABCD
a1 b3 c1 d1
a2 b4 c3 d3
a2 b1 c1 d1
ABCD
a1 b2 c2 d2
a1 b3 c3 d2
 

We now define the core rule.

Definition 9 (core rule)   Let T=<I′,I> be a TGD of Dω and Qn=<V,∅ > be a goal, such that (VAL(V)∩ (VAL(I)∪ VAL(I′)))\ CONS=∅ . If there is a valuation h on II′∪ V such that:

then the result of the application of the rule, denoted Qn+1=T(T,Qn) , is Qn+1=<h(V)\ Ph(I),∅ > . Qn+1 becomes a new goal. Thus it is obtained by removing P from h(V) and by adding h(I) . Note that there may be several possible TGDs obtained by an application of the rule, depending on the valuation h . In a sense, the piece notion allows us to group together some tuples according to some particular symbols (which correspond in logic to existentially quantified variables of T) they share.

Example 6   Consider the TGD T of example 3, and the TGD Q of example 4, here called Qn, as shown below. Let V=ω ′(J′). Therefore, Q=<V,∅>. We can check that (VAL(V)∩ (VAL(I)∪ VAL(I′)))\ CONS=∅ . Let h be a valuation on II′ ∪ V such that aVAL(I)−VAL(I′) , h(a)=a and aVAL(I′)−VAL(I) , h(a)=a . Therefore h(a2)=a2 . h is also defined by: h(a4)=a2 , h(b1)=b, h(b2)=b , h(c1)=c, h(c2)=c , and h(d5)=d2 .
T=<I′,I> : Qn=<V,∅ > :
 
ABCD
a2 b1 c2 d2
a2 b2 c1 d2
a1 b1 c1 d2
a1 b2 c2 d1
ABCD
a b c d5
a4 b c d5
a4 b c d5
 
T′=<h(I′),h(I)> : Qn′=<h(V),∅ > :
 
ABCD
a2 b c d2
a2 b c d2
a1 b c d2
a1 b c d1
ABCD
a b c d2
a2 b c d2
a2 b c d2
 

There are two pieces of h(V) in relation to (VAL(I′)−VAL(I))\ CONS={a2} , which are the first tuple of h(V) and the last two tuples of h(V) . The second piece P is such that Ph(I′) . We construct Qn+1=T(T,Qn) . Let V′=h(V)\ Ph(I) :

Qn+1=<V′,∅ > :
 
ABCD
a b c d2
a1 b c d2
a1 b c d1
 

The following theorem states that a goal is implied by a set of TGDs if and only if there is a sequence of rule applications such that the sequence starts with the original goal, and gives the empty goal as an end result. The proof is detailed at the end of the paper.

Theorem 2 (Soundness and Completeness)   Dω Q if and only if there is a sequence ((T1,Q1),… ,(Tn,Qn)) of rule applications, such that T1,…   TnDω and Q1=Q and Qn=<∅ ,∅ > and i∈ [1..n−1] , Qi+1=T(Ti,Qi) . Note that the last TGD used, Tn , is of the form <u,∅ > .

It is important to notice that whenever DωQ, theorem 2 ensures that there is at least one sequence of rule applications, but it does not give a method to find it. Therefore, to implement the top-down chase, we need to add a search strategy (breadth-first or depth-first for example). In order to illustrate the need for a search strategy, let us detail in the following three kinds of sequences for the same set of TGDs Dω and the same goal Q. The first one stops and ends with an empty goal, thus proving the implication. The second one does not terminate (although in this particular case, the loop might be detected and stopped), and the third one is stuck (no rule can be applied any more) and thus needs backtracking whenever the search algorithm is depth-first.

In the following examples, we will consider that Dω contains the TGD T of example 6, as well as another TGD S detailed below. Remember also that in 4 it has been shown that in order to prove that a dependency d is implied by a set of dependencies D, we construct a goal to be proven (i.e. without hypothesis), and we add some TGDs (also without hypothesis) to D, giving Dω. Therefore, in the present case, Dω contains T as well as two more TGDs presented in example 6. Finally, the goal to be proven is Q, already presented in example 4 . Dω is detailed below:

U1 :U2 :
 
ABCD
a b c d
ABCD
a b c d
 
T : S :
 
ABCD
a2 b1 c2 d2
a2 b2 c1 d2
a1 b1 c1 d2
a1 b2 c2 d1
ABCD
a1 b2 c1 d2
a1 b1 c2 d3
a1 b1 c1 d1
a1 b2 c2 d1
 

And Q is :

Q :
 
ABCD
a b c d5
a4 b c d5
a4 b c d5
 
Example 7   The following example illustrates a successful proof sequence of rule applications, ending with the empty goal. Consider the first rule application of example 6, giving a new goal Q1=T(T,Q) to be proven :
Q1 :
 
ABCD
a b c d2
a1 b c d2
a1 b c d1
 

Consider the TGD U1. Let U1=<I′,I> and Q1=<V1,∅ >. We can check that (VAL(V1)∩ (VAL(I)∪ VAL(I′)))\ CONS=∅ . Let h be a valuation on II′ ∪ V1 such that ∀ aVAL(I)−VAL(I′) , h(a)=a and ∀ aVAL(I′)−VAL(I) , h(a)=a (trivial). h is also defined by: h(a1)=a and h(d1)=d . Therefore we have :

Q1′=<h(V1),∅ > :
 
ABCD
a b c d2
a b c d2
a b c d
 

As there is no element in (VAL(I′)−VAL(I))\ CONS, each tuple of h(V1) is itself a piece. The third tuple P is such that Ph(I′) . We construct Q2=T(U1,Q1) . Let Q2=<h(V1)\ Ph(I), ∅> :

Q2 :
 
ABCD
a b c d2
a b c d2
 

Now consider the TGD U2. Let U2=<I′,I> and Q2=<V2,∅ >. We can check that (VAL(V2)∩ (VAL(I)∪ VAL(I′)))\ CONS=∅ . Let h be a valuation on II′ ∪ V2 such that ∀ aVAL(I)−VAL(I′) , h(a)=a and ∀ aVAL(I′)−VAL(I) , h(a)=a (trivial). h is also defined by: h(d2)=d′ . Therefore we have :

Q2′=<h(V2),∅ > :
 
ABCD
a b c d
a b c d
 

As there is no element in (VAL(I′)−VAL(I))\ CONS, each tuple of h(V2) is itself a piece. The first tuple P is such that Ph(I′) . We construct Q3=T(U2,Q2) . Let Q3=<h(V2)\ Ph(I), ∅> :

Q3 :
 
ABCD
a b c d
 

Following the same schema, we apply the very same rule to Q3, giving Q4=<h(V3)\ Ph(I), ∅> :

Q4 :
 
ABCD
 

Therefore, we have proven that DωQ, thus Dd.

Example 8   The following example illustrates a non-terminating sequence of rule applications. Consider the first rule application of example 6, giving the new goal Q1=T(T,Q) to be proven, as well as the TGD T :
Q1 : T
 
ABCD
a b c d2
a1 b c d2
a1 b c d1
ABCD
a2 b1 c2 d2
a2 b2 c1 d2
a1 b1 c1 d2
a1 b2 c2 d1
 

Let T=<I′,I> and Q1=<V1,∅ >. We can check that (VAL(V1)∩ (VAL(I)∪ VAL(I′)))\ CONS={a1, d1, d2} . Therefore, we must rename these symbols either in T or in Q1. Let now Q1 be :

Q1 :
 
ABCD
a b c d4
a3 b c d4
a3 b c d3
 

Let h be a valuation on II′ ∪ V1 such that ∀ aVAL(I)−VAL(I′) , h(a)=a and ∀ aVAL(I′)−VAL(I) , h(a)=a . Therefore h(a1)=a1 , h(a2)=a2, and h(d1)=d1 . h is also defined by: h(a3)=a2 , h(b1)=b′ , h(b2)=b , h(c1)=c , h(c2)=c′ , h(d3)=d2 and h(d4)=d2 . Therefore we have :

T′=<h(I′),h(I)> : Q1′=<h(V1),∅ > :
 
ABCD
a2 b c d2
a2 b c d2
a1 b c d2
a1 b c d1
ABCD
a b c d2
a2 b c d2
a2 b c d2
 

There are two pieces of h(V1) in relation to (VAL(I′)−VAL(I))\ CONS={a2} , which are the first tuple of h(V1) and the last two tuples of h(V1) . The second piece P is such that Ph(I′) . We construct Q2=T(T,Q1) . Let V2=h(V1)\ Ph(I) :

Q2=<V2,∅ > :
 
ABCD
a b c d2
a1 b c d2
a1 b c d1
 

By applying rule T over Q2 one more time, giving Q3=T(T,Q2), we obtaain :

Q3 :
 
ABCD
a b c d2
a1 b c d2
a1 b c d1
 

Therefore Q3=Q1 w.r.t. symbol renaming, thus the sequence is entering a loop. Note that in this particular case, it is simple to detect and thus to stop. However, this is an implementation-level treatment.

Example 9   The following example illustrates a stuck sequence (no rule can be applied any more) that thus needs backtracking whenever the search algorithm is depth-first. Consider again the goal Q1=T(T,Q) to be proven, as well as the TGD S :
Q1 : S
 
ABCD
a b c d2
a1 b c d2
a1 b c d1
ABCD
a1 b2 c1 d2
a1 b1 c2 d3
a1 b1 c1 d1
a1 b2 c2 d1
 

Let S=<I′,I> and Q1=<V1,∅ >. We can check that (VAL(V1)∩ (VAL(I)∪ VAL(I′)))\ CONS={a1, d1, d2} . Therefore, we must rename these symbols either in S or in Q1. Let now Q1 be :

Q1 :
 
ABCD
a b c d5
a3 b c d5
a3 b c d4
 

Let h be a valuation on II′ ∪ V1 such that ∀ aVAL(I)−VAL(I′) , h(a)=a and ∀ aVAL(I′)−VAL(I) , h(a)=a . Therefore h(d1)=d1, h(d2)=d2 and h(d3)=d3 . h is also defined by: h(a1)=a , h(a3)=a , h(b1)=b , h(b2)=b′ , h(c1)=c′ , h(c2)=c , h(d4)=d3 and h(d5)=d3 . Therefore we have :

S′=<h(I′),h(I)> : Q1′=<h(V1),∅ > :
 
ABCD
a b c d2
a b c d3
a b c d1
a b c d1
ABCD
a b c d2
a b c d2
a b c d3
 

There are two pieces P1 and P2 of h(V1) in relation to (VAL(I′)−VAL(I))\ CONS={d2, d3} , which are the first (identical) two tuples of h(V1) and the last tuple of h(V1) . The two piece are included in h(I′). Note that we can save one sequence step by removing both pieces at the same time instead of performing two steps with the same valuation h. We construct Q2=T(T,Q1) . Let V2=h(V1)\ P1 \ P2h(I) :

Q2=<V2,∅ > :
 
ABCD
a b c d1
a b c d1
 

One can now verify that there is no applicable rule to Q2 amongst T, S, U1 and U2.

3.2  Dealing with EGDs

EGDs add some difficulties because they include an equality predicate. Nevertheless, it has been shown in [BV84b] that the implication problem for TGDs and EGDs without constants is reducible to the implication problem for TGDs without constants. Thus, our core rule is sufficient.

Let e be the A -EGD e=<(a1,a2),J> . Let w1 be a tuple such that w1[A]=a1 and for all attributes BA , w1[B]∉J[B] . Let w2 be a tuple such that w2[A]=w1[A] and w2[A]=a2 . We associate with e two TGDs. e1 is <w1,J∪ {w2}> , e2 is <w2,J∪ {w1}> . For a given set D of TGDs and EGDs, let D* be the result of replacing each EGD e in D by its two associated TGDs e1 and e2 .

Example 10   Consider the EGD E of example 2. We associate to E two TGDs e1 and e2 , as shown below.
w1 : w2 :
 
ABCD
a2 b2 c1 d3
ABCD
a2 b2 c2 d3
 
e1=<w1,I∪ {w2}> : e2=<w2,I∪ {w1}> :
 
ABCD
a2 b2 c1 d3
a1 b1 c1 d2
a1 b1 c2 d1
a2 b2 c2 d3
ABCD
a2 b2 c2 d3
a1 b1 c1 d2
a1 b1 c2 d1
a2 b2 c1 d3
 

The following theorems allow us to ignore the presence of EGDs by reducing the implication of a dependency by a set of TGDs and EGDs to the implication of a TGD by a set of TGDs only:

Theorem 3   ([BV84b]) Let D be a set of TGDs and EGDs without constants and let d be a TGD without constants. Then Dd if and only if D*d .
Theorem 4   ([BV84b]) Let D be a set of TGDs and EGDs without constants. Let e be a non-trivial A -EGD without constants and let e1 and e2 be its two associated TGDs. Then De if and only if D*e1 and there is a non-trivial A -EGD in D .

Thus, we can now generalize theorem 1 and give the final following theorem that reduces the implication problem of TGDs and EGDs to the existence of a top-down chase using the rule:

Theorem 5   Let D be a set of TGDs and EGDs without constants:

4  Conclusion and remarks

As a conclusion, we discuss several points. First we compare the top-down chase with a backward formal system and with other proof procedures. Then we discuss the contributions of our work. We conclude this paper by lifting some restrictions on the model and pointing out some perspectives.

4.1  Comparison with formal systems

Many formal systems have been studied for data dependencies ([BFH77], [Sci82], [BV84a], [SU82]). In [BV84a], some formal systems for TGDs and EGDs are studied. Two of these systems are backward, but only one, namely the T3 formal system, has some similarities with the top-down chase. We sketch here the main differences. We refer the reader to this paper for more details about formal systems for TGDs and EGDs.

The T3 system deals with TGDs and their unknown values (i.e. non-constant symbols of VAL(J′)−VAL(J) ) in the following way : the process starts with a goal tuple Q and applies a TGD T=<J′,J> by making J′ and J coincide. Typically this is achieved using especially the collapsing, augmentation and projection rules. When J′ and J coincide, it uses the transitivity rule to derive a new goal. A derivation tree is not linear , as it is the case for SLD-resolution.

The top-down chase leads to a linear inference in the sense that it uses directly TGDs from the base without first applying rules on them. There is only one rule. Obviously, it is a more complex step.

4.2  Discussion

To our knowledge, this is the first time a top-down proof procedure is used to solve the implication problem for dependencies. As such, a first constribution is to have shown that this can be performed top-down.

The top-down chase avoids rewriting the dependencies set under clausal form, and avoids the corresponding Skolemization process, necessary in order to use classical top-down theorem proving procedures. In the top-down chase, only the preliminary transformation of the TGD to be proven requires creating as many new constants as universally quantified variables appearing in it. Thus this step is unsignificant. By providing a way to infer directly on the original form of dependencies, by way of the notion of piece, the top-down chase is conceptually speaking a simpler approach, this is our second contribution.

Compared to the classical chase, it does not entails the dynamic creation of new constants. This is our third contribution.

Note that the top-down chase is not the usual chase simply reversed. The core rule is totally different from those used in the bottom-up approach of [BV84b]. Indeed, it has been shown that there is a strong relationship between the bottom-up chase and resolution with paramodulation [BC90][NR01]. Whereas the latters acts on the Skolemized set of clauses, the top-down chase does not need this prior step.

We are currently investigating in details the efficiency of the top-down chase. Actually, this is not trivial to assess, because of the various parameters that come into play. We agree with [BMSU86] on the fact that “ it is unreasonable to expect proofs of optimality, or even superiority of one algorithm over another, in all cases ”.

However, we shall detail some of our arguments.

First of all, there is no need for Skolemizing the dependencies set nor dynamic constant creation. This might increase the efficiency. Indeed, on the contrary of the bottom-up chase, existentially quantified variables in the TGD conclusions never transform into constants. The only new symbols that are created are those of the current goal that are temporary variables. The goal can grow by the addition of hypotheses of the dependencies used by the T−rule. Notice that these symbols are, on the other hand, deleted whenever they are mapped to existing constants, or whenever some pieces are removed from the goal. Depending on the search strategy, they are also deleted from memory whenever the sequence of rule applications is stuck or successful (and needs for example backtracking when doing depth-first search).

On the other hand, top-down approaches can take exponential time longer w.r.t. bottom-up approaches. However, the notion of piece performs some dynamic optimizations by dealing with groups of atoms instead of one at the time, thus failures are detected earlier w.r.t. classical top-down procedures, and irrelevant goals are cut down earlier. This feature dramatically reduces the number of backtracks. As an illustration, we have compared our method with Prolog over 5000 proofs generated at random, varying each time the size and the content of the dependencies set, and the size of the dependencies. The top-down chase has been implemented on top of the CoGITo platform [GS98], which is a set of tools for conceptual graphs management. These tests showed that the top-down chase provides a logarithmic average improvement for the number of backtracks as well as for the proof duration (when the process stops), in spite of a non-optimized implementation. This reduces drastically the drawback of the top-down approach, and gives the top-down chase a fair practical level of efficiency. The fact that the search for the right symbol mappings has an exponential complexity does not practically seem to act as a brake. Indeed, whereas for SLD-Resolution, some irrelevant goals may lead to exploring a branch of the resolution tree, the top-down chase detects failures earlier and allows saving the average time.

However, either approach could run faster in practice on given data [BMSU86]. That is why we think presenting an example in which the top-down chase would be more efficient than the bottom-up chase would not be significant and would not provide any worthwhile addition to the discussion.

For all these reasons, we plan to implement the chase of [BV84b] in order to practically compare their efficiency over a random dataset. We think dependencies might be divided into classes for which one or the other approach would be better but identifying them is still an open problem.

We must mention that some optimisations to the bottom-up approach have been made by magic-sets in [MS96]. The principle of magic sets ([BMSU86]) is to perform at compile time some optimizations that are usually performed at run-time, by rewriting the set of dependencies before inference. This leads to avoiding the generation of irrelevant facts during the process, which is the essence of the top-down approach. We shall take these optimizations into account for the implementation.

4.3  Extension of the model

As already stated, we assumed some restrictions on the model that can be easily lifted. The reduction of EGDs to TGDs also works in this unrestricted model, which is stated in [BV84b]. Thus all the results can be extended, as they are in [CS98] for piece resolution.

There has been a renewed interest in data dependency theory with the introduction of constrained dependencies and ordered dependencies. These type of dependency can express a wide variety of constraints on the data ([BCW95]), besides generalizing most of the temporal dependencies of the taxonomy presented in [JS92]. The chase procedure has been redesigned, still in a bottom-up way, in order to deal with constrained tuple-generating dependencies ([MS96]), which are constrained functional dependencies. Our procedure can serve as a basis for the design of a top-down chase for constrained tuple-generating dependencies.

References

[AHV95]
S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, Reading, Mass., 1995.
[BC90]
J. Biskup and B. Convent, Relational Chase Procedures Interpreted as Resolution with Paramodulation. Fundamenta Informaticae, 15(8):123–138, 1991.
[BCW95]
Marianne Baudinet, Jan Chomicki, and Pierre Wolper. Constraint-generating dependencies. ICDT’95, Proceedings of the 5th International Conference on Database Theory, Prague, Czech Republic, p 322–337, 1995.
[BDR98]
J. Bailey, G. Dong, and K. Ramamohanarao. Decidability and undecidability results for the termination problem of active database rules. In PODS ’98. Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Seattle, Washington, p 264–273, 1998.
[BFH77]
Catriel Beeri, Ronald Fagin, and John H. Horward. A complete axiomatization for functional and multivalued dependencies in database relations. SIGMOD’77, Proceedings of the International Conference on Management of Data, Toronto, Canada, p 47–61, 1977.
[BMSU86]
François Bancilhon, David Maier, Yehoshua Sagiv, and Jeffrey D. Ullman. Magic sets and other strange ways to implement logic programs. PODS’86, Proceedings of the Fifth Symposium on Principles of Database Systems, Cambridge, Massachusetts, p 1–16, 1986.
[BV81]
Catriel Beeri and Moshe Y. Vardi. The implication problem for data dependencies. Proceedings of the 8th Colloqium on Automata, Languages and Programming, Acre (Akko), Israel, p 73–85, 1981.
[BV84a]
C. Beeri and M. Y. Vardi. Formal systems for tuple and equality generating dependencies. SIAM Journal on Computing, 13(1):76–98, 1984.
[BV84b]
Catriel Beeri and Moshe Y. Vardi. A proof procedure for data dependencies. Journal of the ACM, 31(4):718–741, 1984.
[Cod72]
E. F. Codd. Further Normalization of the Database Relational Model. R. Rustin, Ed. Prentice-Hall, Englewood Cliffs, NJ, 1972.
[CS98]
Stéphane Coulondre and Eric Salvat. Piece resolution: Towards larger perspectives. ICCS’98, Proceedings of the Sixth International Conference on Conceptual Structures, Montpellier, France, p 179–193, 1998.
[FV86]
R. Fagin and M. Y. Vardi. The theory of data dependencies: a survey. Mathematics of Information Processing, Proceedings of Symposia in Applied Mathematics, 34::19–72, 1986.
[GL82]
Yuri Gurevich and Harry R. Lewis. The inference problem for template dependencies. Information and Control, 55(1–3):69–79, 1982.
[GS98]
David Genest and Eric Salvat. A platform allowing typed nested graphs: How cogito became cogitant. ICCS’98, Proceedings of the Sixth International Conference on Conceptual Structures, Montpellier, France, p 154–161, 1998.
[Her95]
Christian Herrmann. On the undecidability of implications between embedded multivalued database dependencies. Information and Computation, 122(2):221–235, 1995.
[JS92]
C. S. Jensen and R. T. Snodgrass. Temporal specialization. ICDE’92, Proceedings of the International Conference on Data Engineering, Tempe, Arizona, p 594–603, 1992.
[LL97a]
Mark Levene and George Loizou. The additivity problem for functional dependencies in incomplete relations. Acta Informatica, 34(2):135–149, 1997.
[LL97b]
Mark Levene and George Loizou. Null inclusion dependencies in relational databases. Information and Computation, 136(2):67–108, 1997.
[LL98]
M. Levene and G. Loizou. The additivity problem for data dependencies in incomplete relational databases. Acta Informatica, 34(2):135–149, 1997.
[LL99]
Mark Levene and George Loizou. A Guided Tour of Relational Databases and Beyond. Springer, London, 1999.
[Mah94]
Michael Maher. Constrained dependencies. ILPS’94, Proceedings of the Workshop on Constraints and Databases, Ithaca, New York, 1994.
[Mah97]
M. J. Maher. Constrained dependencies. Theoretical Computer Science, 173(1):113–149, 1997.
[MS96]
M. J. Maher and D. Srivastava. Chasing constrained tuple-generating dependencies. PODS ’96, Proceedings of the Fifteenth Symposium on Principles of Database Systems, Montréal, Canada, p 127–138, 1996.
[Ng99]
Wilfred Ng. Ordered functional dependencies in relational databases. Information Systems, 24(7):535–554, 1999.
[NR01]
R. Nieuwenhuis and A. Rubio, Paramodulation-Based Theorem Proving. Handbook of Automated Reasoning, Elsevier Science, Chapter 7, 371–443, 2001.
[Sci82]
Edward Sciore. A complete axiomatization of full join dependencies. Journal of the ACM, 29(2):373–393, April 1982.
[SF00]
Iztok Savnik and Peter A. Flach. Discovery of multivalued dependencies from relations. Technical Report report00135, Albert-Ludwigs-Universitaet Freiburg, Institut fuer Informatik, 2000.
[SM96]
Eric Salvat and Marie-Laure Mugnier. Sound and complete forward and backward chainings of graph rules. ICCS’96, Proceedings of the Fourth International Conference on Conceptual Structures, Sydney, Australia, p 248–262, 1996.
[Sow84]
J. F. Sowa. Conceptual Structures: Information Processing in Mind and Machine. Addison-Wesley, Reading, Mass., 1984.
[SU82]
Fereidoon Sadri and Jeffrey D. Ullman. Template dependencies: A large class of dependencies in relational databases and its complete axiomatization. Journal of the ACM, 29(2):363–372, 1982.
[TS86]
Hisao Tamaki and Taisuke Sato. OLD resolution with tabulation. Proceedings of the Third International Conference on Logic Programming, London, p 84–98, 1986.
[Ull88]
Jeffrey D. Ullman. Principles of Database and Knowledge-Base Systems. Volume I: Classical Database Systems. Computer Science Press, 1988.
[Var84]
Moshe Y. Vardi. The implication and finite implication problems for typed template dependencies. Journal of Computer and System Sciences, 28(1):3–28, 1984.
[Vie86]
L. Vielle. Recursive axioms in deductive databases: The query-subquery approach. Proceedings of the 1st Conference on Expert Database Systems, Charleston, South Carolina , p 253–267, 1986.
[WC95]
J. Widom and S. Ceri. Active Datbase Systems. Morgan-Kaufmann, San Mateo, Calif., 1995.

Appendix

We now prove theorem 1 :

Proof of theorem 1

Proof 1   (If). Let KSAT(D) be a relation. Let k be a valuation such that k(J)⊆ K . So, by construction, KSAT({ <u,∅ > | uk (J)}) . It follows that KSAT(Dk) . As D does not contain constants, the only difference between Dk and Dω lies in the constants respectively introduced by the valuation k and ω . Therefore, we can construct a relation Kω such that Kω SAT(Dω) and K is obtained from Kω by renaming constants of ω (J) in order to match with k(J) . Let k be an extension of k on J such that k is the identity on VAL(J′)−VAL(J) . By construction, as there is a valuation h on ω ′(J′) such that h(ω ′(J′))∈ Kω , there is a valuation h on k′(J′) identical to h on VAL(J′)−VAL(J) , such that we have h′(k′(J′))∈ K . Hence KSAT(d) , and Dd .

(Only If). Let KSAT(Dω) be a relation. (i) There is a valuation k on ω (J) such that k(ω (J))⊆ K . As ω (J) contains only constants, then ω (J)⊆ K . (ii) As DDω , then DωD and KSAT(D) . As Dd , then KSAT(d) . As ω (J)⊆ K (cf. i) and KSAT(d) , then there is a extension k of k on ω ′(J′) such that k′(ω ′(J′))⊆ K . Hence KSAT(Q) , and DωQ .

We now prove theorem 2. We shall first introduce several lemmas.

Lemma 1   Let Q=<I,∅ > and Q′ =<H,∅ > be two goals and T=<J′,J> be a TGD. If Q′=T(T,Q) then {Q′,T}⊨ Q .
Proof 2   Let KSAT({Q′,T}) , then KSAT(Q′) and KSAT(T) . (i) By construction, there is a valuation h such that Q′=<h(I)\ Ph(J),∅ > , so there is a valuation k such that k(h(I)\ Ph(J))⊆ K . We also have k(h(I)\ P)⊆ K , so KSAT(<h(I)\ P,∅ >) . (ii) Trivially, we have KSAT(<h(J),h(J)>) , and also KSAT(<h(J),h(I)\ Ph(J)>) . Moreover, we have KSAT(T) , thus KSAT(<J′,J>) and we also have KSAT(<h(J′),h(J)>) and, as Ph(J′) , KSAT(<P,h(J)>) . Therefore KSAT(<P,h(I)\ Ph(J)>) . (iii) Suppose KSAT(Q) . Then there is not any valuation k such that k(I)⊆ K . Therefore, for every valuation k such that k(h(I)\ P)⊆ K , we have k(P)⊈K ( k exists because KSAT(<h(I)\ P,∅ >) - cf. i). As there is a valuation k such that k(h(I)\ Ph(J))⊆ K (cf. i), then KSAT(<P,h(I)\ Ph(J)>) , which leads to a contradiction (cf. ii). So KSAT(Q) and {Q′,T}⊨ Q .
Lemma 2   Let Q and Q be two goals and T be a TGD. If {Q′,T}⊨ Q, but Q′⊭Q, then there is a rule application B=T(T,Q), such that Q′⊨ B.
Proof 3   For this proof, we shall swap to the first-order logic framework. Φ will denote, for a given TGD, the equivalent formula in FOL. {Q′,T}⊨ Q, therefore the set {Φ(Q′),Φ(T),¬ Φ(Q)} is unconsistant. SLD-Resolution allows to derive the empty clause. We are then sure that there is a linear refutation, starting from the negative clause corresponding to ¬ Φ(Q), provided that Φ(Q′) and Φ(T) are under clausal form with exactly on positive litteral.

To do that, we must rewrite Φ(Q′), Φ(T) and ¬ Φ(Q) under clausal form, in the following way:

Φ(Q′) is under the form Φ(Q′)=∃ x1... ∃ xh(A1∧ ... ∧ Aj). We need to introduce Skolem constants that we shall denote qi, i∈ [1..h], each of them being respectively replacing the variable xi. We construct j clauses1 under the form Qi=Ai[q1,...,qh], i∈ [1..j].

Φ(T) is under the form Φ(T)=∃ y1... ∃ yk(C1∧ ... ∧ Cl) ← H1∧ ... ∧ Hn, universally closed by the variables x1,...,xp. We need to introduce Skolem functions that we shall denote fi(x1,...,xp), i∈ [1..k], each of them being respectively replacing the variable yi. We construct l clauses under the form Ti=(Ci∨ ¬ H1 ∨ ... ∨ ¬ Hn)[x1,...,xp, f1(x1,...,xp), ...,fk(x1,...,xp)] with i∈ [1..l].

Φ(Q) is under the form Q=∃ x1... ∃ xs(Q1∧ ... ∧ Qt). The negation of Φ(Q) is under the form ¬ Q1∨ ... ∨ ¬ Qt, universally closed by the variables x1,...,xs. We construct a clause under the form NQ=(¬ Q1∨ ... ∨ ¬ Qt)[x1,...,xs]

The linear refutation starting with NQ exists. As NQ is composed only by negative litterals, NQ can only be resolved with clauses in which there are positive litterals, i.e. the clauses Qa,a ∈ [1...j] and Tb,b ∈ [1..l].

Let us suppose that the refutation only use the clauses Qa, a∈ [1..j]. Then the clauses Tb, b∈ [1..l] are unnecessary, and thus {Φ(Q′), ¬ Φ(Q)} is unconsistant, thus Φ(Q′)⊨ Φ(Q), which is in contradiction with the hypothesis. Therefore the resolution does not only use the clauses Qa, a∈ [1..j].

Remember that Ti=(Ci∨ ¬ H1 ∨ ... ∨ ¬ Hn)[x1,...,xp, f1(x1,...,xp), ...,fr(x1,...,xp)] with i∈ [1..l]. If the refutation does not only use the clauses Qa, a∈ [1..j], then it uses at least one of the atoms Ca, a∈ [1..l] of the clauses Ti,i∈[1..l] (which are the only other positive litterals). This step will give a resolvent containing the litterals ¬ H1 ∨ ... ∨ ¬ Hn. The f1(x1,...,xp),...,fr(x1,...,xp) are Skolem functions. They correspond to existentially quantified variables of Φ(T), therefore there is a substitution between a Qb, b∈ [1..t] and a Ca, a∈ [1..l], which does not maps from the equivalent existentially quantified variables (1st condition of the T-rule verified) nor from the variables of Hi, i∈[1..n] that are not in Ca. Therefore, within the tableaux framework, there is an equivalent valuation h that satisfied the first condition of the T-rule (that concerns symbols of the conclusion of T not in the hypothesis, i.e. existentially quantified variables), and the second condition of the T-rule (that concerns symbols of the hypothesis of T not in the conclusion of T, i.e. universally quantified variables appearing only in the hypothesis). Indeed, as the variables do not come into account in the substitution, we force the equivalent symbols (in the tableaux valuation) to be identical. This has no consequences, because we have supposed that no symbols have the same name in T and Q.

This resolution step has instanced the variables of NQ. Therefore litterals of NQ containing, as a term, one of these variables have them also instanced. Thus, as there is no function symbols nor variables in Qa, a∈ [1..j], these litterals are unified with Ca[x1,...,xp, f1(x1,...,xp), ...,fr(x1,...,xp)], a∈ [1..n]. We are therefore sure that there is at least one piece of the conclusion of h(Q) appearing in the conclusion of h(T). Thus the third condition of the T-rule is satisfied.

Let us focus on the litterals added to the resolvent in the resolution step: ¬ H1 ∨ ... ∨ ¬ Hn. There are two possibilities: either there are unified with the Ca[x1,...,xp,f1(x1,...,xp), ...,fr(x1,...,xp)], a∈ [1..n], or with the Aa, a∈ [1..j]. Let us suppose that at the moment they come into account within the resolution, they are unified with the Ca[x1,...,xp,f1(x1,...,xp),...,fr(x1,...,xp)], a∈ [1..n], then the resolvent would contain again the same atoms, because negative atoms of the clauses Ti,i∈[1..l] are the same in each clause Ti,i∈[1..l]. They are thus necessarily unified with the Aa, a∈ [1..j], for the resolution to have an end (which is the case). Thus we see that negative litterals of the clauses Ti,i∈[1..l] can be only unified with the Aa, a∈ [1..j]. Therefore, the negation of the new goal has a refutation which does not need the clauses Tb, b∈ [1..l], and thus {Φ(Q′),¬ Φ(B)} is unconsistant, and Q′⊨ B

Lemma 3   Let Q be a goal and Γ={T1,...,Tn} be a set of TGDs. If Γ ⊨ Q, then there is a finite sequence of indices i1,...,ip such that {Ti1,...,Tip}⊨ Q, with ij ∈ [1..n] and such that j ∈ [1..p], there is a goal Bj−1 such that {Bj−1,Tij}⊨ Bj with B0=<∅,∅> and Bp=Q.
Proof 4   By induction on p. At step 1, Ti1Q. Thus {<∅,∅>, Ti1}⊨ Q. By the lemma, there is a rule application between Q and Ti1, giving a goal B1 such that <∅,∅>⊨ B1. Let us suppose this induction true until step p−1: there is a finite sequence of indices i1,...,ip−1 such that {Ti1,...,Tip−1}⊨ Q, with ij ∈ [1..n] and such that j ∈ [1..p−1], there is a goal Bj−1 such that {Bj−1,Tij}⊨ Bj with B0=<∅,∅> and Bp−1=Q. At step p, we must show that {Bp−1, Tip} ⊨ Q and that Γ ⊨ Bp−1 Let Tip the TGD corresponding to the first clause interfering with Q in the SLD-resolution process described in the previous lemma. Let us show that the resolutions with Tip can be performed at the beginning of the SLD-resolution, and that Γ ⊨ Bp−1. To do that, let us suppose that we can not group the resolutions with the clauses coming from Tip. Then there is necessarily a resolution with a clause coming from Tip that needs a previous resolution with a clause not coming from Tip. If it is the case, then (i) either there is a litteral of NQ that can not be unified with an atom of a clause coming from Tip, but it will be possible later (if it is not possible later, then we do not need this clause), (ii) or this litteral will appear later in a resolvent, and will be resolved with a clause coming from Tij, jp−2. If a litteral of NQ can not be unified with an atom of the clause coming from Tip, it will not be possible later, because the resolution process does create other opportunities.

We now have a new clause resulting of a sequence of resolutions with the clauses coming from Tip. Thus this clause has a linear refutation. Let us show that this clause corresponds to a goal. To do that, let us perform the inverse transformation (i.e. clausal-form to FOL and thus tableaux). This clause only contains litterals from the original goal, and litterals coming from the hypothesis of Tip, all negative. Moreover, it does not contain function symbols. Therefore we can perform the inverse transformation under the form of a goal Bp−1 and thus Γ ⊨ Bp−1.

As Γ ⊨ Bp−1, by induction, there is a finite sequence of indices i1,...,ip−1 such that {Ti1,...,Tip−1}⊨ Q, with ij ∈ [1..n] and such that j ∈ [1..p−1], there is a goal Bj−1 such that {Bj−1,Tij}⊨ Bj with B0=<∅,∅>. As we showed that {Bp−1,Tip}⊨ Q, then there is a finite sequence of indices i1,...,ip such that {Ti1,...,Tip}⊨ Q, with ij ∈ [1..n] and such that j ∈ [1..p], there is a goal Bj−1 such that {Bj−1,Tij}⊨ Bj with B0=<∅,∅> and Bp=Q

Lemma 4   Let Q and Q be two goal such that Q′ ⊨ Q, and let Γ={T1,...,Tn} be a set of TGDs. If there is a sequence of rule applications starting from Q and terminating with success, then there is a sequence of rule applications starting from Q and terminating with success.
Proof 5   Let us show that if there is a rule application between Q and a TGD T giving a new goal B, then there is also a rule application between Q and T giving a new goal B such that B′ ⊨ B. We then conclude by recurrence that Q′⊨ Q, thus the set {Φ(Q′),¬ Φ(Q)} is unconsistant.

The SLD-resolution allows to produce the empty clause from the set of clauses corresponding to {Φ(Q′),¬ Φ(Q)}. There is therefore a linear refutation starting from the negative clause corresponding to ¬ Φ(Q), provided that Φ(Q′) be under clausal form with exactly one positive litteral. Let us perform the same transformation as in lemma 2. The linear refutation starting from NQ exists. As NQ is composed only of positive litterals, NQ can be resolved only with clauses having positive litterals, i.e. the clauses Qa,a ∈ [1...j]. Each litteral of NQ is thus unifiable with one of the Aa[q1,...,qh], a∈ [1..j]. All the atoms of NQ thus appear in the Aa[q1,...,qh], a∈ [1..j] and can only differ by a variable of NQ corresponding to a term in the Aa[q1,...,qh], a∈ [1..j]. Therefore tuples of Q are a subset of the tuples of Q, and can only differ by extra symbols in Q.

If there is a rule application between Q and a TGD T then there is a valuation h and (at least) one piece of the conclusion of h′(Q′) appearing entirely in the conclusion of h′(T). As tuples of Q can only have extra symbols, and as tuples of Q all appear in Q, there is also a valuation h and (at least) one piece of the conclusion of h(Q) appearing entirely in the conclusion of h(T). Let us construct the now goal B by removing from the conclusion of h(Q) only the pieces corresponding to those removed from the conclusion of h′(Q′) when B was constructed. Then the new goal B contains some tuples of h(Q) that differ, by construction, from that of B by potentially extra symbols, and some tuples of h(T) that also differ from that of B by potentially extra symbols. By performing the transformation of ¬ Φ(B) and Φ(B′) under clausal form, it is easy to show that each litteral of the clause coming from ¬ Φ(B) is unifiable with an atom of a clause coming from Φ(B′), and thus that the set {Φ(Q′),¬ Φ(Q)} is unconsistant, and B′ ⊨ B. By recurrence on the number of rule applications starting with Q, we conclude that there is also a sequence of rule applications starting from Q and terminating with success.

Proof of theorem 2

Proof 6  

(If). By induction on the number of T-rule applications. Trivially, as Qn=<∅ ,∅ > , then Dω Qn . Assume the induction is true for Qi, i∈ 2,… ,n , thus Dω Q2 . Let us prove that Dω Q1 . By lemma 1, as Q2=T(T1,Q1) , then {Q2,T1}⊨ Q1 , and also {Q2,Dω }⊨ Q1 . As Dω Q2 , it follows that Dω Q1 .

(Only If). By lemma 3, there is a sequence of indices i1,...,ip such that {Ti1,...,Tip}⊨ Q, with ij ∈ [1..n] and such that j ∈ [1..p], there is a goal Bj−1 such that {Bj−1,Tij}⊨ Bj with B0=<∅,∅> and Bp=Q. The proof is made by induction on p. At step 1, we have {<∅,∅>, Ti1 }⊨ B1, therefore by lemma 2, there is a rule application between B1 and Ti1, giving a goal B0 such that <∅,∅>⊨ B0. As <∅,∅> ⊨ <∅,∅>, the resolution terminates with success. Let us suppose this hypothesis true until step p−1, i.e. there is a sequence of rule applications starting from Bp−1 and terminating with success. At step p, we have Bp=Q. By lemma 2, there is a rule application between Q et Tip, giving a goal B such that Bp−1B. According to the induction hypothesis, there is a sequence of rule applications starting from Bp−1 and terminating with success. By lemma 4, there is also a sequence of rule applications starting from B, thus from Q, terminating with success.

Proof of theorem 5

Proof 7   By theorem 3, Dd if and only if D*d . D* is a a set of TGDs. Therefore, by theorem 1, D*d if and only if Dω*Q . It follows that Dd if and only if Dω*Q . By theorem 2, we prove the existence of the sequence of T-rules.

By theorem 4, De if and only if D*e1 and there is a non-trivial A -EGD in D . D* is a a set of TGDs. Therefore, by theorem 1, D*e1 if and only if Dω *Q . It follows that De if and only if Dω *Q and there is a non-trivial A -EGD in D . By theorem 2, we prove the existence of the sequence of T-rules.


1
The notation F[t1,...tk], where F is a formula means that terms of F (variables, constants and functions) are t1,...,tk

This document was translated from LATEX by HEVEA.