Efficient Semantics-Based Compliance Checking Using LTL Formulae and Unfolding

Business process models are required to be in line with frequently changing regulations, policies, and environments. In the field of intelligent modeling, organisations concern automated business process compliance checking as the manual verification is a timeconsuming and inefficient work. There exist two key issues for business process compliance checking. One is the definition of a business process retrieval language that can be employed to capture the compliance rules, the other concerns efficient evaluation of these rules. Traditional syntax-based retrieval approaches cannot deal with various important requirements of compliance checking in practice. Although a retrieval language that is based on semantics can overcome the drawback of syntax-based ones, it suffers from the well-known state space explosion. In this paper, we define a semantics-based process model query language through simplifying a property specification pattern system without affecting its expressiveness. We use this language to capture semanticsbased compliance rules and constraints. We also propose a feasible approach in such a way that the compliance checking will not suffer from the state space explosion asmuch as possible. A tool is implemented to evaluate the efficiency. An experiment conducted on three model collections illustrates that our technology is very efficient.


Introduction
Business process models are valuable intellectual assets capturing the ways organisations conduct their business.Current business process management evolves increasingly fast due to changing environments and emerging technologies.As a result, organisations accumulate huge numbers of business process models, and among these may be models with high complexity.For example, Haier is one of the largest Chinese consumer electronics manufacturers.Over the years, Haier has gathered more than 4,000 process models from various domains, including purchase, financing, distribution, and service.In this context, support for business process management, for example, for the purposes of knowledge discovery and process reuse, faces real challenges.In order to stand a competitive advantage, one of these challenges concerns business process compliance checking to make sure that business processes are in line with frequently changing business environments and legal regulations.This problem has also gradually emerged as an important branch of intelligent modeling.There are two key issues must be addressed for automated business process compliance checking.One is a retrieval language that can be employed to capture compliance rules, the other is the efficient evaluation of compliance checking.
In recent years, there are some query languages have been proposed to retrieve process models in repositories, such as BP-QL [1] and BPMN-Q [2].In [3], BPMN-Q was also used to capture compliance rules.But these languages are based on syntax (structure) of process models, rather than on semantics of them.While in the syntax of a process model, a directed path connecting a task A and a task B does not mean that during execution task A will always occur before task B. Let us consider, for example, the three process models in Figure 2.Among of them, rectangles represent tasks, arcs represent sequential dependencies between tasks, while diamonds represent choices (if each of the diamonds has one  input arc and multiple output arcs) and merges (if each of the diamonds has multiple input arcs and one output arc).These models represent three variants of a business process for opening an account in the BPMN notation [4].These three variants could specify the way an account is opened in three different states in which the company conducts its business and could be part of a repository of hundreds, even thousands of process models for all states in which the bank operates.Next, let us take BPMN-Q as an example to illustrate the drawback of syntax-based languages.A rule written in BPMN-Q uses a directed edge connecting two activities to represent that these two activities are executed in order (in just some executions of a process).For example, the BPMN-Q query, as shown in Figure 2, can specify the compliance rule that task "receive customer request" must always be followed by task "analyse customer credit history" in some process executions.But if an analyst requires to retrieve processes where in every process execution task "receive customer request" always occurs before task "analyse customer credit history, " BPMN-Q cannot capture this kind of requirements.Thus, after executing the query in Figure 2, we would retrieve the first and the third processes, since in both process (a) and process (c), there exists at least one process execution in which if task "receive customer request" occurs, then task "analyst customer credit history" would eventually occur.However, according to the requirement, process (c) does not belong to the result as process (c) has an execution where that task "receive customer request" always precedes task "analyst customer credit history" does not hold (the process execution where task "open VIP account" is run).As a result, the problem of BPMN-Q is that people cannot know whether all of the process executions of a resulting process satisfy the requirement, or just some of them satisfy the requirement.This issue is very important in reuse of business processes, automatic modeling, and verification.For example, in reuse of business process, people often need to know whether there are some process executions that fail to satisfy a requirement, with the goal to check the reason and modify these process executions.Therefore, in order to yield correct result, we have to explore every process execution of every process in a repository, which is indeed based on semantics.
As we can see from the example, syntax-based retrieval languages are not powerful enough.In fact, retrieval technologies based on semantics are indeed in line with process execution and therefore are more intuitive to ordinary users who are not necessarily experts in business process management (BPM).A semantics-based process model query language should capture two types of requirements: (1) it can specify various semantic relationships between tasks; (2) it can explicitly specify that these relationships hold in just some process execution or in every process execution.
In light of the previous, in this paper, we aim to address two questions.One is that how many the semantic relationships between tasks are enough; the other is that the evaluation of semantics-based compliance rules requires to explore every process execution of a process model, which suffers from the well-known state space explosion problem.
In [5], a property specification pattern system (SPS) has been proposed for finite-state verification by Dwyer and so forth.SPS consists of 5 basic patterns and 5 scopes, which results in 5 × 5 = 25 LTL formulae.In this paper, we significantly simplify SPS without affecting its expressiveness through formal logic reasoning.After this simplification, there are only 3 basic LTL formulae from which the rest formulae can be deduced.A retrieval language for expressing semantics-based compliance rules is based on this simplified SPS.With respect to the evaluation of semantics of process models, we proposed a feasible technology which can extract every execution of a business process model.In such a way, the state space explosion can be avoided as much as possible.We achieve this by adopting the theory of complete finite prefixes (CFP) [6] and its improvements [7].Moreover, a tool is implemented to evaluate the performance of our technology over three collections of Petri nets.For the three collections, two are obtained from practice, and the third is a much larger one and obtained by artificially generating.
The remainder of this paper is organized as follows.In Section 2 we simplify SPS to define a process model retrieval language for specifying compliance rules.While in Section 3 the basic concepts of Petri nets, unfolding, and CFP are presented.In Section 4, we detail the mechanism of efficient semantics-based compliance checking.Next, in Section 5 we illustrate the tool implementation and report on the performance evaluation over three process model collections.Finally, we discuss related work in Section 6 and conclude the paper in Section 7.

Language
As discussed in Section 1, a language is needed for specifying semantics-based compliance rules.This language should be powerful enough while being not too complex.In this section, we will simplify the SPS to obtain a core pattern system from which the rest patterns and scopes of SPS can be derived.Then we present the formal definition of a new query language, namely, "a semantics-based process query language" (ASBPQL), based on this core pattern system.

LTL Formulae.
Linear temporal logic (LTL) is a widely used formalism for specifying properties of concurrent, finite-state systems.In this subsection, we use LTL to reason about the core of SPS.
Definition 1 (linear temporal logic formulae).The formulae of linear temporal logic are built from a finite set of atomic propositions , the logical operators ¬, ∧, and ∨, and the temporal modal operators ⃝ and .Formally, the set of LTL formulae over  can be inductively defined as follows: (i) both true and false are LTL formulae; (ii) for all  ∈ ,  and ¬ are LTL formulae; (iii) if  1 and  2 are both LTL formulae, then The operator ⃝ is read as "next" and denotes in the next state.The operator  is read as "until" and means that its first argument has to hold until its second argument is true, where it is required that the second argument holds eventually (some literatures also define the weak until operator () which related to the strong until operator () through the following equivalences:  ≡ () ∨  ≡ ( ∨ ◻) ≡ (◻) ∨ (  ) ≡   ( ∨ ◻),  ≡ ⬦ ∧ (  )).
The operator  is read as "releases" and is the dual of .In addition, two derived operators are in common use.They are as follows:   (i) ⬦ is read as "eventually," ⬦ =   , which requires that its argument be true eventually, that is, at some states in the future; (ii) ◻ is read as "always," ◻ =   , which requires that its argument be true at all future states.

2.2.
Simplification.SPS consists of 5 basic patterns (the other three patterns are defined based on them) and 5 scopes, as shown in Figure 3.The intents of the 5 basic patterns are as follows: (i) Absence, a given task never occurs within a scope; (ii) Universality, a given task occurs throughout a scope; (iii) Existence, a given task occurs at least once within a scope; (iv) Precedence, a task  occurs before a task  within a scope; (v) Response, a task  must be followed by a task  within a scope.
The meanings of the five scopes are presented as follows: (i) Global means the entire extent of a process execution; (ii) Before  means the extent up to an occurrence of the given task  within a process execution; (iii) After  means the extent after an occurrence of the given task  within a process execution; (iv) Between  and  means the part of a process execution from an occurrence of the task  and that of the task ; (v) After  until  is similar to the scope Between  and  except that the designated part of a process execution continues if the task  does not occur.
As shown in Table 1, for each scope there is an LTL formula corresponding to a pattern, which results in 25 formulae.
Next, we provide proofs that the SPS can be simplified from 5 patterns and 5 scopes to only 3 patterns (Absence, Existence, and Precedence) and 1 scope (After  until ).This can significantly reduce the number of formulae from 25 to 3.
First, we take pattern Absence as an example to prove that scope Before  can be derived from scope After  until .According to the semantics of LTL, if  is always true, that is, ◻, scope Before  can be derived from scope After  until , that is, ◻ ∧ ◻( ∧ (¬)) → ((¬) ) ⇒ ⬦ → ((¬)  ).Now we prove that this proposition holds.
Next, we prove that if  holds eventually, that is, ⬦, we can derive the formula corresponding to scope Between  and  from the formula corresponding to scope After  until .4)), ( 6) ◻( ∧ (¬) ∧ ⬦ → ((¬)  )) (by (1), ( 5)).Now we have proved that the formulae corresponding to three scopes (After , Before , and Between  and ) can be derived from the formulae corresponding to scope After  until .If  always holds and  always does not hold, that is, ◻ ∧ ◻(¬), the formula corresponding to scope Global can be derived from that of scope After  until .This proof is straightforward and is easy to be reasoned about.For page limit, we do not present it in this paper.
Finally, we obtain a simplified pattern system that consists of only 3 patterns (Absence, Existence, and Precedence) and one scope (After ⋅ ⋅ ⋅ until), as shown in Figure 4.As we can see, this simplified pattern system is far more concise than SPS.discussed in Section 1, after defining the basic semantic relationships between tasks, we have to determine whether these relationships hold in just some process executions or in every process execution of a business process.Combining with all these considerations, we can define 6 basic predicates to capture the occurrence of tasks and the relationships between tasks in some or every process execution.In the following, the first two basic predicates, posoccur and alwoccur, capture the occurrence of a given task in some or every process execution of a process model.These two basic predicates are based on pattern Existence:

Syntax
(1) ( 1 , ): there exists some executions of process model  where at least one instance of  1 occurs, (2) ( 1 , ): in every execution of process model , at least one instance of  1 occurs.
The next two basic predicates, concur and exclusive, capture the concurrent and exclusive relationships between tasks, respectively.Note that these two basic predicates do not assume that an instance of  1 and  2 should eventually occur: (3) V( 1 ,  2 , ):  1 and  2 are both executable tasks (i.e., not dead tasks) of process model ; in every process execution of , it is never possible that an instance of  1 and an instance of  2 both occur; (4) ( 1 ,  2 , ):  1 and  2 are both executable tasks of process model ;  1 and  2 are not causally related; and in every execution of , if an instance of  1 occurs, then an instance of  2 occurs and vice versa.
The last two basic predicates, pospred and alwpred, capture the basic relationship Precedence between tasks in some or every process execution of a given process model, respectively: (5) ( 1 ,  2 , ): in every process execution of process model , it holds that an instance of  1 occurs before an instance of  2 ; (6) ( 1 ,  2 , ): there exists some process executions of process model  where an instance of  1 occurs before an instance of  2 .Finally, we define ASBPQL by BNF grammar.A Query in ASBPQL is a Condition.The result of the Query is those process models that satisfy the Condition.A Condition can consist of ⟨⟩", " with the intended semantics what the basic predicate ( 1 , ) specifies, a ⟨⟩", " with the intended semantics what the basic predicate ( 1 , ) specifies, and a ⟨⟩, with the intended semantics that all process models satisfying that particular relation between tasks must be retrieved, or it can be recursively defined as a binary or unary Condition through the application of logical operators, that is, ⟨⟩ or ⟨⟩.Specifically, a disjunction retrieves the union of the process models of the conditions involved, while a conjunction retrieves the intersection.The negation of a condition retrieves the process models that do not satisfy the condition.A task can be defined as its label which is a string as follows: ( Using ASBPQL, we can capture the semantics-based compliance rules in which we are interested, including the relationship between tasks and the occurrence of tasks in some or every process execution.For example, rule "A" pospred "B" and "B" alwpred "C" mean that we want to search for all process models where in some process execution task A occurs before task B and in every execution task B occurs before task C.

Petri Nets and Unfoldings
In this section, we discuss the basic concepts of Petri nets and unfolding on which we base our work.For more details, readers can refer to [8] for an in-depth introduction to Petri nets and to [6,7,9,10] for unfolding and its related definitions.

Petri Nets.
Petri nets are a formal notation system which can be employed to specify workflow systems (see, e.g., [11,12]).Petri nets are also used as a formal foundation for defining the semantics of other process modeling languages or for reasoning about process models specified in these languages, for example, BPMN [13], BPEL [14,15], and EPCs [16].A formal definition of Petri nets is presented as follows.
Definition 8 (Petri nets).A Petri net is a tuple (, , ), where (i)  is a finite set of places; (ii)  is a finite set of transitions, with  ∩  = 0 and  ∪  ̸ = 0; (iii)  ⊆ ( × ) ∪ ( × ) is a finite set of directed arcs representing the flow relation, connecting transitions and places together.
The conditions that the sets of places and transitions should be finite and that every transition has at least one input place and at least one output place derive from [7].For notational convenience we adopt a commonly used notation, where • represents all the inputs of a node  (which can be a place or a transition) and • captures all its outputs.
Next, a labeled Petri net is basically a Petri net with annotated transitions and the annotation does not affect the semantics of the net.Definition 9 (Labeled Petri nets).A labeled Petri net is a tuple (, , , , ), where where  ∉  is a silent action (i.e., an action not visible to the outside world).
A marking of a Petri net is an assignment of tokens to its places.A marking represents a state of the net, and a transition, if enabled, may change a marking into another marking, thus capturing a state change, by firing.
Definition 10 (marking, enabling, and firing of a transition).
(i) A marking  of  is a mapping  :  → N. A marking may be represented as a collection of pairs, for example, {( 0 , 2), ( 1 , 3), ( 2 , 0)} or as a vector, for example, 2 0 +3 1 (in that case we drop places that do not have any tokens assigned to them).A labeled Petri net system is a labeled Petri net with an initial marking usually represented as  0 .(ii) Markings can be compared with each other,  1 ≥  2 if and only if for all  ∈ ,  1 () ≥  2 ().Similarly, one can define >, <, ≤, =. (iii) A transition  is enabled in a marking , denoted as    →, if and only if the following holds: ∀ ∈ •, () > 0.
(iv) A transition  that is enabled in a marking  may fire and change marking  into   .This is denoted as The markings of a Petri net system and the transition relation between these markings constitute a state space.In this paper we consider -bounded Petri net systems (noting that such systems are always finite) which are necessary for the application of unfoldings.
→ , which may also be denoted as  0   →  or, if the choice of  does not really matter,  0 *  → .
(ii) Σ is called a finite Petri net system if and only if its set of reachable markings is finite.(iii) Σ is called -bounded if and only if for every reachable marking  and every place  ∈ : () ≤ .

Unfolding.
It is well known that Petri nets may suffer from the state space explosion problem [17].As such a naive exploration of the state space, especially in the context of a Petri net which allows highly concurrent behaviour, may not be tractable.In order to deal with this, McMillan [6] proposed a state space search technique based on the use of unfolding (this technique was later on improved by Esparza et al. [7] and is discussed in the next subsection).Unfoldings are applied to -bounded (or called -safe in [7]) Petri net systems and provide a method of searching the state space of concurrent systems without considering all possible interleavings of concurrent events.The concept of unfolding was firstly introduced by Nielsen et al. [9] and later elaborated upon by Engelfriet [10] using the term branching processes.In the following we introduce the necessary concepts and notations to make this paper self-contained and to be able to build upon this theory.Most of these defintions are adopted from [7].Firstly, various types of relationship may hold between pairs of nodes in a Petri net.
(i)  + is the irreflexive transitive closure of , while  * is its reflexive transitive closure.The partial orders defined by these closures are denoted as < and ≤, respectively.Hence, for example,  The unfolding of a Petri net is an occurrence net, usually infinite but with a simple, acyclic structure.
Definition 13 (occurrence net (based on [7])).An occurrence net is a net   = (, , ), where (i)  is a set of conditions; (ii)  is a set of events, with  ∩  = 0; (2)  is acyclic; that is,  + is a strict partial order, and (3) for all  ∈  ∪  the set of nodes  ∈  ∪  for which  <  is finite; (iv) No node is in self-conflict; that is, for all  ∈  ∪ , ¬(#).
We also adopt the notion of Min(  ), as in [7], to denote the set of minimal elements of   with respect to the strict partial order  + .As for transitions in Petri nets, we only consider events that have at least one input and at least one output condition.The minimal elements are therefore conditions only, and intuitively Min(  ) can be seen as an initial marking of the net.
Definition 14 (branching process (based on [10])).A branching process of a Petri net system Σ = (,  0 ), with  = (, , ), is a pair (  , ℎ), where (i)   = (, , ) is an occurrence net; (ii) ℎ :   →  is a homomorphism which, following [10], means that (iii) for all ,   ∈ , if ℎ() = ℎ(  ) and Note that the definition allows for infinite branching processes.In [10] it is shown that, up to isomorphism, every net system has a unique maximal branching process.For a net system Σ, this unique process is referred to as the unfolding of Σ and it is denoted as Unf Σ .For example, in Figure 5 the Petri nets in (a) can be unfolded into the occurrence net in (b).Note that in Figure 5(b) all (condition/event) nodes are identified by integers and annotated by the corresponding place or transition identifiers in Figure 5(a).

Complete Finite Prefix.
The unfolding of a Petri net is infinite when the net is cyclic, as, for example, Unf Σ in Figure 5(b).In [6], McMillan proposed an algorithm for the construction of a so-called truncated unfolding, which is a finite initial part of an unfolding and contains as much reachability information as the unfolding itself but may be much larger than necessary.In [7], Ezparza et al. referred to this truncated unfolding as complete finite prefix (CFP) and proposed an improved algorithm for computing a minimal CFP.For example, as illustrated in Figure 5(c) (the dashed arcs should be ignored for the moment), Fin Σ is a minimal CFP of Σ.Note that in Figure 5(c) the tuple of conditions positioned next to an event node represents the marking of the net upon the occurrence of that event.
The main theoretical notions required to understand the concepts of a CFP are that of configuration and local configuration of events.Firstly, a configuration represents a possible partially ordered run of the net.
Definition 15 (configuration [7]).A configuration  of an occurrence net  = (, , ) is a set of events, that is,  ⊆ , satisfying the following two conditions: (ii)  is conflict free, that is, for all ,   ∈  : ¬(#  ).(c) The CFP Fin Σ annotated with explicit links from cut-off events to continuation events Figure 5: Illustration of unfolding and complete finite prefix using the Petri net Σ adapted from [7] (the net in (a) without  0 ,  8 , , and  is the same as the example net in Figure 1 in [7]).
Given a configuration  the set of places   represents a reachable marking, which is denoted by ().In other word, () is the marking to reach by firing the configuration .For example, in the unfolding Unf Σ in Figure 5(b) we have ({2, 5, 7, 11, 17}) = { 8 }.
Definition 16 (cut [7]).Let Σ be a Petri net system, and let (  , ℎ) be its unfolding.The set of conditions associated with a configuration of   is called a cut and is defined as The concepts thus far can be used to introduce the unfolding algorithm.In [7] a branching process (  , ℎ) of a Petri net system Σ is specified as a collection of nodes.These nodes are either conditions or events.A condition is a pair (, ) where  is the input event of , while an event is a pair (, ) where  is a transition and  is its input conditions.A set of conditions of a branching process is a coset if its elements are pairwise in corelation.For example, in Figure 5(b) each of the node sets {13, 14}, {15, 16}, {45, 46}, {47, 48}, {49, 50}, and {51, 52} is a coset.
During the process of unfolding the collection of nodes increases where the function (  , ℎ) (which denotes the possible extensions) is applied to determine the nodes to be added.The possible extensions are given in the form of event pairs (, ), where  is a coset of conditions of (  , ℎ) and  is a transition of Σ such that (1) ℎ() = •, and (2) no event  exists for which ℎ() =  and • = .In the unfolding algorithm, nodes from the set of possible extensions (  , ℎ) are added to the unfolding of the net till this set is empty (i.e., there are no more extensions).
In the complete finite prefix approach, it is observed that a finite prefix of an unfolding may contain all reachabilityrelated information.The key to obtain a CFP is to identify those events at which we can cease unfolding (e.g., events 12, 41, and 42 in Fin Σ in Figure 5(c)) without loss of reachability information.Such events are referred to as cut-off events, and they are defined in terms of an adequate order on configurations.
Definition 17 (adequate order [7]).Let Σ = (, , ,  0 ) be a Petri net system, and let ≺ be a partial order on the finite configurations of one of its branching processes, then ≺ is an adequate order if and only if (i) ≺ is well founded; (ii) for all configurations  1 and  2 ,  1 ⊂  2 ⇒  1 ≺  2 ; (iii) the ≺ order is preserved in the context of finite extensions; that is, if  1 ≺  2 and ( 1 ) = ( 2 ), Journal of Applied Mathematics then if we extend  1 with  to   1 , and we extend  2 to   2 by using an extension isomorphic to  then   1 ≺    2 .
The last clause of this definition is not fully formalised here as it requires a certain amount of formalism, and we hope that the idea is sufficiently clear from an intuitive point of view.We refer the reader to [7] for a complete formal definition of this notion.Note that, as pointed out in [7], the order ≺ is essentially a parameter to the approach.
The concept of local configuration captures the idea of all preceding events to an event such that these events form a configuration.
Definition 18 (local configuration [7]).Let  = (, , ) be an occurrence net, and the local configuration of an event  ∈ , denoted [], is the set of events   , where   ∈ , such that   ≤ .
Definition 19 (cut-off event [7]).Let Σ be a Petri net system, let   be one of its branching processes, and let ≺ be an adequate order on the configurations of   ; then an event  is a cut-off event if and only if   contains a local configuration Without loss of reachability information, we can cease unfolding from an event , if  takes the net to a marking which can be caused by some earlier other event   .So in Figure 5(c), we remove the part after event 12 from Unf Σ because it is isomorphic to that after event 11; that is, the continuation after event 12 is essentially the same as the continuation after event 11.For a proof of this approach we refer to [7].

Evaluation
In this section, we demonstrate how the basic predicates introduced in Section 2 can be derived for Petri nets based on the process executions extracted from CFPs.

Annotating Complete Finite Prefix.
In this work, the repository of process models are captured in terms of CFPs.All predicates between tasks are determined by examining the possible firing sequences in the CFP of each process model.To facilitate our algorithms for determining these predicates (presented in the next subsection), we would like to represent the continuation from cut-off events slightly more explicit in a CFP.The idea is that for each of the cut-off events  in a CFP we mark out some earlier other event   that can lead to the same marking as  (i.e., ([]) = ([  ]) and [  ] ≺ []).We referred to   as the continuation event of  in the CFP.We then annotate the CFP with links that connect from each cut-off event to its continuation event.
Definition 20 (notations of continuation events and cut-off events).Let Σ = (,  0 ) be a Petri net system, with  = (, , ), and let  = (  , ℎ), with   = (  ,   ,   ), be an unfolding of Σ; then we define the following: (i) Eq(, ) = { ∈ E  | ([]) = } for any reachable marking  of .If  is clear from the context, we will simply omit it and write Eq() (a similar convention holds for the remainder of this definition, and  is not introduced explicitly anymore); (ii) continuation() which refers to the continuation node in  for a reachable marking .It is defined as the unique event   ∈ Eq() such that for all  ∈ Eq(), if  ̸ =   then [  ] ≺ []; (iii) cutoff() = Eq() \ {continuation()} which denotes the set of cut-off events for a reachable marking .
Definition 21 (annotated complete finite prefix).Let Σ = (,  0 ) be a Petri net system, and Fin  Σ denotes a CFP of Σ that is annotated with links from cut-off events to their continuation events, shortly referred to as an annotated CFP: , where (i) Fin Σ = (, , ) is the CFP of Σ; (ii)  is a set of links defined as  ⊆  × , and if and only if (,   ) ∈ , then there is a reachable marking  such that   = continuation() and  ∈ cutoff().
To generate an annotated CFP, we propose a slight adaptation of the algorithm for computing a CFP for a -safe net system in [7].This adapted algorithm is presented as Algorithm 1.Based on Definition 21, the data structure for the representation of an annotated CFP comprises that of a CFP in [7] (written Fin  .)and a set of links (written Fin  .). (Fin  .) is the set of events that can be added to a branching process Fin  .(i.e., possible extensions of Fin  .),as defined in [7].Application of (, ≺) yields an event  which satisfies the following condition taken from [7]:  ∈  and [] is minimal with respect to ≺.The predicate  (,  ) is an abbreviation of [] ∩   = 0, the condition used in [7].Next,   V(, Fin  .,) returns the result of whether or not  is a cut-off event of Fin  .(as in [7]), and during its application, the corresponding continuation event for  is returned in the local variable , so that it does not need to be determined again when adding links.Note that we use  ∪ :=  as an abbreviation for  :=  ∪  and \ :=  for  :=  \ .

Determining the Basic Predicates.
In Section 2, we defined a set of 6 basic predicates based on process execution semantics and to check if such a predicate holds requires in principle exploration of all process executions.Since different process executions result from choices in a process model, we propose to preprocess the annotated CFP of each process model (Algorithm 2) as follows: first we transform such a CFP to a set of conflict-free CFPs (specified by function GetAllExecutions in Algorithms 3) and then convert each resulting CFP to a directed bipartite graph (or bigraph) (specified by AnnotatedCFP2Bigraph in Algorithm 5).
In Algorithm 3, GetLeafCondCoSets yields all cosets of leaf conditions in the input CFP.By traversing backwards Finally, Algorithm 5 specifies how to convert an annotated CFP into a directed bigraph.The transformation is straight-forward where the events in the CFP become event nodes in the bigraph, conditions become condition nodes, the arcs become the directed edges, and the links are converted to the edges leading from a cut-off event to each of the immediate successors (conditions) of the corresponding continuation event.For illustration, Figure 8 depicts an example of converting an annotated CFP to a directed bigraph.
During preprocessing, we first generate a CFP from a Petri net, and then from the CFP we extract one of more bigraphs.As we only add link information in an annotated CFP, the complexity of the adapted CFP generation algorithm (cf.Algorithm 1) is the same as that of the original CFP algorithm, which is exponential on the number of arcs of the Petri net [7].The complexity of generating a bigraph from a CFP (cf.Algorithm 2) is linear on the size of the CFP, since the latter is traversed depth-first in reverse order (i.e., starting from a leaf condition).Now we define the algorithms for determining the 6 basic predicates.First, we introduce two common functions: RetrieveBigraphs which returns the set of bigraphs for a process model () from the above preprocessing, and RetrieveAllEvents which returns the set of event nodes for (i.e., labeled with) a task () in a bigraph ().Each such bigraph represents a possible execution of the corresponding process, and each event node labeled with a task identifier in the bigraph captures an occurrence of the corresponding task in that process execution.For a short notation, an event node labeled with task  is hereafter referred to as an -event node.
Algorithms 6 and 7 specify how to evaluate the two unary predicates.Predicates posoccur or alwoccur of task  in process model  can be determined by checking the presence of a event node in any or all bigraphs of .Based on the fact that Let us move on to the algorithms for evaluation of intermediate predicates Pred.Consider an execution  of process model  and two tasks  1 and  2 in .Algorithm 12 specifies the evaluation of Pred.In this algorithm,   refers to  1 and   to  2 in the previous discussion, and function Precedes (which we will shortly describe in more detail) is used to evaluate causal relationship between two specific task occurrences.
Finally, we introduce the definition of function Precedes.In Algorithm 13, function Precedes determines if a given event node eventually precedes a given   -event node in bigraph  (representing a process execution).Following a typical graph search algorithm, it traverses bigraph  from the -event node (via recursively calling itself) until reaching the   -event node ( = ), the end of the graph (iSuccessors (, ) = 0 where iSuccessors (, ) denotes the immediate successors of node  in graph ), or a node that was visited before ( ∈  where  stores the set of visited nodes).Also, we consider that the Precedes relationship is irreflexive; that is, a task occurrence cannot have a Precedes relationship with itself.Hence, when  and   refer to the same task occurrence ( = ∧ = 0), Precedes returns a negative result.A basic predicate is evaluated by traversing breadth first each bigraph of each process model in the repository; thus this operation is linear on the size  of a bigraph.Let  be the total number of bigraphs in the repository, and let  be the number of basic predicates in a compliance rule.Hence, the complexity of evaluating a single rule (cf.Algorithms 6, 7, 8, 9, 10, 11, and 12) is linear on  times  times max  , where max  is the size of the largest bigraph in the repository.
It should be noted that for our purposes the adapted CFP generation algorithm and bigraph extraction algorithm are applied to computing the basic predicates over a repository of process models specified as Petri nets.Hence, these operations are performed when inserting a Petri net in the repository.This means that the cost of evaluating a rule is not determined by the complexity of these two algorithms, as the computation of the basic behavioural relations would already have taken place (so essentially we trade space for time).

Experiments
In this section, we first describe the implementation of ASBPQL in a software tool, and then we report on the performance of ASBPQL which we measured using this tool.

Implementation.
In order to evaluate the performance of ASBPQL we implemented a tool, namely, ASBPQL Querier, that supports compliance checking for business process models with ASBPQL.A screen shot of ASBPQL Querier is shown in Figure 9.The tool is part of the BeehiveZ toolset v3.0.Bee-hiveZ is an open-source BPM analysis system based on Java (BeehiveZ can be downloaded from http://code.google.com/p/beehivez/downloads/list).
The architecture of the ASBPQL Querier and of the process model repository with which the ASBPQL Querier interacts inside BeehiveZ is illustrated in Figure 10.The core of the ASBPQL Querier is the query engine: it takes as input the compliance rules produced by users via the query editor and generates as output the results of compliance checking via the query results display.The query editor uses the syntax of ASBPQL.Using this syntax, users can easily specify the semantic relationships in which they are interested.For example, "A" alwpred "B" and "C" concur "F" mean that the users want to retrieve all process models where in every execution task A precedes task B as well as task C occurs parallel with task F.
Under the hoods, the query engine exploits an internal parser which converts each query statement into a grammar tree.This parser is built by JavaCC (http://javacc.java.net/)which is a widely used open source parser generator and lexical analyzer generator for Java.Grammar trees are then used by the evaluator to identify all process models in the repository that satisfy the requirements of a given query.To do so, the evaluator needs to get access to the collection of process models stored in the process model repository in Petri net format, as well as the directed bigraphs which have been constructed from the annotated CFPs of each Petri net by the annotated CFP decomposer using Algorithm 2. The generation of annotated CFPs is performed by the annotated CFP generator using Algorithm 1.For an annotated CFP, the data structure of conditions, events, and directed arcs are represented by nodes of doubly linked lists which support in particular fast insertion of nodes and backward traversing.Moreover, for efficiency reasons, we keep an inverted index for every node label that appears in the set of annotated CFPs.We use Apache Lucene to manage these indexes (http://lucene.apache.org/).Specifically, for each label we record all processes which contain that label in some nodes.Based on this index, after a compliance rule is issued the tool can instantly filter out a set of candidate models containing the labels used in the compliance rule.The rest of the models are thus ignored since they are not relevant to the current rule.This step typically reduces the scope of searching and increases the tool's performances.Furthermore, an advantage of using inverted indexes is that they can be easily updated as a result of changing a node label in the repository.For more details on this index, we refer to previous work [18].

Performance Measurements.
We prepared a set of eight sample rules using various ASBPQL basic predicates and measured the evaluation of each of these rules over three process model collections.The first two collections are real-life repositories: the SAP R/3 reference model, consisting of 604 EPC models, and the IBM BIT library, consisting of 1,128 Petri nets.The SAP dataset is used by SAP consultants to deploy the SAP enterprise resource planning system within organizations [19].The IBM BIT library includes five collections (A, B1, B2, C1, and C2) of process models from various domains, including insurance and banking [20].The third dataset contains 10,000 artificially-generated models.(This dataset is available at http://code.google.com/p/beehivez/downloads/list.) Since the SAP dataset is represented in the EPC notation, we first transformed these models into Petri nets using ProM (http://www.processmining.org/).This resulted in 591 Petri nets for the SAP dataset (13 SAP reference models could not be mapped into Petri nets through ProM).In the resulting dataset there are 4,439 transitions out of which 1,494 are uniquely labeled (33% of the total), while in the IBM dataset there are 9,083 transitions with 946 uniquely labeled one (10% of the total).The structural characteristics of the three datasets used in the experiments are reported in Table 2.In particular, we can see that the SAP and IBM collections have models of comparable sizes based on the average number of their elements (transitions, places, and arcs).
We generated the third dataset using BeehiveZ based on the reduction rules from [8].The number of nodes per model follows a normal distribution.Specifically, the number of transitions per model ranges from 1 to 50 (average 24.85), the number of places from 1 to 47 (average 16.81), and the number of arcs from 2 to 162 (average 63.22).The labels of transitions   were randomly chosen from a fixed label set comprising the characters "A-Z" and "a-z" and the numbers "0-9", each label being made by a single character or number.In total, this led to 248,493 transitions in this dataset, with 62 unique labels (corresponding to 0.026% of the total number of transitions).
As we mentioned earlier that we deployed inverted labels for each task label, we chose such a very low set of unique labels compared to the total number of transitions in order to increase the number of models that can potentially satisfy a rule; thus we can get precise measurement result about the efficiency of executing a compliance rule.All models used in the experiments are bounded Petri net, which is a requirement for unfolding according to [21].We conducted our tests on an Intel Core i7-2600 @3.4 GHz and 8 GB RAM, running Windows 7 ultimate and JDK6.The heap memory for the JVM was set to 1 GB.We executed each compliance rule twelve times and measured each response time.We then discarded the highest and lowest response times for each rule and computed the average response time over the remaining ten values.The test rules and the response times for the three datasets are reported in Table 3.In particular,  1 to  3 are used to test the unary basic predicates posoccur and alwoccur, and  4 and  5 are for the concur and exclusive predicates, while  6 to  8 are for causal relation For readability, in the table we use'fictitious labels for transitions (e.g.,  1 ).The real labels from the three datasets, can be found in the Appendix.
The second and third columns of Table 3 show for each rule the number of models being filtered by BeehiveZ's inverted index ("candidate models") and the number of models that actually satisfy the rule ("returned models").These numbers are very low for the SAP and IBM datasets (e.g.,  3 yields six models in the SAP dataset, out of which only two satisfies the rule), due to the high number of unique labels within these collections (see Table 2).However, as expected, these numbers grow significantly in the artificially generated collection (as an example,  6 yields 552 models of which 72 satisfy the rule).
The last column of Table 3 shows the response times to execute the sample queries.These times are in the order of milliseconds for the SAP and IBM datasets (average 15 ms and 254.7 ms) and less than one second for the artificial dataset (average 850.4 ms).This shows that the technique is highly scalable to very large datasets.Having said that our technique shifts computation time from compliance checking to model insertion.In other words, most of the time is employed in generating the CFPs rather than in executing the compliance checking.Specifically, the overall time for building the set of CFPs and the corresponding bigraphs for the three datasets is 12.6 mins (SAP dataset), 28.5 mins (IBM), and 8.1 hours (artificial dataset).However, since we build annotated CFPs incrementally as we insert each Petri net into the repository, in practice the time for creating a single CFP is very short: only 1.28 s on average for a model from the SAP dataset, 1.52 s for a model from the IBM dataset, and 2.92 s for a model from the artificial dataset.These times are reasonable since repository users typically insert or remove single process models, or small groups thereof, at once, rather than inserting or removing entire model collections at once.
As expected, the storage size of the CFPs (including the label indexes) and corresponding bigraphs can be large.While it is only 26.8 MB for the SAP dataset and 18.1 MB for the IBM dataset, this value gets to 3.38 GB for the artificial dataset.However, this space is still acceptable considering that in organizational settings dedicated servers are typically employed to host process model repositories, rather than single desktop machines.

Related Work
Based on the importance of query languages for business process models, in 2004, the Business Process Management Initiative (BPMI) planned to define a standard process model query language.While such a standard has never been published, two major research efforts have been dedicated to the development of query languages for process models.One is known as BP-QL [1], a graphical query language based on an abstract representation of BPEL and supported by a formal model of graph grammars for processing of queries.BP-QL can be used to query process specifications written in BPEL rather than possible executions and ignores the run-time semantics of certain BPEL constructs such as conditional execution and parallel execution.
The other effort, namely, BPMN-Q [2,3], is also a visual query language which extends a subset of the BPMN modelling notation and supports graph-based query processing.Similar to BP-QL, BPMN-Q only captures the structural (i.e., syntactical) relationships between tasks.BPMN-Q uses a directed path (enhanced by operators like ≪leads to≫ and ≪precedes≫) connecting two activities to capture the requirement that they occur in order.The processing of BPMN-Q queries includes several steps.In short, BPMN-Q query engine searches for the process models that contain subgraphs that structurally match a query, reduces these subgraphs (remove elements that are not relevant to the query), translates the reduced subgraphs into Petri nets, and then calculates the corresponding reachability graph for each Petri net.Next, the query is translated into temporal logic formula which is fed into a model checker together with the reachability graphs generated from Petri nets.Finally, the model checker would output the process models that satisfy the query.Although part of the evaluation of BPMN-Q queries is based on LTL formulae, one of the most important step is subgraph matching which is totally structure based.For example, for the BPMN-Q query in Figure 2, the subgraphs obtained from the process model (c) in Figure 1 is shown in Figure 11  BPMN-Q query holds, but this is not the case for the process execution where task "open VIP account" occurs.Accordingly, as discussed in Section 1, the main problem of BPMN-Q is that it cannot answer the question whether for the resulting processes the requirements of a query hold in every process execution or in just some process executions.BPMN-Q only returns process models where requirements hold in some process executions, rather than in every process execution.A comparison between ASBPQL and BPMN-Q is shown in Figure 12 where empty cells mean that the corresponding requirements cannot be captured by BPMN-Q.In [22], the authors explore the use of an information retrieval technique to derive similarities of activity names and develop an ontological expansion of BPMN-Q to tackle the problem of querying business processes that are developed with different terminologies.A framework of tool support for querying process model repositories using BPMN-Q and its extensions is presented in [23].In [24], the authors proposed an indexing mechanism to improve the efficiency of evaluating BPMN-Q queries.ASBPQL provides three distinguishing features compared to the previous languages.First, its abstract syntax and semantics have been purposefully defined to be independent of a specific process modelling language (such as BPEL or BPMN).This allows ASBPQL and its query evaluation technique to be implemented for a variety of process modelling languages.Second, ASBPQL can express various temporalordering relations (precedence/succession, concurrence, and exclusivity) between individual tasks, between an individual task and a set of tasks, and between different sets of tasks (in some or every process execution).Third, these rich querying constructs are evaluated over the execution semantics of process models, rather than their structural relationships.In fact, structural characteristics alone are not able to capture all possible order relations among tasks which can occur during execution, in particular with respect to cycles and task occurrences (recall the discussions in Section 1).
In earlier work [25], we provided an initial attempt at defining a query language based on execution semantics of process models.The language was written in linear temporal logic (LTL) and only supported precedence/succession relations among individual tasks (not sets of tasks).Queries in this language are evaluated based directly on annotated CFPs (i.e., TPCFPs in [25]), rather than on the directed bigraphs which are built from decomposing the annotated CFPs (a directed bigraphs represents an execution of a process).As a result, this language only returns the process models which satisfy the requirements just in some process executions, rather than in every execution.In addition, using LTL formulae as queries is not very user friendly for ordinary users.In [26], the authors proposed an business query language (BQL) to capture 4 types of relations (Exist, ParallelWith, Exclude, and Precede).A query in BQL returns processes of which some executions satisfy these four types of relations.Furthermore, BQL suffers from a drawback that the formal semantics of it has not been defined.
In addition to the development of a specific process model query language, other techniques are available in the literature which can be useful for querying process model repositories.In [27,28] the authors focus on querying the content of business process models based on metadata search.In [29], an XML-based process query language, IPM-PQM, was designated to express search requirements.IPM-PQM can express four types of search conditions: Process-Has-Attribute, Process-Has-Activity, Process-Has-Subprocess, and Process-Has-Transition. IPM-PQM is a typical structure-based process querying technology.VisTrails system [30] allows users to query scientific workflows by example and to refine workflows by analogies.WISE [31] is a workflow information search engine which supports keyword search on workflow hierarchies.In [32] the authors use graph reduction techniques to find a match to the query graph in the process graph for querying process variants, and the approach, however, works on acyclic graphs only.In [33][34][35][36], a group of similarity-based techniques has been proposed which can be used to support process querying.In previous work, we designed a technique to query process model repositories based on an input Petri net [18].In [37], the authors introduced an execution-log-based query language which enables users to find elements and their relationships in process logs.In [38,39], an approach that supports "static" and "dynamic" querying of process has been presented.As for the static querying, this approach searches for matching processes which contains specified context elements, such as business function, roles, and resources.This is based on keyword matching.As for the dynamic querying, similar to BPMN-Q, it tries to find process models where the requirements hold in just some process instances.In [40], the authors proposed an approach to searching business process models.This approach induces relationships between activities from their labels; it provides an approximate process model search mechanism.Finally, in [41], the notion of behavioural profile of a process model is defined, which captures dedicated behavioural relations like exclusiveness or potential occurrence of activities.These behavioural relations are derived from the structure of the unfolding of a process model.However, the main foundation of beavioural profile is the weak order (two transitions  1 ,  2 are in weak order, if there exists an process execution in which  1 occurs before  2 ).Thus, for the reasons mentioned above, behavioural profile only provides an approximation of a process model's behavior which just holds in some process executions, whereas we can precisely determine whether a process model satisfies or not a given query in every process execution.Moreover, the efficient computation of this approach requires process models to be sound free-choice Petri nets, whereas our query

Figure 1 :
Figure 1: Three variants of a business process for opening bank accounts.

Figure 3 :
Figure 3: The pattern hierarchy and the scopes in SPS.

Figure 6 :
Figure 6: The set of conflict-free CFPs as decomposition of Fin Σ in Figure 5(c).

Figure 7 :
Figure 7: The set of conflict-free annotated CFPs transformed from Fin  Σ in Figure 5(c) according to Algorithm 3.

Figure 8 :Algorithm 6 :
Figure 8: Converting a conflict-free annotated CFP to a directed bigraph.

Figure 11 :
Figure 11: The resulting subgraph of the process model in Figure 1(c) after executing the query in Figure 2.

.
Based on the simplified SPS, we can define the basic relationships between tasks in ASBPQL.One is Existence, and the other is Precedence.And two other relationships are in very common use in business process management.One is Exclusive, and the other is Concurrence.As (iv) Nodes  1 and  2 are concurrent, denoted as  1 co  2 , if and only if  1 and  2 are neither causally related nor in conflict.
A CFP  = (, , ), a set of CFPs Γ, a (cut-off) event   , a (continuation) event   Output: A set of (updated) CFPs Γ  begin Γ  := 0; / * get  ready by removing the successor conditions of   (in ) * /   := iSuccessors(,   );  ⋅  \ :=   ;  ⋅  \ := {  } ×   ; / * retrieve and process the CFPs that contain   in Γ * / for   ∈ Γ ℎ   ∈   ⋅  do / * remove from  the part before   ,   itself, and the outgoing edges of   * /  := GetSubCFP to(  ,   );   ⋅  \ :=  ⋅ ;   ⋅  \ :=  ⋅ ;   ⋅  \ :=  ⋅  ∪ ({  } × iSuccessors(  ,   )); / * connect the above (updated)  and   to   * /   ⋅  :=  ⋅  ∪   ⋅ ; set of bigraphs of process model  is each free of choices, the exclusive relation between two tasks  and   is determined by checking in every bigraph of  if there are both a -event node and a   -event node, as specified in Algorithm 8.In Algorithm 9, the concur relation between  and   in  holds if and only if in each bigraph of  either (1) there are no and   -event nodes at all, or (2) there are both an -event node and an   -event node, and no directed path exists between the two nodes.Next, the remaining algorithms are defined for basic predicates capturing causal relationships between tasks.Evaluation of each such predicate is based on the result of evaluating the corresponding intermediate predicate in individual process executions.Given a process model , predicate alwpred holds only when its intermediate predicate (i.e., Pred) holds in all process executions of , while predicate pospred holds as long as its intermediate predicate (i.e., Pred) holds in one process execution of .To capture such semantics, we apply logical operator ∧ (for predicate alwpred) or ∨ (for predicate pospred) between the intermediate predicate over the set of bigraphs (G) of  in the algorithms.Algorithm 10 specifies the evaluation of predicate alwpred, and Algorithm 11 specifies the evaluation of pospred.

Table 2 :
Structural characteristics of the three datasets.

Table 3 :
Response times to execute eight sample compliance rules over the three datasets.
. If only consider this subgraph, this