Splitting Long Event Sequences Drawn from Cyclic Processes for Discovering Work ﬂ ow Nets

This paper addresses the preprocessing of event sequences issued from cyclic discrete event processes, which perform activities continuously whose delimitation of jobs or cases is not explicit. The sequences include several occurrences of the same events due to the iterative behaviour, such that discovery methods conceived for work ﬂ ow nets (WFN) cannot process such sequences. In order to handle this issue, a novel technique for splitting a set of long event traces S = { S k } (| S | ≥ 1) exhibiting the behaviour of cyclic processes is presented. The aim of this technique is to obtain from S a log λ = { σ i } of event traces representing the same behaviour, which can be processed by methods that discover WFN. The procedures derived from this technique have polynomial-time complexity.


Introduction
In discrete event processes, modelling is essential for designing management or control systems or analysing processes in operation.In the latter case, automated modelling of discrete-event processes from the recorded system behaviour is a valuable resource for process reengineering.In the areas of business process and manufacturing systems, automated modelling is an active research matter; in the first area, it is called process discovery [1], while in the second one it is named process identification [2].
1.1.Automated Modelling.In both areas, the aim is to build discrete-event models from records of event data generated by the processes; such event data are captured in the form of event sequences or traces, which reveal the actual process behaviour.The models must represent clearly sequential and concurrent behaviours; finite automata and Petri nets (PNs) are the formalisms mostly used.
The source of event traces, called the event log, is the management information systems [3][4][5][6] or the process controllers [2,7].In each type of process, the logs are represented in different formats.In business processes, the event logs are composed of large multisets of traces; each trace describes a process execution called a case.In manufacturing processes, the activities are continuously performed iteratively; the delimitation of jobs or cases is not explicit.Thus, the event logs are composed of a few (usually one) very long sequences.
1.2.Event Log Preprocessing.Extracting the iterative subsequences from long task sequences is a way to isolate the executions of t-components of the workflow net (WFN) to discover, allowing splitting the long sequences into multiple traces.This approach allows the application of diverse techniques that discover WFN to event logs drawn from the manufacturing processes.
Existing discovery methods for WFN cannot always process long sequences from cyclic process, in particular, when initial events occur again in the sequence due to the iterative behaviour of the process; the obtained models are less readable or, in some cases, wrong.Consider the log S = {abcdabcecd}; the discovered model obtained using a standing method [8] is shown in Figure 1(a).Conversely, when the single sequence of the log is split into λ = {abcd, abcecd}, the same discovery method yields the WFN in Figure 1(b); the extended WFN replays S.
Splitting or partitioning an event log is a strategy held for several purposes: trace clustering [9,10], reduction of the surplus language for fault diagnosis [11], model simplification [12], discovering unobservable behaviour [13], and model refinement [14].Methods dealing with the problem of sequence segmentation for improving the translation from Japanese to English have been proposed [15,16].

Contribution.
In this paper, a novel technique for splitting long task sequences issued from highly repetitive cyclic processes into subsequences is proposed.To the best of our knowledge, there are no methods addressing the stated problem.The method processes a reduced set of long event traces S = {S k } (|S| ≥ 1) and obtains a log λ = {σ i } of event traces representing the same behaviour.The purpose of this processing is to apply WFN discovery algorithms, in particular, those dealing with the silent transitions.
The paper is organised as follows: Section 2 presents the notation on PNs, WFNs, and the splitting problem; Section 3 describes the splitting trace method; Section 4 presents the implementation and tests; finally, Section 5 presents the conclusions.

Background and Problem Statement
This section presents the basic concepts and notation of ordinary PNs and WFNs used in this paper.For further details the reader can consult to the study by van der Aalst et al. [1].Additionally, the sequence splitting problem is formulated.For any node x 2 P ∪ T, The places in P can be empty or marked by one or more tokens.A marking M: P → N determines the number of tokens within the places; where N is the set of nonnegative integers.A marking M, usually denoted by a vector (N) |P| , describes the current state of the modelled system.Definition 2. A Petri net system or Petri net (PN) is the pair N = (G, M 0 ), where G is a PN structure and M 0 is an initial marking.R(G, M 0 ) denotes the set of all reachable markings from M 0 .Definition 3. A PN system is 1-bounded or safe iff, for any M i 2 R(G, M 0 ) and any p 2 P, M i (p) ≤ 1.A PN system is live iff, for every reachable marking In a t-invariant Y i , if we have initial marking (M 0 ) that enables a t i 2 <Y i >, when t i is fired, then M 0 can be reached again by firing only transitions in <Y i >.

Workflow Nets Definition 5.
A WorkFlow net (WFN) N is a subclass of PN owning the following properties [1]: (i) it has two special places: i and o.Place i is a source place: • i = ∅, and place o is a sink place: o • = ∅.(ii) If a transition t e is added to PN connecting place o to the place i, then the resulting PN (called extended WFN) is strongly connected.
(N, M 0 ) contains no dead transitions.An extended WFN sound is live and bounded.A WFN can represent a process behaviour by associating task labels to some transitions.
where Σ is a finite set of tasks labels, and L: T → Σ ∪ {ε} is the labelling function.Transitions labelled with ε are called silent or unobservable, otherwise they are called observable.Additionally, ∀ t i , t j 2 T, t i ≠ t j , if L(t i ), L(t j ) 2 Σ then L(t i ) ≠ L(t j ); i.e., two transitions cannot have the same label from Σ. Definition 8. Let Σ be a finite set of tasks labels; an event log λ is a set of traces

The Splitting Technique
3.1.Strategy.Every S k 2 S is parsed by searching subsequences in S k that have the same alphabet; such subsequences are represented by a macrotask θ j , which is replaced in all the S k that contains this subsequence; this operation is repeated until all the sequences in S are formed only by macrotasks.
The main steps of the technique are the following.First, an initial splitting of S k , induced by the first task, is performed.Then, the subsequences of S k are analysed for obtaining the macrotasks θ 1 , which are replaced in S k .
3.2.Basic Operators.Several operators for handling task traces are introduced below.Definition 10.Let λ be an event log over Σ and let a be a task in Σ; for every trace σ k = x 1 x 2 …x n 2 λ and a 2 σ k : (i) τ(x i , σ k ) provides the name of the task of position x i in σ k ; (ii) First(S′): gets the first subsequence of the list S′; (iii) A(X) gets the set of tasks (alphabet) used in the object X; A(σ k ) and A (λ) gets the set of tasks in a trace σ k and in λ, respectively.
Definition 11.Let λ be an event log and Notice that A(σ 1 ) is the support of a t-invariant of the extended WFN to build.Definition 12. Let S′ = {σ 1 , σ 2 ,…, σ n } be a list of subtraces, σ = t 1 t 2 …t m 2 S′ be a subtrace, i 2 {1,…, m-1} and j 2 {2,…, m} be indexes.Then, the operator delSet(S′, σ, θ, i, j) deletes the tasks in σ from i to j and replace them with the symbols of the macrotask θ in S′.

Splitting Procedures
3.3.1.First Splitting.In the processing of S k , the subsequences to consider are those delimited by a given task symbol T along S k .This search is started using the first symbol of S k ; then, a list of sequences S′ is formed by all the subsequences of S k starting with T.
The algorithm to split the sequence S in shorter subsequences delimited by the apparition of the first task is presented below.

Determining Macrotasks.
Afterward, the subtrace σ 1 of S′ with the smallest alphabet is chosen and added to the macrotask θ 1 ; such a subtrace is replaced by θ 1 in S′.
Based on A(σ 1 ) in θ 1 , the remainder subtraces σ r that have the same alphabet can be found and then added to θ 1 .

Mathematical Problems in Engineering
The replacing of θ 1 in S′ may split the remaining subtraces and then create new subsequences.This operation is performed again on S′ without considering θ 1 , then obtaining θ 2 , which is included in S′ as explained before.In every iteration, new macrotasks θ s are created and replaced in S′.This process is performed until S′ is formed only by macrotasks.The traces in all the macrotasks form the event log.Now, the procedures (Algorithms 2 and 3) to replace a macrotask θ in S′ and delete the corresponding subsequences are presented below.
The procedure below (Algorithm 3) summarises the splitting process.If t j 2 A(σ) then 6.

Remark. The computational complexity of
If first = 0 then 7.
Property.Algorithm 3 processes efficiently an event sequence S yielding a set S′ which contains subsequences corresponding to the segmentation of S.
Proof.The procedure builds iteratively S′ and converges toward a set including only macrotasks.The concatenation of the subsequences represented by the macrotasks in the order they are obtained yields the sequence S. Since all the involved algorithms are polynomial-time, the processing is efficient. □

Implementation and Tests
The algorithms to split a long trace into several traces have been implemented as a software tool.Besides to test the software over sequences and verify the correct splitting, an extended test scheme, described below, is defined.
4.1.Testing Scheme.The correctness of the splitting procedure is verified in a controlled manner through a rediscovery scheme, using artificial event logs, which are generated as follows.First, a known extended WFN that may contain silent transitions is created and executed in the PN editor PIPE [17]; this WFN contains a transition t e that allows the cyclic behaviour in the net to get long sequences.Then, the obtained string is processed to delete the apparition of the task t e in the log and silent transitions labelled with ε.Finally, the long string is saved in a text file, which is the input of the implemented method.The developed tool processes the text file that contains the long sequence and splits it into several traces, which are saved in a text file; such traces represent the behaviour of the initial WFN.This text file can be used as input to a discovery process technique [18] to obtain a WFN, which is compared to that used to generate the log.The discovered WFN is an XML file, which can be drawn by PIPE.The followed test scheme is shown in Figure 4.

Experiments.
Several case studies using WFN with different structure and size were conducted using the software tool.The following examples are more significant due to their structure rather than the size.

Test 1.
An execution of the software tool is presented in Figure 5.In Figure 5(a), the extended WFN edited in PIPE is shown; the artificial log is drawn from such a net.The artificial log composed by one sequence of length 1,045 is shown in Figure 5(b).In Figure 5(c), the split log with 11 traces obtained by the execution of the implemented tool is displayed.Then, the WFN discovered by applying the classification method to the split log is displayed in Figure 5(d).

Test 2.
A second test is presented in Figure 6.In Figure 6(a), the extended WFN is shown.The artificial log with length of 3,937 is shown in Figure 6(b).In Figure 6(c), the obtained log with six traces as result of the execution of the implemented tool is displayed.The WFN obtained using the split log and the classification method is displayed in Figure 6(d).
4.2.3.Test 3. In Figure 7, a third test is presented.In Figure 7(a), the extended WFN is shown.The artificial log with length of 10,093 is shown in Figure 7(b).In Figure 7(c), the obtained log with eight traces as result of the execution of the implemented tool is displayed.The WFN obtained using the split log and the classification method is displayed in Figure 7(d).

Conclusions
A technique for splitting long event sequences exhibiting the behaviour of cyclic processes has been presented.The result of the processing is an event log from which a WFN can be discovered.Long event sequences are drawn from highly repetitive processes, such as automated manufacturing systems where the initial state is known, but the delimitation of jobs or cases is not specified.Although, there are discovery methods that deal with the sequences of cyclic processes, this preprocessing technique allows applying many discovery algorithms that build WFN, particularly those that deal with silent transitions [18][19][20].In this paper, the method in [18] has been used in the tests to rediscover the models that generate the long sequences.
The event logs obtained from the splitting technique contain traces capturing silent behaviour represented in the discovered WFN by silent transitions of types skip, redo, switch, and finalise.However, these traces cannot always lead to discover initialise silent transitions; it is a pending research.

2. 1 .
Petri Nets Definition 1.An ordinary PN structure G is a bipartite digraph represented by the three-tuple G = (P, T, F); where: P = {p 1 , p 2 , …, p |P| } and T = {t 1 , t 2 , …, t |T| } are finite sets of nodes named places and transitions, respectively; F ⊆ P × T ∪ T × P is a relation representing the arcs between the nodes.

FIGURE 5 :
FIGURE 5: Test 1: (a) Extended WFN in PIPE, (b) artificial event log, (c) splitting the sequence, and (d) WFN obtained using the classification method.
Given a set of long event traces S = {S k }, where S k 2 T * and |S| ≥ 1, representing the behaviour of a cyclic discrete even process, the aim is to obtain a set λ = {σ i } of task traces by splitting the S k , such that the concatenation of traces in λ represents the same behaviour expressed in S, i.e. an extended WFN discovered from λ must replay S.Assumptions.A1.The sequences S k are arbitrarily long; they capture all the possible actual behaviour of the process.Such sequences are generated by an unknown, live, and 1-bounded cyclic PN.It means that the process is well behaved; there are no deadlocks nor buffer overflows during the recording of traces.A2.In every S k all the tasks occur at least twice.A3. S k are recorded from the initial state.Thus, the first tasks are known.