An Optimization Approach for Mining of Process Models with Infrequent Behaviors Integrating Data Flow and Control Flow

Infrequent behaviors of business process refer to behaviors that occur in very exceptional cases, and their occurrence frequency is low as their required conditions are rarely fulfilled. Hence, a strong coupling relationship between infrequent behavior and data flow exists. Furthermore, some infrequent behaviors may reveal very important information about the process. ,us, not all infrequent behaviors should be disregarded as noise, and identifying infrequent but correct behaviors in the event log is vital to process mining from the perspective of data flow. Existing process mining approaches construct a process model from frequent behaviors in the event log, mostly concentrating on control flow only, without considering infrequent behavior and data flow information. In this paper, we focus on data flow to extract infrequent but correct behaviors from logs. For an infrequent trace, frequent patterns and interactive behavior profiles are combined to find out which part of the behavior in the trace occurs in low frequency. And, conditional dependency probability is used to analyze the influence strength of the data flow information on infrequent behavior. An approach for identifying effective infrequent behaviors based on the frequent pattern under data awareness is proposed correspondingly. Subsequently, an optimization approach for mining of process models with infrequent behaviors integrating data flow and control flow is also presented. ,e experiments on synthetic and real-life event logs show that the proposed approach can distinguish effective infrequent behaviors from noise compared with others.,e proposed approaches greatly improve the fitness of the mined process model without significantly decreasing its precision.


Introduction
e purpose of process mining is to extract useful knowledge from event logs recorded by IT systems of enterprises to discover, monitor, and enhance the actual business process [1,2]. One of the important research areas is process discovery, which automatically infers process models from event logs. e goal of process discovery is to find the "best" process model given a record of the real executions as much as possible. e four important metrics for measuring the "best" model are fitness, precision, generalization, and simplicity [3].
Unfortunately, real-life event logs often contain both noise and infrequent behavior [2,4,5]. In general, noise refers to behavior that does not conform to a process specification and/or correct execution, such as traces of recorded incomplete process behaviors, recording errors, and error execution of process [5]. However, effective infrequent behavior is considered to be a possible execution behavior in very exceptional cases, such as fraudulent behavior in insurance [6], risk problems in system operations [7], and escape problems of spacecraft systems. Some infrequent behaviors may be important behaviors that cannot be discarded in system operation. Early process discovery algorithms [8][9][10][11][12][13] have assumed that event logs accurately record system behavior and apparently have significant limitations in real life. Most of the recent discovery algorithms [14][15][16][17][18][19][20][21][22][23] support noise filtering but ignore infrequent behaviors of business processes. Very few discovery algorithms [24][25][26][27] consider infrequent behaviors, whereas they still regarded infrequent behaviors as noises. is may lead to some important information to be discarded. As a result, the derived models have difficulty in accurately describing the real behavior of systems. erefore, one important challenge in process discovery is to distinguish infrequent behavior from noise in event logs.
ere are few approaches related to the research of infrequent behavior, most of which focus on the controlflow perspective. Existing approaches determine whether the behavior is infrequent or frequent only by considering the frequency of activities or directly-follows relations. However, for the infrequent behavior, they rarely analyze whether it has a relationship with data flow and directly remove it as noise. However, in real system operation, some execution paths may be taken by contextual data information, such as available resources, execution time, and execution status. As the required conditions (that is, specific data information) are rarely fulfilled, some paths are executed infrequently. erefore, these infrequent behaviors are caused by their special required conditions. Once these conditions are fulfilled, the corresponding infrequent behavior will inevitably occur. We can say these infrequent behaviors as effective infrequent behaviors or correct infrequent behaviors. For instance, an airbag deploying in a car requires a suitable speed and angle of impact. eoretically, the airbag can only be opened when the impact on a fixed object is within 60°in front of the vehicle, and the car speed is higher than 30 km/h. Compared with normal driving activities, the frequency of airbag deployment behavior is lower. Obviously, there is a coupling relationship between these infrequent behaviors and the data information of the event. Moreover, it is an important behavior for system operation. erefore, existing approaches that filter low-frequency behavior based on control flow and treat it as noise are not appropriate. Identifying these effective infrequent behaviors from the perspective of data flow and integrating control flow and data flow information in process discovery play an important role in process model optimization, business process improvement, resource allocation adjustment, and so on.
is paper analyzes the coupling relationship between infrequent behavior and data flow. It quantifies the influence strength of data information on behavioral dependencies between events, which provides a reliable basis for the identification of effective infrequent behavior. We conduct a series of experiments to compare our approach to existing approaches on synthetic and real-life event logs and discuss the result. e experimental result indicates that the proposed approach can identify more infrequent but useful behaviors than the state-of-the-art mining technique and greatly increase the fitness of the process model without significantly decreasing the precision and, indeed, optimizing the process model. e main contributions of the paper are as follows: (1) An analysis approach based on frequent patterns and interactive behavior profiles is proposed to identify which parts of the trace are infrequent (2) To quantify the strong influence of data information on the behavioral dependence between activities, a conditional dependence probability measurement approach is introduced (3) An effective infrequent behavior recognition approach based on frequent patterns under data awareness is presented along with an optimization approach for mining of process models with infrequent behaviors integrating data flow and control flow e remainder of this paper is structured as follows. Section 2 discusses the related work. Section 3 introduces the problem with an example. Section 4 presents the notations and the required concepts. Section 5 proposes an effective infrequent behavior recognition approach based on frequent patterns under data awareness. After that, an optimization approach for mining of the process model with infrequent behaviors that integrates data flow and control flow is also given. Section 6 evaluates how well the proposed approach works on synthetic and real-life event data. Finally, Section 7 concludes the paper and discusses future work.

Related Work
Many researchers have proposed a range of process mining algorithms. However, there exist many problems in the process mining algorithm, such as short loops [28], indirect dependency relationships [29], duplicated transitions, invisible transitions, noises, and infrequent behaviors [30]. Some early mining algorithms, such as the α-algorithm [8] and its derived improved algorithm [9,10], the ILP mining algorithm [11], the inductive miner algorithm [12], and the domain-based mining algorithm [13], disregarded noises in the event log. Clearly, they have great limitations in real life. Most of the recent mining algorithms support noise filtering [14][15][16][17][18][19][20][21][22][23]. e first discovery algorithm handling noise was heuristics miner [14]. Heuristics miner considers the frequencies of the basic ordering relations during the computation of the strength of causal relations. e true dependency between two events (such as concurrency, exclusion, and causality) is determined by the strength of the causal relations. Its derived algorithms have also been proposed [15,16]. Existing noise-filtering approaches are based on frequencies [14][15][16][17][18], machine-learning techniques [19,20], genetic algorithms [21], or probabilistic models [22,23]. All of those approaches focus on the control-flow perspective when filtering noise without considering the data flow information and exclude infrequent but useful behaviors. e literature [31][32][33] specifically studied the noise processing approach in event logs but still did not address infrequent behaviors.
Recently, the literature on infrequent behavior has been very scarce, and it has mainly focused on the controlflow perspective [24][25][26][27]. In terms of control flow, the literature [24] proposed the WoMine-i algorithm, which retrieves infrequent behavior patterns from a process model, including structures with sequences, selections, parallels, and loops. However, in general, we do not have a reference model in real life, only have logs recorded reality. Hence, it is difficult for this approach to improve the quality of the discovered model. In [25], the inductive miner infrequent algorithm was proposed, which adds infrequent behavior filters to all steps of the IM algorithm, such that infrequent behavior is filtered by adopting an eventually-follows graph. In [26], a minimum anomalyfree automaton (AFA) based on the whole event log and a given threshold was constructed. Subsequently, all events that did not fit the AFA were removed from the filtered event log, which led to the removal of individual events rather than entire traces from the log. However, this technique cannot detect some typical anomalies, such as incomplete traces. e approaches in [25,26] filter infrequent behaviors based on the frequency of directlyfollows relation between the activity pairs only from the perspective of the control flow and neglect the dependence between some infrequent behaviors and data flow. In [27], the authors proposed a generic noise-filtering approach suitable for any arbitrary process discovery algorithms. e approach uses the conditional occurrence probability to calculate the likelihood of the occurrence of an activity following a subsequence. e disadvantage of this approach is that the log is interpreted as a sequence, whereas the structural information is not considered, such as concurrent and loop structures. For concurrent and loop structures, the same structure will correspond to multiple or even infinitely different subsequences. Additionally, distinguishing between noise and infrequent behavior is ignored, and they are directly deleted as noise. In terms of data flow, in [34], the data-aware heuristic miner (DHM) was proposed, which combines data flow and control flow. A classification technique is used to find the data attributes between activities, which can reveal conditional infrequent behavior from the event log to distinguish infrequent behavior from noise effectively. ere are two limitations to this approach. First, according to the condition of data dependence, only the dependency strength of two different directly-follows relations between activity a and activity b is computed, but the probability of activity b following activity a directly compared to other activities is not considered. Second, only condition directly-follows dependencies between activities are discovered, and the conditional dependencies of the more complex patterns cannot be discovered. Recent work on declarative process discovery [35][36][37] considered the data perspective. Declarative process models representing the discovery results from execution logs are given. In [37], the authors present an automated discovery of a declarative process model with data conditions. Clustering techniques in conjunction with a rule mining technique and redescription mining techniques are used to discover constraints between two activities, respectively. However, similar to association rule mining, only sets of rules or constraints rather than full process models are returned.
As a consequence, there exist three problems with the process discovery algorithm in recognition of infrequent behavior. First, most of them focus on the control-flow perspective to recognize infrequent behavior, whereas they ignore the coupling relationship between infrequent behavior and data conditions. Second, they directly remove all infrequent behaviors as noise when discovering the process model, which leads to infrequent but useful behaviors which are also excluded. ird, these approaches only consider the directly-follows dependency of activity pairs, while the dependencies of more complex patterns are ignored. For the reasons stated above, this paper proposes an effective infrequent behavior recognition approach based on frequent patterns under data awareness. Subsequently, an optimization approach for mining of process models with infrequent behaviors integrating data flow and control flow is provided.

Motivation
A business process of booking tickets in a train ticket reservation system is illustrated as an example. Here, only the business process of ordering the ticket is considered, and a series of business processes generated by refunds and changes are not considered. Assuming that only 1,000 records are extracted from the system log, the trace sequences and the frequency of their occurrence are shown in Table 1. e event names corresponding to the activities are shown in Table 2.
In a real system, σ 8 , σ 9 , and σ 12 are three effective infrequent traces corresponding to situations in which the total number of contact names related to the logged-in user exceeds 15. e time interval between confirming the order and the payment success is more than 30 minutes, the user has an unpaid order, and the waiting time for payment does not timeout. In particular, they often occur infrequently because the corresponding conditions are rarely fulfilled. Obviously, these behaviors are infrequent but correct. However, the IMi algorithm disregards these as noise when constructing the process model so that the resulting process model cannot truly describe the actual operation of the system.
In trace σ 8 , it is not difficult to find an infrequent activity K′, which has an indirect data-dependent relationship with the frequent activity F. In trace σ 9 , there is a low-frequency activity pair TQ, which is caused by an indirect data dependency between R and T. e reason that trace σ 12 occurs infrequently is the same as that for the trace σ 9 . It is obvious that there is a coupling relation between these infrequent behaviors and the particular data dependency. To capture the infrequent and useful behavior, an effective infrequent behavior recognition approach based on the frequent pattern under data awareness is proposed in this paper according to indirect data dependency between events. Furthermore, an optimization process model integrating data flow and control flow is obtained by incorporating infrequent behavior in the resulting process model, which increases the fitness of the process model and more accurately captures the important behaviors of the system.

Preliminaries
is section gives basic definitions of several terms used in this paper. e events in the log represent activities, the event log is a collection of traces, and the same trace may appear multiple times in the event log, with each trace corresponding to the execution of a process. e event log typically stores considerable additional information about the event, such as the active execution resource (such as people or devices) and the timestamp of the event execution.
Definition 1 (process model [38]). A process model is a quadruple N � (P, T, F, C) with (1) P and T as a nonempty set of place and transition, respectively or, pl, cy as the structure type of the process model for sequence, selection, parallel, and loop Definition 2 (weak order (log) [39]). Let L be the event log. Let A L be an activity set of L. e weak order relation Definition 3 (behavioral profile (log) [40]). Let L be the event log. Let A L be an activity set of L. A pair (x, y) ∈ (A L × A L ) is in at most one of the following relations: (1) e strict order relation ⟶ L , iff x≻ L y and y≻ L x (2) e exclusiveness relation + L , iff and x≻ L y and y≻ L x (3) e interleaving order relation || L , iff x≻ L y and y≻ L x Note that we say that a pair (x, y) is in reverse strict order, denoted by y← L x if and only if x ⟶ L y.
Definition 3 indicates that if any trace in the log does not contain both the activity x and the activity y, then x+ L y. If there are two different traces such that x≻ L y and y≻ L x hold, or there is a trace such that x≻ L y and y≻ L x both hold, then x‖ L y. If there is a trace for which x≻ L y holds and there is no other trace such that y≻ L x, then x ⟶ L y. Definition 4. (causal behavioral profile (log) [40]). Let L be the event log. Let A L be an activity set of L.
Clearly, the co-occurrence relation compensates for the option of the strict order relation. A causality holds between two activities x and y if they are in strict order x ⟶ L y and for any trace in the log must contain the activity y as long as it contains the activity x.

An Optimization Approach for Mining of Process Models with Infrequent Behaviors Integrating Data Flow and Control Flow
is section describes an approach for identifying effective infrequent behaviors from the perspective of data dependency and gives an algorithm to reconstruct optimized process models integrating data flow and control flow by incorporating effective infrequent behavior. Section 5.1 presents some relevant definitions and an algorithm for an  8 16 ABDFJKK′KLMRSW σ 9 40 ABDFJLMRTQ σ 10 10 DFABJLMRTW σ 11 43 ABDFJJMLNRSW σ 12 17 ABDEDFRSW effective infrequent behavior recognition approach based on frequent patterns under data awareness. Section 5.2 gives an optimization approach for mining of the process model with infrequent behaviors integrating data flow and control flow. e research framework of the proposed approach is shown in Figure 2.

An Effective Infrequent Behavior Recognition Approach
Based on Frequent Patterns under Data Awareness. In this section, first (Section 5.1.1), some definitions related to the proposed approach are introduced, such as pattern, subsequence matching a pattern, interaction behavioral profile, and conditional dependency probability. en, Section 5.1.2 elaborates on how to identify effective infrequent behaviors by using frequent patterns, interactive behavior profiles, and conditional dependency probabilities.

e Relation between Infrequent Behavior and Data
Dependency. Prior to presenting the filtering approach, we present some basic notations used throughout the paper. Let A denote the set of all possible activities and let A * denote a set of finite sequences over A. A finite sequence σ of length n over A is a function: σ: 1, 2, . . . , n { } ⟶ A, alternatively written as σ � 〈a 1 , a 2 , . . . , a n 〉, wherea i � σ[i] for 1 ≤ i ≤ n. e empty sequence is written as ε. e concatenation of sequences σ and σ ′ is written as σ•σ ′ . A sequence σ ′ � 〈a 1 ′ , a 2 ′ , . . . , a k ′ 〉 is a subsequence of sequence σ if and only if we can write σ as σ 1 •〈a 1 ′ , a 2 ′ , . . . , a k ′ 〉•σ 2 , where both σ 1 and σ 2 are allowed to be ε, i.e., σ is a subsequence of itself. e beginning activity of a finite trace σ is written as firstAct(σ) � σ [1], and the end activity of a finite trace σ is written as lastAct(σ) � σ[n] with |σ| � n. e set of all beginning activities and all ending activities in the event log L is written as startActs L and endActs L , respectively, where startActs L � ∪ σ∈L firstAct(σ) and endActs L � ∪ σ∈L lastAct(σ).
Considering the event logL 1 , including five traces, where σ 1 � 〈a, b, c, g〉 10 ,  4 (the superscript of the trace indicates the number of times the trace appears in the log). Figure 3 shows the directly-follows graph G(L 1 ) of log L 1 . In the directly-follows graph, each node represents an activity, and an edge represents the directly-follows relationship between two activities in the trace. indicates the start node of the log, double circles g indicate the end node of the log, a line with a double-sided arrow indicates two activities are in a concurrent relationship, and a line with a single-sided arrow indicates a directlyfollows relationship.
According to Definition 3, we obtain the behavioral profile between activities in the log L 1 shown in Figure 4. Clearly, σ 1 and σ 2 are distinguishing traces, but in fact, they are the behavioral equivalent, as activity b and activity c are in an interleaving order relation. e same is true for traces σ 4 and σ 5 . erefore, traditionally treating the trace as a sequence is too imprecise. We consider that the sequences of equivalent behaviors are the same, even if their corresponding sequences are different. To analyze the frequency and correctness of subsequences included in a trace from the behavior perspective, we define a pattern that considers all types of structures-sequence, selection, concurrent, and loop.
Definition 5 (pattern). Given the event logL, let A L be an activity set of L. Let G(L) be a directly-follows graph of log L. All vertices of G(L) are written as V(G), and all edges of G(L) are written as E(G). When a connected subgraph satisfies the following two conditions, we call it a pattern of log L: Definition 5 indicates that the pattern represents part of the behavior of the trace in the event log, and any activity in the pattern has the same behavioral profile relation with other activities that are not in this pattern. For convenience, the vertices of the pattern Patt are written as V(Patt), the edges of the pattern Patt are written as E(patt), and the pattern to which the activity a belongs is written as Patt(a).
In the example provided in Figure 5, according to Definition 5, for b, c ∈ V(Patt 2), d ∈ V(Patt 3), and d ∉ V(Patt 2), then b+ L d⇒c+ L d. For a ∈ V(Patt 1) and a ∉ V(Patt 2), then b← L a⇒c← L a.    Scientific Programming In expressing the interaction behavior between patterns better, the concept of an interactive behavioral profile is introduced as follows.
Definition 6 (interactive successor relationship and interactive input (or output) arc). Given the event log L, let A L be an activity set of L and Patt be one of the patterns of the log L. For∀a, b ∈ A L , we denote a ≼ I b as an interactive successor relationship between the activity a and the activity b, if Patt(a) ≠ Patt (b) and . And we say that the activity b has an interactive input arc and the activity a has an interactive output arc, respectively. Definition 6 indicates that when the activity a and the activity b belong to different patterns and a trace exists in forms of 〈· · · , a, b, · · ·〉, there exists an interactive successor relationship between a and b. For instance, in Figure 5, there are two interactive successor relationships between patterns Patt 1 and Patt 2: a ≼ I band a ≼ I c.
An activity a is referred to as the entry node of pattern Patt if a ∈ startActs L or a has an interactive input arc. Similarly, an activity a is referred to as the exit node of the pattern if a ∈ endActs L or a has an interactive output arc. As a pattern may have multiple entry nodes, or multiple exit nodes, the set of entry nodes of pattern Patt is written as entry(Patt), and the set of exit nodes of pattern Patt is written as exit(Patt).
In a pattern, a indicates the interactive input arc of node a and a indicates the interactive output arc of node a.
According to Definition 5, the directly-follows graph of the aforementioned log L 1 can be divided into three highly cohesive low-coupling subpatterns, as shown in Figures 5(a)-5(c).
Definition 7 (subsequence matching a pattern). Let L be an event log over a set of activities A. Let pattern Patt be a subpattern of log L. A subsequence σ ′ ∈ A * in the trace is said to match a pattern Patt, denoted as σ ′ ⊑Patt, when σ ′ � ActSeq(path(s, e)) with s ∈ entry(Patt) and e ∈ exit(Patt) holds, wherepath(s, e) denotes a path from node s to node e, and ActSeq(path(s, e)) denotes the sequence of activities that consists of all nodes on the path from node s to node e.
Definition 7 illustrates that a subsequence is considered to match the pattern Patt when it corresponds to a substring consisting of all nodes on the path from the entry node to the exit node in the pattern Patt. e pattern that matches a subsequence σ is denoted as Patt(σ).
For example, for the subsequences a, defde belonging to the traces σ 4 � 〈a, d, e, f, d, e〉 in logL 1 , they match patterns Patt 1 and Patt 3, respectively. For different subsequences matching the same pattern, we consider them to be behavioral equivalent, i.e., although σ 1 and σ 2 are two different sequences, they are considered to be behavioral equivalent. σ 4 and σ 5 do the same.
Definition 8 (interactive behavioral profile (pattern)). Given the event logL, let A L be an activity set of L and ≼ I be an interactive successor relationship. e interactive behavioral profile is the 3-tuple (⟶ I , ≼ I, + I ) L defined by Also, we say that a transition pair (b, a) is in reverse strict order of the interaction, denoted by b← I a if and only if the transition pairs (a, b) satisfy the strict order of the interaction, i.e., a ⟶ I b.
According to Definition 8, the interactive behavioral profile of patterns Patt 1, Patt 2, and Patt 3 is shown in Figure 6.
An infrequent trace occurs at low frequencies either because it contains low-frequency events or because it contains low-frequency subsequences, along with a large number of high-frequency subsequences. How can the frequency of occurrence of subsequences or activities be determined? To solve this problem, the activity frequency and pattern frequency are both given below.
Definition 9 (activity frequency [24]). Given the event log L, let A L bean activity set of L. e frequency ActFreq(a) of an activity a ∈ A L is defined as Given a frequency threshold min ActFreq, an activity a is frequent iff ActFreq(a) ≥ min ActFreq.
A pattern can reflect the structural behavior relationship between activities. Since there exist multiple different subsequences with the same behavior corresponding to the same pattern, so it is more accurate to measure their frequency by using the frequency of patterns.
Definition 10 (pattern frequency). Let L be an event log over a set of activities A. Let pattern Patt be a subpattern of log L. e frequency PFreq(Patt)of a pattern Patt is defined as Given frequency thresholds min PFreq, a pattern Patt is frequent, iff PFreq(Patt) ≥ min PFreq. Theorem 1. Let L be an event log over a set of activities A. Let pattern Patt be a subpattern of log L.Given a subsequence Proof. According to Definition 12, we can easily obtain this conclusion.
Frequency-based filtering techniques only consider direct dependencies between activity pairs, while the frequency of directly-follows relation between all activity pairs Scientific Programming in some infrequent trace is frequent. For instance, <a, b> and <b, c> are frequent activity pairs in the log, but the sub-sequences<a, b, c> consisting of <a, b> and <b, c> may be low-frequency subsequences. Only using the direct dependency between activity pairs will not identify that <a, b, c> is an infrequency subsequence. erefore, filtering infrequent behavior only from the frequency of occurrence of a single activity pair is too imprecise. In this case, computing the probability that a certain activity directly occurs after the occurrence of the subsequence at larger distances is necessary. Definitions 11 and 12 compute the number of conditional occurrence times and conditional dependency probability, respectively, of the activity directly following the subsequence, when the data dependency condition C exists between a subsequence and an activity.
where patt(σ ′ ) represents a pattern to which a subsequence σ ′ matches it.
For example, traces σ 1 and σ 2 in the previously mentioned log L 1 , let (1) the activity g follow directly after the subsequence ab with the latest attribute values x 1 in the trace 〈abcg〉, and (2) the activity g also follow directly after the subsequence ba with the latest attribute values x 1 in the trace 〈acbg〉.
According to Definition 11, COT(bc C,L > g) � COT(cb C,L > g) � 15. Definition 11 considers the number of conditional occurrences of which more behavioral equivalence subsequences are directly followed by the same activity (such as concurrency or loops) under the same conditions.
Behavioral dependencies between activities in real-world systems may be affected by direct or indirect data dependency between activities. To capture the strength of behavioral dependence they cause, Definition 12 further gives the concept of conditional dependency probability between the subsequence and the activity based on the literature [20].
Definition 12 (conditional dependency probability). Let L be an event log over a set of activities A. Given a subsequence σ ′ ∈ A * , an activity a ∈ A, and dependency conditions C, we write CDP(σ ′ ⇒ a) to represent a conditional dependency probability of the subsequence with the latest attribute value x directly followed by the activity a under dependency conditions C; we denote CDP(σ ′ ⇒ a) as where a represents other activities except activity a and occur Times L Obviously, the value of CDP(σ ′ ⇒ a) is a real number in (−1, 1). When the dependency condition C has the latest attribute value x, the higher the value of CDP(σ ′ ⇒ a), the more likely the activity a directly follows the subsequence σ ′ .
If an infrequent subsequence has a higher value of CDP(σ ′ ⇒ a), it can be judged to be a correct infrequent behavior.
For a given conditional dependency probability threshold θ dep , σ ′ •a is considered to be a reasonable subsequence under the current data dependency iff CDP(σ ′ ⇒ a) ≥ θ dep .
Definition 13 (special data dependency). Given an event log L, a subsequence σ, an activity a, b ∈ A, and dependency conditions C or C ′ . C or C ′ is regarded as special data dependency if the following two conditions are met: where θ is a frequent threshold, |L| is the total number of traces in the log, and C value � x represents the dependency condition C has the latest attribute value x.  Figure 6: e interactive behavioral profile of Patt 1, Patt 2, and Patt 3.
8 Scientific Programming

An Effective Infrequent Behavior Recognition Approach Based on Frequent Patterns under Data Awareness.
Definition 12 in Section 5.1.1 quantifies the strength of data dependency on the behavioral dependency between the activity and the subsequence, which provides a basis for the identification of effective infrequent behaviors. is section provides an effective infrequent behavior recognition approach based on frequent patterns under data awareness.
For the frequent traces in the log, a number of patterns with high-cohesion low-coupling on behavior can be constructed by their directly-follows graph and behavioral profile. It is easy to determine that these subpatterns are frequent patterns. Since an infrequent trace often contains some frequent behavior in addition to infrequent behavior, there may be direct or indirect data dependency between them. To make full use of this dependency and accurately capture its impact on behavioral relationships, Algorithm 1, first, finds out which part of the behavior in the trace occurs in low frequency. en, check whether there exists special data context information in the context of the infrequent behavior. If it exists, conditional dependency probability is used to analyze the influence strength of the data flow information on infrequent behavior. If its value is greater than a certain threshold, the infrequent behavior is considered to be an infrequent but correct behavior; otherwise, it is considered to be noise.
Step 1-Step 11 in Algorithm 1 analyze the validity of the infrequent trace in terms of the infrequent activity, where Step 1-Step 3 determine whether there is an infrequent activity in the trace and if it exists, Step 4-Step 11 determine the correctness of the infrequent activity occurrence from a data dependence perspective.
Step 12-Step 13 analyze the validity of the infrequent trace in terms of the infrequent subsequence.
Step 12 divides the trace σ into several subsequences according to the activity set in the frequent subpatterns. e divided subsequence is either a frequent subsequence or an infrequent subsequence. If the divided subsequence includes a smaller infrequent subsequence, Step 16-Step 21 analyze the correctness of the infrequent subsequence from the perspective of data dependence. If the divided subsequences are all legal subsequences, Step 21-Step 33 determine whether the interaction behavior between the subsequences is reasonable according to the interaction behavior profile of the frequent subpatterns. If it is unreasonable, the correctness of infrequent interaction between them is judged from the data dependency perspective.

An Optimization Approach for Mining of Process Models with Infrequent Behaviors Integrating Data Flow and Control
Flow. Algorithm 1 analyzes the effectiveness of infrequent behavior based on direct or indirect data dependencies between events. On this basis, Algorithm 2 further gives an optimization approach for mining of the process model with infrequent behaviors integrating data flow and control flow. First, the initial process model based on the control flow is constructed from the frequent traces by using the IMi mining algorithm. en, Algorithm 1 is used to identify all the effective infrequent behaviors in the event log. Finally, an optimization process model integrating data flow and control flow is further reconstructed by incorporating all the effective infrequent behaviors into the initial process model.
Step 1-Step 4 in Algorithm 2 preprocess the traces in the event log according to the occurrence frequency and divide the log into two sets, FilterLog andOutlier, where FilterLog represents all frequent traces and represents the infrequent traces that need to be analyzed. e initial process model is built by applying the IMi mining algorithm on the event log in Step 5. Incomplete traces that do not start or end normally are deleted and simultaneously added to the set Noise in Step 6-Step 11.
Step 12-Step 14 use Algorithm 1to determine whether each trace in the set Outlier is an effective infrequent trace and further divide the trace into two subsets: effective infrequent trace set Infrequent and noise set Noise. An optimized process model M 2 of the fusion control flow and data flow is obtained by adding these infrequent behaviors in the set Infrequent to the initial process model M 1 in Step 15.

Evaluations and Results
In this section, we conducted controlled experiments on synthetic and real-life event logs to compare our approach to existing approaches and discuss the result in this section. First, we (in Section 6.1) illustrate the solution steps of the infrequent behavior identification approach proposed in this paper using the synthetic event log shown in Section 3 and then report on the number of infrequent behaviors correctly identified using our approach and other approaches. en, in Section 6.2, we compare the proposed approach with other approaches to measure the quality of the process model discovered when different levels of infrequent behavior are injected into the real-life logs. ese experiments are performed on an Intel i7-6500 processor and an 8 GB RAM (2.50).

Synthetic Dataset.
In verifying the correctness of Algorithm 1, the event log given in Section 3 is taken as an example to elaborate on how to use Algorithm 1 to identify effective infrequent behavior. First, the causal behavioral profile is obtained according to frequency traces in the event log, as shown in Figure 7 (note that the subscript L of the behavioral profile is omitted here). Six maximal frequent patterns obtained according to the behavioral profile in Figure 7 are presented in Figure 8. e corresponding interactive behavior profiles between them are shown in Figure 9.
For the infrequent traces σ 7 , σ 8 , σ 9 , σ 10 , σ 11 , and σ 12 , since the end activity of trace σ 7 is an abnormal end activity, it is easy to determine that it is noise. In the event log, the attributes of some activities and their attribute values of infrequent traces σ 8 , σ 9 , σ 10 , σ 11 , and σ 12 are shown in Table 3 to Table 7, respectively. e conditional dependence probabilities between certain activities and subsequences are computed according to Algorithm 1, as displayed in Table 8.
Scientific Programming e results show that when the condition dependence threshold θ dep � 0.7, the traces σ 8 , σ 9 , and σ 12 are considered to be an effective infrequent behavior using the proposed approach.
Subsequently, we evaluate the ability to identify effective infrequent behaviors in the proposed approach compared to the IMi algorithm [17], the FM algorithm [19], and the DHM algorithm [20]. Table 9 indicates that the proposed approach can correctly identify more effective infrequent behaviors than other approaches, whereas the DHM algorithm may mistake the incorrect infrequent trace as the correct one.
Finally, the optimization process model M 2 of the fusion control flow and data flow is constructed by incorporating these infrequent behaviors into the process modelM 1 , as shown in Figure 10. e transitions d i (1 ≤ i ≤ 6) in process model M 2 are unobservable activities representing data flow that have been added for routing purposes only and do not appear in the event log. In adopting the approach proposed in [41], the fitness of the model M 2 is improved to 0.993, while the fitness of the initial model M 1 is 0.939. Algorithm 1 uses two thresholds, a frequency threshold of activity and conditional dependency probability threshold. e former is used to differentiate the frequent activities and infrequent activities, while the latter is used to differentiate effective infrequent behaviors and noneffective infrequent behaviors. Actually, the performance of identifying effective Input: an event log L, an infrequent trace σ, a frequency threshold of activity minActfreq, m frequent subpatterns Patt j (1 ≤ j ≤ m), conditional dependency probability threshold θ dep .
Output: a Boolean value indicating whether the trace σ is infrequent but correct or not.
Step 1: for i � 1 to |σ| do Step 2: compute Actfreq(σ[i]) using Definition 9//determine whether there exist infrequent activities in the trace or not Step 3: if σ l−1 < minActfreq then //If the infrequent activity exists, it is necessary to determine whether there exists a special conditional dependency between the activity and the previous subsequence Step 5: if (a special data dependency C exists between σ and w in the trace σ) then Step 6: compute CDP(σ⇒ w) using Definition 12 Step 7: if CDP(σ⇒ w) ≥ θ dep then Step 8: return true; Step 9: else return false; Step 10: else return false; Step 11: else i++//continue to determine whether the next activity is an infrequent activity //If an infrequent activity does not exist, check whether an infrequent subsequence in the trace exists Step 12: according to the activities in frequent subpatterns, the trace σ can be divided into several subsequences, let σ � σ 1 •σ 2 • · · · •σ n Step 13: for (each σ i ) do Step 14: if ∃Patt j (1 ≤ j ≤ m) such that σ i ⊑ Patt j then Step 15: i++; //Case 1: an infrequent subsequence exists, i.e., a subsequence does not match any of the frequent subpatterns Step 17: if (a special data dependency C exists between σ and w in a trace σ) then Step 18: computing CDP(σ⇒ w) using Definition 12 Step 19: if CDP(σ⇒ w) ≥ θ dep then return to Step 15 Step 20: else return false; Step 21: else return false; Step 22: if (i �� n) //Check whether the interaction behavior between all legal subsequences is consistent with the interaction behavior profile between frequent subpatterns Step 23: for σ Step 24: if (the interaction behavior between σ k and σ k+1 is not consistent with the interaction behavior profile between corresponding frequent subpatterns) Step 25: then let σ � σ k , w � σ k+1 [1] (where σ k+1 [1] represents the first activity of σ k+1 ) Step 26: if (a special data dependency C exists between σ and w in trace σ) then Step 27: compute CDP(σ⇒ w) using Definition 12 Step 28: if CDP(σ⇒ w) ≥ θ dep then Step 29: k++ Step 30: else return false Step 31: else return false Step 32: else k++;//if consistent, continue to judge the behavior relationship between the next adjacent subsequence Step 33: if (k �� n − 1) then return true;//the interaction behavior between all subsequences is reasonable ALGORITHM 1: An effective infrequent behavior recognition approach based on frequent patterns under data awareness. 10 Scientific Programming infrequent behaviors is mainly affected by the conditional dependency probability threshold. To illustrate how varying this threshold affects the identification of effective infrequent behaviors, we designed the experiment to measure the amount of effective infrequent behavior correctly identified by our technique. Here, we use the previous synthetic log to evaluate the effect of different levels of threshold on Algorithm 1 by incrementally injecting infrequent behaviors. As shown in Figure 11, the results show that generally the rate of effective infrequent behavior correctly recognized decreases as the threshold parameter increases. When the threshold is set at a high value, these infrequent behaviors which rarely happen (i.e., only one or two times), and some infrequent behaviors which include other traces of recorded errors with the same data dependency conditions cannot be identified correctly.

Real-Life Dataset.
We designed a simulation experiment for analysis using the claims data package provided by an insurance company platform. e data are from the company's Insurance Service Platform-Log Data1, including 980 cases, 13,280 events, 27 activities, and 12 attributes. For Log Data1, we compare the proposed approach, the IMi algorithm, and the DHM algorithm on precision [42] and fitness [41] value to evaluate the quality of the discovered model by injecting 1% to 9% infrequent behavior into the event log. In many cases, there is a trade-off between these two metrics. To balance them, the F-score is often used to combine fitness and precision through their harmonic means 2 × (fitness * precision/fitness + precision). e abscissa corresponds to the ratio of the injected infrequent behavior, and the ordinate corresponds to the fitness, precision, and Fscore value, in Figures 12-14, respectively. Figure 12 shows that the proposed approach can find more infrequent behaviors than the IMi and DHM approaches, and it significantly improves the fitness of the discovered model. Since the IMi algorithm only filters infrequent behaviors based on frequency from the control-  Figure 7: Behavior profile of frequent traces in the log of scheduled tickets in the train booking system.
Input: an event log L, a frequency threshold of a trace minfreq Output: a process model integrating data flow and control flow Step 1: for (each trace σ in L)do Step 2: if |σ| ≥ minfreq then Step 3: FilterLog � FilterLog ∪ σ Step 4: else Outlier � Outlier ∪ σ //calculate the start activities and the end activities of frequent traces Step 5: the initial process model M 1 is constructed by applying the IMi algorithm on the set FilterLog Step 6: for (each trace σ in FilterLog) do Step 7: compute first Act(σ i ), lastAct(σ i ) Step 8: compute startActs FilterLog � ∪ ∀σ i ∈L firstAct(σ i ) and endActs FilterLog � ∪ ∀σ i ∈L last Act(σ i ) Step 9: for (each trace σ in Outlier) do Step 10: if (σ [1] ∉ startActs FliterLog or σ[n] ∉ lastActs FliterLog ), where |σ| � n then Step 11: Noise � Noise ∪ σ and Outlier � Outlier − σ Step 12: for (each trace σ in Outlier) do Step 13: if IsInfrequent(σ) �� true then Step 14: Infrequent � Infrequent ∪ σ and Outlier � Outlier − σ else Noise � Noise ∪ σ and Outlier � Outlier − σ Step 15: an optimization process model M 2 is obtained by adding data flow and control flow to the initial process model M 1 to incorporate all infrequent but correct behaviors in Infrequent   + I   I  I   I   I   I   I  I   I   I   I   I   I  I  I I I Figure 9: Interactive behavior profile of frequent patterns.        flow perspective, many infrequent behaviors are disregarded as noise.
erefore, the fitness of the resulting model is relatively low. With the increase in infrequent behavior, the overall fitness of the three approaches declines. Figure 13 indicates that the precision of the model obtained by the proposed approach is higher than that of the others when injecting less infrequent behaviors. is may be due to a reduction in additional behaviors by adding data flow to the resulting model. e precision of the DHM algorithm is relatively low. Although it can find more infrequent behavior, more control flow is added in the resulting causal net without data flow information, which makes the discovered model more complicated. With the increase in infrequent behavior, the overall precision of the three approaches shows a downward trend. Figure 14 shows that the F-score for the proposed approach is generally superior to the IMI and DHM approaches. As the increase in infrequent behavior may lead to a small decrease in precision, in some cases, it will be slightly lower than others. e experimental results of synthetic and real logs show that our approach has a noticeable improvement over the fitness of the discovered process model without significantly reducing the precision. us, our approach is promising that can preserve the effective infrequent behavior representing important information of the system when discovering the process model. Hence, the proposed approach provides better support for enterprise business improvement.

Conclusions and Future Work
In this paper, an effective infrequent behavior recognition approach based on frequent patterns under data awareness is presented. It analyzes the coupling between infrequent behavior and the data dependency information and uses the conditional dependence probability to quantify the influence strength between them. is approach provides a qualitative and quantitative analysis for the identification of effective infrequent behavior and realizes long-term dependencies between the activity and the frequent pattern, not only directly-follows data dependencies. Moreover, an optimization approach for mining of process models with infrequent behaviors integrating data flow and control flow is provided in this paper. We compared the proposed approach with other techniques, showing that our approach discovers infrequent behavior that other techniques cannot detect. Furthermore, the evaluation on synthetic and real-life event logs indicates that incorporating infrequent but correct behavior will greatly improve the fitness of the discovered process model without significantly reducing its precision by adding appropriate data flow and control flow to the resulting process model.

Scientific Programming
In the future, the proposed approach will be applied to more application fields, and various factors that lead to infrequent behavior occurrence from a data-flow perspective will be further studied. Association rules will be used to reveal data dependency between activities to provide a better basis for the recognition of infrequent behaviors.

Data Availability
e data used to support the findings of this study were supplied by an insurance company under license and so cannot be made freely available. Requests for access to these data should be made to slxjx@aust.edu.cn.

Conflicts of Interest
e authors declare that they have no potential conflicts of interest.