Event-Tree Based Sequence Mining Using LSTM Deep-Learning Model

During the operation of modern technical systems, the use of the LSTM model for the prediction of process variable values and system states is commonly widespread. The goal of this paper is to expand the application of the LSTM-based models upon obtaining information based on prediction. In this method, by predicting transition probabilities, the output layer is interpreted as a probability model by creating a prediction tree of sequences instead of just a single sequence. By further analyzing the prediction tree, we can take risk considerations into account, extract more complex prediction, and analyze what event trees are yielded from diﬀerent input sequences, that is, with a given state or input sequence, the upcoming events and the probability of their occurrence are considered. In the case of online application, by utilizing a series of input events and the probability trees, it is possible to predetermine subsequent event sequences. The applicability and performance of the approach are demonstrated via a dataset in which the occurrence of events is predetermined, and further datasets are generated with a higher-order decision tree-based model. The case studies simply and eﬀectively validate the performance of the created tool as the structure of the generated tree, and the determined probabilities reﬂect the original dataset.


Introduction
Nowadays, uncovering possible frequent event sequence scenarios has been a critical task across many disciplines. In the age of big data, when an immense amount of data is recorded into logs in the scope of the industry 4.0 trend, it is important for engineers to acquire as much knowledge about the industrial processes as possible [1,2]. By using frequent pattern mining algorithms on event logs, we are able to identify sequences that can lead to given system states. is particular method has already proved its capability across numerous applications and industries. Taub et al. use sequence mining to distinguish efficient and nonefficient action patterns among their subjects in a gamebased learning environment [3]. A similar frequent pattern identification method was used to give insight into successful learning patterns using Betty's brain computer-based learning environment [4]. A universal (language independent) algorithm was proposed for linguistical pattern discovery, where special attention was paid to a clear, easily understandable output [5]. Kant et al. proposed a new algorithm (MCPRISM) to mine min-closed sequences to identify comment section spam content on websites [6]. A new framework called malicious sequential pattern-based malware detection was developed by using a novel sequential pattern mining algorithm (MSPE) to recognize new, unseen malicious executables in computer systems [7]. Weiss uses a genetic algorithm for analyzing the temporal patterns in the alarm data of telecommunication systems to identify equipment failure [8]. Sequential pattern mining has been also used for event prediction in numerous applications [9,10].
Although these examples are perfectly capable of fulfilling the sequential pattern mining task, traditional algorithms suffer greatly with runtime and accuracy when dealing with massive datasets [11]. Another drawback of the frequent pattern mining solutions is that their output data are proved to be challenging to interpret and handle-especially when the number of the mined sequences is high-often introducing a new problem to solve [12]. To represent the yielded information, the frequent pattern tree proved to be a much more compact and workable data structure [13].
Machine learning techniques are excellent tools for processing massive datasets. Learning patterns from exemplary training sequences is a similar task as in the case of learning languages and the identification of frequent event sequences, where the use of long-short term memory (LSTM) yields better results compared to that of traditional recurrent neural networks (RNNs) [14]. e reason why LSTM is suitable for this application is the use of the forget gate in its cell, which is able to reset the internal state of the network [15]. e algorithm known as the seq2seq learning method was developed in 2014 by Sutskever et al. at Google for frequent sequence learning using LSTM to improve machine translation [16]. Ever since this method has been used in numerous applications. Karatzoglou et al. used it to improve location-based services by learning human semantic trajectories and better predicting their upcoming location [17]. e method's capabilities have also been demonstrated in finances by Rebane et al. who analyzed the performance for cryptocurrency price prediction [18]. A seq2seq model-based approach was used to improve query focused summarization performance [19]. Wu et al. described a novel method to create, store, and convert logs of Internet of ings big data systems to be later processed through their proposed seq2seq algorithm [20]. e method has also been applied in manufacturing systems. Hwang et al. used the algorithm to predict a furnace temperature based on other process variables with a very high accuracy [21]. e general application of this structure for event prediction has been described in detail by Dörgő et al. [22,23].
Fundamentally, the output of a seq2seq approach is a single sequence, which consists of the items that have been found as the most probable at each prediction step. By using a heuristic search algorithm during inference, further information can be retained from each prediction step. is information can aid us to understand better the black-box model of the prediction [24]. is optimization is done using beam search, which retains several best items-the number usually referred to as beam width. Cohen and Beck studied the performance degradation in neural sequence models when an inappropriate beam width is chosen [25]. In recent years, the use of beam search instead of the traditional greedy search was favored because it usually provides much better results, although it is taxing on runtime [26].Li et al. used a seq2seq model with beam search decoder to realize a dependency parser with a direct head prediction with promising performance [27]. Williams et al. proposed the use of beam search to build an end-to-end speech recognition system, which is capable of adapting the inference process based on contextual signals at each prediction step [28]. Several different pruning strategies have been explored to be used with beam search to improve runtime [29]. A seq2seq model using the dynamic beam width was applied by Jahier Pagliari et al. to an embedded translation system in order to improve its efficiency [30]. A known drawback of the beam search algorithm is that it produces pretty similar output sequences in certain use cases. A solution for this phenomenon was proposed for image captioning [31].
is paper aims to create own implementation of the seq2seq learning method with a beam search decoder, which is referred to as the seq2probTree method later. is method will be realized in the Python environment, and it is able to create a probability tree that describes the alternative network of events based on a given input. e implemented tool is capable of displaying the output an easy to interpret, structured probability tree, thus giving a visualization of the prediction and aiding the debugging of seq2seq models, as the fault analysis of deep neural networks is a task with enormous importance, especially in the case of safety-critical application [32].
First, in Section 2, the methodology will be explained. Definitions will be given to the necessary expressions and the prediction task at hand. e LSTM deep-learning model will be described along with the tree creation process. e metrics used for the evaluation will also be defined in this section. In Section 3, the implementation process and the used toolboxes will be presented briefly. en, the seq2probTree method will be put to the test by applying it on a first-order Markov chain model and later on a higherorder tree-based system, where the extent to which the method is able to reconstruct the tree is checked, and the necessary comparison score is defined. Finally, the real-life practical applicability is confirmed by using it on the alarm logs of a hydrofluoric acid alkylation production unit. Last, in Section 4, the findings and experiences of using the developed method will be summarized, and further steps in the subject will be proposed.

Methodology
In this section, the previously defined task will be explained in detail. e definition will be given to an event sequence and how its probability is calculated. e peculiarity of the seq2probTree method is explained, creating a whole sequence tree instead of only predicting the most likely scenario. Here, in addition to the theory of prediction, its extension to tree-based event-scenario generation is also provided. e metrics used for the evaluation of the predicted event scenarios are also explained in detail.

Sequences and the Prediction Task.
Industrial processes frequently generate event logs those are logically consisting of events (denoted as e i ) related to production, safety, transportation, storage, sales, financial transactions, marketing, etc. An event log defined as D T database is an ordered list of these events, where the events are arranged according to their start time in the ascending order. e D T dataset can be segmented into sequences (denoted as Φ n ), which are the chronologically ordered lists of events Φ k : � e 1 ⇒ e 2 ⇒ · · · ⇒ e k . According to different aspects, this segmentation can be carried out: causal connection of states, temporal segmentation, periodicity, etc. erefore, a 2 Complexity sequence of k events is referred to as a k-length sequence and is denoted by Φ k . ese events represent the occurrence of n different states (type of events) of the set S � s 1 , s 2 , . . . , s n . e sequence Φ k : � e 1 ⇒ e 2 ⇒ · · · ⇒ e k can be divided chronologically at any part as Φ k � (Φ k′ ⇒ Φ k″ ), where Φ k′ and Φ k″ are the antecedent and future sequence of states, respectively (naturally, k � k ′ + k ″ ). Hereinafter, the " ′ " and " ″ " symbols denote the past and future sequences or states, respectively.
As single or multiple connected processes usually generate the data analyzed here, a causal flow connects the individual temporal instances of states (regardless of the type of the dataset, e.g., events, items, transactions, etc.), and the number of occurrences of different states is not independent of each other. erefore, the probability of the occurrence of the Φ k sequence P(e 1 ⇒ e 2 ⇒ e 3 ⇒ · · · ⇒ e k ) can be calculated by the chain rule and the conditional probabilities of transitions between the events according to the following equation: P Φ k � P e 1 × P e 2 | e 1 × P e 3 |e 1 ⇒ e 2 × · · · · · · × P e k | e 1 ⇒ e 2 ⇒ · · · ⇒ e k−1 . (1) erefore, according to the chain rule, the probability of a k-length sequence can be calculated as the product of the conditional probabilities of the step-by-step transition from the sequence of antecedent events to the present one. A conditional probability is the ratio of the number of occurrences of the more extended sequence and the shorter one, denoted by the supp value of the sequence, according to the following equation: is probability of transition reflects how confident is the next state knowing the previous sequence of states in Φ k−1 .

2.2.
e Network of Alternative Events: Sequence Trees. e methodology where the prediction of the following state with the highest conditional probability is accepted was described by Dörgő and Abonyi [22]. However, the underlying processes and, hence, the resultant datasets can be highly complex. e ultimate goal of this method is being able to create an event sequence tree that describes the possible courses (all highly probable Φ k″ ) of events based on a given input sequence (Φ k′ ). Figure 1 shows the idea in detail. e horizontal axis indicates the time and illustrates how the possible future scenarios after k ′ past events are ordered in a tree structure. e red branch of the tree indicates the scenario if the predictions of the highest probability are accepted in every prediction step, namely, by using the greedy search algorithm. e EOS tag indicates the end-of-sequence prediction.
So far, only the scenario with the highest probability has been predicted, ignoring the possibility of the occurrence of a less likely, however, highly informative and essential subsequence, which can indicate a different scenario of upcoming events. e added feature of this method is to uncover the information that these highly probable sequences may yield. erefore, accepting that the conditional probabilitybased prediction model often predicts several events with similar probability, here, the implemented beam search algorithm is described, thus not just the future sequence with the highest probability is accepted, but a scenario tree is formalized accepting all the predicted events above a certain probability threshold (P thr ). erefore, after the occurrence of the first k ′ events, the prediction of the first future event e 1 ″ is accepted if its confidence of transition is above a specific P thr limit as follows: Applying equation (3) in every prediction step, not a single future sequence but multiple sequences or possible future scenarios are predicted as depicted in Figure 1. us, as it is described by the prediction task, the P(Φ k″ |Φ k ′ ) conditional probability is to be determined among all possible future Φ k″ sequences.
In order to annotate the scenarios as well, a hierarchical annotation was introduced in the superscript of the predicted event: the numbers divided by commas after the " ″ " mark indicate the likeliness of the predicted event in the prediction step as the number in order of the likeliness of the prediction, where 1 indicates the most likely future state. For instance, the tag e ″ ,1,3,1 shows that this is the third predicted future event (three numbers are present after the " ″ " mark), and this was the event with the highest probability for the first predicted state e ″ ,1 ; then accepting this prediction, the second predicted event has the third highest probability e ″ , 1,3 and accepting the first two predictions, the third predicted event had the highest probability in the given prediction step. Similarly, the e ″ ,2,1 future state is the prediction with the second highest probability (e ″ ,2 ) for the first future event and accepting this prediction, this is the prediction with the highest probability in the second prediction step. erefore, continuously accepting the most likely predictions, the sequence e ″ ,1 ⇒ e ″ ,1,1 ⇒ e ″ ,1,1,1 ⇒ · · · is predicted, highlighted by the red arrow in Figure 1. However, in this sequence, the predictions with the highest probability are accepted in every step, the overall probability of the sequence is not maximal in every situation, since after the acceptance of a less likely prediction in a prediction step, the following predicted events could be of a high probability and then the overall probability of the occurrence of the sequence can be relatively high (the overall probability of the occurrence of a sequence is the product of the transition probabilities according to equation (1)). By repeating the prediction task at each node, the sequence tree explained in Figure 1 may be created. After each prediction step, by meeting the confidence of all the possible events to the previously defined P thr probability limit, we can make sure that we keep the complexity of the tree as low as necessary for the given task.

e LSTM Deep-learning Model.
In the seq2seq machine learning method, the so-called long-short term memory is utilized as a recurrent neural network of choice. is network was specifically developed to deal with the problem of vanishing gradients with the least possible computational cost increase [33]. e LSTM network is well-known for its capability of classification, processing, and prediction making on time series data due to its relative insensitivity to gap length (lag) between discreet events, which property is welcome in the given use case.
e LSTM structure is depicted in Figure 2.
e input of the model: Figure 2 highlights the structure of the input sequences. First, an end-of-sequence (EOS) tag is appended to the end of every sequence to indicate the end of the event series. e implemented EOS tag is added to the end of the sequences and handled similarly to all the other events in the subsequent steps. Moreover, the order of the events in the input sequence is reversed, since according to Sutskever et al. [16], the prediction accuracy significantly improves when the beginning of the input sequence is close to the beginning of the predicted sequence. Embedding layer: e described sequence of input events needs to be transformed into a mathematically manageable vector of numerical values. erefore, first, the symbols are encoded as one-hot encoded vectors, oh t of binary values of length n d , where n d is the number of one-hot encoded symbols. In the one-hot encoded vectors, only one bit related to the encoded symbol is fired. A detailed explanation and visualization of one-hot encoding can be found in [34]. en, the embedding layer transforms the one-hot coded vectors into a lower dimension (n e ) of continuous values using a x t � W emb oh t linear transformation. Note that, in Figure 2, the embedded forms of the EOS symbol are denoted by the symbol EOS.
Encoder and decoder layers: e encoder LSTM layer processes the sequence of one-hot coded and then embedded symbols. Instead of calculating its output values, it maps the sequence into its internal states. ese internal weights of the encoder layer represent the state of the process, which generated the events. ese weights are used to condition the decoder layer, which means the transfer of information of that happened previously in the process and generally means copying the encoder layer's weights into the decoder layer, obtaining the same structure (of n u LSTM units).
ese weights indicate the prediction required from the decoder layer. After the input of an (embedded) startof-sequence symbol, the decoder layer predicts the next event of the predicted sequence iteratively, consistently applying the previously predicted event as the input for the prediction of the next event.
is procedure is repeated until an end-of-sequence symbol is predicted or the maximum sequence length is reached. Dense layer: After the decoder layer maps the input event , these values are used to calculate the probabilities of occurrence of the events using the softmax activation function of the dense layer in Figure 2, where w s,j represents the j-th column vector of the weight matrix of the output dense layer of the network W s , and b j represents the degree of bias. Once the probability of each state in our dictionary is determined, all the predictions above the defined threshold P thr is accepted as the next event of the related future scenario,  4 Complexity

Creation and Traversal of the Probability Trees.
Prior to prediction, the sequence of events that defines the state of the process is to be transformed to the internal state of the encoder layer. en, these internal states of the encoder layer containing information on the history of the process are transferred to the decoder layer. e prediction starts with the input of a start-of-sequence symbol (marked as StOS in Figure 2). e decoder network generates the prediction of the next event, which is reintroduced into the input of the decoder network and applied as the input in the next time  Figure 2: e illustration of the structure of the sequence-to-sequence event-scenario prediction. e encoder model maps the states of the input sequence into a fixed-length vector-based representation. Using these vector-based representations of input events as the initial state, the decoder model determines the next event. However, using the probabilities calculated by the dense layer, not just the event with the highest probability is recorded, but event scenarios are predicted using every prediction above a predefined threshold. e StOS and EOS tags mark the start-of-sequence and end-of-sequence tags, respectively. Complexity 5 step. By utilizing the original seq2seq learning method, the generated events are continuously appended to the predicted sequence of events. e feature added by the seq2probTree method is that after the first prediction step following the start-of-sequence symbol, we do not simply accept an event as the next with the highest probability. However, we take the entire output vector and apply equation (3), thus pruning the candidates for the next possible event. en, we further explore the network of alternative events during which the probability of each upcoming event is determined (and stored if that probability is adequate), thereby realizing the beam search algorithm. e prediction process is continued until the layer generates the end-of-sequence symbol or reaches the previously set limit of the length of the predicted sequence in the case of every scenario. e method results in a probability tree that is explored and recorded in a depth-first manner (Figure 3). e resource demand of this approach is significantly increased as it is necessary to store all the internal LSTM states and the previous prediction's output for each step-depending on the original number of the possible events-could be a memory hog. In addition, an increase in the inference runtime is expected as the time demand of the depth-first search algorithm is O(|V| + |E|), where V and E stand for the number of vertices and edges in the tree, respectively. e pseudocode for the tree traversal and the recursive prediction step is given below.

Evaluation and Metrics.
e evaluation of the model was carried out using metrics that measure the potential applicability of the method. Since the focus is on the development of a prediction system that draws attention to the most possible outcomes of the process, three performance metrics have been identified for characterizing the sequence containing the events that are found suitable in every step by using equation (3). erefore, for easier notation, we introduce Φ, a sequence containing the events with only adequate prediction probabilities in every step.
First, S 1 is the percentage of the Φ sequences that include at least one well-predicted event. For mathematical formulation, Φ is the sequence of events that we aim to predict, while Φ ″ is our prediction. N is the number of sequences in the analyzed database, the cardinality of a set is marked with | * |, while the common elements in two sequences are marked as their intersection. Mathematically, S 1 is expressed as follows: Second, S % , a set-based similarity measure that describes the well-predicted events as a percentage of the length of the target sequence has been defined. e events do not have to be in the order of occurrence, and S % measures how accurately the type of events are predicted, Finally, S ED was proposed, which is an edit distancebased similarity metric that provides the edit distance between the actual (target) and predicted sequence as a percentage of the length of the more extended sequence among them. e edit distance yields the minimum number of elements that must be inserted or skipped in the compared sequences in order to be identical. e edit distance of two sequences is marked with ED, and equation (8) mathematically describes the S ED edit distance-based similarity metric, ese performance metrics are calculated for each sequence on the tree, whenever a leaf is found, that is, EOS is predicted, or the maximum sequence length is reached. However, in order to make the resulting sequences even more comparable, their confidence is also calculated. Confidence for each Φ is defined as a product of the supports of all the containing events in the sequence. e support of the event is the probability the LSTM calculated for that item, given the sequence of the previous events. For the events in the input sequence, the support is determined as a value of 1,

Implementation and Results
In this section, a summary is provided on the implementation of the proposed method. en, the used validation techniques are detailed, and the obtained results are evaluated. Since the implemented tool is used for diagnostic purposes, the results should be easily reproducible. us, the validation is performed by applying the proposed methods on examples with different complexities. First, the realized system is validated on a simple first-order Markov chain where the method's capability to reproduce the sequence tree is examined. en, to demonstrate the proposed method's capability to understand higher-order relationships between events, a more complex benchmark dataset is generated using a tree-based system. Finally, the method is tested on a real-life production unit.

Realization of the seq2probTree Method.
e described method was implemented in Python using the Spyder 4 Integrated Development Environment in the Anaconda open-source data science development platform. is platform was ideal for the task as most of the necessary libraries are included by default, thus minimizing the setup process for development. e LSTM RNN was implemented using Keras, a deep-learning application programming interface running on top of the TensorFlow end-to-end opensource machine learning platform. Keras API is well-known for its full-fledged documentation and high-quality example codes, which are usually very well commented for easy adaptation. In order to decrease the runtime of the training process of LSTM, the NVIDIA CUDA ® Deep Neural Network (cuDNN) library was utilized. Since Keras is built on top of Tensorflow, which happened to be a cuDNN accelerated framework after the initial setup, the time required by the LSTM training was reduced tenfold. is speed increase was provided by an NVIDIA Geforce GTX 1080 Ti graphics processing unit. e probability trees presented in this paper were generated using the ETE toolkit for Python, which provides a wide range of tree-handling options and node annotation features alongside a tree visualization system to output the resultant trees. e code of Markov chain models was created in MathWorks MATLAB environment for the ease of exporting the simulated data into .xlsx format and importing it into Python using the pandas library. However, due to the vast size of the training dataset for the third-order Markov model, MATLAB's .m format had to be utilized, which can be handled by SciPy (conveniently included in Anaconda). e finished implementation consists of two routines. e first contains the selection of the desired dataset, the setup of the LSTM, the training procedure, and the creation of the training history plots. After the training process has been completed, the encoder and decoder models are saved, thus eliminating the necessity of running the model training with each subsequent session of the application of the tool. e second routine consists of loading the LSTM models, the recursive decoding, and all the functions necessary for the metric calculation and the tree generation and output.

Validation on First-Order Markov Model.
In this section, a brief summary will be given on how the proposed method has been implemented. For the ease of validation, a simple Markov chain is used. e model consists of 12 states that follow each other in a row as a rule of thumb. e only 2 exceptions are state 4 and 7, which break this rule. While transitioning from state 4, there is a probability of 0.35 that the system will "reset," thus returning to state 1. If the system reaches state 7, there is a 30% chance that the system skips the following 2 states and goes right to 10. is behavior can be observed in Figure 4. e dataset was established by creating 10000 sequences utilizing the described Markov chain. Each sequence starts from a randomly selected system state, and the length is also randomly determined between 9 and 12. After the generation of the dataset, the LSTM model was trained by using the following parameters: (i) Embedding dimensions � 6 (ii) Latent dimensions � 15 (iii) Batch size � 256 (iv) Epochs � 70 e training's accuracy and loss can be observed in Figure 5. In order to validate the model's performance, a cross-check was made by feeding each state as an input to the encoder, thus initializing the internal LSTM states. It is important to note here that to initialize the encoder for the validation, not only the state from which the prediction starts needs to be used as the input but also the previous two states; as for the model training, each sequence in the database was separated after the third state as input and output. en, one prediction step is completed, and the output of the LSTM is recorded. is is repeated for each state, creating the validation transition matrix, which is then compared to the transition matrix of the first-order Markov chain (part (a) in Figure 4). In Figure 6, each predicted value is illustrated in function of the original transition probability. e calculated coefficient of determination for this simple example is as high as 0.9994.
After the training was completed, the seq2probTree method was utilized with P thr � 0.2 and by giving the input sequence of [1,2,3] to the taught LSTM model. e maximum output sequence length was set to 12. Figure 7 gives visual aid about the metrics placed at each node on the probability tree, while the acquired results can be observed in Figure 8. Each node on the tree has at least three properties: name, support, and confidence (top and bottom values, respectively). e EOS nodes also have the three performance metrics calculated for the given sequence: S 1 , S % , and S ED , values of which can be found in the right column in the specified order from top to bottom. For example, it can be observed from Figure 7 that the seq2-probTree method predicted state 11 after the subsequence ending with state 10 with a probability of 0.49. In addition, the calculated probability of ending the sequence after state 11 is 0.5. We can also see that the confidence of Φ k -thus, the whole sequence ending with EOS-is 0.04. e S 1 value also shows the highlighted Φ k sequence that every entry (1.0) in the input database starting with the given Φ k′ subsequence-in this case [1 2 3]-has at least one state that has been predicted in Φ k″ by the method. S % being 0.68 gives us the idea that the states predicted in Φ k″ occur in 68% of the database entries starting with [1 2 3]. e last metric of this EOS node on the probability tree-S ED -shows that the average edit distance-thus the number of changes that need to be made to match the sequence-is 4.49, given the aforementioned Φ k′ . e properties of the first-order Markov chain are observable in the results. Both of the distinguished transitions are identifiable, and the predicted transition probabilities are within a margin of error of the Markov chains. e tree also reflects all the different length variants of each possible sequences.

Validation on Higher-Order Tree-Based System.
As the LSTM-based deep-learning networks are explicitly developed to capture the long-term relationship in datasets, a higher-order system is used for further evaluation. e behavior of the system is based on a probability tree, which was pseudorandomly generated. Each node on the tree may have up to three children, with the system stating that it represents and the probability that state occurs also generated randomly. e sum of the probabilities of states originating from the same node is normalized to 1. e depth of the tree was determined randomly between 8 and 9-without considering the root (StOS) and leaf (EOS) nodes. e number of applied states is set to 4 to facilitate easier understanding and reconstruction of the results. However, at this complexity, it is already a difficult task. e states are represented by letters A, B, C, and D. e complexity of the system can be observed in Figure 9, while the inspected transition probabilities-thus the highlighted areas-are visible more transparently in Figures 10-12.
To utilize the seq2probTree method, a training dataset was created consisting of 10, 000 simulations of the system starting from the root node and randomly determining the path-based on the transition probabilities-until a leaf node is reached. After the given amount of simulations were concluded, the resultant dataset was copied six times, as during the training, the sequences are split to input and target and this position, where the sequences are separated. as input and target is randomly selected. e reason for the six times multiplication is that the position of the cut is varied between the 1 st and the 6 th state in the sequence -separating the input and target after the selected state. e generated dataset was used for the training of the LSTM model by using the following parameters: During the training on the dataset produced by the simulation of the proposed tree-based system, the accuracy and loss functions were also recorded. ey can be 8 Complexity observed form Figure 13. It is important to note that during the training, 20% of the dataset was used as validation data, while dropout was not utilized in the LSTM layer.
After the training of the LSTM model on the aforementioned dataset, the seq2probTree method was applied with input sequences leading to the highlighted areas in Figures 14-16-   In order to generate the smallest possible trees, consisting of only the states with the highest probability, the P thr was set to a higher value of 0.25. In addition, TopN thr parameter was introduced as the beam strength with a value of 2, which represents that only the two states with the highest probabilities are taken into consideration during the construction of the tree. With the maximal sequence length set to 9, these measures made sure that the size of the resultant tree is adequate and appropriate for evaluation.
By comparing the acquired probability trees to the tree that is defining the system's behaviour, it is observable that the seq2probTree method based on the LSTM model can capture the long-term relationship of the states of a system. Given the training accuracy as 0.86, the acquired results represent the original probability tree on which the system is based quite accurately. A few prediction errors are observable in the results. ese discrepancies could be explained by pointing out that the input sequence is Figure 9: e state transition probabilities of the full tree-based system. e complexity of the system is easily observable. relatively scarce; thus, shorter patterns with high confidences can "mislead" the model.
As the seq2probTree method is proposed as a tool capable of online dynamic process supervision, the visual information it provides is crucial. While understanding the prediction tree with sparse input is an overwhelming task, as the input sequence expands with more system states, the less complex the probability tree structure becomes. Figures 16-21 represent the method's visual output, while step-by-step appending the input sequence starting from [B] to [B A B A D D] following the most probable path shown in Figure 16 (also the path of the sequence with the lowest S ED metric). e results clearly show how the complexity of the acquired probability trees decreases by expanding the input sequences. e inferred state sequences are diversified by providing scarce input for the LSTM model, and a few erroneous conclusions are drawn. An excellent example for this behavior is found in Figure 17, where after the [B A] input sequence, state A was predicted with a probability that fit P thr along with state B, which should have been a sure transition.
To quantify the accuracy of the model for each aforementioned input sequence, an average error has been calculated and may be observable on Table 1. e error-just like the S metrics-is calculated for each predicted event sequence on the probability tree by simply counting how many elements from the end of the sequence are not found in the probability tree defining the system. e determined errors are then averaged out.
One question that arises during the utilization of the seq2probTree method is regarding the necessary length of the input sequence, after which the output is considered reasonably accurate. Table 1 gives us the idea that if the input sequence is at least four-element long, then the probability trees generated by the seq2probTree method will show no discrepancies when compared to the tree on which the behavior of the system is based. us, the probability trees for every possible 4-length input sequence were generated, and the aAverage errors were calculated. Since the same accuracy cannot be expected for all the input sequences-for the sake of comparison-an additional weighting was applied to the calculated average error. e weight is calculated by determining the confidence for each input sequence and normalizing them based on the highest value. e resultant weighted average error values can be observed from Table 2. Based on the obtained results, it can be stated that after the 4 th input element, the prediction is quite precise for this example system. More significant discrepancies were observed in low confidence sequences, where even after the input sequence (several) diversions are possible.
By utilizing the seq2probTree method on this tree-based system, the capability of the algorithm for predicting higherorder event relationships has been verified with success. e average error value has been introduced to help the evaluation of the results when a direct comparison is possible with an original probability tree.

Case Study: Alarm Scenarios of a Hydrofluoric Acid Alkylation Production Unit.
e proposed method has been applied to an alarm log of a hydrofluoric acid alkylation production unit to check the real-life performance. e process flow diagram of the technology can be observed in Figure 22. e log used for this experiment was created by the operation of the production unit over a four-month-long (121 days) period, where all the incoming alarm and other events have been recorded. e unprocessed log contains precisely 200, 802 entries of which 30, 168 messages are unsuppressed alarm events. 8, 721 of these are alarms that were considered significant, thus were not shelved by the operators. e event sequences for the input of the tool were created by grouping them based on a time window while preserving their sequential temporal property.
us, whenever a 600 sec gap is found after the last event, the two events are not considered related, and a new sequence is started. By using this strategy, the significant alarms were separated into 3, 330 sequences. en, by considering only the event sequences with a minimum length of two, the number of valuable sequences got further reduced to 762. It is also important to note that this event database has a very high unique state count compared to the previous examples-the sequences are composed of 354 individual states. Due to confidentiality reasons, the name (the meaning) of the alarm tags has been removed. en, this sequence database was analyzed for frequent events that start sequences. To carry out the analysis for this case study, the four most frequent events were selected to be utilized using the seq2probTree method.
e name of the selected events and their number of occurrence as the first in a sequence are highlighted in Table 3.   After processing the event log, the seq2probTree method has been applied to the database. e training results from Figure 23 have been acquired by using the following LSTM and training parameters: By using the aforementioned events as inputs for the seq2probTree method, the probability trees in Figures 24-27 have been created. e parameters of the beam search algorithm-P thr and TopN thr -were set as 0.065 and 3, respectively. Analyzing the trees, it is clear that the seq2-probTree method is capable of learning and identifying the possible event scenarios. However, since the dataset is vastly diverse-especially since the seq2probTree method is also sequential position-sensitive-the probabilities of the individual transitions are pretty low; thus, the shallow P thr value is justified. Moving lower with the probability threshold would have resulted in immense trees; thus, only the most frequent transitions are displayed in the figures. In Figure 27, one drawback of the method is also observed: in the longer sequences, which contain or start with [136711], often a recurring [361835] is present. is transition is so prominent that the LSTM model keeps on predicting it with a high probability. In these cases, only the defined maximal output sequence length parameter kept the seq2probTree method from creating an ever-growing branch on the tree. Figure 24 illustrates well the different alarm sequences related to the depropanizer. e tree is initialized with the alarm message of the depropanizer pressure [136769], which can be followed by either the level alarm of one of the vessels of the depropanizer [137161] or an alarm of a pump [136711]. After the alarm on the depropanizer vessel, the alarm of the depropanizer pressure [136769] or the depropanizer feed can come in [353848]. e alarm sequences in Figure 25 are related to another scenario of the depropanizer. As can be seen, the alarm           Similarly, very long alarm cascades of varying probabilities are generated in Figure 27. As we saw, the alarm of the pump [136711] reoccurs in many sequences, and, not surprisingly, it can induce the presence of several other alarms with different scenarios for the order of their occurrences.

Conclusion
By proposing the seq2probTree method, the application of the seq2seq learning algorithm is expanded by not only

22
Complexity considering the most probable item but also further exploring the alternative courses of an event sequence using the beam search algorithm during inference. is approach has been realized in Python environment by using state-ofthe-art development tools. e capability of the method has been demonstrated to reproduce the characteristics of a given system by applying it to a first-order Markov chain model. e provided transition probabilities were reasonably identified, but the approach was also capable of revealing the given unique attributes and quirks of the examined systems. e assumption that the seq2probTree method is capable of exploring higher-order relationships between events has been demonstrated and validated using a treebased system as an example. In addition, the average error metric has been proposed to aid the user in determining the length of the input necessary for reliable prediction. Finally, the applicability of the proposed method was examined on a reallife practical example, where it produced valuable results even in the case of a highly diversified system. e proposed approach was able to map the typical alarm event scenarios and represent those in a visually interpretable manner in a hydrofluoric acid alkylation process.
Based on this evidence, it can be stated that the sequence trees created by the seq2probTree method properly represent the network of the possible alternate sequence of events. With this approach, the necessary visual output can be obtained for understanding and diagnostics of higher-order, complex systems.

Data Availability
e benchmark datasets and the code of the developed algorithms will be available on the GitHub profile and the website of the authors (https://www.abonyilab.com/) after the publication of the results.

Conflicts of Interest
e authors declare that they have no conflicts of interest.