Research Article Log Pattern Mining for Distributed System Maintenance

Due to the complexity of the network structure, log analysis is usually necessary for the maintenance of network-based distributed systems since logs record rich information about the system behaviors. In recent years, numerous works have been proposed for log analysis; however, they ignore temporal relationships between logs. In this paper, we target on the problem of mining informative patterns from temporal log data. We propose an approach to discover sequential patterns from event sequences with temporal regularities. Discovered patterns are useful for engineers to understand the behaviors of a network-based distributed system. To solve the well-known problem of pattern explosion, we resort to the minimum description length (MDL) principle and take a step forward in summarizing the temporal relationships between adjacent events of a pattern. Experiments on real log datasets prove the eﬃciency and eﬀectiveness of our method.


Introduction
With the increasing demand for computing power, many network-based distributed systems have emerged, such as the popular distributed storage system and HDFS. Distributed system utilizes multiple machine nodes to complete tasks based on the network. Since the network may be complex and each node in the network may report anomalies, to maintain the network-based distributed application, instead of monitoring each node of the network, experts usually analyze the node logs to evaluate the health of the system. Logs record the running status and significant events such as the starting and ending of a task of the system. Mining information from log sequences is a useful way to understand network-based distributed system behaviors.
Many researchers have proposed different log analysis approaches in recent years. Some try to sketch the operation process of the system [1][2][3], and some devote to functional tasks, such as anomaly detection and problem diagnosis [4,5]. In this paper, we treat the timestamped log messages as an event sequence produced by a black-box system. We target on the problem of mining informative patterns from temporal log sequence data. Here, by informative, we mean a set of patterns that can summarize the log sequence well. e discovered patterns can help the operation engineers to have a better understanding of the system behaviors and serve as an excellent source of information for online monitoring and anomaly detection. Our work is motivated by the wellknown distributed file system HDFS. Many informative patterns are generated during system operation, corresponding to different system behaviors.
As the centerpiece of the HDFS, one important function of the NameNode is to determine the mapping of blocks of a file to DataNodes when a new file is written to the HDFS. For the NameNode, the write operation regularly generates a pattern as shown in Table 1. Normally, when a new data block is allocated, three copies of this block are created in different DataNodes, and after creating required replicas, NameNode updates the block map.
Based on capturing this type of patterns, we can further implement several log analysis applications. For example, with the increasing amount of log data generated by the complex system, it is rather difficult for engineers to check the log messages one by one. Compressing the log data to a smaller size without losing the important information is of high demand. Since the pattern represents the semantic of the data block allocating behavior, we can reduce the number of logs greatly by encoding the original log sequence with the discovered patterns. Besides, we can investigate the historical condition of the cluster by counting the number of newly added data blocks in the last week.
Sequential pattern mining has become an important data mining problem for decades. Dozens of research efforts focus on the problem of finding sequential patterns effectively [6][7][8]. However, traditional mining approaches often generate a huge number of patterns that confuse users a lot. To solve the well-known problem of pattern explosion [9], we resort to the minimum description length (MDL) principle [10], which has been used in several previous works [9,11,12].
e MDL principle provides a good balance between the complexity of the result pattern set and its ability to represent the data.
Although our work is not the first work that discovers patterns by utilizing the MDL principle, we take a step forward in summarizing the temporal relationships among events of a pattern. Logs monitor repetitive behaviors corresponding to execution traces of several program statements. It is a common case that there exist significant temporal regularities in system logs. Consider an example in Figure 1, which shows simplified log messages for two behaviors of the NameNode. We use capitals to represent event types and lowercase characters to represent corresponding occurrences. ere exist two patterns: P 1 � ABC, where B is generated after A every 2 or 5 seconds and C is generated after B every 3 seconds. P 2 � D D, where the interval between D is 1 or 2 seconds. Obviously, patterns summarizing the temporal relations between adjacent events indicate more information of the running status of the system. However, none of these methods fully consider the temporal relationships. From the perspective of handling temporal relations between adjacent events of a pattern, existing approaches can be divided into two groups. GoKrimp [9] and SQS [12] punish gaps by allocating higher cost when encoding patterns with large gaps, which do not consider the regularity of interarrival times at all. CSC [11] considers patterns restricted to fixed intervals and does not allow any event type appearing more than once, which strongly limits the expressiveness of a pattern. e generated patterns of previous methods either mix together and result in high redundancy [9,12] or are too simple to represent the true behaviors [11].
In this paper, we propose an approach DTS (discovering patterns from temporal log sequences) to remedy the defects. To the best of our knowledge, our work is the first work proposed to encode the temporal regularity of patterns in an event sequence. e encoding scheme enables us to discover high-quality patterns with low redundancy.
e key contributions of our work include the following: (1) We formalize how to use histograms to describe the distribution of time intervals between adjacent events in a pattern. Moreover, we design an encoding scheme that compresses the original sequence as well as the temporal regularities with the mined patterns in a lossless way. (2) We introduce a heuristic algorithm DTS that discovers a set of informative patterns both effectively and efficiently. (3) Evaluation results on real datasets show that our methods are capable of discovering high-quality patterns with low redundancy. e rest of this paper is organized as follows. In Section 2, we give the preliminary knowledge about our work, including the pattern semantic and the problem statement. e encoding scheme for temporal data is described in Section 3. In Section 4, we present the algorithm for mining patterns that best compress the log sequence. We evaluate the effectiveness and efficiency of our approaches in Section 5. We discuss related work in Section 6 and conclude this paper in Section 7.

Log Sequence and Log Parsing.
A log sequence is a sequence of log entry and timestamp pairs, denoted as S log � (log 1 , t 1 ), (log 2 , t 2 ), . . . , (log n , t n )}(t i ≤ t i+1 , 1 ≤ i ≤ n), where log i is the i-th log entry and t i is the corresponding timestamp. e granularity of timestamps can be set to any level depending on the application.
Raw logs are usually unstructured, free-text messages. To analyze log data, a common preprocessing step is to parse unstructured log messages into structured representations [13,14]. Concretely, log messages printed by the same statement are often highly similar to each other and vice versa [13]. Based on this observation, we can extract two parts of information from a log message: event type and parameters. Event type refers to the constant content shared by all logs generated by the same statement, while parameters are the values of variables that differ from log to log. In 2 Complexity this paper, we use Drain [14] to parse the original log messages due to its popularity and excellent performance. For each log message log i , we store its event type and parameters with a structured representation, denoted as a tuple [e i , para 1 , para 2 , . . . , para p ], where e i refers to an instance of the event type of log i , para j (1 ≤ j ≤ p) refers to the values of the parameters, and p is the number of parameters in log i . We assume that the event types of all logs in S log come from a finite alphabet Σ � E 1 , E 2 , . . . , E l . erefore, the original log sequence is parsed into a temporal event sequence S � (e 1 , t 1 ), (e 2 , t 2 ), . . . , (e n , t n ) . We use Δ(S) to denote the time spanned by S, i.e., Δ(S) � t n − t 1 . For brevity, we omit the parameters when unnecessary.

Pattern Discovery with the MDL Principle.
In this work, our goal is to turn the raw log sequence into a sequence of more informative patterns. Specifically, we aim to discover a set of informative patterns which can (1) represent event relationships and temporal regularities and (2) provide the best lossless compression of the event sequence. is is achieved based on the minimum description length (MDL) principle [10].

A Brief Introduction to MDL.
We use MDL as a metric to balance data compression quality and complexity. Specifically, we apply the two-part MDL principle, which can be roughly described as follows. Given a set of models M, the best model M ∈ M is the one that minimizes L(S, M) � L(M) + L(S|M), where L(M) is the description length of M and L(S|M) is the description length of sequence S when encoded with M. e description length is computed at the bit level. In this paper, the model refers to a set of patterns P. We will define the encoding scheme as to how to describe S with a set of patterns and how to compute the description length later. us, our problem of pattern discovery can be formulated as follows.
Problem Definition. Given a temporal log sequence S, find the set of patterns P that minimizes the description length L(S, P) � L(P) + L(S|P).

Semantics of the Pattern Language.
To use MDL, we first need to specify the pattern language, which determines the "vocabulary" of all possible patterns we can discover given S. Specifically, the key elements of a pattern P i are defined as follows: (1) Content: the content of P i is denoted as , which can be used as a unique identifier of each pattern. e length of the episode of P i is m. (2) Occurrence (set): an occurrence of pattern P i in S is denoted as O P i � 〈(e 1 , t 1 ), (e 2 , t 2 ), . . . , (e m , t m )〉, a list of events e j ordered by time t j , where (e j , t j ) ∈ S(1 ≤ j ≤ m) with e j denoting an instance of event type E j ∈ P i . e occurrence set of a pattern is denoted as O P i . In this paper, we consider the leftmost occurrences of P i . (3) Support: the support of a pattern P i is the number of its nonoverlapping occurrences, denoted as supp i . Two occurrences of P i are nonoverlapping if they do not share common events. We also do not allow overlapping among occurrences of different patterns.

Data Encoding Scheme
In this section, we present our encoding scheme and the way we compute description lengths. It is worth noting that we are only interested in code length functions instead of actual encodings [10]. All logarithms in this paper are to base 2.

Encoding the Pattern Set.
Under the MDL criterion, we need to first encode the model, which is the pattern set P in our paper. Here, we use P to store the elements of all patterns, which comprise the items in the header row of Figure 2. e event sequence S can then be compressed by replacing the occurrences of patterns with the corresponding codes. For each pattern, we need to encode its episode and histograms which describe the distribution of time intervals between adjacent events. Although most of the events in S are covered by patterns, some events can be left out. Each of such events is covered by a special pattern with only one item in its content (namely, the type E i of the event), which is denoted as a singleton pattern which has no histogram.
As with description length computation, first we discuss how to calculate L(P i ). We use log(|Σ|) bits to describe the content of each event type. erefore, episode α P i of length m  Complexity needs m × log(|Σ|) bits to describe. We use a unique code, denoted as C(P i ), to encode each pattern. Intuitively, frequent patterns should have shorter description lengths. Hence, the length of the binary code depends on the pat- Here, we consider optimal prefix codes, where the code length is As with the histogram h k i , let the number of nonempty bins in h k i be cnt bins . Since the bin width is set to the time unit of S, the more regular each value of the time interval between adjacent events in P i is, the less diverse these values are and the smaller cnt bins is. Consider the value of the ratio r h k � (cnt bins /supp i ). A smaller value of r h k indicates better temporal regularity between e k and e k+1 . is observation allows us to design an encoding scheme that compresses the temporal regularity of pattern P i 's occurrences.
To be specific, the value of a nonempty bin h k

between the timestamps of two adjacent events and cnt b j is the number of intervals in
We use log(Δ(S)) number of bits to encode b j and log(supp i ) number of bits to encode cnt b j . We also need extra bits to identify each nonempty bin, whose length is dependent on Concretely, the more intervals that equal b j , the shorter the code length, which is To sum up, the description length of a pattern in P is Figure 2 gives a concrete example on how the description length of a pattern is computed. For all patterns in P, the total description length is (2)

Encoding the Event Sequence.
Given the encoded pattern set P, we can encode the event sequence S by replacing the occurrences of patterns in P with their codes. Since S is a temporal sequence, for each occurrence of a pattern, we need to cover all events in it as well as their timestamps. In our encoding scheme, we replace an occurrence with the code of the corresponding pattern P i and the timestamp of its first event. e description length of a timestamp is log(Δ(S)). Moreover, we use the codes of histogram bins to specify the time intervals between adjacent events. We individually For the remaining events that are not covered by any pattern, we simply use singleton patterns E i to encode them, i.e., (C(E i ), Δ(S)).
As a concrete example, we encode the sequence S in Figure 1 using the pattern set in Figure 2. Here, S is encoded by replacing the original events and timestamps in each occurrence of each pattern P i with the corresponding codes.
With this encoding scheme, the description length of sequence S encoded by P is calculated as follows:

DTS: Discovering Patterns from Temporal Sequences
In this section, we introduce our DTS algorithm for pattern discovery. e problem of discovering a set of patterns P that best compresses S based on MDL has been proven to be NP-hard [9]. erefore, we resort to discovering informative patterns heuristically instead of pursuing the optimal result. We assume sequence S is encoded by singleton patterns at the very beginning. Our DTS algorithm starts from an empty pattern set P and updates P iteratively until we cannot achieve no more compression.

Overall Process of DTS.
We now present an overview of our DTS algorithm. A naïve strategy is to update the pattern set P by inserting the best pattern (the one that can achieve the largest gain in compression) in the current sequence S. However, for system logs, it is common that the execution of a task generates several different branches. For example, Figure 3 shows a simplified process of an HDFS client writing files. Each time the client finishes writing a data block, it will call addBlock() to allocate a new block. Each time the DataNode receives a new block, it will report to the NameNode, thus generating a sequence 〈XYYY〉. Also, the (20) L (P 1 ) = 3log (10)

Complexity
NameNode sometimes calls fsync() to persist the cached data, thus triggering the log event M. erefore, another sequence 〈XMYYY〉 can also be generated during the writing process. e common part of these two sequences, i.e., 〈XYYY〉, can be easily discovered and inserted into the set P. Nonetheless, it is much harder to recognize M together with 〈XYYY〉. In practical maintenance applications, we hope to identify the behavioral patterns more completely.
is cannot be achieved using the aforementioned naïve strategy.
Faced with this problem, we divide the update operations of P into two types: the insertion operation and the refinement operation. eir definitions are as follows.
Definition 1 (insertion operation). Insert the newly discovered candidate pattern P into P.
e pattern set is updated to P ∪ P.
Definition 2 (refinement operation). Given a pattern P � 〈E 1 E 2 , . . . , E m 〉 (P ∈ P), we refine it by adding events of type E at position j(j ∈ [0, . . . , m]) of P. e result is a new pattern P r with length m + 1. Specifically, where E is called as the refining event. e pattern set is updated to P\P { } ∪ P m , P r , where P m denotes the remaining original pattern. If there are remaining occurrences of the original P, the operation is called a partial refinement; otherwise, it is called a full refinement, and then P m is null. Figure 4 presents an example of the refinement operation. e pattern set P is updated iteratively by either inserting a new pattern to P or refining a pattern that is already in P. It is common that an event covered by a candidate pattern P is also a refining event for another pattern P ′ ∈ P. We refer to such a case as an event conflict.
In each iteration, if there is no event conflict, we can greedily add the best candidate pattern to P. However, if there exists an event conflict, since our encoding scheme does not allow overlap among different patterns, it is necessary to decide which update operation is better. Here, we choose the operation that decreases the description length L(P) + L(S|P) more. After choosing an update operation, we encode the events covered by the new patterns with the corresponding codes.
e pattern set P is updated iteratively until we cannot decrease the description length any more. e overall process of DTS is divided into two stages: the update stage driven by the event conflict and the update stage containing only refinement. We maintain two data structures to support the two update operations: a candidate pattern set CP for insertion and a candidate refinement map CR for refinement. CP stores the candidate patterns in the current sequence dynamically, while CR stores the possible refinements for each pattern in P. As shown in Figure 5, at the first stage, DTS iteratively compares the compression gain of the candidate pattern and the refinement result, selecting the better one to update P. When CP becomes empty, namely, we cannot get a valid candidate pattern any more, DTS continues with the second stage, iteratively updating P with the best refinement in CR until CR is empty. We now move on to the technical details of our DTS algorithm.

Candidate Pattern Set CP.
Concerning the insertion operation, we hope the pattern P added to P will best decrease the description length. To this end, we define the compression gain of a pattern P as follows: A naive way of selecting the best candidate pattern is to enumerate all possible patterns exhaustively. However, this is time-consuming due to the exponential search space. In practice, instead of searching from scratch in each iteration, we maintain a candidate pattern set CP that stores promising candidate patterns. We initialize CP with Algorithm 1. For each event type E ∈ Σ, best growth (α, S, E, P) generates a pattern starting from a singleton pattern P � < E > by appending any event type E′ to the end of α greedily. Since it is computationally intensive to find the optimal occurrence set having the fewest histogram bin, we resolve to a simple approach that finds the leftmost occurrences of P. If the newly generated pattern yields greater compression gain than the previous one, we move on to test the next pattern created by a new growth. e process stops until we cannot get a new pattern better than the previous one. We select the last pattern P last as the best pattern starting with E and add it to CP.

Candidate Refinement Map CR.
Considering the refinement operation, we maintain a candidate refinement map CR to store the best refinements for each pattern P ∈ P.
Each refinement follows the format R(P) E j � [refined − position j, event − type E, remaining − pattern P m , refined − pattern P r ]. For example, Complexity remaining. On the contrary, if D D is fully refined by F, the refinement result will be written as [1, F, null, DF D]. e refinements in CR are indexed by the refining event type. In other words, the refinement results with the same event type are grouped together. When the pattern set P is updated by a new pattern P, which is either selected from CP or generated by a refinement operation, we decide whether there exists a good refinement for each position by Algorithm 2. For each position j, we only store the refinement that best decreases the encoded description length. e compression gain is defined in the following equation: e way we obtain a refinement REFINE (P, S, j, E) is as follows. e content of the newly refined pattern P r can be easily obtained by Definition 2. e occurrences of P r can be obtained from the original occurrence set of P. For each occurrence O P � 〈(e 1 , t 1 ), (e 2 , t 2 ), . . . , (e m , t m )〉 ∈ O P , we find the leftmost event instance (e i , t i ) in S that meets the time restriction and refine O P with it.

e Complete DTS Algorithm.
In this section, we present the complete DTS algorithm for heuristic discovery of the pattern set P. As is shown in Algorithm 3, we update P iteratively until we cannot decrease the description length of S any more. At the beginning, we initialize the relevant data structures. Lines 4-17 correspond to the update stage driven by the event conflict. During this stage, DTS first gets the best candidate pattern P * in CP.
en, DTS calls the function BestConflictRefinement (P, CR) to check if there exist one or more refinements in CR that conflict with P * and retrieve the best one R * (P ′ ) E . DTS decides which one is better to update the pattern set P by comparing the compression gains of P * and R * (P ′ ) E with the following equation: Note that we compare the compression gains of the two operations at the event level in order to avoid bias towards patterns with longer episodes or greater frequency. A positive value of cmp( P * , R * (P ′ ) E ) indicates that P has greater compression gain. Otherwise, R * (P ′ ) E is better. If P * is better or there is no conflicting refinement, P will be updated to P ∪ P * ; otherwise, it will be updated to P\P ′ ∪ P m ′ , P r ′ .
When the function BestCandidatePattern (CP) returns null which indicates that there is no candidate pattern in the current sequence any more, DTS moves on to the next stage. Lines 18-21 correspond to the update stage containing only refinement. Here, DTS updates P with the best refinement in CR iteratively.
To illustrate, consider the example in Figure 6, in which the best current candidate pattern is P 1 � EF, and there exist two conflicting refinements in CR, where R * (D D) F � [1, F, D D , DF D] is selected using BESTCAN-DIDATEREFINEMENT (EF). If R * (D D) F has greater compression gain, pattern set P will be updated to P ′ ; otherwise, pattern P 1 � EF will be added to P, resulting in P ″ . e data structures CP and CR are maintained dynamically. After each update, we delete invalid elements in them and adjust the patterns that are influenced by the update.

Experiments
In this section, we conduct extensive experiments to verify the effectiveness of our approach. All experiments were executed on a machine with Intel i7-3770 CPU @3.4 GHz, 24 GB of RAM, and Window 10 OS.

Datasets.
We use four different log datasets collected from real systems. e basic statistics of these datasets are summarized in Table 2.
e Zookeeper and OpenStack datasets are collected from the well-known loghub data repository [15], while the NameNode and DataNode logs are collected from our own HDFS cluster. ese datasets have different characteristics. e NameNode logs have more event types and more complex system behaviors interleaving together as a NameNode manages multiple DataNodes. e DataNode and Zookeeper logs have less event types as well as simpler and more regular behaviors. e OpenStack logs contain only behaviors such as creating a project and other simple tasks, thus having the least event types. For each log dataset, we calculate the ratio of event types which occur less than 50 times. As is shown in Table 2, three out of four datasets contain a high percentage of low-support event types (more than 65%). To discover nontrivial patterns in these datasets effectively, we need to set the support threshold to a low value, which can lead to the pattern explosion problem. We will later show that DTS can discover nontrivial patterns while maintaining a low level of redundancy.
Complexity our experiments because GoKrimp yields similar results and is more efficient than SeqKrimp. For CSC and our approach, we simply use the entire sequence as the input. For other approaches, the sequences are broken into disjoint sequences of size 10. Our algorithms are implemented in Java, while the implementations of other approaches are obtained from the original authors.

Evaluation of Efficiency.
In the first experiment, we compare the efficiencies of all approaches as the length of the log sequences varies on the NameNode and DataNode datasets. e results are shown in Figure 7. It is worth noting that although the results of all 5 approaches are plotted in the same figures, the time unit for CSC, GoKrimp, and our approach is in seconds, while that for SQS and ISM is in minutes.
As is shown, CSC is the faster than all other approaches because it has a gap constraint which restricts the size of the search space greatly. However, this constraint dictates that CSC cannot find complex patterns effectively, which will be shown in the next experiment. DTS is slower than GoKrimp as GoKrimp applies a dependency test for speedup. Nonetheless, this prevents GoKrimp from discovering lowfrequency patterns.
Although DTS takes comparatively longer time, it only needs less than 400 seconds to process a complex sequence with 600,000 events, which is acceptable in real applications. Besides, the running time of DTS increases steadily with the increase of log size, indicating good scalability. On the contrary, SQS and ISM are much slower, with processing times ranging from dozens of minutes to several hours due to their complex iterative computations. e running times on the Zookeeper and OpenStack datasets are shown in Table 3, which shows similar results to those on the NameNode and DataNode logs. CSC is still the most efficient, while DTS comes next.

Candidate
Candidate

Evaluation of the Compression Ratio.
According to the MDL principle, we can encode the sequence with the discovered pattern set. e smaller the description length L(S, P) is, the stronger the expressiveness of the encoding pattern set is. Here, we use compression ratio [9] (denoted as CR) to measure the effectiveness of the encoding. CR is defined as the ratio of the description length of the encoding with only singleton patterns to that with the discovered patterns. e higher the ratio is, the better the expressiveness is. Note that here we only compare the compression ratios for MDL-based approaches. Figure 8 shows the compression ratios on all four datasets. As indicated, DTS generally achieves the best compression ratio, thanks to our effective encoding scheme as well as the proposed refinement operations and the heuristic strategy based on the "event conflict." Among the other 3 comparison algorithms, the iterative searching strategy of GoKrimp is too simple to yield expressive results. erefore, it generally achieves the lowest compression ratio. CSC has a good performance on the DataNode and Zookeeper logs, yielding similar performance to DTS. is can be attributed to the fact that CSC is suitable for processing log data in which the time intervals between adjacent events are short and constant. On the contrary, for log sequences such as NameNode logs which contain complex patterns, CSC tends to perform poorly. As with SQS, thanks to its complex search strategy, it performs well on more complex data. e problem with SQS is that it generates too many patterns, and thus, the model used to describe the data is too complicated, resulting in greater description length. is can negatively impact the compression ratio of SQS, especially when the patterns in the log data are simple.
It is worth noting that while our approach has generally higher compression ratios, this does not mean that it results in more redundancies, on which we will discuss later.

Evaluation of the Event Coverage.
In this section, we evaluate whether the patterns returned by different approaches cover various event types adequately. We use the event coverage [17] (denoted as EC) as the metric, which measures the percentage of event types covered by the discovered patterns. A larger EC indicates better performance. Figure 9 demonstrates the event coverage of all 5 approaches on the four log datasets. As is shown, except for

Complexity
GoKrimp, all other approaches have event coverage over 80%. By contrast, while GoKrimp achieves an EC of 91% on the OpenStack dataset, it fails to achieve an EC of over 50% on the remaining datasets. is is related to the proportion of rare event types in the log data. As was mentioned before, the search heuristic of GoKrimp is oversimplified, which prevents GoKrimp from discovering patterns with infrequent event types. erefore, the more infrequent event types contained in the log data, the lower the event coverage of GoKrimp is. By contrast, this factor has little effect on the other five algorithms.
Overall, our DTS achieves the best coverage on the three of the four log datasets. However, it is slightly inferior to SQS and CSC on the OpenStack logs. is is because there are three event types in the OpenStack dataset which occur randomly in the sequence, exhibiting no significant temporal regularity. erefore, DTS does not consider these event types as part of any pattern, while CSC and SQS greedily include these event types in some patterns. is showcases the advantage of DTS, which takes the temporal regularities of patterns into consideration, reducing the number of invalid patterns made up of random combinations of unrelated events.
is can lead to a far less redundant result than CSC and SQS.

Evaluation of the Pattern Redundancy.
e goal of this work is to mine high-quality and low-redundancy pattern sets from log sequences. In previous sections, we have evaluated the pattern qualities using compression rate and event type coverage. We now shift our attention to the evaluation of pattern redundancy. Here, we utilize the following four metrics: (1) Average intersequence edit distance (AED): this metric measures the edit similarity among the discovered patterns. Concretely, the AED of a pattern set is calculated as follows: for each pattern P in set P, we calculate the edit distance between P and each of the remaining discovered patterns P ′ (P ′ ∈ P) and normalize the distance with the pattern P's length. We record the minimum distance between P and all other patterns. AED is obtained by averaging these minimum distances for all patterns in the pattern set. e larger the distance is, the less redundant the pattern set is.
(2) Average count of supersets (ACS): this metric measures the diversity of the event types contained in the discovered pattern set. Concretely, the ACS of a pattern set is calculated as follows: for each pattern P in P, we denote the event type set contained in P as Set(P). For each P, we calculate the number of remaining patterns P ′ (P ′ ∈ P) whose event type set Set(P ′ ) is the superset of Set(P). ACS is obtained by averaging this number for all patterns in the pattern set. For each pattern, the less patterns whose event type set is its superset, the less redundant the pattern is. erefore, the smaller the ACS is, the more diverse the event types are and the less redundant the pattern set is.  Table 4 reports the results of four evaluation metrics on different datasets, respectively. As is shown, our approach has a larger intersequence edit distance and a smaller count of supersets on average, which indicate that the patterns generated by our approach have lower redundancy and higher diversity. In terms of the number of patterns, DTS has relatively small No.P values while ensuring adequate coverage of event types and a high level of compression. Overall, the pattern sets discovered by DTS can express the meaningful information of a log sequence with low redundancy.
Among the four rival algorithms, GoKrimp yields relatively good performance, generating the smallest number of patterns with relatively low redundancy. However, as was previously shown, the expressiveness of GoKrimp is generally the worst among all methods. As a matter of fact, the search strategy of GoKrimp acts as a double-edged sword. On the one hand, it helps achieve compact results. On the other hand, it limits GoKrimp's ability to mine meaningful patterns. Overall, this tradeoff is not desirable in practice. As with ISM, it has reasonable AED and ACS values, yet it is inferior when compared against DTS. By contrast, the results of CSC on the Zookeeper and DataNode logs are less redundant, which again manifests the effectiveness of CSC for such type of log data. However, on the NameNode and OpenStack logs, the redundancy of CSC is high. Especially, the values of ACS are more than 10, namely, each combination of event types generates more than 10 patterns on average. As for SQS, it has a high degree of redundancy on every type of log data. is is because the encoding scheme of SQS only considers the support of the patterns and does not fully take into account the temporal regularity. erefore, any sequential pattern generated by frequent behaviors can result in good compression, which leads to many redundant patterns.

Related Work
As an important data source for system management, logs have been widely used in many tasks such as anomaly detection bib13 [5,13], program workflow modeling [1], failure diagnosis [18], and performance monitoring [19]. ese works mostly focus on automatic system maintenance and diagnosis through log analysis. In this paper, we aim at discovering meaningful patterns from raw log sequences. e discovered patterns can provide useful information that can support the aforementioned tasks.
Sequential pattern mining was first introduced by Agrawal and Srikant [6]. Since then, various sequential pattern mining algorithms have been proposed, such as PrefixSpan [20], SPAM [21], and BIDE [7]. Traditional pattern mining approaches usually generate a huge number of redundant patterns. is problem is commonly known as pattern explosion. Several restrictions on pattern semantics have been proposed to tackle this problem, such as closed frequent patterns [22] and maximal frequent patterns [23]. However, these measures do not fully resolve the problem. In order to reduce pattern redundancy, modern sequential pattern mining approaches resort to the minimum description length (MDL) principle [10] to select a set of patterns that can best compress the data. KRIMP [24] pioneered the use of MDL in identifying good pattern sets, which mines itemsets that can well describe a transaction database. GoKrimp [9], SQS [12], and CSC [11] extend this methodology to the sequential pattern mining task. e basic idea of these approaches is to cover a sequence database with a set of patterns that can achieve the highest compression. Jianxin Li [25] proposed a parallel approach FCT to efficiently find frequent co-occurring terms in relational data.
However, these existing approaches do not take the temporal regularity of timestamped sequences into consideration. By comparison, our approaches can achieve better results by utilizing the event relationships in a pattern as well as the temporal regularities among events and are capable of discovering informative patterns with low redundancy. Instead of designing a MDL-based encoding scheme, ISM [16] presents a subsequence interleaving method based on a probabilistic model of the sequence database, which searches for the most compressed set of patterns. However, this approach, along with the aforementioned SQS, is slow according to our experiments due to its exhaustive nature.

Conclusions
In this paper, we have proposed a novel approach to discover sequential patterns from log sequences with temporal regularities, which process the log data in a black-box manner and do not need any domain knowledge on the system. Specifically, we have drawn on the MDL principle and formalized an encoding scheme that takes event relationships as well as temporal regularities into consideration. Based on this scheme, we have proposed DTS, an efficient heuristic algorithm, which greedily updates the pattern set. Extensive experiments on real datasets show that the proposed approaches can discover high-quality patterns efficiently.

Data Availability
e NameNode and DataNode datasets are available from the corresponding author upon request. e Zookeeper and OpenStack datasets are collected from the well-known loghub data repository (https://github.com/logpai/loghub).

Conflicts of Interest
e authors declare that they have no conflicts of interest.