DLLog: An Online Log Parsing Approach for Large-Scale System

. Syslog is a critical data source for analyzing system problems. Converting unstructured log entries into structured log data is necessary for efective log analysis. However, existing log parsing methods demonstrate promising accuracy on limited datasets, but their generalizability and precision are uncertain when applied to diverse log data. Enhancements in these areas are necessary. Tis paper proposes an online log parsing method called DLLog, which is based on deep learning and has the longest common subsequence. DLLog utilizes the GRU neural network to mine template words and applies the longest common subsequence to parse log entries in real-time. In the ofine stage, DLLog combines multiple log features to accurately extract the template words, creating a log template set to assist online log parsing. In the online stage, DLLog parses log entries by calculating the matching degree between the real-time log entry and the log template in the log template set. Tis method also supports the incremental update of the log template set to handle new log entries generated by systems. We summarized the previous works and validated DLLog using real log data collected from 16 systems. Te results demonstrate that DLLog achieves high parsing accuracy, universality, and adaptability.


Introduction
Log data serves as a valuable and reliable source for operations staf to monitor systems, detect abnormalities, and locate faults [1].Log data, easily obtainable from systems, contains a wealth of information, including system status, performance, and resource usage.However, log data is inherently unstructured, while most system analysis tasks require structured data as input [2][3][4].Terefore, parsing unstructured log data into structured data becomes essential [5,6].Tis paper aims to develop a log parsing method characterized by high accuracy, universality, and adaptability.Te goal is to enable the accurate extraction of log templates from log data without manual intervention.
Traditional log parsing methods require considerable human resources and time.Moreover, as system scale and complexity increase, data volume expands rapidly.Importantly, developers have not established a unifed standard for log format, making traditional manual log parsing methods impractical.Static code-based parsing methods exhibit high limitations [7][8][9] because obtaining system source code during the parsing process is challenging.While frequent pattern mining-based log parsing methods demonstrate competitive parsing efciency, they struggle to match rare logs with low frequency to any log template, resulting in suboptimal parsing results [10][11][12].Clustering-based log parsing methods often sufer from low parsing accuracy due to their simplistic parsing patterns (e.g., dividing log groups based on word frequency or diferent word types) [2,[13][14][15].In comparison to static code or frequent pattern miningbased methods, clustering methods have slower parsing speeds and require numerous iterations.
Current log parsing methods often exhibit limitations in terms of parsing accuracy and universality.While a specifc log parsing method may demonstrate high detection accuracy for a particular dataset, it frequently struggles to maintain comparable accuracy when applied to a broader range of datasets.It is imperative for log parsing methods to incorporate incremental update capabilities, as systems undergo sporadic updates or optimizations postdeployment, resulting in the generation of new log data that needs to be matched with novel log templates.Log parsing methods lacking incremental update functionality require substantial computational resources to build a new parsing model.Undoubtedly, a log parsing method with the ability to update its model during the parsing process is of paramount importance.
To address these challenges, we propose an online log parsing method called DLLog, based on GRU neural networks and the longest common subsequence.Our method outperforms existing approaches by accurately mining log template words using multiple log features, thus achieving high universality.Prior to template matching, DLLog preclassifes log templates to reduce incorrect template matching time.Moreover, our method supports log template set updates to accommodate new log data generated by the system.
DLLog parses logs by utilizing the structural, frequency, and association features of logs entries.It combines ofine log template word mining and online log parsing to enhance the universality and parsing accuracy of DLLog for large-scale log datasets.In the ofine mining stage, DLLog initially employs common regular expressions to clean and remove obvious parameter words from the logs.Ten, it transforms the log entry into a sequence of word frequencies based on log word frequency and log structural feature.Subsequently, DLLog employs a GRU neural network to identify potential relationships between log words and extract log template words based on these relationships.Due to log sequences including the log structural feature, log template words from rare logs are more easily and accurately mined by DLLog.Finally, log entries with the same log template words are categorized into the same log group.Each log group corresponds to a log template.Diferent log templates form a pre-classifed log template set based on the log structural feature.Tis process does not require manual intervention.Besides, this grouping pattern can efectively avoid the problem that rare logs cannot match any log template.
During the online parsing stage, DLLog processes logs by calculating the matching degree, defned as the length of the longest common subsequence between real-time log entries and the existing log template set.Based on the matching results, DLLog determines whether to update the log template set.By adopting incremental updates to the log template set, DLLog eliminates the need for retraining models, ensuring the efcient operation of log parsing methods and enhancing the method's universality when applied to largescale log datasets.Tis paper evaluates DLLog on several extensive log datasets, demonstrating its success in achieving high parsing accuracy, universality, and adaptability.
Te primary contributions of this paper are summarized as follows: (i) Tis paper introduces an ofine log template word mining approach that utilizes a GRU neural network to extract log template words and partition log data into distinct log groups.
(ii) Tis paper proposes an online log parsing method that leverages the longest common subsequence, enabling updates to the log template set to accommodate newly generated log data from the system.(iii) We conducted comprehensive experiments and evaluations on various large-scale log datasets, demonstrating the superior performance of DLLog in terms of accuracy, universality, and adaptability.
Te rest of this paper is organized as follows: Section 2 presents the related work of log parsing.Section 3 presents the basic structure of log data.Section 4 presents the detailed design of DLLog.Section 5 evaluates the performance of DLLog through experiments.Finally, Section 6 presents the fnal remarks.

Related Work
System logs are invaluable data resources extensively utilized in system operation and maintenance, fault analysis and detection, and various practical applications [16][17][18][19].Since log messages typically consist of semi-structured text strings, log parsing is essential for converting unstructured logs into structured data [20].Log parsing preserves the essence of log entries, removes parameter words, and minimizes log entry dimensions, making it easier to map diverse unstructured logs to standard log templates.We have categorized and summarized recent research in log parsing into the following categories.

Static
Code-Based Log Parsing Methods.Liang et al. [8] introduced an MTS-DCGAN log parsing method based on source code analysis.Tis approach involves querying class names, call relationships, and object names associated with system behaviors.By traversing the syntax tree, log templates are constructed.Kabinna et al. [21] proposed the Cox models, which follow similar principles as the MTS-DCGAN, identifying format strings in the code to create log parsing templates.While these methods accurately generate log templates, they are dependent on access to system source code, limiting their applicability to closed-source systems.

Heuristic-Based
Log Parsing Methods.He et al. [22] developed Drain, a log parsing method that utilizes parsing trees.Drain constructs parsing trees and then compares variances between log entries and log event groups within the parsing tree for log parsing purposes.While Drain provides high parsing accuracy, its versatility is limited and, and it requires domain-specifc knowledge.
Zhang et al. [23] presented the FT-tree method for log parsing, which creates a log template tree by analyzing log words and their combinations.Te process involves pruning the log template tree by removing branches that do not satisfy certain constraints.Consequently, all log words along the path from the root node to any leaf node in the pruned log template tree constitute a log template.However, a drawback of this method is its tendency to overlook infrequent log templates, potentially leading to reduced parsing accuracy.

2
International Journal of Intelligent Systems 2.3.Clustering-Based Log Parsing Methods.Sedki et al. [10] proposed the unifed log parsing tool, which identifes frequent phrases in log data to form frequent candidate itemsets.Tese itemsets are then clustered to generate class clusters along with their corresponding templates.Fu et al. [13] proposed the LKE method, a log parsing technique based on K-means clustering.LKE extracts log templates from initial log groups obtained by segmenting clusters using cluster midpoints and parameter distances.However, due to log data imbalance, clustering-based methods might misclassify low-frequency log template words as parameter words, leading to lower parsing accuracy.

2.4.
Other Log Parsing Methods.Makanju et al. [24] proposed the iterative partition log mining (IPLoM) method, which employs iterative partitioning to categorize log entries into distinct groups.IPLoM further refnes partitions based on log identifers and location information to extract log templates for each log group.AEL [25] employs a clone detection method for log parsing, assuming signifcant text similarity among log entries within the same log event.AEL employs the "Adjust" step to consolidate similar log execution events and resolve all log templates.Du and Li [26] presented Spell, an online log parsing method based on the longest common subsequence, which updates and maintains the longest common subsequence library (LCSMap) of log event sequences.

Log Structure Overview
System logs are unstructured data stored as free text, recording various events, states, errors, or interaction behaviors generated by systems or components.Typically, there is no unifed standard for defning log entry formats and syntax structures across diferent systems.Each log entry consists of a constant part and a variable part.Te constant part, also referred to as the log template, comprises fxed plain text information generated by the printout code, containing semantic information in the form of log template words.Te variable part, including dynamic parameter information such as IP addresses, port numbers, and fle names, changes with log events.Te words that make up the variable parts are referred to as log parameter words and generally lack valuable semantic information.Although the formats of log data vary greatly among diferent systems, these log data typically include the following important components: timestamp, log level, components, and log events.
(1) Timestamp: Te time when the system generated the log entry.
(2) Log level: Also known as log type, it indicates the severity of log events (such as info, error, and warn).(3) Component: Te name of the component (software module or server) that generates log events.(4) Log event: Describe the system interaction event information under specifc time and environment.Generally, a log entry contains only one log event.
In log data, the log event serves as the core of each log entry.Log parsing extracts the constant part (common feld) of log events to create a log template representing each log entry.Table 1 displays log samples from eight diferent types of original log data, including distributed systems, supercomputer systems, operating systems, and mobile systems.
We take the HDFS log entry (081109 203521 146 INFO dfs.DataNode$Packet Responder: Received block blk_7503483334202473044 of size 233217 from/10.251.71.16) as an example.Te log event part is generated by the system printout code "LOG.info("Received block" + block + "size" + block.getNumBytes() + "from" + inAddr)."Te fxed parts are "Received block," "of size," and "from," which remain unchanged regardless of the event object.Simultaneously, these words also constitute the log template for the log event.In HDFS log data format Table 2, the symbol " * "denotes a placeholder.In fact, a log template can be used to represent multiple log entries.Figure 1 provides twelve examples of HDFS raw log data.
In these examples, the log template "Received block * of size * from * "can also represent the second log entry (081109 205412 832 INFO dfs.DataNode$PacketResponder: Received block blk_-5704899712662113150 of size 67108864 from / 10.251.91.229).Each log entry in log data can be characterized by only one log template, but one log template can represent multiple log entries.Table 3 displays the corresponding log templates for all log data examples in Figure 1.
As shown in Table 3, we can convert 12 diferent types of unstructured log entries into 5 types of structured data by transforming log data into log templates.Indeed, a log template is a standardized format for representing a group of original log entries.Log entries with the same log template represent the same type of log events.In essence, the core of log parsing lies in converting each log entry into a specifc log template.During log parsing, a parser must explicitly distinguish between the constant and variable parts of the log event, extract the constant log part (log template words) to compose the log template, and then use the log template to represent the log entry, thereby completing the log data parsing task.

DLLog Architecture and Overview
Tis section will provide a detailed overview of the proposed online log parsing method, DLLog, which is based on GRU deep learning and has the longest common subsequence.Te fundamental concept behind DLLog is that log templates typically consist of the longest combinations of frequently occurring words.DLLog comprises three main modules: log data vectorization, ofine log template word mining, and online log parsing.Figure 2 illustrates the framework of the DLLog.Table 4 illustrates notations with their explanatory terms of Te DLLog.Log vectorization is the frst step in the DLLog.Its objective is to convert unstructured log entries into vectorized sequences, which are then used for ofine log template word mining and online log parsing.Te process for vectorizing log entries consists of three steps:

International Journal of Intelligent Systems
(1) Te frst step is to scan the entire log dataset, break down the log into words, and employ regular expressions to flter obvious log parameter words (such as IP address and fle path) with a fxed format.Tis log vectorization process in Section 4.1 processes log datasets using the log data fltering rules provided by the FT-tree [23], spell [26] and drain [22], which is widely adopted in the Log parsing domain.
(2) Te second step is to count the frequency of log words.In this step, the module fully considers the structural and frequency features of the log.Te frequency is derived from the statistics of log level word (F level ), log component word (F component ) and log event word (F word ).Next, we categorize the frequency information by the word type, sort it in descending order, and store in the word frequency table, denoted as F, which is defned as  International Journal of Intelligent Systems frequency table F, the ID s corresponding to the F level are positioned at the beginning of the frequency table, followed by F component in the middle, and F word at the end.Figure 3 provides a structure sample of the word frequency table F. We use the HDFS original log dataset in Figure 1 as an example to further illustrate the process of creating the word frequency table F. Tis log dataset includes two log levels: "INFO" and "WARN."Terefore, the word frequency sorting result for log levels can be expressed as F level � < (1: INFO: 10), (2: WARN: 2) >.Similarly, the frequency sorting results for the four log components can be expressed as F component � < (3: dfs.DataNode DataXceiver.Te processing of F word follows the same procedure as F component .Te fnal word frequency table F corresponding to the log dataset can be expressed as F � < (1: INFO: 10), . . ., (25: blk − 3362838757940877 177: 1) >.In the table F, each row is represented as a triple (ID: Word: Frequency), where the frst unit represents the word frequency ID, the second unit is the word itself, and the third unit is the frequency (the number of times the words appear in the dataset).By categorizing log words into F level , F component and F word , this method helps prevent the incorrect categorization of low-frequency log template words as log parameter words, mitigating issues arising from unbalanced log data features.
(3) Te third step is to replace log words with word frequency IDs, constructing the log word (token) frequency sequence in ascending order.
It is important to note that the online log entry vectorization process only requires the frst and third steps.Since the word frequency table F has already been constructed, we simply need to follow step (1) to clean the online log entry.If a log word appears in the real-time log but is not present in the word frequency table F, we incrementally update the word frequency table F with the newly encountered log word.Ten, according to the word frequency table F, we construct the cleaned log data into a log word sequence.Figure 4 illustrates the example of log vectorization.

Ofine Log Template Word Mining.
Ofine log template word mining aims to create an accurate log template set.During the log vectorization module, DLLog converts each log entry into a sequence of log word frequencies based on the log structural features and log frequency features.In the ofine log template word mining module, DLLog learns the relationship between log words through GRU neural network.It determines whether words are log parameter words or log template words, enabling the accurate extraction of log templates.
Te core method of ofine log template word mining is GRU neural network [27].GRU neural network is a wellknown variant of recurrent neural network (RNN) and was introduced by Cho et al. [27].It has found wide application in various felds, including text classifcation [28,29], machine translation [30], emotion analysis [31].
Compared with LSTM neural network, the GRU neural network has a forgetting and updating mechanism, both of which excel at tracking long-term dependencies.Tese mechanisms address the challenge of gradient vanishing or exploding that often occurs in recurrent neural networks during multiple propagations.Unlike LSTM neural network, the GRU neural network simplifes the internal network structure, resulting in more efcient state information updates.Te internal structure of GRU unit is depicted in Figure 5.Each GRU block in a GRU neural network consists of an update gate and a reset gate.Te reset gate determines which part of the information in the hidden state is "forgotten," while the update gate decides how much of the current input information is incorporated and temporarily stored in the hidden state  h t .Te formulas for reset gate, update gate, and hidden state are as follows: where, W u , W R , and W represent the weight value.When u t � 1, it means retaining the state from the past time to the current state.When u t � 0, it signifes forgetting the past status information.6 illustrates the network structure of the DLLog log template word mining model.We input a log token subsequence of length h (where h represents the size of the sliding window) S h � < s t−h , . . ., s t−2 , s t−1 > into the model, where ∀s i ∈ F. First, the log token subsequence s t is passed through the word vectorization layer, which maps each token to a computationally recognizable vector.Tese word vectors then serve as input to the frst layer of the GRU neural network.Both the frst and second layers of the GRU neural network comprise h GRU units, matching the length of the input data.
In each GRU cell, the input consists of the hidden state h t−1 from the previous time step and the external input data at the current time step.Te currently embedded word vector and hidden state h t−1 are both weighted in the update gate using their respective weights.Te result of this weighted sum, obtained using formula (1), is then passed through a sigmoid activation function to calculate the fnal value of the update gate.Te input for the reset gate is identical to that of the update gate, with both being multiplied by their corresponding weights.Formula (2) is applied to calculate the value of the reset gate.Te reset gate determines how much information from the previous hidden state will be updated to the current candidate hidden state  h t , while the update gate decides how much information from the previous hidden state will be updated to the current hidden state h t .Te candidate hidden state  h t and the hidden state h t are computed using formulas (3) and (4), respectively.Subsequently, the retained information (hidden state h t ) is passed to the next GRU unit.
For the double-layer GRU neural network, each GRU unit in the second layer corresponds to a GRU unit in the frst layer.Te hidden state produced by each GRU unit in the frst layer serves as the input for the connected GRU unit in the second layer.Finally, the fully connected layer and softmax function are employed to transform the fnal hidden state of the second-layer GRU neural network into a probability distribution for predicting the next log word.During the training phase, the model utilizes the cross-entropy as the loss function and employs stochastic gradient descent (SGD) to iteratively update the weight parameters.Te calculation formula for the cross-entropy loss function is given by where y i represents the actual label, p i is the probability value, k is the number of categories (the number of words in the word frequency table F), and N is the total number of samples.7 illustrates the sample log template mining process.

Log
In fact, the fnal output of the model can be considered a binary classifcation problem.Based on prior experience, the target log word following an input sequence is not unique.Terefore, it is essential to manually set an appropriate probability threshold ρ when mining log template words.If the probability value of the target word exceeds the threshold ρ, the target word is considered to have a strong correlation with the input sequence, and it it is identifed as a log template word.Conversely, Conversely, if the probability of the target word is below the threshold ρ, it is categorized as a log parameter word.To prevent mistakenly identifying log parameters as log template words, the extraction of template words for that sequence is halted when any target word in the sequence is identifed as a parameter word (the frst occurrence of a log parameter word within the sequence).Subsequently, processing continues with the next log word sequence until the entire log data has been processed.
After extracting the log template words corresponding to each log entry, the log entries should be divided into different log groups based on the log level, component name, and log template words.Log entries within each log group share the same log template words.For each log group, a data structure named "tem i "is created to store the corresponding log template of that log group.A data structure named tem total is initialized as empty, which will store the fnal log template set, and Tem total � tem 1 ∪ tem 2 ∪ . . .∪ tem n .Figure 8 illustrates the sample structure of the log template set.

International Journal of Intelligent Systems
For a given current log word frequency sequence S, the frst step is to search for log templates in the existing log template set Tem total with the same log type and component name as the current log word frequency sequence S. Tese matched log templates form a new set, Tem same .Ten, we calculate the matching degree between each log template in Tem same and the current log word frequency sequence S using the longest common subsequence (LCS) method [32][33][34].Te matching degree is determined by the length of the longest common subsequence.For instance, if there are three log templates (tem 1 , tem 2 , and tem 3 ) in Tem same that share the same log type and component name as the current log word sequence.Te matching degrees between the current log word sequence and these log templates are denoted as L 1 , L 2 , and L 3 , which are calculated using the LCS(S, tem i ).
Te second step is to fnd the log template with the highest matching degree corresponding to the current log word frequency sequence S. If these log templates (tem 1 , tem 2 , and tem 3 ) share the same matching degree with the current sequence S, the system selects the log template with the shortest length as the corresponding log template.It is important to note that the matching degree, denoted as LCS(S, tem i ), between the selected log template and the current log word sequence S should be greater than or equal to half the length of the current log word sequence and half the length of the selected log template.If, for any reason, the log template set Tem total cannot produce a match for the current log word sequence S, a new log template must be generated and added into the log template set Tem total .If it is impossible to generate a new template based on the existing data, the current log word sequence S is stored in the temporary log set new logs.
In each case, the examples are as follows: (i) If the matching degree are ordered as , DLLog selects the log template with the minimum length as the fnal log template corresponding to the log word sequence S by comparing the lengths of log templates Te third step is to update the log template set.According to reference [23], when the system begins generating new types of system log entries, it often generates a substantial amount of log data of these new types within a single day.Tese log data typically contain numerous diferent parameter words.Consequently, new templates can be directly extracted by computing the longest common subsequence of these new types of logs.Te pseudo-code of the log template set update algorithm is presented in Algorithm 2.
For the current log word sequence S, which fails to match any log template within the log template set Tem total , it becomes necessary to calculate the longest common subsequence between S and each log entry in the temporary log Input: log word frequency sequence S and log template set Tem total Output: log template (1) Initialize optimal matching degree best � 0, optimal template length temlength � 0 and number w � 0 (2) Initialize temporary log set new logs (3) Go through all log templates in Tem total with the same loglevel and componentname as S to form the log template set Tem same (4) for tem i in Tem same do (5) w � i (10) end if (11) end for (12)   International Journal of Intelligent Systems set new logs .Subsequently, the optimal longest common subsequence is selected as the new log template.Similarly, this new template needs to be longer than or equal to half the length of both the current log word sequence and the selected log entry from new logs .Once this condition is met, the log template can be added to the log template set Tem total , thereby updating the set.Te next time a new type of log entry of the same kind appears, the frst two steps can be employed to match the log template.

Evaluation
Tis section frst introduces the hardware and software environment, the experimental log dataset, and the evaluation metrics.Finally, specifc experimental results are presented to demonstrate the superiority of DLLog.

Experimental Dataset.
Te log datasets used in this section consist of 16 real-world log datasets published by the LogPai team (https://github.com/logpai).In the LogHub data repository, these log data come from diferent systems, including distributed systems (HDFS, Hadoop, Spark, ZooKeeper, and OpenStack), supercomputers (BGL, HPC, and Tunderbird), operating system (Windows, Linux, and Mac), mobile system (Android, HealthApp), server applications (Apache, OpenSSH) and standalone software (Proxifer).LogHub log dataset can not only be used to measure the accuracy of log parsing methods but also test the robustness and efciency of parsing methods.Tese datasets have been widely employed in similar research endeavors [15,22,26,35].Table 5 provides detailed information about these log datasets.
For each log dataset, Zhu et al. [11] sampled it and manually marked the log template of each log entry.In all experiments in this section, these markers were used as the basic factual basis for evaluation.

Evaluation Index.
In the feld of log parsing, parsing methods are typically evaluated using the Parsing Accuracy (PA) metric, as defned in reference [11].PA is calculated as the ratio of correctly parsed log messages to the total number of log messages.Each log message corresponds to a specifc log template, and log messages sharing the same log template are grouped into the same cluster, representing a particular type of log message.When assessing the correctness of parsed log messages, it is considered correct only when the log template corresponding to the log message is correctly divided into the log template cluster.In comparison to the evaluation metric (the RandIndex) used in prior studies [35][36][37], PA is considered a more rigorous measure.

Environment and Implementation.
We have implemented the methods proposed in this chapter using the open-source Python machine learning library, PyTorch.All experiments were conducted in a consistent experimental environment using Python 3.8 with PyTorch 1.7.0.Te hardware platform utilized for the experiments featured an AMD Ryzen 5 3600 6-core processor running at 3.6 GHz, an NVIDIA GTX1660 GPU, 128 GB of memory, and the Windows 10 64 bit operating system.We constructed our model based on the above environment.Specifcally, during the ofine training process and the log template mining process, it runs on a GPU to accelerate model training.Te DLLog online parsing phase runs on a CPU to allow for a fair comparison with other log parsing methods.
Input: log word frequency sequence S, log template set Tem total , and temporary log set new logs Output: new log template and new log template set (1) Initialize optimal matching value best � 0, optimal matching length temlength � 0 and number w � 0 (2) for NS i in new logs do (3)

International Journal of Intelligent Systems
Te number of training epochs is set to 300, the hidden dimensions of the GRU model are 64, and the number of layers is 2. In the Log Template Word Mining stage, the sliding window size, h, is set to 3, and the probability threshold, ρ, is set to 0.63.Te learning rate is set to 0.001.

Accuracy Evaluation.
In our experiments, we aimed to select state-of-the-art log parsing methods as comparison baselines.However, due to the unavailability of the source code for some methods [38,39], such as Uniparser [38], we attempted to reproduce it for further experiments.Unfortunately, the parsing results of the reproduced model did not yield satisfactory outcomes on certain datasets.Consequently, to assess the accuracy of DLLog, we compared it with fve baseline log parsing methods: Drain [22], Spell [26], Nulog [40], IPLoM [24], Logram [41], and Brain [42].Drain is a treebased log parsing method, Spell is a log parsing method based on the longest common subsequence, Nulog is a log parsing method based on a deep self-supervised learning model, IPLoM is a log parsing method based on iterative partition, Logram is a log parsing method based on the N-Gram statistical language model, and Brain is a rule-based log parsing method, specifcally utilizing the longest common pattern.Te 6 log parsing methods have been introduced in detail in Section 2. Te comparison results of parsing accuracy are shown in Table 6.We set the best comparison result to bold.
As depicted in Table 6, DLLog achieves the best parsing accuracy in 7 out of the 16 log datasets, with an impressive average parsing accuracy of 0.891.Compared to state-ofthe-art log parsing methods, the highest average parsing accuracy demonstrates the superiority of DLLog.DLLog also achieved high parsing accuracy scores on datasets where the optimal parsing accuracy was not attained.DLLog's average parsing accuracy is approximately 4% higher than Brain and Drain.In comparison to the relatively lower accuracy of the Logram method, DLLog's average parsing accuracy is 11% higher.However, we also observed that due to diferent rules in generating logs for various log systems, no log parsing method can achieve optimal parsing accuracy on all datasets.
Nearly every parsing method can achieve satisfactory parsing results for log datasets with simpler structures, such as HDFS and Apache log datasets; some methods even achieve the optimal parsing accuracy of 1.For log datasets with more complex structures, like HealthApp and HPC log datasets, the accuracy of each parsing method decreases to varying degrees.However, DLLog still attains the highest parsing accuracy on both datasets.Te Spell method, which relies solely on the longest common subsequence for log parsing, achieves accuracies of 0.654 for HPC and 0.787 for BGL datasets.But DLLog based on deep learning and the longest common subsequence achieves accuracies of 0.996 for HPC and 0.988 for BGL datasets.Tis suggests that DLLog can efectively aid the model in parsing logs and enhance log parsing accuracy by fully utilizing the structural, frequency, and associative features of logs.

Versatility Evaluation.
Experiment 2 evaluated the versatility of each log parsing method.Te purpose is to verify whether the proposed method can widely support diferent log data types.Detailed statistics are given in Table 7, including the median (Median), minimum (Min.), standard deviation (STD), and interQuartile Range (IQR).Figure 9 shows the boxplot of the accuracy distribution for each log parsing method.For each box in Figure 9, the line from bottom to top represents the minimum observation value (Lower bound), the lower quartile (Q1), the median (Q2), the upper quartile (Q3), and the maximum observation value (upper bound).Te length of each box represents the interQuartile Range of the corresponding log parsing method.
From Table 7 and Figure 9, it is clear that DLLog has the smallest InterQuartile Range of 0.186, and DLLog has the smallest standard deviation of 0.143, which is 2.0%, 11.1%, 17.8%, 7.7%, 21.8%, and 14.8% lower than Drain, Spell, Nulog, IPLoM, Logram, and Brain.Tis indicates that DLLog has the highest versatility and stability compared with other log parsing methods.Te average parsing accuracy of Drain, Nulog, and Brain is basically the same, but  10, the ordinate represents the parsing time, and the abscissa represents the size of log data volume.
As illustrated in Figure 10, DLLog exhibits a linear growth trend with the increase of log data in both log datasets.Parsing the BGL log dataset takes more time for each log parsing method compared to the HDFS log dataset, as the HDFS log dataset has 30 templates while the BGL log dataset has 619 templates (20 times more than the HDFS log dataset).Logram demonstrates the fastest parsing speed when the dataset size is less than 100 MB.Tis is because Logram, based on n-gram, calculates frequencies simply by counting, which saves a signifcant amount of time compared to other parsing methods based on complex rules.We found that Nulog has the slowest parsing speed because throughout the entire parsing process, Nulog, based on deep learning models, continuously needs to retrain the model for parsing.
Although DLLog also requires training deep learning models, it is only used during the initial ofine parsing stage (with a data size of 0.3 M).In the online log parsing stage, DLLog pre-classifes log templates before template matching to compare the similarity of newly arrived logs with existing log templates.Tus, for similarity comparison, only a comparison between the newly arrived log and log templates that meet specifc categories is required.Tis strategy signifcantly reduces template matching time.While DLLog does not achieve the optimal parsing speed compared to the 6 parsing methods, its parsing speed remains within an acceptable range, and it attains the highest parsing accuracy and the best versatility.

Incremental Update Evaluation.
Te maintenance and upgrade of systems result in the generation of new log data.Terefore, it is crucial to consider the performance of log parsing methods when dealing with newly emerging log types.Experiment 4 evaluated the update capabilities of 7 parsing methods on the HDFS and Android datasets, chosen due to their volumes and the availability of ground truths for such evaluations.We used an initial data volume of 2 K (approximately 0.3 M) for each dataset for model training.Subsequently, we processed the trained model with data volumes of 1 M, 10 M, 100 M, 500 M, and 1000 M for each dataset.With the increase in the number of logs, new log types may emerge.For instance, the 2 k HDFS logs are generated from 14 log templates, while the 1000 MB HDFS dataset contains 29 log templates.An excellent log parser should exhibit stable accuracy when introducing new logs accurately parsing new log types.Te experimental results are shown in Figure 11, the ordinate represents the parsing accuracy, and the abscissa represents the size of log data volume.
We can observe that, with the increase in volume and the introduction of new types of log data, DLLog demonstrates optimal stability, indicating its efective handling of new log types.However, it's worth noting that the parsing performance of all log parsing methods on the HDFS dataset surpasses that on the Android dataset.Tis diference can be attributed to the signifcantly larger number of log templates in the Android dataset, making parsing more complex.DLLog's exceptional incremental update capability, derived from the combination of deep neural networks and the longest common subsequence, enables it to efectively process new log types.Consequently, even with an increase in log data volume, the decline in parsing accuracy is not signifcant.

Ablation Experiment.
In the process of constructing the log word frequency sequence, we conducted ablation experiments using two sequences: one sorted according to our method and the other unsorted, to verify the efectiveness of our approach.We conducted a comparative experiment between DLLog based on LSTM and DLLog based on GRU.Te experimental results are illustrated in Table 8.As depicted in Table 8, DLLog is signifcantly infuenced by whether the log word frequency sequence is sorted.DLLog based on the sorted log word frequency sequence exhibits an average parsing accuracy that is 39% higher than the unsorted version.Moreover, substantial improvements are observed on each dataset, such as Tunderbird, Linux, etc. Tis is attributed to the presence of numerous templates and variable parameter words in these datasets, making the sorting of the log word frequency sequence more impactful on DLLog's training.We processed the sequence of word frequencies transformed based on the frequency table F in ascending order, leveraging the characteristics of high-frequency templates appearing more frequently and the construction method of the frequency table.Tis arrangement ensures that high-frequency template words form a "fxed" combination, with dynamic parameters following, allowing the GRU neural network to accurately learn the log pattern combinations.
On most datasets, the parsing accuracy of DLLog based on GRU is similar to that based on LSTM, but there is a signifcant diference in parsing accuracy on OpenSSH and Linux.We believe that in the majority of cases, the classifcations of the two are consistent.Only in a few log sequences are the classifcations by LSTM and GRU are  diferent, leading to some inaccuracies in log grouping.Moreover, these log groups constitute a signifcant portion, resulting in a substantial diference in parsing accuracy between the two, according to the formula for grouping accuracy calculation.We attribute the diference in parsing accuracy between GRU and LSTM to the fact that GRU, as a simplifed variant of LSTM, predicts more accurately in short sequence datasets, and logs are a type of short sequence data.Conversely, for longer log sequence data, LSTM performs slightly better than GRU.

Conclusions
In this paper, we proposed DLLog, an online log parsing method for accurately and incrementally parsing templates without the need for domain-specifc knowledge.DLLog leverages the GRU neural network for ofine template word mining and leverages the longest common subsequence for parsing log entries in real-time.By utilizing multiple log entry features, DLLog can autonomously extract template words, eliminating the requirement for manual intervention and enhancing its versatility in parsing unstructured log data.Additionally, DLLog supports incremental updates of the log template set, making it adaptable to newly generated log entries in evolving systems.We conducted a comprehensive evaluation of the DLLog parsing method on multiple extensive log datasets, and the experimental results unequivocally demonstrated its remarkable accuracy, universality, and adaptability when parsing large-scale log data.In our future research endeavors, we intend to incorporate location information and character features of log words to assist the log parsing method in distinguishing between log parameter words and log template words.Tis endeavor aims to further enhance the precision and efectiveness of DLLog.

Figure 6 :
Figure 6: Structural of deep GRU network model.

Figure 7 :Figure 8 :
Figure 7: Example of log template mining process.
Table 2 displays the classifcation results of the aforementioned original log entry based on Timestamp, Log level, Component name, and Log events.Tis table also presents the Log template words, Parameter words, and Log template.

Table 1 :
Sample of raw log data.
* of size * from * 2 * * Got exception while serving * to * 3 * * Served block * to * 4 BLOCK * NameSystem.delete: * is added to invalidSet of * * 5 Verifcation succeeded for * 4 International Journal of Intelligent Systems 4.1.Ofine Log Template Word Mining.Te solid arrow in Figure 2 illustrates DLLog's ofine log template word mining process.Tis module initially scans and cleans the entire log dataset.It counts the frequency of each word that makes up the log level, component name, and log event.Using this frequency information, we construct a log word frequency table.DLLog vectorizes each log entry based on the word frequency ID in the word frequency table, converting the log entry into a vector to create the log word frequency sequence.During the training stage, the GRU neural networks are employed to learn the relationships between log words, enabling DLLog to extract log template words from log sequences based on the learned associations.Finally, DLLog categorizes log entries into diferent log groups depending on whether the log template words are identical.Each log group's log entries share the same log template, and the log templates from diferent log groups constitute the set of log templates.4.2.Online Log Parsing.Te dashed arrow in Figure 2 illustrates the online parsing process of the DLLog.Unlike ofine log template word mining, the log sequence in online log parsing does not require sorting based on word frequency.Each real-time log entry only needs to undergo a cleaning process before being input into the log vectorization module to generate a sequence of log word frequencies.Ten, DLLog calculates the matching degree between the current log sequence and the log template within the existing log template set.By comparing the matching degree with the predefned threshold, DLLog determines whether the parsing is successful or if the log template set needs updating.4.3.Log Vectorization.We defne the log dataset as L � < log 1 , log 2 , . . ., log n >, where log i represents a log entry, and we defne the log event set E � < event 1 , event 2 , . . ., event n >, where event i represents a log event.Let W � < word 1 , word 2 , . . ., word n > be the set of words constituting the log event.Tese words are also referred to as tokens.If the log word i appears frequently (that is, it has a high-frequency), then word i has a high probability of being a log template word; We defne the set of log templates as M � < tem 1 , tem 2 , . . ., tem n >, where tem i represents a log template composed of multiple log template words word i , arranged in a specifc manner.It should be noted that each log template tem i in the log template set M corresponds to multiple log entries, and a log entry can only be represented by one log template.

Table 4 :
Notations with their explanatory terms., x t represents input at time t (current time), r t ∈ [0, 1] is the reset gate, u t ∈ [0, 1] is the update gate, h t is the hidden state of the current GRU unit, h t−1 is the hidden state from the previous GRU unit,  h t is the candidate hidden state.σ And tanh are activation functions, ⊕ represents addition and ⊗ represents point multiplication.
Training Stage of DLLog.Te DLLog log template word mining model employs a two-layer GRU neural network.Compared with a single-layer GRU neural network, the two-layer GRU neural network exhibits superior learning and generalization capabilities, making it better at preserving long-term dependencies within sequences.Te log template word mining model based on the GRU neural network consists of four layers: the word embedding layer, the GRU neural network layer, the fully connected layer, and softmax layer.Figure Template Word Mining Stage.In the log template word mining stage, the input method for log data remains the same as in the training phase.Te input consists of a log token subsequence S h � < s t−h , . . ., s t−2 , s t−1 >, where h is the size of the sliding window, and ∀s i ∈ F. Te output of model is a probability distribution denoted as P � (p 1 , p 2 , . . ., p n−1 , p n ), which includes the probabilities associated with all words in the word frequency list F. Assuming that p i represents the probability corresponding to the target log word s t , we use p i to indirectly indicate the association between s t and input sequence S h � < s t−h , . . ., s t−2 , s t−1 >.If s t exhibits a strong association with the input sequence, it is determined to be a log template word; otherwise, it is the log parameter word.Figure Module.In the online log parsing stage, when a new log entry log i arrives, DLLog frst cleans, divides the original log entry and vectorizes to construct the log word frequency sequence S i .Tis process has been i with the log templates in Tem total to determine whether S i matches any existing log template tem i , or if it should create a new log template to extend the log template set Tem total .Tis section utilizes the longest common subsequence (LCS) to calculate the matching degree.Algorithm 1 shows the pseudo-code for online log parsing.
the system chooses L 1 for the further processing.(ii) If L 1 ≥ |S|/2, and L 1 ≥ |tem 1 |/2, |S| is the length of the log word sequence and |tem 1 | is the length of the log template tem 1 , then the log template tem 1 is the fnal log template corresponding to the log word sequence S.
(iii) If L 1 < |S|/2, or L 1 < |tem 1 |/2,then a new template must be created for the current log word sequence S.

Table 5 :
Summary of datasets.
5.4.EfciencyEvaluation.Due to the system producing a large amount of log data in real-time, the online parsing efciency of log parsing methods must also be considered.Experiment 3 verifes the running time spent by the fve methods to parse all HDFS and BGL log entries(2.16G in   total).Te experimental results are shown in Figure

Table 6 :
Accuracy comparison of parsing results.

Table 8 :
Ablation experiment comparison of parsing results.