Feature Extraction of Sequence of Keystrokes in Fixed Text Using the Multivariate Hawkes Process

In this paper, we propose a new method of extracting the features of keystrokes. +e Hawkes process based on exponential excitation kernel was used to model the sequence of keystrokes in fixed text, and the intensity function vector and adjacency matrix of the model obtained through training were regarded as the characteristics of the keystrokes. A visual analysis was carried out on the CMU keystroke raw data and the feature data extracted using the proposed method. We used one-class classifier to compare the classification effect of CMU keystroke raw data and the feature data extracted by the Hawkes process model and POHMM model. +e experimental results show that the feature data extracted using the proposed method contains rich information to distinguish users. In addition, the feature data extracted using the proposedmethod has a slightly better classification performance than the original CMU keystroke data for some users who are not easy to distinguish.


Introduction
e keyboard is a common man-machine interface in information systems. By recording the time when people press and release the keys (keystrokes) and extracting features from the keystrokes, we can apply the data extracted from the keystrokes for authentication, identity verification, and intrusion detection of the information system. Similar to studies on the voiceprint, studies on the analysis methods of keystrokes can be divided into two categories, namely, content-independent and content-related. In content-independent scenarios, the user types fixed text, e.g., username and password, when logging into the information system, which then recognizes or authenticates the user's identity by extracting the features of the keystrokes. In this case, for user identification, instead of considering the typed text, the time interval between pressing and releasing the keys is formed into a multidimensional feature vector (one dimension for one action) as the original feature according to the sequence of keystrokes in the fixed text. In content-related scenarios, the system identifies the user when he or she is using the information system. Under such circumstances, the user types arbitrary text (also known as free text). To eliminate the influence of text content on the features of keystrokes in user identification, the average time spent by the user in typing common combinations of English letters, e.g., th, is, and ing, is transformed into a multidimensional feature vector (each combination of letters as one dimension) as the original feature.
In most previous studies, the above eigenvalues were taken as discrete quantities for modeling, and various classifiers were proposed by means of statistics, machine learning, and deep learning. However, the identification accuracy of these classifiers fails to meet practical requirements. One of the reasons is that the original features of the keystrokes or the features extracted based on the original features are insufficient to achieve highly accurate classification. Moreover, the dynamic change of keystrokes is another main cause. With the eigenvalues of keystrokes regarded as discrete quantities, it is hard to capture the dynamic changes of features. Hence, in this study, we established a model based on the sequence of keystrokes from the temporal perspective, which is more consistent with the dynamic change of keystrokes. Hidden Markov model (HMM) is a common temporal model for analyzing keystrokes. Its hidden state corresponds to keystrokes and emission probability corresponds to the probability distribution of the time interval between keystrokes. is model fails to consider the continuity of keystrokes. In addition, as a generation model, although the trained HMM contains the characteristics of users' keystrokes, it cannot describe the characteristics of a single sample. Temporal point process is a mathematical tool to describe discrete events in continuous time domain. In this study, considering the continuous sequence of keystrokes, we used the multivariate Hawkes process, which is a special temporal point process, to model the sequence of keystrokes in fixed text in order to analyze and extract the characteristics of the sequence of keystrokes.

Research Background
e samples related to the sequence of keystrokes mainly include key value, down time, and up time. e difference in the down time of adjacent keys is referred to as DD time or digraph; the difference in the down time of two keys with one key in the middle is called trigraph; and the difference in the down time of two keys with n−1 keys in the middle is known as n-graph. e up time of one key minus the down time of this key equals the hold time. For adjacency keys, the down time of the latter key minus the up time of the former key equals the UD time. e data about the sequence of keystrokes in fixed text generally includes DD time, UD time, and hold time [1,2]. Figure 1 shows the calculation of the time interval with the fixed text "GEN" as an example. It should be noted that Figure 1 shows only one scenario about the sequence of keystrokes.
In a series of studies by Kevin et al. [1][2][3], the sequence of keystrokes was directly input into the classifier as the feature vector. In some studies, data about the sequence of keystrokes was processed. Bergadano et al. [4] calculated the average trigraph of the text and ranked the combinations of letters per the average trigraph as the feature of keystrokes. Robinson et al. [5] regarded the mean and variance of hold time as the feature. Monrose and Rubin [6] took the average and variance of the time spent in typing common combinations of letters, e.g., th and he, as the characteristics of the keystrokes. In a study by Araújo et al. [7], the averages and variances of UD time, DD time, and DU time served as the characteristics of the keystrokes. In a study by Epp et al. [8], besides various time intervals, the number of errors, the number of keystrokes, and the number of characters were also used as the original features, from which the feature subset was selected by the feature selection approach. With some statistical values (e.g., average, equation, skewness, autocorrelation, and moment) and information measurements (e.g., entropy) of the time interval as the features, Ulinskas et al. [9] applied a feature selection approach to select a feature subset from them. Based on fuzzy logic, de ru and Eloff [10] divided the time interval into four categories, that is, very short, short, moderately short, and somewhat short, as the characteristics of keystrokes. Mondal and Bours [11] took the key values of adjacency keys as well as hold time, UD time, UU time (up time 2−up time 1), and DD time as the feature vectors. Apart from the time interval, the key pressing strength, the position of the key on the keyboard, error frequency, and keystroke sound could also be used as features [12]. Lin et al. [13] input the key value, down time, UD time, and DD time into a convolution neural network as the original feature matrices. In the abovementioned studies, the discrete eigenvalues were combined to design the corresponding classifier and obtain the classification results. Sung and Cho [14] used genetic algorithm-based SVM wrapper ensemble approach to select feature, and some ensemble learning methods [15][16][17] can be applied to this field. ese methods failed to effectively explain which features played a role in identifying users.
However, in some studies, the temporal characteristics of the sequence of keystrokes were also taken into consideration. Alpar [18] transferred the sequence of keystrokes in fixed text from the time domain to the frequency domain for analysis, which was not accurate enough due to the small number of keystrokes. HMM is a temporal model commonly used to analyze keystrokes [19][20][21]. Compared with the methods that take the sequence of keystrokes as a feature vector composed of the time interval between the actions of pressing and releasing the keystrokes, the temporal modeling method can make full use of the available data. e partially observable hidden Markov model (POHMM) [21] proposed by Monaco from the US Army Research Laboratory has an ideal performance in this aspect. e hidden variables of this model include two hidden states, i.e., positive and negative, based on which the sequence of observed keystrokes is generated. Corresponding to the state transition matrix and emission probability of the model, the features of keystrokes are the overall features of the training sample set rather than the features of a specific sample. erefore, the features of a single sample cannot be obtained based on this model. In contrast, the temporal model proposed in this study is able to extract the features of a single sample for visual analysis.

Description of Keystrokes.
Keystrokes can be divided into two actions, i.e., pressing (down) and releasing (up). When the user types text, by recording the key value, action type, and time t, the information system's sequence of keystrokes can be represented as k down denotes the i-th key pressed; k i corresponds to the key value (e.g., key "A"); type represents the type of action (e.g., "down"); and k type i stands for the time when the keystroke occurs.
is study is based on the premise that the user types text without errors (no "backspace" key in the sequence of keystrokes). e key values of the sequence of keystrokes are the content of the text. Figure 2 shows the sequence of keystrokes event for the fixed text "hello." e first case was generated by hitting the keyboard in the rhythm of press-release-press-release. Poking the keyboard with a single finger will produce such a sequence. However, people usually use both hands to strike keys, and the different fingers of the left and right hands strike different keys at the same time.
e sequence of keystrokes in the second and third cases will appear as a result. We can record the time the finger presses the k i key as 2 Mathematical Problems in Engineering . en, the finger will "bounce" to release the k i key. erefore, there is a triggering relationship between pressing and releasing; that is, k down i triggers k up i . Next, consider the relationship between k down i and k down i+1 . Since typing determines the text, k down i must happen before k down i+1 . at is, after the user types k i , k i+1 will be typed immediately. erefore, there is also a triggering relationship between them. Finally, consider the relationship among k  Figure 3 depicts an example of a sample. e down and up actions of similar events are self-triggering, and there is a mutual triggering relationship between different types of events (the down action corresponding to the previous text triggers the down action corresponding to the following text: the trigger value between adjacency events is larger; the farther the event is, the smaller the trigger value is). ere is no triggering relationship between other events (i.e., the trigger value is equal to 0). e triggering relationship can be expressed as the following matrix:

Multivariate Hawkes Process.
e multivariate Hawkes process [22] is a counting process corresponding to a sequence of events composed of multiple types of (multidimensional) events. ere is an incentive relationship between these events. "Multiple" corresponds to multiple types. ere are two ways to define the multivariate Hawkes process-conditional intensity function and Poisson cluster process. Both methods have their own advantages. Conditional intensity functions can be superimposed and combined; their formula description is flexible and concise; and they are easy to calculate. e clustering Poisson process is suitable for deriving the first or second moment metrics.
is article adopts the conditional intensity function definition method. Suppose that the multivariate counting process is , where N(0) � 0; its dimension is D; and the conditional intensity function is where μ i is a constant; . e Ddimensional intensity function vector represents the external part of the intensity function of the temporal point process i (the intensity triggered by an external event); i,j is a D × D-dimensional excitation kernel matrix; and the excitation function g ij ≥ 0 describes the endogenous influence (incentive) of events that have occurred in the current j-th dimension of the multivariate Hawkes process on the intensity of the i-th dimension event. Formula (2) satisfies the following conditions: e integrable function g ij (t) ∈ R + is the element of the matrix G(t), called the excitation kernel function, which describes the incentive of j-type events to i-type events at time t during t ∈ [t i , t j ). It can increase the probability of itype event occurring at time t (note that g ij (t) ≡ 0 means that there is no excitation between events; the conditional intensity function degenerates to a constant; and the temporal point process becomes a Poisson process with a parameter of μ ). Figure 4 is an example of a binary (two types Mathematical Problems in Engineering of events, represented by red and blue) Hawkes process. Figure 4(a) shows the calculation process N(t), and Figure 4(b) shows the conditional intensity function corresponding to the blue event, conditional strength function λ 2 (t), and base intensity μ 2 (t). Figure 4(c) shows the time when the two types (dimensions) of events occur.

Log-Likelihood of Samples.
Suppose that the D-dimensional multivariate Hawkes process is composed of D single-variable temporal point processes N i (t), where i � 1, ..., D, and i represents dimensions. According to (2), its conditional strength function can be written as where H j t represents the historical event that occurred in the temporal point process j before t, and exponential kernel function is a commonly used excitation kernel function: where is the adjacency matrix (or branching matrix), which describes the enhancement of the excitation intensity of the i-dimensional event by the j-dimensional event; and the attenuation coefficient β ∈ R + describes the attenuation of the excitation intensity. erefore, this article uses the adjacency matrix as the triggering matrix between keystroke events. e model , and the attenuation coefficient β > 0 is the hyperparameter of the model.
In order to describe the sequence of keystroke events with multiple Hawkes procedures, the following defines the sequence of keystrokes. According to the definition of the multiple Hawkes process, the temporal point process i corresponds to the up or down action of the i-th character of the keystroke event. Sample keystroke behavior represents the data observed in the sampling time interval [0, T); the superscript d of X is the dimension of the multivariate Hawkes process; and its maximum value D is the sample corresponding number of keys. For example, if the keystroke event sequence corresponding to the "hello" text contains the last Enter key, the number of keys is 6. e subscript k of X is the k-th action of the temporal point process of a certain dimension. For example, the first action of the second dimension of the keystroke event sequence corresponding to the text "hello" is the down action of the button "E," and the maximum value n i is the number of actions in the temporal point process of this dimension (the buttons have down and up actions, so all n i � 2 in this study). A sample set consisting of multiple samples S: � X (m) M m�1 , the superscript of X indicates the first m samples, and M is the total number of samples.
e log-likelihood [23] of the sample X is where T is the time taken for a single sample. en, the log-likelihood of the sample set S is In order to reduce the structural risk of the model, a regularization term (1/α)R(μ, W) was added to the loglikelihood of the sample set. e maximum-likelihood estimate of the sample set S with regularization is (introducing another hyperparameter penalty term coefficient α) Here, the hyperparameter α constrains the influence of the regularization term on the likelihood estimation. In addition to the regular term constraint, in this study, we used (1) to force the constraint excitation matrix W. Considering that the adjacency matrix is sparse, to satisfy the necessary conditions for the stationarity of the Hawkes process, the excitation function g ij (s) must satisfy ∞ 0 g ij (s) < ∞, and the spectral radius of the adjacency matrix spectral radius of the adjacency matrix W must be less than 1 to ensure the Mathematical Problems in Engineering smoothness of the model parameters. Hence, the L 2 regular term (ridge regression) was used: e L 2 regular term can be described as a zero-mean Gaussian distribution on the weight ω ij : 3.4. Model Selection. e multivariate Hawkes process can be used to mine the incentive relationships that exist in the sequence of multiple types of events. For example, Eichler et al. [24] used the multivariate Hawkes process to mine the causal relationship between different types of events. e commonly used multivariate Hawkes process model training method is to add a regular term to the maximum-likelihood estimation of the sample to constrain the complexity of the model parameters to avoid overfitting. Zhou et al. [25] used regular terms with forced sparse and low-rank structures. In order to avoid excessive assumptions on the model parameters, Xu et al. [26] used the linear combination of basic functions as the intensity function and used the sparse group-lasso regular term to constrain the sparsity of the linear combination coefficients. e abovementioned frequency methods all require a large number of training samples. As the training sample size decreases, the noise of the model will become larger, so it is not suitable for keystroke behavior analysis (the sample size is not large).
Compared with frequency-based methods, Bayesian methods can also be effective when the training sample size is small. e Bayesian method uses the priors of the model parameters combined with the training data. According to the objective function (usually the maximum likelihood of the sample), the method continuously corrects and obtains the optimal parameter posterior and finally uses the parameter posterior to make decision-making inferences. As long as there is a reasonable prior, a reasonable decision can be made even if the training sample size is small. Linderman and Adams [27] used the Gibbs sampling method to approximate the likelihood of the sample. All historical events must be considered when calculating the intensity, so the convergence speed is slow. In order to improve the speed of convergence, Linderman and Adams [28] divided the time axis into many small time periods. Each function describes the excitation relationship. e function and intensity are independent, so that the calculation of intensity does not need to consider all historical events. Compared with the Gibbs sampling method [27], the convergence speed is greatly improved. However, this approach introduces model noise [29]. Salehi et al. [29] proposed a multivariate Hawkes process variational inference method. Compared with Linderman and Adams [27,28], this method can converge quickly and simultaneously learn the model parameters and the coefficient α of the regular term, which improves the efficiency of model learning. In this study, we used Salehi's method [29] to model the sequence of fixed text keystroke events. Like Salehi's method, we set the number of Monte Carlo samples to one; and the iterations are I � 20000. For each sample, the computation complexity is O(I) and the runtime is about 15 minutes.

Experiment and Results
In order to facilitate comparison with other methods, in this study, we used the CMU dataset [1]. e dataset at http:// www.cs.cmu.edu/∼keystroke/DSL-StrongPasswordData.xls has 51 user keystrokes. Each user typed in the fixed text ".tie5Roanl" 50 times once (session), and each user typed it 8 times (the interval between sessions was more than 1 day). Each user has a total of 400 (5000 × 8) rows of records. Each row records 1 sample X of the model. Each sample has 11 types of events (11 keys: 10 characters plus Enter key), and each type of event has 2 actions (down and up). Kevin [2] showed that, no matter which classifier was used, the accuracy of s036 and s052 was high (it is easy to distinguish from other users), and the accuracy of s002, s032, and s047 was low. is article selected the data of these 5 representative users for experiments. We determined the hyperparameter β of the Hawkes process model Code address: https://github.com/zcmail/KD-feature-extracted-by-Hawkes-process, trained the Hawkes process model to extract keystroke behavior characteristics, and finally compared and analyzed the raw data of CMU keystrokes and the POHMM Code address: https://github.com/vmonaco/ pohmm model [21,30].

Selection of Hyperparameter β of Hawkes Mode.
In this study, we used grid search to select the value of β. e model was trained with samples of each session of each user, and the intensity function vector and the adjacency matrix corresponding to each session sample were obtained. e best choice of β was determined by comparing the intensity function vector and the adjacency matrix. Figure 5  According to Figure 5, the value range of β is limited to [0.001, 0.005, 0.01, 0.03, 0.05, 0.1], and the value of β is further determined according to the principle of minimizing changes within the class. e specific method is to compare the adjacency matrix corresponding to different sessions of the same user and select β corresponding to the small change of the adjacency matrix. Figure 6 shows the adjacency matrix corresponding to the samples of each session learned by the model when the s002 user has different β values. Figure 7 shows the adjacency matrix corresponding to the samples of each session learned by the model when the s032 user has different β values. It can be seen from Figures 6 and 7 that the smaller the β value, the greater the triggering effect and the larger the change of the adjacency matrix within the class. Based on this, we further narrowed the value of β to the range of [0.01, 0.03, 0.05, 0.1].
Next, the value of β can be determined according to the principle of large differences between classes. e basic intensities and adjacency matrices corresponding to different session samples were connected into feature vectors. en, principal components analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) were used for reducing dimensionality and for visual analysis, respectively. Figure 8 shows the effect of different values of β.
In Figure 8, each point corresponds to the feature vector extracted from the sample data training of a certain session of a certain user, and users are distinguished by color and label (the label of user s002 is 0, the label of s032 is 1, etc.). It   Mathematical Problems in Engineering can be seen from Figure 8 that when the value of β is 0.1, the discrimination effect is better. β � 0.1 was selected in this study. Except the decay parameter β, we initialized the penalty term coefficient α in [0.1, 1] and found that when α equaled one, the model sometimes could not converge. erefore, we initialize α equal to 0.1. We use Adam optimizer and the learning rate of 0.01.

Feature Visualization.
Here, the original data of CMU keystrokes (the sample is directly taken as feature data) and the feature data extracted by the Hawkes process model are visualized. Figure 9(a) shows the DD time feature of the CMU keystroke raw data of 5 users, and Figure 9(b) shows its hold time feature. e ordinate is the characteristic value; the abscissa is the sample serial number corresponding to the characteristic. e sample serial number of s002 is 0∼399, the sample serial number of s023 is 400∼799, and so on. It can be seen from the figure that the hold time corresponding to the sample of s036 (serial number 800∼1199) is shorter than that of other users, and the DD time between ie, ro, and nl is longer than other users. e DD time and hold time characteristics of s002 and s032 are similar, and they are not easy to distinguish. e characteristics of s047 and s002 and s032 are similar, and the DD time between them is longer than that of other users. s052 has no features that distinguish it from other users. e results in Kevin's work [2] showed that s052 is easy to distinguish from other users. Figure 10 shows the partial value of the intensity function vector μ extracted for 5 users using the Hawkes process model (because μ 4 , μ 5 , μ 6 , and μ 7 are 0, they are not drawn). e top row is μ 0 , and the second row is μ 1 . e feature number is consistent with Figure 9. It can be seen from Figure 10 that, according to μ 0 , s036 is easy to distinguish from other users; according to μ 2 , s052 is easy to distinguish from other users; and, according to μ 3 , some samples of s032 can be distinguished. Figure 11 shows the partial values of the adjacency matrix extracted by 5 users using the Hawkes process model (same as Figure 10, the value 0 is not drawn). It can be seen from Figure 11 that s002, s032, and s047 are not easy to distinguish from other users; s036's ω 5,4 , ω 6,5 , and ω 9,8 are easy to distinguish from other users; and s052's ω 2,0 , ω 9,7 , and ω 10,9 are easy to distinguish from other users. It is difficult to distinguish s002 and s032 from Figure 9, while, in Figures 10 and 11, s002 and s032 can be distinguished according to μ 3 , ω 7,5 , and ω 7,6 . erefore, compared with the original data of CMU keystroke events, the features extracted in this study have more specific information to distinguish users.

Comparison of Classification Results of Different Feature
Data. In practical application, the positive sample data is easy to collect, while the negative sample is often unknown. We used one-class classifiers to compare the classification effects of CMU keystroke raw data, as well as the features extracted by the Hawkes process model and POHMM model (the model contains features). Kevin [2] and Ali et al. [30] used the first 200 samples of the target user to train the model. Similar to their approach, the training set used in this study comprised the first 200 samples of the target user (or the feature data corresponding to the first 200 samples) as  We compared the classification effects of the CMU keystroke raw data as well as the features extracted by the POHMM and the multivariate Hawkes process model. For the CMU keystroke raw data, we adopted the scaled Manhattan classifier (in [2], this classification effect is the best). e POHMM model is a generative model. It can extract the features of sample set but cannot extract the features of single sample. e result of POHMM is the loglikelihood value of the test sample. e threshold value is selected according to the log-likelihood value of the first 200 samples of the target user (the count of log-likelihood values greater than the threshold is 5% * 200 � 10). e features extracted by the Hawkes process model used scaled Manhattan classifier and Euclidean classifier in [2]. ese methods use the same training data and test data. Table 1 shows the classification effect. Each user has two values. e upper value is false negative, and the lower value is false positive. It can be seen that, for s002 and s032 positive samples, the classification effect of Euclidean classifier based on the features extracted by Hawkes process model is better keystroke raw data and POHMM model. On the whole, the POHMM model has the best classification effect. However, no matter which classification method was used, s032, s002, and s047 have higher error rate [2]. e ROC curves of different classification results are shown in Figure 12. e Hawkes feature adj_7_6 Hawkes feature adj_8_6 Hawkes feature adj_8_7 Hawkes feature adj_9_7 Hawkes feature adj_9_8 Hawkes feature adj_10_8 Hawkes feature adj_10_9 Hawkes feature adj_10_10 Mathematical Problems in Engineering 13 results of the classification experiments show that it is effective to extract the features by the Hawkes process model. For users who are not easily distinguishable, this method has slightly better performance than scaled Manhattan classifier based on CMU keystroke raw data. For users such as s036 and s052 that are easy to distinguish [2], the classification effect of scaled Manhattan classifier based on CMU keystroke raw data is better than that of Euclidean classifier based on the features extracted by the Hawkes process model. Maybe noise was introduced in the training process of the Hawkes process model. In addition, the selection of superparameter β was not accurate, which leads to the deviation of model parameters.

Conclusion
Taking the time interval (discrete value) between keystroke events as a feature is a common practice in modeling the sequence of keystroke events. ere are a small number of studies that consider continuous time features. In this study, we used the multivariate Hawkes process based on the exponential excitation kernel to model the sequence of fixed text keystroke   [30], we found that the features (model parameters μ and W) learned by the model are slightly more accurate in distinguishing users that are not easily distinguishable and have richer information for distinguishing users. e exponential excitation kernel function (which is memoryless) used in this study is a strong assumption. If the keystroke behavior does not conform to the exponential decay law, there will be deviations from the actual. e next step of this study will be to explore the use of the nonparametric Hawkes process to model keystroke events (the kernel function has no specific form). In addition, since the Hawkes process can extract the features of the fixed text keystroke event sequence, it should also be able to extract the features of the free text keystroke event sequence, which can be investigated in the future work.

Conflicts of Interest
e authors declare that they have no conflicts of interest.