NTM-Based Skill-Aware Knowledge Tracing for Conjunctive Skills

Knowledge tracing (KT) is the task of modelling students' knowledge state based on their historical interactions on intelligent tutoring systems. Existing KT models ignore the relevance among the multiple knowledge concepts of a question and characteristics of online tutoring systems. This paper proposes a neural Turing machine-based skill-aware knowledge tracing (NSKT) for conjunctive skills, which can capture the relevance among the knowledge concepts of a question to model students' knowledge state more accurately and to discover more latent relevance among knowledge concepts effectively. We analyze the characteristics of the three real-world KT datasets in depth. Experiments on real-world datasets show that NSKT outperforms the state-of-the-art deep KT models on the AUC of prediction. This paper explores details of the prediction process of NSKT in modelling students' knowledge state, as well as the relevance of knowledge concepts and conditional influences between exercises.


Introduction
With the development of intelligent tutoring systems (ITSs) and the emergence of massive open online courses (MOOCs) [1,2], knowledge tracing plays an important role in improving the efficiency of personalized learning platforms. Knowledge tracing is the task of modelling students' knowledge state based on their historical interactions to predict students' mastery of knowledge concepts (KCs), where a KC can be an exercise, a skill, or a concept [3,4].
In order to better model students' knowledge state, various knowledge-tracing models have been proposed. In previous studies, Bayesian knowledge tracing (BKT) is a powerful knowledge-tracing model. BKT models students' knowledge concept state by using a hidden Markov model (HMM) for each KC [5].
As deep learning develops, a lot of deep learning models have been applied in KT. Chris Piech applies the recurrent neural network (RNN) to model the student learning process for the first time and proposes deep knowledge tracing (DKT) [6][7][8][9]. e dynamic key-value memory network (DKVMN) uses a static memory called key and a dynamic memory called value to discover latent relations between exercises and knowledge concepts [10,11]. Self-attentive knowledge tracing (SAKT) proposes a self-attention-based KT model to model the students' knowledge state, with exercises as attention queries and students' past interactions as attention keys/values [3,[12][13][14][15].
However, the aforementioned works only focus on students' exercise interactions and ignore the relations between questions and skills. It cannot model students' knowledge state accurately by merely focusing on students' interactions. Knowledge tracing models begin to pay attention to the structure of the knowledge concepts [16][17][18].
Deep hierarchical knowledge tracing models students' knowledge state by capturing the hierarchical structure of questions and knowledge concepts [16]. Neil Heffernan's latest work considers the question information to which the knowledge concept belongs [17]. Graph-based knowledge tracing considers the influence among neighboring knowledge concepts [19][20][21][22]. e bipartite graph is an effective structural model to capture latent relations between questions and skills [18]. is method is effective, but the amount of calculation is huge because it needs to extract questions and skills, respectively. us, it is difficult to be regarded as a streamlined and effective knowledge-tracing model.
None of the above KT models make full use of the multiknowledge concept information of the questions.
Existing knowledge tracing models cannot capture latent relations between questions and concepts concisely and effectively. We know that questions are generally composed of multiple knowledge concepts, which are actually closely related. In order to better model the students' learning process, our model is constructed by using neural Turing machines (NTMs), which are an instance of memory-augmented neural networks (MANNs) that have a large external memory capacity [23][24][25]. erefore, on the basis of above deep knowledge tracing models, we propose an NTM-based skill-aware knowledge-tracing model. e highlight of our work is to utilize the knowledge concept composition information of questions to model the students' knowledge state more accurately and to discover more latent relevance among knowledge concepts effectively. e contributions of this paper are concluded as follows: (i) We process the real-world KT datasets in detail and discover new characteristics of online tutoring systems and knowledge tracing datasets. (ii) We design a question-skill dictionary algorithm to obtain the conjunctive skills of questions. e input encoding contains both students' answering interaction information and the related knowledge concept information. (iii) We apply neural Turing machines into knowledge tracing innovatively to enhance the memory capacity of our model and to predict students' mastery of knowledge concepts accurately and discover knowledge concept substructure effectively. (iv) We propose a novel NTM-based skill-aware knowledge-tracing model for conjunctive skills and apply a novel loss optimization function to deep knowledge tracing to enhance the model's ability of skill awareness. Our model considers the conjunctive knowledge concept information contained in a question in the process of modelling the students' knowledge state; thus, our model outperforms existing KT models. e rest of this paper is organized as follows: Section 2 presents a brief overview of related work in the field of knowledge tracing. In Section 3, we formulate the process for NSKT to perform the knowledge-tracing task. en, Section 4 introduces the characteristics and classifications of online tutoring systems. e details of the NSKT model are provided in Section 5. e experimental results and the comparison of models' performance in the real-world datasets are given in Section 6. In Section 7, we discuss in detail the process of NSKT in modelling the students' knowledge state. Section 8 presents the conclusions and future studies of this work.

Related Work
In this section, we present a brief overview of the models and methods of related work in the field of knowledge tracing, which can be classified into two main categories, as shown in Table 1.

Item Response eory.
Item response theory is the most commonly used cognitive model to predict students' mastery of knowledge concepts before knowledge tracing was proposed in 1995 [26,27]. On the basis of IRT, the students' knowledge state cognitive model based on factor analysis was later proposed: LFA [28] and PFA [29]. ese logistic regression models predict students' mastery of knowledge concepts by analyzing the relationship among factors that have an impact on students' answering accuracy [30,31].

Knowledge
Tracing. Bayesian knowledge tracing (BKT) models the students' knowledge state by using the hidden Markov model (HMM) for a single knowledge concept, which is represented as a set of binary latent variables [5].
With the rise of deep learning, deep knowledge tracing (DKT) was proposed in [6], which regards students' historical interactions as time sequences and models the students' knowledge state by the recurrent neural network (RNN). e experimental results show that DKT has the powerful ability of modelling the students' knowledge state. After DKT, a lot of deep KT models have been proposed to improve the AUC of the prediction of students' mastery of knowledge concepts. However, most of these deep knowledge-tracing models only focus on students' interactions on knowledge concepts and ignore the structural relationship between questions and knowledge concepts.

Question-KC Relation in Knowledge Tracing. Cen et al.
proposed the two IRT models (additive factor model (AFM) and conjunctive factor model (CFM)) to model the conjunctive skills in the student datasets [32]. Both the AFM and CFM consider the conjunctive skills information contained in an item to predict the probability of students answering the item correctly.
Deep hierarchical knowledge tracing begins (DHKT) to focus on the hierarchical relationship between knowledge concepts and questions to predict the performance of students [16]. DHKT trains a question embedding by the average embeddings of the skills belonging to the question. e model using the bipartite graphs can capture relationships between knowledge concepts and questions effectively and systematically to pretrain question embeddings for each question [18]. Neil Heffernan's latest work begins to focus on the architecture of knowledge concepts and questions too [17].

Problem Formulation
Generally, KT can be formulated as a supervised sequence learning problem: the student's interaction tuple at the timestamp t, h t � (q t , a t ) that represents the combination of which skill (exercise) was answered and if the skill was answered correctly, so a t ∈ 0, 1 where M is the number of unique exercises in datasets. Given the student's past exercise interactions, H t � h 0 , . . . , h t , the goal of KT is to predict the probability that the student will answer question q t+1 correctly at the next timestamp t + 1, P(a t+1 � 1/q t+1 , H t ) [3,6,10].
It can be seen that existing KT models only focus on students' exercise interactions, so they are difficult to predict students' mastery of skills effectively. e notations used in this paper are shown in Table 2.
Definition 1. Related knowledge concepts (RKCs): the related knowledge concepts (RKCs) refer the other knowledge concepts S that compose the question p with a knowledge concept q, where S and q are mutual conjunctive knowledge concepts (skills). e Algorithm 1 processes the skills and the questions of the dataset to obtain a dictionary Dic with the question number as the key and conjunctive skills of the question as the value, while conjunctive skills are the skills that make up the same question. e time complexity of Algorithm 1 is O(n 2 ). In this paper, we use KC shown in Table 2 to represent skill. Let S be the RKCs related to KC q of the answering question p, where S � x/x ∈ Dic p , x ≠ q is illustrated in Figure 1(a). e skill-aware knowledge tracing model can be formulated as follows: the student's interaction at the timestamp t, h t ′ � (p t , q t , a t , S t , c t ), where a t is the correctness to the question p t on skill q t , S t are the of RKCs of KC q t , c t is the correctness to RKCs S t . Given the student's past interactions, H t ′ � (h 0 ′ , . . . , h t ′ ), we can predict the probability that the student will answer next KC q t+1 correctly at the timestamp t + 1, P(q t+1 ) � P(a t+1 � 1/q t+1 , H t ′ ) or predict students' mastery of holistic knowledge concepts, P(q i ) M i�1 .

Online Tutoring Systems
e online tutoring systems can be classified into two categories: 4.1. Question-Level Online Tutoring Systems. In questionlevel online tutoring systems, students answer the question directly. If the question is answered correctly or incorrectly, all KCs (skills) of the question are answered correctly or incorrectly too. So if a student has answered q t correctly or incorrectly, then they must answer the RKCs S t correctly or incorrectly too, which is illustrated in Figure 1(b). Because q t and S t are from the same question, so in question-level online tutoring systems, for a student's interaction at the timestamp t: (p t , q t , a t , S t , c t ), (1)

Skill-Level Online Tutoring Systems.
e question-answering situation in skill-level online tutoring systems is much more complicated than that of the question-level online tutoring system. Students can individually answer one of the skills in the question and can answer this skill once or multiple times. So if a student answers KC 1 correctly, it does not mean that the student must answer KC 2 correctly, which is shown in Figure 1(c).
Superficially, there is no obvious answering correctness relationship between skill q t and the related skill set S t . However, there are a large number of students answering examples shown in Table 3 in skill-level online tutoring systems, indicating that if a student answers q t incorrectly many times, even if he finally answers q t correctly, which demonstrates that his mastery of skill q t is very poor, and similarly, he has poor mastery of S t . It is very likely that he will answer q t 's-related skills S t incorrectly. So the student's mastery of q t , P(q t ) and the student's mastery of S t , P(S t ) are close: is finding is strongly supported by the actual responses of students in skill-level online tutoring systems. So in skilllevel online tutoring systems, according to formula (2), we can assume.
as shown in Table 4.

Method
In this section, we will give a detailed introduction of our NSKT framework, of which, the overview architecture is given in Figure 2.

Model.
e model consists of an encoding layer and a neural network layer. In order to better model the students' knowledge state, the model is constructed with composed of skills and questions.  Table 4: e relationship between a t and c t in skill-level online tutoring systems. Table 3: Example of a student answering question 33 in skill-level online tutoring systems. Question 33 is composed of skill s11 and s21. e student answers s11 incorrectly three times t 1 − t 3 in succession. Even if he answers s11 correctly at the timestamp t 4 , his mastery of s11 is very poor and his mastery of s11's-related skill s21 is not good too, so it is very likely to answer s21 incorrectly, in fact, he answers s21 incorrectly at the timestamp t 5 .

Timestamp
Skill Correctness Computational Intelligence and Neuroscience

RKC Information Encoding.
e information of the set S of RKCs related to KC q is encoded E s with a length of M:

Neural Turing
Machines. Neural Turing machines are an instance of memory-augmented neural networks (MANNs) that extend the capabilities of neural networks by coupling them to external memory resources. Experiments show that neural Turing machines have stronger memory capabilities than the LSTM [23], which is very suitable for modeling the students' knowledge state [33][34][35]. Figure 3 shows a highlevel diagram of the neural Turing machine architecture. As can be seen from Figure 3, the NTM is composed of 4 modules: controller, read heads, write heads, and memory. e controller can be a feed-forward neural network or a recurrent neural network [23,34] and has read and write heads that access the external memory matrix.

Reading.
Let M t be the external memory content which is a n × m memory matrix at the timestamp t, where n is the number of memory locations and m is the vector dimension at each memory location. e n elements w t (i) of w t , which is a vector of weightings over the n locations emitted by a read head at the timestamp t, obey the following constraints: Let r t be the read vector of a length m returned by the head at the timestamp t: 5.5. Writing. e memory matrix M t at the timestamp t is modified by the erase vector e t and the add vector a t : 5.6. Addressing Mechanisms 5.6.1. Focusing on Content. Each head produces a length m key vector k t that is used to compute the normalised weighting w c t as follows: where β t is a positive key strength generated by the controller and the similarity measure K is cosine similarity:

Focusing on Location.
e location-based addressing mechanism is designed to facilitate both simple iterations across the locations of the memory and random-access jumps. It does so by implementing a rotational shift of a weighting as follows [23].
Firstly, the interpolation gate g t is used to blend between the weighting w t− 1 and the weighting w c t : Furthermore, the model uses a one-dimensional convolution shift kernel to convolve the current weighting w g t : where s t is the shift weighting generated by the controller. Computational Intelligence and Neuroscience 5 To correct the blur that occurs due to the convolution operation, each head emits one further scalar c t ≥ 1 whose effect is to sharpen the final weighting as follows:

5.7.
Controller. e NTM controller in our model is the long short-term memory network [36], which can be formulated by the formulas as follows: i, f, o, c, h are the activation matrices of the input gate, the forget gate, the output gate, the memory cell, and the hidden state matrix, respectively. w and b are the weight matrix and the bias vector of the corresponding gate, respectively. ⊙ denotes the Hadamard product. σ and tanh denote the sigmoid and hyperbolic tangent function, respectively: let logits ∈ R M be the output of the last neural network of the NSKT model, the student's mastery of knowledge concepts predicted by the model at the timestamp t is where y t ∈ R M .

Optimization.
e loss function of the model consists of two parts, namely, the answering interaction loss L 1 and the related knowledge concept information loss L 2 . Let ℓ be the binary cross entropy loss: We optimize the average cross entropy loss of the student's interactions as follows: where δ(q t+1 ) is the one-hot encoding of KC q t+1 at the timestamp t + 1, |H| is the total number of the student's interactions, and T denotes transpose operation. e average cross-entropy loss of the related knowledge concept information is where q i ∈ S t+1 , c i is the correctness to skill q i . e loss for a single student is represented by L, which is as follows: where the hyperparameter λ is the coefficient that determines the proportion of the answering information loss and the related information loss. We use an optimizer to optimize our model. Let Θ be the minimum of L, thus, the training objective of NSKT is as follows:

Skill
Awareness. e student's past interactions in online tutoring systems: where h ′ � (q t , a t , S t , c t ) denotes that the student interaction tuple at the timestamp t.
e set of knowledge concepts Set q that students have answered actually so far is represented as follows: e set of knowledge concepts (skills) Set S answered by NSKT so far is represented as follows: As shown in Figure 4, when the student answers the next skill q t+1 at the next timestamp t + 1, even if the student has not answered questions related to skill q t+1 before, q t+1 ∉ Set q t , but if NSKT has awareness of skill q t+1 so far, q t+1 ∈ Set S t , NSKT can predict the student's mastery of skill q t+1 accurately.

Experiments
In this section, we give a detailed explanation of datasets and experiments conducted to evaluate the performance of the NSKT model and other KT models in three real-world opensource knowledge tracing datasets.

Datasets.
To evaluate KT models' performance, we use three datasets collected from online learning platforms. ese three datasets are widely used real-world datasets in KT. Computational Intelligence and Neuroscience provided by the 2017 ASSISTments data mining competition and is the latest ASSISTments dataset with the most student responses. (iii) EdNet (https://github.com/riiid/ednet) is the dataset of all student-system interactions collected over 2 years by Santa, a multiplatform AI tutoring service with more than 780 K users in Korea available through Android, iOS, and Web [37]. We conducted our experiments on EdNet-KT1 which consists of students' question-solving logs and is the record of Santa collected since April 2017 by following the question-response sequence format.
e complete statistical information for the three datasets is shown in Table 5.
e details about the columns in datasets are shown as follows: ASSISTments: (iv) correct_answer: the correct answer of each question recorded as a character between a and d inclusively.
(v) user_answer: the answer that the student submitted was recorded as a character between a and d inclusively.

Dataset Characteristics
(i) ASSIST09 and EdNet: For multiple skill questions, the records of students' interactions will be repeated with different skill taggings and each record skill-aware output model NSKT is skill has been aware by model at timestamp t 2  Computational Intelligence and Neuroscience represents the student response to a skill of the question [38]. (ii) ASSIST17: similar to the ASSIST09 dataset, each record in ASSIST17 represents the student response to a skill of the question. However, we noticed the special features of this dataset. A large number of users in the ASSIST17 dataset only answered one skill of multiple skill questions and answered this skill one or more times. e number of multiple skill questions in this situation accounted for 44.88% of the total number of questions answered by students. at is, the student answers one or more skills of multiple skill questions, and the number of responses to a skill may be given once or multiple times.

Compared Models and Implementation Details.
To show the performance of our model and demonstrate the improvement of our model to existing KT models, we compare NSKT against the state-of-the-art KT models. We give the reference GitHub repositories of some KT models.
(i) BKT [5]: Bayesian knowledge tracing uses the hidden Markov model (HMM) to model the students' latent knowledge state as a set of binary variables. We use pyBKT (https://github.com/ CAHLR/pyBKT) to implement BKT and set the model parameters: seed � 42, num_fits � 1. For all models, we use the Adam optimizer with learning_rate � 0.001, beta1 � 0.9, beta2 � 0.999, and epsilon � 1e − 8 to optimize. e minibatch size and the maximum length of the sequence for all datasets are set to 32 and 100, respectively. We perform standard five-fold crossvalidation to evaluate all the KT models in this paper. We conduct experiments on the server with an 8-core 2.50 GHz Intel(R) Xeon(R) Platinum 8163 CPU and 64 GB memory.

Models' Performance.
We use the area under the receiver operating characteristic curve (AUC) as an evaluation metric to compare prediction performance among the KT models mentioned in Section 6.3. A higher AUC indicates better performance. e test AUC results in the three realworld datasets for all KT models are shown in Table 6. From the experiment results, we can find the following observations: and EdNet, respectively. It proves that NSKT is better in mining hidden information from complex educational data features to improve the accuracy of prediction. Figure 5 shows the training process of KT models in the three KT datasets. It shows that the DKVMN and SAKT can learn faster than other KT models. e training speed of the DKT-LSTM, DKT-NTM, DSKT, and NSKT is close, but the test AUC of NSKT is the best.
We set the probability to KC q t predicted by KT models: P(q t ), and assume that students will answer KC q t correctly if P(q t ) > � 0.5 and if P(q t ) < 0.5, the student will answer q t incorrectly: 8 Computational Intelligence and Neuroscience a t ′ � 0 P q t < 0.5, If a t ′ � a t , it means that models can predict correctly. us, the accuracy of prediction for KT models in the datasets is shown in Figure 6. Figure 7 shows the performance of DSKT and NSKT under different λ values and the value of λ when models achieve the best performance. From Figure 7, we can draw the following conclusions: the test AUC of DSKT and NSKT is not ideal with a small λ value. However, as the value of λ increases, the test results of DSKT and NSKT get better and better; thus, we recommend λ ≥ 0.9.

Friedman-Aligned Rank Test.
We perform the Friedman-aligned rank test [39] on the AUC test results of the KT models shown in Table 6 by the following formula: where R i is the sum of the ranks of the i-th sample, k is the number of groups of samples, and n is the number of samples in each group. e probability distribution of X 2 can be approximated by that of the chi-squared distribution with k − 1 degrees of freedom χ 2 k− 1 . Now, we test the null hypothesis, which is as follows: H 0 : there is no significant difference in the performance of the KT models. e P value P of the Friedman-aligned rank test on test AUC results is en, we reject the null hypothesis H 0 , which indicates a significant difference in the performance of the KT models.

Execution Time.
We compared the execution time of KT models per 200 batches in each dataset shown in Figure 8. As shown in Figure 8, the BKT model requires the least execution time to train the same size of data. is is because the BKT is not a deep learning knowledge tracing model, and it needs to train fewer parameters. For deep learning knowledge tracing models, the execution times of the DKT-LSTM, DKVMN, and SAKT are close and the execution times of the DKT-NTM and DSKT are close. e execution time of the DKT-NTM is more than that of the DKT-LSTM. e reason can be that the NTM takes more time to access its own external memory matrix. NSKT considers the conjunctive skills of the questions during the training process and needs to access the NTM's external memory matrix to enhance the memory ability of the model. Hence, NSKT has the most execution time, but this is also the reason why NSKT performs better in modelling the students' knowledge state.
e experimental results show that the NTM-based skillaware knowledge-tracing model has a strong ability to capture the relevance among knowledge concepts and can enhance the model's ability of skill awareness for conjunctive skills and improve the accuracy of prediction in modelling   the students' knowledge state. Experiments demonstrate that NSKT is effective.

Discussion
In this section, we discuss the details of the prediction process of the KT model in modelling the students' knowledge state, as well as the relevance of knowledge concepts and conditional influence between exercises.

Prediction Process.
In our opinion, an excellent KT model not only can predict the probability that students will answer questions correctly at the next timestamp accurately but also can perform well in modelling the students' holistic knowledge concept state.
Analyzing the prediction process of KT models can show the performance of NSKT. We randomly select a student sample U 1 from the ASSIST09 dataset, and the detailed process of DKT and NSKT modelling U 1 's knowledge state is shown in Figure 9.
It can be seen from Figure 9(a) that although DKT performs fairly well in prediction, DKT only focuses on the knowledge concepts to be predicted at the next timestamp and does not care about the U 1 's mastery of other knowledge concepts. erefore, after U 1 answers s32 correctly (s32, 1) at the timestamp t 3 , the model's predicted probability of s32 decreases rapidly, indicating that U 1 's mastery of s32 is getting worse and worse, which should not be consistent with the U 1 's real knowledge state shown in Table 7. Because of lacking related knowledge concept (RKC) information, DKT's prediction accuracy and prediction breadth are not ideal.
As shown in Figure 9(b), we use two heatmap subfigures to show the process of modelling the U 1 's knowledge state on NSKT. e x-axis of the lower subfigure is the sequence of U 1 's interactions (q t , a t ) and the y-axis is the skill index. e x-axis of the upper subfigure is the RKC S t and the y-axis is the index of the RKCs S t .
Because U 1 answers skill 32 (abbreviated as s32) correctly (s33, 1) in the first three timestamps t 1 − t 3 , the predicted probability of s32 gets higher and higher and the color of s32 in the y-axis of the lower subfigure gets brighter and brighter. As shown in the x-axis of the upper subfigure, s33 is the related knowledge concept of s32 in the first three timestamps t 1 − t 3 ; thus, the predicted probability of s33 gets higher and higher and the color of the s33 in the y-axis of the upper subfigure gets brighter and brighter too.
In the next three timestamps t 4 − t 6 , U 1 answers s33 correctly (s33, 1) in succession, the predicted probability of s33 gets higher and higher and the color of s33 in the y-axis of the lower subfigure gets brighter and brighter. s32 is the related knowledge concept of s33, so the predicted probability of s32 continues to increase, and the color of s32 in the y-axis of the upper subfigure gets brighter and brighter too and remains at a relatively high value.
In the next three timestamps t 7 − t 9 , U 1 continues to answer s33 correctly (s33, 1); however, this s33 is a single skill without related knowledge concepts, so only the predicted probability of s33 gets higher and higher and the color of s33 in the y-axis of the lower subfigure gets brighter and brighter.
At the last timestamp t 10 , U 1 answer s37 correctly (s37, 1), so the predicted probability of s37 gets higher and higher and the color of s32 in the y-axis of the lower subfigure gets brighter and brighter. Because s55 is the related knowledge concept of s37, so the predicted probability of s55 gets higher and higher and the color of s55 in the y-axis of the upper subfigure gets brighter and brighter too.
In contrast, we randomly select a student sample U 2 with a low answering accuracy shown in Table 8. e process of DKT and NSKT modelling the U 2 's knowledge state is shown in Figure 10. It can be seen from Figure 10(a) that DKT models the U 2 's knowledge state almost accurately, but the prediction breadth is not enough.
As shown in Figure 10(b), NSKT, like DKT, models the U 2 's knowledge state accurately and performs better in prediction breadth. At the timestamp t 4 , U 2 answers s95 incorrectly many times, the predicted probability of s95 gets lower and lower and the color of s95 in the y-axis of the lower subfigure gets darker and darker. As shown in the xaxis of the upper subfigure, s2 is the related knowledge concept of s95; thus, the predicted probability of s2 gets lower and lower and the color of s33 in the y-axis of the upper subfigure gets darker and darker too.
It can be concluded from Figures 9 and 10 that NSKT performs better in prediction accuracy and prediction breadth and can better model the students' knowledge state. NSKT not only focuses on students' mastery of the knowledge concept to be predicted at the next timestamp but also focuses on the students' mastery of the related knowledge concepts. is is where NSKT is superior to other existing KT models, and NSKT performs better in modelling the students' knowledge state than DKT [4].

Pearson Correlation Coefficient.
In this paper, we use the Pearson correlation coefficient as the metric to measure the correlation among skills. By estimating the covariance and standard deviation of the sample, we can get the sample Pearson coefficient r: (27) Figures 11 and 12 show the comparison of skill Pearson correlations of U 1 's interactions and U 2 's interactions on DKT and NSKT, respectively. Figures 11(a) and 12(a) show the skill Pearson correlation on DKT, and Figures 11(b) and 12(b) show the skill Pearson correlation on NSKT. It can be seen from the figures that DKTcan only mine the correlation among the skills that have been answered in the past, indicating that DKT cannot effectively discover the relevance among knowledge concepts. As shown in Figures 11(b) and 12(b), NSKT can discover the correlation among four skills, while DKT can only discover among three. For example, it can be seen from Figure 11(b) that the Pearson correlation between s32 and s55 on NSKT of U 1 's interactions is  Figure 9: Comparison of the prediction process of U 1 on DKT and NSKT. e color of the heatmap indicates the predicted probability that U 1 's mastery of skills after interaction (q t , a t ) at the timestamp t. e yellower the color, the higher the probability. (a) Heatmap for the prediction process of DKT. e x-axis is the sequence of U 1 's interactions (q t ; a t ) and the y-axis is the skill index. (b) Heatmap for the prediction process of NSKT. e x-axis of the lower subfigure is the sequence of U 1 's interactions (q t ; a t ) and the y-axis is the skill index. e x-axis of the upper subfigure is the RKC S t and the y-axis is the skill index of the RKCs S t . Skill index Skill name Accuracy (%) 32 Ordering positive decimals 100 33 Ordering fractions 100 37 Addition whole numbers 100 55 Absolute value 100  12 Computational Intelligence and Neuroscience r (s32,s55) � 0.31, which means there is a weak positive correlation between s32 and s55. e Pearson correlation between s33 and s55 on NSKTof U 1 's interactions is r (s33,s55) � 0.92, which means there is a strong positive correlation between s33 and s55. rough the above examples, we can conclude that NSKT performs better in the ability of discovering latent relevance among knowledge concepts than existing KT models.

Knowledge Concepts' Discovery.
NSKT can learn latent knowledge concept substructure among skills without expert annotations and can cluster related skills into a cluster, which denotes a knowledge concept (KC) class [6]. Figure 13 shows the visualization of using k-means to cluster the skill representation vectors, which have been performed by the t-SNE method [40,41]. All skills are clustered into eight clusters, and each cluster can represent a knowledge concept class. Skills in the same cluster are labeled with the same color, and those skills have strong relevance and similarity. For example, s32 and s33 do have a strong relevance and similarity because they are very close in Figure 13, which further proves that NSKT has a stronger ability of discovering skill latent relevance information than existing KT models.
We have explored latent conditional influence between exercises by where y(j/i) is the correctness probability assigned by NSKT to exercise j when exercise i is answered correctly in the first  Figure 10: Comparison of the prediction process of U 2 on DKTand NSKT. e color of the heatmap indicates the predicted probability that U 2 's mastery of skills after interaction (q t , a t ) at the timestamp t. e yellower the color, the higher the probability. (a) Heatmap for the prediction process of DKT. e x-axis is the sequence of U 2 's interactions (q t ; a t ) and the y-axis is the skill index. (b) Heatmap for the prediction process of NSKT. e x-axis of the lower subfigure is the sequence of U 2 's interactions (q t ; a t ) and the y-axis is the skill index. e x-axis of the upper subfigure is the RKC S t and the y-axis is the skill index of the RKCs S t .  time step [6]. We have shown a latent conditional influence relationship among the exercises corresponding to Figure 9(b) interactions. We have marked them with arrow symbols in Figure 13. e line width indicates connection strength, and nodes may be connected in both directions. We only show edges with an influence threshold greater than 0.08. Attached ASSIST09 skill maps are shown in Figure 13 (we only show 110 skills with the skill name).

Conclusion
In this work, we proposed a novel NTM-based skill-aware knowledge-tracing model for conjunctive skills, which can capture the relevance among the multiple knowledge concepts of questions to predict students' mastery of knowledge concepts (KCs) more accurately and to discover more latent relevance among knowledge concepts effectively. In order to better model the students' knowledge state, we adopt the neural Turing machines, which use the external memory matrix to augment memory ability. Furthermore, NSKT relates knowledge concepts (KCs) to related knowledge concepts (RKCs) as a whole to enhance the model's ability of skill awareness and improve prediction accuracy and prediction breadth. Experiments in the real-world KT datasets demonstrate that the NTM-based knowledge concept skillaware knowledge-tracing model (NSKT) outperforms existing state-of-the-art KT models in modelling the students' knowledge state and discovering latent relevance among knowledge concepts.
For future studies, we will focus on mining hidden associations among knowledge concepts and building students' personalized answering paths in intelligent tutoring systems. Furthermore, we will construct the holistic structure of knowledge concepts to enhance students' understanding of how the overall knowledge affects each other.

Data Availability
e datasets used to support the findings of this study are included within the article and are available from the corresponding author on reasonable request too.

Conflicts of Interest
e authors declare that they have no conflicts of interest.