A BERT-Based Approach for Extracting Prerequisite Relations among Wikipedia Concepts

Concept prerequisite relation prediction is a common task in the ﬁeld of knowledge discovery. Concept prerequisite relations can be used to rank learning resources and help learners plan their learning paths. As the largest Internet encyclopedia, Wikipedia is composed of many articles edited in multiple languages. Basic knowledge concepts in a variety of subjects can be found on Wikipedia. Although there are many knowledge concepts in each ﬁeld, the prerequisite relations between them are not clear. When we browse pages in an area on Wikipedia, we do not know which page to start. In this paper, we propose a BERT-based Wikipedia concept prerequisite relation prediction model. First, we created two types of concept pair features, one is based on BERTsentence embedding and the other is based on the attributes of Wikipedia articles. Then, we use these two types of concept pair features to predict the prerequisite relations between two concepts. Experimental results show that our proposed method performs better than state-of-the-art methods for English and Chinese datasets.


Introduction
In recent years, the emergence of online learning platforms and e-learning resources has injected new impetus into people's learning. Online learning models have gradually become more popular. Research related to this field has also received considerable attention. As everyone has a different knowledge background, the challenge faced by online learners is usually how to choose learning resources and how to rank them. Typically, each learning resource explains one or more of the leading knowledge concepts. Concepts in a field are usually learned progressively, from simple to complex and from abstract to concrete. Usually, the order of learning resources is determined by the relations between main concepts. is kind of relationship between concepts is generally called a concept prerequisite relation. A prerequisite is usually a concept or requirement before one can proceed to the following one. A prerequisite relation is a natural dependency among concepts when people learn, organize, apply, and generate knowledge [1][2][3]. e learning order between concepts is determined by their prerequisite relations. As for knowledge in a given field, a directed acyclic graph can illustrate its concept prerequisite relations. e concept appears as a node, and the direction of its arrow represents the prerequisite relations between the concepts. For the concept pair (A, B) in the teaching field, if concept B is the prerequisite relation for concept A, then you first learn concept B before learning concept A. It can be written as A⟵B. As shown in Figure 1, neural network (A) relies on concepts such as gradient descent, partial differential, and differential equation. ese concepts also rely on differential (B). In other words, before learning neural network, differential equation is needed.
In a classroom course, the instructor will explain each central concept to students according to the inherent order of the concepts. Additionally, the instructor may also spend some time explaining some background knowledge-related concepts to help students understand current knowledge concepts. However, students may not receive assistance from instructors in online courses. For example, when students learn the Vue.js, they usually need to master the HTML and CSS first; when they learn the Java Spring Boot, they usually need to master the Maven first. ere is usually a prerequisite relation between two different learning resources. In addition, when people browse a Wikipedia article, they often open the pages of other articles to learn more about the background of the current article. Between Wikipedia articles, there is usually a prerequisite. Due to a lack of understanding of prerequisite relations between different concepts, people may be unable to complete courses or understand the content of Wikipedia articles.
In this article, we propose a method for extracting concept prerequisite relations from Wikipedia using BERT. We used concepts from Wikipedia, and each concept has its own Wikipedia article. Compared with courses on online learning platforms, Wikipedia's main concepts are easier to extract in an automated way. Furthermore, because Wikipedia has a unique knowledge structure, we can extract the characteristics of concept pairs and analyze the prerequisite relations between concepts easier.
Our main contributions include the following: (1) A novel metric to measure the prerequisite relations among Wikipedia concepts superior to the existing methods (2) A Chinese dataset annotated with prerequisite relations between pairs of Wikipedia concepts e structure of this article is as follows. Section 2 reviews past works on the task of concept prerequisite relations extraction with Wikipedia and MOOC. e problem definition of concept prerequisite relations is in Section 3. Section 4 elaborates on the methodology. Section 5 describes datasets and preparation techniques and our experimental results and analysis. Section 6 is the concluding remarks and future work.

Related Work
e concept prerequisite relations determine the order in which knowledge is learned and the order in which documents are read. Nowadays, concept prerequisite relations extraction can be used in different kinds of education-related tasks [4], including curriculum planning [5,6], learning resources recommendation [7,8], knowledge tracing [9], and so on. Additionally, there is also a lot of research related to concept prerequisite relation extraction. e area that researchers pay most attention to is to extract the prerequisite relations between Wikipedia concepts. Talukdar and Cohen [10] utilized three types of features for concept pairs, including WikiHyperlinks, WikiEdits, and WikiPageContent, and then used the MaxEnt classifier to predict prerequisite relations among Wikipedia concepts. Liang et al. [1] studied the problem of measuring prerequisite relations among concepts and proposed the RefD metric to capture the relation. RefD means reference distance, and it uses the page links in Wikipedia to model the prerequisite relation by measuring how differently two concepts refer to each other. Zhou and Xiao [11] employed Wikipedia page links, categories, article content, and time attributes of Wikipedia articles to create features and then predict concept prerequisite relations. Sayyadiharikandeh et al. [12] used the clickstream of human navigation among articles on Wikipedia to infer concept prerequisite relations. In addition, many similar studies have used machine learning methods to predict prerequisite relations between Wikipedia concepts [13][14][15][16][17]. A common problem with these methods is that all of them require experts to manually design the features of the concept pairs.
Besides Wikipedia, some researchers have tried to extract concepts from various learning resources and analyze the prerequisite relationships between concepts. Pan et al. [18] manually extracted the main knowledge concepts of the course from the MOOC video and used the sequence and frequency of appearance of the concepts as features to analyze the prerequisite relations between the concept pairs. Wang et al. [13] extracted the main knowledge concepts from the textbooks, linked these concepts with Wikipedia articles, and then identified the prerequisite relations between the concepts. Liang et al. [14] explored the content of the course introductions on the university website, investigated how to recover concept prerequisite relations on the university website, investigated how concept prerequisite relations are derived from course dependencies, and proposed an optimization-based framework to address the problem. Furthermore, other similar studies use the dependency relationship between learning resources to predict the prerequisite relations between knowledge concepts [1,2].
As mentioned above, all methods based on machine learning need to use manual design concepts to predict prerequisites. is usually causes other factors that can be used to infer the prerequisite relationship to be ignored.
ere is a possibility that deep learning will outperform machine learning in this regard since deep learning methods can automatically extract features from raw data. Miaschi et al. [19] used Word2Vec to convert the two concepts into vectors and input the vectors into two LSTM networks, respectively, to obtain the features of the concept pair and predict the prerequisite relations of the concept pair. However, Word2Vec only treats a concept as a normal word. Compared with Word2Vec, BERT [20] can better explore the semantic meaning of a concept, and the contextualized gradient descent neural network partial differential differential equation differential B A Figure 1: Example of concept prerequisite relation (differential equation concept is the prerequisite relation of neural network concept).
vectors that BERTgenerates can also be used to infer concept prerequisite relations.
In this paper, we use the BERT sentence embedding based on contextual embedding to automatically extract the features of concept pairs. Meanwhile, we also designed some features manually for concept pairs. Both classes of features were employed to infer concept prerequisite relations. Furthermore, we created a Chinese concept pair dataset that can be used to identify the prerequisite relations.

Problem Definition
e goal of the concept prerequisite relations identification task is to judge whether there is a dependency between two concepts. For a concept pair (A, B), there are four possible relations between them: (1) A is a prerequisite of B; (2) B is a prerequisite of A; (3) the two concepts are related, but they do not have any prerequisite relation between them; and (4) the two concepts are unrelated [10]. In previous studies, researchers usually converted this task into a binary classification problem for processing. ey were simply judging whether A is a prerequisite of B. It can be defined as Preq (B, A) � 1 means that A is a prerequisite of B. In other words, before people can learn about concept B, they must master concept A while Preq (B, A) � 0 means that A is not a prerequisite of B. In this article, we will also turn the concept prerequisite relations identification problem into a binary classification task to deal with.
Moreover, the concepts we use are Wikipedia concepts. Each concept has a corresponding Wikipedia article. e concept is the title of the article.

Wikipedia Concept Prerequisite Relations Prediction Method
is section presents our proposed concept prerequisite relations prediction model (AFs + MFs). e structure of the model is illustrated in Figure 1. e input of the model is composed of two types of concept pair features, including features extracted automatically (AFs) and features extracted manually (MFs). Precisely, we extract two BERT sentence embeddings and Wikipedia-based features from concept pairs. First, the model inputs the AFs of the concept pairs into two LSTMs, and the two output vectors of LSTMs are concatenated with MFs. en, these features are input to a fully connected layer to accomplish concept prerequisite relations recognition.

Features Extracted Automatically.
As a big data pretraining transformation language model of the bidirectional transformer, the application of BERT has significantly improved performance on several NLP tasks. Particularly, sentence-BERT [21] introduces pooling to the token embeddings generated by BERT to generate fixed-size sentence embeddings, obtaining state-of-the-art performance in many fields, including text similarity and classification problems.
Articles in Wikipedia concepts typically contain a number of sentences, each containing deep semantic information. Hence, we use BERT to generate sentence embeddings as the feature extracted automatically from the concept.
More specifically, first, for the first k words or Chinese characters of the Wikipedia concept article, the BERT tokenizer is used with a maximum sequence length of 500 to obtain the token representation. en, we generate a concept BERT sentence embedding by inputting tokens as the input of the BERT model (vector size � 768). e two BERT sentence embeddings of the concept pair are used as inputs to the neural network, which is passed to the two 32-unit LSTMs. LSTM can be used to create some feature information not included in automatic feature design and achieve deeper concept feature extraction.

Features Extracted Manually.
As a multilingual open knowledge base, Wikipedia has the characteristics of multiuser collaborative editing, dynamic updating, and complete coverage. Wikipedia's concepts are described through articles with corresponding titles, and the articles contain links, categories, and redirects (synonyms) in the content. Researchers can use this information to extract feature information from concept pairs. By manually extracting the structural features of concept pairs from Wikipedia articles, we can analyze the prerequisite relations between the two concepts. erefore, we extract three types of concept pair features from Wikipedia article information: text features, links features, and category features. ese features are as follows: A note should be made that features #1-#7 and #9-#10 are taken from literature [14], feature #5 is taken from literature [19], and features #8 and #11-14 are taken from literature [16]. Previously, only the English dataset had been validated on these features, and this article will make an evaluation of these features on both the Chinese and English datasets simultaneously.

AFs + MFs : Concept Prerequisite Relations Prediction
Model. Based on the above design and analysis, for a concept pair (A, B), the model (Figure 2) separates the concept prerequisite relations prediction into the following steps: (1) e first k words or Chinese characters of the concept pair (A, B) Wikipedia articles is first obtained, and the sentences S A and S B are generated. (2) en, the sentence is divided into individual words or Chinese characters, and they are labeled separately, and BERT is used to encode them to generate sentence embedding in V A and V B .

Datasets and Implementation Details.
For our research, we used a public dataset, AL-CPL, which is an English dataset designed by Chen et al. [9] in their research. e dataset consists of two-category concept pair sets and prerequisite relation labels from four different fields. e fields are data mining, geometry, physics, and precalculus. Each data item is formalized as a triple (A, B, Label), which is the concept A, B, and the prerequisite relation label, respectively. Each concept in the dataset has a corresponding article in Wikipedia. e left half of Table 1 shows detailed information about the AL-CPL dataset.
In addition, we also want to verify whether the proposed method performs well in other languages. By using the AL-CPL English dataset, this paper creates the CH-AL-CPL Chinese dataset. First, the English Wikipedia article corresponding to each concept in the AL-CPL dataset is found, and then the Chinese article corresponding to each concept based on the cross-language links is found in Wikipedia.
However, Chinese Wikipedia articles are only a small fraction of those on English Wikipedia. us, the collection of Chinese concept pairs obtained by directly using crosslanguage links is not only small but also has a significant issue of data category imbalance. Due to this, this paper uses the transitivity and asymmetry of the concept prerequisite relations to expand the number of the Chinese dataset.
(1) Transitivity. Concept B is a prerequisite of concept A, concept C is a prerequisite of concept B, and then concept C is a prerequisite of concept A (2) Asymmetry. If concept B is a prerequisite for concept A, then concept A cannot be a prerequisite for concept B By combining transitivity and asymmetry, we can increase the number of categories in the dataset and balance the ratio of categories. In Table 1, the right half shows the detail of concept pairs in each domain of the CH-AL-CPL.
In the experiment, all models were implemented with Keras. Using the bert-as-service (https://bert-as-service. readthedocs.io/) sentence encoding service, we generated a 768-dimensional sentence embedding for the first k words or Chinese characters in Wikipedia concept articles. e sentences were tokenized with NLTK [22]. In order to train the model, the following parameters are set: 50 training epochs, 0.01 learning rate, 32 dimensions of the hidden layer, and 0.2 dropout rate. Adam optimization is used to train the model, and L2 regularization is used to prevent overfitting.
In the AFs model, two 768-dimensional sentence embeddings of the Wikipedia concept pair (A, B) are input to two 32-unit LSTMs, and the fully connected layer is used to receive the output of the LSTMs to identify prerequisite relations. Besides, the #1-16 manual features of the Wikipedia concept pair (A, B) are combined, which then sends them to a fully connected layer for prerequisite relations prediction. Further, the AFs + MFs model concatenates the output of the LSTM of the AFs model and the manual features of MFs to complete the prerequisite relations recognition.

Experimental Result and Analysis.
In our experiment results, we compared our method with the following typical concept prerequisite relations prediction baselines: (1) Reference Distance (RefD) [1]. e basic idea of this method is that each concept can be represented by its collection of related concepts in the concept space if most of the related concepts of concept A refer to concept B. e related concept of concept B rarely refers to concept A, then concept A may depend on concept B. e author constructs related links for each related concept and refers to the EQUAL weight and TF-IDF weight method to identify the prerequisite relations between the two concepts, so we selected the best-performing TF-IDF weight method. (2) Machine Learning (AT) [14]. is method uses linkbased and text-based features extracted from Wikipedia pages and then uses Naive Bayes (NB), logistic regression (LR), support vector machine (SVM), and random forest (RF) four classifiers to train a concept prerequisite relations prediction model on the AL-CPL dataset, which is used to analyze the prerequisite relations between concept pairs. We use the results of the best-performing random forest classifier (RF) report directly as the basis for comparison. (3) Neural Network (RS) [16]. is method is based on a neural network, and the UNIGE_SE team is responsible for the sharing task of EVALITA 2020 PRELEARN. e author proposes eight category features based on the content and structural features of Wikipedia, using Italian datasets and deep learning models to analyze the prerequisite relations between concepts. As a comparative algorithm, these feature values are recalculated on Chinese and English datasets, and the author's method is used to predict the prerequisite relations of English and Chinese concept pairs as the comparative algorithm of this paper.
(4) AFs and MFs. Besides the above baseline, to verify the effectiveness of the proposed method's automatic features and manual features, respectively, this paper also predicts the concept prerequisite relations for the two types of features separately. Specifically, we use a fully connected layer to receive automatic and manual features as input, thereby achieving prerequisite relations prediction individually.
On the AL-CPL and CH-CL-CPL datasets, we conducted a concept prerequisite relations prediction experiment. e performance of the model is evaluated using 5-fold crossvalidation. In comparison to a baseline model, the most widely used performance metrics are Precision (P), Recall (R), and F1 score (F1). Tables 2 and 3 Data mining  826  292  534  1151  493  658  Geometry  1681  524  1157  3330  1825  1505  Physics  1962  487  1475  2958  1091  1867  Precalculus  2060  699  1361  3200  1431  1769  All  6529  2002  4527  10639  4840  5799 evaluating different baselines under different performance metrics for AL-CPL and CH-AL-CPL, respectively. As shown in Tables 2 and 3, our method significantly outperforms all the baselines in all the metrics on English and Chinese datasets (except AL's Precision metric).
From Table 2, we can find that our method achieves the best performance against all baselines on all domains, except for the Precision metric of the geometry and physics domain.
e F1 score of AFs + MFs leads AL by about 3.6%, 3.7%, 5.9%, and 3% in each of the four areas. In geometry and physics, Precision is the best probably because these two fields have the rich text and link features.
Based on Table 3, we observe that our method outperforms all baselines in all metrics and achieves the best result. Table 3 reports the evaluation metrics for the four domains. e CH-AL-CPL dataset, which is expanded by transitivity and asymmetry, has the most significant number of prerequisites. e performances obtained are generally better than CH-CAL-CPL. In addition, since the author did not give the code of the AT method from [14], some features in Chinese cannot be calculated, so we did not report the experiment results of the AT method in CH-AL-CPL.
As an excellent NLP language model, BERT replaces the encoder with the decoder by using a two-way transformer encoder. By using this method, the feature encoding effect of the words in the sentence is greatly improved. Compared with previous models, such as Word2Vec, the trained BERT model has a deeper contextual understanding. A contextbased semantic feature is suitable for detecting text features of Wikipedia concepts. Moreover, as a concept pair, we also cannot ignore the rich links and category relationships between the two. Overall, through the combination of the two types of features, the performance of the concept prerequisite relations model can be further improved.

Ablation Study.
In order to demonstrate that the length of the Wikipedia article influences the automatic extraction of features, we conducted an ablation experiment by varying the value of k (the first k words or characters), from 100 to 500. e experiment results are shown in Figure 3.
As shown in Figure 3, increasing k will increase the F1 score of the AFs model. Particularly when k � 400, the model is the most effective in predicting the four domains. After the k value exceeds 400, however, the textual information of the concept is incorporated into other background knowledge, which affects the performance of the model to a certain extent. According to the CH-AL-CPL dataset, geometry and precalculus have the best F1 scores when k � 500. is may be due to the relatively long average length of articles in this domain.
Additionally, the experiment explored the role of manual features in the MFs model. ere are three types of features that are designed, content-based (#1-#7), categorical-based (#8), and link-based (#9-#14). In the experiment, one feature type was removed each time and compared with the MFs model. e result is shown in Figure 4. Figure 4 illustrates how the prediction performance decreases to varying degrees after reducing a specific type of feature group. After removing the link-based features from the AL-CPL dataset, the decline in the four fields reached 10.5%, 12.7%, 9.1%, and 7.4%, respectively, indicating that the link relations between concepts play a crucial role in the prediction of prerequisite relations.
As a result of the difference in the CH-AL-CPL dataset, content-based features have decreased by 7.0%, 11.3%, 8.8%, and 6.1% in the four domains. Since Chinese Wikipedia has fewer words than English Wikipedia, less text information has a more significant impact on text features. Moreover, removing the features with a category has the most negligible

Conclusion and Future Work
In this paper, we propose a novel concept prerequisite relations prediction method called AFs + MFs, which combines the BERT sentence embedding (AFs) of the concept article and Wikipedia-based features (MFs). Furthermore, we designed a Chinese prerequisite relations dataset to verify the effectiveness of the method. e experiment results show that our method achieves state-of-the-art results on four domains. In addition, we have conducted effectiveness studies on AFs and MFs separately.
In the future, we plan to identify the concept prerequisite relations of non-Wikipedia concepts. Moreover, some learning resources, such as MOOCs and e-lectures, contain multiple concepts. e following research question is how to recommend learning resources by considering concept prerequisite relations.

Mathematical Problems in Engineering
Data Availability e CSV data used to support the results of this study have been stored in the GitHub repository (https://github.com/ lycyhrc/CH-AL-CPL). ese data are the concept prerequisite relations dataset of the Chinese version of AL-CPL.
ere are no restrictions on access to these data.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.