A Deep-Learning Framework for Analysing Students' Review in Higher Education

As part of continuous process improvements to teaching and learning, the management of tertiary institutions requests students to review modules towards the end of each semester. These reviews capture students' perceptions about various aspects of their learning experience. Considering the large volume of textual feedback, it is not feasible to manually analyze all the comments, hence the need for automated approaches. This study presents a framework for analyzing students' qualitative reviews. The framework consists of four distinct components: aspect-term extraction, aspect-category identification, sentiment polarity determination, and grades' prediction. We evaluated the framework with the dataset from the Lilongwe University of Agriculture and Natural Resources (LUANAR). A sample size of 1,111 reviews was used. A microaverage F1-score of 0.67 was achieved using Bi- LSTM-CRF and BIO tagging scheme for aspect-term extraction. Twelve aspect categories were then defined for the education domain and four variants of RNNs models (GRU, LSTM, Bi-LSTM, and Bi-GRU) were compared. A Bi-GRU model was developed for sentiment polarity determination and the model achieved a weighted F1-score of 0.96 for sentiment analysis. Finally, a Bi-LSTM-ANN model which combined textual and numerical features was implemented to predict students' grades based on the reviews. A weighted F1-score of 0.59 was obtained, and out of 29 students with “F” grade, 20 were correctly identified by the model.


Introduction
Tertiary institutions play a signifcant role in equipping human resources with specialized skills necessary for improving the social-economic growth of a country. Raising the bar in teaching and learning is paramount in producing quality graduates capable of transforming a nation's economic status. Achieving this requires investing in quality assurance activities for the diferent aspects of teaching and learning. Factors that afect the quality of knowledge acquisition by students in tertiary institutions include students' learning styles, instructors' teaching methodology, preparedness knowledge, learning environments, and teaching and learning resources. For instance, diferent students prefer diferent learning styles to understand what is being taught and this may result to signifcant diferences [1,2]. Against this background, it is therefore important for instructors to align their delivery methods with students' preferred learning to efectively transfer knowledge.
As part of quality assurance activities, most tertiary institutions collect students' module feedback data at the end of each semester. Te feedback contains students' ratings for their instructors on several attributes such as preparedness, teaching skills, availability, and behavior. Teaching and learning resources such as laboratory equipment and classroom ventilation are also assessed. Te management of tertiary institutions leverages insights drawn from analyzing this feedback data to make informed decisions. Preventative and corrective measures are then taken to address the different challenges encountered by students during the semester. Tis not only improves the quality of teaching and learning but also increases learner satisfaction.
Manually analyzing thousands of students' qualitative comments is inefcient for the following reasons. Firstly, it may lead to subjective and inconsistent interpretations, thus reducing the reliability and repeatability of the results. Secondly, this approach is prone to human errors arising mostly due to fatigue when carrying out repetitive tasks manually. Tirdly, it reduces staf productivity levels considering that a signifcant amount of efort and time needs to be allocated to this task. Otherwise, given the right tools, the human resource allocated to performing this task could be freed and efciently utilized for other crucial activities of the organization. Fourthly, with the manual approach, it is difcult to mine hidden insights from the massive volume of students' feedback. Keeping track of recurring themes and patterns from students' past feedback is a daunting task. Tus, valuable insights are left untapped. Furthermore, it is challenging to predict the academic performance of students prior to writing fnal examinations.
Innovative and cost-efective approaches are needed by tertiary institutions to analyze the reviews efectively and efciently. Recently, there has been a surge in research in the felds of natural language processing (NLP) and machine learning. Tese techniques can be leveraged in mining insights from the reviews. However, to ensure consistency, reliability, credibility, and accuracy of the analysis, it is essential to formulate a roadmap for the analysis. Terefore, this study proposes a novel framework for analyzing students' module feedbacks. Te framework leverages existing NLP and deep learning techniques.
Te aim of the research is to create a framework for analyzing students' module feedback. It is expected that the outcome of the reviews, based on the proposed framework will assist the management of tertiary institutions to efciently and efectively mine valuable insights from students' qualitative module reviews. With the insights extracted from the analysis of the reviews, university management can make informed decisions on improving teaching and learning processes. In this paper, three models were used for aspect term extraction, namely, LSTM-CRF, GRU-CRF, and Bi-LSTM-CRF. For the task of aspect category identifcation, four RNN variants (LSTM, GRU, Bi-LSTM, and Bi-GRU) were built and compared. Based on the aspect terms, the sentiment polarity of the students' reviews was determined. At the institution under study in this paper, students' feedback is collected prior to the examination. Terefore, a grade prediction model was also developed in order to identify which students are likely to fail based on their review of the module. In this respect, course leaders can develop initiatives to further support these students.
Te rest of this paper is structured as follows. In Section 2, existing works in this domain are presented; in Section 3, the diferent steps in the developed framework for analyzing students' reviews are described. Te fndings of this study are discussed in Section 4, and Section 5 concludes the study, outlining possibilities for future works.

Literature Review
In this section, we describe aspect-based sentiment analysis and recent works in analyzing students' reviews through NLP and deep learning techniques.

Sentiment
Analysis. Sentiment analysis is defned as the process of mining people's opinions from [3]. Te opinions can be categorized as either positive, negative, or neutral. Sentiment analysis can be performed at three levels of granularity: document, sentence, or aspect level [4]. At the document level, sentiment analysis is performed for the entire text document. Te document is classifed into a positive, negative, or neutral class [5]. Document level sentiment analysis is too generic since it does not provide polarity scores for the individual reviews contained in the document. Hence, it is a shallow analysis that just provides an overview. Sentence level sentiment analysis is a deeper analysis compared to the document level. It is performed at sentence granularity. Tus, sentiment polarity scores for each individual sentence in the document are calculated. Tis analysis assumes that there is a single opinion in the sentence which is not usually the case in the real world. For example, a student might say "the lecture was great, but the classroom size was too small to accommodate all the students [6]." Tis sentence has got two diferent opinions: "great" directed onto "lecture" and "too small" expressed on "classroom size." Sentiments for these two opinions need to be captured separately hence the need for a much fner granular analysis.
Aspect-based sentiment analysis (ABSA) is a subdiscipline of NLP. It pays special attention to sentiments target known as aspects in a sentence [7]. It is more granular than document and sentence level sentiment analysis and hence more complex to implement [8]. Te aspect-based sentiment analysis (ABSA) task is to identify opinions expressed toward specifc entities or attributes [9]. ABSA consists of the following three main tasks [10,11], namely, (1) identifcation of aspects from the text; (2) extraction of the linguistic expression used to refer to the aspects so as to identify entities and attributes; and (3) identifcation of the sentiment polarity or opinion of each aspect. ABSA is often used to identify the sentiment polarity or opinion in customer reviews for e-commerce websites and tweets [11] but has also been previously used in education-related felds for various purposes: to review the performance of teaching staf [12] and to review students' social media posts [13].
Tere are diferent approaches for sentiment analysis, namely, lexicon, "traditional" machine learning, and deep learning approaches. Lexicon approaches are dictionarybased. Te dictionary contains a collection of sentiment terms and their associated polarity values [14]. With the lexicon approach, words are assigned sentiment polarity values based on the dictionary mappings between words and sentiment polarity scores. However, if a word is not found in the dictionary, its sentiment polarity score cannot be determined. Since the dictionary mapping does not capture the context of the sentiment terms, lexicon approaches do not scale well with words that have contrasting sentiment polarity scores in diferent domains or contexts. Popular lexicon dictionaries are Sentiword, AFINN, VADER, Sen-tiwordNet, TextBlob, and OpinionFinder [4].
Machine learning approaches to sentiment analysis treat sentiment analysis as a classifcation problem [14]. Recent trends in sentiment analysis are leveraging the capabilities of deep learning. Tis is because deep learning models are capable of identifying complex relationships and patterns from datasets. Sentiment analysis models constructed using deep learning require less feature engineering compared to traditional machine learning approaches. Recurrent neural networks (RNNs) and long short-term memory (LSTM) networks are usually applied to sequential tasks such as sentiment analysis, named entity recognition, speech processing, DNA sequences, and machine translations [6]. Te architecture for RNN allows a neural cell to learn from previous time step outputs [9,15,16]. Tis capability makes it possible for RNNs to learn from the past to improve future performance. A major disadvantage of RNNs is that they sufer from the problem of vanishing gradients, especially for long sentences [17]. LSTM and gated recurrent units (GRUs) were specifcally designed to deal with this problem. Teir architectures contain gates that control which information to retain or remove from the memory. Another challenge with RNNs is that they are limited to only learning from previous time steps only. To address this problem, bidirectional RNNs were introduced. Bidirectional RNNs contain two layers of RNNs that process inputs in opposite directions. Outputs from the two RNNs are concatenated and forwarded to the next layer [6]. Tis ensures that the neural network learns from both forward and backward propagation Variants of bidirectional RNNs are Bi-LSTM and Bi-GRUs which also operate in a similar manner with the added advantage of not having the vanishing gradient problem faced by RNN.

Aspect-Based Sentiment Analysis of Students' Reviews.
Aspect-based sentiment analysis (ABSA) is a subdiscipline of natural language processing. It pays special attention to sentiments target is known as aspects in a sentence and is more complex [7,8]. ABSA identifes opinions expressed towards specifc entities or attributes [9] and consists of the following three main tasks: (1) identifcation of aspects from the text, (2) extraction of the linguistic expression used to refer to the aspects so as to identify entities and attributes, and (3) identifcation of the sentiment polarity or opinion of each aspect [10,11]. ABSA is often used to identify the sentiment polarity or opinion in customer reviews for ecommerce websites and tweets [11].
Chauhan et al. [13] did an ABSA at the document level to review students' posts on social media. Nouns, noun phrases, and an ontology containing concepts of higher education was used for aspect extraction. Te authors used a naive-based classifer to do the sentiment analysis. Shaikh and Doudpotta [18] used ML and a rule-based approach to develop a sentence-level ABSA system. SentiWordNet was used for opinion mining and the corpus of the Pakistani language was manually labelled by the authors. Naive Bayes multinomial classifcation was used for entity extraction.
Sindhu et al. [19] used LSTM to do sentence-level ABSA to detect all aspects and sentiment polarity. A custom word embedding was developed. 5000 surveys from the Sukkur IBA University and the SemEval 2014 corpus for a noneducational domain were used. Te F1-score was 93% for sentiment orientation detection and 91% for aspect extraction.
Nikolić et al. [7] implemented ABSA on reviews within the higher education domain. Tey developed a system for ABSA on reviews written in the Serbian language using NLP techniques and machine learning models at the sentence segment level. Te author did not use any deep learning model due to a lack of data. Tey used two classifcation methods: frst, an SVM model and the second one is based on dictionary matching.
Kastrati et al. [20] frst used conventional machine learning techniques to classify the polarity of a review based on aspects as well as a 1D-CNN model using various word embeddings. Te authors then developed an LSTM and CNN deep learning model for aspect category sentiment classifcation [21]. Tey used the FastText, GloVe, Word2-Vec, and domain-specifc MOOC word embeddings for the experimentation. Te systematic literature review carried out by [22] reviews existing work which discuss sentiment analysis of student's feedback. Te author's classifed the various papers into three categories: (1) a set of papers which deal with comments where the aspects focus on teacher entity, which include (teacher's knowledge, pedagogy, and behavior); (2) papers discussing aspects on courses, teachers, and institutions; and (3) papers dealing with capturing the opinions and attitudes of students toward institution entities.

Methods and Results
In this section, the methodology used to analyze students' reviews, determine the sentiment polarity, and predict grades are explained as shown in Figure 1.

Data Acquisition.
Te frst stage of data acquisition was domain understanding. Te objectives set at this stage were two-fold: frstly, defning aspect categories found in students' reviews and secondly, labelling training and testing datasets. Tese objectives were achieved by working collaboratively with education domain experts. With the realization that expert knowledge might be biased, the task of labelling the dataset was performed independently by two education domain experts. Inconsistencies observed between the two were resolved through consensus. At the end of this stage, a list of aspect categories was produced. An aspect category was defned based on the frequency of related comments in the review. With assistance from the education domain experts, the categories were labelled. Table 1 lists selected phrases associated with each aspect category.
Te students' module reviews dataset is from the Lilongwe University of Agriculture and Natural Resources, Computational Intelligence and Neuroscience 3 a public university in Malawi. Te dataset consists of records which were obtained for a period of four consecutive academic years. In each academic year, students' modules feedback was captured at the end of every semester prior to writing examinations. Te dataset contains both quantitative and qualitative data. Te university has 5 faculties and an average enrolment of 4000 students each academic year.

Data Preparation and Preprocessing.
Once the dataset was acquired, duplicates, non-English characters, and misspellings were removed from the dataset. Stop words and punctuation marks were also removed. Te aspect terms in each individual review were manually labelled using the BIO tagging scheme. In order to ensure efcient processing of the dataset by machine learning algorithms, the following   It is to be noted that course and module are same.
Computational Intelligence and Neuroscience 5 standard preprocessing steps were carried out: converting all texts to lowercase, spelling correction, padding short sentences, tokenization, and vectorization. All the reviews were converted to vectors which were then used to build the grade prediction and for sentiment analysis. Te review vectors were also used for aspect-term extraction and aspect category determination. Te resulting aspect terms obtained were used for sentiment analysis.

Aspect-Term Extraction and Modelling.
Aspect terms were labelled with either "B-tag" or an "I-tag." If an aspect term consisted of two or more words, the frst word was assigned a "B-tag," whereas all the remaining words in that compound aspect terms were given an "I-tag." Aspect terms comprising of a single aspect word were labelled with a "Btag." If a word was not part of the aspect terms, it was given an "O-tag" as shown in Table 2. Te "B-I-O" tagging annotation is summarised in Table 3. SemEval datasets act as a de facto standard for ABSA and the guidelines were used to the data annotation process. Te aspect terms, aspect categories, and sentiment polarity classes were labelled. Aspect terms are attributes of entities upon which sentiment expressions are directed in a given review sentence [23]. For example, consider a review sentence R: "the number of students in our class is very large." In this case, "number of students" constitutes aspect terms. Usually, aspect terms are nouns or noun phrases. Training and testing dataset for the aspect's extraction model were labelled using the BIO Tagging since aspect-term extraction is considered as a sequence labelling task [10]. Table 4 shows the number of reviews used as the training and test data and the number of B-tags, I-tags, and O-tags. Te goal of the aspect-term model was to extract aspect terms from students' qualitative reviews. Aspect terms as defned by [23] as entity attributes on which review sentiments are directed onto. Given a sequence of review words, the task was to extract aspect terms found in the sequence using the BIO tagging scheme. Figure 2 below shows Bi-LSTM-CRF architecture which was implemented to extract aspect terms from the students' reviews. POS vectors and word vectors were fed as inputs via the word embedding layer into the Bi-LSTM neural network. Outputs from the Bi-LSTM layer were then passed onto a fnal CRF layer. Te CRF layer concatenated output results from individual time steps of the Bi-LSTM neural network and using constraints it learned from the dataset guided the fnal predictions for the labels. Tis ensured that the entire sentence context was considered when predicting the fnal BIO-tagged labels. Based on previous works, three models were implemented, namely, Bi-LSTM-CRF, LSTM-CRF, and GRU-CRF neural networks. Table 5 shows the results found after experimenting with the three models. Te Bi-LSTM-CRF model achieved the highest micro-and macro-F1-scores of 0.67 and 0.66, respectively. A micro-F1-score of 0.67 recorded by the Bi-LSTM-CRF model implies that an overall accuracy of 67% can be reached when predicting "B" and "I" aspect terms from the dataset using the model. Furthermore, a consistent pattern of higher F1-scores for "B" aspect term compared to "I" aspect term can be observed in all the three models from Table 4. Tis trend could be explained by the diference in sample size representations of "I" and "B" aspect terms in both the training and testing sets. Our dataset had more "B" aspect term compared to "I" aspect term considering that most of the reviews had a single aspect term in them. From Table 4, it can also be revealed that a 0.74 F1-score for the "B" aspect term using Bi-LSTM-CRF classifer was recorded.
On the other hand, a 0.53% F1-score for "I" aspect term using the Bi-LSTM-CRF classifer was achieved. One possible explanation for the high performance achieved by the Bi-LSTM-CRF classifer compared to the rest of the models is because of its internal architecture which enabled it to learn from both forward and backward propagations hence improving its performance.

Aspect-Category Identifcation
Model. Several studies have been conducted in tackling the problem of aspectcategory detection in literature. Akhtar et al. [24] approached this task as a supervised multilabel classifcation problem. Mamatha et al. [9] identifed seed words or aspect terms associated with each category and leveraged them when predicting the categories. Using a conditional random feld classifer together with grammatical relationship dependencies, they were able to predict an aspect category for each review. Considering that diferent aspect categories contain diferent topics, Movahedi et al. [25] leveraged deep learning's attention mechanism in detecting aspect categories.  Te frst word of an aspect term, found at the beginning I Inside of an aspect term, that is, the word is part of an aspect term O Outside of aspect term words Based on previous works in this feld, four variants of RNNs models were constructed and compared. LSTM, Bi-LSTM, GRU, and Bi-GRU models were trained on the labelled dataset and tested. Among the reasons for experimenting with variants of RNNs in performing this task was to utilize the capabilities of RNNs for processing sequential inputs. RNNs have proven to perform well in processing sequential tasks such as speech recognition, named entity recognition, and sentiment analysis. After comparing the performance of the diferent models, the model which achieved the highest performance was adopted and used for the rest of this study.
Since the aspect-category training and testing sets were uneven considering that some categories had fewer instances compared to other categories; the weighted F1-score metric was a better metric to use in evaluating our model performance because it considers class size when calculating it. From Table 6, it can be seen that three models: GRU, Bi-GRU, and LSTM achieved similar weighted F1-scores of 0.78, whereas LSTM achieved a slightly lower weighted F1score of 0.77. Teoretically, it was expected that either Bi-LSTM or Bi-GRU models would achieve the highest performance considering that they process inputs from both forward and backward propagation. Tis enables them to update training weights based on concatenated outputs from the two opposite processing layers. A possible explanation for this deviation from the expected theoretical performance could be that our labelled dataset size was small for the benefts of bidirectional RNNs to be fully realized.
Since GRU, Bi-GRU, and LSTM models achieved similar performance in terms of weighted F1-score, it is recommended to use either of these three models depending on resource constraints and dataset size at hand. In environments where there are limited computational resources and the dataset is smaller, GRU neural networks are recommended because their architecture is computationally optimized to efciently utilize resources during processing. For a large dataset, Bi-GRU is a better option. Te 0.78 weighted average F1-score recorded by GRU, Bi-GRU, and LSTM models implies that one can be confdent that the models would accurately predict 78% of aspect categories from the dataset.
It can also be seen from Table 6 that GRU, Bi-GRU, and LSTM models perform consistently well in predicting individual aspect categories as evidenced by F1-scores of at least 0.70 in more than seven categories out of twelve. On the other hand, it is also observed that there are variations in performance scores for the diferent aspect categories. Tis could be attributed to ambiguity caused by certain reviews that belonged to multiple aspect categories.
In comparison with the performance of state-of-the-art models for the task of aspect-category detection in the education domain, study fndings conducted by [19] achieved the highest F1-score of 0.85. Among the reasons for the high score was the use of education domain word embeddings. Teoretically, using domain embedding enhances model performance. Despite the limitation of having a small labelled dataset, considering that deep learning models require massive amounts of training sets, the models developed in this work performed relatively well with respect to state-of-the-art models in predicting the aspect categories.
Te key for achieving this performance was the use of aspect    terms as seed words that aided the neural networks to predict corresponding aspect categories for the students' reviews.

Sentiment Polarity Model.
Te goal for this model was to predict sentiment polarity scores for reviews with respect to aspect terms found in each review. Given a review sentence and an aspect term as inputs to the model, the model outputs either a negative or positive sentiment polarity score in the set {0, 1}. In this study, the authors focused on predicting positive and negative sentiments for the reviews and leaving out neutral ones. Tis was because most of the neutral comments were suggestions on how to improve various aspects of teaching and learning. As such neutral comments were treated as implicit negative sentiments considering that such suggestions implied that the present state had problems that required resolving. Bi-GRU neural network was chosen as the model for two reasons, namely, to utilize its bidirectional processing capability that allows the neural network to adjust training weights based on updated information from both forward and backward propagation, hence improving performance. Secondly, Bi-GRU architecture consists of only two gates which is fewer than the Bi-LSTM's three gates. Tese gates control what information to store or discard from the neural network [26]. Tis makes it computationally less resourceintensive during training compared to Bi-LSTM neural networks.
To utilize aspect-term information in a review when predicting sentiment polarity, both sentence and aspect embeddings were used during model training. To the best of our knowledge, no study conducted so far in the education domain for sentiment analysis has experimented with the use of Bi-GRU with aspect embeddings in constructing a sentiment polarity classifer. Since there were no available open-source academic domain embeddings among the research community, Stanfold's Glove 840B was used to create word vectors in capturing word semantic meanings. Te choice of Glove 840B was taken due to empirical evidence presented by [27]. Teir fndings revealed that Glove 840B and fastText embeddings achieved statistically superior performance compared with the rest of the embeddings.
Te word vectors of sentences and the aspect terms' vectors were fed as inputs via sentence embedding and aspect embedding layers respectively. Sentence embeddings consist of semantic word representations for all word tokens in reviews, whereas aspect-term embeddings consist of semantic word representations for all aspect terms. Output from the two embedding layers was forwarded onto the Bi-GRU layers. Te Bi-GRU layers processed the input sequences using both forward and backward propagation and then concatenated the results from the two bidirectional processing.
Considering that the dataset used was partially imbalanced since it had more negative reviews compared to positive ones in both the training and testing sets, the weighted F1-score was a better metric for evaluating the model performance since it considers class weights when computing it. As shown in Table 7, an overall F1-score of 0.96 was found from the average of 4 runs. Among the possible reasons for the high performance recorded in this study include the computational processing capability of Bi-GRU for sequential tasks using its bidirectional architecture. Tis allowed the neural network to learn from both forward and backward propagation, hence achieving superior performance. Aspect embedding information also played a signifcant role in determining sentiment polarity scores.
Tus, the Bi-GRU neural network performed considerably well in predicting sentiment polarity scores. Table 8 gives an overview of the performance of the models compared to other models which have been implemented for sentiment polarity prediction within the education domain. Furthermore, out of 160 negative sentiments from the dataset, the proposed Bi-GRU model correctly predicts 158 of them as negative, and similarly, out of 31 positive sentiments, it correctly predicts 26 of them as positive.

Grades' Prediction Model.
In this section, we investigate whether module reviews can be leveraged in predicting Table 6: Precision (P), recall (R), and F1-score (F1) for models developed for the aspect-term identifcation model. students' module grades. Predicting students' grades prior to writing examinations is important considering that it acts as an early warning sign in identifying students who are at risk of failing certain modules. After identifying the students at risk of failing, the management of the course can put in place preventative interventions to reduce failure rates. Normally, the reviews are conducted towards the end of each semester before students participate in the end-of-semester examinations.
To accomplish the task of predicting students' grades while considering both the students' reviews feedback and students' academic performance history, a multi-input Bi-LSTM-ANN neural network that takes as inputs both textual and numerical features is proposed. For the text features, the model takes in review sentence vectors via an embedding layer onto its Bi-LSTM neural network. Te numerical features considered are continuous assessment grade, sentiment polarity score, CGPA, course credit hours, course code, frst semester GPA, and semester. Te numerical features were obtained after performing the heatmap variable selection technique on the dataset with respect to the target variable, that is, the course fnal grade. Numerical variables that achieved at least 0.1 correlation coefcient with the target variable were considered as feature candidates for the model as shown in Figure 3.
Te text processing results from the Bi-LSTM neural network were concatenated with the numerical inputs from the ANN layer and passed to a dense neural network. Te dense neural network extracted patterns from the combination of text and numerical features before forwarding the results to a fnal dense layer containing the SoftMax activation function. Te SoftMax function generated probability distribution values for the fve grade ranges {A, B, C, D, F}.
To evaluate the performance of the Bi-LSTM-ANN, its performance was compared with a baseline ANN model which was constructed using the above numerical features only. Figure 4 shows the proposed architecture for the model. Te experiment was then repeated three times and the results were averaged. It can be seen in Table 9 that the proposed model achieved a slightly higher weighted F1score of 0.59 compared to the baseline F1-score of 0.58. Tis diference could be attributed to the impact of the text feature in our proposed model considering that the baseline model did not incorporate text features whereas our proposed model had both text and numerical features. Since the diference between the two models is very small, it implies that the contribution of the students' reviews to grades prediction is minimal. From Table 9, the proposed model achieved a weighted average score of 0.82 in predicting Fgrades.

Discussion
Te aspect-term extraction component automates the process of mining explicit and implicit aspect terms from students' reviews. Aspect terms, being the specifc attributes, that students express their views upon, requires mechanisms for efciently mining them. After experimenting with LSTM-CRF, GRU-CRF, and Bi-LSTM-CRF models using the BIO tagging labelling scheme, the Bi-LSTM-CRF classifer achieved the highest micro-and macro-F1-scores of 0.67 and 0.66, respectively. Tus, the Bi-LSTM-CRF classifer can accurately predict 67% of aspect terms in students' reviews and in comparison, with state-of-the-art models for aspect-term extraction, the model performed competitively well. Results reported in a study by [23] on aspect-term extraction using Bi-LSTM and CRF models achieved an Fmeasure score of 44.49% on a Hindi dataset. Zschornack Rodrigues Saraiva et al., [29] experimented with the task of aspect-term extraction on restaurant and laptop datasets  Once the aspect terms were extracted, the aim was to categorize students' reviews into predefned education domain aspect categories. With the help of education domain experts, twelve aspect categories that were frequent in the dataset were defned. Tese categories are assessment, course content, course delivery approach, course practical hours, course lecture hours, course tutorial hours, experiential learning, teaching and learning environment, teaching and learning resources, and general. After defning the aspect categories, four variants of RNNs: LSTM, Bi-LSTM, GRU, and Bi-GRU were developed. Te diferent classifers' performance was within the same range but GRU had a slightly higher performance of 0.78 compared to the rest of the models. In this regard, this study recommends the use of GRUs classifers for categorizing students' reviews, especially for smaller datasets.
Te third part of this study focused on mining sentiment polarity from the students' reviews. Te goal was to classify the reviews into either positive or negative sentiment classes. Positive reviews included feedback where students expressed satisfaction with aspects being reviewed, whereas negative feedback consisted of reviews where students were currently not pleased and suggested improvements. Using a manually labelled dataset, experimentation was performed with Bi-GRU and an LSTM baseline model suggested by [19]. Te proposed Bi-GRU model achieved an overall weighted F1score of 0.96 compared to the baseline of 0.94. Tis good performance can be attributed to the bidirectional processing capabilities of Bi-GRUs. With the assumption that the twelve aspect categories were mutually exclusive in that a review could only belong to one aspect category, this implicitly implied that the sentiment polarity score assigned to a review also applies to the aspect category to which that review belongs.
Te fnal component of this study is the grades prediction component. Tis component combined students' text reviews and numerical features to predict student's grade where the grade is in the set {A, B, C, D, F}. Te Bi-LSTM-ANN neural network and evaluated it against an ANN baseline. Te Bi-LSTM-ANN neural network was fed with both text reviews and numerical features. Te results showed an overall weighted F1-score of 0.59 for the Bi-LSTM-ANN and 0.57 for the baseline model. It can be inferred from these results that the text feedback slightly contributes to a student's academic performance. Tis claim is further supported by the fact that after performing a heatmap variable selection on our entire dataset variables, the sentiment polarity score variable showed a 0.11 correlation score with the target variable and course fnal grade. Indeed, there exists a relationship between students' module feedback and fnal grades, but the relationship is weak as revealed from the heatmap. Since the purpose of the grades prediction model is to identify students at risk of failing certain modules, we treated students predicted to obtain "F" grades as those requiring interventions to perform better in the fnal examinations. Te proposed model achieved an F1score of 0.82 in predicting students likely to score "F" grades. From the confusion matrix metrics, it can be shown that out of 29 true "F" grades, the model accurately predicts 20 of them. Tis shows that the proposed model is quite promising in predicting the students at risk of failing.

Limitations.
Te limitations for this study was the lack of available open-source students' reviews dataset. In the future, the authors look forward to evaluating the models with diferent datasets. For the task of aspect-category identifcation, it was assumed that the aspect categories were mutually exclusive. In future studies, the authors plan to explore possibilities of dealing with reviews belonging to multiple categories considering that in practice some reviews belong to more than one category.

Conclusion
Te aim of this work was to use deep learning models for aspect-based sentiment analysis of students' reviews about the modules they have studied as well as a grade prediction model in order to predict students are at risk of failing. Te accuracy of the models has been evaluated and compared with existing works. A list of aspect categories has been based with the help of domain experts. Te framework can be easily replicated for any higher education institution having student reviews in English. Te deep learning models used performed reasonably well compared to the state of the art. Te work proposed in this paper can help to address problems faced by educational institutions in processing reviews obtained from students. In this competitive era, institutions need to stand out by providing the best student experience and quality education. Te ability to automatically process student feedback will allow institutions to become more proactive and addressing issues promptly.   With regards to grades prediction, it is in the interest of both students and institutions to keep the level of failure to a minimum. Undoubtedly, deep learning technologies can be leveraged to help educational institutions in this challenging task.

Data Availability
Te data and materials used to support the study are available from the corresponding author upon request.

Disclosure
Te funder was not involved in the study design, data collection, or manuscript writing.

Conflicts of Interest
Te authors declare that they have no conficts of interest.