Machine Learning Approach for Answer Detection in Discussion Forums: An Application of Big Data Analytics

Nowadays, data are flooding into online web forums, and it is highly desirable to turn gigantic amount of data into actionable knowledge. Online web forums have become an integral part of the web and are main sources of knowledge. People use this platform to post their questions and get answers from other forum members. Usually, an initial post (question) gets more than one reply posts (answers) that make it difficult for a user to scan all of them for most relevant and quality answer. Thus, how to automatically extract the most relevant answer for a question within a thread is an important issue. In this research, we treat the task of answer extraction as classification problem. A reply post can be classified as relevant, partially relevant, or irrelevant to the initial post. To find the relevancy/similarity of a reply to the question, both lexical and nonlexical features are used. We proposed to use LinearSVC, a variant of support vector machine (SVM), for answer classification. Two selection techniques such as chi-square and univariate are employed to reduce the feature space size. The experimental results showed that LinearSVC classifier outperformed the other state-of-the-art classifiers in the context of classification accuracy for both Ubuntu and TripAdvisor (NYC) discussion forum datasets.


Introduction
Web forum is an online discussion board where like-minded people gather and discuss issues on specific topics. Web forum has become an integral part of the web due to its constant growth. ere is a good chance to get forum pages while searching for a question/topic. Forum users share information on different topics. Discussion among the users is started when one user asks a question and other users/members answer it, so this forms a forum thread where a question gets more than one answers [1].
Question post in a forum thread usually get answers with different qualities. Quality means to what extent a reply post addresses the question. Each user answers the question according to their own understanding and knowledge, which may be relevant, partially relevant, or irrelevant. is makes it difficult for the question poster to identify the most relevant answer [1]. It is very tedious and laborious to go through all the reply posts and then identify the relevant answer. So, the main objective of this research is to automatically extract/identify the most relevant answers/replies for a posted question within a thread.
We consider the thread's initial post as question and all other replies as candidate answers of different qualities. To keep the process simple, we ignore all the questions in the reply posts and topic drift in the thread.
ere are two types of features, lexical and nonlexical. Both of them are used to find reply relevancy and similarity with the given question [1][2][3][4][5]. Some nonlexical features are not always available [6], which cannot be calculated easily and also make the model forum dependent, e.g., if forum metadata are used for training the model, then the model becomes dependent on those specific features, and hence it cannot be easily adapted to other forums. erefore, in this study, we mostly exploited lexical, content-based, and semantic features to make the model forum independent which can be easily adapted to other forums.
Like other research works [7][8][9], we also consider answer extraction as text classification problem. Replies are classified into three classes: high-quality, low-quality, and nonquality, depending on their relevancy with the question post. For answer detection/classification, we used support vector machines (SVMs). It is a group of algorithms used for classification, regression, and outlier detection. Two variants, LinearSVC and SVC, of SVM are used here. LinearSVC outperformed the other classifiers and gave high accuracy of 76.3%.
For lexical similarity features, like cosine similarity, we use bag-of-words (BoW) approach to convert text into vectors [10]. Since all features/words are not equally important, redundant ones are devalued by using TfidfVectorizer. In this study, we used unigram, bigram, and trigram word sequence.
Mining for best reply posts within a thread has many applications. Question/answer forums such as Yahoo! Answers can suggest answers extracted from forum thread to their users.
It can also be used to generate question-answer pairs which can be further filtered to frequently asked questions (FAQ). Contributions of our work are summarized as follows: e remainder of the paper is organized as follows. Section 2 is about related work. Section 3 explains our proposed framework. Section 4 describes the experimental settings, results, and discussion. Finally, Section 5 concludes and presents the future work.

Related Work
Predicting answer quality in online web forums is a text classification problem [7-9, 11, 12]. Different approaches and methodologies have been used for this task. Bag-of-words (BoW) approach is a commonly used approach [1]. In this approach, the text is represented by its words and each word is considered as a feature. Frequency of each feature is recorded and a vector is created, which is further used to find the similarity with other vectors. Usually, BoW is used with bigram and trigram to get more information.
is approach was augmented with co-occurrence feature from Wikipedia and was used to classify news articles in one of the twenty groups [13]. e authors in [14] integrated BoW approach with forum metadata, simple rule of question mark, and question words to extract questions from web forums. Multimodal deep belief net was used in [8] to check answer quality. is model solved the issue of nonlinear correlation between lexical and nonlexical features. A framework based on convolutional neural network was developed in [11] to classify massive open online course (MOOC) forum threads. Others used character-level ConvNet for text classification [15].
To classify text in web forums as question or nonquestion, a sequential model [2] was proposed, which is based on patterns extracted from questions and nonquestions. e model then used a graph-based approach for answer extraction in the same thread.
Another approach called cascaded framework was introduced in [16] for <thread-title, reply> pair extraction from web forums to enrich chatbot knowledge. In the first step, replies were extracted which were logically relevant to the thread title. en, the extracted pairs were ranked and top N were selected.
Both types of algorithms, traditional such as Naïve Bayes and deep learning like convolutional neural networks and multimodal deep net, have been used for extracting quality contents from the web forums [1,8,11,15,17].
Text classification task is based on the quality of contents. Quality means to what extent it is relevant and addressing the query. So, for classification, it is necessary to measure the quality of contents using different features [18]. Reply post in a forum thread is classified as high-quality, low-quality, and nonquality based on their relevancy with the question post.
Primarily, there are two types of features, lexical and nonlexical, used for answer extraction within a thread. ese are categorized in different ways: the authors in [1] identified six feature groups and further divided them into 28 subfeatures. e authors in [6] described five types of features that are lexical, content base, structural, forum specific, and reply-to and further divided them into 17 subfeatures.
In some forums, lexical similarity cannot be used much effectively because answers have very minimal overlap with the questions and nonanswers also show the same behaviour [6]. In such cases, nonlexical features are more reliable than lexical ones. In some cases, researchers proposed a framework totally relying on nonlexical features for judging the quality of documents [19]. Some researchers showed that combining n-gram of lexical with nonlexical features gave good results [7]. e authors in [11] used user interactive behaviour features to classify massive open online course (MOOC) threads using convolutional neural network, as the model based on such features are language and content independent. e authors in [16] used structural and content-based features to develop their framework for <title, reply> pair extraction to enrich chatbot knowledge. e authors in [5] used nonlexical thread features to classify web forum threads into subjective and nonsubjective.
us, different research studies used various combinations of features to enhance the model performance. One such study identified 12 features while the other identified 6 best features [1,6].
In a nutshell, not all features are important; some do not contribute while others negatively affect the model performance, so in order to get optimal subfeature list, the authors in [1,20] eliminated nonvaluable and redundant features. Moreover, forum noise also adversely affects the model performance [20]. On the other hand, normalizing the forum noise will enhance the model performance. Hence, how to select best features list is nontrivial due to different nature of forum data.
ere are different selection techniques to reduce feature space size. Mainly, these are grouped into filtered, wrapper, and embedded methods. Document frequency thresholding (DF), chi-squared (CHI), information gain (IG), and Acc and Acc2 (Acc2) are the most commonly used feature selection techniques [21]. e authors in [17] used univariate and clustering feature techniques to improve the Naïve Bayes performance for text classification task. Authors in [21] have introduced two new feature selection metrics for text classification such as Relevance Frequency Feature Selection (RFFS) and Alternative Accuracy2 (AAcc2); and suggested that the new metrics produced promising results as the current frequenly used metrics. Other researchers used information gain (IG), chisquare, and gain ratio (GR) to get top 12 best features. To gauge the significance of each feature, permutation and ablation tests are also performed [6].
However, our proposed study used LinearSVC to classify reply posts within a forum thread. For feature space size reduction, univariate and chi-square selection techniques are used to select optimal subfeature list. e next section describes our proposed methodology.

Proposed Methodology
Our proposed model is summarized in Figure 1. It is divided into four phases: in the first phase, data are preprocessed to eliminate errors and noise. In the second phase, lexical and nonlexical features for the question and reply posts are calculated to find their similarities.
irdly, features are filtered using different selection techniques. In the final phase, the kernel method of SVM called LinearSVC is used to classify the replies as high-quality, low-quality, and nonquality. ese steps are explained below.

Preprocessing.
Converting raw data into predictable and analyzable format is data preprocessing. e following steps are taken to preprocess the data: (a) Converting all words to lowercase (b) Lemmatizing words using WordNetLemmatizer of NLTK (c) Removing all stop words (d) Expanding the abbreviation 3.2. Feature Extraction. Different features are used to find the relevancy and similarity of a reply post to its initial post. ese features are categorized in different ways. A study conducted by Osman et al. [1] categorized features into six groups that are relevancy, author activeness, timeliness, ease-of-understanding, amount-of-data, and politeness. ese groups were further divided into 28 subfeatures.
Similarly, another research study identified five feature groups: lexical, content, forum specific, structural, and replyto types, and further divided them into 17 subfeatures [6]. Broadly, features are classified into lexical and nonlexical. Lexical features are text-specific features, e.g., cosine similarity of question and reply posts. Similarly, the number of unique words in a reply post is also a lexical feature. Nonlexical features are forum-specific (author or thread structure related) and content-based features. Total number of threads the users have participated, author reputation in the forum, and time elapsed between question and reply posts are some examples of nonlexical features.
For answer extraction, in discussion forums, some researchers have preferred nonlexical features over lexical ones [5][6][7]19], while others have proposed lexical features [20]. Naturally, questions have some kind of lexical similarity with their answers [20], so one should use both lexical and nonlexical features to extract most relevant and quality answer [8]. Lexical features are used to find the relevancy of answer with the question, while nonlexical features are used to check their quality [19] that is to what extent an answer addresses the question.
Some features are not always available. One researcher inspected 12 data forums and found that 36.3% forumspecific features are available, while 75% author activeness features were available [6]. In our case, timeliness features are not available. Moreover, using some features makes the model forum specific. So, in this study, we used both lexical and nonlexical features particularly targeting those features which are 100% available and can be easily calculated from the text or structure of the thread. ese features are lexical, content-based, and semantic features.
In this study, we used twenty features given in Table 1 with brief description. Out of these, fourteen features are lexical, content-based, or semantic features as shown in Table 2. In the table, the three highlighted features F1, F16, and F17 are our new proposed semantic features. Some features like F7, F11, F12, F13, and F20 are directly calculated from the text or thread structures. For example, F7 is the number of unique words in a reply post which can be calculated by splitting it in words and then applying set and len functions in Python Language.
For pure lexical features like F2, F3, F4, F5, and F6, we used bag-of-words (BoW) approach. BoW approach is a well-known technique to extract features from documents and represent them as vector. Vector values represent number of occurrences of a word in the documents. Since BoW approach ignores feature order and only word frequency matters, to preserve sentence structure and words order, we used bigram and trigram word sequences which will get more meaning from the document. Some features get high frequency but are not much valuable, so for filtering unimportant features/words, we used the term frequency inverse document frequency (TF-IDF) technique, which converts text into vectors and assigns weightage to each word according to their importance in the document.
We introduced three new semantic features called F1, F16, and F17 for answer extraction in discussion forums, and to the best of our knowledge, these features have not been used in the literature. We used word mover distance and Scientific Programming Google's pretrained word2vec model for our proposed new features. Google's pretrained word2vec model, used for contextual/semantic similarity of words, has vectors for three million words/phrases, and it has been trained on roughly hundred billion words from Google News dataset. We leave the default word vector length to be 300 features and hence the word2vec model will check the relevance of two words in 300 dimensional space. Its speciality is words having same semantic/context will have close vectors. Word mover (WM) distance is the measure of dissimilarity of two documents. e greater the WM distance, the greater will be the dissimilarity and vice versa. Zero distance means that the two documents are completely related with each other.
Feature F1 is the contextual similarity of each reply with the thread centroid. For thread centroid, the most important features/words are obtained using the TF-IDF technique. Word mover distance of each reply from the centroid is calculated using Google's pretrained word2vec model. Feature F16 is the word mover distance of thread title and a reply while feature F17 is the word mover distance of the initial post/question and a reply. e proposed new semantic features (F1, F16, and F17) are the important ones since both chi-square and univariate feature selection techniques selected them in the top features space for both Ubuntu and TripAdvisor(NYC) datasets as given Tables 3-6.

Feature Selection.
ere is a list of features, lexical and nonlexical, which can be used for extracting answer in the question-answer forums. But all of them are not equally important and cannot be used due to the following reasons: (a) Some features are nonvaluable and negatively affect the model performance [1]   To overcome the above limitations, initially we select those features whose availability is hundred percent and can be easily calculated from the text as discussed in Section 3.2. en, we employed two feature selection techniques, namely, chi-square and univariate, to reduce the feature space size in order to get optimal features as discussed in detail in Section 4.3.

Classification Model Construction.
is phase aims to classify the reply posts as relevant, partially relevant, and irrelevant using machine learning algorithm. We used a kernel method of support vector machine (SVM) called LinearSVC. is classification is based on the relevancy of a reply to the initial post.
We compared the classification accuracy of the Line-arSVC classifier with other kernel methods of SVM as well as other state-of-the-art classification algorithms such as multinomial Naïve Bayes, Bernoulli Naïve Bayes, random forest, and logistic regression. All classifiers were trained and tested with three sets of features that are all features and two subfeature sets chosen by different feature selection techniques. More details can be found in Section 4.

Evaluation Data.
e proposed answer detection model is evaluated on two datasets-the online TripAdvisor forum "WMDbtwnTitlRpl" F17 "WMDbtwnQustionRpl" F20 NoWrdsRply    Reply which is completely relevant is assigned a class label 3, partially relevant reply is assigned a class label 2, and 1 is assigned to irrelevant replies. . Both of the datasets have 7 columns, " readID," "Title," "UserID_inipst," "Questions," "UserID," "Replies," and "Class," for each thread. We split the labelled dataset in such a way that 80% data is used for training and 20% data is used for testing.

Classification Algorithms.
We chose a linear kernel method of support vector machine (SVM) called LinearSVC for classification of answer/reply post in text forum threads. SVM is widely used for text classification problem [22]. We also compared the performance of LinearSVC with other kernel methods of SVM as well as other state-of-the-art classification algorithms. e classifiers are briefly discussed as follows.

Naïve Bayes.
It is a group of supervised learning algorithms based on Bayes theorem which considers each feature as independent of other features. is classifier has been largely used in text classification problems and has given good results [23]. Bayes theorem is stated below: where y is class variable and x 1 to x n represents a dependent feature vector. Naïve Bayes requires small amount of data to train and is extremely fast compared to other classifiers. e following variant of Naïve Bayes is used in the evaluation of this study.
Multinomial Naïve Bayes It is used for multinomial distributed data. It is mainly used for text classification.

Support Vector Machines (SVMs).
It is a group of algorithms used for classification, regression, and outlier detection. It performs well in high dimensional space [4] and uses less memory. It works with different kernels, and custom kernel can also be specified. We used the following three implementations.

Support Vector Classification (SVC).
It is based on libsvm. Its fit time increases quadratically with the number of samples. "rbf" is the default kernel. Other kernels are "linear," "poly," and "sigmoid." NuSVC. It is the same as SVC but has slightly different parameter set and mathematical formulation. It is based on libsvm. Here, Nu is a regularization parameter having values from 0 to 1. e parameters C and Nu are same in the context of their classification power but evaluation of Nu is easier than C.
LinearSVC. It is based on "liblinear" with "linear" kernel. Input could be dense or sparse and is more flexible in choosing of penalties and loss functions.

Logistic Regression.
It is a classification method that generalizes logistic regression (LR) to multiclass problems, i.e., more than 2 discrete outcomes. It is a model used to predict probabilities of different outcomes of a target variable given a set of input features.

Random Forests.
ey are also known as random decision forests. ey are an ensemble learning method for classification task and work by building a large number of decision trees at training time and output the class that is the mode of the classes (classification) of the individual decision trees.

Feature Reduction.
To eliminate nonvaluable and redundant features, two selection techniques are employed: chi-square and univariate. e former selected eleven best features for Ubuntu and eight best features for TripAdvisor dataset shown in Tables 3 and 5, respectively, while the latter one selected fifteen optimal features for Ubuntu and ten best features for TripAdvisor dataset shown in Tables 4 and 6, respectively. In the following section, it has been shown that classifiers with these subsets of features performed well than those with all features.

Experimental Results and Discussion.
Results of all six classifiers, used in this study, with all features and features selected by different selection techniques are discussed in this section. LinearSVC, SVC, NuSVC, MultinomialNB, random forest (RF), and logistic regression (LR) are used in this study.
In the first phase, classifiers were used for Ubuntu dataset for all twenty features as shown in Table 7. All the six classifiers gave good results, but MultinomialNB and Lin-earSVC performed well and gave exactly the same accuracy of 73.7%. LR has the second highest accuracy (72.4%) while SVC resulted in 71.1% accuracy. Random Forest occupied fourth position with 63.2% accuracy. en, classifiers were tested for TripAdvisor dataset using all twenty features. e results are shown in Table 8. It is clear from the results that LinearSVC has the highest accuracy of 68.4%. RF and LR are at the second position with 67.1% accuracy while NuSVC has 64.6% accuracy. SVC and MultinomialNB have the lowest accuracy.
In the second phase, feature space was reduced by employing the chi-square feature selection technique. Eleven best features were selected for Ubuntu and eight best features were chosen for TripAdvisor dataset (Tables 3 and 5). e three new semantic features, introduced in this work, were picked by the feature selection technique. is shows that these features are the important ones for question-reply similarity.
Classifiers were employed for Ubuntu dataset with these optimal features. e results in Table 9 show again that LinearSVC has the highest accuracy of 73.7%. Multi-nomialNB and LR have the same accuracy of 72.4%. SVC is at fourth position with 67.1% accuracy. Random forest is at fifth and NuSVC is at sixth position. Referring to Tables 7 and 9, LinearSVC and LR gave the same accuracy as those with the twenty features. Random forest and NuSVC also increased their accuracy. MultinomialNB's accuracy was slightly reduced, but this time only 11 features were used instead of twenty.
Results of all six classifiers, LinearSVC, NuSVC, RF, LR, SVC, and MultinomialNB, for TripAdvisor dataset with top eight best features selected by the chi-square technique are shown in Table 10. Again, LinearSVC performed well with 76% accuracy while LR is at second position and NuSVC is at third position with 73.4% and 67.1% accuracies, respectively. RF has the lowest accuracy (65.8%).
LinearSVC's accuracy increased by 7.6%, NuSVC's accuracy increased by 2.5%, and LR's accuracy increased by 6.3% as compared to the accuracy with all twenty features given in Table 8. SVC and MultinomialNB also increased their accuracy.
In the third phase, the univariate feature selection technique was employed to filter the features. Fifteen best features were selected for Ubuntu dataset and ten were selected for TripAdvisor dataset, as shown in Tables 4 and 6, respectively. Again, the newly introduced three semantic features were also selected for both of the datasets.
Classifier results, with the selected features, for Ubuntu dataset are given in Table 11. LinearSVC is at top with 76.3% accuracy. MultinomialNB's accuracy is 73.7%. SVC and LR    For TripAdvisor dataset, algorithms were used with top ten selected best features chosen by the univariate feature selection technique. Results in Table 12 show that the classifiers' performance is much better than that with all twenty features. LinearSVC's accuracy increased from 68.4% to 73.4%. e accuracy of NuSVC and RF was increased by 2.5% while LR improved its accuracy from 67.1% to 72.2%. e classification accuracy of different classifiers based on different features in the context of Ubuntu and Tri-pAdvisor datasets are depicted in Figures 2 and 3, respectively. From the experimental results, we observed the following: (1) Most of the classifiers' accuracy increased or remained unchanged with best selected features.

Conclusion and Future Work
Automatic solution for extracting most relevant and quality answer to the initial post (question) in the thread/discussion forum is a challenging task. is study sets a new direction     Scientific Programming by presenting lexical, content-based, and semantic features that greatly improved the classification accuracy of the proposed classifier. In this study, we proposed to use a supervised machine learning model for extracting most relevant replies to the initial post, within a forum thread, using a kernel method of support vector machine called LinearSVC and compared it with other SVM kernel methods and other state-of-the-art classification algorithms. Line-arSVC, a variant of SVM, gave highest accuracy. Two subsets of features were explored, which improved the model performance. Moreover, three new semantic features were introduced and selected as best features by both chi-square and univariate feature selection techniques which significantly improved the accuracy of LinearSVC. For Ubuntu dataset, the chi-square technique selected 6 lexical and 5 nonlexical features, while the univariate technique selected 10 lexical and 5 nonlexical features. For TripAdvisor (NYC), the chi-square technique selected 5 lexical and 3 nonlexical features while the univariate technique selected 7 lexical and 3 nonlexical features. So, lexical features proved more imperative and vital for answer extraction in discussion boards.
In future, we plan to explore more semantic and contentbased features to further enhance the model. Also, this work can be extended to thread summarization.

Conflicts of Interest
e authors declare no conflicts of interest.