Online healthcare forums (OHFs) have become increasingly popular for patients to share their health-related experiences. The healthcare-related texts posted in OHFs could help doctors and patients better understand specific diseases and the situations of other patients. To extract the meaning of a post, a commonly used way is to classify the sentences into several predefined categories of different semantics. However, the unstructured form of online posts brings challenges to existing classification algorithms. In addition, though many sophisticated classification models such as deep neural networks may have good predictive power, it is hard to interpret the models and the prediction results, which is, however, critical in healthcare applications. To tackle the challenges above, we propose an effective and interpretable OHF post classification framework. Specifically, we classify sentences into three classes: medication, symptom, and background. Each sentence is projected into an interpretable feature space consisting of labeled sequential patterns, UMLS semantic types, and other heuristic features. A forest-based model is developed for categorizing OHF posts. An interpretation method is also developed, where the decision rules can be explicitly extracted to gain an insight of useful information in texts. Experimental results on real-world OHF data demonstrate the effectiveness of our proposed computational framework.
The past few years have witnessed the increasing popularity of online health forums (OHFs), such as WebMD Discussions and Patient, as communication platforms among patients. According to a survey by PwC in 2012, 54% of 1060 participants are comfortable with their doctors getting information related to their health conditions from online physician communities [
To extract insightful information from OHF posts, a commonly adopted strategy is to split posts into sentences and classify each sentence into different categories according to their semantical meanings [
An example of an online health forum post.
However, it is a challenging task to effectively analyze the expressions in online health forums. First, the user-generated content in OHFs is usually unstructured and contains background information that is relatively less important to analyze [
In this paper, we propose an effective framework for analyzing OHF posts. We propose to develop a random forest model to classify the sentences into three categories, that is, medication, symptom, and background, in order to get an accurate understanding of the role of each sentence in the overall expression of the health situation. Besides, human-understandable interpretations for classification results are generated for the forest model. To enable interpretation, the features involved in the classification task are designed in a human-understandable manner. Moreover, the contribution of features to a classification instance can be explicitly measured by the decision rules constructed during training process [ We propose a forest-based framework to deal with the healthcare-related text classification problem. Labeled sequential pattern features are involved in characterizing the unstructured healthcare-related texts from both syntactic and semantic levels. We develop a method for constructing decision rules integrated from decision trees in forest-based models to achieve model interpretability. The effectiveness and interpretability of our framework are demonstrated through experiments on a real OHF dataset, where we analyze the interpretations provided by our framework in detail.
In this section, we will briefly introduce each module of our proposed framework (Figure
An overview of the interpretable classification framework.
Given a sentence “I am taking 90 units Lantus twice a day” for classification, for example, we will first convert it into an instance in a feature space through preprocessing to identify the number term “90,” the drug term “Lantus,” the frequency term “twice a day,” the context of each term, and so forth. Then, we will use the forest-based model to classify the sentence, along with the explanations based on the discriminative features identified by the model.
In this module, we split the collected online health community posts into sentences and manually assign each sentence one label from the classes {
In this module, we propose the feature extraction method
In this module, the task is to train a model
Interpretable features play an essential role in enabling users to understand prediction results. In this section, we discuss how to convert health-related sentences into instances in numerical feature space composed of labeled sequential patterns, UMLS semantic type features, sentence-based features, and heuristic features. The method of extracting labeled sequential patterns is introduced in detail.
In sentence classification, if we simply use bag of words to represent each sentence, the overall data matrix will be huge and sparse, because there are a large number of terms, and many terms only occur in few sentences about some specific diseases. It is undesirable to use these raw terms to explain their correlations with sentence category as interpretations for classification results. The reason is that the raw terms do not explicitly specify the semantics of words, or contain the structural information of sentences. Therefore, we propose to use higher-level features to represent a sentence rather than words. We will rely on these higher-level features to interpret the sentences classification results.
We first extract
Tags introduction.
Tag | Description |
---|---|
|
Part-of-speech tags ( |
|
Medications or drug terms ( |
|
Symptom terms ( |
|
Frequency phrases (customized regular expressions) |
Given a training set of labeled sentences
We now focus on mining the frequent sequential patterns from database
A
For example, given two labeled sequences
A
where
There are several algorithms to mine frequent patterns from a database. We select CM-SPAM [
With FSPs available, the next step is to select a subset of promising FSPs called frequent labeled sequential patterns (FLSPs) which are then used for classification.
Note that we have two classes:
We would also like to set the minimum support threshold to a small percentage in order to include more FSPs. In our experiments, we set the minimum
Finally, we obtain a set of FLSPs, which can be used as features to identify the relationship between labels and patterns in sentences [
In addition to FLSPs, we also use UMLS [
Generally, for each sentence
Sentence-based features are capable of representing the sentence in a direct way [
Although word-based features such as bag-of-word representation usually suffer from the curse of dimensionality, we still take them into account to compare the classification performance because of their effectiveness [
Capitalized words and abbreviations can be good indicators of whether there are any medical terminologies in the sentence, which could be highly related to medication or symptom sentences. We can use two binary features to indicate whether the sentence contains any capitalized words or abbreviations, respectively.
Labeled sequence database
Convert
FSP set FLSP set
In addition to all the features originated from the texts of the sentences, we can also adopt useful side information of posts [
In general, we can select different combinations of the features introduced in this section to represent health-related sentences and then build models to predict the categories of sentences with interpretations.
In this section, we first introduce the classification of health forum sentences using a random forest model and how to interpret the forest model with features of high importance. Second, we introduce how to collect rules from decision trees in the forest to construct a new pattern space [
A random forest consists of an ensemble of tree-based classifiers and calculates the votes from the trees for the most popular class in classification problems [ Sample A subset of Each tree grows to the maximum size without pruning.
When growing a tree using the samples from the original training set, about one-third of the instances in the training set are left out of the samples selected at random. This out-of-bag data will be an unbiased estimate of the classification accuracy for the currently growing tree and also can be used to estimate features importance.
The classification mechanism of a random forest is explained through a set of decision paths. To interpret random forest models, we propose to quantify the contributions of node features, rank them according to their contributions, and find out the most discriminative ones [
For a decision tree in the random forest, its decision function can be formulated as follows:
From another perspective, we can observe how a feature contributes to the
The prediction function of a forest, which is an ensemble of decision trees, takes the average of the predictions of its trees:
Suppose a random forest model
To further exploit interpretability, we extract decision rules from the forest model to form a new space, where the forward selection is applied to select the top discriminative decision rule combinations, that is, discriminative patterns [
Specifically, a
However, since the dimension
Then, we build a classifier using support vector machines [
In this section, first we present the experiments results which show that the forest-based models outperform the baseline methods. Second, we compare the interpretability between Lasso and our forest-based model by analyzing their discriminative features and discriminative patterns.
Since there are few datasets available for health-related texts classification, we created our dataset by collecting texts from online health communities to solve this problem. The data used for the experiment in this study were crawled from patient.info (
Labeled sentences result.
Med. | Symp. | Others | Total |
---|---|---|---|
1127 | 772 | 200 | 2099 |
The contributions of our study we want to claim are how much improvement of the performance our proposed method can achieve by introducing the labeled sequential patterns as features and how the interpretability can be enabled by applying our proposed methods to sentence representatives in a variety of spaces to gain an insight of the health-related text classification model. To show the first contribution, we choose support vector machines trained on a variety of features proposed in [
The metrics for the evaluation are accuracy, weighted average precision, weighted average recall, and weighted average
Table
Model evaluation. We evaluate each model using 5-fold cross validation. Each of the average accuracy, weighted average precision, weighted average recall, and weighted average F-score for medication class, symptom class, and the overall performance is presented in each column. Each row represents the performance of each model trained on different feature combinations.
Ft. set | M. Acc. | M. Prec. | M. Rec. | M. F1. | S. Acc. | S. Prec. | S. Rec. | S. F1. | Acc. | Prec. | Rec. | F1. | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Select + SVM | Word-based | 0.843 | 0.846 | 0.867 | 0.856 | 0.886 | 0.875 | 0.804 | 0.838 | 0.798 | 0.808 | 0.798 | 0.802 |
+ Semantic |
|
0.854 |
|
|
0.884 | 0.874 | 0.801 | 0.836 | 0.804 | 0.816 | 0.804 | 0.808 | |
+ Position | 0.843 | 0.846 | 0.867 | 0.856 | 0.886 | 0.875 | 0.805 | 0.838 | 0.798 | 0.808 | 0.798 | 0.802 | |
+ Thr. Crt. | 0.844 | 0.846 | 0.867 | 0.857 | 0.896 |
|
0.814 | 0.852 | 0.800 | 0.812 | 0.800 | 0.805 | |
+ Morpho. | 0.848 | 0.855 | 0.864 | 0.859 | 0.891 | 0.883 | 0.811 | 0.846 | 0.801 | 0.816 | 0.801 | 0.807 | |
+ Word Cnt. | 0.802 | 0.785 |
|
0.826 | 0.864 | 0.888 | 0.722 | 0.796 | 0.761 | 0.773 | 0.761 | 0.763 | |
LSP | 0.799 |
|
0.709 | 0.790 | 0.831 | 0.862 | 0.644 | 0.737 | 0.691 | 0.821 | 0.691 | 0.731 | |
+ Semantic | 0.849 | 0.865 | 0.852 | 0.858 | 0.891 | 0.878 | 0.818 | 0.846 | 0.806 |
|
0.806 |
|
|
+ Position | 0.841 | 0.851 | 0.852 | 0.851 | 0.893 | 0.883 | 0.817 | 0.848 | 0.800 | 0.815 | 0.800 | 0.806 | |
+ Thr. Crt. | 0.844 | 0.852 | 0.859 | 0.855 |
|
0.885 | 0.826 | 0.855 | 0.801 | 0.814 | 0.801 | 0.807 | |
+ Morpho. |
|
0.860 | 0.864 | 0.861 | 0.896 | 0.883 | 0.826 | 0.854 |
|
0.820 |
|
|
|
+ Word Cnt. | 0.848 | 0.856 | 0.862 | 0.859 |
|
0.884 |
|
|
0.807 | 0.819 | 0.807 | 0.812 | |
+ Word-based | 0.810 | 0.810 | 0.844 | 0.826 | 0.870 | 0.887 | 0.739 | 0.806 | 0.768 | 0.792 | 0.768 | 0.776 | |
|
|||||||||||||
Lasso | Word-based | 0.794 | 0.730 |
|
|
0.886 |
|
0.712 | 0.820 | 0.791 |
|
0.791 | 0.756 |
+ Semantic | 0.793 | 0.741 | 0.947 | 0.831 | 0.886 | 0.923 | 0.752 | 0.828 | 0.789 | 0.754 | 0.789 | 0.757 | |
+ Position | 0.795 | 0.742 | 0.947 | 0.832 | 0.886 | 0.920 | 0.754 | 0.829 | 0.790 | 0.757 | 0.790 | 0.758 | |
+ Thr. Crt. | 0.796 | 0.745 | 0.945 | 0.833 | 0.889 | 0.922 | 0.762 | 0.834 | 0.791 | 0.756 | 0.791 | 0.759 | |
+ Morpho. | 0.797 | 0.745 | 0.947 | 0.834 | 0.889 | 0.924 | 0.759 | 0.833 | 0.792 | 0.757 | 0.792 | 0.760 | |
+ Word Cnt. | 0.798 |
|
0.947 | 0.834 | 0.891 | 0.927 | 0.762 | 0.836 | 0.793 | 0.759 | 0.793 | 0.762 | |
LSP | 0.715 | 0.663 | 0.955 | 0.782 | 0.802 | 0.875 | 0.538 | 0.666 | 0.711 | 0.678 | 0.711 | 0.665 | |
+ Semantic | 0.769 | 0.712 | 0.955 | 0.816 | 0.861 | 0.911 | 0.689 | 0.785 | 0.767 | 0.727 | 0.767 | 0.728 | |
+ Position | 0.767 | 0.710 | 0.955 | 0.814 | 0.860 | 0.910 | 0.686 | 0.782 | 0.765 | 0.716 | 0.765 | 0.725 | |
+ Thr. Crt. | 0.771 | 0.715 | 0.953 | 0.817 | 0.864 | 0.911 | 0.700 | 0.791 | 0.769 | 0.728 | 0.769 | 0.731 | |
+ Morpho. | 0.771 | 0.715 | 0.953 | 0.817 | 0.864 | 0.910 | 0.698 | 0.790 | 0.769 | 0.728 | 0.769 | 0.730 | |
+ Word Cnt. | 0.771 | 0.715 | 0.953 | 0.817 | 0.864 | 0.910 | 0.698 | 0.790 | 0.769 | 0.728 | 0.769 | 0.730 | |
+ Word-based |
|
0.745 | 0.950 | 0.835 |
|
0.930 |
|
|
|
0.759 |
|
|
|
|
|||||||||||||
Forest-based | Word-based |
|
0.795 |
|
|
0.881 | 0.891 | 0.773 | 0.827 | 0.819 | 0.808 | 0.819 | 0.795 |
+ Semantic | 0.815 | 0.761 | 0.956 | 0.847 | 0.878 | 0.901 | 0.751 | 0.819 | 0.802 | 0.805 | 0.802 | 0.778 | |
+ Position | 0.820 | 0.767 | 0.957 | 0.851 | 0.887 |
|
0.772 | 0.833 | 0.807 | 0.791 | 0.807 | 0.779 | |
+ Thr. Crt. | 0.817 | 0.765 | 0.949 | 0.847 | 0.872 | 0.884 | 0.749 | 0.811 | 0.799 | 0.792 | 0.799 | 0.774 | |
+ Morpho. | 0.832 | 0.776 | 0.965 | 0.860 | 0.890 | 0.907 | 0.781 | 0.838 | 0.816 |
|
0.816 | 0.789 | |
+ Word Cnt. | 0.830 | 0.779 | 0.954 | 0.858 |
|
0.893 | 0.804 |
|
0.814 | 0.797 | 0.814 | 0.783 | |
LSP | 0.786 | 0.742 | 0.921 | 0.822 | 0.863 | 0.861 | 0.748 | 0.801 | 0.771 | 0.725 | 0.771 | 0.739 | |
+ Semantic | 0.837 | 0.824 | 0.887 | 0.854 | 0.879 | 0.860 | 0.802 | 0.829 | 0.809 | 0.805 | 0.809 |
|
|
+ Position | 0.840 |
|
0.873 | 0.854 | 0.882 | 0.844 |
|
0.839 | 0.808 | 0.800 | 0.808 | 0.803 | |
+ Thr. Crt. | 0.832 | 0.825 | 0.875 | 0.849 | 0.879 | 0.849 | 0.814 | 0.831 | 0.802 | 0.796 | 0.802 | 0.797 | |
+ Morpho. | 0.841 | 0.829 | 0.886 | 0.856 | 0.881 | 0.843 | 0.832 | 0.837 | 0.812 | 0.802 | 0.812 | 0.804 | |
+ Word Cnt. | 0.829 | 0.816 | 0.881 | 0.847 | 0.880 | 0.856 | 0.808 | 0.831 | 0.800 | 0.791 | 0.800 | 0.793 | |
+ Word-based | 0.848 | 0.816 | 0.927 | 0.868 | 0.887 | 0.861 | 0.827 | 0.843 |
|
0.803 |
|
0.802 |
For the SVM model, the entire average predicting accuracy achieves 79.8% with only word-based features, which outperforms the accuracies of Lasso. SVM also performs very well in terms of precision, recall, and F1 score. The model trained on LSP features alone fails to outperform the model trained on word-based features, but the former could achieve better performance than the latter if we add the UMLS semantic type features. Note that there are only hundreds of LSP features while there are more than 16 k word-based ones. Without feature selection, the performance of SVM is not very good, since the word-based features are considerably sparse. Furthermore, SVMs with RBF kernels do not provide interpretability directly for us to gain an insight of the sentences although the models achieve good performance.
From the experiment results using Lasso, we can find that the recall scores for classifying medication sentences are better than those for symptom ones, while the accuracies and precision scores indicate the opposite trend. As we use multiclass classifiers, many of the test instances are classified as medication class. The Lasso models trained on the word-based features slightly outperform the ones trained on the LSP features. As Table
Top 10 average weight of word-based, LSP, semantic features in Lasso.
Word-based | Average weight | LSP | Average weight | Semantic | Average weight |
---|---|---|---|---|---|
Avoiding | −0.413 | (PRP, PRP, RB, SYMP) | 0.081 | sosy | 0.329 |
Wrong | −0.363 | (PRP, PRP, VB, SYMP) | 0.060 | mobd | 0.207 |
Avoid | −0.343 | (VBZ, CC, SYMP) | 0.058 | patf | 0.190 |
Prescribe | −0.323 | (SYMP, SYMP, SYMP) | 0.054 | resa | −0.173 |
Bleeding | 0.283 | (PRP, SYMP, CC, SYMP, IN) | −0.053 | inpo | 0.100 |
Anxiety | 0.281 | (CC, SYMP, IN, SYMP) | −0.052 | anab | 0.094 |
Swelling | 0.233 | (PRP, SYMP, VBG) | 0.049 | mcha | −0.092 |
Increased | −0.185 | (RB, SYMP, VB) | 0.048 | aggp | −0.090 |
Migraines | 0.185 | (JJ, IN, JJ, SYMP) | 0.036 | plnt | −0.063 |
Fever | 0.160 | (NN, SYMP, RB, SYMP) | −0.033 | mamm | −0.052 |
For the forest-based model, we can find that the accuracies of medication and symptom class can both achieve more than 80% with only LSP features and UMLS semantic type features. The overall accuracy achieves 80.9% and outperforms the other methods. Besides, with LSP and UMLS semantic type features, the precisions and recalls of both classes are greater than 0.8. Moreover, with position feature and word-based features, the performance of the forest-based model is even better. In general, the random forest model can achieve the relatively better F1 scores for both medication and symptom sentences classification. Similarly, the random forest models trained on the word-based features slightly outperform those trained on the LSP features.
Although it is not guaranteed that the models trained on LSP features outperform the ones trained on word-based features, we would still like to take advantage of LSP features since the feature dimension is significantly reduced without sacrificing the discrimination ability of models. In addition, LSP features provide a valuable perspective in both tag and structural levels to interpret classification results for health-related sentences.
Table
For LSPs, they are usually assigned with positive weights as they are capable of mining the symptom terms in the sentences. The pattern
Several UMLS semantic type features are assigned relatively larger weights to identify symptom sentences. For example, the term “sosy,” short for “sign or symptom,” is obviously a useful feature to identify symptom sentences. The term “mobd” (i.e., “mental or behavioral dysfunction”) can be used to detect mental disease symptoms. “patf” (i.e., “pathologic function”) is a parent semantic type of “mobd,” which is also an informative feature to detect pathologic terms.
To interpret healthcare-related sentences in forest-based models, we calculate the feature contributions from decision trees in the forest. We select one random forest model with the best accuracies in the experiments and list the 10 features with the greatest contributions for each class in Table
Top 10 feature contributions for medication and symptom class in a random forest model.
Feature | Back. | Med. | Sym. |
---|---|---|---|
|
|||
Prescribed = 1 | −0.00275 | 0.01195 | −0.00920 |
(PRP, CD, CD) = 1 | −0.00251 | 0.01156 | −0.00905 |
Morpho. = 1 | −0.00206 | 0.00660 | −0.00455 |
hlca = 1 | −0.00071 | 0.00559 | −0.00489 |
(NN, SYMP, SYMP, CC) = 0 | 0.00115 | 0.00429 | −0.00544 |
sosy = 0 | 0.00191 | 0.00406 | −0.00597 |
(PRP, CD, IN, NN, NN) = 1 | −0.00075 | 0.00402 | −0.00327 |
(CD, IN, CD, CD) = 1 | −0.00120 | 0.00396 | −0.00276 |
thr. Crt. = 0 | 0.00154 | 0.00381 | −0.00535 |
(PRP, CD, JJ, JJ) = 1 | −0.00086 | 0.00362 | −0.00276 |
|
|||
sosy = 1 | −0.00589 | −0.00783 | 0.01371 |
Prescribed = 0 | 0.00234 | −0.015734 | 0.01339 |
thr. Crt. = 1 | −0.00381 | −0.00683 | 0.01064 |
(PRP, CD, CD) = 0 | 0.00271 | −0.01264 | 0.00993 |
(SYMP, SYMP, SYMP) = 1 | −0.00330 | −0.00564 | 0.00895 |
(NN, SYMP, SYMP, CC) = 1 | −0.00209 | −0.00667 | 0.00876 |
Position < |
−0.00334 | −0.00540 | 0.00874 |
patf = 1 | −0.00254 | −0.00379 | 0.00633 |
(SYMP, CC, JJ) = 1 | −0.00172 | −0.00404 | 0.00576 |
Word count > |
−0.00131 | −0.00423 | 0.00554 |
In identifying
For the
Compared to the feature ranking in Lasso, we can have a better understanding from the feature contribution rankings for each class in the random forest. The relationships between features and classes can be learned from the feature contribution vectors while Lasso only provides the weights of the features, which may not be expressive enough to represent the relationships between features and classes. The random forest model can achieve both better performance and interpretability compared to Lasso.
DPClass [
Top 10 discriminative patterns in a DPClass model.
Pattern | Leaf class |
---|---|
((RB, CD, CD) = 0) ∩ ((PRP, CD, CD, JJ) = 0) ∩ ((PRP, CD, NN, NN, NN) = 0) ∩ ((TO, VB, CD) = 1) | Med. |
((IN, NN, NN, comma, SYMP) = 0) ∩ ((CD, RB, CD) = 1) ∩ ((RB, IN, IN, CD, IN) = 0) ∩ ((PRP, CC, CD, NN) = 1) | Med. |
((SYMP, NN, VBG) = 1) | Sym. |
((VBP, CD, NN, NN) = 0) ∩ ((SYMP, SYMP, NN) = 1) | Sym. |
((RB, CD, IN, IN) = 0) ∩ ((VBP, IN, CD, CD, NN) = 0) ∩ (“mg” = 0) ∩ (“prescribed” = 1) ∩ (dsyn = 1) | Med. |
((PRP, VBP, CD) = 0) ∩ ((CD, CD, NN, NN) = 1) ∩ ((TO, CD, IN) = 1) | Med. |
(“cough” = 1) | Sym. |
((RB, CD, IN, IN) = 0) ∩ ((VBP, IN, CD, CD, NN) = 0) ∩ (“mg” = 0) ∩ (“prescribed” = 0) ∩ (fndg = 0) ∩ ((NN, comma, comma, SYMP) = 1) | Sym. |
((RB, CD, IN, IN) = 0) ∩ ((VBP, IN, CD, CD, NN) = 0) ∩ (“mg” = 0) ∩ (“prescribed” = 1) ∩ (dsyn = 0) | Med. |
(“anxiety” = 1) | Sym. |
Previous medication information extraction research mainly focused on extracting medication information from clinical notes, such as [
Our work focuses on both classifying and interpreting the online healthcare-related texts. The major challenges in healthcare-related text classification and interpretation are how to represent the texts and how to classify and interpret the data. For the former question, [
In our research, we propose to use labeled sequential patterns to represent the healthcare-related sentences in order to reduce the dimension and sparsity of the data, which can both guarantee the performance and enhance the efficiency. Then, we build forest-based models on the training data which is capable of predicting with decent performance and interpreting the healthcare-related sentences by extracting the important features used in the decision rules, ranked by their contributions, and the discriminative patterns consist of the decision rules. Overall, the forest-based models trained on the proposed feature space can achieve good performance and enable the interpretability of the data. In the future, we will build a compact system based on this framework to help users directly extract and highlight the insightful sentences while they are viewing healthcare-related articles, posts, and so forth Moreover, we will also target to extract and interpret the insightful sentences from other categories such as medication effects and user questions and include data from other sources like clinical notes.
The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agencies.
The authors declare that there is no conflict of interest regarding the publication of this paper.
The work is, in part, supported by DARPA (nos. N66001-17-2-4031 and W911NF-16-1-0565) and NSF (no. IIS-1657196).