A Time-Critical Topic Model for Predicting the Survival Time of Sepsis Patients

Sepsis is a leading cause of mortality in intensive care units and costs hospitals billions of dollars annually worldwide. Predicting survival time for sepsis patients is a time-critical prediction problem. Considering the useful sequential information for sepsis development, this paper proposes a time-critical topic model (TiCTM) inspired by the latent Dirichlet allocation (LDA) model. The proposed TiCTM approach takes into account the time dependency structure between notes, measurement, and survival time of a sepsis patient. Experimental results on the public MIMIC-III database show that, overall, our method outperforms the conventional LDA and linear regression model in terms of recall, precision, accuracy, and F1-measure. It is also found that our method achieves the best performance by using 5 topics when predicting the probability for 30-day survival time.


Introduction
Predicting the survival time of patients is an active research area for both clinicians and scientists [1][2][3][4][5]. It can significantly contribute to making decisions about clinical treatment, allocation of medical resources, and hospice care for patients [1]. Sepsis is a disease of life-threatening organ dysfunction caused by a dysregulated host response to infection [6]. Without timely treatment, sepsis can rapidly lead to tissue damage, organ failure, and death. Common signs and symptoms include fever, increased heart rate, increased breathing rate, and confusion. In US health systems, the cost for patients with sepsis accounted for more than $20 billion (5.2%) of total US hospital costs in 2011 [7]. In 2001-2010, one in twenty deaths in England was associated with sepsis based on information recorded on death certificates [8]. Although clinicians have made efforts to improve sepsis patient survival time, the mortality rate of sepsis is still very high [9,10]. us, accurate prediction of survival time for sepsis patients could help clinicians conduct prevention, provide early warning and effective treatment, and reduce the mortality rate. Unfortunately, the pathogenesis of sepsis remains unclear. Predicting the survival time for specific diseases, such as sepsis, is still a challenging problem.
To help clinicians understand the overall body situations of a patient, ICUs have introduced many different mechanisms for describing the patient's body situation and progress in the ICU. Different severity of illness scores has been used to predict sepsis or mortality risk scores in the ICU. e most widely used score-based methods include the acute physiology and chronic health evaluation (APACHE III) [11], the simplified acute physiology score (SAPS II) [12], the modified early warning score (MEWS) [13], the sepsisrelated organ failure assessment (SOFA) [14], and quick SOFA (qSOFA) [6]. In addition, Zhang and Hong [15] proposed a novel score for predicting hospital mortality for severe sepsis. ese score-based methods utilize a set of easily obtainable measurements from various patients to generate risk scores. Although they allow clinicians to make rapid diagnostics of a patient, the obtained results are not satisfactory [16,17]. Additionally, these score-based approaches only evaluate the patient body situation at specific times and cannot predict the survival time of sepsis patients because the development of sepsis is a time-sensitive process.
Sepsis is a life-threatening condition that arises when the body's response to infection causes injury to its own tissues and organs [18,19]. It is difficult to predict the development of sepsis based on a small number of measurements. Conventional topic models such as latent Dirichlet allocation (LDA) are unsupervised machine learning methods that can recognize latent topic information in massive document collections [20,21]. Lehman et al. [22] proposed a novel approach for ICU patient risk stratification using a topic model. Ghassemi et al. [23] proposed a mortality model using a topic model to predict in-hospital mortality, 30-day postdischarge mortality, and 1year postdischarge mortality. Vairamani [24] proposed an approach for mortality prediction in ICU patients based on LDA. Zhang et al. [25] proposed a novel survival topic model inspired by LDA for trauma patients. However, these topic models overlook the useful sequential information for disease development, thereby reducing the prediction accuracy for sepsis patient survival time. Unfortunately, the development and treatment of sepsis is a time-critical process that has a high correlation between the order of words in measurement and notes.
To address this issue, this paper proposes a time-critical topic model (TiCTM) inspired by the LDA model to predict the survival time of sepsis patients. e proposed TiCTM approach takes into account the time dependency structure between notes, measurement, and survival time of a sepsis patient. We consider the time-critical dynamic process of sepsis patients as an approximately linear variation under clinician treatment. erefore, the linear change in the parameter of TiCTM can reflect the time-critical dynamic process, whereas the parameter of LDA is fixed. Our experimental results on the public MIMIC-III database show that, overall, TiCTM outperforms the conventional LDA and linear regression model in terms of recall, precision, accuracy, and F1-measure. In particular, TiCTM obtains the best performance when predicting the probability for 30-day survival time using 5 topics. e remainder of this paper is organized as follows. In Section 2, we describe the proposed TiCTM for predicting survival time for adult sepsis patients in the ICU. e experiments and evaluations are discussed in Section 3. Finally, we conclude our research in Section 4.

Methodology
In this section, we first present a brief review of the classical LDA model and then describe our proposed TiCTM approach in detail.

Brief Review of LDA.
LDA is a generative probabilistic model of a corpus. e basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words [20]. e LDA model is represented as a probabilistic graphical model, as shown in Figure 1. e meaning of each notation for LDA is shown in Table 1. e LDA model considers documents as the collection of words, which overlook the order of words. To address this issue, we propose a time-critical topic model inspired by LDA to predict the survival time of sepsis patients, as described as follows.

Our Proposed Method.
Our proposed TiCTM approach considers the time dependency structure between notes, measurement, and survival time of a sepsis patient. e TiCTM model is depicted in Figure 2.
Assume that sequential notes and measurement submodel have several phases. For phase 1, the meaning of each notation is shown in Table 2.
For phase I, the meaning of each notation is shown in Table 4.
As shown in Figure 2, the proposed TiCTM model consists of two submodels: sequential notes and measurement and survival time prediction. is is because the development of sepsis is a time-critical process. e main idea of the TiCTM model is that we consider the time-critical dynamic process of sepsis patients as approximately linear variation under clinician treatment.
erefore, we employ the linear change in the parameter of TiCTM to reflect this time-critical dynamic process, whereas the parameter of LDA is fixed.

Sequential Notes and Measurement.
For a sepsis patient m, the parameter μ m represents the patient's initial body situation. In everyday clinician treatment, the patient's body situation changes Δμ. erefore, after i m days of clinician treatment, the parameter μ m of patient m will transit to μ m + i m Δμ, which is used to reflect the sequential changing process. e parameter μ m + i m Δμ determines the disease probability distribution θ i . e parameter θ i determines the disease z i for patient m. β (probability of bigrams appearing in the topics of notes dictionary) and z i (topic variable) generate N i,m words in the notes of patient m. λ (probability of measurements appearing in the topics of measurement dictionary) and z i generate R i,m measurements in the measurement result of patient m.
For each word w i,n , n � 1, 2, · · · , N and i � 1, 2, · · · , I, we draw a topic assignment z i,n |θ i ∼ (A(θ i )), where where A(θ i ) represents the probability of topic z i,n appearing, which is determined by parameter θ i . Draw a word w i,n |z i,n , β ∼ (B(β)), where where B(β) represents the probability of word w i,n appearing, which is determined by topic z i,n and parameter β.
Draw a topic assignment z i, where A ′ (θ i ) represents the probability of topic z i,r appearing, which is determined by parameter θ i . Draw a measurement where B ′ (λ) represents the probability of measurement x i,r appearing, which is determined by topic z i,r and parameter λ.

Survival Time Prediction.
For each patient m, T m denotes the real survival time, and T m represents the predicted survival time with TiCTM. Body situation (μ m + I m Δμ) and regression coefficient (η) are used for predicting the survival time (T m ) for patient m. e objective of the problem is to minimize the difference (Diff) between T m and T m . As shown in Figure 2, we define the time to death T from formula (5). where e details for predicting survival time for sepsis patients using the TiCTM model are further described in the following subsection.

Details of Survival Time Prediction
We hypothesize T m as the real survival time for patient m after the second phase. f m is the function for real survival time analysis T m .
Assume that patient m has the same probability p m to survive every day; then, we can calculate the probability of death on the (T m + 1) day using To maximize p T m m (1 − p m ), it must be satisfied with where p m is a form of sigmoid function as To make p m � 0 (Death) and p m � 1 (Survive), we modify p m to en, we can obtain formula (12) for f m and formula (13) for T m :

Definition of the Survival Time Prediction Function.
To verify the effectiveness of the proposed TiCTM model, this paper uses the two-phase survival model to analyze survival time analysis for adult sepsis patients. e measurement and notes in the first phase limit the time for the patient's admission within 24 hours. e second phase uses the patient's measurement and notes after 24 hours to the last.
To calculate the probability of measurement x 1 and note w 1 in the first phase by body condition μ m , we use Size of notes dictionary β e probability of words appearing in topics of notes dictionary β k,v e probability of v-th bigram appearing in k topic of notes dictionary, V v�1 β k,v � 1 Size of measurement dictionary λ e probability of measurements appearing in the topics of measurement dictionary λ k,v′ e probability of the v′-th measurement appearing in k topics of measurement dictionary v′ v′�1 λ k,v′ � 1 4 Scientific Programming where To calculate the probability of measurement x 2 and note w 2 in the second phase by body condition (μ m + i m Δμ), we use where e likelihood function of the two-phase of patient m can be obtained from e log of the likelihood function of the two-phase of patient m can be obtained from where To simplify the calculation, we use formula (21) to replace the log of the likelihood function of formula (19): After removing the constant term in formula (21), the log of the total likelihood function of the two-phase for all patients can be obtained from the following formula: To maximize the log of the total likelihood function, formula (23) is obtained as follows:

Scientific Programming
Under the given constraints, K k�1 Δμ k � 1, the following formulas are obtained: A body condition transition diagram is shown in Figure 3. erefore, we can obtain the patient's body condition at discharge.
We hypothesize T m as the predicted survival time for patient m after the second phase. f m is a function for predicting survival time analysis T m .
We predict the probability (p m ) for patient m to survive with where h m,v is the indicator function. When the notes of patient m contain the v-th bigram word of the notes dictionary, the h m,v value is 1; otherwise, the h m,v value is 0. h m,v′ is the indicator function. When the measurement of patient m contains the v ′ -th measurement of the measurement dictionary, the h m,v′ value is 1; otherwise, the h m,v′ value is 0. η v,k is the regression coefficient. e v-th bigram word of the dictionary represents a danger factor when η v,k > 0, which indicates a shorter survival time. Otherwise, the v-th bigram word of the notes dictionary represents a protective factor when η v,k < 0, which indicates a longer survival time. η v′,k is the regression coefficient. e v ′ -th measurement of the measurement dictionary represents a danger factor when η v′,k > 0, which indicates a shorter survival time. Otherwise, the v ′ -th measurement of the measurement dictionary represents a protective factor when η v′,k < 0, which indicates a longer survival time.
en, we can obtain formula (27) for f m and formula (28) for T m :

Scientific Programming
To find the optimal η to minimize Diff, we obtain arg min en, we obtain the gradient update as the following formulas: zDiff

Baselines.
In this experiment, we use two methods as our baselines: LDA [14] and linear regression. For the linear regression model, the meaning of each notation is shown in Table 5.
where a and b are regression coefficients. e objective of survival time prediction for sepsis patients is to minimize the loss function between f m and f LR m : en, we obtain the gradient update as formulas (36), (37), and (38):

Evaluation Criteria.
To evaluate the performance of the proposed method, we use a 3-fold cross-validation scheme. Evaluation metrics, such as recall, F1, and FPR, are adopted in this paper. ey are defined as follows: Recall (TPR) � TP/(TP + FN). Precision � TP/(TP + FP). Accuracy � (TP + TN)/(TP + TN + FP + FN). F1 � precision × recall × 2/(precision + recall). FPR � FP/(FP + TN), where TP indicates true positive, which means predicting a survival time less than or equal to a given time, while the true survival time is less than or equal to a given time; FP indicates false positive, which means predicting a survival time less than or equal to a given time but the true survival time is greater than a given time; TN indicates true negative, which means predicting a survival time greater than a given time, while the true survival time is greater than a given time; and FN indicates false negative, which means predicting a survival time greater than a given time but the true survival time is less than or equal to a given time.

Dataset.
e dataset used in the experiments is the public Medical Information Mart for Intensive Care (MIMIC-III) [26]. MIMIC-III has been widely used by 845 publications as of the end of August 2019 [27]. e version of MIMIC is MIMIC-III v1.4, which comprises over 58,000 hospital admissions for 38,645 adults and 7,875 neonates. e data spanned from June 2001 to October 2012. We used a dataset of 2,487 deceased adult (age > 14) sepsis patient records from MIMIC-III. e data processing flowchart is shown in Figure 4.
Patient features include text data, class data, and numerical data.
(1) Text data processing: clinical notes are also text data.
is paper considers the clinical notes as a set of bigrams. We calculate the TF-IDF value of every bigram after removing the stopwords. en, we sort every bigram according to TF-IDF and select the top 3,000 bigrams as the words for the dictionary.
(2) Numerical data processing: to calculate the mean and standard deviation (std) for numerical data, we divide it into five intervals: (−∞, mean − 1.5 · std), [mean − 1.5 · std, mean − 0.5 · std), [mean − 0.5 · std, mean + 0.5 · std), [mean + 0.5 · std, mean + 1.5 · std), and [mean + 1.5· std, +∞). Each interval is used as a word in the measurement dictionary. (3) Class data processing: each class of data is used as a word in the measurement dictionary.  Tables 6-9. In Table 9, we can see that 5 topics (K � 5) achieve the best performance for predicting the probability for 30-day survival time. e number of topics is chosen by a 3-fold cross-validation scheme. We use the average value of 3-fold to select the best number of topics. It can be seen that the best number of topics for sepsis patients is 5 topics. e F1 score of 5 topics improved by 3.55% (in-hospital death), 3.88% (7 days), 2.56% (14 days), and 1.99% (30 days) compared to 20 topics. e possible reason is that there are many words in the dictionary, and the combination of topics and bigrams increases sharply when the number of topics increases. is situation will lead to model overfitting, and the F1 score will decrease.    5). When predicting in-hospital death, 7-day survival, 14-day survival, and 30-day survival, the F1-score will increase, as shown in Table 14. For longer survival times, the patient's condition remains steady. en, the clinician intervenes less. erefore, the F1 score is more accurate for predicting longer survival times. Figure 5 presents comprehensive performance comparisons of TiCTM (different topics) with LDA (K � 5, 10) and the linear regression model using ROC. In Figure 5, we can see that the performance of our proposed TiCTM is better than that of the classic LDA and linear regression models.

Conclusions
In this paper, we propose a time-critical topic model (TiCTM), which combines a patient's measurement with notes, to predict the survival time for adult sepsis patients in the ICU. We consider useful sequential information for predicting the survival time of sepsis patients, thereby increasing the prediction accuracy. Our experimental results show that the proposed TiCTM has the best performance when predicting the probability for 30-day survival time using 5 topics. In addition, the performance of our proposed TiCTM is better than that of the classic LDA and linear regression models. In the future, our study will focus on the explainable machine learning model for predicting survival time in the ICU [28][29][30][31].

Data Availability
e data used to support the findings of this study are available from https://mimic.physionet.org.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.