The paper discusses different approaches to building a medical decision support system based on big data. The authors sought to abstain from any data reduction and apply universal teaching and big data processing methods independent of disease classification standards. The paper assesses and compares the accuracy of recommendations among three options: case-based reasoning, simple single-layer neural network, and probabilistic neural network. Further, the paper substantiates the assumption regarding the most efficient approach to solving the specified problem.
Providing support to medical decision-making is one of the most urgent issues in healthcare automation. It has been repeatedly noted in different articles, reports, and forum discussions [
Butko and Olshansky [
Malykh et al. [
Decision-making in hospitals has evolved from being opinion-based to being based on sound scientific evidence. This decision-making is recognized as evidence-based practice. Perpetual publication of new evidence combined with the demands of everyday practice makes it difficult for health professionals to keep up to date [
A large number of publications are devoted to medical decision support systems (DSSs), including publications in specialized scientific journals (
A number of contemporary approaches to medical decision support system development are listed by Malykh et al. [
The first one of these approaches involves provision of relevant data sources to doctors, helping them make decisions independently. The system does not recommend any final solutions—instead, it suggests data sources to study and find answers to current questions (Evidence-Based Clinical Decision Support at the Point of Care | UpToDate URL:
The second approach is to use clinical pathways. Clinical pathways represent prescriptive models of the standard healthcare procedures that need to be undertaken for a specific patient population. Instances of the clinical pathways (also known as cases) describe the actual diagnostic-therapeutic cycle of an individual patient [
The third approach involves development of a large number of individual narrow-focused decision support systems. This approach helps achieve top quality when solving isolated problems [
The fourth approach that claims to have a global scope of application is focused on building a cognitive system capable of self-learning and knowledge digestion directly from nonformalized text sources (IBM Watson
None of the reviewed approaches is immaculate. All of them require efforts of experts and regular updates of knowledge bases. Moreover, each of the approaches is in fact tailored to specific purposes.
The latest Russian-language review [
In this paper, we will review general approaches to decision support system development based on nonreduced big clinical data. The main expectations related to application of general approaches ensue from the case-based nature of decision-making in healthcare, and the assumption that big clinical data already contain enough knowledge for efficient decision-making.
There are two other factors that draw attention to systems based on machine learning or precedent approach.
First of them is that there are trends in the development of our civilization which include an explosive development of information technologies (among them M2M, Big Data, and IoT), their strong need for formalized knowledge, and practical absence of qualified experts who could formalize that knowledge. The chief editor of the Rational Enterprise Management (REM) magazine (Russia) holds regular discussions on a wide range of problems including the above-mentioned ones. Results of the discussions are published in the REM editor’s column. The guests of a recent discussion [
As for the second factor, it is obvious that, nowadays, there are no qualified experts in the field of knowledge even in key branches. The actual situation is even more critical as the experts who are able to solve at least a part of these problems are not able to cope with ever increasing information flow. From this point of view, precedent-based DSSs practically need no experts. Experts may be needed for enhancing or optimizing existing medical data bases and knowledge bases [
We regard the diagnostic and treatment process (DTP) as a discrete controlled process with a memory. The model was first introduced by Malykh et al. [
Modern medical information systems store electronic medical records and contain descriptions of millions of various clinical cases. The degree of formalization of clinical data stored in MISs varies. MISs model the diagnostic and treatment process as a sequence of controlling events reflecting diagnostic and treatment activities, and a sequence of monitoring events describing the condition of the patient. Controlling events are well formalized; medical organizations keep statistical and business records of such events, plan them, and allocate required resources. Medical data related to monitoring of patients’ condition are less formalized and may be partly available in the form of plain text medical documents.
Previous studies provide evidence that is possible to model the DTP using controlled stochastic Markov processes [
The choice of control (X
DTP modeling based on the Markov process appears sufficiently substantiated [
Thus, in the model, the DTP is represented by a sequence of vectors of equal length and structure V split into two components—control U and monitored properties X. Control components have non-negative numerical values. A zero value of control at this stage of the process means that this kind of control has never been applied before, starting from the beginning of the process and up until this step inclusively. Components of monitored properties are of different nature. They can be dimensional physical values or non-numerical, for example, assignment of a property’s value to a specific class. Since it is almost impossible to monitor all the properties at the same time, certain components of properties may be unknown to us. When applying different methods to the model, we may need to digitize non-numerical values of components and identify missing values of monitored properties.
We will review several methods that can be applied to build a cybernetic taught system. The input into the system will be a sequence of vectors describing a discrete DTP in accordance with the presented model. The output will consist of recommendations proposing diagnostic and treatment options (choice of controls) for this particular state of the process. A diagram of the system is presented in Figure
Let us define the objective more accurately and assume that each DTP model is considered in the context of an already available predominant diagnosis. For each model, we have an array of earlier observed DTP implementations. Such implementations are sources of knowledge about treatment of a particular nosology, and they are used to teach a cybernetic recommender system to operate in the given context. Based on available DTP implementations, we defined a glossary of controls and monitored properties for each model. Issues related to normalization of primary data, outlier testing and exclusion, and approaches to data generalization based on assignment of monitored properties to generic classes are beyond the scope of this paper [
Finally, let us provide examples of typical properties of nonreduced primary data. We believe that a process ensemble in a data bank may reach 10^3–10^6 processes for an individual nosology. The dimension of a vector describing one step of a discrete DTP exceeds 10^3. The dimension of a control (output of the cybernetic system) may also exceed 10^3.
The case-based approach, including its application to medical decision support, has been described in sufficient detail in multiple sources [
Malykh et al. [
We have a network and each node in it is presented by a single DTP state. Each individual DTP represents a specific route within the network (routes are marked in Figure
Here is how the recommender system operates. The input into the system is a current state of the DTP: The situation when the input contains the entire implemented sequence of process states is beyond the scope of this paper. Several nodes are randomly selected on the small-world graph (R1 in the example presented in Figure
It is easy to assess the scale of the network in focus. In the example with 1,000 processes for one main nosology with the average duration of the process equal to ten days, we will need 10,000 network nodes. Each node will store a vector with the dimension 1,000 or higher. Computational experiments show that 0.5–1% of the total number of nodes is sufficient as random initial network nodes. In case with 10,000 nodes, the number of initial nodes will be 50–100. The descent along the small-world graph was quick, and the routes did not exceed 10 steps on average. The number of edges originating from each node in the small-world graph was equal to 8. The top-down assessment of the number of metric calculations in this case equals to 100∗10∗8. It is possible to accelerate the calculations by splitting the small-world graph into layers corresponding to specific DTP lengths and searching for closest neighbors within the layer corresponding to the input state. In the above example, we would have layers consisting of 1,000 states, and we would search for closest neighbors starting from 5 to 10 randomly selected nodes. This is fully acceptable in view of the computational requirements: computational experiments show that, in this case, computations can be performed almost real-time.
Let us review the network teaching process. Teaching means adding new DTP implementations to the network. The number of metric calculations
As an alternative approach, let us consider a basic neural network with a single layer. The structure of the network is outlined in Figure
Current DTP state is used as input to a basic one-layer neural network. The network contains
Let us refer to the network scale as an example. Let the dimension of input vector be 1,000 and that of the control component 500. In such case the teaching process will involve definition of 1,000∗500 weights. Let us remark that no major reduction of the neural network is possible to solve the above problem. The reason is that the dimension of the control component is the number of diagnostic and treatment activities that can be prescribed for this nosology, including coexisting illnesses. And this number is enormous. Adding new layers to the neural network will only make matters worse by increasing the number of taught parameters.
Let us examine the network teaching process. Initially, a certain set of DTPs is selected and used for network teaching purposes, including calculation of weights. New DTP implementations emerge. How should we use this new knowledge? If a sufficiently large volume of DTP implementations was used to teach the network (1,000 to 10,000) and new implementations constitute an insignificant share of the teaching sample (e.g., 100 new implementations versus 10,000 is merely 1%), it can be asserted that network re-teaching will not result in any noticeable changes in teaching parameters, and consequently, any major variations in the network’s output. This kind of network is rough and conservative; it can “digest” new knowledge only when the volume of such is sufficient. In this respect, neural networks are not as good as networks applying the case-based approach.
As another alternative approach, let us consider a probabilistic neural network. The structure of the network is outlined in Figure
Figure
Now, a probability density function can be “restored” for each class. For input vector In, we apply Bayes’ formula to calculate the posterior probability of belonging to each class and generate recommendations regarding the choice of diagnostic and treatment activities for this state.
Let us refer to the network scale as an example. Let the dimension of the input vector be 1,000, the dimension of the control component be 500, and the teaching sample contain 1,000 processes with 10 states in each. We will need to calculate 10,000 kernel functions and then calculate 1,000 posterior probabilities of the input vector belonging to each class for various distributions of kernel function supports for 500∗2 different classes.
Let us examine the network teaching process. The teaching process is focused on adding new DTP implementations to the network, including assignment of states to different classes. If the number of new implementations is a small share of the teaching sample used earlier, it can be asserted that adding new implementations will have no major impact on the network’s output. The probabilistic neural network proves to be rough and conservative; it can “digest” new knowledge only when the volume of such is sufficient. In this respect, probabilistic neural networks are not as good as networks applying the case-based approach.
Recommender system.
Structure of a case-based system.
Neural network.
Probabilistic neural network.
Impact of control parameter
We performed computational experiments for a network built using the case-based approach in 2015-2016. The results were published in Malykh et al. [
Table
Accuracy assessment of recommended diagnostic and treatment activities for seven nosologies using the case-based approach.
MKB-10 code/nosology |
|
Number of correct recommendations among control precedents | Number of recommendations with a different control level among control precedents | Number of diagnostic and treatment activities the decision support system was unable to provide recommendations for among control precedents |
---|---|---|---|---|
Number of states/number of controlled variables | Absolute value/share in the total number of diagnostic and treatment activities | Absolute value/share in the total number of diagnostic and treatment activities | Absolute value/share in the total number of diagnostic and treatment activities | |
J13/pneumonia due to |
|
6788/81.6% | 3923/47.2% | 1530/18.4% |
2938/118 | ||||
|
||||
K80.1/calculus of gallbladder with other cholecystitis |
|
34468/76.7% | 18390/40.9% | 10490/23.3% |
12853/931 | ||||
|
||||
H25.1/age-related nuclear cataract |
|
3522/94.9% | 539/14.5% | 189/5.1% |
5509/293% | ||||
|
||||
H26.2/complicated cataract |
|
4362/91.4% | 1617/33.9% | 408/8.6% |
5778/249% | ||||
|
||||
I67.4/hypertensive encephalopathy |
|
65678/72.4% | 37563/41.4% | 25060/27.6% |
23165/1431 | ||||
|
||||
I67.9/cerebrovascular disease, unspecified |
|
58649/75.4% | 32447/41.7% | 19117/24.6% |
24875/1518 | ||||
|
||||
N20.1/calculus of ureter |
|
17489/58.7% | 9948/58.7% | 12291/41.3% |
15922/205 |
In the matter of neural networks, computational experiments for all nosologies listed in Table
Accuracy assessment of recommended diagnostic and treatment activities for nosology J13 based on a single-layer neural network.
MKB-10 code/nosology |
|
Number of correct positive recommendations among control precedents | Number of incorrect positive recommendations among control precedents |
|
---|---|---|---|---|
Share of correct negative recommendations/share of correct positive recommendations | ||||
Number of neural network inputs/number of neural network outputs (number of controlled variables) | Absolute value/share in the total number of positive recommendations | Absolute value/share in the total number of positive recommendations | Absolute value/percent | |
J13/pneumonia due to |
|
339/40.31% | 502/59.69% |
|
224/222 | 98.55%/40.31% |
Let us emphasize that the volume of statistics on this illness stored in the DB has increased compared to an earlier experiment involving the same nosology—from 166 to 266 completed clinical processes. Controls included all types of drug prescriptions (222 different pharmaceutical products in our case). Data normalization involved adjustment of prescribed dosages of pharmaceutical products to unified dose units. The only monitored variable was “inpatient days.” Inputs also included bias. 49,728 weights had to be determined. The optimized target function was a quadratic residual between neural network output and control components monitored in control samples, adjusted to (0, 1). We used a nonstandard neurons activation bell curve (Gaussian function). This choice of activation function was based on the fact that integral values of many controls had apparent limits stipulated by Russian federal healthcare standards (standards of the Russian Ministry of Health). Different insurance programs also limit integral values of controls. Healthcare providers will not exceed these limits unless they find it necessary. Formally, with respect to the model, it means that once an integral property of a control reaches a certain limit, it stops growing further or such growth is highly unlikely. The gradient of the target function with respect to weights was calculated explicitly, and the steepest descent method was applied. Teaching included 1,006 descent steps. Criteria reflecting the accuracy of the neural network are presented in Table
Accuracy of recommended diagnostic and treatment activities for nosology J13 based on a single-layer neural network with an activation threshold equal to 0.1.
Absolute values (neuron activation threshold equal to 0.1) | |||
TP | 339 | 502 | FP |
TN | 35,052 | 515 | FN |
|
|||
Percent (neuron activation threshold equal to 0.1) | |||
TP | 40.31% | 59.69% | FP |
TN | 98.55% | 1.45% | FN |
TP, true positive; FP, false positive; TN, true negative; FN, false negative.
The relevant receiver operating characteristic (ROC) error curve is shown in Figure
ROC error curve.
Results of the experiment based on a probabilistic neural network are presented in Table
Accuracy of recommended diagnostic and treatment activities for nosology J13 based on a probabilistic neural network with
Absolute values ( | |||
TP | 233 | 191 | FP |
TN | 35,376 | 608 | FN |
|
|||
Percent ( | |||
TP | 55.0% | 45.0 | FP |
TN | 98.31% | 1.69% | FN |
The focus of this paper was how to build a medical decision support system based on big clinical data. The authors review general approaches to the problem that do not involve individual models for specific nosologies and neither do they require engagement of experts in the relevant subject area to such modeling or knowledge extraction from data. Data are extracted from the MIS without reduction, “as is.” It is assumed that the data contain significant information reflecting medical knowledge and contemporary medical treatment technologies. Three different approaches to big clinical data processing were examined: (1) case-based reasoning for decision-making; (2) decision-making based on a single-layer neural network; and (3) decision-making based on a probabilistic neural network. Experimental calculations were performed to assess the accuracy of recommendations generated using different approaches.
Drawbacks of the above neural networks with respect to the given problem were identified. The overall accuracy of provided recommendations was rather high. Moreover, the accuracy of negative recommendations that the neural networks learned to provide was very high (98–99%). However, the accuracy of positive recommendations provided by the neural networks was not so high (40–55%, which is obviously insufficient for successful practical application). Another disadvantage of neural networks is their rough and conservative nature, particularly when digesting isolated portions of new data with the volume insignificant compared to previously available data.
The case-based approach to decision-making yielded more accurate recommendations (59–95%), which is sufficient for its successful practical application. Another advantage of the case-based approach is its sensitivity to new data. With respect to calculations, the case-based approach is also more efficient compared to other options under consideration as it ensures a high operating speed of the decision support system, thus making it acceptable for practical application. These are the key findings of the study conducted.
This offers encouraging prospects for designing and developing decision support systems for physicians based on empirical components of medical knowledge. This approach also corresponds to existing case-based character of management and decision-making in medical practice. So far, the results indicate that precedent-based approach has a high effectiveness and could naturally enhance other approaches to supporting physicians’ decision-making, particularly knowledge-based ones. The obvious practical value of this approach lies in the fact that it can be complementary to other knowledge-based approaches (clinical pathways, Evidence-Based Clinical Decision Support, expert systems, Watson, etc.). The doctor will be able to make decisions based on the best examples of medical practice, finding precedents of clinical cases close to the given case.
The constraints of precedent-based approach include the need for a representative database of verified precedents excluding medical errors. From another perspective, precedents with corrected errors are of particular interest to physicians training and further prevention of such errors. The information about the results of these errors and possible ways of correcting them is also valuable. Thus, precedent-based approach could be widely spread as an educational tool. On the other hand, the precedent-based approach does not imply formalization of medical knowledge, which entails poor cognitive justification of generated recommendations. Consequently, justifications only describe how other patients were treated in similar clinical cases. There are also problems with optimization of provided metrics, compression of state descriptions, and construction of training procedures. These problems are connected with high dimensionality of the space of state characteristics and samples of clinical precedents. However, discussion of these issues and possible ways of addressing them has been left outside of this research [
In further studies, we are going to focus on detailed application of the case-based approach, analyze metrics, and distances not only for pairs of vectors but also for pairs of vector sequences, and examine issues concerned with intelligent normalization of primary data and data extraction from plain texts of medical documents.
UDC 007.52 (Automatically operated systems without any humans among system links, robots, and automated machines).
The authors declare that they have no conflicts of interest.
The authors would like to thank Professor V. M. Khachumov, a doctor of engineering sciences, and V. P. Fralenko, a candidate of engineering sciences, for the consulting support on neural networks and teaching methods, as well as Professor N. N. Nepevoda, a doctor of physical and mathematical sciences, for the discussion and assessment of the outcomes. Some of the outcomes presented in the paper were achieved earlier under the support of the Ministry of Education and Science of the Russian Federation (Project RFMEFI60714X0089) and in the context of Grant 13-07-12012 provided by the Russian Foundation for Basic Research.