Cost Control of Treatment for Cerebrovascular Patients Using a Machine Learning Model in Western China

Background Cerebrovascular disease has been the leading cause of death in China since 2017, and the control of medical expenses for these diseases is an urgent issue. Diagnosis-related groups (DRG) are increasingly being used to decrease the costs of healthcare worldwide. However, the classification variables and rules used vary from region to region. Of these variables, the question of whether the length of stay (LOS) should be used as a grouping variable is controversial. Aim To identify the factors influencing inpatient medical expenditure in cerebrovascular disease patients. The performance of two sets of classification rules, and the effects of the extent of control of unreasonable medical treatment, were compared, to investigate whether the classification variables should include LOS. Methods Data from 45,575 inpatients from a Healthcare Security Administration of a city in western China were used. Kruskal–Wallis H tests were used for single-factor analysis, and multiple linear stepwise regression was used to determine the main factors. A chi-squared automatic interaction detector (CHAID) algorithm was built as a decision tree model for grouping related data. The intensity of oversupply of service was controlled step by step from 10% to 100%, and the performance was calculated for each group. Results The average hospitalization cost was 1,284 US dollars, and the total was 51.17 million US dollars. Of this, 43.42 million were paid by the government, and 7.75 million were paid by individuals. Factors including gender, age, type of insurance, level of hospital, LOS, surgery, therapeutic outcomes, main concomitant disease, and hypertension significantly influenced inpatient expenditure (P < 0.05). Incorporating LOS, the patients were divided into seven DRG groups, while without LOS, the patients were divided into eight DRG groups. More clinical variables were needed to achieve good results without LOS. Of the two rule sets, smaller coefficient of variation (CV) and a lower upper limit for patient costs were found in the group including LOS. Using this type of economic control, 3.35 million US dollars could be saved in one year.


Introduction
Cerebrovascular disease and its complications are the leading cause of disability and death worldwide. Of all the diseases of the nervous system, cerebrovascular diseases have the greatest impact on disability and produce the highest economic burden [1][2][3]. Since 2017, this disease has become the leading cause of death in China [4]. e number of people suffering from cardiovascular and cerebrovascular diseases in China was 330 million in 2019, and these diseases are the leading cause of death among urban and rural residents [5]. In 2017, the total cost of treating cerebrovascular diseases in China reached 83.83 billion US dollars, ranking first among all diseases and accounting for 17% of the total medical cost of treating diseases, equivalent to 0.66% of GDP [6]. One city alone spent 51.17 million US dollars a year on these diseases in this study. In the face of so much economic pressure, the government must take effective action to reduce the economic burden of cerebrovascular diseases.
Diagnosis-related groups (DRG) are one of the most advanced medical payment management methods, aiming to reduce inefficiency and contain costs [7]. Based on factors such as a patient's demographic information, diagnosis, and disease severity, DRG-based payment systems group patients with similar clinical attributes requiring similar care, providing the necessary framework to aggregate patients into case types or products, which entail the use of similar resources [8]. DRG adopt a standard pricing framework for a single disease group [9] and provide equity in payments across healthcare providers for services of the same kind. Most studies have found DRG to have positive effects on controlling medical expenses and reducing the economic burden among patients [10]. Studies into cerebrovascular diseases have found that DRG can effectively reduce unreasonable costs incurred in the treatment of cerebrovascular diseases [11,12]. However, the rules of the grouping vary between countries and regions; for example, length of stay (LOS) is widely used as a statistical classification index in research into DRG management in Poland, Britain, and other developed countries [10]. Japan uses LOS as a secondary parameter [9]. However, Finland and Sweden do not consider LOS [13].
China Healthcare Security Diagnosis-Related Groups (CHS-DRG) are the unified grouping standard used by the national pilot city [14]. Due to the unbalanced development of China's economy, the Chinese government requires cities to develop localized grouping rules based on their actual conditions, so there are variations of DRG payment policy design and grouping rules across China [15]. Beijing Diagnosis-Related Groups (BJ-DRG) are the earliest localization group in China; Beijing built Chinese Diagnosis-Related Groups (CN-DRG) following the model of the All-Patient Diagnosis-Related Groups (AP-DRG) in the USA, and Shanghai built a Shanghai-DRG and National standards for paying fees according to DRG (C-DRG) based on the Australia Refined DRG (AR-DRG). However, these grouping methods are all based on the data collected from the first-tier developed cities in China, and there is no research into the underdeveloped cities in the west of the country. It is inappropriate for cities in the west to use the same rules, due to the unbalanced economic and technological development in China [16]. None of those grouping rules take into account the LOS, unlike most countries in Asia, which incorporate LOS [17].
In this study, we collected data from an underdeveloped city in western China. Machine learning was used to group patients with similar costs, and two sets of rules were built, one incorporating LOS and the other without LOS. We compared the performance of the grouping rules based on the coefficient of variation (CV) to assess the heterogeneity within a group, as has been done in previous studies [8]. We identified the outliers in each group and considered them to represent unreasonable costs. Finally, we tried to control these costs to different extents. is study fills the gap in previous studies, which have only focused on developed cities and which use CV as the standard measure of the results of grouping. In our study, underdeveloped cities and control performance were considered. e rest of this paper is organized as follows. In Section 2, we introduce our materials and methods. In Section 3, we present our results, including general information and inpatient medical expenditure, single and multiple factor analysis of the factors influencing inpatient medical expenditure, the results of two sets of rules for DRG grouping, medical expenses in different DRG, and payment method adjustment results. In Section 4, we discuss the results. Section 5 concludes this study and provides a description of directions for future research.

Patient Data.
e data used in this research were collected from the Healthcare Security Administration of a city in western China during 2018. e data included medical records and cost information related to 93,185 inpatients with cerebrovascular diseases (ICD-10:60-69) as the principal diagnosis, all of which under the major diagnostic categories (MDC) of diseases and dysfunction of the nervous system (MDCB). Original information on these patients included 58 variables, such as gender, age, LOS, cost of hospitalization, payment of medical insurance, and type of insurance.

Data Cleaning.
In the first step of data cleaning, we selected data from only the comprehensive grade tertiary and secondary hospitals. e patients from township hospitals, community hospitals, and school hospitals were removed. As a second step, we eliminated outliers in costs [8] and patients younger than 18 years of age. Finally, patients who were not hospitalized in our study city but were reimbursed by the city's Medical Insurance Bureau were excluded. Valid data from a total of 45,575 patients were obtained after screening.

Statistical Analysis and Data Grouping.
e proportions of the training set and the test set were 80% and 20%, respectively. Firstly, the training set is grouped, and the effect of grouping is detected with the data of the test set. Finally, all the data are put into grouping rules and analyzed.
Kruskal-Wallis tests were used for single factor analysis to determine the factors influencing hospitalization expenses. Values of P < 0.05 were considered to be statistically significant [18]. Stepwise multiple generalized linear regression was used for variance analysis [19]. e medical costs for different subgroups were calculated, and the statistically significant variables with the greatest impacts on medical costs were selected for grouping analysis.
e Chi-Squared Automatic Interaction Detection (CHAID) algorithm was used to establish the combination of DRG [10,20]. In the selection of grouping variables, we considered both the inclusion and exclusion of LOS, CV, and the percentage of outliers. We considered a CV value of less than 1 to indicate no heterogeneity within a group, as has been done in previous studies [8]. We regarded outliers to represent unreasonable medical treatment and calculated the variation in unreasonable medical costs among different participants under different degrees of control. We used inpatient hospitalization expenditure as the dependent variable, and the variables selected by the generalized linear stepwise model were set as the independent variables. LOS was shown to have a significant positive influence on medical expenditure. In order to further investigate the grouping performance of LOS, we built two decision tree models. e first model used the LOS as a classification variable, and the second model omitted the LOS. We have conducted more than ten random trials using data sampling samples, and the results of each trial are consistent, which indicates that the performance of the algorithm is stable. All analyses were carried out using R.studio 4.0.2 software [21] with the CHAID package [22].

Results
In the following section, we summarize general information about the patients' medical costs in Section 3.1, and single factor and multiple analysis are shown in Sections 3.2 and 3.3, respectively. e results of grouping using the two sets of rules based on machine learning are shown in Section 3.4. Finally, the performance of the algorithm using different levels of implementation control is presented in Section 3.5.

General Information and Inpatient Medical Expenditure.
As shown in Table 1, women, individuals over 60 years old, and urban residents accounted for the majority of patients, while men, the elderly, and rural residents had relatively high expenses. Of the patients, 50.18% spent less than nine days in hospital, and 82.26% recovered after hospitalization. Of the patients with complete data, 19,488 (42.76%) were male and 26,087 (57.24%) were female; 1,995 (4.37%) were under the age of 45, while 9,117 (20%) were aged between 45 and 60, and 34,463 patients (75.64%) were older than 65. With respect to residence, 30,243 (66.36%) patients were urban workers, and 15,332 (33.64%) were rural residents. Among them, 24,482 (53.74%) were from a secondary grade hospital, and 21,087 (46.26%) were from a tertiary grade hospital. We also carried out statistical analysis on the effect of LOS, with surgery or without surgery, discharge status, and comorbidities complications (CCs) and whether there was grade III hypertension, on the distribution of patients' medical expenditure in different subgroups. e average expenditure of these patients was 1,284 US dollars. Among the subgroups, males, individuals aged over 65, rural residents, patients from tertiary grade hospitals, LOS more than 13 d, surgery, death, and CCs with insufficiency of blood supply to the cerebral arteries were more expensive.

Single Factor Analysis of the Factors Influencing Inpatient
Medical Expenditure. In this study, 58 variables were examined using single-factor analysis ( Table 1). Ten factors-gender, age, type of insurance, surgery, LOS, status on discharge, CCs, and a hypertension level of three-were shown to be associated with statistically significant differences in hospital expenditure, using Kruskal-Wallis tests (P < 0.01). Expenditure on men, individuals older than 60, rural residents, patients with longer LOS, patients undergoing surgery, death, and patients with CCs was the highest.

Multiple Factor Analysis of the Factors Influencing
Inpatient Medical Expenditure. Generalized linear stepwise models were used for multiple regression analysis. Gender, LOS, level of hospital, surgery, status on discharge, type of insurance, comorbidities complications, and age had significant impacts on medical expenditure (Table 2). e Rsquared value of the model was 0.521, and the kappa value was 12.08, indicating that the model performed well, and there was no multicollinearity between variables. All of these variables could be regarded as reasonable data for DRG grouping.

Two Rules for DRG Grouping and Medical Expenses in
ere were seven subgroups in model one and eight groups in model two. e hospital level was the main factor, and the second rule, without LOS, required more disease-related information, such as details of CCs. e group without LOS was more stringent. For example, grade A tertiary and grade B tertiary were in the same group under the rule incorporating LOS, while they were in different groups without LOS. e number of individuals in each group and details of expenses are shown in Tables 3 and 4. Most of the CVs of the first grouping method were less than 0.5, indicating that the homogeneity within the group was good, and the grouping effect was better in the grouping rules incorporating LOS. e weight calculation formula was (the average cost of the group)/(all the average costs). e higher the weight, the more resources consumed by the patients in the group. We set P75 + 1.5 IQR as the cost limit of each group, and the excess amount indicates the number of each group's medical expenses that were outside the cost limit.
We also analyzed the outliers of each group. Using the first grouping rules, the outliers were older than the normal patients, while using the second grouping rules, the outliers had a significantly longer LOS than the average.

Prediction of Medical Expenses Based on an Increasing
Control Ratio of Unreasonable Treatment. In 2018, a total of 51.17 million US dollars medical expenses were related to 45,575 inpatients with cerebrovascular diseases as the principal diagnosis. e average cost was 1,248 US dollars. Among them, 43.42 million were paid by the Healthcare Security Administration, and 7.75 million were paid by patients themselves. All of this expenditure was based on the Fee for Service (FFS) payment system. We took the mean cost of each group as the payment standard for the DRG group and calculated the average cost to the Healthcare Security Administration, hospital, and patient. e current FFS method encourages an oversupply of service in order to increase revenue [9]. We consider expenditure less than the cost limit in each group to be a normal supply and the instances in which the outliers exceed the upper limit as an oversupply of services. We increased the control intensity step by step from 10% to 100% for this oversupply service, to simulate performance under the payment system of DRG. e control effect of the two grouping rules is shown in Table 5. If we took full control, the rules with LOS could save 598,570 US dollars, and 3.35 million US dollars could be saved based on the grouping rules without LOS.  e government therefore paid an average of 1,087 US dollars for each patient, and each patient paid 196 US dollars for themselves. e expenditure in developed cities was even higher. Control of the medical expenses caused by cerebrovascular disease is an urgent problem for the Chinese government. e city we chose uses a Fee for Service system, which may provide an incentive to oversupply services. We used local data to classify the patients into different groups with similar medical costs. Two models with different rules were  built, based on whether the LOS was included as a classification variable. We used the CV to measure the quality of the grouping and analyzed the characteristics of the outliers in each group. We then increased the intensity of control of the oversupply of services step by step, from 10% to 100%, to simulate the performance based on the two grouping rules. e model incorporating LOS had a smaller CV than the model without LOS. If our standard model was built without LOS, it could reduce the occurrence of medical oversupply, saving 3.35 million US dollars in one year. ese figures apply to only one city; if the whole country controlled costs in this way, the economic pressures on healthcare could quickly be alleviated.
Although it is generally recognized that LOS is the main factor influencing medical expenses [23], the inclusion of LOS as a classification variable of DRG is inconsistent. It is generally believed that considering LOS as a classification variable may lead to upcoding [11]. Most European countries, including England, Estonia, and Finland, do not consider LOS as a classification variable. e official Chinese CHD-DRG, modelled on the American MS-DRG, does not include LOS [14], and the Shanghai-DRG, based on the Australia AR-DRG, also does not consider LOS. However, some studies indicate that omitting LOS may increase the frequency of readmission and moves between hospitals, with services provided in alternative ways [17]. Omitting LOS also leads to poorer care for patients who should have a longer stay. e grouping rules of some countries, such as France, Ireland, and Poland, consider LOS to be an important factor [13]. Tables 3 and 4 show the results of grouping. e grouping rule with LOS has a smaller CV, indicating that the cost difference within grouping rule one was smaller, and the grouping was more reasonable. We used the P75 + 1.5 IQR as the upper limit to test for outliers in each group. e proportion of outliers was higher in the group without LOS.
is observation implies that the use of LOS can lead to accurate grouping. Both grouping rules demonstrate that the hospital level is very important. In grouping rules without LOS, hospital levels and comorbidity are more finely divided. It is therefore counterproductive to consider only one hospital level.
We analyzed the outliers (Tables 3 and 4) and found that in the LOS group, the age of the outliers was significantly higher than the average value of the group, while in the group without LOS, the LOS was significantly higher than the average. A study using MS-DRG hospital data from Malta also found that most of the outliers were older and higher costs were associated with higher LOS [8]. Further analysis of these results could help identify the reasons for the high costs.
In Asia, only the Republic of Korea considers the type of hospital as a factor for DRG-based payment [9]. In this study, we found that the level of the hospital crucially influenced inpatient medical expenditure. Although there have been studies looking at the impact of hospital levels on costs [19], research into DRG has tended to focus only on tertiary hospitals. Our research therefore complements previous studies that only grouped hospitals at one level [13]. e major diagnosis was directly related to the differences in the cost of hospitalization. Comorbid patients often require special treatment and care, and different comorbidities may affect the cost of additional care, making comorbid diseases an important grouping variable. Medical costs are higher for the elderly, who require special treatments [13], but age did not show up in our grouping variables. In China, many DRG subgroups, such as the pneumonia subgroup, have age as the primary factor [19], possibly because the high cost of this group is mainly concentrated in the elderly and children. However, the age distribution of cerebrovascular disease is mainly concentrated in the elderly. In most of the European countries, like England and Estonia, age is not a factor used in grouping [13]. is observation is consistent with our findings. Most grouping rules have found surgery to be an important variable, and our single analysis also showed that surgery has a significant impact on costs. But surgery was not a variable identified in our results. is situation may have something to do with the choice of disease species. A cluster study in Beijing, China, also confirmed that in stroke, one of the cerebrovascular diseases, surgery is rare [24]. Table 5 shows the performance if the oversupply of services is controlled under the payment system of DRG. e intensity of control was increased step by step from 10% to 100%, and the results of application of the two rule sets were compared. More money could be saved without the LOS. Experience in Europe indicates that use of LOS leads to upcoding, and the medical cost was high when considering the LOS. ese results imply that without LOS the cost could be controlled better, but with LOS the patients could be classified better. More incentives and oversight are needed if DRG is to be introduced. For one city, 21 million RMB could be saved by applying the results of our research, an outcome which is highly desirable for the government. ere were some limitations in this study. Due to the lack of standards for the data reported by the hospitals, there were 5,768 cases lacking information on whether surgery was performed, so these data were excluded from the grouping. Since there is no uniform surgical code between each hospital, we could not use the surgical code as our research object. Due to the large amount of data, we only considered data from one year. In the future, data from more years could be included, or the data from another year could be used for the CV of the test group.

Conclusions
We used real data from less developed regions for grouping for the DRG, filling the gap in previous studies, which took developed regions as research objects. To the best of our knowledge, this is the first time that secondary grade hospitals have been considered in a Chinese DRG study. We compared two grouping methods and discussed the results of the grouping. DRG payments were fixed, and this study adjusted the payment ratio of medical insurance, patients, and hospitals to achieve a satisfactory result for all three parties. To speed the development of DRG and rationalize the costs of cerebrovascular disease, the structure of hospital information and the standardization of data entry are essential. More research in this area is urgently needed.

Data Availability
All the data were taken from the Medical Insurance Laboratory of ChengDu Healthcare Security Administration.
Ethical Approval e study does not involve human subjects and adheres to all current laws of China.

Conflicts of Interest
e authors report no conflicts of interest concerning the materials or methods used in this study or the findings presented in this paper.