Application of Zero-Inflated Poisson Mixed Models in Prognostic Factors of Hepatitis C

Background and Objectives. In recent years, hepatitis C virus (HCV) infection represents a major public health problem. Evaluation of risk factors is one of the solutions which help protect people from the infection. This study aims to employ zero-inflated Poisson mixed models to evaluate prognostic factors of hepatitis C. Methods. The data was collected from a longitudinal study during 2005–2010. First, mixed Poisson regression (PR) model was fitted to the data. Then, a mixed zero-inflated Poisson model was fitted with compound Poisson random effects. For evaluating the performance of the proposed mixed model, standard errors of estimators were compared. Results. The results obtained from mixed PR showed that genotype 3 and treatment protocol were statistically significant. Results of zero-inflated Poisson mixed model showed that age, sex, genotypes 2 and 3, the treatment protocol, and having risk factors had significant effects on viral load of HCV patients. Of these two models, the estimators of zero-inflated Poisson mixed model had the minimum standard errors. Conclusions. The results showed that a mixed zero-inflated Poisson model was the almost best fit. The proposed model can capture serial dependence, additional overdispersion, and excess zeros in the longitudinal count data.


Introduction
In recent years, hepatitis C virus (HCV) infection has been a major cause of liver diseases worldwide and represents a major public health problem [1][2][3][4][5]. Transfusion and contact with infected blood and its products, intravenous drug use, and contamination during medical procedures are among different risk factors of HCV [6][7][8]. An estimated 130-170 million people worldwide are infected with hepatitis C. The global prevalence of this infection is approximately 0.2%−40% [2,9]. But there is a difference between developed and undeveloped countries in its prevalence. It is due to difference in health policies and medical care [10]. Apart from few studies that have been done on high-risk groups or in specific locations, no comprehensive and accurate estimate of HCV infection is available in Iran. According to two available studies which examined Iranian population, the prevalence of HCV infection in the general population is less than 1% [11,12].
Hepatitis C is a common infection that causes chronic liver disease in the world [13]. The occurrence of end-stage liver disease caused by HCV is estimated to peak around 2020 [10,14]. According to other studies, HCV infection is responsible for 20% of acute hepatitis cases, 70% of all chronic hepatitis cases, 40% of all cases of liver cirrhosis, 60% of hepatocellular carcinomas (HCC), and 30% of liver transplants [15].
In the coming decades, it is expected that the economic burden and mortality associated with hepatitis C rise [7,16]. Unfortunately, the majority of infections do not respond to treatment and lead to chronic diseases. So it seems that 2 BioMed Research International controlling HCV infection is an important issue in public health [5,17]. Risk factor evaluation in order to reduce the problem in the community is one solution to protect people from the infection.
In medical researches statistical modeling is a powerful approach in risk factor evaluation, but selection of good and appropriate model is important. When the response variable is count, there are some models that they use for analyzing such data. Sometimes count data have an overdispersion problem because of having large number of zeros. This phenomenon is called zero-inflation. Using usual count model in zero-inflated data causes misleading results.
Lambert [18] proposed the zero-inflated Poisson (ZIP) regression model for independent count data. For clustered count data, ZIP models have been developed, and different types of such models have been introduced and used in different studies [19,20]. In this study, the relationship between 3 viral loads of each HCV patient and some risk and demographic factors was investigated using mixed ZIP regression. Details of mixed ZIP modeland its parameter estimation are described in [21,22].

Patient
Selection. This is a longitudinal study and all data for this research were drawn from medical records of 186 patients with hepatitis C. All of these patients had been referred to Tehran hepatitis clinic, a clinic of Baqiyatallah Research Center for Gastroenterology and Liver Diseases, from 2005 to 2010. The Information concerning 186 patients includes viral load (HCV-RNA). The viral load had been recorded before the treatment, during the period of treatment, immediately after this period and 3 to 4 months after the end of the treatment. The viral load before treatment has been considered for baseline adjusting. The variables included in the study are as follows: demographic information including sex and age, genotypes including genotypes 1, 2 and 3, treatment protocol including combination therapy of standard Interferon (3 MU three times a week) plus Ribavirin (800-1200 mg per day) for 24 weeks or 48 weeks [23][24][25] as well as combination therapy of Peg-Interferon (Alfa 2a in a fixed dose of 180 micrograms per week) plus Ribavirin (800-1200 mg per day) for 24 weeks or 48 weeks [24,26], history of blood transfusion, addiction (IV drug user), and contaminated needle stick. All of these factors were extracted from the patient's medical records. Therefore, five covariates including age, sex, genotype, protocol of treatment, and risk factor were entered in this study. Finally, 558 viral loads of HCV and their related information were extracted; it means that each patient was examined three times (the first time was baseline). On the other hand, negative HCV-RNA is considered as being below 100 and it is taken as zero in the analyses. Generally, HCV-RNA of 100 to 200,000 is considered as being very low; 200,000 to 1,000,000 as low; 1,000,000 to 5,000,000 as medium; 5,000,000 to 25,000,000 as high; and above 25,000,000 as very high.

Statistical Analysis.
Descriptive statistics and frequency distribution such as mean, standard deviation, and  This model is a combination of zero-inflated and random effects models to control both zero-inflated and cluster structure of data [22]. On the other hand, the Poisson random effects model, without considering zero inflated structure of data, was carried out [27]. These two models were compared using standard error of their estimators. Significance was defined as < 0.05. Stata 11 and R 2.13.1 program, were used for the analysis.

Results
55 patients of the total 186 patients who were entered into this study were females. The mean and standard deviation of age were 42.88 and 11.17 years, respectively. Their age ranged between 19 and 76 years. Table 1 shows the distribution of covariates in this study. Each patient had four viral loads for evaluating the treatment process. Table 2 shows the distribution of six groups of viral load in 186 patients repeated four times for each. According to these results, 55.2% of patients had negative HCV-RNA, which means that zero inflated models is needed. At the first stage, PR regression with random effects (mixed PR) was fitted. The random effect was entered into this model for adjusting the clustered data structure. According to the results of this model, genotype 3 and treatment protocol were statistically significant. Table 3  shows the results of this model. The significant Pearson Chi square goodness of fit (GOF) test ( < 0.001) along with other features of the model fit indicated that the mixed PR model produced a poor fit. On the other hand, a significant likelihood ratio test ( < 0.001) of dispersion statistic from zero showed that overdispersion has occurred in this data.
In the next stage, ZIP model with random effects (mixed ZIP) was carried out to account for both clustering and excessive zeros. The covariates of age, sex, and genotype, protocol of treatment, and risk factors had significant effects on developing HCV-RNA at = 0.05. The rate of virological response was higher in younger males. Subjects who had none of the risk factors, including the history of blood transfusion, addiction (IV drug user), and needle stick, were more likely to have virological response than others. Patients with genotype 3 and genotype 2 tended to have more virological response than those with genotype 1. The rate of virological response was also higher in subjects with combination therapy of Peg-Interferon plus Ribavirin. In addition to regression parameter in this model, two parameters of the random effects were estimated. The first estimate of random effect model ( = 0.3543) indicated the longitudinal correlation between the subjects. Also this random effect shows that the recurrence of the HCV-RNA every time partly depends on its value at the previous count. The second random effects ( 2 = 1.61) indicated that the variation of data is much greater than the one shown by the first random effect. The results of this model are shown in Table 4. The comparison of these two models is presented in Table 5. The standard errors for covariate effects obtained from ZIP model were generally smaller than those obtained from PR regression with random effects.

Discussion
In this paper, a mixed ZIP model was used for clustered count data with excessive zero. Its results were compared with those of mixed PR model. In mixed ZIP model, all covariates had significant effects on the response variable. In this research, the rate of low viral load in men was more than that in women. The results of the studies done recently on patients with genotype 1 indicate that SVR in men is 2.5 times higher than that in women [28]. In the present study, this rate was 2.7 times as much. It seems that this difference is because of some physiological and psychological differences between men and women in the society. Also, patients with genotypes 3 and 2 had more virological response than patients with genotype 1. The fact that achieving SVR in genotype 1 is more difficult than in other genotypes has also been confirmed by the results of other studies [29]. On the other hand, risk factors decrease the rate of virological response. It seems that such results have been obtained due to the relationship that exists between the risk factors and the genotype. For example, there is a direct relationship between genotype 1 and injecting drug users, blood transfusion, and contact with infected blood as well as its products [30]. Two main protocols of treatment were used in this study based on the genotype of patients. According to the results, combination therapy of Peg-Interferon plus Ribavirin had better results than combination therapy of standard interferon plus Ribavirin. The large number of studies which have been conducted so far showed that Peg-Interferon plus Ribavirin had been most responsive to treatment [31][32][33][34]. So it seems that this protocol has been the best choice [35,36]. Unfortunately in Iran, due to high cost of the drug, it is not the first choice for doctors. Usually when patients did not respond to the treatment, doctors decided to prescribe Peg-Interferon plus Ribavirin [36].
Although clustered count data with extra zeros often occur, few methods have been developed for correlated data with extra zeros [37]. There are some studies done on the extension of zero inflated models in order to accommodate random effects [20,38]. In all of these models, there were two separate random effects in the models; therefore, the interpretation of the results was more difficult and sometimes confusing. A mixed ZIP model that was used in this paper has been introduced by Ma et al. in 2009. This proposed model had a compound Poisson random effect structure. This distribution was very useful for characterizing both the excessive zeros and clustering structure of the data. Another advantage of this model was its computational efficiency which was highly useful for analyzing massive data sets [21]. In this data set, the programs run after two minutes and thirty seconds. A comparison was also made between the results gained by this model and those gained by one of the standard methods for analysis of longitudinal count data (mixed PR model). The standard errors for covariate effects obtained from the mixed ZIP model were generally smaller than those obtained from mixed PR model. Results were compared by standard error, because there is not any goodness of fit criterion for mixed ZIP model yet. Standard errors get larger unless the extra zeros are accounted for. If the excessive number of zeros is not adjusted, the standard deviation gets larger. Therefore, the standard deviation of estimators in ZIP model is smaller than that of the other model. Since the zero inflated structure has not been taken into account in the other model, the standard deviation of estimators gets larger compared to ZIP model.
In conclusion, the mixed zero inflated Poisson models were seen as almost being the best fit. As with this research, clustered zero inflated count data is quite frequent in medical researchers. Since a wrong model would yield unreliable results, therefore, choosing the best and correct model for analyzing the data is highly important.