Sparse Logistic Regression for Diagnosis of Liver Fibrosis in Rat by Using SCAD-Penalized Likelihood

The objective of the present study is to find out the quantitative relationship between progression of liver fibrosis and the levels of certain serum markers using mathematic model. We provide the sparse logistic regression by using smoothly clipped absolute deviation (SCAD) penalized function to diagnose the liver fibrosis in rats. Not only does it give a sparse solution with high accuracy, it also provides the users with the precise probabilities of classification with the class information. In the simulative case and the experiment case, the proposed method is comparable to the stepwise linear discriminant analysis (SLDA) and the sparse logistic regression with least absolute shrinkage and selection operator (LASSO) penalty, by using receiver operating characteristic (ROC) with bayesian bootstrap estimating area under the curve (AUC) diagnostic sensitivity for selected variable. Results show that the new approach provides a good correlation between the serum marker levels and the liver fibrosis induced by thioacetamide (TAA) in rats. Meanwhile, this approach might also be used in predicting the development of liver cirrhosis.


Introduction
Chronic hepatitis, characterized by hepatic fibrosis, is recognized as a health problem with a worldwide prevalence, and it may gradually progress toward cirrhosis and hepatocellular carcinoma which may induce death. Successful and early treatment of chronic hepatitis can prevent development of cirrhosis and hepatocellular carcinoma. There are two major symptoms of chronic hepatitis: necroinflammatory activity and fibrosis. Liver fibrosis is the best sign for predicting the development of liver cirrhosis [1]. Since there is no correlation between aminotransferase activities and fibrosis, aminotransferase activities cannot be used for the diagnosis of fibrosis. Considering that liver fibrosis is reversible in the early stage, accurate and early diagnosis of liver fibrosis is required for better prognosis of chronic hepatitis.
Liver biopsy has to date been the gold standard for the grading of hepatic inflammation and the staging of hepatic fibrosis and has been used as the reference standard method in evaluations of plasma markers of liver diseases [2]. However, it is an expensive, invasive procedure with a considerable risk of complications (particularly bleeding) and a small chance (<1 : 1000) of death [3]. And liver biopsy sample is only 1 : 50000th of the mass of the liver and therefore causing the risk of false negative. Even with sample of adequate sized biopsies, cirrhosis may still be missed in 15-30% of liver biopsies [4]. Due to the limitations of biopsy including the small but significant mortality rates, sampling error, inter-and intraobserver variation in pathology reporting, and provision of a static picture of liver architecture in a dynamic disease process, it is still necessary to look for alternative approaches. Moreover, to evaluate drug efficacy, it is essential to establish appropriate animal models of liver fibrosis. Since establishing an animal model is timeconsuming, usually lasting 8 to 12 weeks, histopathologic examination may lead to the consumption of animals, and the testing small proportion cannot reflect the condition of whole population. All these would increase the cost and decrease the efficiency and reliability of the research. In order to build appropriate animal models for liver fibrosis, also it is necessary to develop a reliable and accurate method to diagnose liver fibrosis quickly.
The search of a noninvasive method to assess liver fibrosis has encouraged the development of various approaches. Transient elastography for the noninvasive measurement of liver stiffness was developed [5][6][7]. Currently, it is notable that monitoring serum markers of liver fibrosis could offer an attractive alternative to liver biopsy, as it allows dynamic calibration of fibrosis efficaciously. Liver fibrosis is characterized by an overall increase of the extracellular matrix, mainly produced by hepatic stellate cells (HSCs) [8,9], which undergo a phenotypic switch induced within the inflammation process by numerous cells and cytokines. A number of potential serum markers of fibrosis and cirrhosis have been used in the diagnosis of a variety of chronic liver diseases.
Therefore, monitoring a variety of plasma markers, especially collagen-related biomarkers such as aminoterminal peptide of procollagen III (PIIINP), is a novel approach for liver fibrosis diagnosis. In general, there are many plasma markers identified but some of them still have not been determined yet. So constructing a sparse classification for the progression of liver fibrosis based on detected plasma markers has attracted much attention. A novel method is linear discriminant analysis with stepwise variable selection. Guyon et al. [10] proposed a recursive feature elimination technique with support vector machine to analyze gene expression data. Rocke and Nguyen [11] raised dimension reduction of microarray-based classification. Li et al. [12] introduced two bayesian approaches with technique of automatic relevance determination for the same problem. Debashis and Arul [13] suggested that linear discriminant function by optimal scoring with LASSO was an alternative approach. Sparse Fisher's linear discriminant analysis, suggested by Qiao et al. [14], was also an excellent method.
The aim of this study, therefore, is to develop classification rules based on the consideration of measures of diagnostic accuracy. In particular, we are interested in finding liver fibrosis that can discriminate between two populations. Our solution was to combine the problems of variable selection and classification. We suggested an approach for classification using the smoothly clipped absolute deviation penalty´SCAD) [15] approach with logistic regression [16]. We compared it with the stepwise linear discriminant analysis (SLDA) and sparse logistic regression (SLR) with least absolute shrinkage and selection operator (LASSO-)penalized function [17,18]. At last, we analyzed the sensitivity of these methods using the receiver operating characteristic (ROC). Considering the small sample size, we fit the ROC curve and compute the area under the curve (AUC) using Bayesian bootstrap [19,20].

Materials and Method
2.1. Animals. Twenty-eight Sprague-Dawley (SD) rats were provided from SLAC Laboratory Animal Co.ltd (Shanghai, China, SCXK: 2007-0005). The rats were maintained under specific-pathogen-free conditions, with a constant temperature ranging between 25 and 27 • C, and a constant humidity ranging between 45 and 50% at animal laboratory of China Pharmaceutical University. Animal care was in accordance with the guidelines of the animal laboratory of China Pharmaceutical University.

Induction Of Liver Fibrosis.
The modeling method is in accord with Imanishi et al. [21] and Kuriyama et al. [22]. The rats were randomly divided into 4 subgroups: (I) model group (8 weeks, n = 8), (II) model group (12 weeks, n = 8), (III) normal control group (8 weeks¸n = 6), and (IV) normal control group (12 weeks, n = 6). All rats were observed at 8th week and 12th week after TAA treatment. Rats in the model subgroup were injected intraperitoneally (i.p) with TAA 3 consecutive days per week and lasted for 8 weeks at a dose of 6% TAA 200 mg/kg as an initial dose. The doses after the first time were modified according to weekly weight and AST changes in response to TAA during the induction. Rats in normal control groups were treated with saline. After the final administration in the 8th and 12th weeks, blood samples were collected and serum was separated by centrifugation at 4 • C and kept at −20 • C for further analysis. Then all rats were sacrificed under anesthesia. The livers were washed with cold saline, and a part of the right hepatic lobular was removed and stored in liquid nitrogen for content detection of hydroxyproline. The remaining part of the right hepatic lobular was made into slices for pathological diagnosis. Several serum markers and liver function indices studied in clinical research were measured at the 8th and 12th weeks, respectively. There are many serum markers reported as the liver function indices. After consideration, we ultimately chose hyaluronan (HA), serum laminin (LN), collagen Type I (Col I), IV collagen (IVC), procollagen III (PC-III), aspartate aminotransferase (AST), albumin (Alb), hydroxyproline (Hyp), total protein (TP), and total bilirubin (T.Bil) [7,23] in this study. The liver tissue slices were observed and diagnosed by HE, Masson-trichrome staining, and transmission electron microscope. Then the relationship between the serum index and the occurrence of liver fibrosis was analyzed by statistics model.

Statistics Model and Solution
. . , (x n , y n )} be input-output pairs of a given data set¸where x i in R p is variable levels of plasma markers and y i in {0, 1} is the type of liver fibrosis occurs or not identified in liver biopsy. Here, n is the number of liver biopsy and p is the number of plasma markers. For binary logistic regression, we can write it as where g(x T i β) = x T i β, i = 1, 2, . . . , n. Then, the loglikelihood function of this binary logistic regression is Further, the SCAD-penalized maximum likelihood function is where for some a > 2 and θ > 0.
In general, p λj may be different coefficients. Here, we make λ = λ j . In other words, the same penalty function is applied to each component of β [24]. Generally, λ can be selected by GCV. Another alternative penalized function is the least absolute shrinkage and selection operator [17]. They suggested the penalized function could be selected with p λj = λ j |β j |. The algorithm can be carried with least angle regression [25]. Making λ = λ j , the penalized maximum likelihood with LASSO is Once the regression coefficient β is estimated, the classifier is constructed as follows. Let c(i | j) be the cost of classifying an observation to the i class when the true class is j. Then, a new tissue sample with plasma markers x is classified into In practice, c(i | j) are equal which is most frequent. So minimum c(x) is equal to arg max j p r (y = j|x).

Solution.
In this study, we principally discuss the SCAD penalized. It can be locally approximated by a quadratic function as follows: In other words, Then, the penalized log-likelihood can be locally approximated by where We then estimate β as follows: where , and X is an index matrix. Given the good initial value β 0 , the penalized maximum likelihood can be as efficient as the fully iterative procedure.

Simulation Studies.
In this subsection, we numerically compare the proposed approach of variable selection and classification methods with SLDA and LASSO methods. We simulate 1000 datasets when n = 20, 40, 60, 100 respectively, from the model Y ∼ Bernoulli{π(x T β)}, where π(u) = exp(u)/(1 + exp(u)) and β = (3, 1.5, 0, 0, 2, 0, 0, 0). The previous six components of x come from a standard normal distribution. The correlation between x i and x j is ρ |i− j| with ρ = 0.5. The last two components of x are independently and identically distributed as a Bernoulli distribution with probability of success 0.5. All covariates are standardized. This model was used in Tibshirani et al. [17]. The classification standard is based on the probability arg max p r (y = j|x). Define The true classification ratio is computed via 1000 Monte Carlo simulations. The summary of simulation results is depicted in Table 1. Table 1 shows that the true classification ratio associates with the sample size n. With the increase of the sample size n, the true classification ratio tends to one. Another important key point: three methods are all working well for large sample size n.

Result
3.1. Histopathology of Liver Fibrosis. Typical liver fibrosis was induced after 8 weeks of TAA treatment, with the hepatofibrosis pathological characteristics of fibroblast extending around from central venous or portal area, forming the obvious fiber separator without the formation of false lobules. From the light microscope, it could be seen that pathological slice of normal animals had clear lobule, there was no edema or denatured fat in liver cells, and Sinus hepaticus had   no expansion and congestion (Figures 1(a) and 1(d)). The pathological slice of model animals had obvious fibroplasias extending from the central venous or portal area to surrounding area, which formed the clear fibrous septa. The liver cells had mild steatosis with lipid droplet and vacuoles accompanied by mild bile duct hyperplasia (Figures 1(b), 1(c), 1(e), and 1(f)). Table 3, there was a major difference between the normal group and control group in several serum markers and liver function indices such as HA¸LN, IVC, and I Collogen. During the 1st-4th weeks, AST activity reached the peak and there is no significant difference (P > .05) in the 5th to 8th weeks ( Table 2). In Figure 2, it is shown that after peak value AST activity gradually decreased and returned to normal during the 12th week. As a matter of fact the level of AST in human goes up obviously during acute hepatitis, and it correlate with the severity of the disease. Then the activity of AST decreases during the course of liver fibrosis [26]. AST activity  Albumin ratio of total protein (A/T) x 2 Aspartate aminotransferase (AST) x 3 Hydroxyproline (Hyp) x 4 IV collagen (IVC) x 5 Serum laminin (LN ) x 6 III collagen (PC-III) x 7 I collagen (Col I) x 8 Hyaluronan (HA) x 9 Total Bilirubin (T.Bil) x 10 Albumin ratio of the globulin (A/G) x 11 Albumin (Alb) x 12 Total protein (TP) H y a l u r o n a n - of control group was significantly higher than the normal group and decreased slightly in the 12th week without TAA for four weeks which agrees with the literature [27]. At last typical liver fibrosis was induced after 8 weeks of TAA.

Application.
We applied the proposed sparse logistic regression with SCAD, LASSO, and SLDA to the classification of liver fibrosis. The dataset consisted of 26 observations. The binary response variable Y is 1 for those rats who have liver fibrosis and 0 otherwise. All the twelve covariates are considered. All covariates' meanings were listed in Table 4. Carrying out the procedure, including the SLDA, SLR-SCAD, SLR-LASSO, then the outcome about the selected variable and the classification true rates were obtained, respectively, according to Tables 5 and 6. 3.4. Test Significance. The receiver operating characteristic (ROC) curve is an excellent method to test the significance of selected variables. The ROC curve is a plot of the true positive fraction (TPF) as a function the false positive fraction (FPF), or sensitivity versus one minus specificity, and is obtained by varying the threshold criterion distinguishing between a positive and negative diagnosis. For example, the diagnostic variables X ∼ F are for the population without liver fibrosis and Y ∼ G are for those with liver fibrosis, where F and G are the distribution functions. Some features such as the invariance property and interpretation of the area under the curve (AUC) as p r (Y > X) make the ROC analysis extremely popular in diagnostics research. Generally, the selected variables are sensitive for the classification if the AUC > 0.7. However, estimation of AUC for the ROC curve is very difficult, especially for the small sample size. Here, we use the Bayesian bootstrap (BB) estimation of AUC, proposed by Kuriyama et al. [22], to test the variables in SLDA, LASSO, and SCAD. The AUC calculation results are shown in Table 7, displaying the sensitivity for classification variable Y . Figure 3 described these covariates' ROC curves.  The ROC of x 3 , x 5 , and x 8 are more than 0.7, and they represent the hydroxyproline, LN, and hyaluronan, respectively.
As shown in Table 7 and Figure 3, we come to a conclusion that some covariates may be very important for diagnosis of the occurrence of liver fibrosis, such as hydroxyproline, LN, hyaluronan, as the AUCs for these covariates are more than 0.7.
Fortunately, these covariates are all selected by the three statistical models. Though SLR-SCAD chose only two covariates, it gets higher diagnostic accuracy, equaling to SLDA, better than SLR-LASSO, as shown in Tables 5 and 6. It elucidates that by using SLR-SCAD we could obtain high diagnostic accuracy with fewer serum indices so as to save the experiment cost. Meanwhile, the SLDA and SLAR-LASSO also display well. But their high diagnosis accuracy needs more variables to be selected.

Discussion and Conclusion
It has been thought that the research on liver fibrosis is entering a new era of the whole wound and heal reaction [28]. The ideal animal model, which is a reliable and reproducible on hepatic fibrosis, should have the basic pathological characteristics with the same phase change on pathology as human liver fibrosis. However, there are some defects on detection methods in the process of establishing animal model. Considering the limitation of traditional test method in modeling, we propose an alternative approach to construct the quantification between fibrotic condition and serum indices. Since accurate diagnosis of liver fibrosis remains a difficult and crucial problem in both the animal models and patients, an ongoing challenge is to identify new prognostic markers that are directly related to liver fibrosis and that can more accurately predict the likelihood of gaining liver fibrosis. Here we introduce a new approach to the jointing problems of simultaneous classification and variable selection utilizing the sparse logistics regression with SCADpenalized function. The proposed method is applied in our study to analysis the data of plasma makers for diagnosis of the liver fibrosis in rat model and to determine the occurrence of cirrhosis.
In this study, the pathological section results were used to determine whether the reliable animal model with liver fibrosis induced by the dose individualization of TAA was established successfully. Despite, there is, difference between liver biopsy of human in clinic and pathological section of dead animal, they are both expensive, invasive and have a chance of death, especially the inevitable death of animals. It is because that the pathological slice from the whole rat's liver could supply the highest accuracy that the results can be as the standard to verify the outcome of statistical model. There is no doubt that sacrificing animal life attributes to significant mortality rates and disobeys the morality and ethics. Our proposed method by using serum indices may indeed greatly decrease the consumption of a lot of animals when used in hepatic drug screening.
Monitoring a variety of plasma markers, especially the levels of collagen-related markers, is becoming a novel approach of liver fibrosis diagnosis with simple procedure and high sensitivity and accuracy. According to Table 1, the novel SLR-SCAD method performs as well as other wellknown methods such as SLDA and SLR-LASSO in statistical classification and variable selection. They all perform well for the fixed sample size. The SLR-SCAD achieved higher classification ratio by the two selected variables, hydroxyproline and LN, which are reasonable by the verification of ROC curve. The two variables also have significant sense in clinic. If SLDA and SLR-LASSO want to obtain higher classification, they need to test more serum indices. In addition, it is mentioned that the SLDA method has its own drawback especially when the variables have collinearity [26,29]. Compared with the SLDA and SLR-LASSO methods, SLR-SCAD is quite befitting for penalized regressions.
It will be a very useful approach to determine the occurrence of liver fibrosis in animals based on statistic models with fairly high diagnosis accuracy and without loss of animals. The occurrence of liver fibrosis can be effectively estimated in rats by using our proposed SLR-SCAD method with the right serum indices. In summary, we propose a new method by combining the analysis of serum index of fibrosis and statistic model which represent a reliable diagnostic and prognostic approach of liver disease. It could be beneficial to extend this study to other fields where classification and many variables exist.