A Multinomial Ordinal Probit Model with Singular Value Decomposition Method for a Multinomial Trait

We developed a multinomial ordinal probit model with singular value decomposition for testing a large number of single nucleotide polymorphisms SNPs simultaneously for association with multidisease status when sample size is much smaller than the number of SNPs. The validity and performance of the method was evaluated via simulation. We applied the method to our real study sample recruited through theMexican-American CoronaryArtery Disease study.We found 3 genes SORCS1, AMPD1, and PPARα to be associated with the development of both IGT and IFG, while 5 genes AMPD2, PRKAA2, C5, TCF7L2, and ITR with the IGTmechanism only and 6 genes CAPN10, IL4, NOS3, CD14, GCG, and SORT1 with the IFG mechanism only. These data suggest that IGT and IFG may indicate different physiological mechanism to prediabetes, via different genetic determinants.


Introduction
Genome-wide association studies GWASs examine genetic variants across the entire genome to improve the understanding of genetic components underlying complex human disease.With whole-genome genotyping techniques that allow GWAS to involve hundreds of thousands of single nucleotide polymorphisms SNPs , many studies have successfully identified novel genetic components for many diseases or related quantitative traits.However, the sample size is often limited due to the difficulty of recruiting patients and/or

Multinomial Probit Model with Singular Value Decomposition
Logit and probit models are statistical models that are widely used for the analysis of categorical ordinal/nominal data.The difference between these two models is the choice of the link function relating the linear predictor to the expected value; the probit model uses the inverse normal cumulative distribution, and the logit model uses the logit transformation.As discussed by Greene 4 , in most cases, the choice of the link function is largely a matter of taste.We utilized the probit model here to analyze data with polytomous ordinal response variables.In general, the multinomial ordinal probit model can be expressed by latent unobserved continuous variables associated with categorical responses.Let us assume that responses y 1 , y 2 , . . ., y n are observed, where y i takes one of the J-ordered categories and θ 1 , . . ., θ J are real numbers of bin boundaries, which satisfy that −∞ θ 1 ≤ • • • ≤ θ J ∞.As discussed by Albert and Chib 5 , we denote that z 1 , z 2 , . . ., z n are latent continuous random variables.We assume that the latent variable z i associated with a categorical outcome y i can be explained in terms of an underlying linear model, and that the observed response y i has category j if and only if z i falls between θ j−1 and θ j .The multinomial ordinal probit model is equivalent to the following model: where x i is a 1 × m vector of the explanatory variables for the ith sample, and β is a m × 1 vector of parameters to be estimated.In vector-matrix notation, we can have the multinomial ordinal probit model where z is the n×1 vector of latent variables, X is the n×m matrix of the explanatory variables, β is the m × 1 vector of unknown regression coefficients, and follows an independent standard multivariate normal distribution, ∼ N 0, I n .By applying SVD to the matrix X in 2.2 , when rank X n, the matrix can be expressed as X ADF , where A is the m × n singular value factor loading matrix with orthonormal columns so that A A I n , F is the n × n SVD orthogonal factor matrix with F F FF I n , and D diag d 1 , . . ., d n , the diagonal matrix of positive singular values, ordered as Therefore, in the product X ADF , the last n − r columns of both A and F for which d r 1 • • • d n 0 are ignored since they interact with the block of zeros in D. Hence, this leads to another form of SVD, X A r D r F r , that is, the product of the first r columns of A, the upper r × r block of D, and the first r columns of F. Since the difference between the both scenario is only in dimension of matrices in SVD, we assume that rank X n in the rest part of the paper for convenience.Thus, the model in 2.2 with the SVD of X can be written as follows: where L FD and γ n×1 A n×m β m×1 .Therefore, z, the n × 1 vector of latent variables in 2.3 , has a multivariate normal distribution, that is, z ∼ N Lγ, I n .As shown in 2.3 , γ is expressed by a linear combination of the original parameters β .Hence, we call γ as the vector of superfactors.The model in 2.3 represents a massive dimension reduction from m to n parameters.The regression model with m parameters reduced to that with n parameters derived from the SVD of the covariate matrix X.Therefore, the statistical inference on the original parameter turns into the superfactors.Let p i p i1 , . . ., p i J denote the vector of probabilities associated with the assignment of the ith sample into categories 1, . . ., J, where p ij denote the probability that a sample falls into category j.From 2.1 and 2.3 , it follows that where φ • and Φ • denote the probability density function and the cumulative density function of the standard normal distribution, respectively, and l i is the ith row of matrix L FD.Let y y 1 , . . ., y n denote the vector of responses observed for all samples.Then, the probability of observing data y is given as follow: 2.5 From 2.4 and 2.5 , the log likelihood function for γ, θ can be written as

Model Fitting with Maximum Likelihood Estimation
The maximum likelihood estimates MLEs of the superfactors, γ, in 2.3 can be obtained by the iteratively reweighted least squares IRLSs procedure 6 using the log likelihood function for γ, θ in 2.6 .The procedure can be briefly described as follows.Let η denote the vector of all model parameters, that is, η θ 2 , . . ., θ J−1 , γ 1 , . . ., γ n−J 2 .Note that θ 1 and θ J are not included in this vector because their values are assumed to be 0 and ∞, respectively, for the purpose of model identifiability.Also note that the J − 2 smallest singular values together with their corresponding factors are dropped from the parameters since the number of parameters must not exceed the number of samples.Assuming that J 4, define that and H i diag f i1 , . . ., f iJ−1 , where f ij denotes the derivative of the standard normal cumulative distribution function at θ j − l i γ.Take W i diag p i , where p i is the J × 1 vector of probabilities that the ith individual falls in each category, that is, p i p i1 , . . ., p i J , and let N i be a J × 1 vector of observation, that is, N i I{y i 1}, . . ., I{y i J} .After initialization of all elements, the iteration s 1 s 1, 2, . . .can be written as

2.8
The MLE of η can be found by performing the process recursively until the change between η s 1 and η s is negligible.

General Solution for the Original Parameters
We have discussed how to estimate the superfactor γ in 2.3 thus far.Since the primary interest is to find SNPs that are significantly associated with a disease, it is necessary to transform the superfactor γ to the original parameters β in 2.1 .The equation γ A β in 2.3 can be utilized for the transformation even though A is m × n nonsquare matrix.As discussed by Graybill 7 , the unique solution for β can be achieved by taking the generalized inverse matrix of A as A since A A I n .Therefore, the unique solution for SNP effect β can be calculated by β Aγ.

Selection of Significant SNPs
Finding significant SNPs is the same as testing if each SNP effect β i , i 1, . . ., m is statistically significant, that is, testing the hypothesis: H 0 : The simple method is to use Wald's test statistic, which forms β − β /se β and assumes a normal distribution.However, when m n, it is hard to calculate se β directly from the data.We therefore utilized permutation test to select significant SNPs.The rationale behind the test is that, under the null hypothesis, the estimate of β obtained from the raw unpermuted data is similar to the estimate of β obtained from the permuted data.That is, the difference between two estimates is closed to zero under H 0 .With this idea, we can construct Wald's test statistic as follows.Let β i i 1, . . ., m be the estimate of the ith SNP effect from the raw data and β k i k 1, . . ., K be the estimate of the ith SNP effect from the kth-permuted data.Let us define β d k i as the difference between β i and β k i , that is, Then, Wald's test statistic can be as follows: where

and se
Under the null hypothesis, the statistic Λ i defined in 2.9 follows approximately standard normal distribution when k is large.P value for rejecting the null hypothesis at a significance level α 0.05 can be utilized to identify significant SNPs.

Simulated Multinomial Ordinal Data
The validity of the proposed method was evaluated using simulated data sets.The procedure of data generation was composed of three steps: generating genotype data with certain genetic model, generating the latent variable, and defining the disease status variable by applying the predefined bin boundaries.The brief scheme of each step is as follows: we first generated 10 sets of the simulated genotype data under an additive genetic model, each set consists of 100 samples and 1000 SNPs.From 2.1 , we can notice that the latent variable z i consists of two parts: the expected value x i β and the random error i .In order to generate the expected value, we assumed that, for each sample, 9 out of the 1000 SNPs every 101th SNP, except the last one contribute to disease status , where β 1 and β 2 are set as −1 and 1, respectively.Hence, the latent variable can be obtained from the sum of the expected values x i β and the random error generated from standard normal distribution.We then generated disease status variable y i assuming 3 disease development stages.Therefore, when applying the proposed method to the simulated data sets, we would expect 9 strong signals corresponding to each of the 9 disease-associated SNPs.We also compared results obtained from the proposed method with that from single SNP analysis with multinomial ordinal probit model.

Mexican-American Coronary Artery Disease (MACAD) Study
We also applied the proposed method to study sample recruited through the Mexican-American Coronary Artery Disease MACAD study 8, 9 .The study population consists of probands who are Mexican American aged between 45 and 75 with coronary artery disease: spouses of probands, adult offspring ≥18 , and their spouses.For the offspring generation, we performed oral glucose tolerance test and genotyped 132 SNPs in 32 genes selected based on a prior relationship to insulin physiology.The goal of the study herein was to identify genes involved in the development of IGT and/or IFG, where IGT was defined as a 2 hr glucose level between 140 and 199 mg/dL and IFG defined as a fasting glucose level between 100 and 125 mg/dL.In order to identify and compare genes affecting the development of IGT and/or IFG, we generated two study samples, for which each sample has 3 disease stages D1 both 2 hr and fasting glucoses normal N/N n 1 60 , IGT only IGT/N n 2 31 and IGT and IFG IGT/IFG n 3 15 D2 both 2 hr and fasting glucoses normal N/N n 1 60 , IFG only N/IFG n 2 34 and IGT and IFG IGT/IFG n 3 15 .

Simulated Multinomial Data
Figure 1 summarizes the results of association analyses when applying the multinomial ordinal probit model with SVD to the simulated data sets.All numbers shown in the figures are the average of the estimates obtained from the 10 simulated data sets.As mentioned previously, we expected 9 strong signals corresponding to the 9 SNPs designed to be associated  with disease development when generating the simulated data sets, and 9 were observed in our analysis.Similar results from the single SNP analysis were shown in Figure 2. Figure 1 a summarizes MLEs of SNP effects calculated with the multinomial ordinal probit model with SVD.The figure shows that almost all MLEs except 9 were between −0.1 and 0.1, while there were 9 large MLEs 4 around 0.3, 5 around −0.3 corresponding to the 9 SNPs contributed to disease status. Figure 1 b gives P values in −log 10 scale for testing SNP effects.The line in Figure 1 b corresponds to significance level α 0.05.9 SNPs were clearly separated from the rest and had −log 10 P value > 1.3.
Figure 2 a summarizes MLEs of SNP effects obtained by the single SNP analysis and shows that no signal was strong enough to be distinguished from all other signals.The P values are given in Figure 2 b in −log 10 scale.SNP 501 in the middle of the figure had a relatively strong signal compared to all others.However, the −log 10 P value was much less than 1.3, which corresponds to significance level α 0.05.Thus, no SNPs were identified as statistically significant from the single SNP analysis method.In contrast to the fact that no SNP was identified as statistically significant by the single SNP analysis, the multinomial ordinal probit model with SVD method was able to identify all 9 SNPs contributing to disease status as statistically significant at significance level α 0.05.These results indicated that the proposed method should be reliable for the analysis of large-scale genome-wide association data that have polytomous ordinal responses when m n.

Mexican-American Coronary Artery Disease (MACAD) Study
We analyzed the data sets D1 and D2 see methods generated from a subsample of subjects recruited through a coronary artery disease proband in the Mexican-American Coronary  Artery Disease Project as described in the method section, using both the multinomial ordinal probit model with SVD method and the single SNP analysis method.Figure 3 summarizes the analysis results with the data D1 N/N-IGT/N-IGT/IFG using the single SNP analysis.Figure 3 a gives MLEs of SNP effects.Figure 3 b plots P values of association analysis in −log 10 scale.With Sidak correction, which is often used to correct multiple testing problem, the adjusted significance level should be 1 − 1 − α 1/m , where α is significance level, and m represents the number of tests.Thus, the corrected −log 10 P value threshold for significance level α 0.05 is 3.4, which corresponds to the line in Figure 3 b .We applied the adjusted significance level to the P values in Figure 3 b since the P values are before correcting multiple testing problem.No SNP was identified as statistically significant Figure 3 b .
The data set D2 N/N-N/IFG-IGT/IFG was analyzed with the same method, and analysis results are given in Figure 4. Figure 4 a plots MLEs of the SNP effects.Since P values in Figure 4 b are before the the multiple testing correction, we used 3.4 as the −log 10 P value threshold corresponding to 0.05 significance level after the multiple testing correction.Two SNPs corresponding to SORC1 and SORT1 were found significant.
We then also analyzed D1 and D2 with the multinomial ordinal probit model with SVD method.Figure 5 summarizes the analysis results for data D1. Figure 5 a plots MLEs of SNP effects.Figure 5 b plots P values in −log 10 scale for testing SNP effects.The multiple testing correction does not need to be applied now since the method tests all SNPs simultaneously.With the 1.3 P value threshold, which corresponds to 0.05 significance level, we identified that 8 out of the 32 candidate genes SORCS1, AMPD1, PPARα, AMPD2, PRKAA2, C5, TCF7L2, and ITR were associated with the disease path defined in D1.
The multinomial ordinal probit model with SVD method was applied to data set D2 as well.The results are shown in Figure 6.In Figure 6 a , MLEs of the SNP effects were     summarized.Figure 6 b plots the P values in −log 10 scale for testing the SNP effects.It showed that 11 SNPs corresponding to 9 out of 32 candidate genes SORCS1, AMPD1, PPARα, CAPN10, IL4, NOS3, CD14, GCG, and SORT1 have −log 10 P value greater than the 1.3 P value threshold.From the analyses of D1 and D2, we found that SNPs in 3 genes SORCS1, AMPD, and PPARα were associated with both IGT and IFG; SNPs in 5 genes AMPD2, PRKAA2, C5, TCF7L2, and ITR were associated with IGT only; SNPs in 6 genes CAPN, IL4, NOS3, CD14, GCG, and SORT1 were associated with IFG only.These results suggest that IGT and IFG may indicate different pathways to diabetes, with different genetic determinants.Thus, using both simulated data and a real study sample, we demonstrated that multinomial ordinal probit model with SVD method can be utilized to identify associated markers involved in disease development when multidisease stages are considered.For relatively small size of data set used in the paper, which is 100 samples and 1000 SNPs for the simulation study, the computation took about less than 10 minutes to complete.However, the computation time might be a concern when applying this method to large data set, such as GAWS with millions of SNPs and thousands of samples.

Figure 1 :
Figure 1: Analysis of the simulated data sets with multinomial ordinal probit model with SVD.

Figure 2 :
Figure 2: Analysis of the simulated data sets with single SNP analysis.

b
the disease status a Estimates of parameters by single-SNP analysis P values of testing SNP effects −log 10 P value

Figure 3 :
Figure 3: Analysis of genes for IGT/IFG through IGT pathway Data Set D1 with single-SNP analysis.
of testing SNP effects −log 10 P value

Figure 4 :
Figure 4: Analysis of genes for IGT/IFG through IFG pathway Data Set D2 with single SNP-analysis.

Figure 5 :
Figure 5: Analysis of Genes for IGT/IFG through IGT pathway Data Set D1 with multinomial ordinal probit model with SVD.
of testing SNP effects −log 10 P value

Figure 6 :
Figure 6: Analysis of genes for IGT/IFG through IFG pathway Data Set D2 with multinomial ordinal probit model with SVD.