Modeling the Ranked Antenatal Care Visits Using Optimized Partial Least Square Regression

The frequency and timing of antenatal care visits are observed to be the significant factors of infant and maternal morbidity and mortality. The present research is conducted to determine the risk factors of reduced antenatal care visits using an optimized partial least square regression model. A data set collected during 2017-2018 by Pakistan Demographic and Health Surveys is used for modeling purposes. The partial least square regression model coupled with rank correlation measures are introduced for improved performance to address ranked response. The proposed models included PLSρs, PLSτA, PLSτB, PLSτC, PLS D, PLSτGK, PLS G, and PLS U. Three filter-based factor selection methods are executed, and leave-one-out cross-validation by linear discriminant analysis is measured on predicted scores of all models. Finally, the Monte Carlo simulation method with 10 iterations of repeated sampling for optimization of validation performance is applied to select the optimum model. The standard and proposed models are executed over simulated and real data sets for efficiency comparison. The PLSρs is found to be the most appropriate proposed method to model the observed ranked data set of antenatal care visits based on validation performance. The optimal model selected 29 influential factors of inadequate use of antenatal care. The important factors of reduced antenatal care visits included women's educational status, wealth index, total children ever born, husband's education level, domestic violence, and history of cesarean section. The findings recommended that partial least square regression algorithms coupled with rank correlation coefficients provide more efficient estimates of ranked data in the presence of multicollinearity.


Introduction
Pakistan sets targets to minimize the maternal mortality ratio (MMR) to 140 per 100,000 live births by 2015 by increasing skilled birth attendants and improving access to reproductive health care as the fifth Millennium Development Goal (MDG) for improving maternal health suggested. The MDG progress assessment reported that Pakistan was not close to attaining the target in 2015. Recently, Pakistan has endorsed the Sustainable Development Goals (SDGs), committing to decrease the MMR to 70 per 100,000 live births by 2030 by increasing skilled birth attendance, facilitation to modern contraception, and extending coverage of health workers. The Government of Pakistan took initiatives and made good progress in maternal health indicators during the last decade, and a significant decline was reported in MMR from 276 to 178 [1,2]. Pregnancy-related morbidity and mortality can be reduced by improving access and facilitation to maternal health care services. At least four antenatal care visits (ANC) are recommended to skilled personnel to avoid any pregnancy-related complication [3]. Nearly 12% Pakistani women reported no ANC throughout their pregnancy, 36% have less than four visits, and 52% claimed four or more visits [1]. Several studies have assessed the significant influential factors of antenatal care attendance in Pakistan without considering the frequency of ANC [4,5]. Poisson regression, negative binomial regression, zero-inflated, and hurdle regression models have been commonly used to model the count of ANC visits [6,7]. Binary logistic regression and a multinomial logistic regression model are also found to study the use and ranks of ANC visits [5,8]. Advancements in health research generate public health data having many covariates, where some or all may be correlated. Several studies have been conducted to identify influential factors of different public health concerns using multiple statistical tools and techniques [9][10][11][12][13]. The partial least square (PLS) regression model has been the concern of interest as a statistical method for modeling data having multicollinearity during the last few decades. A variety of modified PLS algorithms have been introduced for superior model performance [14]. Most PLS algorithms model continuous factors, and a few are specifically designed for categorical framework but no specific algorithm is projected to address the ranked data. To fill the gap of obtaining the optimal model for the ranked response variable, modified PLS algorithms based on ranked correlation loading weights are introduced. The main motivation of the present study is to propose the modified PLS algorithms to particularly address the ranked response factor in the presence of multicollinearity. To improve the PLS regression model, eight algorithms based on rank correlation measures including Spearman's rank correlation coefficient, Kendall's τ A rank correlation coefficient. Kendall's τ B rank correlation coefficient, the Stuart-Kendall τ C rank correlation coefficient, Somers' delta (D), Goodman-Kruskal's tau τ GK , Goodman-Kruskal's gamma (G ), and Thiel's U correlation coefficient are proposed in this study. To the best of our knowledge, no previous research has considered multicollinear covariates in modeling the ranking of ANC visits of Pakistani women. Thus, the objectives of this study are twofold: (i) to develop a regression model for the ranked response covering the issue of multicollinearity and (ii) to determine the risk factors for inadequate use of antenatal care. This study introduced eight novel PLS algorithms addressing the concern of multicollinearity for a ranked response which is never discussed earlier. The proposed and standard algorithms are executed on a real-life application of ANC data for comparison purposes. These algorithms will facilitate users to obtain more efficient models than the standard PLS approaches for specifically ranked data. Regarding the clinical importance of this study, the influential selected variables of ANC will help maximize the chances for a normal pregnancy by providing priority interventions, increasing coverage, and improving health quality. The novel contribution of this study included: . Consider the regression model y = α + Zβ + ε, where α and β are the unknown regression parameters and ε is the error term. Let Z ðn,pÞ is the matrix of explanatory variables and is assumed to be linearly related with the response y ðn,1Þ and suppose some C (where C ≤ p) to represent the number of components for prediction. Then, for c = 1, ⋯, C, the general algorithm executes as Consider that W, S, P, and q are the matrices/vectors to compile the loading weights, scores, Z-loadings, and Y -loadings computed at each iteration of the algorithm, respectively. The regression estimators of the PLSR model are computed by b β = WðP ′ WÞ −1 q and b α = y − ZB [15]. The general steps of standard PLSR are presented in Figure 1.
The standard PLS is designed for continuous dependent variable y but if the response is measured on a rank scale then this standard method may not work well. The most important phase of the PLS algorithm is to compute loading weight having the ability to choose significant factors. Loading weights compute the correlation between the dependent variable and predictors. If the data set is ranked then Spearman's rank correlation coefficient, Kendall's τ A rank correlation coefficient, Kendall's τ B rank correlation coefficient, the Stuart-Kendall τ C rank correlation coefficient, Somers' delta (d), Goodman-Kruskal's tau τ GK , Goodman-Kruskal's Gamma (G), and Thiel's U correlation coefficient are the recommended measures of rank correlation. These measures of association are used to compute the loading weights of the PLS algorithm. The modified loading weights of PLSR are visually displayed in Figure 2. [16] is a nonparametric measure of rank correlation using a monotonic function. It is used to compute the weights of as where d i denotes the difference between the two ranks of 2 Computational and Mathematical Methods in Medicine each observation and n is the number of observations and the modified PLSR algorithm is referred to as PLSρ s .

PLSτ
A . The Kendall rank correlation coefficient or Kendall's τ coefficient is a measure of rank correlation. The tau-A (τ A ) will not make any adjustment for ties [17]. It is used to define the PLS loading weights as where n c is the number of concordant pairs, n d is the number of discordant pairs, and n 0 = nðn − 1Þ/2, and the modified algorithm is named as PLSτ A .

PLSτ B .
Kendall's tau-B makes adjustments for ties [18]. The PLS loading weights are altered by using τ B as where n 1 = ∑ i t i ðt i − 1Þ, n 2 = ∑ j u j ðu j − 1Þ, t i is the number of tied value in the i th group of ties and t j is the number of tied value in the j th group of ties and the proposed model is termed as PLSτ B .
2.2.4. PLSτ C . The Stuart-Kendall (tau-c) is more suitable for contingency tables [19]. The τ C replaced the weights of PLS as follows: where m is the minimun number among rows and coulmns, and the modified PLSR algorithm is called PLSτ C .

PLS D .
The PLS loading weights based on Somers' delta(D) [20] of variable Y with respect to variable Z are defined as Kendall's tau τ is symmetric, whereas Somers' D is asymmetric in Z and Y, and the model is named as PLS D .
2.2.6. PLSτ GK . Goodman-Kruskal's tau τ GK [21] is integrated as PLS loading weights as and the modified algorithm is called PLSτ GK .
where n c is the number of concordant pairs and n r is the number of reversed pairs. Goodman-Kruskal's gamma drop ties, and the PLSR model is named as PLS G .
2.2.8. PLS U . Thiel's U correlation coefficient or uncertainty coefficient [22] altered the PLS loading weights as where HðZÞ represents the entropy of a single distribution and HðZ | YÞ represents the conditional entropy, and the modified PLSR algorithm is referred to as PLS U .

Filter-Based Factor Selection
Methods. Several variable selection methods integrated with PLSR have been introduced. The following are considered here.

The Loading Weight (LW).
The loading weights w j are used to measure of importance of predictors and are defined as [23].
2.3.3. Significance Multivariate Correlation (SMC). The significance multivariate correlation measure is used to reduce the effect of irrelevant predictors and enhance the influence of significant variables included in the model. The SMC [23] is computed as

Results
Initially, the PLS models with modified loading weights are executed for simulated data set for ranked variables. A sample of size 1000 with 100 predictors is generated. The response variable and 50% predictors are generated over 3 ranks, and the remaining explanatory variables are distributed over 4 ranks. Spearman's coefficient, Kendall's coefficient-A, Kendall's coefficient-B, Stuart-Kendall's, Somers' delta, Goodman-Kruskal's tau, Goodman-Kruskal's gamma, and Thiel's U coefficient are used as loading weights of the PLS algorithm to fit oversimulated data set to observe the variation in performance of standard and proposed models based on Akaike information criterion (AIC). Figure 3 showed the efficiency of models established by AIC and indicated that PLSR algorithms with modified loading weights have higher efficiency (lower mean AIC) compared to standard PLSR for a ranked response. The PLSR model with τ GK as modified loading weight showed optimum performance compared to eight other models without integrating any variable selection method. All other proposed models also evidenced higher efficiency compared to standard PLSR. Figures 4 and 5 also demonstrated the higher accuracy of proposed models compared to standard PLSR algorithm integrated with LW and SMC variable selection methods. Both figures depicted that PLSτ GK and PLS U have optimum performance compared to all other models. The standard and modified models are also executed over the real data set of ANC for comparison of accuracy. The data set of ANC visits had 43 predictors sampled over 943 samples (mothers). The Spearman rank correlation coefficient is used to examine the multicollinearity in the data.  Figure 5: The AIC performance of PLSR models based on rank correlation coefficients' oversimulated data integrated with SMC variable selection method is presented.

Computational and Mathematical Methods in Medicine
The correlogram map measured strong correlation among 16 covariates while intermediate correlation among several other predictors is observed and shown in Figure 6. The existence of multicollinearity recommends the applicability of PLSR to deal ranked data with multicollinearity. The frequency of ANC is classified into three ranks as inadequate, intermediate, and adequate. The ratio of 70 : 30 is used to randomly split data into training and testing sets, respectively. Initially, PLSR integrated with rank correlation coefficients as loading weights is executed. The Spearman's coefficient, Kendall'scoefficient-A, Kendall Figure 7 showed the comparison of validation performance of standard PLSR and eight proposed PLSR models integrated with correlation coefficients without considering any variable selection method. These results depicted that PLSρ s and PLS D have optimum performance compared to standard PLSR and other proposed PLSR models for the observed data of ANC visits. The PLS τ A and PLSτ GK also have relatively higher accuracy than standard PLSR. In Figure 8, the loading weight factor selection method is incorporated with each PLSR model. The inclusion of the variable selection method enhanced the overall performance of standard and modified PLSR models. The results showed that PLSτ A and PLSτ B are more efficient in terms of optimization accuracy compared to standard and modified PLS models. A similar pattern of performance for PLSρ s and PLSτ C is observed compared to standard PLSR. Four other proposed PLSR models integrated with rank correlation coefficients showed slightly lower accuracy compared to standard PLSR.  X2  X3  X4  X5  X6  X7  X8  X9  X10  X11  X12  X13  X14  X15  X16  X17  X18  X19  X20  X21  X22  X23  X24  X25  X26  X27  X28  X29  X30  X31  X32  X33  X34  X35  X36  X37  X38  X39  X40  X41  X42  X43   X1  X2  X3  X4  X5  X6  X7  X8  X9  X10  X11  X12  X13  X14  X15  X16  X17  X18  X19  X20  X21  X22  X23  X24  X25  X26  X27  X28  X29  X30  Comparison based on validation accuracy supported that PLSρ s is found to be the most appropriate proposed method to model the observed ranked data set of ANC. Figure 7 represented the optimal performance of PLSρ s compared to all other models without considering any variable selection method. Integrated with RC factor selection methods, the proposed PLSρ s showed higher efficiency compared to standard PLS visualized in Figure 9. Moreover, PLSρ s established the highest optimization accuracy of nearly 78% among all other methods in 10 combined with the SMC factor selection method. Based on this evidence, PLSρ s featured with the SMC method is finally picked for the selection of influential factors of ANC. For extraction of influential factors of ANC, PLSρ s coupled with SMC is executed and estimates of 29 variables are presented in Table 1 with regression estimates.

Discussion
To examine the significant predictors associated with ANC, sample data obtained from PDHS (2017-2018) is used. The occurrence of multicollinearity pointed to the application of PLS being a popular substitute for the standard regression model. Data is randomly divided into testing and training sets. Eight PLS algorithms established on rank correlation coefficients are introduced to address particularly the ranked response and compared with the standard PLS model to prove the improved efficiency in model building. Furthermore, three variable selection methods are integrated with standard and modified PLS algorithms to estimate the accuracy to examine the variation in the performance of modified and standard PLS models with and without variable selection methods. The variable selection methods, namely, loading weights, regression coefficients, and significance multivariate correlation are considered here. The validation performance is computed for 10 iterations to examine the efficiency of nine PLS models integrated with variable selection methods.
Comparison based on validation performance supported that PLSρ s is found to be the most appropriate proposed method to model the observed ranked data set of ANC. Figure 7 represented the optimal performance of PLSρ s compared to all other models without considering any variable selection method. Integrated with RC factor selection methods, the proposed PLSρ s showed higher efficiency compared to standard PLS visualized in Figure 9. Moreover, PLSρ s established the highest optimization accuracy of nearly 78% among all other methods in 10 combined with  Computational and Mathematical Methods in Medicine the SMC factor selection method. Based on this evidence, PLSρ s featured with the SMC method is finally picked for the selection of influential factors of ANC. Regarding validation accuracy, very important and interesting facts are observed about the comparison of efficiency for ranked data. Primarily, PLSρ s and PLS D have optimum performance compared to standard PLSR and other proposed models without considering any factor selection method. For the observed data set, the PLSτ A and PLSτ B combined with the LW variable selection method are found to be more efficient in terms of optimization accuracy compared to standard and modified PLS models. Integrated with RC method for variable selection, the PLSρ s , PLSτ A , PLSτ B , PLSτ C , PLS D , and PLS U featured incremental performance compared to standard PLS. The PLSτ GK and PLS G are found to exhibit approximately similar efficiency as standard PLSR. The PLSρ s and PLSτ C embedded with the SMC factor selection method evidenced optimum accuracy compared to standard PLS. Considering all validation comparisons, it is noticed that the modified models integrated with rank correlation coefficient exhibit higher efficiency for ranked data set of ANC compared to the standard PLS algorithm. The PLSρ s coupled with SMC is suggested for modeling the ANC ranked data and 29 influential factors are observed to discriminate the ANC ranks. The proposed algorithms for rank response will facilitate researchers to address the regression models more efficiently even in the presence of multicollinearity in different fields of research. Since the rank response is specifically addressed rarely, the findings of this study offer new, potentially useful information for this ranked population. In the future, these algorithms may be integrated with other variable selection methods to observe the efficiency. Also, the proposed study can be extended for neutrosophic statistics [9]. The main limitation of this study is the small number of predictors as every possible factor was not available for the target population and also the interaction effects are not included.

Conclusion
Proposed PLS algorithms integrated with rank correlation coefficients are observed to be a better option with regard to model efficiency and variable selection of ranked 7 Computational and Mathematical Methods in Medicine simulated and real data sets. This suggests that these rank measure-based PLS algorithms provide models with superior potential. The PLSρ s coupled with SMC identified the significant predictors of ANC using the optimized model for the observed data. The modified PLS models have the ability to address multicollinear ranked data more effectively. Regarding the clinical importance of this study, the influential selected variables of ANC will help maximize the chances for a normal pregnancy by providing priority interventions, increasing coverage, and improving health quality.

Data Availability
Data is available at https://dhsprogram.com/data/.

Conflicts of Interest
The authors declare that they have no conflicts of interest.