NetDA: An R Package for Network-Based Discriminant Analysis Subject to Multilabel Classes

,


Introduction
Multiclassi cation, known as a classi cation problem that the number of classes is greater than two, is a great challenge in data science. In supervised learning, discriminant analysis has been a useful method to do classi cation. In the conventional method (e.g., Hastie et al. [1]; Section 4.3), linear discriminant functions, which are formulated in terms of mean vectors and the inverse of covariance matrices of the predictors, are used to classify subjects. In addition, some advanced methods have been proposed to address complex settings in the past literature. For example, Guo et al. [2] discussed the LDA method and its application in microarray data analysis. Safo and Ahn [3] studied generalized sparse linear discriminant analysis for multilabel responses. In the presence of high-dimensional predictors, several advanced approaches have also been explored (e.g., Clemmensen et al. [4]; Witten and Tibshirani [5).
However, network structures of predictors, which re ect (pairwise) dependence among predictors, are ubiquitous in data analysis [6]. In the recent developments, Chen et al. [7] proposed a graphical-based logistic regression model. He et al. [8] proposed surrogate variables that were transformed from network structures and implemented them to the support vector machine. Regarding the framework of discriminant analysis, Cai et al. [9] and Liu et al. [10] developed graph-based linear discriminant analysis, but their approaches are restricted to binary responses. Moreover, for general data analysts, it is important for them to directly implement existing software and do data analysis. However, rare software related to classi cation with network structure accommodated has been available. While some R packages related to discriminant analysis exist, such as MASS, spar-seLDA, and penalizedLDA, they are not able to handle network structures in predictors.
Motivated by these concerns and to address these challenges, we follow the strategy proposed by Chen [11] and develop an R package, which is called NetDA. Under the normality assumption for predictors, we apply the graphical lasso method to estimate precision matrices and the corresponding network structures for the predictors. Since precision matrices are the inverse of covariance matrices, it motivates us to directly implement them to linear/quadratic discriminant functions. is strategy is di erent from the conventional linear discriminant analysis that simply employs empirical estimates of covariance matrices. Moreover, the other issue is prediction. Based on tted models and predicted values, we also develop a function that contains several commonly used criteria to assess the performance of classi cation and prediction. e article is organized as follows. Section 2 introduces the data structure and outlines the methodology in our package. Section 3 describes the usage of the package NetDA. Section 4 illustrates the package by a real dataset. We finally conclude the article in Section 5.

Overview of Methodology
In this section, we primarily overview the data structure and network-based discriminant analysis proposed by Chen [11].

Data Structure.
Suppose that the data contain n subjects that come from I classes, where I is a fixed integer greater than 2 and the classes are nominal. Let n i be the size in class iwith i � 1, . . . , I, and hence, n � I i�1 n i . Let Y denote the n-dimensional vector of responses with the jth component being Y j � i, which reflects the class membership that the jth subject is in the ith class for i � 1, . . . , I and j � 1, . . . , n.
Let p > 1 denote the dimension of predictors for each subject. Define X � [X j,k ] as the n × p matrix of predictors for j � 1, . . . , n and k � 1, . . . , p, where the component X j,k represents the kth predictor for the jth subject. Furthermore, let X ·k � (X 1,k , . . . , X n,k ) ⊤ represent the n-dimensional vector of the kth predictor in the kth column of X, and let X j· � (X j,1 , . . . , X j,p ) ⊤ denote the p-dimensional predictor vector for the jth subject in the jth row of X. Let X j· , Y j : j � 1, . . . , n denote an independent and identically distributed (i.i.d.) sample. We let lower case letters represent realized values for the corresponding random variables. For example, x j· stands for a realized value of X j· .

Gaussian Graphical Models.
For i � 1, . . . , I and j � 1, . . . , n, let f j|i (x j· ) denote the conditional probability density function of the predictor X j· taking a value x j· given that subject j comes from the i th class. We particularly consider the case where the conditional distribution X j· given Y j � i is assumed to be a multivariate normal distribution with a mean vector μ i and a positive-definite covariance matrix Σ i . en, the conditional probability density function is given by Moreover, by the suitable reparametrization (e.g., Hastie et al. [12]; p. 246), we can transfer (1) to the Gaussian graphical model (GGM) based on class i. e exact formulation is given by where V � 1, . . . , p includes all the indices and E ⊂ V × V contains all pairs with unequal coordinates, which yields a graph G ≜ (V, E) (e.g., Hastie et al. [1]; Chapter 17); Our main interest is to estimate Θ i , since, as we will see later, the main concern in discriminant analysis is to estimate the inverse of the covariance matrix. On the other hand, from the perspective of graphical models, nonzero θ i,st implies that X j,s and X j,t are conditionally dependent given other variables in class i, while zero value of θ i,st gives conditional independence of X j,s and X j,t given other variables. us, the precision matrix Θ i reflects the network structure of the predictors.
In the past literature, graphical LASSO (GLASSO) [13] is a common method to estimate Θ i . e key idea of GLASSO is based on the likelihood function. To see this, we follow the similar discussion in page 247 of Hastie et al. [12] and write the log-likelihood function of Θ i based on (2) with β i � 0: where x j· x ⊤ j· , and trace(·) is the sum of diagonal entries for a square matrix, and with λ(Θ i ) being the jth eigenvalue of Θ i . Assume that the precision matrix Θ i is sparse. To estimate θ i,st and identify network structures by retaining dependent pairs of vertices and removing independent ones, we apply the L 1 -norm as a constraint to achieve the desired result. In other words, the estimator of Θ i can be obtained by the following optimization: where ρ(Θ i ) ≜ s≠t |θ i,st | is the penalty function and ζ is a tuning parameter. e optimization problem (5) is called GLASSO [13]. e detailed algorithm can be found in page 248 of Hastie et al. [12], and the estimator of ζ can be determined by Bayesian information criterion (BIC). By the similar discussion in Yuan and Lin [14], the estimated network structure determined by (5) is equal to the true graph with probability approaching one under suitable conditions. For the computation, the R package glasso can be implemented to derive the estimate Θ i .

Discriminant
Analysis. e idea of discriminant analysis is to model the distribution of the predictors X j· separately for each of the response classes Y j , and then to use the Bayes theorem to describe the conditional probabilities P(Y j � i|X j· � x j· ) (e.g., James et al. [15]).
Specifically, let π i ≜ P(Y j � i) denote the probability that the j th subject is randomly selected from class i so that I i�1 π i � 1. Moreover, applying the Bayes theorem to the conditional density function f j|i (x j· ) and π gives the posterior probability as for i � 1, . . . , I and j � 1, . . . , n.
To compare two classes i and l with i ≠ l, we calculate the log-ratio of (6), given by With a distribution assumption (1) imposed, we now have two scenarios for the first term in the right-hand side of (7), say log(f j|i (x j· )/f j|l (x j· )). Scenario 1. If the covariance matrices Σ i in (1) are assumed to be common, that is, Σ i � Σ for every i with Σ being a positive definite matrix, (7) becomes showing that subject j with predictors X j· � x j· is more likely selected from class i than from class l. Consequently, (8) defines a boundary between classes i and l in the sense that there is a linear function in x j· separating classes i and l.
Motivated by the form of (8), we consider a linear function in x as Moreover, μ i and π i can be empirically estimated, respectively, as For the estimation of Σ − 1 , or equivalently Θ, we adopt (5) by pooling all subjects in the dataset and denote Θ as the estimator of Θ. erefore, (9) can be estimated as and we call (11) the network-based linear discriminant function (NetLDA) and it is used to determine the class label for a new observation. For the prediction of a new subject with the predictor x, we first calculate δ i (x) using (11) for i � 1, . . . , I. Next, we find i * that is defined as and the class label for this subject is then predicted as i * .

Scenario 2.
We allow Σ i ≠ Σ l , or equivalently, Θ i ≠ Θ l , for any i ≠ l and i, l � 1, . . . , I. en, under a distribution assumption (1), we have Replacing the first term in the right-hand side of (7) by (13) yields erefore, based on (14), we further define a quadratic function of x based on the class i: For i � 1, . . . , I, the estimator of the precision matrix Θ i , denoted as Θ i , is obtained by (5) based on the predictor information in class i; μ i and π i can be estimated by (10). erefore, (15) can be estimated by (16) is called the network-based quadratic discriminant function (NetQDA) and is used to determine the class label for a new observation. For the prediction of a new subject with the predictor x, we first calculate φ i (x) using (16) for i � 1, . . . , I. Next, we find i * that is defined as and the class label for this subject is then predicted as i * .

Assessment of Classification and Prediction.
We first introduce micro-averaged metrics (e.g., Chen et al. [7]). Let V and T represent the classes of the subject indexes for validation and training datasets, respectively. Let n � |T| and m � |V| denote the sizes of the training and validation data, respectively. We use the training data T to fit models and then apply fitted models to compute predicted values Y j for j ∈ V. After that, for i � 1, . . . , I, we calculate the number of the true positives (TP), the number of the false positives (FP), and the number of the false negatives (FN) under the validation data V, respectively: where I(·) is the indicator function.
We define precision (PRE) and recall (REC) under the validation data V, respectively, as en, micro-F-score is defined as According to definitions in (19), when all subjects are correctly classified, FP and FN are equal to zero, yielding that PRE and REC are equal to one; if all subjects are falsely classified, then TP is equal to zero, and thus, PRE and REC are equal to zero. erefore, values of PRE and REC are between zero and one. Moreover, under the range [0, 1], the F-score falls in [0, 1] as well by treating 0/0 as zero. In principle, the higher values of PRE, REC, and F-score reflect the better performance and the more accurate classification (e.g., Chen et al. [7]).
In addition to criteria above, the other commonly used criterion is the adjusted Rand index (ARI). For i, i ′ � 1, . . . , I and under the validation data V, we define Moreover, we define a i � I i′�1 n ii′ for i � 1, . . . , I and b i′ � I i�1 n ii′ for i ′ � 1, . . . , I. en, ARI under the validation data V is defined as Hubert and Arabie [16].
As mentioned in Hubert and Arabie [16], ARI is bounded above by one, and the higher value of ARI indicates the more accurate classification.

Benchmark of NetDA.
e conventional linear discriminant methods (e.g., MASS, sparseLDA, and penal-izedLDA) aim to adopt (9) with Θ estimated by the inverse of the empirical estimator of Σ. However, this approach may encounter cumbersome computation or possible singularity when calculating inverse matrices. In addition, if Θ is sparse, all entries in empirical estimators of Θ are nonzero, and it indicates that some unconnected pairs of predictors may be falsely included. As a result, imprecise estimator of Θ may implicitly affect the performance of classification.
Unlike existing methods, the first contribution of NetDA is to estimate Θ directly. e graphical lasso method is a tool to identify zero entries in Θ and estimate nonzero ones. is approach enables us to retain connected pairs of predictors and exclude unconnected ones. Moreover, our implementation can avoid computing inverse of Σ.
e second contribution of NetDA is to handle heterogeneous network structure that stratifies the predictor information by class when characterizing the predictor network structures, and is able to deal with multiclassification by adopting network structure in each class.

Details of NetDA
In this section, we first overview the technique that we need in the existing package. After that, we describe functions in our package.

Library Overview.
In our package, we use the package glasso in R software. Specifically, the package glasso follows from the graphical lasso method proposed by Friedman et al. [13]. e purpose of glasso is to detect network structures of random vectors that follow multivariate normal distributions. In particular, under multivariate normal distributions and operations in glasso, the detection of network structures is equivalent to the estimation of precision matrix.
In the following analysis, the response is types of wines that are labeled as 1, 2, and 3; constituents are treated as predictors that are continuous. e goal is to adopt the information of constituents to construct predictive models and then use them to classify type of wines for a given subject.

NetDA.
NetDA contains two methods. e first method is called NetLDA, which aims to estimate the precision matrix by pooling all individuals in the data, and the corresponding discriminant function is given by (11). e second approach is called NetQDA, whose strategy is to estimate precision matrices based on individuals in different classes, and then use class-dependent estimated precision matrices to define quadratic discriminant functions in (16). Unlike NetLDA, NetQDA takes possibly class-dependent network structures of the predicted variables into account and uses network structures in different classes to determine which classes individuals belong to. When either linear discriminant functions or quadratic discriminant functions are obtained, they can be used to determine the class for a new subject.
To implement the NetLDA and NetQDA methods, we use the following command: NetDA (X, Y, method, X_test) where the meaning of each argument is described as follows: (i) X: this is an n × p matrix of the predictors from the training data (ii) Y: this is an n-dimensional vector of the response from the training data, whose elements are positive integers and reflect class-labels (iii) Method: it is a scalar to determine the classification method: method = 1 represents NetLDA in (11), and method = 2 represents NetQDA in (16) (iv) X_test: this is an m × p matrix of the predictors from the validation data e purpose of NetDA is to apply the training data "X" and "Y" to determine a fitted model that is specified by the argument "method." After that, we use "X_test and a fitted model to determine the predicted class for subjects in the validation data. erefore, the function NetDA returns a list of components: (i) yhat: it is a vector of predicted responses obtained by NetLDA or NetQDA based on the predictors in the validation data (X_test). (ii) Network: this is the estimators of precision matrices.
If "method = 1" is chosen, then there is one precision matrix; if "method = 2" is given, then there are I precision matrices.

Metrics.
e function Metrics is utilized to assess the performance of classification and prediction based on some commonly used criteria that are introduced in the Section 2.4. Specifically, given responses from the validation data and predicted values obtained by NetDA, we first derive a confusion matrix to see the classification result. To further assess the performance of prediction, we evaluate precision, recall, F-score, and ARI defined in (19), (20), and (22), respectively.
To obtain the desired results, we use the following command: Metrics (yhat, Y_test) where the meaning of each argument is described as follows:

Demonstration of NetDA
In this section, we demonstrate standard analysis of classification and prediction based on two functions in the Journal of Probability and Statistics package. To show the advantage of NetDA, we compare with existing packages MASS, sparseLDA, and penalizedLDA.

Simulation.
Let I � 3 denote the number of classes, and let p � 12 denote the dimension of predictors. We specify the sample size n � 600, in which the size of the ith class is given by n i � (n/I) for i � 1, . . . , I. For each class, we consider different network structures of predictors. Specifically, for class i � 1, . . . , I, let Ω o,i denote a p × p matrix whose diagonal entries are zero and off-diagonal entries are specified as either one or zero to reflect edges of the corresponding two nodes in Figure 1. at is, for s ≠ t, entry (s, t) in Ω o,i is 1 if the edge exists between X ·s and X ·t and 0 otherwise. In addition, we further define a p × p diagonal matrix Ω d,i whose nonzero entries are taken as the common value 0.1 + |Λ min (Ω o,i )|, where Λ min (Ω o,i ) represents the smallest eigenvalue of Ω o,i . Finally, we define the precision matrix as Θ i ≜ Ω o,i + Ω d,i that is invertible. erefore, based on Gaussian graphical models, the p-dimensional vector of predictors in class i is generated from a multivariate normal distribution with mean zero and the covariance matrix , and eta3 denote p × p matrices that reflect network structures in left, middle, and right panels of Figure 1, respectively, and one can specify those three matrices as follows: (23) Network structure in class 1 Network structure in class 2 Network structure in class 3 Following the description above, we generate the n × p dimensional matrix for predictors and then determine the simulated data: To perform classification, we implement NetDA function with two different scenarios described in Section 2.3. In addition, we demonstrate three existing methods labeled as lda (MASS), sda (sparseLDA), and pda (penalizedLDA), respectively. Detailed descriptions are given below: 2 : 13] >#Demonstration of MASS >lda = lda (Y·X, prior = c (length (which (Y = = 1)), length (which (Y = = 2)), ++length (which (Y = = 3)))/ length (Y)) >yhat_lda = predict (lda, data.frame (X)) $class > >#Demonstration of sparseLDA >y = matrix (0, n, I) >y [1 : 200, 1] = 1 >y [201 : 400, 2] = 1 >y [401 : 600, 3] = 1 >colnames (y) < − c ("1," "2," "3") >sda = sda (data.frame (X), y, lambda = 1e − 6, stop = − 1, maxIte = 25, +trace = TRUE) >yhat_sda = as.numeric (unlist (predict (sda, data.frame (X)) $class)) >#Demonstration of penalizedLDA >pda = PenalizedLDA (X, Y, lambda = 0.14, K = 2) >yhat_pda = as.numeric (unlist (predict (pda, data.frame (X)))) [1 : n] >#Demonstration of NetDA >yhat_netlda = NetDA (X, Y, method = 1, X) $yhat >yhat_netqda = NetDA (X, Y, method = 2, X) $yhat After that, to assess the performance of classification, we adopt the function Metrics to compute values of criteria (19) and (20) as shown by [2] and (22) indicated by [3]. [3] >ARI_netlda � Metrics (yhat_netlda, Y) [3] >ARI_netqda � Metrics (yhat_netqda, Y) [3] We repeat above simulations 500 times and summarize numerical results in Table 1. We observe that the package NetDA provides higher values of PRE, REC, F-score, and adn ARI, showing that the classification obtained by the NetDA method is more accurate than that determined by other methods. Specifically, compare with MASS and NetLDA, we can see that the latter outperforms the former method, which is due to the incorporation of network structure with irrelevant pairs of predictors removed from Θ. On the other hand, for the comparison between NetLDA and NetQDA, we can see that the NetQDA is much better than the NetLDA method, because the NetQDA method successfully detects network structures from each class, and those detected network structures are valid to do classification.
ose numerical findings verify the discussion in Subsection "Benchmark of NetDA."

Real Data Analysis.
In this study, we take the wine dataset as an example, which is introduced in Section 3, to demonstrate the package NetDA. To demonstrate the functions and perform classification and prediction, we first split the full data into the training data and the validation data. In our example, we take the first 45 samples in each class to obtain the training data and use the remaining samples in each class to form the validation data. When the training data and the validation data are determined, we employ the function NetDA to perform classification. We insert "X,""Y," and "X_test" to the function NetDA, and we denote "NetLDA" and "NetQDA" as the argument method = 1 and method = 2, respectively. e resulting vectors of predicted classes and estimated precision matrices are given by "$yhat" and "$Network," respectively.
>library (network) >library (GGally) >library (sna) >material_name � c ("Alcohol," "Malic acid," "Ash," "Alkalinity," "Magnesium," "phenols," + "Flavanoids," "Nonflavanoid," "Proanthocyanins," "Color," "Hue," "OD280," "Proline") >ggnet2 (Net_lda, mode � "circle," size � 8, label-� material_name, label.size � 5) >ggnet2 (Net_qda [ [1]], mode � "circle," size � 8, label � material_name, label.size � 5) >ggnet2 (Net_qda [ [2]], mode � "circle," size � 8, label � material_name, label.size � 5) >ggnet2 (Net_qda [ [3]], mode � "circle," size � 8, label � material_name, label.size � 5) From Figures 2 and 3, we can observe that precision matrices provide complex network structures in predictors. In particular, in Figure 3, we can see that the estimated classdependent network structures are different from each other, and the network structure in class 2 looks more complex than others. To assess the performance of prediction, we input predicted values (yhat_lda or yhat_qda) and responses in the validation data (Y_test) to the function Metrics, and the resulting values are displayed below. Finally, we further adopt the function lda in the packages MASS, sparseLDA, and penalizedLDA to perform the conventional discriminant methods and compare them with our NetDA. Detailed implementations and numerical results are given below: >Wine � data.frame (cbind (Y, X)) >##Demonstration of MASS >lda � lda (Y., Wine, prior � c (length (which (Y � � 1)), length (which (Y � � 2)), +length (which (Y � � 3)))/ length (Y)) >predict (lda, X_test) $class -> lda_pred   In general, we can see that NetLDA and NetQDA have the satisfactory performance in prediction. For the NetLDA method, there is one misclassification as shown in the confusion matrix, while the predicted classes determined by NetQDA are all equal to the responses in the validation data. From the comparison to NetDA, we observe from a confusion matrix determined by the conventional linear discriminant analysis (lda) is comparable to that obtained by the NetLDA method, but it is interesting to see that the value of ARI determined by NetLDA is slightly larger than that based on lda. In addition, it is clear to see that the NetQDA method is better than lda. On the contrary, it is surprising to see that two penalized methods sparseLDA and penalizedLDA do not have satisfactory performance of classification and prediction, especially that sparseLDA has the most unexpected result. In summary, the numerical results in this data analysis show (a) the importance of incorporating predictor network structures in the classification procedure, and (b) the advantage of adopting classdependent network structures.

Summary
Classification and prediction have been important topics in supervised learning, and discriminant analysis is a useful method in statistical learning. While many methods have been developed, little method has been available to handle potential network structures in predictors when building predictive models. In addition, rare relevant software has been developed for statistical analysts whose interest is to incorporate network structures and obtain precise classification.
To address this concern, we develop an R package NetDA for public use. Our package provides two functions. e function NetDA aims to incorporate the information of network structures in predictors to do linear or quadratic discriminant functions. e other function Metrics summarizes some useful and informative criteria to assess the performance of classification and prediction. A detailed documentation and concrete examples illustrate the validity of the methods in this package. Finally, some further developments can be explored based on the current package, including the alternative approaches of detection of network structures (e.g., Hastie et al. [12]; Section 9.4), nonparametric discriminant analysis with network structure accommodated (e.g., Chen [17]), and analysis of noisy data, such as measurement error models (e.g., Chen and Yi [18]; Chen and Yi [19]).

Data Availability
Data used to support this study are available at https:// archive.ics.uci.edu/ml/datasets/wine.

Conflicts of Interest
e author declares that there are no conflicts of interest.