On the Variability of Neural Network Classification Measures in the Protein Secondary Structure Prediction Problem

We revisit the protein secondary structure prediction problem using linear and backpropagation neural network architectures commonly applied in the literature. In this context, neural network mappings are constructed between protein training set sequences and their assigned structure classes in order to analyze the class membership of test data and associated measures of significance. We present numerical results demonstrating that classifier performance measures can vary significantly depending upon the classifier architecture and the structure class encoding technique. Furthermore, an analytic formulation is introduced in order to substantiate the observed numerical data. Finally, we analyze and discuss the ability of the neural network to accurately model fundamental attributes of protein secondary structure.


Introduction
The protein secondary structure prediction problem can be phrased as a supervised pattern recognition problem [1][2][3][4][5] for which training data is readily available from reliable databases such as the Protein Data Bank (PDB) or CB513 [6].Based upon training examples, subsequences derived from primary sequences are encoded based upon a discrete set of classes.For instance, three class encodings are commonly applied in the literature in order to numerically represent the secondary structure set (alpha helix (), beta sheet (), coil ()) [7][8][9][10][11].By applying a pattern recognition approach, subsequences of unknown classification can then be tested to determine the structure class to which they belong.Phrased in this way, backpropagation neural networks [7,[12][13][14], and variations on the neural network theme [8,10,11,[15][16][17][18] have been applied to the secondary structure prediction problem with varied success.Furthermore, many tools currently applying hybrid methodologies such as PredictProtein [19,20], JPRED [8,17,21], SCRATCH [22,23] and PSIPRED [24,25] rely on the neural network paradigm as part of their prediction scheme.
One of the main reasons for applying the neural network approach in the first place is that they tend be good universal approximators [26][27][28][29][30] and, theoretically, have the potential to create secondary structure models.In other words, after a given network architecture has been chosen and presented with a robust set of examples, the optimal parameters associated with the trained network, in principle, define an explicit function that can map a given protein sequence to its associated secondary structure.If the structure predicted by the network function is generally correct and consistent for an arbitrary input sequence not contained in the training set, one must be left to conclude that the neural network has accurately modeled some fundamental set of attributes that define the properties of protein secondary structure.Under these circumstances, one should then be able to extract information from the trained neural network model parameters; thus, leading to a solution to the secondary structure prediction problem as well as a parametric understanding of the underlying basis for secondary structure.
The purpose of this work is to revisit the application of neural networks to the protein secondary structure prediction problem.In this setting, we consider the commonly encountered case where three structure classes (alpha helix (), beta sheet (), and coil ()) are used to classify a given protein subsequence.Given the same set of input training sequences, we demonstrate that, for the backpropagation neural network architecture, classification results and associated confidence measures can vary when two equally valid encoding schemes are employed to numerically represent the three structure classes (i.e., the "target encoding scheme").Such a result goes against the intuition that the physical nature of the secondary structure property should be independent of the target encoding scheme chosen.
The contribution of this work is not to demonstrate improvements over existing techniques.The hybrid techniques outlined above have been demonstrated to outperform neural networks when used alone.Instead, we focus our attention on the ability of the neural network model-based approach to accurately characterize fundamental attributes of protein secondary structure given that certain models presented within this work are demonstrated to yield variable results.Specifically, in this work, we present (1) numerical results demonstrating how secondary structure classification results can vary as function of classifier architecture and parameter choices; (2) an analytic formulation in order to explain under what circumstances classification variability can arise; (3) an outline of specific challenges associated with the neural network model-based approach outlined above.
The conclusions reported here are relevant because they bring into discussion a body of literature that has purported to offer a viable path to the solution of the secondary structure prediction problem.Section 3 describes the methods applied in this work examine the total number that retained their classification using.In particular, this section provides details concerning the encoding of the protein sequence data (Section 3.1), the encoding of the structure classes (Section 3.2) as well as the neural network architectures (Sections 3.3-3.4),and the classifier performance measures (Section 3.5) applied in this work.Section 4 then presents results from numerical secondary structure classification experiments.Section 5 presents an analytic formulation for the linear network and the backpropagation network described in Section 3 in order to explain the numerical results given in Section 4.

Notation for the Supervised Classification Problem
In the supervised classification problem [1,2], it is assumed that a training set consists of  training pairs: where   ∈

Methods
In order to apply the neural network paradigm, two numerical issues must be addressed.First, since the input data comes in the form of an amino acid sequence, Section 3.1 discusses a simple encoding scheme for converting the amino acid alphabet into a usable numerical form.Second, for this work, our secondary structure target alphabet consists of elements from the set {, , } = {Beta Sheet, AlphaHelix, Coil}.Hence, an encoding scheme must also be chosen for representing the neural network classifier output.Section 3.2 discusses two approaches to encoding the output in fine detail because it is critical to the main point of this paper.Specifically, we choose two different target vector encoding schemes that can be related by a simple mathematical relationship.Such an approach will allow us to compare classifier performance measures based upon the target vector encoding; in addition, it will facilitate the analytic formulation presented in Section 5. Finally, Sections 3.3-3.5 review the neural network architectures and the specific classifier performance measures employed in this work.Section 6 then concludes with some final observations regarding the neural network modelbased approach to the protein secondary structure prediction problem.

Encoding of Protein Sequence Input Data.
For the numerical experiments, the training set was constructed using one hundred protein sequences randomly chosen from the CB513 database [6] available through the JPRED secondary structure prediction engine [21].Furthermore, we employ a moving window of length 17 to each protein sequence where, in order to avoid protein terminal effects, the first and last 50 amino acids are omitted from the analysis.The secondary structure classification of the central residue is then assigned to each window of 17 amino acids.For the one hundred sequences analyzed, a total of 12000 windows of length 17 were extracted.The window size value of 17 was chosen based upon the assumption that the eight closest neighboring residues will have the greatest influence on the secondary structure conformation of the central residue.This assumption is consistent with similar approaches reported in the literature [7,[12][13][14].
To encode the input amino acid sequences of length 17, we employ sparse orthogonal encoding [31] which maps symbols from a given sequence alphabet onto a set of orthogonal vectors.Specifically, for an alphabet containing  symbols, a unique -dimensional unit vector is assigned to each symbol; furthermore, the th unit vector is one at the th position and is zero at all other positions.Hence, if all training sequences and unknown test sequences are of uniform length , an encoded input vector will be of dimension  where  = .In our case,  = 20 and  = 17; hence, the dimension of any given input vector is  = 340.
The above input vector encoding technique is commonly applied in the bioinformatics and secondary structure prediction literature [7,15].While many different and superior approaches to this phase of the machine learning problem have been suggested [3][4][5], we have chosen orthogonal encoding because of its simplicity and the fact that the results of this work do not depend upon the input encoding scheme.Instead, our work specifically focuses on potential neural network classifier variabilities induced by choice of the target vector encoding scheme.

Target Vector Encoding. Analytically characterizing the invariance of classifier performance measures clearly involves first establishing a relationship between different sets of target vectors E and Ẽ.
As a means of making the invariance formulation presented in this paper more tractable, we assume that two alternative sets of target vectors can be related via an affine transformation involving a translation, , a rigid rotation, , where  is an orthogonal matrix and a scale factor, , where is a matrix of translation column vectors applied to each target vector.Many target vector choices regularly applied in the literature can be related via the transformation in (2).For instance, two equally valid and commonly applied encoding schemes for the three class problem are orthogonal encoding [31] where and where class encodings are chosen on the vertices of a triangle in a two-dimensional plane [14].It turns out that ( 4) and ( 5) can be phrased in terms of (2) [32]; hence, the numerical results presented in this work will apply this set of encodings.More precisely, the secondary structure classification associated with a given input vector is encoded using ( 4) and ( 5) (hence,  = 3).The set of target vectors E is derived from ( 4) and the set of target vectors Ẽ is derived from (5).Both the linear and the backpropagation networks are tested first by training using E and then comparing classifier performance with their counterparts trained using Ẽ.In all numerical experiments, MATLAB has been used for simulating and testing these networks.

The Linear Network.
When the supervised classifier model in (1) assumes an affine relationship between the input and output data sets (as in the case of multiple linear regression), matrices of the form are generally introduced.Specifically, the linear network seeks to determine a  ×  matrix  of coefficients and a constant -dimensional column vector  such that the th output vector   in the training set can be approximated by Given this model, we can form a weight matrix of unknown coefficients that, ideally, will map each input training vector into the corresponding output training vector.If the bottom row of the input data matrix  is appended with a row of ones leading to the ( + 1) ×  matrix in matrix form, the goal is then to find a  ×  + 1 weight matrix  that minimizes the sum squared-error over the set of data pairs by satisfying the first derivative condition  linear / = 0.
The least squares solution to this problem is found via the pseudoinverse  † [33] where Once the optimal set of weights has been computed, the network response to an unknown  × 1 input vector  can be determined by defining the ( + 1) × 1 vector and calculating where () is a  × 1 column vector.
3.4.The Backpropagation Network.Given an input vector  ∈   , the model  :   →   for a backpropagation neural network with a single hidden layer consisting of  nodes is described as where  ∈  × ,  ∈  × , and  ∈   define the set of network weights W = {, , } and  :   →   is a "sigmoidal" function that is bounded and monotonically increasing.To perform supervised training, in a manner similar to the linear network, W is determined by minimizing the objective function: given the training data defined in (1).Since (  ) is no longer linear, numerical techniques such as the gradient descent algorithm and variations thereof are relied upon to compute the set W that satisfies the first derivative condition   /W = 0. Consider the following definitions: . . .

𝜎 (
. . . where   () = /.The first derivative conditions for the network weights prescribed by ( 16) and ( 17) can then be written in matrix form as follows: (where "" denotes the matrix transpose), where diag(   ) is a square diagonal matrix such that the diagonal entries consist of components from the vector    .

Classification Measures.
After a given classifier is trained, when presented with an input vector  ∈   of unknown classification, it will respond with an output () ∈   .The associated class membership () is then often determined by applying a minimum distance criterion: where the target vector   that is closest to () implies the class.Furthermore, when characterizing the performance of a pattern classifier, one often presents a set  of test vectors and analyzes the associated output.In addition to determining the class membership, it is also possible to rank the distance between a specific target vector and the classifier response to .In this case a similar distance criterion can be applied in order to rank an input vector  ∈  with respect to class  ∈ {1, . . ., },   () =  ( () ,   ) .
For the purposes of this work, (15) facilitates the determination of the class membership and ranking with respect to class  for the linear network.Similarly, assuming a set of weights W = {, , } for a trained backpropagation network, ( 21) and ( 22) would be applied using (16).
It is well established that, in the case of a normally distributed data, the classification measures presented above minimize the probability of a classification error and are directly related to the statistical significance of a classification decision [1].Given the neural network as the supervised classification technique and two distinct choices for the set of target vectors E and Ẽ, we demonstrate, in certain instances, that the classification and ranking results do not remain invariant such that C() ̸ = () and ρ () ̸ =   () for any input vector .

Noninvariance of Secondary Structure Predictions
In this section, we numerically demonstrate that, when different target vector encodings are applied, neural network classifier measures outlined above, in certain cases, are observed to vary widely.For each neural network architecture under consideration, an analytic formulation is then presented in Section 5 in order to explain the observed numerical data.As mentioned in Section 3.2, numerical experiments are performed first by training using E and then comparing classifier performance with their counterparts trained using Ẽ.Multiple cross-validation trials are required in order to prevent potential dependency of the evaluated accuracy on the particular training or test sets chosen [7,15].In this work, we apply a hold-n-out strategy similar to that of [14] using 85% of the 12000 encoded sequences as training data (i.e.,  = 10200) and 15% as test data to validate the classification results.Recognition rates for both the linear and backpropagation rates using either set of target vector encodings were approximately 65% which is typical of this genre of classifiers that have applied similar encoding methodologies [7,[12][13][14].Although these aggregate values remain consistent, using ( 21) and ( 22) we now present data demonstrating that, while class membership and ranking remain invariant for the linear network, these measures of performance vary considerably for the backpropagation network which was trained with,  = 17, seventeen hidden nodes and a mean squared training error less than 0.2.Ranking results from a representative test for the linear and backpropagation networks are presented for the top 20 ranked vectors in Tables 1 and 3. Class membership data are presented in Tables 2 and 4. Observe that, for the linear network, indices for the top 20 ranked vectors remain invariant indicating ranking invariance; in addition, no change in class membership is observed.On the other hand, Tables 3 and 4 clearly indicate a lack of consistency when considering the ranking and class membership of test vectors.A particularly troubling observation is that very few vectors ranked in the top 20 with respect to  1 were ranked in the top 20 with respect to ẽ1 .Furthermore, Table 4 indicates that the class membership of a substantial number of test vectors changed when an alternative set of target vectors was employed.The data also indicates that the greatest change in class membership took place for alpha helical sequences; thus implying that there is substantial disagreement over the modeling of this secondary structure element by the backpropagation network due to a simple transformation of the target vectors.

Analysis
The results in Section 4 clearly show that while the pattern recognition results for the linear network remain invariant under a change in target vectors, those for the backpropagation network do not.In this section, we present analytic results in order to clearly explain and understand why these two techniques lead to different conclusions.
where f() is the output of the classifier with target vectors {ẽ  }  =1 .
Definition 2. Given two sets of target vectors E and Ẽ, the ranking with respect to a specific class  is invariant under a transformation of the target vectors if, for any input vectors  1 and  2 , Based upon these definitions, the following has been established [32].

Proposition 3. Given two sets of target vectors E and Ẽ, if the ranking is invariant, then the class membership of an arbitrary input vector 𝑥 will remain invariant.
In the analysis presented, the strategy for characterizing neural network performance depends upon the data from the previous section.For the linear network, since both ranking and classification were observed to remain invariant, it is more sensible to characterize the invariance of this network using Definition 2.Then, based upon Proposition 3, class membership invariance naturally follows.On the other hand, to explain the noninvariance of both class membership and ranking observed in the backpropagation network, the analysis is facilitated by considering Definition 1.The noninvariance of ranking then naturally follows from Proposition 3.

Invariance Analysis for the Linear Network.
When the target vectors are subjected to the transformation defined in (2), the network output can be expressed as where   is derived from  such that the translation vector   associated with   ∈ E is appropriately aligned with the correct target vector in matrix .In other words, when the output data matrix in ( 7) is of the form then where   ∈ {1, . . ., } for  = 1, . . ., .Given this network, the following result is applicable [32].(ii) the rows of the matrix  are linearly independent; (iii) E and Ẽ are related according to (2); (iv) for some  ∈   ,   =  for all  = 1, . . .,  in (2); then the ranking and, hence, the class membership for the linear network will remain invariant.
In other words, if the columns of the matrix   in ( 25) are all equal, then using ( 15) and (25) will result in, (23) being satisfied.The above result is applicable to the presented numerical data with  = 340 and  = 10200; hence, ranking and class membership invariances are corroborated by the data in Tables 1 and 2.

Invariance Analysis for the Backpropagation Network.
In this section, we seek to characterize the noninvariance observed in class membership using the backpropagation network.If the class membership varies due to a change in target vectors, then this variation should be quantifiable by characterizing the boundary separating two respective classes.The decision boundary between class  and class  is defined by points  such that where ,  ∈ {1, . . ., },  ̸ =  and () is the classifier output.Under these circumstances, if an ℓ 2 norm is applied in (21), The solution set to this equation consists of all  such that () is equidistant from   and   where, for the purposes of this section, () is defined by (16).Expanding terms on both sides of this equation leads to the condition If the class membership of a representative vector is to remain invariant under a change of target vectors, this same set of points must also satisfy Assuming that two networks have been trained using two different sets of target vectors E and Ẽ, the set of weights {, , } and { W, τ, α} determines the network output in (16).Without loss of generality, we consider the case where all target vectors are normalized to a value of one such that ‖  ‖ 2 = 1 and ‖ẽ  ‖ 2 = 1 for  = 1, . . ., .In this case, the conditions in (30) and ( 31) become We first consider a special case where the target vectors are related according to Γ = Γ with  = 0 in (2).Under these circumstances, if the choice is made, it should be clear that, since   =  −1 and Ỹ =   , is minimized by the condition   /W = 0. Another way to see this is to observe that (19), (20) remain invariant for this choice of target vectors and network weights.Hence, we have the following.
Proposition 5.For a specific choice of  and , if  = 0 in (2) and then the class membership for the backpropagation network will remain invariant.
Proof.Simply consider (32) and (33) and choose any  satisfying It then immediately follows that Therefore, if  satisfies (32), then it also satisfies (33) and, hence, is a point on the decision boundary for both networks.
Intuitively, a scaled, rigid rotation of the target vectors should not affect the decision boundary.However, when the more general transformation of ( 2) is applied with  ̸ = 0, we now demonstrate that, due to the nonlinearity of  in (16), no simple relationship exists such that ( 32) and (33) can simultaneously be satisfied by the same set of points.We first investigate the possibility of establishing an analytic relationship between the set of weights { W, τ, α} and {, , } for both networks.In other words, we seek, ideally invertible, functions   ,   , and   .
such that the set W can be transformed into W.If this can be done, then an analytic procedure similar to that presented in the proof of Proposition 5 can be established in order to relate (32) to (33) for the general case.Since (19), (20) define the set W, it is reasonable to rephrase these equations in terms of the objective function Ẽ : where f(  ) = α( W  − τ) and Ỹ = (  −   ) such that  is the translation vector associated with the target vector referred to by   .From these equations, it should be clear that no simple analytic relationship exists that will transform W into W.A numerical algorithm such as gradient descent will, assuming a local minimum actually exists, arrive at some solution for both W and W. We must therefore be content with the assumed existence of some set of functions defined by (39).Again, let us consider any point  on the decision boundary such that Such a point must also simultaneously satisfy where  is some constant and   ,   , and   satisfy (40).Given an arbitrary training set defined by (1), it is highly unlikely that this constraint can be satisfied.One remote scenario might be that the terms   (, , ) −   (, , ) and  −  are always small.In this case, given a sigmoidal function () that is linear near  = 0, a linearized version of (43) could be solved using techniques described in [32].However, this again is an unlikely set of events given an arbitrary training set.Therefore, given the transformation of (2), we are left to conclude that class membership invariance and, hence, ranking invariance are, in general, not achievable using the backpropagation neural network.

Discussion
. Intuitively, given a reasonable target encoding scheme, one would desire that properties related to protein secondary structure would be independent of the target vectors chosen.However, we have presented numerical data and a theoretical foundation demonstrating that secondary structure classification and confidence measures can vary depending on the type of neural network architecture and target vector encoding scheme employed.Specifically, linear network classification has been demonstrated to remain invariant under a change in the target structure encoding scheme while the backpropagation network has not.As  increases, for the methodology applied in this work, recognition rates remain consistent with those reported in the literature; however, we have observed that adding more training data does not improve the invariance of classification measures for the backpropagation network.This conclusion is corroborated by the analytic formulation presented above.

Conclusions
As pointed out in the introduction, one major purpose of the neural network is to create a stable and reliable model that maps input training data to an output classification with the hope of extracting informative parameters.When methods similar to those in the literature are applied [7,[12][13][14], we have demonstrated that classifier performance measures can vary considerably.Under these circumstances, parameters derived from a trained network for analytically describing protein secondary structure may not comprise a reliable set for the model-based approach.Furthermore, classifier variability would imply that a stable parametric model has not been derived.It is in some sense paradoxical that the neural network has been applied for structure classification and, yet, associated parameters have not been applied for describing protein secondary structure.The neural network approach to deriving a solution to the protein secondary structure prediction problem therefore requires deeper exploration.

Proposition 4 .
If(i) the number of training observations  exceeds the vector dimension ; α =   (, , ) , τ =   (, , ) , W =   (, , ) ≡ { 1 , . . .,   , . . .,   } of target dimensional column vectors   ∈   are chosen to encode (i.e., mathematically represent) each class for  = 1, . . ., .Under these circumstances, each output training vector   is derived from the set E. Based upon this discussion, we summarize the use of following symbols: are -dimensional input column vectors and   ∈   are -dimensional output column vectors.The goal of the supervised classifier approach is to ensure that the desired response to a given input vector   of dimension  from the training set is the -dimensional output vector   .Furthermore, when the training data can be partitioned into  distinct classes, a set E

Table 1 :
Ranking results for the linear network where  represents the test vector index and  1 () and ρ1 () represent the distance with respect to the helix class vectors  1 and ẽ1 .Out of 1800 vectors tested, vectors referred to in this table were ranked from 1 to 20.

Table 2 :
Class membership results for the linear network.For each class, the total number of vectors classified using E is analyzed to examine the total number that retained their classification using Ẽ.

Table 3 :
Ranking results for the backpropagation network where  represents the test vector index and  1 () and ρ1 () represent the distance with respect to the helix class vectors  1 and ẽ1 .Out of 1800 vectors tested, vectors referred to in this table were ranked from 1 to 20.

Table 4 :
Class membership results for the backpropagation network.For each class, the total number of vectors classified using E is analyzed to examine the total number that retained their classification using Ẽ.