We revisit the protein secondary structure prediction problem using linear and backpropagation neural network architectures commonly applied in the literature. In this context, neural network mappings are constructed between protein training set sequences and their assigned structure classes in order to analyze the class membership of test data and associated measures of significance. We present numerical results demonstrating that classifier performance measures can vary significantly depending upon the classifier architecture and the structure class encoding technique. Furthermore, an analytic formulation is introduced in order to substantiate the observed numerical data. Finally, we analyze and discuss the ability of the neural network to accurately model fundamental attributes of protein secondary structure.
The protein secondary structure prediction problem can be phrased as a supervised pattern recognition problem [
One of the main reasons for applying the neural network approach in the first place is that they tend be good universal approximators [
The purpose of this work is to revisit the application of neural networks to the protein secondary structure prediction problem. In this setting, we consider the commonly encountered case where three structure classes (alpha helix
The contribution of this work is not to demonstrate improvements over existing techniques. The hybrid techniques outlined above have been demonstrated to outperform neural networks when used alone. Instead, we focus our attention on the ability of the neural network model-based approach to accurately characterize fundamental attributes of protein secondary structure given that certain models presented within this work are demonstrated to yield variable results. Specifically, in this work, we present numerical results demonstrating how secondary structure classification results can vary as function of classifier architecture and parameter choices; an analytic formulation in order to explain under what circumstances classification variability can arise; an outline of specific challenges associated with the neural network model-based approach outlined above.
The conclusions reported here are relevant because they bring into discussion a body of literature that has purported to offer a viable path to the solution of the secondary structure prediction problem. Section
In the supervised classification problem [
In order to apply the neural network paradigm, two numerical issues must be addressed. First, since the input data comes in the form of an amino acid sequence, Section 3.1 discusses a simple encoding scheme for converting the amino acid alphabet into a usable numerical form. Second, for this work, our secondary structure target alphabet consists of elements from the set
For the numerical experiments, the training set was constructed using one hundred protein sequences randomly chosen from the CB513 database [
To encode the input amino acid sequences of length 17, we employ sparse orthogonal encoding [
The above input vector encoding technique is commonly applied in the bioinformatics and secondary structure prediction literature [
Analytically characterizing the invariance of classifier performance measures clearly involves first establishing a relationship between different sets of target vectors
When the supervised classifier model in (
Once the optimal set of weights has been computed, the network response to an unknown
Given an input vector
Consider the following definitions:
After a given classifier is trained, when presented with an input vector
It is well established that, in the case of a normally distributed data, the classification measures presented above minimize the probability of a classification error and are directly related to the statistical significance of a classification decision [
In this section, we numerically demonstrate that, when different target vector encodings are applied, neural network classifier measures outlined above, in certain cases, are observed to vary widely. For each neural network architecture under consideration, an analytic formulation is then presented in Section
As mentioned in Section 3.2, numerical experiments are performed first by training using
Ranking results for the linear network where
|
|
|
|
---|---|---|---|
1205 | 0.0780 | 1205 | 0.1170 |
42 | 0.0867 | 42 | 0.1300 |
1031 | 0.0976 | 1031 | 0.1464 |
1773 | 0.1113 | 1773 | 0.1670 |
598 | 0.1238 | 598 | 0.1857 |
1761 | 0.1267 | 1761 | 0.1900 |
862 | 0.1354 | 862 | 0.2031 |
1073 | 0.1409 | 1073 | 0.2114 |
277 | 0.1459 | 277 | 0.2188 |
115 | 0.1540 | 115 | 0.2309 |
1505 | 0.1821 | 1505 | 0.2731 |
392 | 0.1839 | 392 | 0.2759 |
1421 | 0.1904 | 1421 | 0.2856 |
147 | 0.2001 | 147 | 0.3001 |
990 | 0.2044 | 990 | 0.3066 |
1457 | 0.2127 | 1457 | 0.3191 |
1288 | 0.2150 | 1288 | 0.3225 |
352 | 0.2160 | 352 | 0.3239 |
1232 | 0.2198 | 1232 | 0.3297 |
280 | 0.2311 | 280 | 0.3466 |
Class membership results for the linear network. For each class, the total number of vectors classified using
Class |
|
|
% change |
---|---|---|---|
|
202 | 202 | 0 |
|
621 | 621 | 0 |
|
977 | 977 | 0 |
Ranking results for the backpropagation network where
|
|
|
|
---|---|---|---|
817 | 0.0107 |
|
0.0101 |
887 | 0.0231 | 1604 | 0.0130 |
264 | 0.0405 | 887 | 0.0209 |
1183 | 0.0711 | 1145 | 0.0214 |
684 | 0.0727 | 461 | 0.0232 |
623 | 0.0874 | 583 | 0.0329 |
911 | 0.0891 | 1086 | 0.0339 |
1382 | 0.0917 | 1382 | 0.0478 |
1610 | 0.0939 | 413 | 0.0489 |
551 | 0.1060 | 225 | 0.0608 |
1042 | 0.1150 | 438 | 0.0609 |
924 | 0.1322 | 911 | 0.0613 |
727 | 0.1339 | 207 | 0.0774 |
438 | 0.1356 | 559 | 0.0885 |
577 | 0.1363 | 481 | 0.0945 |
896 | 0.1500 | 1548 | 0.0947 |
175 | 0.1513 | 962 | 0.0968 |
1138 | 0.1549 | 85 | 0.1012 |
583 | 0.1581 | 195 | 0.1111 |
559 | 0.1655 | 9 | 0.1167 |
Class membership results for the backpropagation network. For each class, the total number of vectors classified using
Class |
|
|
% change |
---|---|---|---|
|
225 | 142 | 36.9 |
|
581 | 476 | 18.1 |
|
994 | 878 | 11.7 |
The results in Section
Let us begin by considering two definitions.
Given two sets of target vectors
Given two sets of target vectors
Based upon these definitions, the following has been established [
Given two sets of target vectors
In the analysis presented, the strategy for characterizing neural network performance depends upon the data from the previous section. For the linear network, since both ranking and classification were observed to remain invariant, it is more sensible to characterize the invariance of this network using Definition
When the target vectors are subjected to the transformation defined in (
If
then the ranking and, hence, the class membership for the linear network will remain invariant.
In other words, if the columns of the matrix
In this section, we seek to characterize the noninvariance observed in class membership using the backpropagation network. If the class membership varies due to a change in target vectors, then this variation should be quantifiable by characterizing the boundary separating two respective classes. The decision boundary between class
We first consider a special case where the target vectors are related according to
For a specific choice of
Simply consider (
Intuitively, a scaled, rigid rotation of the target vectors should not affect the decision boundary. However, when the more general transformation of (
Another way to analyze this problem is to first set
Intuitively, given a reasonable target encoding scheme, one would desire that properties related to protein secondary structure would be independent of the target vectors chosen. However, we have presented numerical data and a theoretical foundation demonstrating that secondary structure classification and confidence measures can vary depending on the type of neural network architecture and target vector encoding scheme employed. Specifically, linear network classification has been demonstrated to remain invariant under a change in the target structure encoding scheme while the backpropagation network has not. As
As pointed out in the introduction, one major purpose of the neural network is to create a stable and reliable model that maps input training data to an output classification with the hope of extracting informative parameters. When methods similar to those in the literature are applied [
This publication was made possible by Grant Number G12RR017581 from the National Center for Research Resources (NCRR), a component of the National Institutes of Health (NIH). The authors would also like to thank the reviewers for their helpful comments.