Analysis of Unweighted Amino Acids Network

The analysis of amino acids network is very important to studying the various physicochemical properties of amino acids. In this paper we consider the amino acid network based on mutation of the codons. To analyze the relative importance of the amino acids we have discussed different measures of centrality. The measure of centrality is a powerful tool of graph theory for ranking the vertices and analysis of biological network. We have also investigated the correlation coefficients between various measures of centrality. Also we have discussed clustering coefficient as well as average clustering coefficient of the network. Finally we have discussed the degree of distribution as well as skewness.


Introduction
Amino acids are the building blocks of proteins. Each protein is formed by a linear chain of amino acids. There are 20 different amino acids being found till now that occur in proteins. Each amino acid is a triplet code of four possible bases. A sequence of three bases forms a unit called codon. A codon specifies one amino acid. The genetic code is a series of codons that specify which amino acids are required to make up specific protein. As there are four bases, (Adenine (A), Cytosine (C), Guanine (G), or Thymine (T/U)) this gives us 64 codons. Out of these 64, the three triplets UAA, UAG, and UGA are known as stop codons or nonsense codons and their role is to stop the biosynthesis. The codon AUG codes for the initiation of the translation process and is therefore also known as start codon. Also a codon can be changed in several ways; such change is known as mutation. There are various types of mutation like substitution, insertion, deletion, frameshift, and so forth. In this paper we have considered one-point mutation of all possible bases. To discuss relative importance or significance of amino acids we have investigated four centrality measures in the amino acid network. The compatibility relation of the graph is defined based on the mutation of the codon. For example the amino acid M (Methionine) is connected with K (Lysine), T (Threonine), R (Argnine), I (Isoluecine), V (Valine), L (Leucine), because all possible mutations of the base of the codon AUG (M) represent amino acids K, T, R, I, V and L. Different researchers have made many contributions in this field. Kundu [1] discussed that hydrophobic and hydrophilic network satisfy "small-world property" within protein. Also he has discussed that hydrophobic network has large average degrees of nodes than the hydrophilic network. In 2007 Aftabuddin and Kundu [2] discussed three types of networks within protein and give some idea about all three types of networks. Jiao et al. [3] discussed the weighted amino acid network based on the contact energy. They have shown that weighted amino acid network satiety is "small-world" property. Fell and Wagner [4] examined whether metabolites with highest degree may belong to the oldest part of the metabolism. Wuchty and Stadler [5] discussed various centrality measures in biological network. They concluded that the degree of vertex centrality alone is not sufficient to distinguish lethal protein from viable ones. Newman [6] discussed correlation of degree of centrality and betweenness centrality. Also Schreiber and Koschutzki [7] compared centralities for biological networks, namely, PPI network and transcriptional network. As a result of their study, it was observed that in the analysis of biological networks various centrality measures should be considered. This paper is organized as follows. In Section 2 we define some preliminary concepts of the graph on which we operate and briefly review the various centrality measures. In Section 3 we define graph in amino acids based on mutation and discuss various centrality measures. Also we discuss the bivariate correlation between different centrality measures. In Section 4 we discuss some network parameters. In Section 5 we give the conclusion of this paper.

Preliminary Concepts of Graph
An undirected graph = ( , ) consists of a finite set of vertices and a finite set ⊆ × of edges. If an edge = ( , V) connects two vertices and V then vertices and V are said to be incident with the edge and adjacent to each other. The set of all vertices which are adjacent to is called the neighborhood ( ) of . The complete graph is a graph in which each of the vertices connects to one another. A directed graph or digraph consists of a set of vertices and a set of edges such that ∈ , if each edge of the graph has a direction. A graph is called loop-free if no edge connects a vertex to itself. An adjacency matrix of a graph = ( , ) is a ( × ) matrix, where = 1 if and only if ( , ) ∈ and = 0 otherwise. The adjacency matrix of any undirected graph is symmetric. The degree of a vertex V is defined to be the number of edges having V as an end point. A walk is defined as a finite alternating sequence of vertices and edges, beginning and ending with vertices, such that each edge is incident with the vertices preceding and following it. No edges appear more than once in a walk. A vertex, however, may appear more than once. In a walk beginning and ending vertices are initial and terminal vertices. A walk is closed if beginning and end vertices are the same. Also if beginning and end vertex are not the same then that walk is called open walk. A trail is a walk without repeated edges and path is a walk without repeated vertices. A shortest or geodesic path between two vertices , V is a path with minimal length. A graph is connected if there exists a walk between every pair of its vertices.

Centrality in Graph.
In graph theory, centrality measure of a vertex represents its relative importance within the graph. A centrality is a real-valued function on the nodes of a graph. More formally a centrality is a function which assigns every vertex V ∈ of a given graph a value (V) ∈ . In the following we have discussed four most commonly used centrality measures.

Degree of Centrality.
The most simple centrality measure is degree of centrality, ( ). It is defined as the number of nodes to which the node is directly connected. The nodes directly connected to a given node are also called first neighbors of the given node. Degree centrality shows that an important node is involved in a large number of interactions. This interaction gives the immediate importance or risk of the node in the corresponding network. Mathematically it is defined as However in real world problem the degree of centrality is not an actual measurement for finding importance or risk of a node. In real situation an important node may be connected indirectly with other nodes.

Eigenvector Centrality.
Another important measure of centrality is eigenvector centrality [8]. An eigenvalue of a square matrix is a value for which det( − ) = 0, where is the identity matrix of the same order as . Eigenvector centrality is defined as the principal eigenvector of the adjacency matrix of corresponding graph.
In matrix-vector notation we can write where is the adjacency matrix of the graph, is a constant (the eigenvalue), and is the eigenvector. In general, there will be different eigenvalues for which an eigenvector solution exists. However eigenvector of the greatest eigenvalue is the eigenvector centrality [8]. Eigenvector centrality gives the direct as well as indirect importance of a node in a network.

Closeness Centrality.
The closeness centrality is the idea how a vertex is close to all other vertices not only to the first neighbor but also in global scale. Generally a vertex is central; then it is close to all other vertices. If a vertex is close to other vertices, then it can quickly interact with all other vertices. In general closeness centrality is defined as the inverse of the sum of the shortest path distances between each node and every other node in the network [9].The closeness centrality of a node depicts an important node that can easily reach or communicate with other nodes of the network.
Mathematically it is defined as where is the number of vertices of the network and ( , V) is the shortest path distance between the pair of vertices and V. From the above definition it is clear that if a node has minimum cumulative shortest path distance, then that node has maximum closeness centrality. And maximum closeness centrality node is very well connected to all other nodes.

Betweenness Centrality.
Another well-known centrality measure is the betweenness centrality [9]. Betweenness centrality interactions between two nonadjacent nodes depend on the other node, generally on those on the paths between the two. The betweenness centrality of a node is the number of shortest paths going through . Mathematically it is defined as where st is the number of shortest paths from vertex to and st ( ) is the number of shortest paths from to that pass through . Betweenness centrality depicts identifying nodes that make most information flow of the network.
An important node will lie on a large number of paths between other nodes in the network. From this node we can control the information of the network. Without these nodes, there would be no way for two neighbors to communicate with each other. In general the high degree node has high betweenness centrality because many of the shortest paths may pass through that node. However a high betweenness centrality node need not always be high degree node.

Graph of Amino Acids
Every codon codes unique amino acids. A one-point mutation of a codon may or may not change the corresponding coded amino acid. All one-point mutations of a codon give nine more codons. Some of these nine codons will code for the same amino acid(s) other than the original one. In some sense the nine mutants can be termed near or close to the original one. In the language of topology these codons can be termed vicinity of the original codon. In other words they are related to the original one. Since any mutation has its reverse mutation, this relation is bidirectional. This nearness relation or affinity is naturally carried over to the amino acids. Thus in the amino acids we have a binary relation which generated an undirected graph. Thus in our amino acid graph the vertex set is the set of amino acids and two amino acids and are linked/connected by an edge if one-point mutation of a codon coding codes for . Thus two amino acids connected by an edge can be interpreted as having affinity towards each other in the sense that one may evolve from the other. Thus the amino acid graph gives a picture of the evolution of the amino acids. We will call it the evolutionary graph of amino acids. The corresponding graph is depicted in Figure 1.
From Figure 1, we observe that the graph is connected. Corresponding adjacency matrix of the graph is given as follows:

Centralities in Amino Acids Graph.
To analyse the amino acid graph (Figure 1) we have calculated different measures of centrality. In Table 1, we represent the different centrality values of the vertices.
From Figure 1 we observed that the graph is not complete. That means some of the amino acids are not linked with some other amino acids. The amino acids R (Arginine) and S (Serine) form a complete graph with respect to first base or third base mutation of codon. Again the amino acids R and S are well connected to all other amino acids through first base and/or second base mutations so its cumulative shortest path distance is minimum. Hence the amino acids R and S have high closeness centrality. Therefore the first base and/or second base mutation has relative importance in terms of closeness centrality. Further we observed that any amino acid which has no direct link with other amino acids, has indirect link through one of R or S. For example, the amino acid G has no direct link with the amino acids, namely, M, L, I, F, Y, P, T, N, Q, K, and H. But through the amino acids R and S with the help of first base and/or second base mutation they are linked indirectly. Again when we observe betweenness centrality it is clear that the amino acids R and S have high betweenness value. Because the degree of these amino acids is high, many shortest paths pass through them. As R and S are linked with other amino acids through first base and/or second base mutation, we conclude that first base and/or second base mutation has relative importance in terms of betweenness centrality.
For degree of centrality point of view we observe that the amino acids R and S have highest degree of centrality. As it does not reflect the indirect link of the amino acids, we cannot draw any conclusion regarding which base mutation represents degree of centrality.
Again when we observe eigenvector centrality it is clear that the amino acids R and S have maximum eigenvector centrality because the sum of direct and indirect links of the amino acids R and S has maximum. As eigenvector centrality depends on direct as well as indirect link, and as indirect link of any of the R and S with other amino acids is through first base and/or second base mutation, we conclude that the first base or third base mutation has relative importance in the context of eigenvector centrality.

Correlation between Various Centralities.
In this section we have discussed the bivariate correlation of various measures of centralities for amino acids networks. Correlation is the most important character to study assortative or   Table 2. All correlation coefficients ( ) are based on Pearson's method. The range of -value is between +1 and −1. If > 0 then the network is assortative whereas if < 0 then the network is disassortative.
From Table 2, we observe that all the centrality measures are highly correlated. Also from the above correlation coefficients we observed that the networks are of assortative type ( > 0). Therefore the information can be easily transferred through this network.

Network Parameters
There are various network parameters which are used in the biological network. In this paper we have used basically three network parameters, namely, clustering coefficient, degree of distribution, and Pearson's skewness. Clustering coefficient is the measurement that shows the tendency of a graph to be divided into cluster. A cluster is a subset of vertices that contains lots of edges connecting these vertices to each other. The clustering coefficient of a node " " is the ratio between the total number ( ) of links actually connecting its nearest neighbours and the total number (the number of such links is ( − 1)/2, where is the degree of node " ") of all possible links between these nearest neighbours. It is given by = 2 / ( − 1). Also nodes with less than two neighbors are assumed to have a clustering coefficient of 0. It takes values as 0 ≤ ≤ 1. The clustering coefficient of the whole network is the average of all individual . The higher clustering coefficient of a node represents strong relationship between neighbouring nodes. That is, the higher value of the clustering coefficients of a node represents more number of connections among its neighbours. In Table 3 we have shown clustering coefficients of all the amino acids.
From here it is clear that clustering coefficient of an amino acid depends upon degree of the amino acids as well as number of direct connections between two neighbouring amino acids. Here we observe that the very large hydrophobic amino acid W (volume 227.8 A 3 ) as well as high molecular weight (204.23) amino acid has high clustering coefficient. Again the clustering coefficient of whole amino acids network is 0.464 (G). That is, very small hydrophobic amino acid (volume 60.1 A 3 ) as well as very small molecular weight (75.07) amino acid represents clustering coefficient of the whole network. Since clustering coefficient is higher with higher number of connections among the neighbours, therefore the higher values of clustering coefficients of a network give large effect on the nodes of the network and slow down the information spread. Therefore from here it is clear that the information can be sent faster in amino acid network.
International Scholarly Research Notices 5  Next, it is of interest to investigate the nature of the node of the distribution of degrees of nodes for both patterns. The spread in the number of links a node has is characterized by a distribution function ( ). The degree distribution ( ) of a network is defined to be the fraction of nodes in the network with degree . If there are nodes in total in a network and of them have degree , we have ( ) = / . Generally the degree distributions value of a node represents the probability that a selected node will have exactly links. In Table 4, we have shown degree of distribution values of different amino acids.
Also another well-known parameter is skewness. Skewness is a measure of the symmetry or asymmetry of the distribution of a variable. The measuring skewness was first suggested by Karl Pearson in 1895. There are various measures of skewness. In this paper we have used only Pearson's coefficient of skewness.In normal curve the mean, the median, and the mode all coincide and there is perfect balance between the right and the left sides of the curve. The situation of skewness which means lack of symmetry occurs in a curve when the mean, median, and mode of the curve are not coincident. Skewness describes the shape of the distribution. Symmetry means that the variables are equidistance from the central value on either side. Again the term asymmetrical means either positively skewed or negatively skewed. Skewness is denoted in mathematical notation by . Based on the values and relative position of the mode, mean and median there are two types of skewness that appear in the distribution, namely, positive skewness and negative skewness. If mean is maximum and mode is least and the median lies in between the two then it is called positive skewed distribution. Again if mode is maximum and the mean is least and the median lies in between the two then it is called negative skewed distribution. There are various relative measures of skewness. In this study we have discussed Karl Pearson's coefficient of skewness, which is given by the following formula: The value of the measure of the skewness lies within the range of -3 to +3. If = 0, then the distribution is symmetrical, that is, normal.
If > 0, then the distribution is positively skewed. If < 0, then the distribution is negatively skewed. Here we assume the degree of distribution as variable ( ) and number of the amino acids which contains the same value of the distribution as frequency ( ). Then we have Table 5, where 0.15 is considered as assumed mean. From Table 5, we have that mode is 0.4 (because of the highest frequency, i.e., 8) and median is 0.294. Also the standard deviation is 0.137. Therefore Pearson's coefficient of skewness is −1.18 < 0. From here we concluded that the degrees of distribution of the amino acids are negatively skewed distribution.

Conclusion
In this paper we have equipped the amino acids with graph structure by defining compatibility relation based on mutation. We have observed that the graph is connected. We have discussed different centrality measures and we observed that the high hydrophilicity amino acid R (Arginine) and least hydrophilicity amino acid S (Serine) have the highest centrality values irrespective of the centrality measures. Both the amino acids are hydrophilic with the same number of codons, that is, six. The degree of centrality assigns the top score to the amino acids S and R and second top score to the amino acid L, followed by the amino acid I and then the amino acids G, V, and T (fourth score); A, C, D, P, M, K, N, and H (fifth score); E, F, Y, and Q (sixth score); and finally W (seventh score). The only other measure that operates such distinction is closeness centrality. Neither the betweenness centrality nor eigenvector centrality has such distinction. Also we have observed that first base and/or second base mutation has relative importance in all the centrality measures. Next we have found correlation coefficients of the various centrality measures of amino acids and it was observed that all the centrality measures are highly correlated. Hence we can conclude that in amino acid network based on mutation all centrality measures give same ranking to the amino acids. Also we have observed that large hydrophobic amino acid W (Tryptophan) has high clustering coefficient, and small hydrophobic amino acid G (Glycine) has average clustering coefficient of the network. 6 International Scholarly Research Notices Finally we have observed that the degree of distribution is negatively skewed. Then using Kolmogorov-Smirnov test we observed that the degree of distribution follows three parameter Weibull distribution patterns. Since our graph is based on the mutation of codon, the network show generated gives a general picture of the evolution of the amino acid. An Amino acid has more affinity to evolve from another amino acid say if they are linked than other wise.