A Similarity Search Using Molecular Topological Graphs

A molecular similarity measure has been developed using molecular topological graphs and atomic partial charges. Two kinds of topological graphs were used. One is the ordinary adjacency matrix and the other is a matrix which represents the minimum path length between two atoms of the molecule. The ordinary adjacency matrix is suitable to compare the local structures of molecules such as functional groups, and the other matrix is suitable to compare the global structures of molecules. The combination of these two matrices gave a similarity measure. This method was applied to in silico drug screening, and the results showed that it was effective as a similarity measure.


Introduction
The molecular similarity measure is an important tool in chemometrics and chemoinformatics. The main applications include ligand-based in silico (virtual) drug screening, ADME-Tox (adsorption, distribution, metabolism, excretion, and toxicity) property prediction, physical molecular property prediction (1-octanol-water partition coefficient, solubility), and measurement of the diversity of chemical compounds in a library.
The molecular similarity measure generally assigns a one-dimensional (1D) and/or two-dimensional (2D) descriptor-that is, molecular fingerprints based on substructure, molecular mass, number of rotatable bonds, number of hydrogen donors/acceptors of the compound, and so forth-to compounds so that the similarities of the compounds can be evaluated [1][2][3][4][5]. Many methods have been proposed for the similarity search of chemical compounds, such as the comparison of overlapping substructures in the form of Daylight fingerprints (Daylight Chemical Information Systems Inc., Aliso Viejo, CA, USA), the chemically advanced template search (CATS) descriptor method developed by Pickett [6], and the Burden-CAS-University of Texas (BCUT) descriptor method [7]. One of the most widely used methods is to compare the existence of fragment structures; this is the technique employed by the MACSS key, which was developed by Molecular Design Limited (MDL, Santa Clara, CA, USA). Each element of the feature vector of the molecule represents the existence of a particular fragment structure in the molecule (dictionary based fingerprinting). A rather large example of a dictionary used for this fingerprinting technique is the software program DRAGON developed by Talete SRL (geographical information), which consists of more than 3200 molecular descriptors. The affinity fingerprint approach is a new type of similarity search method based on a multiprotein/multicompound affinity matrix [8][9][10][11][12][13][14][15][16][17][18][19][20][21]. In this method, each element of the feature vector of the molecule represents the binding affinity of the molecule with a particular protein. Usually, the binding affinity is measured by calculation using a proteincompound docking program.
There are various applications for molecular similarity, and thus many types of similarity measures are needed. Most of the conventional molecular descriptors aim scaffold hopping (lead hopping) to find a compound with a different scaffold from the known active compound. However, in some cases, we want to find similar compounds with similar scaffolds. For example, in lead generation, we want to find a series of similar compounds with the same or similar scaffold instead of performing an actual synthesis. Substructure 2 Journal of Biomedicine and Biotechnology searches have been used for this purpose [22,23]. However, a comparison of the indices of molecular topologies is much faster than a substructure search. A topological index, which is any of several numerical parameters of a molecular graph, is also widely used [5]. The Wiener index, Hosoya index, and Randic's molecular connectivity index, are graph invariants and conventional topological indices. These topological indices show correlation to the physical or chemical properties of molecules, although these indices do not recognize atom types and they can be quite difficult to calculate.
In the current study, we proposed a new similarity measure for identifying topologically similar compounds based on their molecular topologies and evaluated this method by applying it to a ligand-based drug screening test.

Similarity and Distance between Compounds.
First, the all-atom model compound structures are converted to united atom models, in which all hydrogen atoms are omitted and the atomic charge of the hydrogen atom is added to the atomic charge of the connected heavy atom.
In this method, the adjacency matrix E and the distance matrix D are used, and Figure 1 shows an example of these two matrices for a simple graph. The topology of the compound can be represented by an edge-adjacency matrix E [4,5]: e ab = 1, when the ath and the bth atoms are connected e ab = 0, otherwise, where e ab is the a-b element of matrix E. The value of e ab could be the bond order between the ath and the bth atoms (the value of e ab could be 1.5 for an aromatic bond). Just as in the BCUT method, the diagonal part (e aa ) is replaced by the converted atomic charge q c : where q A is an atomic partial charge and c is a coefficient. In this study, c was set to 1.0. The q c value is > 0 for any q A value. The a-b matrix element of the pseudodistance matrix D represents the minimum path length between the ath and bth atoms: where d ab is a shortest path length between the ath and the bth atoms. We also tried D ab = d ab (shortest-path topological distance matrix [4,5]) and D ab = d ab 1/2 and found that D ab = ln(1+d ab ) gave the best result among these definitions.
Let ε i and N be the ith eigenvalue of the matrix E or D and the number of atoms of the united-atom model of the compound. We define the eigenvalue histogram g(ε) as follows: Here ε and c are the energy and the arbitral coefficient. The distance S(A, B) between molecules A and B is defined based on the eigenvalue histogram of molecule A(g A (ε)) and that of molecule B(g B (ε)) as follows: In the current in silico drug screening, candidate hit compounds are selected using the following method. Let S E (A, B) and S D (A, B) be the distance between A and B based on the adjacency matrix E and that based on the distance matrix D. These two distances give the consensus distance S (A, B) with the weight parameter λ: Compounds that are close to the known active compounds are selected as the candidate hit compounds. For this purpose, the distance to the known active compounds is introduced. The distance from the kth compound to the average position of the active compounds (Dist k ) is defined as where A k , C i , and M are the kth compound, ith active compounds, and the total number of the active compounds. When the number of active compounds is one, Dist k = S . We call Dist k the molecular-graph (MG) distance and we call this screening procedure the molecular-graph distance (MGD) method. The eigenvalues (ε i ) of S E and S D of each compound of the compound library are stored in a database file a priori. For a query compound, the eigenvalues of S E and S D must be calculated, which costs less than 1 second. The database search is conducted only to perform the calculations in (4)- (7) and thus is quite fast.

Preparation of Materials
For the drug screening test, our target proteins were the macrophage migration inhibitory factor (MIF), cyclooxygenase-2 (COX-2), human immunodeficiency virus protease-1 (HIV), thermolysin (THR), glutathione Stransferase (GST), the histamine H1 receptor, the adrenaline beta receptor, the serotonin receptor, and the dopamine D2 receptor. For validation of the present method, we used the same set of compounds as used in our previous study [20]. Namely, the compound set consisted of 12 inhibitors of MIF, 28 inhibitors of THR, 15 inhibitors of COX-2, Journal of Biomedicine and Biotechnology 20 inhibitors of HIV, 12 inhibitors of GST, 10 antagonists of the histamine H1 receptor [24], 12 agonists and 13 antagonists of the adrenaline beta receptor [25], 8 agonists and 9 antagonists of the serotonin receptor [26], and 6 agonists and 15 antagonists of the dopamine D2 receptor [27] as the active compounds, along with 11050 potentially negative compounds from the random compound library of the Coelacanth Chemical Corporation (East Windsor, NJ, USA). Typically, only one hit compound could be found out of 10 4 randomly selected compounds; we therefore expected that there would be no more than a few, if any, hit compounds among these 11212 compounds. The atomic charge of each ligand was determined by the Gasteiger method [28,29]. To calculate the Gasteiger charge, an all-atom model is necessary for each compound. The three-dimensional (3D) coordinates of the 11050 random compounds above were generated to add the hydrogen atoms by the Concord program (Tripos, St. Louis, MO, USA) from 2D Sybyl SD files provided by the Coelacanth Chemical Corporation. The 3D coordinates of the active compounds (inhibitors, substrates, agonists, and antagonists) were generated by Chem3D (Cambridge Software, Cambridge, MA, USA).

Results
To evaluate the efficiency of this method, the leave-oneout cross-validation test was applied; namely, the active compounds of each target protein were selected one by one as the known active compounds for this software and the other unknown active compounds were discovered by the software. The test dataset consists of these active compounds and the other approximately 10 4 potential inactive compounds (decoy set). The total number of compounds was 11212. A total of 160 (= total 160 active compounds) database enrichment curves were calculated for these 9 target proteins and 11212 compounds and the results were averaged.
The surface area (q or "area under curve": AUC) under the total database enrichment curve ( f ) is a measure of the database enrichment: where x and f (x) are the percentages of compounds that are selected from the total compound library and the database enrichment curve, respectively. A higher q value corresponds to better database enrichment, and the q value is always greater than zero and less than 100. For the random screening, q = 50. First, the λ dependence of the hit ratio was examined in the MGD method. The average q values and the hit ratio of the 160 trials with various λ values are summarized in Table 1. The coefficients c for matrices E and D were optimized for every λ to maximize the hit ratio. When λ = 0 or λ = 1, the hit ratio and the q values were lower than those in the other cases. This result showed that the combination of matrices D and E is more effective than the single usage of either D or E. The optimized coefficients were used in the following study. The average eigenvalues of D and E were −3.21 * 10 −6 and 0.505 for the decoy set, respectively. The histograms g(ε) of (4) were close to single Gaussian distributions. We show the distributions of g(ε) of H1 antagonist diphenhydramine and COX-2 inhibitor indomethacin in the supplementary data.  Table 1. Figure 2 shows a score distribution with λ = 0.25 using diphenhydramine as the template (see Figure 4). The template corresponds to Dist k = 0. The frequency was normalized; the surface area under the curve was set to 1. The distribution is similar to Gaussian distribution.  Figure 3 shows the average database enrichment results of the 160 trial screening tests by the leave-one-out crossvalidation test. The MGD method worked well and showed good enrichment. In this calculation, λ was set to 0.25. Within the first 1%, 5%, and 10% of the database, 36.5%, 52.8%, and 60.0% of the active compounds were found by the similarity measure defined by (7), respectively. The average q value by the MGD method using (7) was 82.53. This result is worse than the result by the machine-learning docking score index method reported previously. Namely, about 70% of the active compounds were found within the first 1% of the database and the average q value was 98.5.
Ten histamine H1 receptor antagonists were included in the compound library (see Figure 4). Figure 5 shows the known active compound (template) and the best ranked molecules, when λ = 0, 0.25 and 1. The Dist k values and z-scores (= (score -average score)/standard deviation) of the three top-ranked compounds are summarized in Table 1. These z-scores show that the score distribution is slightly different from the Gaussian distribution. In the Gaussian distribution, the number of compounds with a z-score > 3 is 0.1% of the database (10 compounds in this case). The z-scores of the top-ranked compounds were only 2, in this case. The selected molecules look similar to the template compound, diphenhydramine. For λ = 0.25, the other H1 receptor antagonist, chlorpheniramine, was found as the third compound. Both diphenhydramine and chlorpheniramine have a diphenyl group. The other compounds (a1 and a2) are not very similar to the template. For λ = 0, compound b2, which is not an H1 antagonist, is similar to the template. Compound b3 is identical to compound a1. For λ = 1, the three best-ranked compounds are not particularly similar to the query. Compounds a1, a2, b3, and c3 are similar to each other.  (4) for the E matrix and the coefficient2 is the c parameter in (4) for the D matrix. The hit ratio is the % of the active compounds found within the first 1% of selected compounds of the database. The average and σ are the average value and the standard deviation of the scores. * : the top-ranked compounds when diphenhydramine is the template.  Figure 3 shows that 36.5% of the active compounds were found within the first 1% of the compounds of the whole library. Our previous study showed that 12.4%, 43.4%, and 67.5% of the active compounds were found within the first 1% of the database by the docking score index (DSI), factor-selection DSI (FS-DSI), and machine-learning DSI (ML-DSI) methods, respectively, when 180 proteins were used to calculate the affinity fingerprint [19]. The three-dimensional (3D) shape and charge distribution of a compound govern the protein-compound binding energy. The DSI, FS-DSI, and ML-DSI methods utilize the 3D shape and charge distribution of the compound through the affinity fingerprint. On the other hand, the 2D structure of the compound does not govern the protein-compound binding energy. Thus, the current similarity measure was not better than the previously developed screening methods for in silico drug screening, when it was used as a single measure to describe the molecular similarity. However, the MGD method did have an advantage in terms of its computational speed. The MGD method can search 10 000 000 compounds within 1 hour on a Xeon 3 GHz computer, which is 1000 times faster than the MSM-DSI method. However, the current similarity measure was still effective for in silico drug screening. Our active compounds were chosen based on literature. As shown in Figure 4, some compounds were very similar to each other by human-eye inspection. The main reason for this similarity was likely that these compounds were generated from a progenitor compound by lead optimization. Diphenhydramine, chlorpheniramine, homochlorcyclizine, cetirizine, and clemastine have a diphenyl-like group. In promethazine, olopatadine, mequitazine, and cyprohrptadine, the conformations of two phenyl groups are fixed. Most of these antagonists are structurally similar, which should be the reason why the current similarity measure was effective for the in silico drug screening. In other words, this method is not suitable for scaffold hopping (lead hopping) [30]. For scaffold hopping, other methods have been developed [30,31].

Discussion
If all matrix elements of the off-diagonal part are zero, the eigenvalues are equal to the values of the diagonal part. The off-diagonal part shifts the eigenvalues from the values of the diagonal part. In (1), the bond information close to the ith atom can give major perturbation on the ith diagonal value (atomic charge of the ith atom). Thus the matrix E represents the short-range information of the molecular topology. On the other hand, in (3), the i-j matrix element of the matrix D becomes large when the ith atom is far from the jth atom on the molecular topology. Thus the matrix D can represent the long-range information of the molecular topology.
The information of the matrix E and that of D are independent of each other. For each compound in the test database of 10 4 compounds, the values of S E and S D in (6) were calculated. The correlation coefficient of the values of S E and S D was only 0.08, indicating that there was no correlation between these values. Thus, the compounds in the compound library are widely distributed in the (S E , S D ) two-dimensional (2D) space, and the MGD method selects the compounds around the query compound from the compound library in the 2D space.

Conclusion
We developed a similarity measure for chemical compounds that is based on the molecular topology, the atomic charge, and the minimum path length between atoms. The histograms of eigenvalues of these matrices were smoothed to generate a continuous curve. The difference and overlap Journal of Biomedicine and Biotechnology 7 between these two histograms define the distance between the two compounds.
This similarity measure was applied to ligand-based in silico drug screening. In this calculation, compounds whose molecular topology structures are similar to the given active compounds were selected by using this similarity measure. This measure actually worked to find unknown active compounds from a random compound library.