REGRAS: an auxiliary program for pattern recognition and substructure elucidation of monoterpenes

The main purpose of this paper is to present a procedure that utilizes C NMR for pattern recognition and substructure elucidation of monoterpenes. By this reason, a novel version of the REGRAS program was developed for the specialist SISTEMAT system. This program carries out an analysis of the C NMR data from a given compound and, from characteristic chemical shift ranges, recognizes the substructures and the skeleton present in a compound. At the end of this procedure, the program displays as analysis results the likely skeletons and substructures of the substance in question. The REGRAS program was tested on skeleton elucidation of 30 monoterpenes from the most varied skeleton types, exhibiting excellent results in skeleton and substructure prediction processes.


Introduction
During the last three decades, numberless specialist systems have been developed for determining chemical structures.Among these systems, one can point out the DENDRAL [1], ACCESS [2], DARC/EPIOS [3], SpecInfo [4] and Assemble 2.0 [5].Each of them has a particular working way, despite they exhibit some common points such as, for example, the need for utilizing powerful computers and numerous chemical constraints to be imposed, in order to diminish the number of shown structural proposals.
In the last ten years, our research group, more concerned with artificial intelligence programs, has developed the expert SISTEMAT system [6,7].This aims for the goal to aid researchers on natural product chemistry in processes of determining structures of substances.For that, various classes of compounds have been studied, mainly, for example, sesquiterpenes [8,9], lactone sesquiterpenes [10], diterpenes [11] and triterpenes [12].
One of the advantages of the specialist SISTEMAT system, in relation to others, is that, during creation process of data banks, carbon skeleton concept of substances was inserted into the system, so that the number of structural proposals could be reduced throughout the structural generation process.Thus, after creation of each database from 13 C NMR spectrum data, which basically are chemical shifts and multiplicities, the system is able to realize characterization and identification of most existing carbon skeletons through a set of chemical shifts from 13 C NMR, whose target is to identify a determined skeleton or determined part of a skeleton, i.e., a substructure.This identification procedure of substructures and skeletons from the 13 C NMR data for natural products will show a direct action on the structure generator, which should exhibit a less number of structural proposals to a determined data set.Therefore, combinatorial explosion problems, observed in other systems [1][2][3][13][14][15], may be avoided.These systems exclude the proposals that do not incorporate natural products skeletons only after an exhaustive generation of all the probable structures.Thus another advantage of our system is the spent computational time reduction.
The objective of this work is to present the ranges of chemical shifts from the 13 C NMR spectra, being the former characteristic for several monoterpene skeletons, that is, the pattern recognitions of different skeletons and subskeletons, and in addition to verify their application in elucidation of structures of novel monoterpenes.

Methodology
In order to create a monoterpene data bank, a revision on the literature was made until the year of 1997, because the posterior years were used to test the program, and thereof were collected all those monoterpenes which showed 13 C NMR data.A totality of 1322 substances bearing the respective data were obtained and these were inserted into the specialist SISTEMAT system.To date, these data are compiled in a literature review [16].

The search for heuristic rules
Heuristic rules are practical rules obtained from specialist's experience, or originated from programs which perform "learning from machine" routine, and are aimed at solving a specific problem.In the SISTEMAT system, the search of these rules is done through the TIPCARB [11,16] and PICKUP [9,11,16] programs.The TIPCARB program can determine which carbon atoms are present in each position of a skeleton.This information helps in the search of heuristic rules because they define whether or not the skeleton is substituted and the kind of the substituents.This could also be done manually by a careful analysis on literature, but the huge data volume makes this task unfeasible for searching heuristic rules.
After the position of each carbon atom and the types of substituents were defined, these fragments, denominated substructures, are coded in the PICKUP program [16] that performs the search in the database for the chemical shift range for 13 C data of the carbons in the substructure.After the chemical shift estimation, this information is evaluated in relation to its degree of recognition with complete database, allowing one to affirm that a certain group of chemical shifts characterizes a certain probability of the occurrence of a substructure in the compound.In summary, the TIPCARB program indicates which substructure should be selected, and the PICKUP program obtains the chemical shift ranges of its carbon atoms and the degree of recognition of these shifts within the database.This procedure had already been utilized successfully to obtain the chemical shift ranges of eudesmane sesquiterpene [9] and of diterpenes [11], and through this work we tested its efficiency on characterization of monoterpene skeletons.A fact we verified is that, many times, the system lists substances from other skeletons which fit within the specified 13 C NMR range, however most of these skeletons do not have

Table 1
Chemical shifts ranges utilizes for disfunctionalization of the 13  the same types of carbon atoms.This can be observed, for example, in the skeletons of menthane and pinane (Fig. 1).One obviously can note that the number of quaternary and methine carbons is different in both skeletons.Hence, this kind of datum could be utilized in the distinction between two skeletons, and, therefore, could increase the percentual of recognition of a skeleton in relation to the other one.For that, the found solution was the utilization of the REGRAS program.

The REGRAS program
The REGRAS program at its initial version [17] realized disfunctionalization of 13 C NMR spectrum of a substance, in order to propose, at the end of the analysis, the number of quaternary, methine, methylene and methyl carbons on the skeleton of the substance in question.After obtaining these data, the program matches the types of carbon atoms found with a database containing this information for all the skeletons of a determined chemical class, thus getting the likely skeletons of a substance.Through this type of analysis we verify that the program can restrict the proposal number of probable skeletons, but, many times, it cannot infer which skeleton of the substance is, once different skeletons may present the same types of carbon atoms.
As during analysis on 13 C NMR characteristic ranges are listed different types of carbon atoms, we implemented in a new version of REGRAS program the chemical shift ranges obtained for each skeleton.In this way, throughout an analysis, the 13 C NMR spectral data initially must be disfunctionalized [17] according to the data presented in Table 1, being so obtained the types of existing carbon atoms in a substance.From these data, the program will go to select the skeletons bearing such requisites, and, afterwards, research by chemical shift ranges occurs.Through connection of these data will not be chosen skeletons which are encountered on the 13 C NMR range and show types of carbon atoms incompatible with the skeleton of the substance in question.Thus, the initially found problem is solved.
Summarizing, one can say that the REGRAS program exhibits two analysis steps: the first is a "rough" research -however it is fundamental -by means of 13 C NMR data disfunctionalization. Skeletons which show a determined set of carbon atoms are selected.At the second analysis step, the program -from the previously selected skeletons -realizes a "fine" research through the characteristic chemical shift ranges of each skeleton or subskeleton.
Figure 2 shows the flow chart of the REGRAS program action.After entering the 13 C NMR spectral data of the monoterpene bearing menthane skeleton, isolated from Sphaeranthus suaveolens [18] and displayed in Fig. 3, the REGRAS program makes the disfunctionalization of the substance spectrum.The data of 13 C NMR referring to the present substituents were previously identified and removed through the MACRONO program [19,20].At this point of analysis, the program shows, for the monoterpene in Fig. 3, the skeleton probability exhibited in Table 2 (the proposed skeletons are presented in Fig. 4) and "asks" whether the user wants to carry on the analysis and realize research by characteristic 13 C NMR ranges.In this case, we proceeded the analysis, and the program displayed as result the following probability: Menthane skeleton: 100.0%.
The program can also realize research through subskeletons.For the monoterpene in Fig. 3, this analysis was carried out in accordance with the characteristic chemical shifts ranges obtained by the PICKUP system.The REGRAS program provided the substance the following subskeleton: Menthane [1EN; 3OR; 6OXO] -100.0%, that corresponds to a substructure existent in monoterpene of Fig. 3.

Results
The chemical shifts ranges obtained by the PICKUP system for monoterpene skeletons are presented in Table 3.
To test validity of these ranges, 30 monoterpenes (Fig. 5) were randomly selected from the literature.They had their 13 C NMR data submitted to the REGRAS program, that proposed skeletons and subskeletons of the substances by means of disfunctionalization of the 13 C NMR data and further research through characteristic chemical shifts ranges.The results obtained with the monoterpenes of Fig. 5 are shown in Table 4.As in the example presented previously, the data referring to the substituents were identified and removed by the MACRONO program.

Discussion of the results
The REGRAS program, which disfunctionalizes 13 C NMR spectra and does the research on the types of carbon atoms present on the skeleton as well as those on the characteristic chemical shift ranges of skeletons and subskeletons (Table 3), showed a percentual hit of 93.3%, by considering that the program only committed two errors in tests 22 and 25.In test 22, the program effected the correct 13 C NMR spectrum disfunctionalization, however, during the search for skeletons, it uniquely found the 1-ethylmenthane (Fig. 6) with the desired types of carbon atoms, once the 10-norionane skeleton (Fig. 6) -this is a new skeleton that is not included in the database yet.Therefore, in this case, a mistake did not occur from the program directly.On the other side, in the test 25, the REGRAS program afforded an erroneous forecast, for, through the 13 C NMR spectrum disfunctionalization, the chemical shift at δ71.3, whose multiplicity is a singlet referring to C 8 of the substance, was disfunctionalized according to the chemical shift ranges of Table 1.To this carbon was attributed a methine carbon on the wholly disfunctionalized skeleton.Accordingly, the program found the following types of carbon atoms for the disfunctionalized 13 C NMR spectrum of the substance: 0, 4, 3 and 3, respectively related to quaternary, methine, methylene and methyl carbons.By consultation of the types of carbon atoms, the program exhibited wrong skeleton proposals.
In relation to subskeleton proposals for the test substances, the REGRAS program showed 42 subskeleton proposals, being that, among them, the program indicated the correct subskeletons for the substances in 90.5% of the cases.In tests 1, 5 and 20, were presented correct and incorrect subskeleton proposals, being the user up to discern the most coherent one among them.

Conclusion
In this paper, we demonstrated that a compound group with the same skeleton can be characterized through 13 C NMR chemical shift ranges.When these are associated with the information about the types of existing carbon atom on a determined skeleton, selectivity and reliability of the results increase appreciably.
The REGRAS program shows to be a valuable tool for auxiliating the researchers in processes of determing new monoterpene structures, and it henceforth will be embodied in our specialist SISTEMAT's set of programs, in order that this can have its applicability extended to other natural product classes.
It is noteworthy to point out here that skeletons the REGRAS program shows, as well as the substructures which are supplied by itself, will be utilized as constraints for the structure generator which is being built, once the latter can start the process of structure generation from a just found structure complement.Differently from other systems, the REGRAS program will run without combinatorial explosions.Another important variable to be here emphasized is the fact that our system usually utilizes PCs, whereas the other systems we cited in this study need powerful computers for such procedures.Therefore our system is more accessible and can be utilized by any user.Bornane [-] -81.8% Solvents: CD3OD: I-IV, IX, XV-XXII; D2O: V-VI; CDCl3: VII-VIII, X-XIII, XXIII-XIV, XXVII-XXIX; C5D5N: XIV, XXV-XXVI, XXX.
C NMR data Chemical function Initial multiplicity Final multiplicity Chemical shift ranges C=O

Fig. 4 .
Fig. 4. Skeletons proposed at the first stage of the analysis by the REGRAS program.

Table 2
Skeleton probability shown by the REGRAS program in the first stage of the analysis

Table 3
Characteristic chemical shift ranges of some monoterpene skeletons and subskeletons Subskeleton N o C 13 C NMR shifts range % Recognition

Table 3 .
Continued Subskeleton N o C 13 C NMR shifts range % Recognition Necrodane

Table 3 .
Continued Subskeleton N o C 13 C NMR shifts range % Recognition Ionane

Table 3 .
Continued Subskeleton N o C 13 C NMR shifts range % Recognition Isocamphane