A new program toC NMR spectrum prediction based on tridimensional models

This work describes a new program creation, written for Windows environment for C NMR spectrum prediction based on interatomic distances and types of atoms. It has shown to be more efficient on forecast of natural product spectra than some commercial programs such as ACD and DENDRAL. Predictions of spectra of molecules with great conformational variations, as, for example, sesquiterpene lactones, are given.


Introduction
The development of new spectrometric techniques and greater ease of obtaining new data have, in the last years, generated an increasing need of processing these data in structural determination quickly.In this sense, the computer has become an excellent tool to increase the productivity of chemists and spectroscopists in structural elucidation processes.
Prediction of spectra from 13 C NMR through computational methods is generally based on a relationship between a substructure and a subspectrum.When there is a relationship between both, one can infer the presence of molecular fragments.Nowadays, there are many programs which can be used to generate all the structures compatible with the substructural requirements [1][2][3][4].
One of the earliest Specialist Systems created for computer-assisted structure determination has been the DENDRAL [5,6].It utilizes various programs developed to help the user to explain chemical and spectral data, generate isomers and stereoisomers, evaluate them, and to plan experiments.
Within this system, a program utilized for 13 C NMR spectrum prediction is named PRED-CHECK [7].The algorithm of this program uses chemical shifts proposed by the user for the atoms which form a structure, and compares them with the values of shifts observed in atoms belonging to similar substructures just included in the database.In DENDRAL system, the substructural representation involves molecular constitution and stereochemical configuration at a four-bond ray, however, it does not incorporate conformational factors.For this reason, the spectra predicted are imprecise, and shift permutations can occur.
The DARC/EPIOS [2] system mainly utilizes 13 C NMR spectroscopy and molecular formulae.In this system, the chemical shifts of a resonating carbon are associated with a chemical environment, representing a resonant atom at a four-chemical bond ray.To this carbon atom is joined a chemical shift, but it does not regard molecular conformations besides those implicit in configuration specifications such as, for example, the ring junction at a rigid system.When spectrum from a given sample is supplied, the program searches on the data bank for the fragments which satisfy all the chemical shifts.
The ACCESS [3] system has an encoding similar to that of the DARC/EPIOS, and also a data bank ten times greater.A database containing substructures allied to chemical shifts from 13 C NMR spectra is utilized to select among these substructures those compatible with the compound in question and then to foresee its spectrum.The theoretical and experimental spectra so are matched, and a similarity index is attributed to them.
One of the newest spectroscopic analysis system is the SpecInfo [4], elaborated by BASF.This system encodes the substructures by characterizing the environment of the atoms, that is, describing their chemical environment in three concentric spheres.The data bank of the fragments would be associated with spectral patterns.The SpecInfo possesses a large database with 13 C NMR and other elements, IR and mass spectra.
In natural product studies, the chemical class and carbon skeleton definitions are important restrictions for structure determination.This information can be accomplished through analyses of the spectra obtained from the substance as well as of data from its botanical origin (family, genus, species, etc.).The botanical origin which may help the natural product chemist in structure determination has never been utilized in specialist systems because none of them has been specifically developed for structure determination of natural origin substance, although they have been tested with some success in elucidation of this type of substance.
In 1988, the construction of a specialist system named SISTEMAT [8,9] was initiated at the Instituto de Química da Universidade de São Paulo.It was elaborated for structure determination of natural products, and chemotaxonomic and evolutionary studies of plants.The SISTEMAT can handle data from several spectrometric techniques as well as data from the botanical origin of a sample, in order to determine class and carbon skeleton of a given substance present in that sample.This is a differential characteristic of the SISTEMAT system in relation to other systems recently developed [10][11][12][13].
To evaluate the structures created by structural determination programs, a structure generator program is being elaborated.The foreseen spectra and those observed from the sample can be hence compared.The comparison is based on number of signals, chemical shifts and multiplicities.In contrast with the complex patterns from IR and 1 H NMR spectra, the data from 13 C NMR spectra arise as characteristic signals from each magnetically different carbon atom. 13C NMR potentially discriminates structural isomers, since the chemical shifts are sensitive to chemical environment of each carbon atom.
One of the most recent programs, the CSEARCH [14], uses a data bank from centered fragments on each carbon atom with five concentric spheres.Those fragments are correlated to the chemical shift of each carbon atom.To each generated fragment, a chemical shift range and a more probable value are associated with the carbon atom with five spheres.To generate a new spectrum, the sample structure is initially divided into fragments referring to each carbon atom and, wherein it is possible, to five spheres containing the vicinal atoms.The central carbon atom of each fragment is then compared to those of existing fragments in the data bank and allied to the chemical shift corresponding to a larger equivalent fragment found on the bank.
The ChemWindow [15], release 3, has also within a module for foreseeing 13 C NMR spectrum.It can separate the geometric isomers.The average error from chemical shifts of each carbon atom is 4.3δ, albeit errors up to 21-23δ may be found.
As the main factor utilized in spectrum prediction is founded on a correlation between substructures and subspectra, application of the ab initio theory has been used in tridimensional fragment generation [11].On the other hand, the equations needed to deal with the structures are indeed very complex and require an unreasonable computational time.
Semi-empirical chemical methods have been briefly estimated [16,17] but never widely utilized.
Within the known programs, the distances between atoms from the substructures are calculated by considering the proximity of diverse atoms regarding bond to bond.In 1989, Gastmans et al. [18] proposed a new encoding method based on the absolute interatomic distance and the relative orientation of the atoms.This program is applicable in personal computers and provides options which allow the theoretical spectrum to be foreseen, and the signals to be attributed.
In certain chemical classes of compounds such as the sesquiterpene lactones, which can show cycles of up to ten carbon atoms with very varied substitutions, there is a great conformation variety.Therefore, a substitution can provoke alterations on the chemical shifts of carbons which would be very far, in regard with carbon to carbon bond.
To study the conformational effects and those caused by substituents, a 13 C NMR spectrum simulation program was elaborated for this work, based on the tridimensional chemical environment of each fragment, not only tracking chemical bonds but also through definition of spheres centered on the resonant atom.All the other atoms within these spheres will be considered neighbors, and to this fragment a 13 C NMR chemical shift range will be allied.This program allows that the theoretical spectrum is foreseen and the chemical shifts of known experimental spectra are assigned to specific carbon atoms.
The aim of this work was to build a system that, utilizing only microcomputers, could predict 13 C NMR spectra regarding conformational effects of natural product structures.The main difference between our system and the others here mentioned is that the former utilizes the chemical environment description of a determined atom whereas the latter utilize the additivity rules for the prediction of 13 C and 1 H NMR chemical shifts [15,[19][20][21].

Methodology
In order to study the conformational effects and also to build a data bank with a high reliability degree, a program for foreseeing 13 C NMR spectra was elaborated by using tridimensional substructures given by the HyperChem program [22], developed at Hypercube Inc. Autodesk.The latter permits to realize molecular mechanics calculations for a given chemical structure.
The sesquiterpene lactones which presented 13 C NMR spectra in the literature were designed on the HyperChem program and had the structure geometry optimized by utilizing the MM+ method according to Polak-Ribiere algorithm at a 0.01 kcal/Å.molgradient in vacuum.These structures are saved on *.hin format.This method was chosen because it makes use of the strength field of all atoms simultaneously, therefore it is able to calculate the non-bonding interactions and modify bond lengths, in order to avoid repulsions.This method has the advantage of being quite quick for non very large organic molecules, and observed errors are negligible, if we consider the range size utilized in prediction of 13 C NMR spectra.Therefore, the mounting and the increase of the data bank as well as an inquiring for a sample in question can be very fast.
The data bank was made by using a program called SIMULATE.This utilizes the facilities of a database program manager of Microsoft Visual FoxPro [23].For the program here described a data bank was built, based on more than 26,000 recordings of 13 C NMR chemical shifts from nearly 1,600 sesquiterpene lactones, which were published in the literature up to 1995 [24].

The SIMULATE program
The SIMULATE program is easy to be installed and can be used in PCs that occupy only 25 Mb of space in hard disk and needs 16 Mb of RAM.This system is self-explanatory, and its interface with the user is quite simple.To each compound given by the user the program supplies the explanation of what must be done, in order to be accomplished the task.
The substructure encoding is based on the interatomic distance.To each atom is attributed a number according to a code similar to that used in the SISTEMAT [25] (Table 1), which defines the type of resonant atom, and the description of its environment is not made by tracking chemical bonds but through the definition of four spheres centered on the resonant atom.
The distances between the atoms are calculated, and with each resonant carbon is associated a chemical environment.This chemical environment is divided into four arbitrary ranges, respectively, 0-1.73 Å, 1.73-2.86Å, 2.86-4.25 Å, 4.25-4.75Å.
The program also supplies the number of each stereocode, what allows one to look up it in the data bank and to know from which structure it was obtained.Moreover it provides a table with the atoms found on each range and at which distance each of them is located.Table 3 below shows the ranges concerning the first 15 carbons of the substance presented in Table 2.For example, the carbon 1 at the first range shows an atom of the type 02 (-CH 2 -) at a distance of 1.55 Å and an atom of the type 03 (-CH<) at a distance of 1.56 Å, and so successively.The program distinguishes all different codes related to the individual codes of each carbon atom, counts the number of occurrences in the main file and calculates the maximal and minimal values as well as the average, and the standard deviation of the chemical shifts concerning each code.
When a "new data" option is chosen in SIMULATE program, the carbons of the structure appear on the computer screen, in order to be typed the 13 C NMR chemical shifts.
To be made the prediction of the theoretical 13 C NMR spectrum from a certain substance, the structure must be drawn on the HyperChem program screen and then submitted to the geometry optimization according to the same parameters utilized in data bank building.This structure must be saved in *.hin format.It is necessary to choose the "new sample" option of the SIMULATE program, which is able to construct the stereocode of each carbon of the sample and then to search in the file for similar codes related to the individual codes of each carbon atom of the sample.If an equivalent code is not found up to the fourth range, the program does the searching up to the third range, and so successively, so that it locates an identical code.The fine research then will be begun, that is, if the program has found only an equivalent code up to the third range, the SIMULATE program will start the research within the fourth range, in order to find some partially equivalent code, i.e., this needs not be equal up to 4.75 Å but can be equal, for example, up to 4.60 Å.
The SIMULATE program has also the following options: (1) results: it provides the results of all samples from the file; (2) deldata: it eliminates from the data bank all the codes of structures typed incorrectly; (3) delsample: it eliminates from file incorrect samples; (4) exit: it finishes the program.
The great advantage of this method is, despite the stereocodes are defined at a large distance, which embodies a chemical environment that can exerce a small influence on the chemical shifts and originate a large amount of substructures, there are no restrictions that these substructures are useless for prediction, once they are never repeated, for when an identical code is not found up to the fourth range, the program itself searches for one identical up to the third, second, and first range, so that it finds a similar code.
The fact of the 13 C NMR spectra prediction being made through fragments corresponding to each carbon atom and its chemical environment allows that the prediction of a 13 C NMR spectrum of a carbon atom of a given structure utilizes wholly different structure parts, however with the similar close vicinity.
The stereocodes involve a sufficiently large chemical vicinity, whose more distant atoms may exerce even a little influence on chemical shifts, originating a large amount of substructures, which can be used in total or partial spectra prediction.This lets the data bank need not to be rather extense.As the simulate program utilizes relatively large distance ranges, there is no problem of erroneous chemical shifts distribution referring, e.g., to methyls in positions α or β, for the chemical environments of the former or latter would be different from the range four on.
The system also possesses a data correction program.As soon as the chemical shifts to be embodied into the data bank are entered, the program matchs them with those already existent in the bank.If new data with a very great difference in chemical shift values from a same fragment (±2.0δ inferior or superior limits) are introduced, the program asks for a revision and a confirmation of the chemical shift value.

Results and discussion
To the SIMULATE program were submitted the data concerning fifty seven sesquiterpene lactones published in the Phytochemistry journal from January 1998 to February 1999.Their structures were designed on the screen of the HyperChem program and then optimized according to the methodology described previously.
By utilizing therefore the option INPUT SAMPLE from the main menu, the same procedure for the data bank formation will be utilized: the stereocodes of each atom of the sample will be arranged, and thus the program will start the search for the equivalent codes for each atom, being the former associated to chemical shifts.
To exemplify the performance of the program, we introduce here fifteen from the tests realized.
In tests 1-11, these same structures were also submitted to the commercial ACD/CNMR program for 13 C NMR spectrum simulation (Advanced Chemistry Development Inc.) [26] whereas in tests 12-15 the results are compared with those obtained by the DENDRAL system for sesquiterpene lactones [27].The results are exhibited in Table 4, where: num: is the number of the atom in the biosynthetic numbering; type: is the type of atom, being that the lines referring to the heteroatoms were eliminated; exper: is the chemical shift observed from the literature; simulate: is the chemical shift supplied by the SIMULATE program; dif1: is the difference between the chemical shifts from the literature and those from the SIMULATE; range: is the difference between the minimum and the maximum shift found in the database for a studied carbon; match: is the distance at which the more distant atom belonging to the substructure is found; ACD/DENDRAL: is the theoretical chemical shift prediction supplied by the ACD/DENDRAL program; dif2: is the difference between the chemical shifts from the literature and those predicted by the ACD/DENDRAL program; code: is the assigned atom code, following the arbitrary numbering in Table 1.
The SIMULATE program has shown an average error of 2.4δ and a maximum error of 11.9δ whereas the ACD program has displayed an average error of 4.49δ and a maximum of 25.7δ.The substance analyzed in test 1 is very similar to those found in the data bank.Consequently, the error is very small, from 0.0δ to 3.6δ, and the ranges up to where the prediction is made by the program generally reach up to the third one, but the fourth one is found, too.
In the other cases, in which the sample is similar to structures stored in the data bank, as in tests 8 and 9, the forecast of chemical shifts quite approach to the data supplied by the literature by utilizing substructures that reach to the third and forth ranges with a very small error.However, for the sample, for example, belonging to the pinguisane skeleton, which is not stored in the data bank, the prediction is not so good by utilizing substructures reaching the second range as in test 2.In this test the SIMULATE program's errors are from 1.3δ to 7.5δ, and the prediction was made up to the second range.For several carbon atoms was the ACD program's prediction better, with an error up to 0.5δ.On the other hand, for others was the error much larger, reaching up to 21.9δ.
The prediction of spectra of guaianolide skeleton substances has shown that the ACD program generally exhibits a very big error at carbon 5 (see tests 1, 3, 4 and 6) whereas the SIMULATE can make the forecast with a good approximation.Maybe this is due to the fact that carbon 5 is a ring junction and is rather close to the lactone ring.When the conformational effects are taken into account, one can observe that the absolute distance of carbon 5 in relation to the lactone ring is still smaller.In test 6 the same error occurred relative to carbons 5 and 1, being that carbon 1 is also a ring junction and presents a double bond at the vicinal carbon [10(14)]; and, in test 3, a double bond [1 (10)] conjugated to a carbonyl at carbon 2.
The ACD program, perhaps not possessing similar structures in its data bank or not considering the conformational effects on the chemical shifts, is not able to detect these differences successfully.The SIMULATE program's data bank contains substance data with the same substitution pattern.Hence, when we optimize the geometry of the sample, the program can find in the data bank substances which exhibit identical effects due to conjugation and the γ-effect for the right prediction.
Test 10 has shown a non-existing substitution pattern in the data bank.There has been no record of epoxidized lactone ring at C-7 and C-8, where lactonization usually occurs.Therefore, the prediction was only realized up to the second range.
Nevertheless the program has detected in addition some mistakes in chemical shifts supplied by the literature.In test 5, the attribution of carbons 14  In test 13, the DENDRAL system attributes the chemical shift values of C 1 -C 5 and C 2 -C 6 , respectively, as interchangeable.On the other hand, the SIMULATE program was able to predict, without interchangeable values, these chemical shifts.These attributions were confirmed through TAI acylation by Budesinsky and Saman [26].These chemical shifts were foreseen correctly, since the SIMULATE data bank was built based upon the correct carbon attributions according to the literature and tridimensional structures made by its program.
In test 15, where the DENDRAL system has obtained the best results by attributing correctly all chemical shifts, the SIMULATE program could also foresee all these results correctly.On some carbon atoms, comparatively, the SIMULATE program has shown smaller errors than DENDRAL system.

Conclusion
In this work, we have attempted to idealize an expert system that, employing microcomputers with a relatively limited data bank, is able to predict 13 C NMR spectra taking into account the conformational structures of complex natural products.This new code here described is sufficiently detailed to considering all the atoms that can exerce influences on the resonant carbon, without danger of utilizing substructures so detailed that they are useless concerning prediction results.Through the results obtained in the mentioned parallel tests, we can conclude that, for sesquiterpene lactones, the SIMULATE program foresees 13 C NMR spectra with a better accuracy level than other programs such as DENDRAL and ACD.This demonstrates that the code, which makes use of the conformational description and the chemical environment of the atoms, is more reliable concerning predictions of 13 C NMR data.Hence, this is the principal existing difference between the system here described and the others already developed [15,[19][20][21] up to date.
With the raise of the data bank, we will test this new program here explained on other natural product classes, in order to corroborate its efficacy on 13 C NMR prediction data.We hope to do it sucessfully.

Fig. 2 .
Fig. 2. Compounds used to test the programs.

Table 1
Chemical grouping codes utilized by the program

Table 2
Results supplied by the program for the above eudesmanolide (Fig.1) Sample: name of the sample to be analyzed; num: number of the atom on the structure design; type: type of atom (C, O, N, etc.); range: up to what range the program found equivalent code; match: at what distance is the more distant atom found on the substructure; δ exper:13C NMR shift in the literature; δ calc:13C NMR shift supplied by the program; code: the code of the atom according to Table1.

Table 3
Atom types and distances at the four levels of codification used by the program Fig. 1.Eudesmanolide used as a standard sample in SIMULATE program.
and 15 is inverted, what can be assigned by the data provided by the ACD program.Yet in test 11, one can locate a literature error probably due to digitization, because carbon 11, which should have a chemical shift of 126.2δ, has shown as a datum of 162.2δ; and carbon 21 itself, which should have a chemical shift of 126.4δ, has exhibited 162.4δ.

Table 4
Results obtained by the SIMULATE and ACD/DENDRAL programs for the compounds shown in Fig.2

Table 5
Summary of the results shown in Table4