Developing a Suitable Model for Water Uptake for Biodegradable Polymers Using Small Training Sets

Prediction of the dynamic properties of water uptake across polymer libraries can accelerate polymer selection for a specific application. We first built semiempirical models using Artificial Neural Networks and all water uptake data, as individual input. These models give very good correlations (R 2 > 0.78 for test set) but very low accuracy on cross-validation sets (less than 19% of experimental points within experimental error). Instead, using consolidated parameters like equilibrium water uptake a good model is obtained (R 2 = 0.78 for test set), with accurate predictions for 50% of tested polymers. The semiempirical model was applied to the 56-polymer library of L-tyrosine-derived polyarylates, identifying groups of polymers that are likely to satisfy design criteria for water uptake. This research demonstrates that a surrogate modeling effort can reduce the number of polymers that must be synthesized and characterized to identify an appropriate polymer that meets certain performance criteria.


Introduction
Degradable materials are very important in fabricating biomedical devices. After implantation, they do not need to be removed; rather, under ideal conditions, the implant site repairs itself while the device is resorbed [1]. In comparison, nondegradable materials often need to be surgically removed after their purpose has been achieved, thus subjecting the patient to a second surgery that potentially exposes them to more complications [2]. Degradable devices can be used in a broad range of applications such as vascular stents, vascular bypass grafts, bone fixation devices, and soft tissue replacement scaffolds [3].
Degradable biomaterials have a wide range of requirements depending on the particular clinical application. Parameters such as chemical structure, composition, porosity, and device geometry determine surface and bulk properties of an implant, and thus, they are critical to the selection of the material [4].
One important characteristic of degradable biomaterials is their water uptake versus time, as it is crucial for the determination of how long a polymeric device will reside in the body before erosion leads to the ultimate removal of the device from the implant site [5]. Water uptake affects degradation, swelling, mechanical [6], and adhesive properties [7]; also it determines drug stability [8], drug release profile [9], and biological response [10].
Current methods used to measure water uptake versus time are labor intensive and time consuming. Depending on the polymer, water uptake can take days to weeks to equilibrate [11]. There are potentially very large libraries of polymeric biomaterials, which make it impractical to measure these parameters experimentally for each polymer. For example, a virtual library of about 40,000 polymethacrylates has 2 International Journal of Biomaterials been described by Kholodovych et al. [12]. This library would clearly be too large for each polymer to be characterized individually by experimental methods.
Computational modeling is a useful tool to minimize the number of experiments needed to characterize a polymer library [13]. Costache et al. [14], Gubskaya [15], and Le et al. [16] published reviews that include the most relevant models currently available for important parameters in biomaterials such as glass transition temperature ( ), Young's modulus, air-water contact angle, water uptake, and degradation. Serna et al. (2008) built a model of equilibrium water uptake for 12 aromatic polyamides with very similar levels of water uptake (13.9%-19.1%). They found correlations between the amidic hydrogen charge and the water uptake [17].
Although empirical mathematical modeling has been successfully used to model water uptake for different polymers, all models require parameters that can only be obtained through experimentation. Fick's diffusion [18], anomalous Fickian diffusion [19], dual-stage Fick's diffusion [20], power law [18], Weibull equation [21], Langmuir theory [22], and concentration-dependent diffusion coefficient model [23] have been used. Modeling of hydration at the molecular level has been demonstrated using parameters such as free volume redistribution frequency [24], Radial Distribution Functions (RDFs) [25], 3D atomic density maps known as spatial distribution functions [26], and angular distribution functions [27]. Furthermore, from MD simulations, water absorption has been predicted for a single polymer system [28][29][30].
Prior works by Kholodovych et al. [12,31], Smith et al. [32][33][34][35], Gubskaya et al. [36], and Ghosh et al. [37] showed that it is possible to build computational models of polymer properties for an entire library based upon experimental data for a small subset. In these studies, a polymer library is explored using a combined experimental and computational approach, looking for polymers that fulfill a series of design criteria to be suitable for specific applications. Smith et al. [33,34] developed semiempirical models using molecular descriptors obtained from two-dimensional polymer structures (i.e., the descriptors were independent of the polymer conformation). These models were able to predict fibrinogen adsorption within experimental error in 38 out of the 45 polymers and rat lung fibroblast proliferation in 41 out of 48 polymers. Pearson correlation coefficient values for these predictions were 0.54 ± 0.12 and 0.54 ± 0.09, respectively. Gubskaya et al. [36] calculated descriptors from relaxed threedimensional polymeric structures obtained from Molecular Dynamics (MD) simulations of tetramers in vacuum and implicit water. In this work, Decision Tree Analysis and ANNs were used to predict fibrinogen adsorption with a Pearson coefficient of 0.67 ± 0.13. The incorporation of threedimensional descriptors led to important improvements in comparison with previous semiempirical models, increasing the average Pearson correlation coefficient from 0.54 ± 0.12 to 0.67 ± 0.13.
One of the challenges of biomaterials is the change of their interactions and properties over time [38]. However, all aforementioned models study and predict individual values for each polymer. They do not consider dynamic properties that may change over time. Even Le et al. (2013), who built predictions of phase behavior over time, developed the model using each experimental value as a single input, without considering how the phase behavior changes over time [39]. Previously, we built ANN models to accurately predict drug release over time on a family of terpolymers [40] using molecular descriptors. In this study, we develop and compare models for water uptake over time, first using all individual data separately and then using a global parameter for this property.
Our research has two objectives: (i) the development of computational models for water uptake versus time based upon experimental data from a small subset of polymers in a library and (ii) the application of these models to predict water uptake for an entire library of polymers. The main challenge of this research is to model and predict properties that change over time with particular kinetics using a small set of experimental data. As a model system, a library of Ltyrosine-derived polyarylates was used. Kohn and collaborators used this library to discover promising lead polymers for several medical applications [41], such as bone pins [42], hernia repair devices, and an antibacterial sleeve that protects recipients of implanted cardiac assist devices from potentially life-threatening infections [43].
This library, consisting of A-B-type copolymers having an alternating sequence of a diphenol and a diacid [41], was obtained by copolymerizing 14 tyrosine-derived diphenols with 8 aliphatic diacids in all possible combinations resulting in 112 distinct polymers. Changes in polymer backbone or pendent chain length affect polymer properties such as and hydrophobicity. In this study we investigate the effect of polymer backbone and pendent chain on the water uptake profiles of polymer films.

Materials.
A subset of the L-tyrosine-derived polyarylates was synthesized as described previously by carbodiimide-mediated solution polycondensation of a diphenol and a diacid at room temperature [44].

Film Processing.
Polymer films were compression molded and annealed at 5-10 ∘ C above for 20 h before incubation, as described previously [11].

Water Uptake.
Water uptake was obtained for the selected polymers from the L-tyrosine-derived polyarylates combinatorial library (Table 1) using 3 H-labeled water, as described previously [45]. Briefly, films 1 cm in diameter were incubated in 3 H-radiolabeled water (0.2 Ci/mL) at 37 ∘ C.  The " * " symbol indicates the presence of more than one chiral center in the polymer repeat unit. b Molecular weight ( ) was measured by THF-GPC (mean value of three different films ± standard deviation (SD)). c The "‰" symbol indicates the polymers that did not dissolve in THF and, thus, could not be measured, and degradation could not be measured. d Glass transition temperature ( ) was measured by DSC for the dry polymer before pressing.
After 6 h and 12 h and 1, 2, 3, 4, 7, 14, 21, 28, 35, and 42 days, samples were removed from the vial, rinsed with distilled water, blotted dry, and dissolved with 3 mL of tetrahydrofuran (THF) (VWR) and 12 mL of liquid scintillation cocktail (LSC) (Ecolite). Radioactive counts were measured using a scintillation counter (Beckmann 6500), and water content ( 3 H 2 O ) was calculated using a calibration curve. Water uptake (WU) was calculated as the water content relative to the original dry weight ( sample ): (1) Table 2 lists the estimated values for equilibrium water uptake from the experimental measurements; both this parameter and individual water uptake experimental points were used to build surrogate models for water uptake.

Computational Methods.
The data-mining package WEKA (Waikato Environment for Knowledge Analysis) [46] was used in this study. The methodology can be summarized in the following steps ( Figure 1): (i) Polymers were characterized using two-dimensional (2D) descriptors [32] and three-dimensional (3D) descriptors [36].
(ii) Descriptors to build the model were selected using correlation based feature selection (CFS), expectation-maximization (EM) cluster analysis, Decision Tree Analysis, and linear regression.
(iii) Either all water uptake experimental data points over time or equilibrium water uptake was used to build the model using ANNs, using 10% for testing and the rest for training.  [50] and the OPLSall atom force field [51]. Although 3D descriptors obtained from tetramers do not capture the realistic structure of large polymers, they include very important information about their structure, which allows building more accurate models, as shown previously by Gubskaya et al. [36]. Similarly, other authors had previously used monomers [52] or less than 5 repeating monomeric units [53] to obtain molecular descriptors.

Descriptor Selection.
Starting with 2,272 descriptors taken from Gubskaya et al. [36] and Smith et al. [32], a correlation based feature selection (CFS) was used to reduce the dimensionality of the descriptors for each parameter in study. CFS is a function available in WEKA that evaluates the worth of a subset of attributes (descriptors) by considering the individual predictive ability of each feature along with the degree of redundancy between them. As a result, it selects a subset of attributes that are highly correlated with the parameter while removing irrelevant, redundant, and noisy attributes [54]. A genetic search algorithm was used in conjunction with the CFS, allowing a parallel search of the attribute space and avoiding local optima.
For each model, expectation-maximization (EM) [46] cluster analysis was employed to categorize the polymer property of study (i.e., water uptake and equilibrium water uptake) into three classes (i.e., low, medium, and high). When analyzing all data points for water uptake, both time and water uptake values were included in the cluster analysis.
The most significant descriptors were selected using a J48 Decision Tree [55], selecting descriptors that correctly partition the water uptake values and equilibrium water uptake according to the EM cluster analysis. Because Decision Tree Analysis cannot represent relationships between continuous variables, an additional descriptor was selected by linear regression, that is, the highest weight on the linear regression, for the full training set and the experimental values of water uptake. Time was also included as a descriptor for water uptake with all data points.

Artificial Neural Networks.
Linear models are insufficient to capture the complexity of the structure-propertyrelationships between polymer structure and water uptake profiles. Specifically, we observed that water uptake does not yield a simple correlation with the hydrophilic factor, as defined by Todeschini et al. [56] and calculated by Smith et al. [32].
Several authors have shown that an ANN model provides more accurate predictions than a linear model [57][58][59][60][61][62]. A multilayer perceptron (MLP) was used to build ANN models for each parameter with the three descriptors selected as explained in Section 2.3.2. Two hidden layers (nodes) were used. Output nodes were unthresholded linear units [46]. Backpropagation by gradient descent was used as MLP learning method. All input variables were scaled to the unit interval while the learning rate and the momentum applied for updating the weights were 0.3 and 0.2, respectively. Training time was set on 1,000 epochs, which showed to be enough for model convergence. To perform cross-validation, 10% of data was separated as test set in each model, in all possible combinations. Randomization of the initial weights and shuffling of the training data were performed by varying the seed for the random number generator. The model obtained with each seed represents a local optimum, based on the initial weights. Thus, running enough seeds and selecting the best model among them would allow finding the global optimum. For the present models, a hundred ANN models were obtained with different seeds, from which the best model in terms of root mean squared error for the training set was selected. Table 3 summarizes the descriptors selected for both models. One 3D descriptor and five 2D descriptors were selected for the model for all time points; two 3D descriptors and one 2D were selected for the model of WU eq . 2D descriptors include nCt, hydrophilic factor, SMR VSA6, GGI3, MATS3m, and C-003. nCt is the number of tertiary carbon atoms (sp3). The hydrophilic factor is calculated from the number of hydrophilic groups (-OH, -SH, and -NH) of the molecule [63] and it was previously used to predict biological response on this polymer library [34]. SMR VSA6 is a descriptor of subdivided surface area, based on accessible van der Waals surface area of each atom [64], and type of descriptor used before to predict fibrinogen adsorption of this polymer library [35]. GGI3 is a topological charge descriptor; similar topological descriptors have been used to predict biological response on polymethacrylate surfaces [37]. MATS3m is a Moran autocorrelation descriptor, which describes the level of correlation between molecules, and it has been used to study protein interactions [65]. C-003 is the number of CHR3 molecular subfragments, an atom center fragment; it gives information about structural motifs important for the molecular shape and it was used before to predict fibrinogen adsorption on polymethacrylate surfaces [37]. 3D descriptors include G2m and R8p+ in vacuum and Mor25m in water. G2m is a WHIM descriptor, which captures relevant 3D information about molecular size, shape, symmetry, and atom distribution with respect to invariant reference frames [66]. WHIM descriptors were used to predict fibrinogen adsorption on polymethacrylate surfaces [37]. R8p+ is R-GETAWAY descriptor, which accounts for the local aspects of the molecule such as branching, cyclicity, and conformational changes [67].

Descriptors Selection.
Mor25m is a 3D-MoRSE descriptor, which provides structural information of the molecules in the space [68], and it has been suggested that this information is related to the free volume of molecules [69,70] and, thus, responsible for the ability of the polymer to uptake water. 3D-MoRSE and GETAWAY descriptors have been also correlated with the tendency of a molecule to be solvated by water, measured by the hydrophilic index Hy [71], as defined by Todeschini and Consonni [63]. These types of descriptors encode relevant information of this polymer library that gives information of several physical and chemical processes such as water uptake and even in fibrinogen adsorption as discussed by Gubskaya et al. [36]. Table 4 show that correlation coefficient is not the best indicator of model accuracy. Both models present high 2 of training set (>0.92). However, the model using all data presents only 17% or less of predictions within experimental variability, for training and test sets, while the model for WU eq is able to predict 67% for training and 50% for test, within experimental variability.

Model for Water Uptake. Results in
Results of cross-validation have to be analyzed very carefully when using all data points, because they are not independent of each other. In that case, it is likely to select for cross-validation data that belong to polymers for which there is a large data set in the training set. Thus, depending on how the cross-validation set is selected, different results will be obtained. On the other hand, the model for WU eq obtains its values from several experimental measurements of each polymer after its water content is equilibrated. This gives more representative and reliable experimental data, and it captures more information than single points at the same time of incubation. With this, cross-validation that in this case includes only independent values, considering all possible combinations of leave-two-out (10%) of experimental values, gives accurate predictions in 50% of the cases from test sets, and WU eq was correctly classified as high, medium, or low according to the EM cluster analysis previously done, in 83% of the cases. With this, predictions accurately represent the relative order in water uptake of the polymers studied ( Figure 2 and Table 4).
International Journal of Biomaterials 7  This result is less accurate than predictions of simple physical behaviors such as [14], but it is much more accurate than predictions of more complex processes such as fibrinogen adsorption [34,36], cell growth [34], and gene delivery efficiency [72], where the Pearson coefficient for these models was below 0.77.
Some of these predictions would be expected directly from the chemical structure of the polymers, but others would not be easily expected. For example, all DTO polymers would have low water uptake, which is expected from the long pendant chain (8 carbons), while the DTH polymers, with only one carbon less than the DTO, would have water uptake levels from low to medium.

Limitations of Surrogate Modeling.
Limitations of this type of model include the following: (i) it needs experimental data to train the model; (ii) the descriptors give a reference of relevant parameters to the target property, but they cannot explain the mechanism; (iii) experimental measurements must be performed to validate the predictions; (iv) for new polymers outside of the sublibrary, new descriptors must be generated, which is time consuming due to the need for MD simulations. However, this last limitation is only encountered for the first property that you wish to model, for once the descriptors are generated, they can be used to build predictive models for several properties of the polymer library. The obtained models for water uptake can be improved by increasing the size of training set, by generating more meaningful descriptors, such as 3D descriptors in explicit water, by improving the descriptor selection algorithm, and by identifying other surrogate methods.

Conclusions
This study describes a new approach to modeling dynamic properties and demonstrates the potential value of this approach. In particular, we developed models for water uptake for a library of polymers using only a small training set and molecular descriptors for all the polymers in the library. We also demonstrate that using a consolidated parameter of water uptake, a dynamic property, gives a more accurate model than using a more conventional approach of all experimental measurement as independent values. By separating time points from one experiment, information about 8 International Journal of Biomaterials the slope, rate, and progression of the dynamic property is not considered in the model. And, since data points are not independent, accuracy of predictions is compromised.
A surrogate model was built to accurately predict equilibrium water uptake of a polymer sublibrary of 56 L-tyrosinederived polyarylates using a small training set and only three descriptors selected from a large set of descriptors, calculated from either 2D or 3D structures. Those descriptors included atom counts; 3D information about electron diffraction (3D-MoRSE); and chemical properties of molecular atoms, branching, cyclicity, and conformational changes (GETAWAY). Although these descriptors can be used only for this model in this polymer library, the methodology for selecting descriptors can be applied to any polymer library and/or polymer property.
The model was able to accurately predict low, intermediate, and high levels of water uptake for up to 12 of the 18 polymers. Using this model, predictions were obtained for the rest of the sublibrary. Those predictions can be used primarily as a reference of order of magnitude and ranking of polymers in terms of water uptake.
Finally, having several semiempirical models for different polymer properties such as glass transition temperature, contact angle, fibrinogen adsorption, cell response, water uptake, and degradation for the same polymer library may be used to select a polymer for a specific application. With a known set of design criteria, a group of polymers can be selected from the mentioned models. After this selection, the actual parameters must be measured experimentally, the models must be validated, and the best polymer can be selected to begin the device development process. With this, surrogate modeling of polymer properties may accelerate the discovery and selection of rationally designed materials for a target application.