A Novel Molecular Representation Learning for Molecular Property Prediction with a Multiple SMILES-Based Augmentation

Deep learning has brought a rapid development in the aspect of molecular representation for various tasks, such as molecular property prediction. The prediction of molecular properties is a crucial task in the field of drug discovery for finding specific drugs with good pharmacological activity and pharmacokinetic properties. SMILES string is always used as a kind of character approach in deep neural network models, inspired by natural language processing techniques. However, the deep learning models are hindered by the nonunique nature of the SMILES string. To efficiently learn molecular features along all message paths, in this paper we encode multiple SMILES for every molecule as an automated data augmentation for the prediction of molecular properties, which alleviates the overfitting problem caused by the small amount of data in the datasets of molecular property prediction. As a result, by using the multiple SMILES-based augmentation, we obtained better molecular representation and showed superior performance in the tasks of predicting molecular properties.


Introduction
Traditionally, drug discovery is time-consuming and very expensive. For understanding the properties of a compound, many results of the simulations can be obtained via the experience of a chemist or pharmacist. e overall process is significantly complex, long, and always inefficient. Deep learning has brought a rapid development in the field of drug discovery and is expected to accelerate the process of drug discovery [1][2][3][4][5][6][7][8]. Nevertheless, deep learning methods still face some obstacles, such as small amount of data in molecular datasets, few label data [3], and label noise [5,6], which leads to overfitting and poor model prediction performance.
Inspired by natural language processing techniques, many deep learning models use the simplified molecular input line entry system (SMILES) [9] as a line text representation of a molecule. SMILES string is in form of a 1D sequence of chemical structure that can be encoded using a one-hot vectorization form. A molecule may have multiple SMILES. Because SMILES are not unique, a molecule is often defined by canonical SMILES [10], which ensures that each molecule corresponds to a unique canonical SMILES. e SMILES-based methods [11][12][13][14][15][16] have shown great potential and have been widely used in the tasks of molecular property prediction [12][13][14] and molecular generation [15,16]. e performances of the deep learning models are hindered by the nonunique nature of SMILES string, which affects the accuracy of molecular property prediction and the ability to explore the potential chemical space of molecules in molecular generation tasks [11]. Paul et al. proposed a mixed deep learning network architecture CheMixNet [12] to learn molecular representation by using several neural networks design (convolutional neural network (CNN), recurrent neural network (RNN), and multilayer perceptron (MLP)) for learning molecular SMILES sequences and molecular ACCess (MACCS) fingerprints, respectively. en, concatenate the two parts of features and make the final prediction. Lin et al. learn molecular representation by using bidirectional gated recurrent unit (BiGRU) neural network architecture based on sequence manner [13], which is designed for solving the single-and multitask classification in the field of drug discovery. e methods input single SMILES sequence for a molecule to neural networks to learn representation. erefore, the limited molecular representation affects the predictive performance of the neural network models. SMILES2vec [14] was proposed to train on SMILES for predicting chemical property using an RNN neural network via Bayesian optimization methods [17]. SMILES2vec was inspired by language translation using RNN in the field of natural language processing (NLP). SMILES2vec did not explicitly encode the grammar of SMILES specification. e LSTM-based [18] or GRU-based [19] recurrent neural network architecture is an effective neural network design for learning features from sequence or text data. e above neural network models based on SMILES have limitations because only the single SMILES of each molecule is considered, which cannot learn the grammatical features of SMILES well. Although SMILES enumeration [11] and all SMILES variational autoencoder [15] have considered multiple SMILES strings of single molecule to learn latent molecular representation. However, these methods are not used in the tasks of molecular property prediction. In this paper, we proposed a novel molecular representation method for molecular property prediction using multiple SMILES-based augmentation to alleviate the problem of a small amount of data and few labels in the molecular property prediction datasets， regardless of descriptors engineering and expert experience.

Related Work
A related method to this paper is the SMILES enumeration [11]. SMILES enumeration explored the fact that multiple SMILES represent the same molecule as a technique for data augmentation. e augmented dataset was bigger than the original. e neural network trained with the augmented dataset showed better performance on the test set than the original neural model trained with the unaugmented dataset. Another SMILES enumeration-based method is all SMILES variational autoencoder [15], which used multiple SMILES strings of single molecule to learn latent molecular representation for molecular generation. All SMILES variational autoencoder (VAE) encoded multiple SMILES by using several recurrent neural network layers and decoded them to molecular SMILES. All SMILES VAE learned a bijective mapping between molecules and the latent representations near the high-probability subspace of the prior. e result showed that all SMILES VAE obtained the state-of-the-art performance. However, these methods are not used in the tasks of molecular property prediction, recommended by MoleculeNet [8].

Methods
e key idea is that we focus on a multiple-SMILES representation learning as data augmentation for various downstream tasks. As we all know, the deep learning model must be fed with a large amount of data. rough learning a large amount of data, the model can find the law and obtain the potential knowledge of the data. Despite the presence of a large number of molecules, labeled datasets are scarce. In the task of molecular property prediction, the number of molecules in some datasets of property prediction is also very small. It leads to the problem of unstable prediction performance using the deep learning models due to overfitting and underfitting.
Inspired by the SMILES enumeration [11] and all SMILES VAE [15], we proposed a novel method of molecular property prediction using multiple SMILES-based augmentation to solve the overfitting and underfitting problem. e general framework is illustrated in Figure 1. Generally, before feeding the data into the deep neural network, the multiple SMILES-based augmentation must be completed (Figure 1(a)), which is related to whether the model can learn the potential knowledge in the datasets. It is a crucial step for the successful prediction of deep neural model. e process of data augmentation includes cleaning data and removing invalid molecules. en multiple SMILES sequences are generated for each molecule, and further onehot vectorization is carried out that can be fed to the neural network to learn molecular features. e deep neural network used in this paper is shown in Figure 1(b), which consists of stacked CNN and RNN. e "Gate" in Figure 1(b) is denoted as the gated recurrent unit (GRU) or long-shortterm memory (LSTM). e final molecular representation can be used for a variety of downstream tasks, such as molecular property prediction.
In the following, we will describe technical details. We first give the mathematical definition of the problem (Section 3.1) and then propose a novel molecular representation method using multiple SMILES-based augmentation for molecular property prediction (Section 3.2).

Problem Definition.
We define a feed-forward convolutional neural network as CNN (kernel,channel,padding) , where kernel is the convolution kernel, channel denotes the convolution channel, and padding represents the type of padding. A recurrent neural network can be defined as RNN gate , where gate denotes the type of gate, such as LSTM and GRU. Let X os be the original SMILES strings that have been cleaned and are valid molecules. Let mapping function of multiple SMILES be f ms . X ms is the multiple SMILES sequences. We define the vectorization function as f vect . e problem is to learn the function f res that maps the multiple SMILES vectors to molecular representation X mol . e mapping relations are represented as follows: X ms � f ms X os , e whole process includes three mapping functions, namely, f ms , f vect , and f res . e final molecular representation is what we want to get and can be further used in specific tasks such as molecular property prediction. 2 Computational Intelligence and Neuroscience

Multiple SMILES-Based Augmentation.
e SMILES [9] is a popular specification for extracting the feature of molecular sequences that uses ASCII strings encoding molecular structures in the form of a line notation. e SMILES structure follows a certain grammar. e alphabets and numbers in SMILES denote atoms and rings, respectively. e special characters such as "�" and " ≡ " indicate the bond types, and the parentheses indicate side chains. e mapping function of multiple SMILES f ms can be implemented using the method of renumbering atoms in RDKit [20] after performing randomization of a SMILES sequence and then regenerating a new SMILES sequence using the "MolToSmiles" method and setting canonical to be "False" in RDKit. Figure 2 takes estradiol as an example to randomly generate 10 multiple SMILES sequences. Estradiol is randomly selected from the ESOL dataset. Figure 3 demonstrates randomly generated 4 SMILES sequences with renumbered atoms in the molecular graph for estradiol. It is shown that atoms with different numbers in the molecule can generate different SMILES sequences. erefore, the SMILES sequence of molecules is not unique, but canonical SMILES are unique for specific molecule.
e mapping function f vect can be implemented using language translation technology in the field of natural language processing, which is an effective method for learning from text data. We need to construct a character set for all SMILES sequences in some datasets, which is similar to the corpus in natural language processing. en randomization and vectorization to convert the SMILES array to a one-hot vector are performed. Figure 4 demonstrates the one-hot images of vectorization using multiple SMILES for estradiol. Each image of vectorization in Figure 4 highlights the one-hot vector for different SMILES sequences. e abscissa interval is [0,9], and the 9 represents the number of Multiple SMILES-based augmentation SMILES (Clean data, remove invalid molecules)

2-Chloronapthalene: Clc1ccc2ccccc2c1
Phorate:  e last mapping function f res for obtaining molecular representation can be learned using stacked CNN and RNN mixed architecture. e RNN consists of an input layer, a Multiple SMILES HO OH  Computational Intelligence and Neuroscience hidden layer, and an output layer. Figure 5 shows the simple structure of RNN, where X is an input vector, H indicates the hidden vector of the hidden layer, O represents the output vector. W xh , W ho , and W h denote the weight matrix from input layer to hidden layer from hidden layer to output layer, and hidden layer, respectively. Figure 6 demonstrates the timeline structure of RNN. e output O t of RNN at time t is related not only to the input x t at time t but also to the hidden layer value h t−1 at time t − 1. It is shown that RNN can better deal with sequence information, that is, the previous input is related to the subsequent input. SMILES string is precisely this sequence structure, which can extract features with designed RNN architecture. e message passing process for stacked CNN and RNN mixed architecture can be found in Figure 7, which shows the message passing in the neural network at time t and time t − 1. e result of the output layer at time t must be based on the input at time t and the result vector of the hidden layer at time t − 1. e process can be summarized in the form of matrix as follows: where P and Q indicate some kind of neural network. Finally, the mapping function f res for obtaining molecular representation can be represented using CNN and RNN as follows:  Figure 4: e images of one-hot vectorization using multiple SMILES for Estradiol molecule. It shows the vectorization of randomly generated 6 SMILES strings, using random order of the character set for Estradiol, which consists of 9 characters: "(", "3", "O", "c", "1", ")", "2", "4", "C". e length of padding is 37, that includes predefined extra padding.
Input layer Hidden layer Output layer Figure 5: e structure of the recurrent neural network [21]. Figure 6: e timeline structure of the recurrent neural network [21].
Computational Intelligence and Neuroscience where f (Dense,Pooling,Gather) denotes the mapping of fullconnection layer, pooling layer, and gather layer.

Experiments
Extensive experiments have been implemented to evaluate the performance of molecular representation using the multiple SMILES-based augmentation for the tasks of molecular property prediction. We will describe the datasets, baselines, and experimental results.

Dataset Description.
We use five molecular property datasets recommended by MoleculeNet [8] for the experiments. Table 1 shows the information of five datasets. e details of used datasets are shown as follows: (i) ESOL: ESOL [22] contains the logarithmic aqueous solubility (mol/L) of 1,127 compounds, which is used as a regression task to predict water solubility in deep neural networks (ii) Lipophilicity: lipophilicity [23] includes the octanol/ water distribution coefficient (logD at pH 7.4) about 4,200 compounds, which is important in membrane permeability and solubility (iii) FreeSolv: FreeSolv [24] provides the hydration free energy (kcal/mol) of 642 compounds in water (iv) HIV: HIV [25] is used as a classification task in deep neural networks to predict the activity of inhibiting HIV replication, which contains 41,127 compounds (v) BACE: BACE [26] is used as a classification task, which contains 1,513 molecules and provides quantitative and qualitative binding results for a set of inhibitors e datasets must be cleaned before being input into the neural network. e cleaning and preprocessing process are shown in Figure 8. e original data are cleaned via five steps, namely, excluding invalid molecules, filtering organic molecules, removing salt and stereochemistry information, keeping the largest fragment, and converting to canonical SMILES. en, we get the cleaned molecules that will be used to generate multiple SMILES sequences and vectorization. Finally, the feature of vectorization will be fed to the neural network to be trained.

Baselines.
We compared our method with the following models: (1) CheMixNet: CheMixNet [12] was proposed for predicting chemical properties using molecular SMILES sequences and fingerprints, which is a mixed deep neural network architecture. In this paper, we focus on the molecular SMILES sequence. erefore, we do not consider computable characteristics, such as molecular fingerprints or physical descriptors. For a fair comparison with our method, we adopt the neural architecture of CNN and RNN in CheMixNet [12], which uses the SMILES as the sole input.
(2) Smi2Vec-BiGRU: Smi2Vec + BiGRU [13] was proposed for learning atoms and the single-and multitask classification tasks, which learns the lowdimensional representation for a molecule by transforming SMILES to vector based on bidirectional gated recurrent unit (GRU) [18] architectures. (3) XGBoost: XGBoost [27] is an ensemble method to implement a gradient boosting decision tree (GBDT) for improving the speed and efficiency of the model. is a generalized graph-based architecture [29], including the message passing phase and readout phase. e former phase is to learn the characteristics of the graph, and the latter phase is to obtain the full graph representation for predicting various tasks. (6) GC: GC [30] is a standard feature extraction method for molecules based on circular fingerprints, which is a kind of graph convolutional model and operates directly on graphs with arbitrary size and shape. (7) Weave: Weave [31] implemented graph convolutional operation on molecules using a simple encoding of the molecular graph including atoms, bonds, and distances. (8) Pretraining GNN: pretraining GNN [32] proposed a new strategy and self-supervised methods for pretraining graph neural networks. In order to obtain useful local and global features, the strategy of pretraining GNN is to pretrain expressive graph neural networks by using individual nodes and entire graphs. Pretraining GNN achieved state-of-the-art performance on the tasks of molecular property prediction. (9) Drug3D-Net: Drug3D-Net [2] is a grid-based 3D model for molecular representation using spatialtemporal gated attention, which uses the geometric information of molecules to extract the molecular characteristics. (10) Multiple SMILES (RNN (one layer), RNN (two layers), CNN_RNN): this is the method presented in this article. e neural network architecture includes one layer RNN, two layers RNN, and the mixed networks of CNN and RNN.

Experimental Setup.
In this experiment, we use root mean squared error (RMSE) and mean absolute error (MAE) to evaluate the performance of regression tasks. Similarly, we use the loss function of "binary_crossentropy" for classification datasets. We use the average area under the receiver operating characteristics curve (AUROC) and the area under the precision-recall curve (AUPRC) predicted from the test set to evaluate the performance of the model for classification tasks. Our experiment was trained based on the Keras framework and TensorFlow [33]. We used the Adam algorithm [34] for optimizing the parameters of the model. We set a total of 200 epochs, 64 batch sizes, and 5-fold crossvalidation with checkpoint and early stopping. We set the learning rate as 0.001 with learning rate decay. We perform   and MAE values, which shows that the multiple SMILESbased data augmentation can alleviate the overfitting problem to a certain extent on a small amount of data, such as ESOL and FreeSolv datasets. Figure 9 shows the scatter plots in the FreeSolv dataset for four training-folds, which indicates that the points on the test set closely surround the identity line, which shows that the prediction results in the test set are closer to the target value, although the trend lines deviate slightly from the identity lines in each training folds. In addition, Figure 10 shows the loss curves during our model training in the training set and validation set for the FreeSolv dataset. At the beginning of the training of the model, the loss on the training set and the validation set has a relatively large gap (training loss curve and validated loss curve are far away), indicating that the model is not stable. When the number of Computational Intelligence and Neuroscience training epochs increases, the loss curves on the training set and the validation set tends to be consistent and fit each other, indicating that the model tends to be stable and is slowly converging. Table 3 demonstrates the predictive performances for HIV activity (HIV) and inhibitors of human β-secretase 1 (BACE). e larger the AUROC and AUPRC score, the better for HIV and BACE. Our method based on mixed CNN and RNN architecture achieved the best performance on AUROC and AUPRC scores in the test set for HIV and BACE datasets. In HIV, we obtain 0.9767 AUROC and 0.9798 AUPRC scores compared with the 0.9621AUROC and 0.9617 AUPRC of the 3D-based method Drug3D-Net, although the Drug3D-Net considered the information of molecular geometry. In addition, the performance of our method exceeds that of the pretrained model pretraining GNN with a large margin.

Performance in Classifications.
In summary, our method shows superior performance in both regression datasets and classification datasets, which implies the good molecular representation ability of our proposed method. Tables 2 and 3), the mixed CNN_RNN architecture obtains the best performance among RNN (one layer), RNN (two layers), and CNN_RNN, which indicates that the CNN convolution in our model is essential and can improve the predictive performance for downstream tasks. Meanwhile, the performance of RNN (two layers) architecture is slightly better than that of RNN (one layer) architecture, which shows that the deeper neural networks can have better learning ability for extracting features. erefore, it can show better performance in specific tasks, such as molecular property prediction.

Conclusion
In this study, we make full use of the nonunique nature of the SMILES string to perform randomization of a SMILES string multiple times for efficiently learning molecular features along all message paths. By encoding multiple SMILES for every molecule as an automated data augmentation, we obtain better molecular representation and the proposed method shows superior performance in the tasks of predicting molecular properties, which alleviates the overfitting problem caused by the small amount of data in the datasets of molecular property prediction.

Data Availability
All input data are publicly available and a detailed description for the same is mentioned in the Dataset Description.

Conflicts of Interest
e authors declare that they have no conflicts of interest.

10
Computational Intelligence and Neuroscience