Prediction of the Antibacterial Activity of the Green Synthesized Silver Nanoparticles against Gram Negative and Positive Bacteria by using Machine Learning Algorithms

With the appearance and growth of microbial organisms resistant to various antibiotics, as well as the need to reduce the cost of care of health, the production of antimicrobials at lower costs has become an inescapable necessity for today ’ s human societies. Recently, the interdisciplinary ﬁ eld of nanotechnology has developed widely. One of the applications of nanobiotechnology is the use of silver nanoparticles (AgNPs) for new solutions in the treatment of microbial infections. AgNPs have unique properties which help in molecular diagnostics, therapies, and also in devices that are used in several medical procedures. In this ﬁ eld, machine learning algorithms have been used with hopeful results. One of the branches of arti ﬁ cial intelligence (AI) is machine learning (ML) that focuses on data and shows the power of the data. Machine learning techniques are taking considerable attention because of their obvious successes in a broad range of predictive tasks. In this work, we studied machine learning technique to predict the antibacterial activity of AgNPs against Escherichia coli , Pseudomonas aeruginosa , Staphylococcus aureus , and Klebsiella pneumoniae . Here, we reviewed 100 articles for completing the data, highlighting the recently used different plants for the synthesis of highly ef ﬁ cient antimicrobial green AgNPs, which consist of key experimental conditions (amount of plant extract, volume of plant extract, volume of solvent, volume of AgNO 3 solution, reaction temperature, reaction time, concentration of precursors, and nanoparticle size). The results showed that nanoparticles size and concentration of AgNPs are key factors in predicting the antibacterial effect of AgNPs.


Introduction
Resistance to human pathogens is a major challenge in the field of pharmaceutical and medical science. This antibiotic resistance and continuous use of chemical compounds have caused an important phenomenon of resistance in microorganisms [1]. This phenomenon has weakened or neutralized the effects of medicines and eventually has led to an increase in drug use. It has also led to the willingness to use the compounds with newer and stronger formulation [2,3]. Another disadvantage of using these drugs is their increased side effects leading to the reemergence of multidrug resistant pathogens and parasites, which create diseases that are more dangerous than primary disease [4]. Therefore, developing and making changes in antibacterial compounds for enhancing their antibacterial potential of them have been an important part of research in recent years [5]. Nanotechnology as the most advanced technology of the present century has been able to penetrate all parts and angles of human life, animals, and plants. It has also affected its current and future status with its innovation [6][7][8]. In this field, it is a well-known fact that silver nanoparticles are highly toxic to microorganisms [9]. Silver nanoparticles technology which caused a dramatic revolution in the antibacterial materials is the main orientation for developing nanosilver products and has many advantages over chemical materials [10,11]. Antibacterial effects of silver nanoparticles were investigated by many researchers and their potential effects against a wide range of microbes including antibiotic-resistant bacteria have been proved [12][13][14][15][16][17]. Many studies have been done on the possible relationship between physicochemical properties and toxicity of silver nanoparticles [18]. The shapes of the silver nanoparticles such as spherical, rods, beads, nanoprisms, and sheets are being developed and studied for their specific antibacterial effects [19][20][21][22][23]. Recently, studies have shown that surface coating and size of silver nanoparticles play an important role in antibacterial activity. They indicated that smaller nanoparticles deliver higher magnitude of toxicity [19,24,25]. Two main factors determine antibacterial activity of nanoparticles: (1) physicochemical properties, such as natural properties, surface modification, and composition and (2) the type of bacteria [26][27][28]. Computational tools can decrease the design space by predicting the characteristics of nanoparticles before synthesis, which helps to the reduction of experimental trial and error work. Tools are used for the prediction of the threedimensional structure of metallic nanoparticles [29,30], and also used to characterize the functional composition of the protein corona of nanoparticles [31]. One of the branches of computer science is artificial intelligence (AI). AI allows software applications to become more powerful and precise at predicting the result without programming explicitly. Because of this property, AI has attracted much attention in different fields. Machine learning (ML) is one of the subsets of AI and historical data are used by ML algorithms as input to predict the new output. Also, they apply information and data for their output estimation and do not use physical test materials, so they are inexpensive [32,33]. Antimicrobial resistance for specific bacteria has been illustrated by ML. For instance, Tiihonen et al. [34] studied the antimicrobial activity of conjugated oligoelectrolyte molecules against Escherichia coli. Söylemez et al. [35] also utilized a ML for prediction of antimicrobial activity of peptides against gram negative and positive bacteria. Her and Wu [36] explored using a pan-genome-based ML method for prediction of the antimicrobial resistance activities of E. coli. In this work, we used machine learning techniques to predict the antibacterial activity of AgNPs against E. coli, Pseudomonas aeruginosa, Staphylococcus aureus, and Klebsiella pneumoniae. The antibacterial activity was predicted by studying experimental conditions (amount of plant extract, volume of plant extract, volume of solvent, volume of AgNO 3 solution, reaction temperature, reaction time, concentration of precursors, and nanoparticle size). This method helps in the prediction of antibacterial activity of AgNPs by saving time and decreasing costs. It also can be useful to researchers for preventing the growth of bacteria. As we know, bacterial growth is harmful for environmental and industrial components, especially for human health [37][38][39][40].

Materials and Methods
2.1. Approach. In Figure 1, we show workflow for our model step-by-step. After collecting data, we prepared and identified features and outcomes. Then, we created our dataset I and prepossessing phase started. In this step, we handled categorical and missing values and finally normalized them and created dataset II. In the third step, before training regression models, important features were selected by ML algorithms and feature selection algorithms. Then, dataset III was created based on effective features and outcomes.
To find a suitable model, we trained some of regression algorithms and finally, we evaluated our models with different evaluation metrics.

Dataset.
In this study, we studied researches that have been done on green synthesis and antibacterial activity of AgNPs. These studies contained different keywords, such as "silver nanoparticles," "antimicrobial," "green synthesis," and "antibacterial" results. AgNPs are well known because of  Journal of Nanomaterials their high stability, low cost, inhibition result against a wide range of bacteria, and their applications in food industry [41][42][43]. Therefore, AgNPs were selected as target antibacterial agents against gram-negative bacteria including E. coli, P. aeruginosa, and K. pneumoniae, and grampositive bacteria including S. aureus. In this study, we reviewed 100 articles by concentrating on the amount of plant extract, volume of plant extract, volume of solvent, volume of AgNO 3 solution, reaction temperature, reaction time, concentration of precursors, and nanoparticle size. These variables were acquired as input attributes for the prediction of the antibacterial activity of the AgNPs. The outcomes were documented based on the antibacterial measurements such as zone of inhibition.

Data Preprocessing.
The data preprocessing is one of the main stages in ML. It includes data cleaning, normalization, and transformation. It has considerable effect on generalization performance of supervised ML algorithms [44]. Missing data (or missing values) are defined as the data value that is not stored for a variable in the observation. Lack of values can cause problems and has a considerable effect on the results. Because of that, handling them is very important rather than ignoring or omitting them. There are some techniques that researchers can use and tackle them. In the primary dataset (dataset I), there are some missing values in features selection and since these missing values affect results in our regression model, we choose impute method for handling them [45]. Categorical variables take on values that are names or labels. ML algorithms cannot operate on label data directly. So, categorical variables in the regression model must be turned into the numerical form [46]. There are some methods for converting them to numerical form [47]. The process of converting categorical variable to numerical form is called "one hot encoding." We utilized this method due to its high accuracy. By this process, machine learning algorithms can do a better job in prediction. Sometimes features in machine learning models have different ranges and they are not on the similar scale. In this case, normalization reveals its power. The goal of normalization is to transform the numeric columns in the dataset to the similar and common scale. Note that every dataset does not need normalization. Normalization is required when feature columns have various ranges. Normalization can be achieved by several methods. The most common methods are z-score and the min-max [44]. In this study, normalization was done by applying the min-max method. This method rescales the features in any range to a new range and this range usually is between (0, 1) and (−1, +1) [48].

Normalized value
Original value .
Data splitting is a noticeable aspect of data science, especially for working with models based on the data. Data splitting means dividing dataset into two subsets and its goal is training and developing models. The first subset is for training dataset and the second one is for testing dataset. Train dataset is used to fit the ML models and test dataset is used after training is done. In this paper, we split our dataset into two subsets randomly. Train dataset includes 70% of our dataset and the rest 30% remains for testing the model.

Feature Importance.
In the fields of ML and data mining, analyzing high-dimensional data is challenging for researchers and engineers. In such cases, feature selection methods provide an impressive way to tackle this problem by removing irrelevant and redundant data. By these methods, computation time is decreased while learning accuracy is increased. So, suitable feature selection can eliminate irrelevant or redundant columns from our dataset without sacrificing accuracy and creates a better understanding of the learning model or data. It also reduces the chance of overfitting which is very important for the future prediction. On the other hand, good feature selection decreases the amount of time needed to train a model. Features importance shows which features have the biggest impact on model prediction and which ones are not impressive [49]. There are some methods for choosing important features such as: filter methods, wrapper methods, and embedded methods [50,51]. Researchers and engineers choose one of these models based on their datasets and models. In this study, feature selection was done by wrapper method (recursive feature elimination (RFE)) and boosting method (XGBoost, Ada Boost) and for each method we have recorded the scores assigned to each feature.

RFE Method.
A good feature ranking criterion is not necessarily a good feature subset ranking criterion. RFE estimates the effect of removing one feature at a time on the objective function. They become very suboptimal when it comes to removing several features at a time, which is necessary to obtain a small feature subset [52]. Among different feature selection methods, RFE is effective in selecting features and it searches to progress generalization performance by deleting the least important features whose elimination will have the least effect on training errors [53].

Boosting Method.
(1) XGBoost. One of the advantages of using ensembles of decision tree methods such as gradient boosting is that they can estimate feature importance automatically from training set. The more an attribute is used to make key decisions with decision trees, the higher is its relative importance. A well-known computational speed and model performance is XGBoost. This method is scalable and end-to-end tree-boosting system [54]. So, this algorithm was selected in this paper based on the mentioned properties.
(2) AdaBoost. Boosting is a general method for improving the accuracy of any given learning algorithm. One of the most popular ensemble boosting regressors is adaptive boosting known as AdaBoost. This algorithm was introduced for the first time in 1995 by Freund and Schapire [55,56]. AdaBoost tries to combine multiple weak subsets of features into a strong subset to improve the performance of future models.

Feature Selection
Based on Feature Importance. In this study, for each tested feature selection algorithm, we have recorded the scores. Based on this feature selection result, we observed that nanoparticle size, extract mass, solvent volume, extract volume, AgNO 3 volume, AgNO 3 concentration, reaction temperature, reaction time, diameters of disks and wells, incubation temperature, incubation time, bacteria concentration, and nanoparticles concentration are more crucial and effective features than others, so we created dataset III with most important features.
2.5. Model Development or Regression Models. ML is a subset of AI. ML enables the machine to learn from data, use past experiences to improve performance, and make predictions [57]. Based on data and the way of learning, there are mainly four types of learning: supervised learning, unsupervised learning, semisupervised learning, and reinforcement learning. Supervised learning, as the name shows is an approach where algorithms are trained on labeled datasets [58]. In a labeled dataset, there are both input and output. ML algorithms train on this kind of dataset until they detect special relationship between input and output. The supervised learning itself is divided into two main groups: classification and regression. Classification is the process of predicting the class of given data points. In this method, there are some classes, and scientists try to categorize datasets based on those classes. Regression analysis is a subset of supervised learning wherein the algorithms are trained and learned with both input features and output labels. Regression models seek to estimate a mapping function from inputs to a continuous output [59]. In this study, the desired dataset is labeled and both input and output are in the dataset. There is also no intention to categorize the data and we want to predict the output. So, regression method in supervised learning is selected for this study. ML algorithms try to map functions from AgNPs features to inhibition of bacteria and predict the antibacterial activity of AgNPs. Several supervised regression algorithms are used for developing our model to find out which model can provide the most precise prediction. The decision tree, random forest, and two boosting algorithms: XGBoost and AdaBoost are examined in this study.

Decision Tree.
It is one of the most commonly used algorithms in supervised learning and can be used for both classification and regression problems. The name of decision tree suggests that it uses a tree structure to present the predictions resulting from a series of feature-based splits. It divides the dataset into smaller subsets. In other words, it starts with a root node and finishes with a decision made by leaves. Decision tree has some advantages such as the flexibility to handle a broad range of response types, including numeric, categorical, ratings, and survival data. Besides, it is simple to understand and interpret [60].

Random
Forest. Tree-based algorithms are attractive due to their high execution speed. However, random forest is an algorithm which combines the results of many different decision trees to make the best possible decisions. In this algorithm after a large number of trees are generated, they vote for the most popular class in classification problems or average in the case of regression. Like decision tree, random forest can work with both classification and regression. It also can work with thousands of variables without deletion or reduction in accuracy, while preventing overfitting [61].
2.5.3. XGBoost. eXtreme Gradient Boosting (XGBoost) is an ensemble ML algorithm for tree boosting and uses a gradient boosting framework, which is designed for speed and performance [54]. Ensemble learning consists of a collection of predictors, which are multiple models to provide better prediction accuracy [62]. The main idea of this algorithm is that it builds N regression trees one-by-one, so each subsequent tree is trained using the residuals of the previous tree [63]. Weights play a significant role in XGBoost. All independent variables have weights, these weights are then fed into the decision tree which predicts results. The weight of variables predicted wrong by the tree is increased and these variables are then fed to the second decision tree. These individual classifiers/predictors then ensemble to give a strong and more precise model. There are two main reasons why we use this method: execution speed and model performance [54].
2.5.4. AdaBoost. The term "Boosting" is related to algorithms, which convert weak and low-accurate learner to a strong one. The significant goal of using boosting is to decrease bias as well as variance for supervised learning [64]. The AdaBoost algorithm, short for adaptive boosting is one of them and is utilized as an ensemble method. This algorithm starts to predict on original dataset and then it gives same weight to each observation. If the first prediction, which is made by first learner, is incorrect, it dedicates the higher importance to the incorrectly predicted statement and an iterative process. It continues to add new learners until the limit is reached in the model [56]. In real-world data include some patterns of which some of them are linear and some not. Finding these kinds of relationships can be challenging. So, utilizing ensembles like AdaBoost allows us to capture many of nonlinear relationships and makes accurate predictions from them [55].

Model Validation.
Algorithms mentioned in the previous section were applied for predicting antibacterial activity of AgNPs in this study. The main goal of ML is to produce an effective computational model with high accuracy. In addition, ML tries to avoid overfitting and underfitting [65]. Cross-validation is a useful technique to evaluate the effectiveness of a model, especially in cases where we need to mitigate overfitting. The basic idea for this technique is splitting the dataset into two parts, part one for training and the other one for testing model (Figure 2). In the present study, we dedicated 70% of our dataset for training set and 30% for testing phase. We evaluated the performance of our models using different performance evaluation metrics such as mean absolute error (MAE), mean square error (MSE), root mean square error (RMSE), and coefficient of determination or R-squared (R 2 ).
where y i is the actual value, ŷ i is the predicted value, and n is the number of observations/rows. MSE is a standard and common error metric for regression model problems and it is calculated as the mean or average of the squared differences between predicted and expected target values in a dataset.
where y i is the actual value, ŷ i is the predicted value, and n is the number of observations/rows. RMSE is an extension of the mean squared error and it can be calculated through the √MSE. For MAE, MSE and RMSE the smaller the value, the better the predictive. The coefficient of determination or R-squared (R 2 ) is an important evaluation measure for regression-based algorithms and it indicates how much variation of the output is supported by the inputs and it varies between 0% and 100%. Unlike the other three metrics for R 2 metric the higher the value, the better the predictive.

Result
3.1. Data Preprocessing. In this study, from 100 studies researching the antibacterial activity of AgNPs, primary dataset I was selected, which consisted of 14 columns and 200 rows. But this dataset was not suitable for ML algorithms because it had some missing values. It also had some categorical features, which ML algorithms cannot operate with this kind of variables correctly. Besides, other features in this dataset were not in the same range and had different scales. So, we utilized impute method for solving missing values problem. Then one hot encoding was used for transforming categorical features to numerical. Finally, we normalized our dataset with the min-max normalization technique and created dataset II. It had 18 columns (14 features, 4 results). The features comprised of nanoparticle size (nm), extract mass (g), solvent volume (ml), extract volume (ml), AgNO 3 volume (ml), AgNO 3 concentration (mM), reaction temperature (°C), reaction time (min), diameters of disks and wells (mm), incubation temperature (°C), incubation time (hr), bacteria concentration (CFU/ml), and nanoparticles concentration (μg/ml). Table 1 introduces the structure of primary dataset. The nanoparticle size, extract volume, AgNO 3 volume, AgNO 3 concentration, reaction temperature, diameters of disks and wells, incubation temperature, incubation time, bacteria concentration, and nanoparticles concentration had no missing values. The extract mass, solvent volume, reaction time, and method had 4.7%, 6%, 3.5%, and 37.6% missing values, respectively. After preprocessing of the dataset I, there was no missing or categorical variable and all of the features were in the same range (0, 1). Finally, the second dataset (dataset II) was created with the changes that were applied to the primary dataset (dataset I).
3.2. Feature Scoring and Feature Ranking. In developing ML model, only a few features in dataset are effective for building the model, and the rest variables are either irrelevant or redundant. We analyzed the feature scores and observed that the most important features were nanoparticle size and concentration. These features were more crucial and effective features than others. All of the feature selection algorithms selected them. The extract mass, reaction time, and temperature are known as comparatively important. Figures 3-6 show the results of selecting the important  Journal of Nanomaterials feature with four algorithms for four bacteria: E. coli, P. aeruginosa, S. aureus, and K. pneumoniae. More data analyses such as correlation values are given in Supplementary Materials (S1-S4).

Model Validation.
In this study, based on important and selected feature, different ML methods were used to learn whether AgNPs have antibacterial activity or not. We have applied methods such as decision tree, random forest,  Journal of Nanomaterials P. aeruginosa and the lowest error compared with other algorithms for these bacteria while AdaBoost model has the lowest R 2 (0.52) and the highest error for S. aureus.

Conclusions
In this study, we used ML algorithms to predict the antibacterial activity of AgNPs against E. coli, P. aeruginosa, S. aureus, and K. pneumoniae. This method helps in the prediction of the antibacterial activity of AgNPs by saving time and decreasing costs. For achieving this goal, first, we needed to collect data. As mentioned above, without appropriate dataset, using ML algorithms would not be possible so we prepared our dataset by reviewing already published works. In the preprocessing step, various types of processing such as one hot encoding and normalization were performed on the dataset. By feature selection methods, we can decrease computation time and eliminate irrelevant or redundant columns from our dataset without sacrificing accuracy. Thus, good feature selection decreases the amount of time needed to train a model. Features importance shows which features have the biggest impact on the model prediction and which ones are not impressive. For selecting significant features, we utilized different algorithms like decision tree, random forest, and two boosting algorithms: XGBoost and AdaBoost. The results showed that the most important features were nanoparticle size and concentration. These features were more crucial and effective features than others. The XGBoost gave the highest R 2 (0.84) for P. aeruginosa and AdaBoost model has the lowest R 2 (0.52) for S. aureus.

Data Availability
All the steps mentioned in the article, which include data preprocessing to the implementation of algorithms and their evaluation, are available on https://github.com/asmadehghani/AntiBacterialLearning.git.

Conflicts of Interest
The authors declare that they have no conflicts of interest.

Supplementary Materials
Figures S1-S4 show pairplot correlation analysis of input variables with the outcome. Figures S5-S8 are the results of modeling based on the algorithms presented in this study. The x-axis shows the data points that each study provided and y-axis demonstrates the predicted and actual value.