Evolutionary Computation Techniques for Predicting Atmospheric Corrosion

Corrosion occurs in many engineering structures such as bridges, pipelines, and refineries and leads to the destruction of materials in a gradual manner and thus shortening their lifespan. It is therefore crucial to assess the structural integrity of engineering structures which are approaching or exceeding their designed lifespan in order to ensure their correct functioning, for example, carrying ability and safety. An understanding of corrosion and an ability to predict corrosion rate of a material in a particular environment plays a vital role in evaluating the residual life of the material. In this paper we investigate the use of genetic programming and genetic algorithms in the derivation of corrosion-rate expressions for steel and zinc. Genetic programming is used to automatically evolve corrosion-rate expressions while a genetic algorithm is used to evolve the parameters of an already engineered corrosion-rate expression. We show that both evolutionary techniques yield corrosion-rate expressions that have good accuracy.


Introduction
Corrosion is a natural phenomenon that can cause substantial economic and environmental losses which result from the damage incurred in metal constructions over the years.The cost of corrosion has been reported [1,2] to be as large as 3.1% of the gross domestic product of countries such as the United States, United kingdom, and Australia.Corrosion costs can be (i) direct when the metallic structure is greatly damaged in which case replacement or expensive maintenance are required or (ii) indirect when the worsened appearance of the construction reduces its value (even if the construction is not greatly damaged and can still be used just fine).
Corrosion refers to the disintegration of materials into their constituent atoms because of chemical or electrochemical reactions with the environment [3].This disintegration causes a loss in the thickness of the construction which results in a decrease in resistance and strength and consequently a decrease in the service performance of the construction.Corrosion occurs in many engineering structures such as bridges, pipelines, refineries, and so forth and can result in the destruction of materials in a gradual manner and hence shortening their lifespan.
Corrosion can occur in many environments such as atmosphere, soil, sea, and so forth where environmental factors affect the material in complicated processes leading to its corrosion.Depending on the environment, corrosion can be atmospheric, underground, marine, gaseous, or microbial and bacterial.Atmospheric corrosion is the type of corrosion we are mainly interested in because (i) it has been reported that atmospheric corrosion is responsible for more corrosioninduced failures than any other corrosion type [4] and (ii) it is the most major corrosion type in SABIC [5] industrial siteswhere the findings of this work are going to be applied.
Because of its huge impact on the economy and environment, an understanding of corrosion and ability to predict corrosion rate of a material in a particular environment plays a vital role in evaluating the residual life of the material and consequently reducing associated costs.In order to understand and predict corrosion, we must model the environmental factors that influence corrosion and derive relationships between them and the rate of the resulting corrosion.
In this paper, we propose the use of genetic programming [6] and genetic algorithms [7], to derive the corrosion-rate expressions in terms of the major influential environmental factors.
The rest of the paper is structured as follows.In Section 2, we will found the mathematical properties of the problem of deriving corrosion-rate expression.In Section 3, we will describe the methodology of our work where we use genetic programming and genetic algorithms.In Section 4, we conduct an empirical evaluation of our work.In Section 5 we discuss our findings.In Section 6 we review related work in the automatic derivation of corrosion-rate expressions.Finally, in Section 7, we draw conclusions and set directions for future work.

Problem Formulation
The problem of identifying a corrosion model reduces to defining a function CR that expresses corrosion rate in terms of the  environmental factors that cause it.Such environmental factors will differ from one site to another and include temperature, air humidity, wetness, acidity, concentration of particular chemicals, and so forth.The interaction between the environmental factors and the metal causes corrosion over time.The major influential environmental factors in atmospheric corrosion in the literature are the following.
(i) Temperature (T): the degree in Celsius; an increase in temperature stimulates corrosion by increasing the rate of electrochemical reactions and diffusion processes [4].
(ii) Time of wetness (TOW): the time during which the environment's critical relative humidity is greater than 80% and the average temperature is above 0 ∘ C; which forms an electrolyte film on the metal causing its corrosion [4,8].
(iii) Sulfur (SO 2 ): the amount of concentration of the contaminant; sulfur stimulates electrochemical reactions in the electrolyte layer on the metal formed by humidity above 60% to 70% [8].
(iv) Chloride (Cl − ): the amount of concentration of the contaminant; chloride prevents the creation of protective oxide layers on the metal which accelerates the corrosion process [9].
International Journal of Corrosion In our work, the function CR that represents the corrosion rate has five inputs T, TOW, SO 2 , Cl − , and  that represent the  = 5 environmental factors.We shall use the following representation: the environmental factors form an -by- matrix  which is input to the function CR and the resulting output corrosion rates form the -by-1 vector .Here,  is the number of observations where the values of the environmental factors are recorded together with the corresponding corrosion rates.The corrosion model that we want to identify is thus CR() = .

Methodology
In order to identify the function CR, we first start by collecting an -by-( + 1) matrix of data where  is the number of experiments in which  values of the variables in  have been collected together with the resulting value .After that, the set of  observations is split into two parts: one for building the model and one for evaluating its accuracy.The part of data that is used for building the model is usually the largest, say 0.9  data items, and the part used to evaluate the model is the remaining 0.1  data items.This division is not engraved in stone and can be changed while satisfying two opposing criteria: (i) the size of the dataset used in training should be as large as possible to account for a diversity of data points while deriving the expression of interest and (ii) the size of the dataset used in evaluation should be as large as possible to avoid overfitting in the derived model.After collection of data, we apply the evolutionary technique of interest, that is, genetic programming (Section 3.1) and genetic algorithms (Section 3.2)-in order to determine the function CR. [6] is a bioinspired computer algorithm that mimics natural evolution of living organisms.It is similar to genetic algorithms with the exception that individuals are computer programs as opposed to vectors of values.

Genetic Programming. Genetic programming (GP)
The objective of GP is to evolve a computer program that solves a given problem.In order to do so, a population of computer programs called individuals-that are randomly generated initially-is evolved across a number of generations.The evolution of the population involves the exchange of genetic material between the individuals through crossover operations and the alteration of the genetic material of single individuals through mutation operations.A selection strategy is applied to the individuals of a population in a given generation to decide which ones are allowed to proceed to the next generation.Such selection is based on the fitness of the individuals which is a problem-dependent value that specifies the goodness of an individual in solving the problem at hand.The evolution continues until a good-enough individual that solves the problem adequately is found, or until a maximum number of generations is reached.
Each individual in the population is a program represented by its abstract-syntax tree (AST).All nonleaf nodes of the AST represent operators, and leaf nodes represent problem variables or constant values.Crossingover two programs means taking one or more subtrees from the first program GP can be used to solve a variety of optimization problems amongst which is symbolic regression that we shall describe here because it is the essence of our approach.To solve a symbolic-regression problem (also known as function-discovery problem), the genetic program (GP (we use GP to refer to both "genetic programming" and "genetic program")) takes as input a set of  observations of values of some variable , a set of observations of values of some variable  and tries to identify the function  0 such that  =  0 () is true for all pairs (, ) in the  observationsand also true outside the  observations.The function  0 to be determined is a computer program that will be evolved by the GP.The initial population of the GP contains a number  of randomly generated functions  1 ,  2 , . . .,  each represented as an AST, for example, Figure 1 shows the AST representation of some function   () =  3 + 2 + 5 in the GP.The functions will be crossed over and mutated over generations to produce new fitter functions.The fitness of a function   is calculated as the sum of differences (| − |) for all pairs (, ) in the  observations.At the end, the GP either discovers  0 during evolution or another function of equal or inferior fitness.Notice that the GP can derive some function   0 that satisfies  =   0 () for all observed data pairs (, ) but does not satisfy  =   0 () in the general case, that is, for some unobserved pairs (, ) the relation ( =  0 ()∧ ̸ =   0 ()) holds.This is a classic case of overfitting and  is usually (partially) tackled by dividing the  observations into a training part used during evolution and a testing part used after evolution to give an indication of how well the derived function   0 generalizes to new unseen data.Should the derived function not generalize well enough, evolution is restarted with the derived function   0 injected in the initial population of the new GP run.

Genetic Algorithms.
Genetic algorithms (GAs) [7] work in a very similar fashion to genetic programming except that each individual during evolution is an array as opposed to a tree.This means that GAs cannot be used to evolve a symbolic expression like GP does since evolving an expression requires the ability to evolve an AST, not a flat array.However, the GA can be used as a powerful regression tool to estimate the coefficients of an expression  whose structure is known already.For example, if we know for example that some function  is () =  1  +   2 where  is the independent variable and  1 and  2 are constants, we can evolve the array  = [ 1 ,  2 ] such that the difference () −  is minimal in all  observations (as discussed in Section 3.1).The use of GAs in this case is similar to the use of linear and nonlinear regression, however, with the added advantage that it can escape local minima.

Evaluation
We use the datasets available from [10] to conduct our experiments.The datasets show the corrosion rates of the two metals: steel and zinc.Corrosion for both metals is measured  against the five most influential environmental factors: temperature, time of wetness, concentration of sulfur dioxide, Chloride, and exposure time as explained in Section 2. Tables 1 and 2 show some relevant statistics about the datasets we use in our experiments.In the following we will show the results of applying each evolutionary techniques to determine an expression or model of the corrosion rates of steel and zinc.

Results of Using GP.
The GP was run using the parameters shown in Table 3.The fitness function is aggregate and is computed as the sum of the average mean squared error  (MSE) and the complexity of solution measured as the number of nodes in the resulting corrosion expression.The size of the expression was added as penalty to guide the evolution process towards small expressions.The expressions obtained for the corrosion rates of steel and zinc are shown in Tables 4 and 5, respectively-in decreasing goodness-offit scores.The value  in Tables 4 and 5 is obtained by performing a regression analysis between the model output (i.e., the corrosion-rate values obtained using the derived expression) and the corresponding target (i.e., the corrosionrate values available in the dataset).The more  is closer to 1, the more the model is fitting the target data.Figures 2, 3, 4, 5, and 6 show the goodness of fit of the derived GP expressions for steel according to their order in Table 4, and Figures 7, 8, 9, 10, and 11 show the goodness of fit of the derived GP expressions for zinc according to their order in Table 5.In these figures, the variable Target on the -axis shows the measured corrosion rate in the datasets while the variable Output on the -axis shows the estimated corrosion rate using the respective GP expression for the same dataset point.The reported results are obtained by using our own GP implementation.Although there are robust GP systems around such as Eureqa Formulize [12], we got the best results using our own GP system especially that we forced all environmental factors to appear in the final symbolic expression-something we had little control over in Eureqa Formulize.As can be seen, the GP expressions have very high goodness-of-fit values despite the necessarily noisy datasets.The GP ran for around 60 minutes to derive the corrosionrate expressions of each metal before reaching the maximum number of generations.

Results of Using
GAs.We used the robust MATLAB ga library to obtain the best results.We used the corrosion expression (1) from [13] where the constants   ,  ∈ [1 ⋅ ⋅ ⋅ 10] are the ones to evolve using the GA.Table 6 lists the GA parameters used in the evaluation: ×   9 (+ 10 ) . ( Initially, the GA yielded very inaccurate expressions for steel and zinc (the error was in the range 10 18 )-which was rather unexpected.However, a closer investigation revealed that the GA was not performing protected division that is, during the evaluation of the fitness of an individual that has  3 = 0,  5 = 0, or  7 = 0 the fitness values were erroneous.
To circumvent this problem, we penalized individuals that have  3 = 0,  5 = 0, or  7 = 0 by assigning poor fitness values to them.The GA corrosion expressions for steel and zinc are shown in ( 2) and ( 3 of 5791 and 15.9, respectively.The goodness-of-fit plots are shown in Figure 12 for steel and Figure 13 for zinc: CR = 1.5124 CR = 0.1757 As can be seen, the GA expressions are also accurate.The GA ran for around 15 minutes to derive each of the reported expressions before reaching the maximum number of generations.

Discussion
Table 7 shows a summary of the accuracy of the two evolutionary techniques for predicting corrosion rates for steel and zinc.
In terms of usefulness, the GP expressions are superior to the GA expressions because they are derived automatically, that is, without knowledge about the structure of the target corrosion-rate expression, whereas we assume a specific corrosion-expression structure for GAs and evolve its parameters.The explicit GP corrosion-rate expressions give more insight into the corrosion process because they show how corrosion rate is affected by the environmental factors.The datasets used in the experiments are characterized by the presence of a number of outliers which can either be (i) genuine data points where corrosion rate deviates significantly from average because of the inherent complexity of the corrosion process or (ii) erroneous data points that result from faults in measurement devices, human mistakes during data entry, and so forth.The analysis presented in this paper assumes case (i), that is, all data points are assumed to be valid.
As can be seen from the results, the slope of the fitting line is seemingly controlled by outliers.In order to investigate this issue further, we have redone the GP evolution, however this time by a better handling of the outliers using two methods as follows.First, we used a logarithmic distance measure log(1 + | − ()|) instead of the squared error ( − ()) 2 .The outliers in the dataset are data points of large magnitudes, which means that if they do not lie close to the curve during evolution, they will affect the fitness of the solution in a substantial way if their distance from the curve is measured by, for example, (−()) 2 .When the distance is logarithmic, the effect of outliers on the evolution of the curve is significantly reduced, for example, if (−()) 2 = 10 6 then log(1 + | − ()|) ≈ 3. Second, we removed the outliers all together and used the squared error ( − ()) 2 as we did previously.The resulting corrosion-rate expression for steel using the logarithmic error is shown in (4); the resulting corrosionrate expression for zinc using the logarithmic error is shown in (5) the resulting corrosion-rate expression for steel after dropping outliers is shown in (6), and the resulting corrosionrate expression for zinc after dropping outliers is shown in (7).fit of the corrosion-rate expressions (4), ( 5), (6), and

International Journal of Corrosion
As can be seen from the results, using GP still gave good results when outliers were eliminated and when their effect was significantly reduced.RH is relative humidity.

Related Work
Corrosion modeling is not a novel research area.Many corrosion models have been developed in the literature to yield expressions of metal corrosion as shown in Table 8.Most of these models do not take into account all five environmental factors that we consider in this work (see Table 8).
In addition to this, corrosion modeling has also been attempted using artificial neural networks in numerous works including [10,[20][21][22] and using support vector regression [11]; however, these techniques do not yield explicit corrosion-rate expressions.

Conclusions
In this paper, we have developed a corrosion model based on two evolutionary computation techniques, namely, genetic programming and genetic algorithms.Both techniques yielded corrosion-rate expressions with good accuracy with genetic programming being superior because it can learn without prior knowledge the structure of the corrosion expression.The findings of the this work will allow better understanding of the corrosion phenomenon in terms of cause and effects so that necessary action such as prevention measures can be carried out.

Table 3 :
The GP parameters used in the experiments.

Table 4 :
The best five expressions obtained for the corrosion expression of steel using the GP.The fitness of an expression is its mean squared error (MSE), and its size is the number of nodes

Table 6 :
The GA parameters used in the experiments.

Table 7 :
Summary of Model-Identification Goodness of Fit (values ).