Predictive Power of Machine Learning for Optimizing Solar Water Heater Performance : The Potential Application of High-Throughput Screening

Department of Chemistry, The University of Texas at Austin, 105 E. 24th Street, Stop A5300, Austin, TX 78712, USA Institute for Computational and Engineering Sciences, The University of Texas at Austin, 105 E. 24th Street, Stop A5300, Austin, TX 78712, USA Department of Power Engineering, School of Energy, Power and Mechanical Engineering, North China Electric Power University, Baoding 071003, China Department of Computer Science, Rice University, 6100 Main Street, Houston, TX 77005-1827, USA School of Chemistry and Chemical Engineering, Chongqing University of Technology, Chongqing 400054, China


Introduction
How to cost-effectively design a high-performance solar energy conversion system has long been a challenge.Solar water heater (SWH), as a typical solar energy conversion system, has complicated heat transfer and storage properties that are not easy to be measured and predicted by conventional ways.In general, an SWH system uses solar collectors and concentrators to gather, store, and use solar radiation to heat air or water in domestic, commercial, or industrial plants [1].For the design of high-performance SWH, the knowledge about correlations between the external settings and coefficients of thermal performance (CTP) is required.However, some of the correlations are hard to know for the following reasons: (i) measurements are time-consuming [2]; (ii) control experiments are usually difficult to perform; and (iii) there is no current physical model that can precisely connect the relationships between external settings and intrinsic properties for SWH.Currently, there are some state-of-the-art methods for the estimation of energy system properties [3][4][5] and for the optimization of performances [6][7][8][9][10][11].However, most of them are not suitable for the solar energy system.These problems, together with the economic concerns, significantly hinder the rational design of highperformance SWH.
Fortunately, machine learning, as a powerful technique for nonlinear fitting, is able to help us precisely acquire the values of CTP with the knowledge of some easy-measured independent variables.With a sufficiently large database, a machine learning technique with appropriate algorithms can "learn" from the numerical correlations hidden in the dataset via a nonlinear fitting process and perform precise predictions.With such a technique, we do not need to exactly find out the physical models for each CTP and can directly acquire a precise prediction with a well-developed predictive model.During the past decades, Kalogirou et al. have done a large number of machine learning-based numerical predictions of some important CTPs for solar energy systems [12][13][14][15][16][17][18][19].Their results show that there is a huge potential application of machine learning techniques to energy systems.Based on their successful works, we recently developed a series of machine learning models for the predictions of heat collection rates (daily heat collection per square meter of a solar water system, MJ/m 2 ) and heat loss coefficients (the average heat loss per unit, W/(m 3 K)) to a water-inglass evacuated tube solar water heater (WGET-SWH) system [2,20,21].Our results show that with some easymeasured independent variables (e.g., number of tubes and tube length), both heat collection rates and heat loss coefficients can be precisely predicted after some proper trainings from the datasets, with proper algorithms (e.g., artificial neural networks (ANNs) [2,20], support vector machine (SVM) [2], and extreme learning machine (ELM) [21]).An ANNbased user-friendly software was also developed for quick measurements [20].These novel machine learning-assisted measurements dramatically shorten the measurement period from weeks to seconds, which has good industrial benefits.However, all the machine learning studies mentioned here are only the predictions and/or measurements.So far, very few industries really put these methods into practical applications.To the best of our knowledge, very few references concern about the optimization of thermal performance of energy systems using such a powerful knowledge-based technique [22].To address this challenge, we recently used a high-throughput screening (HTS) method combined with a well-trained ANN model to screen 3.5 × 10 8 possible designs of new WGET-SWH settings, in good agreement with the subsequent experimental validations [23].This is so far the first trial of HTS to a solar energy system design.The HTS method (roughly defined as the screening of the candidates with the best target properties using advanced high-throughput experimental and/or computational techniques) has already been widely used in biological [24][25][26][27][28] and computational [29][30][31] areas.With the basic concept that screening thousands or even millions of possible cases to discover the candidates with the best target functions or performances, HTS helps people dramatically reduce the required regular experiments, saving much economic cost and manpower.
In this paper, we aim to propose an HTS framework for optimizing a solar energy system.Picking SWH as a case study, we show how this optimization strategy can be applied to a novel solar energy system design.Different from the study by Liu et al. [23], this paper shows the predictive power of machine learning and the development of a general HTS framework.Instead of listing tedious mathematical works, in this paper, we provide vital details about the general modeling and HTS process.Since tube solar collectors have a substantially lower heat loss coefficient than other types of collectors [12,32], WGET-SWHs gradually become popular during the past decades [33][34][35], with the advantages of excellent thermal performance and easy transportability [36,37].With this reason, we chose the WGET-SWH system as a typical SWH, to show how a well-developed ANN model can be used to cost-effectively optimize the thermal performance of an SWH system, using an HTS method.

Principles of an ANN.
There are various machine learning algorithms that have been effectively applied to the prediction of properties for energy systems, such as ANN [12,13,17,18,20,38], SVM [20,39,40], and ELM [21,41].Because the ANN method is the most popular algorithm for numerical predictions [42], we only introduce the basic principle of ANN here.A general schematic ANN structure is shown in Figure 1, with the input, hidden, and output layers constructed by certain numbers of "neurons."Each neuron (also called a "node") in the input layer, respectively, represents a specific independent variable.The neuron in the output layer represents the dependent variable that is needed to be predicted.Usually, the independent variables should be the easy-measured variables that have a potential relationship with the dependent variable.The dependent variable is usually the variable that is hard to be detected from experiments and is expected to be precisely predicted.The layer between the input and output layers shown in Figure 1 is the hidden layer.The optimal number of neurons in the hidden layer depends on the study object and the scale of the dataset.Each neuron connects to all the neurons in the adjacent layer, with the connection called the weight (usually represented as w), which directly decides the predictive performance of the ANN, using the activation functions.For the training of an ANN, the initial weights will be first selected randomly, and then following iterations would help find out the optimal weight values that fulfill the prediction criterions.All the data move only in the same direction (from left to right, as shown in Figure 1).A well-trained ANN should consist of the optimal numbers of hidden layer neurons, hidden layer(s), and weight values, which sufficiently avoid the risk of either under-or overfitting.In practical applications, there is a large number of neural networks with modified algorithms, such as ELM [43][44][45], backpropagation neural network (BPNN) [46][47][48], and general regression neural network (GRNN) [49][50][51].Though there are various network models, the basic principles for model training are similar.

Training of an ANN.
To train a robust ANN, several factors should be considered: (i) percentages of the training and testing sets; (ii) number of hidden neurons; (iii) number of hidden layers; and (iv) required time for training.When training a practical ANN for real applications, a large training set is recommended.For predicting the heat collection rates of WGET-SWHs, we found that with a relatively large dataset (>900 data groups), the training set higher than 85% could help acquire a model with good predictive performance in the testing set [2].Another reason to use a large training set is that if the training set percentage is small, it would be a waste of data for practical applications.The reason is simple: more data groups for training would usually lead to a better predictive performance.For the selection of the number of 2 International Journal of Photoenergy hidden neurons, it is quite important to try the neuron numbers from low to high.If the number of hidden neurons is not enough, there would be a risk of underfitting; if it is too many, there would be a risk of overfitting and timeconsuming.Therefore, finding the best number of neurons by comparison is particularly important.It should be noted that in some special neural network methods (e.g., GRNN), the number of hidden neurons can be a fixed value once the dataset is defined in some software packages.Under this circumstance, it is no longer necessary to worry about the hidden neuron settings.In addition to the hidden neuron numbers, same tests should be done on the number of layers, in order to avoid either under-or overfitting.The last factor we need to consider is the training time.According to the basic principle of an ANN (Figure 1), the interconnection among neurons would become more complicated with higher numbers of neuron.Therefore, with larger database and larger numbers of independent variables and hidden neurons, the training time would be longer.This means that sometimes an ordinary personal computer (PC) cannot sustain a tedious cross-validation test.From our previous studies with an ANN training [2,51], we found that if the database was sufficiently large, repeated training and/ or cross-validation training would lead to insignificant fluctuation.In other words, for practical applications, the ANN training and testing results would be robust if the database is large, and so a cross-validation process can be rationally skipped after a simple sensitivity test, in order to save computational cost.

Testing of an ANN.
Using a testing result with an ANN for the prediction of heat collection rate as an example (Figure 2), we can see that a well-trained ANN can precisely predict the heat collection rates of the data in the testing set, with relatively low absolute residual values.Though there are still deviations exist in some predicted points, the overall accuracy is still relatively high and acceptable to practical applications.It should be noted that for a solar energy system, the independent variables for modeling should always include some environmental variables, such as solar radiation intensity and ambient temperature [2].These variables are highly dependent to the external temperature, location, and season.That is to say, the external conditions of the predicted data should be in the similar environmental conditions as the data used for the model training.Otherwise, the ANN may not perform good predictive performance.In all of our recent studies, all the data measurements were performed in very similar season, temperature, and location, which can sufficiently ensure precise predicted results in both the testing set and subsequent experimental validation.

High-Throughput Screening (HTS)
The basic idea of computational HTS is simple: the calculations of all possible systems in a certain time period (using fast algorithms) and the screening of the candidates with target performances.Previously, Greeley et al. used density functional theory (DFT) calculations to screen and design high-performance metallic catalysts for hydrogen evolution reaction via an HTS method, in good agreement with experimental validations [29].Hautier et al. combined DFT calculations, machine learning, and HTS methods to predict the missing ternary oxide compounds in nature and develop a completed ternary oxide database [31], which shows that a machine learning-assisted HTS process can be precisely used for new material prediction and discovery.However, though the HTS method has been widely used in many areas, its conceptional applications to energy system optimization is not reported during the past decade.Very recently, our studies show that the machine learning-assisted HTS process can be effectively performed on the optimization of solar energy system [23].Choosing WGET-SWH as a case study, our results show that an HTS process with a well-trained ANN model can be used for the optimization of heat collection rate of SWH.The first step was to generate an extremely large number of independent variable combinations (3.5 × 10 8 possible design combinations) as the input of a well-trained ANN model.The heat collection rates of all these combinations were then, respectively, predicted by the ANN.After that, the new designs with high predicted heat collection rates were recorded as the candidate database.For validation, we installed two screened cases and performed rigorous measurements.The experimental results showed that the two selected cases had higher average heat collection rates than all the existing cases in our previous measurement database.Being similar to a previous chemical HTS concept proposed by Pyzer-Knapp et al. [52], we reconstruct the process of this  3 International Journal of Photoenergy optimization method, as shown in Figure 3.More modeling and experimental details can be found in [23].

HTS-Based Optimization Framework
Based on the recent trials on the HTS-based optimization method to the SWH system, here, we propose a framework for the design and optimization of solar energy systems.Though the machine learning-based HTS method is a quick design strategy, the preconditions should be fulfilled rigorously.That means, two vital conditions should be fulfilled, including (i) a well-trained machine learning model and (ii) a rational generation of possible inputs.

A Well-Trained Machine Learning Model.
To acquire a well-trained machine learning model, in addition to the regular training and testing processes as shown in Sections 2.2 and 2.3, another key step is to define the independent variables for training.Since the dependent variable is usually the quantified performance of the energy system, the selection of an independent variable which has potential relationships with the dependent variable would directly decide the predictive precision of the model.In our previous case [23], we chose seven independent variables as the inputs, including tube length, number of tubes, tube center distance, tank volume, collector area, final temperature, and tilting angle (the angle between tubes and the ground).A 3-D schematic design of a WGET-SWH system is shown in Figure 4 [23], which shows that only with these independent variables can we reconstruct a WGET-SWH system quickly with some other minor empirical settings.Unlike a physical model (which requires rigorous mathematical deduction and hypothesis), machine learning does not require the user to know exactly about the potential relationships between the independent and dependent variables.This feature also leads to the fact that machine learning prediction method is more flexible than conventional methods.From these seven inputs, we can see that except for the final temperature, all the other six variables are the important parameters of a WGET-SWH.In terms of the final temperature, we found that this is extremely important to ensure a precise model for heat collection rate prediction.The reason is simple: the heat collection performance of a WGET-SWH is not only decided by the mechanical settings of the system but also depends on the environmental conditions such as solar radiation intensity, ambient temperature, and the final temperature.Since  4 International Journal of Photoenergy the solar radiation intensity correlates well with the final temperature in a nonphotovoltaic heat transfer system, and it is not easy to be measured, we did not consider this as a variable for model training.Also, because the ambient temperatures are very similar during the measurements of all the SWHs in our database (we performed all the measurements in the similar months and locations), we also removed it from the variable list.It should be noted that for the measurements gathered from various seasons and unstable weathers, the ambient temperature sometimes is important and should not be neglected for modeling.
Results show that without the solar radiation intensity and ambient temperature, our predictive models were still precise and robust enough [2].Reducing the number of independent variables like these not only helps us dramatically reduce the required time for model training but also simplifies the input generation process at the following HTS application.Another vital step is the scale and size of the database.Due to the complexity of the energy collection and transfer system, there are usually a large number of independent variables.To ensure a good training, a large and wide database should be used.If the size of the database for training is too small, it would generate high error rates during fittings; if the range of database is too narrow, the trained model would only have good performance in a very local data range, scarifying the precision of the data in a relatively remote region.In many previous cases, we can see that a large and wide database is crucial to ensure a good practical prediction [53].In our case study, the ranges of the independent variables were wide enough to ensure a good predictive performance of the ANN [2].Detailed descriptive statistics (maximum, minimum, data range, average value, and standard deviation) of the WGET-SWH database we used for training are shown in Table 1.

Screening according to the ANN-predicted results
Pick up suitable screened designs

A Rational Generation of Possible Inputs.
A rational generation of inputs of the ANN during the HTS process is also crucial to ensure a quick HTS with less time consumption.Without a rational criterion, there will be infinite possible combinations, which will lead to infinite computational cost.In our current study, we found that a quick way is to generate the inputs according to the trained weights of each independent variables: the independent variable with a higher numerical weight of the model will be assigned more possible values as the input of ANN during prediction.The basic assumption is that a larger value of weight will lead to a more significant change to the predicted results.In Liu et al. [23], we show that the tank volume has the highest weight to determine the heat collection rate, which also qualitatively agreed with the empirical knowledge.Thus, we generated more inputs of tank volume with different numerical values for the HTS process.Table 2 shows the numbers of selected values of independent variables for screening the optimized WGET-SWHs via an HTS process [23].Except for the final temperature, the number of values of all the independent variables was assigned according to their sequences of weight after a typical and robust ANN training.In terms of the inputs of final temperature, since it is not a part of the SWH installation, we consider all its possible integers shown in the database (Table 1) as the inputs for HTS.It should be noted that the weight values of a trained ANN do not contain exact physical meanings because the initial weights for an ANN training were usually selected randomly.Multiple trainings of ANN will lead to different final weight values.Thus, in addition to referring to the trained weight values, sometimes we should artificially assign more possible input values for the independent variables that are physically more influential to the predicted results.For other weight-free algorithms (e.g., SVM), artificial choices for inputs are particularly important.

Experimental Validation.
With the inputs of the generated independent variable values, the machine learning model is able to output their predicted heat collection rates in an extremely short timescale.After screening, those designs with high predicted heat collection rates can be recorded as the candidates for future applications.In our recent studies, two typical designs after an HTS process were selected for experimental installations, with their independent variables summarized in Table 3.    Step 1: Select the independent and dependent variables for the machine learning model.
Step 2: Train and test a predictive machine learning model with a proper experimental database.
Step 3: Generate a large number of the combinations of independent variable values.
Step 4: Input the generated independent variables into the well-trained predictive model.
Step 5: Screen and record the outputted dependent variable values and their corresponding independent variable values that fulfill all the screening criterions.
Step 6: Select the candidates from the results of Step 5 for experimental validation.
Step 7: Record the experimental results from Step 6.
To sum up, the proposed framework is shown in Figure 5.It can be seen that once all the preconditions of the  7 International Journal of Photoenergy "cylinders" discussed above are fulfilled, a completed machine learning-assisted process can be achieved.The ultimate goal of the screening is to find out better candidates with optimized target performance.These candidates will have the independent variables different (or partially different) from the previous experimental database.Combining the previous experimental database with the experimental validation on new designed candidates, we can construct a new experimental database with more informative knowledge for future applications.It should be noted that this framework not only works for solar energy systems but also works for the optimization cases of other devices.We expect that this framework can be expanded to other optimization demands in the future.

Conclusions
In this paper, we have summarized our recent studies on the predictive performance of machine learning on an energy system and proposed a framework of SWH design using a machine learning-based HTS method.This framework consists of (i) developing a predictive model and (ii) screening possible candidates.A combined computational and experimental case study on WGET-SWH shows that this framework can help efficiently design new WGET-SWH with optimized performance without knowing the complicated knowledge of the physical relationship between the SWH settings and the target performances.We expect that this study can fill the blank of the HTS applications on optimizing energy systems and provide new insight on the design of high-performance energy systems.

Figure 1 :
Figure 1: Schematic structure of a typical ANN.Circles represent the neurons in the algorithm.

Figure 2 :
Figure 2: Testing results using an ANN model for the prediction of heat collection rate for WGET-SWHs.(a) Predicted values versus actual values; (b) residual values versus actual values; and (c) residual values versus predicted values.Reproduced with permission from Liu et al. [2].

Figure 3 :
Figure 3: An HTS process for solar energy system optimization.Each orange circle represents a possible design.

Figure 5 :
Figure5: A proposed framework of machine learning-assisted HTS process for target performance optimization.Independent variables are assigned as "ind."Dependent variables are assigned as "dep."{A in } represents the original experimental database.{B in } represents the generated independent variables as the inputs.{B in (new)} represents the generated independent variables and their predicted dependent variables.{C in } represents the new experimental database combining the original experimental database and the experimental validation results of the screened candidates.

Table 1 :
[2]criptive statistics of the variables for 915 samples of in service WGET-SWHs.Reproduced with permission from Liu et al.[2].: tube center distance; final temp.: final temperature; HCR: heat collection rate (MJ/m 2 ).Tank volume was defined as the maximum mass of water in tank (kg). TCD

Table 2 :
[23]er of selected values of different independent variables (extrinsic properties).Reproduced with permission from Liu et al.[23].

Table 4 :
[23]ured heat collection rates (MJ/m 2 ) of the two novel designs.All the measurements were performed under the environmental conditions similar to those of the measurements for the previous database (Table1).Reproduced with permission from Liu et al.[23].