Transparent and Interpretable State of Health Forecasting of Lithium-Ion Batteries with Deep Learning and Saliency Maps

Batteries are the most expensive component of battery electric vehicles (BEVs), but they degrade over time and battery operation. State of health (SOH) forecasting models learn how battery operation over long-time periods of weeks or months in ﬂ uences battery aging. Currently, existing methods for SOH forecasting of lithium-ion batteries based on deep neural network (DNN) models lack explainability of their forecasts due to their inherent black box character. However, the explainability of forecasts is essential to build user trust into the forecasting models. In this work, we address this problem from two perspectives: First, we compared four machine learning (ML) models like decision tree and random forest, which are inherently transparent, to two new DNN architectures with a more inherent black box character. Second, we proposed a new method using Gaussian-ﬁ ltered saliency maps to visualize battery operational states that are relevant to DNN models. This method is applied to the best DNN models previously trained. We used an extensive data corpus consisting of ﬁ ve public data sets with di ﬀ erent operational conditions, battery types, and aging trajectories. Furthermore, we show that the Gaussian-ﬁ ltered saliency maps meaningfully visualize battery operational states that are consistent with ﬁ ndings from controlled laboratory aging experiments. Thus, this work was able to add transparency and interpretability to the SOH forecasting results of two state-of-the-art DNNs, while maintaining their superior performance compared to transparent ML models, while mitigating their inherent black box character.


Introduction
Global lithium-ion battery (LIB) production capacity has been increasing since 2016. This trend is expected to continue exponentially, mainly driven by the growing demand for LIBs for battery electric vehicles (BEVs) [1][2][3]. For BEV users, owners, and fleet operators, the state of health (SOH) of the batteries is one of their main concerns, because it reflects the aging of the battery depending on its usage and environmental conditions [4][5][6]. The task of determining the current SOH with the battery data available at the current point in time is called SOH estimation [7][8][9]. When the battery ages, the SOH decreases. Modeling this change of the SOH from a current SOH to a future SOH due to aging causes is called SOH forecasting. These aging causes are encoded in the battery operational load through parameters like state of charge (SOC), temperature, and current [9]. SOH forecasting is also called "battery aging prediction" [7,8].
Forecasting the future SOH enables assessments of the future residual value of the battery and planning of battery replacements. It also may help to derive operational strategies for BEV fleets as one of the proposed applications [9]. Until now, only a few forecasting methods have been applicable in real-world vehicle operation [9,10]. Apart from the transferability of models in the case of new battery types or battery operation, also the black box behavior of most machine learning (ML) models, especially of deep learning (DL) models, is problematic: When running in prediction mode, only the model's input and the model's prediction are available-without any further explanation. However, the understandability of predictions is important to build user trust in the models [11].
In terms of model understandability, Bzdok and Ioannidis [12] describe a trade-off between model complexity and transparency: Complexity gives a model the ability to learn more sophisticated problems. A model is transparent if it is understandable by itself; that is, it has inherent white box character. For example, transparent models can be k-nearest neighbor (k-NN), decision trees (DTs), and linear models [13,14] (for other less popular definitions of transparency, interpretability, and explainability, consult [15,16]). Such models are transparent but offer limited learning capacity, while more complex models like support vector regression and multilayer perceptrons (MLPs) can learn complex problems but are nontransparent [17]. For these nontransparent models, understandability by humans can be achieved by further explanations. Ensemble methods like random forest (RF) are challenging because they combine the predictions of several models like DTs. The biggest challenge is in artificial neural networks (ANNs), especially in deep ANNs as DL models [17].
Building on an existing SOH forecasting method applying an MLP architecture with residence time histograms of current, temperature, and SOC [18], this work addresses the following research questions (RQs): Which established ML methods provide comparable results to DL architectures while improving the explainability of the predictions? (RQ.1) How can the explainability of SOH forecasts be provided using DL models with inherent black box character? (RQ. 2) This paper contributes by answering both RQs: Furthermore, we contribute by comparing ML methods with novel DL architectures on the task of SOH forecasting on an aggregated data set consisting of five different public battery cell data sets from laboratory operation (RQ.1). As ML methods with more inherent white box character, we utilize DT, RF, extreme gradient boosting (XGBoost), and histogram-based gradient boosting (HistXGBoost), while the two newly introduced DL methods with more inherent black box character are multilayer perceptrons with residual blocks (ResMLPs) and two-dimensional convolutional neural networks with residual blocks (ResCNNs). Based on this, we answer RQ.2 by further analyzing a ResMLP and a two-dimensional ResCNN through a newly introduced method of applying saliency maps with Gaussian filters to explain the SOH forecasts of the ResMLP and the ResCNN. Because saliency maps tend to result in noisy representations, we added Gaussian filters to reduce the noise and enable experts to interpret a prediction of the model. Furthermore, combining the saliency maps with Gaussian filters and a data representation that includes human-understandable features allows deep and detailed insights into the model's predictions.
The remainder of this paper is structured as follows: Section 2 introduces the fundamentals of battery aging and SOH as well as ML models and explainable AI (XAI) methods. Section 3 presents an existing SOH forecasting method applied in this work and related work of XAI methods for battery aging models. In Section 4, the SOH forecasting method with two DL architectures, the gradient-based saliency maps, and the utilized data sets with their preprocessing are explained. In Section 5, experiments and results are presented and discussed. Section 6 concludes our work.

Fundamentals
2.1. Battery and State of Health. LIBs are popular electrochemical energy storage devices that convert chemical energy into electrical energy (discharging). The process is reversed for charging [19]. For a schematic representation of a typical LIB cell and information on the operating principle, readers are referred to [20][21][22]. Batteries age due to use and environmental conditions; that is, their performance degrades. Therefore, one common measure is the SOH, and more precisely, the capacity-based state of health (SOH C ) describes the remaining capacity CðtÞ relative to the initial capacity of a new battery, also called nominal capacity (C nom ) [6,[23][24][25][26][27]: In the following, we focus on the SOH C because it reflects the range of BEVs, referred to as SOH for simplicity. In automotive applications of batteries, the relevant lower threshold of the SOH is usually considered at the end of life (EOL) at SOH = 80% [24,26,28]. A more detailed introduction to the SOH and EOL criteria is presented by von Bülow and Meisen [9].
Battery aging can be structured into two causes, which are considered in this paper: calendar aging and cyclic aging. Calendar aging is associated with the storage of batteries; that is, there is no charging or discharging. Hence, it is also called passive aging. Cyclic aging corresponds to the impact of battery usage on the SOH, i.e., aging due to charging and discharging [29].
High temperatures T and high SOCs cause rapid battery calendar and cyclic aging [30]. For example, a high SOC above 80% accelerates solid-electrolyte interphase (SEI) growth [31]. Other stressors that accelerate battery aging are a high state of charge lift (ΔSOC) as well as high charge and discharge currents I, i.e., current-rate (C-rate) [29,32]. The C-rate with ½C-rate = 1/h = A/Ah is the current relative to the nominal capacity. Known battery stressors are qualitatively presented in Table 1. For example in the 2 nd row, ↑ and ↓ mean that high and low SOCs accelerate SEI formation.

Machine Learning Models.
In this section, we shortly introduce the four aforementioned ML models DT, RF, XGBoost, and HistXGBoost that are relevant to RQ.1.
The first ML model is a DT for regression tasks, which is an explainable tree model. A DT consists of hierarchically encoded simple binary decision rules pointing to child nodes [35]. These simple rules can be combined to produce understandable predictions. A single DT tends to overfit on the training data, which is compensated by pruning when the tree reaches a maximum size.

2
International Journal of Energy Research A further development of the DT, the RF algorithm, has been shown to perform well on large-scale data sets [36] but lacks the explainability compared to the DT. Instead of consisting of a single optimized tree, a RF consists of an ensemble of binary trees, each trained on a random subset of the training data [37], because in the law of large numbers, RFs are less likely to overfit on the training data [38]. A final prediction is aggregated over the ensemble of independent estimates using large percentage voting to achieve better results [39]. The third ML model is the XGBoost tree boosting algorithm. Unlike the tree algorithms described before, XGBoost contains an ensemble of trees with continuous scores for each leaf. Furthermore, XGBoost optimizes this ensemble by minimizing a second-order approximation objective [40]. The last ML model is the HistXGBoost, which enhances the XGBoost algorithm by using a histogram-based algorithm to create discrete feature bins from continuous feature values. This improves the inefficient finding of the optimal split points in the underlying DT. It also speeds up the training process and is more memory-efficient [41].
These previously mentioned ML models allow for different levels of interpretability. Shallow rule-based models such as DTs have an inherent white box behavior [42] because the rules of a DT can be extracted to provide a clear representation of why a particular prediction was made [35,43]. RF, XGBoost, and HistXGBoost contain an ensemble of different trees, including different rules in each tree. Consequently, these tree ensemble models cannot be interpreted at a rule level because the final prediction is based on a voting mechanism that combines the predictions of all these trees. However, the feature importance of the prediction can be determined by calculating an importance score of how influential a particular input was on the prediction. This allows for an interpretation of why a model made a particular prediction [44].

Explainable AI.
In order to achieve better explainability, two strategies are possible when starting with a black box model: (1) switching to a model with more or better white box behavior and the same, similar, or even better perfor-mance or (2) opening up black box models with additional methods. According to Benchekroun et al. [15], the latter is called postmodel explainability. Formally, a model approximates a function F : ℝ m ⟶ ℝ o with the input X ∈ ℝ m and the output Y ∈ ℝ o . Explanation methods map the model inputs X to outputs of the same shape via an explanation map E : ℝ m ⟶ ℝ m . One explanation method is the gradient explanation E grad ðXÞ = δFðXÞ/δX which quantifies the influence of each input value x ∈ X on the prediction FðXÞ locally, i.e., in a small region around x. For two-dimensional objects like images, videos, or 2D histograms, it holds m = m 1 × m 2 . In this work, we call these twodimensional gradient explanations saliency maps. Such a saliency map discriminates important areas with respect to the class of a given image [45]. For more theory on saliency maps, consult [45,46]. The gradient explanation is computationally fast as it only requires one backpropagation [45,46]. Other gradient explanation methods, like guided backpropagation, guided gradient-weighted class activation mapping, and SmoothGrad, are summarized and analyzed in [46].
Unlike gradient-based methods, the Shapley additive explanation (SHAP) is based on the Shapley value. The Shapley value is the average marginal contribution of a feature value based on cooperative game theory. It can be interpreted as the importance of each feature for a given sample [47].

Related Work
3.1. State of Health Forecasting. Regarding RQ.1, a structured literature review on SOH forecasting methods by von Bülow and Meisen [9] already exists. They distinguish SOH estimation, i.e., the determination of a state, from SOH forecasting, i.e., the determination of a state change among others by two criteria ( [9], Section 2.1): First, the relative temporal position of the model inputs and, second, what the input features of the model encode. For SOH estimation, the features encode the effect of the battery aging, like the SOH or capacity trajectory (autoregressive models), or changes in the partial charge curves. These data are obtained from the past or present until the current point  [48,49]. For SOH forecasting, the features encode the causes of the battery aging, i.e., the operational load on the battery through parameters such as SOC, temperature, and current. The data on the causes of battery aging are collected from the current point in time t 1 to the end of the forecast horizon t 2 . This is required for training, while for interference, scenario-based adaptions can be made to express the future battery operational load, e.g., halving the amount of high power charging. They highlight the suitability of two-dimensional residence time histograms as features for SOH forecasting with MLP and LIB cell data from laboratory operation [18]. The two-dimensionality improves the interpretability of the battery operation over multiple cycles. The same method has also been applied to LIB system data from BEV operation [50] and investigated regarding transferability between different cell data sets [51].
Their SOH method [18] is based on the perception mentioned in Section 2.1 that battery aging is a state change from a current SOHðt 1 Þ to a future SOHðt 2 Þ due to aging causes. The aging causes are encoded in the battery operational data during that time t = ½t 1 , t 2 which consists of multidimensional time series signals of C-rate, temperature, and SOC. The structure of the method is shown in Figure 1: (1) The above-mentioned battery operational data is used to extract stressor table data of battery stressor types that are known to cause battery aging. A battery stressor type is defined by one or more relevant battery signals, each limited by an interval (histogram-like binning). For example, one stressor type is the battery operation within a C-rate of 3C to 4C and a temperature of 31°C to 32°C as depicted in Figure 2.
(2) The flattened stressor table data is used as input to the ML model, which outputs the state of health change ðΔSOHÞ from a current SOHðt 1 Þ to a future SOHðt 2 Þ. The SOH values are assumed to be known for model training.
To investigate RQ.1, we use different ML models for the 2 nd process block shown in Figure 1 in our method.
3.2. Explainability of Battery Aging Models. SOH forecasting models learn how the operational parameters current I, temperature T, and SOC of the battery during a time window ½t 1 , t 2 influence the SOHðt 2 Þ at the end of this time window. The explainability of SOH forecasts with DL models (RQ.2) shall provide transparency on the model's forecast under a given operational load, e.g., to identify critical load situations or patterns. Thus, autoregressive models with a sequence of C or SOH, but without explicit I, T, and SOC as inputs are not suitable for SOH forecasting because the operational load is only an implicit input [9]. There is work in the broader context of XAI and battery aging that also includes autoregressive models: Lee et al. [52] trained various ML models for SOH estimation of LIBs: RF, gradient boosting machine, support vector machine, k-NN, MLP, and gated recurrent unit. For each of these, the future SOH value at the target cycle was learned from the moving average, moving first-order difference, and moving variance up to the start cycle of the prediction. Then, SHAP [47] was applied to explain the model's predictions. However, the Shapley values only provide local explainability. This means that explainability over multiple samples and features is not possible without further modifications. Therefore, Lee et al. [52] calculated the average Shapley value for each input feature. Furthermore, they did not ML model Stressor extraction Stressor data 1 2 Figure 1: SOH forecasting method structure: (1) stressor extraction and (2) ML model (from our previous work [18]).   [53] predicted the battery degradation trajectory with an autoregressive long short-term memory (LSTM) based on in-cycle features, previous capacities C, charge and discharge C-rate, and the cycle counter k. They applied a saliency analysis to the LSTM on the test data; that is, they took the absolute gradient of the output over the time series input as introduced in Section 2.3. This shows that the relevance of the features changes over time: Towards the kneepoint, the charge and discharge C-rates are more important, while the capacities are more relevant of earlier cycles.
Liu et al. [54] proposed a RF-based method for predicting battery capacity at an early state of the production. They focused on several parameters from the battery cell production step of coating and the influence of variations of these parameters on the battery capacity. Their method is aimed at dynamically explaining the predictions. Therefore, they introduced the accumulated local effect (ALE), which simplifies the prediction function. The ALE averages variations of several predictions and accumulates them in a sampling grid for each feature. So it compares the effect of a single feature at a given value to the average prediction. The ALE can be extended to express the interaction effect of two different features.
In another work, Liu et al. [55] focused on parameters in the production steps of mixing and coating at an early production state. They implemented a boosting tree-based framework using AdaBoost, LPBoost, and TotalBoost. After model training, feature importance and correlation were analyzed using the Gini index and the predictive measure of association (PMOA). The Gini index measures the variation of impurity, i.e., statistical dispersion [56]. Thus, using the Gini index in this application, the impurity can be interpreted as the goodness of a split at the nodes of a DT. PMOA quantifies analogies between different splitting rules. The resulting important features can be utilized by the battery manufacturer to better control the production process and improve quality.
Zhang et al. [57] implemented an attention-based deep neural network (DNN) that learns the capacity at each cycle C t+N = f ðX N , … , X t+N−1 Þ from the inputs X t containing the N previous cycles as sequential data of discharge capacity, partial charge, and discharge voltage curves. The model is partially autoregressive because it predicts C t+N at each cycle given the capacities of the previous N cycles as part of X. N is set to 3 for the NASA [58] and the Oxford [59] data set and to 30 for the Stanford data set [60]. For the Oxford data set with N = 3, the most recent measurements X t+N−1 = X t+2 have the largest attention weight over the entire battery life; that is, they are considered to be the most influential.
Mamo and Wang [61] trained an attention-based LSTM with the cycle number k and temperature T as inputs to analyze the capacity degradation. However, they do not analyze the interpretability or attention weights of the model. Still, for more complex input features, attention weights may provide interpretability.
In summary, these papers indicate that XAI is still at an early stage of application for battery aging models, especially for SOH forecasting models, but the interpretability of these models is urgently needed to build trust into their forecasts.

Method
Building on the method for SOH forecasting introduced in Section 3.1, we present two DL architectures in Section 4.1 which are used in the 2 nd process block called "ML model" in Figure 1 (RQ.1). In Section 4.2, we introduce a method for XAI using saliency maps to overcome the lack of interpretability of the applied DL-based black box models (RQ.2). Section 4.3 gives an overview of the applied data sets and their preprocessing.

Deep Learning Models.
When encoding battery operation in two-dimensional stressor tables, e.g., of T and SOC, different battery operation leads to distinct patterns in the stressor tables. The consideration of these patterns can be beneficial for the SOH forecast. Hence, we want to investigate whether it is advantageous for a DL model to consider not only single separate bins in the stressor tables but also groups of neighboring bins. These groups of neighboring bins can be treated similarly to pixels in images, where local patterns are learned by two-dimensional convolutional layers. Therefore, we compared an MLP-based model, which cannot utilize the positional information of the bins after the second layer, and a convolutional neural network-(CNN) based model, which is able to learn local patterns through convolutional operations as in images.

International Journal of Energy Research
An MLP has already shown state-of-the-art performance on the task of SOH forecasting [18]. We added residual blocks to an MLP (for a general introduction to MLPs, consult [62][63][64]) architecture to enable deeper neural network architectures without running into the problem of vanishing gradients [65]. Our ResMLP model consisted of an input layer, a variable number of residual blocks, a dropout layer, and an output layer. Each residual block consisted of multiple layers as visualized in Figure 3(a). We have identified the optimal set number of residual blocks, as well as the size of each layer, through a hyperparameter search.
For a general introduction to CNNs, consult [62,64,66]. Our CNN architecture, ResCNN, utilizes two-dimensional convolutional layers with residual skip connections in similar structural blocks as the ResMLP.
Furthermore, a challenge for the ResCNN architecture with the stressor tables as input is that the stressor tables "I and SOC charging" and "I and SOC discharging" have a smaller shape than the remaining stressor tables from Table 3. Thus, we created a ResCNN architecture that has two separate ResCNN blocks of different input sizes (20 × 70 and 20 × 18) that are concatenated at the end, as briefly shown in Figure 4. Each ResCNN block started with a two-dimensional convolutional layer with a kernel size of 7 and a max-pooling layer. Next, a variable number of residual blocks consisting of convolutional layers, as visualized in Figure 3(b), was used to achieve a tuneable depth of the network. These blocks were followed by two convolutions with max-pooling. When converting the size of the stressor tables, the missing values are zero-padded. This enables the concatenation of the two blocks with initially different input sizes. A linear layer was utilized to predict the target ΔSOH values. A more detailed structure of the ResCNN architecture can be found in the Appendix in Figure 13.

Explaining State of Health Forecasts with Saliency Maps.
In RQ.2, we investigate the explainability of the two DL architectures mentioned above, ResMLP and ResCNN, by comparing the saliency maps of both DL models with each other and with domain knowledge from laboratory experiments. Saliency maps are an explanation method suitable for any gradient-based ML model, i.e., especially for ANNs like CNNs and MLPs. Our novel extension combines the binning of saliency maps into aging groups by the ΔSOH value and the Gaussian filtering of the saliency maps to remove noise.
The input of one sample X i consists of m features x j ∈ X i = fx 1 , … , x m g. In this work, X i contains SOHðt 1 Þ as well as the values of all histogram bins. The output of one sample Y i consists only of the ΔSOH i .
The trained ResMLP and ResCNN learned a function FðXÞ =Ŷ. The gradient explanation of each sample i E grad ðX i Þ = FðX i Þ/X i quantifies the influence of the corresponding input features x j on the prediction FðX i Þ locally, i.e., the influence of a stressor bin in a histogram on the forecast of the ΔSOH. This process is depicted in Figure 5.
We further define subsets S binned,p ⊂ S of the samples by binning the output valuesY i = ΔSOH to obtain agingdependent explanations as motivated in Section 4.3. For each of these subsets by the ΔSOH bins, we derive mean saliency maps for each histogram in X by applying the mean on the gradient explanations E grad ðX i Þ over all samples in the respective S binned,p .
In order to interpret the saliency maps of MLP architectures and compare them with the saliency maps of CNN architectures, we projected the computed values of the saliency map back into the original shape of the stressor tables. It holds ℝ Finally, we visualized these processed saliency maps as heatmaps in the shape of the stressor tables.
In image processing, the application of Gaussian filters to blur or smooth images is known [67,68]. Another usecase of Gaussian filters is the removal of noise in signal processing [69]. Besides, Rahman et al. [68] applied a Gaussian filter to saliency maps. Their objective is to explore the edges of the salient object. Similarly, we aim to smooth the edges of the salient operational region in the two-dimensional histograms. Moreover, we are not interested in a single gradient of a single field in a stressor table, i.e., in a histogram. Instead, we want to investigate which areas of the stressor table are important for the prediction of the model and whether these areas are aligned with important areas identified by a domain expert. Applying the Gaussian filter with a manually tuned standard deviation of σ = 5 has the advantage of smoothing the outliers and edges in the saliency maps. The design of the stressor tables by the strict intervals of each bin in the histograms favors noise in the saliency maps, but this is negligible in terms of the interpretability of the model. Consequently, a domain expert is not distracted by the noise in the saliency map and can focus on areas that have a high influence on the model prediction.

Battery Cell Data and Preprocessing.
In this work, we tested our method on the basis of 188 LIB cells from five public battery cell data sets [60,[70][71][72][73] from laboratory operation, as in the previous publications on SOH forecasting and transfer learning [51]. The preprocessing described in [51] was also the basis for this work. More detailed illustrations of the preprocessing are given in Figures 11 and 12 in the Appendix. An overview of the data sets used regarding battery operation and materials is given in Table 4 in the Appendix. These selected public data sets (1) provide data from a sufficient number of battery cells; (2) have overall the same as well as different battery types, e.g., materials and C nom ; and (3) have a variety of battery operational loads, i.e., charging, discharging, and storage of the batteries. This enables us to ensure the validity of our results for different battery types and operational loads.
We used two-dimensional stressor tables, variant A with a fine signal interval width, i.e., bin size, because this gave the 6 International Journal of Energy Research best result in previous work [18] (fine signal interval width equals bins of 0.5C for I, 0.5°C for T, and 5% for SOC; 2D stressor table, variant A, consists of several 2D residence time histograms of T, I, and SOC as displayed in Table 3 in the Appendix). This includes separate histograms for the three operational modes of the batteries: charging (I > 0 A), discharging (I < 0 A), and hold (I = 0 A). For example, the T and SOC hold histogram bins the signals temperature and SOC while I = 0 A. Further, a grouped window width of w w = f25, 50, 100g cycles (named W9 in previous work [18]) and a window shift of w s = 25 cycles showed promising results regarding generalization. A major obstacle for the qualitative analysis and interpretation of the saliency maps presented in Section 4.2 is that the output values, i.e., the ΔSOHs, are not uniformly distributed as displayed in Figure 6 (the same histogram of the training data is shown in the Appendix in Figure 14). This corresponds to the ΔC histograms of ( [74] p. 80), who also used two data sets from those applied in this work [60,70]. As mentioned before, we are interested in the feature importance of samples with decreasing SOH, i.e., Δ SOH ≤ 0 and even better with ΔSOH ≪ 0 for significant aging. However, samples with high aging are especially rare. Most samples have little or no aging. At the same time, we also need a sufficient number of test samples as listed in Figure 6 to obtain statistically representative saliency maps (the corresponding values are displayed in Table 5 in the Appendix).
In order to get an overview of the model's behavior in the different aging phases, we created fifteen equally spaced bins of the test data ranging from ΔSOH values of −28.1% to 0.7%. Each bin contains test samples with the corresponding ΔSOH values.
We used the root mean square error (RMSE) as an evaluation metric because it is a common error metric for evaluating regression problems and, unlike the mean square error (MSE), has the same unit as the predicted output value. Having the same unit enables the direct comparison of the model's performance with a required error threshold. We applied min-max normalization feature-wise using only the training data; that is, each feature was normalized separately based on its minimum and maximum value in the training data as in [18]. A more detailed description of the data, including their characteristics, can be found in [51].

Comparison of Different Machine Learning Models.
We first conducted a hyperparameter search for each model using weights and biases with a Bayesian Gaussian process model to minimize the validation RMSE. The hyperparameter spaces for the ML and DL models are specified in Tables 6 and 7 Tables 6 and 7 in the Appendix and ran the experiment with five different random seeds to increase the robustness of the results. The mean and standard deviation of these five runs for each model are shown as the final result with the RMSE as metric in Table 2 and the MSE and coefficient of determination (R 2 ) as metric in Table 8 in the Appendix.
An overview of the mean runtime during the hyperparameter search can be found in Table 9 in the Appendix. The mean training time of the DT during the hyperparameter search was the shortest, and the mean run time of the ResCNN was the longest. However, we have not made a detailed comparison of runtimes given the limited comparability due to the different hardware (CPU vs. GPU) deployed in this work for which the ML and DL models are designed.
We refrained from a comparison with state-of-the-art models because of different data sets, different forecast horizons, different metrics, and different output values used, which limit comparability [9]. Thus, in this section, comparisons are limited to the ML and DL architectures used in this work, which share the same data sets (see Table 4), forecast horizon (w w = f25, 50, 100g cycles), metrics (RMSE, MSE, and R 2 ), and output value (ΔSOH).
Our results show that the ResMLP outperformed the other models. The RMSE test score of the DT, the worst performing ML model, was outperformed by 319% and the best performing ML model, the HistXGBoost model, by 73%. Furthermore, the HistXGBoost achieved a RMSE test score of 0.368, which was an improvement of approximately 142% over the DT model. However, the ResCNN outperformed the HistXGBoost model by 33%.
The RF outperforms the DT, which is likely caused by the reasons described in Section 2.2: The DT cannot learn complex functions as well as the RF which is able to approximate the underlying function better due to the ensemble architecture of several trees.
The results shown in Table 2 confirm the advantage of the DNNs over the ML models tested. They are also coherent with our expectations from Section 1 and probably  Since we only have 16065 training samples, the subsampling is not affected by the random state. The random state also has no effect on the thresholds of the feature bins, which are simply linearly spaced. The best values are marked bold.  Figure 6: Number of test samples in each ΔSOH bins with bounds in %. Test data are 10% of the complete data set. 8 International Journal of Energy Research caused by the inherent ability of more complex DNNs to learn more complex input-output relationships; that is, DNNs are better approximators of the underlying function in this data set for SOH forecasting. The ResCNN could not outperform the ResMLP, which achieved a 30% better RMSE test score. Consequently, the interdependencies of the neighboring bins in a stressor table did not significantly influence the resulting ΔSOH target for the ResCNN model, or these dependencies could not be learned by the locally limited filters of the convolutions. Furthermore, the better performing DL models support the need for more interpretable results to build trust in them (RQ.2).

Interpretation of Exemplary Saliency Maps from Deep
Learning Models. In the following, we qualitatively analyze and interpret exemplary saliency maps. In the saliency maps, the blue color indicates low or no relevance for the prediction, while red indicates high importance. These values represent the mean gradients which are described without a unit in the literature for common applications of saliency maps to image classification. In the scope of this paper a few representative saliency maps have been selected for illustration and disscussion. The generated saliency maps of all ΔSOH bins, all histograms, and both DL architectures, ResMLP and ResCNN, can be selected via a graphical user interface (GUI): https://tmdt-buw.github.io/soh-saliencymaps/. The saliency maps show the mean of all samples in the test data from one ΔSOH bin in the corresponding stressor table.
We refrain from comparing the presented gradientbased saliency maps with other work because of the novelty of SOH forecasting and XAI. Furthermore, the saliency maps as a qualitative means of providing explainability would only enable qualitative comparisons with other works. Thus, their parameters are expressed only as easily perceivable colored heatmaps and not numerically.
Furthermore, the computation times of the saliency maps are neither presented nor discussed because they are negligibly small, requiring only a single forward and backward propagation on the test data set. Also, in practical applications, these computations are likely to be run in a setting without constrained computational resources, e.g., not on an electronic control unit (ECU) in a BEV or an edge device [75,76]. (Figures 7(a), 7(b), 8(a), and 8(b)). In Figures 7(a) and 7(b), the saliency maps of the ResCNN and ResMLP are shown, respectively, for the T and SOC hold histogram and the ΔSOH bin with ΔSOH ∈ ½−7:6,−5:5. The important temperature band width is 23°C to 30°C for the ResCNN and 22°C to 26°C for the MLP. That is, it is narrower and lower for the ResMLP. Also, for the ResCNN, the highest gradient (yellow) is about half of the maximum gradient of the ResMLP (red). The most relevant SOC range for the ResCNN is from 0% to approximately 50% to 70%. In contrast, for the ResMLP, the red area extends from the right, i.e., SOC = 100%, and not from the left, i.e., SOC = 0%. For the ResMLP, the orange area extends further to the left than the tur-quoise area extends to the right for the ResCNN. That is, the ResMLP has learned that low and high SOCs are harmful; for the ResCNN, this is only true for low and medium SOC. Overall, the two models have learned different areas of focus with different levels of relevance.

T and SOC Hold
In Figures 8(a) and 8(b), the saliency maps of the ResCNN and ResMLP are shown, respectively, for the T and SOC hold histogram and the next ΔSOH bin with Δ SOH ∈ ½−5:5,−3:5. Figures 7(a) and 8(a) for the ResCNN as well as Figures 7(b) and 8(b) for the ResMLP are pairwise similar underlining a consistent learning of each model for itself. However, at lower ΔSOH in the saliency map of the ResCNN in Figure 8(a), the medium important yellow SOC regions are at lower temperatures than in Figure 7 . This is interesting because it is known from laboratory aging experiments that high temperatures and high SOC induce accelerated aging during hold mode (I = 0 A) [77]. However, only the saliency maps of the ResMLP indicate this. The relevance of these operational states seems to be lost in the convolutional structure of the ResCNN, which could be connected to the poorer performance of the ResCNN compared to the ResMLP.
For the ResCNN, the area of highest importance has slightly increased relevance for less degradation, i.e., changing from white in Figure 8(a) to yellow in Figure 7(a) with larger absolute ΔSOHs. Contrarily, for the ResMLP, the relevance increases slightly for more degradation, i.e., changing from light orange and light red in Figure 8(b) to dark orange and dark red in Figure 7(b) with greater absolute ΔSOHs. Again, the relevance shown in the saliency maps of the ResMLP is consistent with domain knowledge: The same operational regions are marked with higher relevance for samples with more degradation. (Figures 9(a) and 9(b)). For the I and SOC charging histogram, the saliency maps of the ResCNN and ResMLP are shown in Figures 9(a) and 9(b), respectively, for the ΔSOH bin with ΔSOH ∈ ½−7:6,−5:5. As in the previous saliency maps, the overall gradients of the ResMLP are higher than those of the ResCNN. Both models generally learn a similar region of maximum importance: Their saliency maps indicate the top right corner as inducing high aging, i.e., operation with I ⪆ 5 h −1 and SOC ⪆ 65%. These states of high C-rate and high SOC do not usually occur in parallel for a long time, because during constant voltage (CV) charging at the end of the charging process with high SOC, the decreasing C-rate is controlled to hold a CV. These saliency maps seem to indicate that a good transition during charging at the end of the constant current      For the ResCNN, the remaining operational states have almost the same relevance (light green). Although for the ResMLP, the importance decreases in a clockwise circle around the center of the saliency map: higher C-rates and low SOCs are marked in light orange which fades to white for low C-rates. Finally, low C-rates and high SOC are correlated with low aging. (Figures 10(a) and 10(b)). For the I and SOC discharging histogram, the saliency maps of the ResCNN and the ResMLP are shown in Figures 10(a) and 10(b), respectively, for the ΔSOH bin with ΔSOH ∈ ½−7:6,−5:5. Again, both models generally learn a similar operational region as most relevant: Their saliency maps indicate the lower left corner as inducing high aging, i.e., operation with approximately I ⪅ −2 h −1 and SOC ⪅ 20 %. Differences lie in the operational region of higher SOC ⪆ 35%: The saliency map of the ResMLP has increasing importance for increasing SOCs starting from SOC ≈ 50%. The saliency map of the ResCNN shows slightly decreasing importance for higher SOCs. In conclusion, the saliency map of the ResMLP seems to be more consistent considering the knowledge gained from laboratory battery aging experiments because it shows less aging for medium SOCs and more aging for higher currents as well as high and low SOCs.

I and SOC Discharging
When comparing I and SOC charging (Figures 9(a) and 9(b)) and I and SOC discharging (Figures 10(a) and 10(b)), it is clear that the most important regions are roughly mirrored along the diagonal (\); that is, important regions change from high C-rates and high SOC for charging (upper   . This may be due to the convolutional filters: changing one stressor value, i.e., a δx, affects several filters, but only partially due to folding. However, this explanation is disproved by the similar maximum gradients for ResCNN and ResMLP in Figures 9(a) and 9(b).
We could not conclusively clarify that the coherence of the saliency maps, i.e., the explanations of the model's forecasts, is related to the model's performance, but a strong connection is assumed. However, a well-performing model might have learned patterns that are not directly related to those known by a domain expert. This could complicate comprehensible explanations for domain experts.
To summarize the analysis of the saliency maps, both black box models, ResCNN and ResMLP, have been improved to the extent that a battery-domain expert can build trust in the models. We expect that a more trustworthy model will be more easily accepted by users interested in SOH forecasts, as motivated in Section 1. The trustworthiness of the SOH forecasting models in this work is achieved by gaining interpretability of the SOH forecasts through insights into the operational state regions in the histograms that the models focus on for their predictions. We conclude that the ResMLP is a more trustworthy model in terms of the predic-tive performance, as shown in Table 2, and the interpretability due to the consistency with human domain expert knowledge. Therefore, the best model we investigated is the ResMLP.

Conclusion
SOH forecasting has been shown to work with an MLP trained on laboratory battery cell data previously [18,51]. Still, due to the high inherent black box character of MLPs, the problem of the lack of explainability of their forecasts remained open. Explainability of ML models, e.g., by using methods from the field of XAI, aims at building trust in ML models and their predictions to ensure a wider distribution and increased acceptance.
In this work, we compared different ML model types with increasing inherent black box character and decreasing explainability: DT, RF, XGBoost, HistXGBoost, and two new DL architectures, ResMLP and ResCNN. ResMLPs showed the best result with an RMSE of 0.213 while ML models were significantly worse, e.g., DTs with an RMSE of 0.892 (+319%). Due to the lack of comparability [9], we refrained from a comparison with state-of-the-art models but focused on the comparison of different ML with DL models. We rate only the model fit of the ResMLP and ResCNN as good enough for predictions in the relevant SOH range of 80% to 100%, i.e., a range of 20% SOH given by the EOL of automotive batteries at SOH = 80%.
Thus, we focused on adding explainability to these two DL methods. Therefore, we proposed a new method using Gaussian-filtered saliency maps to visualize the battery operational states that are relevant to DNN models. The saliency maps of the ResMLP show meaningful visualizations of operational states that are consistent with previous knowledge from controlled laboratory aging experiments. In conclusion, the proposed Gaussian-filtered saliency maps, in combination with the human-interpretable data representation in the form of two-dimensional histograms,

16
International Journal of Energy Research were able to add interpretability the SOH forecasting results mitigating the inherent black box character of DL models compared to transparent ML models. Consequently, we have contributed to the field of battery aging prediction by showing that combining two well-established methods, Gaussian filters and saliency maps, with a humaninterpretable data representation results in understandable aging forecasts. This work includes 188 cells from 5 different public battery cell data sets with different cathode materials, different nominal capacities, and different operational load profiles. Despite the comparatively large amount of data, the amount of available public battery aging data sets is limited. This is even more true for battery pack data and even more for data from real-world BEV operation, such as from BEV fleets [9,50].
Gaussian-filtered saliency maps as a method are transferable to other gradient-based learning models with histogram-based inputs. However, saliency maps, like many other postmodel explanation methods, only enable explanations of labeled samples. Thus, explanations of unlabeled forecasts are not possible with saliency maps. This would only be possible, e.g., with a DT in which a human could trace the path of the forecast through the leaves of the DT.    In future work, an appropriate histogram embedding for attention-based models could be developed. Therefore, we have already explored three encodings: First, we applied standard positional encoding with sine and cosine introduced by Vaswani et al. [78]. Second, we implemented a custom positional encoding of the histogram-based features by a histogram-id, row-id, and column-id that encode the histogram type and the position within the histogram. Finally, we learned an encoding using a CNN, i.e., a CNN followed by attention layers with a final MLP head. All of these encodings did not show promising results. So, further research is needed. Perhaps, unsupervised learning of an

18
International Journal of Energy Research embedding or approaches from natural language processing (NLP) could be a starting point. Another problem in applying attention-based models was that our database was too small to profit from the benefits of attention-based models as in other fields such as NLP [79] or image recognition [80]. Also, the computational cost of dot-product attention with quadratic complexity poses problems with our large feature space. Once an attention-based model achieves similar performance to our models, the attention weights could be investigated for their interpretability. Table 3 gives an overview of the specification of the stressor tables and their dimensions used in this paper. Table 4 gives an overview of the data sets used in this paper. Figure 11 shows the general data preprocessing flowchart. Afterwards, the data is preprocessed data set specific as visualized in Figure 12.

D. Architecture of ResCNN Model
The model architecture of the ResCNN model is visualized in Figure 13 and explained in the following: First, the input stressor tables are split into two groups of table shapes (20 × 70 and 20 × 18). When converting the size of the stressor tables, the missing values are zero-padded. Then, the two groups of tables can be fed into the corresponding convolutional blocks of the ResCNN. The number of residual blocks and the hidden size are used as a variable in the hyperparameter search. Table 5 shows the numerical values of the state of health change bins and number of test samples displayed in the histogram in Figure 6.

F. Train Data Distribution
The histogram of the training data is shown in Figure 6. It is corresponding to Figure 14 displaying the test data.

G. Hyperparameters
The hyperparameter spaces and the selected best set of hyperparameters for the ML and DL models are specified in Tables 6 and 7 respectively. The influence of residual blocks varied in the two DL architectures investigated. The ResMLP architecture benefited from the residual blocks, which is supported by the best performing model of the hyperparameter search having six residual blocks. A ResMLP without residual blocks, an MLP similar to the model investigated by [18,50], did not perform best. On the other hand, the residual blocks of the ResCNN were not shown to add value to the predictions, as the best performing model did not contain residual blocks.

H. Further Results
The mean and standard deviation of the five runs for each model are shown as the final result with the MSE and R 2 as metric in Table 8. Table 9 gives an overview of the mean runtime during the hyperparameter search. The ML models were run on one CPU node of the PLEIADES cluster at the University of Wuppertal, and the DL models were run on a GPU node. In addition, the ML models used a CPU node with an AMD EPYC 7452 as CPU with a limit of 8 cores per hyperparameter search instance. Furthermore, the DL models   The best values are marked bold.