Robust Data-Driven Fault Detection: An Application to Aircraft Air Data Sensors

Fault detection (FD) is important for health monitoring and safe operation of dynamical systems. Previous studies use model-based approaches which are sensitive to system speci ﬁ cs, attenuating the robustness. Data-driven methods have claimed accurate performances which scale well to di ﬀ erent cases, but the algorithmic structures and enclosed operations are “ black, ” jeopardizing its robustness. To address these issues, exemplifying the FD problem of aircraft air data sensors, we explore to develop a robust (accurate, scalable, explainable, and interpretable) FD scheme using a typical data-driven method, i.e., deep neural networks (DNN). To guarantee the scalability, aircraft inertial reference unit measurements are adopted as equivalent inputs to the DNN, and a database associated with 6 di ﬀ erent aircraft/ ﬂ ight conditions is constructed. Convolutional neural networks (CNN) and long-short time memory (LSTM) blocks are used in the DNN scheme for accurate FD performances. To enhance robustness of the DNN, we also develop two new concepts: “ large structure ” which corresponds to the parameters that can be objectively optimized (e.g., CNN kernel size) via certain metrics (e.g., accuracy) and “ small structure ” that conveys subjective understanding of humans (e.g., class activation mapping in CNN) within a certain context (e.g., object detection). We illustrate the optimization process we adopted in devising the DNN large structure, which yields accurate (90%) and scalable (24 diverse cases) performances. We also interpret the DNN small structure via class activation mapping, which yields promising results and solidi ﬁ es the robustness of DNN. Lessons and experiences we learned are also summarized in the paper, which we believe is instructive for addressing the FD problems in other similar ﬁ elds.


Motivation. Fault detection (FD)
is important for safe operations of dynamical systems. For instance, aircraft air data sensors (ADS) provide measurements of aircraft's airspeed, angle of attack (AOA), and sideslip angle. The erroneous sensor measurements, however, were found to be the cause of many catastrophic flight accidents including the crashes of NASA X-31 [1], Airbus A330 [2], and most recently Boeing B737 MAX [3]. A robust fault detection strategy is imminent for the health monitoring of commercial airlines.
At present, hardware redundancy (HR) is widely used for FD problems. Particularly for ADS in the commercial airlines, HR consists of installing multiple sensors to produce redundant measurements of the air data. Outputs from all the sensors are continuously monitored by a voting logic, which detects (and isolates) the erroneous sensor. The correct measurement is then reported using the remaining other sensors [4][5][6].
One issue with the HR-based fault detection is the cost and weight penalty (due to the redundant sensors). Moreover, recent accidents indicate that HR is not sufficient in addressing the fault detection problem (e.g., the Boeing 737MAX accident due to AOA sensors). Alternative to HR, analytical redundancy (AR) has been investigated. A majority of the AR methods adopted model-based approaches. Different from HR, AR investigates each sensor separately. For a certain sensor, a mathematical model is developed in conjunction with other sensors. An inferred sensor measurement is then estimated and compared with the sensor's output to generate a residual. If the residual exceeds a predefined bound, fault is claimed to be detected for that sensor [7].
The model-based AR, nevertheless, hinges on the model that is derived from system specifics, which is sensitive to operational conditions. The development of model-based AR typically requires ad hoc parameter tuning, which is time-consuming. Another line of AR adopts model-free and mostly data-driven methods. This does not require system specifics, but the recorded data (e.g., sensors measurements and the associated faults) only. In particular, deep neural networks (DNN) were widely used [8][9][10][11][12][13][14]. However, no explainable rules exist for the architecture devising in DNN, and most works adopted the trial-and-error methodology; mathematical operations enclosed within the DNN are also considered "black," and scalability of the DNNbased FD scheme is doubtful. Referring to the development of DNN in other fields (e.g., computer vision), many rules have been proposed (and widely accepted) in devising the DNN architectures (e.g., how many CNN filters should be used in each layer). Ablation studies [15] and DNN visualization (e.g. class activation mapping [16]) are widely adopted, which greatly elevate the physical understanding (robustness) of the DNN. Similar concepts/approaches may also be used in analyzing the DNN-based FD scheme.
We thus summarize the motivation of this paper: Though DNN-based methods have yielded accurate and scalable FD performances, the weaknesses are as follows: explainability in devising the DNN architecture and interpretability of the DNN operations. Exemplifying the FD problem of aircraft air data sensors, we propose to develop a robust DNN-based FD scheme. Whilst the FD accuracy and scalability must be guaranteed, we also explain the rules in devising the DNN architecture and interpret the operations enclosed in the DNN structure.
1.2. Related Work 1.2.1. Model-Based FD. Model-based FD hinges on a mathematical model to infer the sensor measurement, which being further used to generate a residual. A plethora of works were found to use Kalman filtering (KF), e.g., the extended KF [17][18][19], the unscented KF [20], the theoretical analysis of an adaptive three-step KF [21], and implementation of the KF-based method to real data in [22]. Other KF-based works included [23,24], wherein fuzzy logic was used in conjunction with KF to consolidate the sensor data [23], and the hidden Markov model has been used to decide the sensors state (fault/healthy) based on the KF outputs in [24]. KF-based FD schemes, however, rely on the evolution model that is derived from the system dynamics/kinematics; ad hoc parameter tuning is imminent in adjusting KF to different systems (e.g., different aircraft) or operational cases (e.g., different flight conditions).
Other model-based FD methods adopted robust control theory in [25][26][27][28], wherein the robustness synthesis-based filter was constructed to output the residual, but a sensor state evolution model is needed, and no rules pertaining to the parameters tuning was studied. In [29][30][31], moving horizon estimator was developed. It compensated for both sen-sor faults and wind speed estimation in the fault tolerant control. However, the authors discussed limited aircraft/ flight cases in the paper; scalability of the proposed methods is unclear. A scheme designed particularly for systems with two time-scale dynamics (e.g., phugoid and short periods in the aircraft's longitudinal plane) was discussed in [32,33], wherein both nonlinear geometric approach and singular perturbation technique were involved, but computational load of the algorithm was relatively high, and parameter tuning was time-consuming. Barrier function-based learning observer was proposed in [34], and in [35], a set-value observer (SVO) was used. As acclaimed in the papers, these works significantly decreased the FD false alarm rate. The weaknesses of [34,35], however, are also typical: model sensitiveness and unclear scalability.
1.2.2. Data-Driven FD. Different data-driven methods have been proposed for the FD problem. In [36], the fuzzy inference system in conjunction with thresholder was proposed for the FD of DC motors. In [37], 4 different Wiener models were ensembled for the fault analysis of an industrial gas turbine. Dynamical primary components analysis was used together with support vector machine in [38] for the FD problem of gear box. And finally, in [39], a total of 5 state-of-the-art algorithms were studied for FD of marine machinery. Although these methods have yielded promising results, however, the weakness is also obvious: heavy parameters tuning is usually needed in finalizing the algorithm structures.
Other data-driven schemes were found to use neural networks (NN). In [8,9], fully connected cascaded NN was adopted, and the authors discussed fault detection and isolation for inertial reference unit (IRU). Similar works were found in [12,40,41], wherein feed-forward NN was used. In [42,43], NN-based adaptive observer was developed to generate the sensor measurement residual; parameters of the NN were adjusted online via KF [42]. Also, in [10][11][12], NN was used to establish nonlinear identification models, which being used as a state observer to generate the residual. The essence of all these works was to regress a functional relation that maps from the designated input to the desired output (i.e., fault cases), but traditional NN lacks the efficiency in abstracting high-level features. It is usually used in a hybrid form with other methods (e.g., KF). In addition, no research pertaining to the explainability and interpretability analysis was thoroughly illustrated in the associated publications.
Recent NN developments advocate the deep neural networks (DNN) in many academic/industrial fields [44]. DNN typically has more ("deeper") layers which are activated using designated function (e.g., ReLU). More dedicated operations were also designed in convolutional neural network (CNN) and long-short time memory (LSTM) blocks for extracting both spatial and temporal features enclosed in the DNN input. Early works along the DNN-based FD line were found in [45][46][47], wherein recurrent neural network (RNN) was used. Later works adopted a variant of RNN, i.e., LSTM, which attenuates the error vanishment/explosion problems in the traditional RNN. CNN was also widely used either independently [48,49] or in conjunction with LSTM; new data 2 International Journal of Aerospace Engineering formats defined as "state image" and "control image" were proposed in [50,51], via which the sensor FD accuracy was significantly improved. The CNN-LSTM fusion-based DNN architecture has claimed promising results in [51] for air data sensors FD and most recently in [52] for fault estimation.
Despite the rapid developments of various DNN-based FD architectures, however, research efforts along the explainability and interpretability analysis line are still rare.  (2) to interpret the small structure that depicts the operations enclosed in the DNN architecture. Similar works have appeared in literature. The DNN large structure corresponds to the specifics (e.g., CNN kernel size) that can be objectively optimized via certain metrics (e.g., DNN testing accuracy). To explain the large structure, comparative studies were commonly used. In [53], different sets of parameters (number of CNN filters, kernel sizes, etc.) were assembled in the DNN. The authors then performed thorough training for each parameter set and decided the optimal one via gleaning the trained DNN. Technical tools designed specifically for optimizing the DNN training were also found, of which the most peculiar one is the Microsoft's NNI, which decides the best training coefficients (e.g., learning rate and iterative epochs) for a certain DNN architecture [54]. The Tree-structured Parzen Estimator (TPE) is a sequential model-based optimization (SMBO) approach, as a black-box optimization, which can be used in various scenarios and shows good performance, especially when limited by computation resources and can only try a small number of trials [55]. When an "optimal" DNN is found, ablation study is commonly used to verify the architecture (e.g., CNN branches and CNN-LSTM fusions), which involves cropping certain subarchitecture from the "optimal" one, and comparing the DNN performances. Typical examples are found in [15].
The DNN small structure depicts the operations enclosed within the DNN (e.g., a certain CNN filter). It is usually analyzed via mirroring the DNN outputs to what humans understand in a certain context. For instance, in the visual object classification problem, CNN is commonly used. The understandable terms of humans in such a context are the visual features that one hinges on to classify an object (e.g., "ear" or "nose" of a cat/dog). Class activation mapping (CAM) thus was proposed in [16] and rapidly developed in [56][57][58][59][60][61], which points out the highlighted region(s) wherein the CNN filters focus on. The CNN architecture may be considered reasonable (interpretable) if the highlighted region(s) corresponds to what humans tend to watch (e.g., the "ear"/"nose"). Related studies in such line have made promising progresses which promoted both academic researches and industrial applications of CNN in visionrelated problems-but "vision-related" only; very rare studies pertaining to such line were found in other DNN-based works (e.g., DNN-based fault detection).

1.3.
Overview of This Paper. In this paper, exemplifying the FD problem of aircraft air data sensors, we aim to develop a robust (accurate, scalable, explainable, and interpretable) DNN-based fault detection scheme. We highlight our contributions as follows:   and summarize the rules we adopted in optimizing several large structure parameters-this enhances the explainability on why such DNN specifics are adopted (iii) Interpretable DNN small structure: we borrow feature visualization from computer vision in analyzing the DNN small structure, of which the results correspond to what human understand-this elevates the interpretability on how the DNN operations work The remainder of this paper is organized as follows. Section 2 defines the problem. Section 3 illustrates the database. Section 4 prepares the data and experimental setup. Studies on both large and small structures are detailed in Section 5, wherein lessons and experiences we learned are also summarized. Finally, conclusions and future works are discussed in Section 6.

Problem Definition
To define the FD problem of aircraft air data sensors, we start with the air data evolution equations: wherein S * and C * represent sin and cos operations, fV, α, βg are airspeed, AOA, and sideslip angle, g is the gravitational acceleration, and fw x , w y , w z g and fψ, θ, ϕg denote rotational speeds and angles, respectively. In Eq. (1), fA x , A y , A z g indicates the accelerations along different axes of the aircraft body, which are defined as fA i = F i /mg i=x,y,z , wherein m is the mass of the aircraft, and fF x , F y , F z g are the external forces generated by the control actions: In Eq. (2), δ * indicates individual control input (e.g., Table 2: Measurement noise standard deviations in the simulation for aircraft D.   International Journal of Aerospace Engineering throttle δ th , elevator δ e , aileron δ a , and rudder δ r ), and S, c, and b are the aircraft wing area, mean chord length, and span, respectively. The kinematics of aircraft rotational speeds and angles is written as follows:

Sensor Standard deviation Unit
and dynamics of the aircraft rotation yields wherein fM x , M y , M z g are the external control moments, which are defined as Figure 1 depicts an overall flow for above equations. Traditional works hinge on model-based approaches to monitor the control inputs and sensors outputs. Implicitly in such model, the external control forces/moments must be considered, which are generated using associated control actions, and directly related to the aircraft specifics (e.g., wing area and mass) and flight conditions (e.g., airspeed and AOA). Parameters within such model-based FD scheme typically require ad hoc tuning per aircraft/flight condition. Its scalability is therefore doubtful.
Despite the high dependency of control forces/moments upon aircraft specifics/flight conditions, their outcome (i.e., fA i g i=x,y,z and fw i g i=x,y,z ) can be directly measured using inertial reference unit (IRU). Via Eq. (3), rotational angles of the aircraft can also be calculated using fw i g i=x,y,z (although dedicated sensors do exist to directly measure them). We thus adopt IRU measurements as a probe into the overall system, model them as equivalent inputs to the air data evolution, and perform the air data sensors FD task directly.
To be specific, the problem in this paper is defined as to detect (classify) different faults that occur in the air data sensors, given the air data measurements fV, α, βg, and other data resources which may include fA x , A y , A z g, fw x , w y , w z g, and fψ, θ, ϕg. The FD scheme is modeled as a mapping process     [65]). We also involve 6 different flight conditions to cover the aircraft's entire envelope, i.e., high, medium, and low altitudes for both cruise, manual free flight, and low-altitude landing/take-off cycle (LTO). Different control forms from both human pilot (manual) and automated control laws (autopilot, AP) are also considered. See Table 1 for more details. In Figure 2, we also plot a sampled data we allocated in our database.

Measurement Noises and Disturbances.
Both simulation and real flight data are considered in the paper. While measurement noises and disturbances exist naturally in the real flight, we adopt the model following [66] in simulation. Dryden atmospheric disturbances are injected to perturb the flight states, on which the measurement noises are added to generate the noise-corrupted data. Measurement noises are assumed to follow Gaussian process distribution. Standard deviations for the noise of each sensor are characterized in Table 2 [17].

Designated Training and Testing.
To avoid the overfitting problem, training and testing data are strictly separated. We put all the real flight data in testing to fully evaluate the DNN performance. Diversity is crucial in specifying the training data, as the training algorithm is expected to extract from this data for an efficient FD mapping. We therefore adopt the real data from B 2 , F, and simulated data from D and B 1 manual flight for testing. As for training, we use the data from Y manual LTO and B 1 AP cruise, see Table 3. In the table, an overview of the data for each aircraft/flight condition is also characterized using the minimum, maximum, and stand deviation of key (clean) flight states (i.e., altitude, airspeed, AOA, and sideslip angle).

ADS Fault
Modeling and Injection. Different sensor fault types have been discussed in previous works, which include ramp bias, oscillations, and drift. For airspeed, most flight accidents happened due to the pitot tube being clogged by ice/rain. We thus consider drift fault for airspeed (measurement loss). For AOA and sideslip angles, the deflection vanes may be stuck or perturbed by external atmosphere, which causes drift (constant bias) and extra noises. As in Table 4, a total of 5 ADS fault cases are discussed in this paper, wherein the magnitude for each case is specified following [51].
We implement the ADS faults in an additive form, i.e., the "clean" data (Case 0 in Table 4) are retrieved from real flight/simulation. Sensor faults are then injected into the ADS data. Following [51], this injection is performed in a randomized manner, i.e., for every 60 seconds in the data, the fault cases occur randomly at randomized moments, with its duration (also randomized) not exceeding the 60 seconds. In Figure 3, different fault cases are injected to both airspeed, AOA, and sideslip angle for illustrative purposes. Table 3 also presents the distribution of different cases in the final data we adopt for the DNN training/testing.  International Journal of Aerospace Engineering

A Brief Introduction of CNN and LSTM.
We use both 2-dimensional CNN and LSTM in this paper. In CNN [67], convolutional filters scan the input (e.g., an image), of which the results are concatenated as feature maps. Multiple filters yield various feature maps, which being stacked to construct the designated mapping. Activation functions and pooling operations may also be used, with the former adding to nonlinearity in the mapping and the latter the noise tolerance [50,51,68]. LSTM is typically used for sequential data. Different from the spatial local-connectivity features extraction in CNN, it aims to abstract the temporal knowledge. Previous DNN including RNN [67] also targets the temporal features. RNN, however, may suffer error explosion/vanishing problems in the training. LSTM adopts gate operations to auto-matically select the historic input that may be useful in the mapping. As proved in many works, training efficiency, mapping accuracy, and deployment cost of LSTM can all be improved [50,51].
In coping with the dynamics problem, both CNN and LSTM may be useful. Via CNN, spatial correlations of different states are abstracted. In LSTM, the temporal features of each state are modeled. Previous works advocated a fusion of CNN and LSTM in designing the DNN architecture [50,51] for parameter identification, icing detection, etc. As there is not an input explicitly defined as "image," the key in implementing CNN and LSTM in these problems is to reshape the dynamic data (e.g., flight states and control commands) into an image-like format. We detail this in Section 4.2.

Data Preprocessing.
We perform data preprocessing to generate the image-formatted input to CNN and LSTM, see Figure 4. Via real flight or simulation, we have the  Figure 6: DNN-full; this "full" architecture adopts both CNN and LSTM operations for all variables that relate to the ADS evolution. CNN kernel size, filter numbers, LSTM nodes, and fully connection (FC) layers are specified directly following [50,51]. Conv: convolutional; MP: max-pooling; FL: flatten.  records of different states. We inject faults into the ADS states, allocate all other flight data, and stack them into a 2D matrix (middle plot). In this matrix, each row stands for the historic measurements of a certain state and column the value of that state at a certain moment. For each group of the aircraft flight state (e.g., air data, accelerations, rotational speed, and rotational angles), we stack this matrix separately.
Time window is also used. At each moment t, we consider the flight records in a range from previous t − ΔT to t (both included), wherein ΔT is 30 s (following [50,51], this window may be understood as a compromise between the aircraft "fast" motion modes of which the periods are in seconds (e.g., longitudinal short period and lateral roll) and "slow" modes which typically last for tens/hundreds of seconds (e.g., longitudinal phugoid and Dutch roll)). For different aircrafts, the data is recorded in various sampling rates (e.g., 20 Hz for B 1 and 30 HZ for F). We downsample them to a unified frequency at f s = 1/Δt, wherein Δt = ΔT/ n = ΔT/30 = 1s. We then stack the state matrix using the resampled data.
In the downsampled flight data, the range of each state varies significantly (see Table 3). In practice, this may create numerical difficulties in the DNN training (singularities, error vanishing/explosion). Normalization is adopted to process the sampled data. Following [50,51], this normalization is performed linearly along each row of the stacked matrix. After normalization, the "image" we obtain in the right plot of Figure 4 is adopted as input to the DNN.

Experimental Setup.
In training/testing the DNN, we record both training loss and validation accuracy (with all testing data designated as the validation dataset). Following [51], we repeatably perform 30 training runs for all DNN architectures, excluding the best/worst 5 runs, and summarize the outcome via the remained 20 records. We also adopt Keras (version 2.0.8) API with Tensorflow (version 1.3.0) as backend in the programming. Our computational platform is configured with CUDA 8.0 and cuDNN 6.0 with Nvidia driver version 384.69 (GPU: Nvidia GTX2080Ti) in Windows 10 system (Python 3.7). The platform has one i9-9900K CPU and 32 GB RAM.    [54]. Given different DNN architectures, NNI probes into the architecture, analyzes the training data, and decides an optimal combination of the training parameters, see Figure 5. Via NNI, different DNN architectures are trained separately in associated "optimal" manners, which we believe provides more solid ground for the comparative studies.

Development and Study on the DNN-Based Fault Detection Scheme
Development of the DNN-based FD scheme involves twofolds: to devise the DNN large architecture and to optimize the parameters enclosed within, and both are detailed in Sections 5.1 and 5.2 and verified via ablation studies in 5.3. Interpretability analysis on the DNN small structure is given in 5.4. Experiences and lessons we have learned are discussed in Section 5.5.

Devising the Large Architecture.
In devising the large architecture, we need to (1) select the DNN input (output being the fault cases directly) and (2) decide the CNN/LSTM branches. To the authors' best knowledge, there does not exist a universal rule for such issue. We refer to [50,51] and start with a "full" architecture in Figure 6 (DNN-full). In Eq. (1), both accelerations, rotational angles, and rotational speeds relate to the ADS states. DNN-full thus absorbs all these variables in the input layer and uses both CNN and LSTM branches to fully extract the features. We follow [50,51] in specifying the CNN filters/LSTM nodes. NNI is adopted to decide the optimal training coefficients, and    Table 5 yields promising results (validation accuracy 0.925).
Starting with DNN-full, we proceed to simplify the DNN architecture (whilst the performance still needs to be guaranteed). Referring to Eq. (3), the input of rotational angles may be redundant, as they can be completely calculated using rotational speeds. Input branches of the rotational angles are then cropped from DNN-full, and the rest remained unchanged as DNN-angles (Table 5). Again via NNI, DNN-angles are trained in an optimal manner, which yields an even better outcome when compared with DNNfull (validation accuracy 0.932).
The better training outcome of DNN-angles provides an import cue in devising the DNN architecture: whilst input (e.g., data resources) and operations (e.g., CNN and LSTM) enclosed within the DNN must be complete, the architecture should be as simple as possible-to render the training load lighter. In DNN-angles, referring to Eq. (1), we consider the that input air data, accelerations, and rotational speeds are complete (all necessary), as they are all independently related with the ADS evolution in (1). However, CNN is typically used to abstract the coupling effects. While the coupling does exist in air data, the 3 states of rotational speeds and accelerations are considered uncoupled as each state is generated via independent control actions. We thus exclude CNN in the rotational speeds and acceleration branches and propose the DNN-final architecture (Table 5). Again via NNI, an optimal training outcome of DNN-final yields performances even better than DNN-angles (0.946). DNN-final also yields the architecture we adopt in developing the DNN-based FD scheme.

Optimizing the Large Structure Parameters.
In the previous chapter, via gradually cropping the DNN architecture and investigating the training outcome, we adopt DNNfinal. Whilst this architecture is further verified in the next chapter via ablation studies, in this chapter, we focus on DNN-final and seek to find the optimal large structure parameters. In particular, we aim to find the best CNN    [50,51] and adopt a 3 × 3 CNN kernel size. The input size in this paper, however, is smaller (3 × 31) than in [50,51] (11 × 31). We thus study 3 different CNN kernel sizes, i.e., 3 × 3, 2 × 2, and 1 × 1, see Figure 7. Based on the comparative results in the figure, kernel size 2 × 2 yields the best training outcome (i.e., DNN-K2). DNN-K2 corresponds to the physical understanding of convolutional operations. On the one hand, although 1 × 1 kernels have appeared in literature to provide a delicate scanning on the input [69], the essence of CNN filters is to extract local-connectivity features. A kernel is imminent for "patching" the features. DNN-K1 in Figure 7 thus yields the worst performance. On the other hand, a full kernel which operates on the whole input dimension (e.g., 3 × 3 in our case for the 3 × 31 image) has appeared [70], which was claimed to provide a panoramic view of the entire image, but a larger CNN kernel may render the training more difficult, hence jeopardizing the training outcome, as yielded by DNN-final in Figure 7. In the following contents, we perform the iterative studies all starting with DNN-K2.

CNN Filter Numbers.
Based on DNN-K2, we proceed to decide how many filters should be used in each layer of the CNN. In DNN-K2, we use 64 filters for 2 consecutive CNN layers, followed by another 2 layers with 128 filters, with a max-pooling layer laid in between. In our studies, we retain this structure and inflate/deflate CNN filter numbers in the 4 layers synchronously, thus creating a total of 9 descendant architectures ( Table 6). Note that in the table, subsequent fully connection layers are also tuned to match the CNN output size. Training outcomes of the 10 architectures in Table 6 are plotted in Figure 8, wherein 48 CNN filters yields the best performance.
The plots in Figure 8 correspond to the understanding of CNN filters. To start with, multiple CNN filters must be used to stack various features. With more filters being adopted in the CNN, the performance may elevate. However, there also exists a certain level of CNN filters that yields "saturated" feature extraction which is illustrated by the plateau in Figure 8. We mark the DNN-K2 with 48 CNN filters as DNN-C48, starting from which we perform the iterative studies on LSTM node number in following contents.

LSTM Node Numbers.
In DNN-C48, all LSTM operations remain unchanged as in [50,51], and 128 LSTM nodes are used ubiquitously in all LSTM layers. We retain the CNN branches in DNN-C48, tweak the node number in all LSTM layers (synchronously), and create multiple descendant architectures. Again, via NNI, we perform optimal training for all these architectures. The associated training results are shown in Figure 9. Clearly, the training outcome reaches a peak at 16 nodes.
In Figure 9, starting from 4 nodes to 16, we observe an almost linear elevation in the training outcome-this probably is due to the relative simple activation functions being used in the LSTM, and more LSTM nodes provides better performance as more temporal features are extracted. After the 16 nodes, however, there exists a slow drop (also almost linear) towards 80 nodes, mainly due to the heavier LSTM training load. The plateau is again found after 80 nodes, and the reason is similar: LSTM node number is saturated already; purely increasing the LSTM nodes will not elevate the performance.   Table 6. DNN-opt yields the best validation accuracy at 0.972.

11
International Journal of Aerospace Engineering kernel size 2 × 2, filter number 48, and LSTM node number 16) and detail its architecture in Figure 10. Via NNI, training coefficients of DNN-opt are optimized in Figure 5. Training histories of DNN-opt are illustrated in Figure 11.
We also implement DNN-opt for the testing data in Section 3 and characterize the testing confusion matrix in Table 7. In the table, the FD accuracies in 6 different cases (see Table 4) for 4 different aircraft and conditions (see Table 3) were investigated. In the confusion matrix for each aircraft and condition, the horizontal direction indicates the real cases and vertical the detected cases from DNN-opt. Shadowed diagonal elements in the matrix represents the detection accuracy for each case, and the offdiagonal values indicate wrong detection rates (e.g., 2.46% in the D AP cruise matrix indicates that 2.46% of the real case 3 data were detected as case 0 via DNN-opt). As shown in the table, DNN-opt yields promising results (90% accuracy) in the ADS fault detection problem for all 4 aircraft at diverse flight conditions.

Ablation Studies on Large
Structure. The DNN-opt architecture originates from DNN-final, which was obtained via cropping redundant operations from DNN-full architecture. In this part, we verify the DNN-opt architecture via ablation studies, see Table 8 and Figure 12. We crop the LSTM and CNN branches for designated inputs from DNN-opt, optimize the training of the remained part via NNI, and investigate the training outcomes. As found in Table 8 and Figure 12, convolution of the air data branch in DNN-opt relates to the DNN-opt performance most significantly (accuracy of opt-adsCNN drops most drastically). This corresponds to the previous analysis in cropping from DNN-full to DNN-final: convolution strives to abstract the coupling effects of the input; in our case, we rely primarily on the correlations in air data to assert the fault. Although other structures (e.g., LSTM for accelerations and opt-accLSTM) does not relate as significantly as the air data convolution, the associated performances still deteriorate from the best DNN-opt. The results in Table 8 and Figure 12 prove that DNN-opt claims the best performance as compared with other ablated structures.

Interpretable
Analysis on Small Structure. We explore to interpret the air data convolutional operations in DNN-opt, as it affects the DNN-opt performance most significantly. In Figure 13, flight records corresponding to fault case 4 ( Table 4, sideslip angle drift) are plotted. The associated state image is stacked in Figure 14. Features abstracted in the 4 convolution layers in DNN-opt are shown in . Note that we adopt "same" padding in DNN-opt, and feature dimension exported from the first 2 convolutional layers thus remains the same as input state image. The highlighted features on Figures 15 and 16 corresponds to the fault occurrence marked with red box in Figure 14 (last 1/3 of the time window, i.e., 21~31 s). In the second 2 convolutional layers after the max-pooling, size of the features is reduced to half; the highlighted features also reflect the fault marked on Figure 14 (last 1/3 along the horizontal axis).
To better illustrate the feature extraction in DNN-opt, we adopt the class activation mapping (CAM) technique proposed in [61] and visualize the CAM plots for the 6 cases separately in Figure 19. For illustrative purposes, the state image imported to DNN-opt is also shown on the left of Note that a "hotter" mapping on CAM indicates the highlighted regions that convolution hinges on to assert the FD output. In Figure 19, CAM corresponds to the general understanding of the FD problem, as the highlighted hotter (red) regions overlap the areas that the fault occurs (marked with red).

Experiences and Lessons.
In Sections 5.1-5.4, we have completed the training/testing of DNN-opt; explainability and interpretability analysis is also presented. In this part, we summarize our experiences and lessons in developing the DNN-based FD scheme: Start with a "full" DNN. Iterative studies are usually adopted in devising the DNN large structure and optimizing the associated parameters. This iteration, however, must start from a certain architecture. Although there still lacks a rule as to how to initialize such a DNN architecture, staring from a "full" one proves to be effective in our work (i.e., DNN-full). The full DNN should involve all available data sources in the input layer and implement all potentially useful DNN

13
International Journal of Aerospace Engineering operations to fully extract the features (e.g., CNN for spatial and LSTM for temporal).
(i) Simplify the architecture: although redundant inputs/operations in the DNN architecture may provide extra information in extracting the features, the training load is usually high. The DNN architecture should be simplified (cropped) as much as possible, to the point that an accuracy plateau/peak is found (e.g., DNN-C48 in our work) (ii) The iteration policy: multiple large structure parameters need to be studied in simplifying the DNN architecture (e.g., CNN kernel size and filter number). Whilst we suggest to study one parameter at a time, the iteration policy is eminent: which to iterate first. In our studies, we have tried different combinations in iterating the CNN kernel size, CNN filter number, and LSTM node number and decided the work presented in the paper prove to be most effective. In other related works, more parameters may need to be tuned (e.g., fully connection node number); similar policy still needs to be studied (iii) Explainability and interpretability analysis: neural networks were long considered as a "black box". Recent developments in computer vision and natural language processing, however, indicate that certain rules do exist in explaining (devising) the DNN large structure, and the enclosed operations correspond to what humans understand. Similar concepts/approaches are advocated in developing/ analyzing the DNN-based FD schemes (e.g., ablation studies and CAM which claim promising results in our work)

Conclusion and Future Works
Exemplifying the fault detection (FD) problem of aircraft air data sensors, we aim to develop a robust DNN-based FD scheme in this paper. We model the FD problem using aircraft inertial referent unit measurements as equivalent inputs and construct a dedicated database which involves different aircraft/conditions; both provide a solid basis in training/testing the DNN. In devising the DNN architecture, we adopt iterative studies on specifying the large structures and optimizing the parameters enclosed within. Ablation studies are also adopted to explain the constructed DNN architecture. Whilst the developed DNN yields promising training/testing performances, we adopt methods widely adopted in computer vision (i.e., feature visualization and CAM) in interpreting the DNN small structure operations, which correspond to what humans understand in similar contexts. Combining all the above, the developed DNN is considered robust: performance accurate and scalable, large structure explainable, and small structure interpretable. As a continuation of the work, we plan to implement similar formulations to other FD problems, e.g., aircraft actuators faults and communication datalink failures in commercial airlines. Interpretation of other operations (e.g., LSTM) in the fault detection context will also be studied.   Table 4); the highlighted (red hot) features correspond to the faults occurred on the state image (marked with red boxes). 14 International Journal of Aerospace Engineering

Data Availability
Data is available upon request to the corresponding author.

Conflicts of Interest
The authors declare that they have no conflicts of interest.