In order to detect outliers in hydrological time series data for improving data quality and decisionmaking quality related to design, operation, and management of water resources, this research develops a time series outlier detection method for hydrologic data that can be used to identify data that deviate from historical patterns. The method first built a forecasting model on the history data and then used it to predict future values. Anomalies are assumed to take place if the observed values fall outside a given prediction confidence interval (
As the fundamental resources for water resources management and planning, longterm hydrological data are sets of discrete record values of hydrological elements that are collected with time and have been frequently analyzed in the field such as flood and drought control, water resources management, and water environment protection. With the development of data acquisition technology and data transmission technology, hydrological departments collected everincreasing amounts of time series data from automatic monitoring systems via loggers and telemetry systems. Within these datasets, hydrologic time series analysis becomes workable and credible for building mathematical model to generate synthetic hydrologic records, to forecast hydrologic events, to detect trends and shifts in hydrologic records, and to fill in missing data and extend records [
the large volumes of data,
the parameter pattern being specific and changing to different hydrology acquisition system due to multitemporal scale characteristic,
abnormal events or disturbances that create spurious effects in the data series and result in unexpected patterns,
inaccuracies in hydrological models due to imprecise and outdated information, logger and communications failures, poor calibration, and lack of system feedback.
Consequences of such situations in hydrological information systems may result in the DRQP (data rich, but quality poor) phenomenon. Consequently, the original monitoring data (i.e., precipitation, discharge, and water levels) should undergo a preprocessing step to eliminate the negative influence caused by incorrect or abnormal data due to instrumentation faults, data inherent change, operation error, or other possible influencing factors [
This study develops a realtime outlier detection method that employs a windowbased forecasting model for hydrologic time series collected from automatic monitoring systems. The method builds a forecasting model from a sequence of historical point values with a given window to predict future values. If the observed value differs from the predicted value beyond a certain threshold, an outlier would be indicated. The method uses prediction confidence interval (
In order to evaluate the proposed method, it was applied to two different hydrological variables, water level and daily flow, from
The rest of the paper is organized as follows. In the next section (Section
A time series
Time series analysis is the investigation of a temporally distributed sequence of data or the synthesis of a model for prediction wherein time is an independent variable; as a consequence, the information obtained from time series analysis can be applied to forecasting, process control, outlier detection, and other applications [
In hydrology, time series analysis is one of frontier scientific issues because it can detect and describe quantitatively each of the hydrologic processes underlying a given sequence of observations. Moreover, hydrologic time series analysis can also be used for building mathematical models to generate synthetic hydrologic records, to forecast hydrologic events, to detect trends and shifts in hydrologic records, and to fill in missing data and extend records. Consequently, time series analysis has become a vital tool in hydrological sciences and its importance has been dramatically enhanced in the recent past due to everincreasing interest in the scientific understanding of climate change [
In the time series analysis, it is assumed that the data (observations) consist of a systematic pattern and stochastic component; the former is deterministic in nature, whereas the latter accounts for the random error and usually makes the pattern difficult to be identified. Previous research usually equates stochastic component to system error and then simply discards it so as to not complicate the statistical analyses. However, the stochastic component potentially includes interesting and meaningful information; it must be treated with caution. It is for this reason that outlier detection becomes a hotspot research issue in recent years.
An outlier can be defined as “observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism” [
Outlier detection, also known as anomaly detection in some literatures, is an important longstanding research problem in the domains of data mining and statistics. The major objective of outlier detection is to identify data objects that are markedly different from, or inconsistent with, the remaining set of data [
Outlier detection is a very broad field and has been studied in the context of a large number of application domains where many detection methods have been proposed according to the different data characteristics. Recently, there has been significant interest in detecting outliers in time series. Generally, methods for time series outlier detection should consider the sequence nature of data and operate either on a single time series or on a time series database. The goal of outlier detection on a single time series is to find an anomalous subregion, while the goal of the latter is to identify a few sequences as outliers or to identify a subsequence in a test sequence as an outlier. In some cases, a single time series is converted to a time series database through the use of a sliding window [
Given a single time series, one can find particular elements (or time points) within the time series as outliers or find subsequence outliers. Fox defines two types of outliers (type I/additive and type II/innovative) based on the data associated with an individual object across time, ignoring the community aspect completely [
The problem of outlier detection in time series database mainly focuses on how to find all anomalous time series. It is assumed that most of the time series in the database are normal while a few are anomalous. Similar to the traditional outlier detection, the usual recipe of solving such problems is to first learn a model based on all the time series sequences in the database and then compute an outlier score for each sequence with respect to the model [
Hydrologic systems involving outliers invariably represent complex dynamical systems. The current state and future evolutions of such dynamical systems depend on countless properties and interactions involving numerous highly variable physical elements. The representation of such dynamical systems in their corresponding models is complicated because certain relationships can only be developed through analyses.
Outlier detection in hydrologic data is a common problem which has received considerable attention in the univariate framework. In the multivariate setting, the problem is well established in statistics. However, in the hydrologic field, the concepts are much less established. A pioneering work in this direction was recently presented by Chebana and Ouarda [
Although many outlier detection methods exist in the literature, there is a lack of discussion on the selection of a proper detection method for hydrological outliers. It is mainly because of the fact that most of outlier detection methods belong to statistical approaches and demanded that the data must follow some distributions, and the selection of a suitable outlier detection method is critically determined by the intent of analyst and the intended use of the results. Analysts have to consider several technical aspects in its decisionsmaking such as the tradeoff between accurate and efficient, the evaluation of consequences subject (i.e., masking and swamping), the design assumptions and the limitation of different methods, and the preference on parametric or nonparametric approach. Without a thorough understanding of outlier phenomena, it is difficult to determine a suitable outlier detection method.
Faced with such challenges, this work proposes a new method to detect outliers that splits given historical hydrological time series into subsequences by a slidingwindow and then an autoregressive (
In this section, we formulate the outlier detection problem and give a formal definition of some concepts which are used in the proposed algorithm. And then we will introduce the algorithm detecting outliers in time series based on the slidingwindow prediction model. In addition, we also mention efficient strategies to choose the optimal parameters to meet the users’ requirements. Next is the formal formulation of the contextual outlier detection algorithm.
Hydrology is a timevarying phenomenon, the change of which is referred to hydrological processes. As important scientific data resources, hydrological data are the discrete records of hydrological processes and could be divided into flow, water level, rainfall, evaporation, and other hydrologic time series according to the physical quantities of its representation.
A hydrological time series
For outlier detection purposes, we are typically not interested in any of the global properties of a time series; rather, we are interested in local subsections of the time series, which are called subsequences.
Given a time series
Since all subsequences may potentially be abnormal, any algorithm will eventually have to extract all of them; this can be achieved by use of a sliding window.
Given a time series
Generally, the first problem of time series outlier detection is to define what kind of data in a given dataset is abnormal. That is, the definition of outlier determines the outlier detections’ goals. In hydrologic time series, time sequences which are composed of different physical quantities show great difference anomaly characteristics; therefore, it is difficult to give a uniform definition of abnormality. In this paper, we identify a subsequence to be outlier based on its nearestneighbor.
Given a time series
Given a time series
From the above definition, one can see that the nearestneighbors window size
After studying the current situation and challenge of hydrological time series and its outliers, this study proposes a new outlier detection method that uses a sliding window of hydrological time series
In brief, the method consists of the following steps beginning at time
Define
Build a nearestneighborwindow prediction model that takes
Compare the actual measurement at time
Modify
Repeat Steps
The detection process of data point
Diagram of the proposed outlier detection method.
The first step of this outlier detection process, the
Generally, there are two types of hydrological data from where outliers are to be detected: history time series data or realtime data. The difference between them is primarily based on the fact that the former uses the previous and subsequent neighbor window as input parameters to detect the outlier while the latter only uses the previous neighbor window as input parameters. Then, the neighbor window can be divided into onesided and twosided types.
Note that
Here,
In this step, an autoregressive (
The neighborhood point sets
Generally, twosided neighbor windows need points’ previous and subsequent neighbors; however, the right neighbors may contain outliers that had not been detected, which may affect detection results subsequently. So, in some application fields, only previous (left) neighborwindow data can be used to predict and identify forthcoming outliers. Therefore, it often uses a simple modification of the twosidedwindows model to predict the measurement at time
Similar to twosided neighbors windows, the weight vector
Given the model prediction from Section
The twosidedwindows approach for outlier detecting is illustrated with a simple example in Figure
Outlier detection example, where the raw data, predicted value, confidence bound, and outlier are identified by different symbols. In this example,
In order to detect the outliers in the time series, the proposed methods should calculate the plausible values range
(1) The window width
(2) The confidence coefficient
To tune the best combination of algorithm parameters that maximizes the ratio of detection, the crossvalidation scheme is applied. The complete sample is split into two segments: a training dataset and a testing dataset. This is a way of crossvalidating whether the parameters found during the first period, the training phase, are consistent and still valid in a different period, the testing phase. The principle of crossvalidation is a generic resource to validate statistical procedures and it has been applied in different contexts. Furthermore, in training phase, the training set is divided into 10 nonintersecting subsets of equal size, chosen by random sampling. The model is then trained 10 times, each time reserving one of the subsets as a validation set on which the model error is evaluated while fitting the model parameters using the remaining nine subsets. The model parameters with the lowest mean squared error among the 10 training models are then selected for the final model [
To demonstrate the efficacy of the outlier detection methods developed in this study for data QA/QC, it will be applied to hydrological data series from national hydrology database of MWR, China. In the following, the real data are described and are functional and results are presented and discussed. More precisely, it first uses the previously presented approaches to identify outliers; then, some performance evaluation will be discussed and interpreted on the basis of hydrological data; and some results using multivariate approaches for comparison purposes will be provided at last.
In this subsection we report on a number of experiments using two different hydrologic elements, water level (m) and daily flow (m^{3} s^{−1}), from
Geographical location of
The data downloaded from the national hydrology database of MWR, China, were available as raw data. In addition, we followed the procedures described in the previous section. Figures
Raw hydrological time series of
Water level
Daily flow
Raw hydrological time series of
Water level
Daily flow
Since the data used in this study were subjected to manual quality control before being archived to the national hydrology database, it was expected that the detectors would not identify many data outliers in the archive. However, we can easily see that some data points deviate from their neighbor. And then, we apply our methods to detect the outliers in the given hydrological time series with the window size
Detection results over
Twosided window detection results
Onesided window detection results
Detection results over
Twosidedwindow results
Onesidedwindow detection results
Detection results over
Twosidedwindow detection results
Onesidedwindow detection results
Detection results over
Twosidedwindow detection results
Onesidedwindow detection results
Figures
The detection results for the proposed methods with the two different hydrologic elements from
The objective of the evaluation analysis described in this section is to assess the effectiveness of method; we can classify the results from our experiment into four categories (see Table
Assessment of both methods.
Truth  Detection  

Outlier  Not an outlier  
Outlier  True positives or TP (A) data points that are outliers and identify as outliers  False negatives or FN (C) data points that are outliers but identify as normal 


Not an outlier  False positives or FP (B) data points that are normal but identify as outliers  True negatives or TN (D) data points that are normal and identify as normal 
The categories in Table
According to these definitions, the
Another relevant parameter is the
The positive predictive value (
Finally, the negative predictive value (
A statistical measure of the accuracy is provided in this section. The parameters used are the ones described in Section
Statistical analysis of onesided methods with different parameters of HYK station.
Parameters  Water level  Daily flow  

(5, 0.95)  (6, 0.95)  (6, 0.96)  (7, 0.95)  (5, 0.95)  (5, 0.96)  (6, 0.96)  (5, 0.97)  
TP  12  14  12  11  15  16  14  13 
TN  713  713  711  712  707  710  708  709 
FP  2  2  4  3  5  2  4  3 
FN  3  1  3  4  3  2  4  5 
Sensitivity  80.00%  93.33%  80.00%  73.33%  83.33%  88.89%  77.78%  72.22% 
Specificity  99.72%  99.72%  99.44%  99.58%  99.30%  99.72%  99.44%  99.58% 
PPV  85.71%  87.50%  75.00%  78.57%  75.00%  88.89%  77.78%  81.25% 
NPV  99.58%  99.86%  99.58%  99.44%  99.58%  99.72%  99.44%  99.30% 
Statistical analysis of both methods with optimal parameters of given dataset.
Parameters  LZ station  HYK station  

Water level  Daily flow  Water level  Daily flow  
Onesided 
Twosided 
Onesided 
Twosided 
Onesided 
Twosided 
Onesided 
Twosided  
TP  20  18  18  19  14  13  17  17 
TN  704  704  706  704  713  710  710  708 
FP  4  4  3  5  2  5  2  4 
FN  2  4  3  2  1  2  1  1 
Sensitivity  90.91%  81.82%  85.71%  90.48%  93.33%  86.67%  94.44%  94.44% 
Specificity  99.44%  99.44%  99.58%  99.29%  99.72%  99.30%  99.72%  99.44% 
PPV  83.33%  81.82%  85.71%  79.17%  87.50%  72.22%  89.47%  80.95% 
NPV  99.72%  99.44%  99.58%  99.72%  99.86%  99.72%  99.86%  99.86% 
As it is shown in Table
The results show great accuracy for both features. Particularly remarkable are the values reached by the specificity. In particular, it exceeds 99% in all situations. These values mean that when the approach classifies the day to be predicted as normal, it does it with high reliability.
As for the sensitivity, all the results reached values greater than 73% (except for daily flow with the situation
The
Finally,
As for Table
We compared our methods with other methods such as
Figure
Comparisons of area under ROC curves.
AUCs  Median  Boxplot  SVM  Twosided  Onesided 

LZ water level  0.922  0.894  0. 871  0.935  0.957 
LZ daily flow  0. 843  0.852  0.895  0.92  0.933 
HYK water level  0.819  0.836  0.865  0.903  0.921 
HYK daily flow  0.934  0.928  0.93  0.942  0.955 
For our method, it can be inferred that the anomalies can be effectively detected by the windowbased forecasting model, which is constructed using
It is important to mention that we have also validated our method on different datasets. The results were similar to those reported in this section. Furthermore, this paper used
Outlier detection, one of the classical topics of data mining, has generated a great deal of research in recent years owing to the new challenges posed by large highdimensional data. In the meantime, outliers in hydrological time series have many practical applications, such as data QA/QC, adaptive sampling, and anomalous event detection. This research developed a time series outlier detection method that employs a windowbased forecasting mode in conjunction with
The case study results suggest that the proposed outlier detection methods developed in this study are useful tools for identifying anomalies in hydrological time series. Since these methods only require a timeseries model of the time series, they can be easily applied to many realtime hydrological time series. However, it should be noted that, while the
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work is supported by the Natural Science Foundation of China (nos. 51079040, 61170200, and 61370091) and the National Science and Technology Infrastructure of China (no. 2005DKA32000).