Technical note: A nonparametric outlier rejection scheme

Experimental data always contains measurement errors (or noise, in signal processing). This paper is concerned with the removal of outliers from a data set consisting of only a handful of points. The data set has a unimodal probability distribution function, the mode is thus a reliable estimate of the central tendency. The approach is nonparametric; for the data set (xi, yi) only the ordinates (yi) are used. The abscissa (xi) are reparametrized to the variable i = 1, N. The data is bounded using a calculated mode and a new measure: the mean absolute deviation from the mode. This does not seem to have been reported before. The mean is removed and low frequency filtering is performed in the frequency domain, after which the mean is reintroduced.


Introduction
Consider an experiment where measurements are logged for further calculations. The intention is to use the data points that are most reliable; that is, when the experiment and the measurement device has settled. Two main causes of invalid data (i.e. outliers) are: (1) The influence of previous experiments. (2) The effect of the environment (for example mains flicker).
A common and simple practice is to ensure a long experimental run to produce enough valid data. If the run length is very large and the measurement errors causing bad data are independent, and normally distributed with a constant standard deviation, the ubiquitous leastsquares fit is a maximum likelihood estimator of the line parameters. The weighted least-squares (or chi-squared) fit relaxes the assumption of a fixed standard deviation.
It is often assumed that a random variable is normally distributed. The central limit theorem (see standard text on mathematical statistics and also the references) justifies approximate normality for large sets of data, but it may be difficult to construct a normal distribution from experimental data (Press et al. [1] provides good treatment of this problem). Various outlier rejection schemes are based on this normality (for example see Matlab Users' Guide [2], Section 5.3).
The data have a central tendency towards the true value. Now, the central tendency is characterized by a scalar which is a function of the moments of the dataset.
The mean is the most commonly used measure of the central tendency. The variance is then used to measure the spread about the central tendency.
The average deviation or mean absolution deviation for the data setxi, i-1, Nis Ixj_: (1) where is the mean.
The mean absolute deviation is recognized as a more robust estimate of the width around the central tendency if the second moment can not be realized (i.e. if it is infinite) [1].
The mean in a small dataset is not representative of the central tendency; the distribution has broad tails. In this case, the median (for a probability distribution function p[x], the median Xmed is the value for which larger and smaller values are equally probable) and the mode (for a probability distribution function p[x], the mode Xmode is the value at which p[x] is a maximum) are alternative estimations of the true value.

Mean absolute deviation from the mode
If the median is not representative of the set (which is the case when the area in the tails of the distribution is large; the mean fails if the first moment of the tails is large).
Then the mode of the distribution must be evaluated.
Subsequent statistics of the set should by necessity, therefore, not involve the mean and the median. The 'width' about the central tendency can be taken as the mean absolute deviation from the mode: Xj Xmode (2) the outlier rejection algorithm is then as follows.
Outliers in a small dataset Outlier rejection algorithm This paper discusses small datasets with a maximum of members per set. The data is grouped around the true value; the outliers are readings from previous and subsequent experiments. The distribution therefore peaks around the true value.
(1) Get Xmodeo In step 2 above, the criterion A is evaluated as A=o'p ()   The method of removing the linear trend in SMOOFT has hidden dangers. The straight line is constructed through the first and last data points. In the examples below, the first point is always from the previous experiment while the last may be from the following one. When restoring the linear trend after the frequency domain calculations, these points reintroduce unwanted characteristics.

Examples
Two sets of experimental data are presented to illustrate the performance of the outlier rejection algorithm. The aim is to find the true or steady-state reading. The results are self-evident in the plots (see figures and 2).

Conclusions
A simple but effective outlier rejection routine has been presented. The routine is based on bounding data points to within the mean absolute deviation from the mode.
This statistic does not seem to have been reported previously.