The systematic error caused by random errors through data reduction

During the evaluation of the continuous measurement signal of analytical instruments by a digital computer, the signal is sampled periodically, and the analytical information [1] is computed from this sequence of discrete values representing the signal by the procedure called data reduction [2]. For example, retention data, peak heights and/or peak areas are produced from the raw discretized signal, in the case of chromatogram evaluation.


Introduction
During the evaluation of the continuous measurement signal of analytical instruments by a digital computer, the signal is sampled periodically, and the analytical information [1] is computed from this sequence of discrete values representing the signal by the procedure called data reduction [2]. For example, retention data, peak heights and/or peak areas are produced from the raw discretized signal, in the case of chromatogram evaluation.
Of course the analytical information produced should not contain a systematic error component, that is, replicate measurements should give more precise information.
Measurement conditions are chosen so that the raw signal of the measurement device, as represented by the sequence of discrete values, does not contain a systematic error component. As measurement is a complex process, its component processes can add random errors with various distributions. These random errors can be transformed into systematic error components [3]. The random error of the raw signal can also be transformed into a systematic error component in the analytical information by the data processing: Enke and Nieman [4] investigated this effect in the case of data smoothing, and Eisenhart [5], Ivanova and Tkatchev [6] studied it in the case of calibration. This paper reports an investigation of data reduction transformation causing a systematic error component from random errors. The work used mathematical and simulation methods. evaluation by digital computers. In order to handle further formulae more simply let us assume that the discretized signal does not contain outliers and systematic error components, also let us investigate the onedimensional case. (Extension to the multivariate case does not cause any difficulty.) The model of the measurement signal can thus be written in the form: where: x(t) is the stochastic sequence of the discrete measurement signal X(t) is the true value sequence e(t) is the white noise stochastic sequence with standard normal distribution (with independent elements, 0 mean and variance) s 2 variance of the random measurement error T number of samples.
Our assumption about the absence of outliers and systematic error components does not limit the generality of our conclusions. If it can be shown that in this simple (special) case the data reduction does transform the random error contained in the measurement signal, into systematic error in the analytical information, then in the more complicated case the analytical information will at least contain this sytematic error component.
During the data reduction step a new stochastic sequence is obtained from the measurement sequence (1), by a transformation determined by the type of data reduction chosen. This sequence (2) contains the analytical information: where K << T. The transformation R {.;.,.,...,.} is called the data reduction transformation.

Mathematical modelling of data reduction
The aim of data reduction is to compress the raw data produced by the measurement device and to bring it to a form suitable for evaluation without significant loss of the analytically important information. A loss of information is unavoidable during the analytical information production step of data reduction. This is because not all of the information contained in the measured, raw data is necessary-the information content of the analytical information is orders of magnitude smaller.
As a measurement data model a so-called discrete time vector valued stochastic process (stochastic sequence) was chosen, because the raw continuous time measurement signal is usually sampled as the first step of The data reduction transformation is in the general case a nonlinear one. According to the well-known definition of linear transformation, a data reduction transformation is linear if multiplication of the sequence by a constant before or after transformation gives the same result and the sum of two transformed sequences is equal to the transform of the sum of the two sequences.
The true value sequence of the analytical information sequence {y(k)}k __K can be defined as if its element would be obtained from the true value of the measurement sequence by the data reduction transformation (2).
As the data reduction transformation is nonlinear in the general case, the mean value of the analytical information sequence is not equal to its true value, so: Ely(k)] E[R{k;x(1),..,x(T)}] # # R}k;E[x(1)] ,..,E[x(T)]{ R}k;X(1),..,X(T){ Y(k) (4) Thus in the case of nonlinear data reduction transformations the random error component in the measurement sequence causes a systematic error component in the analytical information sequence. This systematic error component can be theoretically eliminated by a suitably chosen correction, but the problem is not solved for data reduction transformations used in analytical chemical practice.
At the same time, in the case of linear data reduction transformations, the random error in the measurement sequence does not cause systematic error in the analytical information. In this case the data reduction transformation has the following simple form: where y and x denote the vectors composed of the elements of the analytical information and measurement sequence respectively, R(t) is the matrix of the data reduction transformatid-fi with only time dependent elements. Repeating the derivation of Equation (4) for this case, using the fact that the elements of the random error sequence are independent, we obtain: In this case there is no systematic error in the analytical information sequence.
Data reduction transformations for the reduction of discretized chromatographic raw data In the discussion of the data reduction of measurement sequences obtained from chromatographs, the components of the measurement sequence have to be taken into account. For the sake of simplicity let us assume that the measurement sequence contains only the signal of the unknown compounds as peaks with the shape of Gaussian distribution function, an additive random error and a base-line component: s.e(t); 1,...,T (7) where: u(]'), z(]') are parameters characterizing the quality of the compounds q0") is the parameter characterizing the amount of the compounds c(t) is the base-line sequence (deterministic).
As analytical information for the determination of the amount of the jth compound, the peak height: or the peak area is used. For the determination of the influence of the data reduction process, it can be divided 26 into well-defined subprocesses formations), which are: (elementary trans-(1) Peak recognition.
(3) Calculation of peak height or area.
The data reduction transformation can be regarded as a composite of the transformations listed above in the given order. Let us first investigate the elementary transformations of the composite data reduction transformation separately.
The peak recognition algorithm is used for the decision whether or not a peak is beginning or ending at a given time t, as parameters u(j), j 1,...,J are unknown.
The peak recognition algorithm separates a set of signal samples belonging to the peak, and the maximum or the integral area is computed. For this purpose a moving group of consecutive points from thejth to the sth sample are separated as being before and after peak (typically up to 30 sample values), their mean is calculated ( ts, O). A start or end of peak is assumed if the deviation of two consecutive samples exceeds a given limit m, from the mean value: x(tj. + k) c(O-ts,t)l> m k 1,2 Furthermore the peak is accepted if it has a maximum and if its area exceeds a specified minimum value. It must be noted here that other peak detection algorithms (i.e. based on the value of the derivatives of the sequence) can also be used.
The difference of consecutive signal samples depends not only on the deterministic component (X(0 + 1) and X(0 + 2)), but also on the random error (e(0 + 1), e(tj + 2)) and the base-line value. This means that the number of samples assumed to belong to the peak is a random variable, which will be denoted by n(t). It is easily seen, that nO" is a nonlinear function ofthe measurement signal samples, because in the case of twice as large q)j), the number of samples belonging to the peak will not be twice larger because of the nonlinear (Gaussian) shape of the peak. Thus this peak recognition step is a source of systematic error.
The data in table show the variation of the number of samples assumed by the given algorithm to belong to the peak in simulated chromatograms as a function of the various parameters of the peak. (The description of the simulation experiment is given in the next section.) Data in the column designated H/W 1000 refer to 'narrow' peaks. It can be seen that the variation of the number of samples included does not vary very much, but also that the variation is not a monotonous function of signal-tonoise ratio (it first increases then decreases with increasing signal to noise ratio). At the same time the uncertainty of peak recognition decreases with increasing signal-to-noise ratio in every column as demonstrated by the value of the standard deviation S. For wide peaks the number of samples assumed to belong to the peak increases very rapidly with increasing signal-to-noise ratio. This property becomes stronger as the peak width increases.
For the computation of the peak height, as well as the peak area, the value of the base-line must be determined. As a peak covers up the base-line a base-line correction algorithm is used for estimating the values of the base-line, based on the samples before and after the peak. In the most simple (linear) case, the algorithm selects two groups of samples from the measurement sequence of predetermined size, before and after the peak, computes the means of the groups, orders these averages to the centre of the groups in the t-domain as points of the base-line, and connects the two with a straight line which will serve as the estimate of the base-line between the two points. It is important to note, that the systematic error of the peak recognition affects the base-line correction algorithm, because the groups before and after the peak should be as close as possible to the peak. So samples belonging to the peak can be assumed to be in one of the surrounding groups.
As the final step, the calculation of the peak area from the nO" selected samples is a linear transformation if a constant or linear form of base-line is assumed: tj + n(j) A0) (9) In this case there is no further nonlinear transformation causing possible systematic error in this step.
the calculation of the peak height from the selected nO') samples is a nonlinear data transformation as the determination of the peak height is usually computed by fitting a parabolic curve to a set number of the samples having the largest values, the peak height being taken to be equal to the height of the parabola. The above algorithm filters the effect of the random error component in the case of Gaussian distributed random errors according to the least squares estimation properties (for other distributions it is only approximately valid), but the estimation of the parameters of the parabola from the measurement samples is a nonlinear transformation because of the nonlinear character of the parabola. For this reason the estimation of the peak height is also a nonlinear data transformation, so it can introduce systematic error components from the random errors.
At the same time the systematic errors of the base-line correction algorithm appears in the value of the peak height calculated as above, transferring the systematic error caused by the peak recognition step. However, the systematic error in the estimated base-line affects the peak height calculation much less than it does the peak area, as in the latter case the number as well as the values of the measurement signal samples belonging to the peak distort the result of the peak area calculation.
The above explains the well-known empirical fact that only the peak height, and not the peak area, can be us :d for quantitative trace analysis (Hachenberg [7]). In trace analysis the signal-to-noise ratio is rather small.
The effect of data reduction on the result of chromatographic analysis as a function of the parameters of the chromatogram The effect ofdata reduction was investigated by computer simulation of measurement signal sequences with known variances. The simulations were performed on a HP 21MX computer.
It was assumed for the computations that the function describing peak shape is a Gaussian distribution function. In order to model the qualitative and quantitative deviations of the compounds, several Gaussian distribution functions were used, which can be characterized by their height H and the interval W cut from the base-line by the tangents at the inflexion points. E H/s (11) At the same time magnitude and sign of the systematic error varies as a function of peak shape, i.e. as a function of the height/width ratio (H/W). Figure contains data for narrow peaks (high H/W), for which there are relatively low numbers of signal samples belonging to the peak, having a relatively rapid deterministic change. It can be seen that both the relative peak height and area increases with decreasing signal-to-noise ratio. It should also be noted from the data in table 1, that there is only a relatively small variation in the number of samples assumed to belong to the peak, especially when relative values are examined.
It can be seen from figure 2 that in the case of medium wide peaks (intermediate values ofH/W), the peak height increases with decreasing signal-to-noise ratio, while the peak area decreases after a short period of increase.
In the case of broad peaks (low values of H/W) the number of samples belonging to the peak is large. The data in table show that in this case the number of samples assumed to belong to the peak reduces very rapidly with decreasing signal-to-noise ratio. So for broad peaks both the peak height and area reduce with decreasing signal-to-noise ratio.

Discussion
It can be seen from the simulation results that the signal-to-noise ratio required for a negligible bias to be introduced by the transformation of the random error of 20 "_H 10 100 100 "1000 Ig E Figure 1. The effect of the signal-to-noise ratio (E) on the peak height and area (H/W 1000 mV/s, with Gaussian distributed random error, each point representing the mean of 10 simulated values). Where H peak height; AH variation of the peak height compared to the theoretical (true) value; A peak area; AA variation of the peak area compared to the theoretical (true) value; W peak width. the measurement signal samples varies with the chromatographic peak shape. For example the signal-to-noise ratio normally achievable in analytical practice (around 500) is sufficient in the case of narrow peaks, but it should be greater than several thousands for wide peaks. In the case of chromatographic peak height evaluation, these For comparison purposes it should be noted that the best signal-to-noise ratio achievable in spectrodensitometry of thin layer chromatograms is not greater than 500, because of the significant noise of optical origin [11]. At the same time, the signal-to-noise ratio of the main components in gas and modern liquid chromatography can reach the range of a couple of ten thousands, with suitable detector sensitivities and retention times. In the case of chromatographic trace analysis, the signal-to-noise ratio is usually less than 50. For details of practical applications see Leisztner et al.