STATISTICAL ANALYSIS OF DATA FROM ELECTRONIC COMPONENT LIFETESTS ( A TUTORIAL PAPER )

Methods of statistically analysing data from electronic component lifetests are discussed. Particular emphasis is given to analysis techniques using the assumptions of constant hazard rate (Exponential distribution), the Weibull distribution and mixed Weibull distributions. The methods used for analysing Weibull data when the data itself is non-uniform due to both removal of test samples during test and also the non-continuance of surveillance of components under test are discussed. Attention is finally given to the effect of two or more failure mechanisms which can produce S-shaped patterns when data is plotted on Weibull Graph paper.


INTRODUCTION
The subject of statistical analysis of lifetest data is very wide and several textbooks treat this subject in depth.Most textbooks in reliability engineering include directly or in an appendix many of the fundamental theories and methods.It is not the purpose of this paper to compete with the textbooks, merely to attempt to extract some methods which by experience have proven to be useful and which are not too complex to be applied in the day-to-day engineering work in industrial R & D laboratories.For a more complete treatment some of the textbooks in the list of references may be studied (Ref. 1, 2, 3, 4).
In the following we will start by going through some of the most important definitions of reliability and try to explain their practical applicability.The next step will be to discuss the different reliability tests and failure types to be taken into account.Further, as the main body of the paper, some very useful applications of Weibull analysis technique will be described in detail.The concepts and methods will be explained by practical examples, founded on experience from real life.
The paper is oriented towards components (or in general terms: non-repaired items).For repairable items which are repaired during their life, the methods cannot be used.

SOME IMPORTANT RELIABILITY MEASURES
The concept of reliability is clearly expressed in the definition: The ability of an item to perform a required function under stated conditions for a stated period of time.
The three key phrases are underlined and we will return to this definition several times in the paper.
This definition leads to the fundamental probabilistic measure of reliability R(t) P(tactual > t) Professor J0rgen M01toft, Danish Engineering Academy, Mechatronics Section, Bygning 451, DK 2800 Lyngby, Denmark, Telephone: 02-883022.260 J. MOLTOFT which says that the reliability is the probability, P, of having a lifetime (tactual) greater than the stated period of time t.The word "lifetime" means that the item under consideration performs the required function under stated conditions.These two conditions are often forgotten when planning and interpreting lifetests.
The lifetime tactual of an item can be both shorter and longer than t.Having a specific item in our hands, we cannot tell how long tactual would be for that particular item.This means that the lifetime is a stochastic variable which has an associated probability <listribution that may be characterized by one or more parameters.One of the purposes of a lifetest is to estimate these parameters.
Probability distributions are usually expressed in terms of the cumulative distribution function c.d.f, or the probability density function p.d.f.In the present case we have the c.d.f. for the lifetime expressed as:u F(t) P(tactual < t) This states the probability of having a lifetime shorter than or equal to the stated period of time.As an item either performs its function or not, we have: because the total probability cannot exceed one.The corresponding p.d.f, is dF(t) f(t)dt While the p.d:f, is seldom used, a further measure that is often used in reliability technology is the hazard or failure rate function.
The failure rate h(t) expresses the conditional probability such that:n h(t) dt is the probability that the item will fail in the coming time interval dt given that it has survived the stated period of time t.
For small values of where R(t) is close to unity, the p.d.f, and the hazard rate function are almost identical.As R(t) decreases with time h(t) and f(t) differ more and more.
One of the most used and abused reliability measures is the mean time to failure, MTTF.This is simply the mean value of our stochastic variable, the lifetime.
It is calculated by the standard formula for mean values

MTTF f(t)dt
As mentioned the MTTF measure is often abused.One of the misconceptions is the intermixing with the working life-that is the time where wear-out is setting in and the hazard rate rises.The MTTF has nothing to do with the working life, which cannot be underlined too often.

._ata analysis
]Failure analysis conditions have to be decided.For a field test the selected market and customers determine.
For a laboratory test we set up a test cycle consisting of combinations of environmental parameters (temperature, voltage, current, humidity, etc.), functional modes (test patterns, signal flow, biasing, etc.) and duty cycle (on-time, off-time).The operational conditions in a laboratory test should be selected in order:m to give reproducible results, to precipitate failure modes and mechanisms which are seen in actual use, to enable predictions of field failure patterns.The next step is to make a statistical test plan in which the sample size, test duration etc. are determined.The same procedure and methods are used for field and laboratory tests.The planning methods are, however, different depending on whether the outcome required is a reliability verification or a reliability estimation.In our case we will have a close look at estimation.However, in many cases the data from a verification test may be analysed in the same way and thus can provide further information than the simple answer, "yes" or "no", to the postulate to be verified.
Failures can be categorized in many ways.This is usually done in accordance with the purpose of the exercise.In the present case a failure occurs when a component ceases to perform its intended function.This can happen in two ways:m 1) catastrophically, which happens when the component suddenly and abruptly ceases to work properly, 2) gradually, which happens when a parameter drifts outside a preset specification limit.We will accept both types as failures in our statistical analysis.However, for further analyses, as for example failure cause analyses and for predicticn purposes, it is very important to distinct between these two types of failures.
Often both types are discussed together and the concept of the "bath tub" curve is introduced.However such a concept has to be treated carefully if mistakes are to be avoided.In this case, all failures are regarded as catastrophic and the failure rate function follows a bath tub shaped curve with a high failure rate in the beginning, a long period in the middle with a constant failure rate and a rising failure rate in the end.
The descriptions above regarding type of lifetests and failures are not in any way exhaustive.The purpose was just to explain the frame in which we make the statistical analysis.
An example for illustration; background For the purpose of illustrating the basic methods, a specific example has been selected.This is based on results from a real life reliability test in which high speed CMOS integrated circuits were subjected to 85C, 85 % RH and a supply voltage of 6V.The failure mechanism reported was corrosion of the thin metal film interconnection layer.
Analyses under the constant hazard rate assumption Many component manufacturers still reveal data in the form:m a sample of 50 CMOS circuit components were tested for 6000 hours and 6 failures were found, followed by the statement:-- The failure rate for the components is 2%/1000 hours.
The calculation is very simple as exponential distributed lifetimes are assumed.In this case the reliability function R(t) e -'it with the corresponding c.d.f.

F(t) 1 e -'t
The hazard rate will be constant:-- The estimation of 2 is obtained from:-- where r is the total number of failures and T is the accumulated relevant test time.T is determined from:-- where ti is the time-to-failure for failure no. and t* is the test duration.
Confidence limits can also be derived by the use of the z2-distribution (Ref. 1 and 3).The formula are:-- Zl-,v 2T 2T for a two-sided confidence interval, and:-- 2 < Z l -@ ----'v2 for a single side upper confidence limit.The confidence level is 1 a and the degrees of The test duration may be either preset (a time truncated test) or determined by the occurrence of the last of a preset number of failures (a failure truncated test).

Example
Having the mathematical formulas in hand we get immediately n 50 r =6 t* 6000 J. MOLTOFT However, the value of T has to be approximate, because we do not have the time-to-failures.
The problem with this method is that the assumption of exponentially distributed lifetimes seldom is fulfilled.Therefore more sophisticated methods are necessary.However, in this case we need to know the time-to-failures (at least approximately).

WEIBULL ANALYSIS
The foundation for the Weibull analysis technique is the Weibull distribution function.If the lifetimes are Weibull-distributed, the reliability function R(t)= exp -{tr/-t-toJ where the three parameters/3, r/ and to are:-- the shape parameter the characteristic lifetime to the location parameter./3 describes the shape of the p.d.f.For low values of/3 the p.d.f, is skew to the left./3 1 gives an exponential distribution./3 3.44 makes a good approximation to the normal distribution except in the tails.For higher values of/3 the p.d.f, is skew to the right.r/is the time for which the cumulative failure is 63% regardless of the values of/3 and to. to defines where the distribution can start.Lower values of may give a reliability greater than one, if/3 is an integer.If/3 is a non-integer we will get a complex value of the reliability.Both cases are impossibilities.For this reason to is often called the minimum life.
For to 0 (or after carrying out the transformation to;/1 r/-to) we get the two parameter Weibull-distribution This is often the case, and in the following we will concentrate on the two parameter distribution.The c.d.f, becomes:-- and the p.d.f.
For the hazard rate we obtain:-- h(t) . ( In the exponential case, where/3 is equal to one, q MTTF.In other cases r/and MTI'F are interrelated by the gamma-function via the value of fl as follows:-- On the basis of the two-parameter Weibull-function, a Weibull graph paper developed.

F(t)
x lnt In order to plot data on the basis of the two parameter Weibull function (eqn.1), the function in the form of equation 5 is used.For this purpose special graph papers have been evolved and figure 2 shows one such.
In figure 2, the ordinate plots values of the c.d.f., F(t), in terms of the L.H.S. of equation 5.The abscissa is and is, therefore, plotted for convenience on a logarithmic scale.
Therefore, using such paper, a plot of F(t) versus time can give a straight line of slope ft.
However specific constructions are necessary to avoid the complication of calculating/3 directly from the axes of the Weibull paper.At the same time it is also found possible to establish a construction for determining r/.These constructions are shown in figure 3 where equation 5 is plotted with r/ 8 and/3 2.4.

F(t)
Hence F(t) 0.632 This is shown in figure 3 where the c.d.f, value of 63.2% for the Weibull plot gives a r/ (= t) value of 8.
In the case of the determination of fl a special construction must be used relative to a fl scale which has been calculated for that construction and has been placed on the graph paper.In the case of Chartwell graph papers (numbers Ref. 6572 and 6573), this is done by using an estimator point.However, in the present case the value is arrived at by drawing a line parallel to the original plot but through the fixed point given by F(t) 0.63 and e.
Where this parallel line intercepts the 1 value, a value for fl can then be read off numerically from the scale on the left or right of the page.
In a lifetest, the lifetimes or times to failure of components are measured.To obtain the cumulative percentage failure (F(t)) the times to failure are arranged in increasing order and the median rank values for F(t), F(t), calculated using the following formula developed by Benard (Ref.8).P(t) i-0.3 n + 0.4 x 100% (6)   where is the failure number (rank order no) and n is the sample size.The median rank value calculation is necessary since the time to first failure, second failure in the sample etc. would each form part of a distribution if the test were repeated many times.The above formula is FIGURE 3 How to use the Weibull graph paper only approximate and applies only to non Gaussian distributions.More accurate values can be obtained from tables if required.Thus we have a tool to analyse many complex distribution functions.Corresponding values of lifetimes and estimated c.d.f.-values are plotted on the Weibull paper.If a straight line can be drawn through the points, we have a Weibull distribution and the values of fl and r/can be estimated.If the plotted points diverge from a straight line we may split the curve into a combination of Weibull functions.Engineering common sense combined with statistical Weibull analysis thus gives us a simple and very powerful tool for reliability assessments.

Example
Analysing further the lifetest data for the CMOS circuits discussed previously (Section 3), we find the following lifetimes for the failed components (Table I).In the table we have calculated the median rank values, F(t) based on the sample size n 50.
Figure 4 shows the Weibull plot and it is seen that a straight line can be drawn through the points.The estimated value of/3 is 1.8 and this means that we are far from the exponentional case (/3 1).Furthermore, the plot shows that the previous calculation was too optimistic for the long term behaviour and too pessimistic for the short term behaviour of the com- ponents (compare with the dotted line for the exponential case in which fl 1).
For this failure mechanism we have an increasing hazard rate with time (Fig. 5).
1.8 ( o.s h(t) 19300 \19300] This is quite common for a wear-out type failure such as corrosion.
It is possible to add confidence limiting curves to the graph.This is described in several textbooks as for example [ref.1] and [ref. 2].

Effect of non uniform data
In many cases the testing procedure is not that simple.The sample size is often high in the beginning and decreasing for a number of reasons as time goes on.Some of the samples fail for reasons that are irrelevant for the testing purpose.After some time a substantial number from the sample can be taken away for other testing purposes.It may be too expensive to carry on testing a large sample size.In all such cases we seek help in the rank order calculation for suspended items.The formula for an increase, A, to obtain the next rank order number after a suspension is (Ref.9).
(n + 1) (previous rank order number) A 1 + (number of items following suspended set) Subsequent rank order numbers after the suspension are calculated by adding A to the previous recalculated rank order numbers.After each new suspension a new A-value must be calculated.Another often used short cut in reliability testing is that the components are not under continuous surveillance.Instead every component is measured at regular time intervals.By these measurements more than one failure may be discovered at the same time of measurement and the precise lifetime is not known.The times-to-failures for the failed components lie somewhere between two measurement points.If such a short cut has to be used it is advisable to measure with short time intervals in the beginning and longer intervals at the end of the testing period.

Example
The two major points discussed above, namely removal of specimens from the sample under test and non continuous surveillance are illustrated in the following example on the results of a set of real lifetests for CMOS-circuits.
Using the formula for suspended items, equation ( 6), we obtain the table below (Table III).It should be noted that the designation of failure numbers within the five time intervals is arbitrary as there has not been continuous surveillance.
Let us for example calculate the rank order numbers for failure number 7 and 8. Originally there were 818 circuits under test.Therefore we have n 818 Failure number 7 comes right after a suspension.Therefore we get A (818 + 1) 6   Thus i7 i6 + A6 6.000 + 1.163 7.163 Failure number 8 follows failure number 7 and not a suspension.We therefore get i8 i7 + A6 7.163 + 1.163 8.326 This process can be continued through all the failures up to the next suspension.The figures given in table 3 after the suspension between failure numbers 15 and 16 have been rounded to two decimal places for simplicity.However it should be noted that to obtain these figures the three thermal places for A are needed in the calculation.The first failure in a group can be taken to be the top one in the group and one can draw a Weibull curve using only these points.
Taking these aspects into account, the corresponding time-to-failure and calculated median rank values, (equation ( 6)) can be tabulated as follows.
The median rank value, (t), is calculated using the sample size n 818 at all Times.

8.184
The Weibull plot of the data is shown in figure 6 and it appears that a straight line through the top points in a group is a reasonable approximation.
From the graph we can derive the values for fl and 7. fl=l.8 17000 hours These figures are enough for a complete description of the estimated Weibull distribution.
Other figures may be derived as well, for example the well know B10-1ife, which is the time where 10% of the components have failed.From figure 6 we obtain:-- B10 4600 hours Furthermore we can find how many percent failures we expect to obtain in one year of continuous operation.In this case we get:-- F(1 year 8760 hours) 26%  In many cases the plotted points on the Weibull graph paper do not fit a straight line.An example of a typical pattern appears in figure 7 in which the time-to-failure for CMOS components, (Type 4007 dual complementary pair plus inverter) tested at 200C, are plotted.The method has been described above.
An approximate curve may be fitted to such a pattern assuming the bimodal distribution F(t) pFl(t) + (1 p)F2(t) (8)   in which Fl(t) and F2(t) are ordinary Weibull functions and p is the probability of a component having a lifetime distribution which follows Fl(t).A fit for the present case is shown in figure 8.The shape of the curve looks like an S and such patterns are therefore named S-curves.
The S-shaped pattern is often associated with flaws in the components that cause early failures.The later failures are normally caused by unavoidable wear-out mechanisms.Under normal use conditions wear out may happen far out in time, and the last part of an S-curve is therefore only usually seen if the test data is obtained during accelerated testing.This has been the case for the CMOS 4007 example shown.
The philosophy behind the bimodal lifetime distribution is as follows.A specific component with no flaw will have a lifetime distribution FE(t).With a flaw the lifetime distribution will follow Fl(t).These two possible failure situations are mutually exclusive.
Either a flaw is present or it is not, and once a component has failed due to one of the causes, it cannot fail due to the other cause.We don't know whether a specific component contains a flaw.The only thing we know is the probability p that a flaw is present.Furthermore, the probability that a flaw is present is statistically independent of the lifetime the flaw will cause and vice versa.The statistical independence gives the two terms: p. Fl(t) and (1 p) F2(t).Due to the mutual exclusiveness, the two terms can be added directly.This principle can be used in an analogous manner if more than one flaw type may be present in the actual component type as long as the probability of having more than one flaw in a specific component can be regarded as negligible.
Looking more specifically at the FE(t) distribution, this may be a consequence of more than one different failure mechanism, which are all acting at the same time.This is a situation which is very similar to an electronic system in which there are many parts.If one part fails, then the system fails.The similarity is that if a failure mechanism in the component causes a failure, then the component as such fails.In reliability terms we are dealing with a series system, and if the failure mechanisms are acting statistically independent, the combined c.d.f., Fe(t), can be calculated from:-- where Fj(t) is the c.d.f, for the j'th failure mechanism, and m is the total number of simultaneously acting failure mechanisms.This is sometimes called the model of competing failure mechanisms.

S-curve analysis
As mentioned above one usually only observes the first part of the S-curve.If we want to examine this part, the method is straightforward and quite simple.The analysis of the latter part may be very uncertain due to lack of data and is furthermore more complex to carry out.The first step is to isolate from each other points belonging to the two distributions.This is done using Bayes' analysis (Ref.10).The formula pf(ti) pi pf,(ti) + (1 P)fz(ti) (10)   which gives the probability that failure no belongs to subpopulation no 1.p is the previous mentioned probability of a component having a flaw, and fl(ti) and fe(ti) are the p.d.f.-values of subpopulation 1 and 2 respectively at the time ti of failure no i.If this probability exceeds 50% we deem the failure to belong to subpopulation no 1.Otherwise it belongs to no 2.
In order to use this formula we need to know approximate values of p and the parameters describing fa(ti) and f2(t).This can be derived from the plotting of the known points using the median rank method already described.The technique is illustrated in figure 9.
An approximate value of p is derived by estimating the plateau level.The slope of the first part of the S-curve.isthen approximately fll and the intersection between a horizontal line at p 63.2% and the first part of the S-curve gives an estimate of Ta.While the first part of the parameters of f(ti) need to be close to the correct ones, the parameters of f2(ti) are not so important because of the robustness of the Bayes formula to that type of uncertainty.If the number of points after the plateau are too few to indicate a slope, a conservative estimate is to draw a line with the Webull-slope fl 1 through the last point.This line's intersection with the F(t) 63.2%-1ine can be used as an approximate value of /2.However, if there are enough points to indicate a straight line, this can be used as an approximation of Fe(t) and fie and 2 are letermined in the usual way.
Example 90 components were tested over 10000 hours and 13 failures were found.The times to failure were known and are ranked as shown in the following table (Table V).F(t) is calculated using equation ( 6) with n 90.Point no 14 is not measured, because the test was stopped at 10000 hours.However, a failure could have happened just after 10000 hours and therefore point no 14 can contribute to the curve as a conservative estimate of the time-to-failure of failure no 14.To examine the data in more detail the Bayes formula equation ( 10) can be used, inserting p and the calculated values of fl(t) and fz(t) using equation (3), i.e.:-- fl(ti) f2(ti) f12( tit/32-12 22/ e -\/(tifl2 These formulae can be applied for all the failures of Table V.When this is done for failure 11 and 12, it is found that there is an abrupt change in P] viz:-- For failure no 11 we obtain:-- fa(1200) 0.13-2.123-10 - + 0.87. 1.582. 10 -5 0.67 67% For failure no 12 we obtain in the same way:-- f1(1920) 3.6620. 10 -5 f2(1920) 1.5637. 10.5 and p2 0.26 26% Interpreting the results, it is found that the first 11 failures can be deemed to belong to subpopulation 1 and the rest to subpopulation 2. The next step is to plot the 11 failures belonging to subpopulation 1 as if they were a "sample" from a population containing only flawed components.In this case the "sample" size is 11 and all in the "sample" have failed.The corresponding table is as follows (Table VI).The Plotting on Weibull-paper gives figure 11 and in the usual manner we can estimate that:--/1-" 1.3 01 570 h which is very similar to the previous rough estimates With regard to the points belonging to subpopulation 2 we must realise that these two (or maybe three) points are too few to make any analysis realistic.This will often be the case in practical testing.However, a comprehensive description of the analysing method has been given by Jensen and Petersen (Ref.2).Our last concern is the value of p itself.,Only a very rough estimate can be made on the S-curve itself.If Fl(t) and F2(t) are close to each other, this type of estimate becomes very uncertain.A better estimate can be obtained using the Bayes formula which results in the following:m In the present case a computerized calculation with 71 570 h,/32 1 and q2 62000 h gives the following results:n 0.104 for fll 1.2 15 0.103 for fl 1.3 We see that compared with the rough estimate this estimate is significantly different.
Furthermore, we see that the change in the fl-value in this case does not make any practical difference.

CONCLUSION
As mentioned in the introduction this presentation of the methods available for statistical analyses of data from lifetests is not in any way comprehensive.For example we have left out other distributions than Exponential and Weibull, in particular the log-normal distribution which is used very often by some of the major telecommunication companies in analysing equipment failure.Furthermore, we have not discussed methods for analysing parametric drift with time.
However, it is the author's belief that mastering the methods described makes one able to carry out reasonable analyses on most of the lifetest data that at present appear in real life.One word of caution should be made here.The statistical analysis cannot stand alone.It is very important before drawing conclusions to make physical failure analyses and establish as close as possible the cause of failures.It is the combination of statistics, physics and engineering judgement that constitute a powerful tool, not the statistical analysis by itself.
a time truncated test.

FIGURE 2
FIGURE 2 An example of a Weibull graph paper

FIGURE 7
FIGURE 7 The cumulative failure distribution function for CMOS type 4007 tested at a temperature of 200C

FIGURE 8
FIGURE 8 CMOS 4007 test.Approximation with a bimodal distribution function

FIGURE 10
FIGURE 10Weibull-plot for components with a bimodal lifetime distribution. (Sample size n 90)

FIGURE 11
FIGURE 11Weibull-plot of the failure belonging to subpopulation 1.
the number of failures.
3. TYPES OF LIFETEST AND FAILURESReferring to figure 1 lifetests can be performed either in the field or in the laboratory.Laboratory tests are most used for components.Whatever test is used, the operational

TABLE IV Time
to failure and median bank values for CMOS circuit failures