The Challenges of Spirometric Diagnosis of COPD

Chronic obstructive pulmonary disease (COPD) is one of the top causes of morbidity and mortality worldwide. Although for many years its accurate diagnosis has been a focus of intense research, it is still challenging. Due to its simplicity, portability, and low cost, spirometry has been established as the main tool to detect this condition, but its flawed performance makes it an imperfect COPD diagnosis gold standard. This review aims to provide an up-to-date literature overview of recent studies regarding COPD diagnosis; we seek to identify their limitations and establish perspectives for spirometric diagnosis of COPD in the XXI century by combining deep clinical knowledge of the disease with advanced computer analysis techniques.


Introduction
Chronic obstructive pulmonary disease (COPD) is characterized by respiratory symptoms and airfow limitation generated by airway and alveolar alterations.COPD is an umbrella term including chronic bronchitis and emphysema (Figure 1).Functional deviations are triggered by exposure to noxious particles or gases, mainly, smoke from cigarette or biomass combustion.Despite being preventable, it is currently the third leading cause of morbidity and mortality worldwide [2].In 2019 only, 3.28 million deaths were caused by COPD [3].
COPD can be diagnosed by several pulmonary function tests (PFTs), but spirometry is the most widely used tool due to its low cost and simplicity.Figure 2 shows the usual result of a spirometry: two graphs and a summary table of measurements made on the curves included on such graphs [5].Te most important spirometric measurements to detect COPD are the ratio between forced expiratory volume in the frst second (FEV1) and forced vital capacity (FVC), both measured during a forced expiration/inspiration manoeuvre after applying a bronchodilator.
According to the Global Initiative for Chronic Obstructive Lung Disease (GOLD), if the FEV1/FVC ratio is below 0.7 (70%), the subject is deemed to have COPD.Recommendations from the American Toracic Society (ATS) and the European Respiratory Society (ERS) include the use of the statistically derived lower limit of normal (LLN) as an alternative to the fxed FEV1/FVC threshold of 0.7 [6], since the LLN includes the efect of normal ageing in the diagnostic process (Figure 3).Tis graph shows the 70% threshold for reference, the general behaviour of the LLN, and the predicted value according to age, although these parameters also depend on height and sex.
A major efect of having these two case defnitions for COPD is a disparity in prevalence estimation.According to Adeloye et al. [8], the global prevalence of COPD in 2019 was 10.3% when using the fxed threshold, and 7.6% according to the LLN defnition.
As a tool for establishing the diagnostic performance of a test, a table known as the confusion matrix (also called the diagnostic 2 × 2 table) can be built to classify the diagnosis results (Table 1).
A confusion matrix shows how an index test classifes the subjects in comparison with the truth as defned by a reference test or gold standard.Te matrix includes the following four boxes:   Te test performance is often assessed on the basis of these metrics, and they are also used to contrast diferent diagnostic methods.
Diagnosis is generally based on measuring one or a few variables on the subject.In order to propose a new method, the setting of a threshold value for the discriminatory variable is required.Such variables should discriminate between the population with the disease and the disease-free population.However, it is unusual to fnd a criterion to perfectly separate both populations since the measured variable in both populations may overlap (Figure 4).Such overlapping means that the threshold value could either favour more false positives or more false negatives, which implies a trade-of between sensitivity and specifcity.
Identifying a subject as a false positive or a false negative has important consequences.A false positive COPD diagnosis can lead to a potentially harmful treatment, and it could also hinder the identifcation and treatment of other potential diseases generating whatever respiratory symptoms in the patient's clinical picture.On the other hand, a false-negative diagnosis could make the patient miss the opportunity to receive timely COPD treatment, which may imply that disease progression may not be managed at an early stage [10].Bearing this in mind and considering the stage in the diagnostic process, a more sensitive (usually preferred for screening) or a more specifc test (better for confrmatory testing) may be used.
To evaluate the diagnostic capabilities of spirometry, repeatability and reproducibility are important parameters to consider.Te GOLD, the ATS, and the ERS established certain standards [2,11] to ensure that a test reaches an appropriate level of quality, and any study involving spirometry should always examine the conditions in which the test was performed [12].In this case, statistical techniques such as the method agreement analysis [13], intraclass correlation coefcient (ICC) [14], and Bland-Altman plots [15] have proven to be very useful.
Even though the adequate use of spirometry is well described [11], there are some issues regarding its diagnostic accuracy.It is well known that traditional spirometric measures lack sensitivity to detect mild disease.Several reasons may explain such underperformance: frst, airfow obstruction diagnosis currently relies on the use of fxed values in the fow-volume curve which are insensitive to small airway disease (where COPD has its early onset) [16].Secondly, spirometry requires a forced manoeuvre dependent on the patient's efort, which may be variable, and it may be difcult for some patients [17].Tis translates into poor reproducibility.And thirdly, any patient with FEV1/ FVC ratio below the 95% confdence interval of normal is assumed to be diseased.
Tis review seeks to provide an overarching perspective of COPD diagnosis, summarise recent COPD diagnostic accuracy studies to understand current hurdles, and identify where there is room for improvement.
1.1.Traditional Spirometric Measures.One of the most debatable concepts in COPD diagnosis is whether to use the fxed 0.7 value versus using the LLN as a threshold for the postbronchodilator (post-BD) FEV1/FVC ratio.Eforts have been made to resolve this issue.For instance, Miller et al. [17] compared the clinical characteristics of patients recently diagnosed with COPD by the fxed ratio method and those diagnosed by the lower limit of normal.Tey found that the   Canadian Respiratory Journal fxed ratio identifes more subjects with less respiratory symptoms and more cardiac clinical characteristics.Furthermore, the following studies have compared the diagnostic accuracy of FEV1/FVC < LLN versus that of FEV1/FVC < 0.7 in diferent countries.Andreeva et al. [18] compared COPD prevalence in two major cities in Russia using both thresholds as criteria for diagnosing COPD.Tey included patients with reversible airway obstruction and, if FEV1/FVC < 0.7 is taken as the gold standard, then FEV1/ FVC < LLN would have had a sensitivity of 0.69, a specifcity of 0.99, and an accuracy of 0.98.
Te same comparison was performed in several Canadian cities [19].If FEV1/FVC < 0.7 is taken as gold standard, then FEV1/FVC < LLN would have achieved an accuracy of 0.94, with relatively low sensitivity (0.64) but perfect specifcity [2].
A similar study was carried out in Tailand, where a misidentifcation prevalence of 5.6% with most subjects in the "underestimated" subgroup was found, meaning that they were identifed as false positives when using FEV1/ FVC < LLN as the index test and FEV1/FVC < 0.7 as gold standard.Te subjects in this "underestimated" group showed signifcant clinical conditions including chronic respiratory symptoms, so they should not have been considered false positives [20].
Similarly, a study in the Netherlands compared the diagnostic performance of the fxed value versus the LLN with a clinical COPD diagnosis.Tey found that, while the fxed value was more sensitive than the LLN (0.73 vs. 0.47), it was also less specifc (0.95 vs. 0.99) [21].
All the abovementioned studies reported results of spirometric measures after applying a dose of bronchodilators (BD).Nonetheless, some studies have tried to defne the impact of not using this medication in spirometric diagnostic accuracy.For instance, Kronborg et al. [10] report that an increase from 64% to 79% in the diagnostic accuracy of FEV1/FVC pre-B2 can be achieved by changing the threshold from 0.7 to 0.66, using FEV1/FVC post-B2 <0.7 as a reference.
On the other hand, completing the forced expiratory manoeuvre can be difcult for some patients for diferent reasons [22], including the severity of their symptoms or cognitive capacity which impact the FVC measurement quality.Consequently, several studies used spirometric measures at a fxed time point.Particularly, the forced expiratory volume at 6 seconds (FEV6) has been extensively investigated as a replacement for FVC.
For example, in China, Pan et al. [23] determined the diagnostic accuracy of FEV1/FEV6 < 0.73 post-BD vs. FEV1/ FVC < 0.7 post-BD, which turned out to have an accuracy of 0.95, a sensitivity of 0.952, and a specifcity of 0.945.
Along the same lines, Chung et al. [24] sought to defne the best threshold for FEV1/FEV6 pre-B2 to replace FEV1/ FVC pre-B2 to detect airway obstruction in a Korean population of 14,978 subjects.A criterion of FEV1/ FEV6 < 0.75 pre-B2 achieved a sensitivity of 0.94, a specifcity of 0.95, and an overall accuracy of 0.95.
Furthermore, Wang et al. [25] defned the best threshold for FEV1/FEV6 and compared its diagnostic accuracy against FEV1/FVC < 0.70 to detect airway obstruction.Tis study found that a threshold of 0.75 for FEV1/FEV6 has an accuracy of 0.98, a sensitivity of 0.97, and a specifcity of 0.99.
Regarding other spirometric parameters, Ioachimescu et al. [26] proposed an estimation of FVC based on forced expiratory volume at 3 seconds (FEV3) and the diagnostic accuracy of FEV1/FVC3 < LLN, with FEV1/FVC < LLN as the reference test, yielded an accuracy of 0.90, with a sensitivity of 0.94 and a specifcity of 0.89.

Nontraditional Spirometric Measures.
As mentioned before, traditional spirometric measures are based on specifc fxed values which do not seem to take advantage of the wealth of the information the expiratory fow-volume curve has to ofer.Some researchers have focused on the description of diferent measures of the shape of the fowvolume curve.
For instance, Bhatt et al. [16] introduced the D parameter (measured in the "volume vs. time" curve) and the transition point and transition distance (measured in the fow-volume curve) and reported its COPD diagnostic accuracy as 0.84, when compared with computed tomography (CT).Te measurements proposed in this paper are shown in Figure 5.
In addition, Oh et al. [27] proposed the "fow decay," a measure defned as the slope of volume versus the natural logarithm of the reciprocal of the fow (ln (1/fow)) in midexhalation, to quantify dynamic airway resistance.Tis measure was found to have an accuracy of 0.94, a sensitivity of 0.95, and a specifcity of 0.92 when compared with FEV1/ FVC < LLN and plethysmography (Figure 6).
Li et al. [28] introduced a new parameter, termed the AUC 3 /AT 3 , which is the area under the descending limb of the expiratory fow-volume curve before the end of the frst 3 seconds (AUC 3 ) divided by the area of the triangle before the end of the frst 3 seconds (AT 3 ), with an accuracy, sensitivity, and specifcity of 0.86, 0.87, and 0.86, respectively, vs. the FEV1/FVC < LLN (Figure 7).
Te utility of the area under the expiratory fow-volume curve (AEX) has sparked interest in several researchers due to its apparent ability to detect respiratory abnormalities.
Several studies [29][30][31] have been performed regarding the AEX's ability to diagnose respiratory impairment.Ioachimescu et al. [29] found that AEX has a good discriminating capacity between obstruction, restriction, mixed defects, and small airway disease.Later, Ioachimescu and Stoller [30] assessed the diagnostic accuracy and utility of several geometric approximations of AEX based on standard instantaneous fows; they obtained correlations ranging between 0.95 and 0.99 with the actual value of AEX (Figure 8).Ioachimescu and Stoller [31] also evaluated the capability of the square root of one of those approximated values, AEX, to detect and classify bronchodilator responsiveness into fve categories: negative, minimal, mild, moderate, and marked, suggesting that this measure could become useful for stratifying dysfunction in obstructive lung disease.
Furthermore, the concavity of the expiratory fowvolume curve can also be analysed from the spirometric curves.Nozoe et al. [32] proposed that the concavity/convexity level of the fow-volume curve during spontaneous breathing can be an appropriate replacement for the traditional forced expiratory manoeuvre in older patients.Tey found that the percent-of-predicted FEV1 had an area under the curve (AUC-ROC) of 0.92, a sensitivity of 0.93, and a specifcity of 0.93 as a predictor of the spontaneous expiratory fow-volume curve.In this study, a rectangle defned by the maximum spontaneous expiratory fow and the beginning of the inspiration was calculated, using the area below the curve within the rectangle for diagnosis (Figure 9).Also, Mochizuki et al. [33] presented a new metric for the maximal expiratory fow-volume curve (MEFV) concavity and proposed a new index, the obstructive index, to quantify the extent of emphysema in COPD, asthma-COPD overlap (ACO), and asthma.Tis new index, defned as the ratio of forced vital capacity to the diference in volume between the two points where the MEFV curve hits half the value of the peak expiratory fow, had a signifcant association with the CT measurement of low-attenuation volume (LAV%), which indicates that it could successfully refect the extent of emphysema (Figure 10).
Central concavity and peripheral concavity (Figure 11) are other examples of alternative measures, which are calculated based on the forced expiratory fow at 50% and 75% of the forced vital capacity, respectively.Johns et al. [34] found a moderately strong correlation between Canadian Respiratory Journal concavity, FEV1/FVC ratio, and midfow rate.Tey also found that concavity was more specifc for clinical symptoms of COPD.

Machine-Learning Techniques.
Lately, artifcial intelligence has been used in diferent felds to improve the performance of diverse systems by trying to emulate the way human intelligence works.Machine learning is a subcategory of artifcial intelligence, and it is based on the principle that a computer can learn to perform a task (usually classifcation or regression) based on examples or experience, and not by being specifcally programmed for the task.Deep learning is a machine learning technique that takes advantage of using a vast volume of information to learn.For example, Das et al. [35] developed a convolutional neural network (CNN) to verify if a fow-volume trace fulfls the ATS/ERS quality control criteria for spirometry.CNN showed an accuracy of 87% for acceptability and 92% for usability in contrast to classifcations made by respiratory technicians.
In the case of diagnostic performance for COPD, machine learning has been tested to provide a faster and more accurate diagnostic interpretation of PFTs since it can recognize patterns in high-dimensional feature spaces [36].
Combining their study of AEX with machine learning, Ioachimescu and Stoller [37] proposed the square root of AEX as an alternative spirometric parameter to diferentiate between normal, obstructive, restrictive, and mixed patterns.Tey used machine learning in a model that combined bestsplit partition and artifcial neural networks.
Also, three versions of residual networks were independently trained to perform COPD diagnosis using random subsets of CT scans collected from the PanCan study, which enrolled exsmokers and current smokers at high risk of lung cancer [38].Tese networks were evaluated by using threefold crossvalidation experiments.Te best performing networks achieved an accuracy of 0.889 (SD 0.017), calculated by the area under the curve (AUC).Moreover, Bodduluri et al. [39] also used CTscans and deep learning to analyse spirometry and they found that ANN and random forests do a better job at phenotyping COPD than the traditional spirometric measurements.Canadian Respiratory Journal Jafari et al. [40] designed a system to detect normal and abnormal pulmonary functions using spirometry data and multilayer perceptron neural networks (MLPNNs), which classifed respiratory patterns into normal, obstructive, restrictive, and mixed patterns, based on the fow-volume curve.Tis system achieved an accuracy of 0.98, a sensitivity of 0.98, and a specifcity of 0.99 across all categories.In a similar study [41], two neural networks were concatenated in such a way that the frst classifed the sample as normal or abnormal and the second classifed abnormal samples into restrictive or obstructive patterns, reporting accuracies, sensitivities, and specifcities above 0.90 for all three patterns.
Finally, machine learning has been used not only in diagnosis but also in day-to-day applications to improve the quality of life of COPD patients.
For instance, Swaminathan et al. [42] used a machine learning-based strategy for early detection of COPD exacerbations and subsequent triage.Te goal of this study was to identify exacerbations in a timely manner and to evaluate their severity to ofer an action plan for the patient.Tis strategy was compared with the evaluation made by a group of physicians, and it showed good performance in predicting the need for emergency care.
In another study, Cheng et al. [43] proposed a system to classify the lung function based on movement sensors in phones by using support vector machines.Tis study analysed walking patterns captured by their phone sensors and created a machine-learning model that perfectly classifed their pulmonary function into GOLD I/II/III categories.

Discussion
Ideally, to obtain a COPD diagnosis with certainty, a CT would be the gold-standard.Vimala et al. [44] established a correlation between quantitative and qualitative parameters of high-resolution CT and pulmonary function tests, showing that CT has a key role not only in diagnosis but also in COPD severity defnition.
However, CT is not always available, and spirometry is the most used method, at least during the frst stages of diagnosis.Te most frequently used spirometric measure is the FEV1/FVC ratio with a fxed ratio of 0.7 as the threshold [22].Tis value is easy to calculate and remember in a clinical setting and it works reasonably well in the average patient with suspected COPD.Yet, it is well known that this fxed threshold leads to overdiagnosis of older subjects and underdiagnosis of younger subjects because the pulmonary function declines with ageing.
Terefore, when dealing with patients either younger or older than the average, LLN works better.In ideal conditions, the defnition of LLN should be obtained by deriving local population-specifc equations.However, studies to develop such equations for every population have not been conducted due to logistics and costs.Most studies that use LLN to diagnose COPD or to establish COPD prevalence use known equations (mostly obtained in developed countries, for specifc ethnicities) as their reference, which may lead to a decreased diagnostic accuracy.Terefore, an efort should be made globally to develop appropriate LLN equations.
Bhatt and Wood [45] performed a thorough review regarding the controversy around fxed value vs. LLN when dealing with ageing subjects and two important issues were found.First, most studies trying to justify LLN as a better COPD classifcation tool did not use postbronchodilation, which means that the GOLD recommendations were not fulflled.Secondly, they found that subjects with FEV1/FVC ratio under 0.7 but over LLN had a higher risk of mortality and hospitalizations due to exacerbations.However, a more recent study [46] found that using FEV1/FVC under 0.7 was  [34].
not signifcantly diferent neither more accurate than other fxed or LLN thresholds in predicting COPD-related hospitalizations or mortality.
In addition, the fxed threshold (0.7) and the LLN for FEV1/FVC have diferent sensitivities and specifcities.Tis should always be considered in the diferent stages of COPD diagnosis because they should not be considered interchangeable when used for screening vs. confrmatory testing [47].
Some studies try to exploit data obtained from studies with diferent goals (e.g., CT data obtained when screening for cancer) or aim at studying spirometric measures without applying bronchodilators.Terefore, these studies do not test bronchodilator response (BDR), which could be an inappropriate practice because, theoretically, not using BDR makes it difcult to diferentiate between asthma and COPD and goes against GOLD recommendations for COPD diagnosis.
Interestingly, Janson et al. [48] questioned the use of bronchodilator response in diagnosing COPD due to the limited ability to diferentiate asthma from COPD.Also, Fortis et al. [49] studied the impact of bronchodilator response in adverse outcomes measures (such as exacerbations and mortality) and concluded that when BDR is evident in both FEV1 and FVC, the clinical picture is associated with less emphysema, more frequent and severe exacerbations, and lower mortality, suggesting a COPD phenotype with asthma-like features.
Moreover, not all studies check for spirometric repeatability and reproducibility and if they do, they do not always report doing so.Repeatable measurements are critical to guarantee the reliability of the diagnostic test and, when unmet, there is no point in defning the test's diagnostic accuracy.Besides, repeatability is essential for machinelearning models since the models' accuracy will be as good as that of the data used to train them.If there is no quality verifcation, the achieved models cannot be deemed reliable.Furthermore, machine-learning algorithms would likely beneft from having repeated measurements to learn from.
In addition, since the FEV1/FVC ratio is well known to be an imperfect diagnostic test (whether using fxed or LLN values) [50], it should not be used as a unique criterion to diagnose COPD, nor should it be used as a single gold standard for new diagnostic tests.Whenever possible, all available clinical information should be used to evaluate the diagnostic accuracy of any new method.Tis is particularly important when training machinelearning models with supervised techniques since the new model will only be as good as the gold standard used to train it.
In fact, some studies suggest that, due to the heterogeneous nature of disease presentation, it is wise to consider its diferent manifestations beyond spirometry.Lowe et al. [51] segregated current and former smokers into 8 groups, depending on the presence of one or more of the 4 characteristics: exposure (cigarette smoke only), respiratory symptoms (dyspnoea and/or chronic bronchitis), chest CT abnormalities (emphysema, gas trapping, and/or airway wall thickening), and abnormal spirometry, to show how each characteristic contributes to the disease progression and mortality.Adding these nonspirometric characteristics to a machine-learning technique can be easily implemented, which would result in new and perhaps more efcient diagnostic methods.
Recently, the very defnition of COPD has been reviewed and a new naming system has been proposed, based on the origin of COPD [52].Tis new classifcation includes 7 defnitions: genetic COPD, COPD due to abnormal lung development, COPD due to infections, COPD and asthma, environmental COPD (which has two subcategories: cigarette smoking and biomass and pollution exposure), COPD of unknown causes, and COPD of mixed causes.Tis study alone may change the way we diagnose COPD considering the diferent manifestations the disease may have based on its causes.
On a fnal note, it is remarkable that neural networks are the most frequently used method when applying machinelearning techniques to diagnose COPD.Neural networks are known to be a very powerful tool, but they have a major disadvantage: they are black boxes, meaning that the problem is solved without really understanding the process and the reasoning behind the solution.Perhaps, simpler approaches, which are easier to understand, can be tested to see if their performance is powerful enough to improve the timely diagnosis of COPD.

Conclusion
COPD is a highly prevalent disease with a signifcant burden that seriously decreases a patient's quality of life, and its diagnosis remains a challenge despite so many studies being performed on this topic.Te heterogeneity of the disease and its multiple origins and presentations make diagnosis a multidimensional problem.Leveraging advanced machine-learning techniques, along with the deep clinical knowledge of the issue, may be the key for tackling the problem and fnding more suitable solutions, which should aid in achieving a more efcient diagnosis of COPD.AUC 3 : Area under the descending limb of the expiratory fow-volume curve before the end of the frst 3 seconds AT 3 : Area of the triangle before the end of the frst 3 seconds AEX: Expiratory fow-volume curve MEFV: Maximal expiratory fow-volume curve ACO: Asthma-COPD overlap CNN: Convolutional neural network AUC: Area under the curve MLPNNs: Multilayer perceptron neural networks BDR: Bronchodilator response.
True positives (TP): patients correctly identifed as having the disease by the index test (ii) False positives (FP): patients incorrectly identifed as having the disease by the index test (iii) False negatives (FN): patients incorrectly identifed as disease-free by the index test (iv) True negatives (TN): patients correctly identifed as disease-free by the index test Furthermore, based on the classifcation shown in this matrix, a few diagnostic metrics can be defned and calculated as follows:(i) Accuracy: ability of the test to correctly classify the subject ((TP + TN)/(TP + TN + FP + FN)) (ii) Sensitivity or recall: ability of the test to correctly detect diseased subjects (TP/(TP + FN)) (iii) Specifcity: ability of the test to correctly detect disease-free subjects (TN/(FP + TN)) (iv) Positive predictive value: probability that a subject with a positive test result does have the disease (TP/ (TP + FP)) (v) Negative predictive value: probability that a subject with a negative test result does not have the disease (TN/(FN + TN))

Figure 5 :
Figure 5: Spirometric measurements proposed by Bhatt et al. in [16], including the D parameter.

Figure 6 :
Figure 6: Flow decay in a healthy person vs. in a patient with COPD (taken from [27]).