Detecting the Early Stage of Phaeosphaeria Leaf Spot Infestations in Maize Crop Using In Situ Hyperspectral Data and Guided Regularized Random Forest Algorithm

School of Geography, Archaeology and Environmental Studies, University of the Witwatersrand, Private Bag 3, Wits, Johannesburg 2050, South Africa Intuit Mountain View, Mountain View, CA, USA School of Agriculture, Earth and Environmental Sciences, Pietermaritzburg Campus, University of KwaZulu-Natal, Scottsville P/Bag X01, Pietermaritzburg 3209, South Africa African Insect Science for Food and Health (ICIPE), P.O. Box 30772, Nairobi 00100, Kenya Department of Agronomy, Faculty of Agriculture, University of Khartoum, 13314 Khartoum North, Sudan


Introduction
Maize (Zea mays L.) accounts for 15-50% of energy in human diets in sub-Saharan Africa and is a staple diet for majority of the African population (Kagoda et al. 2010).In some sub-Saharan African countries, maize is considered a major feed for livestock [1,2].In South Africa, maize is the second most popular crop after sugar cane and accounts for over 50% production within the Southern African Development Community (SADC) region [3].The crop is grown in two main zones within the country: a marginal western belt and a reliable and higher productivity eastern core.In these areas, maize is highly dependent on climatic variables such as temperature and precipitation [3].Less than 10% of the crop is produced under irrigation [3].Thus, climatic variability and change are key factors influencing interannual maize production in the country.
Commonly, changes in temperature and precipitation and increased frequency and intensity of extreme weather events influence agricultural productivity and food safety through a number of pathways [4].For example, climate change characterized by increase in temperature and change in rainfall patterns influences the onset, persistence, and patterns of crop bacteria, viruses, parasites, and fungi [4].Such changes may also affect plants' physiology and host susceptibility, which may result in the emergence, redistribution, and changes in the incidence and intensity of plant diseases and pest infestations [4,5].
One of the major diseases that threatens maize production in tropical and subtropical growing areas is Phaeosphaeria leaf spot [6,7].Phaeosphaeria leaf spot (PLS) is a maize foliar disease caused by the ascomycete fungus Phaeospharia maydis (Henn.).First noted in India, it has recently spread widely in other parts of the world such as Brazil [6,7], USA [8], and Central, East, and Southern Africa [9,10].The disease is predominant in areas of high rainfall and moderate temperatures, a common characteristic of the higher tropical and subtropical elevations [9,11,12].
Early symptoms of the PLS are small dark green water-soaked leaf spots which may be circular, oval, elliptic, slightly elongated, and often 0.3 to 2.0 cm in diameter [6,12].Typically, lesions are scattered over the leaf surface and have a chlorotic appearance [9,13].These turn to pale green, straw-colored, bleached and necrotic, or dried with dark brown margins [9,13].Under favourable conditions, these lesions may coalesce to large irregular shapes and blight the entire leaf, in some cases infecting the stem [6].
PLS can result in a considerable reduction in photosynthetic leaf area as the spots coalesce [9].This can cause premature leaf drying, thus reducing plant cycle, decrease in grain size and weight, and in extreme cases result in early plant death [6,14].Thus, substantial grain yield losses ranging from 11% to 60% in PLS susceptible cultivars have been reported in the United States of America and Brazil [8,14].Such losses may be attributed to nontranslocation of nitrogen in infected plants [14].Whereas grain yield losses from PLS are yet to be quantified in South Africa, the maize genotypes in the country have shown to be favourable to PLS infestation [9,10].Due to the importance of maize production in the tropical and subtropical regions, losses by the disease not only affect yield, but also micro and macro socioeconomic systems [10,14].This, therefore, necessitates the adoption of effective disease management approaches to sustain production.
Using fungicides like mancozeb, applied before or at the early stages of the disease, control and management of PLS has been successful in Brazil [9].Management approaches using fungicides therefore require that information on PLS infestation is available at appropriate scales and is summarized in a way that allows for suitable and timeous management practice at the right place [15][16][17].
Traditionally, field survey data, based on expert visual inspection, have been used to identify PLS infestation.However, this requires continuous monitoring, which might be prohibitively expensive, time-consuming, and in some cases impractical on large farms [18,19].Recently, remote sensing datasets and techniques have emerged as valuable means to detect and measure crop disease incidences in real time at both regional and farm scales [17,[20][21][22][23]. Detailed information and discussion on the use of different remotely sensed data in detecting crop diseases and pests can be found in two recent comprehensive review papers [17,24].
The maize physiological characteristics due to PLS infestation are caused by a change in some of the biochemical composition and internal as well as external leaf structure [6,9,14].Consequently, the maize canopy spectral reflectance will also change accordingly across the relevant sections of the electromagnetic spectrum.These spectral characteristics form the basis for use of remote sensing in spatial modelling of crops stressed by the disease [25].In this paper, we seek to detect the early stage of PLS in tropical maize using a combination of biochemical data and field spectral reflectance measurements.

Material and Methods
2.1.The Study Area and the Experimental Setup.This study was conducted in Cedara experimental farm, located in KwaZulu-Natal province, South Africa (30 °16 E, 29 °32 S, and 1076 metres above sea level).The plantings were done in two replications in November 2013 and 2014.The size of the plot was 3 m long with two interrow spacing of 0.75 m and 0.3 m, respectively.Two seeds were planted per station and later thinned to one.The plant population density was about 44,000 per hectare.Field data was collected during vegetative (VT) growth stage when the last branch of the tassel is visible [26].The plots were classified based on the number of leaves showing the disease symptoms.The early stage of PLS disease was then assessed based on visual assessment of the plots [9].Phaeosphaeria leaf spot (PLS) disease early stage was assessed fortnightly from the first appearance of symptoms based on visual assessment of the leaf using a 1-9 rating scale [9]: where 1 = 0%, 2 = <1%, 3 = 1-3%, 4 = 4-6%, 5 = 7-12%, 6 = 13-25%, 7 = 26-50%, 8 = 51-75%, and 9 = 75-100% leaf areas showing disease symptoms.The scores were further classified into the following disease reaction types: 1.0 = symptomless (healthy) and 2.0-3.0 = early stage.The other disease stages from 4 to 9 (moderate and severe) were not considered in this study.
2.2.Ground-Based Hyperspectral Measurements.Leaf reflectance spectra were obtained using an Analytical Spectral Devices (ASD Inc., Boulder, CO, USA) FieldSpec®3spectrometer.The spectra were collected under sunny and clear-sky conditions between 10:00 am and 02:00 pm local time over two seasons, on January 5, 2013 and 25 January, 2014.The FieldSpec®3spectrometer has a spectral range of 350 to 2500 nm and registers radiation at 1.4 nm intervals for the 350-1000 nm spectral region and 2 nm intervals for the 1000-2500 nm spectral regions.Measurements were then interpolated to 1 nm spectral resolution across the spectrum [27].From each plot, three to five leaves from the top canopy of maize crop were sampled.For each sample unit, piles of maize leaves were arranged and placed randomly on top of a black thick cardboard [28].The leaf reflectance was then taken immediately at a nadir-looking angle from about 25 cm above the leaves.About 15 to 20 measurements were made from each pile of leaves by moving randomly over each canopy, to derive the representative reflectance 2 Journal of Spectroscopy spectra for the canopy (Figure 1).These spectral measurements were then averaged to represent the final spectral measurement for each leaf sample.A white reference spectral measure on the calibration panel was performed every 10-20 measurements to offset any change in the atmospheric condition and sun irradiance spectrum [27].
In total, 66 and 72 plots for the PLS early stage and healthy maize were sampled, respectively.Using similar procedure, field spectral measurements were replicated on 15 January, 2014 on the same experimental farm and under similar conditions.Spectral reflectance from 66 to 72 plots for the PLS early stage and healthy maize was collected, respectively.

Leaf Sampling and Biochemical
Analysis.Disease infestation affects the amount and quality of chemical composition and physical structure of the leaves and spectral properties [29].To test whether PLS has a significant impact on the chemical composition of maize leaves, the piles of leaf samples (n = 2-5) were packed immediately after the spectral measurements from both the healthy and the PLS early stage.The samples were then pooled, bagged, dried at 70 °C for 48 h, and sent for full biochemical analysis at the Department of Agriculture and Environmental Affairs Feed laboratory in Cedara, South Africa.A t-test was used to determine the effect of PLS disease on the biochemical composition of the leaf.The t-test was used to determine whether the PLS disease caused any significant difference in the chemical characteristic of the healthy leaves.

Hyperspectral Data Analysis.
Reflectance values of 528 wavelengths distributed in four spectral regions (i.e., 350-399, 1300 nm−1400 nm, 1750 nm-1980 nm, and 2350 nm-2500 nm) were removed from the maize spectra due to noise and atmospheric water absorption [30].Therefore, only 1623 out of 2151 wavelengths were used in the spectral analysis.One of the most notable difficulties in hyperspectral data processing is the hyperdimensionality of the data, which requires sufficient training samples to simplify the complexity of classification and prediction processes [31][32][33][34].Practically, in most of the hyperspectral applications, the number of training samples (n) is limited with respect to the large number of hyperspectral bands (p) [34].Therefore, variable selection methods have been widely used to select a compact variable without loss of predictive power of hyperspectral data [33].In this study, a recently developed method, the guided regularized random forest (GRRF) [35,36], was tested for hyperspectral band selection and classification.2.5.Random Forest Classifier.Random forest is an ensemble learning technique developed by Breiman [37] to improve the classification and regression of trees (CART) by combining a large set of decision trees.The RF [37] grows multiple unpruned trees (ntree) on bootstrap samples of the original data.Each tree is grown on a bootstrap sample (2/3 of the original data known as "in-bag" data) taken with replacement from the original data.Trees are split to many nodes using random subsets of variables (mtry), and the default mtry value is the square root of the total number of variables.From the mtry selected variables, the variable that yields the highest decrease in impurity is chosen to split the samples at each node [37].A tree is grown to its maximum size without pruning until the nodes are pure.That is, the nodes hold samples of the same class or contain certain number of samples.A prediction of the response variable (e.g., PLS early stage) is made by aggregating the prediction over all trees.In a classification application, a majority vote from all the trees in the ensemble determines the final prediction [37].A more detailed description of RF can be found in Breiman [37] and Touw et al. (2012) among others.We used RF because RF naturally handles different scales, interactions, and nonlinearities among other numerical and categorical features [36].
2.6.Feature Selection via Guided Regularized Random Forest.Ordinary random forest has been widely used in hyperspectral data reduction.However, its preference to highly correlated predictor variable in identifying variables in high-dimensional spectral space has been identified as its major limitation [38,39].Moreover, while RF only provides insight into the importance of each variable in classification process, it does not automatically select the optimal number of variables that could yield the lowest error rate [40].The new approach tested in this study was first developed and tested by Deng and Runger [36] in a small and simple dataset.
Random forest provides an internal measure of variable importance using the Gini index.The Gini index at node v is defined as where p v k is the proportion of class k observations at node v.The Gini information gain X i , v is the difference between the impurity at node v and the weighted average of impurities at each child node of v.The weights are proportional to the number of samples assigned to each child from the split at node v as defined in where Gini v L and Gini v R are the Gini indices and w L and w R are the weights for the left and right child nodes.
To identify the key predictors, researchers have leveraged random forest for feature selection.For example, the recursive feature elimination (RFE) framework [41] and forward variable selection [40,42] build multiple random forests in order to obtain an optimal subset of features that best explains the phenomena of interest.However, these methods are computationally intensive.Consequently, Deng and Runger [35] proposed a regularization framework that can be applied to random forest (regularized random forest) and boosted trees (regularized boosted trees).The regularization framework avoids selecting a new feature for splitting the data in a tree node when that feature produces similar information to the feature already selected.The regularized framework builds one model that may considerably reduce the training time.Guided regularized random forest (GRRF) is an enhanced regularized algorithm that uses the importance scores from an ordinary random forest to guide the feature selection process [36].
The guided regularized random forest is built similarly to random forest, but uses a regularized version of information gain at each node v as in where F is the feature set selected in the previous nodes and λ i ∈ 0,1 is called the coefficient of regularization for X i and can be calculated as follows: where Imp i 0,1 is the normalized importance score for X i from an ordinary random forest built on the data set and γ ∈ 0,1 is called the importance coefficient.For a feature X i that does not have the maximum importance score 1, a larger γ leads to a smaller λ i and, thus, a larger penalty on Gain X i , v when X i has not been used in the nodes prior to node v.
Comparative studies have shown that GRRF is effective in selecting high-quality feature subsets while maintaining predictive accuracies [36].
2.7.Accuracy Assessment.The accuracy of RF classifier was assessed using the independent test dataset collected during the following growing season (2014) under the same PLS conditions.Out-of-bag error (OOB) [37], which provides an unbiased estimate of error of the RF, was used to estimate the misclassification.A confusion matrix was subsequently constructed to compute the overall accuracy (OA), user's accuracy (UA), and producer's accuracy (PA) as a criteria for evaluating the generalization ability (accuracy) of the RF classifiers [43].OA is a ratio (%) between the number of correctly classified samples and the number of test samples, while UA represents the likelihood that a sample belongs to specific class and the classifier accurately assigns it such class.PA expresses the probability of a certain class being correctly recognized.Furthermore, kappa analysis that uses the k statistic was also calculated to determine if one error matrix is significantly different from another.The kappa coefficient provides a measure of the actual agreement between reference data and a random classifier.If the kappa coefficients are equal to one or close to one, then there is perfect agreement [44].

Results and Discussion
3.1.Chemical Analysis of the Leaves.Since the effects of PLS on maize's biochemical characteristic have never been established, it was necessary to explore whether the leaves' chemical composition between the PLS early stage infestation and healthy leaves was significantly different.A t-test was used to determine if differences in the chemical composition (NPK) between the PLS early stage and healthy leaves were significant.Results that showed statistically significant difference in biochemical concentration was observed between healthy leaves and PLS-infested leaves for nitrogen (N), calcium (Ca), magnesium (Mg), copper (Cu), manganese (Mn), and phosphorus (P) concentration.PLS is known to affect translocation of nitrogen and reduce plant cycle and photosynthetic activity and accelerate leaf senescence, which reduces grain size and weight [7,9].However, this conclusion should be treated with caution as it is based on a general analysis to understand the effects of PLS on the spectral characteristic of the heathy maize leaves.More replicated biochemical analyses under controlled environment are required to better understand the effects of PLS on maize growth.

Variables Importance Measurement and Selection.
The new variable selection procedure used in this study was able to reduce the high dimensionality of the hyperspectral data by eliminating irrelevant or redundant wavelengths.The importance of variables (wavelengths) in discriminating the PLS early stage and healthy maize leaves as determined by the ordinary RF classifier is shown in Figure 2. The most important wavebands are located in the red edge (670-780 nm) and near infrared (700-1200) portions of the electromagnetic spectrum.Very few bands are located in visible (400-500 nm) and the shortwave infrared (1900-2300 nm) sections of the electromagnetic spectrum.Some possible explanations for the selection of these wavelengths (within the visible and red edge regions) as the most important in discriminating PLS early stage are that these regions are more sensitive to vegetation biochemical properties such as canopy chlorophyll and nitrogen contents [45,46].Results in this study have shown that PLS infestation leads to significant changes in the biochemical properties between early stages of PLS and healthy leaves.Changes in these leaf properties result in a shift in the red edge curve and increase the reflectance in the visible region [47], hence the selection of these regions.Other PLS symptoms that could be explained by the selection of the infrared wavelengths are that PLS accelerates leaf senescence and decreases grain size and weight [7]. Figure 2 indicates that many variables (wavelengths) share the same maximal Gini information gain at a node.Therefore, the importance scores from the ordinary RF were used to facilitate GRRF's selection of subset wavelengths that can better discriminate between the early symptoms of PLS and healthy maize leaves.
The GRRF was able to select 6 wavelengths using the ranking output of ordinary RF.The best subsets of wavelengths are located at 420 nm, 795 nm, 779 nm, 1543 nm, 1747 nm, and 1010 nm (Figure 3).These 6 spectral wavebands produced a minimal OOB error of 9.42% using the training dataset compared to 15.78% OOB error rate when the total number of wavelengths (n = 1623) was used.The subset selected by GRRF not only has fewer features compared with the entire variables, but also leads to lower OOB error on the training datasets.This could be explained by the fact that in the model-based analysis, the use of less important or redundant hyperspectral wavelengths leads to a decrease in the model accuracy, because the noise in the redundant data propagates through the classifier's performance [33].The noise may not only decrease the performance of a weak classifier with a limited capability in handling the small variables, but may also affect the performance of more advanced classifiers such as random forest [35].
Results from the present experiment thus reaffirm previous findings [35,36,48] which show that the integrated approach between ordinary RF and GRRF is able to select small subsets of powerful variables in a high dimensionality data by an efficient computation procedure and achieve a competitive performance accuracy.Consequently, it is worth considering GRRF for variables selection in hyperspectral applications in the future.However, this assertion requires additional testing and comparisons with different variable Figure 2: The importance scores of variables (wavelengths) as measured by ordinary RF using mean decrease in Gini index.The highest mean index is the most important variable.5 Journal of Spectroscopy selection methods in different types of datasets before it is adopted as a substitute for data dimensionality reduction.If proven reliable, this integration could significantly save on time spent on complex computational procedure for hyperspectral data analysis.

Accuracy
Assessment.The six wavelengths identified by GRRF were used as input variables into RF classifier to discriminate between the early stage symptoms (ES) of PLS and healthy stage (HS) of the maize leaves.Random forest parameters (ntree and mtry) were optimized using the training dataset (2013 dataset) and the model tested on an independent test dataset (2014 dataset).The results indicate that with the best setting of ntree (7500) and mtry (2) RF classifier yielded an overall accuracy of 81.88% using all the variables (n = 1623).The results were improved to 87.68% when the subset of selected (n = 6) variables was used (Table 1).
Based exclusively on the overall accuracy (AO), the use the wavelengths selected by GRRF (n = 6) proved to be more accurate (AO = 87.68)than the use of all wavelengths (n = 1623) for detecting the early stage of PLS.
Whereas GRRF can be used as a classification algorithm, we preferred to use the traditional RF as a classifier.We opted for this approach as GRRF is designed for feature selection, and the trees are not constructed independently; therefore, the classification model may have a higher variance than the traditional random forest [36].Moreover, the traditional random forest has been used successfully as a classifier in different types of dataset, particularly remotely sensed data [49][50][51].Therefore, we considered the traditional RF as a classifier to provide an effective and efficient evaluation for the newly developed feature selection method in binary application.Results from this study indicate that GRRF was able to select high-quality feature subsets that significantly improved the classification performance of RF in detecting the early stage of PLS.

Conclusions
The objective of the present study therefore was to investigate the potential use of remotely sensed data in detecting the early stage of PLS in tropical maize.An extensive set of in situ hyperspectral measurements was collected over two different seasons, and an integrated new approach of GRRF and ordinary RF was investigated for variable selection and classification process.The relatively high overall accuracy obtained in this study indicates that the early stage of PLS in tropical maize can be detected using selected hyperspectral wavelengths.From the results of the present study, we can conclude the following: (1) Phaeosphaeria leaf spot could be detected accurately at the early stage using hyperspectral data.This may provide insight on the choice of appropriate spatial and temporal management practices.
(2) The new GRRF method produced high-quality feature subsets for the traditional RF classifier.Therefore, it could be considered as an effective and efficient feature selection tool for high data dimensionality reduction in hyperspectral applications.
Overall, our study presents a successful application of hyperspectral data, GRRF feature selection, and RF classifiers in detecting the early stage of PLS.This could be valuable in precision agriculture, specifically the management and control of the PLS.However, these results should be interpreted with caution as our study was based on analysing the spectral characteristics of PLS only.More studies are therefore needed in investigating the optimal spectral and spatial resolutions for PLS detection and upscaling these results to spaceborne or airborne sensor resolutions.

Figure 1 :
Figure 1: The sample plots of 2013 and 2014 with the mean reflectance of healthy and early stage of PLS using data collected in 2013.The grey areas indicate the spectral ranges excluded from the analysis due to the external effects.

Figure 3 :
Figure3: The importance scores of variables (wavelengths) as measured by ordinary RF using mean decrease in Gini index.The highest mean index is the most important variable.

Table 1 :
The confusion matrix showing overall accuracy (OA) and kappa for classifying the early stages of PLS (ES) and the healthy stage (HS) using the independent test dataset.