Application of Stochastic Gradient Boosting Approach to Early Prediction of Safety Accidents at Construction Site

. The construction industry is one of the deadliest industries in the United States and Korea. The number of accidents at a construction site has been recently increasing despite institutional supports and managerial eﬀorts. A proactive prediction of safety accidents is the best way to prevent them, but a dynamic change in particular conditions of a construction project makes the prediction very tricky and complicated. Moreover, preventive work for any safety accident at a construction site mainly depends on the intuitive and subjective opinions of practitioners with limited experience. The stochastic gradient boosting (SGB) approach may be an attractive alternative to conventional methods for predicting safety accidents because of its superior predictive performance. Therefore, SGB is applied to an early prediction of safety accidents at a construction site in order to examine its applicability to the construction safety domain. The prediction result of the proposed model is compared to an artiﬁcial neural network model and a decision tree model. The proposed model shows a slightly better result compared to the ANN and DT models. Moreover, the result of the proposed model also demonstrates the advantages of a simple parameter set in constructing a model and a comprehensible decision-making procedure for safety management.


Introduction
e construction industry is one of the largest but deadliest industries in the United States.According to a current state of fatal work injuries by industrial accidents surveyed in the US as of 2017, out of the total 5,147 workers, the fatal work injuries from the construction industry consist of 971, the transportation and warehouse sector of 882, the agriculture, forestry, fishing, and hunting of 581, and the professional and business services and others of 532 [1].According to a current state of the death toll caused by industrial accidents by industry surveyed in Korea as of 2016, out of the total 1,777 workers, the death toll from the construction industry accounts for 554 people, the manufacturing industry for 408, the mining industry for 364, others for 293, and transport warehouse and communication for 129 [2].e death toll reported from the construction field is the highest.Despite many managerial efforts made out of this poor performance, the accident statistics got worse in the past decade [3]. is might imply that the construction industry has still employed ineffective, traditional approaches to safety management and that there is an urgent need for innovations [4].
at is, in order to prevent construction safety accidents (hereinafter construction accidents), proactive safety management that can predict risk in advance is needed rather than reactive safety management that analyzes the construction accident cases that already happened.Previous studies mainly focused on clarification of the relationship between the causes and construction accidents by analyzing the construction accident cases that had occurred [5][6][7].However, due to its peculiar characteristics of a construction project, i.e., differences in scale, region, workers, contractor, and construction period, it is believed that each project hardly has identical risk.As a result, it is very complicated for a safety manager to accurately judge and predict how risky a possible safety accident is in the current project based on his/her own intuition and experience.
Various data mining approaches have been applied to solve this problem.To be more specific, a decision tree (DT) [3,[8][9][10], an artificial neural network (ANN) [11][12][13], and a support vector machine (SVM) [14,15] have been used to perform an intricate prediction of construction accidents, which is affected by various factors.DT is a very powerful and humanly comprehensible approach because of its intuitive binary "if-then" rule-based structure [16].However, the instability of DT is known to researchers.In other words, a small change in the training samples can bring about a large difference not only in a tree but also in the analytical result [17].ANN usually provides better results than the traditional simple-cut technique.However, this approach is a black-box technique [18], so it lacks human interpretability.SVM has attracted much attention because of its capacity for self-learning and high performance in generalization [19].However, it needs a considerable time of trials and errors to determine a suitable kernel function [18].Additionally, it has a high level of algorithmic complexity and needs extensive memory [16].
In the recent machine learning approaches, the boosting approach, which was culminated in the research by Freund and Schapire who introduced Adaboost M1 (also known as the AdaBoost algorithm), has drawn much attention because of its robustness and easy application [20].
e boosting approach was originally proposed to combine several weak classifiers to improve classification performance.Previous studies found that in terms of predictive performance, a boosting method is superior to the other competing methods including DT, ANN, and SVM, even when the dataset is defective and small [21][22][23].In addition, the applicability of the boosting approach is much higher than that of competing techniques because of its single parameter [22].Furthermore, the instability of DTs can be overcome by the boosting approach, i.e., growing a forest of DTs and classifying the instances of a majority vote for the classifications given by individual trees [24,25].
In 2002, Friedman proposed stochastic gradient boosting (SGB) as a relatively new tree-based regression and classification method to optimize predictive performance [26,27].SGB is a hybrid that combines the advantages of both boosting and bagging approaches [26].SGB provides several advantages such as a limited number of user-defined parameters and the ability to model nonlinear relationships, manage qualitative and quantitative variables, and remain robust despite missing values and outliers of data [28].Furthermore, the stochastic component of SGB is powerful not only to provide superior predictive performance for parametric approaches but also to reduce the occurrence of overfitting by excluding a certain fraction of input data through a randomized approach, which is one of the representative characteristics of a bagging algorithm [26,29].
ese advantages allowed SGB actively utilized in the various domains, such as image classification [30,31], mobile phone customer type discrimination [32], software reliability prediction [33], software maintainability prediction [34], genomic selection [35], financial fraud detection [36], airfoil self-noise prediction [37], rock burst damage prediction [38], drug combination prediction [39], vulnerable web component prediction [40], tree canopy cover estimation [29], pillar stability prediction [41], and credit risk prediction [42].In the construction domain, Shin firstly applied SGB to the preliminary construction cost estimation [43], and Tixier et al. secondly applied it to the construction injury prediction [44].Unfortunately, there have been rarely tries to apply SGB to safety management, which recently needs innovative challenges in the construction domain.For this reason, in this study, SGB is applied to an early prediction of accidents at a construction site to examine its applicability to construction safety management.
In the next section, preventive efforts for construction accidents made in Korea will be discussed.e stochastic gradient boosting approach is going to be reviewed theoretically in the third section.An early prediction model based on SGB is applied to a dataset from the actual case data of construction accidents in Korea and is compared to that of ANN and DT.Finally, conclusions and suggestions for further study will be made.

Construction Accidents and Preventive Efforts in Korea
In Korea, the construction industry has experienced a serious labor shortage due to workers' avoidance of 3D, i.e., dirty, difficult, dangerous, jobs.In addition, the construction workers who are currently engaged are aging and are concerned that they would gradually become vulnerable to construction accidents.According to the statistical data of the Korean Statistical Information Service [2], fatal injuries by construction accidents in Korea are total of 2,596 for the past five years (2012-2016), as shown in Table 1.e fatal injuries of the construction industry were annually increasing to 554 in 2016 from 496 in 2012, even though the total industrial fatal injuries were annually decreasing for the same years.Fatal injuries are highly likely to occur, if any, in a construction project compared to other industries, due to its unique features including outdoor production, high place work, and heavy equipment work.Moreover, given that construction projects have been increasingly higher and scaler and that the number of foreign workers with different cultural backgrounds and languages has been increasing, safety accidents may not decrease in the construction industry [45].e construction industry in Korea has made practical and institutional efforts to prevent construction accidents for a long time.For a reactive prediction of any construction accidents, many efforts have been made by analyzing accident data that had occurred to prepare comprehensive guidelines for construction accidents based on the statistical results by type in terms of cause, time, and the mechanism of the accidents.However, these efforts have a limitation in that various dynamically changing construction projects have not been considered.Furthermore, under Article 41 (2) of the Industrial Safety and Health Law, it is imperative to make efforts to reduce accidents by evaluating the degree of risk assessment in advance for each work at each construction site [46].Risk assessment is a series of processes that identify hazards and risk factors of a workplace, estimate and determine the probability and severity of injuries or illnesses 2 Advances in Civil Engineering caused by hazards and risk factors, and establish and implement reduction measures.Under the law, each site conducts a five-phase evaluation of preparation of risk assessment, hazards identification using checklist, risk estimation, risk evaluation, and risk control and implementation.However, this method also has a limitation in that safety managers or practitioners make decisions based on subjective and limited experience and judgment.Recently, state-of-the-art technology, i.e., Big data analysis, Internet of ings, and image processing, applied to construction safety management [47][48][49].And data mining techniques, i.e., decision tree analysis, association rule analysis, and machine learning, also applied to predict construction accidents proactively [3,5,10,12,13].Although these efforts are not enough for practical application in the construction project yet, scientific and systematic decision support tools to help the safety managers proactively predict how risky a possible safety accident are needed in the long term.

Stochastic Gradient Boosting Approach
SGB proposed by Friedman is a hybrid of bagging and boosting approaches [26].SGB is an ensemble learning algorithm combined with boosting and decision tree and makes a prediction by weighting the ensemble members of all trees as shown in Figure 1 [50].
In the iteration, the new model is built along the gradient descent direction of the loss function of the previous tree.
e essence of SGB is to minimize this loss function between the classification function and real function by training the classification function F * (x).
e construction accident prediction refers to the classification problem.For multiple classifications, the surrogate loss function can be considered as the following equations [26].
e loss function can be expressed as shown in the following equation: where X � x 1 , x 2 , . . ., x n   is the input variables, k is the number of classes, y is the output variable, and p k (x) is the probability.
en, the following equation can be obtained: where K-trees are induced, each of which to predict the corresponding current residuals . is produces K-trees each with L-terminal nodes at iteration m, Rklm.As mentioned above, a separate line search is performed in each terminal node l of each tree k, as shown in the following equation: where ϕ k � − y k log[p k (x)], and each of the functions is updated and then SGB is established.Generally, the prediction accuracy of prediction of construction accidents on-site correlates with the amount of accident information available concerning for project scale, project location, project type, season, and others.In this study, the factors for predicting the construction accidents were determined in two steps.Firstly, a list of factors affecting a construction accident prediction was made by Advances in Civil Engineering reviewing the construction accident case database given by KOSHA.Lastly, appropriate factors were selected from the list through interviews with construction practitioners who were very experienced in safety management in Korea.Consequently, 15 factors, i.e., input variables, were selected in this study, as indicated in Table 2.And the construction accident prediction results were representatively classified into 6 categories: fall, wipeout, hit, cut, narrowness, and crash, which account for over 87% of accidents occurring at construction sites in 2014 [13].

Applying SGB to Predicting the Construction Accident.
In this study, the SGB-based prediction model for construction accidents is evaluated by applying to actual cases of construction accidents.A construction accident was predicted using SGB as follows: (1) e classification function F * (x) was trained using training data.In the dataset, each "x i " of the training set has project scale, project duration, rate of process, and others.Each result, i.e., the accident type, was allocated to "y i ". (2) After training was completed using parameters such as the number of additive trees, learning rate (shrinkage coefficient), and a maximum and minimum number of levels, the series of trees F * (x) which maps "x" to "y" of the training dataset (y i , x i ) with a minimized loss function (ψ( y k , F k (x) K 1  )) was found.(3) e expected value of F * (x), i.e., expected accident, was calculated for a new case of the test dataset (y j , x j ).
e construction accident prediction model of this study was developed using STATISTICA software.STATISTICA employs an implementation of the method usually referred to as SGB by Friedman [26].To conduct an SGB training procedure, the parameters, i.e., the learning rate, the number of additive trees, and subsampling proportion, have to be set, as shown in Figure 2. At first, the learning rate was set at "0.1" as it leads to better results in terms of prediction error according to previous empirical studies [26,52].e subsample proportion and the seed for random number generators were also set to 0.5 and 1, respectively, according to a previous study [40].e number of additive terms was set at 200.Figures 3 and 4 show how consecutive trees in the boosting step improve the quality of the SGB model which has randomly selected training data and testing data in each dataset.As additive terms are added to the model, it is observed that the prediction error of the training data continued decreasing.And a certain point that the error estimation for the testing data started to increase means the optimal number of trees to avoid overfitting.As a result of training, the SGB model in this study showed that the optimal number of additive trees was 60, and the maximum tree size was 3, as shown in Figure 4.
For the verification of the SGB model, the same cases were applied to the ANN and DT models to compare the results.It is because ANN and DT approaches showed superior performance with prediction accuracy in previous studies [3,13].Statistical Package for the Social Sciences (SPSS) Statics 19 was used to construct ANN and DT models  4 Advances in Civil Engineering in this study.To construct model using ANN and DT, optimal parameters have to be selected beforehand: a learning rate, the number of hidden neurons, the momentum for ANNs and the maximum tree depth, the number of minimum cases for high and low-rank node, and level of significance of chi-square automatic interaction detection (CHAID) method, etc. for DT.We determine the values from repeated experiments.

Results of Evaluation.
In the evaluation of the three prediction models, this study includes the second and the third most likely construction accidents along with the first most likely construction accident recommended.e output results of this study are based on discrete variables, not on continuous variables.For this reason, it is believed more reasonable to predict a construction accident by considering the three most likely accidents (from the 1st to 3rd highly possible accidents) rather than the most likely accident in each model.Table 3 indicates a summary of the results from the 30 test data using SGB, ANN, and DT. is table consists of the actual values (target) and the predicted results of SGB, ANN, and DT models.However, there are the first, second, and third prediction values based on probabilities of six construction accidents and the cumulative prediction accuracy from the order of the first place arranged rather than presenting the first prediction value only.
e SGB model provided comprehensible information about new cases for prediction, which is an advantage inherent in the decision tree.Firstly, the importance of each dependent variable in predicting construction accident was estimated, as shown in Figure 5. ese values indicate the relative importance of each variable by assigning 100% to the highest value and then by scaling the others accordingly.Lastly, the tree structures of the model are given as shown in Figure 6. is shows the prediction rules, such as the applied variables, and their influence in the proposed model.So, an intuitive understanding of the whole structure of the model is possible.

Discussion of Results.
is study was conducted using the same test set and the training set of construction accident data.In terms of prediction accuracy of the most likely accident, SGB showed a slightly higher result of 36.6% compared to ANN of 33.3% and to DTof 33.3%.When the second and the third most likely accidents were included, ANN showed the best result of 80.0%, which is higher compared to SGB of 76.6% and to DT of 70%.In construction accident prediction, it is difficult to conclude that in terms of performance, the SGB model is superior to both of the ANN and DT models because the gap of accuracies among the models is very slight.However, the even similar performance of the SGB model is noticeable because ANN and DT models have been proven the superior performance with prediction accuracy of construction accidents in previous studies.Additionally, the SGB model provided additional information, i.e., importance plot and structure model, which helps the safety managers and practitioners comprehend the decision-making process intuitively.Moreover, SGB was a simple and easy to construct model compared to ANN and DT because it has only three parameters.
Consequently, these results reveal that SGB, which is a relatively new AI approach to construction safety management, has the potential to apply to the construction prediction.It will help safety managers and practitioners solve the difficult problem of an early prediction based on the accident cases occurring in the past and their own limited experience due to the dynamic changes in some conditions of construction projects.In addition, SGB is possibly utilized in many areas since the boosting approach

Advances in Civil Engineering
can use the existing AI techniques such as ANN and SVM as well as DT as a base learner in the boosting algorithms.e construction accident prediction method proposed in this study is ultimately expected to improve the effectiveness of safety management by helping safety managers and practitioners review possible construction accidents from a wider perspective.Advances in Civil Engineering

Conclusion
is study proposed an early prediction model for construction accidents to assist the decision-making of safety managers in the construction field.To validate the applicability of the proposed model to construction safety management, its results were compared to those of ANN and DT models, which have proven to have a high predictive performance of construction accident.In an experiment with three models, the SGB model showed similar results using actual case data for a construction accident in Korea.Moreover, the SGB model can provide additional information about the factors for predicting the construction accidents to support safety managers and practitioners to comprehend the decision-making procedure.Lastly, the SGB model is easier to construct than the ANN and DT models because it needs fewer parameters.
ese results demonstrate that SGB has the dual advantages of boosting and decision trees.SGB has the considerable potential to be a leading technique for the next generation of construction safety management system.
In this study, an examination was carried out on the application of SGB in the early prediction of a construction accident.Although the prediction result from SGB is better than that of ANN or DT, it is difficult to conclude the performance of the model is enough for practical applications.erefore, further detailed analysis for the quality of the collected data and statistical verification for factors in construction accident occurrence will be necessary to utilize the proposed model for an actual new project.

Figure 2 :
Figure 2: Parameter setting for the SGB model.

Figure 3 :
Figure 3: Training process of the SGB model.

Figure 4 :
Figure 4: Training results of the SGB model.

Figure 5 :Figure 6 :
Figure 5: Importance of dependent variables of the SGB model.

Table 1 :
Fatal injuries in the Korean construction industry and overall industry.

Table 2 :
Factors of construction accidents on-site.

Table 3 :
Comparison of prediction results of SGB, ANN, and DT models.