Addressing the Safety of Transportation Cyber-physical Systems: Development and Validation of a Verbal Warning Utility Scale for Intelligent Transportation Systems

As an important application of Cyber-Physical Systems (CPS), advances in intelligent transportation systems (ITS) improve driving safety by informing drivers of hazards with warnings in advance. The evaluation of the warning effectiveness is an important issue in facilitating communication of ITS. The goal of the present study was to develop a scale to evaluate the warning utility, namely, the effectiveness of a warning in preventing accidents in general. A driving simulator study was conducted to validate the Verbal Warning Utility Scale (VWUS) in a simulated driving environment. The reliability analysis indicated a good split-half reliability for the VWUS with a Spearman-Brown Coefficient of 0.873. The predictive validity of VWUS in measuring the effectiveness of the verbal warnings was verified by the significant prediction of safety benefits indicated by variables, including reduced kinetic energy and collision rate. Compared to conducting experimental studies, this scale provides a simpler way to evaluate overall utility of verbal warnings in communicating associated hazards in intelligent transportation systems. This scale can be further applied to improve the design of warnings of ITS in order to improve transportation safety. The applications of the scale in nonverbal warning situations and limitations of the current scale are also discussed.


Introduction
Warning plays an important role in the communication of the information with regard to potential hazards to avoid accidents and injuries.There has been a significant increase in research on communication of the road safety in the last three decades [1].In order to improve driving safety, recent advances in Transportation Cyber-Physical Systems (CPS) aim to establish a connected transportation environment connecting cyber world (e.g., information, communication, and intelligence) and physical world (e.g., sensors and actuators) and provide the integrated real-time information among multiple levels, including vehicles to vehicle communication, vehicle to infrastructures communication, and in-vehicle information communication [2].Compared to traditional transportation environment, the connectivity of the CPS allows drivers to learn about the traffic status out of their sight and provides them with more time to respond to warnings regarding potential hazards [3][4][5].As an important application of Transportation CPS, most of the intelligent transportation systems (ITS) researches proposed algorithms to schedule warnings based on warning utility.With various warnings provided by the ITS, it is increasingly important to evaluate the warning utilities in the design of Transportation CPS.
Warning utility refers to the degree to which a warning enhances user performance through its presence.Shackel [6] defined the utility of system as whether the system does what is needed functionally.The definition of utility is similar to the definition of system effectiveness defined in Regan et al. 's work [7].Regan et al. [7] clarified the definitions of effectiveness, usefulness, and ease of use.In their work, effectiveness was defined as "the question of whether the system works in accordance with its functional description"; perceived usefulness was defined as "the degree to which a person believes that using a particular system will enhance his/her performance"; and ease of use was defined as "the degree to which a person believes that using a particular system would be free of effort."As it was discussed in [8], acceptance was defined with two dimensions: usefulness and satisfaction.However, further studies suggested that acceptance also contains dimensions such as effectiveness and social influence [7].Adell [9] categorized different definitions of acceptance of system and defined it as "the degree to which an individual intends to use a system and, when available, incorporates the system in his/her driving."Previous studies have indicated that subjective assessment of perceived usefulness did not always reflect enhancement on driving performance when using the system [10,11].Specifically, a too late or too early warning may still be perceived as being useful for drivers, even without significantly improving driver safety.On the basis of such difference, our previous work showed warning utility may serve as a better construct to assess expected objective benefits [12].Compared to the definitions of usefulness and acceptability, the definition of utility focuses on user performance rather than user attitude.
To date, much work has been done to subjectively evaluate warnings regarding different constructs including acceptance, usefulness, and ease of use.Davis [13] proposed the Technology Acceptance Model and developed a scale to evaluate acceptance with two dimensions: perceived usefulness and ease of use.Segars and Grover [14] examined this scale with a factor analysis and found there was an additional factor "effectiveness" besides perceived usefulness and ease of use.Van Der Laan et al. [8] developed a scale to measure acceptance with two dimensions: usefulness and satisfaction.This scale has been applied to estimate the usefulness and satisfaction of the system for different systems.Based on the Unified Theory of Acceptance and Use of Technology (UTAUT), Adell [9] developed a questionnaire to measure acceptance with the following dimensions: performance expectancy, effort expectancy, social influence, and behavioural intention to use the system.In terms of subjective assessment of usefulness, there were studies using the word "usefulness" directly to measure system usefulness [15,16].As the above subjective assessment tools focused on evaluating the users' attitude to systems, the warning utility with a focus on expected objective benefit (e.g., reduction in accident risk) was seldom measured by scales.The behavioural approach to assess the utility of the warning systems can be highly taskdepended and, in the meantime, time consuming and of high cost.The current study will address this problem by developing a scale to evaluate the verbal warning utility with regard to improving the effectiveness of communication in ITS.
The warnings mainly focused on in this work are auditory warnings which have been widely applied to many warning systems nowadays.Compared to visual warnings, auditory warnings have an advantage over visual warnings that human hearing cannot be shut off in the way human vision can.Studies have demonstrated that auditory warnings have higher levels of compliance than visual warnings (e.g., Wogalter and Young [17]).Auditory warnings are able to attract the human's attention regardless of where their attention is directed and when people are working under conditions with high workload, especially high visual workload, and/or when the operator has to move about a lot or visual conditions are bad.
Among auditory warnings, the current development of the scale mainly focused on verbal auditory warnings, which are user friendly since the users may easily understand and differentiate them without specific trainings.Compared to verbal auditory warnings, nonverbal auditory warnings have the drawback that their meanings need to be learned, remembered, and recognized at the time that they sound [18].Research has indicated that humans are quite bad at remembering and recognizing nonverbal auditory warnings.For example, one study showed that people working in an operating room and recovery room in a teaching hospital were unable to recognize more than half of the alarms currently in use [19].Patterson's study shows that people are able to learn the first few of a set of auditory warnings more or less as fast as they can be presented, but that progress slows beyond six or seven warnings [20].Moreover, identification of nonverbal auditory warnings in operational situations is likely to be worse than in the laboratory setting since recognition on an absolute basis is likely to be lower than on a relative basis [21].A previous work found that verbal warnings leaded to a faster reaction time than nonverbal warnings did [22], especially for complicated road conditions [23].Studies also suggested the verbal warning led to smaller crash rate than nonverbal warnings in intersection collision warning systems [24], especially for older drivers [25].There are disparate recommendations of the application of verbal warnings in critical safety situations.The current ISO working draft [26] suggested verbal warnings should not be used for safety critical warnings since such warnings took more time to present, whereas Noyes et al. [27] suggested verbal warnings should be used in safety critical situations since drivers responded more quickly and have more chance to react accurately to verbal warnings than nonverbal warnings.Generally speaking, verbal auditory warnings can be widely applied without additional training, which makes it especially suitable for transportation assistant systems.
The purpose of the present research was to develop a scale to evaluate the utility (effectiveness) of verbal warnings in order to achieve effective communication in intelligent transportation systems.The evaluation of the warnings utility is mainly based on two aspects: how well it attracts human attention and how well it provides understandable information needed for audiences [1,28].By summarizing items that may influence establishing warning utility, the dimensions of Verbal Warning Utility Scale (VWUS) will be selected through a subjective evaluation.After the scale was developed, the reliability of the scale was tested based on the splithalf reliability analysis and a factor analysis was conducted to explore the structure of the scale.An experimental study was then conducted to test the predictive validity of the scale in a simulated environment.If the verbal warning utility score can be used to predict safety benefits brought by the designed warnings, it is suggesting that the VWUS can be applied to evaluate the warning effectiveness in improving driver safety.
The validated scale can be utilized to assess the utility of verbal warnings in the intelligent transportation systems instead of performing behavioral experiments.

Development of the Scale
To develop the verbal warning utility scale, we referred to the development process of the NASA Task Load Index (NASA-TLX), which is also a multidimensional scale [29].The steps in developing the NASA-TLX include exploring relevant factors to the concept being tested (i.e., workload), conducting subjective rating to select salient factors with a certain criteria, and performing experimental studies to validate the scale.

Method
2.1.1.Participants.One hundred and four students (72 males and 32 females) from the University at Buffalo finished the exploratory questionnaire.Participants were recruited through the flyers on campus.All had a valid driver's license and were English native speakers.Participants' ages ranged from 18 to 69 years with an average age of 29.66 years (SD = 12.09).In terms of driving experience, the average years since obtaining a valid US driver's license were 11.11 years (SD = 11.47), while the average mileage was 9355.77miles (SD = 7037.41).

Material.
Since many warning characteristics may combine to creating a subjective warning utility, we reviewed warning characteristics that may influence the effectiveness of warnings.To our best knowledge, we find that the following warning characteristics could influence the warning effectiveness: Category one is regarding warning effectiveness in representing the hazard/task: Representation of Event Urgency [30][31][32], the degree to which the event urgency is represented by a warning; Time to Display [33], the degree to which a warning occurs at a favourable time; Number of Replications [34]; Distinctive Features [35], the degree to which a warning can be differentiated from other warnings in the system.Category two is regarding features relevant to the warning itself: Modality [36], the way a warning is presented; Loudness [37], the attribute of auditory sensation in terms of which sounds can be ordered on a scale extending from quiet to loud; Length of the Warning, the number of words in a warning; Voice Type, whether a warning is presented with a female or male voice [38]; Voice Quality [34], the degree to which a warning can be recognized; Rate of Speech, the number of words presented in terms of time.Category three is regarding the human cognitive process of the verbal warnings: Alertness [30], the degree to which the user's attention is raised; Loss of Vigilance, the degree to which the ability to maintain attention is impaired over prolonged periods of time; Accuracy of Understanding [39], the degree to which an individual understands the hazard with the presence of a warning; Familiarity [40], the degree to which an individual is being familiar of a warning; Annoyance [30], the degree to which a warning annoys an individual; Likeness/Preference, the degree to which an individual prefers the presence of a warning; Comfort: the degree to which an individual feels comfortable in responding to a warning; Trust [41], the degree to which an individual trusts a warning to be reliable; Reduced Workload [37], the degree to which the workload is reduced with a presence of a warning; Forgetting [34], the probability of a warning being forgotten by an individual after its presence.
A questionnaire was used to explore salient items to be subjectively equivalent to warning utility.The questionnaire listed the name of each item along with its explanation and asked the participants to identify its relationship with the overall verbal warning utility (i.e., "a": subjectively equivalent, "b": related, and "c": unrelated).An item being subjectively equivalent to the overall verbal warning utility indicates that this item could represent the utility of verbal warnings.An item being related to the overall verbal warning utility indicates that this item contributes to the utility of verbal warnings.An item being unrelated to the overall verbal warning utility indicates that this item does not contribute to the utility of verbal warnings.An example item is "Representation of event urgency: How urgent is the event conveyed by the warning?" 2.1.3.Procedure.The study was conducted at the University at Buffalo.Participants were asked to complete a questionnaire regarding demographic information (e.g., gender, age, and driving history) and the exploration questionnaire then was administrated.The participants were asked to imagine the scenario in which a warning was provided to describe the coming dangerous events and help them to decrease the probability of collision.As it is shown in Figure 1, an example was presented that an in-vehicle warning system alarms the driver on the subject vehicle about a hazard caused by a hazard vehicle running the red light from its left.
The warning is, for instance, "a vehicle at your front-left is running the red light at the intersection."The questionnaire Table 1: Mean score of each factor and frequency of the three kinds of relationships between each factor and the warning utility ( = 104).

Items
Representation listed the name of each item along with its explanation and asked the participants to identify its relationship with the overall verbal warning utility (i.e., "a": subjectively equivalent, "b": related, and "c": unrelated).An item being subjectively equivalent to the overall verbal warning utility indicates that this item could represent the utility of verbal warnings.An item being related to the overall verbal warning utility indicates that this item contributes to the utility of verbal warnings.An item being unrelated to the overall verbal warning utility indicates that this item does not contribute to the utility of verbal warnings.An example item is "Representation of event urgency: How urgent is the event conveyed by the warning?"

Results
. Three relationship options, including subjectively equivalent, related and unrelated to verbal warning utility, were recorded.The frequency of each relationship for all twenty items was shown in Table 1.The chi-square goodness-of-fit test was conducted for each factor to test whether the frequency of subjectively equivalent is significantly more than the frequency of the other two relationships.
Applying the Bonferroni correction [42],  = .05,was divided by the number of tests (3) to get the Bonferroni critical value, so a test would have to have  < .017 to be significant.Under that criterion, only the tests for Representation of Event Urgency, Time to Display, Alertness, and Accuracy of Understanding showed significantly higher proportion of equivalency to warning utility than being related or unrelated to warning utility.In the meantime, these four items (i.e., Representation of Event Urgency, Time to Display, Alertness, and Accuracy of Understanding) were considered to be subjectively equivalent to warning utility by more than 50% (frequency of equivalent ≥ 52) [43].Finally, combining the above analysis results, these four most salient items were then confirmed to consist of Verbal Warning Utility Scale (VWUS).The definitions of the four items were presented as follows.
(i) Representation of event urgency (item 1): how well is the event urgency conveyed by the warning?
(ii) Time to display (item 2): whether it is an appropriate time to display the warning to users?
(iii) Alertness (item 11): how well does the warning alert users to the dangerous event when they received the warning?
(iv) Accuracy of understanding (item 13): how well does the warning let users understand the event that just happened?

Validation Study
An empirical study was conducted to test the correlation between VWUS scores and behavior measurements in avoiding potential collisions in a simulated driving task, which would lend support to the predictive validity of VWUS.It was hypothesized that scores of VWUS developed in the present research would significantly predict the human performance in avoiding a potential accident with auditory warnings broadcasted.
Warning lead time plays an important role in determining the effectiveness of warnings in Transportation Cyber-Physical Systems [44,45].The Society of Automotive Engineers (SAE) proposed the four levels of time urgency to define priority order index.Emergency level was defined with lead time ranging from 0 to 3 s, immediate level was defined with lead time ranging from 3 to 10 s, near term level was defined with lead time ranging from 10 to 20 s, and preparatory level was defined with lead time ranging from 20 to 120 s [46].Previous empirical study focuses more on the emergency and immediate level of warning lead time.Yan et al. [47] studied verbal warnings with the 7 levels of warning lead time (i.e., 2.5 s, 3 s, 3.5 s, 4 s, 4.5 s, 5 s, and 5.5 s).Results suggested early warnings led to a shorter brake reaction time, smaller crash rates, and smaller deceleration, indicating more timely and gradual responses to avoid the hazards compared with late warnings.Yan et al. [48] explored the impact warning lead time (3 s versus 5 s) and warning content (with or without direction information) on driving performance.Zhang et al. [49] also investigated the impact of directional and nondirectional verbal warning on driving behaviors at warning timing (lead time to collision) of 7 s.In order to measure the verbal warning utility completely, a wider range of lead times was applied to measure warning utility, including extremely short and long lead times.
The behavioral measurements quantifying the effectiveness of verbal warnings in the current study included the collision and the impact reduction.The "collision" was coded as a dichotomous variable specifying whether there was a collision between a subject's vehicle and hazard vehicle (get collided: 1 and avoid collision: 0).The reduced kinetic energy of the subject's vehicle specified the impact reduction led by the warnings.Because the mass of the vehicle can vary in reality, we will study the reduced kinetic energy of a vehicle with unit mass.

Method
3.1.1.Participants.Thirty-two participants (24 males and 8 females) took part in the current experiment with ages ranging from 18 to 26 years (mean = 21.13,SD = 2.54).In terms of driving experience, the average years since obtaining a valid US driver's license were 3.5 years (SD = 2.36) and the average mileage was 8343.75 miles (SD = 6438.0).All participants recruited had normal or corrected-to-normal vision and valid driver's licenses and are asked to sign an informed consent form before the experiment.Each driver was paid $20 for the time taken to complete the experiment.

Apparatus.
The driving task was completed using a STISIM driving simulator (STISIMDRIVE M100K, Systems Technology Inc., Hawthorne, CA).The driving simulator consists of a Logitech Momo steering wheel with force feedback, a gas, and a brake pedal (Longitech Inc., Fremont, CA).The resting position of the throttle pedal is 38.2 ∘ (the angle between the pedal surface and the ground) and the maximal throttle input is 15.2 ∘ .For the brake pedal, the resting position is 60.1 ∘ and the maximal brake input is 28.6 ∘ .The STISIM simulator was installed on a Dell Workstation (Precision 490, Dual Core Intel Xeon Processor 5130 2 GHz) with a 256 MB PCIe × 16 nVidia graphics card, Sound Blaster X-Fi system, and Dell A225 Stereo System.The driving scenario was presented on a 27-inch LCD with 1920 × 1200 pixels resolution.

Material.
The Verbal Warning Utility Scale (VWUS) consisted of a rating scale and a weighting scale (see Appendix 1 in the Supplementary Materials available online at http://dx.doi.org/10.1155/2015/126947).The rating scale comprised of the ratings of four items, "Representation of Event Urgency, Time to Display, Alertness, and Accuracy of Understanding" and a question regarding the overall utility of the warning.Participants were firstly asked how useful the warning heard in the experiment was in helping them to avoid the hazard by rating each factor on a ten-point scale (e.g., "0": not well at all to "9": extremely well).An example item is "How well did the warning represent the event urgency?"At the end of the rating scale, the participants were also asked about the overall utility of the warning using the item, "In general, how useful was the warning?" In order to get more accurate ratings, a weighting scale was introduced.The weights obtained from weighting scales account for two potential sources of variability in the rating of verbal warning utility.It includes the differences in definition of verbal warning utility between subjects for the same warning and the differences in the sources of verbal warning utility between warnings.After finishing the rating scales, subjects were then asked to compare each pair of properties of the warning they just rated and select the one that contributed the most to the warning utility using the weighting scale.By comparing each dimension with the other three dimensions, each dimension would get three weighted scores ("0": contributes less to the warning utility and "1": contributes most to the warning utility score).The weighted score for each dimension was calculated by taking the average of the three scores.The total score of warning utility was calculated by adding conducts of rating score and weighted score of each dimension together.

Experiment Design.
The current experiment adopted a single-factor within-subject design with lead time as the independent variable.Sixteen levels of the lead time were designed in the experiment (0 s, 1 s, 1.5 s, 2 s, 2.5 s, 3 s, 3.5 s, 4 s, 4.5 s, 5 s, 6 s, 8 s, 10 s, 15 s, 30 s, and 60 s).The lead time indicated the time to collision if the vehicles driven by the participant continued to travel at their current relative position, velocity, and acceleration.The reason why we design such a wide range for lead time is that the development of intelligent transportation system nowadays has enabled the drivers to acquire information of the potential hazard through warnings in advance.
Each participant would go through sixteen trials assigned with sixteen levels of lead time, each of which involves a hazard leading to potential collision and its corresponding warning.The orders of 16 lead times and 16 collision events were randomized.As it is shown in Figure 2, sixteen collision scenarios along with their warning were designed and programed with the driving simulator to represent the potential collision events in reality.In the experiment, the sight of the participant regarding the potential hazard was blocked in purpose so that they could only rely on the auditory warning to learn about the upcoming collision event until the last minute (i.e., they will not be able to see the hazard until the time when they cannot avoid the hazard successfully even with a full braking when they confirm the hazard by vision).The behavioural measurements in this case indicated the safety benefits brought by the warnings without confounding participant's judgement of the hazards.In the meantime, the nonwarning messages (e.g., news and weather forecast) were randomly assigned during the experiment in case participants would brake the vehicle whenever there is a message.Therefore we could assume that drivers' responses to the verbal warnings are based on their understanding of the warning content.The nonwarning messages were designed carefully without causing interference with warnings.They were serving as distracted sources in real driving situations.
3.1.5.Procedure.Upon arrival at the laboratory, participants were asked to sign an informed consent form and then complete the demographic questionnaire.In the experiment, they will experience one potential collision scenario with its warning broadcasted to them in each trial.After each trial of the test block, they were asked to complete the Verbal Warning Utility Scale (VWUS) to evaluate the utility of the verbal warnings in helping them avoid the corresponding collision event.
Before the experiment, participants were trained by completing a practice block to get familiar with the operation of the driving simulator and the driving environment.During the practice block, participants were asked to drive a fourmile distance with two randomly selected collision events (with corresponding verbal warnings) and five nonwarning messages broadcasted.They were informed that there would be collision events with corresponding verbal warnings and that they could respond based on their own driving experience.Subjects were instructed to drive in the inside lane and were informed that there would be messages broadcasting during the driving task.The scenario in the practice block was designed similarly with the one in the test block.Following the practice session, participants completed the test block comprising of sixteen trials under a 2-lane (in each direction) local environment.Before the formal experiment, all participants were suggested to adjust the seat until they felt comfortable and their feet could come in full contact with the surface of the pedal.In the formal experiment, all participants were required to follow normal traffic laws and try to keep the speed at 45 mile/h.They would be advised to adjust the speed if their own speed was lower than 40 mile/h or higher than 50 mile/h when there was no warning, turn, or stop sign or red light.

Descriptive Analysis.
In total, we obtained 512 VWUS total scores of warnings (32 participants × 16 warnings each).Table 2 outlines descriptive statistics and obtained the associations between the VWUS and demographic variables via correlations between its total scores with those variables such as age, gender, annual mileage, and license year.All Spearman correlations coefficients between VWUS total scores and demographic characteristics were not significant ( > .05)with low correlation coefficients ( < .06),indicating the general criteria for evaluating warning utility.The reduced kinetic energy had a significant correlation with VWUS total scores.

Split-Half Reliability.
Since the VWUS only had one question for each item, we were unable to conduct internal consistency for the scale.In order to test its reliability, the split-half reliability analysis was conducted for the 32 participants who completed the VWUS during the experiment.The Cronbach's  for each half was .762and .777,respectively, and the split-half reliability for the VWUS was high with a Spearman-Brown Coefficient of .873.

Factor Analysis.
A factor analysis was conducted on the 4 VWUS items with varimax rotation.Bartlett's test of sphericity  2 (6) = 841.37, < .001,indicated that correlations between items were sufficiently large enough for the analysis.An initial analysis was run to obtain eigenvalues for each component in the data.Two components were obtained with the eigenvalues over 1.0 according to Kaiser's criterion [50] and in combination explained 73.22% of the variance.The 2-factor extraction accounted for 73.22% of the variance with communalities ranging from 0.51 to 0.97 and 1 item cross-loadings.The 1-factor extraction accounted for 41.51% of the variance with communalities ranging from 0.60 to 0.84, which explained less of the variance.The 3-factor extraction accounted for 97.68% of the variance with communalities ranging from 0.51 to 0.97 and 3-item cross-loadings.Based on Osborne and Waters's [51] criteria for factor extraction the 2-factor solution was selected because it had the least item cross-loadings and explained relative higher variance.Table 3 showed the initial eigenvalues and variance explained by each factor.The summary of the loadings values of each item in the rotated factor matrix was also provided with loading values less than .30being suppressed [51].By examining the item loading on these factors, specific themes were defined based on the content of items among each factor.The first factor which was defined as "Accuracy of Hazard Representation" contained 3 items, including "Representation of Event Urgency," "Time to Display," and "Accuracy of Understanding".The second factor contained the item Alertness.3 illustrated a significant correlation between VWUS total utility score (calculated by each item's rating and weighting scores in VWUS) and overall utility score ( = .851, < .001)obtained directly from the studies.It is suggested that the calculated total utility score using VWUS is a good fit for the overall utility score in assessing the warning utility in general.

Criterion Validity. Figure
The intercorrelation among scores of each factor and the overall utility was reported with Spearman correlation   coefficients since the assumption of normal distribution was violated.As it is shown in Table 4, the overall utility score was significantly correlated with scores of factor "Accuracy of Hazard Representation" and factor "Alertness."The intercorrelation between two factors was not significant (  = .07, = .12).In terms of the items on factor "Accuracy of Hazard Representation," the intercorrelation between Representation of Event Urgency and Time to Display was (  = −.35, < .01).That between Representation of Event Urgency and Accuracy of Understanding was (  = −.15, < .01)and that between Time to Display and Accuracy of Understanding was (  = −.45, < .01).
Two sequential linear regression analyses were conducted to determine whether the VWUS could significantly predict the warning utility using criterion validity measures (i.e., reduced kinetic energy) obtained from the behavioural experiment.This behavioural measurement served as an indicator of safety benefits brought by the warnings.The collision rate was defined as the proportion of drivers getting into the collision among the total drivers under each level of the lead time.Gender, age, and license year were entered at Step 1; order of lead time, scenario type, and warning lead time were entered at Step 2; VWUS total scores of each warning ( = 502) or scores of two factors (Accuracy of Hazard Representation and Alertness) were entered at Step 3 in each analysis, respectively.
The results of the first model with the examination of the VWUS total score in predicting safety benefits brought by the verbal warnings was presented in Table 5.The results indicated that the VWUS total score calculated by rating score and weighted score of each dimension significantly predicted reduced kinetic energy ( = 15.50,  < .001).Specifically, more kinetic energy was reduced when drivers responded to warnings with higher VWUS total scores.The results of the second model with scores of two factors of VWUS suggested that the accuracy of hazard representation significantly predicted the reduced kinetic energy ( = −2.36, < .05).It is suggested that the more accurate the warning described the hazard, the more the kinetic energy was reduced as a result of the braking response to warnings.Both results suggested the validity of the VMUS in predicting the effectiveness of verbal warnings.
The logistic regression analysis was then conducted to determine whether the VWUS could significantly predict the warning utility using collision as criterion validity measures.The collision (avoid collision: 0 and collision: 1) served as another indicator of safety benefits brought by warnings.Gender, age, and license year were entered at Step 1, order of lead time, scenario type, and warning lead time were entered at Step 2, and VWUS total scores of each warning ( = 512) were entered at Step 3. Another logistic regression analysis was performed with scores of two factors of VWUS entered at Step 3.However, the test of the logistic regression model with scores of two factors of VWUS against a constant only model was not statistically significant ( 2 (8) = 12.53,  = .13).Therefore, only the model with VWUS total scores as predictors was reported.The results of the logistic regression analysis with the examination of the VWUS total score in predicting safety benefits brought by the verbal warnings were presented in Table 6.
The test of the first logistic regression model against a constant only model was statistically significant, indicating that the predictors as a set reliably significantly predict the probability of collision ( 2 (8) = 15.62, < .05).Nagelkerke's  2 of .50indicated a moderate relationship between prediction and grouping.Successful prediction of probability of collision was 80%.The Wald criterion demonstrated that VWUS total score made a significant contribution to prediction ( < .001).Age and warning lead time also significantly predicted collision.In particular, warnings with higher VWUS total scores significantly reduced the probability of collision.

Scale Sensitivity.
The scale sensitivity was tested with the comparison of two ANCOVA tests with warning lead time as the independent variable, scenario type as the covariate, and scale scores and reduced kinetic energy as the dependent variables, respectively.The results indicated significant main effects of lead time on scale scores ((15, 490) = 21.68, < .001,partial  2 = .40)and reduced kinetic energy ((15, 490) = 28.49, < .001,partial  2 = .47)after controlling for the effect of scenario type.As it is shown in Figure 4, the changes of subjective warning utility (i.e., VWUS total score) and objective warning utility (i.e., reduced kinetic energy) followed a fairly similar pattern as the lead time changes.Specifically, the results of the post hoc tests indicated that the scale scores on warning utility and kinetic reduced energy were significantly different on the following pairs of lead time levels ( < .05):lead time of 0 s and lead time levels ranging from 1 s to 60 s; lead time of 1 s and lead time levels ranging from 2.5 s to 60 s; lead time of 1.5 s and lead time levels ranging from 3 s to 30 s; lead time of 2 s and lead time levels ranging from 3 s to 30 s; lead time of 2.5 s and lead time levels ranging from 4 s to 8 s.The only difference in the post hoc tests was that the scale scores were different between lead time of 60 s and lead time levels ranging from 3 s to 30 s, whereas the reduced energy was different between lead time of 60 s and lead time levels ranging from 4 s to 30 s.

Discussion
By drawing on the experience of the development process of NASA-TLX, we developed the Verbal Warning Utility Scale (VWUS) to evaluate the warning utility in general.The VWUS consisted of two parts: the total utility divided into four items serving as one part of the scale (including Representation of Event Urgency, Time to Display, Alertness, and Accuracy of Understanding) and the weighting of these items by comparing these items pairwise based on their perceived importance.The reliability analysis indicated excellent splithalf reliability for the VWUS.Similar to NASA-TLX, each item of VWUS merely had one rating question so that we are unable to test the internal consistency for VWUS.However, scales with multiple questions for each dimension may cause the memory of the warning utility to decay during the evaluation process, resulting in less accurate or reliable ratings.Therefore, from the application perspective, this Verbal Warning Utility Scale developed in the current work with each rating question in each dimension provides a simple way to evaluate warning effectiveness without too much memory decay taking place.
The results from the validation study clearly indicated that Verbal Warning Utility Scale (VWUS) could significantly evaluate warning utility regarding objective safety benefits.It is suggested that a higher score of the VWUS revealed the warning with a higher utility, which had a higher chance to reduce the probability of potential risk during driving tasks.The linear and logistic regression analyses using the VWUS total scale suggested that it was a significant predictor of the objective warning utility indicators obtained from the experiment, including the reduced kinetic energy and the probability of collision.More specifically, a higher score of VWUS indicated that more reduced kinetic energy and lower collision rate could be achieved with the presence of the warning, which further suggested a higher utility of the warning.The relative low  square reported in the second linear regression model with two factors of VWUS as predictor can result from the fact that the accuracy of hazard representation was the only significant predictor of warning utility in the current validation experiment (see model 2 in Table 5).One reason for such results may be that the independent variable (lead time) designed in the current validation experiment mainly addressed the accuracy of hazard representation.Further validation studies need to be performed with warning alertness levels being addressed.
There is a significant increase in research in terms of warning design to improve the safe communication of the various systems using verbal warnings.In order to achieve a safe and successful interface between human and the system operation, the issues regarding warning effectiveness have raised researchers' attention.To date, much work has been done to subjectively evaluate warnings regarding different constructs including acceptance, usefulness, and ease of use [8,9,13].However, previous studies have indicated subjective assessment of acceptance and perceived usefulness did not always reflect enhancement on driving performance when using the system [10,11].Laid on the foundation of previous works, the scale developed in the current study aims to evaluate the effectiveness of the verbal warnings with a focus on driver performance.The establishment of the Verbal Warning Utility Scale enables us to assess whether the verbal warnings designed in the system is effective regarding the improvement of driver performance rather than drivers' subjective opinion on the warnings.
The current development of VWUS is mainly based on verbal warnings.Since all the dimensions chosen in the current version of the scale are quite general, this scale may also be suitable for examining the utility of nonverbal warnings.Compared to nonverbal warnings, verbal warnings are much easier to identify, understand, and remember.However, verbal warnings may take more time to respond, which makes nonverbal warnings a better choice if the situation is emergent or if the human is trained to use the warning system.In the warning design of the system, both types of warnings can be chosen to alarm humans with potential risks.The current validation experiment has already indicated the validity of the scale applied to verbal warnings.Therefore, further work could validate its effectiveness in testing the utility of nonverbal warnings in order to increase its universality.In the meantime, the current validation study suggested the validity of the scale to evaluate warnings of transportation systems.In order to generalize the testing power of VWUS, further validation studies are also necessary to examine the correlations between VWUS and human performance to avoid potential hazards with different warnings broadcasting in diverse domains.
The scale sensitivity was tested with the comparison of two ANCOVA tests on scale scores and reduced kinetic energy, respectively.The results indicated that the VWUS scale was sensitive to the changes of warning effectiveness with a lead time change of 1 s.The changes of subjective warning utility (i.e., VWUS total score) and objective warning utility (i.e., Reduced Kinetic Energy) follow a fairly similar pattern as the lead time changes.The results from the sensitivity analysis suggested that the VWUS scale was sensitive to the changes of warning effectiveness with a change in warning timing of as little as 1 s.The studies regarding warning timings indicated a change in warning timing of 1.5 s for nonverbal warnings [52] and 2.5 s for verbal warnings [47] was enough to change drivers' collision avoidance performance.Compared to previous work, the scale developed in the current work showed adequate sensitivity to the changes in warning effectiveness.

Conclusion
The Verbal Warning Utility Scale (VWUS) developed in the current work was suggested as an effective tool to evaluate the utility of verbal warnings with a relatively good reliability and validity obtained from the validation study.Compared to previous works using behavioural experiments, the development of the scale provides a simpler way to assess the effectiveness of verbal warnings in intelligent transportation systems.This scale can be further applied to evaluate the design of warnings in order to improve the communications in the intelligent transportation system.

Figure 1 :
Figure 1: The example of the test scenario.

Figure 2 :
Figure 2: The designed scenarios and verbal warnings.
Mean scores of the verbal warning utility

Figure 3 :
Figure 3: The correlation between the VWUS total scores and the overall utility scores (error bar: ±1 SE).

Figure 4 :
Figure 4: The changes of VWUS total score and objective warning utility as a function of lead time (error bar: ±1 SE).

Table 2 :
Mean and standard deviation for demographic measures and Spearman correlation coefficient.

Table 3 :
Loading values on items of VWUS.

Table 4 :
Intercorrelation of scores of VWUS factors and the overall utility.

Table 6 :
Logistic regression results for the prediction of collision in the VWUS scores.