Correlation Analysis of External Environment Risk Factors for High-Speed Railway Derailment Based on Unstructured Data

,


Introduction
e occurrence of high-speed railway derailment accident may result in severe financial and human losses, which have significant disaster characteristics and strong nonlinear characteristics and pose a challenge to high-speed railway safety management. Technological advances have helped to mitigate the internal factors behind railway derailment, but the external factors remain an underexplored area of research.
Advances in computer technology have benefited largescale numerical calculation by enhancing operation speed, storage capacity, and operation scale. Big data analysis, furthermore, has been building on progress in accuracy, quality, and reliability and has become a major area of academic interest in the context of high-speed railway safety.
To minimize risks that may lead to unsafe events, the railway corporation has accordingly built the reporting system to keep the records, and it has generated a large amount of data thereof. Figure 1 exemplifies a typical record of an unsafe event.
However, the records on unsafe events, which are now described in natural language as shown above, could be incorporated into a digital database, and it could not be established without a consistent description standard. At the same time, the railway infrastructure, rolling stock, and other equipment are diverse and complicated, and the data sources are complex, involving many specialties, such as track maintenance, power supply equipment maintenance, signal and communication equipment maintenance, EMU maintenance, passenger transportation, and external environment. e database covers a wide range of faults' fixed equipment faults, mobile equipment faults, perimeter intrusion events, and others. A sizable amount of data have been gathered. In practice, the information needs to be retrieved, read, and updated manually, which costs a huge amount human and material resources, and the processing efficiency is considerably low, resulting in the low utilization rate of railway unsafe event data.
For the requirements of railway safety management and risk analysis for real time and accuracy to be met, it is urgent to carry out railway risk effect factors association analysis based on unstructured data, while taking into account different data structures, different sources, and scattered and independent records of railway unsafe events. rough this research, the heterogeneous data sources could be integrated, the heterogeneity of data can be eliminated, and the accurate association between data and railway risk could be achieved. erefore, this paper will take the external environment risk of high-speed railway derailment as the research focus and carry out a correlation analysis of external environment risk factors of high-speed railway derailment based on unstructured data. e study seeks to identify the risk factors of high-speed railway derailment and select appropriate models to process unstructured data, including data collection, data cleaning, data dictionary construction, data extraction, data storage, and other steps. Accordingly, the association between data and risk occurrence possibility and consequence severity will be realized. e paper then moves on analyzing the high-speed railway derailment risk factors and effective extraction of unstructured data.
In recent years, with the breakthrough of big data and artificial intelligence technology, a number of investigations have been carried out regarding the railway safety index, unstructured data analysis, and multisource heterogeneous railway safety data identification and extraction. In terms of railway safety index, since 2015, the International Union of Railways (UIC) has started to build the global safety index (GSI) [1]. Based on safety data and accident information, it evaluates the safety level of railway in Europe, some Asian, and Middle Eastern countries and regions and analyzes the statistical data of safety accidents, the impact of accidents, and the safety level and development trend. Zhao et al. [2] established a railway accident index by measuring the occurrence frequency and consequence severity of railway accidents, which is used to evaluate the overall situation of China's railway safety. In the aspect of unstructured data analysis and mining, Zhang [3] built an unstructured data analysis platform based on report documents with Chinese word segmentation technology, unstructured data extraction method, pattern matching, and other methods. Zhu et al. [4] put forward a new HGD tree index technology and a new partition method, in order to use probability density function to partition data and improve the speed of data access, and gave a solution based on the optimization operation method. Wang et al. [5,6] analyzed and studied the safety data of dangerous goods transportation based on the data mining method. In the aspect of multisource heterogeneous railway safety data extraction and data analysis, Wang et al. [7] conducted quantitative analysis on railway derailment and the change of accident rate based on American railway safety data. Lin. et al. [8] analyzed the data of American trunk line passenger trains and quantitatively analyzed the causes of passenger train accidents. Liu et al. [9] analyzed the causes of major train derailment and their effect on accident rates. Turla et al. [10] analyzed the freight train collision risk in the United States. Li [11] recognized and extracted the fault features of high-speed railway equipment by establishing the +bilstm and +CRF method for character representation and the +transformer method for word segmentation representation. Zhou and Li [12] established a method of fault data feature recognition and extraction for railway signal equipment based on MCNN.
To sum up, previous studies on high-speed railway risks mostly employ the expert evaluation method, which is arguably based on subjective deliberation and may undermine the research validity. Besides, past investigations mainly focus on feature recognition, extraction for specific structure of safety data, and the processing of unstructured data. As such, there still lacks research on the correlation analysis between safety data and risk. is study proposes a datadriven risk judgment method for analysis to facilitate accurate association between data and railway risk, with implications that the proposed model can contribute to improving the feasibility and accuracy of risk judgment.

Analysis of External Environment Risk Associated Factors
To explore the derailment mechanism of high-speed railway, this paper establishes a dynamic derailment-related element model of high-speed railway. As shown in Figure 2, the derailment of high-speed railway is mainly related to EMU subsystem and line subsystem, and the coupling relationship between wheel and rail has an important impact on the derailment of high-speed railway. In addition, the external 10:17 on April 18, 2021, at 191 km 188m on XX line of XX railway corporation the mountain watcher inspected and found that the mountain on the right side of 191km 201 m was cracked, and about 3 cubic meters rockfall intruded into the clearance (the largest one was 1.1 * 0.9 * 1.2 m). e mountain watcher immediately informed the relevant personnel to block the line. At 10:21, the line was blocked. A er being handled by the track maintenance staff, the line was opened at 11:17. e incident affected the train traffic and caused the train no. XXX to change its parking place. e rockfall at 191km 201m, and this site is the straight line section of the line, half dike and half cutting section, and the right side is 4 meters away from the track center of the line. environmental factors such as natural geology and perimeter intrusion factors also play an important role in the derailment of high-speed railway. e external environmental safety for high-speed railway derailment has been widely investigated and is the focus of the current study [13].
Based on the above analysis, the paper puts forward the framework model of risk associated external environmental factors for high-speed railway derailment, which fall into two categories:: natural and geological factors (natural hazard factors and geological hazard factors) and perimeter intrusion factors (perimeter intrusion of animals, perimeter intrusion of objects and plants, and perimeter intrusion of people), as shown in Figure 3, and the classification is detailed in Table 1 [14][15][16][17].

Unstructured Data Mining Method
e raw data stem from a database of a railway company that has kept its records of, as many as, 15,000 past unsafe events is obtained.

General Analysis.
At the same time, based on the scientific analysis for the unsafe event data, the study combines regular expression and pattern matching technology and establishes the matching model of external environmental factors for high-speed railway derailment risk associated unsafe events. is paper analyzes and mines the relationship between the external environmental factors of highspeed railway derailment and the unsafe events, automatically, quickly, and accurately extracts the key characteristic information such as the possibility of risk occurrence and the severity of the consequences, so as to transform unstructured data into structured information.
e main data analysis and mining process is shown in Figure 4. e process includes (1) unstructured railway safety data, (2) split and match keywords, and (3) association rules' mining and association degree analysis. rough the process, the accurate association between data and railway risk could be acquired [18][19][20].

Keyword Extraction and Matching.
If you want to extract keywords in the text, it is relatively simple for English and other languages. Keywords' extraction can be achieved in a number of languages, including English. In the case of English, for example, there are spaces between words as segmentation. In Chinese, however, such expressions are unavailable, so it is necessary to break coherent sentences into keywords. Expressions in Chinese may vary widely, leading to potential ambiguity in the word segmentation. In keyword extraction and matching, the railway safety dictionary is designed, and the algorithms and models such as tire tree, DAG, Viterbi, HMM (hidden Markov model), and keyword matching are comprehensively used. e main processing of keyword extraction and matching is shown in Figure 5 [21-23].

Association Analysis-Based Apriori Algorithm.
is study utilizes the association rules to explore the correlations between data generated by different mode methods, so as to build rules that may inform the decision-making. e data mining of association rules mainly includes two processes. First, identify the frequent item sets whose   frequency is not less than the minimum support degree of all item sets. Second, conduct mining strong association rules that satisfy the minimum confidence based on the frequent item sets obtained. e overall performance of association rule data mining is determined by the operation of the previous process [24][25][26][27][28].
Finding frequent item sets is not easy because the data explosion involved in the calculation process may   Journal of Advanced Transportation lead to unacceptable computational complexity. However, as long as frequent item sets are obtained, association rules whose confidence is not less than minimum confidence could be explored. e association rules' mining algorithm used in this paper is Apriori algorithm. e following data in Table 2 is an example to illustrate the implementation process of Apriori algorithm [29][30][31][32].
Apriori algorithm could be used to mine association rules. e process of mining frequent item sets is shown in Figure 6.
L k : frequent item sets of length k C k : candidate item set of length k Support_count (k): the support count of k-item sets It is concluded that all item sets of L1, L2, and Ln are frequent item sets, and then, the confidence of each frequent item set is calculated. When the support threshold is set to 40% and the confidence threshold is set to 50%, the results shown in Table 3 can be obtained.

Grey Relation Analysis.
Grey relation analysis (GRA) is a multifactor statistical analysis method. e basic method of calculating the correlation degree is to initialize the original data sequence, then calculate the correlation coefficient, get the correlation degree and the correlation matrix through the combination of the correlation coefficient, and finally sort them according to the correlation degree calculation results of each correlation factor sequence [33].   Journal of Advanced Transportation e calculation method is as follows. Let the characteristic behavior sequence of the system be Because the units or initial values of each data sequence are different, in order to make them comparable, it is necessary to implement dimensionless processing on the original data so that the data of different dimensions (or magnitudes) could be compared, and the initial value method could be used for calculation: Correlation coefficient refers to the degree of correlation based on the geometric shape and development trend of each factor sequence. e expression is as follows: , ρ ∈ (0, 1), k � 1, 2, . . . , n, i � 1, 2, . . . , m.
Among them, ρ is called resolution coefficient. e smaller the ρ value is, the greater the resolution is. e characteristic of association sequence is that it has a huge amount of data. When the information is processed in a centralized way, it is necessary to summarize the association coefficients at different positions of different times into a specific value and calculate their average value. e average value obtained is the correlation degree. e expression is as follows:

Risk Analysis Based on Unstructured Data
is paper attempts to analyze the risk of external environment associated with high-speed railway derailment from the possibility and the severity of the consequences and realizes the scientific measurement of the risk by mining the possibility and the severity of the consequences of the unsafe events related to the external environment risk factors associated with high-speed railway derailment. e determination of the occurrence possibility is mainly based on the Journal of Advanced Transportation unstructured safety data mining and then accurately associates the unsafe events with the corresponding risk factors. It is realized by accumulating the occurrence frequency of the unsafe events associated with the risk factors outside the high-speed railway derailment. Because the basic data involved in the study is mainly unsafe event data, the consequence is mainly the interruption time, so the severity of the consequence is mainly considered to mine the   Journal of Advanced Transportation interruption time caused by the unsafe events associated with the risk factors of high-speed railway derailment [34][35][36][37].
For a small number of factors with low probability of occurrence, they may not be associated with events in the data. In this case, we could consider using some evaluation  methods based on expert experience as a supplement, such as analytical hierarchy process. Based on the above method, after analysis, mining, and unifying the dimensions, the probability of external environment factors associated with high-speed railway derailment is shown in Figure 7, the consequence severity of external environment factors associated with high-speed railway derailment are shown in Figure 8, the risk distribution scatter diagram of natural and geological factors is shown in Figure 9, and the risk distribution scatter diagram of perimeter intrusion factors is shown in Figure 10. e distribution of comprehensive risk index of external environment factors related to high-speed railway derailment is shown in Figure 11. It could be seen that Z13, Y21, and Y33 are high-risk factors, especially Y33. Risk management procedures should be implemented, and targeted measures should be taken to control them. When implementing control measures, we should pay attention to the actual effect and offer feedback to the implementation to ensure the full implementation of control measures [38][39][40][41].

Conclusion
Utilizing data on railway fault unsafe events, this paper establishes a matching model that builds correlations between unsafe events and external environment factors in the context of high-speed railway derailment. Operating in an automatic fashion, the model may be employed to analyze and mine the relationship between external environment factors of high-speed railway derailment and unsafe events. e model may also be used, with an enhanced accuracy for identifying high-risk elements, to extract the key feature information such as risk possibility and consequence severity. e current investigation contributes to the field of research by introducing a statistical method for analyzing unsafe events' data. It seeks to identify high-risk elements of high-speed railway derailment, refines external environment risk index for high-speed railway derailment, and analyzes data in combination with the proposed model. As such, the study achieves a dynamic display of the results arising from external environment risk analysis in the context of highspeed railway derailment. e study is significant in which it seeks to rationalize the methods for analyzing external environment risk and to better visualize the safety laws of highspeed railway derailment. erefore, the study helps to advance the operation of high-speed railway and its safety arrangements towards a more digitalized and smarter system.
Previous studies on high-speed railway risks mostly employ the expert evaluation method, which is arguably based on subjective deliberation and may undermine the research validity. e current project proposes a data-driven method of risk judgment, which may help to advance the feasibility and accuracy of the analysis. is study contributes to improving the safety level of railway operation by putting forward a method for integrating heterogeneous data sources, minimizing data heterogeneity, and thus, with enhanced accuracy, and building the association between the data and railway risks.
According to the needs of high-speed railway operation safety management, with the continuous accumulation of railway unsafe event data, the external environment risk model related to high-speed railway derailment could be continuously modified and improved, and the correlation matching between risk and unsafe event could be more accurate, which could ensure the continuous improvement of high-speed railway operation safety.

Data Availability
Some or all data, models, or code generated or used during the study are proprietary or confidential in nature and may only be provided with restrictions.

Conflicts of Interest
e authors declare that they have no conflicts of interest.