Analysis of Risk Factors of Coal Chemical Enterprises Based on Text Mining

Coal chemical enterprises have many risk factors, and the causes of accidents are complex. The traditional risk assessment methods rely on expert experience and previous literature to determine the causes of accidents, which has the problems such as lack of objectivity and low interpretation ability. Analyzing the accident report helps to identify typical accident risk factors and determines the accident evolution rule. However, experts usually judge this work manually, which is subjective and time-consuming. This paper developed an improved approach to identify safety risk factors from a volume of coal chemical accident reports using text mining (TM) technology. Firstly, the accident report was preprocessed, and the Term Frequency Inverse Document Frequency (TF-IDF) was used for feature extraction. Then, the K-means algorithm and apriori algorithm were developed to cluster and for the association rule analysis of the vectorized documents in the TF-IDF matrix, respectively to quickly identify the hidden risk factors and the relationship between risk factors in the accident report and to propose targeted safety management measures. Using the sample data of 505 accidents in a large coal chemical enterprise in Western China in the past seven years, the enterprise accident reports were analyzed by text clustering analysis and association rule analysis methods. Through the analysis, six accident clusters and 13 association rules were obtained, and the main risk factors of each accident cluster were further mined, and the corresponding management suggestions were put forward for the enterprise. This method provides a new idea for coal chemical enterprises to make safety management decisions and helps to prevent safety accidents.


Introduction
Te COVID-19 epidemic has severely afected the energy markets [1]. China has turned its attention to coal in order to ensure national energy security again. In 2020, coal accounted for about 56.8% of China's primary energy consumption, which is still the leading energy in China [2]. As a necessary form of vital energy in China and an important organic raw material, coal is widely used in cooking, chemical fertilizer production, rubber, plastics, and other coal chemical industries [3,4]. Te coal chemical industry takes coal as the raw material, then converts coal into gas, liquid, solid fuel, and chemicals through chemical processing, and then produces various chemical products. Developing the coal chemical industry is essential to promote clean and efcient coal utilization and to ensure national energy security [5].
Tere are over 100 large-scale coal chemical enterprises worldwide, with nearly 400 modern gasifers. In comparison, there are more than 3000 coal chemical production enterprises in China with more than 1 million employees in the coal chemical industry [6]. Developing a safe, green, and environmentally friendly coal chemical industry can efectively supplement China's oil and gas resources shortage. However, there were many problems, such as serious waste of resources and insufcient attention to safety and environmental protection [7]. Most of the production processes of the coal chemical industry have harsh process conditions and complex production devices. A safety accident in a coal chemical enterprise may cause signifcant harm to personnel, equipment, facilities, and the environment [8]. In addition, during the average production of coal chemical enterprises, the continuous operation time is generally long, the work intensity is high, and the staf is prone to negligence, leading to safety accidents.
Te safety production situation of China's chemical industry is still grim and complex. From 2016 to 2019, 784 chemical accidents in China caused 1002 deaths. From January to November 2021, 127 domestic chemical accidents resulted in 157 deaths. Terefore, it has become an urgent problem for many scholars to accurately identify the potential safety hazards of coal chemical enterprises [9].
Common safety accidents in coal chemical enterprises mainly include fre, explosion, and leakage of toxic gas or liquid [10]. Most of the accidents originated from process areas, storage areas, and waste storage or disposal areas. Te direct causes of chemical accidents mainly include mechanical failure, human errors, and violent reactions, of which human errors account for the most signifcant proportion [11]. Te commonly used safety assessment methods in coal chemical enterprises are diversifed, and the qualitative methods include preliminary hazard analysis, safety checklist, HAZOP analysis, and FMEA analysis. Quantitative evaluation methods include the Dow Chemical method and probabilistic risk assessment. Approaches that combine qualitative and quantitative methods include safety checklist, event tree, and accident tree [6].
However, safety risk identifcation in those models was limited to experience-based methods (e.g., literature review and questionnaires). Various accident causation theories and models were proposed based on the induction analysis of accidents, such as the Swiss cheese model, the man-made disaster theory, and the System-Teoretic Accident Model and Processes (STAMP). [12]. Tese theories have highlighted the primary mechanisms of how risk factors might cause an accident. However, the accident causal model does not clearly defne detailed safety risk factors. Tat is mainly because most studies take expert experience and previous literature, as the primary source in determining the causes of accidents, resulting in a lack of objectivity and low interpretation ability. Secondly, when studying the causal relationship of accident causes, researchers often put forward assumptions and variable combinations through observation or related theories, lacking an objective basis [13]. Using the data mining method to deeply mine disaster information gives the excellent value for accident prevention.
Although the industrial felds difer, the accidents have similar trajectories [14]. Learning from accidents is a pivotal link to preventing future injuries [15], focusing on determining the event's root cause [16]. As explicit knowledge, the text information in accident reports is easy to share [17], and hundreds of accident reports can form a valuable knowledge database. Te accident report has an outstanding value in understanding the details of the accident sequence, including important text information related to corrective and preventive maintenance after the accident [18]. Analyzing the accident report helps to identify typical accident risk factors and determine the accident's cause, type, location, and severity [19]. Currently, this work mainly depends on the judgment of experts in the feld, which is subjective and time-consuming [20]. In particular, enterprises have accumulated many safety accident reports and hidden danger troubleshooting reports. Tese reports are presented in unstructured text, which increases the difculty of quickly and accurately identifying risk factors from many text datasets.
In recent years, data analysis in accident investigation reports has provided a new way to study the causes of accidents [21]. Trough extensive research of accident reports, text mining can better understand the causes of accidents and signifcantly improve the accuracy of accident prediction [22]. In the feld of chemical safety management, work on anomaly detection [23], ontology-based knowledge acquisition [24], and process alarm prediction [25] have been undertaken based on accident texts. Despite such work, no existing method meets the demands of both universality and accuracy, and there is still no efcient, convenient universal tool for extracting risk factors from coal chemical accident cases.
Tis paper is data-driven and theory-driven for better prevention and control of the potential safety hazards of coal chemical enterprises and to ensure safe production. It proposes a text mining method to automatically identify the critical risk factors hidden in the accident report, which can help enterprises fnd valuable information and implicit knowledge. A specifc dictionary in the feld of the coal chemical industry is established, which plays a vital role in the text mining workfow. Six accident clusters are obtained through text cluster analysis, and the accident causes of these accident clusters are found, and improvement measures are put forward. Trough the association rule analysis of the excavated risk factors, 13 association rules are obtained. According to these association rules, targeted security management can be carried out.
Te chapters are arranged as follows. Te background of text mining and related works is presented in Section 2. Section 3 includes the details of the proposed approach. Section 4 introduces the case application of this method in a large coal chemical enterprise in Western China. Section 5 is the summary of this study.

Literature Review
Te production process conditions of the coal chemical industry are harsh, and the production equipment is complex. Scholars have carried out much risk research in the coal chemical industry. In order to improve the risk management and control ability of coal chemical enterprises, Miao studied the dynamic risk management and control model of coal chemical enterprises and developed the corresponding application software [26]. Chen introduced the modeling method and management strategy of the domino efect and pointed out future research directions and challenges to better protect the chemical industry from the impact of catastrophic accidents [11]. Zhang established a quantitative relationship between probability and equipment damage degree and developed a reliability probability model related to specifc types of chemical processing equipment [27]. Shahriar studied the risk of oil and gas pipeline leakage accidents through the sustainability assessment method and used bow tie analysis based on fuzzy theory to prevent signifcant accidents [28]. Although these studies have improved the risk assessment method for the coal chemical process, risk management still faces substantial challenges. Major coal chemical accidents are lowfrequency events; the traditional risk assessment methods that rely on coal chemical accident data cannot be efectively applied to production practice [29].
With the rapid development of science and technology such as artifcial intelligence, 5G, big data, the Internet of things, and cloud platform, more data are available than ever. Te international data group predicts that data will increase from 33 billion TB in 2018 to 175 billion TB in 2025 [30]. In addition, most of these data are in unstructured formats, including audio, video, and free text. Unstructured data accounts for about 80%. Knowledge can be found from various information sources, while the text is still the largest existing information source.
Information overload, that is, the amount of data generated, has exceeded its processing and analysis capacity. It is a growing concern in many industries, especially a large number of free-text data extracted under human supervision. Henke believes that 76% of work activities need natural language understanding. Terefore, developing automated methods to deal with natural language texts efectively are essential [31].
Text mining, also known as knowledge discovering in texts (KDT), processes a large number of unstructured text data through natural language processing (NLP) technology to obtain new knowledge and valuable information [32]. In the 1950s, Luhn frst proposed applying the idea of word frequency statistics to automatic classifcation, creating a precedent in the research and application of text mining [33]. Te concept of "text mining" was frst proposed by Feldman in the paper First International Conference on Text Mining and Knowledge Discovery [34].
Text mining is a branch of data mining and covers many research felds. As we all know, data mining can mine the seemingly scarce potential knowledge in massive explosive data. When the mined data appears in text, this mining method can be called text mining. Because the information hidden in the report is unstructured, the computer cannot process it, while manual text processing is time-consuming and error-prone. Trough text preprocessing and feature extraction in text mining technology, text information can be scientifcally abstracted and transformed into a mathematical model that the computer can recognize. Of course, the theories and methods of machine learning, information processing, pattern recognition, statistics, computer linguistics, and other disciplines need to be used in this process [35]. Text data have the characteristics of large volume, diversity, velocity, low-value density, and the 4V feature of big data [36]. Compared with the wide application of machine learning in image processing, speech recognition, and other felds, text data mining is challenging.
Text mining is becoming a new research hotspot and has been widely used in many felds, such as medicine, commerce, and security. Te most famous application in the medical feld is PubGene, a search engine containing many life science and biomedical data. It can visually show the possible relationship between keywords and literature data. In the business feld, enterprises use text mining technology's intelligent web crawler function to collect information about the market, competitors, and market environment related to enterprises and further analyze this information to adjust the enterprise development strategy [37]. In the safety feld, many scholars use text mining technology to analyze coal mines, rail transit, ship collision, aviation, and other accidents, extract the causes of accidents, and then put forward practical safety management suggestions. For example, Lin developed a text mining method based on keyword extraction and topic modeling to identify the key concerns and dynamics of on-site inspection problems of construction projects to make decisions better [38]. Sarkar developed a text-mining-based prediction model by using fault tree analysis (FTA), and Bayesian network (BN) could predict the occurrence of accidents attributable to diferent primary causes [39]. Raviv used text mining and K-means cluster analysis of 212 crane-related accident reports to fnd that technical failure is the most dangerous risk factor [40]. Hughes introduced a semiautomatic technology for classifying text-based shortdistance call reports in the GB railway industry to type many unstructured texts [41]. Singh identifed the nine most common accident paths and the corresponding prevention strategies through text mining (workplace observation and high-risk control plan) and reactivity data (event records) [42].
Te techniques used in text mining include information extraction, topic tracking, text classifcation, text clustering, association analysis, information visualization, latent semantic analysis, and emotion analysis [37].
Text clustering is essential in data mining and machine learning [43]. Te purpose is to fnd helpful knowledge or patterns from unstructured or semistructured text sets [44]. Given a document set, we need to divide the documents into several clusters so that the documents in the same cluster are similar. Unlike classifcation methods, clustering is a typical unsupervised learning method [45], and we do not need to label documents in advance. Terefore, text clustering technology can be considered when there is no annotation information of documents. Text clustering has a wide range of applications, such as topic detection and tracking [46], document summary [47], and search results clustering [48]. A wealth of techniques has been proposed for text clustering, including spectral methods [49], matrix factorization [50], hierarchical methods [51], partitional approaches [52], and model-based methods [53], in addition to further approaches based on semantic similarity [54], evolutionary algorithms [55] and concept factorization [56].
According to the accident causation theory, accidents cannot be caused by one factor but by breaking through the bottom line of the defense system under the joint action of diferent factors. In order to further reveal the patterns between diferent factors, association analysis needs to be carried out to extract strong association rules between risk factors. Association rule mining is an essential branch of data mining technology. Agrawal frst proposed the concept of association rule mining, the association or correlation between itemsets in the database, which is also known as shopping basket analysis [57]. Te well-known algorithms like apriori [57], FP-growth [58], and ECLAT [59] and their derivatives have introduced efcient frequent itemset mining processes for association rules. Other types of itemsets mining methods have been introduced for rule mining, such as approximate [60], rare [61], and uncertain itemset mining [62]. Tese mined itemsets have been used to produce several forms of rules like multilevel and multidimensional association rules.
Given the subjectivity of traditional risk factor analysis methods in the feld of the coal chemical industry, this paper combines data-driven and theory-driven. It proposes a method and process of text mining, which can objectively extract risk factors from many accident case data.

Methodology
Tis paper presents the process and method of extracting risk factors from accident reports based on text mining, as shown in Figure 1. In the safety management process, coal chemical enterprises have accumulated many accident reports, hidden danger troubleshooting records, and other text data, and constituted a knowledge treasure waiting for indepth excavation. After preprocessing and feature extraction of the content of a large number of informants, it is transformed into a structured dataset. Te text clustering and association rule analysis are carried out. Finally, combined with the mined tacit knowledge, the daily safety management decision-making of the enterprise is carried out.

Text Preprocessing.
Text preprocessing is the fundamental link of text mining, which aims to clean and standardize the corpus. It usually includes screening steps, removing stop words, word segmentation, and part-ofspeech tagging. Chinese text preprocessing does not need stem analysis, citation, and case normalization, which makes text preprocessing diferent from English text. Four substeps are designed: data screening, removing stop words, constructing a domain dictionary, and word segmentation.
(1) Data Screening. Because of the randomness of text data records and many professional terms and idioms, text normalization is required before Chinese word segmentation, which is usually completed by regular expressions [63]. Tis study takes the accident report as the initial database. Sincethe preparation of accident reports has corresponding requirements and the text is relatively standardized, this paper only needs to delete duplicate and defective reports (for example, incomplete reports) during data screening.  [64]. (4) Word Segmentation. By locating the term boundary, the corpus is decomposed into discrete and linguistically meaningful terms [65]. Chinese word segmentation recombines continuous Chinese sentences into word sequences according to specifc rules. By removing stop words, eliminating the interference of meaningless words, and further introducing the constructed domain dictionary, word segmentation can be carried out directly.

Feature Selection.
After text preprocessing and feature vectorization, the feature dimension of the text is still very high. In order to reduce the computational complexity, feature extraction is needed. Feature extraction is a dimensionality reduction method that calculates a feature's score value according to a feature evaluation function, sorts these features according to the score results, and selects features with high score values as feature items. It reduces the number of features and the computational complexity of modeling and improves clustering performance. As a traditional feature selection method, the Term Frequency Inverse Document Frequency (TF-IDF) is usually used as a feature evaluation function for feature extraction [66]. Te TF-IDF matrix has been widely used to train shallow learning models [67], such as SVM, KNN, and NB. Te TF of keywords is expressed as where n i,j denotes the number of occurrences of the keyword t i that appears in the accident record document d j and K n k,j is the number of all keywords in the accident record document d j .
Te IDF of keywords is expressed as where |D| represents the total number of accident record documents and | j: t i ∈ d j | is the number of documents containing keyword t i to avoid this item being zero and the divisor being zero, and it is generally expressed as 1 + | j: t i ∈ d j |. IDF means that the fewer times a keyword appears in an accident record document, the greater the weight given to the keyword. It is the opposite of TF's idea, but it is susceptible to rare keywords. TF-IDF combines the advantages of TF and IDF [68], indicating that the weight of a keyword will increase with the number of times it appears in an accident record document and decrease with the number of relevant accident records in the database: Using the TF-IDF method to calculate the weight of the keywords obtained from word segmentation, we can identify the important keywords within the document and realize feature extraction. It can efectively reduce many worthless words and improve the performance of subsequent clustering analysis and the efect of correlation analysis.

Text Clustering Analysis.
Te textual documents usually need to be classifed according to content similarity [69]. For small datasets, we can manually classify text into specifc clusters. However, clustering a large number of documents will be very time-consuming. Terefore, it is crucial to develop accurate and fast methods in text mining [70].
K-means clustering is the most commonly used clustering technology [71]. Tis algorithm can be extended to large datasets and applied in many applications [72]. Kmeans clustering algorithm takes the sum of squares errors (SSEs) as the objective function to minimize the SSEs between texts in K clusters. Te cluster center e i of cluster E i can be expressed as Te SSE between texts is calculated as follows: where x represents the text object, E i is the i th cluster, n i denotes the number of samples therein, and e i is the center of cluster E i .
With K-mean clustering, the vectorized documents in the TF-IDF matrix are divided into K distinct clusters based on Euclidean distance to the centroid of a cluster [73].
Firstly, the cluster value K needs to be given, and then the centroid position is recalculated after all eigenvalues are assigned to the nearest centroid. Tis process is repeated until convergence occurs, and no further changes occur [74]. Te initially set K value directly afects the clustering efect. Rousseeuw proposed the silhouette coefcient method that provides a graphic display to evaluate the cluster quality and judge the text clustering efect [75]. Assuming that the original data are divided into K clusters, for each vector i in the cluster, give a(i) as the average distance from vector i to other vectors in the same cluster, indicating the degree of cohesion in the cluster and b(i) as the average distance from vector i to all vectors in the nearest cluster, indicating the degree of separation between clusters. S(i) is the silhouette coefcient of vector i, which can be expressed as Averaging the silhouette coefcients of all vectors is the contour coefcient of the cluster. Te value range of the contour coefcient is [−1, 1], 1 indicates high-density clustering, −1 indicates incorrect classifcation, and the value around 0 indicates overlapping clusters.

Association Rule Analysis.
Tere are many algorithms for mining association rules, among which the most classic algorithm is the apriori algorithm [76]. Te basic idea of the apriori algorithm is to fnd the frequent itemset according to the set support until the frequent K + 1 itemset do not exist. Corresponding to the general steps of the association rule algorithm in data mining, text association analysis also includes two stages: (1) aearching frequent itemsets and (2) generate association rules based on frequent itemsets.
Tree methods are widely used in the literature to evaluate the quality of association rules: support, confdence, and lift.
Te expression of support (S) is

Journal of Environmental and Public Health
In the formula, P represents the probability that both itemsets X and Y co-occur in a transaction. Moreover, the support is symmetrical; that is, the support of X⇒Y is equivalent to the support of Y⇒X.
Te expression of confdence (C) is Tis formula represents the conditional probability of event Y under the condition that event X occurs. It is not symmetric; the confdence of the rule X⇒Y may be diferent from the confdence of the rule Y⇒X.
Support and confdence are probability values, and their value interval is [0, 1]. Te closer the value is to 1, the stronger the relationship between events.
Te expression of lift (L) is Te lift is the conditional probability of itemset Y when itemset X in the transaction set is divided by the probability of itemset Y in the transaction set occurring alone. Generally, the lift value is compared with 1, less than 1 indicates a negative correlation between the antecedent and consequent items, greater than 1 indicates a positive correlation between the two, and equal to 1 indicates no correlation.

Case Application
Tis section will introduce the case application. We investigated a large coal chemical enterprise in Western China and obtained 505 accident reports from 2015 to 2020. Te accident reports record the department, name, time, grade, nature, injury degree, and process of the accident. Tis paper chose Python language to mine and analyzed the obtained text database.

Text Preprocessing.
Te integrity and standardization of the obtained accident reports were checked. It was found that 505 accident reports were flled in a standardized format and completed in content, which can be analyzed in the next step.
Combined with Modern Chinese Function Words, Baidu, and Harbin Institute of technology's stop-words list, a Chinese stop-words list containing 1893 words, such as punctuation and function words is sorted out. Te meaningless words in the accident report were deleted by importing the stop-words list.
Tis study used the combination method based on corpus and knowledge to construct a dictionary in the coal chemical industry. Te specialized dictionary comes from safety engineering, chemical engineering, and risk management. At the same time, enterprise safety managers were invited to sort out some professional words with industry recognition combined with the expression characteristics of accident reports. Tese two parts of vocabulary constitute the domain dictionary used in this study.
Te widely used Jieba Chinese word segmentation toolkit was installed in anaconda. Word segmentation and part-of-speech tagging were carried out on text data combined with the established domain dictionary. However, the words obtained through word segmentation cannot be mined and analyzed directly. On the one hand, too many keywords contain unhelpful interference items, leading to dimensional disaster. On the other hand, the obtained keywords only have frequency statistics, and a simple word frequency cannot refect the importance of vocabulary.

Feature Extraction.
Scikit learn is a machine learning software package based on Python, which constructs a feature matrix for the obtained keywords by calling the countvectorizer function. Te TF-IDF vectorizer is called to calculate the weight of each feature according to the TF-IDF algorithm (see Table 1).
In this paper, the keywords that occur more than three times are regarded as high-frequency words, and the TF-IDF values of keywords are arranged in the descending order, and the top 10% are defned as feature items. It can be seen from the above table that the weight value of each feature item is relatively small because the weight value is related to the frequency of the word in the document, and the entire database contains thousands of words. Te weight value of the feature item is only a relative value, which plays a role in ranking the importance. Te feature items with higher weight values mainly include gasifer, central control room, coke oven gas, interlock, pressure, and induced draft fan, indicating that the accidents are mostly related to the above feature items. By comparing the accident records, it was found that the shutdown and maintenance accidents are caused by the failure of equipment components such as gasifer, induced draft fan, compressor, reactor, and pipeline rupture. Interlocking accidents are caused by excessive fuctuation of process parameters such as fow, pressure, liquid level, and temperature. Te feature items with high weight can accurately refect the relevant information about frequent accidents. Cluster analysis and association rules require further analysis of detailed accident characteristics and causes.

Text Clustering Analysis.
Te last section calculated the weight of the extracted feature items and constructed a spatial vector model. Tis section will calculate the similarity between documents for text clustering analysis.
Te silhouette coefcient combines the cohesion and separation of clustering to evaluate the efect of clustering. Use the silhouette score function to determine the K value of cluster analysis. Te result is shown in Figure 2. When the K value is 6, the silhouette coefcient is the largest, close to 1, indicating that high-density clustering can be obtained.
K-means cluster analysis is carried out on 505 accident reports, and six clusters are fnally obtained, as shown in Figure 3, see Table 2 for detailed results.
Te accident reports included in the six clusters are 121, 105, 87, 82, 71, and 39. Te leading causes of accident clustering can be found by summarizing the feature items, as 6 Journal of Environmental and Public Health shown in Table 3. Cluster 0 contains the most accidents, mainly personal injury accidents. Te causes of the accidents include insufcient safety awareness of employees, failure to take protective measures, nonstandard operation, misoperation, and untimely communication, refecting the lack of employees' occupational safety knowledge and safety awareness. Cluster 1 mainly refers to equipment and parts damage accidents. Te causes of the accidents include induced draft fan parts damage, motor damage, and compressor parts damage, indicating that enterprises need to strengthen the inspection and maintenance of the frequently above faulty equipment. Cluster 2 mainly refers to leakage accidents. Te causes of accidents include economizer leakage, pipeline blockage/rupture, and fange leakage, refecting the need for enterprises to formulate and improve regular inspection systems and assessment mechanisms. Cluster 3 mainly refers to production line shutdown accidents caused by large fuctuations in process parameters such as fow, liquid level, and pressure, which indicates that enterprises should strengthen the training of employees on the operation skills of Distributed Control System and Safety Monitoring System and formulate emergency plans for various emergencies. Te leading causes of the accidents in Cluster 4 are that the sundries on the equipment and site are not cleaned in time, resulting in equipment tripping, fre, and other accidents, indicating that the cleanliness of the enterprise is insufcient. Cluster 5 mainly refers to trafc accidents in the plant area. Te causes of the accidents are insufcient safety awareness of employees and failure to comply with trafc rules. It is also necessary to strengthen the safety training of employees and formulate regulations on trafc travel in the plant area.

Association Rule Analysis.
According to the data format required by the association rules, the Boolean matrix was constructed using the feature items in the accident report, as shown in Table 4. Each column represents an item; the accident cause item is mined from the text. Each line represents a transaction, i.e., accident report D i . 1 indicates that the cause of the accident appears in the accident report, and 0 means that it does not. Set the minimum support threshold to 0.1 and the minimum confdence threshold to 0.3. Search frequent sets and flter out the rules with a lift greater than 1 to obtain 13 association rules, as shown in Table 5.

Journal of Environmental and Public Health
Te reason for the relatively low support of these rules is that the vocabulary contained in the database is very large, and the dimension of the feature matrix is too high, so the frequency of the two feature items appearing together is low. However, the confdence of these association rules is almost greater than 0.5, which means that the feature items in the rules have a strong correlation. It can be seen from the above table that 38.5% of the association rules obtained are related to the central control, indicating that in case of abnormal conditions in the daily production process, the central control will issue operation instructions in the shortest time to avoid worse results. Abnormal fow, load, and liquid level fuctuations will lead to abnormal process pressure changes. Te disturbance of load will also cause the abnormality of pressure, fow, motor, and other equipment parameters. Te probability of compressor failure and unqualifed propylene products appearing together in the accident reports is 14.2%, and the probability of unqualifed propylene products Te lifting value of the fnal 13 rules is greater than 1, indicating that the consequent item of the rules is greatly afected by the antecedent item. According to association rules, the obtained rules have obvious practical signifcance and can carry out targeted safety management.

Results of Practical Application.
According to the text clustering results and the characteristics of departments and posts, the case enterprise has designed diferent knowledge question banks and randomly selected questions every month to test the employees. Employees can access the question bank through mobile phones any time for learning. At the same time, the enterprise focused on the inspection and maintenance of induced draft fan, motor, compressor, economizer, and other equipment prone to frequent failures for two months. According to the correlation analysis results, the enterprise has added the linkage monitoring function of essential process parameters in the central control system to prevent domino events caused by excessive fuctuations of process parameters. Trough statistical analysis of accident reports in 2021, it is found that the number of accidents decreased by 16.9% year-on-year, of which human factor accidents, equipment failure accidents, and interlocking accidents   Journal of Environmental and Public Health 9 decreased by 18.4%, 11.1%, and 14.3%, respectively, and the safety situation of the enterprise was signifcantly improved.

Conclusion
To accurately identify the risk factors in coal chemical enterprises and efectively prevent safety accidents, this paper developed an improved approach to identify safety risk factors from a volume of coal chemical accident reports using TM technology. Firstly, the features of the preprocessed accident text are extracted using the TF-IDF method. Secondly, based on the characteristics of coal chemical enterprises, the K-means algorithm and apriori algorithm are developed to perform clustering and association rule analysis on the feature matrix, respectively. Te analysis results identify the main risk factors of accident clustering and the correlation between risk factors. Finally, the method is applied to a large chemical enterprise in Western China, and six accident clusters and 13 association rules are obtained. Te main risk factors of each accident cluster are further analyzed, and the corresponding safety management measures are proposed. Te fnal results show that the method proposed in this paper can quickly identify the critical risk factors hidden in the accident report, and their relationships and help enterprises to carry out scientifc management and decision-making.
In the future, more enterprise data should be selected to verify the method. At the same time, it is necessary to summarize the accident types and causes in more detail to identify better the risk factors existing in the enterprise.