Monitoring and Early Warning Analysis of the Epidemic Situation of Escherichia coli Based on Big Data Technology and Cloud Computing

The purpose of this study is to analyze the molecular epidemiological characteristics and resistance mechanisms of Escherichia coli. The study established a big data cloud computing prediction model for the epidemic mechanism of the pathogen. The study establishes the early warning, control parameters, and mathematical model of Escherichia coli infectious disease and monitors the molecular sequence of the pathogen based on discrete indicators. A nonlinear mathematical model equation was used to establish the epidemic trend model of Escherichia coli. The study shows that the use of the model can control the relative error at about 5%. The experiment proves the effectiveness of the combined model.


Introduction
Gene promoter is the most important regulatory element of gene transcription; it determines where the gene expression starts. erefore, the study of promoters has always been a hot spot in modern molecular biology. e theoretical prediction of gene promoters has become an important research content of bioinformatics as an important part of the identification of the complete structure of genes [1]. With the advent of the postgenomic era, although a large amount of genomic data have been generated, the available annotation information related to the promoter is still relatively scarce. erefore, it is urgent to design a fast and effective method to identify the promoter sequence in the genome.
Because prokaryotes and eukaryotes genome promoters are quite different, they are usually studied separately for prediction. Escherichia coli is one of the most important prokaryotic model organisms. At present, a variety of mathematical models have been used to predict the promoter of Escherichia coli. e position weight matrix (PWM) is a more commonly used prediction method. Some scholars selected 288 different PWMs to conduct a systematic study on 599 sigma70 promoters. e study found that the sensitivity reached 86%, while the accuracy rate was only 53%. Some scholars have predicted 469 Escherichia coli promoter sequences and their positions based on predicted transcription units and using the Markov model (MM). e accuracy rate is more than 70%. e neural network method (NN) has also been used many times to predict the promoter of Escherichia coli. Recently, some scholars have used NNPP2.2 software to combine the distance from TSS to the translation initiation site (TIS) to improve the prediction accuracy of the Escherichia coli promoter [2]. Some scholars used the support vector machine (SVM) to predict 669 Escherichia coli sigma70 promoters and obtained high prediction accuracy. Some scholars have proposed a prokaryotic promoter identification method based on feature screening, and this method has also achieved satisfactory prediction results. We once proposed a position association scoring matrix (PCSM) algorithm to improve the prediction accuracy of promoters. Recently, some scholars have obtained higher recognition accuracy by combining the diversity increment with the secondary discriminant analysis (IDQD) method.
Although the prediction success rate of promoters is constantly improving, there are still many problems. First of all, the promoter datasets used in the past are mostly small, and the nonpromoter datasets are relatively large. is will increase the number of false positives and affect the accuracy of performance evaluation. Second, most of the work does not have a deep understanding of promoters and insufficient utilization of characteristic information [3]. Again, most of the work has carried out two predictions such as promoter and gene and promoter and coding region, and the actual need is to identify the promoter sequence from the entire genome.
In view of the problems in the prediction of the above promoters, this article will reintegrate and predict the characteristics of the promoter sequence of Escherichia coli. First, consider the interaction between RNA polymerase and promoter sequence. We use the position association scoring function (PCSF) to describe the positional conservation of promoter sequences. Second, the promoter sequence is divided into different windows according to the sequence characteristics, and the discrete increment index (ID) is used to measure the information content of the sequence in each window. Finally, we used the modified Markov discriminant to predict the promoter of Escherichia coli. Here, we call this method the IPMD algorithm [4]. Comparison with previous results shows that the algorithm we developed has better predictive performance and is more practical.

e Establishment of the Database.
e Escherichia coli sigma70 promoter sequence is from Regulon DB, an annotation database of the Escherichia coli transcription regulation network. A total of 741 experimentally confirmed sigma70 promoters were obtained, and the length of each promoter sequence was 81 bp (−60. . .+20, TSS reference is 0 position). e negative dataset was obtained from the whole genome of Escherichia coli (downloaded from GenBank, sequence AC number U00096) without the promoter. But in fact, there is no experiment to prove which part of the sequence does not contain a promoter [5]. erefore, according to the known transcription unit structure of Escherichia coli and the known promoter or coding region location, try to avoid regions where promoters may appear to extract negative data. e nonpromoter sequence selected in this study comes from two regions: coding region sequence and noncoding region sequence. Since the promoter drives its downstream genes, it is generally located at the head of the coding region. However, because the Escherichia coli genome is small, 89% are coding regions, so some promoters will exist at the end of the previous gene. erefore, the nonpromoter of the coding region is selected in the middle part of the longer gene. Next, we select nonpromoter sequences from noncoding regions.
Based on the above considerations, we selected 700 nonpromoter and 700 nonpromoter sequences in the coding region and 700 nonpromoter sequences in the convergent region, each of which was 81 bp in length.

Location-Related Scoring Function.
Define the standard sample set as and the position correlation weight matrix as P � [p xi ] M×L , where M is the number of types of characters, L is the length of the sequence, and p xi represents the probability of character x appearing at position i.
Count the number of sixet fragments at each position in the sequence. We introduce the pseudocount B i and redefine the matrix elements of the position association weight matrix as where p 0 is the background frequency, defined as P 0 � 1/N i . We use the position weight matrix, and the associated scoring function is defined as e value of F is used to characterize the degree of similarity between a sequence and a promoter sequence [6].
e larger the value of F, the more likely this sequence is a promoter sequence.
If n i or m i is zero, then D(n i , m i ) � 0. It is easy to prove that the discrete increment is nonnegative, namely, Δ(X, Y) ≥ 0. We take the natural logarithm (in this case, the unit of information is knight). e discrete increment Δ(X, Y) can be regarded as a quantitative expression of the biological similarity relationship, which reflects the similarity of the two sets of data [7]. e smaller the Δ(X, Y), the more similar the two sets of data.

Modified Markov Discriminant.
Considering samples with multiscale features, this study uses modified Markov discriminant to integrate features [8]. For any promoter sequence S to be predicted, the discriminant function between it and the training set can be defined as en, the type of sequence S can be given by the following discriminant rules: ξ � MD s, μ pranoter − Min MD s, μ coding , MD s, μ nin−coding .

(5)
Operator Min represents the smallest value in the brackets. e type of the sequence to be tested for any given threshold ξ 0 can be predicted.

Accuracy Evaluation.
We use the definitions of sensitivity (S n ), specificity (S p ), and correlation coefficient (CC) to evaluate the predictive performance of the algorithm.

Promoter Feature Selection.
According to the sequence characteristics of the promoter of Escherichia coli and the conservative analysis of its promoter sequence in the past, the characteristics of the promoter of Escherichia coli were selected as follows: Usually, the two-category problem has better prediction results than the three-category problem. However, because the negative data in the noncoding region and the negative data in the coding region are quite different in structure and composition, the two datasets are mixed into a negative dataset for promoter prediction research. is is bound to reduce the predictive performance of the model [9]. erefore, the prediction model of this work will be generated by training on three datasets. e feature vector of the input modified Markov discriminant is a 9-dimensional vector ( Table 1).

Forecast Accuracy.
e prediction accuracy is the prediction ability of the test algorithm. We divide the positive sequence and the two types of negative sequence into two parts: the test set and the training set according to the ratios of 1 : 9, 2 : 8, 3 : 7, 4 : 6, and 5 : 5. In this way, the model is trained and tested [10]. e prediction results are given in Table 2. e results show that no matter what proportion of the IPMD model is trained and tested, its prediction accuracy has not changed significantly. is shows that our model is stable.
Although good prediction accuracy is obtained for various proportions of data, this test method does not fully reflect the predictive ability of the model. So next, we use a more objective 10-fold cross-check to evaluate the IPMD algorithm [11]. e 10-fold cross-check is to divide the dataset into 10 equal parts. We take one as the test set and the remaining 9 as the training set. is is repeated 10 times to test the algorithm. en, use the receiver operating characteristic curve (ROC) to evaluate the algorithm performance. It is constructed by plotting the true positive rate and false positive rate calculated from a number of given thresholds. is is a comprehensive indicator that reflects the continuous changes in sensitivity and specificity. We use the area under the ROC curve to evaluate the prediction effect ( Figure 1). e results showed that the area under the ROC curve reached 0.953. When the optimal threshold ξ 0 is selected as −1.20, the prediction sensitivity reaches 84.9% and the specificity is 84.0%. e overall accuracy and correlation coefficient are 89.2% and 0.761, respectively.

Comparison of Results.
e above only gives the prediction results of IPMD on the three datasets. Although the overall accuracy reaches about 90%, it is not certain that our model must be better than the prediction performance of other algorithms. erefore, according to the previous prediction methods for promoters, we carried out prediction studies on the promoter and coding region sequence and the promoter and noncoding region sequence, respectively [12][13][14]. We compare this algorithm with other algorithms. e 10-fold cross-check is still used here, and the comparison results are given in Table 3. Our results have been further improved compared with previous algorithm results.
is can prove that the prediction model that takes into account multiple characteristics can better identify the Escherichia coli sigma70 promoter [15].

Conclusion
In this study, a new prediction model of the Escherichia coli promoter is developed. We first considered the interaction between RNA polymerase and DNA sequence and constructed a position correlation scoring function. In fact, this scoring function can roughly measure the free energy of interaction between RNA polymerase and DNA sequence. Second, the discrete index is used to describe the sequence composition of different windows of the promoter. e discrete index is another reflection form of information entropy, so the discrete increment describes the increase of sequence information. Both have strict physics meaning, but they belong to different physics concepts, which can be regarded as orthogonal in mathematics. In this way, we got the promoter description method under multiple feature scales and then used the modified Markov discriminant to realize the promoter prediction of Escherichia coli.
e comparison with other algorithms shows that our algorithm has better performance and stronger scalability and can be extended to the promoter prediction of other species.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.