Transmission and Drug Resistance Characteristics of Human Immunodeficiency Virus-1 Strain Using Medical Information Data Retrieval System

This study was aimed at exploring the transmission and drug resistance characteristics of acquired immunodeficiency syndrome (AIDS) caused by human immunodeficiency virus-1 (HIV-1). The query expansion algorithm based on Candecomp Parafac (CP) decomposition was adopted to construct a data information retrieval system for semantic web and tensor decomposition. In the latent variable model based on tensor decomposition, the three elements in the triples generated feature vectors to calculate the training samples. The HIV patient data set was selected to evaluate the performance of the system, and then, the HIV gene resistance of 213 patients was retrospectively analyzed based on the electronic medical records. 43 cases showed failure of ribonucleic acid drug resistance, the ART virological failure rate was 24.43% (43/213), and one case was not reported. There was 1 case of RNA hemolysis that could not be detected. There were 50 resistant cases of nonnucleotide reverse transcriptase inhibitors (NNRTI), accounting for 29.94% (50/167), and there were 17 resistant cases of nucleotide reverse transcriptase inhibitors (NRTI), accounting for 10.18% (17/167) of all mutation cases. Among the HIV-1 strains, 19 cases failed the detection of drug resistance sites in the integrase region, and mutations in the integrase region were significantly more than those in the protease region. There were 12 types of HIV-1 strains with drug-resistant mutations. The fusion technical scheme constructed in this study showed excellent performance in medical information retrieval. In this study, the characteristics of HIV-1 of AIDS patients were analyzed from different directions, and effective treatment was performed for patients, so as to provide reference for clinical diagnosis of AIDS patients.


Introduction
Acquired immune deficiency syndrome (AIDS) refers to that human immunodeficiency virus (HIV) which invades the human immune system, causing immune function damage and various opportunistic infections in the body [1][2][3][4]. At this stage, there is no specific treatment method for AIDS [5][6][7][8]. Virus strains may have drug resistance mutations due to natural mutations or the selection of antiviral drugs. After mutations, more adaptable strains become dominant strains and spread further [9][10][11]. HIV includes two types, HIV-1 and HIV-2. HIV-1 is the main strain circulating globally in the new stage. The M, N, P, and O of HIV-1 and HIV-2 are relatively weak in pathogenicity, mainly in Central Africa and West Africa, while European and American countries are dominated by type B strains [12,13]. At present, HIV-1 drug resistance data and drug resistance mechanisms are mainly derived from B subtype, but B subtype infection only accounts for 10% of the global HIU-1 infection population. Subtype C is the most prevalent HIV-1 subtype in the world. Due to differences in intrinsic adaptability and replication ability, the accumulation of drug-resistant mutations in different subtypes of infected individuals may be different, and drug selection pressure has different effects on drug-resistant mutations in different subtypes of strains. New drug resistance mutations are being reported, and some unexplained drug sensitivities may explain the results of genotypic resistance [14][15][16]. Research on drug resistance of the main HIV-1 subtypes prevalent in my country and summarizing the characteristics of drug resistance of patients with failed antiviral treatments is very important for clinical treatment [17].
The continuous development of information and the diversified information demands have accelerated the process of big data [18]. Extended search is to screen the currently available information resources and adjust the scope of the search based on the reduced content. The formulation of the search scope needs to be divided into more levels and steps according to the needs [19][20][21]. At this stage, the molecular epidemiological investigation of HIV-1 is not clear, so it is necessary to conduct retrospective research on the genotypic characteristics of HIV-1 and the emergence of drug resistance characteristics.
Based on electronic medical records and oriented to medical literature retrieval for diagnosis and treatment decision support systems, an enhanced medical information retrieval system was constructed in this study. It integrated structured and unstructured data sources, combined the supervised learning methods for automatic query expansion to improve the effect of information retrieval, and analyzed the drug resistance characteristics of HIV-1 strains of AIDS patients from different directions. This study is aimed at providing a reference for clinical diagnosis of AIDS patients.

Research
Objects. This study retrospectively analyzed 213 patients who were treated in hospital. The data were mainly sourced from the Archives of Antiretroviral Treatment for AIDS Patients collected by the Municipal Center for Disease Control and Prevention and the National Handbook of Free AIDS Antiretroviral Treatment collected by the state. The diagnostic criteria and the diagnostic criteria stipulated in the Guidelines for the Diagnosis and Treatment of AIDS in China jointly promulgated by the Chinese Medical Association and the Ministry of Health in 2004. The antiretroviral failure standard referred to the National Handbook of Free AIDS Antiviral Drug Treatment. The prescribed standards were divided into virological failure, immunological failure, and clinical failure.

HIV-1 Introduction. HIV-1 viral load was based on the
Guidelines for HIV-1 Viral Load Determination and Treatment Assurance (2013 Edition). The EasyQ viral load meter (France, BioMérieux) was adopted to test the HIV viral load in plasma. 100 copies/mL was defined as the lowest detection limit, and the operation process was carried out strictly in accordance with the instructions. Then, the self-built genotype drug resistance test method was adopted to extract the viral ribonucleic acid (RNA) from the plasma using the QIAamp Viral RNA Isolation Kit according to the HIV-1 Genotype Drug Resistance Test and Quality Assurance Guidelines (2013 Edition). Nested polymerase chain reaction (PCR) was adopted for two rounds of amplification, covering the entire gene sequence and the reverse transcriptase codon sequence. Taking the extracted RNA was the template, and the gene sequence amplification was performed using reverse transcription PCR (RT-PCR) and nest-PCR methods. The genotype was determined based on the National Center of Biotechnology Information (NCBI) genotyping tool. The calibrated sequence was submitted to the HIV drug resistance database of Stanford University (http://hivdb.Stanford.edu) for online analysis of HIV drug resistance-related mutations and drug resistance. The degree of drug resistance was judged according to Table 1. 2.3. Data Mining. The development and application of medical information technology have produced a huge amount of data, so it is necessary to dig out potentially useful information from a large amount of obscure data and observe and analyze more useful knowledge from different angles. Data mining technology uses computer science, statistics, artificial intelligence, and other technologies to intelligently and efficiently analyze the objects in the target database and intelligently summarize or predict potential models to make reasonable decisions for the individual. The basic process of data mining mainly includes target understanding, data collection, data preprocessing, model building, model evaluation, and implementation. It can be regarded as a cyclical operation process. If the target set during the excavation is not reached, it should return to the previous one without making adjustments and then perform the operation. Figure 1 shows the flow chart of data mining.
It was supposed that in the database W = fP1, P2, ⋯Pn g, Tx = fL1, L2, ⋯::Lng, n represented the number of items, Tx was a transaction, Km (m = 1, 2, 3, ⋯, n) was the data item (Item), and the value in the data table was the item. The set F = f f 1, f 2, f 3 ⋯ f ng in the W database was selected, and Q was to represent any subset. Each field in the data table was close to the item in the set, the length of Q was set to y, Q can be called y-itemset, for example, set I = fproduct Ag, F was f -itemset.
If set f ∈ F, the frequency of f appearing in F was U, and the total length of set f was M; then the support of f was expressed as follows: SupportT described the proportion of itemsets relative to itemsets, n was the number of occurrences, and N was the total number of things.
In a transaction, 1-itemset H = fitem Hg, S = fitem Sg has H ⟶ S. If H was selected, S must also be selected. The support of ðH ⇒ SÞ represented the frequency of HS cooccurrence; the equation was as follows: The association rule ðH ⇒ SÞ was found, H ∈ F, S ∈ F, and H ∩ S = Φ; then, the confidence of the data set was 2 Computational and Mathematical Methods in Medicine expressed as follows: In the set F, the confidence of the rule R was expressed as the probability that Y would also appear under the condition that X appears, so as to show the reliability of this rule.
Minimum support shows the minimum threshold that valuable knowledge must meet, which shows the lowest importance of discovering this rule. Only when the minimum support degree is satisfied can it show its practical significance.
The minimum level of reliability that valuable knowledge must meet is the minimum confidence level (Minimum Confidence). Compared with the support degree, the confidence degree is used to measure the credibility of the excavated association rules; that is, the higher the confidence degree, the higher the credibility degree. Low credibility will be discarded.

Medical Information Retrieval
System. The method and technology of obtaining medical information through a new type of retrieval is called information retrieval. In this study, the medical information retrieval system is constructed by enhancing semantics, comprehensively supervised learning, and automatic query expansion to improve the information retrieval effect. The first is to input the query expansion words and then select the documents. The semantic expansion model selected based on tensor decomposition was adopted to expand the query words, search and sort the obtained files, and then obtain the search results. Query expansion includes two parts of sources: one is structured data, and the other is unstructured data. The combination of the two types of data can achieve better results in information retrieval. Figure 2 shows the framework of semanticenhanced information retrieval system.

Three-Dimensional
Tensor Representation. In actual information retrieval tasks, large-scale data document collections and limited labeled samples cannot support triple query expansion. The semantic web triples are adopted to assist query and document relevance to help retrieve relevant literature on treatment. In the data knowledge base, there are problems of incompleteness, sparsity, and noise, and tensors are used to represent triples. Each associated triple represents the value in the tensor and plays an important role in the analysis process. The specific schematic diagram is illustrated in Figure 3.

The Query Expansion Algorithm Based on Candecomp
Parafac (CP) Decomposition. Tensor representation can provide good prediction performance under the condition of coefficients. Compared with traditional methods, it can significantly emphasize the impact of different semantic relationships on correlation analysis. In the process of analyzing the relevance of different semantic triples, if R i is used to represent the query word category, Q k is used to represent the semantic relationship, and A k is used to represent the document word type; the corresponding feature vectors are generated after decomposition. According to the inner product of feature vectors, the triples are scored, and M is selected to represent it. The calculation equation was as follows: The initial value of the important score of the triple selected in the marked sample contains the value in the original quantity, and the calculation equation was as follows: C f ðR i , Q j , A k Þ represents the number of triples ðR i , Q j , A k Þ in the marked sample, and C r ðR i , Q j , A k Þ represents the number of triples ðR i , Q j , A k Þ extracted from the marked relevant documents and query terms. Initial value ∈ ð0, 1Þ. The tensor CP decomposition used the alternating least squares algorithm to perform tensor decomposition calculation on the semantic relations in the prediction data set.
The CP decomposition of a three-dimensional tensor can be represented by a factor matrix, as follows: The expanded matrix form was expressed as follows: ⊙ represents the Khatri-Rao product, and + represents the Moore-Penrose pseudoinverse, and the matrix A could be expressed as Then, the Khatri-Rao product attribute was expressed as The below equations were found.
Then, equation (9) can be obtained as follows: Equation (11) can perform iterative and alternate calculations for tensor decomposition. For labeled samples, using tensor decomposition to evaluate semantics was very important. The equation for constructing tensors is shown in In the equation above, U represents the observed element of marked Y, Y represents the tensor filled with missing values, and 1 represents the tensor of element 1. In the calculation process, In practical applications, the selection of rank was close to the original tensor. The larger the rank, the smaller the error. Therefore, the known elements can be used to calculate the fitting situation and gradually increase the rank value and finally decompose the parameters. In the proposed system, for the semantic expansion mode, calculation was not allowed in the process of specific query, so it would not affect the efficiency of the system.

CP-ALS Decomposition
Algorithm of 3D Tensor. The complexity of tensor calculation was constantly increasing, which was also one of the shortcomings of calculation. In the proposed system, tensor decomposition was used to evaluate the mode of semantic expansion, which was not necessary to be calculated in the process of requiring specific queries, so that it would not affect efficiency of the system. Figure 4 illustrates the process of the ALS algorithm. The r in the figure represents the number of iterations. In the latent variable model based on tensor decomposition, the three elements in the triples generated feature vectors to calculate the training samples.

Data Set.
In order to evaluate the performance of the constructed information retrieval system, the HIV patient data set was used, which was oriented to the diagnosis and treatment decision support system. The medical literature was retrieved based on query statements generated by electronic medical records. The data set provided query statements, document collections, and evaluation samples. Summary represents a simpli-fied summary of report information, type represents the type of information required, and description represents a complete report. PudMed Central (PMC) is a literature database that provides free full text for the fields of biomedicine and life sciences. The task of the data set is to retrieve relevant documents from the document collection based on the query and to support the diagnosis or treatment plan of the corresponding case.
2.9. Statistical Methods. The survey data processing in this study was analyzed by SPSS19.0 version statistical software. The measurement data conforming to the normal distribution were expressed by the mean ± standard deviation ( x ± s ), and the nonconforming count data was expressed by the frequency and frequency (%). The t-test data was adopted, and a chi-square test was performed for quality comparison. The difference was statistically significant at P < 0:05; otherwise, it was not significant.

Evaluation of the System.
For large-scale document collections, it is difficult to implement complete relevance annotation. For the incomplete test set, the evaluation parameters choose inferred average precision (infap), inferred normalized discounted Cumulative gain (infndgg), and precision at rank10 (p10). The incremental PRF system and Solr are used as the baseline system to evaluate the performance of the medical information data retrieval system. The system can effectively improve the accuracy of retrieval and achieve consistent results in measurement parameters. The specific comparison on performance evaluation of medical information retrieval system is illustrated in Figure 5. The support vector machine (SVM) was used to test the classification effect, and the radial basis function (RBF) and linear functions were used to evaluate the accuracy. The results are shown in Figure 6. It can be observed that the SVM of linear kernel function and the SVM of RBF were not very satisfactory and may be affected by many factors such as noise.
3.3. Comparison on Accuracy. The average accuracy of least squares decomposition machine (Fm-als) used in this study is shown in Figure 7. The average accuracy of Fm-als was 0.793, the average accuracy of svm-RBF was 0.546, and the average accuracy of svm-linear was 0.561. Therefore, the accuracy of Fm-als was significantly higher than that of other methods, and the difference was significant (P < 0:05).

Basic Situation of Research Objects.
In the study, 43 cases of RNA drug resistance test failed, and the basic conditions of the patients who failed the test were statistically analyzed. The results are shown in Table 3. The ART virological failure   24.43% (43/213). The results of gender showed that males were higher than females, and unmarried cases were higher than married or cohabiting cases and separated cases. The Han nationality showed significantly more cases than the minority nationalities.

Strain Mutation.
A total of 213 strains were studied, 43 of which failed the RNA drug resistance test, and one test report did not come out. There was 1 case of RNA hemolysis that could not be detected. There was 1 DNA showing that the detection of DNA and RNA drug resistance sites failed. Table 4 shows the mutations at the drug resistance site of the HIV-1 strain. There were 50 cases of drug resistance of nonnucleotide reverse transcriptase inhibitors (NNRTI), accounting for 29.94% (50/167).
There were 17 cases of drug resistance of nucleotide reverse transcriptase inhibitors (NRTI), accounting for 10.18% (17/ 167) of all mutation cases. Each mutation site and mutation rate are shown in Table 5.

Mutant Strain Type in Drug
Resistance. There were 12 types of virus strains with drug resistance mutations, as shown in Table 6. A total of 167 cases of drug resistance mutations occurred in this study. The number of cases with different virus strains is shown in Table 6.

Mutation Sites in the Protease Region and Integrase
Region. Of the 213 cases in this study, 19 cases failed to detect the drug resistance site in the integrase region, and 1 case was retested on the drug resistance site in the integrase region. The mutation sites of the protease region and the integrase region are shown in Table 7. The mutations in the integrase region were significantly more than those in the protease region.

Discussion
Since the Chinese government issued the "Four-Free and One-Care" policy for AIDS prevention and treatment in 2003, free antiretroviral treatment for AIDS has been launched in China, providing free antiretroviral treatment for AIDS patients. HIV replicates, evolves, and spreads under the pressure of drugs. The selective pressure of drugs causes the virus to quickly produce drug resistance, which obviously changes the evolutionary characteristics of the virus in its natural state [22]. Among the free antiviral treatment formulas, there are 6 kinds of imitation antiviral drugs (didanosine (DDI), azidothymidine (AZT), 2 ′ ,3 ′ -didehydro-3 ′ -deoxythymidine (D4T), lamivudine (3TC), nevirapine (NVP), and efavirenz (EFV)) to compose 6 formulas such as AZT/DDI/NVP and AZT/3CT/NVP. Antiviral therapy can reduce the speed of virus replication for HIV patients and improve the body's immunity [23].
Clare et al. [24] showed that the incidence of drug resistance of HIV subtype B after antiviral treatment was 7.1%. Guan et al. [25] suggested that the mutation rate of subtype B and subtype C was 4.71% and 1.05%, respectively, and the prevalence of low-level monitored drug-resistant mutations was predicted to be 2.1%. There were 12 virus strains with subtype B-resistant mutations in this study, and the proportion of subtype B cases was 4.09% (7 out of 176), which was basically consistent with Guan's research results. The proportion of subtype C cases was 5.99% (1 in 176), which was higher than that of Guan's, which may be related to the region of the study population. In this study, 19 of the 213 cases failed to detect integrase region drug resistance sites, and 1 case was retested for integrase region drug resistance sites. There were more mutations in the integrase region than in the protease region. Different subtypes of HIV-1 strains have different drug-resistant mutation patterns compared with subtype B, possibly because subtypespecific genetic barriers can play a certain role in the occurrence of drug-resistant mutations. The emergence of drug resistance strains causes the virus to be insensitive to or less sensitive to drugs, which is the main reason for the failure of antiviral treatment [26]. Silva et al. [27] studied 75 patients  In this study, it was found that the mutation sites in NNRTI were significantly more than those in the NRTI region, and mutations also occurred in the integrase region and the protease region. There were 43 patients who failed in this study, which should also increase the management of drug resistance. The types of drug resistance and the degree of drug resistance of various drugs can be used as the basis for case adjustment and medication, which showed reference value and guiding significance for clinicians' medication.

Conclusion
Based on electronic medical records and oriented to medical literature retrieval for diagnosis and treatment decision support systems, an enhanced medical information retrieval system was constructed in this study. It integrated structured and unstructured data sources, combined the supervised learning methods for automatic query expansion to improve the effect of information retrieval, and analyzed the drug resistance characteristics of HIV-1 strains of AIDS patients from different directions. The semantic web and tensor decomposition constructed in the study could perform effec-tive data mining in the data retrieval system, and the fusion technical solution showed excellent performance in medical information retrieval. There were still some shortcomings in this study. The three-tuple relationship network was proposed in the article only. If the four-tuple relationship was constructed, whether semantic analysis and tensor decomposition were more accurate and effective required effective and in-depth discussion in the next step.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare no conflicts of interest.