Metagenomic Sequencing Analysis for Acne Using Machine Learning Methods Adapted to Single or Multiple Data

The human health status can be assessed by the means of research and analysis of the human microbiome. Acne is a common skin disease whose morbidity increases year by year. The lipids which influence acne to a large extent are studied by metagenomic methods in recent years. In this paper, machine learning methods are used to analyze metagenomic sequencing data of acne, i.e., all kinds of lipids in the face skin. Firstly, lipids data of the diseased skin (DS) samples and the healthy skin (HS) samples of acne patients and the normal control (NC) samples of healthy person are, respectively, analyzed by using principal component analysis (PCA) and kernel principal component analysis (KPCA). Then, the lipids which have main influence on each kind of sample are obtained. In addition, a multiset canonical correlation analysis (MCCA) is utilized to get lipids which can differentiate the face skins of the above three samples. The experimental results show the machine learning methods can effectively analyze metagenomic sequencing data of acne. According to the results, lipids which only influence one of the three samples or the lipids which simultaneously have different degree of influence on these three samples can be used as indicators to judge skin statuses.


Introduction
Microbes are invisible to our naked eyes but are major residents living on the earth. Any environment which can be imagined, such as air dust, surface soil, underground rocks, waters systems, and other natural environments, as well as animals including humans, may have some certain microbes. The ecological community of microbes which lives at a certain part of the host body is referred to as a microbiota. The microbiota usually includes bacteria, archaea, microscopic eukaryotes, and viruses [1,2]. The collection of genomes and genes which exist in the microbes is called the microbiome [3,4]. They have significant influence on the environment or their host via complex interactions [5][6][7].
In a sense, the human body is not an individual organism but a complex community or symbiotic organism of human cells and various microbial species. Earlier research on human microbes focused on specific pathogens which caused human diseases. As the study is deepened step by step, especially in the past decade, researchers have found that (1) microbes inside and outside the human body maybe not only are pathogenicity, but also are a beneficial probiotic to the host. (2) In most cases, microbes living together with the host play an important role as a whole [4]. Studies have shown that human health can be assessed by research and analysis of the human microbiome [8,9].
The traditional methods of studying the microbiome are based on independent cultivation of each microbial strain. Then, its characteristics and functions are studied. The related research results give us a lot of knowledge and inspiration about microbes. Limitations, however, also usually exist. On the one hand, it has been reported that 99% of microbes cannot be isolated and cultured [10,11], which means that a large number of microbes cannot be studied using separation methods. On the other hand, microbes of microbiota tend to live and function as members of a system rather than a group of isolated microbes [12]. As a result, researchers began to look for new ways to obtain indirectly genomic information from microbial communities. Therefore, metagenomics came into being. Metagenomics refers to the sum of genome information for all species in an environmental microbiota. With the development of next-generation sequencing (NGS) technology, it is now very convenient to use metagenomic sequencing to study microbes. Due to the importance of microbes to human health, more and more researchers have begun to use metagenomic sequencing to study human microbes [13,14]. With the rapid development of high-throughput sequencing technologies and the substantial reduction of sequencing costs, metagenomic sequencing has become a promising pathogen detection method for accurate diagnosis of infectious diseases [15]. Fan et al. [16] performed metagenomic sequencing for the cerebrospinal fluid of 4 patients with suspected central nervous system infection, and Brucella was detected within 48 hours. However, if the above results were verified by polymerase chain reaction and Sanger sequencing, the patient's cerebrospinal fluid needed be cultured for 7 days, which indicated that metagenomic sequencing was more rapid, efficient, and accurate in detecting pathogens than the culture method. Metagenomic sequencing, as a fast, low-cost, and high-throughput pathogen DNA sequencing technology, has high efficiency and accuracy for detection and has been used to detect various pathogen infections, which demonstrates that metagenomic sequencing can effectively guide clinical treatment [17]. At present, classification and prediction methods based on machine learning have been successfully applied to many fields such as complex text sentiment analysis, satire identification, and other difficult predictions and classifications [18][19][20]. In recent years, so much work on machine learning applied to metagenomics has done. Machine learning can be applied to the clustering, binning of the metagenomic data, comparative metagenomics and gene prediction, and so on [21][22][23]. Principal component analysis is used to obtain the bacteria which have main effect on the gingivitis by analyzing the data of gingivitis and healthy gums [24]. In the human gut metagenomics study of type 2 diabetes, the gene cluster which is found by correlation analysis represents the difference of the samples [25].
The skin is the most exposed organ in the body, and it is also the front line that protects various tissues and organs in the body from physical and chemical damage or damage of pathogenic microorganisms. Globally, the prevalence of skin diseases is increasing. According to statistics, acne is the most common skin disease in the world. Acne is a benignly evolutionary and chronic skin disease characterized by the inflammatory process of the hair follicles and attached sebaceous glands [26][27][28][29].
Acne mainly occurs in the facial and thoracodorsal areas and other seborrheic areas [30]. And its manifestations are polymorphic, ranging from blackheads, pimples, pustules to more severe statuses such as nodules, cysts, and pustules [29,31]. The long course of acne and high recurrence rate badly affect the patient's appearance. Simultaneously acne can reduce the sense of beauty and even can cause mental illnesses such as low self-esteem, negative emotion, anxiety, and depression [32][33][34]. Therefore, the study and treatment of acne is an important and widely studied issue in the dermatology field. Acne's pathogenesis is complex. At present, many researchers have studied the role of bacteria in the pathogenesis of acne, such as Propionibacterium acnes (P. acnes), Staphylococcus epidermidis (S. epidermidis), and Staphylococcus aureus (S. aureus) [23,[35][36][37][38]. However, whether these bacteria are the main pathogens of acne is also controversial at present [38][39][40][41].
Due to the effective application of machine learning to metagenomic data, we attempt to analyze the metagenomic sequencing data of acne using machine learning methods. In this article, we obtained metagenomic sequencing data from the three skin statuses including face skin of healthy people, healthy face skin, and diseased face skin of acne patients. Principal component analysis (PCA) and kernel principal component analysis (KPCA) methods are used to find the corresponding lipids which largely contribute to the status of each kind of skin. In addition, multisets of canonical correlation analysis (MCCA) method are used to obtain lipids which can effectively differentiate the above three different skin statuses. Figure 1 is the framework diagram of the proposed method.
The rest of this paper is organized a follows. Firstly, the Material and Methods are detailedly described in Section 2. Then, extensive experiments of metagenomic sequencing data of acne are presented in Section 3. Finally, a conclusion is drawn in Section 4. The flow rate was maintained at 0.3 mL/min. The injection volume was 2.0 μL. During UPLC runs, the injector needle was washed with the mobile phase. The eluent outlet was connected to QTOF-MS for entity detection and characterization. Highresolution mass measurements were performed with a Waters Xevo G2-XS QTOF-MS (Waters Corporation, Milford, Massachusetts, USA) equipped with an electrospray ionization (ESI) interface operating in the positive ion mode. Entities eluted from the UPLC system were introduced into the QTOF-MS apparatus at the operating chromatographic flow rate. Nitrogen was used as the nebulizing and desolvation gas. UPLC-QTOF-MS data were collected as raw data by Masslynx 4.1 (Waters Corporation, Milford, Massachusetts, USA). Therefore, three sample sets for this experiment including the patient's diseased skin (DS) samples, the patient's healthy skin (HS) samples, and NC samples are obtained. Each sample set has 35 volunteers, and each volunteer was extracted 2520 sequence data.

Principal Component Analysis.
Principal component analysis (PCA) is a common means in data analysis. It is hoped that fewer variables can be used to interpret most of variables in the original data, and the main feature components of the data are extracted.
Suppose the sample set X includes m samples, and each sample is n-dimensional vector. At the same time, the sum of these m samples is 0 as shown in Equations (1) and (2).
Computational and Mathematical Methods in Medicine Suppose the new coordinate system is W n×n = ðw 1 , w 2 , ⋯,w n Þ after the transformation of projection, where w i is an orthonormal basis. The original data sample is projected to a new coordinate system. The projection rule is shown in Equation (3).
For separating all samples as far as possible after projection, the variance of these samples after projection should be maximized. Therefore, the optimized objective function is shown in Equations (4), where I is the unit vector.
The Lagrange multiplier method is used to solve the equation, and the objective function is shown as follows.
The derivative of the above equation is obtained and shown in Equation (6).
It can be seen from the above equation that for finding the eigenspace W n×n , the corresponding eigenvalues and eigenvectors of the covariance matrix should be calculated. However, the eigenspace obtained is still n-dimensional and has not achieved the goal of dimensionality reduction. Therefore, the eigenvalue λ is arranged in descending order, and a reconstruction threshold t 1 is selected using the following equation.
Then, the eigenspace W n×k = ðw 1 , w 2 ,⋯,w k Þðk < nÞ composed of k eigenvectors can be determined. The information con-tained in the discarded part is often related to noise. Therefore, discarding this part of information can improve the experimental effect to a certain extent. In general, when the reconstruction threshold t 1 reaches 85%, it is considered that the found principal components have large effect on the sample set.

Kernel Principal Component
Analysis. Compared with PCA, kernel principal component analysis (KPCA) can mine the nonlinear information contained in the data set. In KPCA, a kernel function is introduced and used to calculate the kernel matrix K of the input data. Gaussian kernel is selected as the kernel function, so the kernel matrix K is described as Then, eigenvalues and eigenvectors of the kernel matrix K are calculated. After arranging the eigenvalues from the largest to the smallest, the reconstruction threshold t 1 should be set to determine the eigenspace W. In our experiment, the reconstruction threshold t 1 is set 95% and 99% for both PCA and KPCA.

Multiset Canonical Correlation Analysis.
Since the PCA and KPCA methods only can analyze a kind of sample set. In order to obtain lipids which better distinguish three samples, a multiset canonical correlation analysis (MCCA) method is used. MCCA is used to analyze the relationship between multiple sets of data. The main idea of MCCA is that when the correlation coefficient β between several sample sets is maximum, the typical variable w i corresponding to each sample set is found. Given the number of sample sets is u, and each sample set includes N samples, the objective function is described as where

Computational and Mathematical Methods in Medicine
Using the Lagrange multiplier method for the objective function, the following equation can be obtained: where C = Then, the influential lipids can be found by the means of the typical variable w i .

Feature
Selection. Data can be reconstructed using PCA and KPCA as shown in Equation (11).
The new data after dimension reduction is Y k×m = ðy 1 , y 2 , ⋯, y m Þ. In this way, n-dimensional data in the original data X is reduced to the k-dimensional data. The obtained Y is already the data in another spatial dimension and not the lipid information of the original data. Therefore, based on the relevant knowledge of mathematical statistics, a method is proposed to map the eigenspace to the input space in this paper.
First of all, the position information of several basis space components w j×p which have the greatest influence on the new data Y is counted, corresponding to the lipids of the original data. Then, the frequency of each original data in the same eigenvector is calculated, and weights are added according to its eigenvalues. Finally, the frequencies and weights of the original data counted by all eigenvectors are multiplied, and the product is summed if it is the same original data. All of products are arranged in descending order, and the maximum k results are lipids that have greater impact on acne.
In Equation (11), each element in Y is calculated as shown in Equation (12):        Each element y p/q is the sum of n multipliers, and each multiplier is the product of the basis space component and the original data. For each element y p×q , n multipliers are arranged in descending order during the process of accumulation, and each multiplier is expressed as w count j×p ⋅ x j×p , count ∈ ½1, n. The larger the number count is, the smaller the value is. The threshold value t 2 is selected to satisfy Equation (13), and the position information of these l base space components w j×p which maximize the multiplier is recorded.
Because each eigenvalue corresponding to eigenvector is different, the position information of basis space components obtained in each eigenvector must be divided into a group. The position information of basis space component corresponds to the original data, and then, the frequencies f i×j ði ∈ ½1, kÞ of different original data in each group are, respectively, calculated, where j represents the space component location information. Equation (14) is used to calculate the projects after adding weight.
So k groups of P values can be obtained. However, since the number and type of position information of basis space components obtained between different groups are uncertain, the sum of P values between different groups with the same position of basis space components is required. The size of the final sum represents the amount of information contained    (15).
Using the above method, the types of lipids that have a greater impact on acne in the original data can be determined.

The Experiment
Results and Analysis of PCA and KPCA Methods. The PCA or KPCA method only can test DS, HS, and NC samples, respectively. We can determine the number of principal components based on the cumulative contribution rate. The lipids which have great influence on the samples can be found using the corresponding eigenvalues and eigenvectors. In Figure 2 the Venn diagram is used to show the similarities and differences on the experimental results of DS, HS, and NC samples, i.e., the lipids which have larger influence on the samples, using the PCA and KPCA methods when the cumulative contribution rate is 95%. The numbers in Figure 2 represent the labels of some certain lipids, and the descriptions of the lipids are presented in Table 1.
In Figure 2(a), it is found that three lipids such as numbers 1311, 1264, and 1240 have the greater impact on the DS samples not only using PCA but also KPCA methods. In Figure 2(b), the lipid like number 608 which has the larger influence on the HS samples is found using PCA and KPCA methods. Besides, another lipid number 607 is also found using the PCA method, and 2 lipids like number 2334 and number 776 are found using the KPCA method. In Figure 2(c), the same 7 lipids have a significant effect on NC samples by not only PCA but also KPCA. In Figure 2, the contribution of the lipid decreases along the direction of arrow step by step.  Table 2.
It can be seen from Figure 3 Table 2 which bear on all sample sets DS, HS, and NC are obtained. Among the 19 kinds of lipids, 13 lipids exert different effects on the three types of sample sets. As shown in Figure 4, the abscissa represents the samples, and the ordinate denotes the influence degree, i.e., the contents of the lipids. The descriptions of the found lipids are presented in Table 3. It can be concluded from Figure 4 that the lipids with different effects on these three samples sets can be categorized into three types, respectively, shown in Figures 4(a)-4(c). Among of three line charts, Figure 4(  9 Computational and Mathematical Methods in Medicine content of the lipids. As demonstrated above, these three lipids only have a major impact on DS, and thus, the content in DS is obviously higher than that in the other two samples. We can safely conclude that when the content of these three lipids increases significantly, the skin of the subject is in a diseased condition and needs treatment. Conversely, if a patient with acne undergoes a dramatic decrease on the content of these lipids during treatment, it demonstrates that the skin condition is turning better.
Likewise, Figure 6 shows the content changes of lipids like No. 608, No. 2172, and No. 2334 for DS, HS, and NC. These three lipids are absent for DS and NC while they are a marked increase for HS. The result suggests that when the content of lipids like No. 608, No. 2172, and No. 2334 escalates, the subjects' skin is during a transitional period. Figure 7 presents two lipids such as No. 889 and No. 2374 which have effects only on NC sample sets. It can be seen that the content of these two lipids in NC increases notably while is rather low in DS and HS. If the content of No. 889 and No. 2374 rises significantly in the process of treatment, it indicates that the treatment is effective and the skin is in a healthy condition.

Conclusion
As one of the common skin diseases in the world, acne has a large number of patients with complex etiology and will cause certain psychological and physiological damage to patients. Therefore, the research and treatment of acne is of great significance. In this paper, the pathogenesis of acne is analyzed from the perspective of metagenomics. In view of the large amount of data on acne metagenomics, it is found that it is difficult to find the hidden valuable data. And the method of machine learning is proposed for analysis. In the experiment, PCA, KPCA, and MCCA are used to analyze the data of DS, HS, and NC sample sets, and the lipids that can distinguish the three sample sets are obtained. Comparing all experimental results, it is found that the lipid of No. 1240 can be used to distinguish DS sample set, lipids like No. 608 and No. 2334 can be used to distinguish HS sample set, and lipids that can be used to distinguish NC sample set are No. 1264 and No. 1311. It can be concluded from the experimental results that the method of machine learning can quickly and accurately determine and distinguish lipids in different sample sets, which can provide certain auxiliary guiding significance for the prevention, diagnosis, and treatment of acne.

Data Availability
Data are not convenient to be published because the related agreement has been signed with the partner in this study.