Dimension Reduction Big Data Using Recognition of Data Features Based on Copula Function and Principal Component Analysis

Nowadays, data are generated in the world with high speed; therefore, recognizing features and dimensions reduction of data without losing useful information is of high importance. There are many ways to dimension reduction, including principal component analysis (PCA) method, which is by identifying effective dimensions in an acceptable level, reducing dimension of data. In the usual method of principal component analysis, data are usually normal, or we normalize data; then, the principal component analysis method is used. Many studies have been done on the principal component analysis method as a step of data preparation. In this paper, we propose a method that improves the principal component analysis method and makes data analysis easier and more efficient. Also, we first identify the relationships between the data by fitting the multivariate copula function to data and simulate new data using the estimated parameters; then, we reduce the dimensions of new data by principal component analysis method; the aim is to improve the performance of the principal component analysis method to find effective dimensions.


Introduction
In many real-world programs, reduction of high-volume data is of high importance and necessity as a prestage of data processing. For example, in data mining programs, dimensionality reduction is considered one of the most important stages to remove data redundancy, to increase precision of measurement, and to improve decision making process. Analyzing high-volume data is intrinsically difficult via high-volume computations for many learning algorithms as well as data processing. In dimensionality reduction methods, extraction of data features is highly important. A highly used method to reduce dimension reduction of data in data mining and in the data preparing phase is the principal component analysis method. The PCA method can be used if the original variables are correlated, homogeneous, if each component is guaranteed to be independent and if the dataset is normally distributed [1,2]. The critical issues for the majority of dimensionality reduction studies are how to provide a convenient way to generate correlated multivariate random variables without imposing constrain to specific types of marginal distributions. An appropriate approach to this problem is to use Copula's theory [3,4]. In this paper, we first use the copula function to study the correlation and relationships between data to determine and eliminate irrelevant properties and simulate new data using the estimated parameter; then, by using the PCA method, we reduce the dimensions of data [4][5][6].
values of the covariance matrix. Analyzing principal components upon mathematics definition is an orthogonal transformation taking data to a new system of coordinates so that the largest data variance would be placed on the first coordinate axis; the second largest variance would be placed on the second coordinate axis and etc. Principal component analysis is aimed at transferring dataset X with m dimensions to data Y with l dimensions. Therefore, it is assumed that matrix X is formed of vectors X 1 , X 2 , ⋯, X n each of which placed in m column in matrix X. So, the data matrix would be in form of m × n. Principal components are just related to covariance matrix Σ (correlation matrix p) of random variables X 1 , X 2 , ⋯, X n [7].

Calculating Empirical Mean and Covariance Matrix and
Data Normalization. To calculate covariance matrix, data have to be normalized first. To do so, the primarily vector of empirical mean would be calculated as follows: Clearly, the empirical mean would be applied on matrix lines.
Then, the distance matrix to mean would be obtained as follows: where h is a vector with size of 1 × n and value equal to 1 in each of the entries. Covariance matrix Σ with m × m dimensions would be obtained as follows: where E is arithmetic mean, ⨂ is an external coefficient, and B * is the matrix B conjugate transpose. Consider X ′ = ½X 1 , X 2 , ⋯, X n random vector and assume that this random vector has matrix covariance Σ with special values λ 1 ≥ λ 2 ≥ ⋯ ≥ λ n ≥ 0. Consider following linear compositions: Using relationship (4), we have Its principal components are Y 1 , Y 2 , ⋯, Y n unrelated linear compositions; variances of which in relationship (5) would be large to the extent possible. The first principal component of a linear composition has maximum variance. Clearly, var ðY 1 Þ = l 1 ′ Σl 1 can be maximized through multiplying each l 1 by a constant. That is, the first principal component of linear composition is l 1 ′ X which maximizes var ðY 1 Þ with consideration of l 1 ′l 1 ′ = 1. The second principal component of linear composition is l 2 ′X which maximizes var ðY 2 Þ with consideration of l 2 ′ l 2 = 1 and cov ðl 1 ′ X, l 2 ′ XÞ = 0, continuously to the n th principal component.
According to relationship (5), we have and ratio of total variance to K th component ðk = 1, 2, ⋯, nÞ is Total share of population variance related to principal K th component If for large n, the highest maximum variance of total population (80 or 90%) could be attributed to the first several components; these components can be replaced by n primary variables, losing not much information [2,[8][9][10]].

Copula Function
In general, the copula function is the link function of multivariate distributions and their marginal distributions. The copula function is a multivariate distribution, marginal distribution which follows uniform distribution of [0,1] interval [11][12][13].
2.1. Characteristics of Copula Function. Assume the following characteristics for C : I 2 ⟶ I: (1) For every u, v ∈ ½0, 1, we will have Such function like C implied in the two above conditions is called the copula function [14].
2.2. Sklar's Theorem. It is indicated by Sklar's theorem that if joint distribution function like H would be available with marginal distributions F and G, then, there would be copula function C available. That is, for every X i , X j ∈ ℝ, we have Advances in Mathematical Physics and if F and G would be continuous, then, copula function C would be unique. Otherwise, C would be defined as unique on RangðFÞ × RangðGÞ.
The most important application of the copula function is formulation of a proper method to produce distribution of random related multivariate variables and to provide a solution for the problem of density estimation transformation [15].
For reversible transformation of n continuous random variables X 1 , X 2 , ⋯, X n based on their distribution function to n independent variables with uniform distribution U 1 = F 1 ðX 1 Þ, U 2 = F 2 ðX 2 Þ, ⋯, U n = F n ðX n Þ, the probability density function X 1 , X 2 , ⋯, X n would be equal to f ðX 1 , ⋯, X n Þ and joint probability density function U 1 , U 2 , ⋯, U n would be equal to CðU 1 , ⋯, U n Þ. Therefore, probability density function f ðX 1 , ⋯, X n Þ can provide a nonparametric form (unknown distribution). Here, probability density function CðU 1 , ⋯, U n Þ for U 1 , U 2 , ⋯, U n would be estimated instead of X 1 , X 2 , ⋯, X n , so that problem of density estimation becomes simpler. Then, it would be simulated so that random samples X 1 , X 2 , ⋯, X n would be obtained through reverse transformation According to Sklar's theorem, one copula function with n unique dimensions C is available in ½0, 1 n with uniform marginal distribution U 1 , U 2 , ⋯, U n . That is, every function F with margins F 1 , F 2 , ⋯, F n can be written as follows: To evaluate a copula function selected via an estimated parameter and to avoid defining any hypothesis on distributions, empirical distribution function can be used. An empirical copula function is useful to study the dependence structure of multivariate random vectors. In general, empirical copula function is as follows: where I ð•Þ would be an indicator function [16].

Gaussian Copula Function.
Difference between Gaussian copula function and normal joint distribution function is that the first one authorizes various distribution functions to be used for joint distribution [14]. However, in probability theory and statistics, normal multivariate distribution is considered the generalization of one-dimensional normal distribution [17]. Gaussian copula function is defined as where ΦðX i Þ is a standard Gaussian function and X i has standard normal distribution and Σ is a correlation matrix. As a result, CðU 1 , ⋯U n Þ copula function would be called a Gaussian copula function.

Methodology
In the research, a two-stage method would be used for dimensionality reduction. That is, primarily empirical copula function and fit of Gaussian copula function to data would be used to estimate parameter p for variables X 1 , X 2 , ⋯, X n . Important advantages of using the copula function in multivariate distributions is that correlation between variables would be considered by these functions, and in fact, there would be no need for independence of variables; instead, the correlation structure between variables would be even considered by these functions [18]. For estimation purposes, generating function is available with dependence unscaled value available in it. The correlation coefficient value has to be specified. To do so, the Pearson correlation coefficient will be used and defined as follows for two X i and X j variables: where σ X i and σ X j are standard deviations of X i and X j , respectively. Then, those data with lower correlation compared to others would be eliminated and using estimated function and Gaussian copula function for X 1 , X 2 , ⋯, X m , where m uniform variables U 1 = F 1 ðX 1 Þ, U 2 = F 2 ðX 2 Þ, ⋯, U n = F m ðX m Þ would be generated ðm ≤ nÞ and placed instead of X 1 , X 2 , ⋯, X m in the principal component analysis method. After dimensionality reduction, the results would be compared through applying the method on raw data [16,19].

Numerical Results
During past 30 years, increasing prevalence of urinary stone disease has been observed. About 80% of kidney stones are from calcium oxalate type. Here, 79 urine samples would be analyzed to see if some of physical features of urine are related to formation of calcium oxalate or not. These data include following columns (variables), which is available at https://cran.r-project.org/web/packages/cond.
Using Gaussian copula function, correlation values of variables would be obtained as follows: Considering Table 1, it is observed that correlation of variable X2 is lower than other variables; so, it would be eliminated at the first stage. After estimation of parameters, new data would be generated. Figure 1 shows the copula function for main data and data generated by this method. Now, data would be generated based on estimated parameters. To specify whether data are generated correctly or not, diagram QQPlot would be drawn.

Advances in Mathematical Physics
Correct data generation is shown by Figure 2. In the second stage, after elimination of the X2 variable on data generated, principal component analysis would be done. In Figure 2, principal components for primary data and those generated by copula function are shown after reduction of the X2 variable. Figure 3 shows principal components for main data and the data generated.
Ratios of population variance related to principal components are provided in following table. Its screen plot is as follows.    X1is urine gravity, X2 is urine pH, X3 is urine osmolarity (it is corresponding to unit of solute concentration), X4 is urine conductivity (it is corresponding to concentration of charged ions in solution), X5 is urea concentration (mM/liter), and X6 is calcium concentration (mM/liter).

Advances in Mathematical Physics
Considering Tables 2 and 3 as well as Figure 4, it is observed that in dimensionality reduction method presented in the research, two first components include more than 80% of population variances and first component includes more than 70% of population.
Example 1. To recognize image resolution in a rectangular monitor, its display would be divided into different boxes and numbers of black and white dots in these boxes would be measured. Images of these characters have been made based on 20 different images, and each box from within these 20 boxes has been randomly selected. A file including 20000 unique simulators have been produced. Each stimulator has been transformed and scaled to 7 following numerical variables so that they would be placed within 0-15 range, (which is available at https://cran.r-project.org/ web/packages/mlbench/index.html).
There are 2000 observations available from these variables.  Figure 4: Diagram of population variance ration related to principal components for main data and generated data through recommended method.

Advances in Mathematical Physics
Using Gaussian copula function, correlation values of variables would be obtained as follows: X is the box. X1 is the horizontal location of box, X2 is the vertical location of box (y.box), X3is width of box (width), X4 is the height of box (height), X5 is the total numbers of dots in the box (onpix), X6 is the mean value of x in dots of the box (x .bar), and X7 is the mean value of y in dots of box (y.bar).
Considering Table 4, it is observed that correlation between variables X6 and X7 is less compared to other vari-ables. So, these two would be eliminated at first stage and then Gaussian copula function would be fitted to reduced data and new data would be generated through estimated parameter, which is shown in Figure 5. Now, data would be generated. QQPlot would be as follows.
Now, principal component analysis would be done on generated data. Diagrams of principal components are as follows ( Figure 6).

Advances in Mathematical Physics
Screen plot of population variance ratio related to principal components for both methods are as follows.
According to Tables 5 and 6 as well as Figure 7, it is observed that ratio of population variance for the first two components in the recommended method includes almost 85% of population and the first component includes almost 80% of population, whereas, for main data, ratio of population variance for the three first components includes almost 85% of population.

Conclusion
Considering the two aforementioned examples, it has been observed that data generated according to the estimated parameters of the Gaussian copula distribution are consistent with the original data (see Figures 2 and 8) by using the recommended method in the research and copula function to recognize dependencies and structural dependence between variables in addition to elimination of redundant data will       (see Figures 4  and 7, Tables 2, 3, 5, and 6). Considering the point that nowadays data are generated with high-speed, appropriate, and efficient methods for dimensionality reduction without losing information are of high importance and necessity, and recommended method in the research is a useful one to do so. The recommended method in the research can be also used for other dimensionality reduction techniques so that data would be prepared for more analysis, for example in data mining.

Data Availability
The data that support the findings of this study are openly available at https://cran.r-project.org/web/packages/cond and https://cran.r-project.org/web/packages/mlbench/index.html.