Big data is a new trend at present, forcing the significant impacts on information technologies. In big data applications, one of the most concerned issues is dealing with large-scale data sets that often require computation resources provided by public cloud services. How to analyze big data efficiently becomes a big challenge. In this paper, we collaborate interval regression with the smooth support vector machine (SSVM) to analyze big data. Recently, the smooth support vector machine (SSVM) was proposed as an alternative of the standard SVM that has been proved more efficient than the traditional SVM in processing large-scale data. In addition the soft margin method is proposed to modify the excursion of separation margin and to be effective in the gray zone that the distribution of data becomes hard to be described and the separation margin between classes.
Big data has become one of new research frontiers. Generally speaking, big data is a collection of large-scale and complex data sets that it becomes more difficult to process using current database management systems and traditional data processing applications. In 2012, Gartner Inc. gave a definition of big data as “Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization” [
One of the major applications of the future parallel, distributed, and cloud systems is in big data analytic [
The support vector machine (SVM) has shown to be an efficient approach for a variety of data mining, classification, analysis, pattern recognition, and distribution estimation [
However, there are several main problems while using SVM model. Big data: when dealing with big data sets, the solution by using SVM with a nonlinear kernel may be difficult to be found. Noises and interaction: the distribution of data becomes hard to be described and the separation margin between classes becomes a “gray” zone. Unbalance: the number of samples from one class is much larger than the number of samples from other classes. It causes the excursion of separation margin.
Under this circumstance, developing an efficient method to analyze big data becomes important. The smooth support vector machine (SSVM) has been proved more efficient than the traditional SVM in processing large-scale data [
In this study, we collaborate interval regression [
This study is organized as follows. Section
Since Tanaka et al. [
The interval linear regression model (
For a data set with crisp inputs and interval outputs, two interval regression models, the possibility and necessity models, are considered. By assumption, the center coefficients of the possibility regression model and the necessity regression model are the same [
In this section, we review the current methods which are ordinarily used for interval regression analysis.
Tanaka and Lee [
The interval regression analysis by QP approach unifying the possibility and necessity models subject to the inclusion relations,
Hong and Hwang [
With the principle of QLSVM, the interval nonlinear regression model is given as follows:
Gaussian (radial basis) kernel: hyperbolic tangent kernel: polynomial kernel:
The advantage of Hong and Hwang’s approach is a model-free method in the sense that there is no need to assume the underlying model function for interval nonlinear regression model with crisp inputs and interval output.
There are two problems while using the traditional SVM model. (1) Large scale: when dealing with large-scale data sets, the solution may be difficult to be found when using SVM with nonlinear kernels; (2) Unbalance: the number of samples from one class is much larger than the number of samples from the other classes. It causes the excursion of separation margin.
To resolve these problems, Huang [
With the principle of RSVM, the interval nonlinear regression model is listed as follows:
The advantage of Huang’s approach is to reduce the number of support vectors by randomly selecting a subset of samples. While processing with large-scale data sets, the solution can be found easily by the proposed method with nonlinear kernels.
In this section we first propose the soft margin method to modify the excursion of separation margin and to be effective in the gray zone. Then the formulation of interval regression with SSVM to analyze big data is introduced.
In a conventional SVM, the sign function is used as the decision-making function. The separation threshold of the sign function is 0, which results in an excursion of separation margin for unbalanced data sets. The aim of the hard-margin separation margin is to find a hyperplane with the largest distance to the nearest training data. However, the limitations of the hard-margin formulation are as follows: there is no separating hyperplane for certain training data; complete separation with zero training error will lead to suboptimal prediction error; it is difficult to deal with the gray zone between classes.
Thus, the soft margin method is proposed to modify the excursion of separation margin and to be effective in the gray zone. The soft margin is defined as
With the soft margin as shown in Figure
Soft margin.
The main idea of smooth support vector machine (SSVM) is solved by a fast Newton-Armijo algorithm [
Suppose that
The first hyperplane (
In Lee and Mangasarian’s approach [
The objective function in (
For specific data sets, an appropriate nonlinear mapping
With the principle of SSVM, we can formulate the interval linear regression model as follows:
Given (
The Karush-Kuhn-Tucker (KKT) conditions that the partial derivatives of
Substituting (
Similarly, we can obtain the interval nonlinear regression model by mapping
To illustrate the methods developed in Section
TAIEX [
TAIEX [
TAIEX [
TAIEX [
The comparison is shown by using the measure of fitness [
Table
Comparison results of the measure of fitness.
Tanaka and Lee [ |
Hong and Hwang [ |
Huang [ |
Proposed methods | |
---|---|---|---|---|
|
0.1404 | 0.1395 | 0.1412 | 0.1354 |
|
0.1573 | 0.1562 | 0.1581 | 0.1429 |
|
0.1694 | 0.1658 | 0.1706 | 0.1583 |
|
0.1714 | 0.1695 | 0.1723 | 0.1609 |
In this paper, we collaborate interval regression with SSVM to analyze big data. In addition, the soft margin method is proposed to modify the excursion of separation margin and to be effective in the gray zone. The main idea of SSVM is solved by a fast Newton-Armijo algorithm and has been extended to nonlinear separation surfaces by using a nonlinear kernel technology. The experiment results show that the proposed methods are more efficient than existing methods. In this study, we estimate the interval regression model with crisp inputs and interval output. In future works, both interval inputs-interval output and fuzzy inputs-fuzzy output will be considered.
The authors declare that there is no conflict of interests regarding the publication of this paper.
The authors appreciate the anonymous referees for their useful comments and suggestions which helped to improve the quality and presentation of this paper. The original version was accepted by International Conference on Business, Information, and Cultural Creative Industry, 2014. Also, special thanks are due to the National Science Council, Taiwan, for financially supporting this research under Grants nos. NSC 102-2410-H-141-012-MY2 (C. H. Huang) and NSC 102-2410-H-259-039-(H. Y. Kao).