This study proposes the hybrid data envelopment analysis (DEA) and support vector machine (SVM) approaches for efficiency estimation and classification in web security. In the proposed framework, the factors and efficiency scores from DEA models are integrated with SVM for learning patterns of web security performance and provide further decision support. The numerical case study of hospital web security efficiency is demonstrated to support the feasibility of this design.
During the past decades, the Internet and World Wide Web (www) have been prevalent platforms for information sharing and transformation. Consequently, web security management becomes a major theme in profit as well as nonprofit organizations. Assurance of information systems security involves not only tangible costs but also intangible inputs, which makes it challenging to evaluate the performance of the investments. This section concisely introduces the basics of web security, data envelopment analysis (DEA), support vector machine (SVM), classification on web security, and the goals of this study.
Phishing [
The increasing popularity of web-based systems has resulted in phishing behaviors causing significant financial damage to both individuals and organizations. The statistics provided by the Anti-Phishing Working Group [
Protection against attacks and unauthorized access to sensitive information is vital in the Internet. Several technical antiphishing solutions have been proposed on the server’s side or client’s side. Server-side defenses employ Secure Sockets Layer certificates, user-selected site images, and other security indicators to help users verify the legitimacy of web sites, while client-side defenses equip web browsers with automatic phishing-detection features or add-ons (e.g., SpoofGuard) to warn users against suspected phishing sites [
Efficiency evaluation is a common issue in various domains and organizations, which is critical to investment analysis and resource allocation. Data envelopment analysis [
One has
Based on the CCR ratio model, the objective function
Notably, the solution space of the CCR LP model is smaller than that of the CCR ratio model due to the constraint (
Support vector machine (SVM) is a popular classifier and pattern recognition method based on statistical learning [
Hence, for such a generalized optimal separating hyperplane, the functional to be minimized comprises an extra term accounting the cost of overlapping errors. In fact the cost function (
For
Due to the KKT conditions, a dual Lagrangian function has to be maximized as follows:
In learning a nonlinear classifier, we can define a kernel and the dual Lagrangian to be maximized as follows:
Bose and Leung [
This study intends to integrate DEA and SVM for web efficiency estimation and classification for several key reasons. First, as the medical informatics and security gain growing attention, a practical evaluation scheme is needed. We develop the DEA to assess the relative efficiency of the hospitals as the pioneer study in the related field. Second, in addition to evaluating the current websites at one snapshot, some websites may be reviewed as potential data set. An efficient and reliable classifier is essential to discriminate future data. Among wide machine learning methods, SVM is relatively robust and convincing, so we integrate DEA and SVM to build the efficiency classification platform. Third, compared with related studies, this work emphasizes web security preparedness instead of potential web attacks detection. That is, we assess web security from a proactive view not limited to technical aspect. The rest of this paper is organized as follows. Section
This section develops the hybrid DEA and SVM approaches for efficiency classification. Consider
The research procedure.
Based on (
This step iteratively discriminates the fully productive group ( set compute the efficiency scores of DMUs in SC. store the efficient DMUs with score 1 in determine if the extraction process continues. If yes, set else set for output DMUs in end.
The procedure of this step is demonstrated in Figure
Tier extraction [
Here the classification schema is learned as
The set of testing data will be used to validate the classification model.
In the next section, the case of hospital web security efficiency will be thoroughly studied for demonstrating the procedure developed above.
This study investigates 91 medical institutes in Taiwan, among which 8 (8.79%) are medical centers, 45 (49.45%) are metropolitan hospitals, and 38 (41.76%) are local community hospitals. To assess the hospitals’ efficiencies in web security, 10 input factors in 3 categories (
The factors for the DEA model.
|
|
Clarity of purpose | |
|
|
|
|
|
|
|
|
|
|
Communication | |
|
|
|
|
Security framework | |
|
|
|
|
|
|
|
|
|
|
|
By (
The efficiencies from DEA.
Hospital ID | Efficiency* | Tier |
---|---|---|
1 | 1.000 | 1 |
2 | 0.778 | 2 |
3 | 0.903 | 2 |
4 | 0.667 | 3 |
5 | 0.889 | 2 |
6 | 0.920 | 2 |
7 | 1.000 | 1 |
8 | 0.808 | 2 |
9 | 1.000 | 1 |
10 | 0.122 | 4 |
11 | 0.970 | 2 |
12 | 0.669 | 3 |
13 | 0.264 | 4 |
14 | 0.332 | 4 |
15 | 1.000 | 1 |
16 | 0.303 | 4 |
17 | 1.000 | 1 |
18 | 0.678 | 3 |
19 | 1.000 | 1 |
20 | 0.295 | 4 |
21 | 1.000 | 1 |
22 | 0.155 | 4 |
23 | 0.750 | 2 |
24 | 1.000 | 1 |
25 | 1.000 | 1 |
26 | 0.295 | 4 |
27 | 1.000 | 1 |
28 | 0.264 | 3 |
29 | 1.000 | 1 |
30 | 0.295 | 4 |
31 | 0.176 | 4 |
32 | 1.000 | 1 |
33 | 0.335 | 4 |
34 | 1.000 | 1 |
35 | 1.000 | 1 |
36 | 0.264 | 3 |
37 | 0.122 | 4 |
38 | 0.848 | 2 |
39 | 0.848 | 2 |
40 | 0.264 | 4 |
41 | 1.000 | 1 |
42 | 0.145 | 4 |
43 | 0.122 | 4 |
44 | 0.284 | 3 |
45 | 0.801 | 3 |
46 | 0.264 | 3 |
47 | 0.264 | 4 |
48 | 0.801 | 3 |
49 | 1.000 | 1 |
50 | 0.125 | 4 |
51 | 0.961 | 2 |
52 | 1.000 | 1 |
53 | 0.145 | 4 |
54 | 1.000 | 1 |
55 | 1.000 | 1 |
56 | 1.000 | 1 |
57 | 0.176 | 4 |
58 | 0.122 | 4 |
59 | 0.388 | 3 |
60 | 0.388 | 3 |
61 | 0.388 | 3 |
62 | 0.388 | 3 |
63 | 0.388 | 3 |
64 | 0.388 | 3 |
65 | 0.388 | 3 |
66 | 0.388 | 3 |
67 | 1.000 | 1 |
68 | 1.000 | 1 |
69 | 0.800 | 2 |
70 | 0.388 | 3 |
71 | 0.778 | 2 |
72 | 0.388 | 3 |
73 | 0.332 | 4 |
74 | 0.388 | 3 |
75 | 0.388 | 4 |
76 | 0.388 | 3 |
77 | 1.000 | 1 |
78 | 0.388 | 3 |
79 | 0.295 | 4 |
80 | 0.332 | 3 |
81 | 0.388 | 3 |
82 | 0.388 | 3 |
83 | 0.332 | 3 |
84 | 0.388 | 3 |
85 | 0.332 | 3 |
86 | 0.388 | 3 |
87 | 0.332 | 4 |
88 | 0.388 | 3 |
89 | 1.000 | 1 |
90 | 0.388 | 3 |
91 | 1.000 | 1 |
The distribution of efficiencies.
Efficiency | No. of DMU | Percentage (%) |
---|---|---|
1 | 25 | 27.47 |
0.900–0.999 | 4 | 4.40 |
0.800–0.899 | 7 | 7.69 |
0.700–0.799 | 3 | 3.30 |
0.600–0.699 | 3 | 3.30 |
0.300–0.399 | 20 | 21.98 |
0–0.2999 | 29 | 31.87 |
The distribution of efficiencies.
By each iteration of the tier analysis algorithm in Section
The distribution of efficiency tiers.
Tier | No. of DMU | Percentage (%) |
---|---|---|
1 | 25 | 27.47 |
2 | 12 | 13.19 |
3 | 31 | 34.07 |
4 | 23 | 25.27 |
The distribution of tiers.
The accuracy from SVM (%).
Kernel type | 2 tiers | 3 tiers | 4 tiers | Average |
---|---|---|---|---|
Linear | 94.51 | 90.11 | 85.71 | 90.11 |
Polynomial | 91.21 | 85.71 | 84.62 | 87.18 |
Radial basis | 96.70 | 85.71 | 85.71 | 89.37 |
Sigmoid | 72.53 | 59.34 | 34.07 | 55.31 |
Average | 88.74 | 80.22 | 72.53 |
|
The results from SVM classifiers.
The distribution of efficiencies in Figure
From the results, the kernel functions with satisfactory prediction accuracy are linear (90.11% in average), radial basis (89.37% in average), and polynomial (87.18% in average), while the sigmoid function results in the lowest accuracy (average of 55.31%). Obviously, the linear, radial basis, and polynomial functions are more appropriate kernel types for web security efficiency classification in this case. The pattern of tier distribution is possibly the reason why those three kernel types outperform the sigmoid function in SVM classification.
From the perspective of data tier refinement, 2 tiers get the highest accuracy (average of 88.74%) while 4 tiers have the lowest rate (average of 72.53%). The results show that fewer data tiers obtain better accuracy in classification, which is consistent with the rule of thumb.
This study proposes the data envelopment analysis and support vector machine approaches for efficiency estimation and classification. For the feasibility of data collection, we use the surrogate variables to construct the tangible and intangible input factors. In defining the output factors, we define an objective variable of
The authors are indebted to the anonymous reviewers for their careful reading and suggestions to enhance the quality of this paper. This work is supported by the National Science Council, Taiwan (Grant no. NSC 102-2410-H-259-039-, NSC 101-2221-E-259-030).