Classification of Hospital Web Security Efficiency Using Data Envelopment Analysis and Support Vector Machine

This study proposes the hybrid data envelopment analysis (DEA) and support vector machine (SVM) approaches for efficiency estimation and classification in web security. In the proposed framework, the factors and efficiency scores from DEA models are integrated with SVM for learning patterns of web security performance and provide further decision support. The numerical case study of hospital web security efficiency is demonstrated to support the feasibility of this design.


Introduction
During the past decades, the Internet and World Wide Web (www) have been prevalent platforms for information sharing and transformation.Consequently, web security management becomes a major theme in profit as well as nonprofit organizations.Assurance of information systems security involves not only tangible costs but also intangible inputs, which makes it challenging to evaluate the performance of the investments.This section concisely introduces the basics of web security, data envelopment analysis (DEA), support vector machine (SVM), classification on web security, and the goals of this study.

Phishing and Web Security.
Phishing [1][2][3][4] is a criminal activity employing both social engineering and technical subterfuge to acquire personal data such as usernames, passwords, and credit card numbers.Phishing has become a serious threat to information security and Internet privacy.An analysis of phishing attacks by the Financial Services Technology Consortium [3] produces a taxonomy consisting of six stages: planning, set up, attack, collection, fraud, and postattack.
The increasing popularity of web-based systems has resulted in phishing behaviors causing significant financial damage to both individuals and organizations.The statistics provided by the Anti-Phishing Working Group [4] show that the number of unique phishing websites detected during the fourth quarter of 2009 was 137,619 and that financial services and payment services were the most-targeted industry sectors.It is clear that financial gain is the main objective of phishing attacks.A survey by Gartner [5] reveals that between September 2007 and September 2008, more than 5 million people in the United States were affected by phishing attacks with the average loss of US$ 351 per incident, and the number of victims has increased by 39.8 percent.
Protection against attacks and unauthorized access to sensitive information is vital in the Internet.Several technical antiphishing solutions have been proposed on the server's side or client's side.Server-side defenses employ Secure Sockets Layer certificates, user-selected site images, and other security indicators to help users verify the legitimacy of web sites, while client-side defenses equip web browsers with automatic phishing-detection features or add-ons (e.g., SpoofGuard) to warn users against suspected phishing sites [6].In addition to the technical solutions, training users on antiphishing techniques is a frequently recommended and widely used approach for countering phishing attacks.ISO and NIST security standards, which many companies are contractually obligated to follow, include security training as an important component of security compliance [7,8].These standards describe a three-level framework comprising Based on the CCR ratio model, the objective function   is maximized for every DMU  individually.In the model,   and   are the th input and th output of DMU  ;   , V  are the weights of the outputs and inputs, respectively;  is a small positive value which ensures that all weights are nonnegative.For computational convenience, frequently the CCR ratio model is transformed into a linear programming (LP) model by assuming [14] that Notably, the solution space of the CCR LP model is smaller than that of the CCR ratio model due to the constraint (2); thus, the CCR LP model finds the local optimum for the ratio model which comprises fractional terms [20].
1.3.Support Vector Machine.Support vector machine (SVM) is a popular classifier and pattern recognition method based on statistical learning [21][22][23].Suppose that  training data {  ,   },  = 1, 2, . . ., , are given, where   ∈   are the input patterns and   ∈ {−1, +1} are the related target values of two-class pattern classification case.Then the standard linear support vector machine is as follows: min ,, where  is the location of hyperplane relative to the origin.The regularization constant  > 0 is the penalty parameter of the error term ∑  =1  2  to determine the tradeoff between the flatness of linear functions (    + ) and empirical error.
Hence, for such a generalized optimal separating hyperplane, the functional to be minimized comprises an extra term accounting the cost of overlapping errors.In fact the cost function (3) can be even more general as given below: subject to the same constraints.This is a convex programming problem that is usually solved only for  = 1 or  = 2, and such soft margin SVMs are dubbed  1 and  2 SVMs, respectively [23].
Mathematical Problems in Engineering For  1 SVMs ( = 1), the solution to a quadratic programming problem (3) is given by the saddle point of the primal Lagrangian shown below   (, , , , ) (the primal Lagrangian) where   and   are the Lagrange multipliers.Due to the KKT conditions, a dual Lagrangian function has to be maximized as follows: In learning a nonlinear classifier, we can define a kernel and the dual Lagrangian to be maximized as follows: where (  ,   ) =   (  )(  ) is the kernel function which maps the training vector   into a higher dimensional space.Popularly used kernel types include linear, polynomial, Gaussian, radial basis, and sigmoid [23].

Estimation and Classification on Web Security. Bose and
Leung [2] investigate antiphishing preparedness of banks in Hong Kong by analyzing the websites of the registered Hong Kong banks.They compute the score for each bank by averaging the performance of the bank's website in three aspects, accessibility, usability, and information content.Later on Chen et al. [24] assess the severity of phishing attacks in terms of their risk levels and the potential loss in market value suffered by the targeted firms.They analyze 1030 phishing alerts released on a public database and financial data related to the targeted firms using a hybrid method that predicts the severity of the attack.Nishanth et al. [25] employ a twostage soft computing approach for data imputation to assess the severity of phishing attacks, which involves K-means algorithm and multilayer perception (MLP), probabilistic neural network (PNN), and decision trees (DT).Similar machine-learning techniques are employed by Lakshmi and Vijaya [26] for modelling the prediction task.The supervised learning algorithms, namely, multilayer perception, decision tree induction, and naïve Bayes classification, are used for exploring the results.This study intends to integrate DEA and SVM for web efficiency estimation and classification for several key reasons.First, as the medical informatics and security gain growing attention, a practical evaluation scheme is needed.We develop the DEA to assess the relative efficiency of the hospitals as the pioneer study in the related field.Second, in addition to evaluating the current websites at one snapshot, some websites may be reviewed as potential data set.An efficient and reliable classifier is essential to discriminate future data.Among wide machine learning methods, SVM is relatively robust and convincing, so we integrate DEA and SVM to build the efficiency classification platform.Third, compared with related studies, this work emphasizes web security preparedness instead of potential web attacks detection.That is, we assess web security from a proactive view not limited to technical aspect.The rest of this paper is organized as follows.Section 2 addresses the problem and methods.Section 3 presents the numerical case study of web security analysis in medical institutions.Finally, the concluding remarks are given in Section 4.

The Method
This section develops the hybrid DEA and SVM approaches for efficiency classification.Consider  DMUs ( = 1, . . ., ) that require assessment.Each DMU consumes  inputs ( = 1, . . ., ) and produces  outputs ( = 1, . . ., ), denoted by  1 ,  2 , . . .,   and  1 ,  2 , . . .,   , respectively.In the proposed framework, the factors and efficiency scores from DEA models are integrated with SVM in learning patterns of DMUs' performance and provide further decision support.The procedure of the hybrid methods is demonstrated in Figure 1.
Step 1 (efficiency evaluation).Based on ( 1) and ( 2), this study first evaluates the efficiency of the training data of the DMUs.The efficiency   of DMU  is defined as in (1).
Step 2 (efficiency tier analysis).This step iteratively discriminates the fully productive group (  = 1) and the subproductive group (  < 1).Using tier analysis [27], the DMUs are divided according to their efficiency scores.Then, the fully productive group is moved to the current tier, while the remaining DMUs are kept for further tier extraction.The algorithm is described as follows.The procedure of this step is demonstrated in Figure 2.Each DMU will belong to one tier thereafter.
Step 3 (SVM learning).Here the classification schema is learned as where   stands for the tier that DMU  belongs to and   |   is the vector combining the input and output factors of DMU  .Since there can be more than two tiers, so this is a multiclass classification problem.
Step 4 (testing).The set of testing data will be used to validate the classification model.
In the next section, the case of hospital web security efficiency will be thoroughly studied for demonstrating the procedure developed above.security expertise independently assess the web sites of these hospitals.All items are scored between 1 and 9 by the predefined measures, where 9 means total consistency between the statement and the practice while 1 stands for the opposite.After the reviewers observe the sample web sites, they give the scores according to the level of conformability to the factors in Table 1.The scores from each reviewer will be averaged as the input/output values for the DEA models.Most of the factors are nearly objective except user satisfaction ( 1 ).Notably, the input variables are surrogate variables to construct the investments to the web security.

Case Study
Step 1 (efficiency evaluation).By (1) and ( 2), the efficiencies of the information security investment in hospitals are computed.The distributions of the results are summarized in Tables 2 and 3 and Figure 3.
Step 2 (efficiency tier analysis).By each iteration of the tier analysis algorithm in Section 2, DEA determines one productive group of hospitals (  = 1) and the other group of subproductive hospitals (  < 1).Then the fully efficient group is extracted and the other proceeds to the next step.Each DMU will belong to an efficiency tier  thereafter.The members of the tiers are distributed as in Table 4 and Figure 4.
Steps 3 and 4 (SVM learning and testing).In this step, we use the utility LIBSVM [28] to build the SVM classification models.Four types of kernel functions are learned, including linear, polynomial, radial basis, and sigmoid.The accuracy in testing by different number of tiers and kernel function types is compared.The report is shown in Table 5 and Figure 5.The distribution of efficiencies in Figure 3 manifests the unbalanced pattern where most hospitals lie in the two extremes of efficiency scales.However, by tier analysis, the distribution of tiers is nearly even except the second tier with the lowest number of hospitals.
From the results, the kernel functions with satisfactory prediction accuracy are linear (90.11% in average), radial basis (89.37% in average), and polynomial (87.18% in average), while the sigmoid function results in the lowest accuracy (average of 55.31%).Obviously, the linear, radial basis, and polynomial functions are more appropriate kernel types for web security efficiency classification in this case.The pattern of tier distribution is possibly the reason why those three kernel types outperform the sigmoid function in SVM classification.
From the perspective of data tier refinement, 2 tiers get the highest accuracy (average of 88.74%) while 4 tiers have the lowest rate (average of 72.53%).The results show that fewer data tiers obtain better accuracy in classification, which is consistent with the rule of thumb.

Conclusions
This study proposes the data envelopment analysis and support vector machine approaches for efficiency estimation and classification.For the feasibility of data collection, we use the surrogate variables to construct the tangible and intangible input factors.In defining the output factors, we define an objective variable of ISO 27001 accreditation progress and a subjective one of user satisfaction, which evaluate the efficiency from not only technical perspective but also users' perception.In the proposed framework, the factors and efficiency scores from DEA models are integrated with SVM for learning patterns of expected web security performance.From the case study, linear and radial basis kernel functions

Figure 3 :
Figure 3: The distribution of efficiencies.

Figure 4 :
Figure 4: The distribution of tiers.

Table 1 :
The factors for the DEA model.Clarity of purpose 1 : security information is clearly structured on homepage;  2 : adequate security measures have been adopted from official homepage or login page;  3 : security policy documents are provided on homepage;  4 : news or events on the Internet security are reported;  5 : fault detection and responses adequately function.
9: how sound are the security methods or protocols used;  10 : first aid directions for computer security risks.Output factors-the benefits of web security 1 : user satisfaction; 2 : progress of ISO 27001 accreditation.

Table 2 :
The efficiencies from DEA.

Table 1 .
Two professional users with web

Table 3 :
The distribution of efficiencies.

Table 4 :
The distribution of efficiency tiers.