Anomaly Detection in Health Insurance Claims Using Bayesian Quantile Regression

Research has shown that current health expenditure in most countries, especially in sub-Saharan Africa, is inadequate and unsustainable. Yet, fraud, abuse, and waste in health insurance claims by service providers and subscribers threaten the delivery of quality healthcare. It is therefore imperative to analyze health insurance claim data to identify potentially suspicious claims. Typically, anomaly detection can be posited as a classification problem that requires the use of statistical methods such as mixture models and machine learning approaches to classify data points as either normal or anomalous. Additionally, health insurance claim data are mostly associated with problems of sparsity, heteroscedasticity, multicollinearity, and the presence of missing values. +e analyses of such data are best addressed by adopting more robust statistical techniques. In this paper, we utilized the Bayesian quantile regression model to establish the relations between claim outcome of interest and subject-level features and further classify claims as either normal or anomalous. An estimated model component is assumed to inherently capture the behaviors of the response variable. A Bayesian mixture model, assuming a normal mixture of two components, is used to label claims as either normal or anomalous. +e model was applied to health insurance data captured on 115 people suffering from various cardiovascular diseases across different states in the USA. Results show that 25 out of 115 claims (21.7%) were potentially suspicious. +e overall accuracy of the fitted model was assessed to be 92%. +rough the methodological approach and empirical application, we demonstrated that the Bayesian quantile regression is a viable model for anomaly detection.


Introduction
Providing quality and accessible healthcare is a major political decision that governments make all over the world. A major challenge in achieving this is funding. According to Novignon et al. [1], majority of countries in developing regions, especially sub-Saharan Africa (SSA), rely on donor grants and loans to finance healthcare. Such expenditures are not only unsustainable but also inadequate considering the enormous healthcare burden in the region. Apart from direct government investment into health services, health insurance schemes have been instituted to reduce the proportion of health service cost borne by the populace in many countries. However, fraud, abuse, and waste in health insurance claims by service providers and subscribers threaten the delivery of quality healthcare. e National Healthcare Antifraud Association (NHCAA) of the United States defines healthcare fraud as an intentional deception or misrepresentation made by a person or an entity, with the knowledge that the deception could result in some kinds of unauthorized benefits to that person or entity [2,3]. is "intentional deception or misrepresentation" can produce observations in the data that do not conform to normal observed patterns. ese can therefore be regarded as anomalies or outliers. ere are several statistical methods that are employed in the identification of fraud, especially in health insurance claims. Wilson [4] applied the logistic regression model to detect auto insurance fraud. is method is only possible if there is information on both legitimate and fraudulent claims. Moreover, the labeling process may be fraught with error such that any model that is built on this may not automatically capture these errors. Ekina et al. [2] used Bayesian co-clustering model to detect health insurance fraud. is method has been used for the analysis of dyadic data, and since "conspiracy fraud" involves a connection between two entities, the Bayesian coclustering is an appropriate method. Liu and Vasarhelyi [5] used a clustering method incorporating geo-locating information to detect fraud in Medicare/Medicaid data in the US. Ekin et al. [6] provided a comprehensive assessment of medical fraud research. ey mentioned that health fraud research is an emerging field which has gained some attention from researchers but acknowledged that more research needs to be conducted especially using Bayesian techniques. Bayesian methods have generally been proven to be an effective approach of identifying fraud. is is because Bayesian methods do not make strong statistical assumption as the frequentist counterparts.
Below are some reasons why Bayesian techniques would be appropriate for health fraud study [6].
(i) Medical data are sparse because fraudulent cases are rare, as compared to legitimate cases. (ii) Data are also dynamic. Both fraudulent and legitimate patterns evolve over time. (iii) Fraudulent claims are not homogeneous. (iv) Missing values are present, and these can lead to over-or undersampling, nonrepresentativeness, and potential bias in inference. (v) Medical data are skewed and non-normal.
To address these challenges, this study proposes the use of Bayesian quantile regression techniques and Bayesian mixture model to identify fraud in health insurance data.
is study seeks to specify the posterior distribution of the parameters of interest given the data; identify the distribution of anomalous claims; and estimate the probability of each claim to identify potential anomalous claims.

Bayesian Method.
From the Bayesian perspective, parameter vector of a given model is assumed to have a probability distribution referred to as the prior probability distribution. is prior probability distribution is informed by the prior knowledge about the parameter. e data from which the parameter is to be estimated are generated by a probability process, and this generating process is referred to as the likelihood. Of interest is the posterior probability which is the probability of the unknown parameter given the data. Given Y i as a vector of observations, independently and identically distributed, and θ i a vector of unknown parameters, then the posterior probability distribution of the parameters is represented as where P(θ i ) is the prior probability of the parameter θ i and P(θ i |Y i ) is the posterior probability of θ i given the observed data Y i . Generally, the prior may be chosen based on some information available and hence called informative prior. It may be chosen based on little or no information available. Such priors are called noninformative prior. e uniform distribution or Jeffrey's prior is mostly used when there is no prior information about the unknown parameter. A prior may also either be a proper prior or an improper prior. A chosen proper prior results in a proper posterior probability. An improper prior is a distribution that does not integrate to one. However, if an improper prior is chosen properly, the resulting posterior distribution may be a proper posterior distribution.
ere are other ways of choosing the prior. e prior can be chosen such that the resultant posterior distribution has the same form as the prior probability. Such priors are called conjugate.

Quantile Regression.
is study used the quantile regression to capture the relationship between the variables and to estimate parameters of the model. is is because of its ability to capture the relationship across the entire data without imposing strong parametric assumptions about the response variable.
It is therefore an adequate technique which captures all dynamics in the data and addresses challenges associated with health insurance claims data.
Furthermore, given a set of response variables Y and explanatory variables X, the conditional cumulative distribution function is and for any τ, 0 < τ < 1, the τth conditional quantile function is e quantile function Q τ is monotonic increasing such that for any two quantiles τ 1 and τ 2 where τ 1 < τ 2 , Q τ 1 < Q τ 2 .

Description of the Method.
To specify the density distribution that describes the anomalous claims, a posterior distribution of the parameters of interest (β, σ, and m i ′ s) is first constructed using the quantile regression model: where E(Y) � Q τ (Y) � Xβ + θ m i and u i has a standard normal distribution. e parameters to be estimated are the regression coefficients β, a scale parameter σ, and the mixture component m i .

International Journal of Mathematics and Mathematical Sciences
From the Bayesian perspective, given that Y i are independently and identically random variables from which the parameters β, σ, and m i ′ s are to be estimated, the posterior distribution f(β, σ, m i |y 1 , y 2 , . . . , y n ) is constructed to be proportional to some function g(β, σ, m i ) expressed as f β, σ, m i |y 1 , y 2 , . . . , y n ∝ g β, σ, m i . Now by the factorization theorem. To estimate say β, σ and m i are treated as constants and their functions are included as a normalizing constant. us, so that We then use the Gibbs sampler to sample β from the conditional distribution f(β|Y, σ, m i ). is is the same as sampling β from the posterior distribution f(β, σ, m i |y 1 , y 2 , . . . , y n ). Samples are also drawn from the conditional distributions f(σ|Y, β, m i ) and f(m i |Y, β, σ) to estimate σ and m i , respectively. It is assumed that the estimates for m i ′ s form a mixture distribution of unknown densities.

e Model.
Suppose Y i , i � 1, 2, . . . , n are the set of independent variables identically and independently distributed, X i are the regressors or covariates, and β j are the regression coefficients. e model for the Bayesian quantile regression as proposed by Kozumi and Kobayashi [7] is where σ is a scale parameter, . z i has a standard exponential distribution, and u i has the standard normal distribution. We replace σz i with m i and equation (9) becomes which is the same as equation (4). Given a random error variable ε with scale parameter σ, asymmetry parameter τ, and location parameter μ, ε is said to have the asymmetry Laplace distribution (assumed by Yu and Moyeed [8]) given by the density From equation (12), μ � X ij β j + θm i is the location parameter. v i � θm i can only be zero through θ when τ � (1/2). e term m i , however, is an exponential random variable and cannot be zero. e presence of anomaly in Y can be demonstrated through m i by assuming that the predicted values of m i have a mixture distribution.
It is somewhat intractable to derive the full conditional posterior distributions for the parameters from equation (12). Markov Chain Monte Carlo (MCMC) algorithms such as Gibbs sampler [9] and Metropolis-Hastings algorithm [10] enable one to estimate the parameters from the posterior distribution.
MCMC algorithms generate random numbers that will have stationary transition probabilities, hence an equilibrium distribution. In the Bayesian perspective, this equilibrium distribution, given some initial conditions, mimics the posterior distribution. Having done this, we can then compute statistics such as mean and variance from this distribution as estimates of the posterior distribution.
Assuming ε i has an exponential normal mixture distribution, it can be shown that Y i will be normally distributed with mean, X ij β j + θm i , variance, w 2 σm i , and a normal likelihood density of the form: International Journal of Mathematics and Mathematical Sciences e variables β, m i , and σ are estimated by assuming the following priors: By specifying the above conjugate priors for the parameters, their full marginal posterior distributions can be derived and we can subsequently sample from them.
From the likelihood function in equation (13) and the above prior distributions, the posterior distribution is given as which becomes e marginal conditional posterior distributions of the parameters are derived from equation (16) as follows: e full conditional posterior distribution of β is a normal distribution given as where K is a constant that does not depend on β.
Note that a function x↦ exp(ax 2 + bx + c) with domain on the whole real line is an unnormalized probability density function (pdf ) if and only if a < 0. In this case, the unnormalized pdf of the normal distribution with mean (μ) and variance (σ 2 ) is Comparing equation (17) to x↦ exp(ax 2 + bx + c) yields If we let σ 2 � V and μ � B, then the distribution of β is normal with parameters B and V where It can also be derived from equation (16) that the full conditional distribution of m i is e full conditional posterior distribution of the m i is the generalized inverse Gaussian distribution with parameters (1/2), μ, and ] given as where and ] 2 � ((2/σ) + (θ/σw 2 )).
Again from equation (16), the full conditional distribution of σ follows the inverse gamma distribution given as is is an inverse gamma distribution with shape parameter (3n + n o )/2 and scale parameter ((s o /2) + m i + ((Y i − X ij β j − θm i ) 2 /(2w 2 m i ))).

2.5.
Estimating the Mixture Components of m i . We used the R package for the Bayesian mixture model, "BayesMix," to estimate the mixture components of m i . BayesMix provides Gibbs sampling of the posterior distribution, a method to set up the model and specify the priors and initial values required for the Gibbs sampler.
To run the Gibbs sampler, we specify the prior distributions of the parameters and initial values for the hyperparameters in the proposed models. e regression coefficients are normally distributed with mean β o and variance V o . We set initial values for these parameters as 0 and 10 (with precision as 1/10), respectively.
International Journal of Mathematics and Mathematical Sciences m i ′ s are exponentially distributed with rate parameter 1/σ. e prior distribution of the scale parameter σ is an inverse gamma distribution with hyperparameters (n o /2) and (s o /2). e quantity n o is the prior effective sample size for σ, and s o is prior point estimate for σ. For our analysis, we set 1/σ to prec which is the rate of the exponential variable m i . e initial values for prec are generated randomly as uniform values with interval 0 and 1 and 0 and 2. is choice is subjective and arbitrary but with the hope that the data will be informative enough to produce a stationary posterior distribution.

Model Evaluation.
e proposed model is evaluated by replicating new values of Y, Y rep and determining how close these values are to the original ones. e approach is called posterior consistency and involves using a posterior predictive distribution. e posterior predictive distribution is given by where Θ is our parameter space and θ i is the set of parameters we estimated from our model. We sample from this distribution as follows: (1) Sample θ i from posterior distribution f(θ i |Y).
(2) Substitute the values of θ i above in f(Y|θ i ) and draw Y rep .
To compare the predictive posterior distribution to our model, we can compute some statistics such as the mean, standard deviation, and order statistics and compare the values. Graphical approaches may also be used. e histogram of Y rep gives the distribution of the predictive posterior distribution, and this can be compared to the model distribution of Y.

Computing the Probability Scores.
e densities of the component mixtures are estimated using the Bayesian method. Now, let λ � 0 and λ � 1 correspond to the density distribution with normal claims and density distribution with anomalous claims (which will have the higher mean), respectively.
From the above preamble, we compute the probability of each of the claim belonging to anomalous claim class, P(Y|λ � 1). ese probabilities are ranked for all observations, and those with values greater than 0.5 are classified into the anomalous claim class.
We treat m i as a mixture of two normal densities and estimate the proportion and density of each component. For μ 1 and μ 2 (anomalous claims) while the smaller one will be associated with λ � 1(anomalous claims) while the smaller one will be associated with λ � 0 (normal claims). p(Y|λ i � 1), which describes the outlier distribution, can be computed as

Results and Discussion
We run the Gibbs sampler at quantile values, τ � 0.95, 0.75, 0.50, 0.25, to observe the trend of the changes in β ′ s (regression coefficients) and σ (scale parameter). Table 3 shows the estimates of the regression coefficients at the specified quantiles (τ � 0.95, 0.75, 0.50, 0.25). From Table 3, the quantile value τ � 0.75 is preferred to estimate the anomalous claims in the data. e values of the estimated parameters at τ � 0.75 help satisfy the assumption that an anomalous/fraudulent health insurance claim is usually located at the right of a legitimate claim.
is suggests that we can estimate the parameters of the proposed model even at τ � 0.95. However, the potential reduction factor at τ � 0.75 and τ � 0.95 is 1.01 and 1.08, respectively. is makes τ � 0.75 the preferred quantile value. e mean, standard deviation, and standard error of the mean for each parameter estimated at τ � 0.75 are given in Table 4. 6 International Journal of Mathematics and Mathematical Sciences We provide the residual and normal Q-Q plots of the model in Figure 1 and also the trace plot and density of β and σ in Figure 2. e results from Figure 2 (diagram of densities) satisfy the assumption of inverse gamma and normal prior distributions made for σ and β, respectively.
Our objective is to identify potential fraudulent claims through the estimated values of the unobserved variable m i . Figure 3 shows the histogram and density of m i at τ � 0.75.

Identifying Anomalous Claims.
e "BayesMix package" in R was used to set up an MCMC sampler to identify the two components of the mixture of m i ′ s. e mean (μ), sigma (σ), and proportion (c) of each component were estimated. Note that the probability of a claim being anomalous is P(λ � 1) � c [2] and the probability of a claim being normal is P(λ � 0) � c [1]. e results are shown in Table 5.
e trace plots and densities of the mixture parameters μ and σ 2 are shown in Figures 4 and 5, respectively. e probability of each of the claims belonging to anomalous class, P(Y|λ � 1), is through m i ′ s, so we use μ [2] � 14.17 for density of P(Y, m, λ � 1) and P(λ � 1) � 0.1651. Moreover, the density of P(λ � 0)P(Y, m, λ � 0) is μ[1] � 10.48 and P(λ � 0) � 0.8349. Assuming the mixture components are    Table 4: Empirical mean and standard deviation for each parameter, with standard error of the mean at τ � 0.75.

Conclusions and Recommendations
e model identified 25 claims (about 22%) estimated at quantile level τ � 0.75 as belonging to the anomalous class and can thus be assumed to be fraudulent claims. e first 10 of these are shown in Table 6.
e model correctly identified all anomalous claims (sensitivity of 100% ) and gave a high specificity value of 90%. e overall accuracy of the model was 92% with a precision of 74%. is study therefore recommends Bayesian quantile regression as a viable model for anomaly detection. It is worthy of note that the study assumed a normal density mixture for the m i ′ s to be able to estimate the anomalous cluster in the data. erefore, a choice of a different distribution for the m i ′ s could result in different findings. Also, instead of using a mixture model, a variable selection

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest.