There is a need for new classes of flexible multivariate distributions that can capture heavy tails and skewness without being so flexible as to fully incur the curse of dimensionality intrinsic to nonparametric density estimation. We focus on the family of Gaussian variancemean mixtures, which have received limited attention in multivariate settings beyond simple special cases. By using a Bayesian semiparametric approach, we allow the data to infer about the unknown mixing distribution. Properties are considered and an approach to posterior computation is developed relying on Markov chain Monte Carlo. The methods are evaluated through simulation studies and applied to a variety of applications, illustrating their flexible performance in characterizing heavy tails, tail dependence, and skewness.
There is an increasing awareness of the importance of developing new classes of multivariate distributions that flexibly characterize heavy tails and skewness, while accommodating tail dependence. Such tail dependence arises in many applications and is a natural consequence of dependence in outlying events. Such dependence is well known to occur in financial data, communication networks, weather, and other settings, but is not adequately characterized by common approaches such as Gaussian copula models. Salmon [
There is an existing literature relevant to this topic. Wang et al. [
Alternatively, nonparametric approaches have been explored to handle heavytailed and skewed observations with more flexibility. For instance, mixtures of normal distributions have been widely used to approximate arbitrary distributions. Venturini et al. [
We focus on Gaussian variancemean mixtures (GVMMs), introduced by BarndorffNielsen [
Current literature on Gaussian variancemean mixtures is mostly focused on univariate models in which the mixing distribution
We propose Bayesian semiparametric Gaussian variancemean mixture models, in which the mixing distribution
Consider the
The family of log GN distributions considered here was initially introduced by Vianelli [
Although log GN distributions are appealing in providing a simple generalization of the lognormal which is more flexible in the tails, such distributions have been rarely implemented even in simpler settings due to the computational hurdles involved. Fortunately, for Bayesian posterior computation via MCMC we can rely on a data augmentation algorithm based on Fact
Let
DPMs of Gaussians provide a highly flexible approximation to arbitrary densities. As a prior for
Expression (
It is important to understand the relationship between the tail behavior of the mixture distribution
Suppose
if
if
Observe that the tail behavior of a Gaussian variance mixture (when
Suppose
Proofs can be found in the appendix. This lemma provides a link between the tail behavior of a Gaussian variancemean mixture and that of a Gaussian variance mixture, via the link between tails behaviors of the two mixing distributions. This relationship is used in the following theorem.
Suppose
Theorem
To compute the moments of Gaussian variancemean mixtures, we can directly apply the law of total cumulance. Let
More generally, we have
In the semiparametric GVMM framework, the
In particular, we transform the original data to have a positive sample skewness and a unit sample variance. The data are first normalized and the sample skewness is calculated. If it is negative, we multiply the normalized data by a negative one. In conducting inferences, we transform back to the scale and sign of the original data. As GVMMs are closed under linear transformations, this will induce a GVMM. Because the transformed data are more likely to be right skewed or symmetric, we can more easily elicit a default weakly informative prior for the skewness parameter
As for the DPM of log GN prior for the mixing distribution
Given the model and priors as specified by (
Samples
Samples
Sample
and sample
where
Sample
Sample
where
Sampling
To test the semiparametric framework, a simulated dataset from univariate GVMM is first modeled. Specifically, observations
For Bayesian inference, we preprocess the original simulated data and place priors as described previously. We run the MCMC for 10000 iterations with the first 5000 as burnin. Several aspects of the posterior distributions are analyzed to evaluate the model fitting. First of all, posterior samples of
Furthermore, we can reconstruct the dataset based on the posterior samples of model parameters and the mixing distribution
Posterior quantile estimation to show the model fitting. Posterior quantile C.I.s are obtained by simulating 200 reconstructed datasets based (each consisting of 5000 data points) on posterior samples of unknown quantities, each dataset giving one set of quantile point estimates. Observed quantiles are obtained from the 1000 observed simulated data.
Quantiles  Observed  Posterior  Posterior 95% 

quantile  mean  C.I.  
2.5% 


[−1.535, −1.450] 
5%  −1.328  −1.322  [−1.363, −1.284] 
25%  −0.738  −0.711  [−0.742, −0.676] 
50%  −0.175  −0.162  [−0.190, −0.124] 
75%  0.555  0.534  [0.492, 0.584] 
95%  1.783  1.871  [1.776, 1.985] 
97.5%  2.394  2.414  [2.283, 2.566] 
Quantile estimates are obtained from models fitted to the observed dataset using maximum likelihood estimation. To obtain the maximum likelihood estimators in skewedGaussian and
Quantiles  Observed  Gaussian  Skewed Gaussian  Skewed 

2.5% 


−1.451  −1.446 
5%  −1.328  −1.645  −1.306  −1.300 
25%  −0.738  −0.674  −0.751  −0.749 
50%  −0.175  0.000  −0.173 

75%  0.555  0.674  0.590  0.566 
95%  1.783  1.645  1.887  1.895 
97.5%  2.394  1.960  2.339  2.379 
It is well known that stock returns do not always confirm well with a Gaussian distribution. Modeling both heavy tailness and asymmetry of returns is becoming important in economics and finance. Here, we look at daily returns of Standard & Poor's 500 Composite (S&P 500) index from 01/02/1990 to 09/13/2011. Totally 5470 observations are shown in Figure S3(A), with sample skewness being 0.189 (after preprocessing), which suggests that the return distribution may be slightly right skewed.
Similar univariate semiparametric GVMM and prior setup are applied to the dataset to access the capability of the model in capturing the return distribution. To evaluate the model fitting, we also reconstructed 200 datasets based on the posterior samples of unknown quantities (each consisting of 5470 observations), and a quick comparison (Figure S3) between the observed and reconstructed datasets shows a significant similarity, indicating that our model captures the return distribution well. We also look at posterior quantile estimates based on the 200 simulated datasets (Table
Posterior quantile estimation to show the model fitting. Posterior quantile C.I.s are obtained by simulating 200 reconstructed datasets based (each consisting of 5470 data points) on posterior samples of unknown quantities, each dataset giving one set of quantile point estimates. Real data quantiles are obtained from the 5470 observed S&P 500 returns.
Quantiles  Real data  Posterior  Posterior 95% 

quantile  mean  C.I.  
2.5%  −1.965  −1.965  [−2.097, −1.867] 
5% 

−1.513  [−1.589, −1.440] 
25%  −0.493  −0.495  [−0.534, −0.469] 
50%  −0.0315  −0.0175  [−0.0421, 0.0087] 
75%  0.467  0.476  [0.440, 0.512] 
95%  1.537  1.560  [1.482, 1.649] 
97.5%  2.056  2.064  [1.943, 2.225] 
Furthermore, we specifically look at the posterior distribution of
Posterior distribution of
Posterior distribution of
There has also been a growing interest for flexible families of nonGaussian distributions allowing skewness and heavy tails in environmental science and climatology, as more heavytailed and skewed data are observed practically. Specifically, it is well known that monthly rainfall is strongly skewed to the right with high positive values of skewness coefficients (e.g., [
US national and regional precipitation data are publicly available from the United States Historical Climatology Network (USHCN). For the purpose of exposition, we used monthly precipitation data measured in inches from four local stations (Albemarle, Chapel Hill, Edenton, and Elizabeth City) in the state of North Carolina, for the period from 1895 through 2010 (116 data per station for each month). Figure
Monthly log precipitation data for July from 1895 to 2010 (116 observations) obtained from four stations in North Carolina show heavy right skewness.
We fit the semiparametric multivariate GVMM (
We run the Markov Chain for 10000 iterations, which shows good mixing and convergence, and discarded the first 5000 as burnin. To illustrate the model fitting, we reconstruct a precipitation dataset with 5000 observations based on posterior samples of all unknown quantities and compare the reconstructed posterior predictive distribution with the observed. Specifically, we test both whether the marginal univariate distributions of each dimension and the covariance structure are captured correctly. As shown in Figure
Monthly logprecipitation data for July from 1895 to 2010 (116 observations) obtained from four stations in North Carolina (shown in histogram) are fitted using Bayesian semiparametric GVMM. Red line shows the kernel density of fitted distributions for the stations, estimated from 5000 posterior predictive samples of the fitted GVMM.
PPplots for the Bayesian semiparametric GVMM model fitted to logprecipitation for July in all four stations.
Sample covariance structure of monthly logprecipitation data from four local stations in the state of North Carolina. Alb: Albemarle, CH: Chapel Hill, Ede: Edenton, and Eli: Elizabeth City.
Covariance structure of monthly logprecipitation when fitted with Bayesian semiparametric GVMM. Alb: Albemarle, CH: Chapel Hill, Ede: Edenton, and Eli: Elizabeth City are four stations in the state of North Carolina.
As a comparison, we also fitted the logprecipitation data with multivariate skewed
This paper proposes the use of Bayesian semiparametric Gaussian variancemean mixtures as a flexible, interpretable, and computationally tractable model for heavytailed and skewed observations. The model assumes the mixing distribution
Let
Given