Cost-Sensitive Estimation of ARMA Models for Financial Asset Return Data

The autoregressive moving average (ARMA) model is a simple but powerful model in financial engineering to represent timeseries with long-range statistical dependency. However, the traditional maximum likelihood (ML) estimator aims to minimize a loss function that is inherently symmetric due to Gaussianity. The consequence is that when the data of interest are asset returns, and the main goal is to maximize profit by accurate forecasting, the ML objective may be less appropriate potentially leading to a suboptimal solution. Rather, it is more reasonable to adopt an asymmetric loss where the model’s prediction, as long as it is in the same direction as the true return, is penalized less than the prediction in the opposite direction. We propose a quite sensible asymmetric cost-sensitive loss function and incorporate it into the ARMA model estimation. On the online portfolio selection problem with real stock return data, we demonstrate that the investment strategy based on predictions by the proposed estimator can be significantly more profitable than the traditional ML estimator.


Introduction
In modeling time-series data, capturing the underlying statistical dependency of the variables of interest at current time on the historic data is central to accurate forecasting and faithful data representation.For financial time-series data especially (e.g., daily asset prices or returns) where a large amount of potential prognostic indicators is available, the development/analysis of sensible dynamic models as well as effective parameter estimation algorithms has been investigated significantly.
To account for statistical properties specific to financial sequences, several sophisticated dynamic time-series models have been developed: fairly natural autoregressive and/or moving average models [1], the conditional heteroscedastic models that represent dynamics of volatilities (variances) of the asset returns [2][3][4], and nonlinear models [5,6] including bilinear models [7], threshold models [8], and regime switching models [9,10].
Among those, the autoregressive moving average (ARMA) model [1] is the simplest while essential in the sense that most other models are equipped with at least the basic ARMA components.The ARMA models appear in a wide spectrum of applications recently including filter design in signal processing [11], time-series analysis and model selection in computational statistics [12], and jump (large changes) modeling for asset prices in quantitative finance [13], to name just a few.For a time-series y =  1 , . . .,   (e.g.,   is the asset return at the th day), the ARMA(, ) determines   by where    indicates the vector [  ,  +1 , . . .,   ] ⊤ .Here   ∼ N(0,  2 ) is the stochastic Gaussian error term at time  where we assume iid across 's.In (1) , ,  0 , and  are the model parameters.That is,   is dependent on  previous asset returns,  historic errors, and the current error   .
In this paper we consider a more general, recent stochastic extension of ARMA (abbreviated as sARMA) [14] (in contrast to the deterministic equation (1)) that adds a Gaussian noise to (1).Moreover, the extra covariates   (called cross predictors) are assumed available at time ; for instance, they are typically economic indicators, market indices, and/or the previous 2 Mathematical Problems in Engineering returns of other related assets.The sARMA(, ) model can be specifically written as where Here  is the weight vector (model parameters) for the cross predictor.
Hence, sARMA deals with Gaussian noisy observation (with variance  2 ), and it exactly reduces to the ARMA model in the limiting case  → 0. The noisy observation modeling of sARMA is beneficial in several aspects: not only does it merely account for the underlying noise process in the observation but also the model becomes fully stochastic, which allows principled probabilistic inference and model estimation even with missing data [14].
Given the observed sequence data, the parameters  = { 2 , , ,  0 , , ,  2 } of the sARMA model can be estimated by the expectation maximization (EM) algorithm [15].Compared to the traditional Levenberg-Marquardt method for ARMA model estimation [1], the EM algorithm is beneficial for dealing with latent variables (i.e., the error terms) as well as any missing observations in an efficient and principled way.However, both estimators basically aim to achieve data likelihood maximization (ML) under the Gaussian model ( 2) [14] (with  → 0 corresponding to ARMA).
Due to the Gaussian observation modeling in sARMA, the ML estimation inherently aims to minimize a symmetric loss.In other words, letting ŷ and   be the model forecast and the true value at time , respectively, incorrect prediction ŷ with the prediction error  = |ŷ  −   | incurs the same amount of loss for both ŷ =   +  and ŷ =   −  (i.e., regardless of over-or underestimation).This strategy is far from being optimal especially for the asset return data as argued in the following.
The main goal is to maximize profit by accurate forecasting with the asset return data that encode signs (directions) toward profits.Traditional maximum likelihood (ML) estimator aims to minimize a loss function that is inherently symmetric and hence unable to exploit the property of the asset return data, leading to a suboptimal solution.
Suppose that our data y forms a sequence of daily stock log-returns, encoded as   > 0 (<0) indicating that the stock price moves up (declines) on the th day against the previous day.Now, consider a portfolio selection algorithm that makes an investment based on the forecast ŷ given the information up to time .The investment yields positive revenue when the signs of ŷ and   are equal and the other way around.Hence, the prediction loss should be inherently asymmetric.Furthermore, when   < 0, having ŷ < 0, even its underestimation (i.e., ŷ <   ) should be penalized less than the prediction with the opposite direction (i.e., ŷ > 0) because the former does not incur any loss in revenue but the latter does.
To address this issue, we propose a reasonable cost function that effectively captures the above idea of the intrinsic asymmetric profit/loss structure regarding asset return data.Our cost function effectively encodes the goodness of matching in directions between true and model predicted asset returns, which is directly related to ultimate profits in the investment.We also provide an efficient optimization strategy based on the subgradient descent using the trustregion approximation, whose effectiveness is empirically demonstrated for the portfolio selection problem with realworld stock return data.
It is worth mentioning that there have been several other asymmetric loss functions proposed in the literature similar to ours.However, existing loss models merely focus on the asymmetry with respect to the ground-truth value point.For instance, the linex function [16,17] is defined to be linear-exponential function of difference between predicted and ground-truth values.The linlin method [18] adopts a piecewise linear function where the change point is simply the ground-truth value.To the best of our knowledge, we are the first to derive the loss based on the matching the directions (signs) of the predicted and ground-truth returns.This effectively enables incorporating the critical information about directions of profits/losses, in turn leading to a more accurate forecasting model.
The rest of the paper is organized as follows.In the next section we suggest a novel sARMA estimation algorithm based on the cost-sensitive loss function: beginning with the overall objective, we derive the one-step predictor for the sARMA model in Section 2.1, provide details of the proposed cost function in Section 2.2, and state the optimization strategy in Section 2.3.The statistical inference algorithm for the sARMA model is also provided in full derivations in Section 2.4.In the empirical study in Section 3, we demonstrate the effectiveness of the proposed algorithm on the online portfolio selection problem with real data, where the significantly higher total profit is attained by the proposed approach than the investment based on the traditional MLestimated sARMA model.

Cost-Sensitive Estimation
The proposed estimator for sARMA is based on the costsensitive loss of the model predicted one-step forecast value (denoted by ŷ ) at each time  with respect to the true one (denoted by   ) available from data.More specifically, for a given data sequence y =  1 , . . .,   , we aim to solve the optimization problem: where  = max (, ) .

(4)
Here ( ŷ ,   ) is the cost of predicting the asset return as ŷ when the true value is   .In Section 2.2 we define a reasonable cost function that faithfully incorporates the idea of asymmetric cost-sensitive loss discussed in the introduction.
In the objective, we also simultaneously minimize (), the parameter regularizer that typically penalizes a nonsmooth sARMA model while preferring a smooth model (effectively achieved by encouraging the regression parameters in  close to 0) model.Specifically we use the L2 penalty, () = ‖‖ 2 + ‖‖ 2 +  2 0 + ‖‖ 2 .The constant  (>0) trades off the regularization against the prediction error cost.
Note also that in (4) we use the notation ŷ () to emphasize the dependency of the model predicted ŷ on .We use the principled maximum a posteriori (MAP) predictor estimated under the sARMA model, which is fully described in Section 2.1.The predictor is evaluated based on the inference on the latent error terms, which can be computed recursively where we give detailed derivations for the inference in Section 2.4.

One-
Step Predictor for sARMA.Under the sARMA model, the predictive model at time , given all available information for  =  + 1,  + 2, . ... From this predictive model, one can make deterministic decision on the asset return at , typically as the maximum-a-posteriori (MAP) estimation: Note that in the sARMA model, it is always assumed that we have at least  previous observations   1 and  previous error terms    1 .The error terms are simply assumed to be  1 = ⋅ ⋅ ⋅ =   = 0 throughout the paper.Due to the linear Gaussianity of the sARMA's local conditional densities, we have Gaussian , and the MAP predictor (5) exactly coincides with the mean . In this section we derive the MAP (or mean) prediction ŷ as a function of the sARMA model parameters , which can then be used in gradient evaluation for the optimization in (4).As is shown, the predictive distributions heavily resort to the posterior distributions of the error terms, namely, (  +1 |    1 ,   1 ,   1 ) for  =  + 1, . . ., .They are also Gaussians, and we denote them by (  , Σ  ) in for  =  + 1, . . ., .Note that   and Σ  have dimensions (( − ) × 1) and (( − ) × ( − )), respectively.The full derivation of the error term posteriors is provided in Section 2. 4.
, one may need to differentiate three cases for : (i)  =  + 1, (ii)  + 1 <  ≤  +  + 1, and (iii)  >  +  + 1.The first case simply forms the initial condition which immediately follows from the local conditional model with marginalization of  +1 .That is, when where we define   =  ⊤  −1 − +  ⊤   +  for  =  + 1, . . ., .We distinguish the second and third cases for the following reason: at time , the previous  error terms are fully included in the time window [ + 1, ] in the latter case, while they are partially included in the former.Hence in the second case, we additionally deal with the error terms   − which are always given as 0. Specifically, in the second case ( + 1 <  ≤  +  + 1), the terms   − are partitioned into (  ,  −1 +1 ,   − = 0), and we have = ∑ In ( 12), we let  2 be the subvector of  corresponding to  −1 +1 .In the third case ( >  +  + 1), we only need to deal with error terms   − , and the predictive density is derived as follows: = ∑ In ( 14), we introduce μ−1 and Σ−1 as submatrices of  −1 and Σ −1 taking the indices from ( − ) to ( − 1) only.
In summary, the one-step predictor ŷ () at  with all available information ( −1 1 ,   1 ,   1 ) can be written as Note here that the means of the error term posteriors  −1 (and their subvectors μ−1 ) have also dependency on the model parameters .

Proposed Cost Function.
In this section we propose a cost function ( ŷ ,   ) (used in (4)) that effectively encodes the intrinsic asymmetric profit/loss structure regarding asset return data.To meet the motivating idea discussed in Section 1, we deal with two outstanding cases: the case when the true   is positive and the case when   is negative.In each case, we further consider a certain margin  (small positive, e.g.,  = 0.005), where observing   >  indicates positive return with high certainty; on the other hand, having 0 ≤   ≤  can be regarded differently as weak positivity and might be considered as noise.For the negative return, we have similar two regimes of different certainty levels.
We discuss the first case,   > .Depending on the value of ŷ , the cost functional changes over the four intervals: (i) ŷ < 0 incurs the highest loss with a super-linear penalty along the magnitude of ŷ (we particularly choose a convex quadratic function), (ii) ŷ ≥   , that is, overestimation, should be penalized the least, and we opt for an increasing linear function with a small slope, (iii)  ≤ ŷ <   is an underestimation, but the prediction has certainty greater than a margin and thus is penalized less (we choose a linear function with slope slightly higher than the second case), and (iv) 0 ≤ ŷ <  makes prediction in correct direction, but due to the weak certainty below the margin, we penalize it more severely than previous two regimes.
In the case of   < −, we exactly penalize the prediction in the same way as the first situation.Specifically, the cost definition for   < − is where the same constants are used, and the offsets are now set as ℎ 0 = − in ( +   ) and ℎ 1 = ℎ 0 +  0 .For the uncertain (within the margin ) return, we still conform to the strategy of encouraging the same direction as the true return.In the case of 0 <   ≤ , we assign small penalty for overestimation as long as it is in the correct direction, while rapidly growing quadratic loss for the prediction toward opposite direction.To summarize, the cost for 0 <   ≤  is where we set (for continuity)  1 =  0   .The other case of − ≤   < 0 is similarly defined as where  1 = − 0   for continuity of the cost function.

Optimization Strategy.
In this section we briefly describe the optimization strategy for (4).We basically follow the subgradient descent [19,20] where the derivative of the cost function with respect to  can be derived as Here, due to the nondifferentiability of the cost function (albeit continuous), we use the subgradient in place of the second part of RHS of (21).
Evaluating the first part, that is, the gradient of ŷ with respect to , requires further endeavor.According to the functional form of ŷ in (16), it has complex recursive dependency on  mainly due to the error posterior means  −1 .Instead of exactly computing the derivative of  −1 , we address this issue by evaluating an approximate gradient by treating  −1 as a constant (constant evaluated at the current iterate ).In consequence, we have a linear function of , and the gradient can be computed easily.However, the approximation (i.e., constant  −1 with respect to ) is only valid in the vicinity of the current .Hence, to reduce the approximation error, we restrict the search space to be not much different from the current iterate (i.e., we search the next  within the small-radius ball centered at the current , specifically ‖ −  curr ‖ ≤  for some small  > 0).Our optimization strategy is closely related to the trust-region method [21], where the objective is approximated in the vicinity of the current parameters.

Inference in sARMA.
In this section we give full derivations for statistical inference on the latent error variables   +1 for each , conditioned on the historic observations   1 and the cross predictors    1 in the sARMA model.That is, the posterior densities (  +1 |   1 ,   1 ,   1 ) for  =  + 1, . . .,  are fully derived.In essence, these are all Gaussians, and as denoted in (6), we find the recursive formulas for the means and covariances (  , Σ  ).We also denote the inverse covariance Σ −1  by   .Similarly as one-step predictive distributions, we consider three cases: (i) initial  = +1, (ii)  + 1 <  ≤  +  + 1 where   − fully contains what we need to infer, that is,   +1 , and (iii)  >  +  + 1 where we have to infer three groups of variables ( The initial case ( = +1) is straightforwardly derived as follows: In ( 23),   =  ⊤  −1 − + ⊤   + as before, and we use   +1− = 0.The theorem of product of two Gaussians is applied to yield (24) from ( 23).This forms the initial posterior mean and inverse covariance as follows: We next deal with the second case; that is,  + 1 <  ≤  +  + 1.We partition   − into three parts:  1 =   − ,  2 =  −1 +1 , and  3 =   .The parameter vector  for  −1 − is accordingly divided into subvectors  1 (for  1 ) and  2 (for  2 ).We only need to infer   +1 ; thus  2 and  3 and the conditional density can be derived as follows: To derive   and   , we rearrange the exponent of (28) as a canonical quadratic form in terms of ( 2 ,  3 ).It is not difficult to have the following formulas after some algebra: where Finally, for the third case ( >  +  + 1), the variables to be inferred (i.e.,   +1 ) are partitioned into three groups of variables: − , and  3 =   .Here,  1 and  2 , when concatenated, yield a vector of the same dimension as  −1 , and we partition  −1 accordingly as  −1 (1) and  −1 (2).Similarly,  −1 is partitioned into (2×2) blocks, and we denote them by  −1 (, ) for ,  ∈ {1, 2}.The posterior can then be written as Similar to the second case, we derive   and   by rearranging the exponent of (33) as a canonical quadratic form in terms of ( 1 ,  2 ,  3 ).The resulting formulas are as follows: where (35)

Empirical Study
In this section we empirically test the effectiveness of the proposed sARMA estimation method.In particular we deal with the task of portfolio selection on the real-world dataset comprising daily closing prices from Dow Jones Industrial Average (DJIA).
We consider the task of online portfolio selection (OLPS) problem with real stock return data.We begin with a brief description of the OLPS problem.Assuming there are  different stocks to invest in daily basis, at the beginning of day , the historic closing stock prices up to day  − 1, denoted by {  } −1 =0 , are available, where   is -dim vector whose th element   () is the price of the th ticker.Using the information, you decide the portfolio allocation vector   , a nonnegative -dim vector that sums to 1 (i.e., ∑  =1   () = 1).Assuming no short positioning is allowed,   () is the proportion of the whole budget to be invested in the th stock for  = 1, . . ., .
The portfolio strategy is thus a function that maps the historic market information (say {  } −1 =0 ) to the price prediction   .The sARMA-based portfolio strategy can be built by estimating sARMA models, one for each stock ticker , for the stock log-return data; namely,   = log   − log  −1 (here, we drop the dependency on  for simplicity).Then the predicted ŷ can be used to decide the proportion of the budget to be invested in the th ticker at time .A reasonable strategy is to make no investment (i.e.,   () = 0) if ŷ < 0, while forcing   () to be proportional to ŷ if ŷ > 0.
To evaluate the performance of a portfolio strategy, we use the popular (running) relative cumulative wealth (RCW) defined as RCW() =   / 0 , where   is the total budget at time .Thus RCW() indicates the total budget return at time  compared to the initial budget, and the portfolio strategy that yields high RCW() for many epochs 's is regarded as a good strategy.Assuming that there is no transaction cost, it is not difficult to see that RCW() = ∏  =1  ⊤    where we define the price relative vector   =   / −1 (division element-wisely).
Hence, in the sARMA model, it is crucial to accurately forecast the returns, and we compare the model estimated by our cost-sensitive loss with the one using traditional ML estimation.For each approach, we estimate  sARMA models, one for each stock return, and once the predicted returns ŷ ()'s at  are obtained,   ()'s are decided as follows:   () = 0 if ŷ () <  and   () = 1/ 1 where  1 is the number of 's with ŷ () ≥ .We also contrast them with the fairly standard market portfolio strategy which sets   () to be proportional to the total market volume (i.e., the product of the price and the total number of shares) of the ticker .
We test the above-mentioned three portfolio strategies on the real-world data, the 30 tickers' daily closing prices from Dow Jones Industrial Average (DJIA) for about 15 months beginning on January 14, 2001, which amounts to about 340 daily records.The dataset is available publicly (http://www .mysmu.edu.sg/faculty/chhoi/olps/datasets.html,http://www .cs.technion.ac.il/∼rani/portfolios), and the detailed description can be found in [22].The stock tickers appear to be considerably correlated with one another and include GE, Microsoft, AMEX, GM, COCA-COLA, and Intel.
In the sARMA estimation, we set  =  = 2, and the cross predictors   are defined to be the returns of the other 29 stocks at day  − 1.The parameter  in our costsensitive estimation is empirically chosen.First, the average costs attained, that is, (1/) ∑  =1 ( ŷ ,   ), which are further averaged over  different models, are 114.0994for the MLestimated sARMA and 0.0030 for the proposed cost-sensitive sARMA.This implies that the proposed estimation method yields a far more accurate prediction performance than the traditional ML method in terms of the proposed cost function.
Next, we depict the running RCW scores for three competing portfolio strategies in Figure 1.As shown, the proposed approach (sARMA-cost) achieves the highest profits consistently for almost all 's during the time horizon, significantly outperforming the market strategy.The MLbased sARMA estimator performs the worst, which can be explained by its attempt at fitting a model to overall data, not accounting for the asymmetric loss structure for the asset return data, especially regarding the directions of return predictions.In the end, for  > 250, the proposed method indeed gives positive return (i.e., RCW() > 1) whereas the other two methods suffer from substantial budget loss (RCW() < 1).This again signifies the effectiveness of the cost-sensitive loss minimization in the return prediction.

Conclusion
In this paper we have introduced a novel ARMA model identification method that exploits the asymmetric loss structure for the financial asset return data.The proposed cost function effectively encodes the goodness of matching in directions between true and model predicted asset returns, which is directly related to ultimate profits in the investment.We have provided the subgradient-based optimization using the trustregion approximation, where it has been empirically shown to work well for the portfolio selection problem in a real-world situation.

Figure 1 :
Figure 1: Running relative cumulative wealth for three competing portfolio selection strategies on the DJIA stock return data.