Matrix Factorization for Evolution Data

1 Software Institute, Sun Yat-Sen University, Guangzhou 510275, China 2 School of Economics and Commerce, South China University of Technology, Guangzhou 510006, China 3Department of Interventional Radiology, The First Affiliated Hospital of Sun Yat-Sen University, Guangzhou 510080, China 4Department of Computing Science, Umeå University, 901 87 Umeå, Sweden 5 Academy of Guangdong Telecom Co. Ltd., Guangzhou 510630, China


Introduction
The essence of the matrix factorization (MF) problem is to find two factor matrices  and , such that their product can approximate a given matrix ; that is,  ≈   .As a fundamental model in machine learning and data mining, MF methods have been widely used in various applications, such as collaborative filtering [1,2], social network analysis [3], text analysis [4], image analysis [5,6], and biology analysis [7].
In this work, we study a special variety of the MF problem.Let the matrix  ∈ R × consist of the values of the objects  1 ,  2 , . . .,   at a serial of time points  1 ,  2 , . . .,   , where the entry  , is the value of   at   .Our goal is to find the suitable factor matrices  and , which can not only approximate , but also depict the evolution of the objects over time.
A typical application of this work is to estimate the missing historical traffic speed data, which is necessary for many transportation information systems [8][9][10].Given an urban road network with  roads  1 ,  2 , . . .,   , we let   (1 ≤  ≤ ) correspond to the traffic speed of   and let entry  , be the speed of   at   ; then  is composed of speed values of the roads  1 ∼   at the time  1 ∼   .Especially, if the value of  , was not collected, we say it is missing and denote as  , = "⊥".With this representation, to estimate the missing values, we can first fit  and  with the nonmissing entries of , and then for each  , = "⊥", take its estimation as R, =      , where   /  is the th/th column of /.A main difference between this work and the other existing MF models is that we take the time evolution effects into account.Well-studied MF models, such as nonnegative matrix factorization [5], max margin matrix factorization [11,12], and probabilistic matrix factorization [13], are based on the i.i.d assumption, which implicates that the behavior of the objects evolved with time is ignored and treated as being independent of time, and thus their abilities on describing time-varying data are poor.To tackle this issue, the factors that affect the evolution of the objects are explicitly modeled in our contribution, and that enables our model to interplay with the time dependent information.Furthermore, our empirical studies confirm that the proposed model can scale well with size of the data.

Mathematical Problems in Engineering
The remainder of this paper is organized as follows.Section 2 briefly summarizes the notations used in the paper.Section 3 reviews the works on matrix factorization.Section 4 presents our matrix factorization model as well as the statistical mechanism of the model.The proposed algorithm is introduced in Section 5. Section 6 is devoted to the analysis of the experiments.Our conclusions and future works are presented in Section 7.

Notations
where For a matrix  ∈ R × , we let   be the th column of  and denote its Frobenius norm as

Matrix Factorization
The essence of matrix factorization is to find the suitable factor matrices  and , such that their product can approximate a given matrix .In principle, the MF problem can be formulated as the optimization model below: where the function Loss is used to measure the closeness of the approximation    to the target .In general, Loss(  , ) can be decomposed into the sum of the pairwise loss between the entries of    and ; that is, Loss(  , ) = ∑  =1 ∑  =1 Loss((  ) , ,  , ).Most common used forms of the Loss function include the square loss (Loss(, ) = ( − ) 2 ) [1,13,14], the 0-1 Loss (Loss(, ) = I( = )) [11], and the divergence loss (Loss(, ) =  log(/) −  + ) [6].
Notably, if { * ,  * } is a solution of (M1), then for any scalar  > 0, { * , (1/) * } is another solution, and hence the problem (M1) is ill posed.To overcome this obstacle, various constraints on  and  are introduced, such as constraints on the entries [5], constraints on the sparseness [15,16], constraints on the norms [13,17], and constraints on the ranks [18,19].All these constraints, from the perspective of the statistical learning theory, can be regarded as the length of the model to be fitted.According to the minimum description length principle [20,21], smaller length means better model, and thus most of them can be incorporated into (M1) as additional regularized terms; that is, where the regularization factor (, ) corresponds to the constraints on  and .
As a transductive model, (M2) has many appealing mathematical properties, such as the generalization error bound [22] and the exactness [14,23].However, as well known, when compared with the generative model, a main weakness of the transductive model is that it can hardly be used to describe the relation in the data.In particular, for our studied problem, even though the model (M2) may work well, it is difficult to express the interplay between the model and the evolutions of the objects.

The Proposed Model
In this section the proposed model for matrix factorization for evolution data (MAFED) is presented.We first formalize MAFED as a constraint optimization model and then present a probabilistic graphical model, the solution of which is equivalent to the MAFED model.Finally we attain the generative interpretation of MAFED, with which we elaborate the ability of MAFED to describe the data relation.The last subsection is devoted to the solution algorithm for MAFED.
Here   and   correspond to the th column of  and the th column of , respectively.From constraints (3b)-(3d), we can see that the roles of  and  in (3a) are asymptotic.We call  the object matrix, of which the th column (  ) is the invariant feature vector of   .Similarly, we call  the environment matrix, of which the th column (  ) is the time varying environment feature vector at   .Note that for each   , the range it takes effects in is global; that is, for each time point   , all the objects  1 , 2 ,. ..,  share the same environment feature vector   .As a result, every entry  , (1 ≤  ≤ , 1 ≤  ≤ ) is composed by the object's intrinsic feature vector   and the environment feature vector   , and hence for each object   , its evolutions over the time  1 ,  2 , . . .,   are controlled by  1 ,  2 , . . .,   .To be more illustrative, let us consider such an example.Let  be the speed matrix of the roads  1 ,  2 , . . .,   at the time points  1 ,  2 , . . .,   , let  be the feature matrix of the roads, and let  be the feature matrix of the environment; then   corresponds to the feature description of   and the entries of   may consist of the intrinsic features of the road, such as the surface rough, the number of lanes, and the role in the traffic network; similarly, for , every   corresponds to the description of the environment at   , and the entries of   can be the time, the weather, the visibility, and so on.Therefore, for the product  , =      , it can be interpreted as the speed of   at   is composed by the road feature   and the environment feature   .
The ability of MAFED to describe the time evolution effect lies in the constraint (3d).As well known, a fundamental characteristic of the evolution is for the all objects; when the interval between two adjacent time points, for example,   and  +1 , is small enough, the corresponding values of the objects, that is,   and  +1 , should tend to remain the same.To satisfy this constraint, in (3d) we introduce the tunable values we have When  and  are fixed,   → 0 leads to  ×  ×   → 0, and so via (3d) it is guaranteed that On the other hand, when the time interval |  − t +1 | → +∞, the values that   and  +1 take will tend to be independently.For this case, we let   → +∞, indicating that  +1 can take its value regardless of   .
With the constraints (3c) and (3d), we can control the value of ‖  ‖ 2 ( = 2, . . ., ) via The result above shows that we can bound the sum ∑  =1 ‖  ‖ 2 2 by selecting appropriate parameters  and .Besides, from the constraint (3b), the sum ∑  =1 ‖  ‖ 2 2 is bounded by  2 .According to [12,24], under some suitable situations, the sum is the convex envelop of rank (  ), and hence MAFED accommodates the ability to control the rank of the model.On the other side, as shown by Candès and Recht [14] and Candès and Tao [23], when the target matrix is low rank, it is possible to exactly recover it from only a few of its observations.This declaration shows that MAFED is possible to achieve high accuracy in the applications of missing imputations.

The Generative Interpretation.
In this section, to investigate how MAFED interplays with the evolution of the data from the view of generative modeling, we first present a probabilistic graphical model (PGM [25]) and then show that the maximum likelihood solution to the proposed PGM is exactly the same as the solution of (3a).First of all, the Lagrange dual of (3a) is Here , , and   ( = 1, 2, . . .,  − 1) are tunable parameters, corresponding to the upper bounds , , and   ( = 1, 2, . . .,  − 1) of constraints (3b)-(3d), respectively, where greater bounds correspond to smaller parameters, and vise versa.
Considering the PGM presented in Figure 1, we have the following assumptions for , , and .
(1) The columns of  are from the same Gaussian distribution with mean 0 and covariance matrix  2  ; that is, for 1 ≤  ≤ , we have (here we use  to denote the dimension of   ) (2) The columns of  are linearly dependent in the order of their subscripts with respect to prespecified priors   and   1 ,   2 , . . .,   −1 ; that is, Here for clarity, we let

Mathematical Problems in Engineering
We also assume  1 is a Gaussian random vector with mean 0 and covariance  2   and   ( > 1) is a Gaussian random vector with mean  −1 and covariance  2  −1 ; that is, (3) The (, )th entry of  (1 ≤  ≤ , 1 ≤  ≤ ) is a Gaussian random variable with mean      and variance  2  ; that is, With these assumptions, we in the following part of this section show that, for a given matrix  and priors   ,   ,  Λ ,   , model ( 7) is equivalent to the following maximum likelihood fitting problem: Combining Figure 1 Taking the logarithm on the two sides of the equation, we get

The Algorithm
We in this section present a fitting algorithm to solve (7).For clarity, we only discuss the case that all the time intervals are of equal length; that is, Under this hypothesis, we take  1 =  2 = ⋅ ⋅ ⋅ =  −1 in model (7) and for  = 2, 3, . . .,  − 1, Similarly, With the equations above, Algorithm 1 is presented.

Experiments
In this section, experimental evaluations against a finance dataset and a traffic speed dataset are presented.The proposed MAFED algorithm is compared with the other two state-ofthe-art approaches.

Evaluation Methodology.
To evaluate the algorithms, we perform missing imputation on an incomplete matrix .Similar to many other matrix competition problems, the testing protocol adopted is the Given X (0 <  < 1) protocol [27]; that is, for , we only show  of its observed entries to the users, while holding the remaining (1 − ) observations to evaluate the trained model.For example, when  = 10%, Given X means that the algorithm is trained with 10% of the nonmissing entries and the rest of 90% nonmissing ones are held and to be recovered.With the Given X setting, ( 16) can be rewritten as The first selected comparison algorithm is the probabilistic principle component analysis model (PPCA [28]), which achieves the state-of-the-art performance in the missing traffic flow data imputation problem [29].The second is the probabilistic matrix factorization model (PMF), which is one of the most popular algorithms in the Netflix matrix completion problem [13].
The evaluation criterion employed is the root mean square error (RMSE).More formally, for a test dataset  = { 1 ,  2 , . . .,   } and the estimated set Ŝ = {ŝ 1 , ŝ2 , . . ., ŝ } (ŝ  is the estimation of   ), the RMSE of the estimation is given by √ (1/) ∑  =1 (  − ŝ ) 2 .In all our experiments, for every V , the data partition is repeated for 5 times, and the average results as well as standard deviations are recorded.To characterize the evolution of the opening prices, we introduce the term incremental rate; for the th company, the incremental rate of its opening price in the th day is given by

Experiments on the
Intuitively, the incremental rate quantifies how strong the evolution is.A smaller |Inc , | implies a closer connection between  ,−1 and  , , and hence we say the evolution from  ,−1 to  , is gradual; while a larger |Inc , | indicates a looser relation between  , and  ,−1 , and thus the evolution is considered to be saltatory.
We calculate the incremental rates for all companies in the days 2 ∼ 245. Figure 2 illustrates the cumulative probability distribution of these rates.We can observe that almost all the incremental rates locate in a very sharp interval (−10%, 10%).This implies that the changes of the opening prices are very slight.In other words, the evolutions of the opening prices are gradualism.
We evaluate the parametric sensitivity of the algorithm by tuning the parameters  and .In our experiments, we first fix  = 0.01 and vary  via an assignment expression  = 0.01 × 2  , where  = 0, 1, . . ., 9. Then we do the reverse by fixing  = 0.01 and changing  via  = 0.01 × 2  .For the Given X protocol setting, we take  = 50%; that is, 50% of the data in  is randomly selected as training data, with which the algorithm is trained and recovers the remaining 50% as test data.The average RMSEs of the experiments are shown in Figure 3.
As shown in Figure 3, the RMSE values remain stable even when  is expanded by more than 200 times (2 8 = 256).Similar result also appears in the experiments on the parameter .We can observe that significant changes of the RMSE value only occur in cases where  > 7 (i.e.,  is expanded by more than 128 times).
To study the prediction ability of the proposed algorithm, we increase the  value in the Given X protocol from 10% to 50%.Then for each  setting, missing imputations are performed using MAFED with  =  =  = 0.5 and the comparison algorithms with the latent feature number  = 10 and  = 30, respectively.As illustrated in Table 1, MAFED outperforms PPCA significantly in all settings.For Input: matrix ; number of the latent features ; learning rates  1 ,  2 and  3 ; regularization parameters ,  and ; threshold  Output: the estimated matrix   .
(   any  value, the RMSE of MAFED is at most 20% of that of PPCA.Specifically, for  ∈ {30%, 40%, 50%}, the RMSE of MAFED is even only 10% of PPCA.More notably, the value  has strikingly different impacts on MAFED and PPCA.When  changes from 10 to 30, almost all the RMSEs of PPCA dramatically increase, while for MAFED, RMSEs are decreased by nearly 5% except for the case with  = 10%.This observation suggests that, with carefully setting of the parameter , it is possible to improve the predicting accuracy of MAFED.We can also observe that, compared with PMF, MAFED achieves smaller RMSEs in the all settings.Especially, when  ∈ {30%, 40%, 50%}, the RMSE of MAFED is only half of that of PMF.Another interesting finding is that, for PMF, the value of  has little impact on predictions..The convergence rate of MAFED is also explored in this work.We record the RMSEs of data recovered by MAFED every 10 iterations and plot them on Figure 4, where the axis is the number of iterations and -axis represents the RMSEs.We can observe that, for all  values, corresponding curves drop dramatically in the first 20 iterations and remain stable afterwards.From the result presented in Figure 3 and Table 1, we can conclude that, in our experiments, MAFED converges to the local optimization solutions within 100 iterations.

Experiments on the Traffic Speed Dataset.
To reconfirm the recovery performance of MAFED, we in this section conduct another evaluation on a traffic speed dataset collected in the urban road network of Zhuhai City [30], China, from April 1, 2011 to April 30, 2011.Again we adopt a matrix  to represent the data, where  consists of 1853 rows and 8729 columns.Each row corresponds to a road and each column corresponds to a 5-minute-length time interval.All columns are arranged in ascending order of time.The entry  , (1 ≤  ≤ 1853, 1 ≤  ≤ 8729) in  is the aggregate mean traffic speed of the th road in the th interval.As the data in  are collected by probing vehicles, the value of  , might be missing if there is no probing vehicle on the th road during the th time interval.In the data set, nearly half of the data, that is, 8 million entries in , are such missing values.
We calculate the incremental rates of the nonmissing data in  and present the cumulative density distribution for the incremental rates in Figure 5.The evolution of the traffic speeds differs from that of the stock opening prices remarkably.As aforementioned, the incremental rates of the finance data set mainly concentrate on the sharp interval [−10%, 10%].While for the traffic speed data, as demonstrated in Figure 5, the range of the incremental rates spreads from −1 to 10.This indicates that there are plenty of sudden changes of the road speeds occuring in adjacent time intervals.In other words, many of the speed evolutions are saltation.A possible cause of this observation could be the effects of the traffic lights in the urban road network.When the traffic light of a road turns from green to yellow (and then red), traffic speed on the road immediately reduces to 0. On the other side, when the light turns from red to green, vehicles are restarted instantly, resulting in a significant acceleration of the speed on the road.
This experiment is conducted with settings  = 10,  = 0.5, and  = 10.Table 2 shows that in the all  settings, MAFED also outperforms PPCA and PMF.In particular, when there are few observations (e.g.,  = 10% and  = 20%), RMSEs of MAFED are 33% lower than those of PPCA and 10% lower than those of PMF.When  > 20%, the RMSE differences between PPCA and MAFED tend to be slight.Despite that, the overall errors of PPCA are roughly 3% ∼ 5% higher than those of MAFED, and for PMF, the RMSEs remain roughly 10% higher than those of MAFED.

Conclusion and Future Works
Matrix factorization models are fundamental in machine learning and data mining.In this paper, we present MAFED, a matrix factorization model for evolution data.In MAFED, the two factor matrices are treated as composed of intrinsic features of the objects and time-varying features of the environment, respectively.Hence, their product accommodates the ability to describe the evolution of the objects over time.Besides, we construct a probabilistic graphical model, with which the statistical natural of MAFED is elaborated.We also present a fitting algorithm to find the solution of MAFED.Finally, we evaluate MAFED through missing data imputation on two real-world datasets.Experimental results indicate that the proposed model outperforms the comparison algorithms in both gradualism cases and saltation cases.
According to the generative interpretation of MAFED, its success mainly attributes to the introduction of the linear chain structural prior; as a result, a natural extension of the present work is to deal with the factorizations on matrices with more complex structural priors.In particular, as have been shown in the empirical studies section, many of the pieces of data in real-world scenarios are highly incomplete, so it is important and interesting to study the factorization on the incomplete matrix with no prespecified structural information.

Figure 2 :
Figure 2: The cumulative probability distribution curve of the incremental rates of finance dataset.

Figure 4 :
Figure 4: Empirical results on the algorithm convergence.

Figure 5 :
Figure 5: The cumulative probability distribution curve of the incremental rates of the traffic speed dataset.

Table 1 :
Imputation results on the finance dataset.

Table 2 :
Imputation results on the traffic dataset.