Joint MMSE transceiver designs and performance benchmark for CoMP transmission and reception

Coordinated Multipoint (CoMP) transmission and reception has been suggested as a key enabling technology of future cellular systems. To understand different CoMP configurations and to facilitate the configuration selection (and thus determine channel state information (CSI) feedback and data sharing requirements), performance benchmarks are needed to show what performance gains are possible. A unified approach is also needed to enable the cluster of cooperating cells to systematically take care of the transceiver design. To address these needs, the generalized iterative approach (GIA) is proposed as a unified approach for the minimum mean square error (MMSE) transceiver design of general multiple-transmitter multiple-receiver multiple-input-multiple-output (MIMO) systems subject to general linear power constraints. Moreover, the optimum decoder covariance optimization approach is proposed for downlink systems. Their optimality and relationships are established and shown numerically. Five CoMP configurations (Joint Processing-Equivalent Uplink, Joint Processing-Equivalent Downlink, Joint Processing-Equivalent Single User, Noncoordinated Multipoint, and Coordinated Beamforming) are studied and compared numerically. Physical insights, performance benchmarks, and some guidelines for CoMP configuration selection are presented.


Introduction
Though cellular has many challenges such as multipath fading, cell edge interference, and scarce spectrum, there is a demand for even better cellular performance than what is achieved today. In order to meet this demand, revolutionary ideas are needed. Coordinated Multipoint (CoMP) transmission and reception, a type of Network MIMO (multipleinput and multiple-output) in Long-Term Evolution-Advanced (LTE-A) [1], is one of those ideas and is a key enabling technology of future cellular systems. It, being a MIMO technique, actually exploits the multipath fading. Furthermore, it lowers the cell edge interference by having potential interfering cells cooperate. And lastly, its lowering of the interference allows for better spectrum reuse and, therefore, better use of the scarce spectrum. Since there are various levels of cell cooperation, there are various CoMP configurations [1][2][3][4]. As such, the following three categories of configurations are generally considered.
The first category is Noncoordinated Multipoint (Non-CoMP) and does not use CoMP at all. In it, each base station (BS) communicates with its own user(s) and does so without cooperating with the other cells in data sharing or channel state information (CSI) exchange. Each BS either ignores or tries to estimate the intercell interference. It has the lowest level of cooperation.
The second category is Coordinated Beamforming (CBF). (In LTE-A, it is also referred to as Coordinated Scheduling and Coordinated Beamforming (CS/CB).) Here, each BS again only communicates with its own user(s) and there is no data sharing between BSs and no data sharing between users. This time though, the cells do cooperate to minimize the interference they cause to each other through coordination and joint transmitter and/or receiver design. It has the second lowest level of cooperation. Much work has been done for CBF configurations where each cell has one transmitter and receiver pair [5][6][7][8][9][10][11][12] and where each cell has one transmitter and multiple receivers [13][14][15][16]. There also are different CSI considerations (e.g., CSI only available at receivers [5][6][7][8]16], full CSI available at a central processing unit [9][10][11][12][13][14], CSI available only on a per-cell basis [15]) 2 ISRN Communications and Networking and different design strategies (e.g., centralized [9][10][11][12][13][14] or distributed [15] designs).
The third category is Joint Processing (JP). Here, the cells fully cooperate; the BSs act as a single equivalent transmitter in downlink (the data is processed and transmitted jointly from the BSs) to form the Joint Processing-Equivalent Downlink (JP-DL) [17][18][19] and act as a single equivalent receiver in uplink (all received signals are shared and jointly processed) to form the Joint Processing-Equivalent Uplink (JP-UL) [20]. It is shown that JP-UL [20] and JP-DL [17] bring significant gains to both the cell average throughput and the cell edge user throughput. Note that JP-UL and JP-DL have higher level of cooperation than the previous two categories (Non-CoMP and CBF). When the users act as a single equivalent receiver (resp., transmitter) in downlink (resp., uplink), it forms the Joint Processing-Equivalent Single User (JP-SU), which is essentially a point-to-point MIMO system. JP-SU has the highest level of cooperation and is only of theoretical interest.
In addition, a few attempts have also been made to jointly consider different categories/configurations. For example, joint precoder and decoder designs (e.g., SINR balancing, user rate balancing and maximum sum rate) are proposed for Non-CoMP, JP-DL and CBF and numerical comparison of their ergodic sum rates is made in [21][22][23]. But to the best of our knowledge, there are no comparison and configuration selection guidelines for various CoMP configurations in the literature.
As seen from these previous works, the precoder and decoder designs and performance evaluation for CoMP systems can be very complex and diverse. This is due to the fact that there exist various CoMP configurations, design criteria, and constraints (e.g., the per-antenna power constraint, per-transmitter power constraint). There also exists a vast number of design approaches associated with each of the design criteria, each of the constraints, and each of the CoMP configurations. Moreover, CoMP was not considered mature and was not adopted by 3GPP in LTE release 10 [24]. Thus, performance benchmarks (which show what performance gains are possible) for CoMP configurations are needed to help determine rules for configuration selection. Since different CoMP configurations require different levels of CSI feedback and data sharing, these rules also help to determine CSI feedback and data sharing requirements. There is also a need for a unified approach to enable the cluster of cooperating cells to systematically take care of the transceiver design of whatever configuration they choose to implement. Both of these two needs will be addressed in this paper.
To address the need for performance benchmarks, we consider joint MMSE precoder and decoder designs for JP-UL, JP-DL, JP-SU, Non-CoMP, and CBF. Firstly, this is because joint MMSE designs can be considered as performance benchmarks for other practical design criteria; an MMSE solution is near optimum in some other senses (e.g., maximum sum rate [25,26], minimum BER [27]) as well. It has been shown that maximizing the sum rate is equivalent to minimizing the geometric mean of the MSEs of all data streams [25]. Moreover, minimizing the sum MSE is equivalent to minimizing the upper bound of the MSEs geometric mean. Thus, the MMSE results are nearly optimum in the maximum sum rate sense. Regarding BER, it has been shown that the MMSE design minimizes the lower bound of BER [27]. In addition, the BER results of the MMSE and minimum BER designs in [26] are very comparable. So, the MMSE results are nearly optimum in the minimum BER sense as well. Though studies in [25][26][27] are for single-user systems, these remarks are also true for CoMP systems. Secondly, note that with full CSI, JP-SU provides a performance upper bound for all CoMP configurations with same total number of transmit antennas and same total number of receive antennas, as shown in Figure 1. Similarly, Non-CoMP and CBF, where each cell has one transmitter and receiver pair, provide performance upper bounds for their respective categories, given same total number of transmit antennas and same total number of receive antennas. Thus, the performance benchmarks can be set forth numerically for various simulation setups; these numerical performance benchmarks can then be used to compare the different configurations and/or categories.
Although not much MMSE work has been published for the CoMP configurations, joint MMSE transceiver designs for the single-user, multiuser downlink, multiuser uplink, and CBF MIMO systems have been studied. For example, for single-user MIMO systems, closed-form expressions of the MMSE design have been derived for the total power constraint [25,26] and for the shaping constraints [28]. For uplink MIMO systems subject to the per-user power constraint, numerical solutions are provided mainly by the optimal transmit covariance optimization approach (TCOA) [29,30] and suboptimal iterative approaches such as in [29]. For downlink systems, numerical solutions are provided mainly by iterative approaches such as in [31] for the total power constraint and in [18] for the per-antenna and per-cell power constraints. Dual uplink approaches [32][33][34] have also been employed for the total power constraint. Recently, for K-user MIMO interference channels (a case of CBF), a joint MMSE design subject to per-transmitter power constraint, using a linear search for each Lagrange multiplier, is proposed [35].

3
Note that various CoMP configurations can be considered as special cases of general multiple-transmitter multiple-receiver (MTMR) systems. In this paper, the novel generalized iterative approach (GIA) is proposed as the unified approach to take care of the MMSE design of general MTMR MIMO systems subject to general linear power constraints, including the per-transmitter power constraint and the more practical per-antenna power constraint. The GIA can provide tradeoff between multiplexing and diversity gains. In addition, the optimum decoder covariance optimization approach (DCOA) for the MMSE design of downlink systems (i.e., JP-SU, JP-DL, and Non-CoMP) subject to general linear power constraints is also proposed so that the optimality of the GIA can be studied. For this purpose, the equivalence between the GIA and the optimum TCOA [29,30] for the uplink or DCOA for the downlink is established in the respective configurations.
In the numerical simulations, firstly, aspects pertaining to the proposed approaches are investigated. The convergence properties of the proposed approaches are investigated; the optimality and diversity/multiplexing tradeoff of the GIA are verified numerically; numerical comparison between the GIA and the approach in [35] is investigated. Secondly, aspects pertaining to performance benchmark are investigated. To set forth a benchmark among different CoMP configurations, MSE and BER performances for the five CoMP configurations (JP-SU, JP-DL, JP-UL, CBF, and Non-CoMP) are compared. Since this paper is concerned with performance benchmarks (achievable theoretical upper bounds), fairness-type criteria, and practical issues such as synchronization required by different CoMP configurations are not considered here. Various important factors (level of cooperation, system load, system size, and path loss) are studied though. The performance benchmarks and the resulting physical insights (into the mechanisms and performances of CoMP configurations) are very useful. In particular, much needed guidelines for the configuration selection process are obtained.
Notations are as follows. All boldface letters indicate vectors (lower case) or matrices (upper case). A , A * , A −1 , tr(A), E(A), rank(A), and A F stand for the transpose, conjugate transpose, inverse, trace, expectation, rank, and Frobenius norm of A, respectively. abs(A) denotes taking the absolute value element-wise of A. span(A) represents the subspace spanned by the columns of A. Matrix I a signifies an identity matrix with rank a. Matrix 0 signifies a zero matrix with proper dimension. diag[· · · ] denotes the diagonal matrix with elements [· · · ] on the main diagonal. A > B (A ≥ B) means that A − B is positive definite (semidefinite). A • B denotes the Schur product of A and B (element-wise product of A and B). CN(μ, q) denotes a complex normal random variable with mean μ and variance q. Finally, i.i.d. stands for independent and identically distributed.

A Single Formulation for General MTMR MIMO Systems.
In this subsection, we derive a single formulation to describe a general MTMR MIMO system including the five CoMP configurations (JP-UL, JP-DL, JP-SU, Non-CoMP, and CBF) investigated in this paper. Consider an MTMR MIMO system with T transmitters and R receivers. Let τ n and γ l denote the numbers of antennas at the nth transmitter and the lth receiver, respectively. Accounting for the path loss (spatial correlation can be easily incorporated as well but has been omitted for simplicity), the channel from the nth transmitter to the lth receiver is modeled as Here, d ln denotes the distance between the lth receiver and the nth transmitter, and 2β is the path loss exponent. The entries of H w,ln are i.i.d. CN(0,1). Here, the subscript W represents spatially white noise. Some of the transmitters (resp., receivers) in the CoMP system may be sharing and jointly processing their data (resp., received signals). Such a collection of transmitters (resp., receivers), which are connected via backhaul, share CSI and data, and act like a single transmitter (resp., receiver) in transmission and data processing, is a composite transmitter (resp., receiver) and thus an equivalent transmitter (resp., receiver). For the sake of having a single formulation, a transmitter (resp., receiver) which does not collaborate with other transmitters (resp., receivers) in the above way is also considered to be an equivalent transmitter (resp., receiver). Thus, this MTMR MIMO system can also be (and will be) considered as having C equivalent transmitters (eq-transmitters for short) and K equivalent receivers (eqreceivers for short). Obviously, C ≤ T and K ≤ R.
Let t c and r i denote the numbers of antennas at the cth eq-transmitter and the ith eq-receiver, respectively. Then, t = T n=1 τ n = C c=1 t c and r = R l=1 γ l = K i=1 r i are the total numbers of transmit and receive antennas, respectively. Also let H ic denote the composite channel matrix from the cth eq-transmitter to the ith eq-receiver. At the cth eq-transmitter, let s ic , m ic , and F ic denote the data, number of data streams, and precoder for the ith eqreceiver, respectively. Furthermore, let Φ sic = E(s ic s * ic ) and G ic be, respectively, the source covariance matrix for s ic and the decoder for s ic . Which transmitter transmits to which receiver is configurable. When the cth eq-transmitter has no data to transmit to the ith eq-receiver, s ic = 0, m ic = 0, Φ sic = 0, F ic = 0, and G ic = 0. When it does, Φ sic is positive definite and F ic and G ic must be designed.
In this system, there may be multiple clusters where each cluster jointly designs the MIMO processors for its own eqtransmitters and eq-receivers but does so independently of the other clusters. There is no CSI sharing between clusters and the intercluster interference is formulated as noise. Let D and S define one such cluster; D being the set of eqtransmitter indices in the cluster and S being the set of eqreceiver indices in the cluster. D and S are introduced to allow a single formulation to take care of the MMSE transceiver 4 ISRN Communications and Networking design for different CoMP configurations. At the ith eqreceiver, i ∈ S, the received signal is thus Here, n i , a i and i i are the noise plus intercluster interference vector, the noise vector, and the intercluster interference vector, respectively, at the ith eq-receiver. The interference is from all of the eq-transmitters which do not belong to D. Thus, when there is only one cluster in the system, there is no interference and n i = a i , i i = 0 for every i ∈ S. Note that, except in Non-CoMP, the possible intercell interference is implicitly included in the first term in (2), and is considered to be manageable.

Five CoMP Configurations.
The needed CSI feedback and data sharing in each CoMP configuration are assumed done through ideal link and of zero delay. The above single formulation is able to describe any general MTMR MIMO system including JP-UL, JP-DL, JP-SU, Non-CoMP, and CBF. There is only one cluster in JP-UL, JP-DL, JP-SU, and CBF. But, there are C clusters in Non-CoMP. Without loss of generality and for convenience, Non-CoMP and CBF considered in this paper have only one transmitter-receiver pair per cluster.

Configuration I: JP-UL.
In JP-UL, the system has only one cluster and is just an equivalent uplink MIMO system, that is, there are multiple transmitters (each being an eq-transmitter) but only one eq-receiver (full cooperation among all receivers). Thus, For both FDD and TDD systems, each BS estimates all uplink CSI and sends the CSI to a central processing unit via the backhaul (if the BSs are colocated, the backhaul is not needed). The central processing unit performs the systemwide transceiver design and sends each user its optimized precoder through the serving BS. Each user uses the received precoder for transmitting data. Lastly, the BSs share their received signals with the central processing unit for joint decoding.

Configuration II: JP-DL.
In JP-DL, the system has only one cluster and is just an equivalent downlink MIMO system, that is, there are multiple receivers (each being an eq-receiver) but only one eq-transmitter (full cooperation among all transmitters). Thus, In TDD systems, the BSs estimate downlink CSI through reciprocity. In FDD systems, each user estimates all intracluster downlink CSI and feeds back the CSI to its serving BS. After obtaining the CSI, each BS sends the CSI to a central processing unit via the backhaul (if the BSs are co-located, the backhaul is not needed). The central processing unit performs the system-wide transceiver design and sends the optimized precoders and decoders to the BSs. Each BS uses the optimized precoder for transmitting data. Each BS also sends the decoder to its users for processing the received data.

Configuration III: JP-SU.
In JP-SU, essentially a pointto-point MIMO system, there is only one eq-transmitter (full cooperation among all transmitters) and only one eq-receiver (full cooperation among all receivers). It is only of theoretical interest (showing performance upper bound for all CoMP systems) and the signaling issues are irrelevant and omitted. It is assumed that a central processing unit knows all the channels and performs the system-wide transceiver design. Thus,

Configuration IV: Non-CoMP.
In Non-CoMP, each transmitter (being an eq-transmitter) is paired with a unique receiver (being an eq-receiver). Each pair is a cluster of the system, so the intercell interference is the inter-cluster interference. Thus, pairwise transceiver design is performed and the system with C eq-transmitter eq-receiver pairs (C = K = T = R) is decoupled into C single user clusters with the ith one being In TDD systems, each transmitter estimates the forward link CSI through reciprocity. The transmitter performs the joint transceiver design and sends the decoder to the receiver. In FDD systems, each receiver estimates the forward link CSI and sends the estimated information to the transmitter. Both transmitter and receiver can independently perform the joint transceiver design. The transmitter will use the resulting precoder to transmit data and the receiver will use the decoder to process the received data. ISRN Communications and Networking 5 2.2.5. Configuration V: CBF. Like Non-CoMP, there are multiple pairs of transmitters and receivers in CBF. However, unlike Non-CoMP, there is only one cluster here. Note that in CBF, F ic = 0 for i / = c and the BSs do not share data. The CSI acquisition and signaling requirement in uplink (resp., downlink) for a central processing unit are the same as in JP-UL (resp., JP-DL). The central processing unit performs the system-wide transceiver design. Thus, Note that, for the composite channel matrix H ic in (4)- (8), the subscript i is the eq-receiver index and the subscript c is the eq-transmitter index. However, for the channel matrix H ln , the subscript l is the receiver index and the subscript n is the transmitter index.

MMSE Design Subject to General Linear Power Constraints.
For a given cluster, define the MSE with respect to the ith eq-receiver and the cth eq-transmitter, i ∈ S, c ∈ D, as Note that when the cth eq-transmitter has no data for the ith eq-receiver, η ic = 0. The sum MSE η is 2.3.1. MMSE Problem. We will jointly choose {F ic , G ic } i∈S,c∈D to minimize the sum MSE η: subject to general linear power constraints, for example, the per-antenna power constraint at the cth eq-transmitter or the per-transmitter power constraint at the nth transmitter of the cth eq-transmitter, Here, J c denotes the set of all cooperating transmitters that form the cth eq-transmitter. When there is only one element in J c , that is, J c = {n}, Q n = I tc in (13). When there are more than one element in J c , Q n is a t c × t c matrix whose entries are all equal to zero except for the diagonal elements corresponding to the antennas of the nth transmitter. The values of these nonzero diagonal elements are equal to one.

Augmented Cost Function.
To solve (11) subject to (12) or (13), one can use the method of Lagrange multipliers to set up the augmented cost function for general linear power constraints where Λ c represents the Lagrange multipliers. Only the widely considered per-transmitter power constraint and the practical per-antenna power constraint are given as examples. For the per-antenna power constraint in (12), For the per-transmitter power constraint in (13), let Δ n = I τn λ nc , Γ nc = I τn P bnc /τ n , c ∈ D. Thus

MMSE Decoders and Precoders.
Define the noise covariance matrix and the noise plus interference covariance matrix at the ith eq-receiver as Φ ai = E(a i a * i ) and Φ ni = E(n i n * i ), respectively. Assume Φ ai is known. Therefore, Φ ni is also known in JP-SU, JP-UL, JP-DL and CBF because Φ ni = Φ ai . In Non-CoMP, Φ ni can be estimated explicitly After some math manipulations, (9) becomes There are two possible directions to solve the MMSE problem.

MMSE Decoder.
On one hand, for a given set of precoders {F ic } i∈S,c∈D , setting the gradient of η in (10) with respect to G ic equal to zero yields the MMSE decoder for s ic , c ∈ D, i ∈ S: Substituting (18) into (17), η in (10) is reduced to The augmented cost function ξ in (14) is also reduced to Note that η 1 in (19) and ξ 1 in (20) (14) with respect to F ic equal to zero yields the MMSE precoder for s ic , c ∈ D, i ∈ S: Substituting (21) into (14), the augmented cost function ξ in (14) is reduced to Note that ξ 2 in (22) is merely a function of precoders {G ic } i∈S,c∈D and Lagrange multipliers {Λ c } c∈D .

Transmit and Decoder Covariance Matrices.
When the nonzero source covariance matrices are diagonal matrices with the same diagonal elements (i.e., is an arbitrary unitary matrix with proper dimension) does not change the power constraint (12) or (13) Define the transmit covariance matrices as and the decoder covariance matrices as for arbitrary unitary matrices {A ic } i∈S, c∈D . Therefore, the transmit and decoder covariance matrices {U ic , V ic } i∈S, c∈D can be used to determine the MSE (in fact, the transmit and decoder covariance matrices {U ic , V ic } i∈S, c∈D also determine the achievable sum rate) and consequently determine the precoders and decoders. Thus, if the transmit covariance matrices {U ic } i∈S, c∈D which minimize the MSE are found, the precoders {F ic } i∈S, c∈D can be obtained using (23) and the decoders {G ic } i∈S, c∈D can be obtained from (18). Similarly, if the decoder covariance matrices {V ic } i∈S, c∈D which minimize the MSE are found, the decoders {G ic } i∈S, c∈D can be obtained using (24) and the precoders {F ic } i∈S, c∈D can be obtained from (21).

Unified Approach for General MTMR MIMO Systems
The GIA is proposed as a unified approach for the MMSE design for general MTMR MIMO systems. It is motivated by the fact that, if the Lagrange multipliers Λ c in (21) are known, we can solve the coupled equations (18) and (21) iteratively for the decoders {G ic } i∈S, c∈D and precoders {F ic } i∈S, c∈D . Note that, in most literatures (e.g., [35]), the Lagrange multipliers are obtained through linear search, in which the search space increases significantly as the system size increases. We herein propose a much more efficient approach using an explicit expression for the Lagrange multipliers.
To obtain an explicit expression for the Lagrange multipliers Λ c , c ∈ D, set the gradient of ξ 1 in (20) with respect to F ic equal to zero and then left-multiply the resulting equation with F ic . Once this is done for each i ∈ S, sum them all up to obtain the following equation: Utilizing (12), for the per-antenna power constraint, Utilizing (13), for the per-transmitter power constraint, Note that the usage of (27) or (28) enforces the corresponding complementary slackness conditions With the explicit expression for the Lagrange multipliers in (27) or (28) in hand, a GIA can be developed. There are three steps in each iteration of the GIA.
The iterative procedure of the GIA stops when the Karesh-Kuhn-Tucker (KKT) conditions are all satisfied, that is, when the following three requirements are fulfilled: one, the MSE no longer decreases; two, each precoder (decoder) converges; three, the transmission powers at the transmitter(s) meet the desired power constraints. Since the MSE has a lower bound at zero and each of the GIA steps actually enforces one of the KKT conditions of the MMSE problem, the GIA can converge quickly to a local minimum at low powers. At high transmit powers, a scaling initialization (scaling the MMSE MIMO precoders and decoders given by the GIA at lower 7 powers) is very effective and efficient. Note that the GIA can deal with arbitrary source covariance matrices {Φ sic } i∈S,c∈D , thus allowing m ic , the number of data streams intended from the cth eq-transmitter to the ith eq-receiver to be prespecified for all i ∈ S, for all c ∈ D, s ic / = 0. Since the numbers of data streams can be pre-specified, the GIA allows for tradeoff between diversity and multiplexing gains.

Optimum Approaches for Special MTMR Systems
When the source covariance matrices are diagonal matrices with the same diagonal elements, that is, Φ sic = σ 2 I mic , i ∈ S, c ∈ D, optimum approaches for the MMSE design subject to the general linear power constraints may be developed for special MTMR systems: uplink systems (e.g., JP-UL, JP-SU, and Non-CoMP where S has only one element) in Section 4.1 and downlink systems (e.g., JP-DL, JP-SU, and Non-CoMP where D has only one element) in Section 4.2.
For convenience and without loss of generality, in the section, we assume σ 2 = 1. [29,30] for Systems with One Eq-Receiver. The TCOA [29,30] can be used for JP-UL, JP-SU, and Non-CoMP where S has only one element (but not for JP-DL and CBF) under general linear power constraint. (Note that in [30,31], the TCOA is only for the per-user power constraint. We use it here to deal with the per-antenna power constraint.) It is motivated by the fact that the MMSE problem may be solved by searching for the transmit covariance matrices {U ic } i∈S, c∈D to jointly minimize η 1 in (19). The optimum numbers of data streams {m ic } i∈S, c∈D are determined by the rank of optimum {U ic } i∈S, c∈D . The TCOA [30] can be reformulated in terms of an SDP formulation which can be solved numerically by SDP solvers (such as SeDuMi [36] and Yalmip [37]) in polynomial time.

DCOA for Systems with One
Eq-Transmitter. The DCOA can be developed for JP-DL, JP-SU, and Non-CoMP where D has only one element (but not for JP-UL and CBF). It is motivated by the fact that the MMSE problem may be solved by searching for the decoder covariance matrices {V ic } i∈S, c∈D to jointly minimize ξ 2 in (22). Using (24), ξ 2 in (22) becomes where The MMSE transceiver design problem becomes The problem in (33) is not cing with the numbers of data streams, that is, rank Allowing {m ic } i∈S,c∈D to be unspecified, we obtain the rank-relaxed decoder covariance optimization problem: The cost function ξ 2,rel in (34) is convex with respect to {V ic } i∈S,c∈D and concave with respect to Λ c . Define min Vic≥0,i∈S max Λc≥0 ξ 2,rel as the primal problem and max Λc≥0 min Vic≥0,i∈S ξ 2,rel as the dual problem. Since both the primal problem and the dual problem are convex and strictly feasible, strong duality holds, that is, the optimum values of {V ic } i∈S,c∈D , Λ c , and ξ 2,rel obtained from the primal problem are the same as those obtained from the dual problem.

Primal-Dual Algorithm.
We propose a novel primaldual algorithm to solve the rank-relaxed decoder covariance optimization problem in (34). Denote the feasible set of values for {V ic } i∈S,c∈D as the primal domain and the feasible set of values for Λ c as the dual domain. In short, the approach consists of iterating between a primal domain step and a dual domain step. (Both subproblems, defined in (30) and (31), are convex because their cost functions are convex and concave, respectively, and their constraints are all linear matrix inequalities. The solution of each sub-problem is optimum for that sub-problem.) For the ( j + 1)th iteration: The convexity of the rank-relaxed decoder covariance optimization problem guarantees the solution provided by the primal-dual algorithm is a global optimum. The iterative procedure stops when the ξ 2,rel 's corresponding to the primal domain step and the dual domain step converge to the same value and when {V ic } i∈S, c∈D converge and Λ c converge. In practice, the DCOA given by solving (35) and (36) is considered to have converged at the ( j + 1)th iteration when 8 ISRN Communications and Networking F , and the duality gap of the values of ξ 2,rel derived from the two steps (37) are less than some pre-specified thresholds. Note that, in all this, the power constraints have been accounted for by the Lagrange multipliers. The optimum numbers of data streams {m ic } i∈S,c∈D are determined by the rank of optimum {V ic } i∈S,c∈D .

Two-Semidefinite Programming (Two-SDP) Procedure.
Similar to the TCOA [30] in uplink, (35) and (36) can be reformulated in terms of the SDP formulation: min Wd,Λc Both (38) and (39) can be solved numerically by SDP solvers (such as SeDuMi [36] and Yalmip [37]) in polynomial time. However, the primal-dual algorithm of the DCOA needs both the primal and dual sub-problems to be solved in each iteration. This leads to high computational complexity. Furthermore, the Two-SDP Procedure is sensitive to the numerical precisions of the SDP solvers. It works well at low transmit powers, but the duality gap cannot be made arbitrarily small at high transmit powers due to insufficient numerical precisions of the SDP solvers available in public. Nevertheless, a very important contribution here is that the MMSE transceiver design under general linear power constraints provided by the Two-SDP Procedure is optimal for downlink.

Numerically Efficient Procedure.
To reduce the computational complexity and improve the convergence properties of the Two-SDP Procedure, the SDP formulation in (38) is still employed to solve for the primal domain step in (35). And we employ the explicit expressions of Λ c derived as follows for the dual domain step in (36). Substituting (18) into (24) and using (23), we obtain Similarly, substituting (21) into (23) and using (24), we obtain To remove the dependence of {V ic } i∈S, c∈D on {U ic } i∈S, c∈D , substitute (41) into (40) to yield Similarly, substituting (23) into B c in (26) and using (41), we can express the Lagrange multipliers {Λ c } c∈D in (27) or (28) in terms of {V ic } i∈S,c∈D .

Equivalence among the Proposed Approaches and Optimality of GIA
In this section, we focus the discussions on the optimality of and the relationships between the GIA, TCOA, and DCOA. Then, the optimality of the GIA can be established.

Equivalence of the TCOA and GIA for Systems with One
Eq-Receiver. When the TCOA is applicable and the transmit covariance matrices {U ic } i∈S, c∈D obtained from the MMSE designs are of full rank, the TCOA and GIA are equivalent.
Consequently, the solution of the GIA is actually optimum because the solution of the TCOA is optimum.
To prove the equivalence between the TCOA and GIA, it suffices to show that the KKT conditions of the two approaches are equivalent. This is because the TCOA is a convex approach. The KKT conditions common to both approaches are (18), the power constraint (12) or (13), the complementary slackness condition (29) or (30), and the nonnegativeness of the Lagrange multipliers. To obtain the unique KKT condition of the TCOA, we set up the following augmented cost function to include the nonnegative definite constraint on {U ic } i∈S,c∈D : where {Ψ uic } i∈S, c∈D are the Lagrange multipliers satisfying tr(Ψ uic U ic ) = 0, Ψ uic ≥ 0, i ∈ S, c ∈ D. When {U ic } i∈S,c∈D are of full rank, the Lagragian variables {Ψ uic } i∈S, c∈D are zero matrices. Making the gradients of (43) with respect to {U ic } i∈S, c∈D to be zeros, we have The task of showing the equivalence of the KKT conditions of the two approaches boils down to showing that the above KKT condition of the TCOA, (44), can be derived from (and can be used to derive) the KKT conditions unique to the GIA, (21). Substitute (18) and (23) into (21) to obtain Then right multiply (45) by F * ic U −1 ic to get With some matrix manipulations, we can show that (46) and (44) are equivalent. Since (21) and (44) can be derived from each other, this proof is complete. The above proof is done assuming Φ sic = σ 2 I mic with σ 2 = 1, i ∈ S, c ∈ D. It is also applicable when σ 2 / = 1.

Equivalence of the DCOA and GIA for Systems with
One Eq-Transmitter. When the DCOA is applicable and the decoder covariance matrices {V ic } i∈S,c∈D obtained from the MMSE designs are of full rank, the DCOA and GIA are equivalent. Consequently, the solution of the GIA is actually optimum because the solution given by the DCOA is optimal.
To prove the equivalence between the DCOA and GIA, it suffices to show that the KKT conditions of the two approaches are equivalent. This is because the DCOA is a convex approach, so that its KKT conditions are sufficient conditions for optimality. The KKT conditions common to both approaches are (21), the power constraint (12) or (13), the complementary slackness condition (29) or (30), and the non-negativeness of the Lagrange Multipliers. To obtain the unique KKT condition of the DCOA, we set up the following augmented cost function from (34) to include the non-negative definite constraint on {V ic } i∈S, c∈D where {Ψ vic } i∈S, c∈D are the Lagrange multipliers satisfying tr(Ψ vic V ic ) = 0, Ψ vic ≥ 0, i ∈ S, c ∈ D. When {V ic } i∈S, c∈D are of full rank, the Lagrange variables {Ψ vic } i∈S, c∈D are zero matrices. Making the gradients of (47) with respect to {V ic } i∈S, c∈D to be zeros, we have The task of showing the equivalence of the KKT conditions of the two approaches boils down to showing that the above KKT condition of the DCOA, (48), can be derived from (and can be used to derive) the KKT conditions unique to the GIA, (18). Substitute (21) and (24) into (18) to obtain Then left-multiply (49) by V −1 ic G * ic to get With some matrix manipulations, we can show that (50) and (48) are equivalent. Since (18) and (48) can be derived from each other, this proof is complete. The above proof is done assuming Φ sic = σ 2 I mic with σ 2 = 1, i ∈ S, c ∈ D. It is also applicable when σ 2 / = 1.

Simulation Setup
In all of the simulations, the noise and nonzero source covariance matrices, Φ ai and Φ sic , are all identity matrices of dimension r i and m ic , respectively. The nonzero source (data) vectors consist entirely of uncoded binary phase shift keying (BPSK) modulated bits. For the per-antenna power constraint, P cd = P, d = 1, 2, . . . , t c , c = 1, 2, . . . , C (see (12)), and for the per-transmitter power constraint P bnc = τ n P, for all n ∈ J c , c = 1, 2, . . . , C (see (13)). Thus, the maximum transmission power from the nth transmitter is always the same (i.e., τ n P) for both power constraints in (12) and (13). Without loss of generality, in all of the simulations, the numbers of transmitters and receivers are the same and each cell has only one transmitter and receiver. Since the transmitter in the lth cell always (no matter which configuration) has data for the receiver in the lth cell, they are labeled the lth transmitter and receiver, respectively. Furthermore, for simplicity, d ll (see (1)) is normalized to be equal to 1 for all l. Since all other links are possibly (depending on the configuration) interfering links, they are normalized such that d ln ≥ 1, l / = n. Again, for the sake of simplicity, all d ln 's, l / = n, are set equal thus giving rise to the parameter Note that, in a cellular context, the users (base stations) are the receivers (transmitters) in downlink and the transmitters (receivers) in uplink. Thus, d ln = 1 (δ = 1) means that all of the users are cell edge users (system is in a cell edge scenario). Furthermore, as d ln increases, δ increases and each user moves away from the cell edge toward its own base station. In all of the simulations, 2β = 4 in the path loss model of (1). All of the setups (1a, 1b, . . . , 5b) used in these simulations for the five CoMP configurations are defined in Table 1. (Note though that the distances are not specified in these baseline setups because they are example dependent.) For each CoMP configuration, there are various setups. The differences between the different setups for a particular CoMP configuration are marked in bold. For example, for JP-UL, setups 1a and 1b are exactly the same except for the values of {m ic } and m. Unlike setups 1a-3b where each setup corresponds to only one configuration, setups 4a, 4b, 5a, and 5b can correspond to either Non-CoMP or CBF. Thus, to T: number of transmitters; τ n : number of antennas of the nth transmitter; C: number of eq-transmitters; t c : number of antennas of the cth eq-transmitter; t: total number of transmit antennas; 1 ≤ n ≤ T, 1 ≤ c ≤ C; R: number of receivers; γ l : number of antennas of the lth receiver; K: number of eq-receivers; r i : number of antennas of the ith eq-receiver; r: total number of receive antennas; 1 ≤ l ≤ R, 1 ≤ i ≤ K; m ic : number of data streams from the cth eq-transmitter to the ith eq-receiver if s ic / = 0; m: total number of data streams; 1 ≤ c ≤ C, 1 ≤ i ≤ K.
"Y" means an approach is applicable in a setup, while "N" means it is not.
help distinguish whether a setup belongs to Non-CoMP or CBF, the name of the configuration is placed next to the setup number, for example, 5a (Non-CoMP) denotes setup 5a for Non-CoMP. Note that not every approach can be used for every configuration and every setup in Table 1. Also note that the channel matrices generated numerically usually have full column and/or row rank. This in general results in maximum feasible rank transmit covariance matrices and/or decoder covariance matrices in the MMSE designs if the numbers of data streams are not pre-specified. Therefore, in such cases, the TCOA and DCOA are applicable in corresponding setups. The applicability of the proposed approaches in the setups is summarized in Table 2, where "Y" means an approach is applicable in a setup while "N" means it is not.
One last note, the results for setup 4b (Non-CoMP) under the per-antenna power constraint are obtained using the optimum closed-form solution (see Appendix B). The results for setups 5a (Non-CoMP) and 5b (Non-CoMP) can also be obtained by the optimum closed-form solution. But, they are omitted for the clarity of the figures.

Investigation into the Proposed Approaches
In this section, the convergence properties, optimality, and diversity/multiplexing tradeoff of the GIA, and numerical comparison of the GIA with the approach in [35] for CBF are investigated. All results except for the ones in Section 7.1 are obtained by averaging over 20 channel realizations. These results are consistent with those obtained by averaging over more channel realizations.

Convergence Properties of the Approaches.
Consider setup 3a (JP-SU). All approaches are applicable. The convergence property (expressed as MSE, dG, and dP) of the GIA for the per-antenna power constraint for one set of channel realizations is shown in Figure 2. The difference in decoders dG and the difference in the per-antenna power constraint dP between the jth and ( j + 1)th iteration are defined as The convergence property for the per-transmitter power constraint is similar and is omitted due to page limit. As shown in Figure 2, both the MSE and dG converge quickly. It is remarkable that the dPs converge much slower in higher power. This is due to the fact that, when P increases, the Lagrange multipliers decrease quickly (see (27) or (28)). Note that the usage of (27) or (28)   complementary slackness conditions (29) or (30). For large P's, the Lagrange multipliers are very small. For example, when 10 log 10 P = 30 dB, they can be as small as 10 −10 . Thus, the number of iteration increases drastically as P increases if equality in the power constraints in (12) or (13) is insisted. The slow convergence behavior of the dP's is also observed in other configurations. In Non-CoMP and CBF, the power constraints may not be met with equality for the MMSE results (where the corresponding Lagrange multipliers are essentially zeros). Although the Lagrange multipliers are formulated in this paper using equality power constraints to derive explicit expressions of the Lagrange multipliers, the GIA can be in fact used to solve inequality power constraints. When the equality of a particular power constraint is not met, the corresponding Lagrange multiplier becomes zero (which shows the complementary slackness condition).
For the DCOA, the convergence properties of the Two-SDP Procedure and Numerically Efficient Procedure, using SDP solvers SeDuMi [36] and Yalmip [37], are shown in Figure 3 for setup 3a (JP-SU) for the per-antenna power constraint for one set of channel realizations. It is found (from observing the convergence rates of the duality gap in (37) and the antenna powers in Figure 3) that the Numerically Efficient Procedure converges faster than the Two-SDP Procedure.  The GIA is equivalent to the DCOA and yields the globally optimum solution. Similarly, in setups 3a, 3c, and 3d (JP-SU) (see Figure 5), the MSE curves of all approaches merge. The GIA is equivalent to both the TCOA and DCOA and yields globally optimum solution. 10 log 10 tP 10 log 10 tP 10 log 10 tP to transmit the maximum number of data streams as other proposed approaches. On the other hand, in setups 1b (JP-UL), 2b (JP-DL), and 3b (JP-SU), the GIA is also able to transmit a fewer number of data streams resulting in a lower MSE and BER performance (see the dashed curves in Figures 4 and 5), while the other proposed approaches are not applicable. In other words, the GIA is able to, unlike the other approaches, provide a tradeoff between multiplexing gain and diversity gain. [35]. As in Section 7.1, our proposed GIA in fact can solve the GIA (5a) [35]  inequality power constraint. So, both our proposed GIA and the approach in [35] are 3-step iteratively approaches applicable in CBF with the per-transmitter power constraint. The only difference is the way of finding the Lagrange multipliers. Reference [35] uses a linear search method to find the Lagrange multipliers when the equality power constraint is enforced, while the GIA uses a more efficient explicit expression (28). In setup 5a (CBF), the MSE (BER) curves of the GIA and the approach in [35] merge, as in Figure 6. It shows that the GIA performs as good as the approach in [35] numerically, but is more efficient. Furthermore, the approach in [35] is only applicable with the per-transmitter power constraint while the GIA can deal with the more practical per-antenna power constraint.

Performance Benchmark
As in the previous section, the proposed unified approach, the GIA, is applicable to all setups. It is optimal when the number of data streams is equal to the rank of the channel, and it provides diversity gain when the number of data streams is less than the rank of the channel (e.g., in setups 1b, 2b, and 3b). In this section, all results are generated using the GIA for simplicity. The performances of the five different CoMP configurations will be studied. In particular, the impacts of the level of cooperation (Section 8.1), system load (Sections 8.  (Note that this choice of d ln makes δ = 1. It also makes all of the users be at the cell edge). The difference between the two cases lies in the number of data streams transmitted; all setups in Case A have four data streams transmitted in total (i.e., fully loaded systems) while all setups in Case B have two data streams transmitted in total (i.e., partially loaded systems). Figures 7(a) and 7(b) show the MSE and BER results, respectively.
Before comparing the results of Case A and Case B, let us compare the individual setups within each case first. Firstly, observe that, in both cases, the performance order of the configurations is exactly the same as the level of cooperation order. The performance improves as the level of cooperation increases. Note that, the MSE and BER performance order agrees with that of the ergodic sum rate in [22,23]. Secondly, note that in both cases, the per-transmitter power constraint in CBF does not usually meet with equality for every pair. However, it always does for the Non-CoMP one. The reason  Figure 7: (a) Impact of the level of cooperation and system load: system-wide MSEs for Case A (setups 1a, 2a, 3a, 4a (Non-CoMP), and 4a (CBF)) and Case B (setups 1b, 2b, 3b, 4b (Non-CoMP), and 4b (CBF)) under the per-antenna and per-transmitter power constraints. The blue solid lines, red dotted lines, black dashed lines, magenta dash-dot lines, and olive solid lines with dots represent, respectively, the results of setups 1x, 2x, 3x, 4x (Non-CoMP) and 4x (CBF) under the per-transmitter power constraint. And the blue 's, red ♦'s, black +'s, magenta 's, and olive ×'s represent the corresponding results under the per-antenna power constraint, (b) Impact of the level of cooperation and system load: System-wide BER's for Case A (setups 1a, 2a, 3a, 4a (Non-CoMP), and 4a (CBF)) and Case B (setups 1b, 2b, 3b, 4b (Non-CoMP), and 4b (CBF)) under the per-antenna and per-transmitter power constraints. (Legends: same as those of Figure 7(a).) is quite interesting. In Non-CoMP, each pair designs its precoder and decoder to minimize its own MSE. Thus, there is no reason for any of the pairs to limit their transmit power. However, in CBF, all the pairs jointly design their precoders and decoders to minimize the system-wide MSE. Thus, it may not be always beneficial for all transmitters to transmit on full power since the mutual interference may be large. Thirdly, note that both the per-transmitter and per-antenna power constraints usually meet with equality for the three JP configurations. With that done, let us now compare the results of Cases A and B. The first observation is that limiting the numbers of data streams is crucial for the performance. The second observation is that, in Case B, the MSE performances of CBF and the higher level of cooperation configurations (JP-UL, JP-DL, and JP-SU) are actually similar at high transmit power. The last observation, somewhat related to the first, is that the performances of Non-CoMP and CBF are much more dependent on the number of data streams than JP-UL, JP-DL, and JP-SU. Comments similar to this last observation are made in [22,23] for the ergodic sum rate results of JP-DL and CBF with multiple receivers per cell.
The difference in the BERs of Non-CoMP and CBF between the two cases is remarkable and can be explained as follows. Using (2) and (3d), we have where s cc is the soft output data at the cth eq-receiver. As can be easily seen, G cc H cc F cc s cc is the desired term, G cc H ck F kk s kk is the interference term, and G cc a c is the noise term. Since each of the channels is 2 × 2 and will be of full rank with probability 1, their nonsingularity will be assumed throughout this explanation. In Case A, the cth receiver, c = 1, 2, needs G cc H cc F cc (the effective channel from input data to output data) to be of full rank in order to successfully receive its two data streams. But, if G cc H cc F cc is of full rank for both receivers (i.e., for c = 1, 2), G cc H ck F kk , c, k = 1, 2, k / = c, are of full rank as well. Thus, the interference and desired signals cannot be separated. If the interference is significant, as is likely at the cell edge, the performance will suffer greatly. On the other hand, it is possible in Case B for both pairs to successfully receive each of their data streams and null out the interference. This is because rank(H cc F cc ) = rank(H ck F kk ) = 1 and therefore span(H cc F cc ) is not necessarily equal to span(H ck F kk ), c, k = 1, 2, k / = c. In CBF, the precoders can be chosen to steer H ck F kk , k / = c, away from H cc F cc and the decoders can be chosen to sufficiently null out H ck F kk , k / = c. In Non-CoMP, the cth pair does not know H ck F kk , k / = c, but it knows the estimated noise plus interference covariance matrix Φ nc (see Appendix A). It can therefore design F cc and G cc based on its knowledge of Φ nc . As can be seen, the performance of Non-CoMP is quite good under the per-transmitter power constraint; it is poor under the more stringent per-antenna power constraint though.

Impact of System Size (the Number of Transmitter Receiver Pairs).
To gain some understanding on what happens when the number of transmitter receiver pairs increases, we consider five different setups: 4b (Non-CoMP), 5a (Non-CoMP), 4b (CBF), 5a (CBF), and 5b (CBF) in Table 1. For convenience, we choose d ln = 1 for l, n = 1, 2 (cell edge scenario). Figure 8 shows the resulting MSEs and BERs. Note that the maximum antenna power is P in all of the setups. The normalized MSE shown in Figure 8 is defined to be the average MSE per data stream.
Firstly, we compare the results of CBFs setups 4b, 5a, and 5b to see the performance degradation when more transmitter receiver pairs join the wireless environment. Consider setup 4b (CBF) as a baseline system. We observe that setups 5a (CBF) and 5b (CBF), respectively, have 2-5 dB and 7-14 dB loss in the normalized MSE results. In addition, the BER results of setups 5a (CBF) and 5b (CBF) have smaller diversity gains (absolute values of the slopes) than setup 4b (CBF). However, more data streams are transmitted in setups 5a and 5b. How does CBF handle the C = K = 3 (setup 5a) and C = K = 4 (setup 5b) systems when each node has only 2 antennas? Does it perform IA, that is, does its precoders and decoders satisfy rank(G cc H cc F cc ) = m cc and G cc H ck F kk = 0, c, k = 1, 2, . . . , C, k / = c [9][10][11][12]38]? Well, MMSE designs are more general than IA because IA is not always feasible and does not take into account arbitrary Φ nc . But, even so, the MMSE design is seen, at times, to exhibit IA-like features, that is, the interference projections, H ck F kk , for all k / = c, are steered by the MMSE design such that they lie predominantly in a subspace not containing the signal projection, H cc F cc . As to be expected, the MMSE decoders take into account both the noise and interference-not merely always nulling out the interference as the IA conditions would dictate. In addition, better IA is generally achieved at higher transmit SNR's due to the reduction in the significance of the noise. Furthermore, it is seen that our MMSE design supports more transmitter receiver pairs than [38]'s upper bound for IA designs.
Secondly, we compare Non-CoMP and CBF to see how important joint system-wide transceiver design is to systems with more than 2 transmitter-receiver pairs. BER-wise, it can be seen that, under the per-transmitter power constraint, the best curve for Non-CoMP (the setup 4b (Non-CoMP) one) only has a 1 dB gain over the worst of CBF curves. Actually, only 2 transmitter receiver pairs are communicating in setup 4b (Non-CoMP) as opposed to the 4 transmitter receiver pairs in setup 5b (CBF). When under the perantenna constraint, all of the CBF BER curves are better than the best Non-CoMP one. Furthermore, the performance for setup 5a (Non-CoMP) is terrible. Thus, it is clear that joint system-wide transceiver design can greatly help systems with multiple transmitter receiver pairs by mitigating multiple intercell interferences.

Impact of the Path Loss.
Firstly, using Cases A and B (as defined in Section 8.1), the system performance of all five CoMP configurations under different path losses and system loads is studied. As such, d ln , l / = n, varies between 1 and 4 (d ll = 1, l = 1, 2 as always). Figures 9(a) and 9(b) show, respectively, the MSE and BER results against d ln , l / = n, for 10 log 10 P = 5 dB.
In both Cases A and B, as d ln , l / = n, (and thus δ) gets larger, the performances of both Non-CoMP and CBF improve while the performances of JP-UL, JP-DL, and JP-SU worsen. This is because d ln , l / = n, corresponds to interference channels (channels which do not carry desired data) in Non-CoMP and CBF and to desired channels (channels which can carry desired data) in JP-UL, JP-DL, and JP-SU. As d ln , l / = n, (and thus δ) increases, the path losses of the interference channels increase for Non-CoMP and CBF and the path losses of some of the desired channels increase for JP-UL, JP-DL and JP-SU. Actually, the MSE performances of the five configurations eventually merge when d ln , l / = n, (and thus δ) is large. This is because the system essentially ends up consisting of two independent and interference-free transmitter-receiver pairs when d ln , l / = n, (and thus δ) is large enough. It is remarkable that this merging of performances can already be seen when d ln = 3, l / = n, in Case A and when d ln = 2, l / = n, in Case B. It is also remarkable (but to be expected) that this merging phenomenon of JP-DL and CBF is also seen with ergodic sum rates in [22,23].
Secondly, using the five setups (4b (Non-CoMP), 5a (Non-CoMP), 4b (CBF), 5a (CBF), and 5b (CBF)) employed in Section 8.2, further path loss studies are conducted for Non-CoMP and CBF with respect to different system sizes. With d ll = 1, for all l, and 10 log 10 P = 5 dB, Figure 10 shows the MSE and BER results against d ln , l / = n. As d ln , l / = n, (and thus δ) gets larger, it is clearly seen that the performances of the setups improve and merge together. This behavior is because d ln , l / = n, corresponds to the interference channels for both Non-CoMP and CBF. As d ln , l / = n, increases, both the inter-pair interference and the importance of joint design across the pairs decrease.

Guidelines for Configuration Selection.
The purpose of this sub-section is to gain some understanding about when should each configuration be used. The understanding also helps to determine CSI feedback and data sharing requirements, since different CoMP configurations require different levels of CSI feedback and data sharing. For example, if based on the BER performance, only Non-CoMP is needed, a downlink user only needs to feed back the desired channel and inter-cluster interference covariance matrix but not intercell channels, To this end, consider the following example: there are two transmitters and two receivers (i.e., T = R = 2). The MMSE design of their precoders and decoders is subject to the pertransmitter power constraint with 10 log 10 P = 5 dB. If the desired BER threshold is 3 × 10 −2 , when should JP-UL, JP-DL, JP-SU, Non-CoMP, and CBF be used?
Well, looking at Figures 9(a) and 9(b), it is surprising but clear that, for Case B (partially loaded systems), Non-CoMP should always be used-even at the cell edge. (Note though that, for the per-antenna power constraint, the performance of Non-CoMP is marginally acceptable at the cell edge.) Non-CoMP is good enough; the other configurations with their greater network overheads (e.g., information exchange and synchronization) are not needed. For Case A (fully loaded systems), on the other hand, which configuration should be used depends on d ln (and thus δ). For small enough d ln , l / = n (and thus small enough δ), that is, for a cell edge type scenario, either JP-UL or JP-DL should be used. The interference is too much for Non-CoMP and CBF. However, for larger d ln , l / = n, Non-CoMP should be used. With respect to JP-SU, it is remarkable that, in both Cases A and B, it has no significant performance advantage over JP-UL and JP-DL and is not needed here.
Looking at Figure 10, it is clear that CBF should be used when there are a few transmitter receiver pairs, all at the cell edge, who want to have 1 data stream each. In that case, CBF's interference management capabilities aid it in being able to satisfy the BER threshold when Non-CoMP cannot. It is also clear that for any number of transmitter receiver pairs, there will be a d such that, when d ln > d, l / = n, Non-CoMP is good enough and should be employed.

Conclusion
For developing a practical CoMP technology in future cellular systems, there are two crucial needs: a performance benchmark and a unified approach for different CoMP configurations. For the need of a performance benchmark, joint MMSE transceiver designs of various CoMP configurations are considered. The joint MMSE design is nearly optimum in maximizing sum rate. The MSE and BER performances of five CoMP systems (JP-SU, JP-DL, JP-UL, CBF, Non-CoMP) under various levels of cooperation, system loads, system sizes, and path losses are investigated thoroughly. Guidelines for CoMP configuration selection are then established. For the need of a unified approach, the GIA is proposed for performing joint MMSE transceiver designs for general MTMR MIMO systems subject to general linear power constraint. In addition, the optimum DCOA for downlink is developed to validate the optimality of the GIA results when applicable. Remarkably, the GIA is shown equivalent to the TCOA when each of them converges and the transmit covariance matrices obtained from them are of full rank. They are also shown equivalent to the DCOA when each of them converges and the decoder covariance matrices obtained from them are of full rank. This means that the GIA gives globally optimum results under the abovementioned special conditions. Convergence properties of the proposed approaches, optimality, and diversity/multiplexing tradeoff of the GIA are verified numerically.
The performance analysis of the five CoMP configurations is conducted using the GIA to provide physical insights and performance benchmark. Firstly, in the cell edge scenario, it is found that the higher the level of cooperation, the better the performance. Actually, JP-UL and JP-DL achieve essentially the same performance as JP-SU. Note that CBF and Non-CoMP considered in this paper give the achievable performance upper bound for the respective category, given same number of total transmit antennas and same number of total receive antennas.
Secondly, in the cell edge scenario, it is found that the performances of Non-CoMP and CBF are much more dependent on the number of data streams than JP-UL, JP-DL, and JP-SU. When the system is fully loaded, both Non-CoMP and CBF suffer severe interference and thus have poor performances. However, for a partially loaded, two transmitter receiver pairs, system, CBF is able to give good performances under both the per-transmitter and per-antenna power constraints. Non-CoMP also gives good performances, but only for the per-transmitter power constraint (the per-antenna power constraint turns out to be too stringent for it). Thirdly, CBF is able to take care of even more than two transmitter receiver pairs because of its superior interference management capabilities (such as its ability to perform IA-like maneuvers). Not only that, it can actually support more pairs than the upper bound for IA designs in [38]. Fourthly, it is found that the per-transmitter power constraint in the CBF configuration does not usually meet with equality for every pair. However, it always does for the Non-CoMP configuration. This phenomenon is due to the following: (a) in Non-CoMP, each pair cares only about its own MSE while, in CBF, each pair cares for the systemwide MSE and (b) increasing the power at a pair will always be good for the MSE of that pair but not necessarily good for the MSE of the entire system. Fifthly, for a given system, as the path loss of the channels corresponding to the interfering links of Non-CoMP and CBF increases, interesting trends are observed; the performances of CBF and Non-CoMP improve greatly whereas the performances of JP-UL, JP-DL, and JP-SU worsen. Actually, the MSE performances of the five configurations eventually merge together.
In addition to producing these findings, these simulations numerically put forth performance benchmarks for the JP, CBF, and Non-CoMP categories-actually, due to JP-SU, performance benchmarks are given for all CoMP configurations. Moreover, due to the use of the MMSE criterion, benchmarks are put forth for the transceiver designs under other criteria as well (such as maximum capacity and minimum BER). These simulations also provide some guidelines for configuration selection.
These performance benchmarks and guidelines are produced under ideal conditions; for example, the synchronization requirements, and so forth of the configurations are not taken into account. The modulation coding scheme (MCS) selection and CSI error are not accounted for either. Even so, they can be used to greatly simplify the complex configuration selection problem under practical conditions; they can help to show which schemes need or do not need to be considered in a particular scenario. Take, for example, the typical two BS-user pair downlink system with the users at the cell edge. In the partially loaded case, it is clear from this paper that Non-CoMP and CBF should be considered first. In the fully loaded case, it is even simpler: it is clear that JP-DL should be considered first. After such large reductions in scope as these, accounting for the various parameters (MCS, limited feedback, etc.) will thus be much more manageable to perform. Furthermore, one can use the guidelines to choose the CSI feedback and data sharing schemes, since different CoMP configurations require different levels of CSI feedback and data sharing. For example, in one of our papers, we demonstrate a practical scheme for decentralized CBF in TDD systems [39].

A. Noise Plus Interference Covariance Matrix in Non-CoMP
Since E(H W,ic MH * W,ic ) = tr(M)I γi for any deterministic matrix M, the noise plus interference covariance matrix for the ith eq-receiver in Non-CoMP can be expressed as il tr F ll Φ sll F * ll I ri + Φ ai . (A.1) If each transmitter transmits with full power, the trace in (A.1) can be replaced by P bll and the following expression is exact: il P bll I ri + Φ ai . (A.2) Note that even when there is receive spatial correlation (not considered in (1)), (A.2) still holds. When some transmitters do not transmit with full power, (A.2) is a "worst case" approximation and is still used for the design in this paper.

B. Alternative Approach to the MMSE Transceiver Design of Non-CoMP under the Per-Antenna Power Constraint
For Non-CoMP with one data stream, this appendix shows a different approach to the MMSE transceiver design problem subject to the per-antenna power constraint. Without loss of generality, consider the ith eq-transmitter eq-receiver pair and let Φ sii = σ 2 ii I ti for all i. Given the MMSE decoder (18), the reduced MMSE problem can be written as subject to (12). Here, b mn is the mnth element of the nonnegative definite Hermitian matrix B. Expressing F * ii in polar form, for some integer k. If b 12 = 0, γ is maximized if and only if a 1 = a 2 = 1. It is remarkable that, in this case, optimality happens only when the equality in the per-antenna power constraint in (12) is met.